## Relational Databases

The movielens dataset - http://grouplens.org/datasets/movielens/

Full dataset is 27,000 movies, 21,000,000 ratings. How big is the subset?


In [23]:
from sqlite_utils import SQLiteDatabase

In [24]:
ml = SQLiteDatabase('movielens-small.db')
query = "SELECT * FROM movies LIMIT 10"
ml.query("SELECT * FROM movies LIMIT 5")

movieId,title,year,genres
1,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji,1995,Adventure|Children|Fantasy
3,Grumpier Old Men,1995,Comedy|Romance
4,Waiting to Exhale,1995,Comedy|Drama|Romance
5,Father of the Bride Part II,1995,Comedy


In [25]:
ml.query("SELECT * FROM ratings limit 5")

userId,movieId,rating,timestamp
1,6,2.0,980730861
1,22,3.0,980731380
1,32,2.0,980731926
1,50,5.0,980732037
1,110,4.0,980730408


In [26]:
ml.query("SELECT * FROM links limit 5")

movieId,imdbId,tmdbId
1,114709,862
2,113497,8844
3,113228,15602
4,114885,31357
5,113041,11862


In [27]:
ml.query("SELECT COUNT(*) FROM movies WHERE year = 2003")

COUNT(*)
250


# Joins

This database, like most useful relational databases, has data in different tables, and those tables relate to each other. The movies table has the titles for each movie, but the other tables have data about those movies

- Movies:Links is 1:1
- Movies:Ratings 1:Many
- Movies:Tags is Many:Many

Let's explore this with distinct counts

First do links because it's 1:1. Talk about doing a simple join to get the title and the IMDB link

In [28]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(*) from links')

COUNT(DISTINCT(movieId)),COUNT(*)
8570,8570


In [29]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(*) from ratings')

COUNT(DISTINCT(movieId)),COUNT(*)
8552,100023


In [30]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(DISTINCT(tag)), COUNT(*) from tags')

COUNT(DISTINCT(movieId)),COUNT(DISTINCT(tag)),COUNT(*)
672,1209,2488


Perhaps final example. Only thing left to do would be to join on tags.
Reminder - you don't need to see everything to include it in the query

In [32]:
ratings_query = '''
SELECT 
    movies.title, 
    count(ratings.rating) AS num_ratings, 
    avg(ratings.rating) AS avg_rating
FROM movies
JOIN ratings ON movies.movieId = ratings.movieId
GROUP BY movies.title 
HAVING count(ratings.rating) > 100 
ORDER BY avg_rating DESC limit 10
'''
ml.query(ratings_query)

title,num_ratings,avg_rating
"Shawshank Redemption, The",328,4.44207317073
"Usual Suspects, The",239,4.36820083682
"Godfather, The",208,4.33413461538
Schindler's List,241,4.3112033195
Casablanca,129,4.25968992248
"Godfather: Part II, The",132,4.25
"Silence of the Lambs, The",337,4.23590504451
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb,126,4.22619047619
"Matrix, The",265,4.20754716981
Star Wars: Episode IV - A New Hope,306,4.19607843137


## Filtering Data

- Simple filtering: col = value
- LIKE, IN, comparisons


## Sorting Data

- Multiple sorts, ordering matters

## Aggregation

- DISTINCT
- AVG

## Joins

This is the relationship in relational database management systems