## Querying Relational Databases

- Databases usually have multiple tables
- Tables have different but related data
- Often reference or relate to each other by a key
- Make connections by **JOIN**ing


## movielens data set

- Full dataset: [grouplens.org/datasets/movielens](http://grouplens.org/datasets/movielens/)
    - 27,000 movies
    - 470,000 tags
    - 21,000,000 ratings
    - by 230,000 users - anonymized
- We'll be working with a subset in SQLite format **movielens-small**

## Reading the data

- Launch **sqlite3** and open **movielens-small.db**

Tips:
    
    .open movielens-small.db
    .mode column
    .headers on
    .tables

In [25]:
from sqlite_utils import SQLiteDatabase
ml = SQLiteDatabase('movielens-small.db')
ml.query("SELECT * FROM movies LIMIT 3")

movieId,title,year,genres
1,Toy Story,1995,Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji,1995,Adventure|Children|Fantasy
3,Grumpier Old Men,1995,Comedy|Romance


### Review on row order

- No inherent order in the rows - reference them by a unique value and not a position.
- movieId is a **PRIMARY KEY**. It's an identifier that must be different for every movie
- Can order by multiple columns, let's try year then title


### Exercise:

Query (`SELECT`) the **movies** table by **year** (newest first) then **title**

`... ORDER BY <column 1> (ASC|DESC), <column 2> (ASC|DESC)`

What's the first movie listed? The last?

Bonus: **null** comes first. Filter out movies **`where year is null`**

In [26]:
ml.query("SELECT * FROM ratings limit 3")

userId,movieId,rating,timestamp
1,6,2.0,980730861
1,22,3.0,980731380
1,32,2.0,980731926


## Ratings table

- The rating itself is a number (e.g. number of stars)
- Each rating has a **userId**, **movieId**, and **timestamp**
- We have a **movies** table also with a **movieId**
- We can look up ratings for a movie, but we need to know its **movieId**

That common **movieId** column is a **KEY**

- In the **movies** table, it's a **PRIMARY KEY**
- In the **ratings** table, it's a **FOREIGN KEY**


## Exercise: Relational Data

Look up a favorite movie and find its most recent rating.

Tips:
- Titles like **The Godfather** are listed as **Godfather, The**
- Sorting the **timestamp** column can be used to find the oldest (ASC) or newest (DESC)


`SELECT * FROM ratings WHERE _____ = __ ORDER BY ___ DESC limit ___;`

In [None]:
ml.query("SELECT * FROM links limit 3")

### Aggregate Functions

We covered calculations that work across columns, but just like Pandas, there are also aggregate functions

[SQLite Aggregate Functions](https://www.sqlite.org/lang_aggfunc.html)

- avg
- count
- max / min
- sum / total

And

- DISTINCT

Let's 

Duplicate titles?

In [None]:
ml.query("SELECT COUNT(*) FROM movies WHERE year = 2003")

# Joins

This database, like most useful relational databases, has data in different tables, and those tables relate to each other. The movies table has the titles for each movie, but the other tables have data about those movies

- Movies:Links is 1:1
- Movies:Ratings 1:Many
- Movies:Tags is Many:Many

Let's explore this with distinct counts

First do links because it's 1:1. Talk about doing a simple join to get the title and the IMDB link

In [None]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(*) from links')

In [None]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(*) from ratings')

In [None]:
ml.query('SELECT COUNT(DISTINCT(movieId)), COUNT(DISTINCT(tag)), COUNT(*) from tags')

Perhaps final example. Only thing left to do would be to join on tags.
Reminder - you don't need to see everything to include it in the query

In [None]:
ratings_query = '''
SELECT 
    movies.title, 
    count(ratings.rating) AS num_ratings, 
    avg(ratings.rating) AS avg_rating
FROM movies
JOIN ratings ON movies.movieId = ratings.movieId
GROUP BY movies.title 
HAVING count(ratings.rating) > 100 
ORDER BY avg_rating DESC limit 10
'''
ml.query(ratings_query)

## Filtering Data

- Simple filtering: col = value
- LIKE, IN, comparisons


## Sorting Data

- Multiple sorts, ordering matters

## Aggregation

- DISTINCT
- AVG

## Joins

This is the relationship in relational database management systems