## Querying Relational Databases

- Databases usually have multiple tables
- Tables have different but related data
- Often reference or relate to each other by a key
- Make connections by **JOIN**ing


## movielens data set

- Full dataset: [grouplens.org/datasets/movielens](http://grouplens.org/datasets/movielens/)
    - 27,000 movies
    - 470,000 tags
    - 21,000,000 ratings
    - by 230,000 users - anonymized
- We'll be working with a subset in SQLite format **movielens-small**

## Reading the data

- Launch **sqlite3** and open **movielens-small.db**

Tips:
    
    .open movielens-small.db
    .mode column
    .headers on
    .tables

In [None]:
from sqlite_utils import SQLiteDatabase
ml = SQLiteDatabase('movielens-small.db')
ml.query("SELECT * FROM movies LIMIT 3")

### Review on row order

- No inherent order in the rows - reference them by a unique value and not a position.
- movieId is a **PRIMARY KEY**. It's an identifier that must be different for every movie
- Can order by multiple columns, let's try year then title


### Exercise:

Query (`SELECT`) the **movies** table by **year** (newest first) then **title**

`... ORDER BY <column 1> (ASC|DESC), <column 2> (ASC|DESC)`

What's the first movie listed? The last?

Bonus: **null** comes first. Filter out movies **`where year is null`**

In [None]:
ml.query('')

In [None]:
ml.query("SELECT * FROM ratings limit 3")

## Ratings table

- The rating itself is a number (e.g. number of stars)
- Each rating has a **userId**, **movieId**, and **timestamp**
- We have a **movies** table also with a **movieId**
- We can look up ratings for a movie, but we need to know its **movieId**

That common **movieId** column is a **KEY**

- In the **movies** table, it's a **PRIMARY KEY**
- In the **ratings** table, it's a **FOREIGN KEY**


## Exercise: Relational Data

Look up a favorite movie and find newest and oldest rating.

Tips:
- Titles like **The Godfather** are listed as **Godfather, The**
- Sorting the **timestamp** column can be used to find the oldest (ASC) or newest (DESC)


In [None]:
ml.query("SELECT * FROM ratings WHERE _____ = __ ORDER BY ___ DESC limit ___")

### Aggregate Functions

We covered calculations that work across columns, but just like Pandas, there are also functions that aggregate data in different ways.

[SQLite Aggregate Functions](https://www.sqlite.org/lang_aggfunc.html)

- avg
- count
- max / min
- sum
- group_concat 

In [None]:
ml.query("SELECT COUNT(*) FROM ratings WHERE movieId = ___")
# How many ratings does your movie have?
# What's the average rating for your movie?

### Distinct

Distinct is often used with these aggregate functions, especially with data that repeats, like a year.

In [None]:
ml.query("SELECT COUNT(year) FROM movies") # How many years is this?

In [None]:
ml.query("SELECT year FROM movies limit 10") # Let's see why

### Exercise: Aggregates

The full dataset has:
    - 27,000 movies
    - 470,000 tags
    - 21,000,000 ratings
    - by 230,000 users
    
1. What are these statistics for the tables in your database?
2. What's the average rating for all movies?
2. If each rating is a star, how many stars have been given for **Shawshank Redemption, The**

Hint: there is no users table but userIds exist in the ratings table

# Querying Multiple Tables

Joins and keys

The real power of relational databases is in the **relations**.

movies and ratings are related, but there are other tables:

- movies:links is 1:1
- movies:ratings 1:Many
- movies:tags is Many:Many

Connect related tables in a single query, with a **JOIN**

Let's have a look at the links table, and see how we can get the IMDB link for some movies in the table.

In [None]:
# JOIN links ON movies.movieId = links.movieId
ml.query('')

### Exercise

Write a query like the above that prints the title and tag for the star wars movies

**Hint:**

    SELECT title, tag FROM movies JOIN tags...


In [None]:
starwars_tags = '''
SELECT 
    movies.title,
    tags.tag
FROM
    movies
JOIN
    tags ON movies.movieId = tags.movieId
WHERE
    movies.title like 'Star Wars: Episode%'
ORDER BY
    tag asc, year ASC
'''
#ml.query(starwars_tags)

### Joins and Aggregates

We can join tables and we can run aggregate functions like AVG/COUNT, useful to put those together.

Average rating for all the star wars movies

In [None]:
# Start with a query, then add the aggregates
avg_ratings = '''
SELECT 
    ratings.rating
FROM
    movies
JOIN
    ratings ON movies.movieId = ratings.movieId
WHERE
    movies.title like 'Star Wars: Episode%'
'''
ml.query(avg_ratings)

### Exercise

1. Include the number of ratings
2. Order to show highest rated on top. What movie?
3. How does the average rating of the first 3 movies compare to the second 3?
4. How does the average rating of these movies compare to the Godfather trilogy?


## Grouping

When aggregating, often makes sense aggregate groups of data instead of whole dataset

For example: 1 average rating for each movie instead of all movies

This is called grouping, we covered it in Pandas, and it's also possible in SQL.


In [None]:
ml.query('select year, count(*) FROM movies GROUP BY year')

### Grouping with Joins

Grouping works with any query, not just single-tables. 

Let's apply this to the ratings query from earlier, and see what that looks like

1. Include title and year
2. Include the number of ratings
3. Order by the average rating (highest first)
4. Order by the most ratings (highest first)

In [None]:
star_wars_ratings = '''
SELECT 
    avg(ratings.rating) AS avg_rating
FROM
    movies
JOIN
    ratings ON movies.movieId = ratings.movieId
WHERE
    movies.title like 'Star Wars: Episode%'
'''
ml.query(star_wars_ratings)

### 10 Highest-Rated Movies

We're pretty close, just need to remove the WHERE clause and add a LIMIT, right?

In [None]:
ratings_query = '''
SELECT 
    movies.title, 
    movies.year,
    count(ratings.rating) AS num_ratings, 
    avg(ratings.rating) AS avg_rating
FROM
    movies
JOIN
    ratings ON movies.movieId = ratings.movieId
GROUP BY
    movies.title 
ORDER BY
    avg_rating DESC
LIMIT 10
'''
ml.query(ratings_query)