In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [4]:
%load_ext sql

from sqlalchemy import create_engine
engine = create_engine('postgresql://localhost/dvdrental')

%sql postgresql://localhost/dvdrental

'Connected: @dvdrental'

## Question 1

We want to understand more about the movies that families are watching. The following categories are considered family movies: Animation, Children, Classics, Comedy, Family and Music.

**Create a query that lists each movie, the film category it is classified in, and the number of times it has been rented out.**

In [20]:
%%sql r <<
SELECT f.title, c.name category, r_sub.rental_count
FROM film f
JOIN film_category fc ON f.film_id = fc.film_id
JOIN category c ON c.category_id = fc.category_id
JOIN(
      SELECT i.film_id, COUNT(rental_id) rental_count
      FROM inventory i
      JOIN rental r ON r.inventory_id = i.inventory_id
      GROUP BY 1) r_sub
ON r_sub.film_id = f.film_id
WHERE c.name IN ('Animation', 'Children', 'Classics', 
                   'Comedy', 'Family', 'Music')
ORDER BY 2, 1, 3;

 * postgresql://localhost/dvdrental
350 rows affected.
Returning data to local variable r


In [22]:
r.DataFrame().head(10)

Unnamed: 0,title,category,rental_count
0,Alter Victory,Animation,22
1,Anaconda Confessions,Animation,21
2,Bikini Borrowers,Animation,17
3,Blackout Private,Animation,27
4,Borrowers Bedazzled,Animation,22
5,Canyon Stock,Animation,19
6,Carol Texas,Animation,18
7,Champion Flatliners,Animation,13
8,Clash Freddy,Animation,25
9,Club Graffiti,Animation,19


## Question 2

Now we need to know how the length of rental duration of these family-friendly movies compares to the duration that all movies are rented for. **Can you provide a table with the movie titles and divide them into 4 levels (first_quarter, second_quarter, third_quarter, and final_quarter) based on the quartiles (25%, 50%, 75%) of the rental duration for movies across all categories?** 

Make sure to also indicate the category that these family-friendly movies fall into.

In [31]:
%%sql r <<
SELECT f.title, f.rental_duration, c.name category,
       NTILE(4) OVER (ORDER BY f.rental_duration) AS standard_quartile
FROM film f
JOIN film_category fc ON fc.film_id = f.film_id
JOIN category c ON c.category_id = fc.category_id
WHERE c.name IN ('Animation', 'Children', 'Classics', 
                   'Comedy', 'Family', 'Music');

 * postgresql://localhost/dvdrental
361 rows affected.
Returning data to local variable r


In [34]:
r.DataFrame().head(10)

Unnamed: 0,title,rental_duration,category,standard_quartile
0,Sweethearts Suspects,3,Children,1
1,Go Purple,3,Music,1
2,Bilko Anonymous,3,Family,1
3,Wait Cider,3,Animation,1
4,Daughter Madigan,3,Children,1
5,Turn Star,3,Animation,1
6,Rush Goodfellas,3,Family,1
7,King Evolution,3,Family,1
8,Tracy Cider,3,Animation,1
9,Wisdom Worker,3,Comedy,1


## Question 3

Finally, provide a table with the family-friendly film category, each of the quartiles, and the corresponding count of movies within each combination of film category for each corresponding rental duration category. The resulting table should have three columns:

- Category
- Standard_quartile (Rental length category)
- Count (count of movies for each combination of film category and rental length category)

In [37]:
%%sql
WITH t1 AS
(SELECT f.title, 
        f.rental_duration, 
        c.name category,
        NTILE(4) OVER (ORDER BY f.rental_duration) AS standard_quartile
 FROM film f
 JOIN film_category fc ON fc.film_id = f.film_id
 JOIN category c ON c.category_id = fc.category_id
 WHERE c.name IN ('Animation', 'Children', 'Classics', 
                   'Comedy', 'Family', 'Music'))

SELECT category, standard_quartile, COUNT(title) movie_counts
FROM t1
GROUP BY 1,2
ORDER BY 1, 2, 3;

 * postgresql://localhost/dvdrental
24 rows affected.


category,standard_quartile,movie_counts
Animation,1,22
Animation,2,12
Animation,3,15
Animation,4,17
Children,1,14
Children,2,18
Children,3,14
Children,4,14
Classics,1,14
Classics,2,14
