# About this project

This project aims to demonstrate the use of SQL to answer hypothetical sales and marketing questions about a DVD rental store with a MySQL database named Sakila.

# Sakila Database

The Sakila database is a nicely normalised schema modelling a DVD rental store, featuring things like films, actors, film-actor relationships, and a central inventory table that connects films, stores, and rentals.

The Sakila MySQL sample database is available from http://dev.mysql.com/doc/index-other.html. 


## Sakila Database Entity Relationship Diagram(ERD)

<img src="https://www.jooq.org/img/sakila.png">

## Problem Description


- **General Information**:
    - What's the time range of the data?
    - How many stores are there?
    - How many staff the store has?
    

- **Store performance**:
    - Which store has more customer rented the film?
    - Which store makes the most money? 
    
    
- **Inventory summary**:
    - Hom many film catogary are there in store?
    - Track the inventory level and determine whether the rental can happen??
        
        
- **Consumer behavior**:
    - What are the top 10 most popular films that customers rent?
    - What are the top 10 films that customers rented for the longest period?
    - Among the films that rented for the longest days, what are the top ones rented for the most time?
    - What is the average rental period?
    - Which genres are most popular? 
    - Who are identified as loyalty customers?
    - If we want to hire an actor to do ads for us. Which actor is in the most films?
    - If we want to hire an actor to do ads for us. Which actors/actresses are most popular given our rental history?

    
    
- **Sales summary**:
    - Do we make the most money from long or short rentals?
    - Monitor customers’ owing balance and find overdue DVDs ??????? 

In [2]:
# pip install ipython-sql

In [2]:
# pip install pymysql

In [3]:
# Loading the SQL module
%load_ext sql

In [4]:
# Connect to database
%sql mysql+pymysql://root:123456@localhost/sakila

# General Information

### What's the time range of the data?

In [19]:
%%sql

select distinct substr(payment_date, 1, 7) date
from payment

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


date
2005-05
2005-06
2005-07
2005-08
2006-02


### How many stores are there?

In [20]:
%%sql

select distinct store_id
from store

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


store_id
1
2


### How many staff the store has?

In [21]:
%%sql

select distinct staff_id
from staff

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


staff_id
1
2


# Store Performance

###  What is store has more customer rented the film?

In [22]:
%%sql

select i.store_id, count(distinct r.rental_id) num_of_rentals
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
group by i.store_id
order by count(distinct r.rental_id)

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


store_id,num_of_rentals
1,7923
2,8121


### Which store makes the most money?

In [23]:
%%sql

select s.store_id, sum(p.amount) sale
from payment p
left join staff s
on s.staff_id = p.staff_id
group by s.store_id
order by sum(p.amount)

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


store_id,sale
1,33482.5
2,33924.06


# Consumer behavior

In [13]:
%%sql

select * from payment

 * mysql+pymysql://root:***@localhost/sakila
16044 rows affected.


payment_id,customer_id,staff_id,rental_id,amount,payment_date,last_update
1,1,1,76,2.99,2005-05-25 11:30:37,2006-02-15 22:12:30
2,1,1,573,0.99,2005-05-28 10:35:23,2006-02-15 22:12:30
3,1,1,1185,5.99,2005-06-15 00:54:12,2006-02-15 22:12:30
4,1,2,1422,0.99,2005-06-15 18:02:53,2006-02-15 22:12:30
5,1,2,1476,9.99,2005-06-15 21:08:46,2006-02-15 22:12:30
6,1,1,1725,4.99,2005-06-16 15:18:57,2006-02-15 22:12:30
7,1,1,2308,4.99,2005-06-18 08:41:48,2006-02-15 22:12:30
8,1,2,2363,0.99,2005-06-18 13:33:59,2006-02-15 22:12:30
9,1,1,3284,3.99,2005-06-21 06:24:45,2006-02-15 22:12:30
10,1,2,4526,5.99,2005-07-08 03:17:05,2006-02-15 22:12:30


If we want to hire an actor to do ads for us. Which actor is in the most films?
If we want to hire an actor to do ads for us. Which actors/actresses are most popular given our rental history?
Given the films customers have rented, which new ones should we suggest to them?
Who are identify loyalty customers?

### What are the top 10 most popular films that customers rent?

In [26]:
%%sql

select f.title, count(r.inventory_id) times_rented
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
left join film f
on i.film_id = f.film_id
group by f.title
order by count(r.inventory_id) desc
limit 10

 * mysql+pymysql://root:***@localhost/sakila
10 rows affected.


title,times_rented
BUCKET BROTHERHOOD,34
ROCKETEER MOTHER,33
RIDGEMONT SUBMARINE,32
GRIT CLOCKWORK,32
SCALAWAG DUCK,32
JUGGLER HARDLY,32
FORWARD TEMPLE,32
HOBBIT ALIEN,31
ROBBERS JOON,31
ZORRO ARK,31


### What are the top films that customers rented for the longest period?

In [48]:
%%sql

select f.title, datediff(return_date, rental_date) rental_days
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
left join film f
on i.film_id = f.film_id
order by datediff(return_date, rental_date) desc
limit 10

 * mysql+pymysql://root:***@localhost/sakila
10 rows affected.


title,rental_days
ROBBERY BRIGHT,10
UPRISING UPTOWN,10
TRADING PINOCCHIO,10
LOVE SUICIDES,10
HOME PITY,10
BAREFOOT MANCHURIAN,10
JUGGLER HARDLY,10
FORWARD TEMPLE,10
SHOW LORD,10
MOSQUITO ARMAGEDDON,10


### Among the films that rented for the longest days, what are the top ones rented for the most time?

In [68]:
%%sql

with rental_days as (
select f.title, datediff(return_date, rental_date) rental_days
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
left join film f
on i.film_id = f.film_id
),

rental_times as (select f.title, count(r.inventory_id) times_rented
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
left join film f
on i.film_id = f.film_id
group by f.title
)

select distinct rd.title, rental_days, times_rented
from rental_days rd
left join rental_times rt
on rd.title = rt.title
where rental_days = 10
order by times_rented desc
limit 10

 * mysql+pymysql://root:***@localhost/sakila
10 rows affected.


title,rental_days,times_rented
ROCKETEER MOTHER,10,33
JUGGLER HARDLY,10,32
FORWARD TEMPLE,10,32
TIMBERLAND SKY,10,31
CAT CONEHEADS,10,30
HARRY IDAHO,10,30
FORRESTER COMANCHEROS,10,27
SWARM GOLD,10,27
CURTAIN VIDEOTAPE,10,27
BLACKOUT PRIVATE,10,27


### What is the average rental period?

In [34]:
%%sql

select round(avg(datediff(return_date, rental_date)),0) avg_rental_days
from rental

 * mysql+pymysql://root:***@localhost/sakila
1 rows affected.


avg_rental_days
5


### Which genres are most popular?

In [57]:
%%sql

select c.name, count(r.inventory_id) times_rented
from rental r
left join inventory i
on r.inventory_id = i.inventory_id
left join film f
on i.film_id = f.film_id
left join film_category fc
on f.film_id = fc.film_id
left join category c
on fc.category_id = c.category_id
group by c.name
order by count(r.inventory_id) desc

 * mysql+pymysql://root:***@localhost/sakila
16 rows affected.


name,times_rented
Sports,1179
Animation,1166
Action,1112
Sci-Fi,1101
Family,1096
Drama,1060
Documentary,1050
Foreign,1033
Games,969
Children,945


### Who are identified as loyalty customers?

In [72]:
%%sql

select concat(first_name, ' ', last_name) name, sum(amount) total_payment, count(rental_id) times_rented
from customer c
left join payment p
on c.customer_id = p.customer_id
group by name
order by sum(amount) desc, count(rental_id) desc
limit 10

 * mysql+pymysql://root:***@localhost/sakila
10 rows affected.


name,total_payment,times_rented
KARL SEAL,221.55,45
ELEANOR HUNT,216.54,46
CLARA SHAW,195.58,42
RHONDA KENNEDY,194.61,39
MARION SNYDER,194.61,39
TOMMY COLLAZO,186.62,38
WESLEY BULL,177.6,40
TIM CARY,175.61,39
MARCIA DEAN,175.58,42
ANA BRADLEY,174.66,34


In [55]:
%%sql


UsageError: %%sql is a cell magic, but the cell body is empty. Did you mean the line magic %sql (single %)?


## Explore product related tables

### FILM table

In [5]:
%%sql
describe film

 * mysql+pymysql://root:***@localhost/sakila
13 rows affected.


Field,Type,Null,Key,Default,Extra
film_id,smallint unsigned,NO,PRI,,auto_increment
title,varchar(128),NO,MUL,,
description,text,YES,,,
release_year,year,YES,,,
language_id,tinyint unsigned,NO,MUL,,
original_language_id,tinyint unsigned,YES,MUL,,
rental_duration,tinyint unsigned,NO,,3,
rental_rate,"decimal(4,2)",NO,,4.99,
length,smallint unsigned,YES,,,
replacement_cost,"decimal(5,2)",NO,,19.99,


In [6]:
%%sql
# What is the largest rental_rate for each rating?

select rating, max(rental_rate) largest_rate
from film
group by rating;

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


rating,largest_rate
PG,4.99
G,4.99
NC-17,4.99
PG-13,4.99
R,4.99


In [7]:
%%sql
# How many films in each rating category?

select rating, count(film_id) num_of_film
from film
group by rating

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


rating,num_of_film
PG,194
G,178
NC-17,210
PG-13,223
R,195


In [8]:
%%sql
# Create a new column film_length to segment different films by length:
# length < 60 then ‘short’; length < 120 then standard’; length >=120 then ‘long’, 
# then count the number of films in each segment.

select case 
    when length <60 then 'Short'
    when length <120 then 'Standard'
    when length >=120 then 'Long'
    end as film_length,
    count(film_id) num_of_film
from film
group by film_length

 * mysql+pymysql://root:***@localhost/sakila
3 rows affected.


film_length,num_of_film
Standard,438
Short,96
Long,466


In [9]:
%%sql
# Find language name for each film

select title, name
from film f
left join language l
on f.language_id = l.language_id
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,name
ACADEMY DINOSAUR,English
ACE GOLDFINGER,English
ADAPTATION HOLES,English
AFFAIR PREJUDICE,English
AFRICAN EGG,English


###  ACTOR table

In [10]:
%%sql
describe actor

 * mysql+pymysql://root:***@localhost/sakila
4 rows affected.


Field,Type,Null,Key,Default,Extra
actor_id,smallint unsigned,NO,PRI,,auto_increment
first_name,varchar(45),NO,,,
last_name,varchar(45),NO,MUL,,
last_update,timestamp,NO,,CURRENT_TIMESTAMP,DEFAULT_GENERATED on update CURRENT_TIMESTAMP


In [11]:
%%sql
# Which actors have the last name ‘Johansson’

select actor_id, first_name, last_name
from actor
where last_name = 'Johansson'

 * mysql+pymysql://root:***@localhost/sakila
3 rows affected.


actor_id,first_name,last_name
8,MATTHEW,JOHANSSON
64,RAY,JOHANSSON
146,ALBERT,JOHANSSON


In [12]:
%%sql
# Add a column showing actor full name with only first letter of first name and last name capitalize

select *,
concat(concat(upper(left(first_name, 1)), lower(substr(first_name, 2, length(first_name)))), ' ', concat(upper(left(last_name, 1)), lower(substr(last_name, 2, length(last_name))))) full_name
from actor
limit 10

 * mysql+pymysql://root:***@localhost/sakila
10 rows affected.


actor_id,first_name,last_name,last_update,full_name
1,PENELOPE,GUINESS,2006-02-15 04:34:33,Penelope Guiness
2,NICK,WAHLBERG,2006-02-15 04:34:33,Nick Wahlberg
3,ED,CHASE,2006-02-15 04:34:33,Ed Chase
4,JENNIFER,DAVIS,2006-02-15 04:34:33,Jennifer Davis
5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33,Johnny Lollobrigida
6,BETTE,NICHOLSON,2006-02-15 04:34:33,Bette Nicholson
7,GRACE,MOSTEL,2006-02-15 04:34:33,Grace Mostel
8,MATTHEW,JOHANSSON,2006-02-15 04:34:33,Matthew Johansson
9,JOE,SWANK,2006-02-15 04:34:33,Joe Swank
10,CHRISTIAN,GABLE,2006-02-15 04:34:33,Christian Gable


In [13]:
%%sql
# How many distinct actors’ last names are there?

select count(distinct last_name) num_of_last_name
from actor

 * mysql+pymysql://root:***@localhost/sakila
1 rows affected.


num_of_last_name
121


In [14]:
%%sql 
# Which last names are not repeated? 

select last_name
from actor
group by last_name
having count(last_name) = 1
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


last_name
ASTAIRE
BACALL
BALE
BALL
BARRYMORE


In [15]:
%%sql
# Which last names appear more than once?

select last_name
from actor
group by last_name
having count(last_name) >1
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


last_name
AKROYD
ALLEN
BAILEY
BENING
BERRY


### FILM_ACTOR table

In [16]:
%%sql
describe film_actor

 * mysql+pymysql://root:***@localhost/sakila
3 rows affected.


Field,Type,Null,Key,Default,Extra
actor_id,smallint unsigned,NO,PRI,,
film_id,smallint unsigned,NO,PRI,,
last_update,timestamp,NO,,CURRENT_TIMESTAMP,DEFAULT_GENERATED on update CURRENT_TIMESTAMP


In [17]:
%%sql
# Count the number of actors in each film, order the result by the number of actors with descending order

select title, count(actor_id) num_of_actor
from film_actor fa
join film f
on fa.film_id = f.film_id
group by title
order by num_of_actor desc
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,num_of_actor
LAMBS CINCINATTI,15
CRAZY HOME,13
DRACULA CRYSTAL,13
CHITTY LOCK,13
BOONDOCK BALLROOM,13


In [18]:
%%sql 
# How many films each actor played in?

select first_name, last_name, count(film_id) num_of_film
from film_actor fa
join actor a
on fa.actor_id = a.actor_id
group by first_name, last_name
order by num_of_film desc
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


first_name,last_name,num_of_film
SUSAN,DAVIS,54
GINA,DEGENERES,42
WALTER,TORN,41
MARY,KEITEL,40
MATTHEW,CARREY,39


In [19]:
%%sql
# What's the actor name for each actor_id, and film tile for each film_id. 

select fa.actor_id, first_name, last_name, fa.film_id, title
from film_actor fa
join film f on fa.film_id = f.film_id
join actor a on fa.actor_id = a.actor_id
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


actor_id,first_name,last_name,film_id,title
1,PENELOPE,GUINESS,1,ACADEMY DINOSAUR
1,PENELOPE,GUINESS,23,ANACONDA CONFESSIONS
1,PENELOPE,GUINESS,25,ANGELS LIFE
1,PENELOPE,GUINESS,106,BULWORTH COMMANDMENTS
1,PENELOPE,GUINESS,140,CHEAPER CLYDE


In [20]:
%%sql 
# In table Film, which category each film belongs to?

select title, c.name category_name
from film f
left join film_category fc
on f.film_id = fc.film_id
left join category c
on fc.category_id = c.category_id
order by category_name
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,category_name
AMERICAN CIRCUS,Action
ARK RIDGEMONT,Action
BAREFOOT MANCHURIAN,Action
ANTITRUST TOMATOES,Action
AMADEUS HOLY,Action


In [21]:
%%sql
# Which films have rental_rate > 2 and rating G, PG-13 or PG. 

select title, rental_rate, rating
from film
where rental_rate>2 and rating in ('G','PG-13','PG')
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,rental_rate,rating
ACE GOLDFINGER,4.99,G
AFFAIR PREJUDICE,2.99,G
AFRICAN EGG,2.99,G
AGENT TRUMAN,2.99,PG
AIRPLANE SIERRA,4.99,PG-13


## Sales information

In [22]:
%%sql
# How many rentals happened from 2005-05 to 2005-08?

select count(rental_id) num_of_rental
from rental
where rental_date between '2005-05-01' and '2005-08-31'

 * mysql+pymysql://root:***@localhost/sakila
1 rows affected.


num_of_rental
15862


In [23]:
%%sql
# What's the rental amount by month?

select substr(rental_date, 1, 7) month, count(rental_id) num_of_rental
from rental
group by month

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


month,num_of_rental
2005-05,1156
2005-06,2311
2005-07,6709
2005-08,5686
2006-02,182


In [24]:
%%sql
# Rank the staff by total rental volumes for all time period

select first_name, last_name, count(rental_id) num_of_rental
from rental r
join staff s
on r.staff_id = s.staff_id
group by first_name, last_name

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


first_name,last_name,num_of_rental
Mike,Hillyer,8040
Jon,Stephens,8004


## Inventory information

In [25]:
%%sql
# What's the inventory level report for each film in each store?

select title, store_id, count(inventory_id) num_of_inventory
from inventory i
left join film f
on i.film_id = f.film_id
group by title, store_id
order by title
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,store_id,num_of_inventory
ACADEMY DINOSAUR,1,4
ACADEMY DINOSAUR,2,4
ACE GOLDFINGER,2,3
ADAPTATION HOLES,2,4
AFFAIR PREJUDICE,1,4


In [26]:
%%sql
# What's the inventory level report for each film in each store, incluidng category information for each film
select title, store_id, c.name, count(inventory_id) num_of_inventory
from inventory i
left join film f
on i.film_id = f.film_id
left join film_category fc
on f.film_id = fc.film_id
left join category c
on c.category_id = fc.category_id
group by title, store_id, c.name
order by title
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


title,store_id,name,num_of_inventory
ACADEMY DINOSAUR,1,Documentary,4
ACADEMY DINOSAUR,2,Documentary,4
ACE GOLDFINGER,2,Horror,3
ADAPTATION HOLES,2,Documentary,4
AFFAIR PREJUDICE,1,Horror,4


In [39]:
%%sql
# Drop table if exits
drop table if exists sakila.inventory_summary

 * mysql+pymysql://root:***@localhost/sakila
0 rows affected.


[]

In [40]:
%%sql
# Create a table to save the above qurery result

create table inventory_summary as
select i.film_id, title, store_id, c.name, count(inventory_id) num_of_inventory
from inventory i
left join film f
on i.film_id = f.film_id
left join film_category fc
on f.film_id = fc.film_id
left join category c
on c.category_id = fc.category_id
group by i.film_id, title, store_id, c.name
order by title

 * mysql+pymysql://root:***@localhost/sakila
1521 rows affected.


[]

In [41]:
%%sql
# Use the inventory summary report to identify the film which is not available in any store, 

select f.film_id, title
from film f
where f.film_id not in (select ins.film_id from inventory_summary ins)
limit 5

 * mysql+pymysql://root:***@localhost/sakila
5 rows affected.


film_id,title
14,ALICE FANTASIA
33,APOLLO TEEN
36,ARGONAUTS TOWN
38,ARK RIDGEMONT
41,ARSENIC INDEPENDENCE


## Sales 

In [43]:
%%sql
# How much revenues made from 2005-05 to 2005-08 by month?

select substr(payment_date, 1, 7) month, sum(amount) sales
from payment
where payment_date between '2005-05-01' and '2005-08-31'
group by month

 * mysql+pymysql://root:***@localhost/sakila
4 rows affected.


month,sales
2005-05,4823.44
2005-06,9629.89
2005-07,28368.91
2005-08,24070.14


In [48]:
%%sql
# How much revenues made from 2005-05 to 2005-08 by month?

select s.store_id, substr(payment_date, 1, 7) month, sum(amount) sales
from payment p
join staff s
on p.staff_id = s.staff_id
where payment_date between '2005-05-01' and '2005-08-31'
group by s.store_id, month

 * mysql+pymysql://root:***@localhost/sakila
8 rows affected.


store_id,month,sales
1,2005-05,2621.83
1,2005-06,4774.37
1,2005-07,13998.56
1,2005-08,11853.65
2,2005-06,4855.52
2,2005-07,14370.35
2,2005-08,12216.49
2,2005-05,2201.61


In [None]:
%%sql
# What are the popular film category?


In [55]:
%%sql
# What are the unpopular movies? So the manager have the option to put those for sale to free up shelf space for newer ones. 

select i.film_id, f.title, c.name category, count(r.inventory_id) num_of_rental
from inventory i
left join rental r 
on i.inventory_id = r.inventory_id
left join film f
on i.film_id = f.film_id
left join film_category fc
on f.film_id = fc.film_id
left join category c
on fc.category_id = c.category_id
group by 1,2,3
order by num_of_rental

 * mysql+pymysql://root:***@localhost/sakila
958 rows affected.


film_id,title,category,num_of_rental
400,HARDLY ROBBERS,Documentary,4
584,MIXED DOORS,Foreign,4
904,TRAIN BUNCH,Horror,4
94,BRAVEHEART HUMAN,Family,5
107,BUNCH MINDS,Drama,5
180,CONSPIRACY SPIRIT,Classics,5
310,FEVER EMPIRE,Games,5
335,FREEDOM CLEOPATRA,Comedy,5
343,FULL FLATLINERS,Children,5
362,GLORY TRACY,Games,5


In [30]:
%%sql 

select * 
from sales_by_store

 * mysql+pymysql://root:***@localhost/sakila
2 rows affected.


store,manager,total_sales
"Woodridge,Australia",Jon Stephens,33726.77
"Lethbridge,Canada",Mike Hillyer,33679.79
