# DVD rental analysis

### We use a sample postgreSQL database to analyze DVD rentals over time. 

##### For this exercise, we have set up a postgreSQL database, and imported a sample database that covers DVD rentals. 

The DVD relational database has 15 tables, representing DVD rental behavior by various dimensions and metrics. These tables provide insight into the customers, staff, movie titles and rental time periods. 

In this Jupyter notebook, we will focus on using SQL to transform, explore and summarize this information. We will 
then store this information in pandas dataframes, in order to produce summary statistics and graphical represenations of the data using matplotlib. 

---

### The following cells exist to install necessary modules and functionalities for quering a postgresQL database within a Jupyter notebook. 

In [32]:
# Installing sqlalchemy, a library to interact with various databases (including postgreSQL)
!pip install sqlalchemy

# Install necessary postgres module to work interface with sqlalchemy
!pip install psycopg2

# Install ipython-sql to enable magic command querying through sqlalchemy
!pip install ipython-sql

You should consider upgrading via the '/Users/tylerstevenson/Portfolio/portfolio_database/bin/python3 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/tylerstevenson/Portfolio/portfolio_database/bin/python3 -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/tylerstevenson/Portfolio/portfolio_database/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [58]:
# Importing necessary packages to query our postgresql database

import sqlalchemy
import psycopg2

# ipython-sql does not need to be imported like the other modules - it enables magic commands in jupyter notebook,
#  to run SQL within a code cell

# import ipython-sql

ModuleNotFoundError: No module named 'alembic'

In [31]:
help(sqlalchemy.text)

Help on function text in module sqlalchemy.sql.expression:

text(text, bind=None)
    Construct a new :class:`_expression.TextClause` clause,
    representing
    a textual SQL string directly.
    
    E.g.::
    
        from sqlalchemy import text
    
        t = text("SELECT * FROM users")
        result = connection.execute(t)
    
    The advantages :func:`_expression.text`
    provides over a plain string are
    backend-neutral support for bind parameters, per-statement
    execution options, as well as
    bind parameter and result-column typing behavior, allowing
    SQLAlchemy type constructs to play a role when executing
    a statement that is specified literally.  The construct can also
    be provided with a ``.c`` collection of column elements, allowing
    it to be embedded in other SQL expression constructs as a subquery.
    
    Bind parameters are specified by name, using the format ``:name``.
    E.g.::
    
        t = text("SELECT * FROM users WHERE id=:user_id

In [3]:
# Define endpoint for sqlalchemy.engine to interface with postgres

engine = sqlalchemy.create_engine('postgresql://postgres:W4ac8s-Dkpth@localhost:5432/dvdrental')

In [4]:
# Loading magic commands from ipython-sql, and necessary parameters for querying the database through sqlalchemy

%load_ext sql
%sql $engine.url

In [5]:
%%sql

-- Testing SQL connection

SELECT actor_id, first_name
from actor
WHERE first_name = 'Ed'

 * postgresql://postgres:***@localhost:5432/dvdrental
3 rows affected.


actor_id,first_name
3,Ed
136,Ed
179,Ed


----

### Our next block of cells will focus on some exploratory work with the DVD rental database. 
###### We will experiement with the tables available, to understand more about their makeup. We can use this exploratory exercise to plan some deeper dives in a later analysis.

In [11]:
%%sql

-- List out all tables that exist in the database

SELECT table_name
FROM information_schema.tables
WHERE table_schema='public'
AND table_type='BASE TABLE';

 * postgresql://postgres:***@localhost:5432/dvdrental
15 rows affected.


table_name
actor
store
address
category
city
country
customer
film_actor
film_category
inventory


In [27]:
%%sql

-- Summarizing the total number of tables available in this database

SELECT count(distinct table_name) as num_tables
FROM information_schema.tables
WHERE 1=1
AND table_schema = 'public'
AND table_type = 'BASE TABLE'

 * postgresql://postgres:***@localhost:5432/dvdrental
1 rows affected.


num_tables
15


In [73]:
%%sql

select * from actor limit 5

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


actor_id,first_name,last_name,last_update
1,Penelope,Guiness,2013-05-26 14:47:57.620000
2,Nick,Wahlberg,2013-05-26 14:47:57.620000
3,Ed,Chase,2013-05-26 14:47:57.620000
4,Jennifer,Davis,2013-05-26 14:47:57.620000
5,Johnny,Lollobrigida,2013-05-26 14:47:57.620000


In [74]:
%%sql

select * from rental limit 5

 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


rental_id,rental_date,inventory_id,customer_id,return_date,staff_id,last_update
2,2005-05-24 22:54:33,1525,459,2005-05-28 19:40:33,1,2006-02-16 02:30:53
3,2005-05-24 23:03:39,1711,408,2005-06-01 22:12:39,1,2006-02-16 02:30:53
4,2005-05-24 23:04:41,2452,333,2005-06-03 01:43:41,2,2006-02-16 02:30:53
5,2005-05-24 23:05:21,2079,222,2005-06-02 04:33:21,1,2006-02-16 02:30:53
6,2005-05-24 23:08:07,2792,549,2005-05-27 01:32:07,1,2006-02-16 02:30:53


In [8]:
%%sql

select table_name, column_name, data_type
from information_schema.columns
where table_name in ('film','inventory','film_actor')
order by table_name, ordinal_position

 * postgresql://postgres:***@localhost:5432/dvdrental
20 rows affected.


table_name,column_name,data_type
film,film_id,integer
film,title,character varying
film,description,text
film,release_year,integer
film,language_id,smallint
film,rental_duration,smallint
film,rental_rate,numeric
film,length,smallint
film,replacement_cost,numeric
film,rating,USER-DEFINED


In [35]:
%%sql

-- Joining several tables to determine what the top ranking titles are, in descending order, by count of rentals. 

select film.title
--    , concat(actor.first_name,actor.last_name) as actor_name
    , count(distinct rental_id) as num_rentals
    from film 
    left outer join film_actor
    on film.film_id = film_actor.film_id
    left outer join actor
    on film_actor.actor_id = actor.actor_id
    left outer join inventory
    on film.film_id = inventory.inventory_id
    left outer join rental
    on inventory.inventory_id = rental.inventory_id
group by 1
order by 2 desc

 * postgresql://postgres:***@localhost:5432/dvdrental
1000 rows affected.


title,num_rentals
Doom Dancing,5
Silence Kane,5
Side Ark,5
Double Wrath,5
Shawshank Bubble,5
Life Twisted,5
Velvet Terminator,5
Dracula Crystal,5
Legally Secretary,5
Lawrence Love,5


In [25]:
%%sql

-- Summarizing count of rental records by year and month

with base_calcs as (
    select DATE_PART('year',rental_date) as rental_year
    , DATE_PART('month',rental_date) as rental_month
    , DATE_PART('year',rental_date) || '-' || DATE_PART('month',rental_date) as rental_year_month
    , (DATE_PART('day',return_date) - DATE_PART('day',rental_date)) as days_rented
    , rental_id
    from rental
    where 1=1
    )
select rental_year
, rental_year_month
, count(distinct rental_id) as num_rentals
from base_calcs
where 1=1
group by 1,2
order by 2 asc



 * postgresql://postgres:***@localhost:5432/dvdrental
5 rows affected.


rental_year,rental_year_month,num_rentals
2005.0,2005-5,1156
2005.0,2005-6,2311
2005.0,2005-7,6709
2005.0,2005-8,5686
2006.0,2006-2,182


---

# Forecasting future DVD category rental demand

### Let's use SQL and the history from our tables to forecast forward looking. Typically, this would be something best-suited for Python, but we will experiment with using SQL for this use case. 

First, we will start by pulling some sample information from the tables we will need to leverage. Then, we can switch to modelling our forecasts, using a very simple linear trend to extrapolate. 

In [60]:
%%sql

-- Quick SQL query to 'search' for specific column names across tables, to identify joins easily

select table_name, column_name
from information_schema.columns
where column_name = 'category_id'


 * postgresql://postgres:***@localhost:5432/dvdrental
2 rows affected.


table_name,column_name
category,category_id
film_category,category_id


In [62]:
%%sql

-- Pulling sample data and metadata from the tables we will require; we will use this info as 
-- a cheat sheet for crafting joins

select table_name, column_name, data_type
from information_schema.columns
where 1=1
and table_name = 'film_category'

union all

select table_name, column_name, data_type
from information_schema.columns
where 1=1
and table_name = 'film'
and column_name in ('film_id','release_year')

union all

select table_name, column_name, data_type
from information_schema.columns
where 1=1
and table_name = 'rental'
and column_name in ('rental_id','rental_date','inventory_id')

union all

select table_name, column_name, data_type
from information_schema.columns
where 1=1
and table_name = 'inventory'
and column_name in ('inventory_id','film_id','store_id')

union all

select table_name, column_name, data_type
from information_schema.columns
where 1=1
and table_name = 'category'
--and column_name in ('category_id','name')


 * postgresql://postgres:***@localhost:5432/dvdrental
14 rows affected.


table_name,column_name,data_type
film_category,film_id,smallint
film_category,category_id,smallint
film_category,last_update,timestamp without time zone
film,film_id,integer
film,release_year,integer
rental,rental_id,integer
rental,rental_date,timestamp without time zone
rental,inventory_id,integer
inventory,inventory_id,integer
inventory,film_id,smallint


In [73]:
%%sql

CREATE EXTENSION IF NOT EXISTS tablefunc;


 * postgresql://postgres:***@localhost:5432/dvdrental
Done.


[]

In [None]:
%%sql



In [77]:
%%sql

-- Predicting demand for the upcoming Q4 2006 season, for the top 5 categories of movie rentals
-- First CTE focuses on joining all necessary tables for rental volume by category
-- Second CTE focuses on layering in biz logic for categorizing movies based on recency

with base_data_collect as (
    select c.name, f.release_year, r.rental_date, r.rental_id
    from category c
    left outer join film_category f_c
        on c.category_id = f_c.category_id
    left outer join film f
        on f_c.film_id = f.film_id
    left outer join inventory i
        on f.film_id = i.film_id
    left outer join rental r
        on i.inventory_id = r.inventory_id
),
biz_logic_applied as (
    select *
        , case when release_year = DATE_PART('year',rental_date) then 'recent_release' 
            else 'old_release' end as recency_category
        , DATE_PART('year',rental_date) || '-' || DATE_PART('month',rental_date) as year_month
    from base_data_collect
),
data_format_prep as (
    select year_month, name, count(distinct rental_id) as num_rentals
    from biz_logic_applied
    where year_month != '2006-2'
    group by 1,2
    order by 1 asc
),
data_pivot as (
    select * from crosstab ('select * from data_format_prep'
                           , 'SELECT DISTINCT name FROM data_format_prep ORDER BY name ASC')
    as sales_summary(store_name varchar, cheese_cheddar int, cheese_parmesan int, wine int, boop int, beep int)


)
select *
from data_pivot
limit 5



 * postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.UndefinedTable) relation "data_format_prep" does not exist
LINE 1: SELECT DISTINCT name FROM data_format_prep ORDER BY name ASC
                                  ^
QUERY:  SELECT DISTINCT name FROM data_format_prep ORDER BY name ASC

[SQL: -- Predicting demand for the upcoming Q4 2006 season, for the top 5 categories of movie rentals
-- First CTE focuses on joining all necessary tables for rental volume by category
-- Second CTE focuses on layering in biz logic for categorizing movies based on recency

with base_data_collect as (
    select c.name, f.release_year, r.rental_date, r.rental_id
    from category c
    left outer join film_category f_c
        on c.category_id = f_c.category_id
    left outer join film f
        on f_c.film_id = f.film_id
    left outer join inventory i
        on f.film_id = i.film_id
    left outer join rental r
        on i.inventory_id = r.inventory_id
),
biz_logic_applied as (
    s

In [87]:
%%sql

select * from crosstab ("""with base_data_collect as (
    select c.name, f.release_year, r.rental_date, r.rental_id
    from category c
    left outer join film_category f_c
        on c.category_id = f_c.category_id
    left outer join film f
        on f_c.film_id = f.film_id
    left outer join inventory i
        on f.film_id = i.film_id
    left outer join rental r
        on i.inventory_id = r.inventory_id
),
biz_logic_applied as (
    select *
        , case when release_year = DATE_PART('year',rental_date) then 'recent_release'
            else 'old_release' end as recency_category
        , DATE_PART('year',rental_date) || '-' || DATE_PART('month',rental_date) as year_month
    from base_data_collect
)
    select year_month, name, count(distinct rental_id) as num_rentals
    from biz_logic_applied
    where year_month != '2006-2'
    group by 1,2
    order by 1 asc
"""
                           , 'SELECT DISTINCT name FROM category ORDER BY name ASC limit 5')
    as sales_summary(year_month varchar, cheese_cheddar int, cheese_parmesan int, wine int, boop int, beep int)


 * postgresql://postgres:***@localhost:5432/dvdrental
(psycopg2.errors.UndefinedColumn) column ""with base_data_collect as (
    select c.name, f.release_year," does not exist
LINE 1: select * from crosstab ("""with base_data_collect as (
                                ^

[SQL: select * from crosstab ("""with base_data_collect as (
    select c.name, f.release_year, r.rental_date, r.rental_id
    from category c
    left outer join film_category f_c
        on c.category_id = f_c.category_id
    left outer join film f
        on f_c.film_id = f.film_id
    left outer join inventory i
        on f.film_id = i.film_id
    left outer join rental r
        on i.inventory_id = r.inventory_id
),
biz_logic_applied as (
    select *
        , case when release_year = DATE_PART('year',rental_date) then 'recent_release'
            else 'old_release' end as recency_category
        , DATE_PART('year',rental_date) || '-' || DATE_PART('month',rental_date) as year_month
    from base_data_collect
