# Lab | Making predictions with logistic regression

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.


### Instructions

1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.
2. Create a query to get the list of films and a boolean indicating if it was rented last month (May 2005). This would be our target variable.
3. Read the data into a Pandas dataframe.
4. Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.
5. Create a logistic regression model to predict this variable from the cleaned data.
6. Evaluate the results.


## Import libraries and get database password

In [60]:
# import pymysql
# from sqlalchemy import create_engine
# import pandas as pd
# import getpass  # To get the password without showing the input
# password = getpass.getpass()

## Get database data through sql

In [61]:
# # get the data
# connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
# engine = create_engine(connection_string)
# query = ''' select title, rental_duration, rental_rate, length, replacement_cost, rating, category, rented_may
#             from(
#             select film_id,
#             case
#             when month(rental_date) = 5 then True
#             else False
#             end as rented_may
#             from rental
#             join inventory using(inventory_id)
#             join film using(film_id)
#             join film_category using(film_id)
#             join category using (category_id)
#             where year(rental_date) = 2005 and month(rental_date) = 5
#             group by film_id) t1
#             right join(
#             select rental_date, title, film_id, rental_duration, rental_rate, length, replacement_cost, rating, category.name as category
#             from rental
#             join inventory using(inventory_id)
#             join film using(film_id)
#             join film_category using(film_id)
#             join category using (category_id)
#             where year(rental_date) = 2005
#             group by film_id
#             ) t2
#             using (film_id)
#             order by title asc'''

# data = pd.read_sql_query(query, engine)
# data.head(30)

# this won't work on mac, tried to solve it, but doesnt work so I'm using a csv file to get the data accordingly


In [90]:
data = pd.read_csv('movie_db.csv')
data.head()
data

Unnamed: 0.1,Unnamed: 0,title,rental_duration,rental_rate,length,replacement_cost,rating,category,rented_may
0,0,ACADEMY DINOSAUR,6,0.99,86,20.99,PG,Documentary,1.0
1,1,ACE GOLDFINGER,3,4.99,48,12.99,G,Horror,
2,2,ADAPTATION HOLES,7,2.99,50,18.99,NC-17,Documentary,1.0
3,3,AFFAIR PREJUDICE,5,2.99,117,26.99,G,Horror,1.0
4,4,AFRICAN EGG,6,2.99,130,22.99,G,Family,1.0
...,...,...,...,...,...,...,...,...,...
953,953,YOUNG LANGUAGE,6,0.99,183,9.99,G,Documentary,
954,954,YOUTH KICK,4,0.99,179,14.99,NC-17,Music,
955,955,ZHIVAGO CORE,6,0.99,105,10.99,NC-17,Horror,1.0
956,956,ZOOLANDER FICTION,5,2.99,101,28.99,R,Children,1.0


In [63]:
data.dtypes

Unnamed: 0            int64
title                object
rental_duration       int64
rental_rate         float64
length                int64
replacement_cost    float64
rating               object
category             object
rented_may          float64
dtype: object

In [64]:
# Has to be categorized --- probably gonna drop this one
data['replacement_cost'].value_counts()
# we'll drop

20.99    55
21.99    55
22.99    54
29.99    52
12.99    52
27.99    51
13.99    50
14.99    48
11.99    47
17.99    46
10.99    46
26.99    45
19.99    45
23.99    44
25.99    41
9.99     40
28.99    40
18.99    40
24.99    37
16.99    36
15.99    34
Name: replacement_cost, dtype: int64

In [65]:
# should be categorized
data['rental_duration'].value_counts()

6    203
3    197
4    194
5    186
7    178
Name: rental_duration, dtype: int64

In [66]:
# should be categorized
data['rental_rate'].value_counts()

0.99    326
4.99    320
2.99    312
Name: rental_rate, dtype: int64

In [67]:
# bining the movies. 0,90,120,150-max
data['length'].value_counts()
data['length'].describe()

count    958.000000
mean     115.490605
std       40.471844
min       46.000000
25%       80.250000
50%      114.000000
75%      150.000000
max      185.000000
Name: length, dtype: float64

In [68]:
data.isna().sum()

Unnamed: 0            0
title                 0
rental_duration       0
rental_rate           0
length                0
replacement_cost      0
rating                0
category              0
rented_may          272
dtype: int64

In [69]:
# the nans are represented that movie was not rented in may.
# so we can fill it with 0 instead of 1

In [70]:
data['rented_may'] = data['rented_may'].fillna(0)

In [71]:
data['category'].value_counts()

Sports         73
Family         67
Foreign        67
Animation      64
Documentary    63
Action         61
Drama          61
New            60
Sci-Fi         59
Games          58
Children       58
Comedy         56
Classics       54
Horror         53
Travel         53
Music          51
Name: category, dtype: int64

In [72]:
data.head()

Unnamed: 0.1,Unnamed: 0,title,rental_duration,rental_rate,length,replacement_cost,rating,category,rented_may
0,0,ACADEMY DINOSAUR,6,0.99,86,20.99,PG,Documentary,1.0
1,1,ACE GOLDFINGER,3,4.99,48,12.99,G,Horror,0.0
2,2,ADAPTATION HOLES,7,2.99,50,18.99,NC-17,Documentary,1.0
3,3,AFFAIR PREJUDICE,5,2.99,117,26.99,G,Horror,1.0
4,4,AFRICAN EGG,6,2.99,130,22.99,G,Family,1.0


In [73]:
#  columns to categorical encode: title, category, rating, rental_rate, rental_duration
#  columns to drop: replacement_cost, Unnamed:0
#  columns to bin:length

#  target column: rented may

In [81]:
#  columns to drop: replacement_cost, Unnamed:0
data = data.drop(['replacement_cost'], axis =1)
data.head()

Unnamed: 0,title,rental_duration,rental_rate,length,rating,category,rented_may
0,ACADEMY DINOSAUR,6,0.99,short,PG,Documentary,1.0
1,ACE GOLDFINGER,3,4.99,short,G,Horror,0.0
2,ADAPTATION HOLES,7,2.99,short,NC-17,Documentary,1.0
3,AFFAIR PREJUDICE,5,2.99,normal,G,Horror,1.0
4,AFRICAN EGG,6,2.99,long,G,Family,1.0


In [75]:
data = data.drop(['Unnamed: 0'], axis =1)
data.head()

Unnamed: 0,title,rental_duration,rental_rate,length,replacement_cost,rating,category,rented_may
0,ACADEMY DINOSAUR,6,0.99,86,20.99,PG,Documentary,1.0
1,ACE GOLDFINGER,3,4.99,48,12.99,G,Horror,0.0
2,ADAPTATION HOLES,7,2.99,50,18.99,NC-17,Documentary,1.0
3,AFFAIR PREJUDICE,5,2.99,117,26.99,G,Horror,1.0
4,AFRICAN EGG,6,2.99,130,22.99,G,Family,1.0


In [76]:
#  columns to categorical encode: title, category, rating, rental_rate, rental_duration

In [77]:
data['rental_rate'] = data['rental_rate'].astype(object)
data['rental_duration'] = data['rental_duration'].astype(object)

In [82]:
data.dtypes

title                object
rental_duration      object
rental_rate          object
length             category
rating               object
category             object
rented_may          float64
dtype: object

In [79]:
#  columns to bin:length

In [80]:
lbl = ['short', 'normal', 'long', 'extended'] 
data['length'] = pd.cut(data['length'],[0,90,120,150,int(data['length'].max())], labels=lbl) 
data['length'].value_counts()

short       311
extended    233
normal      207
long        207
Name: length, dtype: int64

In [85]:
# X-y split

y = data['rented_may']
X = data.drop(['rented_may'], axis = 1)

In [87]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler 
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error
from sklearn.model_selection import train_test_split

pd.set_option('display.max_columns', None) 

In [88]:
# train test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)


In [None]:
X_num = X.select_dtypes(np.number)