# Lab | Making predictions with logistic regression

In this lab, you will be using the [Sakila](https://dev.mysql.com/doc/sakila/en/) database of movie rentals.

In order to optimize our inventory, we would like to know which films will be rented next month and we are asked to create a model to predict it.


### Instructions

1. Create a query or queries to extract the information you think may be relevant for building the prediction model. It should include some film features and some rental features. Use the data from 2005.
2. Create a query to get the list of films and a boolean indicating if it was rented last month (May 2005). This would be our target variable.
3. Read the data into a Pandas dataframe.
4. Analyze extracted features and transform them. You may need to encode some categorical variables, or scale numerical variables.
5. Create a logistic regression model to predict this variable from the cleaned data.
6. Evaluate the results.


## Import libraries and get database password

In [3]:
import pymysql
from sqlalchemy import create_engine
import pandas as pd
import getpass  # To get the password without showing the input
password = getpass.getpass()

········


## Get database data through sql

In [21]:
# get the data
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
engine = create_engine(connection_string)
query = ''' select title, rental_duration, rental_rate, length, replacement_cost, rating, category, rented_may
            from(
            select film_id,
            case
            when month(rental_date) = 5 then True
            else False
            end as rented_may
            from rental
            join inventory using(inventory_id)
            join film using(film_id)
            join film_category using(film_id)
            join category using (category_id)
            where year(rental_date) = 2005 and month(rental_date) = 5
            group by film_id) t1
            right join(
            select rental_date, title, film_id, rental_duration, rental_rate, length, replacement_cost, rating, category.name as category
            from rental
            join inventory using(inventory_id)
            join film using(film_id)
            join film_category using(film_id)
            join category using (category_id)
            where year(rental_date) = 2005
            group by film_id
            ) t2
            using (film_id)
            order by title asc'''

data = pd.read_sql_query(query, engine)
data.head(30)

Unnamed: 0,title,rental_duration,rental_rate,length,replacement_cost,rating,category,rented_may
0,ACADEMY DINOSAUR,6,0.99,86,20.99,PG,Documentary,1.0
1,ACE GOLDFINGER,3,4.99,48,12.99,G,Horror,
2,ADAPTATION HOLES,7,2.99,50,18.99,NC-17,Documentary,1.0
3,AFFAIR PREJUDICE,5,2.99,117,26.99,G,Horror,1.0
4,AFRICAN EGG,6,2.99,130,22.99,G,Family,1.0
5,AGENT TRUMAN,3,2.99,169,17.99,PG,Foreign,1.0
6,AIRPLANE SIERRA,6,4.99,62,28.99,PG-13,Comedy,
7,AIRPORT POLLOCK,6,4.99,54,15.99,R,Horror,1.0
8,ALABAMA DEVIL,3,2.99,114,21.99,PG-13,Horror,
9,ALADDIN CALENDAR,6,4.99,63,24.99,NC-17,Sports,


In [1]:
# Has to be categorized --- probably gonna drop this one
data['replacement_cost'].value_counts()

NameError: name 'data' is not defined

In [23]:
# Has to be categorized
data['rental_duration'].value_counts()

6    203
3    197
4    194
5    186
7    178
Name: rental_duration, dtype: int64

In [24]:
# Has to be categorized categorized
data['rental_rate'].value_counts()

0.99    326
4.99    320
2.99    312
Name: rental_rate, dtype: int64

In [25]:
# bining the movies. 0,90,120,180-max
data['length'].value_counts()

85     17
179    13
84     13
112    12
122    11
       ..
94      3
96      2
55      2
66      2
95      2
Name: length, Length: 140, dtype: int64

In [27]:
data.isna().sum()

title                 0
rental_duration       0
rental_rate           0
length                0
replacement_cost      0
rating                0
category              0
rented_may          272
dtype: int64