# Explore Data Science Academy

## Honour Code

I {Team CW6}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

## Table of Contents

1. [Introduction](#Introduction)

2. [Problem statement](#problem)
	
3. [Importing libraries](#import)

4. [Loading data](#data)
	
5. [Exploratory data analysis](#eda)
	
6. [Data preprocessing](#engineering)
	
7. [Model deployment](#modeling)
		
	7.1. [Support vector machine](#svm_one)

	7.2. [Linear support vector machine](#svm_two)

	7.3. [Multinomial naive bayes](#naive_bayes_one)

	7.4. [Logistic regression](#logistic_one)
		
8. [Model performance](#performance)

	8.1. [Support vector machine](#svm_three)

	8.2. [Linear support vector machine](#svm_four)

	8.3. [Multinomial naive bayes](#naive_bayes_two)

	8.4. [Logistic regression](#logistic_two)
		
9. [Model explanation](#explanation)
	
10. [Saving model as pickle file](#pickle)

11. [Conclusion](#conclusion)

12. [References](#references)

## 1. Introduction

## 2. Problem statement

To find and recommend movies that people are most likely to watch

## 3. Importing libraries

In [18]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

## 4. Loading data

In [28]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

df_movies = pd.read_csv('movies.csv')
df_imdb_data = pd.read_csv('imdb_data.csv')

## 5. Exploratory data analysis

In [48]:
df_train.sort_values('rating', ascending=False).head()

Unnamed: 0,userId,movieId,rating,timestamp
3933573,118176,593,5.0,1197630915
7585778,32810,4282,5.0,1111718329
7585749,66818,1096,5.0,931950762
1564921,148022,2064,5.0,945887531
1564920,144508,589,5.0,838570244


In [30]:
# checking the movies we have
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [31]:
# viewing the database
df_imdb_data.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [32]:
df_train.isna().sum().sum()

0

In [33]:
df_movies.isna().sum().sum()

0

In [34]:
df_imdb_data.fillna(0).head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [27]:
df_imdb_data.sum().sum()

1634261443.0

In [55]:
train_movies = pd.merge( df_train, df_movies, on='movieId')
train_movies.sort_values('userId', ascending=True).head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
7291637,1,7939,2.5,1147869183,Through a Glass Darkly (Såsom i en spegel) (1961),Drama
8249351,1,7323,3.5,1147869119,"Good bye, Lenin! (2003)",Comedy|Drama
8328154,1,7940,4.5,1147877967,The Magician (1958),Drama
9765874,1,7937,3.0,1147878055,"Silence, The (Tystnaden) (1963)",Drama
8597507,1,8154,5.0,1147868865,"Dolce Vita, La (1960)",Drama


In [56]:
train_movies.shape

(10000038, 6)

In [57]:
train_movies.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,10000040.0,10000040.0,10000040.0,10000040.0
mean,81199.09,21389.11,3.533395,1215677000.0
std,46793.59,39195.78,1.061124,226892100.0
min,1.0,1.0,0.5,789652000.0
25%,40510.0,1197.0,3.0,1011742000.0
50%,80914.0,2947.0,3.5,1199019000.0
75%,121579.0,8630.0,4.0,1447242000.0
max,162541.0,209171.0,5.0,1574328000.0


In [58]:
train_movies.corr()

Unnamed: 0,userId,movieId,rating,timestamp
userId,1.0,-0.00427,0.002202,-0.001347
movieId,-0.00427,1.0,-0.00894,0.520786
rating,0.002202,-0.00894,1.0,0.010301
timestamp,-0.001347,0.520786,0.010301,1.0


In [59]:
train_movies.isna().sum().sum()

0

## 6.Data preprocessing

In [62]:
train_movies

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller
...,...,...,...,...,...,...
10000033,84146,107912,3.0,1389449965,"Fallen, The (2004)",Action|Drama|War
10000034,72315,190143,2.5,1567628158,Formentera Lady,(no genres listed)
10000035,131116,206347,3.0,1568558126,Nocturne (1946),Crime|Drama|Mystery
10000036,85757,196867,3.5,1563175258,Guys & Balls (2004),Comedy|Romance


## 7. Model deployment

In [64]:
X=train_movies.drop(['rating'], axis=1)
y=train_movies['rating']

In [65]:
X

Unnamed: 0,userId,movieId,timestamp,title,genres
0,5163,57669,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller
1,87388,57669,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller
2,137050,57669,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller
3,120490,57669,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller
4,50616,57669,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller
...,...,...,...,...,...
10000033,84146,107912,1389449965,"Fallen, The (2004)",Action|Drama|War
10000034,72315,190143,1567628158,Formentera Lady,(no genres listed)
10000035,131116,206347,1568558126,Nocturne (1946),Crime|Drama|Mystery
10000036,85757,196867,1563175258,Guys & Balls (2004),Comedy|Romance


In [66]:
y

0           4.0
1           3.5
2           4.0
3           4.5
4           4.5
           ... 
10000033    3.0
10000034    2.5
10000035    3.0
10000036    3.5
10000037    3.0
Name: rating, Length: 10000038, dtype: float64

In [67]:
X.shape

(10000038, 5)

In [68]:
y.shape

(10000038,)

## 8. Model perfomance

## 9. Model explanation

## 10. Conclusion

## 11. References