# Recommendation Systems Movie


![family_movie](Images/family_movie.jpeg)


You pick the movie, I'll choose the restaurant...

## 1. Project Overview

This project aims to provide top 5 recommendations to any user. To do that we're going to base our system on a few data files that contain userids, movieids, ratings, tags, and genres. We will utilize collaborative filtering, using both user-based filtering and content-based filterings to create the system.

### The Data

#### Source Data

This project uses the Movielens dataset from the [GroupLens](https://grouplens.org/datasets/movielens/latest/research) lab at the University of Minnesota, which can be found in in the `data` folder in this GitHub repository. 


### Data Inspection
So, we have our data spanning over 4 separate csv files. We also have a README file which may tell us how this data interacts. Let's open that file to gain some insight.

In [1]:
file_path = 'data/README.txt'

with open(file_path) as file:
    print(file.read())

Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for down

### So... it hear is the high level summary from our README txt file.

#### Summary
This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.


#### Ratings Data File Structure (ratings.csv)

All ratings are contained in the file `ratings.csv`. Each line of this file after the header row represents one rating of one movie by one user, and has the following format:

    userId,movieId,rating,timestamp
    
#### Tags Data File Structure (tags.csv)

All tags are contained in the file `tags.csv`. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following format:

    userId,movieId,tag,timestamp

#### Movies Data File Structure (movies.csv)

Movie information is contained in the file `movies.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,title,genres
 
#### Links Data File Structure (links.csv)

Identifiers that can be used to link to other sources of movie data are contained in the file `links.csv`. Each line of this file after the header row represents one movie, and has the following format:

    movieId,imdbId,tmdbId
    
### Data Inspection
Let's go ahead and see if we can verfiy some of this data. I'm going to go ahead and import these files in PANDAS one-by-one to make sure the data matches the description.

In [2]:
import pandas as pd
import numpy as np

In [3]:
ratings_df = pd.read_csv('data/ratings.csv')
ratings_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
ratings_df.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


In [5]:
ratings_df.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [6]:
unique_movies = list(ratings_df['movieId'].unique())
print('Number of movies: ', len(unique_movies), '\n')

unique_users = list(ratings_df['userId'].unique())
print('Number of ratings: ', len(unique_users))


Number of movies:  9724 

Number of ratings:  610


So, we have confirmed no null values, as well as 10,0836 movie ratings and a maximum userID of 610. All of our ratings our .5 - 5.0 and... we have 9,724 movies. This looks promising so far and matches our README. Let's look at Tags.

In [7]:
tags_df = pd.read_csv('data/tags.csv')
tags_df.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [8]:
tags_df.describe()

Unnamed: 0,userId,movieId,timestamp
count,3683.0,3683.0,3683.0
mean,431.149335,27252.013576,1320032000.0
std,158.472553,43490.558803,172102500.0
min,2.0,1.0,1137179000.0
25%,424.0,1262.5,1137521000.0
50%,474.0,4454.0,1269833000.0
75%,477.0,39263.0,1498457000.0
max,610.0,193565.0,1537099000.0


In [9]:
tags_df.isna().sum()

userId       0
movieId      0
tag          0
timestamp    0
dtype: int64

In [10]:
tags_df['tag'].value_counts()

In Netflix queue       131
atmospheric             36
thought-provoking       24
superhero               24
surreal                 23
                      ... 
cia                      1
con men                  1
great humor              1
immigration              1
nonlinear narrative      1
Name: tag, Length: 1589, dtype: int64

okay, so... this looks good. Our tags folder contains info for up to 610 user ids, a max movie id of 19365, and no null values. We can already see a few trends with the tags - namely in Netflix queue, atmospheric, and super-hero as the most popoular trend. Let's go to the Movie Ids dataFrame.

In [11]:
unique_items = list(tags_df['tag'].unique())
len(unique_items)
unique_items[0:20]

['funny',
 'Highly quotable',
 'will ferrell',
 'Boxing story',
 'MMA',
 'Tom Hardy',
 'drugs',
 'Leonardo DiCaprio',
 'Martin Scorsese',
 'way too long',
 'Al Pacino',
 'gangster',
 'mafia',
 'Mafia',
 'holocaust',
 'true story',
 'twist ending',
 'Anthony Hopkins',
 'courtroom drama',
 'britpop']

So, among the first 20 unique tags, we can see a discrepancy between 'mafia' and 'Mafia', so we know that we might need to include the lowercase. We can see, especially with actors, that some appear in lower case "will ferrell" while others appear capitalized, like "Tom Hardy." 

In [12]:
movies_df = pd.read_csv('data/movies.csv')
movies_df.tail(5)

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [13]:
movies_df.isna().sum()

movieId    0
title      0
genres     0
dtype: int64

Again, this look food. It appears that there are about 9742 movies with no null movies, which is exactly what the README said. So we're good. We're not going to concern ourselves with the links ID currently. For now, we'll leave it be.

### Data Approach
So ultimately we'd like to combine these three datastructures into one DataFrame. Ultimately we want to characterize this using linear regression, meaning, we want to understand which user will most likely give the movie the highest rating. We'll start by using the ratings DataFrame and concatenating as we go. A few things to note. We... probably don't need to use the timestamp, as we're not as interested in the time series data. Also, we know we're going to have to one-hot encode the genre information, as well as the tags. The genre info is limited to the 19 categories, the tags on the other hand, have over 1589 unique values. We might be able to clean some of these up, but that is still quite a lot.

#### Dropping timestamp.
I will drop the timestamp from each of the `ratings_df` and `tags_df`.

In [14]:
ratings = ratings_df.drop('timestamp', axis = 1)
tags = tags_df.drop('timestamp', axis = 1)

#### Lowercase
As we indicated above, we need to convert the tags to lower case.

In [15]:
#tags['tag']= tags['tag'].str.lower()
#tags['tag']= tags['tag'].apply(translate(str.maketrans('', '', '!@#$')))
tags['tag']= tags['tag'].map(lambda x: x.lower().rstrip('""!@#$').rstrip('"'))

#### One-hot encoding
Before we merge the files let's go ahead and one-hot encode both the `'genre'` category in `movies` dataFrame and the `'tags'` in the `tags` dataFrame

In [16]:
tags_ohe = pd.get_dummies(tags, columns = ['tag'], prefix='', prefix_sep='')
tags_ohe

Unnamed: 0,userId,movieId,"""artsy",06 oscar nominated best movie - animation,1900s,1920s,1950s,1960s,1970s,1980s,...,world war i,world war ii,writing,wrongful imprisonment,wry,younger men,zither,zoe kazan,zombies,zooey deschanel
0,2,60756,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,60756,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,60756,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,89774,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,89774,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,606,7382,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3679,606,7936,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3680,610,3265,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3681,610,3265,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Data Approach
To create our baseline model, we're going to use the `surprise` module. We will compare SVD and a variety of KNN based methods within the `surprise` module to determine which is the most accurate for our dataset. For consistency sake, will use RSME (Root Square Mean Error). 

#### Reading our Dataset
To begin, we will go through the process of reading in our dataset into the surprise dataset format. This will make the subsequent modeling a little more fluid.

In [17]:
#import the relevant item from surprise
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV

In [18]:
#read in dataset to surprise format
from surprise import Reader, Dataset
reader = Reader()
data = Dataset.load_from_df(ratings,reader)

In [19]:
#check to make sure item's loaded properly and create a new trainset.
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of ratings: ', dataset.n_items)

Number of users:  610 

Number of ratings:  9724


This matches our original check so... we've appeared to load the data successfully.
#### Model-Based Methods (Matrix Factorization) - SVD with suprise module
Below we will use the surprise method to create a SVD model, with tuned hyperparameters. We will utilize GridSearchCV for this.

In [20]:
## we will set up a SVD model with appropriate hyperparameters.

#established some initial hyperparameters
params = {'n_factors': [20, 50, 100],
         'reg_all': [0.02, 0.05, 0.1]}

#instantiate GridSearchCV model
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)

#fit our ratings dataset "data" onto the model
g_s_svd.fit(data)

Now we will print the results

In [21]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.8691347889749693, 'mae': 0.667961381211066}
{'rmse': {'n_factors': 100, 'reg_all': 0.05}, 'mae': {'n_factors': 20, 'reg_all': 0.02}}


Okay, we see a RMSE of .87. This... isn't bad on a scale of 0.5-5.0. 

Our optimal parameters are n_factors = 50 and reg_all = .05. This is convenient that these are in the middle of our range. Let's do another quick search to see if we can improve this.

In [22]:
## we will set up a SVD model with appropriate hyperparameters.

#established some initial hyperparameters
params = {'n_factors': [35, 60, 80],
         'reg_all': [0.35, 0.06, 0.8]}

#instantiate GridSearchCV model
g_s_svd = GridSearchCV(SVD,param_grid=params,n_jobs=-1)

#fit our ratings dataset "data" onto the model
g_s_svd.fit(data)

In [23]:
print(g_s_svd.best_score)
print(g_s_svd.best_params)

{'rmse': 0.8695572084639343, 'mae': 0.6687425042822729}
{'rmse': {'n_factors': 80, 'reg_all': 0.06}, 'mae': {'n_factors': 80, 'reg_all': 0.06}}


So... this barely moved. Suffice to say that perhaps we've created a largely optimized model. We can return to this later.

#### Memory-Based Methods (Neighborhood-Based) KNN with surprise

To begin with, we can calculate the more simple neighborhood-based approaches. We can start with KNNBasic. With KNNBasic, we'll need a trainset and a testset in order to cross-validate results. We also run a few examples to determine the best hyperparameters 

We'll import the relevant first.

In [24]:
#import surprise from 
from surprise.model_selection import train_test_split
from surprise.prediction_algorithms import knns
from surprise.similarities import cosine, msd, pearson
from surprise import accuracy

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

With the KNN Basic, we have to set some of our hyper parameters. We'll try both "cosine" and "pearson". We'll also establish user based similarity, as there are fewer users so this will save us considerable time

In [25]:
#basic_pearson.fit(trainset)
#predictions = basic_pearson.test(testset)
#print(accuracy.rmse(predictions))

NameError: name 'basic_pearson' is not defined

In [26]:
# cross validating with KNNBasic
#knn_basic = KNNBasic(sim_options={'name':'pearson', 'user_based':True})
#cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

knn_baseline = KNNBaseline(sim_options={'name':'pearson', 'user_based':True})
cv_knn_baseline = cross_validate(knn_baseline,data)

Estimating biases using als...
Computing the pearson similarity matrix...


AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

In [123]:
sim_cos = {"name": "cosine", "user_based": True}
basic = knns.KNNBasic(sim_options=sim_cos)
basic.fit(trainset)

#let's see how well the model did on the test set
#predictions = basic.test(testset)
#print(accuracy.rmse(predictions))

Computing the cosine similarity matrix...


AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations