# Content-Based Filtering based on Deep Learning: Movie Recommender System


In [2]:
# Libraries
import numpy as numpy
import pandas as pd
import tensorflow as tf 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## Where's the data from?
As in the original lab, the dataset is [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/). 
As quoted: 
>This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

For more info, check **./MovieLens/README.txt**

In [10]:
# Let's pull some movie stats!
movies = pd.read_csv('./MovieLens/movies.csv')
ratings = pd.read_csv('./MovieLens/ratings.csv')

# I'm so bad with regex :(
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')

print(f"Movies dataframe shape: {movies.shape} Ratings shape: {ratings.shape}")

Movies dataframe shape: (9742, 4) Ratings shape: (100836, 4)


In [11]:
# Merge the DataFrames on movie ID, adding the name of the movie to each rating
merged_df = pd.merge(movies, ratings, on='movieId', how='left')

# Group by movie title and calculate rating count and average rating
movie_stats = merged_df.groupby('title').agg({
    'rating': ['count', 'mean'],
    'year': 'first' #So it also aggregates by year
})
movie_stats.columns = ['rating_count', 'average_rating', 'year']

# Top 5 movies by number of ratings. Really nice movies btw.
movie_stats.reset_index().sort_values(by='rating_count', ascending=False).head(5)

Unnamed: 0,title,rating_count,average_rating,year
3164,Forrest Gump (1994),329,4.164134,1994
7609,"Shawshank Redemption, The (1994)",317,4.429022,1994
6878,Pulp Fiction (1994),307,4.197068,1994
7696,"Silence of the Lambs, The (1991)",279,4.16129,1991
5521,"Matrix, The (1999)",278,4.192446,1999


## Which features do we have?
**Movie features:** by now, year released, one-hot encoded genre and average rating.<br>
TODO: Think of new interesting features! Duration, country, budget... We also have tags...<br>
**User features:** by now, per genre average. <br>
TODO: Add rating count and rating average, per-country average and so on. There's a lot of feature engineering to be done here... <br>

## How will we structure the model training?
We will use a data structure based on three dataframes, one containing information about the users, one containing information about the movies and one containing the ratings. <br>
For example, if entry #99 of each dataframe is: <br>
|Index  |Movie  |User   |Y |
| ---   |   --- | ---   |---|
|99     |Shrek 2|#793   |3.5|

That would mean that User 793 rated 'Shrek 2' 3.5 stars.


In [17]:
# Create the movies list
movies_data = []
# Go over the ratings and create the entries
for index, rating in ratings.iterrows():
    movie = movies[movies['movieId'] == rating['movieId']]
    if not movie.empty: # Better safe than sorry I guess
        movieID = movie.iloc[0]['movieId']
        title = movie.iloc[0]['title']
        year = movie.iloc[0]['year']
        genres = movie.iloc[0]['genres']
        avg_rating = ratings[ratings['movieId'] == rating['movieId']]['rating'].mean()

        movie = {
            'title': title,
            'year': year,
            'genres': genres,
            'average_rating': avg_rating
        }

        movies_data.append(movie)

movies_df = pd.DataFrame(movies_data) # Convert the llist to a DataFrame
movies_df.head(10)

Unnamed: 0,title,year,genres,average_rating
0,Toy Story (1995),1995,Adventure|Animation|Children|Comedy|Fantasy,3.92093
1,Grumpier Old Men (1995),1995,Comedy|Romance,3.259615
2,Heat (1995),1995,Action|Crime|Thriller,3.946078
3,Seven (a.k.a. Se7en) (1995),1995,Mystery|Thriller,3.975369
4,"Usual Suspects, The (1995)",1995,Crime|Mystery|Thriller,4.237745
5,From Dusk Till Dawn (1996),1996,Action|Comedy|Horror|Thriller,3.509091
6,Bottle Rocket (1996),1996,Adventure|Comedy|Crime|Romance,3.782609
7,Braveheart (1995),1995,Action|Drama|War,4.031646
8,Rob Roy (1995),1995,Action|Drama|Romance|War,3.545455
9,Canadian Bacon (1995),1995,Comedy|War,2.863636


Keep in mind that each entry here corresponds to a rating, not to a movie!

In [22]:
# Split the genres for one-hot encoding and add them to the dataframe
genres_one_hot = movies_df['genres'].str.get_dummies(sep='|')
movies_df_onehot = pd.concat([movies_df, genres_one_hot], axis=1)

# Drop the original 'genres' column, as well as '(no genres listed)', which will just be 0 in every one-hot column
movies_df_onehot.drop(columns=['genres', '(no genres listed)'], inplace=True)

movies_df_onehot.head(5)

Unnamed: 0,title,year,average_rating,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,Toy Story (1995),1995,3.92093,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Grumpier Old Men (1995),1995,3.259615,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
2,Heat (1995),1995,3.946078,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,1,0,0
3,Seven (a.k.a. Se7en) (1995),1995,3.975369,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,"Usual Suspects, The (1995)",1995,4.237745,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0


Now the movies dataframe is ready for the models. Let's create the users dataframe now.

_Based on the Machine Learning Specialization lab. Thanks to Stanford Online, Coursera, and DeepLearningAI. It's been really fun!_