# Content-Based Filtering based on Deep Learning: Movie Recommender System


In [1]:
# Libraries
import numpy as numpy
import pandas as pd
import tensorflow as tf 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

## Where's the data from?
As in the original lab, the dataset is [MovieLens ml-latest-small](https://grouplens.org/datasets/movielens/latest/). 
As quoted: 
>This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

For more info, check **./MovieLens/README.txt**

In [14]:
# Let's pull some movie stats!
movies_df = pd.read_csv('./MovieLens/movies.csv')
ratings_df = pd.read_csv('./MovieLens/ratings.csv')

print(f"Movies dataframe shape: {movies_df.shape} Ratings shape: {ratings_df.shape}")

Movies dataframe shape: (9742, 3) Ratings shape: (100836, 4)


In [28]:
# Merge the DataFrames on movie ID, adding the name of the movie to each rating
merged_df = pd.merge(movies_df, ratings_df, on='movieId', how='left')

# Group by movie title and calculate rating count and average rating
movie_stats = merged_df.groupby('title').agg({'rating': ['count', 'mean']})
movie_stats.columns = ['rating_count', 'average_rating']

# Top 5 movies by number of ratings. Really nice movies btw.
movie_stats.reset_index().sort_values(by='rating_count', ascending=False).head(5)

Unnamed: 0,title,rating_count,average_rating
3164,Forrest Gump (1994),329,4.164134
7609,"Shawshank Redemption, The (1994)",317,4.429022
6878,Pulp Fiction (1994),307,4.197068
7696,"Silence of the Lambs, The (1991)",279,4.16129
5521,"Matrix, The (1999)",278,4.192446


## Which features do we have?
**Movie features:** by now, year released, one-hot encoded genre and average rating.<br>
TODO: Think of new interesting features! Duration, country, budget... <br>
**User features:** per genre average. <br>
TODO: Add rating count and rating average, per-country average and so on. There's a lot of feature engineering to be done here... <br>

## How will we structure the model training?
We will use a data structure based on three dataframes, one containing information about the users, one containing information about the movies and one containing the ratings. <br>
For example, if entry #99 of each dataframe is: <br>
|Index  |Movie  |User   |Y |
| ---   |   --- | ---   |---|
|99     |Shrek 2|#793   |3.5|

That would mean that User 793 rated 'Shrek 2' 3.5 stars.


_Based on the Machine Learning Specialization lab. Thanks to Stanford Online, Coursera, and DeepLearningAI. It's been really fun!_