This notebook is divided into the following parts
1. Data cleaning
2. Data Analysis
3. Build Recommender system
4. Rate your recommender system

First we make the required imports

In [1]:
#data analysis libraries 
import numpy as np
import pandas as pd

#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#ignore warnings
import warnings
warnings.filterwarnings('ignore')


Now we import all the datasets

We consider all datasets except for the link dataset

Currently my test is on small dataset. I will replace this with the biggger dataset finally

In [2]:
movies = pd.read_csv('./Downloads/ml-latest-small/movies.csv')
ratings = pd.read_csv('./Downloads/ml-latest-small/ratings.csv')
tags = pd.read_csv('./Downloads/ml-latest-small/tags.csv')

In [3]:
print(movies.head())
print(ratings.head())
print(tags.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931
   userId  movieId              tag   timestamp
0       2    60756            funny  1445714994
1       2    60756  Highly quotable  1445714996
2       2    60756     will ferre

**Part 1: Cleaning the data**

First, we need to check if for 'null' values in the dataset

In [4]:
print(movies.isnull().any())
print(ratings.isnull().any())
print(tags.isnull().any())

movieId    False
title      False
genres     False
dtype: bool
userId       False
movieId      False
rating       False
timestamp    False
dtype: bool
userId       False
movieId      False
tag          False
timestamp    False
dtype: bool


My current dataset does not have null values but I will just write the cleaning statements just in case there are null values in the bigger dataset

In [5]:
movies = movies.dropna()
ratings = ratings.dropna()
tags = tags.dropna()

Extract year from the movie title for analysis

In [6]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*',expand = False)
movies['year'].describe()

count     9730
unique     107
top       2002
freq       311
Name: year, dtype: object

In [7]:
movies.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


From the above comparision we can see there is some discrepancy in the year column wrt whole dataset

In [8]:
movies[movies['year'].isnull()]

Unnamed: 0,movieId,title,genres,year
6059,40697,Babylon 5,Sci-Fi,
9031,140956,Ready Player One,Action|Sci-Fi|Thriller,
9091,143410,Hyena Road,(no genres listed),
9138,147250,The Adventures of Sherlock Holmes and Doctor W...,(no genres listed),
9179,149334,Nocturnal Animals,Drama|Thriller,
9259,156605,Paterson,(no genres listed),
9367,162414,Moonlight,Drama,
9448,167570,The OA,(no genres listed),
9514,171495,Cosmos,(no genres listed),
9515,171631,Maria Bamford: Old Baby,(no genres listed),


This has given us 2 insights 
1. That some of the movies do not have proper year
2. Some of the movies do not have proper genre

We are going to remove movies without genres to make our dataset more robust

In [9]:
movies = movies[movies['genres'] != '(no genres listed)']
movies = movies.dropna()
movies.describe()

Unnamed: 0,movieId
count,9704.0
mean,41764.853566
std,51770.255671
min,1.0
25%,3228.75
50%,7254.5
75%,74789.5
max,193609.0


Now we should clean the genres

In [10]:
# Get new columns with genre names and put 0/1
# Step 1 get all unique generes from the column
# Step 2 add column for all unique genre
# Step 3 for each new column add 1 if the move belongs to that genre
temp = movies['genres'].str.get_dummies('|')
movies = movies.join(temp)
movies.head()

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children,Comedy,Crime,...,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995,0,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men (1995),Comedy|Romance,1995,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II (1995),Comedy,1995,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


Can we remove some of the less useful genre? We would need analysis of "genre-rating" to figure that out

In [11]:
movies.iloc[:,4:29].sum()

Action         1827
Adventure      1263
Animation       611
Children        664
Comedy         3756
Crime          1199
Documentary     440
Drama          4359
Fantasy         779
Film-Noir        87
Horror          978
IMAX            158
Musical         334
Mystery         573
Romance        1596
Sci-Fi          978
Thriller       1892
War             382
Western         167
dtype: int64

Now we merge the tags dataframe with the ratings dataframe. This will help us to analyze tag and ratings analysis

But first, we drop the timestamp columns since they do not seem useful to us

In [14]:
tag = tags.iloc[:,0:3]
rating = ratings.iloc[:,0:3]
ratings_and_tags = pd.merge(rating, tag , on=['userId','movieId'])
ratings_and_tags.head(10)

Unnamed: 0,userId,movieId,rating,tag
0,2,60756,5.0,funny
1,2,60756,5.0,Highly quotable
2,2,60756,5.0,will ferrell
3,2,89774,5.0,Boxing story
4,2,89774,5.0,MMA
5,2,89774,5.0,Tom Hardy
6,2,106782,5.0,drugs
7,2,106782,5.0,Leonardo DiCaprio
8,2,106782,5.0,Martin Scorsese
9,7,48516,1.0,way too long


We might also need to merge movies and ratings to find relationship of genres and ratings

In [None]:
movie = movies.iloc[:, 0:5]
ratings_and_movies = pd.merge(movie, rating, on=['movieId']) 

**Part 2: Data Analysis**

In this part we want to answer the following Questions which will help us build our recommender system

1. What is the relationship between genre and rating of a movie?
2. What is the relationship between year and rating of a movie?
3. What is the relationship between tag and ratings of a movie?
4. What is the relationship between tag + genre and ratings of a movie?
5. Are there any sparse "genre" that we can discard to make our system efficient?
6. What of the above features discussed are valuable to our recommendation system? 