# Building Movie Recommender System

In this project I'm going to build a simple movie recommender system that use movies' rating as the calculation basis which has been widely used in generating "best movies" chart. I'm going to combine the ratings' mean, the number of votes, and sorting important metrics that will be utilized to recommend the best movies for users.  


### Concept
There are several basic concepts and assumptions that will be used in this projects:
1. Movies that are more popular have higher probability to be liked by the viewers
2. This system does not provide personal recommendations for each type of user
3. The implementation of this system is basically sorting the movies by rating and popularity, then proceed to show the top movies from the lists

#### IMDB Weighted Rating Formula
I am going to use the Weighted Rating Formula by IMDb (Internet Movie Database) for the future calculation.

$$ (\frac{v}{v+m}\times R) + (\frac{m}{m+v}\times C) $$

Where:\
v = number of votes for a particular movie\
m = minimum votes require in order for a movie to get into the chart\
R = average of a movie's rating\
C = the average of total votes

### Importing Library and File Unloading
The database that we will use are title.basics.tsv and title.ratings.tsv. title.basics.tsv contains general information of a movie including but not limited to title, start year, genres, etc. Meanwhile title.ratings.tsv contains information regarding the movies' rating.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [3]:
# Importing the tsv files
movie_df = pd.read_csv('title.basics.tsv', sep='\t')
rating_df = pd.read_csv('title.ratings.tsv', sep='\t')
# Data preview
print(movie_df.head())
print('\n')
print(rating_df.head())

      tconst  titleType                                      primaryTitle  \
0  tt0221078      short                         Circle Dance, Ute Indians   
1  tt8862466  tvEpisode  ¡El #TeamOsos va con todo al "Reality del amor"!   
2  tt7157720  tvEpisode                                     Episode #3.41   
3  tt2974998  tvEpisode                         Episode dated 16 May 1987   
4  tt2903620  tvEpisode                  Frances Bavier: Aunt Bee Retires   

                                      originalTitle  isAdult startYear  \
0                         Circle Dance, Ute Indians        0      1898   
1  ¡El #TeamOsos va con todo al "Reality del amor"!        0      2018   
2                                     Episode #3.41        0      2016   
3                         Episode dated 16 May 1987        0      1987   
4                  Frances Bavier: Aunt Bee Retires        0      1973   

  endYear runtimeMinutes             genres  
0      \N             \N  Documentary,Short  


### Data Cleaning
In this part, I will first mainly focus on the "title.basics.tsv" dataframe and start to examine the data by checking on each column's information and datatype.

In [4]:
# Obtaining data information
print(movie_df.info())
# Check for Null value in the dataset
print(movie_df.isnull().sum())
# Previewing the data that contains null values
print(movie_df.loc[(movie_df['primaryTitle'].isnull()) | (movie_df['originalTitle'].isnull())])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9025 entries, 0 to 9024
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          9025 non-null   object
 1   titleType       9025 non-null   object
 2   primaryTitle    9011 non-null   object
 3   originalTitle   9011 non-null   object
 4   isAdult         9025 non-null   int64 
 5   startYear       9025 non-null   object
 6   endYear         9025 non-null   object
 7   runtimeMinutes  9025 non-null   object
 8   genres          9014 non-null   object
dtypes: int64(1), object(8)
memory usage: 634.7+ KB
None
tconst             0
titleType          0
primaryTitle      14
originalTitle     14
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            11
dtype: int64
          tconst  titleType primaryTitle originalTitle  isAdult startYear  \
9000  tt10790040  tvEpisode          NaN           NaN        0      2019 

We can see that there are some null values in 'primaryTitle' and 'originalTitle' column. Since these null values don't provide any information to be analyzed, we can just delete those data from the dataframe.

In [5]:
# Updating movie_df dataframe by deleting null values
movie_df = movie_df.loc[(movie_df['primaryTitle'].notnull()) & (movie_df['originalTitle'].notnull())]
# The number of data after null values are deleted
print(len(movie_df))

9011


Aside of column 'primaryTitle' and 'originalTitle', we can see that there also some null values inside the 'genres' column. Using the same method, we are going to delete those null values as well.

In [6]:
# Previewing the data that contains null values
print(movie_df.loc[movie_df['genres'].isnull()])
# Updating movie_df dataframe by deleting null values
movie_df = movie_df.loc[movie_df['genres'].notnull()]
# The number of data after null values are deleted
print(len(movie_df))

          tconst  titleType  \
9014  tt10233364  tvEpisode   
9015  tt10925142  tvEpisode   
9016  tt10970874  tvEpisode   
9017  tt11670006  tvEpisode   
9018  tt11868642  tvEpisode   
9019   tt2347742  tvEpisode   
9020   tt3984412  tvEpisode   
9021   tt8740950  tvEpisode   
9022   tt9822816  tvEpisode   
9023   tt9900062  tvEpisode   
9024   tt9909210  tvEpisode   

                                           primaryTitle originalTitle  \
9014  Rolling in the Deep Dish\tRolling in the Deep ...             0   
9015  The IMDb Show on Location: Star Wars Galaxy's ...             0   
9016  Die Bauhaus-Stadt Tel Aviv - Vorbild für die M...             0   
9017  ...ein angenehmer Unbequemer...\t...ein angene...             0   
9018  GGN Heavyweight Championship Lungs With Mike T...             0   
9019  No sufras por la alergia esta primavera\tNo su...             0   
9020  I'm Not Going to Come Last, I'm Just Going to ...             0   
9021  Weight Loss Resolution Restart - Ins 

From the output we can see that there are many '\\N' values within the 'startYear', 'endYear', and 'runtimeMinutes' column. Intuitively, this '\\N' value represents 'null' values and we are going to change it into "nan" string using np.nan. The data type of these columns will also be converted into float64.

In [7]:
# Changing '\\N' becomes 'nan' using np.nan
movie_df['startYear'] = movie_df['startYear'].replace('\\N', np.nan)
movie_df['endYear'] = movie_df['endYear'].replace('\\N', np.nan)
movie_df['runtimeMinutes'] = movie_df['runtimeMinutes'].replace('\\N', np.nan)
# Converting the column's data type into float64
movie_df['startYear'] = movie_df['startYear'].astype('float64')
movie_df['endYear'] = movie_df['endYear'].astype('float64')
movie_df['runtimeMinutes'] = movie_df['runtimeMinutes'].astype('float64')
#Previewing the result
print(movie_df['startYear'].unique()[:5])
print(movie_df['endYear'].unique()[:5])
print(movie_df['runtimeMinutes'].unique()[:5])

[1898. 2018. 2016. 1987. 1973.]
[  nan 2005. 1955. 2006. 1999.]
[nan 29.  7. 23. 85.]


From the output we can see that now the '\\N' values have been replaced by the 'nan'. The next step we have to do is to transform the data in the 'genres' columnt into list so that the filtering process would be come a lot more easier. This transformation process can be done by creating a function name transform_to_list.

In [8]:
# Define the transform_to_list function
def transform_to_list(x):
    if ',' in x: 
    # Data will be transformed into list with comma as the separator
        return x.split(',')
    else: 
    # Return empty list if there is no data
        return []
# Apply the function to column 'genres'
movie_df['genres'] = movie_df['genres'].apply(lambda x: transform_to_list(x))

After we finish with movie_df dataframe, now we can continue to examine the rating_df dataframe. We will first see the dataframe preview and proceed to obtain the dataframe informations.

In [9]:
# Dataframe preview
print(rating_df.head())
# Dataframe information
print(rating_df.info())

      tconst  averageRating  numVotes
0  tt0000001            5.6      1608
1  tt0000002            6.0       197
2  tt0000003            6.5      1285
3  tt0000004            6.1       121
4  tt0000005            6.1      2050
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030009 entries, 0 to 1030008
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1030009 non-null  object 
 1   averageRating  1030009 non-null  float64
 2   numVotes       1030009 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ MB
None
