# Building Recommender System


In a very general way, recommender systems are algorithms aimed at suggesting relevant items to users (items being movies to watch, text to read, products to buy or anything else depending on industries). During the last few decades, with the rise of Youtube, Amazon, Netflix and many other such web services, recommender systems have taken more and more place in our lives. From e-commerce (suggest to buyers articles that could interest them) to online advertisement (suggest to users the right contents, matching their preferences), recommender systems are today unavoidable in our daily online journeys.[<sup>[ref]</sup>](https://towardsdatascience.com/introduction-to-recommender-systems-6c66cf15ada)

## Load Datasets

Import the required Python libraries and load the datasets as pandas dataframes.

In [1]:
import pandas as pd
import numpy as np

In [2]:
file_movie = 'https://storage.googleapis.com/dqlab-dataset/title.basics.tsv'
file_rating = 'https://storage.googleapis.com/dqlab-dataset/title.ratings.tsv'

movie_df = pd.read_csv(file_movie, sep='\t')
rating_df = pd.read_csv(file_rating, sep='\t')

## Data Cleaning: `movie_df`

Preview the first few rows of the dataframe, check the shape and data types.

In [3]:
movie_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898,\N,\N,"Documentary,Short"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018,\N,\N,"Comedy,Drama"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016,\N,29,"Comedy,Game-Show"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987,\N,\N,News
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973,\N,\N,Documentary


In [4]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9025 entries, 0 to 9024
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          9025 non-null   object
 1   titleType       9025 non-null   object
 2   primaryTitle    9011 non-null   object
 3   originalTitle   9011 non-null   object
 4   isAdult         9025 non-null   int64 
 5   startYear       9025 non-null   object
 6   endYear         9025 non-null   object
 7   runtimeMinutes  9025 non-null   object
 8   genres          9014 non-null   object
dtypes: int64(1), object(8)
memory usage: 634.7+ KB


The dataset consists 9025 entries, with the columns `primaryTitle`, `originalTitle`, and `genres` having missing missing values. Using the method `.isnull()`, we check the number of missing values.

In [5]:
movie_df.isnull().sum()

tconst             0
titleType          0
primaryTitle      14
originalTitle     14
isAdult            0
startYear          0
endYear            0
runtimeMinutes     0
genres            11
dtype: int64

Slice the dataset to display entries containing missing values.

In [6]:
movie_df.loc[(
    movie_df['primaryTitle'].isnull()) | (
    movie_df['originalTitle'].isnull()) | (
    movie_df['genres'].isnull()
)]

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
9000,tt10790040,tvEpisode,,,0,2019,\N,\N,\N
9001,tt10891902,tvEpisode,,,0,2020,\N,\N,Crime
9002,tt11737860,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9003,tt11737862,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9004,tt11737866,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9005,tt11737872,tvEpisode,,,0,2020,\N,\N,\N
9006,tt11737874,tvEpisode,,,0,2020,\N,\N,"Comedy,Drama,Romance"
9007,tt1971246,tvEpisode,,,0,2011,\N,\N,Biography
9008,tt2067043,tvEpisode,,,0,1965,\N,\N,Music
9009,tt4404732,tvEpisode,,,0,2015,\N,\N,Comedy


Update the dataframe `movie_df` by excluding the missing values. Re-check the number of entries after the update.

In [7]:
movie_df = movie_df.loc[(
    movie_df['primaryTitle'].notnull()) & (
    movie_df['originalTitle'].notnull()) & (
    movie_df['genres'].notnull()
)]

len(movie_df)

9000

Notice that the columns `startYear`, `endYear`, and `runtimeMinutes` contains entries `'\\N'`. This value will be replaced with `'NaN'` and the columns will be typecasted to `'float64'` accordingly (they cannot be typecasted to `'int64'` because of the presence of NaNs).

In [8]:
movie_df['startYear'] = movie_df['startYear'].replace(
    '\\N', np.nan).astype('float64')
movie_df['endYear'] = movie_df['endYear'].replace(
    '\\N', np.nan).astype('float64')
movie_df['runtimeMinutes'] = movie_df['runtimeMinutes'].replace(
    '\\N', np.nan).astype('float64')

print(movie_df['startYear'].unique()[:5],
      movie_df['endYear'].unique()[:5],
      movie_df['runtimeMinutes'].unique()[:5],
      sep='\n')

[1898. 2018. 2016. 1987. 1973.]
[  nan 2005. 1955. 2006. 1999.]
[nan 29.  7. 23. 85.]


Create a function to transform each entry in the column `genres` into a list. An empty list will be returned if no genres are available.

In [9]:
def transform_to_list(entry):
    if ',' in entry:
        return entry.split(',')
    else:
        return []

movie_df['genres'] = movie_df['genres'].apply(lambda x: transform_to_list(x))
movie_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0221078,short,"Circle Dance, Ute Indians","Circle Dance, Ute Indians",0,1898.0,,,"[Documentary, Short]"
1,tt8862466,tvEpisode,"¡El #TeamOsos va con todo al ""Reality del amor""!","¡El #TeamOsos va con todo al ""Reality del amor""!",0,2018.0,,,"[Comedy, Drama]"
2,tt7157720,tvEpisode,Episode #3.41,Episode #3.41,0,2016.0,,29.0,"[Comedy, Game-Show]"
3,tt2974998,tvEpisode,Episode dated 16 May 1987,Episode dated 16 May 1987,0,1987.0,,,[]
4,tt2903620,tvEpisode,Frances Bavier: Aunt Bee Retires,Frances Bavier: Aunt Bee Retires,0,1973.0,,,[]


In [10]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9000 entries, 0 to 8999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          9000 non-null   object 
 1   titleType       9000 non-null   object 
 2   primaryTitle    9000 non-null   object 
 3   originalTitle   9000 non-null   object 
 4   isAdult         9000 non-null   int64  
 5   startYear       8338 non-null   float64
 6   endYear         79 non-null     float64
 7   runtimeMinutes  2594 non-null   float64
 8   genres          9000 non-null   object 
dtypes: float64(3), int64(1), object(5)
memory usage: 703.1+ KB


## Data Cleaning: `rating_df`

Preview the the first few rows, check the dataframe information.

In [11]:
rating_df.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.6,1608
1,tt0000002,6.0,197
2,tt0000003,6.5,1285
3,tt0000004,6.1,121
4,tt0000005,6.1,2050


In [12]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030009 entries, 0 to 1030008
Data columns (total 3 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   tconst         1030009 non-null  object 
 1   averageRating  1030009 non-null  float64
 2   numVotes       1030009 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 23.6+ MB


This dataframe contains more than one million entries, significantly larger than `movie_df`.

## Merge Dataframes

The dataframes `movie_df` and `rating_df` are merged, with the parameter `how='inner'`. The joining is done on the column `'tconst'` which exists in both dataframes. The result will be an intersection of the two dataframes.

In [13]:
movie_rating_df = pd.merge(movie_df, rating_df, on='tconst', how='inner')
movie_rating_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240
3,tt6941700,tvEpisode,RuPaul Roast,RuPaul Roast,0,2017.0,,,[],8.0,11
4,tt7305674,video,UCLA Track & Field Promo,UCLA Track & Field Promo,0,2017.0,,,"[Short, Sport]",9.7,7


In [14]:
movie_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1376 entries, 0 to 1375
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          1376 non-null   object 
 1   titleType       1376 non-null   object 
 2   primaryTitle    1376 non-null   object 
 3   originalTitle   1376 non-null   object 
 4   isAdult         1376 non-null   int64  
 5   startYear       1376 non-null   float64
 6   endYear         26 non-null     float64
 7   runtimeMinutes  1004 non-null   float64
 8   genres          1376 non-null   object 
 9   averageRating   1376 non-null   float64
 10  numVotes        1376 non-null   int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 129.0+ KB


Drop missing values from the columns `'startYear'` and `'runtimeMinutes'`

In [15]:
movie_rating_df = movie_rating_df.dropna(
    subset=['startYear', 'runtimeMinutes'])
movie_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1004 entries, 0 to 1374
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tconst          1004 non-null   object 
 1   titleType       1004 non-null   object 
 2   primaryTitle    1004 non-null   object 
 3   originalTitle   1004 non-null   object 
 4   isAdult         1004 non-null   int64  
 5   startYear       1004 non-null   float64
 6   endYear         17 non-null     float64
 7   runtimeMinutes  1004 non-null   float64
 8   genres          1004 non-null   object 
 9   averageRating   1004 non-null   float64
 10  numVotes        1004 non-null   int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 94.1+ KB


## Building Simple Recommender System

A simple recommender system is based solely on ranking, i.e., the movies will be sorted by some popularity metrics such as the number of votes, ratings, etc. In this case, the metric that will be used is 'weighted average rating', which takes into account the average rating and number of votes.

$weighted\;rating = \frac{Rv+Cm}{v+m}$

    R = average rating
    C = average rating of all movies
    v = number of votes
    m = minimum votes for top-scoring movies

In [16]:
C = movie_rating_df['averageRating'].mean()
m = movie_rating_df['numVotes'].quantile(0.8)
print(f'C = {C:.2f}',
      f'm = {m:.2f}', sep='\n')

C = 6.83
m = 229.00


Define a function that returns a new column `score`, which contains values of the weighted average rating.

In [17]:
def imdb_weighted_rating(df, var=0.8):
    v = df['numVotes']
    R = df['averageRating']
    C = df['averageRating'].mean()
    m = df['numVotes'].quantile(var)
    df['score'] = (R*v + C*m) / (m + v) #Rumus IMDb 
    return df['score']

In [18]:
imdb_weighted_rating(movie_rating_df)
movie_rating_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,score
0,tt0043745,short,Lion Down,Lion Down,0,1951.0,,7.0,"[Animation, Comedy, Family]",7.1,459,7.009992
1,tt0167491,video,Wicked Covergirls,Wicked Covergirls,1,1998.0,,85.0,[],5.7,7,6.796077
2,tt6574096,tvEpisode,Shadow Play - Part 2,Shadow Play - Part 2,0,2017.0,,22.0,"[Adventure, Animation, Comedy]",8.5,240,7.68438
5,tt2262289,movie,The Pin,The Pin,0,2013.0,,85.0,[],7.7,27,6.921384
6,tt0874027,tvEpisode,Episode #32.9,Episode #32.9,0,2006.0,,29.0,"[Comedy, Game-Show, News]",8.0,8,6.869089


Create a function that recommends movies based simply on their scores.

In [19]:
def simple_recommender(df, top=100):
    # On default, return 100 top-scored titles
    df = df.loc[df['numVotes'] >= m]
    df = df.sort_values(by='score', ascending=False)
    df = df[:top]
    return df

Get a recommendation of 25 top-scoring movies:

In [20]:
simple_recommender(movie_rating_df, top=25)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,score
68,tt4110822,tvEpisode,S.O.S. Part 2,S.O.S. Part 2,0,2015.0,,43.0,"[Action, Adventure, Drama]",9.4,3820,9.254624
236,tt2200252,video,Attack of the Clones Review,Attack of the Clones Review,0,2010.0,,86.0,[],9.3,1411,8.955045
1181,tt7697962,tvEpisode,Chapter Seventeen: The Missionaries,Chapter Seventeen: The Missionaries,0,2019.0,,54.0,"[Drama, Fantasy, Horror]",9.2,1536,8.89245
326,tt7124590,tvEpisode,Chapter Thirty-Four: Judgment Night,Chapter Thirty-Four: Judgment Night,0,2018.0,,42.0,"[Crime, Drama, Mystery]",9.1,1859,8.850993
1045,tt0533506,tvEpisode,The Prom,The Prom,0,1999.0,,60.0,"[Action, Drama, Fantasy]",8.9,2740,8.740308
71,tt8399426,tvEpisode,Savages,Savages,0,2018.0,,58.0,"[Drama, Fantasy, Romance]",9.0,1428,8.700045
1234,tt2843830,tvEpisode,VIII.,VIII.,0,2014.0,,57.0,"[Adventure, Drama]",8.9,1753,8.660784
1087,tt4295140,tvSeries,Chef's Table,Chef's Table,0,2015.0,,50.0,[],8.6,12056,8.566998
1054,tt2503932,tvEpisode,Trial and Error,Trial and Error,0,2013.0,,43.0,"[Drama, Fantasy, Horror]",8.6,2495,8.451165
448,tt0337566,video,AC/DC: Live at Donington,AC/DC: Live at Donington,0,1992.0,,120.0,"[Documentary, Music]",8.5,1343,8.256663


Improve recommender with user preference by adding specific filters: `isAdult`, `startYear`, `genre`.

In [21]:
def user_prefer_recommender(
    df, ask_adult, ask_start_year, ask_genre, top=100):
    
    # ask_adult = yes/no
    if ask_adult.lower() == 'yes':
        df = df.loc[df['isAdult'] == 1]
    elif ask_adult.lower() == 'no':
        df = df.loc[df['isAdult'] == 0]

    # ask_start_year = numeric
    df = df.loc[df['startYear'] >= int(ask_start_year)]

    # ask_genre = 'all' or specified
    if ask_genre.lower() == 'all':
        df = df
    else:
        def filter_genre(genre):
            if ask_genre.lower() in str(genre).lower():
                return True
            else:
                return False
        df = df.loc[df['genres'].apply(lambda x: filter_genre(x))]

    df = df.loc[df['numVotes'] >= m]
    df = df.sort_values(by='score', ascending=False)
    df = df[:top]
    return df

In [22]:
df = movie_rating_df.copy()

user_prefer_recommender(
    df,
    ask_adult = 'No',
    ask_start_year = 2000,
    ask_genre = 'Romance',
    top=25)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,score
71,tt8399426,tvEpisode,Savages,Savages,0,2018.0,,58.0,"[Drama, Fantasy, Romance]",9.0,1428,8.700045
1281,tt3166390,tvEpisode,Looking for a Plus-One,Looking for a Plus-One,0,2014.0,,28.0,"[Comedy, Drama, Romance]",8.7,396,8.014679
455,tt1115217,tvEpisode,We're Not from Here,We're Not from Here,0,2007.0,,22.0,"[Comedy, Romance]",7.7,2480,7.626421
961,tt6077690,tvEpisode,The Hot Tub Contamination,The Hot Tub Contamination,0,2016.0,,20.0,"[Comedy, Romance]",7.5,1819,7.425036
1090,tt0212815,movie,Big Eden,Big Eden,0,2000.0,,118.0,"[Comedy, Drama, Romance]",7.3,3906,7.273948
691,tt3663996,tvEpisode,Storm Watch,Storm Watch,0,2014.0,,43.0,"[Drama, Romance]",7.0,309,6.927461
164,tt0464049,movie,The History Boys,The History Boys,0,2006.0,,109.0,"[Comedy, Drama, Romance]",6.8,19858,6.800337
927,tt3663990,tvEpisode,Pilot,Pilot,0,2014.0,,44.0,"[Drama, Romance]",6.6,403,6.683187
585,tt5051278,tvMovie,A Midsummer Night's Dream,A Midsummer Night's Dream,0,2016.0,,90.0,"[Comedy, Fantasy, Romance]",6.3,599,6.446466
1126,tt2151739,movie,Meet Me in Montenegro,Meet Me in Montenegro,0,2014.0,,90.0,"[Comedy, Drama, Romance]",5.8,258,6.284136
