### Recommendation System

**Content Based Recommendation**

Content-based filtering is a type of recommender system that attempts to guess what a user may like based on that user’s activity.

Content-based filtering makes recommendations by using keywords and attributes assigned to objects in a database (e.g., items in an online marketplace) and matching them to a user profile. The user profile is created based on data derived from a user’s actions, such as purchases, ratings (likes and dislikes), downloads, items searched for on a website and/or placed in a cart, and clicks on product links.

**Importing Datasets**

In [None]:
#Content Based Recommendation System would not require many libraries as it is a fairly simple model
import pandas as pd
movies_df = pd.read_csv('/content/movies.csv', na_values=['na','--','?','-','None','none','non'])
ratings_df = pd.read_csv('/content/ratings.csv', na_values=['na','--','?','-','None','none','non'])


**Exlporatory Data Analysis**

In [None]:

movies_df.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


The Movies Dataset has 3 attributes. 

In [None]:
ratings_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Common link between both the datasets is "movieID"

**Checking the Shape of the Dataset**

In [None]:
print("Shape of the Movie Table",movies_df.shape)
print("Shape of the Ratings Table",ratings_df.shape)

Shape of the Movie Table (9742, 3)
Shape of the Ratings Table (100836, 4)


Counting unique users

In [None]:
print("No. of Unique Users",len(ratings_df['userId'].unique()))

No. of Unique Users 610


**Data Preperation and Cleaning**

In [None]:
#Extracting Year using Regex
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#We would get year with parenthesis, this would remove the parenthesis 
movies_df['year'] = movies_df['year'] .str.extract('(\d\d\d\d)',expand=False)
#Extracting Title and removing Year 
movies_df['title'] = movies_df['title'].str.replace('(\(\d\d\d\d\))', '')
#Extracting a list of genres out of the genres
movies_df['genres'] = movies_df['genres'].str.split('|')
movies_df.head()

  


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [None]:
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


Checking important details of the dataset, datatypes etc.

In [None]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
 3   year     9729 non-null   object
dtypes: int64(1), object(3)
memory usage: 304.6+ KB


Changing the Year column from String to Integer 

In [None]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Checking for Null Values

In [None]:
movies_df.isna().sum()

movieId     0
title       0
genres      0
year       13
dtype: int64

In [None]:
#Dropping Entries with Null Values
movies_df.year.fillna(0, inplace=True)
movies_df.dtypes

movieId     int64
title      object
genres     object
year       object
dtype: object

In [None]:
movies_df.head(5)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


To run any algorithm, we would require to encode different genres. There are two ways we can encode this, i.e. label-encoding and one-hot encoding. We would go with One-hot Encoding.

In [None]:
movies_with_genres = movies_df.copy(deep=True)
# We aould encode the variables encoding the variable one-entry at a time
# 1 - if it contains the genre
# 0 - if it does not contain the genre

x = []
for index, row in movies_df.iterrows():
    x.append(index)
    for genre in row['genres']:
        movies_with_genres.at[index, genre] = 1


In [None]:
movies_with_genres = movies_with_genres.fillna(0)
movies_with_genres.head(5)

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lets create a sample user profile based, for whom we would recommened movies.

In [None]:
# so on a scale of 0 to 5, with 0 min and 5 max, see Lawrence's movie ratings below
sam_movie_ratings = [
            {'title':'Predator', 'rating':4.9},
            {'title':'Final Destination', 'rating':4.9},
            {'title':'Mission Impossible', 'rating':4},
            {'title':"Beverly Hills Cop", 'rating':3},
            {'title':'Exorcist, The', 'rating':4.8},
            {'title':'Waiting to Exhale', 'rating':3.9},
            {'title':'Avengers, The', 'rating':4.5},
            {'title':'Omen, The', 'rating':5.0}
         ] 
sam_movie_ratings = pd.DataFrame(Lawrence_movie_ratings)
sam_movie_ratings

Unnamed: 0,title,rating
0,Predator,4.9
1,Final Destination,4.9
2,Mission Impossible,4.0
3,Beverly Hills Cop,3.0
4,"Exorcist, The",4.8
5,Waiting to Exhale,3.9
6,"Avengers, The",4.5
7,"Omen, The",5.0


In [None]:
sam_movie_ratings = sam_movie_ratings.merge(movies_df,how="left",on='title')[['movieId','title','rating','genres']].dropna()

In [None]:
sam_movie_ratings['rating'][0]

4.9

Learning more about Sam's Profile

In [None]:
sam_genres = sam_movie_ratings.merge(movies_with_genres,on = 'movieId',how = 'left').drop(['title_x','rating','genres_x','title_y','genres_y','year','movieId'],axis = 1)

Having a look at our test case, Sam's Profile

In [None]:
#for i in range(9):
sam_profile = sam_genres.iloc[0,:]*sam_movie_ratings['rating'][0] + sam_genres.iloc[1,:]*sam_movie_ratings['rating'][1] + sam_genres.iloc[2,:]*sam_movie_ratings['rating'][3] +sam_genres.iloc[3,:]*sam_movie_ratings['rating'][4] + sam_genres.iloc[4,:]*sam_movie_ratings['rating'][5] + sam_genres.iloc[5,:]*sam_movie_ratings['rating'][6] + sam_genres.iloc[6,:]*sam_movie_ratings['rating'][7] + sam_genres.iloc[7,:]*sam_movie_ratings['rating'][8] + sam_genres.iloc[8,:]*sam_movie_ratings['rating'][9]
sam_profile

Adventure              9.0
Animation              0.0
Children               0.0
Comedy                 6.9
Fantasy                0.0
Romance                3.9
Drama                 11.8
Action                16.9
Crime                  3.0
Thriller              19.8
Horror                14.8
Mystery                9.8
Sci-Fi                 9.4
War                    0.0
Musical                0.0
Documentary            0.0
IMAX                   4.5
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Sam is someone who is interested in Thriller and Action movies.

Doing the Recommendation

In [None]:
#Setting the movieID as index of movies_with_genres table
movies_with_genres = movies_with_genres.set_index(movies_with_genres.movieId)
movies_with_genres.drop(['movieId','title','genres','year'], axis=1, inplace=True)
movies_with_genres.head(5)

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,War,Musical,Documentary,IMAX,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Multiply the genres by the weights and then take the weighted average.
recommendation_table_df = (movies_with_genres.dot(sam_profile)) / sam_profile.sum()

# Let's view the recommendation table
recommendation_table_df.head()

movieId
1    0.144809
2    0.081967
3    0.098361
4    0.205829
5    0.062842
dtype: float64

Sorting the values based on index in descending order

Final Results for recommending top 15 recommended movies for Sam.

In [None]:
#Getting the top 15 recommended movies 
top_15_index = recommendation_table_df.sort_values(ascending = False).head(15).index.tolist()
movies_df[movies_df['movieId'].isin(top_15_index)][['title','genres','year']]

Unnamed: 0,title,genres,year
167,Strange Days,"[Action, Crime, Drama, Mystery, Sci-Fi, Thriller]",1995
4370,"Crazies, The (a.k.a. Code Name: Trixie)","[Action, Drama, Horror, Sci-Fi, Thriller]",1973
4843,Ichi the Killer (Koroshiya 1),"[Action, Comedy, Crime, Drama, Horror, Thriller]",2001
5556,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,"[Action, Animation, Crime, Drama, Film-Noir, M...",1989
5802,Night Watch (Nochnoy dozor),"[Action, Fantasy, Horror, Mystery, Sci-Fi, Thr...",2004
5980,"Cave, The","[Action, Adventure, Horror, Mystery, Sci-Fi, T...",2005
6145,Pulse,"[Action, Drama, Fantasy, Horror, Mystery, Sci-...",2006
6429,Pusher III: I'm the Angel of Death,"[Action, Comedy, Drama, Horror, Thriller]",2005
6546,"Invasion, The","[Action, Drama, Horror, Sci-Fi, Thriller]",2007
6789,Rogue,"[Action, Adventure, Horror, Sci-Fi, Thriller]",2007
