#Case Study - Recommendation Systems Part 2

---------------
## Context
---------------

Online streaming platforms like **Netflix** have plenty of movies in their repositories and if we can build a **recommendation system** to recommend relevant movies to the users based on **their historical interactions**, this would improve customer satisfaction and hence **improve revenue**. The techniques that we will learn here will not only be limited to movies, it can be any item for which you want to build a recommendation system.

-----------------
## Objective
-----------------

By using the above dataset, we will build two different types recommendation systems - 
- **Clustering based recommendation system**
- **Content based collaborative filtering**

-----------------
## Dataset
-----------------

We will use three datasets for this case study.
- **ratings** dataset that contains the following attributes: 
    - userId
    - movieId
    - rating
    - timestamp

- **movies** dataset that contains the following attributes:
    - movieId
    - title
    - genres
 
- **tags** dataset that contains the following attributes:
    - userId
    - movieId
    - tag: Brief comments about the movie
    - timestamp

### Importing Libraries

In [None]:
#Importing necessary libraries.
import warnings #Used to ignore the warnings in the error message.
warnings.filterwarnings('ignore')

import numpy as np #A basic python package for numerical operations
import pandas as pd # A python library to process and do computations on the data frames.

import matplotlib.pyplot as plt # A basic library to do data visualizations
import seaborn as sns # A slightly advanced one for data visualization.
from collections import defaultdict #A dictionary output that does not raise a key error
from sklearn.metrics.pairwise import cosine_similarity #To compute the cosine similarity between vectors

### Loading data

In [None]:
#Loading the movies dataset
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
movies.shape

(9742, 3)

In [None]:
#Loading the ratings dataset
ratings = pd.read_csv('ratings.csv')

In [None]:
ratings.shape

Let's **merge both the datasets** to get the title and rating of each movie in a single dataframe

In [None]:
#Merging datasets on movieId 
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title']], on='movieId', how='inner')
ratings_with_title.head()

Unnamed: 0,userId,movieId,rating,timestamp,title
0,1,1,4.0,964982703,Toy Story (1995)
1,5,1,4.0,847434962,Toy Story (1995)
2,7,1,4.5,1106635946,Toy Story (1995)
3,15,1,2.5,1510577970,Toy Story (1995)
4,17,1,4.5,1305696483,Toy Story (1995)


Let's check the **info** of the data 

In [None]:
ratings_with_title.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100836 entries, 0 to 100835
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
 4   title      100836 non-null  object 
dtypes: float64(1), int64(3), object(1)
memory usage: 4.6+ MB


- There are **100,836 observations** and **5 columns** in the data. 
- Notice that the **number of observations has reduced** from ratings data which implies there might be some movies in ratings data which are not in movies data and vice-versa.
- All the columns are of **numeric data type** except the **title column** which is of **object data type**
- The data type of the timestamp column is int64 which is not correct. We can convert this to DateTime format but **we don't need timestamp for our analysis**. Hence, **we can drop this column**

In [None]:
#Dropping timestamp column
rating = ratings_with_title.drop(['timestamp'], axis=1)

* Let us see now the **ranking of movies** based on the average rating. The computation is done in the below code.

In [None]:
#Calculating average ratings
average_rating = rating.groupby('movieId').mean()['rating']

#Calculating the count of ratings
count_rating = rating.groupby('movieId').count()['rating']

#Making a dataframe with the count and average of ratings
final_rating = pd.DataFrame({'avg_rating':average_rating, 'rating_count':count_rating})

In [None]:
final_rating.head()

Unnamed: 0_level_0,avg_rating,rating_count
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.92093,215
2,3.431818,110
3,3.259615,52
4,2.357143,7
5,3.071429,49


### Exploring the dataset

Let's explore the dataset and answer some basic data-related questions 

#### Q1. How many unique users are present in the above dataset?

In [None]:
ratings_with_title['userId'].nunique()

610

- There are **610 users** in the dataset

#### Q2. What is the total number of unique movies?

In [None]:
ratings_with_title['title'].nunique()

9719

- There are **9719 movies** in the dataset

- To demonstrate the clustering based recommendation system **Surprise** package is introduced in this case study. 
- Please use the following code to `install the surprise` library. You only do it **once** while running the code for the first time.

**!pip install surprise**

In [None]:
# To compute the accuracy of models
from surprise import accuracy

# class is used to parse a file containing ratings, data should be in structure - user ; item ; rating
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# for splitting the rating data in train and test dataset
from surprise.model_selection import train_test_split

#For implementing clustering based recommendation system
from surprise import CoClustering

#### Before building the recommendation systems, let's  go over some some basic terminologies we are going to use:

**Relevant item** - An item (movie in this case) that is actually **rated higher than the threshold rating (here 3.5)** is relevant, if the **actual rating is below the threshold then it is a non-relevant item**.  

**Recommended item** - An item that's **predicted rating is higher than the threshold (here 3.5) is a recommended item**, if the **predicted rating is below the threshold then that movie will not be recommended to the user**.  


**False Negative (FN)** - It is the **frequency of relevant items that are not recommended to the user**. If the relevant items are not recommended to the user, then the user might not buy the product/item. This would result in the **loss of opportunity for the service provider** which they would like to minimize.

**False Positive (FP)** - It is the **frequency of recommended items that are actually not relevant**. In this case, the recommendation system is not doing a good job of finding and recommending the relevant items to the user. This would result in **loss of resources for the service provider** which they would also like to minimize.

**Recall** - It is the **fraction of actually relevant items that are recommended to the user** i.e. if out of 10 relevant movies, 6 are recommended to the user then recall is 0.60. Higher the value of recall better is the model. It is one of the metrics to do the performance assessment of classification models.

**Precision** - It is the **fraction of recommended items that are relevant actually** i.e. if out of 10 recommended items, 6 are found relevant by the user then precision is 0.60. The higher the value of precision better is the model. It is one of the metrics to do the performance assessment of classification models.

**While making a recommendation system it becomes customary to look at the performance of the model. In terms of how many recommendations are relevant and vice-versa, below are the two most used performance metrics used in the assessment of recommendation systems.**

### Precision@k and Recall@ k

**Precision@k** - It is the **fraction of recommended items that are relevant in `top k` predictions**. Value of k is the number of recommendations to be provided to the user. One can choose a variable number of recommendations to be given to a unique user.  


**Recall@k** - It is the **fraction of relevant items that are recommended to the user in `top k` predictions**.

**F1-Score@k** - It is the **harmonic mean of Precision@k and Recall@k**. When **precision@k and recall@k both seem to be important** then it is useful to use this metric because it is representative of both of them. 

### Some useful functions

- Below function takes the **recommendation model** as input and gives the **precision@k and recall@k** for that model.  
- To compute **precision and recall**, **top k** predictions are taken under consideration for each user.

In [None]:
def precision_recall_at_k(model, k=10, threshold=3.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    #Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)
    
    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.

Below we are loading the **`rating` dataset**, which is a **pandas dataframe**, into a **different format called `surprise.dataset.DatasetAutoFolds`** which is required by this library. To do this we will be **using the classes `Reader` and `Dataset`**

In [None]:
# instantiating Reader scale with expected rating scale
reader = Reader(rating_scale=(0, 5))

# loading the rating dataset
data = Dataset.load_from_df(rating[['userId', 'movieId', 'rating']], reader)

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in movies based on how they rate different movies. We cluster similar users together and recommend movies to a user based on ratings from other users in the same cluster.

- **Co-clustering** is a set of techniques in **Cluster Analysis**. Given some **matrix A, we want to cluster rows of A and columns of A simultaneously**, this is a common task for user-item matrices. 

- As it clusters both the rows and columns simultaneously, it is also called **bi-clustering**. To understand the working of the algorithm let A be mxn matrix, goal is to generate co-clusters: a subset of rows that exhibit similar behavior across a subset of columns, or vice versa.

- Co-clustering is defined as two map functions:
rows -> row cluster indexes
columns -> column cluster indexes
These map functions are learned simultaneously. It is **different from other clustering techniques** where we cluster **first the rows and then the columns**. 

In [None]:
# using CoClustering algorithm.
clust_baseline = CoClustering(random_state=1)

# training the algorithm on the trainset
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_baseline)

RMSE: 0.9490
Precision:  0.717
Recall:  0.502
F_1 score:  0.591


- We have calculated **RMSE** to check **how far the overall predicted ratings** are from the **actual ratings**.
- Here **F_1 score** of the **baseline model is ~ 0.60**. It indicates that **mostly recommended movies were relevant and relevant movies were recommended**. We will try to improve this later by using **GridSearchCV by tuning different hyperparameters** of this algorithm.

- Let's now predict a rating for a user with `userId=4` and `movieId=10` as shown below
- Here the user has already rated the movie.

In [None]:
#Making prediction for user_id 4 and movie_id 10.
clust_baseline.predict(4, 10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.68   {'was_impossible': False}


Prediction(uid=4, iid=10, r_ui=4, est=3.6757402992691386, details={'was_impossible': False})

As we can see - the **actual rating** for this **user-item pair is 4** and the **predicted rating by this Co-clustering is closer to the predicted rating by the based baseline model**. The model has under-estimated the rating by a small margin. We will try to fix this later by tuning the hyperparameters of the model using GridSearchCV

Below we are predicting rating for the same `userId=4` but for a movie which this user has not interacted before i.e. `movieId=3`, as shown below - 

In [None]:
#Making prediction for userid 4 and movieId 3.
clust_baseline.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.26   {'was_impossible': False}


Prediction(uid=4, iid=3, r_ui=None, est=3.258169827544438, details={'was_impossible': False})

#### Improving clustering based recommendation system by tuning its hyper-parameters

Below we will be tuning hyper-parameters for the `CoClustering` algorithms. Let's try to understand different hyperparameters of this algorithm - 

- **n_cltr_u** (int) – Number of **user clusters**. Default is 3.
- **n_cltr_i** (int) – Number of **item clusters**. Default is 3.
- **n_epochs** (int) – Number of **iteration of the optimization loop**. Default is 20.
- **random_state** (int, RandomState instance from NumPy, or None) – Determines the RNG that will be used for initialization. If int, random_state will be used as a seed for a new RNG. This is useful to get the same initialization over multiple calls to fit(). If RandomState instance, this same instance is used as RNG. If None, the current RNG from numpy is used. Default is None.
- **verbose** (bool) – If True, the current epoch will be printed. Default is False.

In [None]:
# set the parameter space to tune
param_grid = {'n_cltr_u':[3,4,5,6], 'n_cltr_i': [3,4,5,6], 'n_epochs': [30,40,50]}

# performing 3-fold gridsearch cross validation
gs = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9542812996581099
{'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 30}


Once the grid search is **complete**, we can get the **optimal values** for each of those hyperparameters as shown above

Now we will build **final model** by using tuned values of the hyperparameters which we received by using grid search cross-validation

In [None]:
# using tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u=3,n_cltr_i=3, n_epochs=30, random_state=1)

# training the algorithm on the trainset
clust_tuned.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_tuned)

RMSE: 0.9490
Precision:  0.715
Recall:  0.5
F_1 score:  0.588


- We can see that the **baseline F_1 score** for **tuned co-clustering model** is **slightly lower than F_1 score** for baseline Co-clustering model.

- Let's now **predict rating** for a user with `userId=4` and for `movieId=10` as shown below
- Here the user has already rated the movie.

In [None]:
#Using co_clustering_optimized model to recommend for userId 4 and movieId 10.
clust_tuned.predict(4, 10, r_ui=4, verbose=True)

user: 4          item: 10         r_ui = 4.00   est = 3.64   {'was_impossible': False}


Prediction(uid=4, iid=10, r_ui=4, est=3.6438089875094075, details={'was_impossible': False})

If we compare the above-predicted rating, we can see the **predicted rating from the baseline model is slightly closer to the actual rating**.

In [None]:
#Using Co_clustering based optimized model to recommend for userId 4 and movieId 3 with unknown baseline rating.
clust_tuned.predict(4, 3, verbose=True)

user: 4          item: 3          r_ui = None   est = 3.25   {'was_impossible': False}


Prediction(uid=4, iid=3, r_ui=None, est=3.250249996095622, details={'was_impossible': False})

#### Implementing the recommendation algorithm based on optimized KNNBasic model

Below we will be implementing a function where the input parameters are - 

- data: a **rating** dataset
- user_id: a user id **against which we want the recommendations**
- top_n: the **number of movies we want to recommend**
- algo: the algorithm we want to use **for predicting the ratings**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [None]:
def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended movie ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='userId', columns='movieId', values='rating')
    
    # extracting those movie ids which the user_id has not interacted yet
    non_interacted_movies = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the movie ids which user_id has not interacted yet
    for item_id in non_interacted_movies:
        
        # predicting the ratings for those non interacted movie ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating movies for this user

In [None]:
#Getting top 5 recommendations for user_id 4 using "Co-clustering based optimized" algorithm.
clustering_recommendations = get_recommendations(rating, 4, 5, clust_tuned)

### Correcting the Ratings and Ranking the above movies

While comparing the ratings of two movies, it is not only the **ratings** that describe the **likelihood of the user to that movie**. Along with the rating the **number of users who have watched that movie** also becomes important to consider. Due to this, we have calculated the **"corrected_ratings"** for each movie. Commonly higher the **"rating_count" of a movie more it is liked by users**. To interpret the above concept, a **movie rated 4 with rating_count 3 is less liked in comparison to a movie rated 3 with a rating count of 50**. It has been **empirically found that the likelihood of the movie is directly proportional to the inverse of the square root of the rating_count of the movie**.

In [None]:
def ranking_movies(recommendations, final_rating):
  # sort the movies based on ratings count
  ranked_movies = final_rating.loc[[items[0] for items in recommendations]].sort_values('rating_count', ascending=False)[['rating_count']].reset_index()

  # merge with the recommended movies to get predicted ratings
  ranked_movies = ranked_movies.merge(pd.DataFrame(recommendations, columns=['movieId', 'predicted_ratings']), on='movieId', how='inner')

  # rank the movies based on corrected ratings
  ranked_movies['corrected_ratings'] = ranked_movies['predicted_ratings'] - 1 / np.sqrt(ranked_movies['rating_count'])

  # sort the movies based on corrected ratings
  ranked_movies = ranked_movies.sort_values('corrected_ratings', ascending=False)
  
  return ranked_movies

**Note:** In the **above-corrected rating formula**, we can add the **quantity `1/np.sqrt(n)` instead of subtracting it to get more optimistic predictions**. But here we are **subtracting this quantity**, as there are some movies with ratings 5 and **we can't have a rating more than 5 for a movie**.

In [None]:
#Ranking movies based on above recommendations
ranking_movies(clustering_recommendations, final_rating)

Unnamed: 0,movieId,rating_count,predicted_ratings,corrected_ratings
0,304,3,5,4.42265
1,53,2,5,4.292893
2,99,2,5,4.292893
3,238,2,5,4.292893
4,148,1,5,4.0


**Let us now move to the final recommendation algorithm which is named as the Content based recommendation system.**

### Content Based Recommendation System

In a **content-based recommendation system**, we would be using the feature - **text** i.e. reviews to find out similar movies

Text data generally contains pronunciation, stopwords, non-ASCII characters which makes it **very noisy**. So, we will first need to **pre-process the text** and then we will **generate features from the text to compute similarities** between the texts/reviews.

Let's load the **tags dataset**

In [None]:
tags = pd.read_csv('tags.csv')
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In this dataset, we don't have any movie review or plot of the movie, so we will combine the columns - **title**, **genres** from the other two datasets and **tag** from the tags dataset to create a text-based feature and apply **tfidf** feature extraction technique to extract features, which we later use to compute similar movies based on these texts.

In [None]:
#Merging all the three datasets on movieId
ratings_with_title = pd.merge(ratings, movies[['movieId', 'title', 'genres']], on='movieId' )
final_ratings = pd.merge(ratings_with_title, tags[['movieId', 'tag']], on='movieId' )
final_ratings

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,tag
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
1,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
2,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,fun
3,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
4,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,pixar
...,...,...,...,...,...,...,...
233208,599,176419,3.5,1516604655,Mother! (2017),Drama|Horror|Mystery|Thriller,uncomfortable
233209,599,176419,3.5,1516604655,Mother! (2017),Drama|Horror|Mystery|Thriller,unsettling
233210,594,7023,4.5,1108972356,"Wedding Banquet, The (Xi yan) (1993)",Comedy|Drama|Romance,In Netflix queue
233211,606,6107,4.0,1171324428,Night of the Shooting Stars (Notte di San Lore...,Drama|War,World War II


- We can see that **multiple genres are separated by | which we need to remove**.
- We will combine the three columns title, genres, and tag

In [None]:
#Replacing | character with space in genres column
final_ratings['genres'] = final_ratings['genres'].apply(lambda x: " ".join(x.split('|')))

In [None]:
#Combining title, genres, and tag columns
final_ratings['text'] = final_ratings['title'] + ' ' + final_ratings['genres'] + ' ' + final_ratings['tag']
final_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,tag,text
0,1,1,4.0,964982703,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar,Toy Story (1995) Adventure Animation Children ...
1,1,1,4.0,964982703,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar,Toy Story (1995) Adventure Animation Children ...
2,1,1,4.0,964982703,Toy Story (1995),Adventure Animation Children Comedy Fantasy,fun,Toy Story (1995) Adventure Animation Children ...
3,5,1,4.0,847434962,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar,Toy Story (1995) Adventure Animation Children ...
4,5,1,4.0,847434962,Toy Story (1995),Adventure Animation Children Comedy Fantasy,pixar,Toy Story (1995) Adventure Animation Children ...


Now, we will **keep only four columns** - userId, movieId, rating, and text. We will drop the duplicate titles from the data and make it the **title column as the index** of the dataframe

In [None]:
final_ratings = final_ratings[['userId', 'movieId', 'rating', 'title', 'text']]
final_ratings = final_ratings.drop_duplicates(subset=['title'])
final_ratings = final_ratings.set_index('title')
final_ratings.head()

Unnamed: 0_level_0,userId,movieId,rating,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Toy Story (1995),1,1,4.0,Toy Story (1995) Adventure Animation Children ...
Grumpier Old Men (1995),1,3,4.0,Grumpier Old Men (1995) Comedy Romance moldy
Seven (a.k.a. Se7en) (1995),1,47,5.0,Seven (a.k.a. Se7en) (1995) Mystery Thriller m...
"Usual Suspects, The (1995)",1,50,5.0,"Usual Suspects, The (1995) Crime Mystery Thril..."
Bottle Rocket (1996),1,101,5.0,Bottle Rocket (1996) Adventure Comedy Crime Ro...


In [None]:
final_ratings.shape

(1554, 4)

**Now, let's process the text data and create features to find the similarity between movies**

#### Loading libraries to handle text data

In [None]:
#Importing nltk(natural language toolkit library)
import nltk
nltk.download('punkt') #Downloading punctuations
nltk.download('stopwords') #Downloading stopwords
nltk.download('wordnet') #Downloading wordnet

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
import re #This is importing regular expression
from nltk import word_tokenize # Word_tokenize is used to do tokenization.
from nltk.stem import WordNetLemmatizer #Importing the Lematizer 
from nltk.corpus import stopwords # Importing the stopwords
from sklearn.feature_extraction.text import TfidfVectorizer # Tfidf vectoriser used to create the computational vectors

We will create a **function to pre-process the text data**. Before that, let's see some **terminology**:
- **stopwords:** A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that does not contain information in the text and can be ignored.
- **Lemmatization:** Lemmatization in linguistics is the process of grouping together the inflected forms of a word so they can be analyzed as a single item. For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words.

In [None]:
def tokenize(text):
    
    #Making each letter as lowercase and removing non-alphabetical text
    text = re.sub(r"[^a-zA-Z]"," ",text.lower())
    
    #Extracting each word in the text
    tokens = word_tokenize(text)
    
    #Removing stopwords
    words = [word for word in tokens if word not in stopwords.words("english")]
    
    #Lemmatize the words
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

### Feature Extraction

Below are some of the ways to extract features from texts - 
- Bag of words
- TF-IDF
- One hot encoding
- Word vectors

![alt text](tfidf.png)

Here, we will be using **tfidf** as a feature extraction technique

In [None]:
tfidf = TfidfVectorizer(tokenizer=tokenize)
movie_tfidf = tfidf.fit_transform(final_ratings['text'].values).toarray()

In [None]:
pd.DataFrame(movie_tfidf)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,2743,2744,2745,2746,2747,2748,2749,2750,2751,2752,2753,2754,2755,2756,2757,2758,2759,2760,2761,2762,2763,2764,2765,2766,2767,2768,2769,2770,2771,2772,2773,2774,2775,2776,2777,2778,2779,2780,2781,2782
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.215724,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.224745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1549,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1550,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1551,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.483172,0.0,0.483172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1552,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.240926,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have **extracted features from the text data**. Now, we can find similarities between movies using these features. We will use cosine similarity to calculate the similarity.

In [None]:
similar_movies = cosine_similarity(movie_tfidf, movie_tfidf)
similar_movies

array([[1.        , 0.02268393, 0.        , ..., 0.02022472, 0.        ,
        0.        ],
       [0.02268393, 1.        , 0.        , ..., 0.04779055, 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.02022472, 0.04779055, 0.        , ..., 1.        , 0.00719396,
        0.19617374],
       [0.        , 0.        , 0.        , ..., 0.00719396, 1.        ,
        0.01217017],
       [0.        , 0.        , 0.        , ..., 0.19617374, 0.01217017,
        1.        ]])

Finally, let's create a function to find most similar movies to recommend for a given movie

In [None]:
# function that takes in movie title as input and returns the top 10 recommended movies
def recommendations(title, similar_movies):
    
    recommended_movies = []
    
    indices = pd.Series(final_ratings.index)
    
    # getting the index of the movie that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_movies[idx]).sort_values(ascending = False)

    # getting the indices of 10 most similar movies
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)
    
    # populating the list with the titles of the best 10 matching movies
    for i in top_10_indexes:
        recommended_movies.append(list(final_ratings.index)[i])
        
    return recommended_movies

In [None]:
recommendations('Usual Suspects, The (1995)', similar_movies)

[71, 1186, 124, 551, 569, 77, 719, 766, 123, 658]


['Game, The (1997)',
 'Andalusian Dog, An (Chien andalou, Un) (1929)',
 'Town, The (2010)',
 'Now You See Me (2013)',
 'Charade (1963)',
 'Negotiator, The (1998)',
 'Following (1998)',
 '21 Grams (2003)',
 'Inception (2010)',
 'Insomnia (2002)']

- The movie is a **Crime, Mystery, Thriller** movie and the **majority of our recommendations** lie in one or more of these genres which implies that the resulting recommendation system is working well.

### Conclusion

- In this case study, we built recommendation systems using five different algorithms. They are as follows:

  - **clustering-based recommendation systems**
  - **content-based recommendation systems**
- To demonstrate clustering-based recommendation systems, **surprise** library has been demonstrated. Grid search cross-validation is deployed to find the best working model and using that the corresponding predictions are done.
- For performance evaluation of these models **precision@k and recall@k** are introduced in this case study. Using these two metrics F_1 score is calculated for each working model.
- We can try to further improve the performance of these models using hyperparameter tuning.