# <b> Recommender Systems </b>

## <b>Learning Objectives</b>

In this lesson, we will cover the following concepts:

- Recommendation system
- The long tail
- A simple popularity-based recommender system
- A collaborative filtering model
- Evaluating a recommendation system

### **What Is a Recommender System?**
 
A recommender or recommendation system is a subclass of an information filtering system that seeks to predict the rating or preference that a user would give to an item.

Let’s consider the example shown in the figure below. Here, we have a user database, that is, data consisting of items rated by the user. Now, let’s suppose that a new user visits and likes five out of ten items on the website. A recommender system recommends the items that the new user might like, based on similarity with other items. We will dive deeper  into this concept in the coming sections. 


![recommender_system_info](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/Lesson_08_Recomendar_Systems/recommender_system_info.jpg)

## **The Theory of Long Tail**

- It shows how products in low demand or with low sales volume can collectively make up a market share that exceeds the relatively few current bestsellers and blockbusters but only if the store or distribution channel is large enough.

- The long tail concept looks at less popular goods in lower demand. The use of these goods could increase profitability as consumers  navigate  away from mainstream markets.

- This can be easily understood by looking at the figure [below](https://www.wired.com/2004/10/tail/).

![longtail](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Applied_Machine_Learning/Images/Lesson_08_Recomendar_Systems/longtail.png)

The figure above clearly shows the use of long tail by [Rhapsody](https://en.wikipedia.org/wiki/Rhapsody_(music) where they sell music albums both online and off-line. We can clearly observe the following:
 
- Both Rhapsody and Walmart sell the most popular music albums online, but the former offers 19 times more songs than Walmart. Even though there is a demand for popular music albums, there is also a demand for the less popular online. Recommender systems leverage these less popular items online. 


## Recommend the Most Popular Items
 
- Let's consider the movie dataset. We will look carefully at the user ratings and think about what can be done.

- The answer that strikes first is the **most popular item**. This is exactly what we will be doing.

- Technically, this is the fastest method, but it does come with a major drawback, which is a lack of personalization. The dataset has many files; we will be looking at a few of them, mainly the ones that relate to movie ratings.

## **Popularity-Based Recommender System**

- There is a division by section, so the user can look at the section of his or her interest.

- At a time, there are only a few hot topics; there is a high chance that a user wants to read the news which is being read by most others.



### Import Libraries

In python, Pandas is used for data manipulation and analysis. NumPy is a package that includes a multidimensional array object and multiple derived objects. Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Seaborn is an open-source Python library built on top of matplotlib. Mean_squared_error is a library that measures the average of the squares of the errors, which is the average squared difference between the estimated values and the actual value.

These libraries are written with the import keyword.



In [2]:
import pandas as pd
import os, io
import numpy as np
from pandas import Series, DataFrame, read_table
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error
%matplotlib inline


### Exporting Dataset from Zip File


Before reading data you need to download "ml-100k.zip" dataset from the resource section and upload it into the Lab. We will use Up arrow icon which is shown in the left side under View icon. Click on the Up arrow icon and upload the file wherever it is downloaded into your system.

After this you will see the downloaded file will be visible on the left side of your lab with all the .ipynb files.

Then, the below snippet will extract the zip dataset to the corresponding folder.



In [2]:
import zipfile
with zipfile.ZipFile('ml-100k.zip', 'r') as zip_ref:
    zip_ref.extractall(".")

We start to explore the data set of movie ratings and our interest lies particularly  in ratings. Let's see how we recommend the most popular (that is, highly rated) movies.

In [3]:
#Load the Ratings data
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = read_table('ml-100k//u.data',header=None,sep='\t')
ratings.columns = r_cols
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


In [4]:
i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = read_table('./ml-100k//u.item', sep='|',names=i_cols,
 encoding='latin-1')

In [5]:
items.head(10)

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995,,http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,7,Twelve Monkeys (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Twelve%20Monk...,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
7,8,Babe (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
8,9,Dead Man Walking (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Dead%20Man%20...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,10,Richard III (1995),22-Jan-1996,,http://us.imdb.com/M/title-exact?Richard%20III...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [6]:
items[items.movie_id==50]

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
49,50,Star Wars (1977),01-Jan-1977,,http://us.imdb.com/M/title-exact?Star%20Wars%2...,0,1,1,0,0,...,0,0,0,0,0,1,1,0,1,0



Ratings is a variable that stores all the columns from the ml-100k dataset in the u.data file.
 The head() function displays the first five rows from ratings.



## Let's Build a Popularity-Based Recommender System

With our initial exploration, we decided that ideal data would be the one where we could also have the movie ratings with us. Let's see how we are able to do this.



We will use the pd.merge function that is used to combine data on common columns or indices.



In [7]:
new_data = pd.merge(items,ratings,on='movie_id')

In [9]:
new_data.head(1)

Unnamed: 0,movie_id,movie title,release date,video release date,IMDb URL,unknown,Action,Adventure,Animation,Children's,...,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western,user_id,rating,unix_timestamp
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,308,4,887736532


In [10]:
new_data  = new_data[['movie_id','movie title','user_id','rating']]

In [11]:
new_data.head()

Unnamed: 0,movie_id,movie title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3



New_data is a variable that stores data read by the pd.merge function. It consists of items and ratings.
 The head() function displays the first five rows from new_data.



Before proceeding to build the recommender system, we will observe the following steps to recommend movies:
- Find unique users
- Count the number of times the movie has been seen
- [Rank](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rank.html) the scores (counts)

In [12]:
new_data['movie title'].value_counts()

Star Wars (1977)               584
Contact (1997)                 509
Fargo (1996)                   508
Return of the Jedi (1983)      507
Liar Liar (1997)               485
                              ... 
Stefano Quantestorie (1993)      1
Office Killer (1997)             1
Quartier Mozart (1992)           1
Yankee Zulu (1994)               1
Fear, The (1995)                 1
Name: movie title, Length: 1664, dtype: int64

In [11]:
new_data.groupby(['movie title'])['user_id'].count()

movie title
'Til There Was You (1997)                  9
1-900 (1994)                               5
101 Dalmatians (1996)                    109
12 Angry Men (1957)                      125
187 (1997)                                41
                                        ... 
Young Guns II (1990)                      44
Young Poisoner's Handbook, The (1995)     41
Zeus and Roxanne (1997)                    6
unknown                                    9
Á köldum klaka (Cold Fever) (1994)         1
Name: user_id, Length: 1664, dtype: int64

In [14]:
new_data['movie title'].value_counts().reset_index()

Unnamed: 0,index,movie title
0,Star Wars (1977),584
1,Contact (1997),509
2,Fargo (1996),508
3,Return of the Jedi (1983),507
4,Liar Liar (1997),485
...,...,...
1659,Stefano Quantestorie (1993),1
1660,Office Killer (1997),1
1661,Quartier Mozart (1992),1
1662,Yankee Zulu (1994),1


In [18]:
score_data=new_data.groupby(['movie title'])['user_id'].count().reset_index().rename(columns={'user_id':'scores'}).sort_values('scores',ascending=False)

In [22]:
score_data['Rank']=score_data.scores.rank(ascending=False,method='first')

In [23]:
score_data

Unnamed: 0,movie title,scores,Rank
1398,Star Wars (1977),584,1.0
333,Contact (1997),509,2.0
498,Fargo (1996),508,3.0
1234,Return of the Jedi (1983),507,4.0
860,Liar Liar (1997),485,5.0
...,...,...,...
633,"Great Day in Harlem, A (1994)",1,1660.0
1111,"Other Voices, Other Rooms (1997)",1,1661.0
620,Good Morning (1971),1,1662.0
606,Girls Town (1996),1,1663.0


In [24]:
def popularity(train,title,ids):
    train_data_grouped = train.groupby([title])[ids].count().reset_index()  #user_id  #movie title
    
    train_data_grouped.rename(columns = {ids: 'score'},inplace=True)            
    
    train_data_sort = train_data_grouped.sort_values(['score',title], ascending = [0,1])
    
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
    
    popularity_recommendations = train_data_sort.head(10) 
    
    return popularity_recommendations

In [25]:
popularity(new_data,'movie title','user_id')

Unnamed: 0,movie title,score,Rank
1398,Star Wars (1977),584,1.0
333,Contact (1997),509,2.0
498,Fargo (1996),508,3.0
1234,Return of the Jedi (1983),507,4.0
860,Liar Liar (1997),485,5.0
460,"English Patient, The (1996)",481,6.0
1284,Scream (1996),478,7.0
1523,Toy Story (1995),452,8.0
32,Air Force One (1997),431,9.0
744,Independence Day (ID4) (1996),429,10.0


## Drawback

Having recommended the movies, we can immediately conclude that the major drawback of such a system would be the **lack of personalization**.

## **Collaborative Filtering**
 
In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if person A has the same opinion as person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

## Types of Collaborative Filtering
 
### User-Based Collaborative Filtering
 
In this type, we find look-alike customers (based on similarity) and offer products that the first customer's look-alike chose in the past. This algorithm is very effective but takes a lot of time and resources. It computes every customer pair information, which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizing system.
 
1. Build a matrix of things each user bought or viewed or rated
2. Compute similarity scores between users
3. Find users similar to you
4. Recommend stuff they bought or viewed or rated that you haven’t yet
 
### Problems 
1. People are fickle, so their tastes tend to change
2. There are usually more people than things

### Item-Based Collaborative Filtering
 
It is quite similar to the previous algorithm, but instead of finding customer look-alikes, it tries to find items that look alike. Once we have an item look-alike matrix, we can easily recommend similar items to customers who have purchased an item from the store. This algorithm is far less resource-consuming than user-based collaborative filtering. 
 
1. Find every pair of movies that were watched by the same person
2. Measure the similarity of rating across all the users who watched both
3. Sort movies by the similarity strength
 

### Interesting fact 
 
Item-based collaboration is extensively used in Amazon, and they came out with it in great detail. You can read more at [Amazon](https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf)


Let's get started with building our item-based collaborative recommender system. For convenience, let's split this into two parts. 

- To find similarities between items
- To recommend them to users

Item-based collaborative filtering would be the most feasible solution, as the number of items is always lesser than the number of users and it improves the computational speed.

**Leverage the Pandas** 
 
- To begin with, we will use the pandas pivot table to look at relationships between movies and we will use the pivot table in pandas. Pivot table in pandas is an excellent tool to summarize one or more numeric variable based on two other categorical variables.

- We start building a utility matrix (matrix consisting of movies and ratings)

In [26]:
movie_ratings = new_data.pivot_table(index=['user_id'],columns=['movie title'],values='rating')

In [27]:
movie_ratings.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [31]:
movie_ratings.corr()['187 (1997)'].sort_values().dropna()

movie title
Gaslight (1944)                             -1.0
Ma vie en rose (My Life in Pink) (1997)     -1.0
Designated Mourner, The (1997)              -1.0
All About Eve (1950)                        -1.0
Naked Gun 33 1/3: The Final Insult (1994)   -1.0
                                            ... 
Fall (1997)                                  1.0
Tango Lesson, The (1997)                     1.0
Tank Girl (1995)                             1.0
Super Mario Bros. (1993)                     1.0
Duoluo tianshi (1995)                        1.0
Name: 187 (1997), Length: 975, dtype: float64

The above table gives information about the rating given by each user against the movie title. There are many NaN as it is not necessary for each user to review each movie. Let’s start by looking at the geeks' most favorite, Star Wars, and see how it correlates pairwise with other movies in the table.

## Similarity Function

To decide the similarity between two items in the dataset, let's briefly look at the popular similarity functions.

### Terminology

- Let $\textbf{$r_x$}$ denote the rating of the item x given by the user and $\textbf{$r_y$}$ be the rating of item y. To find the similarity pairwise between two items the following metrics can be used:

## cosine Index

$$sim(\textbf{$r_{x}$},\textbf{$r_y$}) = cos(\textbf{$r_x$},\textbf{$r_y$}) = \dfrac{\textbf{$r_x$}\textbf{$r_y$}}{||\textbf{$r_x$}||\  ||\textbf{$r_y$}||} $$ 

The major problem is that it treats missing values as negative.

## Pearson Index

$S_{xy}$ = Items x and y both have ratings

$$sim(\textbf{$r_{x}$},\textbf{$r_y$})=\dfrac{\sum_{x\epsilon s}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})(\textbf{$r_{ys}$}- \textbf{$r_y$})}{(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})^2}(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{ys}$}- \textbf{$r_{ym}$})^2}} $$ 

## Jaccard Index

$$Jaccard \ Index = \dfrac{Number \ in\  both \  sets}{Number \  in\  either \ set}  $$

Let's start with the Pearson Index in this case. Now that we have understood how similar products can be found, let's start with the movie, Star Wars.

In [17]:
new_data

Unnamed: 0,movie_id,movie title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3
...,...,...,...,...
99998,1678,Mat' i syn (1997),863,1
99999,1679,B. Monkey (1998),863,3
100000,1680,Sliding Doors (1998),863,2
100001,1681,You So Crazy (1994),896,3


In [32]:
movie_ratings

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,,,,,,,,,,,...,,,,,,,,,,
940,,,,,,,,,,,...,,,,,,,,,,
941,,,,,,,,,,,...,,,,,,,,,,
942,,,,,,,,3.0,,3.0,...,,,,,,,,,,


In [33]:
StarWarsRatings = movie_ratings['Star Wars (1977)'] 

In [34]:
StarWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Now, let’s use the **corrwith()** function to check the pairwise correlation of Star Wars’s user rating with other films in the column.

In [36]:
similarmovies = movie_ratings.corrwith(StarWarsRatings)
similarmovies =similarmovies.dropna()
df = pd.DataFrame(similarmovies)
df.sort_values(0,ascending=False)

Unnamed: 0_level_0,0
movie title,Unnamed: 1_level_1
Hollow Reed (1996),1.0
Commandments (1997),1.0
Cosi (1996),1.0
No Escape (1994),1.0
Stripes (1981),1.0
...,...
For Ever Mozart (1996),-1.0
Frankie Starlight (1995),-1.0
I Like It Like That (1994),-1.0
American Dream (1990),-1.0


In [40]:
df.sort_values(0,ascending=False).head(10)

Unnamed: 0_level_0,0
movie title,Unnamed: 1_level_1
Star Wars (1977),1.0
Open Season (1996),1.0
Commandments (1997),1.0
"Innocent Sleep, The (1995)",1.0
Little City (1998),1.0
Rough Magic (1995),1.0
"Convent, The (Convento, O) (1995)",1.0
Pie in the Sky (1995),1.0
"Beans of Egypt, Maine, The (1994)",1.0
"Line King: Al Hirschfeld, The (1996)",1.0


If we look at the data closely, we will find something incorrect. 

The potential reason here is that a handful of people who have seen obscure films are messing up our movies. We want to get rid of the movies that only a few people have watched that show incorrect results.

We have used groupby function that involves some combination of splitting the object, applying a function, and combining the results and sort_values function that sorts by the values along either axis.

In [37]:
movie_stats = new_data.groupby('movie title').agg({'rating':[np.size,np.mean]})

In [38]:
movie_stats.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
'Til There Was You (1997),9,2.333333
1-900 (1994),5,2.6
101 Dalmatians (1996),109,2.908257
12 Angry Men (1957),125,4.344
187 (1997),41,3.02439


In [39]:
check = movie_stats.sort_values([('rating','mean')],ascending=False)

In [40]:
check

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
They Made Me a Criminal (1939),1,5.0
Marlene Dietrich: Shadow and Light (1996),1,5.0
"Saint of Fort Washington, The (1993)",2,5.0
Someone Else's America (1995),1,5.0
Star Kid (1997),3,5.0
...,...,...
"Eye of Vichy, The (Oeil de Vichy, L') (1993)",1,1.0
King of New York (1990),1,1.0
Touki Bouki (Journey of the Hyena) (1973),1,1.0
"Bloody Child, The (1996)",1,1.0


Now, we can clearly observe that there are movies that have very few rating counts (size). Therefore, we set a threshold of the movie count to have at least 100 ratings.

In [41]:
popularmovies = movie_stats['rating']['size']>=100

movie_stats[popularmovies].sort_values([('rating','mean')],ascending=False)[:10]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.45679
"Shawshank Redemption, The (1994)",283,4.44523
Rear Window (1954),209,4.38756
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),584,4.359589
12 Angry Men (1957),125,4.344
Citizen Kane (1941),198,4.292929


In [42]:
df = movie_stats[popularmovies].join(DataFrame(similarmovies,columns=['similarity']))
df.sort_values('similarity',ascending=False)[:20]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.0
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
"Sting, The (1973)",241,4.058091,0.367538
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
Pinocchio (1940),101,3.673267,0.347868
"Frighteners, The (1996)",115,3.234783,0.332729
L.A. Confidential (1997),297,4.161616,0.319065


## Building an End-to-End Recommender System

We will list points that need to be followed to recommend a movie based on what we did till now :

- Compute the correlation score for every pair in the matrix
- Choose a user and find his or her movies of interest
- Recommend movies to him or her
- Improve on the recommendation

The pandas method **corr()** will compute the correlation score for every pair in the matrix. This gives a correlation score between every pair of movies in turn creating a sparse matrix. Let's see how this looks.

In [43]:
movie_ratings.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [44]:
corrMatrix = movie_ratings.corr(method='pearson',min_periods=100) # method="spearman"
corrMatrix.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


Now, we want to recommend movies to a friend, so let's have a look at the movies our friend has rated.

In [45]:
friend_ratings = movie_ratings.loc[1].dropna()[1:4]
friend_ratings

movie title
12 Angry Men (1957)                    5.0
20,000 Leagues Under the Sea (1954)    3.0
2001: A Space Odyssey (1968)           4.0
Name: 1, dtype: float64

In [46]:
moviesLikedby_u1=movie_ratings.loc[1].dropna().sort_values().tail(5)

In [47]:
moviesLikedby_u1

movie title
Raiders of the Lost Ark (1981)            5.0
Godfather, The (1972)                     5.0
Good, The Bad and The Ugly, The (1966)    5.0
Crumb (1994)                              5.0
Kids in the Hall: Brain Candy (1996)      5.0
Name: 1, dtype: float64

In [48]:
moviesLikedby_u1.index

Index(['Raiders of the Lost Ark (1981)', 'Godfather, The (1972)',
       'Good, The Bad and The Ugly, The (1966)', 'Crumb (1994)',
       'Kids in the Hall: Brain Candy (1996)'],
      dtype='object', name='movie title')

In [49]:
first_movie=moviesLikedby_u1.index[0]

In [50]:
first_movie

'Raiders of the Lost Ark (1981)'

In [51]:
corrMatrix[first_movie].dropna().sort_values(ascending=False).head(5)

movie title
Raiders of the Lost Ark (1981)               1.000000
Indiana Jones and the Last Crusade (1989)    0.539606
Empire Strikes Back, The (1980)              0.538659
Star Wars (1977)                             0.536117
Back to the Future (1985)                    0.506807
Name: Raiders of the Lost Ark (1981), dtype: float64

In [52]:
for i in moviesLikedby_u1.index:
    print("_________________________________________________")
    print("Movies Similar to "+i)
    print(corrMatrix[i].dropna().sort_values(ascending=False)[1:6])
   

_________________________________________________
Movies Similar to Raiders of the Lost Ark (1981)
movie title
Indiana Jones and the Last Crusade (1989)    0.539606
Empire Strikes Back, The (1980)              0.538659
Star Wars (1977)                             0.536117
Back to the Future (1985)                    0.506807
Firm, The (1993)                             0.490823
Name: Raiders of the Lost Ark (1981), dtype: float64
_________________________________________________
Movies Similar to Godfather, The (1972)
movie title
Godfather: Part II, The (1974)        0.683862
GoodFellas (1990)                     0.421477
People vs. Larry Flynt, The (1996)    0.393439
Chinatown (1974)                      0.376133
Apocalypse Now (1979)                 0.374378
Name: Godfather, The (1972), dtype: float64
_________________________________________________
Movies Similar to Good, The Bad and The Ugly, The (1966)
movie title
Alien (1979)             0.448382
Godfather, The (1972)    0.31624

In [53]:
recommended_movie=[]
for i in moviesLikedby_u1.index:
    print("_________________________________________________")
    print("Movies Similar to "+i)
    print(corrMatrix[i].dropna().sort_values(ascending=False)[1:6])
    
    
    
    for j in corrMatrix[i].dropna().sort_values(ascending=False)[1:6].index:
        recommended_movie.append(j)

_________________________________________________
Movies Similar to Raiders of the Lost Ark (1981)
movie title
Indiana Jones and the Last Crusade (1989)    0.539606
Empire Strikes Back, The (1980)              0.538659
Star Wars (1977)                             0.536117
Back to the Future (1985)                    0.506807
Firm, The (1993)                             0.490823
Name: Raiders of the Lost Ark (1981), dtype: float64
_________________________________________________
Movies Similar to Godfather, The (1972)
movie title
Godfather: Part II, The (1974)        0.683862
GoodFellas (1990)                     0.421477
People vs. Larry Flynt, The (1996)    0.393439
Chinatown (1974)                      0.376133
Apocalypse Now (1979)                 0.374378
Name: Godfather, The (1972), dtype: float64
_________________________________________________
Movies Similar to Good, The Bad and The Ugly, The (1966)
movie title
Alien (1979)             0.448382
Godfather, The (1972)    0.31624

In [54]:
recommended_movie

['Indiana Jones and the Last Crusade (1989)',
 'Empire Strikes Back, The (1980)',
 'Star Wars (1977)',
 'Back to the Future (1985)',
 'Firm, The (1993)',
 'Godfather: Part II, The (1974)',
 'GoodFellas (1990)',
 'People vs. Larry Flynt, The (1996)',
 'Chinatown (1974)',
 'Apocalypse Now (1979)',
 'Alien (1979)',
 'Godfather, The (1972)',
 'Jurassic Park (1993)',
 'Aliens (1986)',
 'Fugitive, The (1993)']

In [87]:
corrMatrix['Raiders of the Lost Ark (1981)'].dropna().sort_values(ascending=False)[1:6].index

Index(['Indiana Jones and the Last Crusade (1989)',
       'Empire Strikes Back, The (1980)', 'Star Wars (1977)',
       'Back to the Future (1985)', 'Firm, The (1993)'],
      dtype='object', name='movie title')

In [89]:
recommended_movie

['Indiana Jones and the Last Crusade (1989)',
 'Empire Strikes Back, The (1980)',
 'Star Wars (1977)',
 'Back to the Future (1985)',
 'Firm, The (1993)',
 'Godfather: Part II, The (1974)',
 'GoodFellas (1990)',
 'People vs. Larry Flynt, The (1996)',
 'Chinatown (1974)',
 'Apocalypse Now (1979)',
 'Alien (1979)',
 'Godfather, The (1972)',
 'Jurassic Park (1993)',
 'Aliens (1986)',
 'Fugitive, The (1993)']

In [55]:
movie_ratings.loc[17].dropna().sort_values().tail(5)

movie title
Fargo (1996)                                    4.0
City of Lost Children, The (1995)               4.0
Twelve Monkeys (1995)                           4.0
Willy Wonka and the Chocolate Factory (1971)    4.0
Swingers (1996)                                 5.0
Name: 17, dtype: float64

In [56]:
def getRecommendation(userNumber):
    moviesLikedby_u=movie_ratings.loc[userNumber].dropna().sort_values().tail(5)
    recommended_movie=[]
    for i in moviesLikedby_u.index:
        for j in corrMatrix[i].dropna().sort_values(ascending=False)[1:6].index:
            recommended_movie.append(j)
    return(recommended_movie)




In [57]:
getRecommendation(17)

['Sling Blade (1996)',
 'Lone Star (1996)',
 'Quiz Show (1994)',
 'Lawrence of Arabia (1962)',
 'People vs. Larry Flynt, The (1996)',
 'Star Trek: The Wrath of Khan (1982)',
 'Professional, The (1994)',
 'Grosse Pointe Blank (1997)',
 'Frighteners, The (1996)',
 'Cape Fear (1991)',
 'Raising Arizona (1987)',
 'Grease (1978)',
 'Field of Dreams (1989)',
 'Wizard of Oz, The (1939)',
 'Beavis and Butt-head Do America (1996)',
 'Trainspotting (1996)',
 'Rock, The (1996)',
 'Star Wars (1977)',
 'Scream (1996)',
 'Fargo (1996)']

In [58]:
getRecommendation(108)

['English Patient, The (1996)',
 'Dead Man Walking (1995)',
 'Leaving Las Vegas (1995)',
 'Godfather, The (1972)',
 'Sense and Sensibility (1995)',
 'English Patient, The (1996)',
 'Contact (1997)',
 'Scream (1996)',
 'Silence of the Lambs, The (1991)',
 'Toy Story (1995)',
 'Fugitive, The (1993)',
 'Blade Runner (1982)',
 'English Patient, The (1996)',
 'Emma (1996)',
 'Raising Arizona (1987)',
 'Secrets & Lies (1996)',
 'Dead Man Walking (1995)',
 'Big Night (1996)',
 'Star Trek: The Wrath of Khan (1982)',
 'Professional, The (1994)',
 'Grosse Pointe Blank (1997)',
 'Frighteners, The (1996)',
 'Cape Fear (1991)']

Some movies come up more than once, because they are very similar to the ones that the user has rated. Let's eliminate them.

Having done all the computations using pandas, we can see that it is computationally intensive. We have a Python module that does that for us.

## Using the Surprise Module

[Python Surprise](http://surprise.readthedocs.io/en/stable/index.html) is an easy-to-use Python scikit for recommender systems. Let's see how to build a recommender system using the surprise module and focus on the model inspired by K-Nearest Neighbors (KNN).

## Common Practice

1. Define Similarity $S_{ij}$ in terms of i and j
2. Select K nearest neighbors N(i;X)
    - Items most similar to i that were rated by X
3. Estimate rating $r_{xi}$ as the weighted average

$$ r_{x_i} = b_{x_i} + \dfrac{\sum_{j \epsilon N(i;x)} S_{ij} (r_{x_j} - b_{x_j})}{\sum_{j \epsilon N(i;x)} S_{ij}} $$

Here, the term $b_{x_i}$ is the baseline estimator for the rating comprising three terms: the overall mean movie rating, rating deviation of user x, and rating deviation of the movie i.

## **Evaluation Metrics**
#### Comparing Predictions with Known Ratings

**RMSE**

- Root Mean Square Error (RMSE) 
    - $ \sqrt{\frac{1}{N}\sum_{x_i}(\textbf{$r_{x_i}$- $r_{x_i}^*$})^2}$ here $r_{x_i}$ is the predicted rating and $r_{x_i}^*$ is the actual rating
- Precision at top 10 
    - % of those in top 10

**Note: In this lesson, we saw the use of the recommender systems.**

![Simplilearn_Logo](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/Logo_Powered_By_Simplilearn/SL_Logo_1.png)