<h1 align="center"> Recommender Systems<h1/>
***
## Program so far
- Linear Regression 
- Logistic Regression
- Tree Based Methods
- Ensemble Methods
- Time Series 
- Natural Language Processing
***
## What are we going to learn today? 
- What is Recommendation System
- The long tail
- A simple Popularity based Recommender system
- A Collaborative Filtering Model
- Evaluating a Recommendation system

Let's get started!!

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## So what is a recommender system?
***
A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" that a user would give to an item. Lets consider an example shown in the figure below. Here we have a user database i.e data consisting of items rated by the user. Now lets suppose that a new user visits and likes 5 out of 10 items in the website. Then a recommender system recommends the items the new user might like based on similarity with other items. We shall get to this concept below. 

![recommender_system_info.jpg](../images/recommender_system_info.jpg)

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>


***
## The theory of long tail
***


- It shows how products in low demand or with low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbusters but only if the store or distribution channel is large enough.

- The long tail concept looks at less popular goods in lower demand and use of  these goods could increase in 
  profitability because consumers are navigating away from mainstream markets.
  
- This can be easily understood by looking at the figure [below](https://www.wired.com/2004/10/tail/). 

![longtail.png](../images/longtail.png)

The figure above clearly shows the use of longtail by [Rhapsody](https://en.wikipedia.org/wiki/Rhapsody_(music) where they sell Music albums both online and off line. We can clearly observe the following.

- Both Rhapsody and Wal-Mart sell the most popular music albums onlie. But the former offers 19 times more songs than Wal-Mart. Even though there is a demand for the popular music albums there is also a demand for the less popular online. Recommender systems leverage these less popular items online. 

## Dost ka sawaal ?

So now let's turn back to what John bhai is upto this time. He is put on task by his friend who asked him to recommend movies for him. John now turns back to data and thinks about a way to do so. Let's see what he does?

## Recommend the most popular items

- John gets his hand dirty with the movie dataset. He looks carefully at the user ratings and thinks whats to be done? 

- The answer that strikes first is the most **popular Item**. This is exactly what he will be doing. 

- Technicaly this is super fast but it does come with a major drawback which is lack of personalization. The dataset that has many files but he will be looking at a few of these files mainly the ones which relate to movie ratings.

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## Popularity based recommender system
***
In this approach we recommend items which are liked by most number of users. This is a blazing fast and dirty approach and thus has a major drawback. The thing is, there is no personalization involved with this approach. Such methods are widely used in news portals and work effectively beacuse of the following points

- There is division by section so user can look at the section of his interest.
- At a time there are only a few hot topics and there is a high chance that a user wants to read the news which is being read by most others.

**Let's take a look at John's Journey !!**

In [1]:
import pandas as pd
import os, io
import numpy as np
from pandas import Series, DataFrame, read_table
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import mean_squared_error
%matplotlib inline

John starts to explore the data set of movie ratings and particularly his interest lies in ratings. Lets see how he recommends the most popular (i.e highly rated) movies to his friend.

In [2]:
## Load the Ratings data
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = read_table('../data/ml-100k//u.data',header=None,sep='\t')
ratings.columns = r_cols

i_cols = ['movie_id', 'movie title' ,'release date','video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
items = read_table('../data/ml-100k//u.item', sep='|',names=i_cols,
 encoding='latin-1')

In [3]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


## Let's build a popularity based recommender system


With his initial exploration John decides that an ideal data would be the one were he could also have the movie ratings with him. Let's see how he is able to do so.

In [4]:
new_data = pd.merge(items,ratings,on='movie_id')
new_data  = new_data[['movie_id','movie title','user_id','rating']]

In [5]:
new_data.head()

Unnamed: 0,movie_id,movie title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


Before proceeding to build the, he concludes that he will observe the following steps to recommend movies to his friend.
- Find the unique users
- Count the number of times the movie has been seen.
- [Rank](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rank.html) the scores (counts) 

In [6]:
def popularity(train,title,ids):
    train_data_grouped = train.groupby([title])[ids].count().reset_index()  #user_id  #movie title
    train_data_grouped.rename(columns = {ids: 'score'},inplace=True)            
    train_data_sort = train_data_grouped.sort_values(['score',title], ascending = [0,1])
    train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first')
    popularity_recommendations = train_data_sort.head(10) 
    return popularity_recommendations

In [7]:
popularity(new_data,'movie title','user_id')

Unnamed: 0,movie title,score,Rank
1398,Star Wars (1977),584,1.0
333,Contact (1997),509,2.0
498,Fargo (1996),508,3.0
1234,Return of the Jedi (1983),507,4.0
860,Liar Liar (1997),485,5.0
460,"English Patient, The (1996)",481,6.0
1284,Scream (1996),478,7.0
1523,Toy Story (1995),452,8.0
32,Air Force One (1997),431,9.0
744,Independence Day (ID4) (1996),429,10.0


<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## Pandas rank
***
- Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values. This is taken from the official [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rank.html)

There he does it again. He uses is analyisis skill and leverages data to his advantage to arrive at recommendation for his friend. 

## Shortcomings

Having recommended the movies to his friend, he gets to John and tells him though he was happy to start off with he realised that some of the movies do not match his taste. John thinks a bit and immidiately concludes on the major drawback of such a system **Lack of Personalization**.

John dwelled upon the problem of personalization and after some research he concludes that **Colaborative Filtering** methods will definitely help his friend. Below summarises what he learnt.

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## Collaborative Filtering
****

In the newer, narrower sense, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

![Collaborative_filtering.gif](../images/Collaborative_filtering.gif)

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***

## Types of Collaborative Filtering.
***
## User Based Collaborative Filtering
***
Here we find look alike customers (based on similarity) and offer products which first customer’s look alike has chosen in past. This algorithm is very effective but takes a lot of time and resources. It requires to compute every customer pair information which takes time. Therefore, for big base platforms, this algorithm is hard to implement without a very strong parallelizable system.

1. Build a matrix of things each user bought viewed rated
2. compute similarity scores between users
3. find users similar to you
4. Recommend stuff they bought/viewed/rated that you haven’t yet
***
## Problems 
1. People fickle, tastes change
2. They are usually many more people than things
3. People do bad things

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## Item Based Collaborative Filtering
***
It is quite similar to previous algorithm, but instead of finding customer look alike, we try finding item look alike. Once we have item look alike matrix, we can easily recommend alike items to customer who have purchased any item from the store. This algorithm is far less resource consuming than user-user collaborative filtering. 

It tells us **Users who liked this item also like ....... items.** It takes items as inputs and recommends other similar items as recommendations.

1. Select an item.
2. Find other users who liked that item.
3. Find other items that the other users liked.
4. Make recommendations.


## Interesting fact 

Item- Item Collaboration is extensively used in amazon and they came out with it in great detail. You can read more at [amazon](https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf)

----------
Lets get started with building our Item based collaborative recommender system. For convinience Lets split this up into two parts. 

- To find Similarities between items
- To Recommend them to users

phew!! After all this research he was convinced that item based collaborative filtering would be the most feasible solution as the number of items is alwas less than the number of users and it improves compuational speed. He then approaches his friend to help him out by rating movies.

**Here we will leverage the power of pandas** 

- To begin we will use the pandas pivot table to look at relationships between movies and we will use the pivot table in pandas

- John begins his journey by building a utility matrix (matrix consisting of movies and ratings) 

In [8]:
movie_ratings = new_data.pivot_table(index=['user_id'],columns=['movie title'],values='rating')

In [9]:
movie_ratings.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


The above table gives information about the rating given by each user against the movie title. There are many **NaN's** as it is not necessary for each user to review each movie. To proceed lets start by looking at one the geeks favourite Star Wars and see how it is correlated pairwise with other movies in the table.
***

Wait a minute !!! But we how do will he decide if it is correlated or not?  Here comes the use of similarity function

<img src="../images/icon/Technical-Stuff.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## Similarity Function
***

So to decide the similarity between two items in the dataset lets briefly look at the popular similarity functions.

### Terminology

-Let $\textbf{$r_x$}$ denote the rating of the item x given by the user and $\textbf{$r_y$}$ be the rating of item y. To find the similarity pairwise between two items. The following metrics can be used.


## Euclidean distance 

A simple yet powerful method to denote similarities. It is given by 

$$sim(\textbf{$r_{x}$},\textbf{$r_y$}) =  \sqrt{\sum_{}(r_x - r_y)^2}$$

The shorter the distance, the more similar the items are.
***

## Cosine Index

$$sim(\textbf{$r_{x}$},\textbf{$r_y$}) = cos(\textbf{$r_x$},\textbf{$r_y$}) = \dfrac{\textbf{$r_x$}\textbf{$r_y$}}{||\textbf{$r_x$}||\  ||\textbf{$r_y$}||} $$ 

| Value of $cos \theta$ | Value of $\theta$ | Conclusion|
|:---:|:---: |:---:|
| +1 |$0^{\circ}$   | Vectors are more similar |
|  0 |$90^{\circ}$  | Vectors are not similar (orthogonal vectors)|
| -1 |$180^{\circ}$ | Vectors are diametrically dissimilar |

Major problem is it treats missing values as negative.
***
## Pearson Index

$S_{xy}$ = Items x and y both have ratings

$$sim(\textbf{$r_{x}$},\textbf{$r_y$})=\dfrac{\sum_{x\epsilon s}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})(\textbf{$r_{ys}$}- \textbf{$r_y$})}{(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{xs}$}- \textbf{$r_{xm}$})^2}(\sqrt{\sum_{s\epsilon s_{xy}}(\textbf{$r_{ys}$}- \textbf{$r_{ym}$})^2}} $$ 

However, this index is prone to extreme values.
***
## Jacard Index

$$Jacard \ Index = \dfrac{Number \ in\  both \  sets}{Number \  in\  either \ set}  $$

For his case he begins with  Pearson Index. Now that he has understood how to find similar items John starts by checking his similarity score with the movie **Star Wars** in his data set.

In [10]:
StarWarsRatings = movie_ratings['Star Wars (1977)'] 
StarWarsRatings.head()

user_id
0    5.0
1    5.0
2    5.0
3    NaN
4    5.0
Name: Star Wars (1977), dtype: float64

Now he cheks the pair wise correlation of user rating of **StarWars** with other movies using `.corrwith()` function.

In [11]:
similarmovies = movie_ratings.corrwith(StarWarsRatings)
similarmovies = similarmovies.dropna()
df = pd.DataFrame(similarmovies)
df.head()

Unnamed: 0_level_0,0
movie title,Unnamed: 1_level_1
'Til There Was You (1997),0.872872
1-900 (1994),-0.645497
101 Dalmatians (1996),0.211132
12 Angry Men (1957),0.184289
187 (1997),0.027398


John now carefully looks at the data and finds something wrong.

What went wrong ?? Here, the possible explanation is that our movies are getting messed up by a handful of people who saw obscure movies. So he decides to get rid of the movies that only a few people watched that are producing wrong results. 

In [12]:
movie_stats = new_data.groupby('movie title').agg({'rating':[np.size,np.mean]})

In [13]:
check = movie_stats.sort_values([('rating','mean')],ascending=False)

In [14]:
check.head()

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
They Made Me a Criminal (1939),1,5.0
Marlene Dietrich: Shadow and Light (1996),1,5.0
"Saint of Fort Washington, The (1993)",2,5.0
Someone Else's America (1995),1,5.0
Star Kid (1997),3,5.0


Now he clearly observes that there are movies which have not been rated by users as expected. For the sake of simplicity, he sets a threshold of 100 ratings per movie. 

In [15]:
popularmovies = movie_stats['rating']['size']>=100

movie_stats[popularmovies].sort_values([('rating','mean')],ascending=False)[:10]

Unnamed: 0_level_0,rating,rating
Unnamed: 0_level_1,size,mean
movie title,Unnamed: 1_level_2,Unnamed: 2_level_2
"Close Shave, A (1995)",112,4.491071
Schindler's List (1993),298,4.466443
"Wrong Trousers, The (1993)",118,4.466102
Casablanca (1942),243,4.45679
"Shawshank Redemption, The (1994)",283,4.44523
Rear Window (1954),209,4.38756
"Usual Suspects, The (1995)",267,4.385768
Star Wars (1977),584,4.359589
12 Angry Men (1957),125,4.344
Citizen Kane (1941),198,4.292929


Now being happy with the results he makes some final checks with the data at hand.

In [16]:
df = movie_stats[popularmovies].join(DataFrame(similarmovies,columns=['similarity']))
df.sort_values('similarity',ascending=False)[:20]

Unnamed: 0_level_0,"(rating, size)","(rating, mean)",similarity
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Star Wars (1977),584,4.359589,1.0
"Empire Strikes Back, The (1980)",368,4.206522,0.748353
Return of the Jedi (1983),507,4.00789,0.672556
Raiders of the Lost Ark (1981),420,4.252381,0.536117
Austin Powers: International Man of Mystery (1997),130,3.246154,0.377433
"Sting, The (1973)",241,4.058091,0.367538
Indiana Jones and the Last Crusade (1989),331,3.930514,0.350107
Pinocchio (1940),101,3.673267,0.347868
"Frighteners, The (1996)",115,3.234783,0.332729
L.A. Confidential (1997),297,4.161616,0.319065


## Accomplished

John found similarities between Star Wars and other movies in the data set and is happy with the results.

## Building a full blown item based recommender system
***

By this time his friend returns with his prefered set of movies. John is happy with his new approach and he lists down points which will help to accomplish recommendations.

- Compute the correlation matrix for movie ratings for every movie pair
- Choose a movie and sort users who have rated the movie in descending order
- Find other items from the above list of users with the help of correlation matrix 
- Make recommendations

The Pandas method `.corr()` will compute the correlation score for every pair in the matrix. This creates a sparse matrix. Lets see how this looks.

In [17]:
# Step 1: Build an item-item similarity matrix based on Pearson Index having at least 100 ratings

corrMatrix = movie_ratings.corr(method='pearson',min_periods=100) 
corrMatrix.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
movie title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'Til There Was You (1997),,,,,,,,,,,...,,,,,,,,,,
1-900 (1994),,,,,,,,,,,...,,,,,,,,,,
101 Dalmatians (1996),,,1.0,,,,,,,,...,,,,,,,,,,
12 Angry Men (1957),,,,1.0,,,,,,,...,,,,,,,,,,
187 (1997),,,,,,,,,,,...,,,,,,,,,,


In [18]:
# For understanding lets replace "NaN" values by "0"s.

movie_ratings.fillna(0, inplace=True)
movie_ratings

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,2.0,5.0,0.0,0.0,3.0,4.0,0.0,0.0,...,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,2.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,4.0,0.0
6,0.0,0.0,0.0,4.0,0.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,4.0,0.0,0.0,5.0,5.0,0.0,4.0,...,0.0,0.0,0.0,5.0,3.0,0.0,3.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
movie_ratings.columns

Index([''Til There Was You (1997)', '1-900 (1994)', '101 Dalmatians (1996)',
       '12 Angry Men (1957)', '187 (1997)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '3 Ninjas: High Noon At Mega Mountain (1998)', '39 Steps, The (1935)',
       ...
       'Yankee Zulu (1994)', 'Year of the Horse (1997)', 'You So Crazy (1994)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Poisoner's Handbook, The (1995)',
       'Zeus and Roxanne (1997)', 'unknown',
       'Á köldum klaka (Cold Fever) (1994)'],
      dtype='object', name='movie title', length=1664)

Now lets take a look at how recommendations would be thrown for the movie `12 Angry Men (1957)` according to item based collaborative filtering.

In [20]:
# Step 2: Choose a movie and sort users who have rated the movie in descending order

# Lets take the movie '12 Angry Men (1957)'
movie = '12 Angry Men (1957)'

# Filter for users who rated "12 Angry Men (1957)"
users = movie_ratings[movie_ratings[movie] != 0]

# Sort the users in descending order of ratings
sort_users = users.sort_values(by = [movie], ascending=False)

# Pick the top users
top_users_id = sort_users[[movie]].index

# Print out the top user ids
print("Top user ids for the movie- " + movie + " are: " + ', '.join(str(i) for i in top_users_id))

Top user ids for the movie- 12 Angry Men (1957) are: 1, 409, 588, 556, 553, 530, 487, 468, 416, 398, 606, 397, 392, 379, 361, 339, 326, 313, 592, 615, 311, 747, 924, 892, 889, 886, 868, 756, 751, 741, 639, 716, 704, 693, 691, 686, 659, 658, 312, 932, 68, 151, 232, 95, 115, 182, 65, 234, 60, 24, 246, 156, 189, 272, 288, 298, 16, 174, 10, 90, 655, 694, 648, 622, 661, 684, 178, 308, 696, 82, 727, 738, 749, 766, 774, 846, 864, 870, 876, 13, 7, 601, 185, 450, 478, 315, 305, 327, 373, 381, 389, 268, 237, 426, 6, 573, 458, 474, 429, 224, 568, 567, 561, 514, 528, 221, 535, 488, 537, 387, 331, 881, 172, 194, 18, 271, 796, 405, 417, 425, 524, 723, 109, 307, 712


These are the users who have rated the movie `12 Angry Men (1947` very highly.

In [21]:
# Step 3: Find other movies from the above list of users

# Dataframe of above users
user_movie = movie_ratings.loc[top_users_id,:]

# Remove the `12 Angry Men (1957) movie column
user_movie.drop(movie, axis = 1, inplace=True)

# Function to find movies similar to '12 Angry Men (1957) for a user
def find_similar_movies(user, user_movie, movie, corrMatrix):
    
    '''
    Arguments: (a) user: user_id
               (b) user_movie: user-movie matrix of ratings
               (c) movie: '12 Angry Men (1957)'
               (d) corrMatrix: item-item similarity matrix
    Output: Movies most similar to '12 Angry Men (1957)'
    '''
    
    user_rating = users.loc[user, movie]
    top_movies = user_movie.loc[user,].sort_values(ascending=False).dropna()
    top_movies = user_rating*top_movies
    for each_movie in top_movies.index:
        similarity = corrMatrix.loc[each_movie, movie]
        top_movies.loc[each_movie] *= similarity 
    top_movies.sort_values(ascending=False, inplace=True)
    return top_movies.dropna()

In [22]:
# Step 4: Recommend movies based on similarities for every user

recommendations = {}
for user in top_users_id:
    movies = find_similar_movies(user, user_movie, movie, corrMatrix)
    films, scores = list(movies.index), list(movies.values)
#     print(list(movies.index), list(movies.values))
    for film in films:
        if film in recommendations.keys():
            recommendations[film].extend(scores)
        else:
            recommendations[film] = scores

for k, v in recommendations.items():
    recommendations[k] = np.mean(v)
    
print("Recommendations for the movie '12 Angry Men (1957) are:'", ', '.join(recommendations.keys()))

Recommendations for the movie '12 Angry Men (1957) are:' Star Wars (1977), Raiders of the Lost Ark (1981)


Now John gets back to his friend with these recommendations. After having a look his friend was convinced and applauded John.

## John ka khayal

Having done all the computations using pandas John foud that it is computationally intensive. Also, there is another problem with the calculation. It suffers from the problem of cold start. Essentially, it means that movies that were rated less are bound to be less recommended, or worse; it might not even get recommendations. Even though a latest movie which has a good enough rating but rated by very few users, it will not be shown.

It is a **memory based** approach which will suffer ultimately in the long run due to the high sparsity of movie ratings. Also, as data grows correlation calculation will become cumbersome slowing down the pipeline. This is where **model based** approaches such as **matrix based factorization** come to the rescue. 

Although it is not necessary that it will always outperform item or user based collaborative filtering methods. 

Lets have a look at it

## What is matrix factorization?

Simply, it is the breaking down of one matrix into a product of multiple matrices. There are many different ways to factor matrices, but singular value decomposition (SVD) is particularly useful for making recommendations.

***

### So what is Singular Value Decomposition?

At a high level, SVD is an algorithm that decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes $R$ into two unitary matrices and a diagonal matrix.

\begin{equation}\large
   R = U\sum V^{T}
\end{equation}

where 
- $R$ is user ratings matrix 
- $U$ is the user “features” matrix and its columns can build back all columns of $R$
- $Σ$ is the diagonal matrix of singular values (essentially weights) 
- $V^T$ is the movie “features” matrix and the columns of $V$ can build back all rows of $R$
- $U$ and $V^T$ are orthogonal, and represent different things. 
- $U$ represents how much users “like” each feature and 
- $V^T$ represents how relevant each feature is to each movie

Okay, now enough of mathematical stuff, lets proceed to build one.

Before that John decides to prepare his data such that users are on the index and movie names are features.

In [23]:
df = new_data.pivot_table(index=['user_id'],columns=['movie title'],values='rating')
df.head()

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,,,,,,,,,,,...,,,,,,,,,,
1,,,2.0,5.0,,,3.0,4.0,,,...,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,...,,,,,,,,,,
3,,,,,2.0,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [24]:
# Fill missing values with 0's
df.fillna(0, inplace=True)

# Normalize the data by converting into numpy array first
matrix = df.as_matrix()
user_mean = np.mean(matrix, axis=1)
matrix_normed = matrix - user_mean.reshape(-1,1)

Both **Scipy** and **Numpy** have functions for SVD. But we will use the **Scipy** function because it lets us choose the number of latent factors we want to use to approximate the original ratings matrix.

In [25]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(matrix_normed, k = 50)  # Number of latent factors is taken 50

In [26]:
''' Sigma is a diagonal matrix but with scipy it returns only those diagonal values.
So, we need to convert it back into a diagonal matrix from. 
Transformation is given below.'''

sigma = np.diag(sigma)

## Making predictions 

Now, John has everything to predict ratings. Also, he remembers that since he subtracted the mean during normalization he has to add it back after predictions are made so that accurate answers are observed.

In [27]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_mean.reshape(-1, 1)
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = df.columns)

In [28]:
preds_df

movie title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",...,Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
0,-0.031335,0.000821,-0.034204,-0.021158,0.001784,-0.153189,-0.083192,-0.004130,0.000049,-0.004028,...,0.003289,-0.000358,0.004787,-0.081861,0.022996,0.020154,-0.064093,-0.005055,-0.024924,0.005055
1,-0.072076,-0.060237,0.598500,2.506863,-0.078158,0.186711,1.490075,5.766853,-0.019980,0.164837,...,-0.073835,-0.193848,-0.026537,3.436720,0.514202,0.103189,0.348079,-0.027599,0.825738,0.015252
2,0.073818,0.000705,0.628568,0.443968,0.203148,0.347435,-0.250397,-1.393918,0.005043,-0.105983,...,-0.024160,0.021793,0.009585,-0.150768,-0.055185,-0.023544,0.198900,-0.007873,0.122707,0.011194
3,-0.017711,0.015344,-0.023134,0.225587,0.651014,-0.161835,-0.158951,-0.141943,0.008309,-0.008342,...,-0.014991,0.104109,-0.009795,0.083725,0.111976,-0.032097,0.101378,-0.016329,-0.021215,-0.001023
4,-0.038605,0.005756,-0.102273,-0.138900,0.369944,-0.261871,0.051986,-0.200241,0.010192,-0.089849,...,0.024312,0.078282,0.016897,0.132723,0.245203,0.102807,0.015265,0.003282,-0.014746,0.015980
5,-0.123549,0.141265,0.752508,-0.212343,-0.137099,-0.653243,0.631983,2.347558,0.003443,-0.105098,...,0.020810,-0.104338,-0.020381,4.142209,0.282970,0.347924,0.389391,0.026321,0.468258,0.005091
6,0.205046,0.032973,0.250763,4.300871,0.153593,0.302792,0.496675,4.199654,0.014966,1.973931,...,-0.025947,0.085415,-0.045599,2.621401,-0.545697,-0.225331,0.077382,0.054582,0.219991,-0.081437
7,-0.078234,0.067212,2.194784,4.923839,-0.183947,0.539404,4.427358,5.051831,0.069877,3.427504,...,0.039704,0.050481,0.073004,5.452918,3.218965,1.276476,1.249201,-0.060226,-0.064488,-0.081601
8,-0.088125,0.030187,-0.112249,-0.477318,0.033606,-0.232024,0.085723,0.304530,0.019027,-0.266139,...,-0.000055,0.066243,0.019692,0.073568,1.521279,0.289155,-0.191295,-0.002665,-0.169865,0.028686
9,-0.002833,0.024514,-0.169797,0.727165,0.014355,0.029264,0.088300,0.111689,-0.008230,0.288903,...,0.006487,-0.039393,0.002707,0.206047,0.165342,0.156679,0.084070,-0.015263,0.024534,0.027832


Now let us see `User 0`'s recommendations

In [29]:
# Recommendations for User_id = 0 in a sorted manner
preds_0 = preds_df.iloc[0,:].sort_values(ascending=False)

# Recommendations if user has not rated the movie previously
recommendations_0 = [i for i in preds_0.index if df.loc[0,i]==0]

# Recommend only top 10 items
final_recommendations_0 = recommendations_0[:10]
print("Recommendations for user with id 0 are: " + ', '.join(final_recommendations_0))

Recommendations for user with id 0 are: Return of the Jedi (1983), Raiders of the Lost Ark (1981), Indiana Jones and the Last Crusade (1989), Godfather, The (1972), Princess Bride, The (1987), Back to the Future (1985), Blade Runner (1982), Toy Story (1995), Terminator, The (1984), Pulp Fiction (1994)


Now they look like some pretty cool recommendations for the user with id `0`. John's friend is satisfied with this method as he finds this scalable to large datasets.

You can try changing the number of latent factors to see how your predicted ratings vary. 

**Also, you can approximate SVD with gradient descent to make your recommendations even better. Although this is a very advanced approach you can always try to do it on your own.**

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***

## Evaluation Metrics
***

There are many metrics used for evaluating a recommender system. We will be discussing a few of them
***

### Comparing predictions with known ratings

**RMSE: Root Mean Squared Error**

  * RMSE is a quadratic scoring rule that also measures the average magnitude of the error. 
  * It’s the square root of the average of squared differences between prediction and actual observation.
  \begin{equation}\large
   RMSE = \sqrt{\frac{1}{n}\sum_{j=1}^n(y_{j}-\hat{y}_{j})^{2}}
\end{equation}
  
  here $r_{x_i}$ is predicted rating and $r_{x_i}^*$ is the actual rating

- Precision at top 10 
  - % of those in top 10

**MAE : Mean Absolute Error**

* MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. 
* It is the average over the test sample of the absolute differences between prediction and actual observation 
  where all individual differences have equal weight.
  

\begin{equation}\large
   MAE = \frac{1}{n}\sum_{j=1}^{n}\mid y_{j} - \hat{y}_{j}\mid
\end{equation}

Apart from them other metrics like `precision`, `recall`, `diversity scores` can be also be used for evaluating a recommender system.

## Content based recommender systems
- Main Idea: Recommend Items to customer X similar to previous items rated highly by X

## Example

- Movie recommendations
  - Recommend movies with same actors, Directors, genere
- Websites Blogs News
  - Recommend other sites with similar content
- Given a description recommend other items with similar description 

![plan_of_action.png](../images/plan_of_action.png)

Being Summer John and his turn their attention to outdoor activities. But wait a minute !!! Amazed with his recommendation capabilities his friend puts him to task to recommend clothing. Lets see what he finds this time. 

In [30]:
data = pd.read_csv('../data/sample-data.csv')

In [31]:
data.head()

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."


Now he finds that this dataset has descriptions unlike the other data sets handled so far. He is confused to come up with recommendations for his friend. He then learns a new concept

![](../images/confused-person.jpg)

<img src="../images/icon/Concept-Alert.png" alt="Concept-Alert" style="width: 100px;float:left; margin-right:15px"/>
***
## TF-IDF
***
TF-IDF stands for Term Frequency times Inverse document frequency. 

- TF Stands for term frequency: It tells us how often a term appeares in a document. 
- IDF stands for Inverse Document Frequency: It tells us how rare it is for a document to have this term. 
  - It is calculated by inverse of how many documents have this tag divided by the total number of documents.      Generally a log of the computed value id used to bring it to scale in which can be used. 

- Then we multiply the TF and IDF to to get a weight which is assigned to a particular search term we are looking for. 



With the newly gained knowledge in hand he sets out to compute [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). He decides to take the following steps
- Create a TF-IDF matrix of unigrams, bigrams, and trigrams
  for each product and remove the common words such as and the etc. 

- Then compute similarity between all products using
  SciKit Leanr's linear_kernel (which in this case is
  equivalent to cosine similarity).

- Iterate through each item's similar items and store the
  100 most-similar. Here he stops at hundred otherwise the list could get too large
  
- Similarities and their scores are stored in redis as a
  Sorted Set, with one set for each item.

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [33]:
tf = TfidfVectorizer(analyzer='word',
                             ngram_range=(1, 3),
                             min_df=0,
                             stop_words='english')
tfidf_matrix = tf.fit_transform(data['description'])

cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

for idx, row in data.iterrows():
            similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
            similar_items = [(cosine_similarities[idx][i], data['id'][i])
                             for i in similar_indices]

            flattened = sum(similar_items[1:],())



John finds that the  sum returns a list of tuples into a single tuple: This single tuple has similarities and item id. This is shown below Thus retreving the id and similarities and sorting them would solve his issue and there he has a recommendation for his friend again. 

In [34]:
similarities = list()
item_id = list()

for i in range(0,len(flattened),2): 
    similarities.append(flattened[i]) # Collecting similarity scores and 
    item_id.append(flattened[i-1])

sol = DataFrame({'id':item_id,'similarities':similarities})
predictions = sol.merge(data,how='inner',on='id')[:10] # using merge to get description

After sorting one can clearly view the predictions. Now these predictions are totally based on the content in the description.   

In [35]:
predictions

Unnamed: 0,id,similarities,description
0,315,0.362816,Kite town t-shirt - Artist Chris Del Moro tran...
1,499,0.318046,All-wear cargo shorts - All-Wear Cargo Shorts ...
2,462,0.317783,Custodian pants - reg - The graveyard shift ha...
3,463,0.315561,Custodian pants - short - The graveyard shift ...
4,32,0.256629,Custodian pants - long - The graveyard shift h...
5,34,0.234255,Delivery shorts - Locals know all the best spo...
6,483,0.216608,Duck shorts - Sometimes life requires a welder...
7,303,0.201515,All-wear capris - Capris are more discreet tha...
8,482,0.201159,Duck pants - short - Essential wear for splitt...
9,481,0.19911,Duck pants - reg - Essential wear for splittin...


**In class Question** looking at the similarity scores and the method used what do you think should be the range of the scores?

<img src="../images/icon/Recap.png" alt="In Session Recap" style="width: 100px;float:left; margin-right:15px"/>

## In class recap
***
- What is Recommendation System
- The long tail
- A simple Popularity based Recommender system
- A Collaborative Filtering Model
- Evaluating a Recommendation system

## Thank You

For more queries - Reach out to academics@greyatom.com

## Next Session: Big Data #01 - HDFS