<a href="https://colab.research.google.com/github/alexiej/laboratory/blob/master/07_Recommendation_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07. RECOMMENDATION SYSTEM

Recommendation systems using different techniques.



In [0]:
import pandas as pd
import numpy as np

## Measure systems

In [2]:
data = pd.DataFrame([['User1', 'Star Wars', '4'],
                     ['User1', 'Star Trek', '4'],
                     ['User1', 'StarGate', '3'],
                     ['User2', 'Star Trek', '3'],
                     ['User2', 'Terminator', '1'],
                     ['User2', 'Hobbit', '5']], columns=['User', 'Movie', 'Stars (1-5)'])
data

Unnamed: 0,User,Movie,Stars (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
2,User1,StarGate,3
3,User2,Star Trek,3
4,User2,Terminator,1
5,User2,Hobbit,5


In [31]:
train_set = data.iloc[[0,1,3]]
train_set

Unnamed: 0,User,Movie,Stars (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
3,User2,Star Trek,3


In [32]:
test_set = data[~data.index.isin(train_set.index)]
test_set

Unnamed: 0,User,Movie,Stars (1-5)
2,User1,StarGate,3
4,User2,Terminator,1
5,User2,Hobbit,5


### Top-N Hit Rate

In this measure system we've got a recommendation system that give us TOP-N recommendations, so next if the element exists in our test set it is consider as a hit. 

$\begin{align*}top\_n\_hitrate=\frac{hits}{users}\end{align*}$

Recommendations (output of the recommendation system, based on the train environment). That's why you find also movies that have been seen (this is our HIT):

User | Movie | Position
--- | --- | ---
User1 | Godzilla | 0
User1 | StarGate | 1 (HIT)
User1 | Independance Day | 2
User2 | Godzilla | 0
User2 | Terminator | 1 (HIT)
User2 | Independance Day | 2


$\begin{align*}top\_n\_hitrate=\frac{hits}{users}=\frac{2}{2}=1.0\end{align*}$

Divide by the number of recommendation in the top N (N=3). Our hit rate is 33%

In [6]:
recommendation = pd.DataFrame( [['User1', 'Godzilla', 0],
                                ['User1', 'StarGate', 1],
                                ['User1', 'Independence Day', 2],
                                ['User2', 'Godzilla', 0],
                                ['User2', 'Terminator', 1],
                                ['User2', 'Indepndance Day', 2]], columns=['User', 'Movie', 'Position'])
recommendation

Unnamed: 0,User,Movie,Position
0,User1,Godzilla,0
1,User1,StarGate,1
2,User1,Independence Day,2
3,User2,Godzilla,0
4,User2,Terminator,1
5,User2,Indepndance Day,2


In [30]:
def get_top_n_hitrate(data, recommendation):
   hits = len(data.merge(recommendation,on=['User', 'Movie']))
   users = len(data['User'].unique())

   N = len(recommendation)/len(recommendation['User'].unique())
   return 0 if users == 0 else (hits/users) / N

print(f'Our Hit Rate is equal to: {get_top_n_hitrate(data, recommendation):.2%}')

Our Hit Rate is equal to: 33.33%


### Top-N Hit Leave One Out Cross Validation

In our recommendation system we remove one item, and generate recommendation for that user. If the recommendation contains the leave it item we count it as a hit (`hit+=1`, otherwise we only increase number if total `total+=1`) 

For the example, we've got two users `train_set`. 

* We remove from the test set one item ` [('User1', 'StarGate'), ('User2', 'Terminator')]`
* We return TrainSet with excluded items from previous point.

In [168]:
train_set = pd.DataFrame([['User1', 'Star Wars', '4'],
                        ['User1', 'Star Trek', '4'],
                        ['User1', 'StarGate', '3'],
                        ['User2', 'Star Trek', '3'],
                        ['User2', 'Terminator', '1'],
                        ['User2', 'Hobbit', '5']], columns=['User', 'Movie', 'Stars (1-5)'])

np.random.seed(1) # require to get the same data

def get_left_out_prediction_movies(df):
  # We remove one item for each item, and by this get the recommendation
  left_out_rows = df.sample(frac=1).groupby('User').head(1)
  rows_to_topn = df[~df.index.isin(items.index)]
  return [ (row['User'], row['Movie']) for index,row in left_out_rows.iterrows()], rows_to_topn

left_out_predictions, rows_to_topn = get_left_out_prediction_movies(train_set)
print('Left out prediction movies: ', left_out_predictions)
rows_to_topn

Left out prediction movies:  [('User1', 'StarGate'), ('User2', 'Terminator')]


Unnamed: 0,User,Movie,Stars (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
3,User2,Star Trek,3
5,User2,Hobbit,5


* Get recommendation for train set from previous point (without leave one movies).

In [165]:
def get_topn_recommendation(df, n = 3):
  return {
         'User1': ['Jurrasic Park', 'Hobbit', 'Blade Runner'], 
          'User2': ['Terminator', 'Space Odyssey', 'Alien']}

topn_recommendation = get_topn_recommendation(rows_to_topn)
topn_recommendation

{'User1': ['Jurrasic Park', 'Hobbit', 'Blade Runner'],
 'User2': ['Terminator', 'Space Odyssey', 'Alien']}

* Based on the recommendation we check if the movie exist in the recommendation list, if yes increase hit. In the example excluded `Terminator` for `User2` exist in the recommendation so we have 1 hit, and total number of users is 2. In the output we get 0.5

In [172]:
def get_topn_hitrate_leave_one_out(topn_recommendation, left_out_predictions):
    hit = 0 
    total = 0
    for left_out in left_out_predictions:
      user = left_out[0]
      left_out_movie = left_out[1] 
      recommendation = topn_recommendation[user]
      if left_out_movie in recommendation:
        hit+=1
      total += 1
    return hit/total

get_topn_hitrate_leave_one_out(topn_recommendation, left_out_predictions)

0.5

#### CONS

* For real data it is very low value, like 0.001, because it is hard to find this exactly one excluded movie in the database. 
* Requires a huge database of user raitings

## COSINE SIMILARITY


https://en.wikipedia.org/wiki/Cosine_similarity
---

