<a href="https://colab.research.google.com/github/alexiej/AMADA/blob/master/07_Recommendation_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07. RECOMMENDATION SYSTEM

Recommendation systems using different techniques.



In [0]:
import pandas as pd
import numpy as np

## Measure systems

In [238]:
data = pd.DataFrame([['User1', 'Star Wars', '4'],
                     ['User1', 'Star Trek', '4'],
                     ['User1', 'StarGate', '3'],
                     ['User2', 'Star Trek', '3'],
                     ['User2', 'Terminator', '1'],
                     ['User2', 'Hobbit', '5']], columns=['User', 'Movie', 'Rate (1-5)'])
data

Unnamed: 0,User,Movie,Rate (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
2,User1,StarGate,3
3,User2,Star Trek,3
4,User2,Terminator,1
5,User2,Hobbit,5


In [142]:
train_set = data.iloc[[0,1,3]]
train_set

Unnamed: 0,User,Movie,Rate (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
3,User2,Star Trek,3


In [143]:
test_set = data[~data.index.isin(train_set.index)]
test_set

Unnamed: 0,User,Movie,Rate (1-5)
2,User1,StarGate,3
4,User2,Terminator,1
5,User2,Hobbit,5


### Top-N Hit Rate

In this measure system we've got a recommendation system that give us TOP-N recommendations, so next if the element exists in our test set it is consider as a hit. 

$\begin{align*}top\_n\_hitrate=\frac{hits}{users}\end{align*}$

Recommendations (output of the recommendation system, based on the train environment). That's why you find also movies that have been seen (this is our HIT):

User | Movie | Position
--- | --- | ---
User1 | Godzilla | 0
User1 | StarGate | 1 (HIT)
User1 | Independance Day | 2
User2 | Godzilla | 0
User2 | Terminator | 1 (HIT)
User2 | Independance Day | 2


$\begin{align*}top\_n\_hitrate=\frac{hits}{users}=\frac{2}{2}=1.0\end{align*}$

Divide by the number of recommendation in the top N (N=3). Our hit rate is 33%

In [144]:
recommendation = pd.DataFrame( [['User1', 'Godzilla', 0, 4.8],
                                ['User1', 'StarGate', 1, 4.7],
                                ['User1', 'Independence Day', 2, 4.0],
                                ['User2', 'Godzilla', 0 , 4.8],
                                ['User2', 'Terminator', 1, 3.8],
                                ['User2', 'Indepndance Day', 2, 3.0]], columns=['User', 'Movie', 'Position', 'Rate'])
recommendation

Unnamed: 0,User,Movie,Position,Rate
0,User1,Godzilla,0,4.8
1,User1,StarGate,1,4.7
2,User1,Independence Day,2,4.0
3,User2,Godzilla,0,4.8
4,User2,Terminator,1,3.8
5,User2,Indepndance Day,2,3.0


In [145]:
def get_top_n_hitrate(data, recommendation):
   hits = len(data.merge(recommendation,on=['User', 'Movie']))
   users = len(data['User'].unique())

   N = len(recommendation)/len(recommendation['User'].unique())
   return 0 if users == 0 else (hits/users) / N

print(f'Our Hit Rate is equal to: {get_top_n_hitrate(data, recommendation):.2%}')

Our Hit Rate is equal to: 33.33%


### Top-N Hit Leave One Out Cross Validation

In our recommendation system we remove one item, and generate recommendation for that user. If the recommendation contains the leave it item we count it as a hit (`hit+=1`, otherwise we only increase number if total `total+=1`) 

For the example, we've got two users `train_set`. 

* We remove from the test set one item ` [('User1', 'StarGate'), ('User2', 'Terminator')]`
* We return TrainSet with excluded items from previous point.

In [146]:
train_set = pd.DataFrame([['User1', 'Star Wars', '4'],
                        ['User1', 'Star Trek', '4'],
                        ['User1', 'StarGate', '3'],
                        ['User2', 'Star Trek', '3'],
                        ['User2', 'Terminator', '1'],
                        ['User2', 'Hobbit', '5']], columns=['User', 'Movie', 'Rate (1-5)'])

np.random.seed(1) # require to get the same data

def get_left_out_prediction_movies(df, left=1):
  # We remove one item for each item, and by this get the recommendation
  left_out_rows = df.sample(frac=1).groupby('User').head(left)
  rows_to_topn = df[~df.index.isin(left_out_rows.index)]
  return [ (row['User'], row['Movie'], row['Rate (1-5)']) for index,row in left_out_rows.iterrows()], rows_to_topn

left_out_predictions, rows_to_topn = get_left_out_prediction_movies(train_set, left=1)
print('Left out prediction movies: ', left_out_predictions)
rows_to_topn

Left out prediction movies:  [('User1', 'StarGate', '3'), ('User2', 'Terminator', '1')]


Unnamed: 0,User,Movie,Rate (1-5)
0,User1,Star Wars,4
1,User1,Star Trek,4
3,User2,Star Trek,3
5,User2,Hobbit,5


* Get recommendation for train set from previous point (without leave one movies).

In [152]:
def get_recommendations(user_ratings):
  return pd.DataFrame(
      [['User1','Jurrasic Park', 4.5],
      ['User1','Hobbit', 4.0],
      ['User1','Blade Runner', 3.8],
      ['User1','Inception', 3.4],    
      ['User2','Space Odyssey', 4.7],
      ['User2','Terminator', 4.3],
      ['User2','Alien', 3.9],
      ['User2','Predator', 3.4]], columns=['user','movie', 'rate']      
  )

recommendations = get_recommendations(rows_to_topn)
recommendations

Unnamed: 0,user,movie,rate
0,User1,Jurrasic Park,4.5
1,User1,Hobbit,4.0
2,User1,Blade Runner,3.8
3,User1,Inception,3.4
4,User2,Space Odyssey,4.7
5,User2,Terminator,4.3
6,User2,Alien,3.9
7,User2,Predator,3.4


In [197]:
def get_topn_recommendation(recommendations, n = 3):
  topn_recommendation = dict()
  for user in recommendations['user'].unique():
    topn_recommendation[user] = [(series['movie'], series['rate']) for index,series in recommendations[recommendations['user']==user].sort_values(by='rate', ascending=False).head(n).iterrows()]
  return topn_recommendation

def get_topn_movies(get_topn_recommendation):
  return [x[0] for x in get_topn_recommendation]

topn_recommendation = get_topn_recommendation(recommendations)
print(get_topn_movies(topn_recommendation['User1']))
topn_recommendation

# {'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)],
#  'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}

['Jurrasic Park', 'Hobbit', 'Blade Runner']


{'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)],
 'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}

* Based on the recommendation we check if the movie exist in the recommendation list, if yes increase hit. In the example excluded `Terminator` for `User2` exist in the recommendation so we have 1 hit, and total number of users is 2. In the output we get 0.5

In [198]:
def topn_hitrate_leave_one_out(topn_recommendation, left_out_predictions):
    hit = 0 
    total = 0
    for user, left_out_movie, stars in left_out_predictions:
      recommendation = get_topn_movies(topn_recommendation[user])
      if left_out_movie in recommendation:
        hit+=1
      total += 1
    return hit/total

topn_hitrate_leave_one_out(topn_recommendation, left_out_predictions)

0.5

**CONS**

* For real data it is very low value, like 0.001, because it is hard to find this exactly one excluded movie in the database. 
* Requires a huge database of user raitings

### Average Reciprocal HitRate  (ARHR)

- This is like HitRate but also get into consideration where it is appear $rank_i$. 
- Focus on the user method

$\begin{align*}arhr=\frac{\sum_{i=1}^{n}\frac{1}{rank_i}}{users}\end{align*}$

In this example, we've got 'Terminator' on the second place, so our arhr is:

$rank_{user1}=0$ (There is no hit)

$rank_{user2}=2$ (Terminator is on the second place)


$arhr=\frac{\frac{1}{rank_{user1}}+\frac{1}{rank_{user2}}}{user_1 + user_2}=\frac{0+\frac{1}{2}}{2}=0.25$



In [199]:
def average_reciprocal_hitrank(topn_recommendation, left_out_predictions):
    hit_rank = 0 
    total = 0
    for user, left_out_movie, stars  in left_out_predictions:
      recommendation = get_topn_movies(topn_recommendation[user])
      if left_out_movie in recommendation:
          hit_rank += 1/(recommendation.index(left_out_movie) + 1 ) # it starts from zero
      total += 1
    return hit_rank/total

print("Recommendation: \n", topn_recommendation)
print("Leave out recommendation: \n", left_out_predictions)

print("ARHR: \n" ,average_reciprocal_hitrank(topn_recommendation, left_out_predictions))

Recommendation: 
 {'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)], 'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}
Leave out recommendation: 
 [('User1', 'StarGate', '3'), ('User2', 'Terminator', '1')]
ARHR: 
 0.25


### Cumulative Hit Rate (cHR)

In this case we remove hit rate for movies that are not bellow some value. Because we haven't enjoy this movie it shouldn't be in the list.


In our example, the movie 'Terminator' was starred as '1' which means the user didn't like this movie. 

In our cumulative hit rate we need remove this in calculations.


In [200]:
def cumulative_hit_rate(topn_recommendation, left_out_predictions, minimum_rank = 3.0):
    hit_rank = 0 
    total = 0
    for user,left_out_movie, stars  in left_out_predictions:
      stars = float(stars)
      recommendation = get_topn_movies(topn_recommendation[user])

      if stars >= minimum_rank:
        if left_out_movie in recommendation:
            hit_rank+= 1/(recommendation.index(left_out_movie) + 1 ) # it starts from zero
        total += 1
    return 0 if total<=0 else hit_rank/total


print("Recommendation: \n", topn_recommendation)
print("Leave out recommendation: \n", left_out_predictions)

print("Cumulative Hit Rate: \n" ,cumulative_hit_rate(topn_recommendation, left_out_predictions, 3.0))

Recommendation: 
 {'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)], 'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}
Leave out recommendation: 
 [('User1', 'StarGate', '3'), ('User2', 'Terminator', '1')]
Cumulative Hit Rate: 
 0.0



$rank_{user1}=0$ (There is no hit)

$rank_{user2}=2\ (1>3=False)=0$ (Terminator is on the second place, but it is stared as '1')


$chr(3)=\frac{\frac{1}{rank_{user1}}+\frac{1}{rank_{user2}}}{user_1 + user_2}=\frac{0+0}{2}=0$

### Rating Hit Rate (rHR)

In this situation for each score in your stars (⭐ 1-5). For each score if the movie is in the score it is consider as a hit, otherwise we add total for this score.



In [201]:
from collections import defaultdict

def rating_hit_rate(topn_recommendation, left_out_predictions):
        hits = defaultdict(float)
        total = defaultdict(float)

        # For each left-out rating
        for user, left_out_movie, stars in left_out_predictions:
            stars = float(stars)

            # Is it in the predicted top N for this user?
            recommendation = get_topn_movies(topn_recommendation[user])
            if left_out_movie in recommendation:
                hits[stars]  += 1
            total[stars] += 1
            
        # Compute overall precision
        for star in range(5,0, -1): # sorted(hits.keys()):
            print ('\nFor stars:',star,  '⭐'*star, 
                   ', Hit rate is: ', hits[star] / total[star] if star in hits else 0)


print("Recommendation: \n", topn_recommendation)
print("Leave out recommendation: ", left_out_predictions)

rating_hit_rate(topn_recommendation, left_out_predictions)

Recommendation: 
 {'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)], 'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}
Leave out recommendation:  [('User1', 'StarGate', '3'), ('User2', 'Terminator', '1')]

For stars: 5 ⭐⭐⭐⭐⭐ , Hit rate is:  0

For stars: 4 ⭐⭐⭐⭐ , Hit rate is:  0

For stars: 3 ⭐⭐⭐ , Hit rate is:  0

For stars: 2 ⭐⭐ , Hit rate is:  0

For stars: 1 ⭐ , Hit rate is:  1.0


In this example, we've got only one movie 
`Terminator` with star `1`, for this movie we recommended 

$hits(3)=0.0$ - `StarGate` is not in the hits

$total(3)=1.0$ - `StarGate` is in left_outs

$hits(1)=1.0$ - `Terminator` is not in the hits

$total(1)=1.0$ - `Terminator` is in left_outs


$rating\_hit\_rate(3) = \frac{hits(3)}{total(3)}=\frac{0}{1}=0.0$

$rating\_hit\_rate(1) = \frac{hits(1)}{total(1)}=\frac{1}{1}=1.0$

### COVERAGE

The % of <user, item> pairt that can be predicted, with rating higher than `min_rate`. 

This is good to measure new items that are not yet scored.

In [211]:
topn_recommendation

{'User1': [('Jurrasic Park', 4.5), ('Hobbit', 4.0), ('Blade Runner', 3.8)],
 'User2': [('Space Odyssey', 4.7), ('Terminator', 4.3), ('Alien', 3.9)]}

In [228]:
def user_coverage(topn_recommendation, num_users, min_rate=3.0):
  hits = 0
  for user in topn_recommendation.keys():
    curr_hit = sum([1.0 for movie, rate in topn_recommendation[user] if rate >= min_rate])
    hits +=  curr_hit/len(topn_recommendation[user])
  return hits/num_users
            
user_coverage(topn_recommendation, num_users = 2, min_rate=3.9)

0.8333333333333333

In the example hit for `User1` we've got 3 recommendations but only two is higher or equal 3.9, for `User2` we've got three recommendations higher or equal than 3.9 rate.

$user1\_hits=(1.0 + 1.0+0.0)/ 3.0 = 0.66$

$user2\_hits=(1.0 + 1.0+1.0)/ 3.0 = 1.00$

$user\_coverage=\frac{user1\_hits+user2\_hits}{num\_users}=\frac{1.66}{2}=0.83$

## Surprise

In [0]:
!pip install surprise

In [332]:
data = pd.DataFrame([['User1', 'Space Odyssey', '4'],
                     ['User1', 'Star Trek', '4'],
                     ['User1', 'StarGate', '3'],
                     ['User2', 'Blade Runner', '3'],
                     ['User2', 'Terminator', '1'],
                     ['User2', 'Hobbit', '5'],
                     ['User3', 'Star Trek', '3'],
                     ['User3', 'StarGate', '1'],
                     ['User3', 'Blade Runner', '5'], 
                     ['User3', 'Terminator', '4']], columns=['User', 'Movie', 'Rate (1-5)'])
data

Unnamed: 0,User,Movie,Rate (1-5)
0,User1,Space Odyssey,4
1,User1,Star Trek,4
2,User1,StarGate,3
3,User2,Blade Runner,3
4,User2,Terminator,1
5,User2,Hobbit,5
6,User3,Star Trek,3
7,User3,StarGate,1
8,User3,Blade Runner,5
9,User3,Terminator,4


In [347]:
from surprise import Dataset
from surprise import Reader
from surprise import KNNBaseline
import itertools


datas = Dataset.load_from_df(data, Reader(rating_scale=(1,5)))
# Train Set that contains all data
full_train_set = datas.build_full_trainset()

# Check similarity between two items
sims_algo = KNNBaseline(sim_options={'name': 'pearson_baseline', 'user_based': False})
sims_algo.fit(full_train_set);

print('\n** Compute similarities between two movies:\n')
sims_matrix = sims_algo.compute_similarities() 
print(sims_matrix)

print('\n** The similair similarities between two movies (Hobbit and Terminator). \n\
The matrix is symmetric (sims_matrix[stargate_id][startrek_id]=sims_matrix[startrek_id][stargate_id]):\n')

stargate_id = sims_algo.trainset.to_inner_iid('StarGate')
startrek_id = sims_algo.trainset.to_inner_iid('Star Trek')

print(sims_matrix[stargate_id][startrek_id])
print(sims_matrix[startrek_id][stargate_id])

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

** Compute similarities between two movies:

Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
[[1.         0.         0.         0.         0.         0.        ]
 [0.         1.         0.0041005  0.         0.         0.        ]
 [0.         0.0041005  1.         0.         0.         0.        ]
 [0.         0.         0.         1.         0.00558801 0.        ]
 [0.         0.         0.         0.00558801 1.         0.        ]
 [0.         0.         0.         0.         0.         1.        ]]

** The similair similarities between two movies (Hobbit and Terminator). 
The matrix is symmetric (sims_matrix[stargate_id][startrek_id]=sims_matrix[startrek_id][stargate_id]):

0.004100497759935318
0.004100497759935318


We've got some similarities between `StarGate` and `Star Trek` because both they appear in `User3`, so they might be similar. We don't have similarities for `Space Odyssey` because it only appear in one user. 

### DIVERSIITY

For diversity we use similarity matrix between movies. `Diversity` is  `(1-S)` where `S` is similarity for each pair in the user recommendation.

* Very High diversity very often means that algorithm return random values. 
* Too Low is not good because it shows very similar ones

In [371]:
topn_recommendation = {'User1': [('Blade Runner', 4.5), ('Hobbit', 4.0)],
 'User2': [('Space Odyssey', 4.7), ('Star Trek', 4.3),  ('StarGate', 4.3)]}

print('Our recommendation:\n ', topn_recommendation)

def diversity(topn_recommendation, sims_algo):
    n = 0
    total = 0
    sims_matrix = sims_algo.compute_similarities()
    print('\nCalculate similarity for all combinations:\n')
    for user in topn_recommendation.keys():
        pairs = itertools.combinations(get_topn_movies(topn_recommendation[user]), 2)
        """
        [(('Jurrasic Park', 4.5), ('Hobbit', 4.0)),
          (('Jurrasic Park', 4.5), ('Blade Runner', 3.8)),
          (('Hobbit', 4.0), ('Blade Runner', 3.8))]
        """
        for pair in pairs:
            movie1 = pair[0]
            movie2 = pair[1]

            id1 = sims_algo.trainset.to_inner_iid(str(movie1))
            id2 = sims_algo.trainset.to_inner_iid(str(movie2))

            similarity = sims_matrix[id1][id2]
            print('Similarity(', movie1,',',  movie2, ') = ' , similarity)
            total += similarity
            n += 1

    S = total / n
    print('\nTotal Similarity: ', total, '/', n, ' = ', S)
    return (1-S)


print('\nCalculation Diversity:\n')
diversity = diversity(topn_recommendation,sims_algo)


print('\nDiversity: \n', diversity)

Our recommendation:
  {'User1': [('Blade Runner', 4.5), ('Hobbit', 4.0)], 'User2': [('Space Odyssey', 4.7), ('Star Trek', 4.3), ('StarGate', 4.3)]}

Calculation Diversity:

Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

Calculate similarity for all combinations:

Similarity( Blade Runner , Hobbit ) =  0.0
Similarity( Space Odyssey , Star Trek ) =  0.0
Similarity( Space Odyssey , StarGate ) =  0.0
Similarity( Star Trek , StarGate ) =  0.004100497759935318

Total Similarity:  0.004100497759935318 / 4  =  0.0010251244399838294

Diversity: 
 0.9989748755600162


For User1 we've got only one combination of recommendation:

```
[('Blade Runner', 'Hobbit')\

Similarity( Blade Runner , Hobbit ) =  0.0
```

For User2 we've got three combinations:

```
[('Space Odyssey', 'Terminator'),
 ('Space Odyssey', 'Hobbit'),
 ('Terminator', 'Hobbit')]
```

And similarity is:

```
Similarity( Space Odyssey , Star Trek ) =  0.0
Similarity( Space Odyssey , StarGate ) =  0.0
Similarity( Star Trek , StarGate ) =  0.004100497759935318
```

$S = 0.0+0.0.0+0.0+0.0041 / n = 0.0041 / 4 = 0.001025$

$Diversity=(1-S)=1-0.001025=0.998975$

### NOVELTY

In [409]:
topn_recommendation = {'User1': [('Blade Runner', 4.5), ('Hobbit', 4.0)],
 'User2': [('Space Odyssey', 4.7), ('Star Trek', 4.3),  ('StarGate', 4.3)]}
print(topn_recommendation)

data = pd.DataFrame([['User1', 'Space Odyssey', '4'],
                     ['User1', 'Star Trek', '4'],
                     ['User1', 'StarGate', '3'],
                     ['User2', 'Blade Runner', '3'],
                     ['User2', 'Terminator', '1'],
                     ['User2', 'Hobbit', '5'],
                     ['User3', 'Star Trek', '3'],
                     ['User3', 'StarGate', '1'],
                     ['User3', 'Blade Runner', '5'], 
                     ['User3', 'Terminator', '4']], columns=['User', 'Movie', 'Rate (1-5)'])
data

{'User1': [('Blade Runner', 4.5), ('Hobbit', 4.0)], 'User2': [('Space Odyssey', 4.7), ('Star Trek', 4.3), ('StarGate', 4.3)]}


Unnamed: 0,User,Movie,Rate (1-5)
0,User1,Space Odyssey,4
1,User1,Star Trek,4
2,User1,StarGate,3
3,User2,Blade Runner,3
4,User2,Terminator,1
5,User2,Hobbit,5
6,User3,Star Trek,3
7,User3,StarGate,1
8,User3,Blade Runner,5
9,User3,Terminator,4


  In novelty we calculate the ranking of the movie in the recommendation system. If it's higher than it's rank is higher. Next we divide in the total number of recommendations. 

In [413]:
def get_rankings(data):
    return data.groupby('Movie').count()['User'].sort_values(ascending=False)

def novelty(topn_recommendation, rankings):
    n = 0
    total = 0
    for user in topn_recommendation.keys():
        for movie in get_topn_movies(topn_recommendation[user]):
            rank = rankings[movie]
            total += rank
            print('Rank', rank,':' ,movie)
            n += 1
    print('\nNovelty: ', total, '/', n, ' = ', total/n)
    return total / n

print('Recommendation: ', topn_recommendation,'\n')
nov = novelty( topn_recommendation, get_rankings(data))


print('\nRecommendation: ', topn_recommendation,'\n')
topn_recommendation = {'User1': [('Blade Runner', 4.5)],
 'User2': [('Star Trek', 4.3),  ('StarGate', 4.3)]}
nov = novelty( topn_recommendation, get_rankings(data))

Recommendation:  {'User1': [('Blade Runner', 4.5)], 'User2': [('Star Trek', 4.3), ('StarGate', 4.3)]} 

Rank 2 : Blade Runner
Rank 2 : Star Trek
Rank 2 : StarGate

Novelty:  6 / 3  =  2.0

Recommendation:  {'User1': [('Blade Runner', 4.5)], 'User2': [('Star Trek', 4.3), ('StarGate', 4.3)]} 

Rank 2 : Blade Runner
Rank 2 : Star Trek
Rank 2 : StarGate

Novelty:  6 / 3  =  2.0


Rank for `Blade Runner`, `Star Trek`, and `StarGate` is high becuase it was rated by two users. The rest recommended was only by two users. If we recommended only top movies we've got higher novelty.

* Is it bad to get very high `Novelty`. If user get only movies that don't know he might think that this is random algorithm.
* If it's very low, this means that it will show that shows only popular movies that might user watched or know but not rate in the system.

## COSINE SIMILARITY


https://en.wikipedia.org/wiki/Cosine_similarity
---

