# Imports

In [1]:
import numpy as np
import pandas as pd
from joblib import load
from scipy.spatial.distance import cosine

# Toy data

In [2]:
ratings = pd.DataFrame([
    [1, 1, 0, 0],
    [0, 1, 1, 0],
    [0, 0, 1, 1],
    [1, 0, 1, 1],
    [0, 0, 1, 0]
], columns=['Star Wars', 'Matrix', 'Avengers', 'Lord Of The Rings'])
ratings.index.name = 'User'

In [3]:
ratings

Unnamed: 0_level_0,Star Wars,Matrix,Avengers,Lord Of The Rings
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,0,0
1,0,1,1,0
2,0,0,1,1
3,1,0,1,1
4,0,0,1,0


# Notation

<table>
    <tr>
        <td>$i$</td>
        <td>Item</td>
    </tr>
    <tr>
        <td>$u$</td>
        <td>User</td>
    </tr>
    <tr>
        <td>$I_{u}$</td>
        <td>Rated items/interactions of user u</td>
    </tr>
    <tr>
        <td>$sim(i, j)$</td>
        <td>Similarity of items i and j</td>
    </tr>
    <tr>
        <td>$dist(i, j)$</td>
        <td>Distance between items i and j</td>
    </tr>
    <tr>
        <td>$novelty(i, u)$</td>
        <td>Novetly of item i for user u</td>
    </tr>
    <tr>
        <td>$unexpectedness(i, u)$</td>
        <td>Unexpectedness of item i for user u</td>
    </tr>
    <tr>
        <td>$relevance(i, u)$</td>
        <td>Relevance of item i for user u</td>
    </tr>
    <tr>
        <td>$serendipity(i, u)$</td>
        <td>Serendipity of item i for user u</td>
    </tr>
</table>

# Metrics

## Novelty

Metrics based on an items distance from user profile (items a user has consumed) [S. Vargas, P. Castells]

$$ dist(i, j) = 1 - sim(i, j) $$

Similarity metric must satisfy $sim(i, j) \in [0, 1]$

$ novelty_1 $ returns the distance to the most dissimilar item (among all rated items)

$$ novelty_1(i, u) = \min_{j \in I_{u}} {dist(i, j)} $$

$ novelty_2 $ returns the mean of all distances

$$ novelty_2(i, u) = \frac {1}{|I_{u}|} \sum_{j \in I_{u}} dist(i, j) $$

In [4]:
def similarity(i, j):
    return cosine(i - j)

def distance(i, j):
    return 1 - similarity(i, j)

In [5]:
def novelty1(i, rated_items):
    """
    The first variation of novelty metric proposed by S. Vargas and P. Castells.
    
    Parameters
    ----------
    i
        Item to be computed novelty of
    rated_items : list
        Rated items by user, for who we compute the novelty of item i
    """
    return np.min([distance(i, j) for j in rated_items])

In [6]:
def novelty2(i, rated_items):
    """
    The second variation of novelty metric proposed by S. Vargas and P. Castells.
    
    Parameters
    ----------
    i
        Item to be computed novelty of
    rated_items : list
        Rated items by user, for who we compute the novelty of item i
    """
    return (1 / len(rated_items)) * np.sum([distance(i, j) for j in rated_items])

## Unexpectedness

<table>
    <tr>
        <td>$p(i)$</td>
        <td>Probability that any user has rated item i</td>
    </tr>
    <tr>
        <td>$p(i, j)$</td>
        <td>Probability that items i and j are rated together</td>
    </tr>
</table>

<i>Point-wise mutual information</i> indicates how similar two items are based on the numbers of users who have rated both items and each item separately: [Kaminskas, Bridge]

$$ PMI(i, j) = -\log_2 \frac {p(i, j)}{p(i)p(j)}/\log_2 p(i, j) = - \frac {\log_2 {p(i, j)} - \log_2 {p(i)p(j)}} {\log_2 p(i, j)} $$

$PMI(i, j) \in [-1, 1]$, where -1 indicates that items i and j are never rated together, while 1 indicates that items i and j are always rated together.

In [7]:
def single_probability(i):
    """
    Returns the probability that user has rated the film.
    
    Parameters
    ----------
    i : np.array
        Numpy 1-D array containing item ratings
    """
    return np.sum(i, axis=0) / i.shape[0]

In [8]:
ratings

Unnamed: 0_level_0,Star Wars,Matrix,Avengers,Lord Of The Rings
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1,1,0,0
1,0,1,1,0
2,0,0,1,1
3,1,0,1,1
4,0,0,1,0


In [9]:
def together_rated_probability(i, j):
    """
    Returns the probability that both items are rated together.
    
    Parameters
    ----------
    i : np.array
        Numpy 1-D array containing item ratings of the first item
    j : np.array
        Numpy 1-D array containing item ratings of the second item
    """
    return np.sum([i + j == 2]) / i.shape[0]

In [10]:
star_wars_rating_prob = single_probability(ratings['Star Wars'])
star_wars_rating_prob

0.4

In [11]:
star_wars_avengers_rating_prob = together_rated_probability(ratings['Star Wars'], ratings['Avengers'])
star_wars_avengers_rating_prob

0.2

In [12]:
def pmi(i, j):
    """
    Point-wise mutual information.
    """
    p_i = single_probability(i)
    p_j = single_probability(j)
    p_i_j = together_rated_probability(i, j)
    return -np.log2(p_i_j / (p_i * p_j)) / np.log2(p_i_j)

Star Wars and Avengers aren't usually rated together.

In [13]:
pmi(ratings['Star Wars'], ratings['Avengers'])

-0.2920296742201793

Based on PMI, unexpectedness has two variations:

$$ unexpectedness_{1}(i, u) = \max_{j \in I_{u}} {PMI(i, j)} $$

$$ unexpectedness_{2}(i, u) = \frac {1}{|I_{u}|} \sum_{j \in I_{u}} {PMI(i, j)} $$

In [14]:
def unexpectedness1(i, rated_items):
    return np.max([pmi(i, j) for j in rated_items])

In [15]:
def unexpectedness2(i, rated_items):
    return 1 / rated_items.shape[0] * np.max([pmi(i, j) for j in rated_items])

## Relevance

We propose 2 variations based on the user profile:

$$ relevance_{1}(i, u) = \min_{j \in S \subset I_{u}} sim(i, j) $$

$$ relevance_{2}(i, u) = \frac {1}{|I_{u}|} \sum_{j \in S \subset I_{u}} sim(i, j) $$

Where *S* is a subset of rated items by user u, e.g. ***k*** most recent interactions.

## Serendipity

We suggest a serendipity metric, that takes into account item's novelty, unexpectedness and relevance. 

$$ serendipity(i, u) = \alpha \ novelty(i, u) + \beta \ unexpectedness(i, u) + \gamma \ relevance(i, u) $$

$ \alpha + \beta + \gamma = 1 $ equation must be satisfied, where coefficients can be found empirically or adjusted by the learning algorithm.

If any of three coefficients is set to 0, then the serendipity definition may vary. [A Survey of Serendipity in Recommender Systems, Kotkov, Wang & Veijalainen]

# Evaluation

Movies dataset.

In [16]:
movies_df = pd.read_csv('../jupyter/data/movies.csv')

In [17]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Movies dataset with one-hot encoding.

In [18]:
movies_extended_df = load('/root/Downloads/t_film_profile_sem_0_and_com_001.pickle')

In [19]:
movies_extended_df

node,30,65,74,75,83,90,100,107,108,110,...,70078755,70078760,70078761,70078764,70078766,70078779,70078781,70078784,70078788,70078789
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
465044,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
467731,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
468343,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
468707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Rating dataset that contains every user

In [20]:
# ratings_df = pd.read_csv('../jupyter/data/ratings.csv')
# ratings_df = ratings_df.sort_values(['userId', 'timestamp'])

In [21]:
# ratings_df.head()

Rating dataset where users with <= 3 interactions are filtered out. 

In [22]:
ratings_cleaned_df = pd.read_csv('../jupyter/test_datasets4/df_ratings_drop_users_5.csv')
ratings_cleaned_df = ratings_cleaned_df.sort_values(['userId', 'timestamp'])

In [23]:
ratings_cleaned_df.head(30)

Unnamed: 0,userId,movieId,rating,timestamp
16772,0,145,3.0,1216755656
16773,0,1729,3.5,1216755676
16774,0,1953,3.5,1216755747
16775,0,249,3.0,1216755814
16776,0,431,4.0,1216922547
64364,1,1183,3.0,877373332
64365,1,1416,4.0,877373333
64366,1,1617,3.0,877373333
64367,2,2858,5.0,940199496
64368,2,2890,2.0,940199572


In [24]:
ratings_cleaned_df[ratings_cleaned_df['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
64364,1,1183,3.0,877373332
64365,1,1416,4.0,877373333
64366,1,1617,3.0,877373333
