# Top N Recommendation

This notebook contains 2 Top N recommendation examples:

- **Top N consumed**: the N items most consumed by users
- **Top N rated**: the N best rated items by users

The dataset to be used will be [MovieLens](https://grouplens.org/datasets/movielens/), whose exploratory analysis was carried out in the practical example of the module **Introduction to Recommendation Systems**.

In [2]:
import os
import sys
import pandas as pd
#from google.colab import files
import matplotlib.pyplot as plt
import matplotlib
from cycler import cycler

matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['#007efd', '#FFC000', '#303030'])

# Loading and processing the dataset

For more information on this session, see the `MovieLens Exploratory Analysis` notebook from module 01.

In [6]:
#!pip install pyarrow

Collecting pyarrow
  Using cached pyarrow-15.0.2-cp39-cp39-win_amd64.whl (24.9 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-15.0.2



[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
import pyarrow.parquet as pq

# Carregar um arquivo parquet
df_ratings = pq.read_table('ratings.parquet')

# Converter para um DataFrame pandas
df_ratings = df_ratings.to_pandas()

# Visualizar as últimas linhas do DataFrame
df_ratings.tail()

Unnamed: 0,user_id,item_id,rating,timestamp
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
1000208,6040,1097,4,956715569


## Arquivo de metadados dos itens

Upload file `movies.parquet`

In [11]:
# Carregar um arquivo parquet
df_items = pq.read_table('movies.parquet')

# Converter para um DataFrame pandas
df_items = df_items.to_pandas()

# Visualizar as últimas linhas do DataFrame
df_items.tail()

Unnamed: 0,item_id,title,genres
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller


In [12]:
def convert_genres_to_list(genres:str, separator='|'):
    return genres.split(separator)

df_items = pd.read_parquet('movies.parquet')
df_items['genres'] = df_items['genres'].apply(convert_genres_to_list)
df_items.tail()

Unnamed: 0,item_id,title,genres
3878,3948,Meet the Parents (2000),[Comedy]
3879,3949,Requiem for a Dream (2000),[Drama]
3880,3950,Tigerland (2000),[Drama]
3881,3951,Two Family House (2000),[Drama]
3882,3952,"Contender, The (2000)","[Drama, Thriller]"


# Top N Consumed Recommendation

In our first example of non-personalized recommendation we will recommend the items most consumed by users.

In general, a recommendation function returns 2 types of information:

- `item_id`: item identifier
- `score`: _score_ to be used to order the offer for the user

In [17]:
def recommend_top_n_consumptions(ratings:pd.DataFrame, n:int) -> pd.DataFrame:

    recommendations = (
        ratings
        .groupby('item_id')
        .count()['user_id']
        .reset_index()
        .rename({'user_id': 'score'}, axis=1)
        .sort_values(by='score', ascending=False)
    )

    return recommendations.head(n)

df_top_consumptions = recommend_top_n_consumptions(df_ratings, n=10)
df_top_consumptions

Unnamed: 0,item_id,score
2651,2858,3428
253,260,2991
1106,1196,2990
1120,1210,2883
466,480,2672
1848,2028,2653
575,589,2649
2374,2571,2590
1178,1270,2583
579,593,2578


To better evaluate the recommendation result, we can **attach item metadata**

In [18]:
df_top_consumptions.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,score,title,genres
0,2858,3428,American Beauty (1999),"[Comedy, Drama]"
1,260,2991,Star Wars: Episode IV - A New Hope (1977),"[Action, Adventure, Fantasy, Sci-Fi]"
2,1196,2990,Star Wars: Episode V - The Empire Strikes Back...,"[Action, Adventure, Drama, Sci-Fi, War]"
3,1210,2883,Star Wars: Episode VI - Return of the Jedi (1983),"[Action, Adventure, Romance, Sci-Fi, War]"
4,480,2672,Jurassic Park (1993),"[Action, Adventure, Sci-Fi]"
5,2028,2653,Saving Private Ryan (1998),"[Action, Drama, War]"
6,589,2649,Terminator 2: Judgment Day (1991),"[Action, Sci-Fi, Thriller]"
7,2571,2590,"Matrix, The (1999)","[Action, Sci-Fi, Thriller]"
8,1270,2583,Back to the Future (1985),"[Comedy, Sci-Fi]"
9,593,2578,"Silence of the Lambs, The (1991)","[Drama, Thriller]"


____________

# Top N Rated Recommendation

Another top-N recommendation approach is to consider the items best rated by users using explicit feedback fields, such as _rating_. To do this, we will use the **average rating of a movie** in the MovieLens dataset.

In [19]:
def recommend_top_n_evaluations(ratings:pd.DataFrame, n:int, min_evaluations:int=None) -> pd.DataFrame:
    recommendations = (
        ratings
        .groupby('item_id')
        .agg({'rating': 'mean', 'user_id': 'count'})
        .reset_index()
        .rename({'rating': 'score', 'user_id': 'evaluations'}, axis=1)
        .sort_values(by=['score', 'evaluations'], ascending=False)
    )

    if min_evaluations is not None:
        recommendations = recommendations.query('evaluations >= @min_evaluations')

    return recommendations.head(n)

recommend_top_n_evaluations(df_ratings, n=10, min_evaluations=None)

Unnamed: 0,item_id,score,evaluations
744,787,5.0,3
3010,3233,5.0,2
926,989,5.0,1
1652,1830,5.0,1
2955,3172,5.0,1
3054,3280,5.0,1
3152,3382,5.0,1
3367,3607,5.0,1
3414,3656,5.0,1
3635,3881,5.0,1


In [20]:
df_top_evaluations = recommend_top_n_evaluations(df_ratings, n=10, min_evaluations=None)
df_top_evaluations.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,score,evaluations,title,genres
0,787,5.0,3,"Gate of Heavenly Peace, The (1995)",[Documentary]
1,3233,5.0,2,Smashing Time (1967),[Comedy]
2,989,5.0,1,Schlafes Bruder (Brother of Sleep) (1995),[Drama]
3,1830,5.0,1,Follow the Bitch (1998),[Comedy]
4,3172,5.0,1,Ulysses (Ulisse) (1954),[Adventure]
5,3280,5.0,1,"Baby, The (1973)",[Horror]
6,3382,5.0,1,Song of Freedom (1936),[Drama]
7,3607,5.0,1,One Little Indian (1973),"[Comedy, Drama, Western]"
8,3656,5.0,1,Lured (1947),[Crime]
9,3881,5.0,1,Bittersweet Motel (2000),[Documentary]


Note that some items may have high ratings, but they may have been given by few users. Thus, we can include a **hyperparameter** with the **minimum number of evaluations** that an item needs to have to be considered in the recommendation.

In [21]:
df_top_evaluations = recommend_top_n_evaluations(df_ratings, n=10, min_evaluations=100)
df_top_evaluations.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,score,evaluations,title,genres
0,2019,4.56051,628,Seven Samurai (The Magnificent Seven) (Shichin...,"[Action, Drama]"
1,318,4.554558,2227,"Shawshank Redemption, The (1994)",[Drama]
2,858,4.524966,2223,"Godfather, The (1972)","[Action, Crime, Drama]"
3,745,4.520548,657,"Close Shave, A (1995)","[Animation, Comedy, Thriller]"
4,50,4.517106,1783,"Usual Suspects, The (1995)","[Crime, Thriller]"
5,527,4.510417,2304,Schindler's List (1993),"[Drama, War]"
6,1148,4.507937,882,"Wrong Trousers, The (1993)","[Animation, Comedy]"
7,922,4.491489,470,Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),[Film-Noir]
8,1198,4.477725,2514,Raiders of the Lost Ark (1981),"[Action, Adventure]"
9,904,4.47619,1050,Rear Window (1954),"[Mystery, Thriller]"


_____________

**Extra**: generation of results to evaluate metrics

In [22]:
import numpy as np
train_size = 0.8
df_ratings.sort_values(by='timestamp', inplace=True)
df_train_set, df_valid_set= np.split(df_ratings, [int(train_size * df_ratings.shape[0])])

In [23]:
recommendations = recommend_top_n_consumptions(df_ratings, n=20)
scores = [{'item_id': x['item_id'], 'score': x['score']} for _, x in recommendations.iterrows()]

In [24]:
from tqdm import tqdm
model_name = 'top'
df_predictions = df_valid_set
df_predictions['y_true'] = df_predictions.apply(lambda x: {'item_id': x['item_id'], 'rating': x['rating']}, axis=1)
df_predictions = df_predictions.groupby('user_id').agg({'y_true': list})
df_predictions['y_score'] = df_predictions.apply(lambda x: scores, axis=1)
df_predictions['model'] = model_name
df_predictions.reset_index(drop=False, inplace=True)
df_predictions.tail()


Unnamed: 0,user_id,y_true,y_score,model
1778,6001,"[{'item_id': 3751, 'rating': 4}, {'item_id': 3...","[{'item_id': 2858, 'score': 3428}, {'item_id':...",top
1779,6002,"[{'item_id': 1942, 'rating': 5}, {'item_id': 4...","[{'item_id': 2858, 'score': 3428}, {'item_id':...",top
1780,6016,"[{'item_id': 3756, 'rating': 3}, {'item_id': 3...","[{'item_id': 2858, 'score': 3428}, {'item_id':...",top
1781,6028,"[{'item_id': 3000, 'rating': 4}]","[{'item_id': 2858, 'score': 3428}, {'item_id':...",top
1782,6040,"[{'item_id': 3182, 'rating': 5}, {'item_id': 2...","[{'item_id': 2858, 'score': 3428}, {'item_id':...",top


In [25]:
column_order = ['model', 'user_id', 'y_true', 'y_score']
df_predictions[column_order].to_parquet(f'valid_{model_name}.parquet', index=None)