# Trending Recommendation

<!-- This notebook contains 2 Top N recommendation examples:

- **Top N consumed**: the N items most consumed by users
- **Top N rated**: the N best rated items by users -->

The dataset to be used will be [MovieLens](https://grouplens.org/datasets/movielens/), whose exploratory analysis was carried out in the practical example of the module **Introduction to Recommendation Systems**.

In [1]:
import os
import re
import sys
import pandas as pd
from datetime import datetime
#from google.colab import files
import matplotlib.pyplot as plt
import matplotlib
from cycler import cycler

matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['#007efd', '#FFC000', '#303030'])

# Loading and processing the dataset

For more information on this session, see the notebook `Introduction to Recommender Systems`

In [2]:
import pyarrow.parquet as pq

# Carregar um arquivo parquet
df_ratings = pq.read_table('ratings.parquet')

# Converter para um DataFrame pandas
df_ratings = df_ratings.to_pandas()

# Visualizar as últimas linhas do DataFrame
df_ratings.tail()

Unnamed: 0,user_id,item_id,rating,timestamp
1000204,6040,1091,1,956716541
1000205,6040,1094,5,956704887
1000206,6040,562,5,956704746
1000207,6040,1096,4,956715648
1000208,6040,1097,4,956715569


In [3]:
def convert_timestamp_to_date(timestamp:int):
    return datetime.fromtimestamp(timestamp).date()    

df_ratings = pd.read_parquet('ratings.parquet')
df_ratings['date'] = df_ratings['timestamp'].apply(convert_timestamp_to_date)
df_ratings.tail()

Unnamed: 0,user_id,item_id,rating,timestamp,date
1000204,6040,1091,1,956716541,2000-04-25
1000205,6040,1094,5,956704887,2000-04-25
1000206,6040,562,5,956704746,2000-04-25
1000207,6040,1096,4,956715648,2000-04-25
1000208,6040,1097,4,956715569,2000-04-25


## Item metadata file

Upload file `movies.parquet`

In [4]:
# Carregar um arquivo parquet
df_items = pq.read_table('movies.parquet')

# Converter para um DataFrame pandas
df_items = df_items.to_pandas()

# Visualizar as últimas linhas do DataFrame
df_items.tail()

Unnamed: 0,item_id,title,genres
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama
3882,3952,"Contender, The (2000)",Drama|Thriller


In [6]:
def extract_year_from_title(title:str, regex='(\d{4})'):
    match = re.search(regex, title)
    return None if match is None else match.group()

def convert_genres_to_list(genres:str, separator='|'):
    return genres.split(separator)

df_items = pd.read_parquet('movies.parquet')
df_items['genres'] = df_items['genres'].apply(convert_genres_to_list)
df_items['year'] = df_items['title'].apply(extract_year_from_title)
df_items.tail()

Unnamed: 0,item_id,title,genres,year
3878,3948,Meet the Parents (2000),[Comedy],2000
3879,3949,Requiem for a Dream (2000),[Drama],2000
3880,3950,Tigerland (2000),[Drama],2000
3881,3951,Two Family House (2000),[Drama],2000
3882,3952,"Contender, The (2000)","[Drama, Thriller]",2000


# Trending Calculation

The _trending_ recommendation seeks to present the items that had the greatest _lift_ in consumption in a given window of time. In mathematical terms, the _lift_ can be defined as:

$$lift = \frac{consumptionCurrentWindow-consumptionPreviousWindow}{consumptionPreviousWindow}$$

In this notebook, we will use the month in which the item was consumed as a time window. The function below helps us define a `window` column that will be used in the algorithm logic.

In [6]:
def extract_year_month(date):
    return '{:04d}-{:02d}'.format(date.year, date.month)
    
df_ratings['window'] = df_ratings['date'].apply(extract_year_month)
df_ratings.tail()

Unnamed: 0,user_id,item_id,rating,timestamp,date,window
1000204,6040,1091,1,956716541,2000-04-25,2000-04
1000205,6040,1094,5,956704887,2000-04-25,2000-04
1000206,6040,562,5,956704746,2000-04-25,2000-04
1000207,6040,1096,4,956715648,2000-04-25,2000-04
1000208,6040,1097,4,956715569,2000-04-25,2000-04


## Consumo por janela temporal

In [7]:
df_window_consumptions = (
    df_ratings
    .groupby(['item_id', 'window'])
    .agg({'user_id': 'count'})
    .reset_index()
    .rename({'user_id': 'count'}, axis=1)
    .sort_values(by=['item_id', 'window'])
)
df_window_consumptions

Unnamed: 0,item_id,window,count
0,1,2000-04,18
1,1,2000-05,165
2,1,2000-06,127
3,1,2000-07,211
4,1,2000-08,381
...,...,...,...
65610,3952,2002-09,1
65611,3952,2002-11,1
65612,3952,2002-12,3
65613,3952,2003-01,2


## Temporal shift

To perform operations between the current and previous values of a time window we can **shift** (_shift_) the values of a column using a grouping. Example:

| Grouping | Value | Shift Value |
|--------|-------|-------------|
| A | A1 | N/A |
| A | A2 | A1 |
| B | B1 | N/A |
| B | B2 | B1 |
| B | B3 | B2 |


In [14]:
df_window_consumptions.sort_values(by=['item_id', 'window'], inplace=True)

df_window_consumptions['count_previous'] = (
    df_window_consumptions
    .groupby(['item_id'])['count']
    .shift(1)
)
df_window_consumptions

Unnamed: 0,item_id,window,count,count_previous
0,1,2000-04,17,
1,1,2000-05,165,17.0
2,1,2000-06,128,165.0
3,1,2000-07,203,128.0
4,1,2000-08,386,203.0
...,...,...,...,...
65635,3952,2002-09,1,2.0
65636,3952,2002-11,1,1.0
65637,3952,2002-12,3,1.0
65638,3952,2003-01,2,3.0


## Lift

Implementing the following formula:
$$lift = \frac{countCurrentWindow-countPreviousWindow}{countPreviousWindow}$$

In [15]:
df_window_consumptions['lift'] = (df_window_consumptions['count'] - df_window_consumptions['count_previous'])/df_window_consumptions['count_previous']
df_window_consumptions

Unnamed: 0,item_id,window,count,count_previous,lift
0,1,2000-04,17,,
1,1,2000-05,165,17.0,8.705882
2,1,2000-06,128,165.0,-0.224242
3,1,2000-07,203,128.0,0.585938
4,1,2000-08,386,203.0,0.901478
...,...,...,...,...,...
65635,3952,2002-09,1,2.0,-0.500000
65636,3952,2002-11,1,1.0,0.000000
65637,3952,2002-12,3,1.0,2.000000
65638,3952,2003-01,2,3.0,-0.333333


## Specifying recommendation window

For the recommendation we need a reference window. For example:

- Trending in current window?
- Trending in the previous window?
- Trending in a specific window?

Once the reference window has been defined, we use the _lift_ value as the item's _score_ and order it by the _score_.

In [17]:
prediction_window = '2003-01'
(
    df_window_consumptions
    .query('window == @prediction_window')
    .rename({'lift': 'score'}, axis=1)
    .sort_values(by='score', ascending=False)
)

Unnamed: 0,item_id,window,count,count_previous,score
32959,2011,2003-01,7,1.0,6.00
42100,2502,2003-01,7,1.0,6.00
782,32,2003-01,6,1.0,5.00
64914,3897,2003-01,6,1.0,5.00
25525,1527,2003-01,5,1.0,4.00
...,...,...,...,...,...
64221,3847,2003-01,1,4.0,-0.75
21114,1266,2003-01,1,5.0,-0.80
18836,1179,2003-01,1,5.0,-0.80
18426,1136,2003-01,1,5.0,-0.80


_____________

# Recommending Trending Items

Finally, we put together all the logic described so far in the `recommend_trending_n` function below to recommend the N items that are on the highest rise.

In [8]:
# min_evaluations:int=None --> minimum user reviews

def recommend_trending_n(ratings:pd.DataFrame, n:int, prediction_window:str=None, min_evaluations:int=None) -> pd.DataFrame:

    prediction_window = max(ratings['window']) if prediction_window is None else prediction_window

    ratings = ratings[['item_id', 'window', 'user_id']]
    # Calculo de janela
    df_window_consumptions = (
        ratings
        .groupby(['item_id', 'window'])['user_id']
        .count()
        .reset_index()
        .rename({'user_id': 'count'}, axis=1)
        .sort_values(by=['item_id', 'window'])
    )

    # Shift temporal
    df_window_consumptions['count_previous'] = (
        df_window_consumptions
        .groupby(['item_id'])['count']
        .shift(1)
    )

    # Calculo do lift
    df_window_consumptions['lift'] = (df_window_consumptions['count'] - df_window_consumptions['count_previous'])/df_window_consumptions['count_previous']

    # Selecao de janela
    recommendations = (
      df_window_consumptions
      .query('window == @prediction_window')
      .rename({'lift': 'score'}, axis=1)
      .sort_values(by='score', ascending=False)
    )

    if min_evaluations is not None:
        recommendations = recommendations.query('count_previous >= @min_evaluations')

    return recommendations.head(n)

df_trending = recommend_trending_n(df_ratings, n=10, prediction_window='2002-12')
df_trending.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,window,count,count_previous,score,title,genres
0,1722,2002-12,7,1.0,6.0,Tomorrow Never Dies (1997),Action|Romance|Thriller
1,595,2002-12,7,1.0,6.0,Beauty and the Beast (1991),Animation|Children's|Musical
2,3503,2002-12,6,1.0,5.0,Solaris (Solyaris) (1972),Drama|Sci-Fi
3,3639,2002-12,6,1.0,5.0,"Man with the Golden Gun, The (1974)",Action
4,2990,2002-12,6,1.0,5.0,Licence to Kill (1989),Action
5,1179,2002-12,5,1.0,4.0,"Grifters, The (1990)",Crime|Drama|Film-Noir
6,3882,2002-12,5,1.0,4.0,Bring It On (2000),Comedy
7,1266,2002-12,5,1.0,4.0,Unforgiven (1992),Western
8,2966,2002-12,4,1.0,3.0,"Straight Story, The (1999)",Drama
9,2942,2002-12,7,2.0,2.5,Flashdance (1983),Drama|Romance


Note that some items had high _lift_/_score_, but **few users consumed these items**.

To avoid these behaviors, we can establish a minimum consumption limit for an item to be recommended.

In [10]:
prediction_window = '2002-12'
min_evaluations = 2
n = 10

df_trending = recommend_trending_n(
    df_ratings, 
    n=n,
    prediction_window=prediction_window,
    min_evaluations=min_evaluations
)
df_trending.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,window,count,count_previous,score,title,genres
0,2942,2002-12,7,2.0,2.5,Flashdance (1983),Drama|Romance
1,3791,2002-12,10,3.0,2.333333,Footloose (1984),Drama
2,2926,2002-12,6,2.0,2.0,Hairspray (1988),Comedy|Drama
3,3635,2002-12,6,2.0,2.0,"Spy Who Loved Me, The (1977)",Action
4,1032,2002-12,5,2.0,1.5,Alice in Wonderland (1951),Animation|Children's|Musical
5,1097,2002-12,7,3.0,1.333333,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
6,2054,2002-12,6,3.0,1.0,"Honey, I Shrunk the Kids (1989)",Adventure|Children's|Comedy|Fantasy|Sci-Fi
7,1258,2002-12,4,2.0,1.0,"Shining, The (1980)",Horror
8,1375,2002-12,4,2.0,1.0,Star Trek III: The Search for Spock (1984),Action|Adventure|Sci-Fi
9,1517,2002-12,4,2.0,1.0,Austin Powers: International Man of Mystery (1...,Comedy


# Collection Selection

We can select specific items to recommend by filtering the reviews dataset.

In [11]:
genre = "Children's"
item_ids = df_items[df_items['genres'].apply(lambda x: genre in x)]['item_id']
df_ratings_filtered = df_ratings[df_ratings['item_id'].isin(item_ids)]
df_ratings_filtered.tail()

Unnamed: 0,user_id,item_id,rating,timestamp,date,window
999888,6040,919,5,956704191,2000-04-25,2000-04
1000014,6040,34,4,956704584,2000-04-25,2000-04
1000153,6040,2384,4,956703954,2000-04-25,2000-04
1000191,6040,3751,4,964828782,2000-07-28,2000-07
1000208,6040,1097,4,956715569,2000-04-25,2000-04


In [12]:
df_trending = recommend_trending_n(df_ratings_filtered, n=10, prediction_window='2002-12', min_evaluations=2)
df_trending.merge(df_items, on='item_id', how='inner')

Unnamed: 0,item_id,window,count,count_previous,score,title,genres
0,1032,2002-12,5,2.0,1.5,Alice in Wonderland (1951),Animation|Children's|Musical
1,1097,2002-12,7,3.0,1.333333,E.T. the Extra-Terrestrial (1982),Children's|Drama|Fantasy|Sci-Fi
2,2054,2002-12,6,3.0,1.0,"Honey, I Shrunk the Kids (1989)",Adventure|Children's|Comedy|Fantasy|Sci-Fi
3,317,2002-12,7,4.0,0.75,"Santa Clause, The (1994)",Children's|Comedy|Fantasy
4,596,2002-12,3,2.0,0.5,Pinocchio (1940),Animation|Children's
5,1,2002-12,3,2.0,0.5,Toy Story (1995),Animation|Children's|Comedy
6,594,2002-12,2,2.0,0.0,Snow White and the Seven Dwarfs (1937),Animation|Children's|Musical
7,2090,2002-12,2,2.0,0.0,"Rescuers, The (1977)",Animation|Children's
8,2761,2002-12,2,2.0,0.0,"Iron Giant, The (1999)",Animation|Children's
9,1028,2002-12,2,2.0,0.0,Mary Poppins (1964),Children's|Comedy|Musical
