# Feature generators tutorial

This notebook presents the RePlay functionality for features preprocessing and generation of new users and item features based on existing features and interactions history. RePlay offers classes:

* CatFeaturesTransformer
* LogStatFeaturesProcessor


### Fit 

To train a feature generator use the method `.fit()`

### Transform the data

Method `.transform()` allows you to transform the data using the knowledge of the training dataset.

In [2]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from pyspark.sql import functions as sf

from replay.data_preparator import DataPreparator, Indexer
from replay.utils import convert2spark

sns.set_theme(style="whitegrid", palette="muted")

## Get started

Download the dataset **MovieLens** and preprocess it with `DataPreparator` and `Indexer`

In [3]:
ratings = pd.read_csv("./data/ml1m_ratings.dat", sep="\t", names=["userId", "itemId","relevance","timestamp"])

For each user, we will add the categorical variable `month`

In [4]:
new_val = pd.to_datetime(ratings["timestamp"], unit='s').map(lambda x: x.month)
ratings.loc[:,"month"] = new_val

In [7]:
dp = DataPreparator()
log = dp.transform(data=ratings,
                  columns_mapping={
                      "user_id": "userId",
                      "item_id":  "itemId",
                      "relevance": "relevance",
                      "timestamp": "timestamp"
                  })

log.show(2)

20-Sep-22 11:33:25, replay, INFO: Columns with ids of users or items are present in mapping. The dataframe will be treated as an interactions log.


+-------+-------+---------+-------------------+-----+
|user_id|item_id|relevance|          timestamp|month|
+-------+-------+---------+-------------------+-----+
|      1|   1193|      5.0|2001-01-01 01:12:40|   12|
|      1|    661|      3.0|2001-01-01 01:35:09|   12|
+-------+-------+---------+-------------------+-----+
only showing top 2 rows



In [8]:
indexer = Indexer(user_col='user_id', item_col='item_id')
indexer.fit(users=log.select('user_id'),
            items=log.select('item_id'))
log = indexer.transform(df=log)
log.show(2)

+--------+--------+---------+-------------------+-----+
|user_idx|item_idx|relevance|          timestamp|month|
+--------+--------+---------+-------------------+-----+
|    4131|      43|      5.0|2001-01-01 01:12:40|   12|
|    4131|     585|      3.0|2001-01-01 01:35:09|   12|
+--------+--------+---------+-------------------+-----+
only showing top 2 rows



We will leave only the first 20 users and will not take the 12th month

In [9]:
log_20_users = log.where("user_idx < 20 and month != 12")

In [10]:
log_20_users_pandas = log_20_users.toPandas()

In [11]:
def find_cold_idx(df):
    unique_item_idx = df.toPandas()["item_idx"].unique()
    cold_item_idx = None
    for i in range(len(unique_item_idx)*2):
        if i not in unique_item_idx:
            cold_item_idx = i
            break

    cold_user_idx = None
    unique_user_idx = df.toPandas()["user_idx"].unique()
    for i in range(len(unique_user_idx)*2):
        if i not in unique_user_idx:
            cold_user_idx = i
            break

    return cold_user_idx, cold_item_idx

## class CatFeaturesTransformer()

Transform categorical features in ``cat_cols_list``
    with one-hot encoding and remove original columns.

In [12]:
from replay.data_preparator import CatFeaturesTransformer
cft = CatFeaturesTransformer(["month"])
cft.fit(log_20_users)

In [13]:
log_trsfrm = cft.transform(log_20_users)

#### Before

In [14]:
log_20_users_pandas

Unnamed: 0,user_idx,item_idx,relevance,timestamp,month
0,16,366,4.0,2001-01-10 21:07:43,1
1,16,748,4.0,2002-03-21 18:09:13,3
2,16,1252,3.0,2003-01-20 21:14:09,1
3,16,1551,5.0,2002-09-10 19:34:53,9
4,16,3112,3.0,2002-04-02 00:49:39,4
...,...,...,...,...,...
25464,18,945,4.0,2000-05-09 20:05:51,5
25465,18,20,4.0,2000-05-09 20:02:15,5
25466,18,1452,4.0,2000-05-10 02:17:40,5
25467,18,1646,3.0,2000-05-09 23:10:36,5


#### After

In [15]:
log_trsfrm.toPandas()

Unnamed: 0,user_idx,item_idx,relevance,timestamp,ohe_month_9,ohe_month_1,ohe_month_5,ohe_month_2,ohe_month_6,ohe_month_3,ohe_month_10,ohe_month_7,ohe_month_4,ohe_month_11,ohe_month_8
0,16,366,4.0,2001-01-10 21:07:43,0,1,0,0,0,0,0,0,0,0,0
1,16,748,4.0,2002-03-21 18:09:13,0,0,0,0,0,1,0,0,0,0,0
2,16,1252,3.0,2003-01-20 21:14:09,0,1,0,0,0,0,0,0,0,0,0
3,16,1551,5.0,2002-09-10 19:34:53,1,0,0,0,0,0,0,0,0,0,0
4,16,3112,3.0,2002-04-02 00:49:39,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25464,18,945,4.0,2000-05-09 20:05:51,0,0,1,0,0,0,0,0,0,0,0
25465,18,20,4.0,2000-05-09 20:02:15,0,0,1,0,0,0,0,0,0,0,0
25466,18,1452,4.0,2000-05-10 02:17:40,0,0,1,0,0,0,0,0,0,0,0
25467,18,1646,3.0,2000-05-09 23:10:36,0,0,1,0,0,0,0,0,0,0,0


How the class behaves with data that is not included in the train. To do this, add a user to the DataFrame with the value 12 in the "month" attribute. There is no user with this value in the class training data.

In [16]:
log_20_users_pandas["month"].unique()

array([ 1,  3,  9,  4,  2,  7,  8, 10, 11,  6,  5])

In [17]:
user_with_12_month_attriubute = log.where("month == 12").limit(1)

item_idx = user_with_12_month_attriubute.collect()[0]["item_idx"]
user_idx = user_with_12_month_attriubute.collect()[0]["user_idx"]

log_21_users  = log_20_users.union(user_with_12_month_attriubute)

In [18]:
log_trsfrm_21_users = cft.transform(log_21_users)

As we can see, for a user with a month value of 12, all attributes are **null**

In [19]:
log_trsfrm_21_users.where(f"user_idx == {user_idx} and item_idx == {item_idx}").toPandas()

Unnamed: 0,user_idx,item_idx,relevance,timestamp,ohe_month_9,ohe_month_1,ohe_month_5,ohe_month_2,ohe_month_6,ohe_month_3,ohe_month_10,ohe_month_7,ohe_month_4,ohe_month_11,ohe_month_8
0,4131,43,5.0,2001-01-01 01:12:40,0,0,0,0,0,0,0,0,0,0,0


## class LogStatFeaturesProcessor()

**Features can start with:**

1. `u_` - for user

2. `i_` - for item

3. `u_i` - user regarding the items

4.  `i_u` - item regarding the users

**Features:**

* `log_num_interact` - log number of interactions
* `log_interact_days_count` - log number of interactions' days
* `min_interact_date` - min interaction timestamp
* `max_interact_date` - max interaction timestamp
* `std` - relevance std for user
* `mean` - relevance mean for user
* `quantile_05` - relevance approximate quantile 0.05
* `quantile_5` - relevance approximate quantile 0.5
* `quantile_95` - relevance approximate quantile 0.95
* `history_length_days` - history length
* `last_interaction_gap_days` - ?
* `abnormality` - atypical user measure
* `abnormalityCR` - atypical user measure averse to controversial items
* `mean_log_num_interact` - average log number of interactions by users/items interacted with item/user
* `log_num_interact` - log number of interactions for item

In [20]:
from replay.history_based_fp import LogStatFeaturesProcessor
lf = LogStatFeaturesProcessor()
lf.fit(log_20_users)

In [21]:
log_trsfrm = lf.transform(log_20_users)

#### Before

In [22]:
log_20_users_pandas

Unnamed: 0,user_idx,item_idx,relevance,timestamp,month
0,16,366,4.0,2001-01-10 21:07:43,1
1,16,748,4.0,2002-03-21 18:09:13,3
2,16,1252,3.0,2003-01-20 21:14:09,1
3,16,1551,5.0,2002-09-10 19:34:53,9
4,16,3112,3.0,2002-04-02 00:49:39,4
...,...,...,...,...,...
25464,18,945,4.0,2000-05-09 20:05:51,5
25465,18,20,4.0,2000-05-09 20:02:15,5
25466,18,1452,4.0,2000-05-10 02:17:40,5
25467,18,1646,3.0,2000-05-09 23:10:36,5


#### After

In [24]:
log_trsfrm_pandas = log_trsfrm.toPandas()

In [25]:
log_trsfrm_pandas.head(10)

Unnamed: 0,item_idx,user_idx,relevance,timestamp,month,u_log_num_interact,u_log_interact_days_count,u_min_interact_date,u_max_interact_date,u_std,...,i_quantile_05,i_quantile_5,i_quantile_95,i_history_length_days,i_last_interaction_gap_days,i_mean_u_log_num_interact,na_u_log_features,na_i_log_features,u_i_log_num_interact_diff,i_u_log_num_interact_diff
0,366,16,4.0,2001-01-10 21:07:43,1,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,2.0,4.0,5.0,245,778,7.147767,0.0,0.0,-0.4108,0.467611
1,748,16,4.0,2002-03-21 18:09:13,3,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,3.0,4.0,5.0,681,343,7.144384,0.0,0.0,-0.407417,0.119304
2,1252,16,3.0,2003-01-20 21:14:09,1,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,1.0,3.0,4.0,983,38,7.108856,0.0,0.0,-0.371889,0.032292
3,1551,16,5.0,2002-09-10 19:34:53,9,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,3.0,4.0,5.0,768,170,7.17303,0.0,0.0,-0.436064,-0.286161
4,3112,16,3.0,2002-04-02 00:49:39,4,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,3.0,3.0,3.0,407,331,7.094975,0.0,0.0,-0.358008,-1.672456
5,2971,16,3.0,2002-04-02 00:46:00,4,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,3.0,3.0,4.0,442,331,7.094975,0.0,0.0,-0.358008,-1.672456
6,2495,16,4.0,2003-02-21 17:27:09,2,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,1.0,2.0,4.0,817,6,7.156144,0.0,0.0,-0.419177,-0.979308
7,1783,16,3.0,2002-04-24 21:55:18,4,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,2.0,3.0,3.0,574,309,7.191866,0.0,0.0,-0.4549,-0.573843
8,731,16,3.0,2002-04-02 01:52:19,4,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,1.0,3.0,4.0,890,51,7.129113,0.0,0.0,-0.392146,0.119304
9,1501,16,3.0,2002-03-27 18:40:11,3,6.736967,4.795791,2001-01-10 20:59:24,2003-02-27 15:31:39,0.946053,...,2.0,3.0,5.0,763,258,7.165088,0.0,0.0,-0.428121,0.032292


####  `LogStatFeaturesProcessor()` behavior when adding a cold user, item

In [26]:
cold_user_idx, cold_item_idx =  find_cold_idx(log_20_users)

#### Add cold user

In [27]:
user_cold = log.where(f"user_idx == {cold_user_idx}").filter(log.item_idx.isin(log_20_users.item_idx)).limit(1)

item_idx = user_cold.collect()[0]["item_idx"]
user_idx = user_cold.collect()[0]["user_idx"]

log_21_users  = log_20_users.union(user_cold)

In [28]:
log_trsfrm = lf.transform(log_21_users)

In [29]:
log_trsfrm.where(f"user_idx == {user_idx}").toPandas()

Unnamed: 0,item_idx,user_idx,relevance,timestamp,month,u_log_num_interact,u_log_interact_days_count,u_min_interact_date,u_max_interact_date,u_std,...,i_quantile_05,i_quantile_5,i_quantile_95,i_history_length_days,i_last_interaction_gap_days,i_mean_u_log_num_interact,na_u_log_features,na_i_log_features,u_i_log_num_interact_diff,i_u_log_num_interact_diff
0,1658,20,3.0,2000-08-11 16:42:11,8,0.0,0.0,1970-01-01 03:00:00,1970-01-01 03:00:00,0.0,...,3.0,3.0,5.0,114,823,7.27781,1.0,0.0,-7.27781,1.791759


#### Add cold item

In [30]:
item_cold = log.where(f"item_idx == {cold_item_idx}").filter(log.user_idx.isin(log_20_users.user_idx)).limit(1)

item_idx = item_cold.collect()[0]["item_idx"]
user_idx = item_cold.collect()[0]["user_idx"]

log_21_users  = log_20_users.union(item_cold)

In [31]:
log_trsfrm = lf.transform(log_21_users)

In [32]:
log_trsfrm.where(f"item_idx == {item_idx}").toPandas()

Unnamed: 0,item_idx,user_idx,relevance,timestamp,month,u_log_num_interact,u_log_interact_days_count,u_min_interact_date,u_max_interact_date,u_std,...,i_quantile_05,i_quantile_5,i_quantile_95,i_history_length_days,i_last_interaction_gap_days,i_mean_u_log_num_interact,na_u_log_features,na_i_log_features,u_i_log_num_interact_diff,i_u_log_num_interact_diff
0,2013,1603,3.0,2000-12-31 09:25:37,12,0.0,0.0,1970-01-01 03:00:00,1970-01-01 03:00:00,0.0,...,0.0,0.0,0.0,0,0,0.0,1.0,1.0,0.0,0.0


#### Add cold item and user

In [33]:
item_cold = log.where(f"item_idx == {cold_item_idx}").filter(~sf.col("user_idx").isin([i for i in range(10)])).limit(1)
item_idx = item_cold.collect()[0]["item_idx"]
user_idx = item_cold.collect()[0]["user_idx"]

log_21_users  = log_20_users.union(item_cold)

In [34]:
log_trsfrm = lf.transform(log_21_users)

In [35]:
log_trsfrm.where(f"item_idx == {item_idx} and user_idx == {user_idx}").toPandas()

Unnamed: 0,item_idx,user_idx,relevance,timestamp,month,u_log_num_interact,u_log_interact_days_count,u_min_interact_date,u_max_interact_date,u_std,...,i_quantile_05,i_quantile_5,i_quantile_95,i_history_length_days,i_last_interaction_gap_days,i_mean_u_log_num_interact,na_u_log_features,na_i_log_features,u_i_log_num_interact_diff,i_u_log_num_interact_diff
0,2013,1603,3.0,2000-12-31 09:25:37,12,0.0,0.0,1970-01-01 03:00:00,1970-01-01 03:00:00,0.0,...,0.0,0.0,0.0,0,0,0.0,1.0,1.0,0.0,0.0
