## Recommendation system project

In current, work I will create recommender system of text posts. As raw data, I will use the following tables:

**user_data:**

| Field name  | Overview  |
|---|---|
| age  | User age (in profile)  |
| city  | User city (in profile)  |
| country  | User country (in profile)  |
| exp_group  | Experimental group: some encrypted category  |
| gender  | User Gender  |
| id  | Unique user ID  |
| os | The operating system of the device from which the social network is used  |
| source  | Whether the user came to the app from organic traffic or from ads  |

**post_text_df:**

| Field name | Overview |
|---|---|
| id  | Unique post ID  |
| text  | Text content of the post  |
| topic  | Main theme |

**feed_data**:

| Field name  | Overview  |
|---|---|
| timestamp  | The time the viewing was made  |
| user_id | id of the user who viewed the post |
| post_id  | viewed post id  |
| action  | Action Type: View or Like  |
| target  | Views have 1 if a like was made almost immediately after viewing, otherwise 0. Like actions have a missing value.  |

### Import required libraries

In [29]:
import warnings
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


In [24]:
warnings.filterwarnings("ignore")

### Database connection and used tables overview

In this work PostgreSQL was used as a RDBMS. I created `connection` variable for datebase access. When publishing a project on `github`, it will be removed. To demonstrate the work of the web service, a small part of the processed data will be given.

In [2]:
connection = "postgresql://robot-startml-ro:pheiph0hahj1Vaif@postgres.lab.karpov.courses:6432/startml"

In [None]:
### Users data

user_info = pd.read_sql(
    """SELECT * FROM public.user_data""",

    con=connection
)

In [6]:
user_info.head()

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads


In [None]:
### Posts and topics

posts_info = pd.read_sql(
    """SELECT * FROM public.post_text_df""",
    
    con=connection
)

In [7]:
posts_info.head()

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business


**feed_data** contains about 77 millions rows. It is too much, so lets take 1.

In [None]:
feed_data = pd.read_sql(
    """SELECT * FROM public.feed_data LIMIT 1000000""",
    
    con=connection
)

In [9]:
feed_data.head()

Unnamed: 0,timestamp,user_id,post_id,action,target
0,2021-10-25 19:44:04,164488,2289,view,0
1,2021-10-25 19:44:53,164488,1289,view,0
2,2021-10-25 19:47:35,164488,1734,view,0
3,2021-10-25 19:50:33,164488,541,view,0
4,2021-10-25 19:51:49,164488,970,view,0


### Working with data and features for the content-based model

Recall how the content approach works:

1. Learn from the selected date timestamp

2. And for any pair (user_id, post_id)

3. Predict whether a like will happen or not

4. It would be nice to have a model that can predict probabilities

It is necessary to select by user_id fixed feature set, in general, the original dataset will fit:

In [19]:
user_info.head()

Unnamed: 0,user_id,gender,age,country,city,exp_group,os,source
0,200,1,34,Russia,Degtyarsk,3,Android,ads
1,201,0,37,Russia,Abakan,0,Android,ads
2,202,1,17,Russia,Smolensk,4,Android,ads
3,203,0,18,Russia,Moscow,1,iOS,ads
4,204,0,36,Russia,Anzhero-Sudzhensk,3,Android,ads


In case of post_id, it is much more interesting! Come up with some kind of embedding for texts:

In [20]:
posts_info.head()

Unnamed: 0,post_id,text,topic
0,1,UK economy facing major risks\n\nThe UK manufa...,business
1,2,Aids and climate top Davos agenda\n\nClimate c...,business
2,3,Asian quake hits European shares\n\nShares in ...,business
3,4,India power shares jump on debut\n\nShares in ...,business
4,5,Lacroix label bought by US firm\n\nLuxury good...,business


In the current work, I will use several approaches to get text embeddings: `TF-IDF`, `bert-base-cased`, `roberta-base` and `distilbert-base-cased`. And then the models trained on different features will be compared using A/B-testing.

#### TF-IDF

In [21]:
import re
import string

from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import TfidfVectorizer

wnl = WordNetLemmatizer()

def preprocessing(line, token=wnl):
    line = line.lower()
    line = re.sub(r"[{}]".format(string.punctuation), " ", line)
    line = line.replace('\n\n', ' ').replace('\n', ' ')
    line = ' '.join([token.lemmatize(x) for x in line.split(' ')])
    return line


tfidf = TfidfVectorizer(
    stop_words='english',
    preprocessor=preprocessing
)

In [26]:
tfidf_data = (
    tfidf
    .fit_transform(posts_info["text"])
    .toarray()
)

tfidf_data = pd.DataFrame(
    tfidf_data,
    index=posts_info.post_id,
    columns=tfidf.get_feature_names_out()
)

tfidf_data.head()

Unnamed: 0_level_0,00,000,0001,000bn,000m,000s,000th,001,001and,001st,...,𝓫𝓮,𝓫𝓮𝓽𝓽𝓮𝓻,𝓬𝓸𝓾𝓻𝓽𝓼,𝓱𝓮𝓪𝓻𝓲𝓷𝓰,𝓶𝓪𝔂,𝓹𝓱𝔂𝓼𝓲𝓬𝓪𝓵,𝓼𝓸𝓸𝓷𝓮𝓻,𝓼𝓾𝓫𝓸𝓻𝓭𝓲𝓷𝓪𝓽𝓮,𝓽𝓱𝓮,𝓽𝓸
post_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.132739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.050614,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


46000 is too high for most models, and most of the values are zeros. So let's just use the received data to generate new features. Firstly, generate aggregated features based on TF-IDF:

In [28]:
posts_info["TotalTfIdf"] = tfidf_data.sum(axis=1).reset_index()[0]
posts_info["MaxTfIdf"] = tfidf_data.max(axis=1).reset_index()[0]
posts_info["MeanTfIdf"] = tfidf_data.mean(axis=1).reset_index()[0]

posts_info.head()

Unnamed: 0,post_id,text,topic,TotalTfIdf,MaxTfIdf,MeanTfIdf
0,1,UK economy facing major risks\n\nThe UK manufa...,business,8.748129,0.495805,0.00019
1,2,Aids and climate top Davos agenda\n\nClimate c...,business,11.878472,0.308003,0.000258
2,3,Asian quake hits European shares\n\nShares in ...,business,12.67553,0.261799,0.000276
3,4,India power shares jump on debut\n\nShares in ...,business,6.622786,0.537713,0.000144
4,5,Lacroix label bought by US firm\n\nLuxury good...,business,6.352096,0.420251,0.000138


Secondly, calculate PCA for TF-IDF results and then, cluster the data using `KMeans` and use distance to each cluster like features:

In [30]:
# Calculate PCA
centered = tfidf_data - tfidf_data.mean()
pca = PCA(n_components=20)
pca_decomp = pca.fit_transform(centered)

# Cluster texts using KMeans
n_clusters = 15
kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(pca_decomp)
posts_info["TextCluster"] = kmeans.labels_

dists_columns = [f"DistanceTo{i}thCluster" for i in range(1, n_clusters + 1)]

# Construct new dataframe
dists_df = pd.DataFrame(
    data=kmeans.transform(pca_decomp),
    columns=dists_columns
)

dists_df.head()

Unnamed: 0,DistanceTo1thCluster,DistanceTo2thCluster,DistanceTo3thCluster,DistanceTo4thCluster,DistanceTo5thCluster,DistanceTo6thCluster,DistanceTo7thCluster,DistanceTo8thCluster,DistanceTo9thCluster,DistanceTo10thCluster,DistanceTo11thCluster,DistanceTo12thCluster,DistanceTo13thCluster,DistanceTo14thCluster,DistanceTo15thCluster
0,0.541474,0.153752,0.502006,0.432577,0.450781,0.391465,0.470436,0.478488,0.529134,0.497147,0.44574,0.502932,0.692984,0.441007,0.58153
1,0.440881,0.250714,0.37334,0.14795,0.306506,0.19326,0.331262,0.338365,0.409533,0.370184,0.279457,0.417898,0.580274,0.290202,0.343088
2,0.476969,0.123318,0.388621,0.331715,0.333302,0.235024,0.366664,0.360627,0.432135,0.393841,0.320277,0.429572,0.541091,0.31406,0.539727
3,0.439527,0.217457,0.350847,0.260073,0.274319,0.158582,0.323516,0.318172,0.383934,0.320928,0.278984,0.405158,0.454373,0.261881,0.497525
4,0.396859,0.280756,0.295478,0.217563,0.195963,0.080126,0.30478,0.258181,0.34405,0.293527,0.216597,0.351318,0.511789,0.177199,0.463908


Concatenate new features with `posts_info` table:

In [31]:
posts_info = pd.concat((posts_info, dists_df), axis=1)
posts_info.head()

Unnamed: 0,post_id,text,topic,TotalTfIdf,MaxTfIdf,MeanTfIdf,TextCluster,DistanceTo1thCluster,DistanceTo2thCluster,DistanceTo3thCluster,...,DistanceTo6thCluster,DistanceTo7thCluster,DistanceTo8thCluster,DistanceTo9thCluster,DistanceTo10thCluster,DistanceTo11thCluster,DistanceTo12thCluster,DistanceTo13thCluster,DistanceTo14thCluster,DistanceTo15thCluster
0,1,UK economy facing major risks\n\nThe UK manufa...,business,8.748129,0.495805,0.00019,1,0.541474,0.153752,0.502006,...,0.391465,0.470436,0.478488,0.529134,0.497147,0.44574,0.502932,0.692984,0.441007,0.58153
1,2,Aids and climate top Davos agenda\n\nClimate c...,business,11.878472,0.308003,0.000258,3,0.440881,0.250714,0.37334,...,0.19326,0.331262,0.338365,0.409533,0.370184,0.279457,0.417898,0.580274,0.290202,0.343088
2,3,Asian quake hits European shares\n\nShares in ...,business,12.67553,0.261799,0.000276,1,0.476969,0.123318,0.388621,...,0.235024,0.366664,0.360627,0.432135,0.393841,0.320277,0.429572,0.541091,0.31406,0.539727
3,4,India power shares jump on debut\n\nShares in ...,business,6.622786,0.537713,0.000144,5,0.439527,0.217457,0.350847,...,0.158582,0.323516,0.318172,0.383934,0.320928,0.278984,0.405158,0.454373,0.261881,0.497525
4,5,Lacroix label bought by US firm\n\nLuxury good...,business,6.352096,0.420251,0.000138,5,0.396859,0.280756,0.295478,...,0.080126,0.30478,0.258181,0.34405,0.293527,0.216597,0.351318,0.511789,0.177199,0.463908


Join `posts_info_table` with `feed_data` and `user_info`:

In [32]:
joined_data = pd.merge(feed_data,
                       posts_info,
                       on="post_id",
                       how="left")

joined_data = pd.merge(joined_data,
                       user_info,
                       on="user_id",
                       how="left")

joined_data.head()

Unnamed: 0,timestamp,user_id,post_id,action,target,text,topic,TotalTfIdf,MaxTfIdf,MeanTfIdf,...,DistanceTo13thCluster,DistanceTo14thCluster,DistanceTo15thCluster,gender,age,country,city,exp_group,os,source
0,2021-10-25 19:44:04,164488,2289,view,0,EU software patent law faces axe\n\nThe Europe...,tech,9.152702,0.294567,0.000199,...,0.557824,0.320145,0.522452,1,33,Russia,Kostin Log,0,iOS,organic
1,2021-10-25 19:44:53,164488,1289,view,0,Hague given up his PM ambition\n\nFormer Conse...,politics,6.994911,0.716144,0.000152,...,0.614566,0.294214,0.248263,1,33,Russia,Kostin Log,0,iOS,organic
2,2021-10-25 19:47:35,164488,1734,view,0,Rush future at Chester uncertain\n\nIan Rushs ...,sport,7.004362,0.640725,0.000152,...,0.58442,0.175505,0.479616,1,33,Russia,Kostin Log,0,iOS,organic
3,2021-10-25 19:50:33,164488,541,view,0,Musicians to tackle US red tape\n\nMusicians g...,entertainment,9.547157,0.401861,0.000208,...,0.568098,0.214478,0.43361,1,33,Russia,Kostin Log,0,iOS,organic
4,2021-10-25 19:51:49,164488,970,view,0,Boothroyd calls for Lords speaker\n\nBetty Boo...,politics,6.14949,0.545454,0.000134,...,0.646946,0.278332,0.417744,1,33,Russia,Kostin Log,0,iOS,organic


Also I have `timestamp` column. Let's extract features from it:

In [None]:
joined_data["hour"] = pd.to_datetime(joined_data["timestamp"]).apply(lambda x: x.hour)
joined_data["month"] = pd.to_datetime(joined_data["timestamp"]).apply(lambda x: x.month)

# Remove action and text columns
# But leave timestamp for train/test split
joined_data = joined_data.drop([
                "action",
                "text",
                ],
                axis=1)

joined_data = joined_data.set_index(["user_id", "post_id"])

joined_data.head()

In [None]:
### Уберем все ненужные колонки

df = df.drop([
#    'timestamp',  ### timestamp пока оставим
    'action',
    'text',
],
    axis=1)

df = df.set_index(['user_id', 'post_id'])

df.head(50)

### Пора обучать модели! 

In [None]:
### Предлагаю начать с относительно простой модели
### Например, с решающего дерева
### А потом посмотреть уже в сторону бустингов

### Как валидировать? Как разобьем на train и test?
### Предлагаю по времени, так как данные имеют 
### Временную структуру! Хотим корректно оценивать
### Вероятности для будущих рекомендаций

max(df.timestamp), min(df.timestamp)

In [None]:
### За отсечку возьмем 2021-12-15

df_train = df[df.timestamp < '2021-12-15']
df_test = df[df.timestamp >= '2021-12-15']

df_train = df_train.drop('timestamp', axis=1)
df_test = df_test.drop('timestamp', axis=1)

X_train = df_train.drop('target', axis=1)
X_test = df_test.drop('target', axis=1)

y_train = df_train['target']
y_test = df_test['target']

y_train.shape, y_test.shape

In [None]:
X_train

In [None]:
### Начнем с решающего дерева!

from sklearn.compose import ColumnTransformer
from category_encoders import TargetEncoder
from category_encoders.one_hot import OneHotEncoder

object_cols = [
    'topic', 'TextCluster', 'gender', 'country',
    'city', 'exp_group', 'hour', 'month',
    'os', 'source'
]

cols_for_ohe = [x for x in object_cols if X_train[x].nunique() < 5]
cols_for_mte = [x for x in object_cols if X_train[x].nunique() >= 5]

### Cохраним индексы этих колонок

cols_for_ohe_idx = [list(X_train.columns).index(col) for col in cols_for_ohe]
cols_for_mte_idx = [list(X_train.columns).index(col) for col in cols_for_mte]

t = [
    ('OneHotEncoder', OneHotEncoder(), cols_for_ohe_idx),
    ('MeanTargetEncoder', TargetEncoder(), cols_for_mte_idx)
]

col_transform = ColumnTransformer(transformers=t)

from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

pipe_dt = Pipeline([("column_transformer",
                     col_transform),
                     
                    ("decision_tree", 
                     DecisionTreeClassifier())])

pipe_dt.fit(X_train, y_train)

In [None]:
### Замерим качество работы такой модели
### Возьмем ROC-AUC

from sklearn.metrics import roc_auc_score

print(f"Качество на трейне: {roc_auc_score(y_train, pipe_dt.predict_proba(X_train)[:, 1])}")
print(f"Качество на тесте: {roc_auc_score(y_test, pipe_dt.predict_proba(X_test)[:, 1])}")

In [None]:
### Теперь обучим катбуст!

from catboost import CatBoostClassifier

catboost = CatBoostClassifier(iterations=100,
                              learning_rate=1,
                              depth=2)

catboost.fit(X_train, y_train, object_cols)

In [None]:
### Замерим качество работы такой модели
### Возьмем ROC-AUC

print(f"Качество на трейне: {roc_auc_score(y_train, catboost.predict_proba(X_train)[:, 1])}")
print(f"Качество на тесте: {roc_auc_score(y_test, catboost.predict_proba(X_test)[:, 1])}")

In [None]:
### Из любопытства посмотрим на feature_importance

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

def plot_feature_importance(importance,names,model_type):
    
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)
    
    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)
    
    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    
    #Define size of bar plot
    plt.figure(figsize=(10,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')
    
plot_feature_importance(catboost.feature_importances_,X_train.columns,'Catboost')

In [None]:
### Сохраним модель

catboost.save_model(
    'catboost_model',
    format="cbm"                  
)

### Положим в базу фичи, необходимые для функционала нашей модели

In [None]:
posts_info.to_sql(    
   "posts_info_features",                    
    con="postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml",                      
    schema="public",                   
    if_exists='replace'            
   )                               
                                   

In [None]:
### Все ли норм?

test_ = pd.read_sql(
    """SELECT * FROM public.posts_info_features""",
    
    con="postgresql://robot-startml-ro:pheiph0hahj1Vaif@"
        "postgres.lab.karpov.courses:6432/startml"
)

test_