<center><img src="https://github.com/hse-ds/iad-applied-ds/blob/master/2021/hw/hw1/img/logo_hse.png?raw=1" width="1000"></center>

<h1><center>Прикладные задачи анализа данных</center></h1>
<h2><center>Домашнее задание 4: рекомендательные системы</center></h2>

# Введение

В этом задании Вы продолжите работать с данными из семинара [Articles Sharing and Reading from CI&T Deskdrop](https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop).

# Загрузка и предобработка данных

In [2]:
import pandas as pd
import numpy as np
import math
import scipy

Загрузим данные и проведем предобраотку данных как на семинаре.

In [3]:
from google.colab import files
files.upload()         # expire any previous token(s) and upload recreated token

!rm -r ~/.kaggle
!mkdir ~/.kaggle
!mv ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

Saving kaggle.json to kaggle.json
rm: cannot remove '/root/.kaggle': No such file or directory
ref                                                                   title                                             size  lastUpdated          downloadCount  voteCount  usabilityRating  
--------------------------------------------------------------------  -----------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
muratkokludataset/date-fruit-datasets                                 Date Fruit Datasets                              408KB  2022-04-03 09:25:39          10817       1463  0.9375           
victorsoeiro/netflix-tv-shows-and-movies                              Netflix TV Shows and Movies                        2MB  2022-05-15 00:01:23           2865        103  1.0              
mdmahmudulhasansuzan/students-adaptability-level-in-online-education  Students Adaptability Level in Online Education    6KB  2022-04-16 04:4

In [4]:
!kaggle datasets download -d gspmoreira/articles-sharing-reading-from-cit-deskdrop
!unzip articles-sharing-reading-from-cit-deskdrop.zip -d articles

Downloading articles-sharing-reading-from-cit-deskdrop.zip to /content
 61% 5.00M/8.20M [00:00<00:00, 15.1MB/s]
100% 8.20M/8.20M [00:00<00:00, 24.2MB/s]
Archive:  articles-sharing-reading-from-cit-deskdrop.zip
  inflating: articles/shared_articles.csv  
  inflating: articles/users_interactions.csv  


In [5]:
articles_df = pd.read_csv("articles/shared_articles.csv")
articles_df = articles_df[articles_df["eventType"] == "CONTENT SHARED"]
articles_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en


In [6]:
interactions_df = pd.read_csv("articles/users_interactions.csv")
interactions_df.head(2)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US


In [7]:
interactions_df.personId = interactions_df.personId.astype(str)
interactions_df.contentId = interactions_df.contentId.astype(str)
articles_df.contentId = articles_df.contentId.astype(str)

In [8]:
# зададим словарь определяющий силу взаимодействия
event_type_strength = {
   "VIEW": 1.0,
   "LIKE": 2.0, 
   "BOOKMARK": 2.5, 
   "FOLLOW": 3.0,
   "COMMENT CREATED": 4.0,  
}

interactions_df["eventStrength"] = interactions_df.eventType.apply(lambda x: event_type_strength[x])

Оставляем только тех пользователей, которые произамодействовали более чем с пятью статьями.

In [9]:
users_interactions_count_df = (
    interactions_df
    .groupby(["personId", "contentId"])
    .first()
    .reset_index()
    .groupby("personId").size())
print("# users:", len(users_interactions_count_df))

users_with_enough_interactions_df = \
    users_interactions_count_df[users_interactions_count_df >= 5].reset_index()[["personId"]]
print("# users with at least 5 interactions:",len(users_with_enough_interactions_df))

# users: 1895
# users with at least 5 interactions: 1140


Оставляем только те взаимодействия, которые относятся к отфильтрованным пользователям.

In [10]:
interactions_from_selected_users_df = interactions_df.loc[np.in1d(interactions_df.personId,
            users_with_enough_interactions_df)]

In [11]:
print(f"# interactions before: {interactions_df.shape}")
print(f"# interactions after: {interactions_from_selected_users_df.shape}")

# interactions before: (72312, 9)
# interactions after: (69868, 9)


Объединяем все взаимодействия пользователя по каждой статье и сглажиываем полученный результат, взяв от него логарифм.

In [12]:
def smooth_user_preference(x):
    return math.log(1+x, 2)
    
interactions_full_df = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"]).eventStrength.sum()
    .apply(smooth_user_preference)
    .reset_index().set_index(["personId", "contentId"])
)
interactions_full_df["last_timestamp"] = (
    interactions_from_selected_users_df
    .groupby(["personId", "contentId"])["timestamp"].last()
)
        
interactions_full_df = interactions_full_df.reset_index()
interactions_full_df.head(5)

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
2,-1007001694607905623,-793729620925729327,1.0,1472834892
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324


Разобьём выборку на обучение и контроль по времени.

In [13]:
from sklearn.model_selection import train_test_split

split_ts = 1475519530
interactions_train_df = interactions_full_df.loc[interactions_full_df.last_timestamp < split_ts].copy()
interactions_test_df = interactions_full_df.loc[interactions_full_df.last_timestamp >= split_ts].copy()

print(f"# interactions on Train set: {len(interactions_train_df)}")
print(f"# interactions on Test set: {len(interactions_test_df)}")

interactions_train_df

# interactions on Train set: 29329
# interactions on Test set: 9777


Unnamed: 0,personId,contentId,eventStrength,last_timestamp
0,-1007001694607905623,-5065077552540450930,1.0,1470395911
2,-1007001694607905623,-793729620925729327,1.0,1472834892
6,-1032019229384696495,-1006791494035379303,1.0,1469129122
7,-1032019229384696495,-1039912738963181810,1.0,1459376415
8,-1032019229384696495,-1081723567492738167,2.0,1464054093
...,...,...,...,...
39099,997469202936578234,9112765177685685246,2.0,1472479493
39100,998688566268269815,-1255189867397298842,1.0,1474567164
39101,998688566268269815,-401664538366009049,1.0,1474567449
39103,998688566268269815,6881796783400625893,1.0,1474567675


Для удобства подсчёта качества запишем данные в формате, где строка соответствует пользователю, а столбцы будут истинными метками и предсказаниями в виде списков.

In [14]:
interactions = (
    interactions_train_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
    .reset_index()
    .rename(columns={"contentId": "true_train"})
    .set_index("personId")
)

interactions["true_test"] = (
    interactions_test_df
    .groupby("personId")["contentId"].agg(lambda x: list(x))
)

# заполнение пропусков пустыми списками
interactions.loc[pd.isnull(interactions.true_test), "true_test"] = [
    "" for x in range(len(interactions.loc[pd.isnull(interactions.true_test), "true_test"]))]

interactions.head(1)

Unnamed: 0_level_0,true_train,true_test
personId,Unnamed: 1_level_1,Unnamed: 2_level_1
-1007001694607905623,"[-5065077552540450930, -793729620925729327]","[-6623581327558800021, 1469580151036142903, 72..."


# Библиотека LightFM

Для рекомендации Вы будете пользоваться библиотекой [LightFM](https://making.lyst.com/lightfm/docs/home.html), в которой реализованы популярные алгоритмы. Для оценивания качества рекомендации, как и на семинаре, будем пользоваться метрикой *precision@10*.

In [15]:
!pip install lightfm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lightfm
  Downloading lightfm-1.16.tar.gz (310 kB)
[K     |████████████████████████████████| 310 kB 14.1 MB/s 
Building wheels for collected packages: lightfm
  Building wheel for lightfm (setup.py) ... [?25l[?25hdone
  Created wheel for lightfm: filename=lightfm-1.16-cp37-cp37m-linux_x86_64.whl size=697423 sha256=f5426655c8707c48a1cb0fe4dc6f1f4def26c8ec795524e78fce945da4fc2c56
  Stored in directory: /root/.cache/pip/wheels/f8/56/28/5772a3bd3413d65f03aa452190b00898b680b10028a1021914
Successfully built lightfm
Installing collected packages: lightfm
Successfully installed lightfm-1.16


In [16]:
from lightfm import LightFM
from lightfm.evaluation import precision_at_k

## Задание 1 (2 балла)

Модели в LightFM работают с разреженными матрицами. Создайте разреженные матрицы `data_train` и `data_test` (размером количество пользователей на количество статей), такие что на пересечении строки пользователя и столбца статьи стоит сила их взаимодействия, если взаимодействие было, и стоит ноль, если взаимодействия не было.

In [None]:
interactions_test_df.head()

Unnamed: 0,personId,contentId,eventStrength,last_timestamp
1,-1007001694607905623,-6623581327558800021,1.0,1487240080
3,-1007001694607905623,1469580151036142903,1.0,1487240062
4,-1007001694607905623,7270966256391553686,1.584963,1485994324
5,-1007001694607905623,8729086959762650511,1.0,1487240086
16,-1032019229384696495,-1415040208471067980,2.70044,1482413824


In [17]:
all_pers_ids = set(interactions_train_df['personId'].unique())
all_pers_ids.update(set(interactions_test_df['personId']))

all_cont_ids = set(interactions_train_df['contentId'].unique())
all_cont_ids.update(set(interactions_test_df['contentId']))

print("Users in total", len(all_pers_ids), "Articles in total", len(all_cont_ids))

Users in total 1140 Articles in total 2984


In [18]:
all_pers_ids = list(all_pers_ids)
all_cont_ids = list(all_cont_ids)

In [19]:
from scipy.sparse import csr_matrix

def create_sparse_matrix(interactions_df: pd.DataFrame, users: set, contents: set) -> csr_matrix:    
    matrix = pd.DataFrame(0, columns=contents, index=users)
    
    
    user_ids = interactions_df['personId'].values
    content_ids = interactions_df['contentId'].values
    eventStrengths = interactions_df['eventStrength'].values
    
    
    for i in range(len(interactions_df)):
        matrix.loc[user_ids[i], content_ids[i]] = eventStrengths[i] if eventStrengths[i] is not np.nan else 0
    
    return csr_matrix(matrix.values)

In [20]:
data_train = create_sparse_matrix(interactions_train_df, all_pers_ids, all_cont_ids)

data_test = create_sparse_matrix(interactions_test_df, all_pers_ids, all_cont_ids)

In [21]:
data_train.shape

(1140, 2984)

In [22]:
# removing nan values

np.isnan(data_train.data).any()
data_train.data = np.nan_to_num(data_train.data)
data_train.eliminate_zeros()

In [23]:
data_train.shape

(1140, 2984)

In [24]:
data_test.shape

(1140, 2984)

In [25]:
np.isnan(data_test.data).any()
data_test.data = np.nan_to_num(data_test.data)
data_test.eliminate_zeros()

In [26]:
data_test.shape

(1140, 2984)

## Задание 2 (1 балл)

Обучите модель LightFM с `loss="warp"` и посчитайте *precision@10* на тесте.

In [None]:
model = LightFM(loss='warp')
model.fit(data_train, epochs=20)

train_precision = precision_at_k(model, data_train, k=10).mean()
test_precision = precision_at_k(model, data_test, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.22562951
Test precision is: 0.007942974


Качество оставляет желать лучшего...

## Задание 3 (3 балла)

При вызове метода `fit` LightFM позволяет передавать в `item_features` признаковое описание объектов. Воспользуемся этим. Будем получать признаковое описание из текста статьи в виде [TF-IDF](https://ru.wikipedia.org/wiki/TF-IDF) (можно воспользоваться `TfidfVectorizer` из scikit-learn). Создайте матрицу `feat` размером количесвто статей на размер признакового описание и обучите LightFM с `loss="warp"` и посчитайте precision@10 на тесте.

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
feat = vectorizer.fit_transform(articles_df['text'])

In [None]:
model = LightFM(loss='warp')
model.fit(data_train, item_features=feat, epochs=20)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.23165469
Test precision is: 0.006211813


На тесте качетво меняется незначительно в большую или меньшую сторону в зависимости от запуска.

## Задание 4 (2 балла)

В задании 3 мы использовали сырой текст статей. В этом задании необходимо сначала сделать предобработку текста (привести к нижнему регистру, убрать стоп слова, привести слова к номральной форме и т.д.), после чего обучите модель и оценить качество на тестовых данных.

In [28]:
!pip install langdetect

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[K     |████████████████████████████████| 981 kB 12.4 MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993242 sha256=c62fbf68978c331e0f88e9c5ea5df0302be1345e53d9611f81eba8568cfac70d
  Stored in directory: /root/.cache/pip/wheels/c5/96/8a/f90c59ed25d75e50a8c10a1b1c2d4c402e4dacfa87f3aff36a
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [29]:
import nltk
import re
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from langdetect import detect

from nltk.tokenize import word_tokenize

from string import punctuation

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [30]:
articles_df.head(3)

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en


In [31]:
text_edited_df = articles_df.copy()
text_edited_df['text'] = articles_df['text']
text_edited_df = text_edited_df[['contentId','lang','text', 'title']] # add id, language, title data and text itself
text_edited_df.head()

Unnamed: 0,contentId,lang,text,title
1,-4110354420726924665,en,All of this work is still very early. The firs...,"Ethereum, a Virtual Currency, Enables Transact..."
2,-7292285110016212249,en,The alarm clock wakes me at 8:00 with stream o...,Bitcoin Future: When GBPcoin of Branson Wins O...
3,-6151852268067518688,en,We're excited to share the Google Data Center ...,Google Data Center 360° Tour
4,2448026894306402386,en,The Aite Group projects the blockchain market ...,"IBM Wants to ""Evolve the Internet"" With Blockc..."
5,-2826566343807132236,en,One of the largest and oldest organizations fo...,IEEE to Talk Blockchain at Cloud Computing Oxf...


In [32]:
text_edited_df.shape

(3047, 4)

In [33]:
# add missing articles
full_cont = pd.DataFrame(interactions_full_df.contentId.unique(), columns = ['contentId'])
text_edited_df = pd.merge(full_cont, text_edited_df, on = 'contentId', how = 'left' )
text_edited_df.shape

(2984, 4)

In [34]:
text_edited_df['text'] = text_edited_df['text'].fillna('unknown')
text_edited_df['lang'] = text_edited_df['lang'].fillna('no text')
text_edited_df.head()

Unnamed: 0,contentId,lang,text,title
0,-5065077552540450930,pt,A AXA se manteve na liderança do ranking de ma...,Ranking das maiores seguradoras da Europa - 20...
1,-6623581327558800021,en,"About a decade ago, a handful of Google's most...","Spanner, the Google Database That Mastered Tim..."
2,-793729620925729327,en,"Posted by Sam Thorogood , Developer Programs E...",Closure Compiler in JavaScript
3,1469580151036142903,en,This is one of the great discussions among dev...,Don't document your code. Code your documentat...
4,7270966256391553686,en,We are excited to announce the release of .NET...,Announcing .NET Core 1.0


In [35]:
# the number of different languages presented
text_edited_df.lang.value_counts() 

en         2148
pt          822
no text       8
la            2
es            2
ja            2
Name: lang, dtype: int64

In [36]:
# since the majority of texts are in english of portugese, we will use stopwords from these languages
stop_words_en = stopwords.words('english') + list(punctuation) 
stop_words_pt = stopwords.words('portuguese') + list(punctuation) 

stopwords_all = stopwords.words('portuguese') + stopwords.words('english')

1. PorterStemmer

In [None]:
porter_stemmer = PorterStemmer()

In [None]:
def stemming_tokenizer(str_input):    
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    
    words = [porter_stemmer.stem(word)
             for word in words 
             if word not in stopwords_all]
    return words

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=stemming_tokenizer, stop_words = stopwords_all)
feat = vectorizer.fit_transform(articles_df['text'])

  % sorted(inconsistent)


In [None]:
model = LightFM(loss='warp')
model.fit(data_train, item_features=feat, epochs=20)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.23821943
Test precision is: 0.006211813


2. SnowballStemmer

In [None]:
snb_stemmer_eng = nltk.stem.SnowballStemmer('english')
snb_stemmer_pt = nltk.stem.SnowballStemmer('portuguese')

In [None]:
def stemming_tokenizer(str_input):    
    wt = word_tokenize(str_input)  
    if detect(str_input) == 'pt':  
        preprocessed = [snb_stemmer_pt.stem(word) for word in wt if word not in stop_words_pt and word.isalpha()]  
    else:    
        preprocessed = [snb_stemmer_eng.stem(word) for word in wt if word not in stop_words_en and word.isalpha()]
    return preprocessed

In [None]:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=stemming_tokenizer, stop_words = stopwords_all)
feat = vectorizer.fit_transform(articles_df['text'])

  % sorted(inconsistent)


In [None]:
model = LightFM(loss='warp')
model.fit(data_train, item_features=feat, epochs=20)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.23929857
Test precision is: 0.005702648


3. WordNetLemmatizer

In [37]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [38]:
lmtzr = WordNetLemmatizer()

In [39]:
def lemming_tokenizer(str_input):    
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    
    words = [lmtzr.lemmatize(word)
             for word in words 
             if word not in stopwords_all]
    return words

In [40]:
vectorizer = TfidfVectorizer(analyzer='word', tokenizer=lemming_tokenizer, stop_words = stopwords_all)
feat = vectorizer.fit_transform(articles_df['text'])

  % sorted(inconsistent)


In [None]:
model = LightFM(loss='warp')
model.fit(data_train, item_features=feat, epochs=20)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.23363309
Test precision is: 0.007433809


Улучшилось ли качество предсказания?

Качество предсказания не улучшилось или улучшилось незначительно.

## Задание 5 (2 балла)

Подберите гиперпараметры модели LightFM (`n_components` и др.) для улучшения качества модели.

In [None]:
# base model for the grid search

model = LightFM(learning_rate=0.05, loss='warp', no_components=50)
model.fit(data_train, item_features=feat, epochs=20)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

In [42]:
!pip install bayesian-optimization

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bayesian-optimization
  Downloading bayesian-optimization-1.2.0.tar.gz (14 kB)
Building wheels for collected packages: bayesian-optimization
  Building wheel for bayesian-optimization (setup.py) ... [?25l[?25hdone
  Created wheel for bayesian-optimization: filename=bayesian_optimization-1.2.0-py3-none-any.whl size=11685 sha256=26dd6b580b047362b6442d2e271aba5d51242617de8320e39d0e0c50dac5739e
  Stored in directory: /root/.cache/pip/wheels/fd/9b/71/f127d694e02eb40bcf18c7ae9613b88a6be4470f57a8528c5b
Successfully built bayesian-optimization
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.2.0


In [43]:
from bayes_opt import BayesianOptimization

In [50]:
parameters = {
    'components_num': (20, 80),
#     "learning_schedule": ["adagrad", "adadelta"],  
     "learning_rate": (0.001, 0.05),
     "item_alpha": (0.00000001, 0.000001),
     "user_alpha": (0.00000001, 0.000001),
#     "max_sampled": (5, 15),
    "epoch_num": (5, 50),
}


def BO_func(components_num, learning_rate, item_alpha, user_alpha, epoch_num):
    epoch_num = int(epoch_num)
    components_num = int(components_num)
    
    model = LightFM(learning_rate=learning_rate, loss='warp', no_components=int(components_num), user_alpha=user_alpha, item_alpha=item_alpha)
    model.fit(data_train, item_features=feat, epochs=int(epoch_num))

    train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
    test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()
    
    return test_precision

In [51]:
optimizer = BayesianOptimization(
  f = BO_func,
  pbounds = parameters,
  verbose = 5,
  random_state = 5, 
 )

optimizer.maximize(
  init_points = 4,
  n_iter = 3, 
 )

|   iter    |  target   | compon... | epoch_num | item_a... | learni... | user_a... |
-------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.006619[0m | [0m 33.32   [0m | [0m 44.18   [0m | [0m 2.147e-0[0m | [0m 0.04601 [0m | [0m 4.935e-0[0m |
| [95m 2       [0m | [95m 0.00835 [0m | [95m 56.7    [0m | [95m 39.47   [0m | [95m 5.232e-0[0m | [95m 0.01554 [0m | [95m 1.958e-0[0m |
| [0m 3       [0m | [0m 0.006415[0m | [0m 24.84   [0m | [0m 38.23   [0m | [0m 4.469e-0[0m | [0m 0.008757[0m | [0m 8.811e-0[0m |
| [0m 4       [0m | [0m 0.00723 [0m | [0m 36.45   [0m | [0m 23.64   [0m | [0m 3.031e-0[0m | [0m 0.03181 [0m | [0m 5.84e-07[0m |
| [0m 5       [0m | [0m 0.006619[0m | [0m 57.15   [0m | [0m 38.59   [0m | [0m 3.373e-0[0m | [0m 0.04314 [0m | [0m 2.564e-0[0m |
| [0m 6       [0m | [0m 0.006517[0m | [0m 56.62   [0m | [0m 39.45   [0m | [0m 4.538e-0[0m | [0m 0.04

In [57]:
print(optimizer.max)

{'target': 0.008350306190550327, 'params': {'components_num': 56.70463177415874, 'epoch_num': 39.4658535416142, 'item_alpha': 5.232338079942139e-07, 'learning_rate': 0.015543224577234876, 'user_alpha': 1.9584401637463912e-07}}


In [66]:
hyperp_dict = optimizer.max['params']

In [67]:
lr, user_alpha, item_alpha = hyperp_dict['learning_rate'], hyperp_dict['user_alpha'], hyperp_dict['item_alpha']

In [68]:
model = LightFM(learning_rate=lr, loss='warp', no_components=56, user_alpha = user_alpha, item_alpha = item_alpha)
model.fit(data_train, item_features=feat, epochs=39)

train_precision = precision_at_k(model, data_train, item_features=feat, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.34631297
Test precision is: 0.007841141


Мы видим, что качество изменилось в лучшую сторону, но незначительно даже при подборе параметров. Можно предположить, что если проводить более длительный и тщательный подбор, можно дойти до значений 0.01-0.02, но это всё ещё не является хорошим качеством. 

## Бонусное задание (3 балла)

Выше мы использовали достаточно простое представление текста статьи в виде TF-IDF. В этом задании Вам нужно представить текст статьи (можно вместе с заголовком) в виде эмбеддинга полученного с помощью рекуррентной сети или трансформера (можно использовать любую предобученную модель, которая Вам нравится). Обучите модель с ипользованием этих эмеддингов и сравните результаты с предыдущими.

In [1]:
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 4.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 28.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 74.5 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 4.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.1 MB/s 
Collecting tokenizers!=0.11.3,<

In [70]:
from sentence_transformers import SentenceTransformer # I found the special model - transformer for making text embeddings https://www.sbert.net/

In [71]:
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')  # the most suitable multilingual model https://www.sbert.net/docs/pretrained_models.html
embeddings = model.encode(list(articles_df['text']))
feat_trf = csr_matrix(embeddings)

In [72]:
len(embeddings[0]) # number of embeddings

768

In [74]:
model = LightFM(learning_rate=0.05, loss='warp', no_components=32)
model.fit(data_train, item_features=feat_trf, epochs=46)

train_precision = precision_at_k(model, data_train, item_features=feat_trf, k=10).mean()
test_precision = precision_at_k(model, data_test, item_features=feat_trf, k=10, train_interactions=data_train).mean()

print("Train precision is:", train_precision)
print("Test precision is:", test_precision)

Train precision is: 0.09334533
Test precision is: 0.0023421592


В сравнении с результатами выше, модель на новых эмбеддингах работает хуже даже на тренировочной выборке. По идее, этого не должно было происходить, поскольку заранее предобученная модель выдает более универсальные эмбеддинги, основанные на большем количестве текстов.