# Recommendation system

download dataset [Health_and_Personal_Care.jsonl.gz](https://drive.google.com/file/d/12N52kB4D1iqgzSuoWEfNSY3KqVRp10wL/view?usp=drive_link)

put in to `data` dir

In [29]:
%load_ext autoreload
%autoreload 2

import os

print(os.environ['DATA_DIR'])

root_data_dir = os.environ['DATA_DIR']
print(os.listdir(root_data_dir))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/Users/username/PycharmProjects/ml_for_products/data
['model_dockerized.cb', 'zinc_data', 'model.cb', 'Health_and_Personal_Care.jsonl.gz', 'mlflow', 'minio', 'bidmachine_task_data', 'bidmachine_logs.zip', 'downloaded_model.cb', 'item_cards.gzip', 'meta_Health_and_Personal_Care.jsonl.gz']


In [30]:
from utils import read_raw_data

file_name = 'Health_and_Personal_Care.jsonl.gz'
data_path = os.path.join(root_data_dir, file_name)

json_data = read_raw_data(data_path, limit=1000)
print(len(json_data))

Dataset num items: 1000 from /Users/username/PycharmProjects/ml_for_products/data/Health_and_Personal_Care.jsonl.gz
1000


In [31]:
json_data[0]

{'rating': 4.0,
 'title': '12 mg is 12 on the periodic table people! Mg for magnesium',
 'text': 'This review is more to clarify someone else’s review bc they didn’t understand understand the labeling!  It shows 1000mg as advertised & another little label says 12mg bc 12 is on the periodic table for magnesium!  I realize not everyone takes chemistry, but 4 ppl liked his review & so misinformation is spreading.  This works. If however you are on opiate level medications that are causing constipation you should talk to your pain dr or your gastrointestinal dr & ask for a medication called Linzess which works must better & must faster, but is unnecessary for most people.  If magnesium is working for you just make sure to take it with food & drink 6-8 glasses of water per day.  Staying hydrated will really help.  Before switching to Linzess I used to take one 1,000 mg pill am & pm every day with meals & always with an 8 ounce glass of water or other liquid.',
 'images': [],
 'asin': 'B07TD

In [32]:
import pandas as pd

user_item_data_df = pd.DataFrame([('health', i['user_id'], i['parent_asin'], i['rating']) for i in json_data], columns=['category', 'CustomerID', 'ProductID', 'target'])

user_item_data_df.head()

Unnamed: 0,category,CustomerID,ProductID,target
0,health,AFKZENTNBQ7A7V7UXW5JJI6UGRYQ,B07TDSJZMR,4.0
1,health,AEVWAM3YWN5URJVJIZZ6XPD2MKIA,B08637FWWF,5.0
2,health,AHSPLDNW5OOUK2PLH7GXLACFBZNQ,B07KJVGNN5,5.0
3,health,AEZGPLOYTSAPR3DHZKKXEFPAXUAA,B092RP73CX,4.0
4,health,AEQAYV7RXZEBXMQIQPL6KCT2CFWQ,B08KYJLF5T,1.0


Recommender baseline

In [33]:
user_item_data_df.groupby('ProductID').agg(num_entries=('CustomerID', 'count')).sort_values(by='num_entries', ascending=False).head()

Unnamed: 0_level_0,num_entries
ProductID,Unnamed: 1_level_1
B07XVVVB8W,22
B08X9LB1WC,7
B0BSRPX53Z,7
B0B4328BFW,5
B08KBQNDJC,4


In [34]:
from recsys.utils import prepare_evaluation_df

evaluation_df = prepare_evaluation_df(user_item_data_df).to_pandas()

evaluation_df.head()

Transformation started...
Negative candidates: 380264, Positive samples: 1000
Num negatives 0.6702412868632708


Unnamed: 0,category,CustomerID,ProductID,target
0,health,AE25NQAZI3725GZIL5FS52ZIKWKQ,B007QESMDK,1
1,health,AE25NQAZI3725GZIL5FS52ZIKWKQ,B00FZIUHL4,0
2,health,AE25NQAZI3725GZIL5FS52ZIKWKQ,B00FB3DHTC,0
3,health,AE25NQAZI3725GZIL5FS52ZIKWKQ,B06XWW6QZG,0
4,health,AE25NQAZI3725GZIL5FS52ZIKWKQ,B00F5DFFYS,0


# Features

In [35]:
file_name = 'meta_Health_and_Personal_Care.jsonl.gz'
data_path = os.path.join(root_data_dir, file_name)

json_meta_data = [i for i in read_raw_data(data_path) if i['parent_asin'] in user_item_data_df['ProductID'].values]
print(len(json_meta_data))

Dataset num items: 60293 from /Users/username/PycharmProjects/ml_for_products/data/meta_Health_and_Personal_Care.jsonl.gz
851


In [36]:
catalog_df = pd.json_normalize(json_meta_data)

catalog_df.head(3)

Unnamed: 0,main_category,title,average_rating,rating_number,features,description,price,images,videos,store,...,details.Coating Description,details.Photo Filter Effect Type,details.Filter Type,details.Is Foldable,details.Target Species,details.Cartoon Character,details.Filter Class,details.Test type,details.Allergen Information,details.Mounting Type
0,Health & Personal Care,"GoodSense Premium Saline, Nasal Moisturizing S...",4.1,29,"[INGREDIENTS: Compare to Ocean ingredients., U...",[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],Good Sense,...,,,,,,,,,,
1,Health & Personal Care,Premium Dry brush for take a bath and Lymphati...,2.7,7,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,[],Smiley smith,...,,,,,,,,,,
2,Health & Personal Care,Cord Locks Silicone Toggles for Drawstrings El...,4.2,632,[],[],,[{'thumb': 'https://m.media-amazon.com/images/...,"[{'title': 'Cord Locks Silicone Toggles', 'url...",Abodhu,...,,,,,,,,,,


In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words='english',
    ngram_range=(1, 2),
    max_features=100,  # Limit number of features
    max_df=0.8,        # Ignore terms that appear in more than 80% of documents
    min_df=1           # Ignore terms that appear in less than 1 document
)

tfidf_matrix = vectorizer.fit_transform([i['title'] for i in json_meta_data]).toarray()
feature_names = vectorizer.get_feature_names_out()
print(tfidf_matrix.shape)

(851, 100)


In [38]:
feature_store = {j['parent_asin']: tfidf_matrix[i,:] for i, j in enumerate(json_meta_data)}
print(len(feature_store))

851


In [43]:
from recsys.model import get_model, get_data
from IPython.display import clear_output

model = get_model()
data_pool, target = get_data(evaluation_df, feature_store)
model.fit(data_pool)
clear_output()
print('model training finished')
print(model)

model training finished
<catboost.core.CatBoostClassifier object at 0x14e530950>


In [44]:
from sklearn.metrics import f1_score, accuracy_score, recall_score, roc_auc_score

proba = model.predict_proba(data_pool)
print('roc_auc: %.4f' % roc_auc_score(target, proba[:,1]))

roc_auc: 0.6151
