The purpose of this testing was to train and evaluate a BPR and an ItemKNN model using the RecBole library on a custom dataset derived from medical reviews and compare it to the results from the MovieLens dataset. The process involved downloading, extracting, and inspecting the MovieLens dataset to understand its structure. Similarly, the medical dataset was prepared by merging train and test datasets, assigning unique identifiers, and converting dates to timestamps. The interaction files were formatted to match RecBole's expectations. Despite structuring the medical dataset similar to MovieLens, the BPR model and the ItemKNN model did not produce meaningful results with the custom dataset. The model training and evaluation worked seamlessly with MovieLens data but not with the custom medical dataset, leading to the conclusion that further investigation into the data preprocessing or model configuration might be needed.

1) In this step I inspect the interaction file of the ML100K dataset that is used from recbole. I downloaded this file that is part of the recbole framework manually from another source, to have a basis for comparison.

In [2]:
import pandas as pd
# Load and inspect the interaction file
interaction_df = pd.read_csv('ml-1m.inter', sep='\t')
print("ML1M Interaction File:")
print(interaction_df.head())

ML1M Interaction File:
   user_id:token  item_id:token  rating:float  timestamp:float
0              1           1193             5        978300760
1              1            661             3        978302109
2              1            914             3        978301968
3              1           3408             4        978300275
4              1           2355             5        978824291


2) In the second step, I loaded the medical data set, which is divided into two parts (training and testing), and carried out an initial data exploration to get an impression of the structure.

In [4]:
# Load the datasets
train_df = pd.read_csv('drugsComTrain_raw.csv')
test_df = pd.read_csv('drugsComTest_raw.csv')

# Print the first few rows of the train dataset
print("Medical Data Train Dataset:")
print(train_df.head())

# Print the first few rows of the test dataset
print("\nMedical Data Test Dataset:")
print(test_df.head())

Medical Data Train Dataset:
   uniqueID                  drugName                     condition  \
0    206461                 Valsartan  Left Ventricular Dysfunction   
1     95260                Guanfacine                          ADHD   
2     92703                    Lybrel                 Birth Control   
3    138000                Ortho Evra                 Birth Control   
4     35696  Buprenorphine / naloxone             Opiate Dependence   

                                              review  rating       date  \
0  "It has no side effect, I take it in combinati...       9  20-May-12   
1  "My son is halfway through his fourth week of ...       8  27-Apr-10   
2  "I used to take another oral contraceptive, wh...       5  14-Dec-09   
3  "This is my first time using any form of birth...       8   3-Nov-15   
4  "Suboxone has completely turned my life around...       9  27-Nov-16   

   usefulCount  
0           27  
1          192  
2           17  
3           10  
4        

3) Then I merged the training and test data sets and prepared the data so that the column names and format are identical to those of the ML100K interaction file. I then saved the adjusted data under “custom.inter” in order to have an *.inter-interaction file available in the required format.

In [5]:
from sklearn.preprocessing import LabelEncoder

# Combine train and test datasets
df = pd.concat([train_df, test_df])

# Assign unique identifiers
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

df['user_id'] = user_encoder.fit_transform(df['uniqueID'])
df['item_id'] = item_encoder.fit_transform(df['drugName'])

# Convert date to timestamp
df['timestamp'] = pd.to_datetime(df['date'], format='%d-%b-%y').astype(int) / 10**9

# Create the interaction file
interaction_df = df[['user_id', 'item_id', 'rating', 'timestamp']]
interaction_df.columns = ['user_id:token', 'item_id:token', 'rating:float', 'timestamp:float']
interaction_df.to_csv('custom.inter', sep='\t', index=False)

# Verify the interaction file
print("Merged and adjusted Medical Data Sample:")
print(interaction_df.head())



Merged and adjusted Medical Data Sample:
   user_id:token  item_id:token  rating:float  timestamp:float
0         191228           3425             9     1.337472e+09
1          88331           1539             8     1.272326e+09
2          85964           1986             5     1.260749e+09
3         127958           2450             8     1.446509e+09
4          33057            554             9     1.480205e+09


4) Then I load both interaction files (*.inter) and have the first lines and information about the data sets displayed again. As far as can be seen, the two data sets are identical in structure.

In [6]:
# Load the interaction files
ml1m_inter = pd.read_csv('ml-1m.inter', sep='\t')
custom_inter = pd.read_csv('custom.inter', sep='\t')

# Print the first few rows of the interaction files
print("ML1M Interaction File:")
print(ml1m_inter.head())

print("Custom Interaction File:")
print(custom_inter.head())

# Check for missing values and data types
print("\nML1M Interaction File Info:")
print(ml1m_inter.info())

print("\nCustom Interaction File Info:")
print(custom_inter.info())

ML1M Interaction File:
   user_id:token  item_id:token  rating:float  timestamp:float
0              1           1193             5        978300760
1              1            661             3        978302109
2              1            914             3        978301968
3              1           3408             4        978300275
4              1           2355             5        978824291
Custom Interaction File:
   user_id:token  item_id:token  rating:float  timestamp:float
0         191228           3425             9     1.337472e+09
1          88331           1539             8     1.272326e+09
2          85964           1986             5     1.260749e+09
3         127958           2450             8     1.446509e+09
4          33057            554             9     1.480205e+09

ML1M Interaction File Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 4 columns):
 #   Column           Non-Null Count    Dtype
---  ----

In [9]:
from recbole.quick_start import run_recbole

# Configuration for running BPR Model
config_dict = {
    'model': 'BPR',
    'dataset': 'custom',
    'train_batch_size': 4096,
    'eval_batch_size': 4096,
    'epochs': 10,
    'show_progress': True,
    'metrics': ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'],
    'valid_metric': 'MRR@10',
}

# Run the BPR model with ml-100k dataset
print("Training BPR with Custom (Medical) Dataset:")
#result = run_recbole(model='BPR', dataset='custom', config_dict=config_dict)
run_recbole(model='BPR', dataset='custom')
print(result)

Training BPR with Custom (Medical) Dataset:


20 Aug 17:34    INFO  ['/home/stef/.local/lib/python3.10/site-packages/ipykernel_launcher.py', '-f', '/home/stef/.local/share/jupyter/runtime/kernel-57aa8a76-cc89-4f61-b316-dca3de1caa64.json']
20 Aug 17:34    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = dataset/custom
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 300
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeatable = False
m

ValueError: not enough values to unpack (expected 2, got 1)