### Summary of Testing Drug Review Dataset (Kaggle)

The objective of this testing was to preprocess a drug review dataset and use it to train a recommendation model with Recbole. The process began by loading and combining the train and test datasets into a single DataFrame. Unique identifiers for users and items were assigned using `LabelEncoder`, and the date strings were converted to timestamps. Interaction, user, and item files were created and verified for correctness.

The configuration for Recbole was defined, specifying fields to load, batch sizes, epochs, and evaluation metrics. The BPR (Bayesian Personalized Ranking) model was initialized and trained using the prepared datasets. Training progress was displayed for each epoch.

Despite completing the training process, the results indicated that the best validation score was `-inf`, with no valid results for validation and testing. This suggests a problem in the model training or data preparation process, potentially due to incorrect data formatting, inappropriate configuration settings, or issues within the Recbole library.

Further investigation is needed to verify data preprocessing steps, review configuration settings, and check for potential bugs in Recbole. Additional diagnostics or tests may help pinpoint the cause and refine the approach.

In [7]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the datasets
train_df = pd.read_csv('drugsComTrain_raw.csv')
test_df = pd.read_csv('drugsComTest_raw.csv')

# Combine train and test datasets
df = pd.concat([train_df, test_df])

# Assign unique identifiers
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

df['user_id'] = user_encoder.fit_transform(df['uniqueID'])
df['item_id'] = item_encoder.fit_transform(df['drugName'])

# Convert date to timestamp
df['timestamp'] = pd.to_datetime(df['date'], format='%d-%b-%y').astype(int) / 10**9

# Create the interaction file
interaction_df = df[['user_id', 'item_id', 'rating', 'timestamp']]
interaction_df.to_csv('custom.inter', sep='\t', index=False)

# Create the user file
user_df = df[['user_id']].drop_duplicates()
user_df.to_csv('custom.user', sep='\t', index=False)

# Create the item file
item_df = df[['item_id', 'drugName', 'condition']].drop_duplicates()
item_df.to_csv('custom.item', sep='\t', index=False)

# Verify the files
print("Interaction File Sample:")
print(interaction_df.head())

print("\nUser File Sample:")
print(user_df.head())

print("\nItem File Sample:")
print(item_df.head())



Interaction File Sample:
   user_id  item_id  rating     timestamp
0   191228     3425       9  1.337472e+09
1    88331     1539       8  1.272326e+09
2    85964     1986       5  1.260749e+09
3   127958     2450       8  1.446509e+09
4    33057      554       9  1.480205e+09

User File Sample:
   user_id
0   191228
1    88331
2    85964
3   127958
4    33057

Item File Sample:
   item_id                  drugName                     condition
0     3425                 Valsartan  Left Ventricular Dysfunction
1     1539                Guanfacine                          ADHD
2     1986                    Lybrel                 Birth Control
3     2450                Ortho Evra                 Birth Control
4      554  Buprenorphine / naloxone             Opiate Dependence


In [14]:
from recbole.quick_start import run_recbole
from recbole.config import Config
from recbole.utils import init_seed
from recbole.data.utils import create_dataset, data_preparation

# Define the configuration dictionary directly in code
config_dict = {
    'USER_ID_FIELD': 'user_id',
    'ITEM_ID_FIELD': 'item_id',
    'RATING_FIELD': 'rating',
    'TIME_FIELD': 'timestamp',
    'load_col': {
        'inter': ['user_id', 'item_id', 'rating', 'timestamp'],
        'item': ['item_id', 'drugName', 'condition'],
        'user': ['user_id']
    },
    'train_batch_size': 4096,
    'eval_batch_size': 4096,
    'epochs': 10,
    'show_progress': True,
    'metrics': ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'],
    'valid_metric': 'MRR@10',
}

# Initialize configuration and dataset
config = Config(model='BPR', dataset='custom', config_dict=config_dict)
init_seed(config['seed'], config['reproducibility'])
dataset = create_dataset(config)

# Prepare data
train_data, valid_data, test_data = data_preparation(config, dataset)

# Run the model
run_recbole(model='BPR', dataset='custom', config_dict=config_dict)

Train     0: 100%|██████████████████████████████████████████████████| 53/53 [00:14<00:00,  3.59it/s]
Train     1: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.20it/s]
Train     2: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.26it/s]
Train     3: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.43it/s]
Train     4: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.52it/s]
Train     5: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.64it/s]
Train     6: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.42it/s]
Train     7: 100%|██████████████████████████████████████████████████| 53/53 [00:13<00:00,  4.02it/s]
Train     8: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.09it/s]
Train     9: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  

{'best_valid_score': -inf,
 'valid_score_bigger': True,
 'best_valid_result': None,
 'test_result': None}