The primary issue encountered in this project was that the custom dataset, derived from medical reviews, did not yield meaningful results when used with the RecBole framework, unlike the MovieLens dataset, which provided consistent and valuable outcomes. To identify the root cause, I conducted a detailed comparison between the custom dataset and two random datasets—one with a similar structure and another larger one mimicking the custom dataset's size.

**Key Findings:**

1. **Dataset Structure:** Both datasets had the same columns and data types (`user_id:token`, `item_id:token`, `rating:float`, `timestamp:float`), indicating that structural differences were not the issue.
   
2. **Dataset Size:** The custom dataset and the large randomly generated dataset had the same number of entries (215,063), while the smaller random dataset had only 10,000 entries.

3. **Statistical Properties:**
   - The **Custom Dataset** had a higher mean rating (6.99) compared to the random datasets (~3.0). The timestamp range and distribution were also different, with the custom dataset having more recent timestamps.
   - The **Large Random Dataset** showed similar statistical properties to the smaller random dataset but differed from the custom dataset in terms of rating distribution and timestamp.

These differences suggest that the custom dataset's inherent characteristics, such as higher ratings and more recent timestamps, could be influencing the model's performance. Further investigation, possibly focusing on the impact of these properties on model training, is necessary to understand why the custom dataset isn't producing meaningful results.

Step 1: Create random dataset in the structure needed in recbole (user_id:token, item_id:token, rating:float, timestamp:float)

In [1]:
import pandas as pd
import numpy as np
import os

# Step 1: Create random dataset
def create_random_dataset(num_users, num_items, num_interactions):
    np.random.seed(42)

    user_ids = np.random.randint(1, num_users + 1, size=num_interactions)
    item_ids = np.random.randint(1, num_items + 1, size=num_interactions)
    ratings = np.random.randint(1, 6, size=num_interactions)  # Ratings between 1 and 5
    timestamps = np.random.randint(1_000_000_000, 1_500_000_000, size=num_interactions)  # Random timestamps

    data = {
        'user_id:token': user_ids,
        'item_id:token': item_ids,
        'rating:float': ratings,
        'timestamp:float': timestamps
    }

    df = pd.DataFrame(data)
    return df

# Step 2: Save dataset in the correct directory structure
def save_dataset(df, dataset_name):
    dataset_dir = f'dataset/{dataset_name}'

    # Ensure the directory exists
    os.makedirs(dataset_dir, exist_ok=True)

    # Save the interaction file
    interaction_file_path = os.path.join(dataset_dir, f'{dataset_name}.inter')
    df.to_csv(interaction_file_path, sep='\t', index=False)
    print(f'Dataset saved to {interaction_file_path}')

# Step 3: Run the process
dataset_name = 'random_ml1m_structure'
random_df = create_random_dataset(num_users=1000, num_items=1000, num_interactions=10000)
save_dataset(random_df, dataset_name)

# Check the first few rows of the dataset
print("First few rows of the dataset:")
print(random_df.head())

  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


Dataset saved to dataset/random_ml1m_structure/random_ml1m_structure.inter
First few rows of the dataset:
   user_id:token  item_id:token  rating:float  timestamp:float
0            103            954             3       1229343492
1            436            279             2       1325078740
2            861            251             5       1033486648
3            271            310             1       1398289623
4            107            720             4       1369596254


Step 2: Train BPR model with random generated dataset

In [2]:
from recbole.quick_start import run_recbole

# Configuration for running BPR Model
config_dict = {
    'model': 'BPR',
    'dataset': 'random_ml1m_structure',
    'train_batch_size': 4096,
    'eval_batch_size': 4096,
    'epochs': 10,
    'show_progress': True,
    'metrics': ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'],
    'valid_metric': 'MRR@10',
}

# Run the BPR model
run_recbole(model='BPR', dataset='random_ml1m_structure', config_dict=config_dict)

21 Aug 17:11    INFO  ['/home/stef/.local/lib/python3.10/site-packages/ipykernel_launcher.py', '-f', '/home/stef/.local/share/jupyter/runtime/kernel-ff343df6-b497-4e3f-ba9f-447e555ce260.json']
21 Aug 17:11    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = dataset/random_ml1m_structure
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 10
train_batch_size = 4096
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}
repeat

Train     5:   0%|                                                            | 0/2 [00:00<?, ?it/s]: 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00, 181.27it/s]
21 Aug 17:11    INFO  epoch 5 training [time: 0.01s, train loss: 1.3778]
Evaluate   :   0%|                                                          | 0/249 [00:00<?, ?it/s]: 100%|██████████████████████████████████████████████| 249/249 [00:00<00:00, 3532.88it/s]
21 Aug 17:11    INFO  epoch 5 evaluating [time: 0.08s, valid_score: 0.005000]
21 Aug 17:11    INFO  valid result: 
recall@10 : 0.0181    mrr@10 : 0.005    ndcg@10 : 0.008    hit@10 : 0.0181    precision@10 : 0.0018
Train     6:   0%|                                                            | 0/2 [00:00<?, ?it/s]: 100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00, 185.65it/s]
21 Aug 17:11    INFO  epoch 6 training [time: 0.01s, train loss: 1.3761]
Evaluate   :   0%|                                                   

{'best_valid_score': 0.0059,
 'valid_score_bigger': True,
 'best_valid_result': OrderedDict([('recall@10', 0.0191),
              ('mrr@10', 0.0059),
              ('ndcg@10', 0.009),
              ('hit@10', 0.0191),
              ('precision@10', 0.0019)]),
 'test_result': OrderedDict([('recall@10', 0.014),
              ('mrr@10', 0.0039),
              ('ndcg@10', 0.0061),
              ('hit@10', 0.014),
              ('precision@10', 0.0014)])}

Step 3: Load custom (medical) dataset, concatenate train and test set and then structure the columns in the right way for recbole

In [4]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Load the datasets
train_df = pd.read_csv('drugsComTrain_raw.csv')
test_df = pd.read_csv('drugsComTest_raw.csv')

# Combine train and test datasets
df = pd.concat([train_df, test_df])

# Assign unique identifiers
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()

df['user_id'] = user_encoder.fit_transform(df['uniqueID'])
df['item_id'] = item_encoder.fit_transform(df['drugName'])

# Convert date to timestamp
df['timestamp'] = pd.to_datetime(df['date'], format='%d-%b-%y').astype(int) / 10**9

# Create the interaction file
interaction_df = df[['user_id', 'item_id', 'rating', 'timestamp']]
interaction_df.columns = ['user_id:token', 'item_id:token', 'rating:float', 'timestamp:float']

# Verify the interaction file
print("Interaction File Sample:")
print(interaction_df.head())

# Safe the interaction file
dataset_name_custom = 'custom'
save_dataset(interaction_df, dataset_name_custom)

Interaction File Sample:
   user_id:token  item_id:token  rating:float  timestamp:float
0         191228           3425             9     1.337472e+09
1          88331           1539             8     1.272326e+09
2          85964           1986             5     1.260749e+09
3         127958           2450             8     1.446509e+09
4          33057            554             9     1.480205e+09
Dataset saved to dataset/custom/custom.inter


Step 4: Train BPR model with custom (medical) dataset

In [21]:
from recbole.quick_start import run_recbole

# Configuration for running BPR Model
config_dict = {
    'model': 'BPR',
    'dataset': 'custom',
    'train_batch_size': 4096,
    'eval_batch_size': 4096,
    'epochs': 10,
    'show_progress': True,
    'metrics': ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'],
    'valid_metric': 'MRR@10',
}

# Run the BPR model
run_recbole(model='BPR', dataset='custom', config_dict=config_dict)

Train     0: 100%|██████████████████████████████████████████████████| 53/53 [00:13<00:00,  3.85it/s]
Train     1: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.69it/s]
Train     2: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.25it/s]
Train     3: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.15it/s]
Train     4: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.40it/s]
Train     5: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.43it/s]
Train     6: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.70it/s]
Train     7: 100%|██████████████████████████████████████████████████| 53/53 [00:11<00:00,  4.67it/s]
Train     8: 100%|██████████████████████████████████████████████████| 53/53 [00:12<00:00,  4.38it/s]
Train     9: 100%|██████████████████████████████████████████████████| 53/53 [00:14<00:00,  

{'best_valid_score': -inf,
 'valid_score_bigger': True,
 'best_valid_result': None,
 'test_result': None}

Step 5: Generate another random dataset for comparison, this time with the same number of users as the custom (medical) dataset

In [5]:
import pandas as pd
import numpy as np
import os

def generate_large_random_ml1m_structure():
    np.random.seed(42)

    # Parameters based on the 'custom.inter' dataset
    num_users = 215063  # Same number of users as in 'custom.inter'
    num_items = 3670    # Approximate number of items in 'custom.inter'
    num_interactions = len(pd.read_csv('dataset/custom/custom.inter', sep='\t'))  # Ensure the same number of interactions

    # Generate random user IDs and item IDs
    user_ids = np.random.randint(1, num_users + 1, num_interactions)
    item_ids = np.random.randint(1, num_items + 1, num_interactions)

    # Generate random ratings between 1 and 5
    ratings = np.random.randint(1, 6, num_interactions)

    # Generate random timestamps within a reasonable range
    timestamps = np.random.randint(1_000_000_000, 1_500_000_000, num_interactions)

    # Create a DataFrame
    data = {
        'user_id:token': user_ids,
        'item_id:token': item_ids,
        'rating:float': ratings,
        'timestamp:float': timestamps
    }
    df = pd.DataFrame(data)

    # Save the DataFrame to a .inter file
    save_dataset(df, 'random_ml1m_structure_large')

    print("Large random ML1M structure dataset created and saved as 'random_ml1m_structure_large.inter'.")
    print(df.head())

# Reuse the previous save function
def save_dataset(df, dataset_name):
    dataset_dir = f'dataset/{dataset_name}'

    # Ensure the directory exists
    os.makedirs(dataset_dir, exist_ok=True)

    # Save the interaction file
    interaction_file_path = os.path.join(dataset_dir, f'{dataset_name}.inter')
    df.to_csv(interaction_file_path, sep='\t', index=False)
    print(f'Dataset saved to {interaction_file_path}')

# Run the function to generate the large dataset
generate_large_random_ml1m_structure()


Dataset saved to dataset/random_ml1m_structure_large/random_ml1m_structure_large.inter
Large random ML1M structure dataset created and saved as 'random_ml1m_structure_large.inter'.
   user_id:token  item_id:token  rating:float  timestamp:float
0         121959           2679             4       1417338431
1         146868           1508             3       1224557973
2         131933           2803             1       1494229191
3         103695            929             2       1245403329
4         119880           1017             4       1065209478


Step 6: Train BPR model with random generated larger dataset

In [6]:
from recbole.quick_start import run_recbole

# Configuration for running BPR Model
config_dict = {
    'model': 'BPR',
    'dataset': 'random_ml1m_structure_large',
    'train_batch_size': 4096,
    'eval_batch_size': 4096,
    'epochs': 10,
    'show_progress': True,
    'metrics': ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision'],
    'valid_metric': 'MRR@10',
}

# Run the BPR model
run_recbole(model='BPR', dataset='random_ml1m_structure_large', config_dict=config_dict)

21 Aug 17:13    INFO  ['/home/stef/.local/lib/python3.10/site-packages/ipykernel_launcher.py', '-f', '/home/stef/.local/share/jupyter/runtime/kernel-ff343df6-b497-4e3f-ba9f-447e555ce260.json']
21 Aug 17:13    INFO  
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 2020
state = INFO
reproducibility = True
data_path = dataset/random_ml1m_structure_large
checkpoint_dir = saved
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False

Training Hyper Parameters:
epochs = 10
train_batch_size = 4096
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 1
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4

Evaluation Hyper Parameters:
eval_args = {'split': {'RS': [0.8, 0.1, 0.1]}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'full', 'test': 'full'}}


Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 412/17178 [00:00<00:04, 4118.22it/s]:   5%|██                                          | 824/17178 [00:00<00:03, 4113.34it/s]:   7%|███                                        | 1236/17178 [00:00<00:03, 4112.84it/s]:  10%|████▏                                      | 1648/17178 [00:00<00:03, 4086.37it/s]:  12%|█████▏                                     | 2057/17178 [00:00<00:03, 4054.98it/s]:  14%|██████▏                                    | 2463/17178 [00:00<00:03, 4050.06it/s]:  17%|███████▏                                   | 2869/17178 [00:00<00:03, 4050.36it/s]:  19%|████████▏                                  | 3275/17178 [00:00<00:03, 4001.56it/s]:  21%|█████████▏                                 | 3676/17178 [00:00<00:03, 3962.39it/s]:  24%|██████████▏                                | 4073/17178 [00:01<00:03, 3803.79it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 418/17178 [00:00<00:04, 4174.36it/s]:   5%|██▏                                         | 836/17178 [00:00<00:03, 4164.88it/s]:   7%|███▏                                       | 1254/17178 [00:00<00:03, 4169.30it/s]:  10%|████▏                                      | 1671/17178 [00:00<00:03, 4158.82it/s]:  12%|█████▏                                     | 2087/17178 [00:00<00:03, 4117.34it/s]:  15%|██████▎                                    | 2499/17178 [00:00<00:03, 4095.66it/s]:  17%|███████▎                                   | 2909/17178 [00:00<00:03, 4069.07it/s]:  19%|████████▎                                  | 3316/17178 [00:00<00:03, 4000.37it/s]:  22%|█████████▎                                 | 3717/17178 [00:00<00:03, 3904.17it/s]:  24%|██████████▎                                | 4108/17178 [00:01<00:03, 3864.83it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 414/17178 [00:00<00:04, 4139.60it/s]:   5%|██                                          | 829/17178 [00:00<00:03, 4142.12it/s]:   7%|███                                        | 1244/17178 [00:00<00:03, 4112.17it/s]:  10%|████▏                                      | 1656/17178 [00:00<00:03, 4084.02it/s]:  12%|█████▏                                     | 2068/17178 [00:00<00:03, 4095.59it/s]:  14%|██████▏                                    | 2480/17178 [00:00<00:03, 4102.03it/s]:  17%|███████▏                                   | 2891/17178 [00:00<00:03, 4094.00it/s]:  19%|████████▎                                  | 3301/17178 [00:00<00:03, 4036.41it/s]:  22%|█████████▎                                 | 3705/17178 [00:00<00:03, 3983.61it/s]:  24%|██████████▎                                | 4104/17178 [00:01<00:03, 3931.97it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 409/17178 [00:00<00:04, 4084.20it/s]:   5%|██                                          | 818/17178 [00:00<00:04, 4074.72it/s]:   7%|███                                        | 1231/17178 [00:00<00:03, 4096.90it/s]:  10%|████                                       | 1641/17178 [00:00<00:03, 4076.34it/s]:  12%|█████▏                                     | 2049/17178 [00:00<00:03, 4067.88it/s]:  14%|██████▏                                    | 2457/17178 [00:00<00:03, 4069.09it/s]:  17%|███████▏                                   | 2864/17178 [00:00<00:03, 4063.12it/s]:  19%|████████▏                                  | 3271/17178 [00:00<00:03, 3980.84it/s]:  21%|█████████▏                                 | 3670/17178 [00:00<00:03, 3938.16it/s]:  24%|██████████▏                                | 4065/17178 [00:01<00:03, 3902.07it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 410/17178 [00:00<00:04, 4098.02it/s]:   5%|██                                          | 820/17178 [00:00<00:04, 4084.10it/s]:   7%|███                                        | 1231/17178 [00:00<00:03, 4093.74it/s]:  10%|████                                       | 1642/17178 [00:00<00:03, 4096.46it/s]:  12%|█████▏                                     | 2052/17178 [00:00<00:03, 4072.19it/s]:  14%|██████▏                                    | 2460/17178 [00:00<00:03, 4055.62it/s]:  17%|███████▏                                   | 2866/17178 [00:00<00:03, 4052.92it/s]:  19%|████████▏                                  | 3272/17178 [00:00<00:03, 3981.99it/s]:  21%|█████████▏                                 | 3671/17178 [00:00<00:03, 3902.82it/s]:  24%|██████████▏                                | 4062/17178 [00:01<00:03, 3819.79it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 409/17178 [00:00<00:04, 4087.91it/s]:   5%|██                                          | 819/17178 [00:00<00:04, 4089.23it/s]:   7%|███                                        | 1228/17178 [00:00<00:03, 4038.45it/s]:  10%|████                                       | 1632/17178 [00:00<00:03, 4036.14it/s]:  12%|█████                                      | 2036/17178 [00:00<00:03, 3991.68it/s]:  14%|██████                                     | 2438/17178 [00:00<00:03, 3999.29it/s]:  17%|███████                                    | 2838/17178 [00:00<00:03, 3978.52it/s]:  19%|████████                                   | 3236/17178 [00:00<00:03, 3908.91it/s]:  21%|█████████                                  | 3628/17178 [00:00<00:03, 3830.54it/s]:  23%|██████████                                 | 4012/17178 [00:01<00:03, 3808.70it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 402/17178 [00:00<00:04, 4018.50it/s]:   5%|██                                          | 813/17178 [00:00<00:04, 4071.71it/s]:   7%|███                                        | 1221/17178 [00:00<00:03, 4065.60it/s]:   9%|████                                       | 1628/17178 [00:00<00:03, 4036.40it/s]:  12%|█████                                      | 2033/17178 [00:00<00:03, 4040.02it/s]:  14%|██████                                     | 2438/17178 [00:00<00:03, 4021.27it/s]:  17%|███████                                    | 2841/17178 [00:00<00:03, 4014.80it/s]:  19%|████████                                   | 3243/17178 [00:00<00:03, 3868.33it/s]:  21%|█████████                                  | 3631/17178 [00:00<00:03, 3835.97it/s]:  23%|██████████                                 | 4016/17178 [00:01<00:03, 3718.11it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 409/17178 [00:00<00:04, 4082.87it/s]:   5%|██                                          | 818/17178 [00:00<00:04, 4030.63it/s]:   7%|███                                        | 1224/17178 [00:00<00:03, 4039.01it/s]:   9%|████                                       | 1628/17178 [00:00<00:03, 4032.30it/s]:  12%|█████                                      | 2037/17178 [00:00<00:03, 4049.76it/s]:  14%|██████                                     | 2442/17178 [00:00<00:03, 4049.06it/s]:  17%|███████▏                                   | 2847/17178 [00:00<00:03, 4032.37it/s]:  19%|████████▏                                  | 3251/17178 [00:00<00:03, 3993.50it/s]:  21%|█████████▏                                 | 3651/17178 [00:00<00:03, 3940.05it/s]:  24%|██████████▏                                | 4046/17178 [00:01<00:03, 3895.00it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 411/17178 [00:00<00:04, 4102.53it/s]:   5%|██                                          | 822/17178 [00:00<00:04, 4077.17it/s]:   7%|███                                        | 1234/17178 [00:00<00:03, 4096.15it/s]:  10%|████                                       | 1644/17178 [00:00<00:03, 4059.96it/s]:  12%|█████▏                                     | 2051/17178 [00:00<00:03, 4027.34it/s]:  14%|██████▏                                    | 2454/17178 [00:00<00:03, 4027.27it/s]:  17%|███████▏                                   | 2861/17178 [00:00<00:03, 4038.84it/s]:  19%|████████▏                                  | 3265/17178 [00:00<00:03, 3998.41it/s]:  21%|█████████▏                                 | 3665/17178 [00:00<00:03, 3923.38it/s]:  24%|██████████▏                                | 4058/17178 [00:01<00:03, 3859.47it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/17178 [00:00<?, ?it/s]:   2%|█                                           | 414/17178 [00:00<00:04, 4134.86it/s]:   5%|██                                          | 828/17178 [00:00<00:03, 4122.27it/s]:   7%|███                                        | 1241/17178 [00:00<00:03, 4113.98it/s]:  10%|████▏                                      | 1653/17178 [00:00<00:03, 4114.97it/s]:  12%|█████▏                                     | 2065/17178 [00:00<00:03, 4106.34it/s]:  14%|██████▏                                    | 2476/17178 [00:00<00:03, 4080.15it/s]:  17%|███████▏                                   | 2885/17178 [00:00<00:03, 4050.18it/s]:  19%|████████▏                                  | 3291/17178 [00:00<00:03, 4004.91it/s]:  21%|█████████▏                                 | 3692/17178 [00:00<00:03, 3934.12it/s]:  24%|██████████▏                                | 4086/17178 [00:01<00:03, 3893.15it/s]:  26%|███

Evaluate   :   0%|                                                        | 0/56825 [00:00<?, ?it/s]:   1%|▎                                           | 417/56825 [00:00<00:13, 4164.13it/s]:   1%|▋                                           | 834/56825 [00:00<00:13, 4134.68it/s]:   2%|▉                                          | 1250/56825 [00:00<00:13, 4143.98it/s]:   3%|█▎                                         | 1665/56825 [00:00<00:13, 4090.35it/s]:   4%|█▌                                         | 2075/56825 [00:00<00:13, 4093.61it/s]:   4%|█▉                                         | 2485/56825 [00:00<00:13, 4089.59it/s]:   5%|██▏                                        | 2895/56825 [00:00<00:13, 4036.00it/s]:   6%|██▍                                        | 3299/56825 [00:00<00:13, 3979.41it/s]:   7%|██▊                                        | 3698/56825 [00:00<00:13, 3884.02it/s]:   7%|███                                        | 4087/56825 [00:01<00:13, 3798.25it/s]:   8%|███

Evaluate   :  55%|██████████████████████▉                   | 31023/56825 [00:09<00:08, 2883.31it/s]:  55%|███████████████████████▏                  | 31339/56825 [00:09<00:08, 2964.53it/s]:  56%|███████████████████████▍                  | 31637/56825 [00:09<00:09, 2759.27it/s]:  56%|███████████████████████▌                  | 31917/56825 [00:09<00:09, 2679.35it/s]:  57%|███████████████████████▊                  | 32208/56825 [00:10<00:08, 2743.48it/s]:  57%|████████████████████████                  | 32485/56825 [00:10<00:08, 2713.42it/s]:  58%|████████████████████████▏                 | 32766/56825 [00:10<00:08, 2741.08it/s]:  58%|████████████████████████▍                 | 33048/56825 [00:10<00:08, 2763.98it/s]:  59%|████████████████████████▋                 | 33331/56825 [00:10<00:08, 2781.43it/s]:  59%|████████████████████████▊                 | 33610/56825 [00:10<00:08, 2729.50it/s]:  60%|█████████████████████████                 | 33884/56825 [00:10<00:08, 2657.00it/s]:  60%|███

Evaluate   :  91%|██████████████████████████████████████▏   | 51618/56825 [00:19<00:03, 1630.53it/s]:  91%|██████████████████████████████████████▎   | 51782/56825 [00:19<00:03, 1430.23it/s]:  91%|██████████████████████████████████████▍   | 51930/56825 [00:19<00:03, 1257.89it/s]:  92%|██████████████████████████████████████▍   | 52062/56825 [00:19<00:03, 1244.12it/s]:  92%|██████████████████████████████████████▌   | 52191/56825 [00:19<00:03, 1216.69it/s]:  92%|██████████████████████████████████████▋   | 52317/56825 [00:19<00:03, 1227.57it/s]:  92%|██████████████████████████████████████▊   | 52470/56825 [00:20<00:03, 1306.37it/s]:  93%|██████████████████████████████████████▉   | 52622/56825 [00:20<00:03, 1361.68it/s]:  93%|███████████████████████████████████████   | 52776/56825 [00:20<00:02, 1410.64it/s]:  93%|███████████████████████████████████████▏  | 52946/56825 [00:20<00:02, 1491.56it/s]:  93%|███████████████████████████████████████▏  | 53097/56825 [00:20<00:02, 1495.03it/s]:  94%|███

{'best_valid_score': 0.0012,
 'valid_score_bigger': True,
 'best_valid_result': OrderedDict([('recall@10', 0.0037),
              ('mrr@10', 0.0012),
              ('ndcg@10', 0.0018),
              ('hit@10', 0.0037),
              ('precision@10', 0.0004)]),
 'test_result': OrderedDict([('recall@10', 0.003),
              ('mrr@10', 0.0011),
              ('ndcg@10', 0.0015),
              ('hit@10', 0.003),
              ('precision@10', 0.0003)])}

Step 7: Show characteristics of the different datasets

In [8]:
import pandas as pd

def load_and_describe_dataset(dataset_path, dataset_name):
    # Load the dataset
    df = pd.read_csv(dataset_path, sep='\t')

    # Print dataset info
    print(f"\n{dataset_name} Info:")
    print(df.info())

    # Print dataset statistics
    print(f"\n{dataset_name} Statistics:")
    print(df.describe())

# Define the paths for the datasets
custom_path = 'dataset/custom/custom.inter'
random_path = 'dataset/random_ml1m_structure/random_ml1m_structure.inter'
large_random_path = 'dataset/random_ml1m_structure_large/random_ml1m_structure_large.inter'

# Evaluate the datasets
load_and_describe_dataset(custom_path, "Custom (medical) Dataset")
load_and_describe_dataset(random_path, "Random ML1M Dataset")
load_and_describe_dataset(large_random_path, "Large Random ML1M Dataset")


Custom (medical) Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215063 entries, 0 to 215062
Data columns (total 4 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   user_id:token    215063 non-null  int64  
 1   item_id:token    215063 non-null  int64  
 2   rating:float     215063 non-null  int64  
 3   timestamp:float  215063 non-null  float64
dtypes: float64(1), int64(3)
memory usage: 6.6 MB
None

Custom (medical) Dataset Statistics:
       user_id:token  item_id:token   rating:float  timestamp:float
count  215063.000000  215063.000000  215063.000000     2.150630e+05
mean   107531.000000    1829.467998       6.990008     1.402447e+09
std     62083.484809    1008.687261       3.275554     8.530971e+07
min         0.000000       0.000000       1.000000     1.203811e+09
25%     53765.500000    1053.000000       5.000000     1.334189e+09
50%    107531.000000    1877.000000       8.000000     1.433635e+09
75%    