# Surprise lgorithm
&ensp; &ensp; The training set and the test set both have three columns. The first column is 'user,' the second column is 'item,' and the third column is 'rating.' Each row represents the rating given by a user to an item.  
  
&ensp; &ensp; In this selected dataset, the range of values for both user and item is [1,40], and the range for rating is [1,5]. The chosen recommendation algorithms include SVD and SlopeOne. Here are the results:

### Load environment

In [118]:
from surprise import Dataset, Reader, SVD, SlopeOne

### Read data

In [13]:
# Specify the file paths
train_file_path = "../data/surprise_train.csv"
test_file_path = "../data/surprise_test.csv"

# Define the reader
reader = Reader(line_format="user item rating", sep=",")

# Load the training data
train_data = Dataset.load_from_file(train_file_path, reader=reader)
trainset = train_data.build_full_trainset()

### SVD Model

In [14]:
# Create and fit the SVD model
svd_model = SVD()
svd_model.fit(trainset)

# Load the test data
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.build_full_trainset().build_testset()

# Make predictions on the test set
predictions = svd_model.test(testset)

# Print example predictions
for prediction in predictions[:5]:
    print(prediction)

user: 1          item: 15         r_ui = 3.00   est = 2.88   {'was_impossible': False}
user: 1          item: 25         r_ui = 5.00   est = 2.94   {'was_impossible': False}
user: 39         item: 29         r_ui = 5.00   est = 2.58   {'was_impossible': False}
user: 10         item: 2          r_ui = 1.00   est = 3.89   {'was_impossible': False}
user: 19         item: 21         r_ui = 2.00   est = 2.49   {'was_impossible': False}


### SlopeOne Model

In [15]:
slopone_model = SlopeOne()
slopone_model.fit(trainset)

# Load the test data
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.build_full_trainset().build_testset()

# Make predictions on the test set
predictions = slopone_model.test(testset)

# Print example predictions
for prediction in predictions[:5]:
    print(prediction)

user: 1          item: 15         r_ui = 3.00   est = 2.92   {'was_impossible': False}
user: 1          item: 25         r_ui = 5.00   est = 2.67   {'was_impossible': False}
user: 39         item: 29         r_ui = 5.00   est = 2.58   {'was_impossible': False}
user: 10         item: 2          r_ui = 1.00   est = 4.83   {'was_impossible': False}
user: 19         item: 21         r_ui = 2.00   est = 1.00   {'was_impossible': False}


### Summarize
&ensp; &ensp; In the methods mentioned above, the 'line_format' parameter of the Reader function only accepts 'user item rating' and does not accept input of additional features. (Refer to Surprise’ documentation: https://surprise.readthedocs.io/en/stable/reader.html?highlight=line_format)

&ensp; &ensp; Therefore, the above recommendation algorithms cannot utilize multi-feature data of drugs and are not suitable for predicting dosage, as they are unable to incorporate additional feature data.

# LightFM algorithm

### an example of movie recommendations
&ensp; &ensp; lightfm_example.pkl is a set of demonstration data, used for movie recommendations based on user preferences. The first column is user_id, the second is movie_id, and columns three to nine are user features, including gender, age, occupation, movie_genre, movie_year, movie_director, movie_actor. The tenth column is rating.

In [1]:
import pandas as pd
import numpy as np
from lightfm.data import Dataset
from scipy.sparse import coo_matrix
from lightfm import LightFM
df = pd.read_pickle('../data/lightfm_example.pkl')
# Building user features
dataset = Dataset()
dataset.fit(users=df['user_id'].unique(),
            items=df['movie_id'].unique(),
            user_features=df['age'].astype(str).unique().tolist() 
            + df['gender'].unique().tolist() 
            + df['occupation'].unique().tolist() 
            + df['movie_genre'].unique().tolist()
            + df['movie_year'].unique().tolist() 
            + df['movie_actor'].unique().tolist() )

# Mapping user features
user_features = dataset.build_user_features((row['user_id'], [str(row['age']), 
                                                              row['gender'], 
                                                              row['occupation'], 
                                                              row['movie_genre'], 
                                                              row['movie_year'], 
                                                              row['movie_actor']])
                                            for idx, row in df.iterrows())


In [34]:
num_users = df['user_id'].nunique()
num_items = df['movie_id'].nunique()

# Assuming that user_id and item_id start from 1 in the dataset
matrix = coo_matrix((df['rating'], (df['user_id'] - 1, df['movie_id'] - 1)), shape=(num_users, num_items))

# Split the data into training and test sets
# For simplicity, let's assume the first 80% as training and the remaining 20% as test

num_ratings = matrix.nnz
train_size = int(0.8 * num_ratings)

train = coo_matrix((matrix.data[:train_size], (matrix.row[:train_size], matrix.col[:train_size])), shape=(num_users, num_items))
test = coo_matrix((matrix.data[train_size:], (matrix.row[train_size:], matrix.col[train_size:])), shape=(num_users, num_items))

In [35]:
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, user_features=user_features, epochs=10)

<lightfm.lightfm.LightFM at 0x7f113b86a6a0>

In [36]:
user_ids = [1, 1, 2]
item_ids = [1, 2, 3]
model.predict(np.array(user_ids),np.array(item_ids))

array([-0.03655178, -0.05357065,  0.07167926], dtype=float32)

### Dose prediction
&ensp; &ensp; Based on the above rules, we replace user_id with the drug's CID, movie_id with DosageID, and the features in columns three to nine with mw, polararea, complexity, xlogp, heavycnt, hbonddonor, hbondacc, and replace rating with Dose_mmol. All DosageIDs are 0, because in the example there is a many-to-many relationship between multiple users and multiple movies, but in this study, multiple cids correspond to the dose in a many-to-one relationship. The results are as follows:

In [2]:
import pandas as pd
from lightfm.data import Dataset
df = pd.read_pickle('../data/lightfm_dosage.pkl')
# Building user features
dataset = Dataset()

dataset.fit(users=df['CID'].unique(),
            items=df['DosageID'].unique(),
            user_features=df['mw'].astype(str).unique().tolist() 
            + df['polararea'].astype(str).unique().tolist() 
            + df['complexity'].astype(str).unique().tolist()  
            + df['xlogp'].astype(str).unique().tolist() 
            + df['heavycnt'].astype(str).unique().tolist()
            + df['hbonddonor'].astype(str).unique().tolist() 
            + df['hbondacc'].astype(str).unique().tolist()
           )

# Mapping user features
user_features = dataset.build_user_features((row['CID'], [str(row['mw'])
                                                          , str(row['polararea'])
                                                          , str(row['complexity'])
                                                          , str(row['xlogp'])
                                                          , str(row['heavycnt'])
                                                          , str(row['hbonddonor'])
                                                          , str(row['hbondacc'])
                                                          ])
                                            for idx, row in df.iterrows())

In [95]:
num_users = df['CID'].nunique()
num_items = df['DosageID'].nunique()

# Assuming that user_id and item_id start from 1 in the dataset
matrix = coo_matrix((df['Dose_mmol'], (df['CID'] - 1, df['DosageID'] - 1)), shape=(num_users, num_items))

# Split the data into training and test sets
# For simplicity, let's assume the first 80% as training and the remaining 20% as test

num_ratings = matrix.nnz
train_size = int(0.8 * num_ratings)

train = coo_matrix((matrix.data[:train_size], (matrix.row[:train_size], matrix.col[:train_size])), shape=(num_users, num_items))
test = coo_matrix((matrix.data[train_size:], (matrix.row[train_size:], matrix.col[train_size:])), shape=(num_users, num_items))

In [96]:
model = LightFM(learning_rate=0.05, loss='bpr')
model.fit(train, user_features=user_features, epochs=10)

<lightfm.lightfm.LightFM at 0x7f113ab998e0>

In [108]:
CID_list = [1,2,3,4]  # 0-based index
Dose_mmol_list = [0,0,0,0]  # 0-based index
predicted_rating = model.predict(CID_list, np.array(Dose_mmol_list),user_features=user_features)
predicted_rating

array([-0.28663096, -0.34560135, -0.3711385 , -0.23588978], dtype=float32)

### Summarize
&ensp; &ensp; It can be observed that all the results are negative. This is because the predictive scores generated by this algorithm are only used for ranking recommended items and cannot be used for other purposes. In LightFM, the positive or negative values of the scores represent the performance of recommended items relative to the average level, while the absolute values of the scores reflect the degree of user preference for the items.