# Collaborative Filtering Recommendation System on CCLE (Cancer Cell Line Encyclopedia) Drug Sensitivity Data


---

* Student Name: Arya Wira Syahdwinata
* NPM : 2306174892

---

Data Source: Barretina,J. et al. (2012) The Cancer Cell Line Encyclopedia enables predictive
modeling of anticancer drug sensitivity. Nature, 483, 603-607.

---
The goal is to predict the sensitivity (rating) of cancer cell line (user) while using certain drugs (item) with NaN entry using collaborative filtering recommendation system.

### Install requirment

In [1]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163690 sha256=8664d833f0282848f3a94e8eba5b4e11f4dea9886f4362dd7b459bce2715ee87
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


### Import all library

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from surprise import Dataset, Reader, KNNBasic, accuracy

### Get the data (the data already been preprocessed into IC50 using bayesian sigmoid method)

Credit: C. Suphavilai, D. Bertrand, and N. Nagarajan, “Predicting Cancer Drug Response using a Recommender System,” Bioinformatics, vol. 34, no. 22, pp. 3907–3914, Jun. 2018, doi: 10.1093/bioinformatics/bty452. Available: https://doi.org/10.1093/bioinformatics/bty452

In [None]:
# Specify the paths to the files
data_path = "ccle_all_abs_ic50_bayesian_sigmoid.csv"

### Preprocessing
---
1. Read the data
2. Reshape the data
3. Change NaN into 0
4. Slice (horizontal + vertical) the data into 4 different part as the test data.

In [None]:
# Load the data into pandas DataFrames
data = pd.read_csv(data_path)

# Reshape the data
data = data.melt(id_vars='Unnamed: 0', var_name='drug_name', value_name='sensitivity')
data.rename(columns={'Unnamed: 0': 'cell_line_name'}, inplace=True)

# Handle missing values
data['sensitivity'].fillna(0, inplace=True)

# Define the rating scale
reader = Reader(rating_scale=(data['sensitivity'].min(), data['sensitivity'].max()))

# Assuming 'data' is your DataFrame
indices = np.array_split(data.index, 4)  # Split the index into 4 parts
data_blocks = [data.loc[idx] for idx in indices]  # Create the 4 sub-DataFrames

### Train the data using user-based collaborative filtering KNN with cosine similarity, test the model and compute the performance (RMSE)

In [4]:
rmse_scores = []  # To store the RMSE scores for each fold

for i in range(4):
    test_data = data_blocks[i]
    train_data = pd.concat([data_blocks[j] for j in range(4) if j != i])  # Use all other blocks for training

    # Define the rating scale for the current training data
    reader = Reader(rating_scale=(train_data['sensitivity'].min(), train_data['sensitivity'].max()))

    # Load the training and test data from the DataFrames
    train_data = Dataset.load_from_df(train_data, reader)
    train_data = train_data.build_full_trainset()

    # Use user-based collaborative filtering with cosine similarity
    sim_options = {'name': 'cosine', 'user_based': True}
    algo = KNNBasic(sim_options=sim_options)

    # Train the algorithm on the trainset
    algo.fit(train_data)

    # Predict ratings for the testset
    predictions = []
    for _, row in test_data.iterrows():
        prediction = algo.predict(row['cell_line_name'], row['drug_name'])
        predictions.append(prediction.est)

    # Compute RMSE for the current fold
    rmse = np.sqrt(mean_squared_error(test_data['sensitivity'], predictions))
    rmse_scores.append(rmse)

print("RMSE scores: ", rmse_scores)
print("Average RMSE score: ", np.mean(rmse_scores))

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE scores:  [3.4521649442526163, 4.2728503864177085, 3.7257478150381167, 4.014022458911873]
Average RMSE score:  3.8661964011550785
