`Created by Ayushi Dubey`

# Importing Necessary Libraries

All the necessary libraries used in the notebook are imported in the below cell. The functions - `standardise_smiles` and `standardise_inchikey` described in the `src` directory are also imported below.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ersilia import ErsiliaModel
import sys
sys.path.append('src')
from smiles_processing import standardise_smiles
from inchikey_processing import standardise_inchikey

ModuleNotFoundError: No module named 'smiles_processing'

# Data Preprocessing

The dataset was downloaded from ChEMBL in tsv format. Dataset contains 8715 entries. Thus, it is preprocessed to check for null or irrelevant data and satisfy the required format.

## Reading the raw data

In [None]:
df = pd.read_csv('../data/raw_data.tsv', delimiter='\t')

## Exploring the raw data

In [None]:
df.head()

In [None]:
df.info()

We only require the `Smiles` and `Inchi Key` columns for further analysis. So we remove all the other columns except these.

In [None]:
selected_columns = ['Smiles', 'Inchi Key']
df = df.loc[:, selected_columns]

In [None]:
df.head()

In [None]:
df.info()

# Standardising Smiles and Inchi Key

We standardise both `Smiles` and `InChiKeys` of the molecules using the functions described in the `src` directory

In [None]:
smiles_list = df['Smiles'].tolist()
standardised_smiles_list = standardise_smiles(smiles_list)
df['standardised_smiles'] = standardised_smiles_list
df

In [None]:
df.info()

Since few molecules could not be kekulized, we remove those rows from our dataset as they contain `NULL` entries in the `standardised_smiles` column.

In [None]:
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df

In [None]:
df.info()

We standardise the `Inchi Key` column also using the function described in the `src` directory.

In [None]:
smiles_list = df['standardised_smiles'].tolist()
standardised_inchikeys_list = standardise_inchikey(smiles_list)
df['standardised_inchikeys'] = standardised_inchikeys_list
df

In [None]:
df.info()

Here, the preprocessing is completed. The dataset now contains 8697 entries. We now create a sample dataset of 1000 entries using the sample function. We also rename the column names to `Smiles` and `Inchi_key` for convenience.

# Creating the Sample of 1000 Entries

In [None]:
input_data = df.sample(n=1000, random_state=42)
input_data = input_data[['standardised_inchikeys', 'standardised_smiles']]
input_data.rename(columns={'standardised_inchikeys': 'Inchi_key', 'standardised_smiles': 'Smiles'}, inplace=True)
input_data.reset_index(drop=True, inplace=True)
input_data.to_csv('../data/input_data.csv', index=False)
input_data.head()

In [None]:
input_data.info()

# Running the Model on the Input Data

The model is run on the terminal on the input data using the following commands:
```
ersilia -v fetch eos30gr
ersilia serve eos30gr
ersilia -v api run -i input_data.csv -o output_data.csv
```

The output after running the model is saved in `output_data.csv` file which is present the `data` directory.

# Model Bias Evaluation

## Exploring the output data

In [None]:
predictions_df = pd.read_csv('../data/output_data.csv')
predictions_df.head()

In [None]:
predictions_df.info()

The output data contains the `key` - which is the `InChiKey`, the `input` - which is the `Smiles` string and the `activity10` column which contains the `probability of hERG blockade`. 

## Visualizing the predictions

### Molecules with highest predicted probability

In [None]:
# Sort the DataFrame by predicted probabilities in descending order
top_predictions = predictions_df.sort_values(by='activity10', ascending=False).head(10)

In [None]:
# Create a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='activity10', y='input', data=top_predictions, palette='viridis')

# Set plot labels and title
plt.xlabel('Predicted Probability')
plt.ylabel('SMILES')
plt.title('Top 10 Molecules with Highest Predicted Probabilities')

# Show the plot
plt.show()