# <center>Exercise: Data Analysis</center>

---

*Fill in the blanks* in the provided code blocks.

Please turn in the answers until **Feb 21, 2025**.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.options.display.precision = 3

1a- The pIC50 values are calculated from IC50 measurements in molar concentration (**M**):$$pIC_{50}=-log_{10}(IC_{50})$$

Convert the pIC50 values in the BACE-1 data set (url provided) back to IC50 values in *μM*.

```python
bace1_df = pd._____("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/bace.csv")
bace1_df["IC50 (uM)"] = bace1_df._____(lambda row: __________, axis=1)
bace1_activities = bace1_df.loc[:, ["IC50 (uM)", "pIC50"]]
```

Calculate the following statistical metrics for the IC50 and pIC50 values: range, variance, standard deviation, skewness, and kurtosis.

(15 points)

```python
# create a list for the function names
stats_name = ["max","min","var","std","skew","kurt"]
# initiate a dataframe to store the statistical values
bace1_stats = pd.DataFrame()
for stat in stats_name:
    # fill in the function which interprets string as code
    stat_val = ____(f"bace1_activities.{stat}(numeric_only=True)")
    # add the results to the dataframe
    bace1_stats[stat] = stat_val
# calculate the range using max & min
bace1_stats["range"] = bace1_stats.apply(lambda row: row[____]-row[____], axis=1)
display(bace1_stats)
```

In [None]:
# paste answers and run



1b- Plot the IC50 and pIC50 values as separate histograms.

```python
# create a figure for two subplots
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))
# subplot 1: IC50 distribution
hist_IC50 = sns.histplot(data=______, x=______, bins=20, ax=axs[0])
# subplot 2: pIC50 distribution
hist_pIC50 = sns.histplot(data=______, x=______, bins=20, ax=axs[1])
# display the figure
plt.show()
```

Based on the data distributions, explain why pIC50 is used as the target variable over IC50 itself.

(15 points)

In [None]:
# paste answers and run



**Arguments for converting IC50 to pIC50 before modeling:**

--------------

2a- Read in the solubility data set (ESOL) from the [MoleculeNet website](https://moleculenet.org/datasets-1). This spreadsheet contains a number of relevant descriptors already.

```python
from rdkit.Chem import Descriptors, PandasTools

# read the spreadsheet from url
esol_df = pd.______("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv")
# examine the column names to identify the relevant ones
esol_df._____
```

For practice purposes, we will discard the provided features, and instead create a descriptor set for it using *RDKit*.

(20 points)

```python
# keep only two columns: SMILES and the experimental solubility values
esol_df = esol_df.loc[__, __]
# the original column name for solubility is quite long; rename it for simplicity
esol_df = esol_df.rename(columns={__________:"solubility"})

# add rdkit mol objects to the dataframe
PandasTools._________(esol_df, smilesCol=_____, molCol='mol_obj')
# remove empty cells
esol_df = esol_df.____()

# calculate all descriptors for all entries
# convert the rdkit mol object column to a list, to iterate over the items more efficiently
mols = esol_df.mol_obj.to_list()
# calculate the descriptors using a list comprehension
# this creates a list of dictionaries, which can be easily converted into a dataframe
esol_desc_list = [Descriptors._______(mol) for mol in mols]
esol_desc_df = pd.DataFrame(esol_desc_list)
# remove columns containing empty cells
esol_desc_df = esol_desc_df.____(axis=1)
display(esol_desc_df)
```

In [None]:
import sys
if ('google.colab' in sys.modules) and ('rdkit' not in sys.modules):
    !pip install rdkit

# hide the deprecation warnings from CalcMolDescriptors
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

from rdkit.Chem import Descriptors, PandasTools

# paste answers and run



2b- Use *scikit-learn* to perform a train/test split of 70:30 on the solubility data set.

```python
from sklearn.model_selection import _________

# with dataframes as input for the split, the outputs are dataframes as well
feat_train, feat_test, y_train, y_test = ________(______, _______, test_size=___, random_state=63)
```

For the training set: calculate both the variance-covariance matrix and the correlation matrix. Include the experimental solubility values for this process.

```python
# combine (concatenate) the training set solubility column and descriptors into one dataframe
esol_train_yX = pd.concat([y_train, feat_train], axis=1)
# calculate and display the covariance matrix
esol_covmat = esol_train_yX.___()
display(esol_covmat)
# calculate and display the correlation matrix
esol_corrmat = esol_train_yX.___()
display(esol_corrmat)
```

Identify the 3 features with the highest correlation to solubility.

(20 points)

```python
# sort by correlation to solubility, along both axes
# use key=abs to evaluate the absolute values
esol_corrmat = esol_corrmat.sort_values(by=_____, ascending=False, key=___, axis=0)
esol_corrmat = esol_corrmat.sort_values(by=_____, ascending=False, key=___, axis=1)
# display the top entries (and the target varieble) of the sorted correlation matrix
esol_corrmat.____(4)
```

In [None]:
# paste answers and run



2c- Use a scatter plot matrix to visualize the relationships among solubility and the 3 most correlated features.

(5 points)

```python
# the top index values in the sorted correlation matrix are the relevant features
# retrieve these values and convert them into a list
visualize_col = esol_corrmat.index[___].to_list()
# retrieve the feature values from the original matrix
esol_train_plot = esol_train_yX.loc[:,visualize_col]
# a scatter plot matrix can be created using the pd.plotting.scatter_matrix function
# or the sns.pairplot function
scatterplt_mat = __________(esol_train_plot)
```

In [None]:
# paste answers and run



2d- This data set will be used again in future excercises. Use the *pickle* package to save the processed data (mandatory: variables `feat_train`, `feat_test`, `y_train`, `y_test`).

(5 points)

```python
import pickle

if 'google.colab' in sys.modules:
    # if using colab: save to google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    # save the pickle file to a folder on your drive
    pickle_path = "/content/gdrive/MyDrive/________/"
else:
    # if using jupyter notebook locally: save to local folder
    pickle_path = "______/"

# create a pickle file
pickle_out = open(pickle_path+"esol_processed.pkl", "__")
# create a list to store the variables to be exported
pickle_content = __ __ __ __ __ __ __
# save data to the pickle file
pickle._____(pickle_content, pickle_out)
pickle_out.close()
```

In [None]:
# paste answers and run



3- What is the difference between numeric and categorical data?

What kind of supervised method should be used to predict numeric target variables and categorical target variables?

(10 points)

**Answers:**

------------

4- What are the minimum and maximum values of the Pearson's correlation coefficient? What do they mean?

(10 points)

**Answers:**

------------