# <center>Exercise: k-Nearest Neighbours</center>

---

*Fill in the blanks* in the provided code blocks.

Please turn in the answers until **Feb 21, 2025**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
if ('google.colab' in sys.modules) and ('rdkit' not in sys.modules):
    !pip install rdkit
from rdkit.Chem import Descriptors, PandasTools

# hide the warnings from CalcMolDescriptors
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')

1a- Read the "blood-brain barrier penetration (BBBP)" dataset from [MoleculeNet](https://moleculenet.org/datasets-1). This data set contains 2000 small molecules labeled as able (1) or unable (0) to cross the blood-brain barrier.

Use the library *RDKit* to obtain topological descriptors for the molecules from their SMILEs. Use chemical knowledge to make a selection of 10 descriptors.

([List of descriptors available through RDKit](https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors))

```python
bbbp_df = pd.read_csv("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/BBBP.csv")
# keep only the names, SMILES, and the labels from the original spreadsheet
bbbp_df = bbbp_df.loc[:,["name","smiles","p_np"]]
# add the rdkit mol objects to the dataframe
PandasTools._______(bbbp_df, smilesCol=_____, molCol='mol_obj')
# remove entries containing empty cells
bbbp_df = bbbp_df._____()
# reset the indices after removing entries
bbbp_df = bbbp_df.reset_index()

# calculate all the descriptors
bbbp_desc_list = [Descriptors._______(mol) for mol in _________]
# convert the resulting list of dictionaries to a dataframe
bbbp_desc_df = pd.DataFrame(bbbp_desc_list)
# remove the columns containing empty cells
bbbp_desc_df = bbbp_desc_df._____(axis=1)

# create a list with the names of the selected descriptors
desc_selection = [__ __ __ __ __ __ __]
# retrieve the 10 selected descriptors
bbbp_desc_sel = bbbp_desc_df.loc[:, desc_selection]
```

This data set will be used again in future exercises. Combine the names, SMILES, labels, and all calculated descriptors into one dataframe, and export it to a `.csv` file.

```python
if 'google.colab' in sys.modules:
    # if using colab: save to google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    # save the file to a folder on your drive
    out_path = "/content/gdrive/MyDrive/_______/"
else:
    # if using jupyter notebook locally: save to local folder
    out_path = "./"

# combine the information into a single dataframe
bbbp_out = pd.concat([bbbp_df.loc[:, ["name", "smiles", "p_np"]], bbbp_desc_df], axis=1)

# export to a csv file
bbbp_out.______(out_path+"bbbp_data.csv")
```

(20 points)


In [None]:
# paste answers and run


1b- Utilizing the package *scikit-learn*, perform a train-test split (70:30) on this data set, and scale the features.

```python
from sklearn.model_selection import _________
from sklearn.preprocessing import ________

# train/test split
feat_train, feat_test, lbl_train, lbl_test = ________(_________,
                                                      _________,
                                                      test_size=___,
                                                      random_state=29)
# standadization
scaler = ________().fit(________)
X_train = scaler.________(feat_train)
X_test = scaler.________(feat_test)
```

(10 points)

In [None]:
# paste answers and run


2a- Use this processed data set with the 10 selected features and the kNN algorithm (k=2) to predict the blood-brain barrier penetration labels ("p_np").

Calculate the accuracy, specificity and sensitivity for the test set.

```python
# import the relevant functions
from sklearn.neighbors import ________
from sklearn.metrics import ________, ________

# build the model
clf = _________(_______=2).___(_______, lbl_train)

# make predictions
pred_train = clf.________(_______)
pred_test = clf.________(_______)

# calculate the accuracy
acc_test = ________(______, _____)
# sensitivity is the recall of the positive class
sens_test = ________(lbl_test, pred_test, pos_label=__)
# specificity is the recall of the negative class
spec_test = ________(lbl_test, pred_test, pos_label=__)

# display the results
print("Test set accuracy: {:.2f}".format(acc_test))
print("Test set sensitivity: {:.2f}".format(sens_test))
print("Test set specificity: {:.2f}".format(spec_test))
```

Plot the confusion matrix for both the train and test sets.

```python
from sklearn.metrics import _________, _________

print("confusion matrix: train")
# calculate the confusion matrix
cm_train = _________(________, ________)
# create the confusion matrix plot
disp_cm_train = _________(confusion_matrix=cm_train)
disp_cm_train.plot(cmap='viridis')
plt.show()

print("\nconfusion matrix: test")
cm_test = _________(________, ________)
disp_cm_test = __________(confusion_matrix=cm_test)
disp_cm_test.plot(cmap='viridis')
plt.show()
```

(20 points)

In [None]:
# paste answers and run


2b- Evaluate how the number of neighbors (k from 1 to 15) affects the accuracy and sensitivity of the training and test sets.

```python
# numbers of neighbors to be tested
neighbors_settings = range(__, __)

acc_train = []
sens_train = []
acc_test = []
sens_test = []
for n_neighbors in neighbors_settings:
    clf = ___________(_______=n_neighbors).___(______, ______)
    pred_train = clf._______(X_train)
    pred_test = clf._______(X_test)
    # append the accuracy scores to the list
    acc_train._____(________(lbl_train, pred_train))
    acc_test._____(________(lbl_test, pred_test))
    # append the sensitivity scores
    sens_train._____(________(lbl_train, pred_train, pos_label=1))
    sens_test._____(________(lbl_test, pred_test, pos_label=1))
```

Plot the graphs of accuracy vs number of neighbors, and sensitivity vs number of neighbors.

```python
disp_acc = plt.figure(figsize=(5,4))
# add line plots for train & test sets
plt.____(_________, _________, label="train")
plt.____(_________, _________, label="test")
plt.title("Accuracy")
plt.legend()
plt.show()

disp_sens = plt.figure(figsize=(5,4))
plt.____(_________, _________, label="train")
plt.____(_________, _________, label="test")
plt.title("Sensitivity")
plt.legend()
plt.show()
```

(20 points)

In [None]:
# paste answers and run


3- What is the main parameter to be adjusted in the kNN algorithm?

(10 points)

**Answers:**

------------

4- What happens in classification when k=1?

What happens in classification when k is equal to the number of data points in the training set?

(20 points)

**Answers:**

------------