#  <center> Problem Set 4 <center>

<center> 3.C01/3.C51, 10.C01/10.C51 <center>

<b>Name:</b>

<b>Kerberos id:</b>

### Download required data & install packages

In [None]:
!wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps4/data/nonbio_version/drug.csv
!pip install molvs
!pip install rdkit

In [None]:
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from molvs import standardize_smiles

## Part 1: Dimensionality Reduction for Molecular Representations

This may return a depreciation warning, which can be ignored

In [None]:
################ Run #################

# convert SMILES strings to Morgan fingerprints with rdkit
df = pd.read_csv("drug.csv")
radius = 3
num_bits = 512

class ECFP:
    def __init__(self, smiles):
        self.mols = [Chem.MolFromSmiles(i) for i in smiles]
        self.smiles = smiles

    def mol2fp(self, mol):
        bi = {}
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=radius,
                                                   bitInfo=bi, nBits=num_bits)
        array = np.zeros((1,))
        DataStructs.ConvertToNumpyArray(fp, array)

        return array, bi

    def compute_ECFP(self):
        bit_headers = ["bit" + str(i) for i in range(num_bits)]
        arr = np.empty((0,num_bits), int).astype(int)
        bitInfo_all = []
        mol_all = []

        for i in self.mols:
            mol_all.append(i)
            fp, bi = self.mol2fp(i)
            arr = np.vstack((arr, fp))
            bitInfo_all.append(bi)

        df_fp = pd.DataFrame(np.asarray(arr).astype(int),columns=bit_headers)
        df_fp.insert(loc=0, column="smiles", value=self.smiles)
        df_fp.insert(loc=1, column="mol", value=mol_all)
        df_fp.insert(loc=2, column="bitInfo", value=bitInfo_all)
        return df_fp

smiles_standarized = [standardize_smiles(i) for i in df["SMILES"].values]
fp_descriptor = ECFP(smiles_standarized)
fp = fp_descriptor.compute_ECFP()

# remove first column as we will reference smiles column from df dataframe
fp = fp.drop(columns=["smiles", "mol", "bitInfo"]).values.astype(float)    # second/third not needed

################ Run #################

This resulting dataframe, `fp`, contains the 512 bits (columns) making up the fingerprints for the 4,629 molecules (rows)

### 1.1 (5 points, Grad only) Choosing radius and number of bits for Morgan fingerprints

Provide a one-sentence description of what the radius represents and another of what the number of bits represents. How does adjusting the radius parameter affect the granularity of the motifs captured by the fingerprints, and how does this relate to the choice of the number of bits?

In [None]:
################ Solution #################

"""
Radius represents the distance (in bonds) from each atom at which neighboring
atoms are considered when generating the fingerprint.

Number of bits represents the size of the fingerprint vector,
determining the number of molecular features captured.

Adjusting the radius parameter affects the granularity of the motifs captured
by the fingerprints by determining the spatial extent of the substructures
considered. A larger radius includes atoms that are further away from the
central atom, capturing larger and more complex molecular features.
This relates to the choice of the number of bits as a larger radius increases
the number of unique substructures encountered, requiring more bits to
represent them adequately in the fingerprint vector.
"""

################ Solution #################

### 1.2 (10 points) Principal Component Analysis on Molecular Fingerprints

Perform PCA to reduce data into vectors of 100 dimensions.

In [None]:
################ Solution #################

# perform PCA


# skeleton code for plotting
fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(, , s=3, label="inactive")
ax.scatter(, , color="red", s=3, label="active")
ax.legend()
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
plt.show()

################ Solution #################

What is the explained variance ratio of the 100 principal components?

In [None]:
################ Solution #################



################ Solution #################

What patterns do you observe (if any)?

In [None]:
################ Solution #################



################ Solution #################

### 1.3 (10 points) t-SNE analysis on Molecular Fingerprints

Perform t-SNE on the obtained principal components, with perplexity value of 2, 30, and 500. Plot the results and label your plots.

In [None]:
################ Solution #################



################ Solution #################

What differences do you see between the 3 t-SNE plots? What patterns do you observe in the perplexity = 30 plot?

In [None]:
################ Solution #################



################ Solution #################

### 1.4 (20 points) Are the low dimensional embeddings meaningful?

Split the data into 10 folds. For each fold, train on the other 9 folds and validate on the last fold. Record your prediction.

In [None]:
################ Solution #################



################ Solution #################

Classify your predictions into True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN).

In [None]:
################ Solution #################



################ Solution #################

Plot the 2D t-SNE embeddings (perplexity = 30) colored by the four classification classes.

In [None]:
################ Solution #################



################ Solution #################

What pattern do you observe?

In [None]:
################ Solution #################



################ Solution #################

### 1.5 (10 points) UMAP analysis on Molecular Fingerprints

Perform UMAP on the obtained principal components. Plot  results.

In [None]:
################ Solution #################



################ Solution #################

### 1.6 (15 points) Visualize latent clusters for structure similarity

First run DBSCAN on only active molecules and visualize with labels.

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))

################ Solution #################

labels = None

################ Solution #################

ax.scatter(pca_umap[df.is_active.values == 1][:, 0],
           pca_umap[df.is_active.values == 1][:, 1],
           c=labels)

# add labels as text
for label in np.unique(labels):
    idx = np.argwhere(labels == label)[0]
    ax.text(pca_umap[df.is_active.values == 1][idx, 0],
            pca_umap[df.is_active.values == 1][idx, 1],
            label, size=20)

Now pick one of the clusters, by setting the label, and visualize.

In [None]:
################ Run #################

label = 5
cluster = df.SMILES[df.is_active.values == 1][labels == label]

# visualize all molecules in cluster
mol_list = []
for mol in cluster:
    mol_list.append(Chem.MolFromSmiles(mol))
Draw.MolsToGridImage(mol_list)

################ Run #################

Comment on the similarity of structures within the cluster. Explore by trying a few different clusters.

In [None]:
################ Solution #################



################ Solution #################