# Notebook 2: Machine Learning for Drug Discovery

In this notebook, you will use **machine learning** to predict whether molecules can block the Estrogen Receptor alpha (ERα).

## What is Machine Learning?

Machine learning (ML) is when a computer learns patterns from data, without being explicitly programmed. Think of it like this:

- **Training**: You show the computer thousands of molecules and tell it which ones are "active" (they block the target) and which ones are "inactive" (they don't).
- **Predicting**: After seeing enough examples, the computer learns the patterns and can predict whether a *brand new* molecule will be active or not — even one that has never been tested in a lab!

It's similar to how you learn to recognise cats vs. dogs from photos — after seeing enough examples, you can tell them apart even in photos you've never seen before.

## Why is this useful for drug discovery?

Testing molecules in the lab is slow and expensive. With ML, we can:
- Screen **millions** of molecules on a computer in minutes
- Focus lab experiments on the most promising candidates
- Save years of work and millions of euros

## What will we do in this notebook?

1. **Download** real experimental data on ERα from the ChEMBL database
2. **Encode** molecules into numbers ("fingerprints") that the computer can understand
3. **Train** 3 different ML models: Random Forest, SVM, and Neural Network
4. **Evaluate** how well the models perform (are they actually learning something useful?)
5. **Predict** the activity of new molecules you choose!

---

## Step 0: Install required packages

In [None]:
%%capture
!pip install rdkit scikit-learn chembl_webresource_client tqdm

In [None]:
import os
import time
from warnings import filterwarnings

import pandas as pd
import numpy as np
from sklearn import svm, metrics, clone
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import (
    accuracy_score, recall_score, matthews_corrcoef,
    roc_curve, roc_auc_score, mean_absolute_error, mean_squared_error
)
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import MACCSkeys, AllChem
from tqdm.auto import tqdm

filterwarnings('ignore')
SEED = 22

## Step 1: Download ERα data from ChEMBL

[ChEMBL](https://www.ebi.ac.uk/chembl/) is a huge public database maintained by the European Bioinformatics Institute (EBI). It contains millions of experimental measurements: researchers around the world test molecules in the lab and upload their results to ChEMBL.

We'll download all the binding data for ERα automatically using Python — no manual downloading needed!

> **What's happening here?** We're asking ChEMBL: *"Give me all molecules that have been tested for binding to ERα (target CHEMBL206), where the measurement type is IC50 and it's a direct binding assay."*
>
> **IC50** = the concentration of drug needed to block 50% of the protein. Lower IC50 = stronger drug.
>
> **pChEMBL** = a standardized version of IC50 on a log scale. Higher pChEMBL = stronger binding. Think of it like a "binding score" where bigger is better.

In [None]:
from chembl_webresource_client.new_client import new_client

CHEMBL_TARGET = 'CHEMBL206'  # ChEMBL ID for Estrogen Receptor alpha

# Fetch bioactivity data
activity = new_client.activity
results = activity.filter(
    target_chembl_id=CHEMBL_TARGET,
    type='IC50',                      # IC50 is a common measure of drug potency
    relation='=',                     # Only exact measurements
    assay_type='B'                    # Binding assays only
).only([
    'molecule_chembl_id', 'canonical_smiles', 'pchembl_value', 'assay_type',
    'standard_relation', 'standard_value'
])

# Convert to a pandas DataFrame
data = pd.DataFrame(results)
print(f"Downloaded {len(data)} datapoints from ChEMBL for {CHEMBL_TARGET}")
data.head()

## Step 2: Clean the data

Real-world data is always messy! Some entries might have missing values, duplicates, or inconsistent measurements. Before we can train a model, we need to tidy things up:

- Keep only the columns we need (molecule ID, SMILES, pChEMBL value)
- Remove entries with missing values
- Remove duplicate molecules
- Label each molecule as **active** or **inactive**

> **What's happening here?**
>
> We set an activity threshold: **pChEMBL ≥ 6.5 → active**. This roughly corresponds to a molecule that binds with a strength of at least ~300 nM (nanomolar), which is the ballpark for a useful drug candidate.
>
> **SMILES** is a way to write chemical structures as text. For example, `CCO` is ethanol (the alcohol in drinks). Every molecule in ChEMBL has a SMILES string that describes its structure.

In [None]:
# Rename columns for convenience
pd_data = data[['molecule_chembl_id', 'pchembl_value', 'canonical_smiles']].copy()
pd_data.columns = ['Molecule_ChEMBL_ID', 'pChEMBL_value', 'Smiles']

# Convert pChEMBL to numeric (some entries may be text)
pd_data['pChEMBL_value'] = pd.to_numeric(pd_data['pChEMBL_value'], errors='coerce')

# Drop rows with missing values
pd_data.dropna(subset=['pChEMBL_value', 'Smiles'], inplace=True)

# Remove duplicates (keep the first measurement for each molecule)
pd_data.drop_duplicates(subset='Smiles', keep='first', inplace=True)
pd_data.reset_index(drop=True, inplace=True)

# Add activity label: active if pChEMBL >= 6.5
pd_data['active'] = (pd_data['pChEMBL_value'] >= 6.5).astype(float)

print(f"After cleaning: {len(pd_data)} molecules")
print(f"  Active molecules:   {int(pd_data.active.sum())}")
print(f"  Inactive molecules: {len(pd_data) - int(pd_data.active.sum())}")

## Step 3: Convert molecules to fingerprints

Computers can't look at a chemical drawing and understand it. We need to convert each molecule into a list of numbers — a **molecular fingerprint**.

> **What's happening here?**
>
> Think of a fingerprint as a **barcode** for a molecule. It's a long list of 0s and 1s, where each position answers a yes/no question about the molecule's structure:
> - Position 42: "Does this molecule contain a nitrogen-hydrogen bond?" → 1 (yes) or 0 (no)
> - Position 99: "Does this molecule have a six-membered ring?" → 1 or 0
> - ...and so on for 167 questions (MACCS keys) or 2048 questions (Morgan fingerprints)
>
> **Why does this work?** Molecules with similar structures tend to have similar fingerprints, and molecules with similar structures tend to have similar biological activity. So the ML model can learn: *"fingerprints that look like THIS tend to be active."*
>
> We start with **MACCS keys** (167 bits, simple) and later switch to **Morgan fingerprints** (2048 bits, more detailed) for better accuracy.

In [None]:
def smiles_to_fp(smiles, method='maccs', n_bits=2048):
    """Convert a SMILES string to a molecular fingerprint."""
    try:
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            return np.nan
    except:
        return np.nan

    if method == 'maccs':
        return list(MACCSkeys.GenMACCSKeys(mol))
    elif method == 'morgan2':
        fp_gen = AllChem.GetMorganGenerator(radius=2, fpSize=n_bits)
        return list(fp_gen.GetFingerprint(mol))
    elif method == 'morgan3':
        fp_gen = AllChem.GetMorganGenerator(radius=3, fpSize=n_bits)
        return list(fp_gen.GetFingerprint(mol))
    else:
        return list(MACCSkeys.GenMACCSKeys(mol))

In [None]:
# Generate fingerprints for all molecules
tqdm.pandas()
compound_df = pd_data.copy()
compound_df['fp'] = compound_df['Smiles'].progress_apply(smiles_to_fp)
compound_df.dropna(subset='fp', inplace=True)
compound_df.reset_index(drop=True, inplace=True)

print(f"Generated fingerprints for {len(compound_df)} molecules")

## Step 4: Split the data into training and test sets

We split our data into two parts:
- **Training set** (80%): used to teach the model
- **Test set** (20%): used to check if the model actually learned something useful

> **What's happening here?** It's like studying for an exam with practice problems (training), and then taking the real exam (test) with questions you haven't seen before. If you score well on the exam, you actually understood the material!

In [None]:
fingerprints = np.array(compound_df.fp.tolist())
labels = compound_df.active.tolist()

train_x, test_x, train_y, test_y = train_test_split(
    fingerprints, labels, test_size=0.2, random_state=SEED
)

print(f"Training set: {len(train_x)} molecules")
print(f"Test set:     {len(test_x)} molecules")

## Step 5: Train Machine Learning models

Now the fun part! We'll try three completely different ML algorithms and see which one works best. Each algorithm learns in its own way:

1. **Random Forest (RF)**: Imagine 100 people each making a flowchart of yes/no questions about a molecule (e.g., "Does it have a ring? → yes → Does it have nitrogen? → yes → probably active"). Each person makes a slightly different flowchart. The final prediction is whatever the majority votes for. That's a Random Forest — an "ensemble" of decision trees.

2. **Support Vector Machine (SVM)**: Imagine plotting all molecules in a high-dimensional space (one axis per fingerprint bit). SVM finds the best "dividing surface" that separates active molecules on one side from inactive ones on the other.

3. **Neural Network (ANN)**: Loosely inspired by the brain. Data flows through layers of interconnected "neurons", where each neuron applies a mathematical function. The network adjusts its connections during training until it gets good at the task.

### Helper functions

These functions help us measure how good each model is and plot the results.

In [None]:
def model_performance(ml_model, test_x, test_y, verbose=True):
    """Calculate and print how well a model performs."""
    test_prob = ml_model.predict_proba(test_x)[:, 1]
    test_pred = ml_model.predict(test_x)

    accuracy = accuracy_score(test_y, test_pred)
    sens = recall_score(test_y, test_pred)
    spec = recall_score(test_y, test_pred, pos_label=0)
    auc = roc_auc_score(test_y, test_prob)
    bal_accuracy = (sens + spec) / 2
    mcc = matthews_corrcoef(test_y, test_pred)

    if verbose:
        print(f"  Accuracy:     {accuracy:.2f}")
        print(f"  Sensitivity:  {sens:.2f}  (how well it finds active molecules)")
        print(f"  Specificity:  {spec:.2f}  (how well it rejects inactive molecules)")
        print(f"  AUC:          {auc:.2f}  (overall performance, 1.0 = perfect)")

    return accuracy, sens, spec, auc, bal_accuracy, mcc


def plot_roc_curves(models, test_x, test_y):
    """Plot ROC curves for all trained models."""
    fig, ax = plt.subplots(figsize=(7, 5))

    for model_info in models:
        ml_model = model_info['model']
        test_prob = ml_model.predict_proba(test_x)[:, 1]
        fpr, tpr, _ = roc_curve(test_y, test_prob)
        auc_score = roc_auc_score(test_y, test_prob)
        ax.plot(fpr, tpr, label=f"{model_info['label']} (AUC = {auc_score:.2f})")

    ax.plot([0, 1], [0, 1], 'r--', label='Random guessing')
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('ROC Curves: How good is each model?')
    ax.legend(loc='lower right')
    plt.tight_layout()
    plt.show()

### Model 1: Random Forest

The Random Forest builds 100 decision trees, each trained on a slightly different random subset of the data. This makes it robust — even if one tree makes a mistake, the majority vote is usually correct.

In [None]:
# Train a Random Forest
model_RF = RandomForestClassifier(n_estimators=100, criterion='entropy', random_state=SEED)
model_RF.fit(train_x, train_y)

print("Random Forest performance:")
_ = model_performance(model_RF, test_x, test_y)

### Model 2: Support Vector Machine (SVM)

SVM uses a mathematical trick called the "kernel trick" (here: radial basis function / RBF) to find complex boundaries between active and inactive molecules, even when the data isn't neatly separable by a straight line.

In [None]:
# Train an SVM
model_SVM = svm.SVC(kernel='rbf', C=1, gamma=0.1, probability=True, random_state=SEED)
model_SVM.fit(train_x, train_y)

print("SVM performance:")
_ = model_performance(model_SVM, test_x, test_y)

### Model 3: Neural Network (ANN)

Our neural network has 2 hidden layers (with 5 and 3 neurons). It's small by modern AI standards, but often enough for molecular data. Neural networks shine when you have very large datasets — with smaller data, simpler models like RF and SVM sometimes do equally well or better.

In [None]:
# Train a Neural Network
model_ANN = MLPClassifier(hidden_layer_sizes=(5, 3), random_state=SEED, max_iter=500)
model_ANN.fit(train_x, train_y)

print("Neural Network performance:")
_ = model_performance(model_ANN, test_x, test_y)

### Compare all three models with ROC curves

Now let's compare all three models visually using **ROC curves**.

> **What's happening here?**
>
> The ROC (Receiver Operating Characteristic) curve shows how well a model separates active from inactive molecules. It plots two things:
> - **True Positive Rate** (y-axis): Of all truly active molecules, how many did we correctly identify? (higher = better)
> - **False Positive Rate** (x-axis): Of all truly inactive molecules, how many did we accidentally flag as active? (lower = better)
>
> A perfect model hugs the top-left corner. A random coin flip gives the diagonal red dashed line.
>
> The **AUC** (Area Under the Curve) summarizes this in one number:
> - **AUC = 1.0** → perfect predictions
> - **AUC = 0.5** → no better than random guessing
> - **AUC > 0.8** → generally considered a good model

In [None]:
models = [
    {'label': 'Random Forest', 'model': model_RF},
    {'label': 'SVM', 'model': model_SVM},
    {'label': 'Neural Network', 'model': model_ANN},
]

plot_roc_curves(models, test_x, test_y)

## Step 6: Cross-validation

The results above used a single random train/test split. But what if we got lucky (or unlucky) with that particular split? **Cross-validation** repeats the experiment multiple times to give us a more reliable answer.

> **What's happening here?**
>
> We split all the data into 3 equal parts (called "folds"):
> - **Round 1**: Train on folds 1+2, test on fold 3
> - **Round 2**: Train on folds 1+3, test on fold 2
> - **Round 3**: Train on folds 2+3, test on fold 1
>
> Every molecule gets to be in the test set exactly once. We then report the average performance across all rounds. If the numbers are consistent (low standard deviation), the model is reliable.
>
> This is a standard practice in machine learning — you should always cross-validate before trusting your model!

In [None]:
def crossvalidation(ml_model, df, n_folds=3):
    """Run cross-validation and print average performance."""
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=SEED)

    auc_per_fold = []
    acc_per_fold = []

    for train_index, test_index in kf.split(df):
        fold_model = clone(ml_model)

        fold_train_x = df.iloc[train_index].fp.tolist()
        fold_train_y = df.iloc[train_index].active.tolist()
        fold_test_x = df.iloc[test_index].fp.tolist()
        fold_test_y = df.iloc[test_index].active.tolist()

        fold_model.fit(fold_train_x, fold_train_y)
        accuracy, sens, spec, auc, bal_acc, mcc = model_performance(
            fold_model, fold_test_x, fold_test_y, verbose=False
        )
        auc_per_fold.append(auc)
        acc_per_fold.append(accuracy)

    print(f"  Mean AUC:      {np.mean(auc_per_fold):.2f} +/- {np.std(auc_per_fold):.2f}")
    print(f"  Mean Accuracy: {np.mean(acc_per_fold):.2f} +/- {np.std(acc_per_fold):.2f}")


N_FOLDS = 3

for model_info in models:
    print(f"\n--- {model_info['label']} ---")
    crossvalidation(model_info['model'], compound_df, n_folds=N_FOLDS)

## Step 7: Regression — predict *how strongly* a molecule binds

So far, we've been doing **classification**: just predicting "active" or "inactive" (a yes/no answer). Now let's try **regression** — predicting the actual pChEMBL value as a continuous number.

> **What's happening here?**
>
> - **Classification** = "Is this molecule active?" → Yes or No
> - **Regression** = "What is this molecule's pChEMBL value?" → e.g., 7.3
>
> This is like the difference between pass/fail grading and giving a percentage score. Regression gives us more information — we can rank molecules from strongest to weakest binder.
>
> We also switch to **Morgan fingerprints** (2048 bits) instead of MACCS keys (167 bits). More bits = more detailed description of the molecule = the model has more information to work with.
>
> **How do we measure regression performance?**
> - **MAE** (Mean Absolute Error): On average, how far off are our predictions? (lower = better)
> - **RMSE** (Root Mean Squared Error): Similar, but penalizes big errors more heavily. (lower = better)
> - Rule of thumb: MAE < 0.6 and RMSE < 1.0 is considered decent for pChEMBL prediction.

In [None]:
# Switch to Morgan3 fingerprints (longer, more detailed barcodes)
compound_df_reg = compound_df.copy()
tqdm.pandas()
compound_df_reg['fp'] = compound_df_reg['Smiles'].progress_apply(
    smiles_to_fp, args=('morgan3',)
)
compound_df_reg.dropna(subset=['fp', 'pChEMBL_value'], inplace=True)
compound_df_reg.reset_index(drop=True, inplace=True)

print(f"Molecules for regression: {len(compound_df_reg)}")

In [None]:
# Cross-validation for regression
regressor = RandomForestRegressor(random_state=SEED)
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

mae_folds = []
rmse_folds = []
trained_model = None

for train_idx, test_idx in kf.split(compound_df_reg):
    fold_model = clone(regressor)
    fold_train_x = compound_df_reg.iloc[train_idx].fp.tolist()
    fold_train_y = compound_df_reg.iloc[train_idx].pChEMBL_value.tolist()
    fold_test_x = compound_df_reg.iloc[test_idx].fp.tolist()
    fold_test_y = compound_df_reg.iloc[test_idx].pChEMBL_value.tolist()

    fold_model.fit(fold_train_x, fold_train_y)
    preds = fold_model.predict(fold_test_x)

    mae_folds.append(mean_absolute_error(fold_test_y, preds))
    rmse_folds.append(np.sqrt(mean_squared_error(fold_test_y, preds)))
    trained_model = fold_model  # Keep the last trained model for predictions

print(f"Regression results (Random Forest):")
print(f"  MAE:  {np.mean(mae_folds):.2f} +/- {np.std(mae_folds):.2f}")
print(f"  RMSE: {np.mean(rmse_folds):.2f} +/- {np.std(rmse_folds):.2f}")
print(f"\nMAE < 0.6 and RMSE < 1.0 is considered good!")

## Step 8: Predict activity for new molecules!

Now let's use our trained regression model to predict how strongly some well-known ER drugs bind. These are real drugs that have been used in medicine:

- **4-hydroxytamoxifen** — the active metabolite of tamoxifen, our reference drug from Notebook 1
- **Estradiol** — the natural estrogen hormone that normally activates ERα
- **Tamoxifen** — one of the most widely used breast cancer drugs (a "prodrug" that gets converted to 4-hydroxytamoxifen in the body)
- **Raloxifene** — another ERα blocker, used for both breast cancer prevention and osteoporosis

In [None]:
# SMILES for our test molecules
test_molecules = {
    '4-Hydroxytamoxifen': 'OC1=CC=C(/C(=C(/CC)C2=CC=CC=C2)C3=CC=C(OCCN(C)C)C=C3)C=C1',
    'Estradiol':          'O[C@@H]1CC[C@@H]2[C@H]3CCC4=CC(=CC=C4[C@@H]3CC[C@]12C)O',
    'Tamoxifen':          'CCC(/C1=CC=CC=C1)=C(\C2=CC=C(OCCN(C)C)C=C2)C3=CC=CC=C3',
    'Raloxifene':         'OC1=CC=C(C(=O)C2=CC=C(OCCN3CCCCC3)C=C2)C=C1/C(=C/4\SC5=CC=C(O)C=C5)C4=O',
}

print("Predicted pChEMBL values for ER\u03b1:")
print("=" * 50)

predictions_for_nb3 = {}
for name, smiles in test_molecules.items():
    fp = smiles_to_fp(smiles, 'morgan3')
    if fp is not np.nan:
        pred = trained_model.predict([fp])[0]
        predictions_for_nb3[name] = pred
        status = 'ACTIVE' if pred >= 6.5 else 'inactive'
        print(f"  {name:25s}  pChEMBL = {pred:.2f}  ({status})")
    else:
        print(f"  {name:25s}  Could not parse SMILES")

**How do these predictions compare to reality?** You can look up the experimental pChEMBL values for these molecules on [ChEMBL](https://www.ebi.ac.uk/chembl/) to check!

In Notebook 3, we'll use **molecular docking** to predict binding from a completely different angle — fitting 3D molecular shapes into the protein's binding pocket. Then we can compare the ML predictions above with the docking results and see if both methods agree.

---

## Summary

In this notebook, you learned how to:

1. **Download** experimental data from ChEMBL using Python
2. **Clean** messy real-world data and label molecules as active/inactive
3. **Encode** molecules as fingerprints — numerical "barcodes" that ML models can understand
4. **Train** three classification models (Random Forest, SVM, Neural Network) and compare them with ROC curves
5. **Cross-validate** to make sure the models are reliable
6. **Train** a regression model to predict continuous binding strength (pChEMBL values)
7. **Predict** the activity of new, untested molecules

**Key takeaway**: With just a few lines of code, we built models that can predict whether a molecule binds to ERα — without doing a single lab experiment!

---

## Try it yourself!

Want to test your own molecules? You can use the **PubChem Sketcher** to draw molecules or search them by name:

1. Go to the [PubChem Sketcher](https://pubchem.ncbi.nlm.nih.gov/edit3/index.html)
2. Either **draw** a molecule using the tools, or click the **search icon** and type a molecule name (e.g., "ibuprofen", "caffeine", "aspirin")
3. Once your molecule appears, look for the **SMILES** string (it appears below the drawing)
4. Copy the SMILES and paste it in the cell below
5. Run the cell to get a prediction!

You can also search for molecules directly on [PubChem](https://pubchem.ncbi.nlm.nih.gov/) — search by name, then find the "Canonical SMILES" under section 2.1.4.

In [None]:
# --- TRY IT YOURSELF ---
# Paste SMILES strings below to predict their activity

my_molecules = [
    # 'paste_your_SMILES_here',
    # 'another_SMILES_here',
]

if my_molecules:
    print("Your predictions:")
    for smiles in my_molecules:
        fp = smiles_to_fp(smiles, 'morgan3')
        if fp is not np.nan:
            pred = trained_model.predict([fp])[0]
            status = 'ACTIVE' if pred >= 6.5 else 'inactive'
            print(f"  {smiles[:50]:50s}  pChEMBL = {pred:.2f}  ({status})")
        else:
            print(f"  Could not parse: {smiles[:50]}")
else:
    print("Add your SMILES strings to the list above and re-run this cell!")