### Task: Predicting Hydration Free Energy Classification using Morgan Fingerprints

**Goal:** Predict whether a molecule has favorable or unfavorable hydration free energy.

In this task, you will build a neural network to classify molecules based on their **hydration free energy** - how favorably they dissolve in water. You'll use Morgan fingerprints as molecular representations.

#### Setting up the notebook and importing necessary libraries

In [None]:
!pip install rdkit

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
from tqdm import tqdm
import requests
from io import StringIO

warnings.filterwarnings('ignore')

# Deep Learning
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Chemistry
from rdkit import Chem
from rdkit.Chem import Descriptors
from rdkit.Chem import AllChem

#### Loading and Analyzing FreeSolv Dataset

The FreeSolv dataset contains experimental and calculated hydration free energy data for small molecules.
We've already loaded it for you below.

In [None]:
def load_freesolv_dataset():
    """Load the FreeSolv hydration free energy dataset"""
    url = "https://raw.githubusercontent.com/MobleyLab/FreeSolv/master/database.txt"
    response = requests.get(url)
    response.raise_for_status()
    
    # Parse the semicolon-delimited file
    lines = response.text.strip().split('\n')
    
    # Filter out comment lines
    data_lines = [line for line in lines if not line.startswith('#') and line.strip()]
    
    # Parse each line
    smiles_list = []
    names_list = []
    expt_values = []
    calc_values = []
    
    for line in data_lines:
        parts = [p.strip() for p in line.split(';')]
        if len(parts) >= 6:  # Ensure we have enough fields
            smiles = parts[1]  # SMILES string
            name = parts[2]    # Molecule name
            try:
                exp_value = float(parts[3])   # Experimental hydration free energy
                calc_value = float(parts[5])  # Calculated hydration free energy
                # Validate SMILES before adding
                mol = Chem.MolFromSmiles(smiles)
                if mol is not None:
                    smiles_list.append(smiles)
                    names_list.append(name)
                    expt_values.append(exp_value)
                    calc_values.append(calc_value)
            except ValueError:
                continue
    
    df = pd.DataFrame({
        'SMILES': smiles_list,
        'name': names_list,
        'expt': expt_values,
        'calc': calc_values
    })
    
    print(f"Loaded FreeSolv with {len(df)} compounds")
    print("Columns in dataset:", df.columns.tolist())
    
    return df

# Load dataset
df = load_freesolv_dataset()

#### Task 1: Data Visualization and Classification Setup

TODO Part A: Visualize the distribution of hydration free energies in the FreeSolv dataset.
- Create a histogram showing the distribution of experimental hydration free energies
- Create a scatter plot showing the values across the dataset
- Analyze the distribution to understand the data range

- Common cutoff values to consider: -5.0, -4.0, -3.0 (or choose your own based on the data)

TODO Part B: Based on your visualization, choose a cutoff value to create binary classification labels. Print the class distribution to ensure balanced or reasonable class sizes

In [None]:
# TODO Part A: Implement visualization code here
# Hint: Use plt.hist() and plt.scatter()


In [None]:
# TODO Part B: Choose cutoff and create classification labels
# Example:
# CUTOFF = -5.0  # Choose your own value!
# df['HydrationClass'] = np.where(df['expt'] < CUTOFF, 0, 1)
# print(f"\nClass distribution:")
# print(df['HydrationClass'].value_counts())


#### Task 2: Calculate Molecular Descriptors (Exploratory Analysis)

TODO: Calculate LogP values for all molecules in the dataset.

Note: LogP (partition coefficient) is NOT in the dataset - you'll calculate it using RDKit. This is for exploratory analysis to understand the relationship between lipophilicity (LogP) and hydration free energy.


- Use RDKit's Descriptors.MolLogP() function to calculate LogP- Explore how LogP relates to hydration free energy (the target we're predicting)

- Store the results in a new column 'LogP'- Plot the distribution of LogP values

In [None]:
# TODO: Calculate LogP for each molecule
# Hint: Loop through df['SMILES'], convert to mol objects, then calculate LogP

#### Task 3: Set Hyperparameters and Generate Fingerprints

TODO: Define hyperparameters and generate Morgan fingerprints for all molecules.

Morgan fingerprints are computational tools that represent molecules as binary bit strings, encoding structural features by identifying circular substructures around each atom.

Key hyperparameters:
- `FINGERPRINT_SIZE`: Length of the fingerprint vector (1024, 2048)
- `MAX_RADIUS`: Maximum radius for substructure consideration (try 2, 3, 4, 5)
- `BATCH_SIZE`: Number of samples per training batch
- `NUM_EPOCHS`: Number of training epochs
- `LEARNING_RATE`: Learning rate for the optimizer
- `DROPOUT`: Dropout rate for regularization

In [None]:
# TODO: Define hyperparameters
FINGERPRINT_SIZE = 2048
MAX_RADIUS = 3
BATCH_SIZE = 32
NUM_EPOCHS = 100
WEIGHT_DECAY = 1e-3
LEARNING_RATE = 1e-3
PATIENCE = 3
FACTOR = 0.5
EARLY_STOPPING = 10
DROPOUT = 0.2

In [None]:
# TODO: Generate Morgan fingerprints
# Hint: Use AllChem.GetMorganGenerator() and GetFingerprint()
# Store fingerprints in df['morgan']

### How do the hyperparameters matter?

- **FINGERPRINT_SIZE**: Larger sizes can encode more substructures with fewer collisions, but result in larger models that train slower
- **MAX_RADIUS**: Controls how far to look around each atom for substructures

**Challenge:** Experiment with different hyperparameter values to see how they affect model performance!

## Task 4: Define PyTorch Dataset and DataLoaders

**TODO:** Create a custom PyTorch Dataset class and instantiate train/test DataLoaders.
- Implement the `__len__` and `__getitem__` methods
- Split data into train/test sets (80/20 split)
- Use 'HydrationClass' as the target variable (0 = favorable, 1 = unfavorable)
- Create DataLoader objects for both sets

In [None]:
# TODO: Define FPDataset class
# TODO: Split data into train/test
# TODO: Create DataLoader objects

## Task 5: Define MLP Model Architecture

**TODO:** Define a Multi-Layer Perceptron (MLP) for binary classification.
- Input layer: size = FINGERPRINT_SIZE
- Hidden layers: implement at least 2 hidden layers with ReLU activation
- Dropout layers for regularization
- Output layer: single neuron with Sigmoid activation for binary classification

In [None]:
# TODO: Define MLP class
# TODO: Instantiate model, loss criterion, and optimizer

## Task 6: Train the Model

**TODO:** Implement the training loop.
- Training phase: forward pass, compute loss, backward pass, update weights
- Validation phase: evaluate on test set without gradient updates
- Implement learning rate scheduling
- Implement early stopping based on validation loss
- Track and store training and validation losses

In [None]:
# TODO: Implement train_model function
# TODO: Train the model

## Task 7: Evaluate Model Performance

**TODO:** Visualize training progress and evaluate model performance.
- Plot training and validation loss curves
- Calculate and display accuracy on test set
- Create a confusion matrix to visualize predictions
- Analyze where the model makes mistakes

In [None]:
# TODO: Plot training curves

In [None]:
# TODO: Calculate predictions and create confusion matrix