# Pandas + RDKit: Working with Chemical CSV Data 📈⚗️

*General Chemistry & Cyberinfrastructure Skills Module*

## Learning Objectives
1. **Read** and **write** chemical data (SMILES + property columns) using **pandas** CSV I/O.
2. **Clean** datasets by removing invalid SMILES and missing values.
3. **Visualise** chemical property trends with matplotlib and seaborn, leveraging RDKit‐derived descriptors.

## Prerequisites
- Python ≥ 3.8
- **pandas** for tabular data handling
- **RDKit** for chemistry operations
- **matplotlib** / **seaborn** for plots

On Google Colab, run the install cell below first.

In [None]:
# !pip install rdkit-pypi pandas matplotlib seaborn -q  # ← Uncomment if needed
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from rdkit import Chem
from rdkit.Chem import Descriptors


## Step 1 – Load a CSV
To get started, we’ll create a **tiny sample CSV** on the fly. In real projects you’d load an existing file:

In [None]:
sample_csv = 'sample_mols.csv'
pd.DataFrame({
    'SMILES': ['CCO', 'c1ccccc1', 'invalid_smiles', 'O=C=O'],
    'IC50_nM': [120, 3000, 50, None]
}).to_csv(sample_csv, index=False)
print('Wrote sample CSV →', sample_csv)

In [None]:
df = pd.read_csv(sample_csv)
df

## Step 2 – Clean the Dataset
We’ll parse each SMILES with RDKit; invalid strings become **NaN** in a new `Mol` column:

In [None]:
def smiles_to_mol(s):
    try:
        return Chem.MolFromSmiles(s)
    except Exception:
        return None

df['Mol'] = df['SMILES'].apply(smiles_to_mol)
clean = df.dropna(subset=['Mol', 'IC50_nM'])
print('Rows after cleaning:', len(clean))
clean

## Step 3 – Compute Descriptors
Let’s add molecular weight and logP using RDKit:

In [None]:
clean['MolWt'] = clean['Mol'].apply(Descriptors.MolWt)
clean['logP'] = clean['Mol'].apply(Descriptors.MolLogP)
clean

## Step 4 – Visualise
Plot logP vs. IC₅₀ (nM) and colour by molecular weight:

In [None]:
sns.set(style='whitegrid')
plt.figure(figsize=(5,4))
scatter = plt.scatter(clean['logP'], clean['IC50_nM'], c=clean['MolWt'], s=80, cmap='viridis')
plt.colorbar(scatter, label='MolWt')
plt.xlabel('logP')
plt.ylabel('IC50 (nM)')
plt.title('Activity vs. lipophilicity')
plt.show()

## Step 5 – Save Cleaned Data
Export the curated data (with descriptors) to a new CSV:

In [None]:
clean_out = 'cleaned_mols.csv'
cols_to_save = ['SMILES', 'IC50_nM', 'MolWt', 'logP']
clean[cols_to_save].to_csv(clean_out, index=False)
print('Cleaned CSV saved to:', clean_out)

## Your Turn 📝
1. Replace `sample_csv` with **your own dataset** (or build a bigger one).  
2. Add at least **two more RDKit descriptors** (*TPSA*, *NumHBA*, etc.).  
3. Plot a pairplot (`sns.pairplot`) of descriptors vs. activity.  
4. Optional: use `pandas` group‐by or `qcut` to bin molecules by molecular weight and compare median activities.

## Summary & Next Steps
- **pandas** makes CSV I/O and cleaning straightforward.  
- **RDKit** can enrich each molecule with physicochemical descriptors.  
- **seaborn/matplotlib** provide quick insight into property trends.  
Expand this workflow to thousands of compounds, export to other formats (Parquet, Excel), or feed the cleaned data into ML models.