# AI-PSCI-004: RDKit Fundamentals
**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 2 | Module: Introduction to AI in Pharmaceuticals | Estimated Time: 60-90 minutes**

**Prerequisites**: AI-PSCI-001, AI-PSCI-002, AI-PSCI-003

---


## üéØ Learning Objectives

After completing this talktorial, you will be able to:

1. Install and import RDKit in Google Colab
2. Create molecule objects from SMILES strings
3. Calculate basic molecular properties (MW, LogP, HBD, HBA, TPSA)
4. Generate and display 2D molecular structures
5. Perform substructure searches to find molecular patterns

---


## üìö Background

### What is RDKit?

**RDKit** (Rational Drug Kit) is the most widely used open-source cheminformatics toolkit in the pharmaceutical industry. It provides Python tools for:

- Reading and writing chemical structures
- Calculating molecular properties (descriptors)
- Searching for substructures
- Generating molecular fingerprints
- 2D and 3D molecular visualization

Nearly every drug discovery team uses RDKit or similar tools to analyze chemical libraries and guide medicinal chemistry decisions.

### Why Molecular Properties Matter

Before a drug can work, it must:
1. **Be absorbed** (get into the bloodstream)
2. **Distribute** to target tissues
3. **Bind** to the target protein
4. **Avoid** rapid metabolism and clearance

Simple molecular properties like molecular weight, lipophilicity, and hydrogen bonding capacity strongly influence these ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties.

### Key Concepts

- **SMILES**: Text representation of chemical structures (from AI-PSCI-003)
- **Mol object**: RDKit's internal representation of a molecule
- **Molecular descriptor**: Calculated property describing a molecule
- **Lipinski's Rule of 5**: Guidelines for oral drug absorption
- **Substructure search**: Finding molecules containing specific patterns

---


## üõ†Ô∏è Setup

Run this cell to install RDKit:


In [None]:
#@title üõ†Ô∏è Install RDKit
!pip install rdkit -q
print("‚úÖ RDKit installed successfully!")

Import the required libraries:


In [None]:
#@title üì¶ Import Libraries
from rdkit import Chem
from rdkit.Chem import Draw, Descriptors, AllChem
from rdkit.Chem.Draw import IPythonConsole
import pandas as pd
import matplotlib.pyplot as plt

# Configure RDKit to display molecules nicely in Colab
IPythonConsole.ipython_useSVG = True

print("‚úÖ All libraries imported!")

---

## üî¨ Guided Inquiry 1: Creating Molecules from SMILES

### Context

The first step in any RDKit workflow is creating a **Mol object** from a SMILES string. The Mol object is RDKit's internal representation of a molecule, which enables all downstream calculations and visualizations.

If RDKit cannot parse a SMILES string (invalid syntax or impossible chemistry), it returns `None` rather than throwing an error ‚Äî so always check your results!

### Your Task

Using your AI assistant, write code to:

1. Define SMILES strings for **aspirin**, **caffeine**, and **metformin**
2. Convert each SMILES to an RDKit Mol object using `Chem.MolFromSmiles()`
3. Verify each molecule was created successfully (not `None`)
4. Display one of the molecules to see its 2D structure

üí° **Prompting Tips**:
- Ask: "What is the SMILES string for aspirin?"
- Ask: "How do I create an RDKit molecule from a SMILES string?"
- If a molecule returns `None`, ask your AI to check the SMILES syntax

### Verification

After running your code, confirm:
- [ ] All three Mol objects are not `None`
- [ ] You can see a 2D structure when displaying a molecule
- [ ] No error messages

üìì **Lab Notebook**: Record the SMILES strings for each drug. Note any patterns you observe in the structures.


In [None]:
# Your code here



## üî¨ Guided Inquiry 2: Calculating Molecular Properties

### Context

Molecular properties (descriptors) are numerical values calculated from a molecule's structure. These are crucial for predicting whether a compound might make a good drug.

**Lipinski's Rule of 5** is a famous guideline stating that orally active drugs typically have:
- Molecular Weight ‚â§ 500 Da
- LogP (lipophilicity) ‚â§ 5
- H-bond donors ‚â§ 5
- H-bond acceptors ‚â§ 10

These rules have guided medicinal chemistry for decades!

### Your Task

Using your AI assistant, write code to:

1. Calculate these properties for all three drugs:
   - Molecular Weight (`Descriptors.MolWt`)
   - LogP (`Descriptors.MolLogP`)
   - Number of H-bond donors (`Descriptors.NumHDonors`)
   - Number of H-bond acceptors (`Descriptors.NumHAcceptors`)
   - Topological Polar Surface Area (`Descriptors.TPSA`)

2. Create a pandas DataFrame with the results

3. Determine which drugs pass all Lipinski Rule of 5 criteria

üí° **Prompting Tips**:
- Ask: "What RDKit function calculates molecular weight?"
- Ask: "How do I create a DataFrame from a dictionary?"
- Ask your AI to explain what each property means biologically

### Verification

After running your code, confirm:
- [ ] Aspirin MW ‚âà 180 Da
- [ ] Caffeine MW ‚âà 194 Da
- [ ] Metformin MW ‚âà 129 Da
- [ ] All three drugs pass Lipinski's rules

üìì **Lab Notebook**: Which drug has the highest LogP? What might this mean for its absorption?


In [None]:
# Your code here



## üî¨ Guided Inquiry 3: Visualizing Multiple Molecules

### Context

Drug discovery often involves comparing many molecules at once. RDKit's `Draw.MolsToGridImage()` function creates a grid of 2D structures, which is invaluable for visual comparison.

Let's expand our drug library and visualize them together!

### Your Task

Using your AI assistant, write code to:

1. Add three more drugs to your collection:
   - **Ibuprofen**: `CC(C)CC1=CC=C(C=C1)C(C)C(=O)O`
   - **Acetaminophen**: `CC(=O)NC1=CC=C(C=C1)O`
   - **Trimethoprim**: `COC1=CC(=CC(=C1OC)OC)CC2=CN=C(N=C2N)N`

2. Create a grid image showing all 6 drugs

3. Add drug names as legends below each structure

üí° **Prompting Tips**:
- Ask: "How do I use RDKit's MolsToGridImage with legends?"
- Ask: "How do I control the number of columns in the grid?"
- If structures look wrong, ask about adding 2D coordinates

### Verification

After running your code, confirm:
- [ ] Grid shows 6 molecules
- [ ] Each molecule has its name as a label
- [ ] Structures look chemically reasonable

üìì **Lab Notebook**: Which drugs look most similar structurally? Can you identify common functional groups?


In [None]:
# Your code here



## üî¨ Guided Inquiry 4: Substructure Searching

### Context

A powerful feature of cheminformatics is **substructure searching** ‚Äî finding molecules that contain a specific pattern. This is how medicinal chemists search for:
- All compounds with a carboxylic acid group
- All compounds containing a benzene ring
- All compounds with a specific scaffold

RDKit uses SMARTS patterns (like regex for chemistry) for flexible searching.

### Your Task

Using your AI assistant, write code to:

1. Create a SMARTS pattern for a **carboxylic acid** group: `C(=O)O`
2. Search your drug library to find which drugs contain this group
3. Create a SMARTS pattern for an **aromatic ring**: `c1ccccc1`
4. Search for drugs containing aromatic rings
5. Highlight the matching substructure in at least one molecule

üí° **Prompting Tips**:
- Ask: "How do I do a substructure search in RDKit?"
- Ask: "What's the difference between SMILES and SMARTS?"
- Ask: "How do I highlight a substructure in a molecule image?"

### Verification

After running your code, confirm:
- [ ] Aspirin and Ibuprofen contain carboxylic acid groups
- [ ] Four drugs (4 of 6) contain 6-membered aromatic rings
- [ ] Metformin does NOT contain an aromatic ring
- [ ] Highlighted image shows the matched atoms

üìì **Lab Notebook**: What other substructures might be pharmaceutically important? (Think about common functional groups in drugs.)


In [None]:
# Your code here



In [None]:
# Your code here



## üî¨ Guided Inquiry 5: Property Analysis and Visualization

### Context

Data visualization helps us understand patterns in molecular properties. Let's create plots comparing our drug library across multiple dimensions.

A common plot in drug discovery is **MW vs LogP** ‚Äî it shows the "chemical space" occupied by compounds and helps identify outliers.

### Your Task

Using your AI assistant, write code to:

1. Calculate MW and LogP for all 6 drugs
2. Create a scatter plot of MW vs LogP
3. Label each point with the drug name
4. Add Lipinski Rule of 5 thresholds as reference lines (MW=500, LogP=5)
5. Color the points based on whether they pass Lipinski's rules

üí° **Prompting Tips**:
- Ask: "How do I add text labels to matplotlib scatter plots?"
- Ask: "How do I add horizontal and vertical reference lines?"
- If labels overlap, ask about using `plt.annotate()` with offsets

### Verification

After running your code, confirm:
- [ ] Scatter plot shows 6 points
- [ ] Each point is labeled with drug name
- [ ] All drugs appear in the "safe" region (below both threshold lines)
- [ ] Axes are properly labeled

üìì **Lab Notebook**: Which drug has the highest LogP? Is this consistent with what you know about its absorption?


In [None]:
# Your code here



## üî¨ Guided Inquiry 6: Creating a Drug Property Report

### Context

Let's bring everything together by creating a comprehensive property report for our drug library. This is similar to what medicinal chemists prepare when evaluating compound series.

### Your Task

Using your AI assistant, write code to:

1. Calculate an expanded set of properties:
   - MW, LogP, HBD, HBA, TPSA
   - Number of rotatable bonds (`Descriptors.NumRotatableBonds`)
   - Number of rings (`Descriptors.RingCount`)
   - Number of aromatic rings (`Descriptors.NumAromaticRings`)

2. Create a formatted DataFrame with all properties

3. Add a column indicating Lipinski Rule of 5 compliance

4. Sort by molecular weight

üí° **Prompting Tips**:
- Ask: "What other descriptors are available in RDKit?"
- Ask: "How do I add a new column to a pandas DataFrame based on conditions?"
- If you want to export, ask about `df.to_csv()`

### Verification

After running your code, confirm:
- [ ] Table has 8 property columns plus drug name
- [ ] All 6 drugs are included
- [ ] Drugs are sorted from smallest to largest MW
- [ ] All drugs show "Pass" for Lipinski compliance

üìì **Lab Notebook**: Which drug has the most rotatable bonds? Why might this matter for binding?


In [None]:
# Your code here



---

## ‚úÖ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Install and import RDKit in Google Colab
- [ ] Create Mol objects from SMILES strings
- [ ] Calculate molecular properties (MW, LogP, HBD, HBA, TPSA)
- [ ] Generate 2D structure images for single molecules and grids
- [ ] Perform substructure searches using SMARTS patterns
- [ ] Create visualizations comparing molecular properties
- [ ] Explain Lipinski's Rule of 5 and its significance

### Your lab notebook should include:

- [ ] SMILES strings for all 6 drugs studied
- [ ] Property table comparing drugs
- [ ] MW vs LogP scatter plot
- [ ] Notes on which drugs contain carboxylic acids vs aromatic rings
- [ ] Reflection on how molecular properties relate to drug absorption

---


## ü§î Reflection Questions

Answer these in your lab notebook:

1. **Property Design**: If you were designing a drug for oral absorption, what LogP range would you target and why?

2. **Structural Patterns**: Caffeine has very low LogP (-0.07) yet crosses the blood-brain barrier easily. What structural features might explain this?

3. **Beyond Lipinski**: Lipinski's Rule of 5 was developed in 1997. What limitations might these rules have for modern drug discovery (think about biologics, targeted therapies, etc.)?

---


## üìñ Further Reading

- [RDKit Documentation](https://www.rdkit.org/docs/) - Official RDKit docs with tutorials
- [RDKit Cookbook](https://www.rdkit.org/docs/Cookbook.html) - Practical code examples
- [Lipinski CA et al. (1997)](https://www.sciencedirect.com/science/article/pii/S0169409X96004231) - Original Rule of 5 paper
- [TeachOpenCADD T001](https://projects.volkamerlab.org/teachopencadd/talktorials/T001_query_chembl.html) - More cheminformatics examples

---


## üîó Connection to Research

The RDKit skills you learned today are used daily in pharmaceutical research:

- **Virtual screening**: Filter millions of compounds based on property criteria
- **Lead optimization**: Track how structural changes affect properties
- **Patent analysis**: Search for compounds containing specific scaffolds
- **ADMET prediction**: Use properties as inputs to machine learning models

In upcoming talktorials, you'll use these RDKit fundamentals to:
- Query ChEMBL for target-specific bioactivity data (AI-PSCI-005)
- Calculate molecular fingerprints for similarity searching (AI-PSCI-007)
- Prepare molecules for docking studies (AI-PSCI-014)

---

*AI-PSCI-004 Complete. Proceed to AI-PSCI-005: ChEMBL Database Queries.*
