# AI-PSCI-003: Molecular Representations

**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 2 | Module: Introduction to AI in Pharmaceuticals | Estimated Time: 60-90 minutes**

**Prerequisites**: AI-PSCI-001 (Colab basics), AI-PSCI-002 (AI collaboration)

---

## üéØ Learning Objectives

After completing this talktorial, you will be able to:

1. Understand and interpret **SMILES notation** for small molecules
2. Explain the purpose and structure of **InChI and InChIKey** identifiers
3. Read and interpret basic **PDB file format** for protein structures
4. **Convert between molecular representations** programmatically
5. **Visualize molecules** in 2D and 3D using Python

---

## üìö Background

### Why Molecular Representations Matter

In computational drug discovery, we need ways to represent molecules that computers can understand. Just as DNA sequences encode genetic information as strings of letters (A, T, G, C), we have developed notations to encode chemical structures.

The key challenge: a molecule is a 3D object with atoms and bonds, but computers work best with text strings and numbers. Different representations serve different purposes:

| Representation | Purpose | Example |
|----------------|---------|--------|
| **SMILES** | Human-readable, searchable | `CC(=O)OC1=CC=CC=C1C(=O)O` |
| **InChI** | Unique identifier | `InChI=1S/C9H8O4/c1-6(10)...` |
| **InChIKey** | Database lookup | `BSYNRYMUTXBXSQ-UHFFFAOYSA-N` |
| **PDB** | 3D coordinates | Atom positions in space |
| **MOL/SDF** | Complete structure | 2D/3D with all properties |

### SMILES: Simplified Molecular-Input Line-Entry System

SMILES was developed in the 1980s as a way to represent molecules as text strings. Key rules:

- **Atoms**: Written as element symbols (C, N, O, S, etc.)
- **Bonds**: Single bonds are implicit; `=` for double, `#` for triple
- **Branches**: Shown in parentheses `()`
- **Rings**: Indicated by numbers at ring closure points
- **Aromaticity**: Lowercase letters (c, n, o) indicate aromatic atoms

### InChI: International Chemical Identifier

InChI is a unique, canonical identifier developed by IUPAC. Unlike SMILES (where the same molecule can have multiple valid representations), every molecule has exactly one standard InChI.

The **InChIKey** is a 27-character hash of the InChI, designed for database searching.

### PDB Format: Protein Data Bank

The PDB format stores 3D atomic coordinates, originally designed for proteins but also used for small molecules. Each `ATOM` or `HETATM` line contains:
- Atom name, residue, coordinates (x, y, z)
- Useful for docking, visualization, and structural analysis

### Key Concepts

- **Canonical SMILES**: A unique, standardized SMILES for a molecule
- **Isomeric SMILES**: Includes stereochemistry information
- **2D vs 3D**: 2D shows connectivity; 3D shows actual spatial arrangement
- **Molecular fingerprints**: Bit vectors encoding structural features (we'll cover these later)

---

## üõ†Ô∏è Setup

Run the cells below to install required packages and import libraries.

In [None]:
#@title üì¶ Install Required Packages (run once)
!pip install rdkit -q
!pip install py3Dmol -q
print("‚úÖ Packages installed successfully!")

In [None]:
#@title üìö Import Libraries
# Core libraries
import numpy as np
import pandas as pd

# RDKit for molecular operations
from rdkit import Chem
from rdkit.Chem import Draw, AllChem, Descriptors
from rdkit.Chem import rdMolDescriptors
from rdkit.Chem.Draw import IPythonConsole

# For 3D visualization
import py3Dmol

# Configure display
IPythonConsole.ipython_useSVG = True
IPythonConsole.molSize = (400, 300)

print("‚úÖ All libraries imported successfully!")

---

## üî¨ Guided Inquiry 1: Understanding SMILES Notation

### Context

SMILES (Simplified Molecular-Input Line-Entry System) is the most widely used text representation of molecules in cheminformatics. Understanding SMILES is essential for querying databases, communicating with AI tools, and working with molecular data.

### Your Task

Using your AI assistant, write code to:

1. Create SMILES strings for these well-known drugs:
   - **Aspirin** (acetylsalicylic acid)
   - **Caffeine**
   - **Ibuprofen**

2. Use RDKit to create molecule objects from these SMILES

3. Display the 2D structures to verify they're correct

üí° **Prompting Tips**:
- Ask: "What is the SMILES notation for aspirin?"
- If unsure, ask your AI to explain each part of the SMILES string
- You can verify structures at PubChem (https://pubchem.ncbi.nlm.nih.gov/)

### Verification

After running your code, confirm:
- [ ] All three molecules display correctly
- [ ] Aspirin shows a benzene ring with acetyl and carboxylic acid groups
- [ ] Caffeine shows a bicyclic purine structure with 3 methyl groups
- [ ] Ibuprofen shows a benzene ring with isobutyl and propionic acid groups

üìì **Lab Notebook**: Record the SMILES strings and describe what each part of the aspirin SMILES represents.

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 2: Molecular Properties from SMILES

### Context

Once we have a molecule as an RDKit object, we can calculate many properties. These properties are critical for drug discovery - they help predict whether a compound will be absorbed, distributed, metabolized, and excreted (ADME) appropriately.

### Your Task

Using your AI assistant, write code to:

1. For each of the three drugs (Aspirin, Caffeine, Ibuprofen), calculate:
   - **Molecular Weight** (MW)
   - **LogP** (lipophilicity - partition coefficient)
   - **Number of hydrogen bond donors** (HBD)
   - **Number of hydrogen bond acceptors** (HBA)
   - **Number of rotatable bonds**

2. Display the results in a formatted table

3. Check which drugs satisfy **Lipinski's Rule of Five**:
   - MW ‚â§ 500
   - LogP ‚â§ 5
   - HBD ‚â§ 5
   - HBA ‚â§ 10

üí° **Prompting Tips**:
- Ask: "How do I calculate molecular weight using RDKit?"
- Look for `Descriptors` module functions
- Ask about Lipinski's Rule of Five if unfamiliar

### Verification

After running your code, confirm:
- [ ] Aspirin MW ‚âà 180.16 g/mol
- [ ] Caffeine MW ‚âà 194.19 g/mol
- [ ] Ibuprofen MW ‚âà 206.28 g/mol
- [ ] All three drugs pass Lipinski's Rule of Five

üìì **Lab Notebook**: Record which properties each drug satisfies and explain why all three pass Lipinski's rules.

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 3: InChI and InChIKey

### Context

While SMILES is great for human readability, the same molecule can have multiple valid SMILES representations. InChI (International Chemical Identifier) provides a **unique, canonical** identifier for each molecule, making it ideal for database searching and cross-referencing.

### Your Task

Using your AI assistant, write code to:

1. Generate the **InChI** and **InChIKey** for each of the three drugs

2. Demonstrate that different SMILES give the same InChI:
   - Try at least two different valid SMILES for aspirin
   - Show they produce the same InChIKey

3. Parse the InChI structure to understand its components

üí° **Prompting Tips**:
- Ask: "How do I generate InChI from a molecule in RDKit?"
- Ask: "What are different valid SMILES representations of aspirin?"
- Request explanation of InChI layers if unfamiliar

### Verification

After running your code, confirm:
- [ ] InChIKey for aspirin is `BSYNRYMUTXBXSQ-UHFFFAOYSA-N`
- [ ] Different SMILES for the same molecule produce identical InChIKeys
- [ ] You can explain the first connectivity layer of the InChI

üìì **Lab Notebook**: Document why InChIKey is useful for database searching. What would happen if you searched for a molecule using SMILES instead?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 4: Understanding PDB Format

### Context

The Protein Data Bank (PDB) format stores 3D atomic coordinates. While originally designed for proteins, it's widely used for ligands (small molecules) as well. Understanding this format is essential for molecular docking and structure-based drug design.

### Your Task

Using your AI assistant, write code to:

1. Generate 3D coordinates for aspirin using RDKit's conformer generation

2. Export the structure to PDB format and examine the content

3. Parse the PDB text to identify:
   - Number of atoms
   - Atom types present
   - The 3D coordinates of the first carbon atom

4. Create a simple visualization of the atom positions

üí° **Prompting Tips**:
- Ask: "How do I generate 3D coordinates for a molecule in RDKit?"
- Look for `AllChem.EmbedMolecule()` for conformer generation
- Ask about the meaning of each column in a PDB ATOM record

### Verification

After running your code, confirm:
- [ ] Aspirin has 21 atoms (including hydrogens)
- [ ] PDB contains HETATM records (small molecule atoms)
- [ ] Each atom has x, y, z coordinates
- [ ] Coordinates are in Angstroms (√Ö)

üìì **Lab Notebook**: Explain the difference between ATOM and HETATM records in PDB files. Why would a drug molecule use HETATM?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 5: 3D Visualization with py3Dmol

### Context

Visualizing molecules in 3D helps us understand their shape, which is critical for understanding how they fit into binding pockets. py3Dmol provides interactive 3D visualization right in your Colab notebook.

### Your Task

Using your AI assistant, write code to:

1. Create an interactive 3D visualization of aspirin using py3Dmol

2. Show the molecule in different representations:
   - Stick model (default)
   - Ball-and-stick model
   - Surface representation

3. Color atoms by element type

4. Add a spin animation to see the molecule from all angles

üí° **Prompting Tips**:
- Ask: "How do I visualize a molecule in 3D using py3Dmol?"
- Look for `setStyle()` method for different representations
- The `spin()` method adds animation

### Verification

After running your code, confirm:
- [ ] You can rotate the molecule by dragging
- [ ] Carbons appear gray, oxygens red, hydrogens white
- [ ] The aromatic ring is visible and planar
- [ ] The acetyl and carboxylic acid groups are positioned correctly

üìì **Lab Notebook**: Describe the overall shape of aspirin. Is it planar or 3D? How might this shape relate to its biological activity?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 6: Converting Between Formats

### Context

In real research, you'll need to convert between molecular formats frequently: getting SMILES from a database, converting to 3D for docking, saving results in different formats. RDKit makes this straightforward.

### Your Task

Using your AI assistant, write code to:

1. Start with the drug **Metformin** (diabetes medication)
   - Find its SMILES representation

2. Convert to multiple formats:
   - Canonical SMILES
   - InChI and InChIKey
   - Molecular formula
   - PDB format (with 3D coordinates)
   - MOL block format

3. Demonstrate round-trip conversion:
   - Convert SMILES ‚Üí InChI ‚Üí Molecule ‚Üí SMILES
   - Verify the final SMILES represents the same molecule

üí° **Prompting Tips**:
- Ask: "What is the SMILES for metformin?"
- Look for `MolToSmiles()`, `MolToInchi()`, `MolToPDBBlock()`, `MolToMolBlock()`
- `rdMolDescriptors.CalcMolFormula()` gives the molecular formula

### Verification

After running your code, confirm:
- [ ] Metformin has molecular formula C4H11N5
- [ ] All format conversions succeed without errors
- [ ] Round-trip conversion produces equivalent SMILES

üìì **Lab Notebook**: List three scenarios where you might need to convert between molecular formats in drug discovery research.

In [None]:
# Your code here



---

## ‚úÖ Checkpoint

Congratulations! You've completed the Molecular Representations talktorial. Before moving on, confirm you can:

- [ ] Read and write SMILES notation for simple molecules
- [ ] Calculate molecular properties (MW, LogP, HBD, HBA) using RDKit
- [ ] Generate and interpret InChI and InChIKey identifiers
- [ ] Generate 3D coordinates and export to PDB format
- [ ] Visualize molecules in 3D using py3Dmol
- [ ] Convert between different molecular representations

### Your lab notebook should include:

- [ ] SMILES strings for at least 3 drugs with structural annotations
- [ ] A table of molecular properties for your chosen drugs
- [ ] InChIKeys for each molecule (useful for future database searches)
- [ ] Screenshots or descriptions of 3D visualizations
- [ ] Answers to the reflection questions below

---

## ü§î Reflection Questions

Answer these in your lab notebook:

1. **Format Selection**: When would you use SMILES versus InChIKey versus PDB format? Give a specific example for each.

2. **Database Searching**: You want to find all published information about a specific drug. Would you search using SMILES or InChIKey? Why?

3. **3D Structure**: Why is generating 3D coordinates important for drug discovery? What kinds of analyses require 3D information that 2D cannot provide?

4. **Limitations**: What information is lost when converting a 3D structure to SMILES? When might this matter?

---

## üìñ Further Reading

- [SMILES Tutorial - Daylight](https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html) - Comprehensive SMILES documentation
- [InChI FAQ - IUPAC](https://www.inchi-trust.org/technical-faq/) - Official InChI documentation
- [RDKit Documentation](https://www.rdkit.org/docs/) - Complete RDKit reference
- [PDB File Format](https://www.wwpdb.org/documentation/file-format) - Official PDB format specification
- [py3Dmol Documentation](https://3dmol.csb.pitt.edu/) - 3D visualization in Jupyter

---

## üîó Connection to Research

The molecular representations you learned today are the foundation for nearly every computational drug discovery task:

- **Virtual Screening**: SMILES-based searches of compound libraries (millions of molecules)
- **Database Queries**: InChIKey lookups in ChEMBL, PubChem for known bioactivity
- **Molecular Docking**: 3D coordinates for predicting binding poses
- **Machine Learning**: Molecular fingerprints derived from SMILES/structures
- **Property Prediction**: ADMET models using calculated descriptors

In Week 5, you'll select your drug target from our 6-target portfolio (DHFR, HIV-1 Protease, SARS-CoV-2 Mpro, AChE, COX-2, or DPP-4). You'll use these representation skills to:

1. Query ChEMBL for known inhibitors of your target
2. Analyze molecular properties of active compounds
3. Visualize how drugs bind to your target
4. Compare reference drugs across different targets

---

*AI-PSCI-003 Complete. Proceed to AI-PSCI-004: RDKit Fundamentals.*