[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/kevingreenman/molecular-representations/blob/main/3-digital_chemistry_representations.ipynb)


# Digital Chemistry: Molecule Representations 🖥️⚛️

*General Chemistry & Cyberinfrastructure Skills Module*

## Learning Objective
Work with and **differentiate** between common computer‐readable chemical representations:
- **SMILES strings**
- **Graph objects** (nodes/edges)
- **Adjacency matrices**
- **XYZ files / Cartesian coordinates**
- …and how to interconvert them in Python.

### Warm‑Up Questions

**WQ‑1.** Are the representations we typically use to represent molecules on paper suitable for a computer to understand precisely?

<span style="color:cyan"><strong>Free response:</strong> YOUR RESPONSE TEXT HERE </span>

## Prerequisites
- Python ≥ 3.8
- **RDKit** for cheminformatics
- **networkx** for graph visualisation/manipulation
- **numpy** and **pandas** for matrices / tables

If you’re on Google Colab, run the install cell below first.

In [None]:
# !pip install rdkit-pypi networkx pandas numpy -q   # ← Uncomment on first run

from rdkit import Chem
from rdkit.Chem import AllChem
import networkx as nx
import numpy as np
import pandas as pd

## Representation Cheat‑Sheet
| Representation | Key idea | Strengths | Limitations |
|----------------|----------|-----------|-------------|
| **SMILES** | Compact line notation describing connectivity & bond order | Human‐typeable, ubiquitous in databases | Loses 3‑D coordinates and sometimes stereochemistry context |
| **Graph** | Nodes = atoms, edges = bonds | Natural for algorithms (shortest path, fingerprints) | Needs extra attributes to capture 3‑D info |
| **Adjacency matrix** | Square matrix; 1 if atoms *i* and *j* bonded | Linear algebra friendly | Grows as *N²*; not human readable |
| **XYZ** | List of atoms + x y z coordinates | Simple 3‑D, universal import/export | No bonding info; separate file stores charge, spin, etc. |


In [None]:
def mol_to_graph(mol):
    """Convert RDKit Mol → networkx.Graph with atom symbol labels."""
    g = nx.Graph()
    for atom in mol.GetAtoms():
        g.add_node(atom.GetIdx(), symbol=atom.GetSymbol())
    for bond in mol.GetBonds():
        i, j = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
        g.add_edge(i, j, order=str(bond.GetBondType()))
    return g

def mol_to_xyz(mol):
    """Return XYZ string (after 3‑D embedding)."""
    m3d = Chem.AddHs(Chem.Mol(mol))
    AllChem.EmbedMolecule(m3d, AllChem.ETKDG())
    AllChem.UFFOptimizeMolecule(m3d)
    conf = m3d.GetConformer()
    lines = [str(m3d.GetNumAtoms()), 'Generated by RDKit']
    for atom in m3d.GetAtoms():
        pos = conf.GetAtomPosition(atom.GetIdx())
        lines.append(f"{atom.GetSymbol():<2} {pos.x:>10.4f} {pos.y:>10.4f} {pos.z:>10.4f}")
    return "\n".join(lines)


## Worked Example — Ethanol (*C₂H₅OH*)

In [None]:
smiles = 'CCO'
mol = Chem.MolFromSmiles(smiles)
print('SMILES:', smiles)

In [None]:
adj = Chem.GetAdjacencyMatrix(mol)
df_adj = pd.DataFrame(adj, columns=[a.GetSymbol()+str(a.GetIdx()) for a in mol.GetAtoms()],
                      index=[a.GetSymbol()+str(a.GetIdx()) for a in mol.GetAtoms()])
df_adj

In [None]:
G = mol_to_graph(mol)
print(f'Graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges')
print('Edges with bond order:')
for u, v, d in G.edges(data=True):
    print(f"{G.nodes[u]['symbol']}{u} – {G.nodes[v]['symbol']}{v}: order {d['order']}")

import matplotlib.pyplot as plt

# Draw the graph visually with atom symbols as labels
plt.figure(figsize=(3,2))
pos = nx.spring_layout(G, seed=42)
labels = {n: G.nodes[n]['symbol']+str(n) for n in G.nodes}
nx.draw(G, pos, with_labels=True, labels=labels, node_color='lightgray', node_size=800, edge_color='slateblue')
edge_labels = {(u, v): d['order'] for u, v, d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G, pos, edge_labels=edge_labels, font_color='red')
plt.title("Molecule Graph")
plt.axis('off')
plt.show()

In [None]:
xyz_text = mol_to_xyz(mol)
print(xyz_text)

## Your Turn 📝
1. Supply **two** SMILES strings.  
2. For each:
   - Show the SMILES.
   - Build and display the adjacency matrix.
   - Convert to a NetworkX graph and print basic stats (nodes, edges).
   - Generate XYZ coordinates.
3. Compare: what information is preserved or lost when moving between representations?

In [None]:
# TODO 1: Replace with your SMILES
my_smiles = ['O=C=O', 'c1ccccc1']  # CO2, benzene

for smi in my_smiles:
    mol = Chem.MolFromSmiles(smi)
    print('\nSMILES:', smi)
    # TODO 2: Adjacency matrix
    adj = Chem.GetAdjacencyMatrix(mol)
    df_adj = pd.DataFrame(adj, columns=[a.GetSymbol()+str(a.GetIdx()) for a in mol.GetAtoms()],
                          index=[a.GetSymbol()+str(a.GetIdx()) for a in mol.GetAtoms()])
    print('\nAdjacency matrix:')
    print(df_adj)
    
    # TODO 3: Graph stats
    G = mol_to_graph(mol)
    print(f'\nGraph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges')
    print('Edges with bond order:')
    for u, v, d in G.edges(data=True):
        print(f"  {G.nodes[u]['symbol']}{u} – {G.nodes[v]['symbol']}{v}: order {d['order']}")
    
    # TODO 4: XYZ coords
    xyz_text = mol_to_xyz(mol)
    print('\nXYZ coordinates:')
    print(xyz_text)


## Molecular Descriptors & Fingerprints 🧮🔑

Chemical *descriptors* are numeric summaries of molecular properties, while *fingerprints* are bit‑vectors that encode structural motifs for rapid similarity searches.

| Category | Examples | Typical use |
|----------|----------|-------------|
| **Descriptors (scalar)** | Molecular weight, logP, topological polar surface area (TPSA), number of H‑bond donors/acceptors | QSAR, drug‑likeness filters |
| **Fingerprints (bit‑vector)** | Morgan (ECFP), MACCS, RDKit topological | Substructure search, clustering, similarity metrics |


In [None]:
from rdkit.Chem import Descriptors
from rdkit.Chem import rdMolDescriptors as rdmd

def basic_descriptors(smiles):
    mol = Chem.MolFromSmiles(smiles)
    return {
        'MolWt': Descriptors.MolWt(mol),
        'logP': Descriptors.MolLogP(mol),
        'TPSA': rdmd.CalcTPSA(mol),
        'HBA': rdmd.CalcNumLipinskiHBA(mol),
        'HBD': rdmd.CalcNumLipinskiHBD(mol)
    }

# Demo on ethanol
print(basic_descriptors('CCO'))

In [None]:
from rdkit.Chem import MACCSkeys
from rdkit import DataStructs

ethanol = Chem.MolFromSmiles('CCO')
benzene = Chem.MolFromSmiles('c1ccccc1')

# Morgan fingerprint (radius 2, 1024 bits)
fp_eth = rdmd.GetMorganFingerprintAsBitVect(ethanol, radius=2, nBits=1024)
fp_ben = rdmd.GetMorganFingerprintAsBitVect(benzene, radius=2, nBits=1024)
sim = DataStructs.TanimotoSimilarity(fp_eth, fp_ben)
print(f'Tanimoto similarity (ethanol vs benzene): {sim:.2f}')

# MACCS keys example
maccs_eth = MACCSkeys.GenMACCSKeys(ethanol)
print('MACCS bits set for ethanol:', list(maccs_eth.GetOnBits())[:10], '...')

### Your Turn 📝 — Descriptor & Fingerprint Playground
1. Pick **three** molecules (SMILES).  
2. Build a **pandas DataFrame** of their basic descriptors using `basic_descriptors`.  
3. Compute Morgan fingerprints and create a **similarity matrix** (Tanimoto) between all pairs.  
4. Which pair is most similar? Does that align with chemical intuition?

In [None]:
# TODO 6: Descriptor & fingerprint analysis
my_smiles = ['CCO', 'O=C=O', 'CC(=O)O']  # edit!

## Build descriptor table
desc_rows = []
for smi in my_smiles:
    row = basic_descriptors(smi)
    row['SMILES'] = smi
    desc_rows.append(row)
df_desc = pd.DataFrame(desc_rows).set_index('SMILES')
display(df_desc)

## Fingerprint similarity matrix
fps = [rdmd.GetMorganFingerprintAsBitVect(Chem.MolFromSmiles(s), 2, 1024) for s in my_smiles]
n = len(fps)
sim_mat = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_mat[i, j] = DataStructs.TanimotoSimilarity(fps[i], fps[j])
df_sim = pd.DataFrame(sim_mat, index=my_smiles, columns=my_smiles)
display(df_sim)

### Critical‑Thinking Questions

**CTQ‑1.** In one sentence, what *extra* information does an **XYZ file** contain that a **SMILES** string does not? What key information does SMILES keep that XYZ is missing?

<span style="color:cyan"><strong>Free response:</strong> YOUR RESPONSE TEXT HERE </span>

**CTQ‑2.** The SMILES strings `O=C=O` and `C(=O)=O` both depict CO₂. Why can two strings represent the same molecule? What does this imply about SMILES “uniqueness”?

<span style="color:cyan"><strong>Free response:</strong> YOUR RESPONSE TEXT HERE </span>

**CTQ‑3.** For each task below, pick the *best* representation (SMILES, NetworkX graph, adjacency matrix, or XYZ) **and justify**:
a) Counting rings b) Running DFT geometry optimisation c) Fingerprint‑based similarity search

<span style="color:cyan"><strong>Free response:</strong> YOUR RESPONSE TEXT HERE </span>

### Challenge ⭐ — SMILES ↔ Graph ↔ SMILES Round‑Trip
Write a function that:
1. Converts a SMILES string to a NetworkX graph, **then**
2. Reconstructs a new RDKit Mol from that graph, and
3. Retrieves a SMILES again.

Compare the original and reconstructed SMILES—are they identical? *(Hint: RDKit’s `RWMol` class and `Chem.MolFromSmiles` validation tools may help.)*

In [None]:
# TODO 5: Implement round‑trip converter (optional)
def smiles_round_trip(smi):
    pass

## Summary & Next Steps
- **SMILES, graphs, adjacency matrices, XYZ**: complementary ways to encode structure.  
- **Descriptors** give *numeric* property snapshots; **fingerprints** give *bitwise* structural signatures for fast similarity.  
- Combining these representations is foundational for QSAR modeling, virtual screening, and cheminformatics pipelines.