# AI-PSCI-006: ADMET Filtering & Drug-Likeness
**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 3 | Module: AI in Drug Discovery | Estimated Time: 60-90 minutes**

**Prerequisites**: AI-PSCI-001, AI-PSCI-002, AI-PSCI-003, AI-PSCI-004, AI-PSCI-005

---

## üéØ Learning Objectives

After completing this talktorial, you will be able to:

1. Apply Lipinski's Rule of Five to filter compound libraries
2. Understand and apply Veber's rules for oral bioavailability
3. Distinguish between drug-likeness and lead-likeness criteria
4. Use PAINS filters to identify problematic compounds
5. Build a multi-criteria filtering pipeline for compound selection
6. Visualize the impact of filters on chemical space

---

## üìö Background

### Why Filter Compounds?

Not every molecule that binds to a target will make a good drug. Before a compound can become a medicine, it must:

1. **Be absorbed** into the bloodstream (usually from the gut)
2. **Distribute** to the target tissue
3. **Not be metabolized** too quickly
4. **Be excreted** safely
5. **Not be toxic**

These properties are collectively called **ADMET** (Absorption, Distribution, Metabolism, Excretion, Toxicity). Computational filters help us predict which compounds are likely to have good ADMET properties.

### Lipinski's Rule of Five (Ro5)

In 1997, Christopher Lipinski analyzed compounds that reached Phase II clinical trials and found that orally active drugs typically have:

| Property | Cutoff | Rationale |
|----------|--------|----------|
| Molecular Weight | ‚â§ 500 Da | Larger molecules have trouble crossing membranes |
| LogP | ‚â§ 5 | Too lipophilic = poor solubility, high plasma protein binding |
| H-bond Donors | ‚â§ 5 | Too many = poor membrane permeability |
| H-bond Acceptors | ‚â§ 10 | Too many = poor membrane permeability |

These are called "Rule of Five" because all cutoffs are multiples of 5.

### Veber's Rules for Oral Bioavailability

In 2002, Veber et al. added two more important criteria:

| Property | Cutoff | Rationale |
|----------|--------|----------|
| TPSA | ‚â§ 140 √Ö¬≤ | Topological Polar Surface Area affects membrane permeability |
| Rotatable Bonds | ‚â§ 10 | Flexible molecules have reduced permeability |

### Lead-Likeness vs Drug-Likeness

**Lead compounds** are starting points for optimization. They're typically:
- Smaller (MW ‚â§ 350)
- Less lipophilic (LogP ‚â§ 3)
- Leaving room to add groups during optimization

**Drug-like compounds** are closer to final drugs:
- Larger (MW ‚â§ 500)
- More optimized binding
- Meet Lipinski/Veber criteria

### PAINS Filters

**Pan-Assay INterference compoundS (PAINS)** are molecules that give false positives in many assays due to:
- Reactivity with assay components
- Aggregation
- Fluorescence interference
- Redox cycling

Filtering out PAINS prevents wasting resources on "frequent hitters" that won't become drugs.

### Key Concepts

- **ADMET**: Absorption, Distribution, Metabolism, Excretion, Toxicity
- **Lipinski Ro5**: MW ‚â§ 500, LogP ‚â§ 5, HBD ‚â§ 5, HBA ‚â§ 10
- **Veber Rules**: TPSA ‚â§ 140, Rotatable bonds ‚â§ 10
- **Lead-likeness**: Smaller, less optimized starting points (MW ‚â§ 350, LogP ‚â§ 3)
- **PAINS**: Problematic substructures that cause assay interference

---

## üõ†Ô∏è Setup

Run this cell to install required packages:

In [None]:
#@title üõ†Ô∏è Install Packages
!pip install rdkit chembl_webresource_client -q
print("‚úÖ Packages installed successfully!")

Import the required libraries:

In [None]:
#@title üì¶ Import Libraries
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import Descriptors, Draw, FilterCatalog, rdMolDescriptors
from rdkit.Chem.FilterCatalog import FilterCatalogParams
from chembl_webresource_client.new_client import new_client

# Set display options
pd.set_option('display.max_columns', 15)

print("‚úÖ All libraries imported!")

---

## üìä Load Example Dataset

We'll use a dataset of DHFR inhibitors from ChEMBL (similar to what you created in AI-PSCI-005). If you saved your dataset, you can load it. Otherwise, we'll fetch fresh data.

In [None]:
#@title üìä Load or Fetch DHFR Compound Data

# Try to load saved data, otherwise fetch from ChEMBL
try:
    df = pd.read_csv('DHFR_compounds_chembl.csv')
    print(f"‚úÖ Loaded saved dataset: {len(df)} compounds")
except FileNotFoundError:
    print("Fetching fresh data from ChEMBL...")
    print("(This may take 2-3 minutes)")
    
    # Fetch data using chembl_webresource_client
    activities_api = new_client.activity
    molecules_api = new_client.molecule
    
    # Query DHFR bioactivity
    activities = activities_api.filter(
        target_chembl_id="CHEMBL202",
        type="IC50",
        relation="=",
        assay_type="B"
    ).only('molecule_chembl_id', 'standard_value', 'standard_units')
    
    activities_df = pd.DataFrame.from_records(activities)
    activities_df = activities_df[activities_df['standard_units'] == 'nM']
    activities_df['standard_value'] = pd.to_numeric(activities_df['standard_value'])
    
    # Get unique molecules
    molecule_ids = activities_df['molecule_chembl_id'].unique().tolist()
    molecules = molecules_api.filter(
        molecule_chembl_id__in=molecule_ids
    ).only('molecule_chembl_id', 'molecule_structures')
    
    # Extract SMILES
    mol_data = []
    for mol in molecules:
        structures = mol.get('molecule_structures')
        if structures and 'canonical_smiles' in structures:
            mol_data.append({
                'chembl_id': mol['molecule_chembl_id'],
                'smiles': structures['canonical_smiles']
            })
    
    molecules_df = pd.DataFrame(mol_data)
    
    # Merge and clean
    df = activities_df.merge(
        molecules_df,
        left_on='molecule_chembl_id',
        right_on='chembl_id'
    )
    
    # Calculate pIC50
    def ic50_to_pic50(ic50_nM):
        if ic50_nM <= 0:
            return None
        return 9 - math.log10(ic50_nM)
    
    df['pIC50'] = df['standard_value'].apply(ic50_to_pic50)
    df = df.dropna(subset=['pIC50'])
    
    # Remove duplicates (keep median pIC50)
    df = df.groupby('chembl_id').agg({
        'smiles': 'first',
        'standard_value': 'median',
        'pIC50': 'median'
    }).reset_index()
    
    df.columns = ['chembl_id', 'smiles', 'IC50_nM', 'pIC50']
    print(f"‚úÖ Fetched {len(df)} unique compounds from ChEMBL")

print(f"\nDataset shape: {df.shape}")
df.head()

---

## üî¨ Guided Inquiry 1: Lipinski's Rule of Five

### Context

Lipinski's Rule of Five is the most widely used drug-likeness filter. A compound is considered "drug-like" if it has **no more than one violation** of the following rules:

- Molecular Weight ‚â§ 500 Da
- LogP ‚â§ 5
- H-bond Donors ‚â§ 5
- H-bond Acceptors ‚â§ 10

Let's calculate these properties for our entire compound library and see how many pass.

### Your Task

Using your AI assistant, write code to:

1. Calculate MW, LogP, HBD, and HBA for each compound
2. Count how many Lipinski violations each compound has
3. Filter to keep compounds with ‚â§ 1 violation
4. Report what percentage of compounds pass

üí° **Prompting Tips**:
- Ask: "How do I calculate Lipinski properties with RDKit?"
- Ask: "How do I count violations of multiple conditions in pandas?"
- Remember: The rule allows ONE violation, not zero

### Verification

After running your code, confirm:
- [ ] All four properties are calculated for each compound
- [ ] Violation count is between 0 and 4
- [ ] Most drug-like compounds should pass (expect >70%)

üìì **Lab Notebook**: Record the percentage of compounds passing Lipinski's rules. What's the most common violation?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 2: Veber's Rules for Oral Bioavailability

### Context

Veber et al. (2002) found that two additional properties strongly predict oral bioavailability in rats:

- **TPSA ‚â§ 140 √Ö¬≤**: Topological Polar Surface Area - sum of polar atom surfaces
- **Rotatable Bonds ‚â§ 10**: Molecular flexibility affects membrane permeation

These rules complement Lipinski's Ro5 for predicting oral absorption.

### Your Task

Using your AI assistant, write code to:

1. Calculate TPSA and rotatable bond count for each compound
2. Count compounds passing Veber's rules
3. Create a scatter plot of TPSA vs rotatable bonds
4. Highlight compounds that fail Veber's rules

üí° **Prompting Tips**:
- Ask: "How do I calculate TPSA in RDKit?"
- Ask: "How do I count rotatable bonds in RDKit?"
- Ask: "How do I color scatter plot points based on a condition?"

### Verification

After running your code, confirm:
- [ ] TPSA values are in reasonable range (typically 20-200 √Ö¬≤)
- [ ] Rotatable bonds are non-negative integers
- [ ] Scatter plot clearly shows the rule cutoffs

üìì **Lab Notebook**: What percentage of compounds pass Veber's rules? How does this compare to Lipinski?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 3: Lead-Likeness Criteria

### Context

**Lead-like** compounds are starting points for drug optimization. They're intentionally smaller and less optimized than drug-like compounds, leaving "room to grow" during medicinal chemistry optimization.

**Lead-likeness criteria** (Teague & Leeson, 2007):
- MW ‚â§ 350 Da (smaller than drug-like)
- LogP ‚â§ 3 (less lipophilic)
- HBD ‚â§ 3
- HBA ‚â§ 6
- Rotatable bonds ‚â§ 7

### Your Task

Using your AI assistant, write code to:

1. Apply lead-likeness criteria to your dataset
2. Compare the number of lead-like vs drug-like compounds
3. Identify compounds that are lead-like but NOT drug-like (unusual!)
4. Create a comparison visualization

üí° **Prompting Tips**:
- Ask: "What are typical lead-likeness criteria?"
- Think about WHY leads are smaller - what happens during optimization?
- Ask: "How do I compare two filtered datasets?"

### Verification

After running your code, confirm:
- [ ] Lead-like compounds are a subset of (mostly) drug-like compounds
- [ ] Fewer compounds pass lead-likeness than drug-likeness
- [ ] Visualization clearly shows the relationship

üìì **Lab Notebook**: Why would a medicinal chemist prefer to start with a lead-like compound rather than a drug-like compound?

In [None]:
# Your code here



---

## üî¨ Guided Inquiry 4: PAINS Filtering

### Context

**PAINS (Pan-Assay INterference compoundS)** are molecules that interfere with biological assays through various mechanisms:

- **Aggregation**: Form colloidal particles that non-specifically inhibit enzymes
- **Reactivity**: Covalently modify assay components
- **Fluorescence**: Interfere with fluorescent readouts
- **Redox cycling**: Generate reactive oxygen species

RDKit includes PAINS filter patterns that can identify these problematic substructures.

### Your Task

Using your AI assistant, write code to:

1. Set up the RDKit PAINS filter catalog
2. Screen your compound library for PAINS alerts
3. Report how many compounds have PAINS alerts
4. Display examples of compounds flagged as PAINS

üí° **Prompting Tips**:
- Ask: "How do I use RDKit's FilterCatalog for PAINS filtering?"
- Ask: "What does PAINS stand for and why does it matter?"
- Ask: "How do I extract the name of matched PAINS patterns?"

### Verification

After running your code, confirm:
- [ ] FilterCatalog is properly initialized with PAINS filters
- [ ] Each compound is checked for PAINS alerts
- [ ] Some compounds are flagged (typically 5-15% of random libraries)
- [ ] Can identify which PAINS pattern was matched

üìì **Lab Notebook**: What types of PAINS patterns are most common in your dataset?

In [None]:
# Your code here



In [None]:
# Your code here (visualize PAINS compounds)



---

## üî¨ Guided Inquiry 5: Building a Multi-Criteria Filter Pipeline

### Context

In real drug discovery, we apply multiple filters sequentially to prioritize compounds. A typical pipeline might be:

```
Raw Library ‚Üí Lipinski ‚Üí Veber ‚Üí PAINS ‚Üí Final Filtered Set
```

Let's build this pipeline and track how many compounds survive each filter.

### Your Task

Using your AI assistant, write code to:

1. Define a filtering pipeline combining Lipinski, Veber, and PAINS
2. Apply filters sequentially and track compound attrition
3. Create a "funnel" visualization showing compound loss at each step
4. Export the final filtered dataset

üí° **Prompting Tips**:
- Ask: "How do I create a funnel chart in matplotlib?"
- Think about the ORDER of filters - does it matter?
- Ask: "How do I export a filtered DataFrame to CSV?"

### Verification

After running your code, confirm:
- [ ] Each filter removes some compounds
- [ ] Final count is less than or equal to initial count
- [ ] Funnel visualization clearly shows attrition
- [ ] CSV file is created with filtered compounds

üìì **Lab Notebook**: What percentage of compounds survive all filters? Which filter is most stringent?

In [None]:
# Your code here



In [None]:
# Your code here (export filtered dataset)



---

## üî¨ Guided Inquiry 6: Chemical Space Visualization

### Context

Let's visualize how filtering affects the chemical space of our compound library. By plotting MW vs LogP (a common chemical space representation), we can see what kinds of compounds are removed by our filters.

### Your Task

Using your AI assistant, write code to:

1. Create a scatter plot comparing original vs filtered datasets
2. Add Lipinski boundaries to the plot
3. Color points by potency (pIC50)
4. Create a summary table comparing statistics before/after filtering

üí° **Prompting Tips**:
- Ask: "How do I create a scatter plot with a colorbar for pIC50?"
- Ask: "How do I plot two datasets on the same axes?"
- Consider using transparency (alpha) to show overlapping points

### Verification

After running your code, confirm:
- [ ] Both original and filtered datasets are visible
- [ ] Lipinski boundaries are clearly marked
- [ ] Color scale shows potency (pIC50)
- [ ] Summary statistics are meaningful

üìì **Lab Notebook**: Does filtering remove mostly potent or weak compounds? What does this mean for drug discovery?

In [None]:
# Your code here



In [None]:
# Your code here (summary statistics)



---

## ‚úÖ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Calculate and apply Lipinski's Rule of Five
- [ ] Calculate and apply Veber's rules (TPSA, rotatable bonds)
- [ ] Distinguish between lead-likeness and drug-likeness
- [ ] Use RDKit's FilterCatalog to screen for PAINS
- [ ] Build a multi-criteria filtering pipeline
- [ ] Visualize the impact of filters on chemical space
- [ ] Interpret filter results in a drug discovery context

### Your lab notebook should include:

- [ ] Summary of filter pass rates for each criterion
- [ ] PAINS patterns found in your dataset
- [ ] Filtering funnel visualization
- [ ] Property comparison before/after filtering
- [ ] Exported filtered CSV file

---

## ü§î Reflection Questions

Answer these in your lab notebook:

1. **Filter Trade-offs**: If you had to choose between a potent compound (pIC50=9) that fails Lipinski and a weaker compound (pIC50=6) that passes all filters, which would you prioritize and why?

2. **Beyond Rule of 5**: Many successful antibiotics (like vancomycin) violate multiple Lipinski rules. Why might oral bioavailability rules be less relevant for antibiotics?

3. **PAINS Controversy**: Some researchers argue that PAINS filters are too aggressive and remove valid drug candidates. What information would you need to decide whether to keep or remove a PAINS-flagged compound?

---

## üìñ Further Reading

- [Lipinski et al. (1997)](https://doi.org/10.1016/S0169-409X(96)00423-1) - Original Rule of Five paper
- [Veber et al. (2002)](https://doi.org/10.1021/jm020017n) - Oral bioavailability rules
- [Baell & Holloway (2010)](https://doi.org/10.1021/jm901137j) - PAINS filters paper
- [Teague & Leeson (2007)](https://doi.org/10.1038/nrd2445) - Lead-likeness review
- [RDKit FilterCatalog Documentation](https://www.rdkit.org/docs/source/rdkit.Chem.FilterCatalog.html)

---

## üîó Connection to Research

The filtering skills you learned today are used daily in pharmaceutical research:

- **High-throughput screening triage**: Filter millions of HTS hits to prioritize compounds
- **Virtual screening**: Pre-filter compound libraries before docking
- **Lead optimization**: Track how property changes during optimization
- **Patent analysis**: Evaluate competitor compound portfolios

In upcoming talktorials, you'll build on these skills to:
- Calculate molecular fingerprints for similarity searching (AI-PSCI-007)
- Cluster compounds to select diverse sets (AI-PSCI-008)
- Build machine learning models using these descriptors (AI-PSCI-007)

---

*AI-PSCI-006 Complete. Proceed to AI-PSCI-007: Molecular Fingerprints & Similarity.*