# AI-PSCI-018: Debugging & Optimization Strategies
## 📘 SOLUTION NOTEBOOK

**AI in Pharmaceutical Sciences: Bench to Bedside**  
VCU School of Pharmacy | VIP Program | Spring 2026

---

**Week 11 | Module: Testing & Evaluation | Estimated Time: 90-120 minutes**

**Prerequisites**: AI-PSCI-015 through AI-PSCI-017 (docking and validation talktorials)

---

## 🎯 Learning Objectives

After completing this talktorial, you will be able to:

1. **Troubleshoot common computational errors** in drug discovery pipelines
2. **Interpret error messages** to quickly identify root causes
3. **Optimize pipeline performance** using profiling and caching strategies
4. **Handle edge cases gracefully** with robust error handling
5. **Document debugging processes** for reproducibility and team communication

---

## 📚 Background

### Why Debugging Skills Matter

In computational drug discovery, you'll spend significant time debugging code and troubleshooting unexpected results. Common scenarios include:

- **API failures**: External services (ChEMBL, PDB, ESMFold) may be down or rate-limited
- **Data quality issues**: Missing values, incorrect formats, unexpected molecules
- **Memory constraints**: Large datasets or batch processing may exhaust resources
- **Numerical errors**: NaN values, division by zero, floating point precision
- **Package conflicts**: Version incompatibilities between libraries

### The Debugging Mindset

Effective debugging follows a systematic approach:

1. **Reproduce**: Can you consistently trigger the error?
2. **Isolate**: What's the minimal code that causes the problem?
3. **Understand**: What does the error message actually tell you?
4. **Fix**: Apply the smallest change that resolves the issue
5. **Verify**: Confirm the fix works and doesn't break other things
6. **Document**: Record what went wrong and how you fixed it

### Key Concepts
- **Traceback**: The sequence of function calls leading to an error
- **Exception handling**: Using try/except to gracefully handle errors
- **Profiling**: Measuring where code spends time and memory
- **Defensive programming**: Writing code that anticipates failures
- **Logging**: Recording program execution for debugging

---

## 🛠️ Setup

Run this cell to install required packages:

In [None]:
#@title 🛠️ Install Required Packages
!pip install rdkit biopython chembl_webresource_client requests -q
print("✅ Packages installed successfully!")

Import required libraries:

---

## 🎯 Target Configuration

Select your drug target. This should match your selection from previous talktorials.

In [None]:
#@title 🎯 Select Your Drug Target
TARGET = "DHFR" #@param ["DHFR", "ABL1", "EGFR", "AChE", "COX-2", "DPP-4"]

# Complete target configuration
TARGET_CONFIG = {
    "DHFR": {"pdb": "2W9S", "uniprot": "P0ABQ4", "chembl": "CHEMBL202", "drug": "Trimethoprim"},
    "ABL1": {"pdb": "1IEP", "uniprot": "P00519", "chembl": "CHEMBL1862", "drug": "Imatinib"},
    "EGFR": {"pdb": "1M17", "uniprot": "P00533", "chembl": "CHEMBL203", "drug": "Erlotinib"},
    "AChE": {"pdb": "4EY7", "uniprot": "P22303", "chembl": "CHEMBL220", "drug": "Donepezil"},
    "COX-2": {"pdb": "3LN1", "uniprot": "P35354", "chembl": "CHEMBL230", "drug": "Celecoxib"},
    "DPP-4": {"pdb": "1X70", "uniprot": "P27487", "chembl": "CHEMBL284", "drug": "Sitagliptin"}
}

config = TARGET_CONFIG[TARGET]
print(f"✅ Target: {TARGET}")
print(f"   PDB: {config['pdb']} | UniProt: {config['uniprot']} | ChEMBL: {config['chembl']}")

---

## 🔬 Guided Inquiry 1: Understanding Error Messages

### Context

Python error messages (tracebacks) contain valuable information, but they can be intimidating. Learning to read them systematically is a crucial skill.

A traceback reads **bottom-up**:
- The **last line** tells you what type of error occurred
- The lines above show the **call stack** - which functions called which
- The **most recent call** (closest to the error) is usually where the problem is

### Your Task

Using your AI assistant, write code to:
1. Create intentional errors of different types (TypeError, ValueError, KeyError, etc.)
2. Catch and analyze the error messages
3. Extract useful information from tracebacks
4. Create a reference guide for common errors in drug discovery

💡 **Prompting Tips**:
- Ask: "What are the most common Python errors and what causes them?"
- Request examples of errors specific to RDKit, pandas, and API calls
- Ask how to programmatically extract line numbers from tracebacks

### Verification
After running your code, confirm:
- [ ] At least 5 different error types demonstrated
- [ ] Each error message is explained
- [ ] Drug discovery-specific examples included

📓 **Lab Notebook**: Create a personal "error dictionary" with errors you've encountered and their solutions.

In [None]:
# Your code here



## 🔬 Guided Inquiry 2: Defensive Programming with Try/Except

### Context

In production pipelines, you can't let one bad molecule crash your entire analysis. **Defensive programming** means anticipating what could go wrong and handling it gracefully.

The key patterns are:
- `try/except`: Catch specific errors
- `try/except/finally`: Always run cleanup code
- `try/except/else`: Run code only if no error occurred

### Your Task

Using your AI assistant, write code to:
1. Create a robust molecule processing function that handles bad SMILES
2. Process a batch of molecules, logging failures without crashing
3. Implement retry logic for API calls
4. Create a summary of successes and failures

💡 **Prompting Tips**:
- Ask: "How do I catch multiple exception types in Python?"
- Request help implementing exponential backoff for retries
- Ask about the difference between catching Exception vs specific errors

### Verification
After running your code, confirm:
- [ ] Function doesn't crash on invalid input
- [ ] Failures are logged with useful information
- [ ] Summary shows success/failure counts
- [ ] Retry logic works for temporary failures

📓 **Lab Notebook**: Document which errors you catch and why. When should you let errors propagate vs. catching them?

In [None]:
# Your code here



## 🔬 Guided Inquiry 3: Performance Profiling

### Context

When processing thousands of molecules, small inefficiencies add up. **Profiling** helps identify where your code spends the most time so you can optimize effectively.

Key profiling approaches:
- **Timing**: Measure how long operations take
- **Line profiling**: See time spent on each line
- **Memory profiling**: Track memory usage

### Your Task

Using your AI assistant, write code to:
1. Create a timing decorator to measure function execution
2. Profile a molecule processing pipeline
3. Identify bottlenecks in the workflow
4. Implement optimizations and measure improvement

💡 **Prompting Tips**:
- Ask: "How do I create a timing decorator in Python?"
- Request help with the `timeit` module for accurate measurements
- Ask about vectorized operations vs. loops in pandas

### Verification
After running your code, confirm:
- [ ] Timing decorator works on multiple functions
- [ ] Bottleneck identified (which step is slowest?)
- [ ] Optimization implemented
- [ ] Before/after comparison shows improvement

📓 **Lab Notebook**: Document the bottleneck you found and how you optimized it. What's the speedup factor?

In [None]:
# Your code here



## 🔬 Guided Inquiry 4: Handling Edge Cases

### Context

Real-world data is messy. Your pipeline will encounter:
- Missing values (NaN, None, empty strings)
- Unusual molecules (salts, mixtures, polymers)
- Extreme values (very large/small numbers)
- Unexpected formats (different SMILES conventions)

**Edge case handling** means your code works correctly for all inputs, not just the "happy path."

### Your Task

Using your AI assistant, write code to:
1. Identify edge cases in molecular data
2. Create validation functions for common issues
3. Handle each edge case gracefully
4. Generate a data quality report

💡 **Prompting Tips**:
- Ask: "What edge cases should I handle when processing SMILES strings?"
- Request help with RDKit's molecule sanitization
- Ask about handling salts and mixtures in ChEMBL data

### Verification
After running your code, confirm:
- [ ] Salt/mixture detection works
- [ ] Empty/None values handled
- [ ] Extreme molecular weights flagged
- [ ] Data quality report generated

📓 **Lab Notebook**: What edge cases have you encountered in your target's data? How did you handle them?

In [None]:
# Your code here



## 🔬 Guided Inquiry 5: Logging and Documentation

### Context

When debugging production issues or reviewing results months later, good **logging** is invaluable. Python's `logging` module provides structured logging that can be configured for different environments.

Log levels (in order of severity):
- `DEBUG`: Detailed information for diagnosing problems
- `INFO`: Confirmation that things are working
- `WARNING`: Something unexpected happened, but code continues
- `ERROR`: A serious problem prevented some functionality
- `CRITICAL`: The program may not be able to continue

### Your Task

Using your AI assistant, write code to:
1. Set up structured logging with timestamps and levels
2. Create a pipeline that logs progress and issues
3. Save logs to a file for later review
4. Generate a debugging summary from logs

💡 **Prompting Tips**:
- Ask: "How do I configure Python logging to write to a file?"
- Request help with log formatting (timestamps, function names)
- Ask about rotating log files to prevent them from growing too large

### Verification
After running your code, confirm:
- [ ] Logs appear with timestamps and levels
- [ ] Different log levels are used appropriately
- [ ] Log file is created and contains entries
- [ ] Summary can be extracted from logs

📓 **Lab Notebook**: What information should you always log in a drug discovery pipeline? What's too much?

In [None]:
# Your code here



## 🔬 Guided Inquiry 6: Creating a Debugging Toolkit

### Context

Experienced developers build personal **debugging toolkits** - collections of functions and patterns they reuse across projects. This saves time and ensures consistent error handling.

### Your Task

Using your AI assistant, write code to:
1. Create a reusable debugging utilities module
2. Include timing, error handling, and validation functions
3. Create a template for debugging notebooks
4. Generate a debugging checklist for drug discovery pipelines

💡 **Prompting Tips**:
- Ask: "What utilities should be in a data science debugging toolkit?"
- Request a checklist for debugging molecular data issues
- Ask about tools for memory profiling in Python

### Verification
After running your code, confirm:
- [ ] Utility functions are reusable
- [ ] Checklist covers common issues
- [ ] Template is practical

📓 **Lab Notebook**: What debugging tools have been most useful in your project? What would you add to this toolkit?

In [None]:
# Your code here



---

## ✅ Checkpoint

Before moving on to the next talktorial, confirm you can:

- [ ] Read and interpret Python tracebacks
- [ ] Use try/except for defensive programming
- [ ] Profile code to identify performance bottlenecks
- [ ] Handle common edge cases in molecular data
- [ ] Set up logging for production pipelines
- [ ] Apply a systematic debugging approach

### Your lab notebook should include:
- [ ] Your personal "error dictionary" with common errors
- [ ] Performance optimization you implemented
- [ ] Edge cases you've encountered in your target's data
- [ ] Your debugging checklist additions

---

## 🤔 Reflection Questions

Answer these in your lab notebook:

1. **Error Handling Philosophy**: When should you catch an exception vs. let it propagate? How do you decide?

2. **Performance vs. Readability**: You identified a bottleneck, but the optimization makes code harder to read. How do you balance these concerns?

3. **Documentation**: What level of logging is appropriate for different stages of a project (development vs. production)?

4. **Real-World Application**: Describe a debugging challenge you faced in your target's data. How did you diagnose and solve it?

---

## 📖 Further Reading

- [Python Logging HOWTO](https://docs.python.org/3/howto/logging.html) - Official documentation
- [Effective Python, Item 65-75](https://effectivepython.com/) - Testing and debugging best practices
- [RDKit Cookbook: Error Handling](https://www.rdkit.org/docs/Cookbook.html) - RDKit-specific tips
- [The Art of Debugging](https://jvns.ca/blog/2019/06/23/a-few-debugging-resources/) - Julia Evans' debugging resources
- [Python Profiling Guide](https://realpython.com/python-profiling/) - Comprehensive profiling tutorial

---

## 🔗 Connection to Research

Debugging skills are essential for computational research because:

**Reproducibility**: A bug that silently produces wrong results is worse than one that crashes. Good error handling ensures you catch problems early and document them.

**Efficiency**: Profiling your pipelines means you can process more molecules in less time, enabling larger-scale virtual screening and analysis.

**Robustness**: Real-world drug discovery data is messy. Edge case handling means your pipeline works on diverse datasets without manual intervention.

**Collaboration**: Good logging and documentation mean colleagues can understand and troubleshoot your code, and you can understand it months later.

The debugging patterns you've learned here apply directly to:
- Production virtual screening pipelines
- Automated compound profiling workflows
- Machine learning training pipelines
- API-based tool integration

---

*AI-PSCI-018 Complete. Proceed to AI-PSCI-019: Documentation for Reproducibility.*