# AI-PSCI-019: Documentation for Reproducibility

**Week 12 | Module: Integrative Reflection | Time: 90-120 min**

---

In this talktorial, you will learn to document your computational drug discovery pipeline for reproducibility and sharing. Proper documentation ensures your work can be understood, verified, and extended by others.

### Why Documentation Matters

In computational drug discovery:
- **Reproducibility crisis**: Many published computational results cannot be reproduced
- **Collaboration**: Team members need to understand and extend your work
- **Publication**: Journals increasingly require code and data availability
- **Career**: Well-documented code demonstrates professionalism

By the end of this session, you'll have a fully documented, GitHub-ready pipeline.

## Learning Objectives

By the end of this talktorial, you will be able to:

1. **Document computational workflows** with clear, structured documentation
2. **Create reproducible environments** using requirements files
3. **Write methods sections** suitable for publication
4. **Prepare code for sharing** on GitHub
5. **Generate automated documentation** from code
6. **Create a README** that enables others to reproduce your work

## Background

### The Reproducibility Standard

A reproducible computational pipeline must include:

| Component | Purpose | Example |
|-----------|---------|--------|
| **Code** | The actual implementation | Python scripts, notebooks |
| **Data** | Input files or access instructions | PDB IDs, ChEMBL queries |
| **Environment** | Software dependencies | requirements.txt, conda.yml |
| **Documentation** | How to run everything | README.md, docstrings |
| **Results** | Expected outputs for verification | Example outputs, checksums |

### Documentation Levels

```
Level 1: Code Comments
    - Inline explanation of tricky logic
    
Level 2: Docstrings
    - Function/class documentation
    
Level 3: README
    - Project-level overview
    
Level 4: Methods Section
    - Publication-ready description
```

### FAIR Principles for Data

- **F**indable: Use persistent identifiers (PDB, UniProt, ChEMBL IDs)
- **A**ccessible: Provide download instructions
- **I**nteroperable: Use standard formats (SMILES, PDB, CSV)
- **R**eusable: Include clear licenses and metadata

---

## Setup

First, let's install the packages we'll use for documentation.

In [None]:
#@title Install Required Packages
!pip install rdkit pandas numpy matplotlib -q
print("Packages installed successfully!")

### Import Libraries

In [None]:
#@title Import Libraries
import os
import sys
import json
import datetime
import inspect
import textwrap
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field, asdict

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from rdkit import Chem
from rdkit.Chem import Descriptors, AllChem

print("Libraries imported successfully!")

### Target Configuration

Select your drug target from the dropdown. All subsequent analyses will use your chosen target.

In [None]:
#@title Target Configuration
TARGET = "DHFR" #@param ["DHFR", "ABL1", "EGFR", "AChE", "COX-2", "DPP-4"]

# Complete target configuration
TARGET_CONFIG = {
    "DHFR": {
        "name": "Dihydrofolate Reductase",
        "disease": "Antibiotic Resistance",
        "pdb_id": "1RX1",
        "uniprot_id": "P0ABQ4",
        "chembl_id": "CHEMBL202",
        "drug": "Trimethoprim",
        "drug_smiles": "COc1cc(Cc2cnc(N)nc2N)cc(OC)c1OC",
        "organism": "E. coli",
        "mutations": ["P21L", "A26T", "L28R", "W30R", "I94L"]
    },
    "ABL1": {
        "name": "Tyrosine-protein kinase ABL1",
        "disease": "Chronic Myeloid Leukemia",
        "pdb_id": "1IEP",
        "uniprot_id": "P00519",
        "chembl_id": "CHEMBL1862",
        "drug": "Imatinib",
        "drug_smiles": "Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1",
        "organism": "Homo sapiens",
        "mutations": ["T315I", "E255K", "Y253H", "M351T", "G250E"]
    },
    "EGFR": {
        "name": "Epidermal Growth Factor Receptor",
        "disease": "Non-Small Cell Lung Cancer",
        "pdb_id": "1M17",
        "uniprot_id": "P00533",
        "chembl_id": "CHEMBL203",
        "drug": "Erlotinib",
        "drug_smiles": "COCCOc1cc2ncnc(Nc3cccc(C#C)c3)c2cc1OCCOC",
        "organism": "Homo sapiens",
        "mutations": ["L858R", "T790M", "C797S", "G719S", "L861Q"]
    },
    "AChE": {
        "name": "Acetylcholinesterase",
        "disease": "Alzheimer's Disease",
        "pdb_id": "4EY7",
        "uniprot_id": "P22303",
        "chembl_id": "CHEMBL220",
        "drug": "Donepezil",
        "drug_smiles": "COc1cc2CC(CC2cc1OC)CN1CCc2ccccc2C1=O",
        "organism": "Homo sapiens",
        "mutations": ["Y337A", "F338A", "W286A", "D74N", "E202Q"]
    },
    "COX-2": {
        "name": "Cyclooxygenase-2",
        "disease": "Pain/Inflammation",
        "pdb_id": "3LN1",
        "uniprot_id": "P35354",
        "chembl_id": "CHEMBL230",
        "drug": "Celecoxib",
        "drug_smiles": "Cc1ccc(-c2cc(C(F)(F)F)nn2-c2ccc(S(N)(=O)=O)cc2)cc1",
        "organism": "Homo sapiens",
        "mutations": ["V523I", "S530A", "Y385F", "R120A", "H90Q"]
    },
    "DPP-4": {
        "name": "Dipeptidyl Peptidase-4",
        "disease": "Type 2 Diabetes",
        "pdb_id": "1X70",
        "uniprot_id": "P27487",
        "chembl_id": "CHEMBL284",
        "drug": "Sitagliptin",
        "drug_smiles": "Fc1cc(F)c(C[C@H](N)CC(=O)N2CCn3c(nnc3C(F)(F)F)C2)c(F)c1F",
        "organism": "Homo sapiens",
        "mutations": ["S630A", "H740A", "D708A", "E205A", "E206A"]
    }
}

config = TARGET_CONFIG[TARGET]
print(f"Target: {TARGET}")
print(f"Full Name: {config['name']}")
print(f"Disease Area: {config['disease']}")
print(f"Reference Drug: {config['drug']}")

---

## Guided Inquiry 1: Creating Reproducible Environment Files

### Background

The first step in reproducibility is capturing your software environment. Python packages can behave differently across versions, so documenting exact versions is critical.

### Your Task

Create functions to:
1. Capture the current environment (installed packages and versions)
2. Generate a `requirements.txt` file
3. Create a more detailed environment report

**Verification**: You should see a list of key packages with versions and a generated requirements.txt content.

In [None]:
# Your code here



---

## Guided Inquiry 2: Documenting Functions with Docstrings

### Background

Docstrings are the foundation of code documentation. Well-written docstrings allow others (and future you) to understand what functions do without reading the implementation.

### Your Task

Create a drug discovery pipeline function with comprehensive documentation:
1. Use Google-style docstrings (Args, Returns, Raises, Example)
2. Include type hints
3. Create a function to extract and display docstrings

**Verification**: You should see well-formatted docstrings and the function should process molecules correctly.

In [None]:
# Your code here



---

## Guided Inquiry 3: Creating a README for Your Pipeline

### Background

A README is the first thing people see when they visit your repository. It should provide a complete overview of your project and clear instructions for getting started.

### Your Task

Create a function that generates a comprehensive README.md for your drug discovery pipeline:
1. Project title and description
2. Installation instructions
3. Usage examples
4. Data sources with proper attribution
5. License information

**Verification**: You should see a well-structured markdown README template.

In [None]:
# Your code here



---

## Guided Inquiry 4: Writing a Methods Section

### Background

A well-written methods section enables others to reproduce your computational experiments. It should be detailed enough for an expert to follow, while remaining concise.

### Your Task

Create a function that generates a publication-quality methods section:
1. Include all software versions
2. Describe data sources with identifiers
3. Detail computational parameters
4. Reference original tool publications

**Verification**: You should see a methods section suitable for a journal submission.

In [None]:
# Your code here



---

## Guided Inquiry 5: Automated Documentation Generation

### Background

Modern projects use automated documentation tools to generate documentation from code. This ensures documentation stays in sync with the code and reduces maintenance burden.

### Your Task

Create a documentation generator that:
1. Extracts docstrings from a module/class
2. Generates markdown documentation
3. Creates an API reference page

**Verification**: You should see auto-generated documentation for a sample pipeline class.

In [None]:
# Your code here



---

## Guided Inquiry 6: Creating a Complete Documentation Package

### Background

A complete documentation package brings together all components: README, requirements, methods, and API docs into a coherent structure.

### Your Task

Create a function that generates a complete documentation package:
1. README.md
2. requirements.txt
3. METHODS.md
4. CHANGELOG.md template
5. LICENSE file

**Verification**: You should see all documentation files generated with appropriate content.

In [None]:
# Your code here



---

## Checkpoint

By now, you should have:

1. **Created environment files** - requirements.txt with exact versions
2. **Written comprehensive docstrings** - Google-style with Args, Returns, Example
3. **Generated a README** - Complete project overview with installation and usage
4. **Drafted a methods section** - Publication-ready with citations
5. **Built API documentation** - Automatically generated from code
6. **Assembled a documentation package** - All files needed for sharing

### Self-Assessment

Can you answer these questions?

1. Why do we pin exact package versions in requirements.txt?
2. What are the essential sections of a Google-style docstring?
3. What database identifiers should be included in methods sections?
4. Why is a LICENSE file important for open source projects?

---

## Reflection Questions

Consider these questions for your lab notebook:

1. **Reproducibility**: What additional information would you need to reproduce someone else's computational study?

2. **Audience**: How would your documentation differ for a computational chemist vs. a biologist?

3. **Maintenance**: How do you plan to keep documentation updated as code changes?

4. **Open Science**: What are the benefits and risks of sharing your code publicly?

---

## Further Reading

### Documentation Best Practices
- [Write the Docs](https://www.writethedocs.org/) - Documentation community
- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) - Docstring conventions
- [Keep a Changelog](https://keepachangelog.com/) - Changelog format

### Reproducibility Resources
- [FAIR Principles](https://www.go-fair.org/fair-principles/) - Data sharing standards
- [Nature: Code availability](https://www.nature.com/nature-research/editorial-policies/code-and-software) - Journal requirements

### Tools
- [Sphinx](https://www.sphinx-doc.org/) - Python documentation generator
- [MkDocs](https://www.mkdocs.org/) - Project documentation with Markdown
- [GitHub Pages](https://pages.github.com/) - Free documentation hosting

---

## Research Connection

The documentation skills you've learned connect to broader research practices:

### The Reproducibility Crisis

Studies have shown that a significant fraction of published computational results cannot be reproduced. Common issues include:
- Missing code or data
- Undocumented parameters
- Version incompatibilities
- Incomplete methods descriptions

### Journal Requirements

Many journals now require:
- Code availability statements
- Data repository links
- Methods reproducibility checks

### Career Benefits

Well-documented code:
- Demonstrates professionalism
- Enables collaboration
- Increases citation impact
- Builds reputation in the field

---

*Congratulations! You've completed AI-PSCI-019: Documentation for Reproducibility.*