# Session 1: Python Fundamentals for Protein Scientists

**Duration:** ~1 hour 20 minutes

## What You'll Learn

By the end of this session, you will be able to:
1. Navigate Jupyter notebooks confidently
2. Use Python's core data types (strings, numbers, lists, dictionaries)
3. Write loops and functions to automate repetitive tasks
4. Load and inspect data with pandas
5. Understand the limitations of AI-extracted data

## Why Python for Protein Science?

- **Automate tedious tasks**: Rename files, process hundreds of samples, generate reports
- **Reproducible analysis**: Your analysis is documented in code, not clicks
- **Powerful libraries**: Pandas for data, matplotlib for plots, BioPython for sequences
- **Growing community**: Most new structural biology tools are Python-based

## How to Use This Notebook

- **Run cells** by pressing `Shift + Enter`
- **Modify code** and re-run to experiment
- **Don't worry about errors** - they're how we learn!
- **Ask questions** at any time

Let's get started!

---
## 1. Jupyter Notebook Orientation

A Jupyter notebook is made of **cells**. There are two types:

1. **Markdown cells** (like this one) - for text, headings, explanations
2. **Code cells** (like the one below) - for Python code that runs

### Running Cells

- Click on a cell, then press `Shift + Enter` to run it
- The result appears below the cell
- A number appears in the brackets `[1]` showing execution order

Try running the cell below:

In [1]:
# This is a code cell. Lines starting with # are comments (ignored by Python)
print("Hello, protein scientists!")

Hello, protein scientists!


### Cell States

- `[ ]` - Cell hasn't been run yet
- `[*]` - Cell is currently running (wait for it!)
- `[1]` - Cell finished running (number shows order)

### Important: Run Cells in Order

Variables and functions defined in earlier cells are available in later cells. If you skip around, you might get errors about undefined variables.

**If things go wrong:** Go to `Kernel` → `Restart & Run All` to start fresh.

### EXERCISE 1.1: Your First Edit

**Task:** Modify the message below to print your name, then run the cell.

**Hint:** Change the text inside the quotes.

In [4]:
# TODO: Change the message to include your name
print("Hello from !")

Hello from !


---
## 2. Variables and Types

**Variables** are named containers that store data. Think of them as labeled boxes.

### Creating Variables

Use `=` to assign a value to a variable name:

In [5]:
# String (text) - always in quotes
protein_name = "Lysozyme"

# Integer (whole number)
residue_count = 129

# Float (decimal number)
molecular_weight = 14.3

# Boolean (True or False)
is_membrane_protein = False

print(protein_name)
print(residue_count)
print(molecular_weight)
print(is_membrane_protein)

Lysozyme
129
14.3
False


### Checking Types

Use `type()` to see what kind of data a variable holds:

In [6]:
print(type(protein_name))      # str (string)
print(type(residue_count))     # int (integer)
print(type(molecular_weight))  # float
print(type(is_membrane_protein))  # bool (boolean)

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>


### f-Strings: Formatting Output

f-strings let you embed variables directly in text. Put `f` before the quote and variables in `{curly braces}`:

In [7]:
# f-string example
print(f"{protein_name} has {residue_count} residues")
print(f"Molecular weight: {molecular_weight} kDa")

# Format numbers with decimals using :.Nf (N decimal places)
pi = 3.14159265359
print(f"Pi to 2 decimals: {pi:.2f}")
print(f"Pi to 4 decimals: {pi:.4f}")

Lysozyme has 129 residues
Molecular weight: 14.3 kDa
Pi to 2 decimals: 3.14
Pi to 4 decimals: 3.1416


### Variable Naming Rules

- Use lowercase letters with underscores: `protein_name` (not `ProteinName`)
- Must start with a letter or underscore, not a number
- Can't use Python keywords like `print`, `if`, `for`
- Make names descriptive: `sample_concentration` beats `x`

### EXERCISE 2.1: Create Protein Variables

**Task:** Create variables for a protein you work with (or make one up!).

Fill in the blanks below:

In [8]:
# TODO: Fill in the values for your protein
my_protein = "___"           # Name as a string
my_residues = ___             # Number of residues as an integer
my_mw = ___                   # Molecular weight as a float
my_pdb_id = "___"            # PDB ID as a string (e.g., "1LYZ")

# This line will print your protein info
print(f"{my_protein} (PDB: {my_pdb_id}): {my_residues} residues, {my_mw:.1f} kDa")

ValueError: Unknown format code 'f' for object of type 'str'

<details>
<summary>Click to see example solution</summary>

```python
my_protein = "Green Fluorescent Protein"
my_residues = 238
my_mw = 26.9
my_pdb_id = "1GFL"

print(f"{my_protein} (PDB: {my_pdb_id}): {my_residues} residues, {my_mw:.1f} kDa")
# Output: Green Fluorescent Protein (PDB: 1GFL): 238 residues, 26.9 kDa
```
</details>

---
## 3. Operators and Comparisons

Python can do math! Here are the basic operators:

In [9]:
# Arithmetic operators
print("Addition:", 10 + 3)       # 13
print("Subtraction:", 10 - 3)    # 7
print("Multiplication:", 10 * 3) # 30
print("Division:", 10 / 3)       # 3.333... (always gives float)
print("Integer division:", 10 // 3)  # 3 (drops decimal)
print("Remainder:", 10 % 3)      # 1 (modulo)
print("Power:", 2 ** 10)         # 1024 (exponentiation)

Addition: 13
Subtraction: 7
Multiplication: 30
Division: 3.3333333333333335
Integer division: 3
Remainder: 1
Power: 1024


### Real Example: Molecular Weight Calculation

Let's estimate a protein's molecular weight from its residue count:

In [12]:
# Average amino acid MW (Daltons)
avg_residue_mw = 110

# Water lost per peptide bond (Daltons)
water_loss = 18.015

# Calculate MW for lysozyme
n_residues = 129
estimated_mw = (avg_residue_mw * n_residues) - (water_loss * (n_residues - 1))

print(f"Estimated MW: {estimated_mw:.0f} Da")
print(f"Estimated MW: {estimated_mw/1000:.1f} kDa")

Estimated MW: 11884 Da
Estimated MW: 11.9 kDa


### Comparisons

Comparisons return `True` or `False`:

In [13]:
# Comparison operators
print("5 == 5:", 5 == 5)   # Equal to (note: DOUBLE equals!)
print("5 != 3:", 5 != 3)   # Not equal to
print("5 > 3:", 5 > 3)     # Greater than
print("5 < 3:", 5 < 3)     # Less than
print("5 >= 5:", 5 >= 5)   # Greater than or equal
print("5 <= 3:", 5 <= 3)   # Less than or equal

5 == 5: True
5 != 3: True
5 > 3: True
5 < 3: False
5 >= 5: True
5 <= 3: False


In [14]:
# Practical example: Is pH physiological?
pH = 7.4
is_physiological = 7.0 <= pH <= 7.6
print(f"pH {pH} is physiological: {is_physiological}")

# Using 'and' / 'or'
temp = 37
good_conditions = (7.0 <= pH <= 7.6) and (35 <= temp <= 40)
print(f"Good cell culture conditions: {good_conditions}")

pH 7.4 is physiological: True
Good cell culture conditions: True


### EXERCISE 3.1: Calculate Extinction Coefficient

**Task:** Calculate the molar extinction coefficient at 280 nm for a protein.

Formula: ε₂₈₀ = (nTrp × 5500) + (nTyr × 1490) + (nCystine × 125)

For lysozyme: 6 Trp, 3 Tyr, 4 disulfide bonds (cystines)

In [15]:
# TODO: Fill in the values and calculate epsilon_280
n_trp = ___
n_tyr = ___
n_cystine = ___  # Number of disulfide bonds

# Extinction coefficients (M⁻¹ cm⁻¹)
eps_trp = 5500
eps_tyr = 1490
eps_cystine = 125

# TODO: Calculate epsilon_280
epsilon_280 = ___

print(f"ε₂₈₀ = {epsilon_280} M⁻¹ cm⁻¹")

ε₂₈₀ =  M⁻¹ cm⁻¹


<details>
<summary>Click to see solution</summary>

```python
n_trp = 6
n_tyr = 3
n_cystine = 4

eps_trp = 5500
eps_tyr = 1490
eps_cystine = 125

epsilon_280 = (n_trp * eps_trp) + (n_tyr * eps_tyr) + (n_cystine * eps_cystine)

print(f"ε₂₈₀ = {epsilon_280} M⁻¹ cm⁻¹")
# Output: ε₂₈₀ = 37970 M⁻¹ cm⁻¹
```
</details>

---
## 4. Strings in Detail

Strings are sequences of characters. They're essential for handling:
- Protein sequences
- Sample names
- File paths
- PDB IDs

### Indexing and Slicing

Access individual characters or ranges using square brackets.

**Important:** Python counts from 0, not 1!

In [None]:
sequence = "MKTLLILAVVLAV"

# Indexing: access single characters
print("First residue:", sequence[0])   # M (index 0, not 1!)
print("Second residue:", sequence[1])  # K
print("Last residue:", sequence[-1])   # V (negative = from end)

# Slicing: [start:stop] - stop is NOT included!
print("First 3:", sequence[0:3])       # MKT (indices 0, 1, 2)
print("Same thing:", sequence[:3])     # MKT (start defaults to 0)
print("Last 4:", sequence[-4:])        # VLAV
print("Middle:", sequence[3:7])        # LLIL

print("Length:", len(sequence))        # 13 characters

### Helpful Diagram for Slicing

```
String:   M   K   T   L   L   I   L   A   V   V   L   A   V
Index:    0   1   2   3   4   5   6   7   8   9  10  11  12
Negative:-13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1
```

Think of the index as pointing to the *gap before* that character.

### Useful String Methods

In [None]:
# Case conversion
pdb_id = "1lyz"
print("Uppercase:", pdb_id.upper())    # 1LYZ
print("Lowercase:", "HELLO".lower())   # hello

# Remove whitespace
messy = "  1LYZ  "
print("Stripped:", messy.strip())      # 1LYZ

# Split into parts
header = "1LYZ_A_chain|Lysozyme|Gallus_gallus"
parts = header.split("|")
print("Split:", parts)  # ['1LYZ_A_chain', 'Lysozyme', 'Gallus_gallus']

# Join parts together
residues = ["ALA", "GLY", "SER"]
print("Joined:", "-".join(residues))  # ALA-GLY-SER

# Check contents
print("Contains 'LYZ':", "LYZ" in pdb_id.upper())  # True

### EXERCISE 4.1: Parse a FASTA Header

**Task:** Extract the PDB ID and chain from a FASTA header.

Header format: `>pdb_id|chain|description`

In [16]:
fasta_header = ">4HHB|A|HEMOGLOBIN ALPHA CHAIN"

# TODO: Extract the PDB ID and chain
# Step 1: Remove the > character (hint: use slicing)
clean_header = ___

# Step 2: Split by | character
parts = ___

# Step 3: Get the PDB ID (first part) and chain (second part)
pdb_id = ___
chain = ___

print(f"PDB ID: {pdb_id}")
print(f"Chain: {chain}")

PDB ID: 
Chain: 


<details>
<summary>Click to see solution</summary>

```python
fasta_header = ">4HHB|A|HEMOGLOBIN ALPHA CHAIN"

# Step 1: Remove the > character
clean_header = fasta_header[1:]  # Everything from index 1 onward

# Step 2: Split by | character
parts = clean_header.split("|")

# Step 3: Get the PDB ID and chain
pdb_id = parts[0]
chain = parts[1]

print(f"PDB ID: {pdb_id}")  # 4HHB
print(f"Chain: {chain}")    # A
```
</details>

---
## 5. Lists

Lists are ordered collections that can hold multiple items. Use square brackets `[]`.

Perfect for:
- Sample names
- Wavelengths
- Time points
- Anything you need multiple of!

In [17]:
# Creating lists
samples = ["WT", "K52A", "D101N", "E35Q"]
wavelengths = [280, 260, 230, 214]
mixed = ["protein", 42, 3.14, True]  # Can mix types (but usually don't)

print("Samples:", samples)
print("Number of samples:", len(samples))

Samples: ['WT', 'K52A', 'D101N', 'E35Q']
Number of samples: 4


In [18]:
# Indexing and slicing (same as strings!)
print("First sample:", samples[0])     # WT
print("Last sample:", samples[-1])     # E35Q
print("First two:", samples[:2])       # ['WT', 'K52A']

First sample: WT
Last sample: E35Q
First two: ['WT', 'K52A']


In [19]:
# Modifying lists (unlike strings, lists are mutable!)
samples.append("H148G")        # Add to end
print("After append:", samples)

samples.insert(1, "Reference") # Insert at position 1
print("After insert:", samples)

samples.remove("Reference")    # Remove by value
print("After remove:", samples)

After append: ['WT', 'K52A', 'D101N', 'E35Q', 'H148G']
After insert: ['WT', 'Reference', 'K52A', 'D101N', 'E35Q', 'H148G']
After remove: ['WT', 'K52A', 'D101N', 'E35Q', 'H148G']


In [20]:
# Useful list operations
numbers = [3, 1, 4, 1, 5, 9, 2, 6]

print("Sorted (new list):", sorted(numbers))  # Returns new list
print("Original:", numbers)                    # Original unchanged

print("Min:", min(numbers))
print("Max:", max(numbers))
print("Sum:", sum(numbers))

# Check membership
print("Is 'WT' in samples?", "WT" in samples)  # True

Sorted (new list): [1, 1, 2, 3, 4, 5, 6, 9]
Original: [3, 1, 4, 1, 5, 9, 2, 6]
Min: 1
Max: 9
Sum: 31
Is 'WT' in samples? True


### EXERCISE 5.1: Working with Amino Acid Data

**Task:** Create a list of amino acids and perform some operations.

In [21]:
# Given: one-letter codes for aromatic amino acids
aromatics = ["F", "W", "Y"]  # Phe, Trp, Tyr

# TODO: Create a list of charged amino acids (D, E, K, R, H)
charged = ___

# TODO: Combine aromatics and charged into one list (hint: use +)
special_residues = ___

# TODO: Sort the combined list alphabetically
special_sorted = ___

print("Special residues:", special_residues)
print("Sorted:", special_sorted)
print("Total count:", len(special_residues))

Special residues: 
Sorted: 
Total count: 0


<details>
<summary>Click to see solution</summary>

```python
aromatics = ["F", "W", "Y"]
charged = ["D", "E", "K", "R", "H"]
special_residues = aromatics + charged
special_sorted = sorted(special_residues)

print("Special residues:", special_residues)  # ['F', 'W', 'Y', 'D', 'E', 'K', 'R', 'H']
print("Sorted:", special_sorted)              # ['D', 'E', 'F', 'H', 'K', 'R', 'W', 'Y']
print("Total count:", len(special_residues))  # 8
```
</details>

---
## 6. Dictionaries

Dictionaries store **key-value pairs**. Use curly braces `{}`.

Perfect for:
- Protein metadata
- Sample annotations
- Lookup tables (amino acid properties, etc.)

In [None]:
# Creating a dictionary
lysozyme = {
    "name": "Hen egg-white lysozyme",
    "pdb_id": "1LYZ",
    "residues": 129,
    "mw_kda": 14.3,
    "organism": "Gallus gallus",
    "is_membrane": False
}

print(lysozyme)

In [None]:
# Accessing values by key
print("Name:", lysozyme["name"])
print("PDB ID:", lysozyme["pdb_id"])
print("MW:", lysozyme["mw_kda"], "kDa")

In [None]:
# Adding and updating values
lysozyme["resolution"] = 1.9    # Add new key
lysozyme["mw_kda"] = 14.4       # Update existing key

print(lysozyme)

In [None]:
# Getting keys and values
print("All keys:", list(lysozyme.keys()))
print("All values:", list(lysozyme.values()))

# Safe access with .get() - returns None if key doesn't exist
print("Ligand:", lysozyme.get("ligand"))          # None
print("Ligand:", lysozyme.get("ligand", "N/A"))  # "N/A" (default)

### Lookup Tables

Dictionaries are great for storing reference data:

In [None]:
# Amino acid molecular weights
aa_mw = {
    "A": 89.1,  "R": 174.2, "N": 132.1, "D": 133.1,
    "C": 121.2, "E": 147.1, "Q": 146.2, "G": 75.1,
    "H": 155.2, "I": 131.2, "L": 131.2, "K": 146.2,
    "M": 149.2, "F": 165.2, "P": 115.1, "S": 105.1,
    "T": 119.1, "W": 204.2, "Y": 181.2, "V": 117.1
}

# Look up a specific amino acid
print(f"Tryptophan MW: {aa_mw['W']} Da")
print(f"Glycine MW: {aa_mw['G']} Da")

### EXERCISE 6.1: Create a Protein Dictionary

**Task:** Create a dictionary for a protein with nested information.

In [None]:
# TODO: Create a dictionary for GFP with the following info:
# - name: "Green Fluorescent Protein"
# - pdb_id: "1GFL"
# - residues: 238
# - excitation_nm: 395
# - emission_nm: 509
# - organism: "Aequorea victoria"

gfp = {
    # Fill in here
}

# Print a formatted summary
print(f"{gfp['name']} ({gfp['pdb_id']})")
print(f"Excitation/Emission: {gfp['excitation_nm']}/{gfp['emission_nm']} nm")

<details>
<summary>Click to see solution</summary>

```python
gfp = {
    "name": "Green Fluorescent Protein",
    "pdb_id": "1GFL",
    "residues": 238,
    "excitation_nm": 395,
    "emission_nm": 509,
    "organism": "Aequorea victoria"
}

print(f"{gfp['name']} ({gfp['pdb_id']})")
# Output: Green Fluorescent Protein (1GFL)
print(f"Excitation/Emission: {gfp['excitation_nm']}/{gfp['emission_nm']} nm")
# Output: Excitation/Emission: 395/509 nm
```
</details>

---
## 7. Loops

Loops let you repeat actions. This is where Python saves you time!

### For Loops

Use `for` to iterate over items in a list (or any sequence):

In [None]:
samples = ["WT", "K52A", "D101N", "E35Q"]

# Basic for loop
for sample in samples:
    print(f"Processing {sample}...")

In [None]:
# Loop with index using enumerate()
for i, sample in enumerate(samples):
    print(f"{i+1}. {sample}")

In [None]:
# Loop over a range of numbers
for i in range(5):  # 0, 1, 2, 3, 4
    print(f"Cycle {i+1}")

print("---")

# Range with start and stop
for wavelength in range(200, 400, 50):  # 200, 250, 300, 350
    print(f"Measuring at {wavelength} nm")

In [None]:
# Loop over dictionary items
protein = {"name": "Lysozyme", "pdb": "1LYZ", "mw": 14.3}

for key, value in protein.items():
    print(f"{key}: {value}")

### Building Results with Loops

In [None]:
# Calculate concentrations from absorbance readings
absorbances = [0.12, 0.25, 0.38, 0.51, 0.63]
epsilon = 38000  # M⁻¹ cm⁻¹
path_length = 1  # cm

concentrations = []  # Start with empty list

for A in absorbances:
    # Beer-Lambert: A = ε × c × l  →  c = A / (ε × l)
    c = A / (epsilon * path_length)
    c_uM = c * 1e6  # Convert to μM
    concentrations.append(c_uM)

print("Absorbances:", absorbances)
print("Concentrations (μM):", [round(c, 2) for c in concentrations])

### EXERCISE 7.1: Count Residues

**Task:** Count how many of each amino acid are in a sequence.

In [None]:
sequence = "MKTLLILAVVLAV"
target_residues = ["L", "V", "M"]

# TODO: Count occurrences of each target residue
# Hint: Use sequence.count(residue) inside a loop

for residue in target_residues:
    count = ___  # Use .count() method
    print(f"{residue}: {count}")

<details>
<summary>Click to see solution</summary>

```python
sequence = "MKTLLILAVVLAV"
target_residues = ["L", "V", "M"]

for residue in target_residues:
    count = sequence.count(residue)
    print(f"{residue}: {count}")

# Output:
# L: 4
# V: 3
# M: 1
```
</details>

---
## 8. Functions

Functions are reusable blocks of code. Define once, use many times!

### Defining Functions

In [None]:
def calculate_concentration(absorbance, epsilon, path_length=1):
    """
    Calculate molar concentration from absorbance using Beer-Lambert law.
    
    Parameters
    ----------
    absorbance : float
        Measured absorbance (AU)
    epsilon : float
        Molar extinction coefficient (M⁻¹ cm⁻¹)
    path_length : float, optional
        Cuvette path length in cm (default: 1)
    
    Returns
    -------
    float
        Concentration in molar (M)
    """
    concentration = absorbance / (epsilon * path_length)
    return concentration

In [None]:
# Using the function
c = calculate_concentration(0.5, 38000)
print(f"Concentration: {c*1e6:.2f} μM")

# With explicit path length
c2 = calculate_concentration(0.5, 38000, path_length=0.5)
print(f"With 0.5 cm cuvette: {c2*1e6:.2f} μM")

In [None]:
# Functions can return multiple values
def protein_stats(sequence):
    """Calculate basic statistics for a protein sequence."""
    length = len(sequence)
    n_aromatic = sequence.count("F") + sequence.count("W") + sequence.count("Y")
    pct_aromatic = (n_aromatic / length) * 100
    
    return length, n_aromatic, pct_aromatic

# Unpack multiple return values
seq = "MKTLLILAVVLAFWY"
length, n_arom, pct_arom = protein_stats(seq)
print(f"Length: {length}, Aromatic: {n_arom} ({pct_arom:.1f}%)")

### EXERCISE 8.1: Write a Function

**Task:** Write a function that estimates molecular weight from residue count.

In [None]:
def estimate_mw(n_residues, avg_residue_mw=110):
    """
    Estimate protein molecular weight from residue count.
    
    Parameters
    ----------
    n_residues : int
        Number of amino acid residues
    avg_residue_mw : float, optional
        Average residue MW in Daltons (default: 110)
    
    Returns
    -------
    float
        Estimated MW in kDa
    """
    # TODO: Calculate MW (consider water loss of 18 Da per peptide bond)
    # Number of peptide bonds = n_residues - 1
    
    water_loss = ___
    total_mw = ___
    mw_kda = ___
    
    return mw_kda

# Test your function
print(f"Lysozyme (129 residues): {estimate_mw(129):.1f} kDa")
print(f"GFP (238 residues): {estimate_mw(238):.1f} kDa")

<details>
<summary>Click to see solution</summary>

```python
def estimate_mw(n_residues, avg_residue_mw=110):
    water_loss = 18.015
    total_mw = (n_residues * avg_residue_mw) - (water_loss * (n_residues - 1))
    mw_kda = total_mw / 1000
    return mw_kda

print(f"Lysozyme (129 residues): {estimate_mw(129):.1f} kDa")  # ~11.9 kDa
print(f"GFP (238 residues): {estimate_mw(238):.1f} kDa")       # ~21.9 kDa
```
</details>

---
## 9. Pandas DataFrames

Pandas is Python's most important library for data analysis. It provides:
- **DataFrame**: A table (like an Excel spreadsheet)
- **Series**: A single column

This is the payoff for everything you've learned!

In [None]:
import pandas as pd
from pathlib import Path

# Set up paths
DATA_DIR = Path('materials/session1/data')
print(f"Data directory: {DATA_DIR}")
print(f"Files available: {list(DATA_DIR.glob('*.csv'))}")

### Loading Data

In [None]:
# Load a CSV file
df = pd.read_csv(DATA_DIR / 'hello_table.csv')

# Display the data
df

### Inspecting Data

In [None]:
# Basic inspection
print("Shape (rows, columns):", df.shape)
print("\nColumn names:", list(df.columns))
print("\nFirst 2 rows:")
df.head(2)

In [None]:
# Data types and memory
df.info()

In [None]:
# Statistics for numeric columns
df.describe()

### Selecting Data

In [None]:
# Load more interesting data
categories = pd.read_csv(DATA_DIR / 'mini_categories.csv')
categories

In [None]:
# Select a single column (returns a Series)
categories['protein']

In [None]:
# Select multiple columns (returns a DataFrame)
categories[['protein', 'status']]

In [None]:
# Select rows by position with .iloc
print("First row:")
print(categories.iloc[0])

print("\nRows 1-3:")
categories.iloc[1:4]

### Filtering Data

In [None]:
# Filter rows based on conditions
passing = categories[categories['status'] == 'pass']
print("Samples that passed:")
passing

In [None]:
# Multiple conditions (use & for AND, | for OR)
hek_pass = categories[(categories['expression_system'] == 'HEK') & 
                       (categories['status'] == 'pass')]
print("HEK samples that passed:")
hek_pass

### Aggregating Data

In [None]:
# Count values in a column
categories['expression_system'].value_counts()

In [None]:
# Group by and aggregate
categories.groupby('expression_system')['status'].value_counts()

In [None]:
# Load time series data
timeseries = pd.read_csv(DATA_DIR / 'mini_timeseries.csv')
print(timeseries)
print(f"\nMean absorbance: {timeseries['absorbance'].mean():.3f}")
print(f"Max absorbance: {timeseries['absorbance'].max():.3f}")

### Quick Visualization

In [None]:
import matplotlib.pyplot as plt

# Simple plot
plt.figure(figsize=(8, 4))
plt.plot(timeseries['time_min'], timeseries['absorbance'], 'o-')
plt.xlabel('Time (min)')
plt.ylabel('Absorbance')
plt.title('Reaction Progress')
plt.grid(True, alpha=0.3)
plt.show()

### EXERCISE 9.1: Explore the Data

**Task:** Answer these questions about the categories data.

In [None]:
categories = pd.read_csv(DATA_DIR / 'mini_categories.csv')

# TODO: Answer these questions:

# 1. How many unique proteins are there?
n_proteins = ___
print(f"Unique proteins: {n_proteins}")

# 2. What percentage of samples passed?
n_pass = ___
n_total = ___
pct_pass = ___
print(f"Pass rate: {pct_pass:.0f}%")

# 3. Which protein has the most samples? (hint: use value_counts())
most_common = ___
print(f"Most common protein: {most_common}")

<details>
<summary>Click to see solution</summary>

```python
categories = pd.read_csv(DATA_DIR / 'mini_categories.csv')

# 1. How many unique proteins?
n_proteins = categories['protein'].nunique()  # or len(categories['protein'].unique())
print(f"Unique proteins: {n_proteins}")  # 3

# 2. Pass rate
n_pass = (categories['status'] == 'pass').sum()
n_total = len(categories)
pct_pass = (n_pass / n_total) * 100
print(f"Pass rate: {pct_pass:.0f}%")  # 60%

# 3. Most common protein
most_common = categories['protein'].value_counts().index[0]
print(f"Most common protein: {most_common}")  # OpsinB
```
</details>

---
## 10. AI Data Extraction Exercise

AI tools like Claude and ChatGPT can extract data from figures and tables. But how accurate are they?

Let's find out by comparing AI-extracted data against hand-corrected values.

### The Task

We asked an AI to extract data from a published GPCR binding pocket residue table. We also manually transcribed the same table. Let's compare them!

In [22]:
from workshop_utils import load_extracted_data, compare_dataframes

# Load both versions
df_ai, df_corrected = load_extracted_data()

print("AI-extracted data shape:", df_ai.shape)
print("Hand-corrected data shape:", df_corrected.shape)

AI-extracted data shape: (17, 17)
Hand-corrected data shape: (17, 17)


In [25]:
# Look at the AI-extracted data
print("AI-extracted data (first 5 rows):")
df_ai.head(10)

AI-extracted data (first 5 rows):


Unnamed: 0,Family,Receptor,Conserved Residues (Binding Pocket),Conserved Residues (Binding Pocket).1,Conserved Residues (Binding Pocket).2,Conserved Residues (Binding Pocket).3,Conserved Residues (Binding Pocket).4,Conserved Residues (Binding Pocket).5,Non-Conserved Residues (Binding Pocket),Non-Conserved Residues (Binding Pocket).1,Non-Conserved Residues (Binding Pocket).2,Non-Conserved Residues (Binding Pocket).3,Non-Conserved Residues (Binding Pocket).4,Non-Conserved Residues (Binding Pocket).5,Non-Conserved Residues (Binding Pocket).6,Non-Conserved Residues (Binding Pocket).7,Non-Conserved Residues (Binding Pocket).8
0,,,2x46,3x50,6x50,8x45,8x47,8x49,3x47,6x43,6x44,6x46,6x47,6x48,6x49,7x56,7x57
1,Parathyroid hormone receptor,PTH1R,R,H,E,L,N,E,I,L,V,M,P,L,F,I,Y
2,Parathyroid hormone receptor,PTH2R,R,H,E,L,N,E,I,L,V,V,L,V,F,I,Y
3,Glucagon receptor family,GLP1R,R,H,E,L,N,E,L,L,T,I,P,L,L,L,Y
4,Glucagon receptor family,GLP2R,R,H,E,L,N,E,L,L,V,I,P,L,L,Q,Y
5,Glucagon receptor family,GHRHR,R,H,E,L,N,E,L,L,F,I,P,L,F,L,Y
6,Glucagon receptor family,GIPR,R,H,E,L,N,E,L,L,T,V,P,L,L,A,Y
7,Glucagon receptor family,GCGR,R,H,E,L,N,E,L,L,T,I,P,L,L,L,Y
8,Glucagon receptor family,SCTR,R,H,E,L,N,E,L,L,I,L,P,L,F,L,Y
9,Calcitonin receptors,CTR,R,H,E,L,N,E,M,M,I,V,P,L,L,I,Y


In [26]:
# Look at the hand-corrected data
print("Hand-corrected data (first 5 rows):")
df_corrected.head(10)

Hand-corrected data (first 5 rows):


Unnamed: 0,Family,Receptor,Conserved Residues (Binding Pocket),Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Non-Conserved Residues (Binding Pocket),Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,,,2x46,2x50,3x50,6x45,8x47,8x49,3x47,6x43,6x44,6x46,6x47,6x48,6x49,7x56,7x57
1,Parathyroid hormone receptor,PTH1R,R,H,E,L,N,E,I,L,V,M,P,L,F,I,Y
2,Parathyroid hormone receptor,PTH2R,R,H,E,L,N,E,I,L,V,V,L,V,F,I,Y
3,Glucagon receptor family,GLP1R,R,H,E,L,N,E,L,L,T,I,P,L,L,L,Y
4,Glucagon receptor family,GLP2R,R,H,E,L,N,E,L,L,V,I,P,L,L,Q,Y
5,Glucagon receptor family,GHRHR,R,H,E,L,N,E,L,L,F,I,P,L,F,L,Y
6,Glucagon receptor family,GIPR,R,H,E,L,N,E,L,L,T,V,P,L,L,L,Y
7,Glucagon receptor family,GCGR,R,H,E,L,N,E,L,L,T,I,P,L,L,L,Y
8,Glucagon receptor family,SCTR,R,H,E,L,N,E,L,L,L,I,P,L,F,L,Y
9,Calcitonin receptors,CTR,R,H,E,L,N,E,M,M,I,V,P,L,L,I,Y


In [27]:
# Notice the column names are different!
print("AI columns:", list(df_ai.columns[:5]))
print("Corrected columns:", list(df_corrected.columns[:5]))

AI columns: ['Family', 'Receptor', 'Conserved Residues (Binding Pocket)', 'Conserved Residues (Binding Pocket).1', 'Conserved Residues (Binding Pocket).2']
Corrected columns: ['Family', 'Receptor', 'Conserved Residues (Binding Pocket)', 'Unnamed: 3', 'Unnamed: 4']


### Comparing the Data

Since the column names differ, we'll compare by position using our `compare_dataframes` function:

In [28]:
# Run the comparison
match_rate, differences = compare_dataframes(df_ai, df_corrected)

COMPARISON RESULTS
Total cells compared: 289
Matching cells: 281
Different cells: 8
Match rate: 97.2%

DIFFERENCES FOUND (showing first 8)
  Row 0, Col 3:
    AI value:        '3x50'
    Corrected value: '2x50'
  Row 0, Col 4:
    AI value:        '6x50'
    Corrected value: '3x50'
  Row 0, Col 5:
    AI value:        '8x45'
    Corrected value: '6x45'
  Row 6, Col 15:
    AI value:        'A'
    Corrected value: 'L'
  Row 8, Col 10:
    AI value:        'I'
    Corrected value: 'L'
  Row 8, Col 11:
    AI value:        'L'
    Corrected value: 'I'
  Row 11, Col 0:
    AI value:        'Corticotropin-releasing factor receptors'
    Corrected value: 'Corticotropin-releasing factor'
  Row 12, Col 0:
    AI value:        'Corticotropin-releasing factor receptors'
    Corrected value: 'Corticotropin-releasing factor'


### Key Takeaways

1. **AI extraction works** - most values match!
2. **But it's not perfect** - column names differ, some values are truncated
3. **Always validate** - compare against the original or a manual check
4. **Document your process** - note that data was "extracted using AI, manually verified"

### Discussion Questions

- Would you trust this AI extraction for a publication?
- What checks would you do before using AI-extracted data?
- How does this compare to manually copying data (which also has errors)?

---
## Summary

Congratulations! You've learned:

| Concept | What You Can Do Now |
|---------|--------------------|
| **Variables** | Store and name data |
| **Types** | Work with strings, numbers, booleans |
| **Lists** | Collect multiple items |
| **Dictionaries** | Store key-value pairs |
| **Loops** | Repeat actions automatically |
| **Functions** | Create reusable code |
| **Pandas** | Load, inspect, and filter data |
| **AI Validation** | Trust but verify AI outputs |

### Next Steps

1. Try the exercises again without looking at solutions
2. Load your own CSV data and explore it
3. Check out the `python_cheatsheet.py` file for quick reference
4. In Session 2, we'll do real data analysis!

### Questions?

Programming is learned by doing (and making mistakes). Ask questions!