# Class 1: Genomic Sequence Analysis with Python

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/duttaprat/BMI_503/blob/main/Class1_Genomics/notebook1_genomics_sequence_analysis.ipynb)

**Course**: BMI 503 - Introduction to Computer Science for Biomedical Informatics  
**Instructors**: Prof. Ramana Davuluri & Prof. Fusheng Wang  
**Institution**: Stony Brook University

---

## Learning Objectives
By the end of this notebook, you will:
1. Use **Biopython** for sequence manipulation
2. Work with **pysam** for alignment files (BAM/SAM)
3. Analyze variants using **PyVCF**
4. Manipulate genomic intervals with **pybedtools**
5. Use **scikit-bio** for sequence analysis
6. Visualize data with **matplotlib/seaborn**
7. Process data with **pandas**

## Setup & Installation

In [None]:
# Install packages
!pip install biopython pysam PyVCF3 scikit-bio pandas matplotlib seaborn -q
!apt-get install -y bedtools -qq
!pip install pybedtools -q
print("âœ… Installation complete!")

In [None]:
# Import libraries
import warnings
warnings.filterwarnings('ignore')

from Bio import SeqIO, Entrez, SeqUtils
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import pysam
import vcf
import pybedtools
import skbio
from skbio import DNA, RNA, Protein
from skbio.sequence.distance import hamming
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
print("ðŸ“¦ Libraries imported!")

---
## Part 1: Biopython

In [None]:
# Basic sequence operations
dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(f"DNA: {dna}")
print(f"Complement: {dna.complement()}")
print(f"Reverse Complement: {dna.reverse_complement()}")
print(f"RNA: {dna.transcribe()}")
print(f"Protein: {dna.transcribe().translate()}")
print(f"GC Content: {SeqUtils.gc_fraction(dna)*100:.2f}%")

In [None]:
# Download from NCBI
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_000546", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
handle.close()
print(f"Downloaded: {record.description[:60]}...")
print(f"Length: {len(record.seq)} bp")

---
## Part 2: Pysam (BAM/SAM)

In [None]:
# Download sample BAM
!wget -q https://github.com/pysam-developers/pysam/raw/master/tests/ex1.bam
!wget -q https://github.com/pysam-developers/pysam/raw/master/tests/ex1.bam.bai

bam = pysam.AlignmentFile("ex1.bam", "rb")
print(f"References: {bam.references}")
print(f"Total reads: {bam.count()}")
bam.close()

---
## Part 3: PyVCF (Variants)

In [None]:
# Download VCF
!wget -q https://raw.githubusercontent.com/jamescasbon/PyVCF/master/vcf/test/example-4.0.vcf -O sample.vcf

vcf_reader = vcf.Reader(open('sample.vcf', 'r'))
print(f"Samples: {vcf_reader.samples}")

for i, record in enumerate(vcf_reader):
    if i >= 3: break
    print(f"{record.CHROM}:{record.POS} {record.REF}>{record.ALT[0]} Q={record.QUAL}")

---
## Part 4: Pybedtools

In [None]:
# Create BED files
with open('features.bed', 'w') as f:
    f.write("chr1\t100\t200\tfeature1\n")
    f.write("chr1\t300\t400\tfeature2\n")

with open('genes.bed', 'w') as f:
    f.write("chr1\t150\t250\tgene1\n")

features = pybedtools.BedTool('features.bed')
genes = pybedtools.BedTool('genes.bed')
intersect = features.intersect(genes)
print("Intersections:")
print(intersect)

---
## Part 5: Scikit-bio

In [None]:
# Sequence analysis
seq = DNA('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
print(f"GC content: {seq.gc_content():.2%}")
print(f"Reverse complement: {seq.reverse_complement()}")

# k-mers
kmers = list(seq.iter_kmers(3))
print(f"\nTotal 3-mers: {len(kmers)}")
kmer_counts = Counter(str(k) for k in kmers)
print(f"Most common: {kmer_counts.most_common(3)}")

---
## Part 6: Pandas

In [None]:
# Genomic dataframe
data = {
    'Gene': ['GENE1', 'GENE2', 'GENE3'],
    'Chr': ['chr1', 'chr1', 'chr2'],
    'Start': [1000, 5000, 2000],
    'End': [2000, 6500, 3500],
    'Expression': [45.2, 123.5, 67.8]
}
df = pd.DataFrame(data)
df['Length'] = df['End'] - df['Start']
print(df)
print(f"\nMean expression: {df['Expression'].mean():.2f}")

---
## Part 7: Visualization

In [None]:
# Nucleotide composition
seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG" * 3)
counts = [seq.count(n) for n in ['A', 'T', 'G', 'C']]

plt.figure(figsize=(10, 5))
plt.bar(['A', 'T', 'G', 'C'], counts, color=['#FF6B6B', '#4ECDC4', '#FFD93D', '#95E1D3'])
plt.title('Nucleotide Composition', fontweight='bold')
plt.ylabel('Count')
plt.grid(alpha=0.3)
plt.show()

In [None]:
# GC content sliding window
sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG" * 10)
window = 20
positions, gc_vals = [], []

for i in range(0, len(sequence)-window, 5):
    win = sequence[i:i+window]
    positions.append(i)
    gc_vals.append(SeqUtils.gc_fraction(win) * 100)

plt.figure(figsize=(12, 5))
plt.plot(positions, gc_vals, linewidth=2, color='#2E86AB')
plt.axhline(50, color='red', linestyle='--', alpha=0.5)
plt.title('GC Content Sliding Window', fontweight='bold')
plt.xlabel('Position')
plt.ylabel('GC %')
plt.grid(alpha=0.3)
plt.show()

---
## Complete Workflow Example

In [None]:
print("="*60)
print("COMPLETE GENOMIC ANALYSIS WORKFLOW")
print("="*60)

# 1. Download sequence
print("\n[1] Downloading TP53 from NCBI...")
Entrez.email = "your.email@example.com"
handle = Entrez.efetch(db="nucleotide", id="NM_000546", rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
handle.close()
print(f"âœ… {record.description[:50]}... ({len(record.seq)} bp)")

# 2. Analyze
print("\n[2] Analyzing sequence properties...")
gc = SeqUtils.gc_fraction(record.seq) * 100
print(f"âœ… GC Content: {gc:.2f}%")

# 3. Find motifs
print("\n[3] Finding TATA boxes...")
tata_pos = [i for i in range(len(record.seq)-6) if record.seq[i:i+6] == "TATAAA"]
print(f"âœ… Found {len(tata_pos)} TATA box(es)")

# 4. k-mers
print("\n[4] Generating k-mers...")
dna_skbio = DNA(str(record.seq[:100]))
kmers = list(dna_skbio.iter_kmers(3))
print(f"âœ… Generated {len(kmers)} 3-mers")

# 5. Create dataframe
print("\n[5] Creating summary dataframe...")
summary = pd.DataFrame({
    'Gene': ['TP53'],
    'Length': [len(record.seq)],
    'GC%': [gc],
    'TATA_boxes': [len(tata_pos)]
})
print(summary)

print("\nâœ… Workflow complete!")
print("="*60)

---
## Exercises

### Exercise 1: Sequence Analysis
Download BRCA1 (NM_007294) and calculate GC content, find ATG positions

In [None]:
# Your code here


### Exercise 2: VCF Analysis
Filter variants with quality > 50 and create a pandas DataFrame

In [None]:
# Your code here


### Exercise 3: k-mer Analysis  
Generate 4-mers and find the most frequent k-mer

In [None]:
# Your code here


---
## Summary

You learned:
- âœ… Biopython for sequences
- âœ… Pysam for BAM files
- âœ… PyVCF for variants
- âœ… Pybedtools for intervals
- âœ… Scikit-bio for analysis
- âœ… Pandas for data
- âœ… Matplotlib/Seaborn for viz

### Resources
- [Biopython](http://biopython.org)
- [Pysam Docs](https://pysam.readthedocs.io)
- [NCBI](https://www.ncbi.nlm.nih.gov)