# Class 1 - Notebook 0: Accessing Omics Data and Reference Genomes

**Course**: BMI 503 - Introduction to Computer Science for Biomedical Informatics  
**Instructor**: Pratik Dutta  
**Term**: Fall 2025  
**Institution**: Stony Brook University

---

## Learning Objectives
1. Access human reference genome sequences
2. Download genomics data from NCBI
3. Access RNA-seq data from GEO
4. Download single-cell data
5. Access imaging data from public repositories

---

## Introduction: Where to Get Omics Data?

### Major Public Databases

**Genomics:**
- üß¨ **NCBI**: GenBank, RefSeq, dbSNP, ClinVar
- üß¨ **Ensembl**: Genome annotations, variation data
- üß¨ **UCSC Genome Browser**: Reference genomes, tracks

**Transcriptomics:**
- üìä **GEO (Gene Expression Omnibus)**: Microarray, RNA-seq
- üìä **SRA (Sequence Read Archive)**: Raw sequencing data
- üìä **GTEx**: Tissue-specific expression
- üìä **TCGA**: Cancer genomics

**Imagomics:**
- üî¨ **TCGA**: Whole slide images
- üî¨ **Human Protein Atlas**: Tissue/cell images
- üî¨ **IDC (Imaging Data Commons)**: Cancer imaging

---

## Setup

In [None]:
# Install required packages
!pip install biopython requests pandas geoparse scanpy wget -q
print("‚úÖ Packages installed!")

In [None]:
import warnings
warnings.filterwarnings('ignore')

from Bio import Entrez, SeqIO
import requests
import pandas as pd
import gzip
import os

# Set your email for NCBI (REQUIRED!)
Entrez.email = "your.email@example.com"  # CHANGE THIS!

print("üì¶ Libraries loaded!")
print("\n‚ö†Ô∏è IMPORTANT: Set your email in Entrez.email above!")

---

# Part 1: Human Reference Genome

## What is a Reference Genome?

A **reference genome** is a representative DNA sequence used as a standard for comparison.

### Human Reference Versions:
- **GRCh38/hg38**: Current version (2013, updated)
- **GRCh37/hg19**: Previous version (2009)

### Why We Need It:
- Align sequencing reads
- Identify variants
- Annotate genes
- Compare across studies

---

## 1.1: Download Chromosome Sequence from NCBI

In [None]:
def download_chromosome_ncbi(chrom, start=None, end=None):
    """
    Download human chromosome sequence from NCBI.
    
    Parameters:
    - chrom: chromosome number (1-22, X, Y, MT)
    - start: start position (optional, for partial sequence)
    - end: end position (optional, for partial sequence)
    
    Returns:
    - SeqRecord object
    """
    # Chromosome accession numbers (GRCh38)
    chrom_accessions = {
        '1': 'NC_000001.11', '2': 'NC_000002.12', '3': 'NC_000003.12',
        '4': 'NC_000004.12', '5': 'NC_000005.10', '6': 'NC_000006.12',
        '7': 'NC_000007.14', '8': 'NC_000008.11', '9': 'NC_000009.12',
        '10': 'NC_000010.11', '11': 'NC_000011.10', '12': 'NC_000012.12',
        '13': 'NC_000013.11', '14': 'NC_000014.9', '15': 'NC_000015.10',
        '16': 'NC_000016.10', '17': 'NC_000017.11', '18': 'NC_000018.10',
        '19': 'NC_000019.10', '20': 'NC_000020.11', '21': 'NC_000021.9',
        '22': 'NC_000022.11', 'X': 'NC_000023.11', 'Y': 'NC_000024.10',
        'MT': 'NC_012920.1'
    }
    
    accession = chrom_accessions.get(str(chrom))
    if not accession:
        raise ValueError(f"Invalid chromosome: {chrom}")
    
    print(f"üì• Downloading chromosome {chrom} ({accession})...")
    
    # Fetch sequence
    if start and end:
        # Fetch partial sequence
        handle = Entrez.efetch(
            db="nucleotide",
            id=accession,
            rettype="fasta",
            retmode="text",
            seq_start=start,
            seq_stop=end
        )
        print(f"   Region: {start:,} - {end:,} ({end-start:,} bp)")
    else:
        # Fetch full chromosome
        handle = Entrez.efetch(
            db="nucleotide",
            id=accession,
            rettype="fasta",
            retmode="text"
        )
        print(f"   Fetching full chromosome...")
    
    record = SeqIO.read(handle, "fasta")
    handle.close()
    
    print(f"‚úÖ Downloaded {len(record.seq):,} bp")
    return record

# Example: Download a small region of chromosome 1
# BRCA2 gene region on chr13
chr13_region = download_chromosome_ncbi('13', start=32300000, end=32400000)

print(f"\nüìä Sequence info:")
print(f"  ID: {chr13_region.id}")
print(f"  Length: {len(chr13_region.seq):,} bp")
print(f"  First 100 bp: {chr13_region.seq[:100]}")

## 1.2: Get Gene Sequence by Name

In [None]:
def get_gene_sequence(gene_name):
    """
    Search for a gene and download its sequence from NCBI.
    
    Parameters:
    - gene_name: Gene symbol (e.g., 'BRCA1', 'TP53')
    
    Returns:
    - SeqRecord object
    """
    print(f"üîç Searching for gene: {gene_name}")
    
    # Search for the gene
    search_term = f"{gene_name}[Gene Name] AND Homo sapiens[Organism] AND RefSeq[Filter]"
    handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=5)
    record = Entrez.read(handle)
    handle.close()
    
    if not record['IdList']:
        print(f"‚ùå No results found for {gene_name}")
        return None
    
    # Get the first result
    gene_id = record['IdList'][0]
    print(f"‚úÖ Found ID: {gene_id}")
    
    # Fetch sequence
    handle = Entrez.efetch(db="nucleotide", id=gene_id, rettype="fasta", retmode="text")
    seq_record = SeqIO.read(handle, "fasta")
    handle.close()
    
    print(f"üì• Downloaded {len(seq_record.seq):,} bp")
    return seq_record

# Example: Get TP53 gene sequence
tp53 = get_gene_sequence('TP53')

if tp53:
    print(f"\nüìä TP53 Gene:")
    print(f"  ID: {tp53.id}")
    print(f"  Description: {tp53.description}")
    print(f"  Length: {len(tp53.seq):,} bp")
    print(f"  First 100 bp: {tp53.seq[:100]}")
    
    # Calculate GC content
    gc_count = tp53.seq.count('G') + tp53.seq.count('C')
    gc_percent = (gc_count / len(tp53.seq)) * 100
    print(f"  GC content: {gc_percent:.2f}%")

## 1.3: Download Reference Genome from Ensembl

In [None]:
def download_ensembl_chromosome(chrom, output_file="chr.fa.gz"):
    """
    Download chromosome FASTA from Ensembl FTP.
    
    Parameters:
    - chrom: chromosome number (1-22, X, Y, MT)
    - output_file: output filename
    """
    # Ensembl FTP URL (GRCh38)
    base_url = "https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/"
    
    if chrom == 'MT':
        filename = "Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz"
    else:
        filename = f"Homo_sapiens.GRCh38.dna.chromosome.{chrom}.fa.gz"
    
    url = base_url + filename
    
    print(f"üì• Downloading chromosome {chrom} from Ensembl...")
    print(f"   URL: {url}")
    print(f"   ‚ö†Ô∏è Warning: This can be large (50-250 MB per chromosome)")
    
    # Check if file exists
    if os.path.exists(output_file):
        print(f"   ‚úÖ File already exists: {output_file}")
        return output_file
    
    # Download
    response = requests.get(url, stream=True)
    
    if response.status_code == 200:
        with open(output_file, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"‚úÖ Downloaded to {output_file}")
        return output_file
    else:
        print(f"‚ùå Download failed: {response.status_code}")
        return None

# Example: Download mitochondrial genome (small, ~16kb)
print("üí° Downloading mitochondrial genome (small example)...\n")
mt_file = download_ensembl_chromosome('MT', output_file='chr_MT.fa.gz')

if mt_file:
    # Read the gzipped FASTA
    with gzip.open(mt_file, 'rt') as f:
        mt_seq = SeqIO.read(f, 'fasta')
    
    print(f"\nüìä Mitochondrial Genome:")
    print(f"  ID: {mt_seq.id}")
    print(f"  Length: {len(mt_seq.seq):,} bp")
    print(f"  First 100 bp: {mt_seq.seq[:100]}")

---

# Part 2: Genomics Data from Public Databases

## 2.1: Search and Download from NCBI

In [None]:
def search_ncbi_variants(gene_name, max_results=10):
    """
    Search for variants in a gene from dbSNP.
    
    Parameters:
    - gene_name: Gene symbol
    - max_results: Maximum number of results
    
    Returns:
    - List of variant IDs
    """
    print(f"üîç Searching variants in {gene_name}...")
    
    search_term = f"{gene_name}[Gene Name] AND human[Organism]"
    handle = Entrez.esearch(db="snp", term=search_term, retmax=max_results)
    record = Entrez.read(handle)
    handle.close()
    
    variant_ids = record['IdList']
    print(f"‚úÖ Found {len(variant_ids)} variants")
    
    return variant_ids

def get_variant_info(rs_id):
    """
    Get detailed information about a variant.
    
    Parameters:
    - rs_id: dbSNP rs ID
    """
    handle = Entrez.efetch(db="snp", id=rs_id, rettype="xml", retmode="xml")
    records = Entrez.read(handle)
    handle.close()
    
    return records

# Example: Search for BRCA1 variants
brca1_variants = search_ncbi_variants('BRCA1', max_results=5)

print(f"\nüìã First 5 BRCA1 variant IDs:")
for i, var_id in enumerate(brca1_variants, 1):
    print(f"  {i}. rs{var_id}")

---

# Part 3: Transcriptomics Data from GEO

## 3.1: Search GEO Database

In [None]:
def search_geo(query, max_results=10):
    """
    Search GEO database.
    
    Parameters:
    - query: Search term
    - max_results: Maximum results
    
    Returns:
    - List of GEO accession numbers
    """
    print(f"üîç Searching GEO for: {query}")
    
    handle = Entrez.esearch(
        db="gds",
        term=query,
        retmax=max_results
    )
    record = Entrez.read(handle)
    handle.close()
    
    print(f"‚úÖ Found {record['Count']} datasets")
    
    # Get details
    if record['IdList']:
        handle = Entrez.esummary(db="gds", id=",".join(record['IdList']))
        summaries = Entrez.read(handle)
        handle.close()
        return summaries
    
    return []

# Example: Search for breast cancer RNA-seq data
results = search_geo('breast cancer RNA-seq', max_results=5)

print(f"\nüìä Top 5 Breast Cancer RNA-seq Datasets:\n")
for i, dataset in enumerate(results, 1):
    print(f"{i}. {dataset['Accession']}")
    print(f"   Title: {dataset['title']}")
    print(f"   Type: {dataset['entryType']}")
    print(f"   Samples: {dataset['n_samples']}")
    print()

## 3.2: Download GEO Dataset with GEOparse

In [None]:
import GEOparse

def download_geo_series(geo_id):
    """
    Download and parse GEO Series.
    
    Parameters:
    - geo_id: GEO Series ID (e.g., 'GSE48968')
    
    Returns:
    - GEO object
    """
    print(f"üì• Downloading {geo_id}...")
    
    gse = GEOparse.get_GEO(geo=geo_id, destdir="./")
    
    print(f"‚úÖ Downloaded!")
    print(f"\nüìä Dataset Info:")
    print(f"  Title: {gse.metadata['title'][0]}")
    print(f"  Organism: {gse.metadata['organism'][0] if 'organism' in gse.metadata else 'N/A'}")
    print(f"  Samples: {len(gse.gsms)}")
    print(f"  Platforms: {len(gse.gpls)}")
    
    return gse

# Example: Small public dataset
print("üí° Downloading a small example dataset...\n")
try:
    gse = download_geo_series('GSE48968')  # Small RNA-seq dataset
    
    # Show sample info
    print(f"\nüìã First 3 samples:")
    for i, (sample_name, sample) in enumerate(list(gse.gsms.items())[:3], 1):
        print(f"  {i}. {sample_name}")
        print(f"     Title: {sample.metadata['title'][0]}")
        print(f"     Type: {sample.metadata['type'][0] if 'type' in sample.metadata else 'N/A'}")
        
except Exception as e:
    print(f"‚ö†Ô∏è Error downloading: {e}")
    print("üí° This requires internet connection")

---

# Part 4: Single-Cell Data

## 4.1: Download Public scRNA-seq Datasets

In [None]:
import scanpy as sc

# Scanpy has built-in datasets
print("üì¶ Scanpy Built-in Datasets:\n")

datasets = [
    ('pbmc3k', '2,700 PBMCs from 10x Genomics'),
    ('pbmc68k_reduced', '68k PBMCs (reduced)'),
    ('paul15', 'Mouse hematopoiesis'),
    ('burczynski06', 'Colon cancer microarray'),
]

for name, desc in datasets:
    print(f"  ‚Ä¢ {name}: {desc}")

# Download PBMC dataset
print(f"\nüì• Downloading PBMC3k dataset...")
adata = sc.datasets.pbmc3k()

print(f"‚úÖ Downloaded!")
print(f"\nüìä Dataset Info:")
print(f"  Cells: {adata.n_obs:,}")
print(f"  Genes: {adata.n_vars:,}")
print(f"  Data type: {type(adata.X)}")
print(f"\nüìã Cell metadata columns:")
print(f"  {list(adata.obs.columns)}")
print(f"\nüìã Gene metadata columns:")
print(f"  {list(adata.var.columns)}")

## 4.2: Download from 10x Genomics

In [None]:
def download_10x_dataset(dataset_name, output_dir="./data"):
    """
    Download public 10x Genomics datasets.
    
    Parameters:
    - dataset_name: Dataset identifier
    - output_dir: Output directory
    """
    # Some public 10x datasets
    datasets = {
        'pbmc_1k_v3': 'https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5',
        'heart_1k_v3': 'https://cf.10xgenomics.com/samples/cell-exp/3.0.0/heart_1k_v3/heart_1k_v3_filtered_feature_bc_matrix.h5',
    }
    
    if dataset_name not in datasets:
        print(f"‚ùå Dataset {dataset_name} not available")
        print(f"Available datasets: {list(datasets.keys())}")
        return None
    
    url = datasets[dataset_name]
    filename = url.split('/')[-1]
    filepath = os.path.join(output_dir, filename)
    
    # Create directory
    os.makedirs(output_dir, exist_ok=True)
    
    if os.path.exists(filepath):
        print(f"‚úÖ File already exists: {filepath}")
        return filepath
    
    print(f"üì• Downloading {dataset_name}...")
    print(f"   URL: {url}")
    print(f"   ‚ö†Ô∏è This may take a few minutes...")
    
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))
    
    with open(filepath, 'wb') as f:
        downloaded = 0
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
            downloaded += len(chunk)
            if total_size > 0:
                percent = (downloaded / total_size) * 100
                print(f"\r   Progress: {percent:.1f}%", end='')
    
    print(f"\n‚úÖ Downloaded to {filepath}")
    return filepath

# List available datasets
print("üìö Available 10x Genomics Datasets:")
print("  1. pbmc_1k_v3: 1k PBMCs")
print("  2. heart_1k_v3: 1k Heart cells")
print("\nüí° Uncomment below to download (files are ~15-25 MB)")

# Uncomment to download:
# h5_file = download_10x_dataset('pbmc_1k_v3')
# if h5_file:
#     adata = sc.read_10x_h5(h5_file)
#     print(f"\nüìä Loaded: {adata.n_obs} cells √ó {adata.n_vars} genes}")

---

# Part 5: Imaging Data

## 5.1: Access Human Protein Atlas Images

In [None]:
def get_hpa_image_url(gene, tissue='breast'):
    """
    Get image URL from Human Protein Atlas.
    
    Parameters:
    - gene: Gene symbol (e.g., 'BRCA1')
    - tissue: Tissue type
    
    Returns:
    - Image URL
    """
    base_url = f"https://www.proteinatlas.org/{gene}/tissue/{tissue}"
    
    print(f"üîó Human Protein Atlas URL:")
    print(f"   {base_url}")
    print(f"\nüí° Visit this URL to see immunohistochemistry images")
    
    # API endpoint for image data
    api_url = f"https://www.proteinatlas.org/{gene}.json"
    
    try:
        response = requests.get(api_url)
        if response.status_code == 200:
            data = response.json()
            print(f"\n‚úÖ Found protein data for {gene}")
            return data
    except:
        pass
    
    return None

# Example: Get BRCA1 protein atlas data
brca1_data = get_hpa_image_url('BRCA1', 'breast')

print("\nüì∏ To download images programmatically:")
print("   1. Visit the Human Protein Atlas website")
print("   2. Search for your gene")
print("   3. Right-click on images to download")

## 5.2: Download Sample Pathology Images

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

def download_sample_histology_image():
    """
    Download a sample histology image from a public source.
    """
    print("üì• Sample whole slide image sources:")
    print("  ‚Ä¢ OpenSlide test data: https://openslide.cs.cmu.edu/download/openslide-testdata/")
    print("  ‚Ä¢ TCGA: https://portal.gdc.cancer.gov/")
    print("  ‚Ä¢ IDC: https://imaging.datacommons.cancer.gov/")
    
    # Sample H&E image from Wikimedia Commons
    sample_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/HE_kidney.jpg/800px-HE_kidney.jpg"
    
    print(f"\nüì• Downloading sample H&E image...")
    
    response = requests.get(sample_url)
    
    if response.status_code == 200:
        # Save image
        with open('sample_he.jpg', 'wb') as f:
            f.write(response.content)
        
        # Display
        img = Image.open('sample_he.jpg')
        
        plt.figure(figsize=(10, 10))
        plt.imshow(img)
        plt.title('Sample H&E Stained Tissue (Kidney)', fontsize=14, fontweight='bold')
        plt.axis('off')
        plt.show()
        
        print(f"‚úÖ Downloaded and displayed!")
        print(f"   Image size: {img.size}")
        print(f"   Mode: {img.mode}")
    else:
        print(f"‚ùå Download failed")

# Download and display
download_sample_histology_image()

## 5.3: Access TCGA Imaging Data

In [None]:
def search_tcga_images(cancer_type='BRCA', limit=5):
    """
    Search for TCGA imaging data.
    
    Parameters:
    - cancer_type: TCGA project code (e.g., 'BRCA', 'LUAD')
    - limit: Number of results
    """
    print(f"üîç TCGA Imaging Data Access:\n")
    
    # TCGA project codes
    tcga_projects = {
        'BRCA': 'Breast Invasive Carcinoma',
        'LUAD': 'Lung Adenocarcinoma',
        'COAD': 'Colon Adenocarcinoma',
        'PRAD': 'Prostate Adenocarcinoma',
        'KIRC': 'Kidney Renal Clear Cell Carcinoma'
    }
    
    if cancer_type in tcga_projects:
        print(f"   Cancer Type: {tcga_projects[cancer_type]}")
    
    # GDC API endpoint
    api_url = "https://api.gdc.cancer.gov/files"
    
    # Query parameters
    params = {
        'filters': f'{"op":"and","content":[{"op":"in","content":{"field":"cases.project.project_id","value":["TCGA-{cancer_type}"]}},{"op":"in","content":{"field":"files.data_type","value":["Slide Image"]}}]}',
        'size': limit,
        'fields': 'file_name,file_id,file_size,data_type,cases.project.project_id'
    }
    
    try:
        response = requests.get(api_url, params=params)
        
        if response.status_code == 200:
            data = response.json()
            hits = data['data']['hits']
            
            print(f"\n‚úÖ Found {data['data']['pagination']['total']} slide images")
            print(f"\nüìã First {len(hits)} results:\n")
            
            for i, hit in enumerate(hits, 1):
                file_name = hit['file_name']
                file_id = hit['file_id']
                file_size = hit['file_size'] / (1024**2)  # Convert to MB
                
                print(f"{i}. {file_name}")
                print(f"   ID: {file_id}")
                print(f"   Size: {file_size:.1f} MB")
                print(f"   Download: https://api.gdc.cancer.gov/data/{file_id}")
                print()
            
            print(f"\nüí° To download images:")
            print(f"   Use the GDC Data Transfer Tool or visit:")
            print(f"   https://portal.gdc.cancer.gov/")
            
        else:
            print(f"‚ùå Query failed: {response.status_code}")
            
    except Exception as e:
        print(f"‚ùå Error: {e}")

# Example: Search for breast cancer images
search_tcga_images('BRCA', limit=3)

---

# Summary

## What We Learned

### 1. Reference Genomes ‚úÖ
- Download chromosomes from NCBI
- Get gene sequences by name
- Access Ensembl reference data

### 2. Genomics Data ‚úÖ
- Search NCBI databases
- Download variant data from dbSNP
- Access GenBank sequences

### 3. Transcriptomics Data ‚úÖ
- Search GEO database
- Download RNA-seq datasets
- Access single-cell data

### 4. Imaging Data ‚úÖ
- Access Human Protein Atlas
- Download pathology images
- Query TCGA imaging repository

---

## Key Resources

### Genomics
- **NCBI**: https://www.ncbi.nlm.nih.gov/
  - GenBank, RefSeq, dbSNP, ClinVar
- **Ensembl**: https://www.ensembl.org/
  - Genome annotations, variation data
- **UCSC Genome Browser**: https://genome.ucsc.edu/
  - Reference genomes, genome tracks
- **1000 Genomes**: https://www.internationalgenome.org/
  - Human genetic variation

### Transcriptomics
- **GEO (Gene Expression Omnibus)**: https://www.ncbi.nlm.nih.gov/geo/
  - Microarray and RNA-seq data
- **SRA (Sequence Read Archive)**: https://www.ncbi.nlm.nih.gov/sra
  - Raw sequencing data
- **GTEx**: https://gtexportal.org/
  - Tissue-specific gene expression
- **TCGA**: https://portal.gdc.cancer.gov/
  - Cancer genomics data
- **10x Genomics**: https://www.10xgenomics.com/resources/datasets
  - Single-cell datasets

### Imaging
- **Human Protein Atlas**: https://www.proteinatlas.org/
  - Protein expression images
- **TCGA Imaging**: https://portal.gdc.cancer.gov/
  - Cancer pathology slides
- **IDC (Imaging Data Commons)**: https://imaging.datacommons.cancer.gov/
  - Cancer imaging data
- **OpenSlide Test Data**: https://openslide.cs.cmu.edu/download/openslide-testdata/
  - Sample whole slide images

---

## Best Practices

### 1. **Always Set Your Email for NCBI** ‚ö†Ô∏è
```python
Entrez.email = "your.email@example.com"
```
NCBI **requires** this for API usage!

### 2. **Check File Sizes Before Downloading**
- **Chromosomes**: 50-250 MB each
- **Whole genomes**: 3+ GB
- **Single-cell datasets**: 10-100+ MB
- **Whole slide images**: 100 MB - 10+ GB

### 3. **Use Appropriate Data Versions**
- **GRCh38/hg38**: Current human genome (2013+)
- **GRCh37/hg19**: Previous version (2009)
- Always check which version a dataset uses!

### 4. **Cache Downloaded Data**
```python
if os.path.exists(filename):
    print(f"File exists: {filename}")
    return filename  # Don't re-download
```

### 5. **Respect Rate Limits**
- **NCBI**: Max 3 requests/second without API key
- **NCBI with API key**: 10 requests/second
- Add delays between requests if needed:
```python
import time
time.sleep(0.34)  # ~3 requests/second
```

### 6. **Handle Errors Gracefully**
```python
try:
    data = download_data()
except Exception as e:
    print(f"Error: {e}")
    # Have a backup plan
```

### 7. **Cite Data Sources**
Always acknowledge data sources in publications!

**Example citations:**
- NCBI: "Data from NCBI GenBank (https://www.ncbi.nlm.nih.gov/)"
- GEO: "Data from GEO Series GSE##### (Barrett et al., 2013)"
- TCGA: "Data from TCGA Research Network (https://www.cancer.gov/tcga)"

---

## Common Data Formats

### Genomics
| Format | Description | Use Case |
|--------|-------------|----------|
| FASTA | Sequence data | Reference genomes, gene sequences |
| FASTQ | Sequencing reads + quality | Raw sequencing data |
| VCF | Variant calls | SNPs, indels, structural variants |
| BED | Genomic regions | Gene annotations, ChIP-seq peaks |
| BAM | Aligned reads | Mapped sequencing data |

### Transcriptomics
| Format | Description | Use Case |
|--------|-------------|----------|
| CSV/TSV | Count matrices | Gene expression tables |
| H5AD | AnnData format | Single-cell data (scanpy) |
| MTX | Market Matrix | Sparse single-cell counts |
| H5 | HDF5 format | 10x Genomics data |

### Imaging
| Format | Description | Use Case |
|--------|-------------|----------|
| SVS | Aperio format | Whole slide images |
| TIFF | Standard image | Microscopy images |
| JPEG/PNG | Compressed | Web display, thumbnails |
| DICOM | Medical imaging | Clinical scans (MRI, CT) |

---

## Exercises

### Genomics (Exercises 1-4):
1. **Download BRCA2 gene sequence**
   - Use `get_gene_sequence('BRCA2')`
   - Calculate its length and GC content

2. **Get chromosome 21 sequence**
   - Download a 1 Mb region: positions 10,000,000 to 11,000,000
   - Count the number of 'N' bases (unknown)

3. **Search variants in TP53**
   - Find 10 variants in the TP53 gene
   - Print their rs IDs

4. **Compare chromosome sizes**
   - Look up the sizes of chr X and chr Y
   - Calculate the size difference

### Transcriptomics (Exercises 5-8):
5. **Search GEO for Alzheimer's data**
   - Search: "alzheimer disease RNA-seq"
   - List the top 5 results with sample counts

6. **Load PBMC 68k dataset**
   - Use `sc.datasets.pbmc68k_reduced()`
   - Print number of cells and genes

7. **Find highly expressed genes**
   - In PBMC3k dataset, calculate mean expression
   - Find top 10 most expressed genes

8. **Compare datasets**
   - Load 3 different scanpy datasets
   - Create a table comparing their sizes

### Imaging (Exercises 9-10):
9. **Find protein expression images**
   - Search Human Protein Atlas for TP53
   - Find images for liver tissue

10. **Download tissue images**
    - Download H&E images for 3 different tissues
    - Display them in a grid

---

## Troubleshooting

### Problem: NCBI download fails
**Solution**: 
- Check your email is set: `Entrez.email = "your@email.com"`
- Check internet connection
- Try again later (server may be busy)

### Problem: File too large
**Solution**:
- Download smaller regions first
- Use streaming downloads
- Check available disk space

### Problem: GEO download slow
**Solution**:
- Be patient (some datasets are large)
- Use `destdir` parameter to save to specific location
- Consider downloading during off-peak hours

### Problem: Import errors
**Solution**:
```bash
pip install --upgrade biopython geoparse scanpy
```

---

## Next Steps

Now that you know how to **access** omics data, proceed to learn how to **analyze** it:

### üìì **Notebook 1: Genomics Packages**
- Biopython, pysam, PyVCF, pybedtools
- Sequence manipulation, variant calling
- **Duration**: ~70 minutes

### üìì **Notebook 2: Transcriptomics Packages**
- pandas, scanpy, anndata, gseapy
- Bulk RNA-seq and single-cell analysis
- **Duration**: ~70 minutes

### üìì **Notebook 3: Imagomics Packages**
- PIL, scikit-image, opencv, squidpy
- Image processing, spatial transcriptomics
- **Duration**: ~30 minutes

---

## üéì Congratulations!

**You now know how to access public omics data!**

### Key skills acquired:
- ‚úÖ Download human reference genome sequences
- ‚úÖ Access NCBI databases (GenBank, dbSNP)
- ‚úÖ Query and download GEO datasets
- ‚úÖ Get single-cell RNA-seq data
- ‚úÖ Find and access imaging resources

### Remember:
- üîë Always set your email for NCBI
- üíæ Cache downloaded data
- ‚è±Ô∏è Respect rate limits
- üìö Cite data sources

**Happy analyzing! üß¨üìäüî¨**