# Session 3: Navigating NCBI and UCSC Genome Databases

## Learning Objectives

By the end of this session, you will be able to:

1. Navigate the NCBI database suite to find genomic resources
2. Search for and download reference genomes and annotations
3. Use the UCSC Genome Browser to visualise genomic regions
4. Download public sequencing data from the SRA
5. Retrieve gene and protein information programmatically

## 1. Introduction to Biological Databases

Biological databases are essential repositories that store and organise vast amounts of genomic, proteomic, and other biological data. The two most widely used resources are:

- **NCBI (National Centre for Biotechnology Information)**: A comprehensive collection of databases including GenBank, RefSeq, SRA, and more
- **UCSC Genome Browser**: A powerful visualisation tool and database for genome annotations

### Why Use These Databases?

| Task | Database |
|------|----------|
| Download reference genomes | NCBI RefSeq, UCSC |
| Find gene sequences | NCBI Gene, UCSC |
| Access raw sequencing data | NCBI SRA |
| Visualise genomic regions | UCSC Genome Browser |
| Find protein sequences | NCBI Protein, UniProt |
| Identify genetic variants | dbSNP, ClinVar |

## 2. NCBI Database Overview

NCBI hosts over 40 interconnected databases. Here are the most relevant for bioinformatics:

### Core NCBI Databases

| Database | URL | Purpose |
|----------|-----|--------|
| **GenBank** | ncbi.nlm.nih.gov/genbank | Primary nucleotide sequence repository |
| **RefSeq** | ncbi.nlm.nih.gov/refseq | Curated reference sequences |
| **SRA** | ncbi.nlm.nih.gov/sra | Sequence Read Archive (raw data) |
| **Gene** | ncbi.nlm.nih.gov/gene | Gene-centred information |
| **Assembly** | ncbi.nlm.nih.gov/assembly | Genome assemblies |
| **Taxonomy** | ncbi.nlm.nih.gov/taxonomy | Organism classification |

### GenBank vs RefSeq

Understanding the difference is crucial:

| Feature | GenBank | RefSeq |
|---------|---------|--------|
| Submission | Anyone can submit | NCBI curated |
| Redundancy | Contains duplicates | Non-redundant |
| Quality | Variable | High quality, reviewed |
| Accession prefix | Various (e.g., AB, AY) | NM_, NR_, XM_, NC_ |
| Best for | All available sequences | Reference analyses |

## 3. Searching NCBI

### Using the Web Interface

**Step 1**: Go to [ncbi.nlm.nih.gov](https://www.ncbi.nlm.nih.gov)

**Step 2**: Select the appropriate database from the dropdown menu

**Step 3**: Enter your search query

### Search Query Examples

**Finding a Reference Genome (Assembly Database)**:
```
Search: "Ovis aries"[Organism] AND "reference genome"[Filter]
```

**Finding a Gene (Gene Database)**:
```
Search: MSTN[Gene Name] AND "Ovis aries"[Organism]
```

**Finding Sequencing Data (SRA Database)**:
```
Search: "RNA-Seq"[Strategy] AND "Ovis aries"[Organism] AND "liver"[All Fields]
```

### Using Search Filters

NCBI supports advanced search syntax:

| Syntax | Example | Description |
|--------|---------|-------------|
| `[Organism]` | "Ovis aries"[Organism] | Filter by species |
| `[Gene Name]` | MSTN[Gene Name] | Search specific gene |
| `[Title]` | muscle[Title] | Search in title field |
| `AND`, `OR`, `NOT` | sheep AND muscle NOT fat | Boolean operators |
| `[Filter]` | refseq[Filter] | Apply specific filters |

## 4. Downloading Reference Genomes

### From NCBI Assembly Database

**Example: Downloading the Sheep Reference Genome (Oar_v4.0)**

1. Go to [ncbi.nlm.nih.gov/assembly](https://www.ncbi.nlm.nih.gov/assembly)
2. Search: `"Ovis aries"[Organism] AND "reference genome"[Filter]`
3. Click on the assembly (e.g., Oar_v4.0)
4. Click "Download Assembly" button
5. Select file types:
   - `*_genomic.fna.gz` - FASTA sequence
   - `*_genomic.gff.gz` - Gene annotations
   - `*_genomic.gtf.gz` - GTF annotations

### Using Command Line (wget/curl)

```bash
# Download sheep reference genome
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.fna.gz

# Download corresponding annotation
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/298/735/GCF_000298735.2_Oar_v4.0/GCF_000298735.2_Oar_v4.0_genomic.gff.gz

# Decompress
gunzip GCF_000298735.2_Oar_v4.0_genomic.fna.gz
```

### Using NCBI Datasets Tool

NCBI provides a command-line tool called `datasets` for easier downloads:

```bash
# Install NCBI datasets (conda)
conda install -c conda-forge ncbi-datasets-cli

# Download genome by accession
datasets download genome accession GCF_000298735.2 --include genome,gff3

# Download by organism name
datasets download genome taxon "Ovis aries" --reference
```

## 5. Accessing the Sequence Read Archive (SRA)

The SRA stores raw sequencing data from published studies. This is invaluable for:
- Replicating published analyses
- Meta-analyses across studies
- Training and testing pipelines

### Understanding SRA Accession Numbers

| Prefix | Level | Example | Description |
|--------|-------|---------|-------------|
| SRP/ERP/DRP | Study | SRP012345 | Entire project |
| SRS/ERS/DRS | Sample | SRS123456 | Biological sample |
| SRX/ERX/DRX | Experiment | SRX123456 | Library/experiment |
| SRR/ERR/DRR | Run | SRR1234567 | Actual sequencing run |

Our training FASTQ file `SRR10532784` is an SRA run accession.

### Finding Data in SRA

**Example: Finding RNA-Seq data for sheep liver**

1. Go to [ncbi.nlm.nih.gov/sra](https://www.ncbi.nlm.nih.gov/sra)
2. Search: `"Ovis aries"[Organism] AND "RNA-Seq"[Strategy] AND liver[All Fields]`
3. Filter by:
   - Source: TRANSCRIPTOMIC
   - Platform: ILLUMINA
   - Access: Public

### Downloading SRA Data

```bash
# Install SRA Toolkit
conda install -c bioconda sra-tools

# Download and convert to FASTQ (recommended method)
fasterq-dump SRR10532784

# For paired-end data, this creates:
# SRR10532784_1.fastq (forward reads)
# SRR10532784_2.fastq (reverse reads)

# Download multiple runs
fasterq-dump SRR10532784 SRR10532785 SRR10532786

# Compress the output
gzip *.fastq
```

## 6. UCSC Genome Browser

The UCSC Genome Browser ([genome.ucsc.edu](https://genome.ucsc.edu)) is a powerful tool for visualising genomic data and downloading annotations.

### Key Features

- Interactive genome visualisation
- Multiple annotation tracks
- Custom track upload
- Table Browser for data export
- BLAT sequence search

### Navigating to a Region

**Example: Viewing the MSTN (myostatin) gene in sheep**

1. Go to [genome.ucsc.edu](https://genome.ucsc.edu)
2. Click "Genome Browser" or "Genomes"
3. Select assembly: Sheep (oviAri4)
4. In the search box, enter: `MSTN`
5. Click "Go"

### Position Format

UCSC uses specific position formats:

```
chr2:118,171,687-118,180,018    # Chromosome 2, specific coordinates
chr2:118171687-118180018        # Without commas
MSTN                             # Gene name
```

## 7. UCSC Table Browser

The Table Browser allows you to download annotations and sequences in various formats.

### Accessing the Table Browser

1. Go to [genome.ucsc.edu/cgi-bin/hgTables](https://genome.ucsc.edu/cgi-bin/hgTables)
2. Select your assembly and track
3. Define region and output format
4. Download

### Download Examples

**Example 1: Download all RefSeq genes as BED file**

- Assembly: Sheep oviAri4
- Group: Genes and Gene Predictions
- Track: NCBI RefSeq
- Table: refGene
- Region: genome
- Output format: BED

### Command Line Downloads

```bash
# Download chromosome sizes for sheep
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.chrom.sizes

# Download 2bit genome file (compact format)
wget https://hgdownload.soe.ucsc.edu/goldenPath/oviAri4/bigZips/oviAri4.2bit
```

## 8. BLAT Sequence Search

BLAT (BLAST-Like Alignment Tool) quickly maps sequences to a genome. It is faster than BLAST for finding locations of known sequences.

### Using BLAT

1. Go to [genome.ucsc.edu/cgi-bin/hgBlat](https://genome.ucsc.edu/cgi-bin/hgBlat)
2. Select assembly
3. Paste your sequence
4. Click "Submit"

### Example: Finding a Primer Location

You have designed a PCR primer and want to verify its location:

```
Primer sequence: ATGCGATCGATCGATCGATCG
```

BLAT will show:
- Chromosome and coordinates
- Strand orientation
- Alignment score
- Number of mismatches

### When to Use BLAT vs BLAST

| Use BLAT | Use BLAST |
|----------|----------|
| Finding known sequence in genome | Finding similar sequences |
| Mapping primers or probes | Identifying homologues |
| Quick lookups | Sensitive searches |
| Same species | Cross-species |

## 9. Hands-On Exercise

### Exercise: Tracing the Origin of Our Training Data

Our training datasets include sheep genomic data. Let's trace where they came from:

**Task 1: Find information about SRR10532784**

1. Go to NCBI SRA: [ncbi.nlm.nih.gov/sra](https://www.ncbi.nlm.nih.gov/sra)
2. Search: `SRR10532784`
3. Answer these questions:
   - What organism is this from?
   - What type of sequencing (RNA-Seq, WGS, etc.)?
   - What tissue/sample type?
   - What sequencing platform was used?

**Task 2: Find the sheep reference genome**

1. Go to NCBI Assembly: [ncbi.nlm.nih.gov/assembly](https://www.ncbi.nlm.nih.gov/assembly)
2. Search: `"Ovis aries"[Organism]`
3. Find the current reference genome
4. Note the assembly accession (GCF_...)

**Task 3: Explore the VCF data context**

Our sheep VCF contains SNP data. Use bcftools to examine it:

```bash
# How many chromosomes have variants?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u | wc -l

# What chromosomes are represented?
bcftools view -H trining_datasets/sheep.snp.vcf.gz | cut -f1 | sort -u

# How many samples were sequenced?
bcftools query -l trining_datasets/sheep.snp.vcf.gz | wc -l
```

## 10. Summary

In this session, we covered:

1. **NCBI Databases**: GenBank, RefSeq, SRA, Gene, Assembly
2. **Search Strategies**: Using filters and Boolean operators
3. **Downloading Data**: Web interface and command-line tools
4. **UCSC Genome Browser**: Navigation and visualisation
5. **Table Browser**: Exporting annotations and sequences
6. **BLAT**: Quick sequence mapping

### Quick Reference

| Task | Resource | Tool/Method |
|------|----------|-------------|
| Download reference genome | NCBI | datasets, wget |
| Download raw reads | SRA | fasterq-dump |
| View genomic region | UCSC | Genome Browser |
| Export annotations | UCSC | Table Browser |
| Map sequence to genome | UCSC | BLAT |

## Next Session

In the next session, we will learn how to organise bioinformatics projects for reproducibility and collaboration.

## Additional Resources

- [NCBI Education Resources](https://www.ncbi.nlm.nih.gov/home/learn/)
- [UCSC Genome Browser User Guide](https://genome.ucsc.edu/goldenPath/help/hgTracksHelp.html)
- [SRA Handbook](https://www.ncbi.nlm.nih.gov/sra/docs/)
- [NCBI Datasets Documentation](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/)