# Prepare genome and annotations

## FASTA files

1. Make a new directory called `genome`
2. Navigate to `genome/`
3. Download the reference genome file, which will be compressed as `.gz`
    1. The `>` operator saves the incoming file to a new file name, which we call `genome.fa.gz`
    2. Checkout details and info about the genome at [WormBase ParaSite](https://parasite.wormbase.org/Schistosoma_mansoni_prjea36577/Info/Index/)
4. Decompress the reference

In [1]:
!mkdir genome
%cd genome
!wget -nc -O genome.fa.gz https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz
!gzip -d -f genome.fa.gz

/data/users/heky1803/BIOL343/2_genome_exploration/genome
--2025-09-16 16:06:40--  https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.genomic.fa.gz
Resolving ftp.ebi.ac.uk... 193.62.193.165
Connecting to ftp.ebi.ac.uk|193.62.193.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 116797085 (111M) [application/x-gzip]
Saving to: ‘genome.fa.gz’


2025-09-16 16:06:47 (17.1 MB/s) - ‘genome.fa.gz’ saved [116797085/116797085]



Whenever we run a command line tool, it is expected that you will define the meaning of the options, arguments, and or flags used and explain ***why*** each flag is used. You can find the definition of these in the manual, which is typically accesssed by running the command followed by `--help`. Use the manual to define the meaning and purpose of each flag; put your answers below:

***wget***

`-nc`:

`-O`:

***gzip***

`-d`:

`-f`:

Generic sequence files usually are held in FASTA files (`.fasta` or `.fa`). An explanation of the FASTA format can be found [here](https://zhanggroup.org/FASTA/). FASTA files can hold nucleotide or amino acid sequence data. Take a moment to read about the FASTA format at the previous link. You will use what you learn, along with a few command line tools, to explore the genome that we just downloaded.

## Genome exploration

### `head` and `tail`
`head` and `tail` are used to glance at the beginning or end of a file, respectively. They are simple tools - usually the only modifier is `-#`, where `#` is an integer that dictates how many lines you want to be printed.


In [2]:
!head genome.fa

>SM_V10_1 length=87984048
TGATAGTTAGTCATATGAAAGCATCATTAGTAAACCACATTGCTTATTATATTGAACAGT
TACATCTGGCTTATTATACAAAGAGAAAACCATACTATTCATACTATTCTCTTTTTGATC
TTCTCAATCTTCTGTTGTTAGATATTCTATTCCTTGCTCACCATATATACTACTTATGTC
AATATAAGTAGCTCACACCACACTACTACTTCAACTACTACTACTACTGCTACTGCTACT
GTTGTGAACAGAACACGACTGTTGGACAATCGAATCTAATTAAGCAAACATTACAAACTA
TCTCAGTAAATTGTAATAACCATACAATGGACAGTTAATTTGCAAATTATCAACCAATAG
TCTCCATGTTCACTGTTTGTTCTTTAATTCCATTGCTTAAAGTTACTTGGTAAAATGCAA
AAACCATACTATGATACTCTATTTGCAAAATGTTCTTCAATTGTCTCAGGCTTCATTGTT
CCTGGATTTCCACACAAATTGCTTTTTATTTCTGTTCTTCTCTTCTTGATCTTCTCAATC


In [3]:
!tail genome.fa

TTGGTTTCTTAGTGTTATAGCCCATACTCCTTTAGTCTTTTAGTATTATCGTCTATAGTC
CCTTGGTTTCTTAGTGTTATAGCCCATACTCCTTTAGTCTTTTAGTATTATCGTCTATAG
TCCCTTGGTTTCTTAGTGTTATAGCCCATACTCCTTTAGTCTTTTAGTATTATCGTCTAT
AGTACGGTAGGTGGGTAAGGTAGAAAATGTTGTTTGTTTGATTCTGTATTTCGTGCAGAT
AAGATGTTTGTAGTCTCTACTTGGCAGTGGTAGAAGTGTTTAACTTGATGAAGGGGATAG
GTGTATGTTCTGTCCTTTGTTTTTGAATAGTGGTTTCGGTTTTGTTTTTTTTTTTGGTGG
GGGTTAAAGTATAGGATTAAGTTAATTTAATGGTAAGTAAAATGATTTCCGAAAAAAGAC
CTAAATTTGTGTTATATATATAATATATACAATTATAATATAGAAGGAGAAAAGATGTAA
AAATAGGATTTAGGGAGGAGGAAAATTTATAGGTTTTGATAATAAATTTTTCTTGTAAGG
GGGTACCCTTACAGAATTTTTGGGGTAGTGGTTGGAT



### `grep`
`grep` is a command line tool that is often used to inspect files, especially files that include sequences (i.e., FASTA and FASTQ, which we'll see later). 

View the `grep` manual and then use it to inspect the genome file in the command line. These files are often hundreds of megabytes to gigabytes, so we tend not to open the entire file in a text editor but instead use command line tools to answer our questions.

In [4]:
!grep --help

Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE.
Example: grep -i 'hello world' menu.h main.c

Pattern selection and interpretation:
  -E, --extended-regexp     PATTERN is an extended regular expression
  -F, --fixed-strings       PATTERN is a set of newline-separated strings
  -G, --basic-regexp        PATTERN is a basic regular expression (default)
  -P, --perl-regexp         PATTERN is a Perl regular expression
  -e, --regexp=PATTERN      use PATTERN for matching
  -f, --file=FILE           obtain PATTERN from FILE
  -i, --ignore-case         ignore case distinctions
  -w, --word-regexp         force PATTERN to match only whole words
  -x, --line-regexp         force PATTERN to match only whole lines
  -z, --null-data           a data line ends in 0 byte, not newline

Miscellaneous:
  -s, --no-messages         suppress error messages
  -v, --invert-match        select non-matching lines
  -V, --version             display version information and exit
      

Now that you've looked at the manual, complete the following tasks. Put the correct answer below each bullet.

- Count the number of contigs/chromosomes using a single `grep` command. The command should return an integer.
- List the name of each contig/chromosome, which is included in the header of each sequence. Your `grep` command should return the entire header of each entry, with a new line separating entries.

The answers to the above questions should be printed in the output derived from the below code block:

In [6]:
!grep -c ">" genome.fa

10


This assembly is full-length (an impressive achievement, which you'll learn more about next semester 🤯), and has been assembled into the 7 autosomes, two sex chromosome (Z and W), and a mitochondrial genome. Every single nucleotide of the entire genome is contained in this single file. It's kind of amazing when decades of work can culminate in a file ~100 MB large...

## Genome annotations
### GTF files
Typically, we're not just interested in the sequences but what the sequences represent, like genes, coding sequences, promoters, exons, etc.. The elements that a genome contains are called "annotations," and these are stored in annotation files. Genome annotation files are tab-separated files that include coordinate information - that is, the chromosome and nucleotide location - for each annotation. For example, the annotations include the start/stop location of every single gene, mRNA, and exon. Annotation are usually in GTF or GFF format; GTF is more common and preferred by many programs. [Here's the definition](http://mblab.wustl.edu/GTF22.html) for the GTF file format. Take a few minutes to read through the information.

Let's get the annotations and decompress them.

- This time we'll pipe the commands, so we have to redirect the download to standard out (`-O -`).
- The pipe operator `|` allows you to run a command using standard output from the previous command as the input for the ensuing command.
- To write a single command over multiple lines, use the `\` sign.

In [7]:
!wget -O - https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.canonical_geneset.gtf.gz | \
    gzip -f -d > annotations.gtf

--2025-09-16 16:20:53--  https://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS19/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS19.canonical_geneset.gtf.gz
Resolving ftp.ebi.ac.uk... 193.62.193.165
Connecting to ftp.ebi.ac.uk|193.62.193.165|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3228084 (3.1M) [application/x-gzip]
Saving to: ‘STDOUT’


2025-09-16 16:20:54 (3.54 MB/s) - written to stdout [3228084/3228084]



Let's checkout what the annotations look like using the `head` and `tail` commands.

In [8]:
!head -10 annotations.gtf

#!genebuild-version 2022-11-WormBase
SM_V10_1	WormBase	gene	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; gene_source "WormBase"; gene_biotype "protein_coding";
SM_V10_1	WormBase	transcript	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
SM_V10_1	WormBase	exon	68427	68783	.	-	.	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; exon_id "Smp_329140.1.e1"; tag "Ensembl_canonical";
SM_V10_1	WormBase	CDS	68596	68763	.	-	0	gene_id "Smp_329140"; gene_version "1"; transcript_id "Smp_329140.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "

In [9]:
!tail -10 annotations.gtf

SM_V10_MITO	WormBase	CDS	13064	13348	.	+	0	gene_id "Smp_900100"; gene_version "1"; transcript_id "Smp_900100.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; protein_id "Smp_900100.1"; tag "Ensembl_canonical";
SM_V10_MITO	WormBase	start_codon	13064	13066	.	+	0	gene_id "Smp_900100"; gene_version "1"; transcript_id "Smp_900100.1"; exon_number "1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
SM_V10_MITO	WormBase	gene	13537	14415	.	+	.	gene_id "Smp_900110"; gene_version "1"; gene_source "WormBase"; gene_biotype "protein_coding";
SM_V10_MITO	WormBase	transcript	13537	14415	.	+	.	gene_id "Smp_900110"; gene_version "1"; transcript_id "Smp_900110.1"; gene_source "WormBase"; gene_biotype "protein_coding"; transcript_source "WormBase"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
SM_V1

Each line is a different annotation. In the first and last 10 lines we see gene, exon, CDS, start codon, stop codon, 5' UTR, and 3' UTR. The columns (technically "fields") of a GTF file are described at the link provided above. Provide definitions for each of these annotations and answer the last question:

- gene:
- exon:
- CDS:
- start codon:
- stop codon:
- 5' UTR:
- 3' UTR:
- Why isn't there an "intron" annotation?

Fields are tab-separated. `grep` can again be used to search for annotations of interest, but this time we have to use a few regular expressions. Regular expressions ("regexp") are sequences of characters ("expressions") that can be used to search for patterns in the input text. The language Perl is (in)famous for enabling power regular expression searches, and it has been integrated into `grep`. [Here](https://www.cheat-sheets.org/saved-copy/perl-regexp-refcard-a4.pdf) is a cheat sheet for regular expressions in Perl.

For example, below are a few lines that one might use to count the number of transcripts in the GTF:

In [10]:
!grep -c "transcript" annotations.gtf
!grep -c -P "\ttranscript\t" annotations.gtf

231617
10960


Define the new `grep` flag and define the regular expression:

- `P`: pattern is a regular expressions

- `\t`: a tab

- Explain the difference between the two commands in the above code block and why they resulted in different values.

We can use a similar idea to count the number of genes on chromsome 1 of the genome.

In [None]:
!grep -c -P "SM_V10_1.*\tgene\t" annotations.gtf

Define the regexp:

- `.*`:

### `cut`
There are a few command line tools used to explore delimited files (`cut`, `awk`, etc.). These tools allow you to slice up delimited files into different rows and columns (technically "fields"), and they can be piped with other tools to create powerful pipelines.

Checkout the man page for `cut`:

In [None]:
!cut --help

For example, to get only the first field of the GTF:

In [None]:
!cut -f 1 annotations.gtf

You can combine `grep` and `cut` to extract fields from specific lines. For instance, suppose you're interested in the gene called Smp_104210, which is a gene that encodes an opsin protein (the receptor that detects photons in eyes or eye-spots). The following command would show you all the lines that contained that gene ID:

In [5]:
!grep 'Smp_245390' genome/annotations.gtf

SM_V10_1	curated	gene	7098303	7098851	.	-	.	gene_id "Smp_245390"; gene_version "1"; gene_source "curated"; gene_biotype "protein_coding";
SM_V10_1	curated	transcript	7098303	7098851	.	-	.	gene_id "Smp_245390"; gene_version "1"; transcript_id "Smp_245390.1"; gene_source "curated"; gene_biotype "protein_coding"; transcript_source "curated"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";
SM_V10_1	curated	exon	7098622	7098851	.	-	.	gene_id "Smp_245390"; gene_version "1"; transcript_id "Smp_245390.1"; exon_number "1"; gene_source "curated"; gene_biotype "protein_coding"; transcript_source "curated"; transcript_biotype "protein_coding"; exon_id "Smp_245390.1.e1"; tag "Ensembl_canonical";
SM_V10_1	curated	CDS	7098622	7098851	.	-	0	gene_id "Smp_245390"; gene_version "1"; transcript_id "Smp_245390.1"; exon_number "1"; gene_source "curated"; gene_biotype "protein_coding"; transcript_source "curated"; transcript_biotype "protein_coding"; protein_id "Smp_245390.1"; tag "Ensembl_cano

Now suppose you wanted the start position of the trasnscript associated with Smp_104210. We know from the GTF description that the 3rd field contains the feature type and the 4th field contains the start location. `cut` allows you to parse each field of a delimited file. `grep` exracts lines containing the search term, which can then be piped to `cut` to extract the feature, and then `grep` can again be used to only keep mRNA features.

In [1]:
!grep 'Smp_104210' annotations.gtf | cut -f 3,4 | grep 'transcript'

grep: annotations.gtf: No such file or directory


Using the features of the Smp_104210, answer the following questions:

- What's the difference between the transript start location (which you found above) and the start codon?
- How many exons are in Smp_104210?
- What's the nucleotide length of the entire transcript?
- What's the nucleotide length of the coding sequence?
- How many amino acids are in the Smp_104210 peptide?

### `sort` and `uniq`

Not all GTFs include the same features/annotations, and other types of annotation files (i.e., GFF3) are often much more comprehensive. When working with a new genome and its annotations, it's a good idea to make sure you know what types of features are included. You could open up the file and scroll through it, but they tend to be quite large. A better way is to use command line tools to explore the file.

`sort` is a tool that explains itself - it sorts lines. After sorting, `uniq` can be used to remove duplicate lines. In an unmodified GTF, there are unlikely to be any duplicate lines - but there are many duplicates within separate fields of the GTF. As we know, the 3rd column includes the feature type, and we've seen several of these features using `head` and `tail`, but have we seen all of them? Using `cut`, `sort`, and `uniq`, we can use the command line to view them all. First, check out the `sort` and `uniq` man pages.

In [None]:
!sort --help

In [None]:
!uniq --help

Now, pipe `cut`, `sort`, and `uniq` to list all the features. Use what you found in the man pages to provide the number of occurences of each feature. Each feature and output should be printed in the output derived from the below code block:

Answer the following questions:

- Why are there more transcripts than there are genes?

# Genome and annotation visualization

## JBrowse2

FASTA files and GTF files are foundational, but even with command line tools it is difficult to inspect and explore these files, especially if we're interested in specific features. Thankfully, there are a range of different "genome browsers" that allow you to load genome sequences, annotations, and other "tracks" and use mouse to interact with it.

There are a few popular genome browsers, but my favorite is [JBrowse2](https://jbrowse.org/jb2/). This browser is actively developed, modern, and works on all types of computers. Navigate to the JBrowse2 link and follow the instructions to download and install the app. After that, run the following commands to generate an index of the genome FASTA file. Indices are common auxillary files in bioinformatics; they allow for tools/apps to quickly access large files based on location ("random access") rather than starting at the beginning/end and searching for the location of interest. As you will see, we will use JBrowse2 to very quickly hop around the genome, which is enabled by the index file. We use `samtools` to generate this index. We'll become much more familiar with this program later.

In [3]:
!samtools faidx genome/genome.fa

The rest of this section will proceed locally, on your actual computer rather than on the server.

You now need to download three files that you've retrieved:

1. `genome.fa` - the FASTA file that contains the entire genomic sequence
2. `genome.fa.fai` - the index for the genome
3. `annotations.gtf` - the annotations file

Right-click on the files in the Explorer pane of VS Code and click "Download..." It may be useful to put these files in a new folder somewhere on your computer.

The rest of the lesson will be shown live during class.