<div>
<img src="https://www.nasa.gov/wp-content/uploads/2024/07/osdr-gl4hs-logo.png" width="600"/>
</div>

# **NOTEBOOK 3: Building a reference genome index**


In this notebook you will build a reference chromosome index (chr17) for hands-on experience and demonstration purposes.

## **Objectives of this notebook**
The primary objective of this notebook is to build a reference genome index for chromosome 17. A reference genome index (like any index) makes searching faster, so the overall performance of the mapping/alignment step will also be faster. In subsequent notebooks, you will use this index to align the sequence records from the FASTQ files to this reference genome. You can learn more about reference genomes in this [Wikipedia article](https://en.wikipedia.org/wiki/Reference_genome).

## **UNIX commands introduced in this notebook**

[`tail`](https://man7.org/linux/man-pages/man1/tail.1.html) command to see the last n lines of a file.

[`mkdir`](https://man7.org/linux/man-pages/man1/mkdir.1.html) command to make a directory.

# Prepare runtime environment

In [1]:
# mount google drive for notebook
from google.colab import drive
drive.flush_and_unmount()
drive.mount("mnt")


Drive not mounted, so nothing to flush and unmount.
Mounted at mnt


In [2]:
# time the notebook
import datetime
start_time = datetime.datetime.now()
print('notebook start time: ', start_time.strftime('%Y-%m-%d %H:%M:%S'))

notebook start time:  2025-07-14 19:18:37


In [3]:
# define FASTQ_DIR (directory)
import os
FASTQ_DIR="/content/mnt/MyDrive/NASA/GL4HS/FASTQ"
if not os.path.exists(FASTQ_DIR):
  raise Exception("STOP! You haven't completed the previous notebooks yet")

In [4]:
# create directory structure for this lab
import os
REFERENCE_DIR='/content/mnt/MyDrive/NASA/GL4HS/REFERENCE'
if not os.path.exists(REFERENCE_DIR):
  !mkdir {REFERENCE_DIR}


In [5]:
# create directory structure for this lab
import os
STAR_DIR='/content/mnt/MyDrive/NASA/GL4HS/STAR'
if not os.path.exists(STAR_DIR):
  !mkdir {STAR_DIR}

In [6]:
# download and install STAR (executeable for alignment and building reference genome)
if not os.path.exists('/content/mnt/MyDrive/NASA/GL4HS/STAR/bin/Linux_x86_64_static/STAR'):
  !wget -O {STAR_DIR}/STAR.tar.gz https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
  !tar -xzf {STAR_DIR}/STAR.tar.gz -C {STAR_DIR}

!chmod +x {STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR

# remove the compressed tar file
!rm {STAR_DIR}/STAR.tar.gz

--2025-07-14 19:18:38--  https://github.com/alexdobin/STAR/archive/2.7.11b.tar.gz
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/alexdobin/STAR/tar.gz/refs/tags/2.7.11b [following]
--2025-07-14 19:18:38--  https://codeload.github.com/alexdobin/STAR/tar.gz/refs/tags/2.7.11b
Resolving codeload.github.com (codeload.github.com)... 140.82.114.10
Connecting to codeload.github.com (codeload.github.com)|140.82.114.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘/content/mnt/MyDrive/NASA/GL4HS/STAR/STAR.tar.gz’

/content/mnt/MyDriv     [        <=>         ]  11.89M  7.17MB/s    in 1.7s    

2025-07-14 19:18:41 (7.17 MB/s) - ‘/content/mnt/MyDrive/NASA/GL4HS/STAR/STAR.tar.gz’ saved [12466670]



In [7]:
# check version of STAR
!{STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR --version

2.7.11b


In [8]:
# download GRCm39 reference for chromosome 17 (real life: look at entire genome)
#gunzip: decompress | sent to reference directory
import os
if not os.path.exists(f"{REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz"):
  !wget -O {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz https://ftp.ensembl.org/pub/release-113/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz
  !gunzip -c {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz > {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa


--2025-07-14 19:18:45--  https://ftp.ensembl.org/pub/release-113/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.169
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.169|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 27906831 (27M) [application/x-gzip]
Saving to: ‘/content/mnt/MyDrive/NASA/GL4HS/REFERENCE/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz’


2025-07-14 19:18:48 (12.5 MB/s) - ‘/content/mnt/MyDrive/NASA/GL4HS/REFERENCE/Mus_musculus.GRCm39.dna.chromosome.17.fa.gz’ saved [27906831/27906831]



In [9]:
# look at the first 10 lines of the reference fasta file
!head {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

>17 dna:chromosome chromosome:GRCm39:17:1:95294699:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


In [10]:
# look at 10 lines in the middle of the reference fasta file
!sed -n '100000,100010 p' {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

ATGTGTGTGTATATGTGTGCGTGCGCGTGTGTGTGTGCACGTGTGTGTGTATGTGTGCGC
GCATGCGTGTGTGTGCGTGCGCGCGCGTGTGTATGTGTGCGTGTGTGCACGCATGTGTGT
ATGTGCGCGCGCGTGCGTGTGCGCGTGCGCGCGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTAAAGTACTGAGGCTTTACTTTATCTACTACCTTGGGAGAGGA
GGCTTAATTAGAGCTCTGCCTCCCTAAACCATTCTTCCCCCAGGAAAAGTTACTTAATCC
TCCTACAGTTCAGAATCAGGACGCAAAGGATTACAACAAACATGGCTTCTCCTATCATGT
GAATCCTTTCTTTTTTTTTTTTTAGGATTTATTTATTTTATTTATATGAGTACACTGTAG
CTGCCTTTAGACACCCCAGAAGAGGGCATCAGATCCCATTACAGATGGTTGCGAGCCACC
ATGTGGTTGCTGGGAATTGAACTCAGGACCTCTGGAAGAGCAGTCAGTGCCCTTAACCAC
TGAGCCATCTCTCCAGTTCAAGAATCCTCAAGAATTTATTTCTGTGTATGTTTGTGCGAG
TGAGTGCCATTTGTGTGCGGGTACCCTGAGGTCAAAAGAAGGCATCAGATCCTCTGGAGA


In [11]:
# look at the last 10 lines of the reference fasta file
!tail {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN


Read [this discussion thread](https://www.reddit.com/r/genetics/comments/rz47pq/why_there_is_a_lot_of_ns_at_the_begining_of_the/) to see why there may be lots of 'N' at beginning and end of a reference genome fasta file.

# Run STAR to build reference chromosome index

Read [this discussion thread](https://www.biostars.org/p/251736/) and search the [documentation](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf) to learn more about `genomeChrBinNbits` option of the STAR command.

In [12]:
# run STAR to create index of GRCm39 chr 17 reference
if not os.path.exists(REFERENCE_DIR + '/MM39_CHR17'):
  !mkdir -p {REFERENCE_DIR}/MM39_CHR17
!{STAR_DIR}/STAR-2.7.11b/bin/Linux_x86_64_static/STAR \
        --runThreadN 2 \
        --runMode genomeGenerate \
        --genomeDir {REFERENCE_DIR}/MM39_CHR17 \
        --genomeFastaFiles {REFERENCE_DIR}/Mus_musculus.GRCm39.dna.chromosome.17.fa  \
        --genomeSAindexNbases 12 \
        --genomeChrBinNbits 5

	/content/mnt/MyDrive/NASA/GL4HS/STAR/STAR-2.7.11b/bin/Linux_x86_64_static/STAR --runThreadN 2 --runMode genomeGenerate --genomeDir /content/mnt/MyDrive/NASA/GL4HS/REFERENCE/MM39_CHR17 --genomeFastaFiles /content/mnt/MyDrive/NASA/GL4HS/REFERENCE/Mus_musculus.GRCm39.dna.chromosome.17.fa --genomeSAindexNbases 12 --genomeChrBinNbits 5
	STAR version: 2.7.11b   compiled: 2024-01-25T16:12:02-05:00 :/home/dobin/data/STAR/STARcode/STAR.master/source
Jul 14 19:18:50 ..... started STAR run

Jul 14 19:18:50 ... starting to generate Genome files
Jul 14 19:18:51 ... starting to sort Suffix Array. This may take a long time...
Jul 14 19:18:52 ... sorting Suffix Array chunks and saving them to disk...
Jul 14 19:20:33 ... loading chunks from disk, packing SA...
Jul 14 19:20:36 ... finished generating suffix array
Jul 14 19:20:36 ... generating Suffix Array index
Jul 14 19:20:48 ... completed Suffix Array index
Jul 14 19:20:48 ... writing Genome to disk ...
Jul 14 19:20:48 ... writing Suffix Array to di

In [13]:
# check index file
!ls -lh {REFERENCE_DIR}/MM39_CHR17

total 907M
-rw------- 1 root root    9 Jul 14 19:18 chrLength.txt
-rw------- 1 root root   12 Jul 14 19:18 chrNameLength.txt
-rw------- 1 root root    3 Jul 14 19:18 chrName.txt
-rw------- 1 root root   11 Jul 14 19:18 chrStart.txt
-rw------- 1 root root  91M Jul 14 19:20 Genome
-rw------- 1 root root  847 Jul 14 19:20 genomeParameters.txt
-rw------- 1 root root 723M Jul 14 19:20 SA
-rw------- 1 root root  94M Jul 14 19:20 SAindex


In [14]:
# check size of google drive usage for the reference (should be about 1.1G)
!du -sh {REFERENCE_DIR}/MM39_CHR17

907M	/content/mnt/MyDrive/NASA/GL4HS/REFERENCE/MM39_CHR17


# Check your work before moving on

In [15]:
# check size of all GL4HS drive usage (should be about 1.4G)
!du -sh /content/mnt/MyDrive/NASA/GL4HS

1.3G	/content/mnt/MyDrive/NASA/GL4HS


In [16]:
# time the notebook
import datetime
end_time = datetime.datetime.now()
print('notebook end time: ', end_time.strftime('%Y-%m-%d %H:%M:%S'))

total_time = end_time - start_time
print('notebook total runtime: ', total_time)

notebook end time:  2025-07-14 19:21:05
notebook total runtime:  0:02:27.626818
