## The source of the GTF file used for STAR

It is the official gene model file from the Gencode website matching the version of our reference genome (.fa file).

- https://www.gencodegenes.org/human/release_25.html


Since our reference genome only contains chromosome 20, entries from other chromosomes were discarded.


```
# Keep the header
head -n 5 gencode.v25.annotation.gtf > chr20.gtf

# Only keep entries on chr20
head -n 5 gencode.v25.annotation.gtf > chr20.gtf
```

In [32]:
# Copy required dataset to working directory
ls /scratch/work/courses/AppliedGenomics2021Sec3/week04_recitation

GCF_000001405.33_GRCh38.p7_chr20_genomic.fna  read_1.fastq
chr20.gtf				      read_2.fastq


In [33]:
# Create the working folder structure
mkdir data result

mkdir: cannot create directory ‘data’: File exists
mkdir: cannot create directory ‘result’: File exists


: 1

In [2]:
# Copy the required files to our working folder

cp /scratch/work/courses/AppliedGenomics2021Sec3/week04_recitation/* data
tree data -h

data
├── [  62M]  GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
├── [  22M]  read_1.fastq
└── [  22M]  read_2.fastq

0 directories, 3 files


## The modules we are using today

In [11]:
module purge
module load trimmomatic/0.39
module load star/intel/2.7.6a
module load bowtie2/2.4.2
module load samtools/intel/1.11

## Trim the fastq files

In [36]:
cd data
java -jar /share/apps/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 \
read_1.fastq read_2.fastq \
read_1_trimmed.fq read_1_unpair_trimmed.fq \
read_2_trimmed.fq read_2_unpair_trimmed.fq \
HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

cd ..

TrimmomaticPE: Started with arguments:
 -phred33 read_1.fastq read_2.fastq read_1_trimmed.fq read_1_unpair_trimmed.fq read_2_trimmed.fq read_2_unpair_trimmed.fq HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Multiple cores found: Using 2 threads
Input Read Pairs: 97206 Both Surviving: 88187 (90.72%) Forward Only Surviving: 5936 (6.11%) Reverse Only Surviving: 1671 (1.72%) Dropped: 1412 (1.45%)
TrimmomaticPE: Completed successfully


## STAR Aligner

In [9]:
# Make a directory for STAR's reference genome
mkdir data/hg38_chr20_basic

### Building index for the reference genome

In [3]:
STAR --runThreadN 1 \ # The number of threads to use
--runMode genomeGenerate \ # Generating reference genome
--genomeSAindexNbases 11 \ # The nubmer index bases to use (the higher the number the more RAM is required)
--genomeDir data/hg38_chr20_basic \ # Path to save the output files
--genomeFastaFiles data/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna # The fasta file for the reference genome

Feb 22 08:11:08 ..... started STAR run
Feb 22 08:11:08 ... starting to generate Genome files
Feb 22 08:11:09 ... starting to sort Suffix Array. This may take a long time...
Feb 22 08:11:09 ... sorting Suffix Array chunks and saving them to disk...
Feb 22 08:12:24 ... loading chunks from disk, packing SA...
Feb 22 08:12:25 ... finished generating suffix array
Feb 22 08:12:25 ... generating Suffix Array index
Feb 22 08:12:29 ... completed Suffix Array index
Feb 22 08:12:29 ... writing Genome to disk ...
Feb 22 08:12:31 ... writing Suffix Array to disk ...
Feb 22 08:12:32 ... writing SAindex to disk
Feb 22 08:12:33 ..... finished successfully


In [12]:
# Make a directory for alignment result
mkdir result/STAR

In [8]:
STAR --genomeDir ./data/hg38_chr20_basic \ # The path to the indexed reference
--runThreadN 1 \ # The number of threads to use
--readFilesIn ./data/read_1_trimmed.fq data/read_2_trimmed.fq \ # The sequencing result
--outFileNamePrefix ./result/STAR/ \ # Path to save aligned files (SAM/BAM)
--outSAMtype BAM SortedByCoordinate \ # Output format
--outSAMunmapped Within \ # How to deal with the unmapped reads
--outSAMattributes Standard # Tags to include for the aligned SAM/BAM file

Feb 22 08:14:52 ..... started STAR run
Feb 22 08:14:52 ..... loading genome
Feb 22 08:14:52 ..... started mapping
Feb 22 08:15:18 ..... finished mapping
Feb 22 08:15:18 ..... started sorting BAM
Feb 22 08:15:21 ..... finished successfully


In [10]:
# Inspect the result
samtools flagstat result/STAR/Aligned.sortedByCoord.out.bam

196110 + 0 in total (QC-passed reads + QC-failed reads)
1698 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
193206 + 0 mapped (98.52% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
191508 + 0 properly paired (98.51% : N/A)
191508 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


### Build a reference genome with prior knowledge of splicing junctions

In [15]:
STAR --runThreadN 2 \
--runMode genomeGenerate \
--genomeSAindexNbases 11 \
--genomeDir data/hg38_chr20_sjdb \
--genomeFastaFiles data/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna \
--sjdbGTFfile data/chr20.gtf \
--sjdbOverhang 99

Feb 22 08:23:12 ..... started STAR run
Feb 22 08:23:12 ... starting to generate Genome files
Feb 22 08:23:12 ..... processing annotations GTF
Feb 22 08:23:13 ... starting to sort Suffix Array. This may take a long time...
Feb 22 08:23:13 ... sorting Suffix Array chunks and saving them to disk...
Feb 22 08:24:28 ... loading chunks from disk, packing SA...
Feb 22 08:24:30 ... finished generating suffix array
Feb 22 08:24:30 ... generating Suffix Array index
Feb 22 08:24:33 ... completed Suffix Array index
Feb 22 08:24:33 ..... inserting junctions into the genome indices
Feb 22 08:24:39 ... writing Genome to disk ...
Feb 22 08:24:39 ... writing Suffix Array to disk ...
Feb 22 08:24:39 ... writing SAindex to disk
Feb 22 08:24:39 ..... finished successfully


### Align with the new indexed reference

In [16]:
mkdir result/STARsjdb
STAR --genomeDir ./data/hg38_chr20_basic \
--runThreadN 2 \
--readFilesIn ./data/read_1.fastq data/read_2.fastq \
--outFileNamePrefix ./result/STARsjdb/ \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard 

mkdir: cannot create directory ‘result/STARsjdb’: File exists
Feb 22 08:24:40 ..... started STAR run
Feb 22 08:24:40 ..... loading genome
Feb 22 08:24:40 ..... started mapping
Feb 22 08:25:06 ..... finished mapping
Feb 22 08:25:07 ..... started sorting BAM
Feb 22 08:25:08 ..... finished successfully


In [17]:
# Inspect the result
samtools flagstat result/STARsjdb/Aligned.sortedByCoord.out.bam

196110 + 0 in total (QC-passed reads + QC-failed reads)
1698 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
193206 + 0 mapped (98.52% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
191508 + 0 properly paired (98.51% : N/A)
191508 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


## Bowtie2

In [19]:
mkdir hg38_chr20_bowtie2

In [23]:
cd data/hg38_chr20_bowtie2

In [25]:
bowtie2-build ../GCF_000001405.33_GRCh38.p7_chr20_genomic.fna hg38_chr20_bowtie2

Settings:
  Output files: "hg38_chr20_bowtie2.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:01
bmax according to bmaxDivN setting: 15986064
Using parameters --bmax 11989548 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 11

In [26]:
cd ../../
mkdir result/bowtie2

In [30]:
bowtie2 -p 2 \
        -x data/hg38_chr20_bowtie2/hg38_chr20_bowtie2 \
        -1 data/read_1.fastq -2 data/read_2.fastq -S result/bowtie2/bowtie2.sam

97206 reads; of these:
  97206 (100.00%) were paired; of these:
    1478 (1.52%) aligned concordantly 0 times
    66192 (68.09%) aligned concordantly exactly 1 time
    29536 (30.38%) aligned concordantly >1 times
    ----
    1478 pairs aligned concordantly 0 times; of these:
      35 (2.37%) aligned discordantly 1 time
    ----
    1443 pairs aligned 0 times concordantly or discordantly; of these:
      2886 mates make up the pairs; of these:
        1510 (52.32%) aligned 0 times
        904 (31.32%) aligned exactly 1 time
        472 (16.35%) aligned >1 times
99.22% overall alignment rate


In [39]:
samtools view -S -b result/bowtie2/bowtie2.sam > result/bowtie2/bowtie2.bam

In [40]:
samtools sort result/bowtie2/bowtie2.bam -o result/bowtie2/bowtie2_sorted.bam

In [41]:
samtools flagstats result/bowtie2/bowtie2_sorted.bam

194412 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
192902 + 0 mapped (99.22% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
191456 + 0 properly paired (98.48% : N/A)
191710 + 0 with itself and mate mapped
1192 + 0 singletons (0.61% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
