# FASTQ download and QC

## Fetching FASTQ data from NCIB SRA

In previous weeks, we explored the genome and annotations. There were stored in FASTA and GTF files and will be used as the reference to which RNA sequencing reads are mapped/aligned. After alignment, we can count how many RNA-seq reads align to the annotations we care about (genes or transcripts). Now we need to get the FASTQ files that we want to align and analyze. To do so, we will be referencing an SRA read table, which is a list of all the FASTQ files associated with a given experiment uploaded to NCBI SRA. Here's the read table for the Winners vs. Losers experiment:

In [1]:
!head -15 SraRunTable.txt
!ls

Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,Bytes,Center Name,Collection_Date,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,dev_stage,Experiment,geo_loc_name_country,geo_loc_name_country_continent,geo_loc_name,Host,Instrument,isolate,isolation_source,Library Name,LibraryLayout,LibrarySelection,LibrarySource,name,Organism,Platform,ReleaseDate,create_date,version,Sample Name,SRA Study,tissue
SRR26691082,RNA-Seq,69,692754411,PRJNA1036419,SAMN38137055,Invertebrate,504911946,"CHARLES UNIVERSITY, FACULTY OF SCIENCE",2022-11,public,"run.zq,fastq,sra","s3,ncbi,gs","s3.us-east-1,gs.us-east1,ncbi.public",egg,SRX22390854,Czech Republic,Europe,Czech Republic: Prague,Mus musculus,NextSeq 500,Belo Horizonte,laboratory,LIV_ma1,SINGLE,PolyA,TRANSCRIPTOMIC,LIV_ma1,Schistosoma mansoni,ILLUMINA,2023-12-31T00:00:00Z,2023-11-07T04:53:00Z,1,LIV_ma1,SRP470475,liver
SRR26691083,RNA-Seq,69,607047855,PRJNA1036419,SAMN38137054,Invertebrate,440003332,"CHARLES UNIVERSITY, F

You can see that there are 12 different runs, each has an ID that looks something like `SRR26691082`. After the ID, quite a bit of run metadata is provided. You should spend some time learning about what each field denotes.

As we know, the experiment had four different samples: 
- liver immature
- liver mature
- intestine immature
- intestine mature

Here's the data from Figure 1 of the paper:

<img src="assets/journal.ppat.1012268.g001.PNG" alt="experiment data" width="600">

The 12 runs correspond to the 12 dots showin in Figure 1D.

First we have to use `sra-tools` to download the FASTQ files from NCBI's SRA database. To do so, we'll use a `for` loop within bash. We can use `cut` (which we learned about in [week 2](../2_genome_exploration/2_genome_exploration.ipynb)) to get the first field, but because it's comma-delimited instead of tab-delimited, we have to tell the program with the `-d` option. This can be saved to a file, which then can be looped through line-by-line. The FASTQ files will be stored in a new directory called `fastq`.

In [2]:
!mkdir fastq
!cut -d ',' -f 1 SraRunTable.txt | tail -n +2 > sra_list.txt
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --sra-id $line --threads 16 --outdir fastq --gzip; \
    done < sra_list.txt

Getting SRR26691082 from NCBI SRA
2024-10-08 16:01:04,926 - SRR ids: ['SRR26691082']
2024-10-08 16:01:04,926 - extra args: ['--gzip']
2024-10-08 16:01:04,927 - tempdir: /local/scratch/job_98296/pfd_c6xj1vjl
2024-10-08 16:01:04,927 - CMD: sra-stat --meta --quick SRR26691082
2024-10-08 16:01:07,163 - SRR26691082 spots: 10039919
2024-10-08 16:01:07,163 - blocks: [[1, 627494], [627495, 1254988], [1254989, 1882482], [1882483, 2509976], [2509977, 3137470], [3137471, 3764964], [3764965, 4392458], [4392459, 5019952], [5019953, 5647446], [5647447, 6274940], [6274941, 6902434], [6902435, 7529928], [7529929, 8157422], [8157423, 8784916], [8784917, 9412410], [9412411, 10039919]]
2024-10-08 16:01:07,163 - CMD: fastq-dump -N 1 -X 627494 -O /local/scratch/job_98296/pfd_c6xj1vjl/0 --gzip SRR26691082
2024-10-08 16:01:07,176 - CMD: fastq-dump -N 627495 -X 1254988 -O /local/scratch/job_98296/pfd_c6xj1vjl/1 --gzip SRR26691082
2024-10-08 16:01:07,177 - CMD: fastq-dump -N 1254989 -X 1882482 -O /local/scratc

***8 minute time to completion***

If we take another look at the metadata, there are a few things that are of interest to us and our analysis. First, these reads were generated with `PolyA` selection, which means reads should have many A's on their 3' end. Second, the reads are `ILLUMINA` reads generated on a `NextSeq 500` instrument, which means they may have Illumina adapters, depending on whether or not they were trimmed by the authors prior to uploading them to SRA. Third, from the paper (but not the run table), we know that these are single-end reads. This metadata will be important to us soon. You can get a bit more information about the data using the tool `seqkit`. For instance:

In [4]:
!seqkit stats fastq/SRR26691082.fastq.gz

file                        format  type    num_seqs      sum_len  min_len  avg_len  max_len
fastq/SRR26691082.fastq.gz  FASTQ   DNA   10,039,919  692,754,411       69       69       69


You can also simply use `grep` to count sequences. Because the files are gzipped, you have to decompress them first:

In [5]:
!zcat -d fastq/SRR26691082.fastq.gz | grep -c '@'

10039919


## FastQC

Whenever we look at FASTQ files for the first time, we should perform quality control (QC). The primary tool used for read QC is called FastQC, which is installed in our environment. Let's take a look at the manual and then run QC on our reads:

In [6]:
!fastqc -h


            FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

	fastqc seqfile1 seqfile2 .. seqfileN

    fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] 
           [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

    FastQC reads a set of sequence files and produces from each one a quality
    control report consisting of a number of different modules, each one of 
    which will help to identify a different potential type of problem in your
    data.
    
    If no files to process are specified on the command line then the program
    will start as an interactive graphical application.  If files are provided
    on the command line then the program will run with no user interaction
    required.  In this mode it is suitable for inclusion into a standardised
    analysis pipeline.
    
    The options for the program as as follows:
    
    -h --help       Print this help file and exit
    
    -v --version    Print the version of the program and exit

In [7]:
!mkdir fastq/qc
!fastqc -t 16 fastq/*.fastq.gz -o fastq/qc

application/gzip
application/gzip
Started analysis of SRR26691082.fastq.gz
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRR26691083.fastq.gz
Started analysis of SRR26691084.fastq.gz
Started analysis of SRR26691085.fastq.gz
Started analysis of SRR26691086.fastq.gz
Started analysis of SRR26691087.fastq.gz
Approx 5% complete for SRR26691082.fastq.gz
Approx 5% complete for SRR26691083.fastq.gz
Started analysis of SRR26691088.fastq.gz
Approx 5% complete for SRR26691084.fastq.gz
Started analysis of SRR26691089.fastq.gz
Approx 5% complete for SRR26691086.fastq.gz
Approx 5% complete for SRR26691085.fastq.gz
Started analysis of SRR26691090.fastq.gz
Started analysis of SRR26691091.fastq.gz
Started analysis of SRR26691092.fastq.gz
Started analysis of SRR26691093.fastq.gz
Approx 10% complete for SRR26691083.fastq.gz
Approx 10% complete for SRR26691084.fas

***1 minute time to completion***

Let's go through each section one-by-one.

### Per base sequence quality

Base quality scores are represented in Phred+33 encoding. Remember that Q20 is a popular threshold, as the likelihood of an error substantially increases as the Phred quality gets less than 20. It is typical to see high base quality at the 5' end of reads and lower quality at the 3' ends. For this reason, sometimes the 3' end of reads may be trimmed to remove low quality bases.

### Per tile sequence quality

Tiles are physical locations on the flow cell. Warnings in this section can be associated with flow cells that have been overloaded.

### Per sequence quality scores

This distribution shows that average Phred quality for entire reads. It helps you make decisions about removing entire reads based on their average quality.

### Per base sequence content

This distribution shows the bases most likely to be found at any given position in the reads. In general, the likelihood of finding any of A, T, G, or C at a given position should be the same across the entire sequence. Deviations from this expectation may be due to the presence of barcodes, adaptors, UMIs, UMI linkers, or PolyA tails. Here's an example of a dataset that has the UMI linker on the 5' end.

<img src="assets/linker.png" alt="UMI linker" width="450">

### Per sequence GC content

Since our reads are derived from mRNA fragments, we would expect that the percentage of Gs and Cs would be normally distributed in a way that mimics the underlying transcriptome of the organism from which the mRNA was extracted. If the distribution isn't normal (that is, if there are sharp peaks), that is an indication of some sort of contamination. Here is an example of a sharp peak:

<img src="assets/gc_peak.png" alt="GC" width="450">

This peak is indicative of something in the library that has a higher GC% than the rest of the mRNA. Sharp peaks are often associated with adaptor dimers. These are reads generated from two adaptors that were ligated together during library prep - they have the sequencing primer binding site, but they don't have an insert. Here's the adaptor sequence used for this library, with Gs and Cs annotated:

```
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC = 34
x  xxx  x xx x xx x x  x xx x x x = 18
18 / 34 = 53%
```

So, it would make sense that if many reads are associated with adaptor dimers, these reads would have a significantly higher GC% than average.

### Per base N content

Ns are called when the base caller can't make a decision for that cycle. N calls are usually associated (and correlated) with per base sequence quality.

### Sequence length distribution

Some sequencing libraries will have non-uniform read lengths, in which case we'd see a distribution of different lenghts. For runs that have a pre-defined read lenght, we should see a sharp peak at that length.

### Sequence duplication levels

For this module, FastQC will scan the first 50 bp of the first 100,000 reads and tally duplicates. The best libraries will have almost no, if any, duplicates. High levels of duplication may be indicative of PCR over amplification. Sometimes over amplification is a desired outcome; in this case, highly expressed transcripts will be highly duplicated, but this is a tradeoff for recovering more lowly experssed transcripts. FASTQs can be deduplicated before or after the alignment step.

### Overrepresented sequences

If you have many duplicate reads, it's likely there are overrepresented sequences. You can also use this list of sequences to help evalulate if you have adaptor dimers. A simple way to check for problems is to search a sequence database for the overrepresented sequence. If it hits to a sequences deriving from your organism of interest, then you don't have a problem and this overrepresentation may be indicative of true underlying biology. The most useful search tool is BLAST. You can use [this website](https://blast.ncbi.nlm.nih.gov/Blast.cgi) to BLAST the sequences against the NCBI's entire nucleotide database.

### Adapter content

This module searches reads for popular adaptors and PolyA stretches. If you performed mRNA enrichment, you may see higher PolyA instances at either end of the reads. 

## FastQC evaluation

To properly evaluate these QC data, we need to consider how the library was prepared. This consideration will give us an idea of what (if any) adaptors we may expect to see, how much sequence duplication might be present, and if PolyA tails are expected. From the [paper](https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1012268#sec010), we know that they used the QuantSeq 3' mRNA FWD V2 library prep kit with UMI Second Strand Synthesis and 6 nt Unique Dual Indexing (UDI). You can read a bit about the kit [here](https://www.lexogen.com/store/quantseq-3-mrna-seq-v2-fwd-with-udi/). This kit is optimized for a few things:

- Degraded mRNA - it will only get sequences from the 3' end (and PolyA tails)
- Low input RNA - which is often the case for small eukaryotic parasites, especially when working with eggs, which are difficult to get RNA out of

Remember how library prep works. When doing PolyA selection, an oligo(dT) primer will be used for first-strand cDNA synthesis and a random primer will be used for second strand synthesis. The library is then amplified, during which barcodes and P5/P7 are added. Here's the schematic from Lexogen:

<img src="assets/library_prep.png" alt="library prep" width="600">

The QuantSeq 3' mRNA FWD V2 library also allows for incorporation of unique molecular identifiers (UMIs) during second strand synthesis. This UMI will be present during the amplification step, which means that each read then has the UMI tag that informs which *original fragment* the read was derived from. The UMI is 6 bp long and has TATA immediately after it, prior to the insert. This tells us how many duplicated reads are present in the data. Here's what that looks like:

<img src="assets/umi.png" alt="UMI" width="450">

Using this information, we can make the following inferences:

- We are likely to have PolyA tails
- We may see UMIs and UMI linkers (TATA)
- Because the RNA was low-input, some duplicated sequences are to be expected.

Take some time to look at all the FastQC files and see if there are any other problems that should be noted.

## MultiQC

When you have many FASTQ files, all the relevant QC can be difficult to manage. MultiQC is designed to coalesce QC metrics for many different bioinformatics datatypes.

Check the manual first:

In [None]:
!multiqc -h

MultiQC is very easy to use. All you have to do is run the command followed by the path to the working directory. It will discover all relevant QC files and wrap them up into a very helpful report. Run it now, then take a look at `multiqc_report.html` (you'll have to download it and open it in your browser).

In [8]:
!multiqc .


  [91m///[0m ]8;id=796612;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m45/45[0m  2m./fastq/qc/SRR26691089_fastqc.html[0m
[?25h[34m|            fastqc[0m | Found 12 reports
[34m|           multiqc[0m | Report      : multiqc_report.html
[34m|           multiqc[0m | Data        : multiqc_data
[34m|           multiqc[0m | MultiQC complete


# FASTQ trimming and filtering

QC metrics will inform whether or not reads should be trimmed (i.e., PolyA tails, low quality bases, or adaptors removed) or filtered (low quality reads removed). Typically, manufacturers will provide some guidance on how to trim and filter sequences produced from their kit. For Lexogen (the maker of the QuantSeq kit), this information can be found in the [FAQs](https://faqs.lexogen.com/faq/what-sequences-should-be-trimmed). This site shows that `AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC` is the adapter sequence. Is that sequence (or one like it) in the "Overrepresented sequences" section? If so, we need to trim the adapter. If not, we just need to trim the PolyA tail.

Here's what the authors of the paper said about read QC:

> Quality control of raw reads was performed in FastQC (v0.11.9). Reads were mapped to the S. mansoni genome (Wormbase ParaSite v9) using STAR aligner (version 2.7.10a) with options adjusted to the QuantSeq FWD 3’mRNA Library Prep Kit. The UMIs were deduplicated using open source UMI-tools software package (version 1.1.2).

Although they did look at QC, they say nothing about trimming or filtering. They do mention that they deduplicated reads using UMI-tools. We won't be able to reproduce this because the reads that we fetched from NCBI SRA don't all have UMIs, some only have the UMI linker (TATA).

For QuantSeq libraries, each read may include a 6 nt UMI, a 4 nt (TATA) linker, the insert, the PolyA tail, and then the adapter. Based on our QC data, it looks like the adapters have been trimmed (`AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC` isn't overrepresented). The PolyA tail *may* have been trimmed, but it's not obvious - we still see up to 40% PolyA at position 59 in SRR26691093. On the other side of the read, we see TATA in all 4 files, and SRR26691093 and SRR26691087 contain the 6 nt UMI while SRR26691085 and SRR26691082 do not.

We definitely want to trim PolyA tails, but trimming UDIs and linkers actually depends on the type of aligner being used. STAR (the aligner we'll be using) can soft-clip the ends of reads that have a high mismatch rate. This also means that we don't need to trim low-quality bases, because they can also be soft-clipped. What about the duplicates? Again, those can be marked during alignment, so we don't need to remove them either.

In the end, we'll just trim PolyA tails and do some trimming that's particular to the NextSeq instrument that we're not going to get into. We will use the tool [cutadapt](https://cutadapt.readthedocs.io/en/stable/) to perform the trimming. First, check out the manual:

In [9]:
!cutadapt --help

cutadapt version 4.9

Copyright (C) 2010 Marcel Martin <marcel.martin@scilifelab.se> and contributors

Cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. All reads from input.fastq will be written to
output.fastq with the adapter sequence removed. Adapter matching is
error-tolerant. Multiple adapter sequences can be given (use further -a
options), but only the best-matching adapter will be removed.

Input may also be in FASTA format. Compressed input and output is supported and
auto-detected from the file name (.gz, .xz, .bz2). Use the file name '-' for
standard input/output. Without the -o option, output is sent to standard output.

Citation:

Marcel Martin. Cutadapt removes a

Edit this block and define the result of each of flags used in the above commands.

`-j`: 

`-m`: 

`-O`:

`-a`:

`-n`:

Now let's trim and run QC on these new files to see how they look. We'll use a bash loop again.

In [10]:
%mkdir trimmed
!for fastq in fastq/*.fastq.gz; do \
    base_name=$(basename "$fastq" .fastq.gz); \
    cutadapt -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./trimmed/$base_name.fastq.gz $fastq; \
done

This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./trimmed/SRR26691082.fastq.gz fastq/SRR26691082.fastq.gz
Processing single-end reads on 16 cores ...
Done           00:00:12    10,039,919 reads @   1.2 µs/read;  49.11 M reads/minute
Finished in 12.337 s (1.229 µs/read; 48.83 M reads/minute).

=== Summary ===

Total reads processed:              10,039,919

== Read fate breakdown ==
Reads that were too short:              60,645 (0.6%)
Reads written (passing filters):     9,979,274 (99.4%)

Total basepairs processed:   692,754,411 bp
Quality-trimmed:               1,674,822 bp (0.2%)
Poly-A-trimmed:               27,670,085 bp (4.0%)
Total written (filtered):    662,590,840 bp (95.6%)

=== Poly-A trimmed ===

length	count
0	7854189
3	308134
4	155349
5	215901
6	171737
7	124653
8	98316
9	89926
10	85678
11	67989
12	86254
13	66030
14	50981
15	51620
16	42716
17	40238
18	39007
19	37771
20	36185
21	46458
22	69529
23	36987
24	18668
25	1

***2 minute time to completion***

Re-run FastQC so we compare the metrics before and after trimming. Generate an overall report with MultiQC. This time when running, we use the `--force` option to overwrite previous reports and the `-d` option to direct the tool to specific directories.

In [1]:
#!mkdir trimmed/qc
#!fastqc -t 16 trimmed/*.fastq.gz -o trimmed/qc
!multiqc --force -d fastq/qc/ trimmed/qc/


  [91m///[0m ]8;id=935423;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Prepending directory to sample names
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq/fastq/qc
[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq/trimmed/qc
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m48/48[0m  stqc.html[0m87_fastqc.html[0m.html[0m
[?25h[34m|            fastqc[0m | Found 24 reports
[34m|           multiqc[0m | Report      : multiqc_report.html   (overwritten)
[34m|           multiqc[0m | Data        : multiqc_data   (overwritten)
[34m|           multiqc[0m | MultiQC complete


***2 minute time to completion***

Check out the new MultiQC report to see what changed. Not a whole lot, but in some samples (i.e., SRR26691093) the PolyA tail was substantially trimmed. Our trimming did not deal with UMIs or linkers - we'll talk more about those when we get to alignment next week. You can open the individual QC files to inspect them further.

You'll notice that the "Sequence Length Distribution" sections often went from ✅ to ❌. That's because the trimmed PolyA tail was different for each read, so now we have a broad distribution of lengths rather than all of them being 69 nt. 

Even though we don't have adapters, we still have have overrepresented sequences. What are they? BLAST them against a nucleotide database and see if they are concerning or not.