# FASTQ download and QC

## Fetching FASTQ data from NCBI SRA

In previous weeks, we explored the genome and annotations. They were stored in FASTA and GTF files and will be used as the reference to which RNA sequencing reads are mapped/aligned. After alignment, we can count how many RNA-seq reads align to the annotations we care about (genes or transcripts).

Now, we need to get the FASTQ files that we want to align and analyze. To do so, we will be referencing an SRA read table, which is a list of all the FASTQ files associated with a given experiment uploaded to NCBI SRA. Here's the read table for the *in vivo* PZQ treatment paper:

In [1]:
!head -11 SraRunTable.txt

Run,AGE,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,Bytes,Center Name,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,dev_stage,Experiment,Instrument,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,replicate,create_date,version,Sample Name,sex,SRA Study,strain,tissue,treatment
SRR10776762,7 weeks,RNA-Seq,102,2570036880,PRJNA597909,SAMN13691180,Model organism or animal,1127285500,MEDICAL COLLEGE OF WISCONSIN,public,"fastq,run.zq,sra","gs,ncbi,s3","gs.us-east1,ncbi.public,s3.us-east-1",sexually mature,SRX7450683,Illumina HiSeq 2500,ADULT_PZQ_E,PAIRED,RT-PCR,TRANSCRIPTOMIC,Schistosoma mansoni,ILLUMINA,2020-01-21T00:00:00Z,Biological replicate 5,2019-12-27T20:33:00Z,1,ADULT_PZQ_biological_replicate_5,pooled male and female,SRP238922,NMRI,whole animal,PZQ
SRR10776763,7 weeks,RNA-Seq,102,2444119104,PRJNA597909,SAMN13691179,Model organism or animal,1065352234,MEDICAL COLLEGE OF WISCONSIN,public,"fastq,run.zq,sra","gs,n

You can see that there are 10 different runs, each has an ID that looks something like `SRR10776762`. After the ID, quite a bit of run metadata is provided. You should spend some time learning about what each field means.

The experiment that we're reanalyzing had 2 different samples: parasites that were harvested from untreated mice and parasites that were harvested from mice treated with 100 mg/kg PZQ. Each sample had 5 replicates, totaling 10 RNA-seq runs.

Most RNA-seq data in the USA is stored on NCBI's SRA database. We have to use `sra-tools` to download the FASTQ files from SRA. To do so, we'll use a `for` loop within bash. We can use `cut` (which we learned about in [week 2](../2_genome_exploration/2_genome_exploration.ipynb)) to get the first field, but because it's comma-delimited instead of tab-delimited, we have to use the `-d` option to tell what "delimiter" to use. The output can be saved to a file, which then can be looped through line-by-line. The FASTQ files will be stored in a new directory called `fastq`. Here's the code to get the list of runs we want to download. It will create a new file called `sra_list.txt`:

In [2]:
!mkdir fastq
!cut -d ',' -f 1 SraRunTable.txt | tail -n +2 > sra_list.txt

Here's the code to download the FASTQ files:

In [1]:
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --split-files --sra-id $line --threads 16 --outdir fastq --gzip; \
    done < sra_list.txt

Getting SRR10776762 from NCBI SRA
2025-10-07 16:19:03,243 - SRR ids: ['SRR10776762']
2025-10-07 16:19:03,243 - extra args: ['--gzip']
2025-10-07 16:19:03,243 - tempdir: /local/scratch/job_321474/pfd_8o72rt4g
2025-10-07 16:19:03,243 - CMD: sra-stat --meta --quick SRR10776762
2025-10-07 16:19:07,413 - SRR10776762 spots: 25196440
2025-10-07 16:19:07,413 - blocks: [[1, 787388], [787389, 1574776], [1574777, 2362164], [2362165, 3149552], [3149553, 3936940], [3936941, 4724328], [4724329, 5511716], [5511717, 6299104], [6299105, 7086492], [7086493, 7873880], [7873881, 8661268], [8661269, 9448656], [9448657, 10236044], [10236045, 11023432], [11023433, 11810820], [11810821, 12598208], [12598209, 13385596], [13385597, 14172984], [14172985, 14960372], [14960373, 15747760], [15747761, 16535148], [16535149, 17322536], [17322537, 18109924], [18109925, 18897312], [18897313, 19684700], [19684701, 20472088], [20472089, 21259476], [21259477, 22046864], [22046865, 22834252], [22834253, 23621640], [23621641

***14 minute time to completion***

If we take another look at the metadata, there are a few things that are of interest to us and our analysis. First, the reads are `ILLUMINA` reads generated on a `HiSeq 2500` instrument, which means they may have Illumina adapters, depending on whether or not they were trimmed by the authors prior to uploading them to SRA. Secpmd, we can see  that these are paired-end reads (`PAIRED`). This information will be important to us soon. You can get a bit more information about the data using the tool `seqkit`. For instance:

In [None]:
!seqkit stats fastq/SRR10776762_1.fastq.gz

***1 minute time to completion***

## FastQC

Whenever we look at FASTQ files for the first time, we should perform quality control (QC). The primary tool used for read QC is called FastQC, which is installed in our environment. Let's take a look at the manual and then run QC on our reads:

In [None]:
!fastqc -h

The following code uses the `*` regular expression, which will expand to any file ending in `.fastq.gz`. This line will run QC on all 10 files and use 32 cores (threads):

In [2]:
!mkdir fastq/qc
!fastqc -t 32 fastq/*.fastq.gz -o fastq/qc

application/gzip
application/gzip
Started analysis of SRR10776762.fastq.gz
application/gzip
application/gzip
application/gzip
Started analysis of SRR10776763.fastq.gz
application/gzip
application/gzip
application/gzip
Started analysis of SRR10776764.fastq.gz
application/gzip
application/gzip
Started analysis of SRR10776765.fastq.gz
Started analysis of SRR10776766.fastq.gz
Started analysis of SRR10776767.fastq.gz
Started analysis of SRR10776768.fastq.gz
Started analysis of SRR10776769.fastq.gz
Started analysis of SRR10776770.fastq.gz
Started analysis of SRR10776771.fastq.gz
Approx 5% complete for SRR10776764.fastq.gz
Approx 5% complete for SRR10776763.fastq.gz
Approx 5% complete for SRR10776762.fastq.gz
Approx 5% complete for SRR10776766.fastq.gz
Approx 5% complete for SRR10776765.fastq.gz
Approx 5% complete for SRR10776767.fastq.gz
Approx 5% complete for SRR10776768.fastq.gz
Approx 5% complete for SRR10776769.fastq.gz
Approx 5% complete for SRR10776771.fastq.gz
Approx 5% complete for S

***3 minute time to completion***

Let's go through each section one-by-one.

### Per base sequence quality

Base quality scores are represented in Phred+33 encoding. Remember that Q20 is a popular threshold, as the likelihood of an error substantially increases as the Phred quality gets less than 20. It is typical to see high base quality at the 5' end of reads and lower quality at the 3' ends. For this reason, sometimes the 3' end of reads may be trimmed to remove low quality bases.

### Per tile sequence quality

Tiles are physical locations on the flow cell. Warnings in this section can be associated with flow cells that have been overloaded.

### Per sequence quality scores

This distribution shows that average Phred quality for entire reads. It helps you make decisions about removing entire reads based on their average quality.

### Per base sequence content

This distribution shows the bases most likely to be found at any given position in the reads. In general, the likelihood of finding any of A, T, G, or C at a given position should be the similar across the entire sequence. Deviations from this expectation may be due to the presence of barcodes or adaptors. Here's an example of a dataset that has the adapter on the 5' end.

<img src="assets/linker.png" alt="linker" width="450">

### Per sequence GC content

Since our reads are derived from mRNA fragments, we would expect that the percentage of Gs and Cs would be normally distributed in a way that mimics the underlying transcriptome of the organism from which the mRNA was extracted. If the distribution isn't normal (that is, if there are sharp peaks), that is an indication of some sort of contamination. Here is an example of a sharp peak:

<img src="assets/gc_peak.png" alt="GC" width="450">

This peak is indicative of something in the library that has a higher GC% than the rest of the mRNA. Sharp peaks are often associated with adapter dimers. These are reads generated from two adapters that were ligated together during library prep - they have the sequencing primer binding site, but they don't have an insert. Here's the adaptor sequence used for this library, with Gs and Cs annotated:

```
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC = 34
 x  xxx  x xx x xx x x  x xx x x x = 18
18 / 34 = 53%
```

So, it would make sense that if many reads are associated with adaptor dimers, these reads would have a significantly higher GC% than average.

### Per base N content

Ns are called when the base caller can't make a decision for that cycle. N calls are usually associated (and correlated) with per base sequence quality.

### Sequence length distribution

Some sequencing libraries will have non-uniform read lengths, in which case we'd see a distribution of different lenghts. For runs that have a pre-defined read lenght, we should see a sharp peak at that length.

### Sequence duplication levels

For this module, FastQC will scan the first 50 bp of the first 100,000 reads and tally duplicates. The best libraries will have almost no, if any, duplicates. High levels of duplication may be indicative of PCR over amplification. Sometimes over amplification is a desired outcome; in this case, highly expressed transcripts will be highly duplicated, but this is a tradeoff for recovering more lowly expressed transcripts. FASTQs can be deduplicated before or after the alignment step.

### Overrepresented sequences

If you have many duplicate reads, it's likely there are overrepresented sequences. You can also use this list of sequences to help evalulate if you have adapter dimers. A simple way to check for problems is to search a sequence database for the overrepresented sequence. If it hits to a sequences deriving from your organism of interest, then you don't have a problem and this overrepresentation may be indicative of true underlying biology. The most useful search tool is BLAST. You can use [this website](https://blast.ncbi.nlm.nih.gov/Blast.cgi) to BLAST the sequences against the NCBI's entire nucleotide database.

### Adapter content

This module searches reads for popular adapters and PolyA stretches. If you performed mRNA enrichment, you may see higher PolyA instances at either end of the reads. 

## MultiQC

When you have many FASTQ files, all the relevant QC can be difficult to manage. MultiQC is designed to coalesce QC metrics for many different bioinformatics datatypes.

Check the manual first:

In [None]:
!multiqc -h

MultiQC is very easy to use. All you have to do is run the command followed by the path to the working directory. It will discover all relevant QC files and wrap them up into a very helpful report. Run it now, then take a look at `multiqc_report.html` (you'll have to download it and open it in your browser).

In [3]:
!multiqc .


  [91m///[0m ]8;id=579265;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /data/users/zhouz6436/BIOL343/5_fastq
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m39/39[0m  /qc/SRR10776765_fastqc.html[0m.html[0m
[?25h[34m|            fastqc[0m | Found 10 reports
[34m|           multiqc[0m | Report      : multiqc_report.html
[34m|           multiqc[0m | Data        : multiqc_data
[34m|           multiqc[0m | MultiQC complete


# FASTQ trimming and filtering

QC metrics will inform whether or not reads should be trimmed (i.e., PolyA tails, low quality bases, or adapters removed) or filtered (low quality reads removed). Typically, manufacturers will provide some guidance on how to trim and filter sequences produced from their kit. For Illumina (the maker of the TruSeq kit used in the paper), this information can be found in the [FAQs](https://support-docs.illumina.com/SHARE/AdapterSequences/Content/SHARE/AdapterSeq/TruSeq/UDIndexes.htm). This site shows that `AGATCGGAAGAGCACACGTCTGAACTCCAGTCA` is the adapter sequence for read 1 and `AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT` is the adapter sequence for read 2. However, our FastQC eval for these data show that there are no adapters or overrepresented sequences, which means they've likely already been trimmed.

Here's what the authors of the paper said about read QC:

> Trimmed reads were mapped to the Schistosoma mansoni genome (v7.2)...

Although they likely looked at QC, they say nothing about how they trimmed or filtered.

In the end, we'll just trim low quality sequences from the end (using Q = 20 as a threshold). We will use the tool [cutadapt](https://cutadapt.readthedocs.io/en/stable/) to perform the trimming. First, check out the manual:

In [None]:
!cutadapt --help

Edit this block and define the result of each of flags used in the above commands.

`-j`: 

`-q`: 

`-m`:

`-o`:

Now let's trim and run QC on these new files to see how they look. We'll use a bash loop again. Note that though our data was originally paired-end, the authors have already joined the read pairs so we're treating it like single-end (one read per fragment cluster).

In [1]:
%mkdir trimmed
!for fastq in fastq/*.fastq.gz; do \
    base_name=$(basename "$fastq" .fastq.gz); \
    cutadapt -j 32 -q 20 -m 20 -o ./trimmed/$base_name.fastq.gz $fastq; \
done

This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 32 -q 20 -m 100 -o ./trimmed/SRR10776762.fastq.gz fastq/SRR10776762.fastq.gz
Processing single-end reads on 32 cores ...
Done           00:00:40    25,196,440 reads @   1.6 µs/read;  37.00 M reads/minute
Finished in 41.119 s (1.632 µs/read; 36.77 M reads/minute).

=== Summary ===

Total reads processed:              25,196,440

== Read fate breakdown ==
Reads that were too short:             123,383 (0.5%)
Reads written (passing filters):    25,073,057 (99.5%)

Total basepairs processed: 2,570,036,880 bp
Quality-trimmed:               2,429,454 bp (0.1%)
Total written (filtered):  2,556,825,458 bp (99.5%)
This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 32 -q 20 -m 100 -o ./trimmed/SRR10776763.fastq.gz fastq/SRR10776763.fastq.gz
Processing single-end reads on 32 cores ...
Done           00:00:38    23,961,952 reads @   1.6 µs/read;  36.88 M reads/minute
Finished in 39.066 s (1.630 µs/read; 36.80 M re

***6 minute time to completion***

You can see from the cutadapt logs that very few reads were trimmed or filtered, and that's because the authors already did these steps before uploading to SRA. Not all authors do this, so we must always do QC when downloading new data to re-analyze. For our own new data, we are responsible for QC, trimming, and filtering.

Re-run FastQC so we compare the metrics before and after trimming. Generate an overall report with MultiQC. This time when running, we use the `--force` option to overwrite previous reports and the `-d` option to direct the tool to specific directories.

In [2]:
!mkdir trimmed/qc
!fastqc -t 32 trimmed/*.fastq.gz -o trimmed/qc
!multiqc --force -d fastq/qc/ trimmed/qc/

application/gzip
application/gzip
Started analysis of SRR10776762.fastq.gz
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRR10776763.fastq.gz
application/gzip
application/gzip
application/gzip
application/gzip
Started analysis of SRR10776764.fastq.gz
Started analysis of SRR10776765.fastq.gz
Started analysis of SRR10776766.fastq.gz
Started analysis of SRR10776767.fastq.gz
Started analysis of SRR10776768.fastq.gz
Started analysis of SRR10776769.fastq.gz
Started analysis of SRR10776770.fastq.gz
Started analysis of SRR10776771.fastq.gz
Approx 5% complete for SRR10776763.fastq.gz
Approx 5% complete for SRR10776762.fastq.gz
Approx 5% complete for SRR10776764.fastq.gz
Approx 5% complete for SRR10776766.fastq.gz
Approx 5% complete for SRR10776765.fastq.gz
Approx 5% complete for SRR10776767.fastq.gz
Approx 5% complete for SRR10776768.fastq.gz
Approx 5% complete for SRR10776769.fastq.gz
Approx 5% complete for SRR10776771.fastq.gz
Approx 5% complete for S

***3 minute time to completion***

Check out the new MultiQC report to see what changed. Not a whole lot, but in some samples (i.e., SRR26691093) the PolyA tail was substantially trimmed. Our trimming did not deal with UMIs or linkers - we'll talk more about those when we get to alignment next week. You can open the individual QC files to inspect them further.

You'll notice that the "Sequence Length Distribution" sections often went from ✅ to ❌. That's because the trimmed PolyA tail was different for each read, so now we have a broad distribution of lengths rather than all of them being 69 nt. 

Even though we don't have adapters, we still have have overrepresented sequences. What are they? BLAST them against a nucleotide database and see if they are concerning or not.