# Week 5 Homework

***Due (pushed to your GitHub branch) on 10/18 by 11:59 pm***

## FASTQ fetch, QC, and trimming/filtering

Use the SRA-toolkit to fetch the relevant FASTQ files. You may use a hand-crafted `txt` file to provide the list of SRAs instead of manipulating an SRA Run Table with `cut`.

In [1]:
#!mkdir fastq_lemons
#!cut -d ',' -f 1 SraRunTable_lemons.txt | tail -n +2 > sra_list_lemons.txt
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --sra-id $line --threads 16 --outdir fastq_lemons --gzip; \
    done < sra_list_lemons.txt

Getting SRR26560370 from NCBI SRA
2024-10-17 15:32:09,570 - SRR ids: ['SRR26560370']
2024-10-17 15:32:09,571 - extra args: ['--gzip']
2024-10-17 15:32:09,571 - tempdir: /local/scratch/job_99830/pfd__ic1jglq
2024-10-17 15:32:09,571 - CMD: sra-stat --meta --quick SRR26560370
2024-10-17 15:32:11,995 - SRR26560370 spots: 23569582
2024-10-17 15:32:11,995 - blocks: [[1, 1473098], [1473099, 2946196], [2946197, 4419294], [4419295, 5892392], [5892393, 7365490], [7365491, 8838588], [8838589, 10311686], [10311687, 11784784], [11784785, 13257882], [13257883, 14730980], [14730981, 16204078], [16204079, 17677176], [17677177, 19150274], [19150275, 20623372], [20623373, 22096470], [22096471, 23569582]]
2024-10-17 15:32:11,995 - CMD: fastq-dump -N 1 -X 1473098 -O /local/scratch/job_99830/pfd__ic1jglq/0 --gzip SRR26560370
2024-10-17 15:32:12,014 - CMD: fastq-dump -N 1473099 -X 2946196 -O /local/scratch/job_99830/pfd__ic1jglq/1 --gzip SRR26560370
2024-10-17 15:32:12,015 - CMD: fastq-dump -N 2946197 -X 44

Run QC on the FASTQ files and aggregate the results into a report.

In [4]:
!mkdir fastq_lemons/qc
!fastqc -t 16 fastq_lemons/*.fastq.gz -o fastq_lemons/qc
!multiqc fastq_lemons/.


  [91m///[0m ]8;id=535840;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /data/users/corwinbm5021/BIOL343/5_fastq/fastq_lemons
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m27/27[0m  27[0m  
[?25h[34m|            fastqc[0m | Found 9 reports
[34m|           multiqc[0m | Existing reports found, adding suffix to filenames. Use '--force' to overwrite.
[34m|           multiqc[0m | Report      : multiqc_report_2.html
[34m|           multiqc[0m | Data        : multiqc_data_2
[34m|           multiqc[0m | MultiQC complete


Edit this block with a short evaluation of each of the report sections. Each evaluation should include your important observations, reasons for potential QC failure, and an opinion on what should be done on the failure (i.e., trimming or filtering). Problem FASTQ files, if any, should be noted.

*Per base sequence quality:*

*Per tile sequence quality:*

*Per sequence quality scores:*

*Per base sequence content:*

*Per sequence GC content:*

*Per base N content:*

*Sequence length distribution:*

*Sequence duplication levels:*

*Overrepresented sequences:*

*Adapter content:*

If necessary, use `cutadapt` to trim and filter your reads. Provide a justification for the `cutadapt` parameters you used:



In [None]:
%mkdir trimmed_lemons
!for fastq in fastq_lemons/*.fastq.gz; do \
    base_name=$(basename "$fastq" .fastq.gz); \
    cutadapt -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./trimmed_lemons/$base_name.fastq.gz $fastq_lemons; \
done

Run QC on the trimmed/filtered reads and generate an aggregated report

In [None]:
!mkdir trimmed_lemons/qc
!fastqc -t 16 trimmed_lemons/*.fastq.gz -o trimmed_lemons/qc
!multiqc --force -d fastq_lemons/qc/ trimmed_lemons/qc/

Report your observations on which QC metrics improved (or got worse):



## Snakemake Pipeline

When running bioinformatics studies, it is wise to use notebooks (such as this one) to tie together code, decisions, and data visualization. The above task is something you might generate and report to a client or supervisor who might be interested in sequencing QC and what kind of pre-processing decisions you made prior to alignment. In addition, the preprocessing should be readily reproducible by all relevant stakeholders, so you should supplement notebooks with pipeline scripts. Throughout the semester, we will be creating snakemake pipelines to reproduce the published analysis that you've selected. Follow the below steps to begin producing this pipeline:

1. Make a new `Snakefile` and `config.yaml` in `5_fastq`. 
2. In the `Snakefile`, produce the following rules:  
    a. `fetch_fastq` - use `parallel-fastq-dump` to fetch every FASTQ file listed in `config.yaml`.  
    b. `fastq_qc` - use `fastqc` to run QC on the raw FASTQ files.  
    c. `trim_filter` - use `cutadapt` to trim and/or filter the FASTQ reads using parameters justified in this notebook. ***Note:*** if you choose to trim/filter, it should be done on *all* of the FASTQ files; do not cherry pick.  
    d. `trim_qc` - use `fastqc` to run QC on the trimmed FASTQ files.  
    e. `report` - use `multiqc` to aggregate all QC data into a single report.

This `Snakefile` should not use any `wrappers` because we already have the relevant software installed in the `biol343` conda environment. The pipeline should run to completion when any instructor or classmate runs `snakemake --use-conda`. You may use the documentation found [here](https://multiqc.info/docs/usage/pipelines/#snakemake) or any other online documentation you may find. You may not use any AI tools to complete this homework.