# Week 5 Homework

***Due (pushed to your GitHub branch) on 10/18 by 11:59 pm***

## FASTQ fetch, QC, and trimming/filtering

Use the SRA-toolkit to fetch the relevant FASTQ files. You may use a hand-crafted `txt` file to provide the list of SRAs instead of manipulating an SRA Run Table with `cut`.

Run QC on the FASTQ files and aggregate the results into a report.

In [1]:
!mkdir fastq_Placenta
!cut -d ',' -f 1 SraRunTable_Placenta.txt | tail -n +2 > sra_list_placenta.txt
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --sra-id $line --threads 16 --outdir fastq --gzip; \
    done < sra_list_placenta.txt

Getting ERR4231022 from NCBI SRA
2024-10-10 15:52:42,588 - SRR ids: ['ERR4231022']
2024-10-10 15:52:42,588 - extra args: ['--gzip']
2024-10-10 15:52:42,589 - tempdir: /local/scratch/job_98763/pfd_wkp5_n_8
2024-10-10 15:52:42,589 - CMD: sra-stat --meta --quick ERR4231022
2024-10-10 15:52:44,994 - ERR4231022 spots: 55208970
2024-10-10 15:52:44,994 - blocks: [[1, 3450560], [3450561, 6901120], [6901121, 10351680], [10351681, 13802240], [13802241, 17252800], [17252801, 20703360], [20703361, 24153920], [24153921, 27604480], [27604481, 31055040], [31055041, 34505600], [34505601, 37956160], [37956161, 41406720], [41406721, 44857280], [44857281, 48307840], [48307841, 51758400], [51758401, 55208970]]
2024-10-10 15:52:44,994 - CMD: fastq-dump -N 1 -X 3450560 -O /local/scratch/job_98763/pfd_wkp5_n_8/0 --gzip ERR4231022
2024-10-10 15:52:44,995 - CMD: fastq-dump -N 3450561 -X 6901120 -O /local/scratch/job_98763/pfd_wkp5_n_8/1 --gzip ERR4231022
2024-10-10 15:52:44,996 - CMD: fastq-dump -N 6901121 -X 

Edit this block with a short evaluation of each of the report sections. Each evaluation should include your important observations, reasons for potential QC failure, and an opinion on what should be done on the failure (i.e., trimming or filtering). Problem FASTQ files, if any, should be noted.

*Per base sequence quality:*

*Per tile sequence quality:*

*Per sequence quality scores:*

*Per base sequence content:*

*Per sequence GC content:*

*Per base N content:*

*Sequence length distribution:*

*Sequence duplication levels:*

*Overrepresented sequences:*

*Adapter content:*

If necessary, use `cutadapt` to trim and filter your reads. Provide a justification for the `cutadapt` parameters you used:



In [None]:
!mkdir fastq_placenta/placenta_qc
!fastqc -t 16 fastq/*.fastq.gz -o fastq_placenta/placenta_qc

Run QC on the trimmed/filtered reads and generate an aggregated report

Report your observations on which QC metrics improved (or got worse):



## Snakemake Pipeline

When running bioinformatics studies, it is wise to use notebooks (such as this one) to tie together code, decisions, and data visualization. The above task is something you might generate and report to a client or supervisor who might be interested in sequencing QC and what kind of pre-processing decisions you made prior to alignment. In addition, the preprocessing should be readily reproducible by all relevant stakeholders, so you should supplement notebooks with pipeline scripts. Throughout the semester, we will be creating snakemake pipelines to reproduce the published analysis that you've selected. Follow the below steps to begin producing this pipeline:

1. Make a new `Snakefile` and `config.yaml` in `4_fastq`. 
2. In the `Snakefile`, produce the following rules:  
    a. `fetch_fastq` - use `parallel-fastq-dump` to fetch every FASTQ file listed in `config.yaml`.  
    b. `fastq_qc` - use `fastqc` to run QC on the raw FASTQ files.  
    c. `trim_filter` - use `cutadapt` to trim and/or filter the FASTQ reads, using parameters justified in this notebook. ***Note:*** if you choose to trim/filter, it should be done on *all* of the FASTQ files; do not cherry pick.  
    d. `trim_qc` - use `fastqc` to run QC on the trimmed FASTQ files.  
    e. `report` - use `multiqc` to aggregate all QC data into a single report. . 

This `Snakefile` should not use any `wrappers` because we already have the relevant software installed in the `biol343` conda environment. The pipeline should run to completion when any instructor or classmate runs `snakemake --use-conda`. You may use the documentation found [here](https://multiqc.info/docs/usage/pipelines/#snakemake) or any other online documentation you may find. You may not use any AI tools to complete this homework.