# Week 5 Homework

***Due (pushed to your GitHub branch) on 10/18 by 11:59 pm***

## FASTQ fetch, QC, and trimming/filtering

Use the SRA-toolkit to fetch the relevant FASTQ files. You may use a hand-crafted `txt` file to provide the list of SRAs instead of manipulating an SRA Run Table with `cut`.

In [2]:
# ERR4231022
# ERR4231010
# ERR4231012
# ERR4231013
# ERR4231014
# ERR4231015
# ERR4231016
# ERR4231017
# ERR4231018
# ERR4231019
# ERR4231020
# ERR4231021
# ERR4231023
# ERR4231011

Run QC on the FASTQ files and aggregate the results into a report.

In [3]:
# !mkdir fastq_placenta
# !cut -d ',' -f 1 SraRunTable_Placenta.txt | tail -n +2 > sra_list_placenta.txt
!while IFS= read -r line; do \
    echo "Getting $line from NCBI SRA"; \
    parallel-fastq-dump --sra-id $line --threads 16 --outdir fastq_placenta --gzip; \
    done < sra_list_placenta.txt

Getting  from NCBI SRA
usage: parallel-fastq-dump [-h] [-s SRA_ID] [-t THREADS] [-O OUTDIR]
                           [-T TMPDIR] [-N MINSPOTID] [-X MAXSPOTID] [-V]
parallel-fastq-dump: error: argument -s/--sra-id: expected one argument


Edit this block with a short evaluation of each of the report sections. Each evaluation should include your important observations, reasons for potential QC failure, and an opinion on what should be done on the failure (i.e., trimming or filtering). Problem FASTQ files, if any, should be noted.

*Per base sequence quality:*

*Per tile sequence quality:*

*Per sequence quality scores:*

*Per base sequence content:*

*Per sequence GC content:*

*Per base N content:*

*Sequence length distribution:*

*Sequence duplication levels:*

*Overrepresented sequences:*

*Adapter content:*

If necessary, use `cutadapt` to trim and filter your reads. Provide a justification for the `cutadapt` parameters you used:



In [None]:
# !mkdir fastq_placenta/placenta_qc
!fastqc -t 16 fastq_placenta/*.fastq.gz -o fastq_placenta/placenta_qc

In [20]:
# mkdir fastq_placenta/trimmed_placenta
!for fastq in fastq_placenta/*.fastq.gz; do \
    base_name=$(basename "$fastq" .fastq.gz); \
    cutadapt -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./fastq_placenta/trimmed/$base_name.fastq.gz $fastq; \
done

This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./fastq_placenta/trimmed/ERR4231010.fastq.gz fastq_placenta/ERR4231010.fastq.gz
[Errno 2] No such file or directory: './fastq_placenta/trimmed/ERR4231010.fastq.gz'
This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./fastq_placenta/trimmed/ERR4231011.fastq.gz fastq_placenta/ERR4231011.fastq.gz
[Errno 2] No such file or directory: './fastq_placenta/trimmed/ERR4231011.fastq.gz'
This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./fastq_placenta/trimmed/ERR4231012.fastq.gz fastq_placenta/ERR4231012.fastq.gz
[Errno 2] No such file or directory: './fastq_placenta/trimmed/ERR4231012.fastq.gz'
This is cutadapt 4.9 with Python 3.12.3
Command line parameters: -j 16 -m 20 --poly-a --nextseq-trim=10 -o ./fastq_placenta/trimmed/ERR4231013.fastq.gz fastq_placenta/ERR4231013.fastq.gz


Run QC on the trimmed/filtered reads and generate an aggregated report

In [10]:
!multiqc fastq_placenta/.


  [91m///[0m ]8;id=893344;https://multiqc.info\[1mMultiQC[0m]8;;\ 🔍 [2m| v1.17[0m

[34m|           multiqc[0m | Search path : /data/users/mccallke0364/BIOL343/5_fastq/fastq_placenta
[2K[34m|[0m         [34msearching[0m | [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [35m100%[0m [32m42/42[0m  [0m0m  
[?25h[34m|            fastqc[0m | Found 14 reports
[34m|           multiqc[0m | Existing reports found, adding suffix to filenames. Use '--force' to overwrite.
[34m|           multiqc[0m | Report      : multiqc_report_1.html
[34m|           multiqc[0m | Data        : multiqc_data_1
[34m|           multiqc[0m | MultiQC complete


Report your observations on which QC metrics improved (or got worse):



## Snakemake Pipeline

When running bioinformatics studies, it is wise to use notebooks (such as this one) to tie together code, decisions, and data visualization. The above task is something you might generate and report to a client or supervisor who might be interested in sequencing QC and what kind of pre-processing decisions you made prior to alignment. In addition, the preprocessing should be readily reproducible by all relevant stakeholders, so you should supplement notebooks with pipeline scripts. Throughout the semester, we will be creating snakemake pipelines to reproduce the published analysis that you've selected. Follow the below steps to begin producing this pipeline:

1. Make a new `Snakefile` and `config.yaml` in `5_fastq`. 
2. In the `Snakefile`, produce the following rules:  
    a. `fetch_fastq` - use `parallel-fastq-dump` to fetch every FASTQ file listed in `config.yaml`.  
    b. `fastq_qc` - use `fastqc` to run QC on the raw FASTQ files.  
    c. `trim_filter` - use `cutadapt` to trim and/or filter the FASTQ reads using parameters justified in this notebook. ***Note:*** if you choose to trim/filter, it should be done on *all* of the FASTQ files; do not cherry pick.  
    d. `trim_qc` - use `fastqc` to run QC on the trimmed FASTQ files.  
    e. `report` - use `multiqc` to aggregate all QC data into a single report.

This `Snakefile` should not use any `wrappers` because we already have the relevant software installed in the `biol343` conda environment. The pipeline should run to completion when any instructor or classmate runs `snakemake --use-conda`. You may use the documentation found [here](https://multiqc.info/docs/usage/pipelines/#snakemake) or any other online documentation you may find. You may not use any AI tools to complete this homework.