NOTE: I thought I had really eloquently constructed a pipeline with a streamlined approach to allow a user to skip the ~1hr genome indexing build through HISAT2 if they so wished by providing previoulsy indexed files. After zipping this directory and attempting to upload it to Brandeis Moodle I see that the upload limit is 60 MB - far too small to host the index files in the `prebuilt_index/` directory. If you would like to see the original zip file in its entirety I do hope you attempt to download it using the link provided which will pull it from my Brandeis Google Drive. I had put a lot of work into making this pipeline executable in different ways to accomadate for effeciency and to challenge myself. The readme below refers to the directory structure of the entire zipped folder. I have, as I said, needed to remove majority of it to accomodate the upload file size restrictions.

Link to Wooten_Final.zip [https://drive.google.com/file/d/1rmW5RJSipdGeclBH-8OGBmZmXPYvxn9X/view?usp=sharing]

# RNA-Seq Workflow: Quality Assessment and Alignment Pipeline

## Overview
This project is designed to separate the RNA-Seq workflow in two parts:
1. Assessing read quality from FASTQ files using a Python script (`quality_check.py`) and generating plots to visualize the results.
2. Performing alignment and gene counting using a Bash script (`HISAT2_alignment.sh`), which aligns RNA-Seq reads to a reference genome and annotates gene-level read counts.

After completing each step, please refer to the discussion file (`discussion.ipynb`) for observations, before proceeding to the next step.

---

## Workflow Summary

### Part 1: Read Quality Assessment
- Run the `quality_check.py` script to assess the quality of input FASTQ files and generate plots.
- Plots and quality metrics are saved in the `outputs/` directory.
- Review the results in the `discussion.pdf` notebook.

### Part 2: RNA-Seq Alignment and Gene Counting
- After reviewing quality metrics, run the `HISAT2_alignment.sh` script to align the reads, convert alignments, and perform gene counting.
- The results are saved in the `outputs/` directory and interpreted in the `discussion.pdf` notebook.

---

## Directory Structure

```plaintext
Final_submit/
├── inputs/                 # Input files (e.g., FASTQ, reference genome, GTF)
├── outputs/                # Outputs (e.g., plots, BAM/BED files, gene counts)
├── scripts/
│   ├── quality_check.py    # Python script for read quality assessment
│   ├── HISAT2_alignment.sh # Bash script for alignment and analysis
├── index/                  # Genome index files created ![if build is not skipped]
├── prebuilt_index/         # Genome index files to be used ![if build is skipped for convenience]
├── discussion.ipynb        # File for interpreting results and visualizations
└── README.ipynb            # Instructions for use
└── expected_results.zip    # A zipped file containing all results as if entire workflow has been run                            [incase you dont want to download HISAT2/matplotlib/etc] 
```
---

## Prerequisites

Make sure that the following tools and libraries are installed:  
- Python 3: with `biopython` and `matplotlib` libraries.
- HISAT2: for genome indexing and read alignment
- SAMtools: for handling SAM/BAM files
- BEDtools: for file conversion and annotation intersection

- Download and unzip the annotation file `Homo_sapiens.GRCh38.113.gtf` from the following link, placing it in the 'inputs/` directory. [https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz]

- If running part 2 in its entirety (see details in Part 2), please download the file `Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz` from the following address [https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz], unzip the file, and place it in the `inputs/` directory for use.

---

## Usage

### Part 1: Run the Python script to assess read quality 
1. Run the following command to execute the Python script, ensure your working directory is Final_submit/  

`python3 scripts/quality_check.py`

- Inputs: fastq files in `inputs/`
- Outputs: printed metrics to terminal, plots saved in `outputs/`

2. Open `discussion.pdf` notebook to view the generated plots and review quality data.

### Part 2: Run the Bash script for RNA-Seq alignment and gene counting

1. There are two ways to run the Bash HISAT2_alignment script, for both ensure your working directory is Final_submit/  
 - To run the script in its entirety (includes genome indexing):  
    `bash scripts/HISAT2_alignment.sh`
 - As the genome indexing step using HISAT2 takes around 45 minutes, to bypass this step and use the pre-built indexed genome, run the follwoing:  
    `bash scripts/HISAT2_alignment.sh --skip-index`

    Inputs: 
    - Reference genome (inputs/Homo_sapiens.GRCh38.dna.primary_assembly.fa) - if not skipped [Needs to be pre-downloaded and unzipped - see prerequisites]
    - Indexed genome - (prebuilt_index/genome_index) - if indexing step skipped 
    - Annotation File:(inputs/Homo_sapiens.GRCh38.113.gtf) [Needs to be pre-downloaded and unzipped - see prerequisites]
    - FASTQ Files: (inputs/sample1_R1.fastq, sample1_R2.fastq)
    
    Outputs:
    - Genome Index Files: Located in the `index/` directory (if not skipped).
    - Alignment Files:
        alignment_files/sample1.sam: Raw SAM output from HISAT2.  
        alignment_files/sample1_sorted.bam: Sorted BAM file.  
        alignment_files/sample1.bed: BED file converted from BAM.
    - Final Outputs:
        outputs/sample1_annotated.bed: BED file intersected with the annotation.
        outputs/gene_counts.txt: File containing gene-level read counts.

2. Revisit `discussion.pdf` to find interpretation of results and IGV snapshots of alginments.





