# Infection Transcriptome - Introduction

## Background

The experiment of interest conducted an analysis of the *in vivo* transcriptome of the bacteria *Bordetalla pertussis*, a disease causing whooping cough. The transcriptome are the set of expressed genes from messenger RNA molecules. Thus, the *in vivo* transcriptome are the expressed genes while the bacteria infects a host, while the *in vitro* transcriptome are the expressed genes found in a laboratory. The bacteria creates disease through the expression of virulence factors, or factors enabling infection, through the virulence gene system, or genes that cause the infection. For example, one of the virulent genes for *B. pertussis* is the secretion of pertussis toxin (PT), which can damage the host.

It can be inferred that the laboratory conditions of *in vitro* would not fully simulate the set of virulent genes that cause infection *in vivo* because of the lack of virulence expression without an oncurring infection. RNA sequencing will allow us to differentiate and thus determine the infection gene expression of the bacteria by comparing and contrasting counts of RNA sequences. This can be important in identifying new antigens for more effective vaccines or treatments.

## RNA-Seq

Check out this [StatQuest](https://www.youtube.com/watch?v=tlf6wYJrwKY&t=1s) video to learn more. RNA sequencing is a technology used to quantify gene expressions between cells through the counts of specific RNA sequences. This is often used in comparison between two differentially expressed cells. Imagine a cluster of normal cells and mutant cells. These clusters of cells share distinctly different behaviors as a result of differences in gene expressions.

What you will find when analyzing the genome of the cells is that while they share the same genomes, the active genes are different, thus leading to differential expression. This is why we do high throughput sequencing: to indicate which genes are active and the extent they are transcribed.

There are three steps to RNA-Seq:
1. Prepare sequencing library: Extract the RNA fragments and add adapters and PCR amplify to prepare for sequencing.
2. Sequence: Fluorescent tagging machine used to determine sequence of RNA fragments.
3. Data analysis: After cleaning raw data and determining read counts, visualize transcription differences.

The first step is done in the laboratory. We will be skipping this step because (1) there is no nearby laboratory and (2) I would be too lazy anyway. Instead, we will start with the raw data from an experiment looking at differential gene expressions between the *in vivo* (infection) transcriptome and *in vitro* (laboratory) transcriptome of *Bordetella pertussis*, a bacteria that causes whooping cough, from this study (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6470212/). This is the middle of the second step, after, to put it briefly, the fluorescent tagging of nuclotides in a vertical flow cell to map the sequences with quality scores to assess certainty.

Here are the next steps after downloading the raw data:
1. Convert to FASTQ format, a raw data format to store the sequence and quality scores.
2. Filter out garbage reads.
3. Align high quality reads to the reference genome.
4. Count the number of reads per gene.
5. Normalize the count data.

## Sequence - Part 1

### File Structure

Here is a structure fairly common among many research scientists in bioinformatics projects:

```
project_name/
    software/
    data/
        cleanFASTQ/
        kallisto/
        microbesOnline/
        seqData/
```

The **software/** folder will store downloaded commands to use to process the raw data during sequencing, while the **data\** folder will store both raw data and processed data.

Furthermore, for the cleanFASTQ, FASTQ, and kallisto folders, you will want to create new folders for each read sample (**in_vitro_1**, **in_vitro_2**, **in_vitro_3**, **in_vivo_1**, **in_vivo_2**, **in_vivo_3**). Disclaimer: The *in vitro* samples were down-sampled to compare similar number of reads with the *in vivo* samples.

Thanks to Andrew Liu for this. He was my savior and lord for this project. Here is his guide for RNA-seq on [Github](https://github.com/hypercubestart/Counting-Transcripts) on RNA-seq if you want to check it out. This project is sort of an extension of his, using a new data set and extended analysis beyond PCA, which will also be discussed later. I am also assuming you are using a Windows environment rather than the MacOS or Linux environment from his tutorial. Preferably you are using Windows PowerShell, and some commands you have to use PowerShell.

### Raw Data

To download the raw data, you will need to use commands found in the [SRA Toolkit](https://www.ncbi.nlm.nih.gov/books/NBK158900/) that will allow you to convert the .sra files storing raw data into other more readable formats. Extract the folder into the ./software folder and add the bin folder inside the toolkit to your Path environment variable so that you can access the important commands.

To access the data, use the executable "prefetch" from the Toolkit and add the run identifier.

In [None]:
!prefetch SRR6718408
!prefetch SRR6718407
!prefetch SRR6718400
!prefetch SRR6718403
!prefetch SRR6718406
!prefetch SRR6718405

You probably now want to move each downloaded file to the /data/seqData folder.

In [None]:
!move /Y ./SRR6718408 ./data/seqData
!move /Y ./SRR6718407 ./data/seqData
!move /Y ./SRR6718400 ./data/seqData
!move /Y ./SRR6718403 ./data/seqData
!move /Y ./SRR6718406 ./data/seqData
!move /Y ./SRR6718405 ./data/seqData

### FASTQ

Now that we have downloaded the .sra files storing the raw data, we want to convert the data into a FASTQ format, a text-based format storing the sequence and corresponding quality scores. Here is the format:
- Line 1: Starts with a "@" and stores unique ID for the sequence
- Line 2: Contains bases of sequenced fragment
- Line 3: Begins with a "+" with same sequence ID and additional optional description
- Line 4: Quality scores for each base in sequence

Here is the command to convert a file to fastq-dump:

    fastq-dump --split-files -O <outdir> <path/file>
    
What? Let me explain (thanks to the [Toolkit Documentation](https://www.ncbi.nlm.nih.gov/books/NBK158900/)):

- `fastq-dump` : Simply the terminal command imported in SRA toolkit to convert the files
- `--split-files` : Dump each read into a separate file, where a suffix is given to indicate read number. This option is being used because the data was obtained using paired-end sequencing, as indicated by a "paired" layout.
- `-O` : Indicates the output directory of the files following the option
- `<outdir>` : The directory to output to, as indicated with -O option
- `<path/file>` : That is simply the path to the .sra file

If you explore the downloaded data folders earlier, you will see that each file stores two files: a file called "NC_002929.2", which stores the complete genome, and the .sra file for the specific sequence. We care about the .sra file.

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vitro_1/ ./data/seqData/SRR6718408/SRR6718408.sra

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vitro_2/ ./data/seqData/SRR6718407/SRR6718407.sra

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vitro_3/ ./data/seqData/SRR6718400/SRR6718400.sra

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vivo_1/ ./data/seqData/SRR6718403/SRR6718403.sra

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vivo_2/ ./data/seqData/SRR6718406/SRR6718406.sra

In [None]:
!fastq-dump --split-files -O ./data/FASTQ/in_vivo_3/ ./data/seqData/SRR6718405/SRR6718405.sra

When you go to your FASTQ folder, you should see two new files for each experiment indicating two reads for each fragment. You can inspect these .fastq files with the [Get-Content](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content?view=powershell-7) command to see the FASTQ format. Make sure to limit the number of lines to four, because you are just inspecting the basic structure of FASTQ. Note: You must use Windows PowerShell for this.

In [None]:
!Get-Content ./data/FASTQ/in_vitro_1/SRR6718408_1.fastq -TotalCount 4

The command will not be recognized if you are using Jupyter Notebooks, but here is the expected output:

```
@SRR6718408.1 1 length=151
GGGGGCTGGGACCTTCCTCAACGGTTTGATTCACGTGGGGCTGCAGAACTATTCCGGAGGGCGGGCAGGGGATCCTCCCGCCAATTCCCTGGGCCAGCGGCTCAAGGAGCTGCAACTTCCGCAAGGCCGCCTGAAAACTGGCACGCCGCCG
+SRR6718408.1 1 length=151
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<
```

This notebook is getting long, so we will do the main sequencing steps in the next notebook. Recall must be done after formatting the raw data:
1. Filter out the garbage reads.
2. Align the high quality reads to the "reference" genome.
3. Count the number of reads per gene.
4. Normalize the data to account for significant differences in reads.