# Find a paper to reanalyze

In the first portion of this class, you will learn how to analyze RNA-seq data by attempting to reproduce a differential expression pipeline from a published paper. For this process to go smoothly and be useful, it's important to select the right kind of paper. This notebook includes a few things that you should look for.

## Clear experimental design

There are many use-cases for RNA-seq, not all of which involve identifying differentially expressed genes. Furthermore, there are many types of RNA-seq, including single-cell RNA-seq, which is a very popular modern iteration of the technology. scRNA-seq involves a different set of tools, so we will not be analyzing these types of data.

Your paper should have a clearly laid out experimental design that seeks to compare different datasets in order to identify differentially expressed genes. Thus, "differentially expressed genes" may be a good search tearm.

In the schistosome PZQ paper that we are walking through as a class, the experimental design was clear. There were 10 different samples: one for each post-exposure timepoint.

Those samples are summed up in Figure 1C: 

<img src="assets/journal.pntd.0009200.g001.PNG" alt="experiment data" width="600">

A search for "mouse brain tissue differentially expressed genes RNA-seq" lead me to this study by Jia et al., published in Neuroscience Letters in 2020. In this study, they analyzed differentially expressed genes in multiple brain tissues from control mice and from mice that modeled Parkinson’s disease. Figure 2 looks like this:

<img src="assets/1-s2.0-S030439402030344X-gr2.jpg" alt="Figure 2 from Transcriptomic profiling of differentially expressed genes and related pathways in different brain regions in Parkinson’s disease" width="600">

In Fig. 2A, we see a heatmap that shows 24 samples from 8 tissue types (one for each column of the heatmap). In Fig. 2B, we see a PCA plot that shows how the samples cluster.  

1. CNCC: Cerebral cortex from control mice
2. CNHP: Hippocampus from control mice
3. CNST: Striatum from control mice
4. CNCB: Cerebellum from control mice
5. PDCC: Cerebral cortex from Parkinson's model mice
6. PDHP: Hippocampus from Parkinson's model mice
7. PDST: Striatum from Parkinson's model mice
8. PDCB: Cerebellum from Parkinson's model mice

For each of these papers, the experimental design is very clear: the authors extracted RNA from samples of interest and sequenced it to identify differentially expressed genes.

### Types of plots that demonstrate clear experimental design

There are a few clues that sometimes are suggestive of clear experimental design, and you can often find these in the figures. We already saw how heatmaps were included in both the images above. We also saw PCA plots. PCA is a clustering technique often used in RNA-seq analysis that works to take a multidimensional dataset and plot it along two dimensions. In the case of RNA-seq, there are generally tens of thousands of dimensions - one dimension for each gene - and PCA will typically result in clusters where tissues that are more transcriptomically similar are nearer to each other.

Likewise, heatmaps represent the strength of expression as a color gradient. In these plots, each gene is a row on the y-axis, and each sample is a column on the x-axis; 

Another popular image type for RNA-seq is the so-called "volcano plot". Fig. 4 from a diffrenet paper about schistosomes is an example:

<img src="assets/journal.ppat.1012268.g004.PNG" alt="Figure 3 from Winners vs. Losers" width="600">

Each point in the volcano plot is the expression of an individual gene. The y-axis represents the log-transformed p-value associated with statistical test for differential expression (a higher number represents a lower p-value and thus a higher confidence in the difference). The x-axis represents the log-transfored fold difference in expression between the two samples. In Fig. 4A, for instance, a higher number represents higher expression in the INT_ma eggs, while a lower number represents higher expression in the LIV_ma eggs. We again see heatmaps in this figure, as well.

## Clear data availability

In a perfect world, every paper that utilizes RNA-seq would include all relevant methodological details and code in order to reproduce the analysis. Unfortunately, we don't live in a perfect world, and many papers lack crucial details. At the very least, the following information is required:

1. Where the raw data can be acquired

Ideally, there would be a lot more information required, including the reference genome that was used for alignment, the type of sequencing that was performed, the code that was used to align the reads, etc.

To find out if the raw data is available, some journals will include a "Data Availability" statement. The schistosome PZQ paper includes the following:

>**Data Availability**: All data are fully available without restriction. All relevant data are within the manuscript and its Supporting Information files.

The methods section and RNA-Seq subsection includes more details:

>FASTQ files containing RNA-Seq data have been deposited in the NCBI SRA database under accession numbers PRJNA597909 and PRJNA602528.

The BioProject ID is what we're looking for - all the raw data will be found there. All Public Library of Science (PLOS) require a data availability statement, so I recommend starting there for your paper search. Here is a link to search for papers in PLOS journals [(Link to PLOS)](https://plos.org/#journals). Unfortunately, the Parkinson's paper does not provide where their data can be found (this is a huge issue and should not be allowed...).

One good way to start searching would be to go to the search bar for a PLOS journal, click "advanced search," select "body", and search for "differentially expressed genes". The first hit I get when I do that is this paper about tomatoes [Link to tomato paper](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0172411). The **Data Availability** statement lists many accession IDs starting with SRA - this is also a great option for you.

## Clear methodology

The best papers will include an entire public repository with the code required for reproduction, which will clearly mark the tools that were utilized. Most papers don't do this, but they will instead include this information in the Methods section. For instance, here are the methods for the tomato paper:

>For the RNA-seq data, the quality of data was evaluated by the FastQC software [20], and we retained reads that contained more than 95% bases and the bases’ quality score is 20. <br> <br> This program, Tophat-Cufflinks [21], could process a large number of read fragments based on RNA-Seq [22–23]. Transcripts selected could be processed as follows: (1) Aligning RNA-Seq reads to the reference genome. It was a core step in the analysis workflows, and we used Tophat [24] to align RNA-Seq reads to the genome. (2) Assembling transcripts. We used Cufflinks [25] packages to assemble transcripts. Frist, cufflinks assembled transcripts. Then, cuffmerge merged two or more transcript assemblies. Third, cuffdiff identified differentially expressed genes, transcripts and detected differential splicing and promoter. Besides, the analysis of differentially expressed genes was also conducted by FPKM. <br> <br> Differentially expressed genes (DEGs) were selected according to the threshold: |log2fold change| ≥ 1.00. False discovery rate (FDR) was used to correct the P values and genes with FDR < 0.05 were considered as significantly DGEs. We used in-house Python script to select applicable genes as DGEs.

This paragraph includes some of the tools used: Tophat, Cufflinks, cuffdiff, and their own Python script (which, again, should have been provided).

Here are the methods for the schistosome PZQ paper:

>Libraries were generated using the TruSeq Stranded mRNA kit (Illumina) and sequenced using the Illumina HiSeq 2500 system (high-output mode, 50 bp paired-end reads at 20 million reads per sample). Trimmed reads were mapped to the Schistosoma mansoni genome (v7.2) using HISAT2. Differentially expressed gene products between vehicle control and PZQ-treated samples were identified using EdgeR (tagwise dispersion model, FDR adjusted p-value < 0.05). For experiments on sublethal drug treatment (Figs 2 and 3), differentially expressed, up and down-regulated transcripts were ranked by fold change and then functional enrichment analysis was performed using g:Profiler [21] to identify enriched GO-terms and KEGG pathways. Principal component analysis was performed using PCAGO [22], an R-based program using DESeq2-normalized counts from S2 and S3 Files and prcomp to visualize clustering of biological samples.

Here we see HISAT2, EdgeR, g:Profiler, etc.

The three most popular RNA-seq analyis tools include:

- STAR-FeatureCounts
- HISAT-StringTie
- TopHat-Cufflinks

We will be using the first two options in week 6, but we'll be able to analyze data from papers that use TopHat-Cufflinks too. If you see any of these tools in the methods section, you're on the right track.

# Task

Use these tips to select a paper with data that you will reproduce over the next 8 weeks. Read the paper thoroughly, and complete [the homework](1_homework.ipynb) by 9/19.