# CRISPR screening - GCB535

In this module we will analyze data from an early CRISPR screening paper (<a href="https://science.sciencemag.org/content/350/6264/1096">Wang T et al. Science 2015</a>) that compared gene essentiality between four different cell lines. We will use the (<a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4">MaGECK pipeline</a>) to analyze the data. We have already installed this pipeline for you in your unix environment.

The MAGeCK pipeline can either start with raw FASTQ files or from a read count matrix, here we will start from a read count matrix. 

To use `MAGeCK`, We first need to make sure that our System is set to `Ubuntu 20.04 (Experimental)` in CoCalc. To do this, in CoCalc:

- Click on the Settings Tab (wrench icon) 
- Look in the Project Control Panel (left side) 
- In "Software Environment" select "Ubuntu 20.04 (Experimental)" from the drop down menu 
- Click the "Save and Restart" Button that pops up

**Q1.** In your directory you will find the file `count_data.tab`. Open a UNIX terminal and take a look at the first few rows of this file.

Each column contains the read counts from a single sample in a single time point, there are four cell lines, `KBM7`, `K562`, `Jiyoye` and `Raji`. For each cell line there are two time points `initial` and `final`, most cell lines have a single replicate other than `KBM7` which has two.

How many sgRNA guides are listed in this file?

**Q2** Now, we would like to run the MAGeCK pipeline to compare *KBM7* final to initial time point. 

We expect that sgRNAs which target genes that are essential for *KBM7* growth to be depleted from the final time point. Take a look at the [tutorial](https://sourceforge.net/p/mageck/wiki/demo/#the-first-tutorial-starting-from-read-count-tables) on how to run MAGeCK when starting from a read count table. Run this in the unix terminal and provide the one line of unix code below. Use the two KBM7 final samples as your treatment and two KBM7 initial samples as your control. Use `KBM7` for the prefix of your output files.

After running the pipeline, you will find several output files in your directory all strating with `KBM7`. We will look into two files: `KBM7.sgrna_summary.txt` which contains sgRNA level analysis, and `KBM7.gene_summary.txt` which contains the gene level analysis.

**Q3.** Load the two data sets into your R environment and save them as two `tibbles` variables: `KBM7_sgRNA` and `KBM7_gene` 
    
Use `read.table` with `header = TRUE`. Add your code to the code below:

In [None]:
library(tidyverse)

**Q4.** After loading the data, we will first compare the sgRNA counts between early and late time points. Use `ggplot` to make a scatter plot comapring `control_mean` with `treat_mean`. Use a log scale for your axis. Do you see depletion of sgRNAs in the later sample? How would this depletion be affected by the time between the two time points?

**Q5.** Next, we will look at gene level data using `KBM7_gene`. MAGeCK calculates a seperate score and p-value for increased or decreased representation. To make a volcano plot, we will have to combine the p-values into a single p-value. Use `tidyverse` and `ggplot` to make a volcano plot with `pos.lfc` on the x-axis (lfc is the same for positive and negative) and the `-log10` of the minumum of the two p-values (`neg.p.value`,`pos.p.value`) on the y-axis. Also, filter your data show you plot only genes with more than one sgRNA `num`.

**Provide and Exceute your code below.**

This volcano plot show you that many more genes are depleted compared to enriched, this is expected as many gene are required for maintaining cell growth while fewer gene knockouts will result in increased cell growth. 

**Q6.** If we use a False discovery Rate (FDR) of 0.05 (which is already provided to you `neg.fdr`), how many gene knockouts result in decreased KBM7 cell growth?

**Q7.** What are the top 10 most depleted genes? Does their known function fit with your expectation?

## Homework

We would like to compare gene essentiality between the four cell lines provided in our data sets. 

**Q8.** Run MAGeCK: 

- For each of the three additional cell lines: `K562`, `Jiyoye` and `Raji`. 
- Then, load the gene summary data and use `tidyverse` to construct a data set that contains the `neg.fdr` for each cell line. - At an FDR P < 0.05, how many genes are shared between all cell lines? 

Take a look at the following [paper](https://cancerdiscovery.aacrjournals.org/content/6/8/824.long).

Apparently, CRISPR screens for gene depletion can have false positives due to targeting of chromosomal regions that are amplified. The reason for that is that if you use Cas9 to target a region that is amplified many times, the excessive DNA damage induced by Cas9 can have a toxic effect. 

K562 cells are know to have an amplification in chromosome 22, so let's check if this is reflected in our data as well. We have provided you with genomic coordinates for all genes in the file `gene_start.tab`. 

**Q9.** Plot gene depletion scores (`-log10(neg.p.value)`) as a function of genomic location for chromosome 22. Do you observe any regions of the chromosome that displays clustered gene depletion? How would you assess the significance, if any, of these observations?