# Homework Set - III

Each Question/subquestion is worth **5 points**, unless indicated otherwise.

### ENCODE: Testing for Enrichment

Let's try to learn about possible genetic mechanisms of Alzheimer's using ENCODE data.

SNPs associated with this disease (in hg19) can be found in `ALZ_SNPs.bed`.

Choose an ENCODE data set from the ENCODE website that you think may be relevant to the disease (note: feel free to choose a tissue other than brain...there could be other tissues linked to the disease as well). Think both about tissue-type and ENCODE mark. Please do not use DNAse hypersensitivity or RNA-seq datasets. 

**Hint: it will be easiest if you find a data set with data in the bed file format.**

**Note: PLEASE try to pick a file size that is Less than 1 Gb -- this will make it run faster/Feasibly on CoCalc!**

**Q1.** (2 points) What dataset did you choose and why? Include the tissue, epigenetic mark tested, and any identifying information, such as the name of the sample.

**Q2:** Now, test for significant overlap bewtween the Alzheimer's SNPs and your ENCODE mark. This time, instead of generating fake SNP sets, use the fisher exact test implemented in bedtools (http://bedtools.readthedocs.org/en/latest/content/tools/fisher.html). 

**NOTE: The fisher tool requires that your data is pre-sorted by chromosome and then by start position. e.g.**

```bash
$ sort -k1,1 -k2,2n in.bed > in.sorted.bed 
```
    
**for BED files, where `in.bed` is your input bed file and `in.sorted.bed` is the name that you want for the sorted output.**

This test looks for an association between two classifications. In our case, we are looking for an association between being a SNP in ALZ_SNPs.bed and being in an interval from your ENCODE data set. Run this test on your datasets. 

**NOTE: We have provided you a pre-sorted genome file in this directory that you can use for this analysis (hg19_genome.tsv).**

What is the right-tailed p-value? This is the p-value that the given parameter we're estimating (probability of overlap) is greater than that expected by chance. Please include your unix commands in the top box and your answer in the bottom box.

**Q3.**  Was there a significant association between the disease-associated genetic variants and the ENCODE mark? Explain your findings to the best of your abilities. Are you suprised by the result? If so, why? If not, why not?

### ENCODE: Finding Active regulatory regions

You also know that H3K9ac is a sign of active chromatin. Download and unzip the narrowPeak bed file for mouse H3K9ac data (file **ENCFF997COQ**). Then, using the intersect command and -v flag, count the number of genes from MeAc_genes.bed that do NOT also have a H3K9ac mark. MeAc_genes.bed was generated in the 2nd Encode Module and has been provided in this folder.

**Q4.** Write the bedtools command(s) to do this. Don't forget to include any sorting commands you may need.

**Q5.** How many genes from MeAc_genes.bed do not have a H3K9ac mark?

**Q6.** Are you convinced that these genes are truly active genes in embroynic mouse liver? Do you think that there may be some genes in liver_genes.txt that are expressed that we missed? Why or why not? What complexities are we overlooking?

*Hint: Think about issues such as where different marks should be located relative to the gene (from the Enhancer paper in the prelab), whether bedtool's window command was the ideal one to use, and cell type heterogeneity*.

### Creating Pipelines and Automation of analysis (Rscripts)

In module 22, you created an Rscript that you can use for high-throughput analysis of some data

**Q7.** Provide a copy of the Rscript that you created in your `H03_Homework-III` directory!

**Q8.** Now, let's unlock the power of automation. We have provided you with 3 data sets (exp1, exp2, exp3):

    /data
    
Using your script, analyze each of the 3 data sets provided and generate outputs!

**Q9.** Read each of the 3 output files you created into R and report the contents to your notebook. 

Provide your code below:

**Q10.** You'll note that if you had 1000s of experiments to analyze, it would be a real pain to write out the command line for each one -- you have better thing to do than that!

What could you do to save yourself from needing to do that? How would you modify the code to achieve that? (Provide a general description, but no need to write specific code)

### Automation / re-analysis Using Jupyter Notebooks

In Module 23 (Pharmcology), you will have probably noted that you could have tabulated virtually all of the above in excel. 

However, in a true high-throughput screening assay, you will have **dozens** of plates to process. That's too much for even one human to do in excel, perfectly! 

You may also have noticed that through doing this assignment, you have written a 'generic' pipeline to process a single plate.

**Q11.** Return to the in-class module. In order to process a different plate, called `plate2`, what would you change in the pipeline you created?

**Q12.** In your module 23 directory, we included data from 6 additional plates. 

Process each and report here (excluding controls):
- the Z prime factor for each plate  
- the number of cells that gave lower than a -5 normalized score, excluding controls, per plate. Note that rather than counting the results on the heatmap, you could `sum()` within the appropriate part of the heatmap table (excluding the controls, of course)

To do this, you could change the plate and re-run your notebook for that module, and record the results in the cells below.

You will notice that in these data, you do not actually have any sample names attached to your data, e.g. what genes you actually screened.

Imagine that you were provided a file that looked like the following:

    sampleid,row,col,plate
    87234,C,3,1
    7134,C,4,1
    ...
    81672,P,22,7

i.e. a file with 2240 rows (+1 header) where each sampleids was mapped to a corresponding row and column. Note that positive and negative control columns are excluded.

**Q13.** Imagine that you now wanted to obtain the sampleids (i.e., the gene code id!) from the a set of cells that were of interest, the 'hits' from the screen. 

For that, imagine that you had a second file which collated all of the cells across all plates which had a normalized Z-score less than -5. e.g., it looked like this:

    row,col,plate
    C,3,1
    D,5,2
    P,6,7

Describe in words the steps that would allow a computer to print out the sampleids ONLY for the entries listed in this new file. To help you, we have provided the first two steps in the process. You complete the rest!
   * Be specific in the details of what you would check for during your look-up.
   * Hint: Pretend of you had two sheets of paper, each with your lists, and you had to do this 'by hand'. What would you do, step-by-step?

### Prediction measurements and Machine Learning

**Q14.** In the context of machine learning problems, describe what is meant by the following terms:

- Features
- Examples
- Labels


**Q15.** Imagine data from two trained models applied to a test set:

Model 1:
- Correctly labelled true positives examples: 144
- Correctly labelled true negatives examples: 1356
- True negative examples labelled as positive: 44
- True positive examples labelled as negative: 456

Model 2:
- Correctly labelled true positives examples: 551
- Correctly laballed true negatives examples: 949
- True negative examples labelled as positive: 451
- True positive examples labelled as negative: 49

For calculations, we've inputted these numbers in for you per the following:

In [1]:
## provided in the homework
m1_tp = 144
m1_tn = 1356
m1_fp = 44
m1_fn = 456

m2_tp = 551
m2_tn = 949
m2_fp = 451
m2_fn = 49

**Q15a.** (6 points) What is the accuracy of Model 1 and Model 2, respectively?

**Q15b.** (6 points) What proportion of true positive example are correctly predicted in Model 1? Model 2?

**Q15c.** (6 points) What proportion of true negative examples are correctly predicted in Model 1? Model 2?

**Q15d.** (6 points) What proportion of true positive examples are correctly identified in Model 1? Model 2?

**Q15e.** (8 points) Based on the above, are these two models producing equivalent performance? Why or why not? Describe a situation where application of Model 1 would be preferrable to Model 2; and conversely, a situation where application of Model 2 might be preferrable to Model 1. 