# In-Class: ENCODE

### Part 1: ENCODE on the Genome Browser
We will now start to look at ENCODE data. The quickest way to look in epigenetic marks at a specific region of the genome is through the UCSC Genome Browser.

*Task: You are interested in the gene SLC24A5 due to its association with albinism(Wei et al. 2013) and would like to learn about its regulation using ENCODE data.* 
1. Go to the UCSC Genome Browser at https://genome.ucsc.edu/cgi-bin/hgGateway
2. Using the hg19 assembly, Enter the gene name SLC24A5 (coordinates chr15:48,413,169-48,434,589) into the search box.
3. Go to the "Regulation" group of tracks. Show whichever tracks you think will be relevant to answer the below group of questions. For help, refer to: https://genome.ucsc.edu/ENCODE/usageResources/ENCODE_QuickReferenceCard.pdf


1) Wei et al. showed that decreased levels of SLC24A5 is associated with albinism. You want to investigate this further by knocking out relevant transcription factors. Name two transcription factors that show evidence of binding to the gene and three that bind upstream of the gene based on the UCSC Genome Browser.

2) Look at the DNase hypersensativity information in the region directly upstream this gene by loading the relevant ENCODE track (the ENCODE tracks have the black and white helix logo next to them). Can you find evidence of DNase hypsersensitivity in a relevant cell type to this phenotype? If so, what is that cell type?

3) You want to determine whether the region upstream of this gene is an enhancer in any cell type. As you can see, it is very difficult to quickly tell this from looking at just the epigenetic marks. Luckily, there are tools that make computational predictions about the functional activity of a region by combining many epigenetic marks. One such tool is chromHMM. Look at the chromHMM tracks by loading the ENCODE histone modification track. Is there an enhancer directly upstream of this gene in any of the assayed cell type? If so, which one? Hint: Zoom out 1.5x to view the entire upstream region.

### Part 2: Using ENCODE like a Genomicist

The Genome Browser works great if you want to look at a few specific sites, however is not a feasible way to investigate many sites across the entire genome. To do this, we will download data from the ENCODE consortium, then analyze it using bedtools.

You are interested in understanding the epigenetic basis of the autoimmune disease Type I Diabetes. You are specifically interested in the possible role that B lymphocytes play in the disease. In order to study this connection, you want to download several data sets.

**Task:** Download data types of interest using the instructions below.

1. Go to the ENCODE website: https://www.encodeproject.org/
2. From the 'Data' Tab on the top menu bar, select 'Experiment Search'.
3. Using the search filters on the left hand side of the screen, find the data set: DNase-seq of B cell (Homo sapiens, female, adult 27 year) from  Gregory Crawford's lab. To do this, click on the filters: Homo Sapiens, adult, DNase-seq, Gregory Crawford.
4. Under "External resources" press on the link to the data on GEO.
5. Under Download, right click on the "(ftp)" link to copy the path to the **bed narrowPeak** formatted file.
6. Use wget on the terminal to download the file to cocalc.


**Task:** Let's start with a simple experiment. Are SNPs associated with Type I Diabetes more likely to be in DNase hypersensitive sites than random SNPs in the genome?

1. Use the BedTools function intersect (http://bedtools.readthedocs.org/en/latest/content/tools/intersect.html) to measure the overlap between the disease-associated SNPs (Found in T1D_SNPs.bed) and the DNAse hypersensitive regions. Intersect looks for the intersection of regions between two bed-formatted files. The -a and -b flags specify the file names.


<font color="red">Remember, you need to start a terminal window to use bedtools.</font>


**Question 1a.** How many SNPs are in T1D_SNPs.bed?

**Question 1b.** What is the command to find the number of intersections?

*Hint: You will need to use BedTools to identify all of the intersections and then pipe that output into a command that will count the total number of intersections.*

**Question 1c.** Now, run the command. How many overlaps are there? 

Next, we want to test if this overlap is more than is expected due to chance. To do this, we are going to use simulations in order to get an idea how likely it is that we get an overlap this significant by chance. 

To do these simulations, we are going to generate a random set of SNPs the same size as our set of interest using bedtool's random function.
The -l flag specifies the length of each region (in this case it will be 1, since we are generating single nucleotide variants), the -n specififes the number of random regions to generate, and the -g flag specifies a file containing the structure of the genome(number of chromosomes and their length). These simulations can then be ouput into a file using ">".

bedtools random -l 1 -n 836 -seed **x** -g human.hg19.genome > Random_**x**

Do this 10 times, but replace x with the seeds 1 through 10. Specificying a seed causes the same numbers to be generated by the random number generator. This allows for reproducible research when software relies on randomness. So, your first command would be:


bedtools random -l 1 -n 836 -seed **1** -g human.hg19.genome > Random_**1**

**Question 2.** For each random replicate, run the same command run in Question 1b (this time using Random_x instead of T1D_SNPs.bed) to find the number of intersections. 

After each time, write the number below.

1. 
2.  
3. 
4. 
5. 
6. 
7. 
8. 
9. 
10. 

**Question 3.** Does the overlap with the original SNP set still seem significant?

**Question 4.** Should this convince you that DNase hypersensitivity in B Cells is causing these SNPs to be associated with Type I Diabetes? Hint: Think about confounding factors.