# Homework Set - II

Each Question/subquestion is worth **2 points**, unless indicated otherwise.

### Reproducibility \+ RNA\-seq



You are a thoughtful researcher investigating antitumor immunity in human cancer. You have dived into the literature and found an interesting RNAseq dataset from Javitt, Shmueli, and colleagues. In this paper, they identified a proteasome regulator called PSME4. Using a human cancer cell line, they knocked down expression of PSME4 in one group of cells, and kept a second group of cells without the PSME4 knockdown, then collected cells and performed RNAseq. Thus, there are two conditions : PSME4 knockdown (PSME4_KD) and a control (Ctrl_KD).

You will be doing a basic analysis of the `genes.table.counts` dataset generated from DESeq2 analysis. To make sure someone else can reproduce your analysis and results, you want to generate a report with the code and the output results. You will do so using by making your own jupyter notebook, using R. 

**Q1.** **[4 points]** First, created a new notebook analogous to what you did in Module 12.

- create it in your HW2 folder, name this `my_hw2_notebook`, and set the kernel to R
- Add a new section called "Homework", and include an appropriate header describing what is contained in this notebook. 
- Add your code and answers to the questions below to the Homework section. 
- Make sure the kernel is set to R (System-wide).

NOTE: You will write your answers to Q1-6 in the new notebook you just created, you don't need to copy them into this notebook.

**Q2.** **[3 points]** Create new cells which contain R code "chunks" that perform the following tasks:

- First, read in the `genes.table.counts` file and stored into a variable.
- Make the `Genes` column the row names of the table. Then, remove `Genes` column.
- Use the `head()` function to print out the first few rows of the table and check if you correctly load the table.
- What are the columns? Write your answer in a markdown cell in your new notebook.



**Q3.** We are interested in the regulatory effects of PSME4, so we decide to perform DESeq2 between the knockdown group and the control group. Follow the following steps to perform the analysis.

**Q3a. \[5 points\]** Create DESeq2 object. 

Different from in\-class activities, here we will learn to create a `DESeqDataSet` from matrix. Write code to perform the following steps:

- Load DESeq2
- From `genes.table.counts`, extract columns of raw counts \(since DESeq2 requires raw counts\), and name the new dataframe as `raw_counts`
- Create a dataframe specifying the condition for each sample. The code is provided as follows: 

```R
colData <- data.frame(
    condition = factor(c("Control", "Control", "KD", "KD")),
    row.names = colnames(raw_counts)
)
```

- Create DESeqDataSet: Use the `raw_counts` matrix and the `colData` to create the `DESeqDataSet` object named `dds`, specifying the design formula \(`∼condition`\). Use the function `DESeqDataSetFromMatrix.`You can learn how to use the function by `?DESeqDataSetFromMatrix`.



**Q3b. \[8 points\]** We will then run DESeq2 pipeline as we have done in the in\-class activity RNASeq\_II. Write code to perform the following steps:

- Run DESeq Pipeline
- Extract the results comparing KD against Control.
- Sort the results by `padj` and print the top hits with `head()`
- Use `results()` and `summary()` to report how many genes pass significance assessed: an adjusted p\-value &lt; 0.05 and log2\-Fold change &gt; 0.585 or &lt; \-0.585



**Q4.** **\[5 points\]** Next,
in DESeq2 analysis, you can create an MA plot. This plot helps us check 1\) if our normalization strategy works and 2\) if we have evidence of differentially expressed genes.  A common visualization in RNAseq analysis is volcano plots. In volcano plots, the y\-axis is the negative logarithm of adjusted p\-values, and the x\-axis is the log2 Fold Change values. 

- Using `ggplot` or `plot()`, create a scatter plot using log of `baseMean` on the x axis and the log2 Fold Change on the y\-axis. Add labels for your x\-axis and y\-axis. Label the title as "MA Plot". 
- In your MA plot, are your points clustered around y = 0? Write your answer in a markdown cell in your new notebook.
- Using `ggplot` or `plot()`, create a volcano plot. Label x\-axis and y\-axis, and label the title as "Volcano Plot".



**Q5.** **[2 points]** Finally:

- Create a new cell, call the `sessionInfo()` command. 

This command gives you a snapshot of the versions of libraries and various tools that you have installed (and run). 
Useful for reproducibility!

**Q6.** **[2 points]** Now:

- Make sure you hit the save button on your jupyter notebook!
- Download your ipython notebook file as an html file. To accomplish this, go to the "File" menu and select "Download as", and select HTML.
- Copy the HTML back to your homework project directory.

Try downloading this onto your own computer! Open in up in a web browser. As you can see, it looks very similar to the code in the jupyter notebook, except that you can not edit it.

.html files can be opened by any web browser. Therefore, this html file can be useful for sharing your results with anyone who does not have jupyter installed on their computer. You can also download jupyter notebooks as .pdf files if you have jupyter installed on your own computer. However, the pdf convertor does not always work on CoCalc.

### Working in the UNIX environment

In the below questions, write the UNIX command that you used to accomplish the task.

**Q7a.** **[2 points]** Move to the directory called `assignment_directory`

**Q7b.** **[2 points]** Now list the files in this directory:

**Q7c.** **[2 points]** See how there is one file here, and a directory called `output/`. 

- Do a listing of this directory to see what is in that directory (if anything).

**Q8.** **[2 points]** So we see that nothing is in the output directory, and so our job will be to fill this in ourselves! First, let's get a sense of what the data looks like. Write a command to look at the file, `yelp_academic.txt`.

**Q9.** **[3 points]** This file contains information on Philly restaurants from yelp. The data was extracted from here (https://github.com/blakecbartlett/Philly-Food-Map). Now, with this file, write a command to 

- remove the header line
- extract the category of food
- sort the category of food in descending order
- count the number of unique food categories (remember you can use the man pages if you forget a certain flag). 
- sort the category of food in ascending order in terms of number of each type
- Finally, send the output of this command to a file called `output/cat_counts.txt`

**Q10.** **[2 points]** Write a command to print this file.

**Q11.** **[2 points]** What's the least common category of food to have? How many restaurants have this type of food?

Insert answer here:

**Q12.** **[3 points]** Now, let's look at combinations of food categories and yelp ratings for restaurants with high reviews. Write a command to 
- remove the header line
- extract both the food category and yelp rating from the file
- remove restaurants with a rating of 3.5
- sort the extracted combinations in descending order
- count how many times each unique combination of food category and rating appears
- sort the counts in descending order by the rating only
- and send this to output/combo_cat_counts.txt
(hint: you may need to review the manual for grep to accomplish this task)

**Q13.** **[2 points]** What's the most common 5 star food category? How many 5 star restaurants have that food category?

Insert answer here:



For the next part of this assignment, we are going to get some more practice with `grep` and file compression. For this, we will be using a data set from the city of Philadelphia about bike trails. This data set is pulled from here:

https://opendataphilly.org/datasets/existing-trails/

**Q14.** **[3 points]** Move back into the main directory for this homework, make a folder called `homework_data/`, and move into it.

**Q15.** **[4 points]** Copy the `Existing_Trails.csv` file from `assignment_directory` to `homework_data`.

Then, print the first 10 lines of this file to see what it looks like.

Notice how this is a comma separated file, containing information about bike trails in Philadelphia. Each row contains information about a given trail such as the trail name, the material of the trail, and its mileage. 

**Q16.** **[4 points]** Give a command to 
- cut out the 7th column of this file (remember that you need to specify that it is comma-delimited)
- count how many different, unique trails there are in this column (remember you can pipe results to the wc command to get the number of lines)

*Hint: Check to see if the file has a header. If so, don't forget to account for this extra line.* 

**Q17.** **[4 points]** Now give a command to count how many times each trail appears in the 7th column. How many trails are displayed 3 times in the dataset?

The three trails with 3 occurrences in the dataset are: Schuylkill Banks Boardwalk, Falls Bridge Sidepath, Cobbs Creek Connector

**Q18.** **[2 points]** Now use `grep` to find all the entries that contain 'Schuylkill Banks Boardwalk'.

**Q19.** **[2 points]** Next,

- write this to a file called `schuylkill_banks_boardwalk_trails.txt`
- compress this into a .tar.gz file called `schuylkill_banks_boardwalk_trails.tar.gz`

### Analyzing ChIP-seq Resullts

In the following section we ask questions related to the ChIP-seq analysis modules. All the data needed to answer these questions are included in the `chip-seq_data` folder.

**Q20.** **[2 points]** From the folder from where you ended the last section, give and run the commands to navigate into the `chip-seq_data` folder and identify how many peaks we have in the file of p53 peaks under DMSO treatment which we used in the first ChIP-Seq module.

**Q21.** **[2 points]** Next, give a command to:
1. Pull out the first column of the bed file
2. Sort it
3. Display how many peaks there are on each chromosome.
4. Sort it in descending order

**Q22.** **[2 points]** Next, upload this file to the UCSC genome browser (you can use the same uploading approach as the bedGraph file). Then, navigate to the genomic interval  chr1:76204325-76204623   and look at the peak overlap. This peak overlaps with which gene? (Please use the hg19 genome build)

**Q23.** **[2 points]** Now go back to the p53-input.DMSO_peaks.narrowPeak file. Which peak name (in the fourth column) does this locus correspond to?

**Q24.** **[2 points]** What are the -log10 values of the p- and q-value for this peak?

**Q25.** **[2 points]** Now, we will look at another peak in the region chr15:63,449,045-63,449,917. We have provided a bedgraph file called p53_Nutlin.sub.bedgraph of some of the reads from the p53 condition around this region. Next, upload this file to the genome browser and look at this bedgraph file next to the called peak, making sure you're at the chr15:63,449,045-63,449,917 genomic interval. Do the raw read patterns seem to support the called peak?

**Q26.** **[2 points]** Short answer: We will continue examining the peak: chr15:63,449,045-63,449,917. Where along the gene does this peak for the p53_Nutlin.sub.bedgraph file seem to fall? Does it seem to differ between isoforms (make sure to look at the UCSC genes)? Given that this data reflects p53 binding, a transcription factor, what might this mean about the relationship between p53 and the gene?

**Q27.** **[2 points]** Short answer:  The UCSC genome browser displays several tracks related to gene regulation. Display the ENCODE regulation panel. Which histone mark overlaps well with the region surrounding the peak identified above? Is this mark typically associated with activation or repression of genes?

**Q28.** **[2 points]** Short answer:  Examine the MultiZ vertebrate alignment track in the UCSC genome browser for this region. Which organism among Rhesus, Mouse, Dog, Elephant, Chicken, X. tropicalis, and Zebrafish has the most conservation in region of the ChIP-seq peak?

**Q29.** **[2 points]** Give the bedtools intersect call to find the number of p53 peaks in the Nutlin condition that are present in the DMSO condition:

**Q30.** **[4 points]** What fraction of Nutlin p53 peaks is present in DMSO?

**Q31.** **[4 points]** hg19 was the previous reference genome for the human genome, but there are newer versions of the human genome now. The most commonly used current reference genome (as of 2024) is hg38, although the recent Telomere to Telomere (T2T) Consortium has create a new reference that covers additional regions in the centromeres and telomeres across the human genome.

The bed files are in hg19, but we would like to see where they are located in hg38, since this is the more commonly used reference genome. Using the liftOver function (read instructions here: https://genome.ucsc.edu/cgi-bin/hgLiftOver) in UCSC, please liftover p53-input.DMSO_peaks.narrowPeak from hg19 to hg38.

Hints: 
1. You will notice that the liftover function can take as input the bed file, but it needs a specific format, which is outlined in its specifications.
2. The results of the conversion can be obtained by going to the "Results" tab and clicking "View Conversions" to download a bed file. 
3. This downloaded bed file can be visualized using Unix commands. 

With this liftover over bedfile, answer the following question.

1. **What is the new location of the region chr1:121484534-121485164 from hg19 in hg38?**

**Q32.** **[4 points]** The liftOver function can liftover from different reference genomes of the same species, but it can also liftover to other species. Try lifting over the same set of the input peaks from p53-input.DMSO_peaks.narrowPeak to the mouse reference genome mm10.

Then answer this question:
1. **How many of the peaks failed to liftOver**? Provide a reason for why they could have failed to lift over. 