# Networks HW

The goal of this homework is to gain hands on experience working with real biological networks, and to start thinking about the possible ways that choice of network might affect your inference about the biology that you are studying. 

In this assignment we will focus on comparing Mendelian disease genes, genes that are frequently somatically mutated in cancer, and genes that are neither. We would like to understand whether cancer genes are similar to Mendelian disease genes as has been previously suggested in the literature: 
> Torkamani, Ali, and Nicholas J. Schork. "Prediction of cancer driver mutations in protein kinases." Cancer research 68.6 (2008): 1675-1682.

# Instructions
Answer the following questions in your own words using a Jupyter notebook and upload a PDF of your code and written answers to Gradescope. This assignment is due **3/3/22 at 9:00AM**.

See this document for instructions on creating a Jupyter notebook and submitting it to GradeScope:

https://docs.google.com/document/d/1VzYYsY_IvQP_HvnulPal4t49rBGC0wfZBAMLQDFXRAo

* You can copy all necessary files for this homework to your home directory with the following command:
```bash
cp -r /datasets/bg237-wi22-a00-public/cmm262-2022/hw/hw3 ~/
```

* Please write your code directly into this notebook. Add your written answers also directly into this notebook, in markdown cells.

**Hint**: All commands needed to complete this homework can be found in the exercise notebooks completed in class!

In [None]:
# Load libraries for network analysis
suppressMessages(library(igraph))

In [None]:
# Read in two networks
# First a binary protein interaction network constructed from an unbiased yeast2hybrid experimental screen
Y2H <- read.table(file="data/Networks/HI-II-14.tsv",header=T,sep="\t")
head(Y2H)

In [None]:
# Second an literature curated network of high confidence protein interactions
lit <- read.table(file="data/Networks/Lit-BM-13.tsv",header=T,sep="\t")
head(lit)
# These networks are hosted here: http://interactome.dfci.harvard.edu/H_sapiens/index.php

In [None]:
# Load networks
edgelist <- cbind(as.character(Y2H$Symbol.A),as.character(Y2H$Symbol.B))
edgelist2 <- cbind(as.character(lit$symbol_a),as.character(lit$symbol_b))
g <- graph.data.frame(edgelist, directed=F)
g2 <- graph.data.frame(edgelist2, directed=F)

# Question 1 (6 points)

In this homework, we will investigate similarities and differences in the networks generated by systematic screen versus literature curation.

In parts a, b, and e, please provide answers for **both** graphs.

#### 1a) How many nodes? (1 point)

#### 1b) How many edges? (1 point)

#### 1c) Get a list of unique node names (1 point)
Hint: use `names()` with the solution to `1a`.

1. How many nodes are shared between the two graphs?
2. How many nodes are in g but not g2?
3. How many nodes are in g2 but not g?

#### 1d) Compare the node degree distributions of the 2 graphs. Do they both follow a power law distribution? Plot the degree distributions, perform the proper test, and report the p-values. Make a conclusion based on the results of your tests with $\alpha = 0.05$ (1 point)

#### 1e) Finally, determine the diameters of each graph. (1 point)

#### 1f) Briefly state similarities and differences between properties of the two graphs based on parts a-e. (1 point)

# Question 2 (6 points)
Evaluate coverage of different gene sets in the graph.

**Hint**: Use this syntax to get the nodes in the graph that are also in a list of interesting genes:
```r
nodesinlist <- nodenames[which(nodenames %in% genelist)]
```

The `-` symbol gives the names **not** in the list.
```r
nodesinlist <- nodenames[-which(nodenames %in% genelist)] 
```

In [None]:
# Load in disease gene lists
mend <- scan("data/OMIM/Mendelian_HGNC.txt",what=as.character())
cancer <- scan("data/Cancer/cancer_genes.2_sources.txt",what=as.character())

#### 2a) How many mendelian disease genes are in the **Y2H** graph? (1 point)

#### 2b) How many cancer genes are in the **Y2H** graph? (1 point)

#### 2c) How many nodes in the **Y2H** graph are neither cancer nor mendelian disease genes? (1 point)

#### Now, repeat the same statistics for the **literature-based** graph.

#### 2d) How many mendelian disease genes are in the **literature-based** graph? (1 point)

#### 2e) How many cancer genes are in the **literature-based** graph? (1 point)

#### 2f) How many nodes in the **literature-based** graph that are neither cancer nor mendelian disease genes? (1 point)

# Question 3 (10 points)
Compare graph measures between disease genes, cancer genes, and non-disease genes in the Y2H network

**Hint**: You want a number for each gene in the group. Boxplots are a good way to compare distributions. If you get warnings for a method you run here that's ok.

We recommend first defining a function called `create_boxplot()` and then calling that function in each of the subsequent parts of this question.

#### 3a) Plot degree distributions for each class of gene (1 point)

#### 3b) Plot the clustering coefficient distribution for each class of gene (1 point)

#### 3c) Plot closeness centrality for each class of gene (1 point)

#### 3d) Plot betweenness centrality for each class of gene (1 point)

#### Now repeat for the literature curated network.

#### 3e) Plot degree distributions for each class of gene (1 point)

#### 3f) Plot the clustering coefficient distribution for each class of gene (1 point)

#### 3g) Plot closeness centrality for each class of gene (1 point)

#### 3h) Plot betweenness centrality for each class of gene (1 point)

#### 3i) Do your conclusions about the properties of these different classes of genes change when you use different networks? (2 points)

# Question 4 (4 points)

Next, compare enrichment for 4 node motifs in the Y2H network versus the literature based network.

**Hint**: There are 6 unique motifs where edges connect all 4 nodes in an undirected graph; there are 11 total undirected 4 node motifs when the subgraph doesn't have to be connected.

#### 4a) Visualize the possible 4 node motifs for both connected and unconnected graphs. (1 point)

#### 4b) Count the number of motifs in each graph. (1 point)

#### 4c) Do the graphs differ in terms of the number motifs? Which motifs are more common in the Y2H network? Which in the literature derived network? Why might that be? (2 points)