# Networks HW

The goal of this homework is to gain hands on experience working with real biological networks, and to start thinking about the possible ways that choice of network might affect your inference about the biology that you are studying. 

In this assignment we will focus on comparing Mendelian disease genes, genes that are frequently somatically mutated in cancer, and genes that are neither. We would like to understand whether cancer genes are similar to Mendelian disease genes as has been previously suggested in the literature: 
### Torkamani A, Schork NJ. Prediction of cancer driver mutations in protein kinases. 
Cancer research. 2008;68(6):1675–82. pmid:18339846 

## Instructions:
* Please write your code directly into this notebook.
* Please add your written answers also directly into this notebook, in comment form

* Please save your jupyter notebook as PDF (save as HTML, then press print to then save notebook as PDF). Submit to GradeScope.

* **Hint**: all commands needed to complete this homework can be found in the exercise notebooks completed in class.

### The data for this homework can be found at /oasis/tscc/scratch/biom200/cmm262/Module_6/Homework. 

### Copy the data into your home directory.
`cp -r /oasis/tscc/scratch/biom200/cmm262/Module_6/Homework ~/Networks_HW`

### Make sure to enter your Networks environment:
`source activate cmm262-networks`

### Start a jupyter notebook and complete assignment

In [None]:
# Load libraries for network analysis
library(igraph)

In [None]:
# Read in two networks
# First a binary protein interaction network constructed from an unbiased yeast2hybrid experimental screen
Y2H <- read.table(file="~/Networks_HW/Networks/HI-II-14.tsv",header=T,sep="\t")
head(Y2H)
# Second an literature curated network of high confidence protein interactions
lit <- read.table(file="~/Networks_HW/Networks/Lit-BM-13.tsv",header=T,sep="\t")
head(lit)
# These networks are hosted here: http://interactome.dfci.harvard.edu/H_sapiens/index.php

In [None]:
# Load networks
edgelist <- cbind(as.character(Y2H$Symbol.A),as.character(Y2H$Symbol.B))
g <- graph.data.frame(edgelist, directed=F)
edgelist2 <- cbind(as.character(lit$symbol_a),as.character(lit$symbol_b))
g2 <- graph.data.frame(edgelist2, directed=F)

In [None]:
# In this homework, we will investigate similarities and differences 
# in the networks generated by systematic screen versus literature
# curation. 

# 2 points
# 1a) How many nodes?

# 1b) How many edges? 

# 1c) Get a list of unique node names - hint: use names() with the solution to 1a


In [None]:
# 1d) Compare the node degree distributions of the 2 graphs. Do they both follow a power law distribution?


In [None]:
# 1e) Compare the diameters of the graph

# Based on these analyses, would you conclude that the graphs are similar? 

In [None]:
# Load in disease gene lists
mend <- scan("~/Networks_HW/OMIM/Mendelian_HGNC.txt",what=as.character())
cancer <- scan("~/Networks_HW/Cancer/cancer_genes.2_sources.txt",what=as.character())
length(mend)
length(cancer)

In [None]:
# 1 point
# 2) Evaluate coverage of different gene sets in the graph
# Hint: use this syntax to get the nodes in the graph that are also in a list of interesting genes
# nodesinlist <- nodenames[which(nodenames %in% genelist)]
# nodesinlist <- nodenames[-which(nodenames %in% genelist)] gives the the names not in the list
# You can also try ! which is equivalent to "not" in R

# 2a) Determine how many mendelian disease genes are in the graph

# 2b) Determine how many cancer genes are in the graph

# 2c) Make a list of the nodes in the graph that are neither cancer nor mendelian disease genes


In [None]:
# 1 point
# Repeat for literature based graph 
# Are more or less of the disease related genes present in the literature derived graph?
# 2d) Determine how many mendelian disease genes are in the graph

# 2e) Determine how many cancer genes are in the graph

# 2f) Make a list of the nodes in the graph that are neither cancer nor mendelian disease genes


In [None]:
# 2 points
# 3) Compare graph measures between disease genes, cancer genes, and non-disease genes in the Y2H network
# Hint: You want a number for each gene in the group
#       Boxplots are a good way to compare distributions
#       You might get warnings for a method you run here. That's ok
# 3a) plot degree distributions for each class of gene

# 3b) plot clustering coeffcient distribution for each class of gene

# 3c) plot closeness centrality for each class of gene

# 3d) plot betweenness centrality for each class of gene



In [None]:
# 2 points
# Now repeat for the literarture curated network
# Would you draw different conclusions using the literature based network?
# You might get warnings for a method you run here. That's ok
# 3e) plot degree distributions for each class of gene

# 3f) plot clustering coeffcient distribution for each class of gene

# 3g) plot closeness centrality for each class of gene

# 3h) plot betweenness centrality for each class of gene



# Do your conclusions about the properties of these different classes of genes change when you use different networks?

In [None]:
# 2 points
# 4) Next compare enrichment for 4 node motifs in the Y2H network versus the literature based network
# Hint: There are 6 unique motifs where edges connect all 4 nodes in an undirected graph, 
#       there are 11 total undirected 4 node motifs when the subgraph doesn't have to be connected

# Visualize the possible motifs

# Count the number of motifs in each graph

# Do the graphs differ in terms of the number motifs? Which motifs are more common in the Y2H network? Which in the literature derived network? 

In [None]:
# Bonus: Are these motifs over-represented relative to similar random networks? 
# Hint: you can perform degree preserving permutation using rewire(g, with = keeping_degseq()) - see igraph documentation
#       The niter parameter is the number of edges that will be randomly reassigned
# Note: This might be computationally intensive

