# PPI-network

This code was developed to find driver genes using a network-based approach on PPI (protein-protein interaction) data from the STRING database. Genes associated with bladder cancer were used as seed genes and provide a starting point for the analysis.

### Downloading the PPI network

Firstly, the PPI network must be downloaded from the STRING database.

The data is then saved as a `networkx` graph.

In [1]:
import requests, zlib

# download the file
zip_file_url = 'https://stringdb-static.org/download/protein.links.v10.5/9606.protein.links.v10.5.txt.gz'
STRING_response = requests.get(zip_file_url)

# decompress the file. ('data' is a string.)
data = zlib.decompress(STRING_response.content, zlib.MAX_WBITS|32)


### Processing the network data

The data is then processed. This means creating a `networkx` `Graph`, and using each line as an edge. Each of the first two 'words' of each edge represent the edge's nodes.

In [2]:
import networkx as nx

data_processed = data.split('\n')
del data_processed[0]
del data_processed[-1]

G = nx.Graph()
for item in data_processed:
    nodes = item.split(' ')
    G.add_edge(nodes[0],nodes[1],weight=int(nodes[2]))

print G.number_of_edges()
print G.number_of_nodes()

5676528
19576


### Pruning the network
#### Remove all unnconnected nodes

In [3]:
largest_cc = max(nx.connected_components(G), key=len)

to_delete_unnconnected = []
for key in G:
    if key not in largest_cc:
        to_delete_unnconnected.append(key)

for node in to_delete_unnconnected:
    G.remove_node(node)
    
print(G.number_of_nodes())

False
2
2
19574


#### Remove low confidence edges

In [7]:
to_delete_edges = []
for edge in G.edges():sds
    if G.get_edge_data(*edge)['weight'] < 400:
        to_delete_edges.append(edge)

for item in to_delete_edges:
    if G.has_edge(*item):
        G.remove_edge(*item)

to_delete_nodes = []
for key in G:
    if len(G[key])==0:
        to_delete_nodes.append(key)
        
for item in to_delete_nodes:
    G.remove_node(item)
    
print(G.number_of_nodes())
print(G.number_of_edges())


18836
792900


#### Remove nodes with no edges

In [26]:
to_delete_no_edges = []
for key in G:
    if len(G[key]) == 0:
        to_delete_no_edges.append(key)
        
print(len(to_delete_no_edges))

0


### Extend the seed genes using DIAMOnD

#### Import the seed genes

In [27]:
seed_file = 'seed_genes.tsv'
seed_genes = set()
for line in open(seed_file,'r'):
    # the first column in the line will be interpreted as a seed gene:
    line_data = line.strip().split('\t')
    seed_gene = line_data[0]
    seed_genes.add(seed_gene)
    
print len(seed_genes)

40


#### Ensure all the seed genes are in the network

In [31]:
all_genes_in_network = set(G.nodes())
seed_genes = set(seed_genes)
disease_genes = seed_genes & all_genes_in_network

if len(disease_genes) != len(seed_genes):
    print "DIAMOnD(): ignoring %s of %s seed genes that are not in the network" %(
        len(seed_genes - all_genes_in_network), len(seed_genes))
    
print len(disease_genes) - len(seed_genes)

0


#### Run DIAMOnD

In [41]:
import DIAMOnD

max_number_of_added_nodes = 160
alpha = 1

added_nodes = DIAMOnD.diamond_iteration_of_first_X_nodes(G,disease_genes,max_number_of_added_nodes,alpha)
added_nodes = [node[0] for node in added_nodes]

ext_seed_genes = added_nodes + list(seed_genes)
print len(ext_seed_genes)

200


In [23]:
counter = 0
for key in G:
    if counter is 0:
        print key
        print type(key)
    counter = counter + 1

9606.ENSP00000395733
<type 'str'>


In [25]:
G['9606.ENSP00000395733'][:][weight ]

TypeError: unhashable type

In [None]:
counter = 0
for item in genes:
    if G.has_node(item):
        counter += 1
        
print(counter)

In [15]:
counter = 0
for edge in G.edges():
    if counter is 0:
        print edge
    counter = counter + 1
    
print G.edges['9606.ENSP00000395733','9606.ENSP00000332454']['weight']

('9606.ENSP00000395733', '9606.ENSP00000332454')
509
