# PPI-network

This code was developed to find driver genes using a network-based approach on PPI (protein-protein interaction) data from the STRING database. Genes associated with bladder cancer were used as seed genes and provide a starting point for the analysis.

### Downloading the PPI network

Firstly, the PPI network must be downloaded from the STRING database.

The data is then saved as a `networkx` graph.

In [2]:
import requests, zlib

# download the file
zip_file_url = 'https://stringdb-static.org/download/protein.links.v10.5/9606.protein.links.v10.5.txt.gz'
STRING_response = requests.get(zip_file_url)

# decompress the file. ('data' is a string.)
data = zlib.decompress(STRING_response.content, zlib.MAX_WBITS|32)


### Processing the network data

The data is then processed. This means creating a `networkx` `Graph`, and using each line as an edge. Each of the first two 'words' of each edge represent the edge's nodes.

In [30]:
import networkx as nx

data_processed = data.split('\n')
del data_processed[0]
del data_processed[-1]

G = nx.Graph()
for item in data_processed:
    nodes = item.split(' ')
    G.add_edge(nodes[0],nodes[1],weight=int(nodes[2]))

print G.number_of_edges()
print G.number_of_nodes()

5676528
19576


### Pruning the network
#### Remove all unnconnected nodes

In [31]:
print(nx.is_connected(G))
print(nx.number_connected_components(G))

largest_cc = max(nx.connected_components(G), key=len)
print(len(largest_cc))

to_delete_unnconnected = []
for key in G:
    if key not in largest_cc:
        to_delete_unnconnected.append(key)
        
print(len(to_delete_unnconnected))

for node in to_delete_unnconnected:
    G.remove_node(node)
    
print(G.number_of_nodes())

False
2
19574
2
19574


#### Remove low confidence edges

In [None]:
# BEFORE

print(net_links.number_of_nodes())
print(net_links.number_of_edges(),'\n')


# DELETE EDGES AND NODES

to_delete_edges = []            
for edge in net_links.edges():            
    if net_links.get_edge_data(*edge)['weight'] < 400:
        to_delete_edges.append(edge)

for item in to_delete_edges:
    if net_links.has_edge(*item):
        net_links.remove_edge(*item)

to_delete_nodes = []
for key in net_links:
    if len(net_links[key])==0:
        to_delete_nodes.append(key)
        
for item in to_delete_nodes:
    net_links.remove_node(item)
    

# AFTER

counter = 0
for item in genes:
    if net_links.has_node(item):
        counter += 1
        
print(counter)

counter = 0
for item in ext_genes:
    if net_links.has_node(item):
        counter += 1

print(counter)

print(net_links.number_of_nodes())
print(net_links.number_of_edges(),'\n')

In [26]:
seed_file = 'seed_genes.tsv'
seed_genes = set()
for line in open(seed_file,'r'):
    # the first column in the line will be interpreted as a seed gene:
    line_data = line.strip().split('\t')
    seed_gene = line_data[0]
    seed_genes.add(seed_gene)
    
print len(seed_genes)

40


In [1]:
import DIAMOnD

input_list = ['dummy_sys_argv','','seed_genes.tsv',160]

In [15]:
data_processed[0].split(' ')[1]

'9606.ENSP00000263431'