# PPI Data Prepration<a id='top'></a>
**Sections:**<br>
[0) Description](#0)<br>
[1) Importing Modules and Packages](#1)<br>
[2) Configuration](#2)<br>
[3) Loading Gene Ontology](#3)<br>
[4) Loading Genes and Annotations](#4)<br>
[5) Loading PPI data](#5)<br>
[6) Generating Negartive PPI data](#6)<br>
[7) Saving the results](#7)<br>


## Description<a id='0'></a>

**Aim:** This jupyter notebook results in PPI data of species of interest (e.g., _human_ or *yeast*) with which the deepSimDEF networks would be trained and evaluatied.

---
**Output file format:** (space separated)<br>
_protein1_ _protein2_ _interaction_label_ <br>
CD44 ARHGEF1 1<br>
POLR2G CTDP1 1<br>
...<br>
OR5L2 SLC7A11 0<br>
SCNM1 CEP120 0<br>
...<br>

---
Files needed for this preprocessing are:

 * **Gene ontology:** ['go.obo' file](http://current.geneontology.org/ontology/go.obo)<br><br>
 
 * **Association files:** [gene association files ingested from GO Consortium members](http://current.geneontology.org/products/pages/downloads.html)
  * **Human** - [Gene Association file (Homo sapiens)](http://geneontology.org/gene-associations/goa_human.gaf.gz)
  * **Yeast** - [Gene Association file (Saccharomyces cerevisiae)](http://current.geneontology.org/annotations/sgd.gaf.gz)<br><br>
  
 * **PPI data:** [String website](https://string-db.org/cgi/input.pl?sessionId=a5gYvToqoD08&input_page_show_search=off) [Examples](https://string-db.org/cgi/input.pl?sessionId=Yxc5T5teWAUN&input_page_active_form=examples)<br>
         
     * **Human** - [String website for PPI data (Homo sapiens)](https://string-db.org/cgi/download.pl?sessionId=rEreY0WA5fL8&species_text=Homo+sapiens)
         * Protein links full: [Protein Protein Interaction file - full](https://stringdb-static.org/download/protein.links.full.v11.0/9606.protein.links.full.v11.0.txt.gz)
         * Protein info: [Protein Protein Interaction file - information](https://stringdb-static.org/download/protein.info.v11.0/9606.protein.info.v11.0.txt.gz)
         * Protein actions: [Protein Protein Interaction file - action](https://stringdb-static.org/download/protein.actions.v11.0/9606.protein.actions.v11.0.txt.gz)<br>
         
     * **Yeast** - [String website for PPI data (Saccharomyces cerevisiae)](https://string-db.org/cgi/download.pl?sessionId=rEreY0WA5fL8&species_text=Saccharomyces+cerevisiae)
         * Protein links full: [Protein Protein Interaction file - full](https://stringdb-static.org/download/protein.links.full.v11.0/4932.protein.links.full.v11.0.txt.gz)
         * Protein info: [Protein Protein Interaction file - information](https://stringdb-static.org/download/protein.info.v11.0/4932.protein.info.v11.0.txt.gz)
         * Protein actions: [Protein Protein Interaction file - action](https://stringdb-static.org/download/protein.actions.v11.0/4932.protein.actions.v11.0.txt.gz)<br>
         * Extra source: ['yeastgenome.org' interaction data](http://downloads.yeastgenome.org/pub/yeast/literature_curation/interaction_data.tab)

---
[back to top](#top)<br>

## Import<a id='1'></a>
[back to top](#top)<br>

In [None]:
import pandas as pd
import numpy as np
import os
import requests
import easydict
import linecache
import pprint
import random

pp = pprint.PrettyPrinter(indent=4)

## Configuration<a id='2'></a>
[back to top](#top)<br>

In [None]:
species = 'yeast' # species of interest to load of and save the resut for
confidence_threshold = {'human': 850, 'yeast': 650}

if species=='human':
    association_file_name = 'goa_human.gaf.gz' # human
    association_file_url = 'http://geneontology.org/gene-associations/goa_human.gaf.gz'
    interactions_full_file = '9606.protein.links.full.v11.0.txt.gz'
    interactions_full_url = 'https://stringdb-static.org/download/protein.links.full.v11.0/9606.protein.links.full.v11.0.txt.gz'
    interactions_action_file = '9606.protein.actions.v11.0.txt.gz'
    interactions_action_url = 'https://stringdb-static.org/download/protein.actions.v11.0/9606.protein.actions.v11.0.txt.gz'
    interactions_info_file = '9606.protein.info.v11.0.txt.gz'
    interactions_info_url = 'https://stringdb-static.org/download/protein.info.v11.0/9606.protein.info.v11.0.txt.gz'
elif species=='yeast':
    association_file_name = 'sgd.gaf.gz' # yeast
    association_file_url = 'http://current.geneontology.org/annotations/sgd.gaf.gz'
    interactions_full_file = '4932.protein.links.full.v11.0.txt.gz'
    interactions_full_url = 'https://stringdb-static.org/download/protein.links.full.v11.0/4932.protein.links.full.v11.0.txt.gz'
    interactions_action_file = '4932.protein.actions.v11.0.txt.gz'
    interactions_action_url = 'https://stringdb-static.org/download/protein.actions.v11.0/4932.protein.actions.v11.0.txt.gz'
    interactions_info_file = '4932.protein.info.v11.0.txt.gz'
    interactions_info_url = 'https://stringdb-static.org/download/protein.info.v11.0/4932.protein.info.v11.0.txt.gz'
    interactions_extra_file = 'interaction_data.tab'
    interactions_extra_url = 'http://sgd-archive.yeastgenome.org/?prefix=pub/yeast/literature_curation/interaction_data.tab'
    
args = easydict.EasyDict({
    "go_dir": 'gene_ontology/raw/',     # directory to the Gene Ontology 'go.obo' file
    "association_file_dir": 'species/{}/association_file/raw'.format(species), # directory to the human association file
    "interaction_files_dir": 'species/{}/ppi/raw'.format(species), # directory to the human association file
    "ppi_raw_dir": 'species/{}/ppi/raw'.format(species),                       # directory to the raw ppi data
    "result_ppi_dir": 'species/{}/ppi/processed'.format(species),              # directory in which the results would be saved
    "combined_score_threshold": confidence_threshold[species],   # the score to filter out low confident interactions (only those above the score are kept)
    "download_gene_ontology": True,     # download the latest version of gene ontology into the specified directory above
    "download_association_file": True,  # download association file of the specieis of interest into the specified directory above
    "download_interaction_files": True,  # download interaction files of the specieis of interest into the specified directory above
    "extra_ppi_source": False,           # includig extra source of positive PPI information to generate better negative PPIs
    "seed": 2021                         # seed to make sure the random negative samples are reproducable
})

os.makedirs(args.result_ppi_dir, exist_ok=True)  # create 'result_ppi_dir' folder (if it does not exist already)

np.random.seed(args.seed)
random.seed(args.seed)

subontology_map = {"C":"CC", "P":"BP", "F":"MF"}

## Loading Gene Ontology<a id='3'></a>
[back to top](#top)<br>

In [None]:
if args.download_gene_ontology:
    os.makedirs(args.go_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of Gene Ontology into '{}'...".format(args.go_dir))
    url = 'http://current.geneontology.org/ontology/go.obo'
    r = requests.get(url, allow_redirects=True)
    open('{}/go.obo'.format(args.go_dir), 'wb').write(r.content)

print("Gene Ontology {}".format(linecache.getline('{}/go.obo'.format(args.go_dir), 2))) # Now: releases/2020-03-23

In [None]:
"""Reading Gene Ontology to extract Terms and their Descriptive Names"""
with open("{}/go.obo".format(args.go_dir)) as f:
    content = f.readlines()
content = "".join([x for x in content])
content = content.split("[Typedef]")[0].split("[Term]")
print("Information of the last GO term in the file:\n~~~~~~~~~~~~~~~~~~~~~~~~~{}".format(content[-1]))

In [None]:
"""Going through every GO term and extract information needed ('id', 'alt_id', 'namespace', and 'is_obsolete')"""
go_term_dict = {}
for c in content:
    go_id = ''
    for l in c.split("\n"):
        # id
        if "id: GO:" in l[0:len("id: GO:")]:
            go_id = l.split("id: ")[1]
            go_term_dict[go_id] = {}
        # alt_id
        if "alt_id:" in l[0:len("alt_id")+1]:
            go_term_dict[go_id].setdefault("alt_id", []).append(l.split("alt_id: ")[1])
        # namespace
        if "namespace:" in l[0:len("namespace")+1]:
            go_term_dict[go_id]["namespace"] = l.split("namespace: ")[1]
        # is_obsolete
        if "is_obsolete:" in l[0:len("is_obsolete")+1]:
            go_term_dict[go_id]["is_obsolete"] = l.split("is_obsolete: ")[1]

In [None]:
"""printing how the key:values are organized for every GO term"""
for i in range(15):
    print(list(go_term_dict)[i], end=": ")
    pp.pprint(go_term_dict[list(go_term_dict)[i]])

In [None]:
"""grouping GO terms based on the sub-ontologies they belong to"""
subontology_go_term_dict = {}
for go_id in go_term_dict:
    if not go_term_dict[go_id].get('is_obsolete', False): # or => if 'is_obsolete' not in go_term_dict[go_id]:
        subontology_go_term_dict.setdefault(go_term_dict[go_id]['namespace'].split('_')[1][0].upper(), []).append(go_id)

In [None]:
"""including 'alt_id' into the sub-ontology's groups of GO terms"""
for go_id in go_term_dict:
    if go_term_dict[go_id].get('alt_id', False): # or => if 'alt_id' in go_term_dict[go_id]:
        for alt_id in go_term_dict[go_id].get('alt_id'):
            subontology_go_term_dict[go_term_dict[go_id]['namespace'].split('_')[1][0].upper()].append(alt_id)

In [None]:
"""printing how the key:values are organized for different sub-ontologies"""
for subontology in subontology_go_term_dict:
    print("{} ({}):: {} <= {} GO term (with 'alt_id') => {}".format(
        subontology, 
        subontology_map[subontology], 
        " ".join(subontology_go_term_dict[subontology][:3]), 
        len(subontology_go_term_dict[subontology]), 
        " ".join(subontology_go_term_dict[subontology][-3:])))

## Loading Genes and Annotations<a id='4'></a>
[back to top](#top)<br>

In [None]:
if args.download_association_file:
    os.makedirs(args.association_file_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of association file into '{}'...".format(args.association_file_dir))
    r = requests.get(association_file_url, allow_redirects=True)
    open('{}/{}'.format(args.association_file_dir, association_file_name), 'wb').write(r.content)
print("Done!")

In [None]:
df = pd.read_csv("{}/{}".format(args.association_file_dir, association_file_name), sep='\t', comment="!", skip_blank_lines=True, header=None, dtype=str)
df = df.iloc[:,[1, 2, 3, 4, 6, 8]]
if len(df[df[3].isnull()])==0:
    df = df[~df[3].str.contains("NOT")]
    df = df.dropna().reset_index(drop=True)
else:
    df = df[df[3].isnull()]
    df = df.dropna().reset_index(drop=True)
df = df.drop(df.columns[2], axis=1)
df

In [None]:
"""keeping track of the gene ids and their mappings"""
protein_gene_id_map = {}
for gene_id, protein_id in zip(df[1], df[2]):
    protein_gene_id_map[protein_id] = gene_id

##### removing 'ND' and 'IEA' annotations

In [None]:
df = df[(df[6]!='ND') & (df[6]!='IEA')]
df

In [None]:
"""protein dictionary to keep track of annotations for proteins (from each sub-ontology)"""
proteins_dict = {}
for index, row in df.iterrows():
    gene = row[2]
    go_term_id = row[4]
    subontology = row[8]
    if go_term_id in subontology_go_term_dict[subontology]:
        proteins_dict.setdefault(gene, dict()).setdefault(subontology, set()).add(go_term_id)
        
"""printing how the key:values are organized for every gene/protein"""
for i in range(5):
    print(list(proteins_dict)[i], end=": ")
    pp.pprint(proteins_dict[list(proteins_dict)[i]])
print("\nTotal number of genes/proteins annotated:", len(proteins_dict))

#### Taking into account only fully annotated genes/proteins

In [None]:
"""keeping track of fully annotated genes/proteins"""
fully_annotated_proteins_wo_iea = []
for protein in proteins_dict:
    if len(proteins_dict[protein]) == 3:
        fully_annotated_proteins_wo_iea.append(protein)
print("Out of {} proteins {} are (experimentally or manually) annotated by all three sub-ontologies.".format(len(proteins_dict), len(fully_annotated_proteins_wo_iea)))

## Loading PPI data<a id='5'></a>
[back to top](#top)<br>

In [None]:
if args.download_interaction_files:
    os.makedirs(args.interaction_files_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of interaction files into '{}'...".format(args.interaction_files_dir))
    
    r = requests.get(interactions_full_url, allow_redirects=True)
    open('{}/{}'.format(args.interaction_files_dir, interactions_full_file), 'wb').write(r.content)
    
    r = requests.get(interactions_action_url, allow_redirects=True)
    open('{}/{}'.format(args.interaction_files_dir, interactions_action_file), 'wb').write(r.content)
    
    r = requests.get(interactions_info_url, allow_redirects=True)
    open('{}/{}'.format(args.interaction_files_dir, interactions_info_file), 'wb').write(r.content)
    
print("Done!")

In [None]:
if args.extra_ppi_source:
    os.makedirs(args.interaction_files_dir, exist_ok=True)  # create 'data_loc' folder (if it does not exist already)
    print("Downloading the latest version of interaction files into '{}'...".format(args.interaction_files_dir))
    
    r = requests.get(interactions_extra_url, allow_redirects=True)
    open('{}/{}'.format(args.interaction_files_dir, interactions_extra_file), 'wb').write(r.content)
    
print("Done!")

#### Loading species PPI info file to map names

In [None]:
df_ppi_info = pd.read_csv("{}/{}".format(args.ppi_raw_dir, interactions_info_file), sep='\t')
df_ppi_info

In [None]:
id_map_dict = {}
for protein_external_id, preferred_name in zip(df_ppi_info['protein_external_id'] , df_ppi_info['preferred_name']):
    id_map_dict[protein_external_id] = preferred_name

#### Loading species PPI data

In [None]:
df_ppi = pd.read_csv("{}/{}".format(args.ppi_raw_dir, interactions_full_file), sep=' ')
df_ppi

In [None]:
"""we need all positive interactions for the generation of negative interactions"""
all_positive_interactions = set()
for protein1, protein2 in zip(df_ppi['protein1'] , df_ppi['protein2']):
    all_positive_interactions.add("{} {}".format(id_map_dict[protein1], id_map_dict[protein2]))
    all_positive_interactions.add("{} {}".format(id_map_dict[protein2], id_map_dict[protein1]))

"""including other sources into consideration to create better negative PPIs later"""
if args.extra_ppi_source:
    if species=='yeast':
        extra_file = 'interaction_data.tab'
        print("Extra source of PPI for {} ('{}' file).".format(species, extra_file))
        df_ppi_another = pd.read_csv("{}/{}".format(args.ppi_raw_dir, extra_file), sep='\t', header=None)
        for index, row in df_ppi_another.iterrows():
            all_positive_interactions.add("{} {}".format(row[1], row[3]))
            all_positive_interactions.add("{} {}".format(row[3], row[1]))
    elif species=='human':
        print("No extra source of PPI for {}.".format(species))

print("The total number of all positive interactions are:", len(all_positive_interactions))

#### Experimentally Supported Interactions

In [None]:
"""keeping only those annotations which are supported experimentally and are with high confidence score"""
df_ppi = df_ppi[(df_ppi['experiments']!=0) & (args.combined_score_threshold<df_ppi['combined_score'])][['protein1', 'protein2', 'experiments', 'combined_score']]
df_ppi

In [None]:
experimentally_positive_interactions = set()
for protein1, protein2 in zip(df_ppi['protein1'] , df_ppi['protein2']):
    experimentally_positive_interactions.add("{} {}".format(id_map_dict[protein1], id_map_dict[protein2]))
    experimentally_positive_interactions.add("{} {}".format(id_map_dict[protein2], id_map_dict[protein1]))
    
print("Number of experimentally positive interactions (with confidense score of {}): {}".format(args.combined_score_threshold, len(experimentally_positive_interactions)))

#### Physical (i.e. binding) vs Functional Interactions

In [None]:
"""would be interested only in physical interactions (typically binding)"""
"""for more information refer to: 'http://version10.string-db.org/help/faq/#how-do-i-extract-purely-experimental-data'""" 
df_ppi_physical = pd.read_csv("{}/{}".format(args.ppi_raw_dir, interactions_action_file), sep='\t')
df_ppi_physical = df_ppi_physical[df_ppi_physical['mode']=='binding'][['item_id_a', 'item_id_b', 'mode']]
df_ppi_physical

In [None]:
physical_interactions = set()
for protein1, protein2 in zip(df_ppi_physical['item_id_a'] , df_ppi_physical['item_id_b']):
    physical_interactions.add("{} {}".format(id_map_dict[protein1], id_map_dict[protein2]))
    physical_interactions.add("{} {}".format(id_map_dict[protein2], id_map_dict[protein1]))

#### Now consider only experimental & physical interactions for which we have full annotations

In [None]:
experimentally_positive_interactions = experimentally_positive_interactions.intersection(physical_interactions)
print("Total number of experimentally positive physical interactions:", len(experimentally_positive_interactions))

In [None]:
"""for proper experiment, we only consider proteins which all fully annotated"""
fully_annotated_positive_interactions = set()
for interaction in sorted(experimentally_positive_interactions):
    p1, p2 = interaction.split()
    if (p1 in fully_annotated_proteins_wo_iea) and (p2 in fully_annotated_proteins_wo_iea):
        if p1==p2:
            fully_annotated_positive_interactions.add("{} {}".format(p1, p2))
        elif "{} {}".format(p2, p1) not in fully_annotated_positive_interactions:
            fully_annotated_positive_interactions.add("{} {}".format(p1, p2))
            
print("Out of {} gene pairs, {} are fully annotated by all three sub-ontologies.".format(len(experimentally_positive_interactions), len(fully_annotated_positive_interactions)))

## Generating Negative PPI data<a id='6'></a>
[back to top](#top)<br>

In [None]:
"""we generate negative interactions by random sampling while making sure the are not positive of any form"""
fully_annotated_negative_interactions = set()
count = 0
while count != len(fully_annotated_positive_interactions):
    protein1, protein2 = np.random.choice(fully_annotated_proteins_wo_iea, 2)
    if "{} {}".format(protein1, protein2) not in all_positive_interactions:
        if "{} {}".format(protein1, protein2) not in fully_annotated_negative_interactions:

            fully_annotated_negative_interactions.add("{} {}".format(protein1, protein2))
            count += 1

print("Total number of positive interactions ({}): {}".format(species, len(fully_annotated_negative_interactions)))
print("Total number of negative interactions ({}): {}".format(species, len(fully_annotated_positive_interactions)))
print("Total size of the PPI dataset is: {}".format(len(fully_annotated_negative_interactions)*2))

## Saving the results<a id='7'></a>
[back to top](#top)<br>

In [None]:
with open(f'{args.result_ppi_dir}/{species}_protein_protein_interaction.tsv', 'w') as fw:
    fw.write("Protein_1\tProtein_2\tInteraction_label\n")
    for positive_interaction in sorted(fully_annotated_positive_interactions):
        positive_interaction = positive_interaction.replace(' ', '\t')
        fw.write(f"{positive_interaction}\t1\n")
    for negative_interaction in sorted(fully_annotated_negative_interactions):
        negative_interaction = negative_interaction.replace(' ', '\t')
        fw.write(f"{negative_interaction}\t0\n")

In [None]:
df = pd.read_csv(f"{args.result_ppi_dir}/{species}_protein_protein_interaction.tsv", sep="\t", dtype=str)
df

In [None]:
ppi_genes = set(list(df.Protein_1) + list(df.Protein_2))
print(f"Number of {species} genes:", len(ppi_genes))
with open(f'{args.result_ppi_dir}/{species}_protein_protein_interaction_genes.tsv'.format(), 'w') as fw:
    for gene in sorted(ppi_genes):
        fw.write(f"{gene}\n")

[back to top](#top)<br>

---