### Get complimentary PPIs (nonPPIs)

From a set of all possible pairs of proteins, we need to know which are not members of our set of PPIs.
We refer to them as nonPPIs set. It is from this set that we can identify our missing PPIs.

The PPIs in the original data is converted to numeric values for easier evaluation with machinine learning models

Also, the output is numerized for future processing.

For this, we need the ```Asthma and Allergy``` dataset containing 1425 PPIs

In [1]:
import csv
import itertools

file = 'Data/Allergy_and_Asthma.txt'
#sample = 'sample.txt'

PPIs = []
ppiScores = []
with open(file, 'r') as f:
    reader = csv.reader(f, delimiter = '\t')
    for p1, p2, score in reader:
        PPIs.append((p1, p2))
        ppiScores.append(score)

uniqueProts = list(set([i for p in PPIs for i in p]))

# repesent each proteins as a number
protIndextDict = {k: v for v, k in enumerate(uniqueProts)}
numerizedProt = [(protIndextDict[p1], protIndextDict[p2]) for (p1, p2) in PPIs]

protCombinations = itertools.combinations(uniqueProts, 2)
_protCombinations = itertools.combinations(uniqueProts, 2) # Copy for printing later
numerizedProtCombinations = itertools.combinations(range(len(uniqueProts)), 2)
_numerizedProtCombinations = itertools.combinations(range(len(uniqueProts)), 2) # Copy for printing later

 
nonPPIs = list(set(protCombinations) - set(PPIs))
numerizedNonPPIs = list(set(numerizedProtCombinations) - set(numerizedProt))

numerizedPPIs = []
for prot, score in itertools.izip(numerizedProt, ppiScores):
    _score = int(100 * round(float(int(score)/100))) #Rounding off to nearest 100 to reduces number of classes
    numerizedPPIs.append(prot + (_score,))

def writeToFile(filename, data):
    with open(filename, 'w') as f:
        writer = csv.writer(f, delimiter='\t')
        [writer.writerow(l) for l in data]
    
    
#outputSample = 'sample_nonPPIs.txt'
output = [
            ('Data/Processed/Numerized_Allergy_and_Asthma.txt', numerizedPPIs),
            ('Data/Processed/Allergy_and_Asthma_nonPPIs.txt', nonPPIs),
             ('Data/Processed/Numerized_Allergy_and_Asthma_nonPPIs.txt', numerizedNonPPIs)
        ]

for filename, data in output:
    writeToFile(filename, data)
    

### Peek at the variables in the script
Below is a print out preview of the variables in the above script to help understand the process.
The decision to arbitraritly round off the PPI scores for the numerized PPIs to the nearest 100 is to create at most 10 classes in which to predict upon. Each Y value (PPI score) is a class, and the fewer, the more accurately the machine can classify the data

In [2]:
import numpy as np
_PPIs = np.array(PPIs)
print 'PPIs:\n', _PPIs
print 'Shape: ', _PPIs.shape

PPIs:
[['CCL11' 'CCR3']
 ['CCL17' 'CCR4']
 ['CCL22' 'CCR4']
 ..., 
 ['CHIA' 'IL4']
 ['FCER1A' 'PPARG']
 ['IL2RA' 'RNASE2']]
Shape:  (1425L, 2L)


In [3]:
_ppiScores = np.array(ppiScores)
print 'PPI Scores: ', _ppiScores
print 'Shape: ', _ppiScores.shape

PPI Scores:  ['999' '999' '999' ..., '151' '151' '151']
Shape:  (1425L,)


In [4]:
_uniqueProts = np.array(uniqueProts)
print 'Unique Proteins: ', _uniqueProts
print 'Shape: ', _uniqueProts.shape

Unique Proteins:  ['CCL2' 'FOXP3' 'PDCD1' 'CRLF2' 'IL1RL1' 'EPX' 'CCL5' 'TBX21' 'CCL8'
 'IL12B' 'IL12A' 'IL21' 'LTB4R' 'CCR3' 'CCR4' 'ADAM33' 'IFNGR2' 'TSLP'
 'CCR8' 'CD40LG' 'CLCA1' 'MS4A2' 'IL17A' 'STAT6' 'CHIA' 'FCER1A' 'PRG2'
 'CPA3' 'AREG' 'CHI3L1' 'RORC' 'STAT5A' 'IL2RA' 'CYSLTR1' 'RNASE3'
 'TNFRSF4' 'CCL22' 'ICOS' 'CCL24' 'CCL26' 'TGFB1' 'IL5RA' 'RNASE2' 'IL4R'
 'IL13RA2' 'IFNG' 'ALOX5' 'MAF' 'RETNLB' 'CLC' 'PMCH' 'PPARG' 'CMA1'
 'IL3RA' 'IL33' 'SATB1' 'IL31' 'GPR44' 'KITLG' 'CSF3R' 'GATA3' 'MRC1'
 'IL18' 'CCL11' 'IL13RA1' 'BCL6' 'POSTN' 'CCL17' 'IL10' 'ARG1' 'IL13'
 'CSF2' 'KIT' 'ADRB2' 'IL4' 'IL5' 'SIGLEC8' 'IL3' 'MMP9' 'IL25' 'TPSAB1'
 'TNFSF4' 'IL17RB' 'IL9']
Shape:  (84L,)


In [5]:
print 'Mapping of Protein to Numeric Value: ', protIndextDict
print 'Count: ', len(protIndextDict)

Mapping of Protein to Numeric Value:  {'CCL2': 0, 'FOXP3': 1, 'IL13': 70, 'CRLF2': 3, 'IL1RL1': 4, 'EPX': 5, 'CCL5': 6, 'TBX21': 7, 'CCL8': 8, 'IL13RA2': 44, 'IL12B': 9, 'IL12A': 10, 'IL21': 11, 'LTB4R': 12, 'CCR3': 13, 'CCR4': 14, 'ADAM33': 15, 'IFNGR2': 16, 'TSLP': 17, 'CCR8': 18, 'CD40LG': 19, 'CLCA1': 20, 'MS4A2': 21, 'CYSLTR1': 33, 'STAT6': 23, 'CHIA': 24, 'FCER1A': 25, 'CPA3': 27, 'CHI3L1': 29, 'RORC': 30, 'ADRB2': 73, 'GATA3': 60, 'IFNG': 45, 'TNFRSF4': 35, 'CCL22': 36, 'ICOS': 37, 'CCL24': 38, 'PDCD1': 2, 'TGFB1': 40, 'TNFSF4': 81, 'RNASE3': 34, 'RNASE2': 42, 'IL17A': 22, 'AREG': 28, 'IL5RA': 41, 'ALOX5': 46, 'MMP9': 78, 'IL4': 74, 'CLC': 49, 'PMCH': 50, 'PPARG': 51, 'IL4R': 43, 'IL9': 83, 'IL3RA': 53, 'IL33': 54, 'SATB1': 55, 'IL31': 56, 'GPR44': 57, 'KITLG': 58, 'CSF3R': 59, 'RETNLB': 48, 'MRC1': 61, 'IL18': 62, 'IL2RA': 32, 'IL13RA1': 64, 'BCL6': 65, 'POSTN': 66, 'IL10': 68, 'ARG1': 69, 'CCL17': 67, 'CMA1': 52, 'CSF2': 71, 'CCL11': 63, 'STAT5A': 31, 'IL5': 75, 'SIGLEC8': 76,

In [6]:
_numerizedProt = np.array(numerizedProt)
print 'Numerized PPIs:\n', 
print 'Shape', _numerizedProt

Numerized PPIs:
Shape [[63 13]
 [67 14]
 [36 14]
 ..., 
 [24 74]
 [25 51]
 [32 42]]


In [7]:
__protCombinations = np.array(list(_protCombinations))
print 'All possible PPI combinations:\n', __protCombinations
print 'Shape: ', __protCombinations.shape

All possible PPI combinations:
[['CCL2' 'FOXP3']
 ['CCL2' 'PDCD1']
 ['CCL2' 'CRLF2']
 ..., 
 ['TNFSF4' 'IL17RB']
 ['TNFSF4' 'IL9']
 ['IL17RB' 'IL9']]
Shape:  (3486L, 2L)


In [8]:
__numerizedProtCombinations = np.array(list(_numerizedProtCombinations))
print 'All possible PPI combinations (Numerized):\n', __numerizedProtCombinations
print 'Shape: ', __numerizedProtCombinations.shape

All possible PPI combinations (Numerized):
[[ 0  1]
 [ 0  2]
 [ 0  3]
 ..., 
 [81 82]
 [81 83]
 [82 83]]
Shape:  (3486L, 2L)


In [9]:
_nonPPIs = np.array(nonPPIs)
print 'List of Non-PPIs:\n', _nonPPIs
print 'Shape: ', _nonPPIs.shape

List of Non-PPIs:
[['RNASE3' 'TPSAB1']
 ['RETNLB' 'CSF3R']
 ['PDCD1' 'IL5']
 ..., 
 ['CCL2' 'CYSLTR1']
 ['CCL8' 'CSF3R']
 ['CCR8' 'SIGLEC8']]
Shape:  (2737L, 2L)


In [10]:
_numerizedNonPPIs = np.array(numerizedNonPPIs)
print 'List of Non-PPIs (Numerized):\n', _numerizedNonPPIs
print 'Shape: ', _numerizedNonPPIs.shape

List of Non-PPIs (Numerized):
[[32 54]
 [21 28]
 [ 4 36]
 ..., 
 [ 8 80]
 [28 51]
 [27 75]]
Shape:  (2737L, 2L)


In [11]:
_numerizedPPIs = np.array(numerizedPPIs)
print 'List of PPIs with Scores (Numerized):\n', _numerizedPPIs
print 'Shape: ', _numerizedPPIs.shape

List of PPIs with Scores (Numerized):
[[ 63  13 900]
 [ 67  14 900]
 [ 36  14 900]
 ..., 
 [ 24  74 100]
 [ 25  51 100]
 [ 32  42 100]]
Shape:  (1425L, 3L)
