# Conjoint featurization

## Christian Roncal cmsc499a Dr. Leiserson

# Understanding:

The amino acids will be represented by 20 letters:
1. A - ALA  
1. G - GLY
1. V - VAL
1. I - ILE
1. L - LEU
1. F - PHE
1. P - PRO
1. Y - TYR
1. M - MET
1. T - THR 
1. S - SER
1. H - HIS
1. N - ASN
1. Q - GLN
1. W - TRP
1. R - ARG
1. K - LYS
1. D - ASP
1. E - GLU
1. C - CYS

These amino acids will then be grouped into classes based on their dipoles and side chains. The rationale is that these two characteristics effect binding greatly. We get 7 classes.

In [1]:
classes = dict.fromkeys(['A', 'G','V'], 1)
classes.update(dict.fromkeys(['I', 'L', 'F', 'P'], 2))
classes.update(dict.fromkeys(['Y', 'M', 'T', 'S'], 3))
classes.update(dict.fromkeys(['H', 'N', 'Q', 'W'], 4))
classes.update(dict.fromkeys(['R', 'K'], 5))
classes.update(dict.fromkeys(['D', 'E'], 6))
classes.update(dict.fromkeys(['C'], 7))

In [6]:
s = 'MSTNPKPQRK'
ctf = []
for c in s:
    ctf.append(str(classes[c]))
print(''.join(ctf))

3334252455


In [7]:
len(classes.keys()), len(classes.values())

(20, 20)

In [8]:
import pickle

with open('classes.pkl', 'wb') as d:
    pickle.dump(classes, d)

# Conjoint featurization

We then define two vector spaces, D and F. D will contain every possible 2,3,4 permutation, and F will be a count of each permutation occurence for a given sequence -- normalized.

## Building vector space D

There are $7^2 + 7^3 + 7^4 = 2793$ possible permutations (with repeats, more like cartesian product). We simply concatenate each $D$, concretely  $$D = D^{7^2} \oplus D^{7^3} \oplus D^{7^4}$$

It follows that the first $7^2$ indices will be taken $by D^{7^2}$ and so on.

To be able to easily count frequencies, it helps to represent $D$ as a dictionaries with the permutation as a key and its index as its value. This way, for every grouping we encounter, we simply look up the index in the dictionary and increment the corresponding index in $F_i$.

In [1]:
from itertools import chain, product
import functools

def buildD():
    alphabet = list(map(str, range(7)))
    ds = [product(alphabet, repeat=i) for i in [2,3,4]]
    print(ds)
    ds = list(chain.from_iterable(ds))
    return ds

In [2]:
D = buildD()
len(D)

[<itertools.product object at 0x7f259c4a3e58>, <itertools.product object at 0x7f259c4a3ea0>, <itertools.product object at 0x7f259c4a3ee8>]


2793

In [5]:
import pickle

with open('D.pkl', 'wb') as d:
    pickle.dump(D, d)

To build $F_i$, for a given string of amino acids $s$, we create a dictionary mapping of a grouping to its index. $F_i$ is then just an array of counts, where $F_i[j]$ corresponds to the count of a mapped grouping

In [5]:
import numpy as np

def seqToClass(seq):
    return ''.join(list(map(lambda x: str(classes[x]), seq)))

def normalize(Fi):
    return (Fi - min(Fi)) / max(Fi)

def getFi(D, seq):
    grptoi = {p:i for i,p in enumerate(D)} # group to index mappings
    Fi = np.zeros(len(grptoi.values()))
    
    classSeq = seqToClass(seq)
#     print(classSeq)
    
    for p in D:
        Fi[grptoi[p]] += classSeq.count(''.join(p))

    return normalize(Fi)

In [6]:
from Bio import SeqIO
from Bio.Alphabet import ProteinAlphabet

record = SeqIO.read('hepc.fasta', 'fasta')
record.seq.Alphabet = ProteinAlphabet

In [7]:
classSeq = seqToClass(record.seq)

In [8]:
fi = getFi(D, record.seq)

# Prepare uniprots

We're going to need all the uniprots to calculate this featurization for all of them.

In [9]:
from make_matrix import *

In [10]:
df = read_interaction_csvs(['flu_interactions_full.csv', 'ebola_interactions_full.csv', 'hepc_interactions_full.csv'])

Merging files....


In [16]:
virus_prots = df['virusUprot'].unique()
human_prots = df['humanUprot'].unique()

### save these prots

In [19]:
import pickle 

with open('virus_uprots.pkl', 'wb') as vu:
    pickle.dump(virus_prots, vu)
    
with open('human_uprots.pkl', 'wb') as hu:
    pickle.dump(human_prots, hu)

### Make the full matrix

In [21]:
df = mp_processpairs(df, mp_pairmatcher, retfn, 20)

In [23]:
df.head()

Unnamed: 0,virusUprot,humanUprot,edge
0,Q6QYT2,P04637,1.0
1,Q6QYT2,P62333,0.0
2,Q6QYT2,Q14257,0.0
3,Q6QYT2,Q13438,0.0
4,Q6QYT2,P08708,0.0


In [22]:
df.to_csv('full_train.csv')

In [36]:
df = pd.read_csv('full_train.csv', usecols=['virusUprot', 'humanUprot', 'edge'])

In [37]:
df.head()

Unnamed: 0,virusUprot,humanUprot,edge
0,Q6QYT2,P04637,1.0
1,Q6QYT2,P62333,0.0
2,Q6QYT2,Q14257,0.0
3,Q6QYT2,Q13438,0.0
4,Q6QYT2,P08708,0.0


In [38]:
df.to_csv('full_train.csv', index=False)

In [40]:
df.sample(frac=1).to_csv('full_train.csv', index=False)