# Conjoint featurization

## Christian Roncal cmsc499a Dr. Leiserson

# Understanding:

The amino acids will be represented by 20 letters:
1. A - ALA  
1. G - GLY
1. V - VAL
1. I - ILE
1. L - LEU
1. F - PHE
1. P - PRO
1. Y - TYR
1. M - MET
1. T - THR 
1. S - SER
1. H - HIS
1. N - ASN
1. Q - GLN
1. W - TRP
1. R - ARG
1. K - LYS
1. D - ASP
1. E - GLU
1. C - CYS

These amino acids will then be grouped into classes based on their dipoles and side chains. The rationale is that these two characteristics effect binding greatly. We get 7 classes.

In [1]:
classes = dict.fromkeys(['A', 'G','V'], 1)
classes.update(dict.fromkeys(['I', 'L', 'F', 'P'], 2))
classes.update(dict.fromkeys(['Y', 'M', 'T', 'S'], 3))
classes.update(dict.fromkeys(['H', 'N', 'Q', 'W'], 4))
classes.update(dict.fromkeys(['R', 'K'], 5))
classes.update(dict.fromkeys(['D', 'E'], 6))
classes.update(dict.fromkeys(['C'], 7))

In [2]:
len(classes.keys()), len(classes.values())

(20, 20)

# Conjoint featurization

We then define two vector spaces, D and F. D will contain every possible 2,3,4 permutation, and F will be a count of each permutation occurence for a given sequence -- normalized.

## Building vector space D

There are $7^2 + 7^3 + 7^4 = 2793$ possible permutations (with repeats, more like cartesian product). We simply concatenate each $D$, concretely  $$D = D^{7^2} \oplus D^{7^3} \oplus D^{7^4}$$

It follows that the first $7^2$ indices will be taken $by D^{7^2}$ and so on.

To be able to easily count frequencies, it helps to represent $D$ as a dictionaries with the permutation as a key and its index as its value. This way, for every grouping we encounter, we simply look up the index in the dictionary and increment the corresponding index in $F_i$.

In [75]:
from itertools import chain, product
import functools

def buildD():
    alphabet = list(map(str, range(7)))
    ds = [product(alphabet, repeat=i) for i in [2,3,4]]
    ds = list(chain.from_iterable(ds))
    return ds

In [127]:
D = buildD()
len(D)

2793

To build $F_i$, for a given string of amino acids $s$, we create a dictionary mapping of a grouping to its index. $F_i$ is then just an array of counts, where $F_i[j]$ corresponds to the count of a mapped grouping

In [184]:
import numpy as np

def seqToClass(seq):
    return ''.join(list(map(lambda x: str(classes[x]), seq)))

def normalize(Fi):
    return (Fi - min(Fi)) / max(Fi)

def getFi(D, seq):
    grptoi = {p:i for i,p in enumerate(D)} # group to index mappings
    Fi = np.zeros(len(grptoi.values()))
    
    classSeq = seqToClass(seq)
#     print(classSeq)
    
    for p in D:
        Fi[grptoi[p]] += classSeq.count(''.join(p))

    return normalize(Fi)

In [185]:
from Bio import SeqIO
from Bio.Alphabet import ProteinAlphabet

record = SeqIO.read('hepc.fasta', 'fasta')
record.seq.Alphabet = ProteinAlphabet

In [186]:
classSeq = seqToClass(record.seq)

In [187]:
fi = getFi(D, record.seq)

31313425614642731216253424352225221122237451351145161234337271142314154133521125373434413223413331273232124335124511166316135116243131333641572741212622361611524531217522256611241124422113422762626111233323623423163155521513223213331342312325137335463261622614224546311423516364511226326224166665613121622553552221222415263422226245626312211417222245122222255553112363313312162135321331331163131312264336416313616333332226162162623613433136611661177333
