# Quantization of difference between 3D protein structures

In this project, we focus on quantitative representation of difference between a pair of protein structures. 

### Prerequisites

1. Mapping from isoforms to proteins is a bijection.

2. Mapping from gene to isoform is one-to-many. 

3. Any two pair of isoforms transcribed by the same genome will have overlaps among sequences of their proteins. 

### What we need

1. A map from gene -> isoforms -> proteins -> atoms 

2. A map from gene -> exon -> length of the exon (is this number of atoms?)

### Algorithm for technique 1

1. Given sequencing of two proteins (say P<sub>1</sub> and P<sub>2</sub>) which have originated from a single genome, say G.

2. For each protein, say P<sup>(1)</sup>, we have a sequence of amino acids a<sup>(1)</sup><sub>1</sub>, a<sup>(1)</sup><sub>2</sub>, .... a<sup>(1)</sup><sub>n</sub> where n is the length of the protein.

3. We predict structures of the proteins using alphafold2 such that P<sup>(1)</sup><sub>i</sub> are the co-ordinates of a<sup>(1)</sup><sub>i</sub>. 

4. Let us say that proteins P<sup>(1)</sup> and P<sup>(2)</sup> correspond to isoforms I<sup>(1)</sup> and I<sup>(2)</sup>. (Note that : I<sup>(1)</sup> and I<sup>(2)</sup> must be transcribed by the source genome G). 

5. We find the exons shared by I<sup>(1)</sup> and I<sup>(2)</sup>. Then by using the lengths of the exons i.e. number of amino acids in each exon, we find overlapping runs between P<sup>(1)</sup> and P<sup>(2)</sup>. 

6. Thus, we get k overlaps L<sub>1</sub>, L<sub>2</sub>, .... L<sub>k</sub> where each lap L<sub>i</sub> stores starting and ending indices of the overlaps in each protein. I.e. L<sub>i</sub> is represented as tuple of size 4 (i<sub>1</sub>, i<sub>2</sub>, j<sub>1</sub>, j<sub>2</sub>) such that i<sub>2</sub> - i<sub>1</sub> == j<sub>2</sub> - i<sub>1</sub> and a<sup>(1)</sup><sub>i<sub>1</sub> + x </sub> == a<sup>(2)</sup><sub>j<sub>1</sub> + x </sub> for x = 0, 1, ... j<sub>2</sub> - j<sub>1</sub>. 

8. Given an overlapping run for a protein, say a<sup>(1)</sup><sub>i<sub>1</sub></sub> to a<sup>(1)</sup><sub>i<sub>2</sub></sub>, we find distance between each pair of the amino acids to form a symmetric matrix D<sup>(1)</sup><sub>1</sub> such that D<sup>(1)</sup><sub>1</sub>[i][j] = ||a<sup>(1)</sup><sub>i</sub> - a<sup>(1)</sup><sub>j</sub>|| for i, j in {i<sub>1</sub>, .... i<sub>2</sub>}. 

9. Given all such matrices, say D<sup>(1)</sup><sub>1</sub>, D<sup>(1)</sup><sub>2</sub>, ...... D<sup>(1)</sup><sub>m</sub> for P<sub>1</sub> and D<sup>(2)</sup><sub>1</sub>, D<sup>(2)</sup><sub>2</sub>, ...... D<sup>(2)</sup><sub>m</sub> for P<sub>2</sub>, we find sum of norms of differences i.e. ||D<sup>(1)</sup><sub>i</sub> - D<sup>(2)</sup><sub>i</sub>||<sub>F</sub>. 

In [1]:
from Bio.PDB import PDBParser
import numpy as np

In [2]:
parser = PDBParser(PERMISSIVE = True, QUIET = True) 

data = parser.get_structure("2fat","/Users/aasavarikakne/Desktop/xplore/prediction/selected_prediction.pdb")

In [3]:
for key in data.header.keys():
    print(key, data.header[key])

name 
head 
idcode 
deposition_date 1909-01-08
release_date 1909-01-08
structure_method unknown
resolution None
structure_reference []
journal_reference 
author 
compound {'1': {'misc': ''}}
source {'1': {'misc': ''}}
has_missing_residues False
missing_residues []


In [4]:
positions = []

for model in data.get_models():
    chains = model.get_chains()
    for chain in chains:
        residues = chain.get_residues()
        for residue in residues:
            atoms = residue.get_atoms()
            for atom in atoms:
                p = atom.get_vector().get_array()
                positions.append(p)

In [5]:
positions = np.stack(positions, axis=0)
print(positions.shape)

(1070, 3)
