### Import necessary libraries

In [123]:
import pandas as pd
import numpy as np
import os 
import seaborn as sns
import statistics
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

In [2]:
# enable mpld3 for this notebook, see documentation
%matplotlib inline
import matplotlib.pylab as plt
import mpld3
mpld3.enable_notebook()

In [3]:
# import mpld3 modules
from mpld3 import plugins, utils

In [4]:
import re
s = 'c141y_d228a_n235k_n239m'

In [5]:
re.findall('[0-9]+',s) # example string matching

['141', '228', '235', '239']

### Load the cleaned dataset

In [6]:
# load in dataset
mutants = pd.read_csv('../data/interim/k8_clean_data.csv', header=None, low_memory=False)

In [7]:
mutants.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,5400,5401,5402,5403,5404,5405,5406,5407,5408,5409
0,-0.161,-0.014,0.002,-0.036,-0.033,-0.093,0.025,0.005,0.0,-0.015,...,0.013,0.021,0.02,0.016,-0.011,0.003,0.01,-0.007,a119e_l125p,inactive
1,-0.158,-0.002,-0.012,-0.025,-0.012,-0.106,0.013,0.005,0.0,-0.002,...,-0.008,0.007,0.015,-0.008,-0.011,-0.004,0.013,0.005,a119e_r283k_a353v,inactive
2,-0.169,-0.025,-0.01,-0.041,-0.045,-0.069,0.038,0.014,0.008,-0.014,...,0.01,0.025,0.025,0.021,-0.012,0.006,0.016,-0.018,c135y,inactive
3,-0.183,-0.051,-0.023,-0.077,-0.092,-0.015,0.071,0.027,0.02,-0.019,...,0.012,0.05,0.038,0.051,-0.015,0.017,0.027,-0.049,c135y_e285m,inactive
4,-0.154,0.005,-0.011,-0.013,-0.002,-0.115,0.005,0.002,-0.003,0.002,...,0.012,0.009,0.003,-0.001,0.002,-0.006,0.009,0.013,c135y_e285v,inactive


In [107]:
# pull out the column of mutants with the nametags

nametags = mutants[5408].astype(str)
print(nametags)

0              a119e_l125p
1        a119e_r283k_a353v
2                    c135y
3              c135y_e285m
4              c135y_e285v
               ...        
16586    y220c_t230c_n239y
16587    y220c_y234f_n239l
16588                y234c
16589          y234c_a119e
16590          y234f_n239l
Name: 5408, Length: 16591, dtype: object


### Subset the data

subset the data! first implement analysis on subset, then apply same on full dataset

first, check the mutant tags to see where the mutations are in the amino acid chain, and then subset based on domains

will have 5 subsets, corresponding to each of the 5 major p53 domains!

In [138]:
# create empty lists for each domain
# each list will store the indexes of the mutants (rows) for their respective protein domains

ad_loci = []
dbd_loci = []
td_loci = []
nls_loci = []
bd_loci = []

In [139]:
def binner(num):
    if num <= 101:
        b = 0
    elif num < 305:
        b = 1
    elif num < 326:
        b = 2
    elif num < 364:
        b = 3
    else:
        b = 4
    return b

In [140]:
# since there could be multiple loci, sort based on avg (center) of mutation loci

for tag in nametags:
        search = re.findall('[0-9]+', str(tag))
        snps = np.array(list(map(int, search)))
        bins = list(map(binner,snps))
        b = statistics.mode(bins)
        
#         avg = np.mean(snps)

        
        if b == 0:
            #t = nametags.index(nametags.loc[tag])
            ad_loci.append(mutants[mutants[5408] == tag])
        elif b == 1:
            #t = nametags.index(tag)
            dbd_loci.append(mutants[mutants[5408] == tag])
        elif b == 2:
            #t = nametags.index(tag)
            td_loci.append(mutants[mutants[5408] == tag])
        elif b == 3:
            #t = nametags.index(tag)
            nls_loci.append(mutants[mutants[5408] == tag])
        elif b == 4:
            #t = nametags.index(tag)
            bd_loci.append(mutants[mutants[5408] == tag])
        
            

# make sure to check the size of each subset

In [142]:
len(dbd_loci)

16591

In [None]:
ad_loci = pd.concat(ad_loci)

I tried splitting the dataset up by the domain that contained the mutation loci, but that did not work. 

Splitting the mutants by protein domain provided wildly imbalanced subsets, since it seems that the overwhelming majority of the mutants have at least one mutation in the DBD domain of p53. 
We can look at the domain-wise distribution of the p53 mutations to confirm this. 

A more even way to partition the dataset would probably be to partition by the number of mutations per mutant - one, two, three, or four and more mutations.

In [147]:
singles = [] # one mutation
doubles = [] # two mutations
triples = [] # three mutations
multis = [] # four or more mutations

In [148]:
# subset data based on the number of mutations per mutant protein

for tag in nametags:
        search = re.findall('[0-9]+', str(tag))
        snps = np.array(list(map(int, search)))
        num_loci = len(snps)
        
        if num_loci == 1:
            #t = nametags.index(nametags.loc[tag])
            singles.append(mutants[mutants[5408] == tag])
        elif num_loci == 2:
            #t = nametags.index(tag)
            doubles.append(mutants[mutants[5408] == tag])
        elif num_loci == 3:
            #t = nametags.index(tag)
            triples.append(mutants[mutants[5408] == tag])
        elif num_loci >= 4:
            #t = nametags.index(tag)
            multis.append(mutants[mutants[5408] == tag])
        
# make sure to check the size of each subset

In [149]:
singles = pd.concat(singles)
doubles = pd.concat(doubles)
triples = pd.concat(triples)
multis = pd.concat(multis)

In [150]:
singles.shape

(61, 5410)

In [151]:
doubles.shape

(16374, 5410)

In [152]:
triples.shape

(114, 5410)

In [153]:
multis.shape

(42, 5410)

Splitting the dataset this way also gave us imbalanced subsets, but not *quite* as imbalanced as they were when separated by protein domain. We can see that the majority of mutants in the dataset have only two mutations, and single mutations or three or more mutations are less common.

Is there perhaps a relationship between the number of mutations and the mutation loci? 

*One idea is to randomly subset the rows of the dataset for the preprocessing. Otherwise, if possible, we could make synthetic data to supplement the subsets.*

## Feature Selection

Let's try visualizing with PCA to find some trends in the data. We'll then plot the data with t-SNE and compare the results.

Reference: 
https://medium.com/analytics-vidhya/pca-vs-t-sne-17bcd882bf3d#:~:text=t%2DSNE%20is%20also%20a,large%20pairwise%20distance%20maximize%20variance.&text=It%20takes%20a%20set%20of,it%20into%20low%20dimensional%20data

In [None]:
# view heatmap of subset 1

plt.subplots(figsize=(12,10))
sns.heatmap(ad_mutants.corr());