At first, we'll import necessary modules, read the data and represent first 5 rows of it:

In [2]:
import numpy as np
import pandas as pd
import matplotlib as mpl

In [3]:
train_csv_raw = pd.read_csv("/Users/daniiltekunov/Desktop/mhc/mhc_train.csv")

print(train_csv_raw[0:5])

        mhc   sequence      meas  pep_class
0  HLAA0101  AADFPGIAR  0.084687          0
1  HLAA0101  AADKAAAAY  0.638438          1
2  HLAA0101  AADSFATSY  0.599135          1
3  HLAA0101  AAFLDDNAF  0.084687          0
4  HLAA0101  AAGLPAIFV  0.084687          0


Now we'll pre-process this data to represent qualitative features space.

Firstly, we'll delete an ejection, that was detected after visual data analysis - element 90849 HLAE0101, that has an 'E' letter in its mhc code. Inasmuch as this element is just 0.0011% of the whole dataset, we'll remove it to reduce possible features space by coverting [0,0,0] feature to Boolean.

In [4]:
train_csv = train_csv_raw[train_csv_raw.mhc != "HLAE0101"]

After that we can start creating features space, where we'll use the following methods:

1) Data normalization

2) Changing the dimension of a vector space via One-hot encoding

###We'll start with the protein structure:

* *HLA* - constant value

* *A* - depends on the structure of the protein, can be 'A' or 'B', after removing the ejection nominally Boolean

* *0101* - A four-digit number, that changes from 0101 to 9901

After analysis with pandas, turns out, that the second digit in a number is always remains 0, which allows to reduce 4-digit number to a 3-digit one by the following method:

* 0101 -> 011
* 0202 -> 022
* ...
* 9901 -> 991

Then, at first, there was a hypothesis, that this feature is numeric, so the first step was to normalize values:

In [12]:
length_of_train_csv = len(train_csv)


def extract_ints_from_mhc(input_list):

    outcome_list = list()

    for i in range(0, length_of_train_csv):
        outcome_list.append(int(input_list[i][4:7]))
    return outcome_list


def normalize(X):
    return (X-X.mean())/X.std()

normalize(pd.Series(extract_ints_from_mhc(listed_csv_mhc)))[:5]

0   -1.053061
1   -1.053061
2   -1.053061
3   -1.053061
4   -1.053061
dtype: float64

After better analysis, it was found out, that these values are ordinal and normalization won't work here.

### Now let's pre-process the 'sequence' feature:

It contains 9 positions of the elements, using English alphabet letters.

After better analysis, it was found that there are no codes with 'Z', 'X', 'J', 'U' and 'O', so we get 21 different elements at 9 positions.

WIth One-hot encoding we get 21*9 = 189 Boolean features from '1A' to '9Y'

### Now we finalize the features space:

For HLAA0101 - AALEGLSGF it will be as following:

{ 0,   
1, # is HLAA == true   
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #1A   
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #2A   
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #3L   
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #4E   
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #5G   
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #6L   
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, #7S   
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #8G   
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, #9F   
0.21281259490425353,  
-1.729404 }  

This is how we obtain necessary values:


In [16]:
def encoding(input_list):
    encoding_rules1 = {'A': (1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'B': (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'C': (0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'D': (0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'E': (0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'F': (0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'G': (0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'H': (0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'I': (0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'K': (0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'L': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'M': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0),
                       'N': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0),
                       'P': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0),
                       'Q': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0),
                       'R': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0),
                       'S': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0),
                       'T': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
                       'V': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0),
                       'W': (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0),
                       'Y': (
                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)}  # Правила кодировки one-hot

    encoding_rules2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D',  
                       5: 'E', 6: 'F', 7: 'G', 8: 'H',
                       9: 'I', 10: 'K', 11: 'L', 12: 'M',
                       13: 'N', 14: 'P', 15: 'Q', 16: 'R',
                       17: 'S', 18: 'T', 19: 'V', 20: 'W',
                       21: 'Y'}  

    new_element = list()
    encoded_list = list()

    for i in range(0, length_of_train_csv):
        for j in input_list[i]:
            new_element.append(encoding_rules1[j])
        encoded_list.append(new_element)
        new_element = []

    return encoded_list


def extract_ints_from_mhc(input_list):

    outcome_list = list()

    for i in range(0, length_of_train_csv):
        outcome_list.append(int(input_list[i][4:7]))
    return outcome_list


def extract_booleans_from_mhc(input_list):

    outcome_list = list()

    for i in range(0, length_of_train_csv):
        if input_list[i][3] == 'A':
            outcome_list.append(1)
        else:
            outcome_list.append(0)

    return outcome_list

With this data we can finally start creating the architecture of the neural network.