COVID 19 Epitope Prediction

Data Exploration

Introduction
About the Data
The data is was collected from and is freely available as a published data set originally provided from The Immune Epitope Database(IEDB) and UniProt and comprises of proteins and their characteristics from 1) B-cell, the main training set data (input_bcell.csv), 2) SARS, also training set data (input_sars.csv) and 3) COVID19, the unlabled target (input_covid.csv). This notebook performs 1) an assessment on the quality of the data set, 2) an assessment on the features, and 3) value distribution of the features of the data set. No warranties exist regarding the correctness of the data, and there is disclaimer for liability for damages resulting from its use. Unrestricted permission regarding the use of the data was also not provided especially since some data may have been covered by patents or other rights.

Preliminaries

Loading Libraries

In [21]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.utils import resample

Reading in the data

In [22]:
INPUT_DIR = 'https://raw.githubusercontent.com/efejiroe/covid_epitope_prediction/master/data/'
bcell = pd.read_csv(f'{INPUT_DIR}/input_bcell.csv')
sars = pd.read_csv(f'{INPUT_DIR}/input_sars.csv')
covid1 = pd.read_csv(f'{INPUT_DIR}/input_covid_01.csv')
covid2 = pd.read_csv(f'{INPUT_DIR}/input_covid_02.csv')
covid = pd.concat([covid1, covid2], axis=0, ignore_index=True)
bsars = pd.concat([bcell, sars], axis=0, ignore_index=True)

Information on the data

In [31]:
## sars training set I
sars.head(3)

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,17,MFIFLLFLTLTSGSDLD,0.887,0.04,1.056,-2.159,5.569763,0.116335,-0.061116,33.205116,0
1,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,1,15,MFIFLLFLTLTSGSD,0.869,0.047,1.056,-2.5,5.569763,0.116335,-0.061116,33.205116,0
2,AAU93319,MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...,2,10,FIFLLFLTL,0.621,0.042,1.148,-7.467,5.569763,0.116335,-0.061116,33.205116,0


In [32]:
sars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   parent_protein_id    520 non-null    object 
 1   protein_seq          520 non-null    object 
 2   start_position       520 non-null    int64  
 3   end_position         520 non-null    int64  
 4   peptide_seq          520 non-null    object 
 5   chou_fasman          520 non-null    float64
 6   emini                520 non-null    float64
 7   kolaskar_tongaonkar  520 non-null    float64
 8   parker               520 non-null    float64
 9   isoelectric_point    520 non-null    float64
 10  aromaticity          520 non-null    float64
 11  hydrophobicity       520 non-null    float64
 12  stability            520 non-null    float64
 13  target               520 non-null    int64  
dtypes: float64(8), int64(3), object(3)
memory usage: 57.0+ KB


In [24]:
## bcell training set II
bcell.head(3)

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,165,SASFT,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,255,LCLKI,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,149,AHRET,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1


In [25]:
## bsars consolidated training set
bsars.head(3)

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
0,A2T3T0,MDVLYSLSKTLKDARDKIVEGTLYSNVSDLIQQFNQMIITMNGNEF...,161,165,SASFT,1.016,0.703,1.018,2.22,5.810364,0.103275,-0.143829,40.2733,1
1,F0V2I4,MTIHKVAINGFGRIGRLLFRNLLSSQGVQVVAVNDVVDIKVLTHLL...,251,255,LCLKI,0.77,0.179,1.199,-3.86,6.210876,0.065476,-0.036905,24.998512,1
2,O75508,MVATCLQVVGFVTSFVGWIGVIVTTSTNDWVVTCGYTIPTCRKLDE...,145,149,AHRET,0.852,3.427,0.96,4.28,8.223938,0.091787,0.879227,27.863333,1


In [26]:
## covid test set
covid.head(3)

Unnamed: 0,parent_protein_id,protein_seq,start_position,end_position,peptide_seq,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability
0,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,1,5,MGILP,0.948,0.28,1.033,-2.72,6.03595,0.10929,-0.138642,31.377603
1,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,2,6,GILPS,1.114,0.379,1.07,-0.58,6.03595,0.10929,-0.138642,31.377603
2,6VYB_A,MGILPSPGMPALLSLVSLLSVLLMGCVAETGTQCVNLTTRTQLPPA...,3,7,ILPSP,1.106,0.592,1.108,-1.3,6.03595,0.10929,-0.138642,31.377603


Data Assessment

No missing values was shown to exist in both training and test sets and the data types are described below.

In [34]:
## Checking columns for data type and null values
bsars.info()
covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14907 entries, 0 to 14906
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   parent_protein_id    14907 non-null  object 
 1   protein_seq          14907 non-null  object 
 2   start_position       14907 non-null  int64  
 3   end_position         14907 non-null  int64  
 4   peptide_seq          14907 non-null  object 
 5   chou_fasman          14907 non-null  float64
 6   emini                14907 non-null  float64
 7   kolaskar_tongaonkar  14907 non-null  float64
 8   parker               14907 non-null  float64
 9   isoelectric_point    14907 non-null  float64
 10  aromaticity          14907 non-null  float64
 11  hydrophobicity       14907 non-null  float64
 12  stability            14907 non-null  float64
 13  target               14907 non-null  int64  
dtypes: float64(8), int64(3), object(3)
memory usage: 1.6+ MB
<class 'pandas.core.frame.Dat

Feature Importance

In [36]:
sars.describe()

Unnamed: 0,start_position,end_position,chou_fasman,emini,kolaskar_tongaonkar,parker,isoelectric_point,aromaticity,hydrophobicity,stability,target
count,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0,520.0
mean,617.871154,635.876923,1.000442,1.719804,1.03896,1.278696,5.569763,0.116335,-0.06111554,33.20512,0.269231
std,349.582246,349.315328,0.08719,4.736354,0.037978,1.418791,0.0,0.0,6.945576e-18,1.422454e-14,0.443987
min,1.0,10.0,0.621,0.0,0.908,-7.467,5.569763,0.116335,-0.06111554,33.20512,0.0
25%,359.0,373.75,0.949,0.17975,1.013,0.5345,5.569763,0.116335,-0.06111554,33.20512,0.0
50%,571.5,592.5,1.009,0.4395,1.036,1.412,5.569763,0.116335,-0.06111554,33.20512,0.0
75%,921.0,940.0,1.05525,1.18125,1.058,2.245,5.569763,0.116335,-0.06111554,33.20512,1.0
max,1241.0,1255.0,1.317,40.605,1.228,4.907,5.569763,0.116335,-0.06111554,33.20512,1.0


ValueError: could not convert string to float: 'A2T3T0'

Feature's Value Distribution