# Protein Feature data treatment
In this notebook we treat the data obtained from [UniProt database](https://www.uniprot.org/uniprot/?query=*&fil=organism%3a%22Homo+sapiens+(Human)+%5b9606%5d%22&offset=0). This database contains information for all the *Homo Sapiens* proteins in a ´.tab´ file. For the moment, we are interested in some numeric values and the number of $\alpha$-helices, $\beta$-strands and turns.<br>
Here we clean the data and filter the desired features.<br>
<br>
    Author: Juan Sebastian Diaz Boada, May 2020

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd

### Importing raw database from file

In [3]:
data = pd.read_csv('uni_prot.tab',sep='\t',low_memory=False)
data.head()

Unnamed: 0,Entry,Cross-reference (GeneID),Length,Mass,Helix,Beta strand,Turn,Glycosylation,Disulfide bond,Absorption,...,Site,Catalytic activity,Cofactor,DNA binding,Metal binding,Nucleotide binding,Tissue specificity,Involvement in disease,Subcellular location [CC],Region
0,Q8NF67,,263,31171,,,,,,,...,,,,,,,,,,
1,Q9NPB9,51554;,350,39914,,,,"CARBOHYD 6; /note=""N-linked (GlcNAc...) aspar...","DISULFID 112..184; /evidence=""ECO:0000255|PRO...",,...,,,,,,,TISSUE SPECIFICITY: Predominantly expressed in...,,SUBCELLULAR LOCATION: Early endosome {ECO:0000...,
2,P31937,11112;,336,35329,"HELIX 51..60; /evidence=""ECO:0000244|PDB:2GF2...","STRAND 42..45; /evidence=""ECO:0000244|PDB:2GF...","TURN 80..82; /evidence=""ECO:0000244|PDB:2GF2""...",,,,...,,CATALYTIC ACTIVITY: Reaction=3-hydroxy-2-methy...,,,,"NP_BIND 40..68; /note=""NAD""; /evidence=""ECO:...",TISSUE SPECIFICITY: Detected in skin fibroblas...,,SUBCELLULAR LOCATION: Mitochondrion.,
3,P61981,7532;,247,28303,"HELIX 4..16; /evidence=""ECO:0000244|PDB:3UZD""...",,"TURN 32..34; /evidence=""ECO:0000244|PDB:6FEL""...",,,,...,"SITE 57; /note=""Interaction with phosphoserin...",,,,,,"TISSUE SPECIFICITY: Highly expressed in brain,...","DISEASE: Epileptic encephalopathy, early infan...",SUBCELLULAR LOCATION: Cytoplasm {ECO:0000250}.,
4,O94805,51412;,426,46877,,,,,,,...,,,,,,,,"DISEASE: Epileptic encephalopathy, early infan...",SUBCELLULAR LOCATION: Nucleus {ECO:0000303|Pub...,"REGION 39..82; /note=""Essential for mediating..."


### Data cleaning

In [4]:
# Select features (columns) of interest from dataset
data = data[['Cross-reference (GeneID)', 'Length', 'Mass', 'Helix',
       'Beta strand', 'Turn']]
# Rename columns
data = data.rename(columns={'Cross-reference (GeneID)':'GeneID','Beta strand':'n_strands',\
                           'Helix':'n_helices','Turn':'n_turns'})
print('Original database length:',len(data.index))
# Remove NaN genes
data = data.dropna(subset=['GeneID'])
# Remove duplicated genes
data = data[~data.duplicated(['GeneID'])]
print('Length of database after filtering:',len(data.index))
# Remove semi colon after GeneID
data['GeneID'] = data['GeneID'].str.replace(';', '')
data.head()

Original database length: 188558
Length of database after filtering: 18991


Unnamed: 0,GeneID,Length,Mass,n_helices,n_strands,n_turns
1,51554,350,39914,,,
2,11112,336,35329,"HELIX 51..60; /evidence=""ECO:0000244|PDB:2GF2...","STRAND 42..45; /evidence=""ECO:0000244|PDB:2GF...","TURN 80..82; /evidence=""ECO:0000244|PDB:2GF2""..."
3,7532,247,28303,"HELIX 4..16; /evidence=""ECO:0000244|PDB:3UZD""...",,"TURN 32..34; /evidence=""ECO:0000244|PDB:6FEL""..."
4,51412,426,46877,,,
5,125,375,39855,"HELIX 48..54; /evidence=""ECO:0000244|PDB:1U3U...","STRAND 8..15; /evidence=""ECO:0000244|PDB:1U3U...","TURN 142..144; /evidence=""ECO:0000244|PDB:1U3..."


### Counting the number of 3D structures

In [5]:
# Get a dataframe for strands, helices and turns
beta = pd.DataFrame(data['n_strands'])
alpha = pd.DataFrame(data['n_helices'])
turn = pd.DataFrame(data['n_turns'])

# WARNING: Do a more rigorous check on wether the separator "; can be used

In [6]:
# Divide into columns per strand, helix or turn
beta = beta['n_strands'].str.split('\\";', expand = True)
alpha = alpha['n_helices'].str.split('\\";', expand = True)
turn= turn['n_turns'].str.split('\\";', expand = True)

In [7]:
# Replace the column od strands for the numebr of strands
data.loc[:,'n_strands'] = beta.count(axis=1)
data.loc[:,'n_helices'] = alpha.count(axis=1)
data.loc[:,'n_turns'] = turn.count(axis=1)
data.head()

Unnamed: 0,GeneID,Length,Mass,n_helices,n_strands,n_turns
1,51554,350,39914,0,0,0
2,11112,336,35329,17,10,4
3,7532,247,28303,12,0,4
4,51412,426,46877,0,0,0
5,125,375,39855,18,22,4


### Export into new file

In [9]:
data.to_csv('./orig_data/proteins.csv',index=False,sep=';')