# 01/28/2022 Implementing biochemical/biophysical feature extraction based on human+viral effector sequences|



For the protein engineering data scientist position, we need to make some fake data for the technical interview. Here I'm loading in the human+viral effector sequences and adding columns to annotate these amino acid sequences with peptide properties.

# Reading in the data

In [1]:
import os, sys, math
sys.path.append('/usr/local/lib/python3.9/site-packages')

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook", font_scale=1.4)
import peptides

#np.set_printoptions(threshold=np.inf)

In [2]:
df = pd.read_csv("nuclear_peptides_split.txt", sep="\t")
#df = df[1:100][:]
df

Unnamed: 0,ensembl_gene_id,split_peptide_seq
0,ENSG00000023608,MGTPPGLQTDCEALLSRFQETDSVRFEDFTELWRNMKFGTIFCGRM...
1,ENSG00000023608,LAWRYFLPPYTFQIRVGALYLLYGLYNTQLCQPKQKIRVALKDWDE...
2,ENSG00000023608,DAAYIFRKLRLDRAFHFTAMPKLLSYRMKKKIHRAEVTEEFKDPSD...
3,ENSG00000023608,LNVHDHYQNMKHVISVDKSKPDKALSLIKDDFFDNIKNIVLEHQQW...
4,ENSG00000023608,DGEEKMEGNSQETERCERAESLAKIKSKAFSVVIQASKSRRHRQVK...
...,...,...
41794,ENSG00000156531,SLGGFSIEDVQKEIKRGTKLMCSLCHCPGATIGCDVKTCHRTYHYH...
41795,ENSG00000156531,GCDVKTCHRTYHYHCALHDKAQIREKPSQGIYMVYCRKHKKTAHNS...
41796,ENSG00000156531,AHHKCMLFSSALVSSHSDNESLGGFSIEDVQKEIKRGTKLMCSLCH...
41797,ENSG00000156531,KSNRDKECGQLLISENQKVAAHHKCMLFSSALVSSHSDNESLGGFS...


# Peptides package

Here I'm using the Peptides package to annotate the 60 amino acid peptide sequences.

I will annotate each peptide sequence with additional columns to reflect:
- molecular weight
- charge
- hydrophobicity
- helix bend preference (Kidera Factor 1)
- volume (Phyiscal Descriptors 1)

Documentation:
- https://peptides.readthedocs.io/en/stable/

- https://peptides.readthedocs.io/en/stable/api.html


In [3]:
#Setting up empty lists to populate as I loop through the dataframe's peptide sequences
mol_weight = []
charge = []
hydrophobicity = []
helix_pref = []
volume = []

#Looping through the dataframe, converting each aa sequence to a Peptide object and then annotating it with characteristics
for i in range(0,len(df)):
    aa = peptides.Peptide(df.iloc[i][1])
    mol_weight.append(aa.molecular_weight())
    charge.append(aa.charge())
    hydrophobicity.append(aa.hydrophobicity())
    helix_pref.append(aa.kidera_factors()[0])
    volume.append(aa.physical_descriptors()[0])
    
df.insert(2,"mol_weight",mol_weight)
df.insert(3,"charge",charge)
df.insert(4,"hydrophobicity",hydrophobicity)
df.insert(5,"helix_pref",helix_pref)
df.insert(6,"volume",volume)

In [4]:
df

Unnamed: 0,ensembl_gene_id,split_peptide_seq,mol_weight,charge,hydrophobicity,helix_pref,volume
0,ENSG00000023608,MGTPPGLQTDCEALLSRFQETDSVRFEDFTELWRNMKFGTIFCGRM...,7010.04024,-1.114877,-0.473333,-0.166167,-0.117500
1,ENSG00000023608,LAWRYFLPPYTFQIRVGALYLLYGLYNTQLCQPKQKIRVALKDWDE...,7262.49314,3.024591,-0.140000,-0.108167,0.290167
2,ENSG00000023608,DAAYIFRKLRLDRAFHFTAMPKLLSYRMKKKIHRAEVTEEFKDPSD...,7217.50014,3.187125,-0.488333,-0.350500,0.197833
3,ENSG00000023608,LNVHDHYQNMKHVISVDKSKPDKALSLIKDDFFDNIKNIVLEHQQW...,7137.10814,3.453626,-1.100000,-0.025667,0.042000
4,ENSG00000023608,DGEEKMEGNSQETERCERAESLAKIKSKAFSVVIQASKSRRHRQVK...,6598.16894,0.039344,-1.238333,-0.173833,-0.658167
...,...,...,...,...,...,...,...
41794,ENSG00000156531,SLGGFSIEDVQKEIKRGTKLMCSLCHCPGATIGCDVKTCHRTYHYH...,6702.75844,3.083865,-0.466667,-0.044167,-0.502333
41795,ENSG00000156531,GCDVKTCHRTYHYHCALHDKAQIREKPSQGIYMVYCRKHKKTAHNS...,6884.97414,6.232823,-0.660000,-0.018833,-0.173667
41796,ENSG00000156531,AHHKCMLFSSALVSSHSDNESLGGFSIEDVQKEIKRGTKLMCSLCH...,6449.42324,0.085868,-0.086667,-0.094000,-0.782000
41797,ENSG00000156531,KSNRDKECGQLLISENQKVAAHHKCMLFSSALVSSHSDNESLGGFS...,6630.48774,1.154905,-0.626667,-0.169833,-0.536333


In [5]:
df.to_csv("/Users/robinyeo/Documents/GitHub/interviews/protein_ds/technical_screen_simulation/nuclear_peptides_annotated_RWY.csv")