# Preprocessing netsurfP output to generate input for cnn model

This file contains all the necessary steps for preprocessing of the sequences with secondary structure (ss) predictions from NetSurfP2.0 into our needed format to use as input for GT-CNN fold prediction.
- Read in the <file>.csv file generated by NetSurfP2.0 directly.
- Perform Domain-based filtering of the sequences.
- Add paddings to make all sequences and ss predictions 798 in length.
- Organize into a csv file to be used as input for next step.

Inputs:
- SS prediction file generated by NetSurfP2.0 (<file>.csv)<br>
    Note: If SS prediction is done using other tools, process files to the same format as at the end of this notebook and go to the next step directly.<br>
- Domain annotation file<br>
    Generated by running sequences through Batch-CD-Search and processing the output.<br>
    A tab separated file with 3 columns: SeqID | DomainStart | DomainEnd<br>

### 1. Imports, functions and definitions

In [1]:
# import necessary package 
from Utils import *
import pandas as pd

In [2]:
pwd

'/home/esbg-lab/Dropbox (ESBG LAB)/GT_strML/Github_Folder/GT-CNN/Codes'

### 2. Read csv file produced by netsurfp; IDs need to be edited with family and fold information

In [3]:
new_GT = pd.read_csv("../Datasets/gtu/gtu.netsurfp.csv")
domain_file = pd.read_csv("../Datasets/gtu/gtu.domainAnnotation.tsv", sep="\t")

In [4]:
new_GT_seq, _ = Transfer_Function(new_GT, val = True, fold_name="u", family_name='GT-u')

In [5]:
domain_file.head()

Unnamed: 0,Name,Domain_start,Domain_end
0,GT48-u|ABI14554|H.annuusxHelianthusdebilissubs...,1,162
1,GT11-u|AXY11804|H.saguini97-6194-5F0_Bacteria,2,164
2,GT48-u|ABI14556|H.annuusxHelianthusdebilissubs...,1,163
3,GT11-u|APZ35041|M.aurumKACC15219_Bacteria,12,175
4,GT26-u|BAS09688|A.spHiyo4_Bacteria,1,165


In [6]:
new_GT_seq['family'].value_counts()

GT-u    4413
Name: family, dtype: int64

### 3. Cutting the sequence (domain based or direct)

In [7]:
# Merge domain bounds with the sequence and ss info
new_GT_seq2= pd.merge(new_GT_seq, domain_file, on='Name')
new_GT_seq2 = new_GT_seq2[['Name', 'fold', 'family','Domain_start','Domain_end', 'q3seq', 'rawseq']]
new_GT_seq2.shape

(4039, 7)

In [8]:
new_GT_seq2

Unnamed: 0,Name,fold,family,Domain_start,Domain_end,q3seq,rawseq
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT-u,1,185,CCCCCCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCCHHHHHHHH...,RYAPLVGYKKPWSNSGWLRKLFGGSDAHSTMASITGNDRMDVIKRD...
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT-u,1,458,CCCCHHHHHHHHHHHHCCCCCCCHHHHCCCHHHHHHHHHHHHHHHC...,MLPSIEQRLHILQLISTLSAHHTKECLRNPQPLYVEQVKERYAPLV...
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT-u,1,302,CCCHHHHHHHHCCCHHHHHHHHCCCCCEEEECCCCCEEEEEECCEE...,MGSLFKQIYRYTRPRAYRHNENLWPFTRITRAPSGEISALRYKGKT...
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT-u,823,1527,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,MSYNDNNHNYYDPNQQGGGVPNDGYYQQPYDMNQQQQQQQQQPYDD...
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT-u,826,1528,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,MANWQNTDPNGNYYYNGAENNEFYDQDYASQQPEQQQGGEGYYDEY...
...,...,...,...,...,...,...,...
4034,GT107-u|AFS38829|A.macleodiiATCC27126_,u,GT-u,50,830,CCCCCCHHHHHHCHCCCCCCCCHHHHHHHCCCCCCCHHHHHHHHHH...,MKKSLKFTQSFANLTASNCSVSDVLFLKNYSLGDFDLVKRSQPFLI...
4035,GT107-u|AHV35628|A.hydrophilaYL17_,u,GT-u,1,684,CCCCCCHHHHCCCCCEEECCCCCCCCCCEEEEECCCHHHHHHHHHH...,MARSTQIGTCLGQPLALFTSPDEPQPGDVLIGWGQKANTRQIKQQA...
4036,GT107-u|VDZ75107|A.hermanniiNCTC12129_,u,GT-u,1,218,CCCCCCCCCCCCCCCCCCCEEEEEECCCCCHHHHHCCCCCCCCHHH...,MSKYNLGGAWQLPADAAGKRVLLVPGQVEDDASIITGTLSINTNRD...
4037,GT107-u|QCT94715|C.mediatlanticusTB-2_,u,GT-u,1,672,CCCCCCCHHHHHHHHHHCCCEEEECCCCHHHHHHHHHHHHCCCEEE...,MNIKNKNFYLNNICNIFTKQYFTGWGRKRTGMFAKWSYKKFGGKLV...


### 4. Cut to get only the domain regions

In [9]:
new_GT_seq_cut=new_GT_seq2.copy()
new_GT_seq_cut['q3seq']=new_GT_seq_cut.apply(lambda x: x['q3seq'][(x['Domain_start']-1):x['Domain_end']], axis=1)
new_GT_seq_cut['rawseq']=new_GT_seq_cut.apply(lambda x: x['rawseq'][(x['Domain_start']-1):x['Domain_end']], axis=1)
new_GT_seq_cut

Unnamed: 0,Name,fold,family,Domain_start,Domain_end,q3seq,rawseq
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT-u,1,185,CCCCCCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCCHHHHHHHH...,RYAPLVGYKKPWSNSGWLRKLFGGSDAHSTMASITGNDRMDVIKRD...
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT-u,1,458,CCCCHHHHHHHHHHHHCCCCCCCHHHHCCCHHHHHHHHHHHHHHHC...,MLPSIEQRLHILQLISTLSAHHTKECLRNPQPLYVEQVKERYAPLV...
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT-u,1,302,CCCHHHHHHHHCCCHHHHHHHHCCCCCEEEECCCCCEEEEEECCEE...,MGSLFKQIYRYTRPRAYRHNENLWPFTRITRAPSGEISALRYKGKT...
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT-u,823,1527,CCCHHHHHHHHHHHHHHCCCCCCCCCHHHCCCCCCCCCCCCCCEEE...,PRNSEAERRISFFAQSLATPMPEPVPVDNMPTFTVFTPHYSEKILL...
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT-u,826,1528,CCCHHHHHHHHHHHHHHCCCCCCCCCHHHCCCCCCCCCCCCCEEEE...,PRNSEAERRISFFAQSLATPMPEPLPVDNMPTFTVLTPHYSERILL...
...,...,...,...,...,...,...,...
4034,GT107-u|AFS38829|A.macleodiiATCC27126_,u,GT-u,50,830,HHHHHCCCCCCCCCCCCEEEEECCCCHHHHHHHHHHHHHHHCCHEE...,NIVEKNNNKKTKRGLEKRFCASLPWNEKKLVGMNKYLTDILGYDKY...
4035,GT107-u|AHV35628|A.hydrophilaYL17_,u,GT-u,1,684,CCCCCCHHHHCCCCCEEECCCCCCCCCCEEEEECCCHHHHHHHHHH...,MARSTQIGTCLGQPLALFTSPDEPQPGDVLIGWGQKANTRQIKQQA...
4036,GT107-u|VDZ75107|A.hermanniiNCTC12129_,u,GT-u,1,218,CCCCCCCCCCCCCCCCCCCEEEEEECCCCCHHHHHCCCCCCCCHHH...,MSKYNLGGAWQLPADAAGKRVLLVPGQVEDDASIITGTLSINTNRD...
4037,GT107-u|QCT94715|C.mediatlanticusTB-2_,u,GT-u,1,672,CCCCCCCHHHHHHHHHHCCCEEEECCCCHHHHHHHHHHHHCCCEEE...,MNIKNKNFYLNNICNIFTKQYFTGWGRKRTGMFAKWSYKKFGGKLV...


### 5. Padding

In [10]:
new_GT_seq_pad = Zero_Padding(new_GT_seq_cut, 798)

In [11]:
new_GT_seq_pad.head()

Unnamed: 0,Name,fold,family,q3seq,rawseq,paddings
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT-u,----------------------------------------------...,----------------------------------------------...,306
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT-u,----------------------------------------------...,----------------------------------------------...,170
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT-u,----------------------------------------------...,----------------------------------------------...,248
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT-u,----------------------------------------------...,----------------------------------------------...,46
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT-u,----------------------------------------------...,----------------------------------------------...,47


### 6. Partition the data

In [12]:
new_GT_seq_final = Partition(new_GT_seq_pad, maxwordCount=798)

Jump extra-long tokens


### 7. Save processed table to csv

In [13]:
new_GT_seq_final.to_csv("../ExampleOutputs/allgtu_processed.csv")