# Preprocessing netsurfP output to generate input for cnn model

This file contains all the necessary steps for preprocessing of the sequences with secondary structure (ss) predictions from NetSurfP2.0 into our needed format to use as input for GT-CNN fold prediction.
- Read in the <file>.csv file generated by NetSurfP2.0 directly.
- Perform Domain-based filtering of the sequences.
- Add paddings to make all sequences and ss predictions 798 in length.
- Organize into a csv file to be used as input for next step.

Inputs:
- SS prediction file generated by NetSurfP2.0 (<file>.csv)<br>
    Note: If SS prediction is done using other tools, process files to the same format as at the end of this notebook and go to the next step directly.<br>
- Domain annotation file<br>
    Generated by running sequences through Batch-CD-Search and processing the output.<br>
    A tab separated file with 3 columns: Name | DomainStart | DomainEnd<br>

Requirements to get the fold and family columns properly formatted:
- The names for all the sequences in ss prediction and domain annotation files need to be properly formatted.<br>
    Eg: GT69-u|AAY89392|C.gattiiVGIR265_Fungi => GTfamily-fold|SequenceID|organismName_taxonomicGroup
- OR Provide a separate file with a list of fold and family using fold_name and family_name in Transfer_function with the val=True tag

### 1. Imports, functions and definitions

In [1]:
# import necessary package 
from Utils import *
import pandas as pd

### 2. Read csv file produced by netsurfp; IDs need to be edited with family and fold information

In [2]:
new_GT = pd.read_csv("../Datasets/gtu/gtu.netsurfp.csv")
domain_file = pd.read_csv("../Datasets/gtu/gtu.domainAnnotation.tsv", sep="\t")

In [3]:
new_GT_seq, _ = Transfer_Function(new_GT, val = False)
new_GT_seq.head()

Unnamed: 0,Name,fold,family,q3seq,rawseq
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT69-u,CCCCCCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCCHHHHHHHH...,RYAPLVGYKKPWSNSGWLRKLFGGSDAHSTMASITGNDRMDVIKRD...
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT69-u,CCCCHHHHHHHHHHHHCCCCCCCHHHHCCCHHHHHHHHHHHHHHHC...,MLPSIEQRLHILQLISTLSAHHTKECLRNPQPLYVEQVKERYAPLV...
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT73-u,CCCHHHHHHHHCCCHHHHHHHHCCCCCEEEECCCCCEEEEEECCEE...,MGSLFKQIYRYTRPRAYRHNENLWPFTRITRAPSGEISALRYKGKT...
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT48-u,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,MSYNDNNHNYYDPNQQGGGVPNDGYYQQPYDMNQQQQQQQQQPYDD...
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT48-u,CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC...,MANWQNTDPNGNYYYNGAENNEFYDQDYASQQPEQQQGGEGYYDEY...


In [4]:
new_GT_seq.shape

(4446, 5)

### 3. Cutting the sequence (domain based or direct)

In [5]:
# Merge domain bounds with the sequence and ss info
# Sequences without domain information are removed.

new_GT_seq2= pd.merge(new_GT_seq, domain_file, on='Name')
new_GT_seq2 = new_GT_seq2[['Name', 'fold', 'family','Domain_start','Domain_end', 'q3seq', 'rawseq']]
new_GT_seq2.shape

(4072, 7)

### 4. Cut to get only the domain regions

In [6]:
new_GT_seq_cut=new_GT_seq2.copy()
new_GT_seq_cut['q3seq']=new_GT_seq_cut.apply(lambda x: x['q3seq'][(x['Domain_start']-1):x['Domain_end']], axis=1)
new_GT_seq_cut['rawseq']=new_GT_seq_cut.apply(lambda x: x['rawseq'][(x['Domain_start']-1):x['Domain_end']], axis=1)
new_GT_seq_cut.head()

Unnamed: 0,Name,fold,family,Domain_start,Domain_end,q3seq,rawseq
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT69-u,1,185,CCCCCCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCCHHHHHHHH...,RYAPLVGYKKPWSNSGWLRKLFGGSDAHSTMASITGNDRMDVIKRD...
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT69-u,1,458,CCCCHHHHHHHHHHHHCCCCCCCHHHHCCCHHHHHHHHHHHHHHHC...,MLPSIEQRLHILQLISTLSAHHTKECLRNPQPLYVEQVKERYAPLV...
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT73-u,1,302,CCCHHHHHHHHCCCHHHHHHHHCCCCCEEEECCCCCEEEEEECCEE...,MGSLFKQIYRYTRPRAYRHNENLWPFTRITRAPSGEISALRYKGKT...
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT48-u,823,1527,CCCHHHHHHHHHHHHHHCCCCCCCCCHHHCCCCCCCCCCCCCCEEE...,PRNSEAERRISFFAQSLATPMPEPVPVDNMPTFTVFTPHYSEKILL...
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT48-u,826,1528,CCCHHHHHHHHHHHHHHCCCCCCCCCHHHCCCCCCCCCCCCCEEEE...,PRNSEAERRISFFAQSLATPMPEPLPVDNMPTFTVLTPHYSERILL...


### 5. Padding

In [7]:
new_GT_seq_pad = Zero_Padding(new_GT_seq_cut, 798)

In [8]:
new_GT_seq_pad.head()

Unnamed: 0,Name,fold,family,q3seq,rawseq,paddings
0,GT69-u|AAY89392|C.gattiiVGIR265_Fungi,u,GT69-u,----------------------------------------------...,----------------------------------------------...,306
1,GT69-u|AAC13946|C.neoformans_Fungi,u,GT69-u,----------------------------------------------...,----------------------------------------------...,170
2,GT73-u|AIE00872|K.pneumoniaesubsppneumoniaeKP5...,u,GT73-u,----------------------------------------------...,----------------------------------------------...,248
3,GT48-u|ABX80511|C.parapsilosis_Fungi,u,GT48-u,----------------------------------------------...,----------------------------------------------...,46
4,GT48-u|AAF34719|C.glabrataATCC90876_Fungi,u,GT48-u,----------------------------------------------...,----------------------------------------------...,47


### 6. Partition the data

In [9]:
new_GT_seq_final = Partition(new_GT_seq_pad, maxwordCount=798)

Jump extra-long tokens


### 7. Save processed table to csv

In [10]:
new_GT_seq_final.to_csv("../ExampleOutputs/gtu.processed.csv", index=False)