# Adding KOG functional categories to annotated toxins

Importing modules:

In [60]:
import pandas as pd
import numpy as np

Importing toxins and KOG tables into dataframes:

In [61]:
kog = pd.read_excel("KOG_BLASTP_RESULTS.xlsx", usecols=["qseqid", "sseqid"])
toxins = pd.read_excel("Toxins_transcriptome.xlsx", usecols=["prot_id"], sheet_name="Somente toxinas") 

Changing kog column names:

In [63]:
kog = kog.rename(columns={'qseqid': 'prot_id', 'sseqid': 'kog_funccat'})

Checking if dataframes have duplicate values in the 'prot_id' column, which will be used to merge them: 

In [64]:
dataframes={"kog": kog, "toxins": toxins}
for k, v in dataframes.items():
    print("{} dataframe has only unique entries: {}".format(k, v['prot_id'].is_unique))

kog dataframe has only unique entries: False
toxins dataframe has only unique entries: False


There are duplicate values. Selecting duplicate rows based on prot_id and saving it to xlsx files:

In [65]:
for k, v in dataframes.items():
    duplicateRowsDF = v[v.duplicated(subset='prot_id', keep=False)]
    duplicateRowsDF.to_excel("Duplicated_{}.xlsx".format(k), index=False)

After manually checking that no information will be lost in the kog functional categories, we remove the duplicates for both datasets:

In [66]:
#Cleaning toxins dataframe (has duplicated empty values):
toxins.replace('', np.nan, inplace=True) #replaces empty strings with NaN
toxins.dropna(inplace=True) #removes NaN 
toxins.drop_duplicates(keep="first", inplace=True) #removes non-NaN duplicates

#Cleaning kog dstsframe:
kog.drop_duplicates(keep='first', inplace=True) # Removes duplicates

#Checking if the dataframes are unique 
for k, v in dataframes.items():
    print("{} dataframe has only unique entries: {}".format(k, v['prot_id'].is_unique))

#Saving to unique xlsx tables:
kog.to_excel("Unique_kog.xlsx", index=False)
toxins.to_excel("Unique_toxins.xlsx", index=False)

kog dataframe has only unique entries: True
toxins dataframe has only unique entries: True


Now that the dataframes are clean, we will parse the KOG functional categories:

In [67]:
kog['kog_funccat'] = kog['kog_funccat'].str.split('|').str[3]

Checking if kog dataframe is ok:

In [68]:
kog

Unnamed: 0,prot_id,kog_funccat
0,TRINITY_DN0_c0_g1_i1.p3,L
1,TRINITY_DN10003_c0_g1_i1.p2,D
2,TRINITY_DN10012_c0_g1_i1.p1,T
3,TRINITY_DN10018_c0_g1_i1.p1,R
4,TRINITY_DN10026_c0_g1_i1.p1,T
...,...,...
38273,TRINITY_DN998_c0_g1_i1.p1,U
38274,TRINITY_DN9990_c0_g1_i1.p1,R
38275,TRINITY_DN9998_c0_g1_i1.p1,B
38276,TRINITY_DN99_c0_g1_i1.p1,T


Joining 'toxins' and 'kog' dataframes.

Inner join: only prot_id that appear in both tables will be present in the final dataframe - i.e. the final table will only contain toxins that were assigned a kog functional category.

In [71]:
toxins_kog = pd.merge(toxins, kog, on=["prot_id"], how='inner')

Check if dataframe size (expected result: toxins-kog will be smaller than both kog and toxins):

In [73]:
print('Kog entries: {}\nToxin entries: {}\nToxin_kog entries: {}'.format(len(kog), len(toxins), len(toxins_kog)))

Kog entries: 27919
Toxin entries: 945
Toxin_kog entries: 364


Saving final dataframe to excel table:

In [74]:
toxins_kog.to_excel('TOXINS_KOG.xlsx', index=False)