# Data From PATRIC

In [4]:
import pandas as pd
import numpy as np
import networkx as nx

## Genomic Features

The table below contains a list of genomic features, including coding DNA.

Each feature is solely identified by BRC ID and associated to a protein family referred as PATRIC genus-specific families (PLfams).

In [100]:
features = pd.read_csv('genome_features.csv')

In [101]:
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 21 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Genome                                   10999 non-null  object 
 1   Genome ID                                10999 non-null  float64
 2   Accession                                10999 non-null  object 
 3   BRC ID                                   10999 non-null  object 
 4   RefSeq Locus Tag                         10703 non-null  object 
 5   Alt Locus Tag                            5488 non-null   object 
 6   Feature ID                               10999 non-null  object 
 7   Annotation                               10999 non-null  object 
 8   Feature Type                             10999 non-null  object 
 9   Start                                    10999 non-null  int64  
 10  End                                      10999

Through this table, we extract useful data to map protein families referred by Nguyen et. al.:

In [140]:
plf = features[['BRC ID', 'PATRIC genus-specific families (PLfams)']].astype("string")
plf.columns = ['BRC_ID', 'PLFam']
plf.set_index('BRC_ID', inplace = True)
plf

Unnamed: 0_level_0,PLFam
BRC_ID,Unnamed: 1_level_1
fig|1241616.6.peg.978,PLF_1279_00000947
fig|1241616.6.peg.979,PLF_1279_00001869
fig|1241616.6.peg.980,PLF_1279_00000303
fig|1241616.6.peg.981,PLF_1279_00000735
fig|1241616.6.peg.982,PLF_1279_00000362
...,...
fig|93061.5.peg.83,PLF_1279_00002111
fig|93061.5.peg.939,PLF_1279_00000867
fig|93061.5.peg.940,PLF_1279_00000994
fig|93061.5.peg.941,PLF_1279_00000907


## Protein Interaction Network

The table below contais pairs of proteins interacting with each other on Staphylococcus aureus protein network, identified by their BRC ID.

In [103]:
ppi = pd.read_csv('ppi_patric.csv')
ppi = ppi[['Interactor A ID', 'Interactor B ID']].astype("string")
ppi.columns = ['Interactor_A_ID', 'Interactor_B_ID']
ppi.head()

Unnamed: 0,Interactor_A_ID,Interactor_B_ID
0,fig|93061.5.peg.452,fig|93061.5.peg.713
1,fig|93061.5.peg.1920,fig|93061.5.peg.1921
2,fig|93061.5.peg.111,fig|93061.5.peg.119
3,fig|93061.5.peg.112,fig|93061.5.peg.121
4,fig|93061.5.peg.1069,fig|93061.5.peg.1071


## Specialty Genes

The table containing specialty genes relates several genomic features to a relevant property:
 - Essential gene
 - Antibiotic resistance
 - Virulence factor
 - Human homolog
 - Drug target
 - Transporter
 
We are particularly interested on properties associated to antibiotic resistance. Besides genes related to antibiotic resistance themselves, it is possible to have causal relation between virulence factor and bacterial resistance.

In [104]:
specialty_genes = pd.read_csv('specialty_genes.csv')
specialty_genes = specialty_genes[['BRC ID', 'Property']]
specialty_genes.columns = ['BRC_ID', 'Property']
specialty_genes.set_index('BRC_ID', inplace = True)
specialty_genes.Property.unique()

array(['Antibiotic Resistance', 'Essential Gene', 'Virulence Factor',
       'Human Homolog', 'Drug Target', 'Transporter'], dtype=object)

## Conserved Genes used for prediction in Nguyen et. al. 2020

Next table is listening protein families of 10 experiments (each one with 100 non overlapping protein families) selected from a set of conserved genes and used in the paper of Nguyen et. al.

Each protein family has a feature importance value derived from XGBoost, which means a contribution degree from a protein family given to classificate in resistant/susceptible phenotype.

In [106]:
feature_importance = pd.read_excel('saureus_feature_importance.xlsx')

In [107]:
feature_importance

Unnamed: 0,Protein Family ID,Model,Total Feature Importance,Annotation
0,PLF_1279_00001080,1,162.412577,hypothetical protein
1,PLF_1279_00001505,1,81.039855,ABC transporter-like sensor ATP-binding protei...
2,PLF_1279_00001583,1,67.782436,Polysaccharide intercellular adhesin (PIA) bio...
3,PLF_1279_00001118,1,60.701992,"Nickel ABC transporter, substrate-binding prot..."
4,PLF_1279_00001691,1,54.623888,Activator of the mannose operon (transcription...
...,...,...,...,...
995,PLF_1279_00007034,10,0.000000,Cold shock protein of CSP family
996,PLF_1279_00001353,10,0.000000,UPF0398 protein YpsA
997,PLF_1279_00000861,10,0.000000,LSU ribosomal protein L15p (L27Ae)
998,PLF_1279_00000601,10,0.000000,LSU ribosomal protein L30p (L7e)


Once the information used in the paper from Nguyen is given in therms of protein families, we need to associate every feature for a corresponding protein family.

Lets  check if every genomic feature in the PPI has an associated Patric Local Family:

In [111]:
ppi['Interactor_A_ID'][ppi['Interactor_A_ID'].isin(plf.index)==False]

Series([], Name: Interactor_A_ID, dtype: string)

In [112]:
ppi['Interactor_B_ID'][ppi['Interactor_B_ID'].isin(plf.index)==False]

2085    fig|93061.5.peg.894
Name: Interactor_B_ID, dtype: string

There is no PLFam associated to the feature fig|93061.5.peg.894 (line 2085, interactor B).

Before ignore this information, lets also check if there is some relevant characteristic related to this feature:

In [110]:
specialty_genes.loc[specialty_genes.index == ppi['Interactor_B_ID'].loc[2085]]

Unnamed: 0_level_0,Property
BRC_ID,Unnamed: 1_level_1


There is no property associated to this feature, hence, it can be excluded:

In [114]:
ppi.drop(2085, axis = 0, inplace = True)
ppi.reset_index(drop=True, inplace=True)

Now we can map a feature to a Patric Local Family with no problem.

## Writing PPI in terms of PLFams for conserved genes

Creating a new PPI substituing the feature for its Patric Local Familiy for conserved genes: 

In [115]:
ppi_plfams = ppi

for i in range(len(ppi['Interactor_A_ID'])):
    if plf.loc[ppi['Interactor_A_ID'][i]].isin(feature_importance['Protein Family ID']).bool():
        ppi_plfams.at[i, 'Interactor_A_ID'] = plf.loc[ppi['Interactor_A_ID'][i]].PLFam
        
for i in range(len(ppi['Interactor_B_ID'])):
    if plf.loc[ppi['Interactor_B_ID'][i]].isin(feature_importance['Protein Family ID']).bool():
        ppi_plfams.at[i, 'Interactor_B_ID'] = plf.loc[ppi['Interactor_B_ID'][i]].PLFam        
        
ppi_plfams.drop_duplicates(subset=None, keep='first', inplace=True)

### Resistance Genes in PPI

In [116]:
specialty_genes[specialty_genes.Property == 'Antibiotic Resistance']

Unnamed: 0_level_0,Property
BRC_ID,Unnamed: 1_level_1
fig|1413510.3.peg.2169,Antibiotic Resistance
fig|93061.5.peg.1154,Antibiotic Resistance
fig|93061.5.peg.2089,Antibiotic Resistance
fig|93061.5.peg.842,Antibiotic Resistance
fig|158879.11.peg.1813,Antibiotic Resistance
...,...
fig|158879.11.peg.2331,Antibiotic Resistance
fig|1241616.6.peg.1396,Antibiotic Resistance
fig|158879.11.peg.647,Antibiotic Resistance
fig|158879.11.peg.2107,Antibiotic Resistance


In [1]:
AMR_genes = pd.DataFrame(specialty_genes.loc[specialty_genes.Property == 'Antibiotic Resistance'].reset_index()['BRC_ID'])

NameError: name 'pd' is not defined

In [118]:
AMR_genes

Unnamed: 0,BRC_ID,Property
0,fig|1413510.3.peg.2169,Antibiotic Resistance
1,fig|93061.5.peg.1154,Antibiotic Resistance
2,fig|93061.5.peg.2089,Antibiotic Resistance
3,fig|93061.5.peg.842,Antibiotic Resistance
4,fig|158879.11.peg.1813,Antibiotic Resistance
...,...,...
264,fig|158879.11.peg.2331,Antibiotic Resistance
265,fig|1241616.6.peg.1396,Antibiotic Resistance
266,fig|158879.11.peg.647,Antibiotic Resistance
267,fig|158879.11.peg.2107,Antibiotic Resistance


We need to find which genes related to antibiotic resistance are in the PPI:

In [119]:
AMR_genes_ppi_A = AMR_genes[AMR_genes['BRC_ID'].isin(ppi_plfams['Interactor_A_ID'])]['BRC_ID']
AMR_genes_ppi_B = AMR_genes[AMR_genes['BRC_ID'].isin(ppi_plfams['Interactor_B_ID'])]['BRC_ID']

AMR_genes_ppi = pd.DataFrame(pd.concat([AMR_genes_ppi_A, AMR_genes_ppi_B], axis = 0))
AMR_genes_ppi.reset_index(drop=True, inplace=True)

In [120]:
AMR_genes_ppi

Unnamed: 0,BRC_ID
0,fig|93061.5.peg.2089
1,fig|93061.5.peg.842
2,fig|93061.5.peg.1243
3,fig|93061.5.peg.2252
4,fig|93061.5.peg.1237
...,...
88,fig|93061.5.peg.88
89,fig|93061.5.peg.1310
90,fig|93061.5.peg.471
91,fig|93061.5.peg.2118


### Conserved Genes used for prediction in Nguyen et. al. 2020 in PPI

We also need to find which conserved genes of 10 experiments (each one with 100 non overlapping protein families) used to prediction in the paper are in the PPI

In [121]:
conserved_ppi_A = feature_importance[feature_importance['Protein Family ID'].isin(ppi_plfams['Interactor_A_ID'])]['Protein Family ID']
conserved_ppi_B = feature_importance[feature_importance['Protein Family ID'].isin(ppi_plfams['Interactor_B_ID'])]['Protein Family ID']

conserved_ppi = pd.DataFrame(pd.concat([conserved_ppi_A, conserved_ppi_B], axis = 0).drop_duplicates())

In [122]:
conserved_ppi

Unnamed: 0,Protein Family ID
2,PLF_1279_00001583
3,PLF_1279_00001118
5,PLF_1279_00001741
6,PLF_1279_00001743
8,PLF_1279_00000675
...,...
974,PLF_1279_00001003
982,PLF_1279_00002144
988,PLF_1279_00001063
990,PLF_1279_00001416


# NetworkX

In [123]:
ppi_info = pd.DataFrame(columns = ['Conserved Gene', 'Shortest Path to an AMR gene (length)',])

ppi_info['Conserved Gene'] = conserved_ppi.reset_index(drop = True)['Protein Family ID']

## For each conserved gene having a path to an AMR, what is the length of this path? 

In [125]:
ppi_graph_plfams = nx.from_pandas_edgelist(saureus_ppi_plfams, 'Interactor_A_ID', 'Interactor_B_ID')

idx = 0
for i in conserved_ppi['Protein Family ID']:
    lengths = []
    for j in AMR_genes_ppi['BRC_ID']:
        if nx.has_path(ppi_graph_plfams, i, j):
            lengths.append(nx.shortest_path_length(ppi_graph_plfams, i, j))
    if lengths:        
        ppi_info['Shortest Path to an AMR gene (length)'][idx] = min(lengths)
        
    idx += 1

In [126]:
ppi_info['Feature Score'] = feature_importance[feature_importance['Protein Family ID'].isin(conserved_ppi['Protein Family ID'])]['Total Feature Importance'].reset_index(drop = True)

In [127]:
print(ppi_info.groupby(['Shortest Path to an AMR gene (length)']).size().reset_index(name='Count'))

   Shortest Path to an AMR gene (length)  Count
0                                      1    102
1                                      2    300
2                                      3    200
3                                      4     68
4                                      5     10
5                                      6      7


In [29]:
ppi_info

Unnamed: 0,Conserved Gene,Shortest Path to an AMR gene (length),Feature Score
0,PLF_1279_00001583,2,81.039855
1,PLF_1279_00001118,2,67.782436
2,PLF_1279_00001741,2,60.701992
3,PLF_1279_00001743,3,54.623888
4,PLF_1279_00000675,3,51.659804
...,...,...,...
753,PLF_1279_00001003,3,0.000000
754,PLF_1279_00002144,,0.000000
755,PLF_1279_00001063,3,0.000000
756,PLF_1279_00001416,3,0.000000


Removing genes with no path to an AMR gene:

In [128]:
ppi_info = ppi_info[~ppi_info[['Shortest Path to an AMR gene (length)', 'Feature Score']].isnull().any(axis = 1)]

## What is the correlaton between the feature score and the length of the path?

In [129]:
ppi_info['Shortest Path to an AMR gene (length)'].astype('int').corr(ppi_info['Feature Score'].astype('float64'))

0.0497869055108677