# Staphylococcus aureus
## Data

In [609]:
import pandas as pd
import numpy as np

### Genomic Features

The table below contains a list of genimoc features, including coding DNA.

Each feature is solely identified by BRC ID and associated to a protein family referred as PATRIC genus-specific families (PLfams).

In [610]:
saureus_features = pd.read_csv('saureus_genome_features.csv')

In [611]:
saureus_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10999 entries, 0 to 10998
Data columns (total 21 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Genome                                   10999 non-null  object 
 1   Genome ID                                10999 non-null  float64
 2   Accession                                10999 non-null  object 
 3   BRC ID                                   10999 non-null  object 
 4   RefSeq Locus Tag                         10703 non-null  object 
 5   Alt Locus Tag                            5488 non-null   object 
 6   Feature ID                               10999 non-null  object 
 7   Annotation                               10999 non-null  object 
 8   Feature Type                             10999 non-null  object 
 9   Start                                    10999 non-null  int64  
 10  End                                      10999

Through this table, we extract useful data to map protein families referred by Nguyen et. al.:

In [612]:
plf = saureus_features[['BRC ID', 'PATRIC genus-specific families (PLfams)']].astype("string")
plf.columns = ['BRC_ID', 'PLFam']
plf.set_index('BRC_ID', inplace = True)
plf.head()

Unnamed: 0_level_0,PLFam
BRC_ID,Unnamed: 1_level_1
fig|1241616.6.peg.978,PLF_1279_00000947
fig|1241616.6.peg.979,PLF_1279_00001869
fig|1241616.6.peg.980,PLF_1279_00000303
fig|1241616.6.peg.981,PLF_1279_00000735
fig|1241616.6.peg.982,PLF_1279_00000362


### Protein Interaction Network

The table below contais pairs of proteins interacting with each other on Staphylococcus aureus protein network, identified by their BRC ID.

In [613]:
saureus_ppi = pd.read_csv('saureus_ppi_patric.csv')
saureus_ppi = saureus_ppi[['Interactor A ID', 'Interactor B ID']].astype("string")
saureus_ppi.columns = ['Interactor_A_ID', 'Interactor_B_ID']
saureus_ppi.head()

Unnamed: 0,Interactor_A_ID,Interactor_B_ID
0,fig|93061.5.peg.452,fig|93061.5.peg.713
1,fig|93061.5.peg.1920,fig|93061.5.peg.1921
2,fig|93061.5.peg.111,fig|93061.5.peg.119
3,fig|93061.5.peg.112,fig|93061.5.peg.121
4,fig|93061.5.peg.1069,fig|93061.5.peg.1071


### Specialty Genes

The table containing specialty genes relates several genomic features to a relevant property:
 - Essential gene
 - Antibiotic resistance
 - Virulence factor
 - Human homolog
 - Drug target
 - Transporter
 
We are particularly interested on properties associated to antibiotic resistance. Besides genes related to antibiotic resistance themselves, it is possible to have causal relation between virulence factor and bacterial resistance.

In [614]:
sa_specialty_genes = pd.read_csv('saureus_specialty_genes.csv')
sa_specialty_genes = sa_specialty[['BRC ID', 'Property']]
sa_specialty_genes.columns = ['BRC_ID', 'Property']
sa_specialty_genes.set_index('BRC_ID', inplace = True)
sa_specialty_genes.Property.unique()

array(['Antibiotic Resistance', 'Essential Gene', 'Virulence Factor',
       'Human Homolog', 'Drug Target', 'Transporter'], dtype=object)

### Conserved Genes

Next table is listening protein families of 10 experiments (each one with 100 non overlapping protein families) selected from a set of conserved genes and used in the paper of Nguyen et. al.

Each protein family has a feature importance value derived from XGBoost, which means a contribution degree from a protein family given to classificate in resistant/susceptible phenotype.

In [615]:
sa_feature_importance = pd.read_excel('saureus_feature_importance.xlsx')

In [616]:
sa_feature_importance

Unnamed: 0,Protein Family ID,Model,Total Feature Importance,Annotation
0,PLF_1279_00001080,1,162.412577,hypothetical protein
1,PLF_1279_00001505,1,81.039855,ABC transporter-like sensor ATP-binding protei...
2,PLF_1279_00001583,1,67.782436,Polysaccharide intercellular adhesin (PIA) bio...
3,PLF_1279_00001118,1,60.701992,"Nickel ABC transporter, substrate-binding prot..."
4,PLF_1279_00001691,1,54.623888,Activator of the mannose operon (transcription...
...,...,...,...,...
995,PLF_1279_00007034,10,0.000000,Cold shock protein of CSP family
996,PLF_1279_00001353,10,0.000000,UPF0398 protein YpsA
997,PLF_1279_00000861,10,0.000000,LSU ribosomal protein L15p (L27Ae)
998,PLF_1279_00000601,10,0.000000,LSU ribosomal protein L30p (L7e)


Once the information used in the paper from Nguyen is given in therms of protein families, we need to associate every feature for a corresponding protein family.

Lets  check if every genome feature in the PPI has a associated Patric Local Family:

In [617]:
saureus_ppi['Interactor_A_ID'].isin(plf.index)[saureus_ppi['Interactor_A_ID'].isin(plf.index)==False]

Series([], Name: Interactor_A_ID, dtype: bool)

In [618]:
saureus_ppi['Interactor_B_ID'].isin(plf.index)[saureus_ppi['Interactor_B_ID'].isin(plf.index)==False]

2085    False
Name: Interactor_B_ID, dtype: bool

There is no PLFam associated to the feature fig|93061.5.peg.894 (line 2085, interactor B).

Before ignore this information, lets also check if there is some relevant characteristic related to this feature:

In [619]:
sa_specialty_genes.loc[sa_specialty_genes.index == saureus_ppi['Interactor_B_ID'].loc[2085]]

Unnamed: 0_level_0,Property
BRC_ID,Unnamed: 1_level_1


There is no property associated to this feature, hence, it can be excluded:

In [620]:
saureus_ppi.drop(2085, axis = 0, inplace = True)
saureus_ppi.reset_index(drop=True, inplace=True)

Now we can map a feature to a Patric Local Family with no problem.

# NetworkX

In [621]:
import networkx as nx
import scipy

This is the network protein interaction in terms of PLFams:

In [622]:
ppi_graph = nx.from_pandas_edgelist(saureus_ppi, 'Interactor_A_ID', 'Interactor_B_ID')

In [623]:
ppi_graph.number_of_edges()

4999

### Resistance Genes

In [624]:
sa_specialty_genes[sa_specialty_genes.Property == 'Antibiotic Resistance']

Unnamed: 0_level_0,Property
BRC_ID,Unnamed: 1_level_1
fig|1413510.3.peg.2169,Antibiotic Resistance
fig|93061.5.peg.1154,Antibiotic Resistance
fig|93061.5.peg.2089,Antibiotic Resistance
fig|93061.5.peg.842,Antibiotic Resistance
fig|158879.11.peg.1813,Antibiotic Resistance
...,...
fig|158879.11.peg.2331,Antibiotic Resistance
fig|1241616.6.peg.1396,Antibiotic Resistance
fig|158879.11.peg.647,Antibiotic Resistance
fig|158879.11.peg.2107,Antibiotic Resistance


In [625]:
resistance_genes = sa_specialty_genes.loc[sa_specialty_genes.Property == 'Antibiotic Resistance'].reset_index()

In [626]:
resistance_genes

Unnamed: 0,BRC_ID,Property
0,fig|1413510.3.peg.2169,Antibiotic Resistance
1,fig|93061.5.peg.1154,Antibiotic Resistance
2,fig|93061.5.peg.2089,Antibiotic Resistance
3,fig|93061.5.peg.842,Antibiotic Resistance
4,fig|158879.11.peg.1813,Antibiotic Resistance
...,...,...
264,fig|158879.11.peg.2331,Antibiotic Resistance
265,fig|1241616.6.peg.1396,Antibiotic Resistance
266,fig|158879.11.peg.647,Antibiotic Resistance
267,fig|158879.11.peg.2107,Antibiotic Resistance


We need to find which genes related to antibiotic resistance are in the PPI:

In [661]:
resistance_genes_ppi_A = resistance_genes[resistance_genes['BRC_ID'].isin(saureus_ppi['Interactor_A_ID'])]['BRC_ID']
resistance_genes_ppi_B = resistance_genes[resistance_genes['BRC_ID'].isin(saureus_ppi['Interactor_B_ID'])]['BRC_ID']

resistance_genes_ppi = pd.DataFrame(pd.concat([resistance_genes_ppi_A, resistance_genes_ppi_B], axis = 0))
resistance_genes_ppi.reset_index(drop=True, inplace=True)

### Conserved Genes used for prediction in Nguyen et. al. 2020

We also need to find which conserved genes used to prediction in the paper are in the PPI

In [628]:
sa_conserved_ppi_A = sa_feature_importance[sa_feature_importance['Protein Family ID'].isin(saureus_ppi['Interactor_A_ID'])]['Protein Family ID']
sa_conserved_ppi_B = sa_feature_importance[sa_feature_importance['Protein Family ID'].isin(saureus_ppi['Interactor_B_ID'])]['Protein Family ID']

sa_conserved_ppi = pd.DataFrame(pd.concat([sa_paper_in_ppi_A, sa_paper_in_ppi_B], axis = 0).drop_duplicates())

### Virulence Factors

In [630]:
virulence_genes = sa_specialty_genes.loc[sa_specialty_genes.Property == 'Virulence Factor'].reset_index()

We need to find which genes related to virulence are in the PPI:

In [631]:
virulence_genes_ppi_A = virulence_genes[virulence_genes['BRC_ID'].isin(saureus_ppi['Interactor_A_ID'])]
virulence_genes_ppi_B = virulence_genes[virulence_genes['BRC_ID'].isin(saureus_ppi['Interactor_B_ID'])]

virulence_genes_ppi = pd.concat([virulence_genes_ppi_A, virulence_genes_ppi_B], axis = 0)

In [659]:
virulence_genes_ppi.reset_index(drop=True, inplace=True)

### Writing PPI in terms of PLFams for conserved genes

Now, for conserved genes, lets create a new PPI substituing the feature for its Patric Local Familiy for conserved genes: 

In [633]:
saureus_ppi_plfams = saureus_ppi

for i in range(1,len(saureus_ppi['Interactor_A_ID'])):
    if plf.loc[saureus_ppi['Interactor_A_ID'][i]].isin(sa_feature_importance['Protein Family ID']).bool():
        saureus_ppi_plfams.at[i, 'Interactor_A_ID'] = plf.loc[saureus_ppi['Interactor_A_ID'][i]].PLFam
        
for i in range(1,len(saureus_ppi['Interactor_B_ID'])):
    if plf.loc[saureus_ppi['Interactor_B_ID'][i]].isin(sa_feature_importance['Protein Family ID']).bool():
        saureus_ppi_plfams.at[i, 'Interactor_B_ID'] = plf.loc[saureus_ppi['Interactor_B_ID'][i]].PLFam        
        
saureus_ppi_plfams.drop_duplicates(subset=None, keep='first', inplace=True)

## Statistics

In [638]:
ppi_info = pd.DataFrame(columns = ['Conserved Gene', 'Shortest Path to an AMR gene (length)', 'Is it virulence related?'])

ppi_info['Conserved Gene'] = sa_conserved_in_ppi.reset_index(drop = True)

#### For each conserved gene having a path to an AMR, what is the length of this path? 

In [658]:
resistance_genes_ppi

Unnamed: 0,BRC_ID
2,fig|93061.5.peg.2089
3,fig|93061.5.peg.842
7,fig|93061.5.peg.2384
8,fig|93061.5.peg.2139
11,fig|93061.5.peg.1243
...,...
247,fig|93061.5.peg.1310
252,fig|93061.5.peg.471
254,fig|93061.5.peg.2118
255,fig|93061.5.peg.287


In [665]:
ppi_graph_plfams = nx.from_pandas_edgelist(saureus_ppi_plfams, 'Interactor_A_ID', 'Interactor_B_ID')

ppi_info = pd.DataFrame(columns = ['Conserved Gene', 'Shortest Path to an AMR gene (length)', 'Is it virulence related?'])

ppi_info['Conserved Gene'] = sa_conserved_in_ppi.reset_index(drop = True)

idx = 0
for i in sa_conserved_in_ppi:
    lengths = []
    for j in resistance_genes_ppi['BRC_ID']:
        if nx.has_path(ppi_graph_plfams, i, j):
            lengths.append(nx.shortest_path_length(ppi_graph_plfams, i, j))
    if lengths:        
        ppi_info['Shortest Path to an AMR gene (length)'][idx] = min(lengths)
        
    idx += 1

In [712]:
ppi_info['Model'] = np.float64(sa_feature_importance[sa_feature_importance['Protein Family ID'].isin(sa_conserved_in_ppi)]['Model'].reset_index(drop = True))

In [709]:
ppi_info['Feature Score'] = sa_feature_importance[sa_feature_importance['Protein Family ID'].isin(sa_conserved_in_ppi)]['Total Feature Importance'].reset_index(drop = True)np.float64

In [708]:
print(ppi_info.groupby(['Shortest Path to an AMR gene (length)', 'Model']).size().reset_index(name='Count'))

    Shortest Path to an AMR gene (length)  Model  Count
0                                       1      1     13
1                                       1      2     17
2                                       1      3     13
3                                       1      4     10
4                                       1      5      7
5                                       1      6      7
6                                       1      7     12
7                                       1      8     14
8                                       1      9      2
9                                       1     10      8
10                                      2      1     42
11                                      2      2     29
12                                      2      3     32
13                                      2      4     30
14                                      2      5     31
15                                      2      6     38
16                                      2      7

In [713]:
ppi_info

Unnamed: 0,Conserved Gene,Shortest Path to an AMR gene (length),Is it virulence related?,Model,Feature Score
0,PLF_1279_00001583,2,,1.0,81.039855
1,PLF_1279_00001118,2,,1.0,67.782436
2,PLF_1279_00001741,2,,1.0,60.701992
3,PLF_1279_00001743,3,,1.0,54.623888
4,PLF_1279_00000675,3,,1.0,51.659804
...,...,...,...,...,...
753,PLF_1279_00001003,3,,10.0,0.000000
754,PLF_1279_00002144,,,10.0,0.000000
755,PLF_1279_00001063,3,,10.0,0.000000
756,PLF_1279_00001416,3,,10.0,0.000000


In [716]:
ppi_info2 = ppi_info[~ppi_info[['Shortest Path to an AMR gene (length)', 'Feature Score']].isnull().any(axis = 1)]

#### What is the correlaton between the feature score and the length of the path?

In [719]:
ppi_info2['Shortest Path to an AMR gene (length)'].astype('int').corr(ppi_info2['Feature Score'].astype('float64'))

0.04841244141598265