# Data From PATRIC

In [2]:
import pandas as pd
import numpy as np
import networkx as nx

## Genomic Features

The table below contains a list of genomic features, including coding DNA.

Each feature is solely identified by BRC ID and associated to a protein family referred as PATRIC genus-specific families (PLfams).

In [5]:
features = pd.read_csv('genome_features1.csv')
for i in [2, 3, 4]:
    features = pd.concat([features, pd.read_csv(f'genome_features{i}.csv')], axis = 0)

In [6]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17217 entries, 0 to 4326
Data columns (total 21 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   Genome                                   17217 non-null  object 
 1   Genome ID                                17217 non-null  float64
 2   Accession                                17217 non-null  object 
 3   BRC ID                                   17217 non-null  object 
 4   RefSeq Locus Tag                         14821 non-null  object 
 5   Alt Locus Tag                            12958 non-null  object 
 6   Feature ID                               17217 non-null  object 
 7   Annotation                               17217 non-null  object 
 8   Feature Type                             17217 non-null  object 
 9   Start                                    17217 non-null  int64  
 10  End                                      17217 

Through this table, we extract useful data to map protein families referred by Nguyen et. al.:

In [7]:
plf = features[['BRC ID', 'PATRIC genus-specific families (PLfams)']].astype("string")
plf.columns = ['BRC_ID', 'PLFam']
plf.set_index('BRC_ID', inplace = True)
plf

Unnamed: 0_level_0,PLFam
BRC_ID,Unnamed: 1_level_1
fig|83332.12.peg.1000,PLF_1763_00001652
fig|83332.12.peg.1001,PLF_1763_00001934
fig|83332.12.peg.1002,PLF_1763_00021396
fig|83332.12.peg.1003,PLF_1763_00003246
fig|83332.12.peg.1004,PLF_1763_00003200
...,...
fig|233413.5.peg.7,PLF_1763_00003574
fig|233413.5.peg.999,PLF_1763_00002802
fig|233413.5.peg.106,PLF_1763_00002641
fig|233413.5.peg.1000,PLF_1763_00002923


## Protein Interaction Network

The table below contais pairs of proteins interacting with each other on Mycobacterium tuberculosis protein network, identified by their BRC ID.

In [9]:
ppi = pd.read_csv('ppi_patric.csv')
ppi = ppi[['Interactor A ID', 'Interactor B ID']].astype("string")
ppi.columns = ['Interactor_A_ID', 'Interactor_B_ID']
ppi

Unnamed: 0,Interactor_A_ID,Interactor_B_ID
0,fig|419947.8.peg.2722,fig|419947.9.peg.2601
1,fig|83332.12.peg.4366,fig|83332.12.peg.2246
2,fig|83332.12.peg.4366,fig|83332.12.peg.899
3,fig|83332.12.peg.4366,fig|83332.12.peg.3329
4,fig|83332.12.peg.4142,fig|83332.12.peg.3034
...,...,...
2429,fig|419947.9.peg.3089,fig|419947.9.peg.4595
2430,fig|83332.12.peg.339,fig|83332.12.peg.764
2431,fig|83332.12.peg.4012,fig|83332.12.peg.3267
2432,fig|419947.9.peg.2083,fig|419947.9.peg.4173


## Specialty Genes

The table containing specialty genes relates several genomic features to a relevant property:

 - Essential gene
 - Antibiotic resistance
 - Virulence factor
 - Human homolog
 - Drug target
 - Transporter
 
We are particularly interested on properties associated to antibiotic resistance. Besides genes related to antibiotic resistance themselves, it is possible to have causal relation between virulence factor and bacterial resistance.

Here the table is filtered by the antibiotic resistance property

In [10]:
specialty_genes = pd.read_csv('specialty_genes1.csv')

for i in [2, 3, 4]:
    specialty_genes = pd.concat([specialty_genes, pd.read_csv(f'specialty_genes{i}.csv')], axis = 0)

specialty_genes = specialty_genes[['BRC ID', 'Property']]
specialty_genes.columns = ['BRC_ID', 'Property']
specialty_genes.set_index('BRC_ID', inplace = True)
specialty_genes