# Antibiotic Resistance Prediction

Let's start by opening data from PATRIC containing the bacteria and which antibiotics they're resistant to

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

dat = pd.read_csv("PATRIC_genomes_AMR.tsv", sep='\t', dtype=str)
dat.head()

Unnamed: 0,genome_id,genome_name,taxon_id,antibiotic,resistant_phenotype,measurement,measurement_sign,measurement_value,measurement_unit,laboratory_typing_method,laboratory_typing_method_version,laboratory_typing_platform,vendor,testing_standard,testing_standard_year,source
0,32002.4,Achromobacter denitrificans strain USDA-ARS-US...,32002,ampicillin,,==16,==,16,mg/L,MIC,BOPO6F plate; cattle host,Sensititre,TREK Diagnostic Systems,CLSI,,
1,32002.4,Achromobacter denitrificans strain USDA-ARS-US...,32002,ceftiofur,,>8,>,8,mg/L,MIC,BOPO6F plate; cattle host,Sensititre,TREK Diagnostic Systems,CLSI,,
2,32002.4,Achromobacter denitrificans strain USDA-ARS-US...,32002,chlortetracycline,,==8,==,8,mg/L,MIC,BOPO6F plate; cattle host,Sensititre,TREK Diagnostic Systems,CLSI,,
3,32002.4,Achromobacter denitrificans strain USDA-ARS-US...,32002,clindamycin,,>16,>,16,mg/L,MIC,BOPO6F plate; cattle host,Sensititre,TREK Diagnostic Systems,CLSI,,
4,32002.4,Achromobacter denitrificans strain USDA-ARS-US...,32002,danofloxacin,,==1,==,1,mg/L,MIC,BOPO6F plate; cattle host,Sensititre,TREK Diagnostic Systems,CLSI,,


We only really care about the antibiotic/bacteria pairs and not how the measurements were taken, who took them, etc., so let's drop all the non-relevant columns. Also, let's drop any data points that don't list a resistant phenotype (i.e susceptible or resistant to an antibiotic).

In [5]:
orig_rows = dat.shape[0]
dat = dat[["genome_id", "genome_name", "taxon_id", "antibiotic", "resistant_phenotype"]]
dat = dat.dropna(how="any")
dropped_rows = orig_rows - dat.shape[0]
print("Dropped {} rows of the original {}".format(dropped_rows, orig_rows))
dat.head()

Dropped 15788 rows of the original 125389


Unnamed: 0,genome_id,genome_name,taxon_id,antibiotic,resistant_phenotype
18,1310800.122,Acinetobacter baumannii 1000160,1310800,imipenem,Susceptible
19,1310784.3,Acinetobacter baumannii 1007214,1310784,carbapenem,Susceptible
20,1310784.3,Acinetobacter baumannii 1007214,1310784,imipenem,Susceptible
21,1310751.3,Acinetobacter baumannii 1022959,1310751,carbapenem,Resistant
22,1310751.3,Acinetobacter baumannii 1022959,1310751,imipenem,Resistant


So we don't have to deal with the 60 gigs of genome sequences PATRIC has available we're only going to build classifiers for three antibiotics. Let's downselect to three with a medium number of genomes (around 1000) tested with them. We'll use penicillin, capreomycin, and fusidic acid. Let's check some quick stats about data for each of these antibiotics.

In [67]:
from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

dat_penicillin = dat[dat['antibiotic'] == 'penicillin']
dat_capreomycin = dat[dat['antibiotic'] == 'capreomycin']
dat_fusidic_acid = dat[dat['antibiotic'] == 'fusidic acid']

def getNumUniqueGenus(antibiotic_dat):
    genus_set = set()
    ids = dat_penicillin['genome_id'].unique()
    for id_num in ids:
        genus_set.add(id_num.split('.')[0])
    return len(genus_set)

print("There are {} unique genuses for penicillin, {} for capreomycin, and {} for fusidic acid".format(
    getNumUniqueGenus(dat_penicillin),
    getNumUniqueGenus(dat_capreomycin),
    getNumUniqueGenus(dat_fusidic_acid)
))

print("There are {} penicllin data points, {} capreomycin data points, and {} fusidic acid data points".format(
    dat_penicillin.shape[0],
    dat_capreomycin.shape[0],
    dat_fusidic_acid.shape[0]))

cnt_p = dat_penicillin[['genome_id', 'resistant_phenotype']].groupby('resistant_phenotype').count()
cnt_c = dat_capreomycin[['genome_id', 'resistant_phenotype']].groupby('resistant_phenotype').count()
cnt_f = dat_fusidic_acid[['genome_id', 'resistant_phenotype']].groupby('resistant_phenotype').count()
display_side_by_side(cnt_p, cnt_c, cnt_f)

There are 148 unique genuses for penicillin, 148 for capreomycin, and 148 for fusidic acid
There are 1430 penicllin data points, 1210 capreomycin data points, and 1106 fusidic acid data points


Unnamed: 0_level_0,genome_id
resistant_phenotype,Unnamed: 1_level_1
Intermediate,107
Resistant,1089
Susceptible,234

Unnamed: 0_level_0,genome_id
resistant_phenotype,Unnamed: 1_level_1
Resistant,234
Susceptible,976

Unnamed: 0_level_0,genome_id
resistant_phenotype,Unnamed: 1_level_1
Intermediate,6
Resistant,121
Susceptible,979


We have at least a few hundred data points as well as a good variety of bacteria types, so we are good to go. Let's download the protein coding DNA sequences for the bacteria in our dataset. This will take a while

In [None]:
import ftplib
import os
genome_ids = set(dat_penicillin.genome_id.unique() + dat_capreomycin.genome_id.unique() + dat_fusidic_acid.genome_id.unique())

for genome_id in genome_ids:
    file_nm = genome_id + '.PATRIC.ffn'
    if not os.path.isfile('./sequences/' + file_nm):
        conn = ftplib.FTP('ftp.patricbrc.org')
        conn.login()
        conn.cwd('/patric2/genomes/' + genome_id + '/')
        conn.retrbinary('RETR ' + file_nm, open('./sequences/' + file_nm, 'wb').write)
        conn.quit()

Now that we have the genomes, we will extract features for our classifier. We'll use k-mers as our features, which are simply the counts of strings of length k present in the DNA. We will use a software package called Jellyfish to extract these counts from the sequences for us. Let's load in the counts

In [22]:
dat_tetracycline.resistant_phenotype.unique()

array(['Resistant', 'Intermediate', 'Susceptible'], dtype=object)

Let's do a summary of our data. Lens does a ton of upfront computation so this might take a few minutes

In [53]:
ls = lens.summarise(dat)
le = lens.explore(ls)
le.describe()

0,1,2,3,4,5
,genome_id,genome_name,taxon_id,antibiotic,resistant_phenotype
desc,,,,categorical,categorical
dtype,object,object,object,object,object
notnulls,109601,109601,109601,109601,109601
nulls,0,0,0,0,0
unique,15471,15436,3027,106,6


In [55]:
dat.resistant_phenotype.unique()

array(['Susceptible', 'Resistant', 'Intermediate', 'Non-susceptible',
       'Not defined', 'RS'], dtype=object)

Now lets download the protein coding sequences for all the genomes. This is about 55 GB, it takes hours to download

In [None]:
import ftplib
import os
for genome_id in dat.genome_id.unique():
    file_nm = genome_id + '.PATRIC.ffn'
    if not os.path.isfile('./sequences/' + file_nm):
        conn = ftplib.FTP('ftp.patricbrc.org')
        conn.login()
        conn.cwd('/patric2/genomes/' + genome_id + '/')
        conn.retrbinary('RETR ' + file_nm, open('./sequences/' + file_nm, 'wb').write)
        conn.quit()