# Data Wrangling for Predicting Antibiotic Resistance in Gonorrhea

The relevant data is contained in four separate files:
1. A csv file containing all strain samples and minimal inhibitory concentration (MIC) of azithromycin, ciprofloxacin, and ceftrixone
2. A space-separated files containing most common unitigs among resistant samples for each antibiotic:
<ul>
    <li>azithromycin(azm)</li>
    <li>ciprofloxacin(cip)</li>
    <li>ceftrixone(cfx)</li>
</ul>

## 1. Package Importing

In [1]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## 2. Data Collection

### 2.1 Data Loading
I began by loading the MIC data, and the three unitig files.

In [2]:
# Loading the MIC data
mic_data = pd.read_csv('../data/external/metadata.csv')
mic_data.head()

Unnamed: 0,Sample_ID,Year,Country,Continent,Beta.lactamase,Azithromycin,Ciprofloxacin,Ceftriaxone,Cefixime,Tetracycline,...,log2_cro_mic,log2_cfx_mic,log2_tet_mic,log2_pen_mic,azm_sr,cip_sr,cro_sr,cfx_sr,tet_sr,pen_sr
0,ERR1549286,2015.0,UK,Europe,,>256,,0.016,,,...,-5.965784,,,,1.0,,0.0,,,
1,ERR1549290,2015.0,UK,Europe,,>256,,0.004,,,...,-7.965784,,,,1.0,,0.0,,,
2,ERR1549291,2015.0,UK,Europe,,>256,,0.006,,,...,-7.380822,,,,1.0,,0.0,,,
3,ERR1549287,2015.0,UK,Europe,,>256,,0.006,,,...,-7.380822,,,,1.0,,0.0,,,
4,ERR1549288,2015.0,UK,Europe,,>256,,0.008,,,...,-6.965784,,,,1.0,,0.0,,,


In [26]:
# Loading azithromycin data
unitigs_azm = pd.read_csv('../data/external/azm_sr_gwas_filtered_unitigs.Rtab', sep='\s', engine='python')
unitigs_azm.head()

Unnamed: 0,pattern_id,ERR1549286,ERR1549290,ERR1549291,ERR1549287,ERR1549288,ERR1549299,ERR1549292,ERR1549298,ERR1549296,...,ERR2172345,ERR2172346,ERR2172347,ERR2172348,ERR2172349,ERR2172350,ERR2172351,ERR2172352,ERR2172353,ERR2172354
0,CTTAACATATTTGCCTTTGATTTTTGAAGAAGCTGCCACGCCGGCAG,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,TACCGTAACCGGCAATGCGGATATTACGGTC,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,CAGACGGCATTTTTTTTGCGTTTTTCGGGAGG,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,AACGGGTTTTCAGACGGCATTCGATATCGGGACG,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,CCAAAAATTACCCGCGTTGACGTAGCTAAAGA,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
# Loading ciprofloxacin data
#unitigs_cip = pd.read_csv('../data/external/cip_sr_gwas_filtered_unitigs.Rtab', sep='\s', engine='python')
#unitigs_cip.head()

In [5]:
# Loading ceftrixone data
#unitigs_cfx = pd.read_csv('../data/external/cfx_sr_gwas_filtered_unitigs.Rtab', sep='\s', engine='python')
#unitigs_cfx.head()

### 2.2 Data Joining

Each unitig file needs the samples' MIC data for the respective antibiotic.

In [6]:
# Print shape of each DataFrame
print("mic_data shape: " + str(mic_data.shape))
print("unitigs_azm shape: " + str(unitigs_azm.shape))
#print("unitigs_cip shape: " + str(unitigs_cip.shape))
#print("unitigs_cfx shape: " + str(unitigs_cfx.shape))


mic_data shape: (3786, 31)
unitigs_azm shape: (515, 3972)


'mic_data's rows contain the sample id's while the unitig files have the sample id's as columns, so I will transpose them. Additionally, the 'mic_data' has 3786 rows while the unitigs all have 3972 columns, so some samples might not have MIC data.

In [27]:
# Set index to 'pattern_id' and transpose
unitigs_azm_T = unitigs_azm.set_index('pattern_id').T

# Reset index and rename column 'sample_id'
unitigs_azm_T.reset_index(inplace=True)
unitigs_azm_T = unitigs_azm_T.rename(columns = {'index':'Sample_ID'})
unitigs_azm_T.head()

pattern_id,Sample_ID,CTTAACATATTTGCCTTTGATTTTTGAAGAAGCTGCCACGCCGGCAG,TACCGTAACCGGCAATGCGGATATTACGGTC,CAGACGGCATTTTTTTTGCGTTTTTCGGGAGG,AACGGGTTTTCAGACGGCATTCGATATCGGGACG,CCAAAAATTACCCGCGTTGACGTAGCTAAAGA,CGGACCGGTATTCCGTCGAAATCACCGCCGTCAACCGCCCC,TGAAATTGTCCATCTCGTATGCCGTCTTCTGCTTG,"TACGGTATTGTCCGCATTATTAAACTCAAAACC,AGAAGACGGCATACGAGATGGACAATTTCATCC",GGCATTTTTTTTGCGTTTTTCGGGAGGGGGCGGC,...,ACCGATGAGTTCGCCGGAATCGGTACGATTGAC,CTGCTGGACAAAAAAGGGATTAAAGATATCACC,CGTTCCTTTCGGCGTATTCTCGCCGTTGCGCGGCG,TCACATTTCCGCTTCAGACGGCATCCGATATGA,GAAGCTGCCACGCCGGCAGAAGTGTTGTTTGCGGG,ACGCCGAAAGGAACGTGTATGCTGCCGCCCAACTGCG,ACTCGAATTTTGCAGGATTGGTATCAATGGCGATAATGCGACCGGCTTTGG,"ACCCGGCCCGGGCTGGCAGGCTACGGCTACACCGGTATCC,CACCTTAGGGAATCGTTCCCTTTGGGCCGGG,TACGCCGAAAGGAACGTGTATGCTGCCGCCC,GGGATTGTTGATTGTCGGACTGTTGTGCAACCTC",AGCCTGATTCACCAATGGTTTGTTCATAACAA,TTTTGAGCAGAAAGCAGTCAAAAACAGGGGGATTTTGCCCTTTTGACAGGTTCGAGTGCCG
0,ERR1549286,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
1,ERR1549290,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
2,ERR1549291,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
3,ERR1549287,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1
4,ERR1549288,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1


As I will be merging on 'sample_id', I wanted to compare the sample id's in the unitig data to the MIC data.

In [18]:
# Check if samples in 'mic_data' are in 'unitigs_azm_T'
print('samples from mic_data that are in unitigs_azm_t:')
print(mic_data['Sample_ID'].isin(unitigs_azm_T['sample_id']).value_counts())
# Check samples in 'unitig_azm_T' are in 'mic_data'
print('samples from unitigs_azm_ that are in mic_data:')
print(unitigs_azm_T['sample_id'].isin(mic_data['Sample_ID']).value_counts())

from mic_data are in unitigs_azm_t:
True    3786
Name: Sample_ID, dtype: int64
True     3786
False     185
Name: sample_id, dtype: int64


All of the samples from 'mic_data' are in 'unitigs_azm_T' but 185 samples are not in 'mic_data'

In [29]:
# Joining with 'mic_data' column 'Azithromycin' on 'sample_id' and 'Sample_ID'
unitigs_azm = unitigs_azm_T.merge(mic_data[['Sample_ID','Azithromycin']], 
                                how = 'left', on = 'Sample_ID')
unitigs_azm

Unnamed: 0,Sample_ID,CTTAACATATTTGCCTTTGATTTTTGAAGAAGCTGCCACGCCGGCAG,TACCGTAACCGGCAATGCGGATATTACGGTC,CAGACGGCATTTTTTTTGCGTTTTTCGGGAGG,AACGGGTTTTCAGACGGCATTCGATATCGGGACG,CCAAAAATTACCCGCGTTGACGTAGCTAAAGA,CGGACCGGTATTCCGTCGAAATCACCGCCGTCAACCGCCCC,TGAAATTGTCCATCTCGTATGCCGTCTTCTGCTTG,"TACGGTATTGTCCGCATTATTAAACTCAAAACC,AGAAGACGGCATACGAGATGGACAATTTCATCC",GGCATTTTTTTTGCGTTTTTCGGGAGGGGGCGGC,...,CTGCTGGACAAAAAAGGGATTAAAGATATCACC,CGTTCCTTTCGGCGTATTCTCGCCGTTGCGCGGCG,TCACATTTCCGCTTCAGACGGCATCCGATATGA,GAAGCTGCCACGCCGGCAGAAGTGTTGTTTGCGGG,ACGCCGAAAGGAACGTGTATGCTGCCGCCCAACTGCG,ACTCGAATTTTGCAGGATTGGTATCAATGGCGATAATGCGACCGGCTTTGG,"ACCCGGCCCGGGCTGGCAGGCTACGGCTACACCGGTATCC,CACCTTAGGGAATCGTTCCCTTTGGGCCGGG,TACGCCGAAAGGAACGTGTATGCTGCCGCCC,GGGATTGTTGATTGTCGGACTGTTGTGCAACCTC",AGCCTGATTCACCAATGGTTTGTTCATAACAA,TTTTGAGCAGAAAGCAGTCAAAAACAGGGGGATTTTGCCCTTTTGACAGGTTCGAGTGCCG,Azithromycin
0,ERR1549286,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
1,ERR1549290,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
2,ERR1549291,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
3,ERR1549287,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
4,ERR1549288,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3966,ERR2172350,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
3967,ERR2172351,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
3968,ERR2172352,1,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
3969,ERR2172353,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,>256
