# Resistome Analysis

This is a guided notebook to teach you how to analyze resistome data. The data used here is a collection of samples from the MetaSUB consortium. These are samples collected from subways in New York and London. This notebook is intended to be advanced.

You will be working with several different tables representing Antimicrobial Resistance (AMR) data, a metadata table for each sample, taxonomy of each sample, and a table with the geographic coordinates of major hospitals in NYC and London.

In this notebook you will learn to do the following:
- correlate taxa with AMR
- match AMR to hospitals
- compare AMR to sample metadata


Throughout this notebook there are some questions you will need to consider:
- Are you interested in all types of AMR or just a few? If you're interested in all types of AMR how will you represent it.
- The AMR data here is already normalized for read depth and gene size (units of RPKM). Should you do any other sort of normalization
- Many samples have no detected AMRs of a given type, how will you handle this?

In this notebook you will be using both Python, and R. You will be using Pandas DataFrames and can find their documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [1]:
# Setup Python

%load_ext rpy2.ipython
import pandas as pd

taxa = pd.read_csv('/srv/data/shared-data/nyclondon_taxa.csv', index_col=0)
amr_class = pd.read_csv('/srv/data/shared-data/nyclondon_amr_class.csv', index_col=0)
amr_gene = pd.read_csv('/srv/data/shared-data/nyclondon_amr_gene.csv', index_col=0)
amr_mechanism = pd.read_csv('/srv/data/shared-data/nyclondon_amr_mech.csv', index_col=0)
metadata = pd.read_csv('/srv/data/shared-data/nyclondon_metadata.csv', index_col=0)
hospitals = pd.read_csv('/srv/data/shared-data/nyclondon_hospitals.csv', index_col=0)

In [9]:
%%R

# Setup R

library(ggplot2)
library(reshape2)

In [13]:
amr_class.iloc[0:5,]

Unnamed: 0,Aminocoumarins,Aminoglycosides,Bacitracin,Cationic antimicrobial peptides,Elfamycins,Fluoroquinolones,Fosfomycin,Fusidic acid,Glycopeptides,Lipopeptides,MLS,Multi-drug resistance,Phenicol,Rifampin,Sulfonamides,Tetracyclines,Trimethoprim,Tunicamycin,betalactams
haib17CEM4890_H75CGCCXY_SL263647,0.0,0.0,0.0,0.0,2.039541,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.512658,0.0,0.0,0.0,0.0,0.121487
haib17CEM4890_H75CGCCXY_SL263659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263671,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.023336,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004233


In [14]:
metadata.iloc[0:5,]

Unnamed: 0,city,latitude,longitude,surface_material,coastal_city,city_population,city_density,ave_june_temp,city_elevation,continent
haib17CEM4890_H7KYMCCXY_SL273052,berlin,52.50842,13.377179,metal,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273064,berlin,52.498003,13.362799,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273076,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273088,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273100,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe


In [5]:
'''
Preprocess your data. 

Use this section to decide how you will preprocess your data.
You may want to preprocess data differently for different tasks.
'''

pass

In [6]:
'''
Compare taxa to AMRs. 

What kind of AMR data most directly relates to taxa?
Do any AMRs seem to be more common with certain taxa?
Does AMR abundance relate to any features like diversity?
'''

pass

In [3]:
'''
Compare AMR abundance to hospitals.

Do AMRs become more or less abundant near hospitals?
Does the type of AMR matter?
If you find a correlation is it driven by taxa or some other metadata feature?

A geographic distance function is provided for you below.
'''

from math import radians, sin, cos, atan2, sqrt


def geodist(coord1, coord2):
    EARTH_RADIUS_KM = 6373.0
    lat1, lon1 = coord1
    lat2, lon2 = coord2
    dlat = radians(lat2) - radians(lat1)
    dlon = radians(lon2) - radians(lon1)
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    y = sqrt(
            (cos(lat2) * sin(dlon)) ** 2 + 
            (cos(lat1) * sin(lat2) - sin(lat1) * cos(lat2) * cos(dlon)) ** 2
        )
    x = sin(lat1) * sin(lat2) + cos(lat1) * cos(lat2) * cos(dlon)
    c = atan2(y, x)
    return EARTH_RADIUS_KM * c


In [7]:
'''
Compare AMR to sample metadata.

Does AMR occur more on certain kinds of surfaces?
Do AMRs occur in clusters?
'''

pass