# Taxonomic Analysis

This is a guided notebook to teach you how to analyze taxonomic data. The data used here is a collection of samples from the MetaSUB consortium. These are samples collected from subways around the world and sequenced to identify microbial and trace DNA. 

You will be working with two tables: a table of metadata and a table of read counts assigned to different taxa.

In this notebook you will learn to do the following:
- normalize samples by their read count
- calculate the alpha diversity of a sample
- identify the most abundant and prevalent taxa
- compare taxonomic and geographic distance
- find correlations between taxa
- plot a PCoA of samples

In this notebook you will be using both Python, and R. You will be using Pandas DataFrames and can find their documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [8]:
# Setup Python

%load_ext rpy2.ipython
import pandas as pd

taxa = pd.read_csv('/srv/data/shared-data/krakenhll_species.csv', index_col=0)
metadata = pd.read_csv('/srv/data/shared-data/metadata.csv', index_col=0)

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [2]:
%%R

# Setup R

library(ggplot2)
library(reshape2)

In [None]:
%%R






In [5]:
taxa.iloc[0:10,0:10]

Unnamed: 0,Acanthamoeba polyphaga mimivirus,Acetoanaerobium sticklandii,Acetobacter aceti,Acetobacter pasteurianus,Acetobacter persici,Acetobacter pomorum,Acetobacter senegalensis,Acetobacter sp. SLV-7,Acetobacter subgen. Acetobacter,Acetobacter tropicalis
haib17CEM4890_H75CGCCXY_SL263647,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263659,0.0,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263671,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263683,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263695,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263707,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263719,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H75CGCCXY_SL263731,0.0,0.0,117.0,159.0,61.0,6.0,198.0,0.0,117.0,157.0
haib17CEM4890_H7KYMCCXY_SL273041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
haib17CEM4890_H7KYMCCXY_SL273052,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0


In [4]:
metadata.iloc[0:5,]

Unnamed: 0,city,latitude,longitude,surface_material,coastal_city,city_population,city_density,ave_june_temp,city_elevation,continent
haib17CEM4890_H7KYMCCXY_SL273052,berlin,52.50842,13.377179,metal,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273064,berlin,52.498003,13.362799,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273076,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273088,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe
haib17CEM4890_H7KYMCCXY_SL273100,berlin,52.50887,13.322773,plastic,no,3711930,4200,19.4,34.0,europe


In [6]:
'''
Normalize taxa by read counts.

Before we can use our taxa we need to normalize it.
There are many ways to do this but the simplest is
to divide each row by its sum. This will make the 
row sum to 1 and converts each row into a 
'compositional' vector.

Try to do this in 1-2 lines of code.
'''

4+ 2

6

In [16]:
'''
Identify mean abundance and prevalence for each taxa.

Abundance is the average proportion of a taxa in a sample.
Put mean_abundance into a variable called `mean_abundance`

Prevalence is the fraction of samples where each taxa is found.
Put prevalence into a variable called `prevalence`

Try to do this in 2-4 lines of code.
'''

mean_abundance = None
prevalence = None

In [7]:
%%R -i mean_abundance -i prevalence

# Plot abundance and prevalence

log(4)

NameError: name 'mean_abundance' is not defined

In [1]:
'''
Find the alpha diversity of each sample.

Alpha diversity is a core concept in microbiome research.
Fill in the code below to calculate species richness and
shannon entropy for each sample.
'''

from math import log

def species_richness(sample, zero_thresh=0.0001):
    '''Calculate the number of species above a given abundance.'''
    pass

def shannon_entropy(sample):
    '''Calculate the shannon entropy of a normalized sample.'''
    pass

shannon_entropy_vector = taxa.apply(shannon_entropy, axis=0)
species_richness_vector = taxa.apply(species_richness, axis=0)

NameError: name 'taxa' is not defined

In [None]:
%%R -i shannon_entropy_vector -i species_richness_vector

# Compare shannon entropy to species richness

In [None]:
'''
Now we'll try something a little more complex, 
comparing the geographic distance between samples 
with taxonomic distance.

First we need to compute the distances. There is only
one way to compute geographic distance but there are
many ways to compute taxonomic distance. Look up Jaccard
Index and Jensen Shannon Divergence as places to start.

There is some code below to get you started.
'''

from math import radians, sin, cos, atan2, sqrt
from scipy.spatial.distance import pdist

def geographic_distance(s1, s2):
    """Compute the geogrpahic distance between two samples."""
    RADIUS_OF_EARTH_KM = 6373.0
    pass


# make a datatable with just latitude and longitude
coords = complete_metadata_table().loc[bact_tbl.index,['latitude', 'longitude']]
geo_dist_vector = pdist(coords, metric=geographic_distance)

def taxonomic_distance(t1, t2):
    """Compute a taxonomic distance between two samples."""
    pass

taxa_dist_vector = pdist(taxa, metric=taxonomic_distance)

In [17]:
%%R -i taxa_dist_vector -i geo_dist_vector

# plot taxonomic distance vs geographic distance.
# What happens if you filter small distances? big distances?

NameError: name 'taxa_dist_vector' is not defined

## Correlations between different taxa.

Figure out which taxa occur in the same samples.
Some questions to consider:
- do you care about abundace or just presence of taxa
- how often do taxa need to co-occur to be interesting
- do you want to separate samples by some feature in the metadata

You can do this in Python or R. Plot your results.


## PCoA of Samples

Run PCoA on your samples https://www.sequentix.de/gelquest/help/principal_coordinates_analysis.htm

PCoA is related to Principal Component Analysis (PCA) however PCA is not suitable for compositional
data like ours. Instead of using data directly PCoA uses the distance between samples. You can
use the distance matrices you built above or write something new.

Both Python and R contain libraries for PCoA. You can use either.
Once you've run your PCoA make two plots. First, plot the variance
represented by each axis. Second, plot the first few axes with
some metadata.

Do not try to write your own code to run PCoA!