# Functional Analysis

This is a guided notebook to teach you how to analyze functional data. The data used here is a collection of samples from the MetaSUB consortium. These are samples collected from subways around the world and sequenced to identify microbial and trace DNA. 

You will mainly be working with three tables: a table of metadata, a table of gene abundances, and a table of pathway abundances. You will also have a table of taxonomy and a list of estimated average genome sizes (AGS).

In this notebook you will learn to do the following:
- normalize functional data by AGS
- select high variance pathways
- filter pathways by coverage
- Connect pathways to different metadata conditions

In this notebook you will be using both Python, and R. You will be using Pandas DataFrames and can find their documentation here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [3]:
# Setup Python

%load_ext rpy2.ipython
import pandas as pd

taxa = pd.read_csv('/srv/data/shared-data/krakenhll_species.csv', index_col=0)
metadata = pd.read_csv('/srv/data/shared-data/metadata.csv', index_col=0)
pathway_abundances = pd.read_csv('/srv/data/shared-data/pathway_abundances.csv', index_col=0)
pathway_coverages = pd.read_csv('/srv/data/shared-data/pathway_coverages.csv', index_col=0)
ags = pd.read_csv('/srv/data/shared-data/ags.csv')


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [2]:
%%R

# Setup R

library(ggplot2)
library(reshape2)

In [4]:
'''
Normalize by Average Genome Size.

For certain applications it can be desirable to normalize
pathway abundances by the estimated average genome size. This
gives you a number that represents the average fraction of
genome space dedicated to a particular fucntion.

AGS normalization tends to produce more even coverage plots.
However many people choose not to use it and you may not as 
well.

AGS is typically estimated by aligning reads to a set of
Universal Single Copy Genes. You have been given a table 
of AGS estimates from MicrobeCensus.

Try to do this in 2-3 lines of code.
'''

pass

In [5]:
'''
Filter low coverage pathways.

Pathways have two key properties: abundance and coverage.
Abundance measures the total number of reads that mapped to
a pathways, coverage measures the fraction of the pathway
which has reads.

Even high abundance pathways may not be 100% covered but we
would like to remove pathways that are very poorly covered.
Set a dropout threshold and use pathway_coverages to 'drop out'
low coverage pathways.

Try to do this in 2-4 lines of code.
'''

MINIMUM_COVERAGE = 0.5
pass

In [6]:
%%R -i pathway_abundance -i pathway_coverage -i MINIMUM_COVERAGE

# Plot (unfiltered) pathway abundance vs coverage
# mark where you drop out low coverage pathways



NameError: name 'pathway_abundance' is not defined

In [None]:
'''
Filter low-variance pathways.

Some pathways represent housekeeping functions and
do not vary much in abundance. These pathways usually 
aren't interesting so we'd like to remove them to avoid 
cluttering our downstream analyses.

Find the variance of each pathway and remove pathways
that are too stable.
'''

pass

## Statistical Testing of Pathways

Figure out which pathways vary significantly based on metadata.

To do this you will need to pick some metadata feature of interest.
Choose two (or more) groups of samples based on your metadata feature
and determine whether any pathways are statistically enriched
between the groups.

You may use any statistical test you like butt Mann-Whitney-U
https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test is a good choice.

Once you have your p-values find the abundance of each gene in each sample 
as a ratio of the average abundance of that gene. Use this, along with your
p-values, to make a volcano plot. https://en.wikipedia.org/wiki/Volcano_plot_(statistics)

You can do this in Python or R. Plot your results.


## PCoA of Samples and Genes

Run PCoA on your samples https://www.sequentix.de/gelquest/help/principal_coordinates_analysis.htm

PCoA is related to Principal Component Analysis (PCA) however PCA is not suitable for compositional
data like ours. Instead of using data directly PCoA uses the distance between samples. You can
use the distance matrices you built above or write something new.

Both Python and R contain libraries for PCoA. You can use either.
Once you've run your PCoA make two plots. First, plot the variance
represented by each axis. Second, plot the first few axes with
some metadata.

Run PCoA twice. Once each for both samples and genes.

Do not try to write your own code to run PCoA!