# Calour microbiome databases interface tutorial

## Setup

In [1]:
import calour as ca

  from ._conv import register_converters as _register_converters


In [2]:
ca.set_log_level(11)

In [3]:
%matplotlib notebook

## Load the data
We will use the Chronic faitigue syndrome data from:

Giloteaux, L., Goodrich, J.K., Walters, W.A., Levine, S.M., Ley, R.E. and Hanson, M.R., 2016.

Reduced diversity and altered composition of the gut microbiome in individuals with myalgic encephalomyelitis/chronic fatigue syndrome.

Microbiome, 4(1), p.30.

In [4]:
cfs=ca.read_amplicon('data/chronic-fatigue-syndrome.biom',
                     'data/chronic-fatigue-syndrome.sample.txt',
                     normalize=10000,min_reads=1000)

2018-07-26 13:09:44 INFO loaded 87 samples, 2129 features
2018-07-26 13:09:44 INFO After filtering, 87 remaining


## preprocess
remove non-interesting bacteria, cluster bacteria and sort samples by disease status

In [5]:
cfs=cfs.filter_abundance(10)

2018-07-26 13:09:45 INFO After filtering, 1100 remaining


In [6]:
cfs=cfs.cluster_features()

2018-07-26 13:09:45 INFO After filtering, 1100 remaining


In [7]:
cfs=cfs.sort_samples('Subject')

## Viewing database annotations
in the interactive heatmap, when clicking on a bacteria, we get a list of all database results about the selected bacteria.

We can choose which databases to use by the `databases=['dbbact',...]` parameter. The possible databases depend on which database modules were installed.

Currently, supported microbiome database interfaces include:

* dbBact - a community database for manual annotations about bacteria (interface installation instruction at [dbbact-calour](https://github.com/amnona/dbbact-calour)).

* SpongeEMP - an automatic database for sea sponge samples (interface installation instruction at [spongeworld-calour](https://github.com/amnona/spongeworld-calour)).

* phenoDB - phenotypic information about selected bacteria (interface installation instruction at [pheno-calour](https://github.com/amnona/pheno-calour)).

By default, calour uses the dbBact database for microbiome data

In [8]:
cfs.plot(sample_field='Subject',gui='jupyter')

<IPython.core.display.Javascript object>

<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a10abebe0>

## dbBact enrichment of selected bacteria
By selecting a set of bacteria (using the shift+click or ctrl+click) and choosing the "Enrichment" button, we can get a list of terms that are significantly enriched in the selected bacteria compared to the rest of the bacteria in the plot

## Adding dbBact annotations
(Only possible using the `gui='qt5'` GUI)

To add a new annotation to the selected set of bacteria, choose the "Annotate" button.

Detailed instructions are available at the dbBact.org website.

## Differential abundance
To find the bacteria significantly different between samples with 'Control' (healthy) and 'Patient' (sick) in the 'Subject' field.

In [9]:
dd=cfs.diff_abundance(field='Subject',val1='Control',val2='Patient', random_seed=2018)

2018-07-26 13:09:57 INFO 87 samples with both values
2018-07-26 13:09:57 INFO After filtering, 1100 remaining
2018-07-26 13:09:57 INFO 39 samples with value 1 (['Control'])
2018-07-26 13:09:58 INFO method meandiff. number of higher in ['Control'] : 38. number of higher in ['Patient'] : 16. total 54


### Plot the significant bacteria
When clicking on a bacteria, we'll get both dbBact, SpongeEMP, and phenoDB information

In [10]:
dd.plot(sample_field='Subject', gui='jupyter', databases=['dbbact','sponge'])

<IPython.core.display.Javascript object>

<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a16f33f98>

## dbBact term enrichment (`diff_abundance_enrichment`)
We can ask what is special in the bacteria significanly higher in the Control vs. the Patient group and vice versa.

* Note since we need to get the per-feature annotations from dbBact, we need a live internet connection to run this command.

### Default parameters

In [11]:
ax, enriched=dd.plot_diff_abundance_enrichment()

2018-07-26 13:10:03 INFO Getting dbBact annotations for 54 sequences, please wait...
2018-07-26 13:10:08 INFO Got 2328 annotations
2018-07-26 13:10:08 INFO Added annotation data to experiment. Total 705 annotations, 54 terms
2018-07-26 13:10:08 INFO removed 0 terms


<IPython.core.display.Javascript object>

The enriched terms are in a calour experiment class (terms are features, bacteria are samples), so we can see the
list of enriched terms with the p-value (pval) and effect size (odif)

In [12]:
enriched.feature_metadata

Unnamed: 0,odif,pvals,term
little physical activity {*single exp 63*},-18.562500,0.000999,little physical activity {*single exp 63*}
LOWER IN physical activity {*single exp 63*},-18.562500,0.000999,LOWER IN physical activity {*single exp 63*}
LOWER IN rural community,-18.518092,0.000999,LOWER IN rural community
LOWER IN control,-17.807566,0.000999,LOWER IN control
LOWER IN small village,-17.452303,0.000999,LOWER IN small village
LOWER IN tunapuco {*single exp 276*},-16.430921,0.000999,LOWER IN tunapuco {*single exp 276*}
LOWER IN peru {*single exp 276*},-16.430921,0.000999,LOWER IN peru {*single exp 276*}
crohn's disease,-15.587171,0.000999,crohn's disease
chronic fatigue syndrome {*single exp 12*},-15.187500,0.000999,chronic fatigue syndrome {*single exp 12*}
LOWER IN adult,-14.254934,0.001998,LOWER IN adult


We can plot the enriched terms heatmap to see the term scores for each bacteria.

Note now rows are the bacteria and columns are the terms

In [16]:
enriched.plot(gui='jupyter', databases=[], feature_field='term',sample_field='group',
              yticklabel_kwargs={'rotation': 0, 'size': 7})

<IPython.core.display.Javascript object>

<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1c678c50>

## Look at the behavior of a single term
We want to see all the annotations where a given term appears, and see what bacteria from either group (CFS or healthy) appear in that annotations.
To do this, we use dbbact.show_term_details_diff(). The output of this function is an experiment where each COLUMN is a bacteria, and each row is an annotation. We see whether each bacteria appears in the annotation. Color indicates the annotation type.

In [38]:
dbbact=ca.database._get_database_class('dbbact')

In [40]:
term_info_exp = dbbact.show_term_details_diff('small village',dd,gui='jupyter')

2018-07-26 13:24:01 INFO found 12 annotations with term
2018-07-26 13:24:01 INFO After filtering, 12 remaining


<IPython.core.display.Javascript object>

### getting enriched annotations instead of terms
Each annotation is coming from a single experiment (as opposed to terms that can come from annotations in multiple experiment)

In [17]:
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='annotation')

2018-07-26 13:12:53 INFO removed 0 terms


<IPython.core.display.Javascript object>

In [18]:
enriched.feature_metadata

Unnamed: 0,odif,pvals,term
higher in individuals with low physical activity ( high in little physical activity compared to physical activity in feces homo sapiens united states of america,-18.562500,0.000999,higher in individuals with low physical activi...
high in united states of america city state of oklahoma compared to peru small village tunapuco rural community in feces homo sapiens adult,-16.430921,0.000999,high in united states of america city state o...
high in children with Crohn's disease compared to healthy adult controls ( high in crohn's disease child obsolete_juvenile stage compared to control adult in feces homo sapiens glasgow,-15.187500,0.000999,high in children with Crohn's disease compared...
high in chronic fatigue syndrome compared to control in feces homo sapiens new york county,-15.187500,0.000999,high in chronic fatigue syndrome compared to...
high in female compared to male in feces homo sapiens united states of america,-15.187500,0.000999,high in female compared to male in feces ho...
Higher in animal product diet compared to plant diet ( high in diet animal product diet compared to plant diet in feces homo sapiens united states of america,-11.368421,0.000999,Higher in animal product diet compared to plan...
"common feces, homo sapiens, infant, kingdom of norway, oslo, age 1 year,",-10.125000,0.000999,"common feces, homo sapiens, infant, kingdom o..."
high in infant age 1 year compared to adult age 30-40 in feces homo sapiens kingdom of norway oslo,-10.125000,0.000999,high in infant age 1 year compared to adult ...
higher in stroke patients compared to healthy controls ( high in stroke compared to control in feces homo sapiens china adult guangzhou city prefecture,-10.125000,0.001998,higher in stroke patients compared to healthy ...
lower in infants age<1 year compared to 1-3 years in baby feces ( high in age age > 1 year compared to age <1 year in feces homo sapiens infant finland,-9.858553,0.016983,lower in infants age<1 year compared to 1-3 ye...


### Getting both enriched terms and annotations

In [19]:
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='combined')

2018-07-26 13:13:05 INFO removed 0 terms


<IPython.core.display.Javascript object>

In [20]:
enriched.feature_metadata

Unnamed: 0,odif,pvals,term
higher in individuals with low physical activity ( high in little physical activity compared to physical activity in feces homo sapiens united states of america,-18.562500,0.000999,higher in individuals with low physical activi...
LOWER IN physical activity {*single exp 63*},-18.562500,0.000999,LOWER IN physical activity {*single exp 63*}
little physical activity {*single exp 63*},-18.562500,0.000999,little physical activity {*single exp 63*}
LOWER IN rural community,-18.518092,0.000999,LOWER IN rural community
LOWER IN control,-17.807566,0.000999,LOWER IN control
LOWER IN small village,-17.452303,0.000999,LOWER IN small village
LOWER IN peru {*single exp 276*},-16.430921,0.000999,LOWER IN peru {*single exp 276*}
high in united states of america city state of oklahoma compared to peru small village tunapuco rural community in feces homo sapiens adult,-16.430921,0.000999,high in united states of america city state o...
LOWER IN tunapuco {*single exp 276*},-16.430921,0.000999,LOWER IN tunapuco {*single exp 276*}
crohn's disease,-15.587171,0.000999,crohn's disease


### Ignoring selected experiments already in dbBact
If our experiment is already in dbBact, or if there are other experiments in dbBact we do not want to include in the enrichment analysis, we can specify them using the `ignore_exp=[expID,...]` parameter.

In our case, the cfs experiment is already added to dbBact, so let's ignore it's annotations when doing the analysis. By looking at [dbBact.org](dbBact.org) we know its experimentID is 12. Alternatively we can use `ignore_exp=True` to automatically detect the current experimentID if it exists in dbBact (using the data and mapping file md5 hash).

In [21]:
ax, enriched=dd.plot_diff_abundance_enrichment(term_type='combined', ignore_exp=[12])

2018-07-26 13:13:12 INFO removed 0 terms


<IPython.core.display.Javascript object>

## Adding common dbBact terms to features (`add_terms_to_features`)
We can attach to each bacteria the most common dbBact term associated with it.

The terms are selected from all of the dbBact terms, or can be selected from a supplied list.

In [22]:
cfs=cfs.add_terms_to_features(dbname='dbbact',use_term_list=['feces','saliva','skin','mus musculus'])

2018-07-26 13:13:20 INFO Getting dbBact annotations for 1100 sequences, please wait...
2018-07-26 13:13:32 INFO Got 24053 annotations
2018-07-26 13:13:32 INFO Added annotation data to experiment. Total 2151 annotations, 1100 terms


In [23]:
tt=cfs.sort_by_metadata('common_term',axis='feature')

In [24]:
tt.plot(sample_field='Subject', feature_field='common_term', gui='jupyter')

<IPython.core.display.Javascript object>

<calour.heatmap.plotgui_jupyter.PlotGUI_Jupyter at 0x1a1cbe0e48>

## Get enriched terms using all bacteria

Instead of just comparing the bacteria enriched in the two groups (and then comparing terms between them), we can do a weighted term average for each group using all bacteria (weighing the terms of each bacteria by its' frequency in the sample). This can work if we don't have a strong set of bacteria separating between the two groups.

In [25]:
dbbact=ca.database._get_database_class('dbbact')

In [32]:
enriched=dbbact.sample_enrichment(cfs,'Subject','Control','Patient',
                                  term_type='combined',ignore_exp=[12])

2018-07-26 13:17:22 INFO 87 samples with both values
2018-07-26 13:17:22 INFO After filtering, 2704 remaining
2018-07-26 13:17:22 INFO 39 samples with value 1 (['Control'])
2018-07-26 13:17:24 INFO method meandiff. number of higher in ['Control'] : 455. number of higher in ['Patient'] : 51. total 506


In [27]:
enriched.feature_metadata

Unnamed: 0,term,num_features,_calour_diff_abundance_effect,_calour_diff_abundance_pval,_calour_diff_abundance_group
enzyme supplement,enzyme supplement,20,-1.467864,0.000999,Patient
-no enzyme supplement,-no enzyme supplement,20,-1.252388,0.000999,Patient
high in EPI dogs with enzyme supplement compared to no supplement ( high in enzyme supplement compared to no enzyme supplement in feces united states of america exocrine pancreatic insufficiency canis lupus familiaris dog,high in EPI dogs with enzyme supplement compar...,20,-1.252388,0.000999,Patient
-gastric bypass,-gastric bypass,4,-1.009475,0.000999,Patient
lower in people with Roux-en-Y gastric bypass compared to controls ( high in control compared to gastric bypass in feces homo sapiens united states of america,lower in people with Roux-en-Y gastric bypass ...,4,-1.009475,0.000999,Patient
-physical activity,-physical activity,49,-0.964541,0.001998,Patient
higher in individuals with low physical activity ( high in little physical activity compared to physical activity in feces homo sapiens united states of america,higher in individuals with low physical activi...,49,-0.964541,0.001998,Patient
little physical activity,little physical activity,49,-0.931222,0.002997,Patient
high in children with Crohn's disease compared to healthy adult controls ( high in crohn's disease child obsolete_juvenile stage compared to control adult in feces homo sapiens glasgow,high in children with Crohn's disease compared...,53,-0.874353,0.000999,Patient
-age 30-40,-age 30-40,16,-0.832852,0.000999,Patient
