Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command line and QIIME2 -- refactoring #16

Merged
merged 24 commits into from
Aug 28, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,6 @@ language: python
env:
- PYVERSION=3.5 USE_CYTHON=TRUE MAKE_DOC=TRUE
before_install:
- "export DISPLAY=:99.0"
- "sh -e /etc/init.d/xvfb start"
- wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
- chmod +x miniconda.sh
- ./miniconda.sh -b
Expand Down
Empty file added Icon
Empty file.
181 changes: 156 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,33 +6,164 @@

# gemelli

## usage

```python
import numpy as np
import pandas as pd
from gemelli.factorization import TenAls
from gemelli.preprocessing import build, rclr

# contruct and transform the tensor
tensor = Build()
tensor.construct(table, metadata, subjects,
[condition_1, condition_2, ..., condition_n])
tensor_rclr = rclr(tensor.counts)
# factorize
TF = TenAls().fit(tensor_rclr)
# write loading files
PC = ['PC'+str(i+1) for i in range(rank)]
# loadings as daaframe
sample_loading = pd.DataFrame(abs(TF.sample_loading),
tensor.subject_order)
feature_loading = pd.DataFrame(TF.feature_loading,
tensor.feature_order)
temporal_loading = pd.DataFrame(TF.conditional_loading,
tensor.condition_orders[0])
Gemelli is a tool box for running tensor factorization on sparse compositional omics datasets. Gemelli performs unsupervised dimensionality reduction of spatiotemporal microbiome data. The outlut of gemelli helps to resolve spatiotemporal subject variation and the biological features that separate them.

## Installation

To install the most up to date version of deicode, run the following command

# pip (only supported for QIIME2 >= 2018.8)
pip install gemelli

**Note**: that gemelli is not compatible with python 2, and is compatible with Python 3.4 or later.

## Using gemelli inside [QIIME 2](https://qiime2.org/)

A QIIME2 tutorial can be found [here](https://github.com/cameronmartino/gemelli/ipynb/tutorials/QIIME2-jansson-ibd-tutorial.md).

`Note: a more formal tutorial is coming soon.`

```bash
$qiime gemelli ctf --help

Usage: qiime gemelli ctf [OPTIONS]

Gemelli resolves spatiotemporal subject variation and the biological
features that separate them. In this case, a subject may have several
paired samples, where each sample may be a time point. The output is akin
to conventional beta-diversity analyses but with the paired component
integrated in the dimensionality reduction.

Inputs:
--i-table ARTIFACT FeatureTable[Frequency]
Input table in biom format. [required]
Parameters:
--m-sample-metadata-file METADATA...
(multiple Sample metadata file in QIIME2 formatting.
arguments will be
merged) [required]
--p-individual-id-column TEXT
Metadata column containing subject IDs to use for
pairing samples. WARNING: if replicates exist for an
individual ID at either state_1 to state_N, that
subject will be mean grouped by default. [required]
--p-state-column TEXT Metadata column containing state (e.g.,Time,
BodySite) across which samples are paired. At least
one is required but up to four are allowed by other
state inputs. [required]
--p-n-components INTEGER
The underlying low-rank structure (suggested: 2 <
rank < 10) [minimum 2] [default: 3]
--p-min-sample-count INTEGER
Minimum sum cutoff of sample across all features
[default: 0]
--p-min-feature-count INTEGER
Minimum sum cutoff of features across all samples
[default: 0]
--p-max-iterations-als INTEGER
Max number of Alternating Least Square (ALS)
optimization iterations (suggested to be below 100;
beware of overfitting) [minimum 1] [default: 25]
--p-max-iterations-rptm INTEGER
Max number of Robust Tensor Power Method (RTPM)
optimization iterations (suggested to be below 100;
beware of overfitting) [minimum 1] [default: 25]
--p-n-initializations INTEGER
The number of initialization vectors. Larger values
willgive more accurate factorization but will be more
computationally expensive [minimum 1] [default: 25]
--m-feature-metadata-file METADATA...
(multiple
arguments will be
merged) [optional]
Outputs:
--o-subject-biplot ARTIFACT PCoAResults % Properties('biplot')
Compositional biplot of subjects as points and
features as arrows. Where the variation between
subject groupings is explained by the log-ratio
between opposing arrows. WARNING: The % variance
explained is spread over n-components and can be
inflated. [required]
--o-state-distance-matrix ARTIFACT
DistanceMatrix A sample-sample distance matrix generated from the
euclidean distance of the subject-state ordinations
and itself. [required]
--o-state-subject-ordination ARTIFACT SampleData[SampleTrajectory]
A trajectory is an ordination that can be
visualizedover time or another context. [required]
--o-state-feature-ordination ARTIFACT FeatureData[FeatureTrajectory]
A trajectory is an ordination that can be
visualizedover time or another context. [required]
Miscellaneous:
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr
during execution of this action. Or silence output if
execution is successful (silence is golden).
--citations Show citations and exit.
--help Show this message and exit.

```

## Using gemelli as a standalone tool

```bash
$ gemelli cmartino$ gemelli --help

Usage: gemelli [OPTIONS]

Runs CTF with an rclr preprocessing step.

Options:
--in-biom TEXT Input table in biom format. [required]
--sample-metadata-file TEXT Sample metadata file in QIIME2 formatting.
[required]
--individual-id-column TEXT Metadata column containing subject IDs to use
for pairing samples. WARNING: if replicates
exist for an individual ID at either state_1
to state_N, that subject will be mean grouped.
[required]
--state-column-1 TEXT Metadata column containing state (e.g.,Time,
BodySite) across which samples are paired. At
least one is required but up to four are
allowed by other state inputs. [required]
--output-dir TEXT Location of output files. [required]
--n_components INTEGER The underlying low-rank structure (suggested:
1 < rank < 10) [minimum 2] [default: 3]
--min-sample-count INTEGER Minimum sum cutoff of sample across all
features [default: 0]
--min-feature-count INTEGER Minimum sum cutoff of features across all
samples [default: 5]
--max_iterations_als INTEGER Max number of Alternating Least Square (ALS)
optimization iterations (suggested to be
below 100; beware of overfitting) [minimum 1]
[default: 50]
--max_iterations_rptm INTEGER Max number of Robust Tensor Power Method
(RTPM) optimization iterations (suggested to
be below 100; beware of overfitting) [minimum
1] [default: 50]
--n_initializations INTEGER The number of initialization vectors. Larger
values willgive more accurate factorization
but will be more computationally expensive
(suggested to be below 100; beware of
overfitting) [minimum 1] [default: 50]
--feature-metadata-file TEXT Feature metadata file in QIIME2 formatting.
--state-column-2 TEXT Metadata column containing state (e.g.,Time,
BodySite) across which samples are paired. At
least one is required but up to four are
allowed by other state inputs.
--state-column-3 TEXT Metadata column containing state (e.g.,Time,
BodySite) across which samples are paired. At
least one is required but up to four are
allowed by other state inputs.
--state-column-4 TEXT Metadata column containing state (e.g.,Time,
BodySite) across which samples are paired. At
least one is required but up to four are
allowed by other state inputs.
--help Show this message and exit.

```

## resources
## Other Resources

Named after gemelli by alighiero boetti and also the pasta.

Expand Down
2 changes: 1 addition & 1 deletion gemelli/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------

__version__ = "0.0.1"
__version__ = "0.0.2"
43 changes: 43 additions & 0 deletions gemelli/_ctf_defaults.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Configuration file where you can set the parameter default values and
# descriptions.
DEFAULT_COMP = 3
DEFAULT_MSC = 0
DEFAULT_MFC = 0
DEFAULT_MAXITER = 25
DEFAULT_FMETA = None
DEFAULT_COND = None

DESC_INIT = ("The number of initialization vectors. Larger values will"
"give more accurate factorization but will be more "
"computationally expensive [minimum 1]")
DESC_ITERATIONSALS = ("Max number of Alternating Least Square (ALS)"
" optimization iterations (suggested to be below 100;"
" beware of overfitting) [minimum 1]")
DESC_ITERATIONSRTPM = ("Max number of Robust Tensor Power Method (RTPM)"
" optimization iterations (suggested to be below 100;"
" beware of overfitting) [minimum 1]")
DESC_COMP = ("The underlying low-rank structure (suggested: 2 < rank < 10)"
" [minimum 2]")
DESC_MSC = "Minimum sum cutoff of sample across all features"
DESC_MFC = "Minimum sum cutoff of features across all samples"
DESC_OUT = "Location of output files."
DESC_FMETA = "Feature metadata file in QIIME2 formatting."
DESC_BIN = "Input table in biom format."
DESC_SMETA = "Sample metadata file in QIIME2 formatting."
DESC_SUBJ = ("Metadata column containing subject IDs to"
" use for pairing samples. WARNING: if"
" replicates exist for an individual ID at"
" either state_1 to state_N, that subject will"
" be mean grouped by default.")
DESC_COND = ("Metadata column containing state (e.g.,Time, BodySite)"
" across which samples are paired."
" At least one is required but up to four are allowed"
" by other state inputs.")
QORD = ("A trajectory is an ordination that can be visualized"
"over time or another context.")
QDIST = ("A sample-sample distance matrix generated from the euclidean distance"
" of the subject-state ordinations and itself.")
QLOAD = ("Compositional biplot of subjects as points and features as arrows."
" Where the variation between subject groupings is explained by the"
" log-ratio between opposing arrows. WARNING: The % variance explained"
" is spread over n_components and can be inflated.")
33 changes: 15 additions & 18 deletions gemelli/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,10 @@ class _BaseImpute(object):
def fit(self):
""" Placeholder for fit this
should be implemetned by sub-method"""

def transform(self):
""" return loadings
"""
return self.sample_loading, \
self.feature_loading, \
self.conditional_loading
@abstractmethod
def label(self):
""" Placeholder for fit this
should be implemetned by sub-method"""


class _BaseConstruct(object):
Expand All @@ -36,15 +33,15 @@ class _BaseConstruct(object):
"""
@abstractmethod
def construct(self):
"""
conditional_loading : array-like or list of array-like
The conditional loading vectors
of shape (conditions, r) if there is 1 type
of condition, and a list of such matrices if
there are more than 1 type of condition
feature_loading : array-like
The feature loading vectors
of shape (features, r)
sample_loading : array-like
The sample loading vectors
"""
conditional_loading : array-like or list of array-like
The conditional loading vectors
of shape (conditions, r) if there is 1 type
of condition, and a list of such matrices if
there are more than 1 type of condition
feature_loading : array-like
The feature loading vectors
of shape (features, r)
sample_loading : array-like
The sample loading vectors
of shape (samples, r) """
14 changes: 14 additions & 0 deletions gemelli/citations.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
@article {Martino2019,
author = {Martino, Cameron and Morton, James T. and Marotz, Clarisse A. and Thompson, Luke R. and Tripathi, Anupriya and Knight, Rob and Zengler, Karsten},
editor = {Neufeld, Josh D.},
title = {A Novel Sparse Compositional Technique Reveals Microbial Perturbations},
volume = {4},
number = {1},
elocation-id = {e00016-19},
year = {2019},
doi = {10.1128/mSystems.00016-19},
publisher = {American Society for Microbiology Journals},
URL = {https://msystems.asm.org/content/4/1/e00016-19},
eprint = {https://msystems.asm.org/content/4/1/e00016-19.full.pdf},
journal = {mSystems}
}
Loading