ATTENTION

Before opening a new issue here, please check the appropriate help channel on the bioBakery Support Forum (https://forum.biobakery.org) and consider opening or commenting on a thread there.

Previous Version

Please direct to https://github.com/biobakery/halla_legacy for the 0.8.17 version of HAllA.

HAllA version # 0.8.20

Given two high-dimensional 'omics datasets X and Y (continuous and/or categorical features) from the same n biosamples, HAllA (Hierarchical All-against-All Association Testing) discovers densely-associated blocks of features in the X vs. Y association matrix where:

each block is defined as all associations between features in a subtree of X hierarchy and features in a subtree of Y hierarchy,
and a block is densely associated if (1 - FNR)% of pairwise associations are FDR significant (FNR is the pre-defined expected false negative rate)

Example codes can be found under examples directory.

Installation

Option 1. (Recommended):

pip install halla --no-binary :all:

Option 2. Install with `setup.py`

python setup.py develop

Other than Python (version >= 3.7) and R (version >= 3.6.1), install all required libraries listed in requirements.txt, specifically:

Requirements Python packages:

jenkspy (version >= 0.1.5)
Matplotlib (version >= 3.3.0)
NumPy (version >= 1.19.0)
pandas (version >= 1.0.5)
PyYAML (version >= 5.4)
rpy2 (version >= 3.3.5) - Notes on installing rpy2 in macOS
scikit-learn (version >= 0.23.1)
SciPy (version >= 1.5.1)
seaborn (version >= 0.10.1)
statsmodels (version >= 0.11.1)
tqdm (>=4.50.2)

R packages:

eva (>= 0.2.6)
XICOR (>= 0.3.3)

# for MacOS - read the notes on installing rpy2:
#   specifically run 'env CC=/usr/local/Cellar/gcc/X.x.x/bin/gcc-X pip install rpy2'
#   where X.x.x is the gcc version on the machine **BEFORE** running the following command

HAllA Overview

Available parameters

Available pairwise distance metrics:

spearman (default for continuous data)
pearson
mi (default for mixed/categorical data)
nmi
xicor
dcor

There are three steps in HAllA:

ATTENTION
Previous Version
HAllA version # 0.8.20

1. Pairwise similarity matrix computation

The pairwise similarity matrix between all features in X and Y is computed with a specified similarity measure, such as Spearman correlation and normalized mutual information (NMI). This step then generates the p-value and q-value tables.

Note that for handling heterogeneous data, all continuous features are first discretized into bins using a specified binning method.

2. Hierarchical clustering

Hierarchical clustering on the features in each dataset is performed using the converted similarity measure used in step 1. It produces a tree for each dataset.

3. Finding densely-associated blocks

This recursive step is described in the pseudocode below:

def find_densely_associated_blocks(x, y):
    x_features = all features in x
    y_features = all features in y
    if is_densely_associated(x_features, y_features):
        report block and terminate
    else:
        # bifurcate one according to Gini impurities of the splits
        x_branches, y_branches = bifurcate_one_of(x, y)
        if both x and y are leaves:
            terminate
        for each x_branch and y_branch in x_branches and y_branches:
            find_densely_associated_blocks(x_branch, y_branch)

initial function call: find_densely_associated_blocks(X_root, Y_root)

For example, given two datasets of X (features: X1, X2, X3, X4, X5) and Y (features: Y1, Y2, Y3, Y4) both hierarchically clustered in X tree and Y tree, the algorithm first evaluates the roots of both trees and checks if the block consisting of all features of X and Y are densely-associated (if %significance (%reject) >= (1 - FNR)%).

If the block is not densely-associated, the algorithm would bifurcate one of the trees. It would pick one of:

[X1 X2][X3 X4 X5] >< [Y1 Y2 Y3 Y4] or
[X1 X2 X3 X4 X5] >< [Y1 Y2 Y3][Y4]

based on the Gini impurity of the splits (pick the split that produces a lower weighted Gini impurity),

Once it picks the split with the lower impurity (let's say the first split), it will iteratively evaluate the branches:

find densely-associated blocks in [X1 X2] vs [Y1 Y2 Y3 Y4], and
find densely-associated blocks in [X3 X4 X5] vs [Y1 Y2 Y3 Y4]

and keep going until it terminates.

Contributions

Thanks go to these wonderful people:

Name		Name	Last commit message	Last commit date
Latest commit History 287 Commits
.github		.github
examples		examples
halla		halla
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
history.md		history.md
readme.md		readme.md
requirements.txt		requirements.txt
setup.py		setup.py

License

biobakery/halla

Folders and files

Latest commit

History

Repository files navigation

ATTENTION

Previous Version

HAllA version # 0.8.20

Installation

Option 1. (Recommended):

Option 2. Install with setup.py

HAllA Overview

Available parameters

1. Pairwise similarity matrix computation

2. Hierarchical clustering

3. Finding densely-associated blocks

Contributions

About

Resources

License

Stars

Watchers

Forks

Languages

Option 2. Install with `setup.py`