Skip to content

biobakery/halla

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

ATTENTION

Before opening a new issue here, please check the appropriate help channel on the bioBakery Support Forum (https://forum.biobakery.org) and consider opening or commenting on a thread there.

Previous Version

Please direct to https://github.com/biobakery/halla_legacy for the 0.8.17 version of HAllA.


HAllA version # 0.8.20

Given two high-dimensional 'omics datasets X and Y (continuous and/or categorical features) from the same n biosamples, HAllA (Hierarchical All-against-All Association Testing) discovers densely-associated blocks of features in the X vs. Y association matrix where:

  • each block is defined as all associations between features in a subtree of X hierarchy and features in a subtree of Y hierarchy,
  • and a block is densely associated if (1 - FNR)% of pairwise associations are FDR significant (FNR is the pre-defined expected false negative rate)

Example codes can be found under examples directory.

Installation

Option 1. (Recommended):

pip install halla --no-binary :all:

Option 2. Install with setup.py

python setup.py develop

Other than Python (version >= 3.7) and R (version >= 3.6.1), install all required libraries listed in requirements.txt, specifically:

Requirements Python packages:

R packages:

# for MacOS - read the notes on installing rpy2:
#   specifically run 'env CC=/usr/local/Cellar/gcc/X.x.x/bin/gcc-X pip install rpy2'
#   where X.x.x is the gcc version on the machine **BEFORE** running the following command

HAllA Overview

Available parameters

Available pairwise distance metrics:

  • spearman (default for continuous data)
  • pearson
  • mi (default for mixed/categorical data)
  • nmi
  • xicor
  • dcor

There are three steps in HAllA:

1. Pairwise similarity matrix computation

The pairwise similarity matrix between all features in X and Y is computed with a specified similarity measure, such as Spearman correlation and normalized mutual information (NMI). This step then generates the p-value and q-value tables.

Note that for handling heterogeneous data, all continuous features are first discretized into bins using a specified binning method.

2. Hierarchical clustering

Hierarchical clustering on the features in each dataset is performed using the converted similarity measure used in step 1. It produces a tree for each dataset.

3. Finding densely-associated blocks

This recursive step is described in the pseudocode below:

def find_densely_associated_blocks(x, y):
    x_features = all features in x
    y_features = all features in y
    if is_densely_associated(x_features, y_features):
        report block and terminate
    else:
        # bifurcate one according to Gini impurities of the splits
        x_branches, y_branches = bifurcate_one_of(x, y)
        if both x and y are leaves:
            terminate
        for each x_branch and y_branch in x_branches and y_branches:
            find_densely_associated_blocks(x_branch, y_branch)

initial function call: find_densely_associated_blocks(X_root, Y_root)

For example, given two datasets of X (features: X1, X2, X3, X4, X5) and Y (features: Y1, Y2, Y3, Y4) both hierarchically clustered in X tree and Y tree, the algorithm first evaluates the roots of both trees and checks if the block consisting of all features of X and Y are densely-associated (if %significance (%reject) >= (1 - FNR)%).

If the block is not densely-associated, the algorithm would bifurcate one of the trees. It would pick one of:

  • [X1 X2][X3 X4 X5] >< [Y1 Y2 Y3 Y4] or
  • [X1 X2 X3 X4 X5] >< [Y1 Y2 Y3][Y4]

based on the Gini impurity of the splits (pick the split that produces a lower weighted Gini impurity),

Once it picks the split with the lower impurity (let's say the first split), it will iteratively evaluate the branches:

  • find densely-associated blocks in [X1 X2] vs [Y1 Y2 Y3 Y4], and
  • find densely-associated blocks in [X3 X4 X5] vs [Y1 Y2 Y3 Y4]

and keep going until it terminates.

Contributions

Thanks go to these wonderful people:

Releases

No releases published

Packages

No packages published

Languages