Skip to content

Griddify high-dimensional tabular data for easy visualization and deep learning

License

Notifications You must be signed in to change notification settings

ersilia-os/griddify

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Griddify

Redistribute tabular data into a grid for easy visualization and image-based deep learning. This library is greatly inspired by the excellent MolMap library.

Installation

git clone https://github.com/ersilia-os/griddify.git
cd griddify
pip install -e .

Note that you may have to install a C++ compiler. You can just use conda for that:

conda install -c conda-forge cxx-compiler

Step by step

Get a multidimensional dataset and preprocess it

In this example, we will use a dataset of 200 physicochemical descriptors calculated for about 10k compounds. You can get these data with the following command.

from griddify import datasets

data = datasets.get_compound_descriptors()

It is important that you preprocess your data (impute missing values, normalize, etc.). We provide functionality to do so.

from griddify import Preprocessing

pp = Preprocessing()
pp.fit(data)
data = pp.transform(data)

Create a 2D cloud of data features

Start by calculating distances between features.

from griddify import FeatureDistances

fd = FeatureDistances(metric="cosine").calculate(data)

You can now obtain a 2D cloud of your data features. By default, UMAP is used.

from griddify import Tabular2Cloud

tc = Tabular2Cloud()
tc.fit(fd)
Xc = tc.transform(fd)

It is always good to inspect the resulting projection. The cloud contains as many points as features exist in your dataset.

from griddify.plots import cloud_plot

cloud_plot(Xc)

Rearrange the 2D cloud onto a grid

Distribute cloud points on a grid using a linear assignment algorithm.

from griddify import Cloud2Grid

cg = Cloud2Grid()
cg.fit(Xc)
Xg = cg.transform(Xc)

You can check the rearrangement with an arrows plot.

from griddify.plots import arrows_plot

arrows_plot(Xc, Xg)

To continue with the next steps, it is actually more convenient to get mappings as integers. The following method gives you the size of the grid as well.

mappings, side = cg.get_mappings(Xc)

Rearrange your flat data points into grids

Let's go back to the original tabular data. We want to transform the input data, where each data sample is represented with a one-dimensional array, into an output data where each sample is represented with an image (i.e. a two-dimensional grid). Please ensure that data are normalize or scaled.

from griddify import Flat2Grid

fg = Flat2Grid(mappings, side)
Xi = fg.transform(data)

Explore one sample.

from griddify.plots import grid_plot

grid_plot(Xi[0])

Full pipeline

You can run the full pipeline described above in only a few lines of code.

from griddify import datasets
from griddify import Griddify

data = datasets.get_compound_descriptors()

gf = Griddify(preprocess=True)
gf.fit(data)
Xi = gf.transform(data)

You can find more examples as Jupyter Notebooks in the notebooks folder.

Learn more

The Ersilia Open Source Initiative is on a mission to strenghten research capacity in low income countries. Please reach out to us if you want to contribute: hello@ersilia.io

About

Griddify high-dimensional tabular data for easy visualization and deep learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published