Redistribute tabular data into a grid for easy visualization and image-based deep learning. This library is greatly inspired by the excellent MolMap library.
git clone https://github.com/ersilia-os/griddify.git
cd griddify
pip install -e .
Note that you may have to install a C++ compiler. You can just use conda for that:
conda install -c conda-forge cxx-compiler
In this example, we will use a dataset of 200 physicochemical descriptors calculated for about 10k compounds. You can get these data with the following command.
from griddify import datasets
data = datasets.get_compound_descriptors()
It is important that you preprocess your data (impute missing values, normalize, etc.). We provide functionality to do so.
from griddify import Preprocessing
pp = Preprocessing()
pp.fit(data)
data = pp.transform(data)
Start by calculating distances between features.
from griddify import FeatureDistances
fd = FeatureDistances(metric="cosine").calculate(data)
You can now obtain a 2D cloud of your data features. By default, UMAP is used.
from griddify import Tabular2Cloud
tc = Tabular2Cloud()
tc.fit(fd)
Xc = tc.transform(fd)
It is always good to inspect the resulting projection. The cloud contains as many points as features exist in your dataset.
from griddify.plots import cloud_plot
cloud_plot(Xc)
Distribute cloud points on a grid using a linear assignment algorithm.
from griddify import Cloud2Grid
cg = Cloud2Grid()
cg.fit(Xc)
Xg = cg.transform(Xc)
You can check the rearrangement with an arrows plot.
from griddify.plots import arrows_plot
arrows_plot(Xc, Xg)
To continue with the next steps, it is actually more convenient to get mappings as integers. The following method gives you the size of the grid as well.
mappings, side = cg.get_mappings(Xc)
Let's go back to the original tabular data. We want to transform the input data, where each data sample is represented with a one-dimensional array, into an output data where each sample is represented with an image (i.e. a two-dimensional grid). Please ensure that data are normalize or scaled.
from griddify import Flat2Grid
fg = Flat2Grid(mappings, side)
Xi = fg.transform(data)
Explore one sample.
from griddify.plots import grid_plot
grid_plot(Xi[0])
You can run the full pipeline described above in only a few lines of code.
from griddify import datasets
from griddify import Griddify
data = datasets.get_compound_descriptors()
gf = Griddify(preprocess=True)
gf.fit(data)
Xi = gf.transform(data)
You can find more examples as Jupyter Notebooks in the notebooks folder.
The Ersilia Open Source Initiative is on a mission to strenghten research capacity in low income countries. Please reach out to us if you want to contribute: hello@ersilia.io