Software for creating and comparing data fingerprints: locality-sensitive hashing of semi-structured data in JSON or XML format. More information and datasets: http://db.systemsbiology.net/gestalt/data_fingerprints/ Preprint: https://www.biorxiv.org/content/early/2018/04/02/293183
-
Create fingerprints. L is the desired fingerprint length.
a. From a a collection of JSON objects (one per file in a directory): bin/LPH_multiple_JSON.pl directory L [normalize] > collection b. From a collection of XML files (one per file in a directory): bin/LPH_multiple_XML.pl directory L [normalize] > collection c. From a a collection of JSON objects (in one file): bin/LPH_JSON.pl file idField L [normalize] > collection d. From a collection of XML objects (in one file): bin/LPH_XML.pl file idField L [normalize] > collection e. From a stream of JSON objects one-per-line (as in Wikidata): bin/LPH_linewise_JSON.pl file idField L [normalize] > collection
-
Visualize. Example R code, where L is your fingerprint length:
data <- read.table("collection", header=FALSE) M <- as.matrix(data[,2 + 1:L]) pca <- prcomp(M, center=TRUE, scale.=TRUE) mag=c(sqrt(data[,2])/50) col=c(rep(grey(0,.5), length(data[,1]))) plot(pca$x[,1], pca$x[,2], pch=20, cex=mag, col=col, xlab='PC1', ylab='PC2') require(Rtsne) tsne <- Rtsne(data[,2 + 1:L], dims=2, perplexity=50, verbose=TRUE, max_iter=500) plot(tsne$Y, main='tsne', pch=20, cex=mag, col=col)
-
Serialize fingerprints into a database.
bin/serializeLPH.pl collection L columnsToIgnore normalize @listOfFingerprints bin/serializeLPH.pl collection L columnsToIgnore normalize *.outn.gz
-
Compare two databases.
bin/searchLPHs.pl collection anotherCollection
-
Perform all-against-all comparisons in one database.
bin/searchLPHs.pl collection
This project is conceptually related to (but distinct from) the Genome Fingerprints: https://github.com/gglusman/genome-fingerprints
This repository contains two versions of data-fingerprint code
- Perl version of fingerprint code that lives in the bin directory
- Python version of fingerprint code in datafingerprint with helpful associated programs in scripts
The Perl version is under active development and may contain newer features than the Python version.
The Dockerfile builds the resources for the python version by default.
To run in docker first build and start the container with:
docker-compose build
docker-compose up -d
To connect to the container and use data-fingerprint code:
docker-compose exec datafingerprint bash
See datafingerprint/tests/README.md for information on behave and unit tests