joa, sofware bertillionage tool for Java code
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


Joa is a signature extractor for Java source and binary files. Joa can be very effective to match the provenance of java binary and source files against a corpus.

Its operation is described in:

Julius Davies, Daniel M. German, Michael W. Godfrey and Abram Hindle. “EMSE: Software Bertillonage: Determining the Provenance of Software Development Artifacts”, Journal of Empirical Software Engineering. To appear.


Julius Davies, Daniel M. German, Michael W. Godfrey and Abram Hindle. “Software bertillonage: finding the provenance of an entity”. Proceeding MSR ‘11 Proceedings of the 8th Working Conference on Mining Software Repositories. pp.183-192.

We kindly ask that if you use Joa for research purposes, that we cite the above paper.

Joa is licensed under the GPLv2 (or any later version).


Simply run ant in the top directory

how to use

The simplest way to run joa is to use the script. Make sure java is found in the PATH

dmg@iodine:~/extProjects/sig-extractor$ ./ 
Signature extractor version $Id: 248 2012-04-29 05:45:47Z dmg $

Joa: extracts signatures from *.class and *.java.
by Julius Davies and Daniel M. German, April 2012.

Usage: [flags] [paths-to-examine...] 

  --stdin            / -in   Reads paths from stdin.
  --recursiveZip     / -rz   Process zips inside zips.
  --hashOutput       / -ho   Each output line is: SHA1;FQN;PATH
  --noFQN            / -nf   Class signature should not include FQN.

  --sortOutput       / -so   Sorts output (all signatures) by FQN.
  --sortInnerClasses / -si   Sorts inner-classes by name within each signature.
  --sortMethods      / -sm   Sorts methods by name within each signature.
  --sortFields       / -sf   Sorts fields by name within each signature.

  --querySame        / -qs   Generates bin_2_bin / src_2_src SQL
  --queryOther       / -qo   Generates bin_2_src / src_2_bin SQL.
  --queryFileHashes  / -qh   Generates SQL based on file SHA1's.

  Note: The '-queryOther/-qo' option takes precedence over '-querySame/-qs'.

  If data is supplied on STDIN, the extractor assumes this contains a list
  of paths separated by newlines (LF).  Paths supplied on the command-line
  are ignored when STDIN has data.

  Sorting the output ( -so / --sortOutput ) can require a lot of RAM.

Java thinks it's allowed to use at most 871MB of RAM right now.


Joa requires a corpus database. The easiest way to create it is to run on each jar, zip, tar, war, etc.

You need to run with, at least, the -ho output. This will generate an output:

I;Processing: aspectjrt-2.5.6.jar

Concatenate all the outputs from all the runs of and then separate the File fields (starting with F) and the Signatures fields (starting with S). Now you have two files: files.txt and sigs.txt. Remove duplicates. Compress each using xz. You should end with a file called files.txt.xz and sigs.txt.xz

Using postgres, create a database called maven. Run the script in db/bulk-load/

Usage: ./ [/path/to/files.xz] [/path/to/sigs.xz]

PostgreSQL data loader for the Java Signature Extractor.
by Julius Davies, Daniel M. German.  August 25, 2011.

Note:  You must have a PostgreSQL database named "maven" already
       created, and it must not require a password for the current
       user (e.g., the command "psql maven" must work).

Depending on the size of your data, it might take few hours.

Querying the database

Simply run

./ <path-to-jar> | psql maven takes options to the command line. They should match the options that you used during the extraction (for example, if you used -sf, or -sm).