askalono is a library and command-line tool to help detect license texts. It's designed to be fast, accurate, and to support a wide variety of license texts.
This tool does not provide legal advice and it is not a lawyer. It endeavors to match your input to a database of similar license texts, and tell you what it thinks is a close match. But, it can't tell you that the given license is authoritative over a project. Nor can it tell you what to do with a license once it's identified. You are not entitled to rely on the accuracy of the output of this tool, and should seek independent legal advice for any licensing questions that may arise from using this tool.
Additional pre-release note
This software is in the early stages of its lifecycle. While its goals are to be as accurate as it can be, there may be more bugs than expected of a production product.
On the command line
Pre-built binaries are available on the Releases section on GitHub. Rust developers may also grab a copy by running
cargo install askalono-cli.
askalono id <filename>
<filename> is a file (not folder) containing license text to analyze. In many projects, this file is called
COPYING. askalono will analyze the text and output what it thinks it is.
If askalono can't identify a file, it may simply be a license it just doesn't know. But, if it's actually source code with a file header (or footer, or anything in between) it may be able to dig deeper. To try this, pass the
askalono id --optimize <filename>
If you'd like to discover license files within a directory tree, askalono offers a
askalono crawl <directory>
As a library
At the moment,
LicenseContent are exposed for usage.
tl;dr: Sørensen–Dice scoring, multi-threading, compressed cache file
At its core, askalono builds up bigrams (word pairs) of input text, and compares that with other license texts it knows about to see how similar they are. It scores each match with a Sørensen–Dice coefficient and looks for the highest result. There is some minimal preprocessing happening before matching, but there are no hand-maintained regular expressions or curations used to determine a match.
In detail, the matching process:
- Reads in input text
- Normalizes everything it reasonably can -- Unicode characters, whitespace, quoting styles, etc. are all whittled down to something common.
- Lines that tend to change a lot in licenses, like "Copyright 20XX Some Person", are additionally removed.
- Tokenizes normalized text into a set of bigrams.
- In parallel, the bigram set is compared with all of the other sets askalono knows about.
- The resulting list is sorted, the top match identified, and result returned.
To optimize startup, askalono builds up a database of license texts (applying the same normalization techniques described above), and persists this data to a MessagePack'd & gzip'd cache file. This cache is loaded at startup, and is optionally embedded in the binary itself.
It means "shallot" in Esperanto. You could try to derive a hidden meaning from it, but the real reason is really just that onions are delicious and Esperanto is an interesting language. In the author's opinion. (Sed la verkisto ne estas bonega Esperantisto, do bonvolu konversacii en la angla sur ĉi tiu projekto.)
How is this different from other solutions?
There are several other excellent projects in this space, including licensee, LiD, and ScanCode. These projects attempt to get a larger picture of a project's licensing, and can look at other sources of metadata to try to find answers. Both of these inspired the creation of askalono, first as a curiosity, then as a serious project.
askalono focuses on the problem of matching text itself -- it's often the piece that is difficult to optimize for speed and accuracy. askalono could be seen as a piece of plumbing in a larger system. The askalono command line application includes other goodies, such as a directory crawler, but these are largely for quick once-off use before diving in with more systematic solutions. (If you're looking for such a solution, take a look at the projects I just mentioned!)
Where do the licenses come from?
License data is sourced directly from SPDX: https://github.com/spdx/license-list-data
askalono can parse the "json" format included in that repository to generate its cache.
At this time, askalono is not taking requests for additional licenses in its default dataset -- its dataset is SPDX's own.
Contributions are very welcome! See CONTRIBUTING for more info.
This library is licensed under the Apache 2.0 License.