Nostril: Nonsense String Evaluator
Switch branches/tags
Clone or download

README.md

Nostril

Nostril is the Nonsense String Evaluator: a Python module that infers whether a given short string of characters is likely to be random gibberish or something meaningful.

License: GPL v3 Python Latest release DOI DOI

Author: Michael Hucka
Code repository: https://github.com/casics/nostril
License: Unless otherwise noted, this content is licensed under the GPLv3 license.

🏁 Recent news and activities

May 2018: The JOSS paper has been published. Also, Nostril release 1.1.1 has a citable DOI: 10.22002/D1.935.

April 2018: Version 1.1.1 fixes the requirements.txt file so that instead of doing exact version comparisons, it only requires minimum versions. The release also updates the documentation in docs/explanations. Other changes (which were in release 1.1.0) include a fix to setup.py to make automatic installation of depencies work properly, updated installation instructions below, improvements to the JOSS paper, a change to the command-line program to use the more conventional -V instead of -v for printing the version, and internal code refactoring.

Table of contents

☀ Introduction

A number of research efforts have investigated extracting and analyzing textual information contained in software artifacts. However, source code files can contain meaningless text, such as random text used as markers or test cases, and code extraction methods can also sometimes make mistakes and produce garbled text. When used in processing pipelines without human intervention, it is often important to include a data cleaning step before passing tokens extracted from source code to subsequent analysis or machine learning algorithms. Thus, a basic (and often unmentioned) step is to filter out nonsense tokens.

Nostril is a Python 3 module that can be used to infer whether a given word or text string is likely to be nonsense or meaningful text. Nostril takes a text string and returns True if it is probably nonsense, False otherwise. Meaningful in this case means a string of characters that is probably constructed from real or real-looking English words or fragments of real words (even if the words are run togetherlikethis). The main use case is to decide whether short strings returned by source code mining methods are likely to be program identifiers (of classes, functions, variables, etc.), or random characters or other non-identifier strings. To illustrate, the following example code,

from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
             'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
    print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

produces the following output:

bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense

Nostril uses a combination of heuristic rules and a probabilistic assessment. It is not always correct (see below). It is tuned to reduce false positives: it is more likely to say something is not gibberish when it really might be. This is suitable for its intended purpose of filtering source code identifiers – a difficult problem, incidentally, because program identifiers often consist of acronyms and word fragments jammed together (e.g., "kBoPoMoFoOrderIdCID", "ioFlXFndrInfo", etc.), which can challenge even humans. Nevertheless, on the identifier strings from the Loyola University of Delaware Identifier Splitting Oracle, Nostril classifies over 99% correctly.

Nostril is reasonably fast: once the module is loaded, on a 4 Ghz Apple OS X 10.12 computer, calling the evaluation function returns a result in 30–50 microseconds per string on average.

♥️ Please cite the Spiral paper and the version you use

Article citations are critical for academic developers. If you use Nostril and you publish papers about work that uses Nostril, please cite the Nostril paper:

Hucka, M. (2018). Nostril: A nonsense string evaluator written in Python. Journal of Open Source Software, 3(25), 596, https://doi.org/10.21105/joss.00596

Please also use the DOI to indicate the specific version you use, to improve other people's ability to reproduce your results:

✺ Installation instructions

The following is probably the simplest and most direct way to install Nostril on your computer:

sudo pip3 install git+https://github.com/casics/nostril.git

Alternatively, you can clone this repository and then run setup.py:

git clone https://github.com/casics/nostril.git
cd nostril
sudo python3 -m pip install .

Both of these installation approaches should automatically install some Python dependencies that Nostril relies upon, namely plac, tabulate, humanize, and pytest.

► Using Nostril

The basic usage is very simple. Nostril provides a Python function named nonsense(). This function takes a single text string as an argument and returns a Boolean value as a result. Here is an example:

from nostril import nonsense
if nonsense('yoursinglestringhere'):
   print("nonsense")
else:
   print("real")

The Nostril source code distribution also comes with a command-line program called nostril. You can invoke the nostril command-line interface in two ways:

  1. Using the Python interpreter:
    python3 -m nostril
    
  2. On Linux and macOS systems, using the program nostril, which should be installed automatically by setup.py in a bin directory on your shell's command search path. Thus, you should be able to run it normally:
    nostril
    

The command-line program can take strings on the command line or (with the -f option) in a file, and will return nonsense-or-not assessments for each string. It can be useful for interactive testing and experimentation. For example:

# nostril bunchofwords xywinlist ioFlXFndrInfo lasaakldfalakj
xywinlist       [real]
ioFlXFndrInfo   [real]
lasaakldfalakj  [nonsense]
xyxyxyx         [nonsense]

Beware that the Nostril module takes a noticeable amount of time to load, and since the command-line program must reload the module anew each time, it is relatively slow as a means of using Nostril. (In normal usage, your program would only load the Python module once and not incur the loading time on every call.)

Nostril ignores numbers, spaces and punctuation characters embedded in the input string. This was a design decision made for practicality – it makes Nostril a bit easier to use. If, for your application, non-letter characters indicates a string that is definitely nonsense, then you may wish to test for that separately before passing the string to Nostril.

Please see the docs subdirectory for more information about Nostril and its operation.

🎯 Performance

You can verify the following results yourself by running the small test program tests/test.py. The following are the results on sets of strings that are all either real identifiers or all random/gibberish text:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Accuracy
/usr/share/dict/web2 218,752 0 89 0 99.96%
Ludiso oracle 2,540 0 6 0 99.76%
Auto-generated random strings 0 997,636 0 82,754 91.70%
Hand-written random strings 0 1,000 0 205 79.50%

In tests on real identifiers extracted from actual software source code mined by the author in another project, Nostril's performance is as follows:

Type of content Results
Test case Meaningful Gibberish False pos. False neg. Precision Recall
Strings mined from real code 4,261 364 6 5 98.36% 98.63%

⚠️ Limitations

Nostril is not fool-proof; it will generate some false positive and false negatives. This is an unavoidable consequence of the problem domain: without direct knowledge, even a human cannot recognize a real text string in all cases. Nostril's default trained system puts emphasis on reducing false positives (i.e., reducing how often it mistakenly labels something as nonsense) rather than false negatives, so it will sometimes report that something is not nonsense when it really is.

A vexing result is that this system does more poorly on supposedly "random" strings typed by a human. I hypothesize this is because those strings may be less random than they seem: if someone is asked to type junk at random on a QWERTY keyboard, they are likely to use a lot of characters from the home row (a-s-d-f-g-h-j-k-l), and those actually turn out to be rather common in English words. In other words, what we think of a strings "typed at random" on a keyboard are actually not that random, and probably have statistical properties similar to those of real words. These cases are hard for Nostril, but thankfully, in real-world situations, they are rare. This view is supported by the fact that Nostril's performance is much better on statistically random text strings generated by software.

Nostril has been trained using American English words, and is unlikely to work for other languages unchanged. However, the underlying framework may work if it were retrained on different sample inputs. Nostril uses uses n-grams coupled with a custom TF-IDF weighting scheme. See the subdirectory training for the code used to train the system.

Finally, the algorithm does not perform well on very short text, and by default Nostril imposes a lower length limit of 6 characters – strings must be longer than 6 characters or else it will raise an exception.

📚 More information

Please see the docs subdirectory for more information.

⁇ Getting help and support

If you find an issue, please submit it in the GitHub issue tracker for this repository.

♬ Contributing — info for developers

Any constructive contributions – bug reports, pull requests (code or documentation), suggestions for improvements, and more – are welcome. Please feel free to contact me directly, or even better, jump right in and use the standard GitHub approach of forking the repo and creating a pull request.

Everyone is asked to read and respect the code of conduct when participating in this project.

❤️ Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.