Specification of the commonsense text format used for resisting authorship analysis
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README.md
UNLICENSE

README.md

Build Status

Commonsense

This is the specification for the commonsense text format used for resisting authorship analysis. Basically, it should be more difficult to attribute authorship to a conforming text through stylistic means. This is essential for anonymous publishing. The canonical implementation of a validator for this spec may be found here. It is named after the Common Sense pamphlet published anonymously by Thomas Paine in 1776.

We aim to prevent the current-day identification of an author from published text, and also their future identification.

Quick Validation

$ gem install commonsense
$ commonsense thomas_paine.txt

Examples

Meeting at public square. Power to every person.


writing idea

This day I am feeble because all political material is a record but tomorrow I am able to put writing in public as a secret.

Specification (v1.0)

The reference implementation of this specification may be found here. In case of conflict/ambiguity, the implementation should take precedence.

Note: all regular expressions are Ruby regexps and must match fully (i.e. as if with beginning/end anchors: ^ and $)

Definitions

  • A candidate text is a UTF-8 encoded text stream.
  • A nice character is /[a-zA-Z0-9 \.\n]/
  • A line is a substring of the candidate text, maximal subject to not containing the \n character
  • A sentence character is /[a-zA-Z0-9 ]/
  • A word character is /[a-zA-Z0-9]/
  • A word is a maximal substring of word characters
  • The word list is the list of words returned by the words method of the reference implementation
  • The capitalized word list is derived from the word list, with each word capitalized

Criteria

A candidate text conforms to the commonsense spec if and only if it meets all the following criteria:

  1. Characters: The text contains only nice characters
  2. Whitespace: Each line has no beginning/trailing spaces and contains no consecutive space characters
  3. Lines: Each line is either empty, a heading or a sentence list
  4. Headings: A heading is /S+/, where S is any sentence character
  5. Sentences: A sentence is /S+\./, where S is any sentence character
  6. Sentence Lists: sentence lists are consecutive sentences, separated by a single space character
  7. First Words: The first word of each sentence should be in the capitalized word list
  8. Other Words: All words not matching (7) should be in the word list

Attack Vectors

Possible vulnerabilities include (with example passing validation):

  1. Semantic: author betrays themself in meaning of text, e.g. revealing location ("I am in the south")
  2. Format Artefacts: common unique formatting between texts, e.g. use of punctuation ("I. Protest. Government.")
  3. Stylistic: text style properties, e.g. dialectic ("I like protest government. I like seem political.")
  4. Metadata: data outside our control, e.g. publication timestamps ("I protest government.")

Mitigation

  1. Not in the project scope.
  2. We do a pretty good job mitigating against formatting, by completely preventing things such as double-spaced sentences and trailing whitespace.
  3. Our severe restrictions on formatting and characters help to a large extent. Vertical spacing between paragraphs and sentence length are outside our control.
  4. Not in the project scope.

Todo

  • expand word list
  • gain experience of using specification in real-world setting
  • revise (possibly antiquated) wordlist
  • allow something like "conforms to commonsense vX.Z" at end of text
  • investigate stylometric methods
  • run analyses on validated texts

Further Reading

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/beneills/commonsense-spec