This is the specification for the commonsense text format used for resisting authorship analysis. Basically, it should be more difficult to attribute authorship to a conforming text through stylistic means. This is essential for anonymous publishing. The canonical implementation of a validator for this spec may be found here. It is named after the Common Sense pamphlet published anonymously by Thomas Paine in 1776.
We aim to prevent the current-day identification of an author from published text, and also their future identification.
$ gem install commonsense $ commonsense thomas_paine.txt
Meeting at public square. Power to every person. writing idea This day I am feeble because all political material is a record but tomorrow I am able to put writing in public as a secret.
The reference implementation of this specification may be found here. In case of conflict/ambiguity, the implementation should take precedence.
Note: all regular expressions are Ruby regexps and must match fully (i.e. as if with beginning/end anchors: ^ and $)
- A candidate text is a UTF-8 encoded text stream.
- A nice character is /[a-zA-Z0-9 \.\n]/
- A line is a substring of the candidate text, maximal subject to not containing the \n character
- A sentence character is /[a-zA-Z0-9 ]/
- A word character is /[a-zA-Z0-9]/
- A word is a maximal substring of word characters
- The word list is the list of words returned by the words method of the reference implementation
- The capitalized word list is derived from the word list, with each word capitalized
A candidate text conforms to the commonsense spec if and only if it meets all the following criteria:
- Characters: The text contains only nice characters
- Whitespace: Each line has no beginning/trailing spaces and contains no consecutive space characters
- Lines: Each line is either empty, a heading or a sentence list
- Headings: A heading is /S+/, where S is any sentence character
- Sentences: A sentence is /S+\./, where S is any sentence character
- Sentence Lists: sentence lists are consecutive sentences, separated by a single space character
- First Words: The first word of each sentence should be in the capitalized word list
- Other Words: All words not matching (7) should be in the word list
Possible vulnerabilities include (with example passing validation):
- Semantic: author betrays themself in meaning of text, e.g. revealing location ("I am in the south")
- Format Artefacts: common unique formatting between texts, e.g. use of punctuation ("I. Protest. Government.")
- Stylistic: text style properties, e.g. dialectic ("I like protest government. I like seem political.")
- Metadata: data outside our control, e.g. publication timestamps ("I protest government.")
- Not in the project scope.
- We do a pretty good job mitigating against formatting, by completely preventing things such as double-spaced sentences and trailing whitespace.
- Our severe restrictions on formatting and characters help to a large extent. Vertical spacing between paragraphs and sentence length are outside our control.
- Not in the project scope.
- expand word list
- gain experience of using specification in real-world setting
- revise (possibly antiquated) wordlist
- allow something like "conforms to commonsense vX.Z" at end of text
- investigate stylometric methods
- run analyses on validated texts
Bug reports and pull requests are welcome on GitHub at https://github.com/beneills/commonsense-spec