Analyzing words Redditors aren't sure how to spell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information. Readme file extension Jan 10, 2019
clustering.ipynb frist comit Jan 10, 2019
examples_in_context.csv frist comit Jan 10, 2019
token_counts.csv frist comit Jan 10, 2019
treemaps.ipynb frist comit Jan 10, 2019

Analyzing indicators of spelling uncertainty such as (sp?) in Reddit comments.


  • token_counts.csv: mapping lowercase tokens to number of times that token was seen followed by an (sp?) marker in comments
  • examples_in_context.csv: csv with a sample of relevant comments. Has 3 columns: comment id, token (the word followed by spelling marker), and full body of the comment. Restricted to tokens with count >= 10. Grepping this is useful for debugging.
  • clustering.ipynb: Groups the tokens in token_counts.csv into clusters of alternative (mis)spellings of the same word.
  • treemaps.ipynb: Generates visualizations of the clusters induced by clustering.ipynb.


Data collection

My data was gathered using the following query on the Reddit comments dataset hosted on Google BigQuery:

SELECT id, REGEXP_EXTRACT(body, r"(?m:^[^&].* ([0-9A-Za-z\047-]+) ?\([sS][pP](?:elling)?\??\))") 
  token, body
FROM [fh-bigquery:reddit_comments.all]

This basically matches a word, an optional space, and indicators of spelling uncertainty such as any of the following: (sp?), (SP?), (sp), (spelling?), (Spelling).

(We insist the line on which the match occurs doesn't start with &. This is to avoid lines that start with > (encoded as >), so that we don't double-count lines that are quoted in comment replies.)


To generate my final list of top tokens, I did some manual work to weed out false positive spelling indicators, e.g.:

  • abbreviations: skill points (sp), Sleep paralysis (SP), the Samajwadi Party (SP)
  • use-mention distinction: I love how he put the (sp) in there
  • tokens that belong to a wide variety of multi-word phrases: ____ Island, ____ disease, ____ acid


I clustered together word forms based on Levenshtein(sp?) distance, normalized by length. Again, you can see some hand-tuned special cases in clustering.ipynb to avoid false positive clustering associations. e.g. 'bechamel' is not a misspelling of 'deschanel'.