The way we speak and words we choose can be greatly affected by our social influences. Technical communities develop their own niche vocabulary, while teenagers may adopt new words that seem entirely foreign to their parents. The goal of this project is to study how new words or jargon begin, and how do they spread across communities into the wider lexicon.
For example, words such as “lol”, “pwn”, or even “crypto” all started as niche acronyms, misspellings, or jargon, but have become widely adopted. How does this happen? And what statistical dynamics govern this process?
This project is aiming to understand jargon and how it evolves on the internet. We're curious to see how new acronyms, jargon, lingo, or misspellings move from one community to another. To do this, we're building algorithms to detect jargon and applying it to a massive corpus of every single comment ever made on Reddit.
- Raw data are scraped from zip files found here
- files can be downloaded using
Download_All_Files.ipynb
- files can be downloaded using
- Each month of comments in then parsed to extract only the raw comments (removing comment metadata)
- Parsing done using
Split_corpus.ipynb - note to self: move file from SSD to project folder, unzip, then run split corpus
- Parsing done using
- Individual words are counted using
extract_word_count.sh- E.g.
sh exract_word_count.sh RC_2005-12 - note to self: after getting wrod count, move files back to SSD
- E.g.
- Once word counts are tallied, we can begin analyzing the data with
Word_Count_Analysis.ipynborWord_Count-TFIDF.ipynb
Work in Progress: Currently working on a single pipeline for the above ^
Next steps:
- pull in December data from 2005 to 2018
- currently 2005-2014,16
- top 20 words for top 20 subs, Spearman correlation across 2017/2018 (i.e., "how stable is our tf-idf++ measure?")
- figures for selected words (maybe OED new additions?)
- write-up