How does language evolve on the internet?

The way we speak and words we choose can be greatly affected by our social influences. Technical communities develop their own niche vocabulary, while teenagers may adopt new words that seem entirely foreign to their parents. The goal of this project is to study how new words or jargon begin, and how do they spread across communities into the wider lexicon.

For example, words such as “lol”, “pwn”, or even “crypto” all started as niche acronyms, misspellings, or jargon, but have become widely adopted. How does this happen? And what statistical dynamics govern this process?

This project is aiming to understand jargon and how it evolves on the internet. We're curious to see how new acronyms, jargon, lingo, or misspellings move from one community to another. To do this, we're building algorithms to detect jargon and applying it to a massive corpus of every single comment ever made on Reddit.

Nuts and Bolts

Preprocessing the Data

Raw data are scraped from zip files found here
- files can be downloaded using Download_All_Files.ipynb
Each month of comments in then parsed to extract only the raw comments (removing comment metadata)
- Parsing done using Split_corpus.ipynb
- note to self: move file from SSD to project folder, unzip, then run split corpus
Individual words are counted using extract_word_count.sh
- E.g. sh exract_word_count.sh RC_2005-12
- note to self: after getting wrod count, move files back to SSD
Once word counts are tallied, we can begin analyzing the data with Word_Count_Analysis.ipynb or Word_Count-TFIDF.ipynb

Work in Progress: Currently working on a single pipeline for the above ^

Next steps:

pull in December data from 2005 to 2018
- currently 2005-2014,16
top 20 words for top 20 subs, Spearman correlation across 2017/2018 (i.e., "how stable is our tf-idf++ measure?")
figures for selected words (maybe OED new additions?)
write-up

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Download_All_Files.ipynb		Download_All_Files.ipynb
README.md		README.md
Split_corpus.ipynb		Split_corpus.ipynb
TFIDF-Consistency Check.ipynb		TFIDF-Consistency Check.ipynb
Tf-Idf From Comments.ipynb		Tf-Idf From Comments.ipynb
Word_Count_Analysis.ipynb		Word_Count_Analysis.ipynb
Word_Count_TFIDF.ipynb		Word_Count_TFIDF.ipynb
Word_Count_TFIDF_Time_Series.ipynb		Word_Count_TFIDF_Time_Series.ipynb
extract_word_count.sh		extract_word_count.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How does language evolve on the internet?

Nuts and Bolts

Preprocessing the Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How does language evolve on the internet?

Nuts and Bolts

Preprocessing the Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages