# Misspelling Analysis

We noticed immediately that our Twitter sampled dataset had a much higher count of the vocabulary size between the Wikipedia social network versus our Twitter dataset. This additional vocabulary size seemed to be coming from the trend on Twitter of either intentionally misspelling words for ironic / humourous purposes, or as a way to express the timing of how long vowels would get drawn out, such as "aaaaahhhhh", or "uhh ohhhhh", and the hundreds other possible variations on those sounds. 

Contrast this to Wikipedia's social network, where at the time of the data collection, their discourse seemed to comprise of a much more formal way of communicating, similar to the habits of email correspondence.

For this reason, we were hoping to find a network cluster that would have a low "misspelling rate" as a proxy for determining a similar formal communication style to the Wikipedia dataset.

In [12]:
import pandas as pd
from autocorrect import Speller
spell = Speller(lang='en')
from nltk import download, tokenize, word_tokenize 
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [3]:
def is_mispelling(word):
    return spell(word) != word

In [2]:
def number_of_mispellings(words):
    if len(words)==0:
        return 0
    
    count = 0
    for word in words:
        if is_mispelling(word):
            count += 1
    return count

In [7]:
def preprocess_word(doc):
    doc = doc.lower()  # Lower the text.
    doc = word_tokenize(doc)  # Split into words.
    doc = [w for w in doc if not w in stop_words]  # Remove stopwords.
    doc = [w for w in doc if w.isalpha()]  # Remove numbers and punctuation.
    while (doc.count('n')): 
        doc.remove('n') 
    while (doc.count('br')): 
        doc.remove('br') 
    return doc

### Warning: this cell takes a *long* time to run. Like a whole day.

In [8]:
# generate Misspelling.csv
sample_tweets_df = pd.read_csv('sample_with_label_and_clusters.csv')
sample_tweets_df['TweetTextNew'] = sample_tweets_df['TweetText'].apply(preprocess_word)
sample_tweets_df['MisspelledWordsCount'] = sample_tweets_df['TweetTextNew'].apply(number_of_mispellings)
sample_tweets_df['MisspellRate'] = sample_tweets_df['MisspelledWordsCount']/sample_tweets_df['WordsCount']
# remove NaNs
sample_tweets_df = sample_tweets_df[sample_tweets_df['MisspellRate'] == sample_tweets_df['MispellRate']]
sample_tweets_df[['TweetID','MisspellRate','MisspelledWordsCount','WordsCount','K5','K8','K10','K12']].to_csv('data/Misspelling.csv')

In [30]:
# read in Misspelling.csv
sample_tweets_df = pd.read_csv('data/Misspelling.csv')

In [33]:
sample_tweets_df.groupby('K5')['WordsCount'].sum() / sample_tweets_df.groupby('K5')['MisspelledWordsCount'].sum()

K5
0    10.031610
1    10.520557
2    10.492694
3     9.826167
4    10.352032
dtype: float64

As you can see above, there really was not any noticeable difference between the different network clusters and a misspelling rate. We expect that this could be because, although there anecdotally seems to exist different communities on Twitter that are using their accounts in a more professional setting, such as the blue-check journalists, they were not well-represented in the noise of our API listener, this was probably especially the case given the vulgarity in many of our seed words.

It would be interesting to combat this by collecting more data, and downweighting certain follows, in a similar fashion to the inverse document frequency weighting. 

For example, since there are some accounts with hundreds of millions of followers, such as Katy Perry, those accounts should have less of an influence on defining a network following cluster.