## r/AmItheAsshole Lanugage Analysis

### Introduction

[r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/) is a subreddit where people will post a story about some conflict they've had and it is up to the community to judge who the "asshole" is in the situation. The page description words this very eloquently:

"A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole." ([r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/))

The goal of this project if there are any notable lexical features between the langauge of the "assholes" and those who are not.

One thing that I would like to note is that this is not a corpus directly translated from speech so the language of the mosts might have been thoughfully worded and might not be representative of ones's natural speech. Still I believe that the judgments voted on by the reddit community can provide insight on.

### Scraping from Reddit
#### PRAW
[The Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/v4.1.0/index.html) is an API for anything you can do in Reddit. In this project it is used to scrape posts from the r/AmItheAsshole subreddit.

#### The r/AmItheAsshole Archives
The subreddit archives the posts into 4 different categories related to this project. There are 5 judgements that one can decide on, however the last one that is not archived 'INFO' (Not enough info) does not help with this investigation. The different judgements are described in the function below.

#### Helper functions for scraping from reddit

In [1]:
import os
import glob
import yaml
import praw
import pandas as pd
from tqdm import tqdm_notebook



# Initialize reddit API
reddit = praw.Reddit('scraper', user_agent='corpus ling project')
reddit.read_only = True

# Abbreviations for the archive queries
archive_names = { 'YTA' : 'Asshole', 'NTA' : 'Not the A-hole', 'ESH' : 'Everyone Sucks',
        'NAH' : 'No A-holes here'}

# Pandas dataframe containing all of the posts.
corpus = pd.DataFrame()

def get_archive(judgment):
    '''
    Get the posts from a single archive of r/AmItheAsshole
    Archive judgement options are:
        'YTA' -- You're the Asshole (& the other party is not)
        'NTA' -- You're Not the A-hole (& the other party is)
        'ESH' -- Everyone Sucks Here
        'NAH' -- No A-holes here
    '''
    return reddit.subreddit('AmItheAsshole').search(
            'flair_name:"' + archive_names[judgment] + '"', limit=None)

def load_archives(download_local = True):
    '''
    Grabs each of the archives and puts them in the corpus dictionary.
    The download_local flag determines if you want to download the text
    as well in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        posts = get_archive(judgement)
        row_df = pd.io.json.json_normalize(
                map(lambda post : {'archive' : judgement, **vars(post)}, posts))
        # store only the id, title, archive, rawtext, and url
        corpus = corpus.append(row_df[['id', 'title', 'archive', 'selftext', 'url']],
                               ignore_index = True, sort=True)        
        
        # Download the text locally
        if (download_local):
            if not os.path.exists(judgement):
                # separate archives into different directories.
                os.mkdir(judgement)
            for index, post in (corpus.loc[corpus['archive'] == judgement]).iterrows():
                # Save each file in their respective directories with their id as the filename
                with open(judgement + '/' + post.id + '.yaml', 'w') as file:
                    documents = yaml.dump(post.to_dict(), file)

def local_load_archives():
    '''
    Loads the archived posts from yaml files in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        for filename in glob.glob(judgement + '/*.yaml'):
            with open(filename) as file:
                corpus = corpus.append(pd.io.json.json_normalize(
                    yaml.load(file, Loader=yaml.FullLoader)),
                                       ignore_index = True, sort=True)



#### Several options for loading the corpus of posts:
To keep things consistent, please use the `local_load_archives()` option becuase posts might be added over time and so the anaysis of this corpus may change as new posts are added.

Note: `load_archives()` will not work unless you have a `praw.ini` file in the current directory. This file contains the necesary account information for using the Reddit API.

In [None]:
# Run this if you just want to use the posts that are downloaded locally.
local_load_archives()

In [None]:
# Run this if you want the most recent set of posts but without downloading them locally.
load_archives(False)

In [None]:
# Run this if you want to load and update the local files with the current archives.
load_archives()

#### The corpus
The corpus of posts is represented as a datarame whos most important columns are the archive column, and the text. The following code prints out summaries of all of the archives with the counts of posts.

In [3]:
for judgement in archive_names.keys():
    print("\n\n Summary of all of the posts in the \"" + archive_names[judgement] + "\" archive:")
    print(corpus.loc[corpus['archive'] == judgement].describe())



 Summary of all of the posts in the "Asshole" archive:
       archive      id                                           selftext  \
count      238     238                                                238   
unique       1     238                                                238   
top        YTA  eaghx8  My sisters boyfriend came over for thanksgivin...   
freq       238       1                                                  1   

                                                    title  \
count                                                 238   
unique                                                238   
top     AITA that I didn't give up my bus seat for a p...   
freq                                                    1   

                                                      url  
count                                                 238  
unique                                                238  
top     https://www.reddit.com/r/AmItheAsshole/comment...  
freq       

##### Feel free to take a break and read a random post:

In [4]:
post = corpus.sample(n=1)
print(post.selftext.tolist()[0])

Been married seven years. I'm 35M, she's 34F. We have one daughter 5F and one son 9moM. For more context about our family: I make a lot of money. My wife is unemployed and has been unemployed since we met. The house is mine, the two cars are mine, etc. I mean, they're both of ours, but I purchased them for the family.

My wife decided to spring onto me two weeks ago that she wants to explore a polyromantic relationship. She has no one in mind but is excited about the potential and asks if I'm up for it. My answer is a resounding no because I agreed to monogamy when we married. We actively discussed this beforehand, too, and I thought we were on the same page.

This is where I may be the asshole: my wife has zero bargaining power here. To put it very bluntly, she doesn't make enough money to make this kind of suggestion or demand. She has said me not accepting her polyromanticism is putting our relationship into jeopardy and while I may lose a lot in a divorce or separation, she's 34 wi

##### Can you guess how the community judged this submission?

In [5]:
print(archive_names[post.archive.tolist()[0]])

Everyone Sucks


### Analyzing the Corpus
#### NLTK
[Natural Language Toolkit (NLTK)](https://www.nltk.org/) is a python library with a lot of Natural Language processing tools.

First we need to import nltk and download some tools:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

Some helper functions for using the nltk library on the corpus dataframe:

In [7]:
from nltk import word_tokenize, sent_tokenize, Text, FreqDist, pos_tag

def corpus_tokenize():
    '''
    Converts the text of each post to a list of tokens and places
    them in a list in a column within the corpus dataframe.
    '''
    global corpus
    corpus['tokens'] = list(map(lambda text : word_tokenize(text.lower()), corpus['selftext']))
    # Get rid of punctuation
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    corpus['tokens_no_punct'] = list(map(lambda text : tokenizer.tokenize(text.lower()), corpus['selftext']))
    # Tokenize on sentences.
    corpus['sent_tokens'] = list(map(sent_tokenize, corpus['selftext']))
    
def corpus_pos_tags():
    '''
    POS tag the whole corpus and put it into a new column. Only
    run after corpus_tokenize().
    '''
    global corpus
    corpus['pos_tags'] = list(map(pos_tag, corpus['tokens']))

    
def get_archive_texts(archives, no_punctuation = False):
    '''
    Gets the texts of every post of the archives listed in archives
    and returns it as one list of words.
    '''
    global corpus
    texts = []
    for archive in archives:
        for post_tokens in corpus.loc[corpus['archive'] == archive]['tokens_no_punct' if no_punctuation else 'tokens'].to_list():
            texts += post_tokens
    return texts

def get_archive_texts_sentences(archives):
    '''
    Gets the texts of every post of the archives listed in archives
    and returns it as one list of sentences.
    '''
    global corpus
    texts = []
    for archive in archives:
        for sent_tokens in corpus.loc[corpus['archive'] == archive]['sent_tokens'].to_list():
            texts += sent_tokens
    return texts

def get_archive_texts_pos_tags(archives):
    '''
    Gets the pos tags of every post of the archives listed in archives
    and returns it as one list of pos tags.
    '''
    global corpus
    texts = []
    for archive in archives:
        for sent_tokens in corpus.loc[corpus['archive'] == archive]['pos_tags'].to_list():
            texts += sent_tokens
    return texts

def get_freq_dist(archives, no_stopwords = False):
    '''
    Get a FreqDist for the given archive names There is also the option to remove stopwords.
    '''
    text = get_archive_texts(archives, no_punctuation=True)
    if no_stopwords:
        # filter out stopwords
        stopwords = nltk.corpus.stopwords.words('english')
        text = list(filter(lambda word : not word in stopwords, text))
    return FreqDist(text)

#### Preprocessing the Corpus
Creates three new columns in the corpus dataframe, 'tokens', 'tokens_no_punct', and 'sent_tokens'. One with just the raw output of the nltk tokenize function, the next without punctiation, and finally the text separated by sentences.

In [8]:
corpus_tokenize()
corpus.head()

Unnamed: 0,archive,id,selftext,title,url,tokens,tokens_no_punct,sent_tokens
0,YTA,e8l9h6,My husband (30M) is defending his PhD in 1 wee...,WIBTA if I didn't go to my husband's PhD defense,https://www.reddit.com/r/AmItheAsshole/comment...,"[my, husband, (, 30m, ), is, defending, his, p...","[my, husband, 30m, is, defending, his, phd, in...",[My husband (30M) is defending his PhD in 1 we...
1,YTA,ec6a0d,"Okay, to establish this, my ex and I have been...",AITA for not offering to help my ex pay for di...,https://www.reddit.com/r/AmItheAsshole/comment...,"[okay, ,, to, establish, this, ,, my, ex, and,...","[okay, to, establish, this, my, ex, and, i, ha...","[Okay, to establish this, my ex and I have bee..."
2,YTA,duxns4,"So Thursday, I went to a party one of my frien...",AITA for being livid at my roommate for callin...,https://www.reddit.com/r/AmItheAsshole/comment...,"[so, thursday, ,, i, went, to, a, party, one, ...","[so, thursday, i, went, to, a, party, one, of,...","[So Thursday, I went to a party one of my frie..."
3,YTA,e617k1,I wanted to do something nice for my mom for h...,AITA for not painting my step father in a fami...,https://www.reddit.com/r/AmItheAsshole/comment...,"[i, wanted, to, do, something, nice, for, my, ...","[i, wanted, to, do, something, nice, for, my, ...",[I wanted to do something nice for my mom for ...
4,YTA,e8tpjf,Edit: threw OUT* a diaper beside my head\n\nTh...,AITA for being mad someone threw a diaper next...,https://www.reddit.com/r/AmItheAsshole/comment...,"[edit, :, threw, out*, a, diaper, beside, my, ...","[edit, threw, out, a, diaper, beside, my, head...",[Edit: threw OUT* a diaper beside my head\n\nT...


#### Word Frequencies
Since these posts are written in the first person and talk about personal experiences, a lot of the most frequent words are pronouns, therefore the following frequency counts remove stopwords. This can be changed by replacing `True` with `False` in the no_stopwords field.

In [18]:
pd.DataFrame(get_freq_dist(archive_names.keys(), no_stopwords=True).most_common(20), columns=['Word', 'Frequency'])

Unnamed: 0,Word,Frequency
0,like,1455
1,said,1388
2,would,1175
3,told,1156
4,time,1065
5,get,1006
6,one,955
7,got,908
8,want,895
9,really,877


In [19]:
n_most_common = 50

pd.DataFrame(get_freq_dist(['YTA', 'ESH'], no_stopwords=True).most_common(n_most_common),
             columns=['word', 'frequency']).join(
    pd.DataFrame(get_freq_dist(['NTA', 'NAH'], no_stopwords=True).most_common(n_most_common),
                 columns=['word', 'frequency']), lsuffix="_a_hole")

Unnamed: 0,word_a_hole,frequency_a_hole,word,frequency
0,like,715,like,740
1,said,690,said,698
2,told,541,would,654
3,time,539,told,615
4,would,521,time,526
5,get,480,get,526
6,one,447,want,509
7,got,419,one,508
8,know,387,really,495
9,want,386,got,489


#### POS tags
First we need to calculate all of the POS tags. 

In [11]:
corpus_pos_tags()

How do the frequencies of the different tags compare?

In [24]:
from collections import Counter

counts = Counter(tag for word, tag in get_archive_texts_pos_tags(['YTA', 'ESH']))
a_hole_pos_freq = pd.DataFrame(counts, index=[0])
a_hole_pos_freq = a_hole_pos_freq.transpose()
a_hole_pos_freq.columns = ['a_hole_pos_count']

counts = Counter(tag for word, tag in get_archive_texts_pos_tags(['NTA', 'NAH']))
pos_freq = pd.DataFrame(counts, index=[0])
pos_freq = pos_freq.transpose()
pos_freq.columns = ['pos_count']


pos_freqs = a_hole_pos_freq.join(pos_freq)
pos_freqs['total'] = pos_freqs['a_hole_pos_count'] + pos_freqs['pos_count']
pos_freqs.sort_values(by=['total'], ascending=False)

Unnamed: 0,a_hole_pos_count,pos_count,total
NN,31312,34864,66176
IN,19320,21324,40644
PRP,14628,15535,30163
DT,13611,15230,28841
JJ,13426,15264,28690
RB,13200,14884,28084
VBD,10780,11512,22292
VB,9922,11190,21112
.,9743,10718,20461
CC,8371,9271,17642


#### N-grams and Collocations
Lets explore making bigrams and trigrams and finding collocations for this corpus.

First lets look at the collocations generated from bigrams. There are several different scoring measures that can be used as is listed in one of the comments below.

In [13]:
import nltk
from nltk.collocations import *

# Feel free to change these values.
n_collocs = 50
n_freq_filter = 5

bigram_measures = nltk.collocations.BigramAssocMeasures()
# Options: pmi, likelihood_ratio, chi_sq, dice, phi_sq, etc.
scoring_measure = bigram_measures.pmi

flatten = lambda l: (l[0][0], l[0][1], l[1])

finder = BigramCollocationFinder.from_words(get_archive_texts(['YTA', 'ESH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
a_hole_colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure))),
                             columns=['bigram_word_1', 'bigram_word_2', 'score'])

finder = BigramCollocationFinder.from_words(get_archive_texts(['NTA', 'NAH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure))),
                      columns=['bigram_word_1', 'bigram_word_2', 'score'])

a_hole_colloc.join(colloc, lsuffix="_a_hole").head(n_collocs)

Unnamed: 0,bigram_word_1_a_hole,bigram_word_2_a_hole,score_a_hole,bigram_word_1,bigram_word_2,score
0,peanut,butter,14.027584,april,fools,14.64165
1,mr,lastname,13.995163,baked,goods,14.419257
2,passive,aggressive,13.709408,nail,polish,14.378615
3,miss,johnson,13.679661,ice,cream,13.864042
4,fake,lashes,13.487016,tl,dr,13.471725
5,y,o,13.371538,imgur,com,13.471725
6,tl,dr,13.317091,plant,based,13.393722
7,400,calories,13.094698,cow,milk,13.356247
8,blah,blah,13.042231,passive,aggressive,13.352143
9,social,media,12.732128,https,imgur,13.319722


What about trigrams?

In [14]:
import nltk
from nltk.collocations import *

# Feel free to change these values.
n_collocs = 50
n_freq_filter = 5

trigram_measures = nltk.collocations.TrigramAssocMeasures()
# Options: pmi, raw_freq, likelihood_ratio, chi_sq, jaccard, dice, phi_sq, etc.
scoring_measure = trigram_measures.pmi

flatten = lambda l: (l[0][0], l[0][1], l[0][2], l[1])

finder = TrigramCollocationFinder.from_words(get_archive_texts(['YTA', 'ESH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
a_hole_colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure)[:n_collocs])),
                             columns=['trigram_word_1', 'trigram_word_2', 'trigram_word_3', 'score'])

finder = TrigramCollocationFinder.from_words(get_archive_texts(['NTA', 'NAH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure)[:n_collocs])),
                      columns=['trigram_word_1', 'trigram_word_2', 'trigram_word_3', 'score'])

a_hole_colloc.join(colloc, lsuffix="_a_hole")

Unnamed: 0,trigram_word_1_a_hole,trigram_word_2_a_hole,trigram_word_3_a_hole,score_a_hole,trigram_word_1,trigram_word_2,trigram_word_3,score
0,long,story,short,21.166446,https,imgur,com,26.791446
1,on,social,media,19.234156,less,active,side,21.184116
2,at,5,30am,17.816748,long,story,short,20.640683
3,year,old,boy,17.366181,blake,s,ring,18.265686
4,my,mouth,shut,17.242991,aunt,tina,s,17.780259
5,a,phd,student,17.07491,my,mouth,shut,17.750472
6,of,ice,cream,16.977982,the,self,feeder,17.721808
7,few,days,ago,16.774776,11,year,old,16.760411
8,along,the,lines,16.594526,reddit,aita,edit,16.639795
9,6,months,ago,16.44879,under,the,impression,16.622273


#### Sentiment Analysis
NLTK provides a sentiment analysis model already called [VADER](https://github.com/cjhutto/vaderSentiment):

"VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."

The following are some functions to help process the text with the sentiment analyzer.

In [15]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def get_average_sentiment_scores(text):
    '''
    Takes in a list of sentences to get the averge sentiment
    scores for each sentence using VADER.
    '''
    sid = SentimentIntensityAnalyzer()
    averages = {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
    count = 0
    for sentence in text:
        ss = sid.polarity_scores(sentence)
        for key, val in ss.items():
            averages[key] += val
        count += 1
    for key in averages.keys():
        averages[key] /= count
    return averages

def corpus_sentiment_scores():
    '''
    Updates the corpus with the average sentiment scores of each post.
    '''
    global corpus
    sid = SentimentIntensityAnalyzer()
    scores = list(map(get_average_sentiment_scores, corpus['sent_tokens']))
    corpus = corpus[corpus.columns.difference(scores[0].keys())].join(
        pd.DataFrame(list(map(lambda d : d.values(), scores)), columns=scores[0].keys()))
    
def average_archive_sentiment_scores(archives):
    '''
    Gets the average sentiment scores for each archive and returns it
    as a dataframe.
    '''
    global corpus
    sentiment_scores = corpus.loc[
        list(map(lambda row: row[1]['archive'] in archives,
                 list(corpus.iterrows())))]
    return sentiment_scores.mean(axis=0).to_dict()

Lets first put these scores into the corpus dataframe. 

In [16]:
# Preprocessing
corpus_sentiment_scores()
corpus.head()

Unnamed: 0,archive,id,pos_tags,selftext,sent_tokens,title,tokens,tokens_no_punct,url,neg,neu,pos,compound
0,YTA,e8l9h6,"[(my, PRP$), (husband, NN), ((, (), (30m, CD),...",My husband (30M) is defending his PhD in 1 wee...,[My husband (30M) is defending his PhD in 1 we...,WIBTA if I didn't go to my husband's PhD defense,"[my, husband, (, 30m, ), is, defending, his, p...","[my, husband, 30m, is, defending, his, phd, in...",https://www.reddit.com/r/AmItheAsshole/comment...,0.074684,0.819947,0.105368,0.099732
1,YTA,ec6a0d,"[(okay, NN), (,, ,), (to, TO), (establish, VB)...","Okay, to establish this, my ex and I have been...","[Okay, to establish this, my ex and I have bee...",AITA for not offering to help my ex pay for di...,"[okay, ,, to, establish, this, ,, my, ex, and,...","[okay, to, establish, this, my, ex, and, i, ha...",https://www.reddit.com/r/AmItheAsshole/comment...,0.046938,0.8515,0.101625,0.061269
2,YTA,duxns4,"[(so, RB), (thursday, JJ), (,, ,), (i, JJ), (w...","So Thursday, I went to a party one of my frien...","[So Thursday, I went to a party one of my frie...",AITA for being livid at my roommate for callin...,"[so, thursday, ,, i, went, to, a, party, one, ...","[so, thursday, i, went, to, a, party, one, of,...",https://www.reddit.com/r/AmItheAsshole/comment...,0.073577,0.864731,0.061615,-0.061638
3,YTA,e617k1,"[(i, RB), (wanted, VBD), (to, TO), (do, VB), (...",I wanted to do something nice for my mom for h...,[I wanted to do something nice for my mom for ...,AITA for not painting my step father in a fami...,"[i, wanted, to, do, something, nice, for, my, ...","[i, wanted, to, do, something, nice, for, my, ...",https://www.reddit.com/r/AmItheAsshole/comment...,0.047857,0.851571,0.100571,0.095214
4,YTA,e8tpjf,"[(edit, NN), (:, :), (threw, NN), (out*, VBZ),...",Edit: threw OUT* a diaper beside my head\n\nTh...,[Edit: threw OUT* a diaper beside my head\n\nT...,AITA for being mad someone threw a diaper next...,"[edit, :, threw, out*, a, diaper, beside, my, ...","[edit, threw, out, a, diaper, beside, my, head...",https://www.reddit.com/r/AmItheAsshole/comment...,0.072667,0.834185,0.093185,0.014785


How do sentiment scores between the different judgemnts compare on average?

In [17]:
sentiment_scores = pd.concat(
    [pd.DataFrame(average_archive_sentiment_scores([archive]), index=[0]) for archive in archive_names.keys()],
    ignore_index=True)

archive_abr = list(archive_names.keys())
sentiment_scores.rename(index=lambda i : archive_names[archive_abr[i]], inplace=True)
sentiment_scores

Unnamed: 0,neg,neu,pos,compound
Asshole,0.073949,0.837025,0.088843,0.042957
Not the A-hole,0.079794,0.838691,0.08087,0.008572
Everyone Sucks,0.077294,0.840701,0.081026,0.003074
No A-holes here,0.071918,0.833326,0.093873,0.052523
