## r/AmItheAsshole Lanugage Analysis

### Introduction

[r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/) is a subreddit where people will post a story about some conflict they've had and it is up to the community to judge who the "asshole" is in the situation. The page description words this very eloquently:

"A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole." ([r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/))

The goal of this project if there are any notable lexical features between the langauge of the "assholes" and those who are not.

One thing that I would like to note is that this is not a corpus directly translated from speech so the language of the mosts might have been thoughfully worded and might not be representative of ones's natural speech. Still I believe that the judgments voted on by the reddit community can provide insight on.

### Scraping from Reddit
#### PRAW
[The Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/v4.1.0/index.html) is an API for anything you can do in Reddit. In this project it is used to scrape posts from the r/AmItheAsshole subreddit.

#### The r/AmItheAsshole Archives
The subreddit archives the posts into 4 different categories related to this project. There are 5 judgements that one can decide on, however the last one that is not archived 'INFO' (Not enough info) does not help with this investigation. The different judgements are described in the function below.

#### Helper functions for scraping from reddit

In [1]:
import os
import glob
import yaml
import praw
import pandas as pd
from tqdm import tqdm_notebook



# Initialize reddit API
reddit = praw.Reddit('scraper', user_agent='corpus ling project')
reddit.read_only = True

# Abbreviations for the archive queries
archive_names = { 'YTA' : 'Asshole', 'NTA' : 'Not the A-hole', 'ESH' : 'Everyone Sucks',
        'NAH' : 'No A-holes here'}

# Pandas dataframe containing all of the posts.
corpus = pd.DataFrame()

def get_archive(judgment):
    '''
    Get the posts from a single archive of r/AmItheAsshole
    Archive judgement options are:
        'YTA' -- You're the Asshole (& the other party is not)
        'NTA' -- You're Not the A-hole (& the other party is)
        'ESH' -- Everyone Sucks Here
        'NAH' -- No A-holes here
    '''
    return reddit.subreddit('AmItheAsshole').search(
            'flair_name:"' + archive_names[judgment] + '"', limit=None)

def load_archives(download_local = True):
    '''
    Grabs each of the archives and puts them in the corpus dictionary.
    The download_local flag determines if you want to download the text
    as well in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        posts = get_archive(judgement)
        row_df = pd.io.json.json_normalize(
                map(lambda post : {'archive' : judgement, **vars(post)}, posts))
        # store only the id, title, archive, rawtext, and url
        corpus = corpus.append(row_df[['id', 'title', 'archive', 'selftext', 'url']],
                               ignore_index = True, sort=True)        
        
        # Download the text locally
        if (download_local):
            if not os.path.exists(judgement):
                # separate archives into different directories.
                os.mkdir(judgement)
            for index, post in (corpus.loc[corpus['archive'] == judgement]).iterrows():
                # Save each file in their respective directories with their id as the filename
                with open(judgement + '/' + post.id + '.yaml', 'w') as file:
                    documents = yaml.dump(post.to_dict(), file)

def local_load_archives():
    '''
    Loads the archived posts from yaml files in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        for filename in glob.glob(judgement + '/*.yaml'):
            with open(filename) as file:
                corpus = corpus.append(pd.io.json.json_normalize(
                    yaml.load(file, Loader=yaml.FullLoader)),
                                       ignore_index = True, sort=True)



#### Several options for loading the corpus of posts:
To keep things consistent, please use the `local_load_archives()` option becuase posts might be added over time and so the anaysis of this corpus may change as new posts are added.

Note: `load_archives()` will not work unless you have a `praw.ini` file in the current directory. This file contains the necesary account information for using the Reddit API.

In [None]:
# Run this if you just want to use the posts that are downloaded locally.
local_load_archives()

In [None]:
# Run this if you want the most recent set of posts but without downloading them locally.
load_archives(False)

In [None]:
# Run this if you want to load and update the local files with the current archives.
load_archives()

#### The corpus
The corpus of posts is represented as a datarame whos most important columns are the archive column, and the text. The following code prints out summaries of all of the archives with the counts of posts.

In [3]:
for judgement in archive_names.keys():
    print("\n\n Summary of all of the posts in the \"" + archive_names[judgement] + "\" archive:")
    print(corpus.loc[corpus['archive'] == judgement].describe())



 Summary of all of the posts in the "Asshole" archive:
       archive      id                                           selftext  \
count      238     238                                                238   
unique       1     238                                                238   
top        YTA  e8qq4s  I had my daughter young, her father stood by m...   
freq       238       1                                                  1   

                                                    title  \
count                                                 238   
unique                                                238   
top     AITA for not wanting to cancel our vacation pl...   
freq                                                    1   

                                                      url  
count                                                 238  
unique                                                238  
top     https://www.reddit.com/r/AmItheAsshole/comment...  
freq       

##### Feel free to take a break and read a random post:

In [4]:
post = corpus.sample(n=1)
print(post.selftext.tolist()[0])



My wife and I are on vacation in Orlando Florida and needed a ride to Denny’s. I ordered the UBER and was matched with a man that arrived in less than 2 minutes after we requested. When we sat down in the car he looked back and asked if he could use the bathroom really fast, and we said it was fine. Me and my wife looked at each other and agreed that it’s fair that someone should be allowed to use the bathroom if need be (obviously). He actually did a slow jog into the resort we were staying at, turned around and returned to tell us the rate wouldn’t go up because of it. But as time went on I told my wife there was no way I’m giving this guy 5 stars, she thinks it’s not fair to the driver. The entire affair lasted 5 minutes or so. Why wouldn’t he use the bathroom before he accepted the ride? I hate rating Uber drivers with low scores because it’s their livelyhood but come on dude. AITA?

TDLR
Friendly Uber driver picked my wife and I up and then left us in the car to use the bathroom

##### Can you guess how the community judged this submission?

In [5]:
print(archive_names[post.archive.tolist()[0]])

Asshole


### Analyzing the Corpus
#### NLTK
[Natural Language Toolkit (NLTK)](https://www.nltk.org/) is a python library with a lot of Natural Language processing tools.

First we need to import nltk and download some tools:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

Some helper functions for using the nltk library on the corpus dataframe:

In [7]:
from nltk import word_tokenize, sent_tokenize, Text, FreqDist

def corpus_tokenize():
    '''
    Converts the text of each post to a list of tokens and places
    them in a list in a column within the corpus dataframe.
    '''
    global corpus
    corpus['tokens'] = list(map(lambda text : word_tokenize(text.lower()), corpus['selftext']))
    # Get rid of punctuation
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
    corpus['tokens_no_punct'] = list(map(lambda text : tokenizer.tokenize(text.lower()), corpus['selftext']))
    # Tokenize on sentences.
    corpus['sent_tokens'] = list(map(sent_tokenize, corpus['selftext']))
    
def get_archive_texts(archives, no_punctuation = False):
    '''
    Gets the texts of every post of the archives listed in archives
    and returns it as one list of words.
    '''
    global corpus
    texts = []
    for archive in archives:
        for post_tokens in corpus.loc[corpus['archive'] == archive]['tokens_no_punct' if no_punctuation else 'tokens'].to_list():
            texts += post_tokens
    return texts

def get_archive_texts_sentences(archives):
    '''
    Gets the texts of every post of the archives listed in archives
    and returns it as one list of sentences.
    '''
    global corpus
    texts = []
    for archive in archives:
        for sent_tokens in corpus.loc[corpus['archive'] == archive]['sent_tokens'].to_list():
            texts += sent_tokens
    return texts

def get_freq_dist(archives, no_stopwords = False):
    '''
    Get a FreqDist for the given archive names There is also the option to remove stopwords.
    '''
    text = get_archive_texts(archives, no_punctuation=True)
    if no_stopwords:
        # filter out stopwords
        stopwords = nltk.corpus.stopwords.words('english')
        text = list(filter(lambda word : not word in stopwords, text))
    return FreqDist(text)

#### Preprocessing the Corpus
Creates three new columns in the corpus dataframe, 'tokens', 'tokens_no_punct', and 'sent_tokens'. One with just the raw output of the nltk tokenize function, the next without punctiation, and finally the text separated by sentences.

In [8]:
corpus_tokenize()
corpus.head()

Unnamed: 0,archive,id,selftext,title,url,tokens,tokens_no_punct,sent_tokens
0,YTA,birkjn,My wife is pregnant with our daughter. Initial...,WIBTA if I ask my pregnant wife to move out be...,https://www.reddit.com/r/AmItheAsshole/comment...,"[my, wife, is, pregnant, with, our, daughter, ...","[my, wife, is, pregnant, with, our, daughter, ...","[My wife is pregnant with our daughter., Initi..."
1,YTA,d6cpjt,When my girlfriend and I put out advertisement...,AITA for throwing out my roommates non-vegan f...,https://www.reddit.com/r/AmItheAsshole/comment...,"[when, my, girlfriend, and, i, put, out, adver...","[when, my, girlfriend, and, i, put, out, adver...",[When my girlfriend and I put out advertisemen...
2,YTA,cm0bft,I had a party at my house last night. I have a...,AITA for telling a friend’s friend that he cou...,https://www.reddit.com/r/AmItheAsshole/comment...,"[i, had, a, party, at, my, house, last, night,...","[i, had, a, party, at, my, house, last, night,...","[I had a party at my house last night., I have..."
3,YTA,d7omtz,Edit: I meant “unlocked” in title. Not open. \...,AITA for monitoring my son’s shower time and m...,https://www.reddit.com/r/AmItheAsshole/comment...,"[edit, :, i, meant, “, unlocked, ”, in, title,...","[edit, i, meant, unlocked, in, title, not, ope...","[Edit: I meant “unlocked” in title., Not open...."
4,YTA,cvlkut,So my situation is a little difficult so I tho...,WIBTA if I told a close family friend that her...,https://www.reddit.com/r/AmItheAsshole/comment...,"[so, my, situation, is, a, little, difficult, ...","[so, my, situation, is, a, little, difficult, ...",[So my situation is a little difficult so I th...


#### Word Frequencies
Since these posts are written in the first person and talk about personal experiences, a lot of the most frequent words are pronouns, therefore the following frequency counts remove stopwords. This can be changed by replacing `True` with `False` in the no_stopwords field.

In [9]:
pd.DataFrame(get_freq_dist(archive_names.keys(), no_stopwords=True).most_common(20), columns=['Word', 'Frequencey'])

Unnamed: 0,Word,Frequencey
0,like,1376
1,said,1298
2,would,1089
3,told,1084
4,time,979
5,get,924
6,one,883
7,got,848
8,want,838
9,really,807


In [10]:
n_most_common = 50

pd.DataFrame(get_freq_dist(['YTA', 'ESH'], no_stopwords=True).most_common(n_most_common),
             columns=['word', 'frequencey']).join(
    pd.DataFrame(get_freq_dist(['NTA', 'NAH'], no_stopwords=True).most_common(n_most_common),
                 columns=['word', 'frequencey']), lsuffix="_a_hole")

Unnamed: 0,word_a_hole,frequencey_a_hole,word,frequencey
0,like,691,like,685
1,said,677,said,621
2,told,534,would,582
3,time,527,told,550
4,would,507,get,464
5,get,460,want,460
6,one,430,one,453
7,got,414,time,452
8,want,378,got,434
9,know,376,really,433


#### N-grams and Collocations
Lets explore making bigrams and trigrams and finding collocations for this corpus.

First lets look at the collocations generated from bigrams. There are several different scoring measures that can be used as is listed in one of the comments below.

In [11]:
import nltk
from nltk.collocations import *

# Feel free to change these values.
n_collocs = 50
n_freq_filter = 5

bigram_measures = nltk.collocations.BigramAssocMeasures()
# Options: pmi, likelihood_ratio, chi_sq, dice, phi_sq, etc.
scoring_measure = bigram_measures.pmi

flatten = lambda l: (l[0][0], l[0][1], l[1])

finder = BigramCollocationFinder.from_words(get_archive_texts(['YTA', 'ESH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
a_hole_colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure))),
                             columns=['bigram_word_1', 'bigram_word_2', 'score'])

finder = BigramCollocationFinder.from_words(get_archive_texts(['NTA', 'NAH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure))),
                      columns=['bigram_word_1', 'bigram_word_2', 'score'])

a_hole_colloc.join(colloc, lsuffix="_a_hole").head(n_collocs)

Unnamed: 0,bigram_word_1_a_hole,bigram_word_2_a_hole,score_a_hole,bigram_word_1,bigram_word_2,score
0,peanut,butter,13.993357,april,fools,14.482595
1,mr,lastname,13.960935,baked,goods,14.260203
2,passive,aggressive,13.675181,ice,cream,13.830519
3,fake,lashes,13.645433,imgur,com,13.575705
4,miss,johnson,13.645433,tl,dr,13.482595
5,y,o,13.337311,https,imgur,13.395132
6,tl,dr,13.282863,plant,based,13.395132
7,400,calories,13.060471,cow,milk,13.197193
8,blah,blah,13.008003,passive,aggressive,13.193089
9,social,media,12.697901,co,worker,13.067558


What about trigrams?

In [12]:
import nltk
from nltk.collocations import *

# Feel free to change these values.
n_collocs = 50
n_freq_filter = 5

trigram_measures = nltk.collocations.TrigramAssocMeasures()
# Options: pmi, raw_freq, likelihood_ratio, chi_sq, jaccard, dice, phi_sq, etc.
scoring_measure = trigram_measures.pmi

flatten = lambda l: (l[0][0], l[0][1], l[0][2], l[1])

finder = TrigramCollocationFinder.from_words(get_archive_texts(['YTA', 'ESH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
a_hole_colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure)[:n_collocs])),
                             columns=['trigram_word_1', 'trigram_word_2', 'trigram_word_3', 'score'])

finder = TrigramCollocationFinder.from_words(get_archive_texts(['NTA', 'NAH'], no_punctuation=True))
finder.apply_freq_filter(n_freq_filter)
colloc = pd.DataFrame(list(map(flatten, finder.score_ngrams(scoring_measure)[:n_collocs])),
                      columns=['trigram_word_1', 'trigram_word_2', 'trigram_word_3', 'score'])

a_hole_colloc.join(colloc, lsuffix="_a_hole")

Unnamed: 0,trigram_word_1_a_hole,trigram_word_2_a_hole,trigram_word_3_a_hole,score_a_hole,trigram_word_1,trigram_word_2,trigram_word_3,score
0,long,story,short,21.173606,https,imgur,com,26.970837
1,on,social,media,19.20466,less,active,side,21.172299
2,at,5,30am,17.767305,long,story,short,20.487062
3,year,old,boy,17.366499,blake,s,ring,18.136434
4,my,mouth,shut,17.211599,aunt,tina,s,17.690536
5,a,phd,student,17.142572,my,mouth,shut,17.686796
6,of,ice,cream,16.945303,the,self,feeder,17.674725
7,along,the,lines,16.935662,under,the,impression,16.580749
8,few,days,ago,16.799013,2,3,minutes,16.5032
9,6,months,ago,16.429702,reddit,aita,edit,16.373697


#### Sentiment Analysis
NLTK provides a sentiment analysis model already called [VADER](https://github.com/cjhutto/vaderSentiment):

"VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."

The following are some functions to help process the text with the sentiment analyzer.

In [13]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def get_average_sentiment_scores(text):
    '''
    Takes in a list of sentences to get the averge sentiment
    scores for each sentence using VADER.
    '''
    sid = SentimentIntensityAnalyzer()
    averages = {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}
    count = 0
    for sentence in text:
        ss = sid.polarity_scores(sentence)
        for key, val in ss.items():
            averages[key] += val
        count += 1
    for key in averages.keys():
        averages[key] /= count
    return averages

def corpus_sentiment_scores():
    '''
    Updates the corpus with the average sentiment scores of each post.
    '''
    global corpus
    sid = SentimentIntensityAnalyzer()
    scores = list(map(get_average_sentiment_scores, corpus['sent_tokens']))
    corpus = corpus[corpus.columns.difference(scores[0].keys())].join(
        pd.DataFrame(list(map(lambda d : d.values(), scores)), columns=scores[0].keys()))
    
def average_archive_sentiment_scores(archives):
    '''
    Gets the average sentiment scores for each archive and returns it
    as a dataframe.
    '''
    global corpus
    sentiment_scores = corpus.loc[
        list(map(lambda row: row[1]['archive'] in archives,
                 list(corpus.iterrows())))]
    return sentiment_scores.mean(axis=0).to_dict()

Lets first put these scores into the corpus dataframe. 

In [14]:
# Preprocessing
corpus_sentiment_scores()
corpus.head()

Unnamed: 0,archive,id,selftext,sent_tokens,title,tokens,tokens_no_punct,url,neg,neu,pos,compound
0,YTA,birkjn,My wife is pregnant with our daughter. Initial...,"[My wife is pregnant with our daughter., Initi...",WIBTA if I ask my pregnant wife to move out be...,"[my, wife, is, pregnant, with, our, daughter, ...","[my, wife, is, pregnant, with, our, daughter, ...",https://www.reddit.com/r/AmItheAsshole/comment...,0.083115,0.751923,0.164962,0.096042
1,YTA,d6cpjt,When my girlfriend and I put out advertisement...,[When my girlfriend and I put out advertisemen...,AITA for throwing out my roommates non-vegan f...,"[when, my, girlfriend, and, i, put, out, adver...","[when, my, girlfriend, and, i, put, out, adver...",https://www.reddit.com/r/AmItheAsshole/comment...,0.073885,0.845115,0.080962,0.0118
2,YTA,cm0bft,I had a party at my house last night. I have a...,"[I had a party at my house last night., I have...",AITA for telling a friend’s friend that he cou...,"[i, had, a, party, at, my, house, last, night,...","[i, had, a, party, at, my, house, last, night,...",https://www.reddit.com/r/AmItheAsshole/comment...,0.095333,0.795509,0.109176,0.069017
3,YTA,d7omtz,Edit: I meant “unlocked” in title. Not open. \...,"[Edit: I meant “unlocked” in title., Not open....",AITA for monitoring my son’s shower time and m...,"[edit, :, i, meant, “, unlocked, ”, in, title,...","[edit, i, meant, unlocked, in, title, not, ope...",https://www.reddit.com/r/AmItheAsshole/comment...,0.065571,0.924857,0.009571,-0.061807
4,YTA,cvlkut,So my situation is a little difficult so I tho...,[So my situation is a little difficult so I th...,WIBTA if I told a close family friend that her...,"[so, my, situation, is, a, little, difficult, ...","[so, my, situation, is, a, little, difficult, ...",https://www.reddit.com/r/AmItheAsshole/comment...,0.084931,0.789724,0.125276,0.095721


How do sentiment scores between the different judgemnts compare on average?

In [15]:
sentiment_scores = pd.concat(
    [pd.DataFrame(average_archive_sentiment_scores([archive]), index=[0]) for archive in archive_names.keys()],
    ignore_index=True)

archive_abr = list(archive_names.keys())
sentiment_scores.rename(index=lambda i : archive_names[archive_abr[i]], inplace=True)
sentiment_scores

Unnamed: 0,neg,neu,pos,compound
Asshole,0.073949,0.837025,0.088843,0.042957
Not the A-hole,0.081716,0.836592,0.080954,0.006575
Everyone Sucks,0.078342,0.840287,0.080346,-0.000789
No A-holes here,0.072422,0.833325,0.094126,0.054809
