## r/AmItheAsshole Lanugage Analysis

### Introduction

[r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/) is a subreddit where people will post a story about some conflict they've had and it is up to the community to judge who the "asshole" is in the situation. The page description words this very eloquently:

"A catharsis for the frustrated moral philosopher in all of us, and a place to finally find out if you were wrong in an argument that's been bothering you. Tell us about any non-violent conflict you have experienced; give us both sides of the story, and find out if you're right, or you're the asshole." ([r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/))

The goal of this project if there are any notable lexical features between the langauge of the "assholes" and those who are not.

One thing that I would like to note is that this is not a corpus directly translated from speech so the language of the mosts might have been thoughfully worded and might not be representative of ones's natural speech. Still I believe that the judgments voted on by the reddit community can provide insight on.

### Scraping from Reddit
#### PRAW
[The Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/v4.1.0/index.html) is an API for anything you can do in Reddit. In this project it is used to scrape posts from the r/AmItheAsshole subreddit.

#### The r/AmItheAsshole Archives
The subreddit archives the posts into 4 different categories related to this project. There are 5 judgements that one can decide on, however the last one that is not archived 'INFO' (Not enough info) does not help with this investigation. The different judgements are described in the function below.

#### Helper functions for scraping from reddit

In [1]:
import os
import glob
import yaml
import praw
import pandas as pd
from tqdm import tqdm_notebook



# Initialize reddit API
reddit = praw.Reddit('scraper', user_agent='corpus ling project')
reddit.read_only = True

# Abbreviations for the archive queries
archive_names = { 'TYA' : 'Asshole', 'NTA' : 'Not the A-hole', 'ESH' : 'Everyone Sucks',
        'NAH' : 'No A-holes here'}

# Pandas dataframe containing all of the posts.
corpus = pd.DataFrame()

def get_archive(judgment):
    '''
    Get the posts from a single archive of r/AmItheAsshole
    Archive judgement options are:
        'TYA' -- You're the Asshole (& the other party is not)
        'NTA' -- You're Not the A-hole (& the other party is)
        'ESH' -- Everyone Sucks Here
        'NAH' -- No A-holes here
    '''
    return reddit.subreddit('AmItheAsshole').search(
            'flair_name:"' + archive_names[judgment] + '"', limit=None)

def load_archives(download_local = True):
    '''
    Grabs each of the archives and puts them in the corpus dictionary.
    The download_local flag determines if you want to download the text
    as well in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        posts = get_archive(judgement)
        row_df = pd.io.json.json_normalize(
                map(lambda post : {'archive' : judgement, **vars(post)}, posts))
        # store only the id, title, archive, rawtext, and url
        corpus = corpus.append(row_df[['id', 'title', 'archive', 'selftext', 'url']],
                               ignore_index = True, sort=True)        
        
        # Download the text locally
        if (download_local):
            if not os.path.exists(judgement):
                # separate archives into different directories.
                os.mkdir(judgement)
            for index, post in (corpus.loc[corpus['archive'] == judgement]).iterrows():
                # Save each file in their respective directories with their id as the filename
                with open(judgement + '/' + post.id + '.yaml', 'w') as file:
                    documents = yaml.dump(post.to_dict(), file)

def local_load_archives():
    '''
    Loads the archived posts from yaml files in the current directory.
    '''
    global corpus 
    corpus = pd.DataFrame()
    for judgement in tqdm_notebook(archive_names.keys()):
        for filename in glob.glob(judgement + '/*.yaml'):
            with open(filename) as file:
                corpus = corpus.append(pd.io.json.json_normalize(
                    yaml.load(file, Loader=yaml.FullLoader)),
                                       ignore_index = True, sort=True)



#### Several options for loading the corpus of posts:
To keep things consistent, please use the `local_load_archives()` option becuase posts might be added over time and so the anaysis of this corpus may change as new posts are added.

Note: `load_archives()` will not work unless you have a `praw.ini` file in the current directory. This file contains the necesary account information for using the Reddit API.

In [5]:
# Run this if you just want to use the posts that are downloaded locally.
local_load_archives()

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




In [4]:
# Run this if you want the most recent set of posts but without downloading them locally.
load_archives(False)

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




In [2]:
# Run this if you want to load and update the local files with the current archives.
load_archives()

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




#### The corpus
The corpus of posts is represented as a datarame whos most important columns are the archive column, and the text. The following code prints out summaries of all of the archives with the counts of posts"

In [6]:
for judgement in archive_names.keys():
    print("\n\n Summary of all of the posts in the \"" + archive_names[judgement] + "\" archive:")
    print(corpus.loc[corpus['archive'] == judgement].describe())



 Summary of all of the posts in the "Asshole" archive:
       archive      id                                           selftext  \
count      328     328                                                328   
unique       1     328                                                328   
top        TYA  eblrzx  Hi reddit. My husband and I were eating at chi...   
freq       328       1                                                  1   

                                                    title  \
count                                                 328   
unique                                                328   
top     AITA for riding my bike to work even though I ...   
freq                                                    1   

                                                      url  
count                                                 328  
unique                                                328  
top     https://www.reddit.com/r/AmItheAsshole/comment...  
freq       

### Analysing the Corpus

#### NLTK
[Natural Language Toolkit (NLTK)](https://www.nltk.org/) is a python library with a lot of Natural Language processing tools.
