# Introduction
This notebook shows the data wrangling steps take for this Am I The Asshole classifier and chatbox project. It contains the work for loading the data, examnining and cleaning the data. Data was loaded using the Python Reddit API Wrapper(PRAW). In order to get enough data, several pulls were made and concatenated.  As this is a NLP project, a lot of text cleaning went into this process. Resources from nltk and sklearn were used to cleand vectorize the data. 

## Table of Contents
- [Importing Libraries](#section1)
- [Loading Dataset](#section2)
- [Examining DataSet](#section3)
- [Cleaning and Vectorizing Dataset](#section4)

### Importing Libraries  <a name="section1"></a>
Import necessary libraries. 

In [1]:
import praw
import seaborn as sns
import numpy as np
import pandas as pd
from datetime import datetime
import string
import CommentPull.CommentPuller #Import custom class for handling labeling and df's. 

### Loading Dataset  <a name="section2"></a>
PRAW has a limit of about 1000 requests before a pull is cut off. Several pulls were done to overcome this so that enough data could be collected. These pulls also take a long time, so before proceeding the pulls are concatenated and saved.

Here the reddit API is accessed and a Am I the Asshole subreddit object made.

In [2]:
reddit = praw.Reddit(client_id='my_id', client_secret='this is a secret', 
                     user_agent='AITA Scrapping')

In [3]:
aita = reddit.subreddit('AmITheAsshole')
reddit.read_only

True

Below are six different pulls for data. The top four were done at once and all use the top() method for pulling. Different time_filters are passed to ensure that different data is collected for each pull. These data were then concatenated and saved. After deciding that more data was needed, the hot and new pulls were made and saved. 

In [None]:
#All time top() pull
posts = []
for post in aita.top(limit=1000):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
aita_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [None]:
#Week top() pull
posts = []
for post in aita.top(limit=1000, time_filter='week'):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
week_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [None]:
#Month top() pull
posts = []
for post in aita.top(limit=1000, time_filter='month'):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
month_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [None]:
#Year top() pull
posts = []
for post in aita.top(limit=1000, time_filter='year'):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
year_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [44]:
#Hot() pull
posts = []
for post in aita.hot(limit=1000):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
hot_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [45]:
#New() pull
posts = []
for post in aita.new(limit=1000):
    posts.append([post.title, post.id, post.selftext, post.comments, post.link_flair_text, 
                  post.upvote_ratio, post.score, post.num_comments])
new_df = pd.DataFrame(posts, columns=['title', 'post_id', 'post_text', 'comments', 'flair_text', 
                                       'upvote_ratio', 'num_upvotes', 'num_comments'])

In [None]:
aita_df = pd.concat([aita_df, week_df])
aita_df = pd.concat([aita_df, month_df])
aita_df = pd.concat([aita_df, year_df]).drop_duplicates().reset_index(drop=True)

In [47]:
aita_df = pd.concat([aita_df, hot_df])
aita_df = pd.concat([aita_df, new_df])
aita_df.shape

(5777, 8)

In [48]:
aita_df.to_pickle('aita_data.pkl')#pickle

### Examining Dataset <a name="section3"></a>
This brief section examines the dataframe to see how much data we have and edit down the data a bit.

In [4]:
#Read in data from previous pulls
aita_df = pd.read_pickle('aita_data.pkl')
aita_df.shape

(5777, 8)

Here we can see that there are many different types of flairs for posts. These flairs as assigned by a bot 24hrs after the post is made. The bot takes into account the number of upvotes for each comment as well as whether the comment is saying 'NTA' or 'YTA'. Here 'NTA' means 'Not the Asshole' and 'YTA' means 'You're the asshole'. These flairs will be as labels for the data.

In [5]:
aita_df['flair_text'].value_counts()

Not the A-hole     4023
Asshole             539
UPDATE              205
No A-holes here     155
Everyone Sucks      137
Not enough info      61
TL;DR                21
                     11
META                  9
Update                6
Charitable META       1
Open Forum            1
META Asshole          1
Name: flair_text, dtype: int64

This project only seeks to classify a post as 'NTA' or 'YTA', so flairs of any thing else will be deprecated. This also handles posts that do not pertain to someone seeking clarification on their story by being labeled as the asshole or not. 

In [6]:
#Keep rows related to assholes
aita_df = aita_df.loc[(aita_df['flair_text'] == 'Not the A-hole') | (aita_df['flair_text'] == 'Asshole')]
aita_df.reset_index(drop=True, inplace=True)
aita_df.head(10)

Unnamed: 0,title,post_id,post_text,comments,flair_text,upvote_ratio,num_upvotes,num_comments
0,AITA for telling my wife the lock on my daught...,ocx94s,My brother in-law (Sammy) lost his home shortl...,"(h3xygto, h3wzvps, h3wy7le, h3wysc4, h3wyfzq, ...",Not the A-hole,0.92,79183,5324
1,AITA For suing my girlfriend after she had my ...,gr8bp3,I'll try to keep this short. I had a [1967 Imp...,"(frysjyr, frxd8p0, frxbsw1, frxdiv6, frxeco8, ...",Not the A-hole,0.98,70802,2775
2,AITA for pretending to get fired when customer...,e5k3z2,I am a high schooler with a weekend job at a c...,"(f9k4vv0, f9k55ot, f9k658r, f9k786c, f9k5lfx, ...",Not the A-hole,0.92,63524,3648
3,AITA for punishing my son after he said someth...,iagtso,"About a week ago, my (39F) family ordered Chin...","(g1pfb3h, g1o1jvg, g1o3lu2, g1o7upz, g1o02m8, ...",Not the A-hole,0.92,52902,2631
4,WIBTA for refusing to stop cooking bacon in my...,dkqv29,"Dad here, old fart, loves his daughter to piec...","(f4ivyg3, f4iwop1, f4iy1el, f4iwiwh, f4ixrfa, ...",Not the A-hole,0.89,51399,7553
5,Aita for wearing the “joke” bikini my friend g...,d7yuot,So it was my birthday couple months ago. Had a...,"(f16j5c6, f15xb3y, f15yu9d, f168ufx, f160co8, ...",Not the A-hole,0.86,50963,4030
6,"AITA for ""announcing"" that my dad's not paying...",ofol5x,My aunt and uncle are paying for my cousins co...,"(h4dpm7g, h4drdr2, h4dtq2h, h4dq47x, h4dqkgo, ...",Not the A-hole,0.89,50841,2884
7,AITA for refusing to pay for my sister's husba...,ouje2w,\n\n\nContext: My sister (F27) and I (18F) los...,"(h74siuj, h72q30k, h72q7dp, h72oyxj, h72p2ky, ...",Not the A-hole,0.87,49640,6429
8,AITA for telling my son he deserved his gf bre...,daglhs,So my son had a long-distance gf recently for ...,"(f1qq36d, f1pi20n, f1pi9gw, f1phqpo, f1pieyj, ...",Not the A-hole,0.92,47772,4008
9,AITA - Telling my parents to pay me back my co...,i9pm5u,I was raised by parents who believed (religiou...,"(g1gjks8, g1gjt5y, g1gk9da, g1gkibi, g1gl4u6, ...",Not the A-hole,0.96,47555,2793


In [7]:
aita_df['flair_text'].value_counts()

Not the A-hole    4023
Asshole            539
Name: flair_text, dtype: int64

A quick check is done here to check to make sure that posts with a low number of comments are not low effort or not contributing to the data. 

In [8]:
#Inspect low comment rows to see if they should be kept. 
print(len(aita_df[aita_df['num_comments'] < 10]))
aita_df[aita_df['num_comments'] < 10].head(10)

108


Unnamed: 0,title,post_id,post_text,comments,flair_text,upvote_ratio,num_upvotes,num_comments
1174,AITA for choosing to distance myself from my s...,qvbcwt,I moved back home after a breakup to save for ...,"(hkv9qkg, hkvbyeu, hkvahy8, hkvd9qw, hkvasj4, ...",Not the A-hole,0.95,143,9
1407,AITA for refusing to entertain my FIL while he...,qxanrz,My father in law has always had a very close r...,"(hl87v57, hl89yti, hl8kam3, hl883uo, hl8kdl0, ...",Not the A-hole,0.96,36,8
1531,AITA for asking my friend to plan our mutual f...,que9n5,Let’s call her A. We have a close group of fou...,"(hkpi69k, hkpi7qg, hkpofal, hkpi0zg, hkpk29a)",Not the A-hole,0.92,20,8
1560,"AITA for not letting a friend move in in an ""e...",qud1d4,I (24m) have this friend (27F) with whom I hav...,"(hkpceso, hkpd5wp, hkpidqq, hkpd4yp, hkpgc6v, ...",Not the A-hole,0.79,19,9
1562,AITA for not wanting to visit my half siblings?,qxzx54,Long story short I was always terrified of my ...,"(hlcyss1, hld0a91, hld14v7, hld4q5k, hlcylqz, ...",Not the A-hole,0.91,16,7
1574,AITA saved from the pawnshop,qwb5z0,"My exhusband and I were together for 20 years,...","(hl1ut4o, hl1xqnt, hl217bq, hl27y3n, hl28stb, ...",Not the A-hole,0.91,18,8
1619,AITA - Fiance Using My Hairbrush,qu2sjj,I've caught my fiance using my hairbrush in th...,"(hknlf65, hknmczj, hknmmv7, hknn3ca, hknn9b3, ...",Not the A-hole,0.83,16,7
1624,AITA for wanting my husband to go back to work?,qy2f5u,My (30f) husband (34m) has been experiencing a...,"(hldbspl, hldcquk, hldccq3, hldfdru, hldcrsa, ...",Not the A-hole,0.9,16,9
1636,AITAH for not wanting my mom to be on the phon...,qw9mko,"My sister is 23 and lives out of state, and sh...","(hl1gt1a, hl1h8s3, hl1hftu, hl1hcmi, hl1hjhm, ...",Not the A-hole,0.84,15,7
3603,AITA for talking to a friend's ex,r1hc8y,"So it's a little more complex than that, but t...","(hlymbyv, hlyoy7h, hlyn30r, hlyqjy5, hlysjyh, ...",Not the A-hole,0.8,8,8


### Cleaning and Vectorizing the Dataset <a name="section4"></a>
This section shows the steps taken to clean the data and vectorize it for machine learning use. A custom class(CommentPuller) is used here to make a majority of the data manipulations. 

In [9]:
#Wrap our aita_df that will preform our manipulations and labeling.
#This class will remove duplicate post_ids.  
#This class also adds a few features we will explore in the next cell
comment_puller = CommentPull.CommentPuller.CommentPuller(aita_df)

In [10]:
#We can see on the far right side of the dataframe that we have six new features.
#These all deal with length, ease to read and grade level for reading. 
#These features will be fun to explore during EDA and may contribute to our model. 
comment_puller.post_df.head(5)

Unnamed: 0,title,post_id,post_text,comments,flair_text,upvote_ratio,num_upvotes,num_comments,title_length,title_ease_score,title_grade_level,post_text_length,text_ease_score,text_grade_level
0,AITA for telling my wife the lock on my daught...,ocx94s,My brother in-law (Sammy) lost his home shortl...,"(h3xygto, h3wzvps, h3wy7le, h3wysc4, h3wyfzq, ...",Not the A-hole,0.92,79183,5324,27,77.91,10.8,486,59.1,16.95
1,AITA For suing my girlfriend after she had my ...,gr8bp3,I'll try to keep this short. I had a [1967 Imp...,"(frysjyr, frxd8p0, frxbsw1, frxdiv6, frxeco8, ...",Not the A-hole,0.98,70802,2775,16,72.16,6.4,924,74.15,13.49
2,AITA for pretending to get fired when customer...,e5k3z2,I am a high schooler with a weekend job at a c...,"(f9k4vv0, f9k55ot, f9k658r, f9k786c, f9k5lfx, ...",Not the A-hole,0.92,63524,3648,13,75.2,8.28,403,62.95,14.36
3,AITA for punishing my son after he said someth...,iagtso,"About a week ago, my (39F) family ordered Chin...","(g1pfb3h, g1o1jvg, g1o3lu2, g1o7upz, g1o02m8, ...",Not the A-hole,0.92,52902,2631,10,69.79,8.0,377,69.55,12.35
4,WIBTA for refusing to stop cooking bacon in my...,dkqv29,"Dad here, old fart, loves his daughter to piec...","(f4ivyg3, f4iwop1, f4iy1el, f4iwiwh, f4ixrfa, ...",Not the A-hole,0.89,51399,7553,17,62.68,11.51,412,64.68,14.72


This method pulls the top ten comments from each post and makes a new dataframe from them, perserving post_id. This will allow us to label each comment as 'NTA' or 'YTA' or neither if there is a conflict. 

In [11]:
#Make a df within the class made of comments and their respective post ids
comment_puller.make_comment_df()
comment_puller.comment_df.shape

(35417, 2)

In [12]:
#Inspect the first few rows of this new df. 
comment_puller.comment_df.head(10)

Unnamed: 0,comment,post_id
0,#[Be Civil](https://www.reddit.com/r/AmItheAss...,ocx94s
1,NTA\n\nThis is a recurring theme here on Reddi...,ocx94s
2,NTA\n\nYour daughter doesn’t feel like she has...,ocx94s
3,NTA\n\nGood on you for standing up for your da...,ocx94s
4,NTA. Don't back down. You are the only one sti...,ocx94s
5,NTA\n\n>my daughters aren't thieves!!! it's no...,ocx94s
6,"""I let Sammy and his family move in which's so...",ocx94s
7,NTA. \nNormal behaviour is asking to borrow st...,ocx94s
8,NTA.\n\nGood for you for standing up for your ...,ocx94s
9,NTA. And why the heck are you supposed to trea...,ocx94s


In [13]:
#This method will search the comments for 'NTA' and 'YTA' and label them accordingly. 
#A 1 denotes YTA and 0 NTA. NAN indicates a comment without either or with both. 
comment_puller.label_dfs()
comment_puller.comment_df.head()

Unnamed: 0,comment,post_id,label
0,#[Be Civil](https://www.reddit.com/r/AmItheAss...,ocx94s,
1,NTA\n\nThis is a recurring theme here on Reddi...,ocx94s,0.0
2,NTA\n\nYour daughter doesn’t feel like she has...,ocx94s,0.0
3,NTA\n\nGood on you for standing up for your da...,ocx94s,0.0
4,NTA. Don't back down. You are the only one sti...,ocx94s,0.0


This method takes the columns post_id and title from post_df, cleans them of numbers, stop words and punctuation. It also makes the text all lower case. Then sklearn's CountVectorizer class is used to turn these data into vectors. A new text_df(containing vectors for post_text) and title_df(containing vectors for title text) are created and stored in the class. 

In [14]:
#This method cleans and vectorizes the data.
#By default, vectorizer will only included words with 4 or more occurances.Can set with parameter freq. 
#N_grams can also be modified with parameter gram. Default is only single words
test = comment_puller.vectorize_text(freq = 6, gram=(1,3))

In [15]:
#Examine text_df 
print(comment_puller.text_df.shape)
comment_puller.text_df.head()

(3564, 17868)


Unnamed: 0,post_id,label_id,num_words,ease_score,grade_level,aback,abandon,abandoned,abandoning,ability,...,yr,yr ago,yr old,yta,yup,zero,zone,zoo,zoom,zoom call
0,ocx94s,0,486,59.1,16.95,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,gr8bp3,0,924,74.15,13.49,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,e5k3z2,0,403,62.95,14.36,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,iagtso,0,377,69.55,12.35,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,dkqv29,0,412,64.68,14.72,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
#Examine title_df
print(comment_puller.title_df.shape)
comment_puller.title_df.head()

(3564, 892)


Unnamed: 0,post_id,label_id,num_words,ease_score,grade_level,abandoning,able,accept,accepting,access,...,would,wouldnt,wrong,year,year ago,year old,yelling,yelling mom,yo,younger
0,ocx94s,0,27,77.91,10.8,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,gr8bp3,0,16,72.16,6.4,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,e5k3z2,0,13,75.2,8.28,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,iagtso,0,10,69.79,8.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,dkqv29,0,17,62.68,11.51,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
#Get sums of each value to check dataframes for outliners
title_sums = comment_puller.title_df.iloc[:, 5:].T.sum(axis=1)
body_sums = comment_puller.text_df.iloc[:, 5:].T.sum(axis=1)

In [18]:
#Show values over 200 for title_df
title_sums[title_sums > 200]

aita             3319
aita refusing     242
aita telling      528
brother           203
daughter          215
family            206
friend            356
husband           255
mom               202
refusing          264
sister            313
telling           588
wanting           208
wibta             221
dtype: int64

Since 'aita' shows up so much more than anything else lets delete it as its clearly an outliner. This makes sense as most posts start with AITA.

In [19]:
#Delete the 'aita' column
comment_puller.title_df.drop('aita', axis=1, inplace=True)

The body text of posts doesn't seem to have this problem, so it can be left the same. 

In [20]:
#Check for values with counts over 4000
body_sums[body_sums > 4000].sort_values(ascending=False)

said     7053
told     6504
like     5605
time     4752
get      4749
would    4305
got      4079
dtype: int64

In [40]:
words = comment_puller.post_df['post_text']

In [21]:
#Save the class
comment_puller.save()