### Import Packages

In [173]:
#Standard Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#Web Request Package
import requests

### How Do We Get a Bunch of Reddit Comments?

Reddit, like many websites that produce large amounts of data that people may want to access, has what is called an API (Application Programming Interface).  Essentially this is a set of definitions, protocols, and tools that make data available in an accessible format for developers.  However, API's will typically have fairly strict limitations about the amount of data one can access, at least without paying a premium.

Reddit's API is convenient in that it stores a lot of useful data and labeling relevant to each comment, user, etc. However, regular users of the API are limited to pulling 1000 comments at a time.  This in itself is not really an issue, but Reddit also limits users to only pulling the *most recent* 1000 comments.  This means we would have to wait for some indiscriminate amount of time between each pull, and given the volume of activity on these subreddits, it could be months or even years before we have a powerful dataset.  

Luckily, there is an open-source alternative to Reddit's API, called **[Pushshift API](https://github.com/pushshift/api)**. In addition to providing useful extra features, the Pushshift API has two great advantages:

- **We can specify a date and time that we want to start a pull from** (i.e.; pull the 1000 comments prior to time *t*). This allows us to create a loop backward over set intervals of time and pull as much distinct content as we want.


- **The data is returned as a list of dictionaries**, which allows very easy conversion into Pandas.

Let's access some Reddit comments and see what we get.

In [174]:
#The url given below calls for the most recent 1000 comments from threads on r/AskMen.
url = "https://api.pushshift.io/reddit/search/comment/?subreddit=askmen&sort=des&size=1000"

In [175]:
headers = {'User-agent': 'eamonious'}
res = requests.get(url, headers=headers)
res.status_code

200

The 200 code indicates that we have successfully accessed the API.

In [176]:
json = res.json()
comments = pd.DataFrame(json['data'])

Let's look at the different data features available to us.

In [177]:
comments.columns

Index(['author', 'author_cakeday', 'author_flair_background_color',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_patreon_flair', 'body', 'created_utc', 'distinguished',
       'gildings', 'id', 'link_id', 'no_follow', 'parent_id', 'permalink',
       'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit',
       'subreddit_id'],
      dtype='object')

We are interested in the following features: 
- **body**: raw text of the comment
- **created_utc**: timestamp of the comment
- **id**: comment unique id
- **parent_id**: unique id of *parent* comment (or *thread id* for first tier comments that reply directly to the thread). Because of the way this category is formatted, we will be able to identify which comments are first tier comments.
- **score**: how many upvotes the comment has
- **subreddit**: subreddit that the comment was in.  this will be the target variable.

In [178]:
#Removes everything but the features we are interested in.
comments = comments[['body','created_utc','id','parent_id','score','subreddit']]

### AskMen and AskWomen: Establishing a Proxy for Gender

For this project, we will be focusing on the subreddits **r/AskMen** and **r/AskWomen**.  

The way these two subreddits work is people make threads in which they ask a question looking for answers from only male or only female redditors, respectively.  So, ***in each subreddit, we can expect the first-tier comments (replying directly to the thread) to be almost exclusively from men or women, respectively***: men answering questions in AskMen, and women answering questions in AskWomen. Thus, if we can grab first-tier comments only from each subreddit, we can get a large, balanced dataset of essentially gender-labeled Reddit comments, without any manual tagging.

Accordingly, we want to collect only first-tier comments if possible, and exclude all lower-tier comment replies, which may be from either gender. ***We can filter for first-tier comments by using the 'parent_id' feature***.  First tier comments all show the thread id as the parent, which begins with 't3_'. Lower-tier comments show the id of the parent comment, which uses a different prefix.  All we need to do is exclude anything that doesn't have the 't3_' prefix. 

In [179]:
#Drops all comments that are not in the first tier, i.e.; direct responses to the original post.
comments['parent_id'] = comments['parent_id'].map(lambda x: x if 't3_' in x else 0)
comments = comments[comments['parent_id']!=0]

Based on experimentation, it looks like the percentage of every 1000 comments that are first tier comments is approximately the same in AskMen and AskWomen (~30-35%), so the classes should remain roughly balanced.

In [15]:
comments.head()

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit
0,I was 23 but I went with my ~29 year old cowor...,1545243578,ec4lbi4,t3_a7oy9v,1,AskMen
1,"Portland, OR.\r\r\r\n\r\r\r\nThe city itself i...",1545243546,ec4la0n,t3_a7mkui,1,AskMen
2,"nope. ""the cats goodbye"" watch how a c...",1545243536,ec4l9lm,t3_a7fe60,1,AskMen
3,Drunk as fuck me during an unintended one nigh...,1545243524,ec4l90i,t3_a79zu9,1,AskMen
4,There was this one time when I went over one o...,1545243449,ec4l5g6,t3_a7kmvc,1,AskMen


You can see we have the comment text, the timestamp (we'll discuss the format in a minute), the unique id, the parent id (only t3 means only first-tier!), the upvote count, and the subreddit.  This is our proof of concept.  Now let's go and get some data.

### Pulling Comments from Pushshift.io

As noted earlier, Reddit limits you to grabbing 1000 comments in a single call, and this rule extends to the Pushshift API as well.  To collect more than 1000 comments, and also to reflect a wider variety of timeframes than simply the last few days, we will use the feature in Pushshift that allows you to query based on a timestamp.  

We can create a loop that repeatedly collects the 1000 first-tier comments from a subreddit prior to a specified date-time, starting at the present and moving backward at 12 day intervals. I chose 12 days because it will quickly give me a variety of times of year, times of month, days of week, etc., and it is a large enough gap that all comments should be new and my dataset will span at least a full year. 

The API uses the **epoch timestamp format**, a numerical representation.  12 days corresponds to 1036800 units in this format.  What I will do is use an initial timestamp from this week and then subtract 1036800 from it in each request, collecting comments further and further back in time and appending them until I have collected 40000+ first-tier comments from the AskMen subreddit.  I will then do the same thing for AskWomen.

AskMen Data Grab:

In [183]:
#Creates the initial dataframe 
#1000 most recent comments at present time (1545243580), filtered to first-tier only
url = "https://api.pushshift.io/reddit/search/comment/?subreddit=askmen&before=1545243580&sort=des&size=1000"
headers = {'User-agent': 'eamonious'}
res = requests.get(url, headers=headers)
json = res.json()
commentsm = pd.DataFrame(json['data'])
commentsm = commentsm[['body','created_utc','id','link_id','parent_id','score','subreddit']]
commentsm['parent_id'] = commentsm['parent_id'].map(lambda x: x if 't3_' in x else 0)
commentsm = commentsm[commentsm['parent_id']!=0]
commentsm = commentsm[commentsm['body']!='[removed]']

#Loops backward over 12 day intervals, adding the 1000 most recent comments prior to each timepoint,
#filtered to first-tier only
for i in range(1,80):
    url = "https://api.pushshift.io/reddit/search/comment/?subreddit=askmen&before={}&sort=des&size=1000".format(1545243580 - i*1036800)
    headers = {'User-agent': 'eamonious'}
    res = requests.get(url, headers=headers)
    json = res.json()
    commentbloc = pd.DataFrame(json['data'])
    commentbloc = commentbloc[['body','created_utc','id','link_id','parent_id','score','subreddit']]
    commentbloc['parent_id'] = commentbloc['parent_id'].map(lambda x: x if 't3_' in x else 0)
    commentbloc = commentbloc[commentbloc['parent_id']!=0]
    commentbloc = commentbloc[commentbloc['body']!='[removed]']
    commentsm = pd.concat([commentsm, commentbloc], ignore_index=True)


In [188]:
len(commentsm)

43774

AskWomen Data Grab:

In [221]:
url = "https://api.pushshift.io/reddit/search/comment/?subreddit=askwomen&before=1545243580&sort=des&size=1000"
headers = {'User-agent': 'eamonious'}
res = requests.get(url, headers=headers)
json = res.json()
commentsw = pd.DataFrame(json['data'])
commentsw = commentsw[['body','created_utc','id','link_id','parent_id','score','subreddit']]
commentsw['parent_id'] = commentsw['parent_id'].map(lambda x: x if 't3_' in x else 0)
commentsw = commentsw[commentsw['parent_id']!=0]
commentsw = commentsw[commentsw['body']!='[removed]']

for i in range(1,80):
    url = "https://api.pushshift.io/reddit/search/comment/?subreddit=askwomen&before={}&sort=des&size=1000".format(1545243580 - i*1036800)
    headers = {'User-agent': 'eamonious'}
    res = requests.get(url, headers=headers)
    json = res.json()
    commentbloc = pd.DataFrame(json['data'])
    commentbloc = commentbloc[['body','created_utc','id','link_id','parent_id','score','subreddit']]
    commentbloc['parent_id'] = commentbloc['parent_id'].map(lambda x: x if 't3_' in x else 0)
    commentbloc = commentbloc[commentbloc['parent_id']!=0]
    commentbloc = commentbloc[commentbloc['body']!='[removed]']
    commentsw = pd.concat([commentsw, commentbloc], ignore_index=True)

In [223]:
len(commentsw)

41555

Our classes are well balanced, both around 40k comments.  We have our data now.  But it needs some more work before we can analyze.

### Cleaning Away Mod Messages and Deleted Comments

First, we want to drop any rows with null values.  Second we want to make sure we don't have any duplicate comment IDs.  Because of the way we've collected, it's possible we could have some duplicates.

In [259]:
#Remove rows with null values
commentsm.dropna(inplace=True)
commentsw.dropna(inplace=True)

#Remove comments with the same ID
commentsm.drop_duplicates('id',inplace=True)
commentsw.drop_duplicates('id',inplace=True)

We can look at the highest value_counts in the 'body' category (comment text) to look at the most frequently appearing comments.  Most of these will be boilerplate moderator comments, which are subreddit specific.  We will want to remove these.  Also, when comments on reddit are deleted, they are typically replaced by text saying deleted or removed, we will want to get rid of these as well.  There are usually a few different standard mod comments, but they contain identifying language that we can filter for.

In [260]:
#This is what the top comments list looked like for r/AskWomen
commentsw['body'].value_counts()[0:45]

[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

Below, I remove moderator comments from each subreddit by filtering for certain typifying language based on my review of top comments in each sub.  I also remove deleted comments.

In [261]:
#Removing deleted comments and moderator comments from AskMen
commentsm = commentsm[commentsm['body']!='[deleted]']
commentsm = commentsm[commentsm['body']!='\\[removed\]']

commentsm['body'] = commentsm['body'].map(lambda x: 0 if 'has been removed' in x else x)
commentsm['body'] = commentsm['body'].map(lambda x: 0 if 'AskMen' in x else x)
commentsm = commentsm[commentsm['body']!=0]


#Removing deleted comments and moderator comments from AskWomen
commentsw = commentsw[commentsw['body']!='[deleted]']

commentsw['body'] = commentsw['body'].map(lambda x: 0 if 'has been removed' in x else x)
commentsw['body'] = commentsw['body'].map(lambda x: 0 if 'emoved' in str(x)[0:10] else x)
commentsw['body'] = commentsw['body'].map(lambda x: 0 if 'AskWomen' in str(x) else x)
commentsw = commentsw[commentsw['body']!=0]

commentsw = commentsw[commentsw['body']!='Please feel free to respond based on the genders that you find attractive. This question is not limited to women who date men.']

In [4]:
commentsm.shape

(33395, 6)

In [5]:
commentsw.shape

(38405, 6)

Our classes are still reasonably well balanced.  Now we will combine the AskMen and AskWomen comments into a single dataframe.

In [6]:
comments = pd.concat([commentsm, commentsw])
comments = comments.reset_index(drop=True)
comments.head()

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit
0,I was 23 but I went with my ~29 year old cowor...,1545243578,ec4lbi4,t3_a7oy9v,1,AskMen
1,"Portland, OR.\r\r\r\n\r\r\r\nThe city itself i...",1545243546,ec4la0n,t3_a7mkui,1,AskMen
2,"nope. ""the cats goodbye"" watch how a c...",1545243536,ec4l9lm,t3_a7fe60,1,AskMen
3,Drunk as fuck me during an unintended one nigh...,1545243524,ec4l90i,t3_a79zu9,1,AskMen
4,There was this one time when I went over one o...,1545243449,ec4l5g6,t3_a7kmvc,1,AskMen


Notice that comment with '\r\r\r\r\n...' Those are line breaks, when people put paragraphs in their comments. Let's check and see how many comments have these things in them.

In [7]:
len(comments['body'].map(lambda x: x if '\r' in x else 0).unique())

23261

So 23000+ comments have a \r combo somewhere! This will interfere with our attempt to calculate word length, and may affect our vectorizations and predictions too.  If we look through the data further, there are also a large number of multiple spaces in some comments.  When we go to calculate word length, we're going to want to use the space character as a splitter.  So we need to reduce these to single spaces, or we'll get a bunch of empty spaces counted as words.  

So we want to remove all the \r\n combos, and all the multi-spaces, and replace them with one empty space.  We can make this type of specific text substitution with **regular expressions** (regex for short).

In [8]:
#Import Regex
import re

#This function selects any consecutive combination of \r's and \n's in a bloc of text, 
#and replaces that selection with a single space.
def replace_linebreaks_w_space(x):
    return re.sub('([\r\n]+)',' ',x) 

#This function selects any stretch of two or more consecutive spaces in a bloc of text,
#and replaces that selection with a single space.
def replace_multispace_w_space(x):
    return re.sub('([ ]{2,})',' ',x)

#Here we take every comment and apply the two functions to it.
comments['body'] = comments['body'].map(replace_linebreaks_w_space)
comments['body'] = comments['body'].map(replace_multispace_w_space)

NOW we can make a column with a proper word length count for each comment!

In [9]:
#Strip away any spaces at the beginning or end of each comment, splits the comment into a list of words, 
#and returns the length of that list (i.e.; the number of words in the comment)
comments['word length'] = comments['body'].map(lambda x: len(x.strip().split(' ')))

In [10]:
comments.head()

Unnamed: 0,body,created_utc,id,parent_id,score,subreddit,word length
0,I was 23 but I went with my ~29 year old cowor...,1545243578,ec4lbi4,t3_a7oy9v,1,AskMen,23
1,"Portland, OR. The city itself is now unafforda...",1545243546,ec4la0n,t3_a7mkui,1,AskMen,36
2,"nope. ""the cats goodbye"" watch how a cat says ...",1545243536,ec4l9lm,t3_a7fe60,1,AskMen,28
3,Drunk as fuck me during an unintended one nigh...,1545243524,ec4l90i,t3_a79zu9,1,AskMen,16
4,There was this one time when I went over one o...,1545243449,ec4l5g6,t3_a7kmvc,1,AskMen,192


We now have accurate word length data. PS. Notice that the \r\r\r\n is gone from that comment!

The last thing we're going to do is ***remove all comments that are 3 words and shorter***, as it's difficult, and for the most part just unreasonable, to guess anything from comments this short.  We want to focus on accurately predicting comments that have some content.

In [12]:
comments = comments[comments['word length']>=4]
len(comments)

67354

Let's save our cleaned dataset!

In [13]:
comments.to_csv('comments_final.csv')