# Project 3: Reddit API Classification & Natural Language Processing

## Tom Ludlow, DSI-NY-6

Using NLP to identify posts from **r/audioengineering** and **r/livesound**

## Project Overview

This project file contains 5 notebooks covering the data science process for the following problem:

### How well can Natural Language Processing models differentiate post content from two similar subreddits, and which type of Classifier works best?



# Notebook 1: API and EDA

This notebook contains the code used to query the Reddit API and store post data from our target subreddits **r/AudioEngineering** and **r/LiveAudio**.  This code performed 10 pulls to collect the maximum of 1000 available posts per subreddit.  These posts are then formatted and prepared to save and use for the remainder of the project.

### Contents:
- [**Loop to pull Reddit API posts**](#Loop-to-pull-Reddit-API-posts)
- [**EDA**](#EDA)
    -  [Formatting changes](#Formatting-changes-after-loading-from-files)
    -  [Unravel posts](#Unravel-posts)
- [**Save separate DataFrames**](#Save-separate-DataFrames)

In [1]:
# library imports
import requests
import time
import pandas as pd
import ast # to convert string data to indexable list of dictionaries
from tqdm import tqdm

In [3]:
# random state var
r = 1219

### Loop to pull Reddit API posts

To collect our Reddit data, we used the existing `.json` API format, which returns a dictionary for URLs containing a `.json` extension.  We include a headers dictionary to identify ourselves to Reddit, which allows us to execute an API loop to accumulate the maximum allowed posts (~1000 per subreddit).

Once this data is scraped and obtained, we saved the raw contents to a `.csv` file.

In [2]:
# create header parameter for API
headers_dict = {'User-agent':'twludlow'}

In [32]:
# instantiate API variables
url = 'https://reddit.com/'
sub01_url = url + 'r/audioengineering' # set sub01 to 'Audio Engineering'
sub02_url = url + 'r/livesound'        # set sub02 to 'Live Sound'

limit_num = 100     # API 'limit' parameter

sub01_after = None  # instantiate empty counters for API 'after' parameter
sub02_after = None

sub01_pages = []    # instantiate empty lists to save API results
sub02_pages = []

for i in range(10): # pull from API 20 times
    
    # add 'after' parameters if an id has been saved - starts as None
    if sub01_after and sub02_after:
        # create full API url for sub01
        sub01_after_url = sub01_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub01_after
        print(sub01_after_url)
        
        # create full API url for sub02
        sub02_after_url = sub02_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub02_after
        print(sub02_after_url)
    
    # if one after is logged and the other is not
    elif bool(sub01_after) != bool(sub02_after):
        print('After reference out of sync.')
        break
    
    else:
        # create first run url
        sub01_after_url = sub01_url + '.json?limit=' + str(limit_num)
        sub02_after_url = sub02_url + '.json?limit=' + str(limit_num)
    
    # pull json from sub01
    sub01_res = requests.get(sub01_after_url, headers=headers_dict)
    print(i, sub01_res.status_code)
    
    # if sub01 connection is established
    if sub01_res.status_code == 200:
        # add page to list
        sub01_pages.append(sub01_res.json()['data'])
        print('sub01_pages length: ', len(sub01_pages))
        
        # set 'after' parameter for next run
        sub01_after = sub01_res.json()['data']['after']
        print('sub01_after: ', sub01_after)
        
    else:        
        print('Connection failed.\n')
    
    # sleep one second
    time.sleep(1)
    
    # pull json from sub02
    sub02_res = requests.get(sub02_after_url, headers=headers_dict)
    print(i, sub02_res.status_code)
    
    # if sub02 connection is established
    if sub02_res.status_code == 200:
        # add page to list
        sub02_pages.append(sub02_res.json()['data'])
        print('sub02_pages length: ', len(sub02_pages))
        
        # set 'after' parameter for next run
        sub02_after = sub02_res.json()['data']['after']
        print('sub02_after: ', sub02_after)
    else:
        print('Connection failed.\n')
        
    # sleep one second    
    time.sleep(1)

0 200
sub01_pages length:  1
sub01_after:  t3_a3vakv
0 200
sub02_pages length:  1
sub02_after:  t3_a32xoh
https://reddit.com/r/audioengineering.json?limit=100&after=t3_a3vakv
https://reddit.com/r/livesound.json?limit=100&after=t3_a32xoh
1 200
sub01_pages length:  2
sub01_after:  t3_a1a83f
1 200
sub02_pages length:  2
sub02_after:  t3_9zp2tb
https://reddit.com/r/audioengineering.json?limit=100&after=t3_a1a83f
https://reddit.com/r/livesound.json?limit=100&after=t3_9zp2tb
2 200
sub01_pages length:  3
sub01_after:  t3_9yjhye
2 200
sub02_pages length:  3
sub02_after:  t3_9xp13m
https://reddit.com/r/audioengineering.json?limit=100&after=t3_9yjhye
https://reddit.com/r/livesound.json?limit=100&after=t3_9xp13m
3 200
sub01_pages length:  4
sub01_after:  t3_9wbrxt
3 200
sub02_pages length:  4
sub02_after:  t3_9vrhrc
https://reddit.com/r/audioengineering.json?limit=100&after=t3_9wbrxt
https://reddit.com/r/livesound.json?limit=100&after=t3_9vrhrc
4 200
sub01_pages length:  5
sub01_after:  t3_9tusd7

In [33]:
# create DataFrames from posting lists
sub01_df = pd.DataFrame(sub01_pages)
sub02_df = pd.DataFrame(sub02_pages)

In [34]:
# save API data to files
sub01_df.to_csv('./audio_engineering_posts.csv', index=False)
sub02_df.to_csv('./live_sound_posts.csv', index=False)

Saves: 181214_audioengineering_posts.csv, 181214_live_sound_posts.csv

Iterations: **10**

Here we can see the format of the data we've obtained: a dictionary, with post data contained in the `children` key.  Within this, we will need to access the nested keys below. 

_(Screenshot used to adjust original pull results display.)_

<img src="./img/reddit_api_pull_screen.png" alt="Result header from initial Reddit API Pull" title="Reddit API JSON Header" />

## EDA

Reload original API data from our subreddits.

In [4]:
sub01_df = pd.read_csv('./reddit_data/181214_audio_engineering_posts.csv')

In [5]:
sub02_df = pd.read_csv('./reddit_data/181214_live_sound_posts.csv')

This `ast` library function `literal_eval` takes input of type `string` and processes it as if it were Python syntax.  In this case, it is converting strings of dictionaries back into dictionaries stored in active notebook memory.

In [9]:
sub01_df['children'] = sub01_df.children.map(lambda x: ast.literal_eval(x))

In [10]:
sub02_df['children'] = sub02_df.children.map(lambda x: ast.literal_eval(x))

### Formatting changes after loading from files

In [108]:
sub01_df.head()

Unnamed: 0,after,before,children,dist,modhash
0,t3_a3vakv,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",102,
1,t3_a1a83f,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
2,t3_9yjhye,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
3,t3_9wbrxt,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
4,t3_9tusd7,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,


In [109]:
sub01_df.shape

(10, 5)

In [110]:
sub02_df.head()

Unnamed: 0,after,before,children,dist,modhash
0,t3_a32xoh,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",102,
1,t3_9zp2tb,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
2,t3_9xp13m,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
3,t3_9vrhrc,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,
4,t3_9ssc53,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",100,


In [111]:
sub02_df.shape

(10, 5)

In [112]:
# save post dictionaries in arrays

ae_posts_bulk = sub01_df['children']
ls_posts_bulk = sub02_df['children']

In [113]:
ae_posts_bulk.head()

0    [{'kind': 't3', 'data': {'approved_at_utc': No...
1    [{'kind': 't3', 'data': {'approved_at_utc': No...
2    [{'kind': 't3', 'data': {'approved_at_utc': No...
3    [{'kind': 't3', 'data': {'approved_at_utc': No...
4    [{'kind': 't3', 'data': {'approved_at_utc': No...
Name: children, dtype: object

In [114]:
ls_posts_bulk.head()

0    [{'kind': 't3', 'data': {'approved_at_utc': No...
1    [{'kind': 't3', 'data': {'approved_at_utc': No...
2    [{'kind': 't3', 'data': {'approved_at_utc': No...
3    [{'kind': 't3', 'data': {'approved_at_utc': No...
4    [{'kind': 't3', 'data': {'approved_at_utc': No...
Name: children, dtype: object

In [115]:
ae_posts_bulk.shape

(10,)

In [116]:
for post in ae_posts_bulk: 
    print(len(post))

102
100
100
100
100
100
100
100
100
22


### Unravel posts

#### Target Fields

 - Title: 'title'
 - Posts: 'selftext'
 - Author: 'author_fullname'
 - Upvotes: 'ups'

We will collect and store the bulleted data to perform our initial modeling analysis, and to leave room for additional supplemental analysis.

In [117]:
ae_posts_bulk[0][0]['data']['title'] # format to isolate data for Title

'Tech Support and Troubleshooting - December 10, 2018'

In [118]:
ae_posts_bulk[0][0]['data']['selftext'] # format to isolate data for Post

"Welcome the /r/audioengineering Tech Support and Troubleshooting Thread.  We kindly ask that all tech support questions and basic troubleshooting questions (how do I hook up 'a' to 'b'?, headphones vs mons, etc) go here.  If you see posts that belong here, please report them to help us get to them in a timely manner.  Thank you!\n\n   Daily Threads:\n\n\n* [Monday - Gear Recommendations Sticky Thread](http://www.reddit.com/r/audioengineering/search?q=title%3Arecommendation+author%3Aautomoderator&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Monday - Tech Support and Troubleshooting Sticky Thread](http://www.reddit.com/r/audioengineering/search?q=title%3ASupport+author%3Aautomoderator&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Tuesday - Tips &amp; Tricks](http://www.reddit.com/r/audioengineering/search?q=title%3A%22tuesdays%22+AND+%28author%3Aautomoderator+OR+author%3Ajaymz168%29&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Friday - How did they do that?](http://www.reddit.com/r

In [119]:
ls_posts_bulk[0][0]['data']['selftext']

'Post the pictures you took at your gigs this week!'

In [120]:
ae_posts_bulk[0][0]['data']['author_fullname'] # format to isolate author's name in base36 code

't2_6l4z3'

In [121]:
ls_posts_bulk[0][0]['data']['author_fullname']

't2_6l4z3'

In [122]:
ae_posts_bulk[0][0]['data']['ups'] # format to get number of upvotes on a post

7

#### Post titles - 'title'

To collect post attributes, we used nested list comprehensions: for each subreddit post batch (sized 100 posts), iterate through each individual post dictionary and return the value from each attribute's key, (e.g. `title`,`selftext`,etc.)

In [123]:
# 'ae' for audio engineering
ae_titles = [ae_posts_bulk[i][j]['data']['title'] for i in range(len(ae_posts_bulk))
            for j in range(len(ae_posts_bulk[i]))]

# 'ls' for live sound
ls_titles = [ls_posts_bulk[i][j]['data']['title'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

#### Posts - 'selftext'

In [124]:
# create list of post using nested comprehensions
ae_posts = [ae_posts_bulk[i][j]['data']['selftext'] for i in range(len(ae_posts_bulk)) 
            for j in range(len(ae_posts_bulk[i]))]

ls_posts = [ls_posts_bulk[i][j]['data']['selftext'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

In [125]:
len(ae_posts)

924

In [126]:
len(ls_posts)

980

#### Upvotes - 'ups'

In [127]:
# list of upvotes using nested comprehensions
ae_ups = [ae_posts_bulk[i][j]['data']['ups'] for i in range(len(ae_posts_bulk)) 
            for j in range(len(ae_posts_bulk[i]))]

ls_ups = [ls_posts_bulk[i][j]['data']['ups'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

#### Authors - 'author_fullname'

Doing manually to handle missing author data.  When running the above comprehension, because of `null` values in the authors field, it failed.

In [128]:
ae_authors = [] # empty lists to store results
ls_authors = []

for i in range(len(ae_posts_bulk)): # for each bulk post (size 100)
    for j in range(len(ae_posts_bulk[i])): # for each post in the batch
        try:
            ae_authors.append(ae_posts_bulk[i][j]['data']['author_fullname']) # attempt to add to list
        except:
            ae_authors.append('no author') # if it fails, add text stating 'no author'
            
for i in range(len(ls_posts_bulk)): # for each bulk post
    for j in range(len(ls_posts_bulk[i])): # for each individual post
        try:
            ls_authors.append(ls_posts_bulk[i][j]['data']['author_fullname']) # attempt to add to list
        except:
            ls_authors.append('no author') # if it fails, add instead 'no author'

In [129]:
len(ae_authors)

924

In [130]:
len(ls_authors)

980

In [131]:
# compile lists into DataFrame
ae_df = pd.DataFrame([ae_titles, ae_posts, ae_authors, ae_ups], index=['title','post','author','upvotes'])

In [132]:
# transpose from rows to columns
ae_df = ae_df.T

In [133]:
# compile lists into DataFrame
ls_df = pd.DataFrame([ls_titles, ls_posts, ls_authors, ls_ups], index=['title','post','author','upvotes'])

In [134]:
# transpose from rows to columns
ls_df = ls_df.T

### Save separate DataFrames

In [135]:
ae_df.to_csv('./csv/ae_df.csv', index=False)
ls_df.to_csv('./csv/ls_df.csv', index=False)

Save data to csv above, then reload from csv below.

In [3]:
ae_df = pd.read_csv('./csv/ae_df.csv')
ls_df = pd.read_csv('./csv/ls_df.csv')

Now that we have our initial data tables, we designate each with a binary value for the status `'is_ls'`, where `0` is for all posts from r/audioengineering, and `1` for r/livesound.

In [4]:
# binarize our classifier: 'is_ls' (is live sound)
ae_df['is_ls'] = 0
ls_df['is_ls'] = 1

Combine the classified posts in a single DataFrame, then replace all `NaN` values with `' '`.

In [5]:
df = pd.concat([ae_df, ls_df])

In [14]:
df.post.fillna(' ', inplace=True)

Our testing vector is a combination of `title` and `post` data, which we combined into the `comb` column as the title, a single space, then the post string.

In [27]:
df['comb'] = df['title'] + ' ' + df['post']

In [28]:
df.head()

Unnamed: 0,title,post,author,upvotes,is_ls,comb
0,Tech Support and Troubleshooting - December 10...,Welcome the /r/audioengineering Tech Support a...,t2_6l4z3,7,0,Tech Support and Troubleshooting - December 10...
1,Gear Recommendation (What Should I Buy?) Threa...,Welcome to our weekly Gear Recommendation Thre...,t2_6l4z3,15,0,Gear Recommendation (What Should I Buy?) Threa...
2,Will I EVER understand compression...?,"Ahh yes, my monthly compression post...\n\nI '...",t2_2r3uhjqr,96,0,Will I EVER understand compression...? Ahh yes...
3,I'm interviewing to be an intern at a big stud...,What questions should I ask?\n\nEdit: I'm gett...,t2_dd3qi,145,0,I'm interviewing to be an intern at a big stud...
4,"If I faced two speakers towards each other, on...",,t2_bl2x2,5,0,"If I faced two speakers towards each other, on..."


In [29]:
df.isnull().sum()

title      0
post       0
author     0
upvotes    0
is_ls      0
comb       0
dtype: int64

In [30]:
df.shape

(1904, 6)

We can see that among our `is_ls` classes, we have 980 posts from r/livesound, and 924 posts from r/audioengineering.

In [31]:
df.is_ls.value_counts()

1    980
0    924
Name: is_ls, dtype: int64

In [32]:
df.index = range(len(df)) # re-index, since original index values were preserved during concatenation

In [33]:
# check for empty posts and store index to list
to_drop = []

for i, post in enumerate(df['comb']):
    if len(post)==0:
        to_drop.append(i)

In [34]:
len(to_drop)

0

In [35]:
# drop rows with empty posts using index list
df.drop(to_drop, inplace=True)

In [36]:
df.is_ls.value_counts()

1    980
0    924
Name: is_ls, dtype: int64

In [37]:
df.to_csv('./csv/181219_post_df.csv', index=False)

This EDA process has created a DataFrame table containing titles, post, and combined values, as well as our target vector `is_ls`.  We have saved this to a csv file, and will proceed with pre-processing in Notebook 2.

## Continue to Notebook 2: Pre-Processing