# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [89]:
import requests, time, json, datetime, dill, pixiedust
import pandas as pd

from sklearn.feature_extraction import stop_words
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords



Pixiedust database opened successfully


In [None]:
URL_boardgames = "http://www.reddit.com/r/boardgames.json"

In [None]:
## send a request to reddit getting the first 25 posts
res = requests.get(URL_boardgames, headers = {'User-agent': 'project3 Bot 0.1'})

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [None]:
data = res.json()

In [None]:
#Checking to see that there's stuff there. Don't worry, there is.
#data

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [None]:
#collecting more data
URL_EXTENDER = "?after="


for i in range(9): 
    #makes a total of 10, or 250 posts. We'll see how much I really need/collect    
    #okay so I got 251 posts. Not sure how but guess it doesn't really matter?
    last_title = data['data']['after']
    
    #retrieve new data
    temp_data = requests.get(URL_boardgames+URL_EXTENDER+last_title, headers = {'User-agent': 'project3 Bot 0.1'})
    
    temp_data = temp_data.json()
    data['data']['children'].extend(temp_data['data']['children'])
    data['data']['after'] = temp_data['data']['after']
    time.sleep(3)
    print('Iteration', i+1)

In [None]:
len(data['data']['children']) #not sure why there are 251 results...

## Gathering data with the two subreddits I chose: r/TalesFromTechSupport and r/LFG

So these work but they are early iterations on the Data-Gathering-Script (other file), which does the scraping while I can work on the rest of the project.

In [None]:
jsons = {}

In [None]:
#This cell is deprecated. Using the Data-Gathering-Script notebook
#Tales From Tech Support Data Scraping
URL_tfts = "http://www.reddit.com/r/talesfromtechsupport.json"

res = requests.get(URL_tfts, headers = {'User-agent': 'project3 Bot 0.1'})
jsons['tfts'] = res.json()

URL_EXTENDER = "?after="


for i in range(9): 
    #makes a total of 10 requests, or 250 posts. We'll see how much I really need/collect    
    #okay so I got 251 posts. Not sure how but guess it doesn't really matter?
    last_title = jsons['tfts'['data']['after']
    
    #retrieve new data
    temp_data = requests.get(URL_tfts+URL_EXTENDER+last_title, headers = {'User-agent': 'project3 Bot 0.1'})
    
    temp_data = temp_data.json()
    jsons['tfts']['data']['children'].extend(temp_data['data']['children'])
    jsons['tfts']['data']['after'] = temp_data['data']['after']
    time.sleep(3)
    print('Iteration', i+1)
len(jsons['tfts']['data']['children'])

In [None]:
#This cell is deprecated. Using the Data-Gathering-Script notebook
#LFG Data Scraping

URL_lfg = "http://www.reddit.com/r/LFG.json"

res = requests.get(URL_lfg, headers = {'User-agent': 'project3 Bot 0.1'})
jsons['lfg'] = res.json()

URL_EXTENDER = "?after="


for i in range(9): 
    #makes a total of 10 requests, or 250 posts. We'll see how much I really need/collect    
    #okay so I got 251 posts. Not sure how but guess it doesn't really matter?
    last_title = jsons['lfg']['data']['after']
    
    #retrieve new data
    temp_data = requests.get(URL_lfg+URL_EXTENDER+last_title, headers = {'User-agent': 'project3 Bot 0.1'})
    
    temp_data = temp_data.json()
    jsons['lfg']['data']['children'].extend(temp_data['data']['children'])
    jsons['lfg']['data']['after'] = temp_data['data']['after']
    time.sleep(3)
    print('Iteration', i+1)
len(jsons['lfg']['data']['children'])

In [None]:
jsons = {'tfts':data_tfts, 'lfg':data_lfg}

In [None]:
#This cell is deprecated. Using the Data-Gathering-Script notebook
#Create a backup of my data in case the current working data is overridden
#only run occasionally, usually while figuring out how to append json files together
import json, datetime


for k, v in jsons.items():
    filepath = './data/backup_my_data_' + k + str(datetime.datetime.now()) + '.json'
    with open(filepath, 'w+') as f:
        json.dump(v, f, indent=4, sort_keys=True)


In [None]:
#This cell is deprecated. Using the Data-Gathering-Script notebook
#writing to a json file in my project
import json

for k, v in jsons.items():
    filepath = './data/my_data_' + k + '.json'
    with open(filepath, 'w') as f:
        json.dump(v, f, indent=4, sort_keys=True)

In [None]:
#This cell is deprecated. Using the Data-Gathering-Script notebook
#reading from a json file in my project
import json
jsons = {}
for i in ['tfts', 'lfg']:
    with open('./data/my_data_'+i+'.json', 'r') as f:
        jsons[i] = json.load(f)

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

Guess I didn't need to save the json files...

What do I want to be in the features? Obviously the text from the post but I'm going to take a look at some of the other features that are given to us from the json file.

Subreddits I chose:
 - https://www.reddit.com/r/talesfromtechsupport
 - https://www.reddit.com/r/lfg/
 
Other options:
 - https://www.reddit.com/r/dataisbeautiful/
 - https://www.reddit.com/r/airz23  
 - https://www.reddit.com/r/nosleep

_wanna make it really hard? pick airz and tfts_

Potential features:
- `'subreddit'`
- `'url'`
- `'author'`
- `'domain'`
- `'downs'`
- `'is_self'` 
- `'is_video'` 
- `'likes'`
- `'media'`
- `'num_comments'`
- `'num_crossposts'`
- `'num_reports'`
- `'selftext'`
- `'score'`
- `'title'`
- `'ups'`

Create main dataframe

In [2]:
main_data_path = './data/main_dataframe.csv'


In [3]:
#Load in main data if not already in memory

main_df = pd.DataFrame()
try:
    main_df = pd.read_csv(main_data_path) 
    #this is for the initial scrape, when we dont have a df saved as a csv
except:
    pass

main_df.shape

(2782, 16)

Load in new data and add to current dataset

In [28]:
#Load in freshly scraped data
new_data_path = './data/new_data.csv'

new_df = pd.read_csv(new_data_path)
new_df.shape

(2715, 16)

In [85]:
#Add to dataset and delete duplicates
main_df = main_df.append(new_df, ignore_index=True)
main_df.drop_duplicates(subset=['url'], inplace=True)
main_df.reset_index(drop=True, inplace=True)
main_df.shape

(3161, 16)

Saving our data

In [30]:
#save main df 
main_df.to_csv(main_data_path, index=False)

In [9]:
#backup of main df
#run every so often just so if we accidentally overwrite main_df save we still have our data
import datetime

filepath = './data/backup_my_dataframe_'+ str(datetime.datetime.now())+'.csv'
main_df.to_csv(filepath, index=False)

This is to save the state of the notebook so if I have to relaunch I don't have to re-run everything.

In [11]:
dill_session = '083018_1'

In [12]:
%%time
import dill
# Save
dill.dump_session('project3_notebook_env_'+ dill_session +'.db')


CPU times: user 116 ms, sys: 11.8 ms, total: 128 ms
Wall time: 128 ms


In [None]:
%%time
import dill
# Load
dill.load_session('project3_notebook_env_' + dill_session + '.db')

## Cleaning up the posts

I need to make sure that the posts aren't giving away where they're from in any obvious way, such as having the subreddit name in the text.


In [51]:
# phrases that indicate subreddit name - to remove
phrases = ['tales from tech support', 'looking for gamers', 'looking for games', 'no sleep']

In [73]:
my_words = set()
for i in ['talesfromtechsupport', 'lfg', 'nosleep']:
    words = [i, '/r/'+i, 'r/'+i]
    my_words.update(words)

words = ['tfts']
my_words.update(words)
my_words

{'/r/lfg',
 '/r/nosleep',
 '/r/talesfromtechsupport',
 'lfg',
 'nosleep',
 'r/lfg',
 'r/nosleep',
 'r/talesfromtechsupport',
 'talesfromtechsupport',
 'tfts'}

In [75]:
# Input
    # text - string
    # stem_or_lem - string, 'stem' uses PorterStemmer, 'lem' uses WordNetLemmatizer, anything else uses nothing
    
# Output
    # a string
    
def process(text, stem_or_lem = 'lem', stop = 'english'):
    
    # lower case
    text = text.lower()
    
    # remove potential phrases of the subreddit name
    phrases = ['tales from tech support', 'looking for gamers', 'looking for games', 'no sleep']
    for i in phrases:
        text = text.replace(i, '')

    
    # Grab all of the words. Disregard punctuation. 
    tokenizer = RegexpTokenizer(r'(\$?(\d+[\.,]?)+%?|(\/?\w+)+)') 
    tokens = tokenizer.tokenize(text)
    #print(tokens)
    new_tokens = [i[0] for i in tokens]
    
    # Remove stop words
    if not stop==None:
        #print('Stop')
        stops = set(stopwords.words(stop))
        stops.update(my_words)
        new_tokens = [word for word in new_tokens if not word in stops]
    
    if stem_or_lem == 'lem':
        #print('Lem')
        lemmatizer = WordNetLemmatizer()
        new_tokens = [lemmatizer.lemmatize(i) for i in new_tokens]
        
    elif stem_or_lem == 'stem':
        #print('Stem')
        p_stemmer = PorterStemmer()
        new_tokens = [p_stemmer.stem(i) for i in new_tokens]
        
   
    
    return " ".join(new_tokens)

text = "This is an example of some text! I'm NoT eNtIrElY sure why I paid $49.23 for it but hey, /r/dumb_thing /r/lfg r/LFG tales from tech support tfts looking for gamers is a great place. 39%. computers compute computing. I.Wonder.What'll.Happen"

results = process(text, 'stem')
print(type(results))
results


<class 'str'>


'exampl text entir sure paid $49.23 hey /r/dumb_th great place 39% comput comput comput wonder happen'

In [88]:
main_df.dtypes

subreddit          object
url                object
author             object
domain             object
downs               int64
is_self              bool
is_video             bool
likes             float64
media             float64
num_comments        int64
num_crossposts      int64
num_reports       float64
selftext           object
score               int64
title              object
ups                 int64
dtype: object

In [None]:

for i in ['title', 'selftext']:
    main_df[i] = main_df[i].map(lambda x: process(x))

In [92]:
%pixie_debugger

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [76]:
print(main_df.columns)

Index(['subreddit', 'url', 'author', 'domain', 'downs', 'is_self', 'is_video',
       'likes', 'media', 'num_comments', 'num_crossposts', 'num_reports',
       'selftext', 'score', 'title', 'ups'],
      dtype='object')


## Predicting subreddit using Random Forests + Another Classifier

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.