# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [11]:
import requests
import json
import time
import pandas as pd

In [186]:
URL = "http://www.reddit.com/r/boardgames.json"

In [None]:
URL = "https://www.reddit.com/r/boardgames/comments/9cbiai/rboardgames_daily_discussion_and_game.json"

In [187]:
## YOUR CODE HERE
test_dict = {}
res = requests.get(URL, headers={'User-agent': 'DT Bot 0.1'})
test_dict[1] = res

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [188]:
comment_data = res.json()

In [319]:
range(len(comment_data[1]['data']['children']))

range(0, 20)

In [None]:
comment_data

In [291]:
for i in range(len(comment_data[1]['data']['children'])):
    print(comment_data[1]['data']['children'][i]['data']['body'])

Please keep all calls to the [Board Game Recommender Bot](https://www.reddit.com/r/boardgamerecommender/comments/82lkbo/rboardgames_home_of_bg_recommender_mar2018sep2018/) as replies to this comment. Thanks!

*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/boardgames) if you have any questions or concerns.*
[deleted]
I've just preordered Arkham Horror 3rdE (after sellling my Eldritch Horror) watching just a gameplay (sorta) from GenCon and Discover: Lands Unknown, practically not knowing anything about the game; just out of curiosity. Wish me luck guys.
Big catan fan but know that there are better ones out there. Started out with original catan and recently been playing cities and knights and love it. 

I play with my two other "roommates" and occasionally we get a 4th and 5th "person" interested in playing, so ideally looking for a game we can play between 3-5+ players (if there are any two player game

In [290]:
comment_data[1]['data']['children'][19]['data']['body']

'**Grimslingers** \n\nI recently read an article about upcoming games.  One being Outlaws in a strange land.   The art was stunning. It said it took place in the Grimslingers universe . So naturally I looked up the games.  They seem pretty interesting. Anyone have any experience playing them ? Could you recommend? The good, the bad ? Etc.\n\nThanks !'

In [151]:
data = res.json()

In [152]:
data['data']['after']

't3_9bxyad'

In [153]:
data['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [217]:
data

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'boardgames',
     'selftext': '**Welcome to /r/boardgames Daily Discussion and Game Recommendations**\n\nThis is meant to be a place where you can ask any and all questions relating to the board gaming world: general or specific game recommendations, rule clarifications, definitions of terms/acronyms, and other quick questions that might not warrant their own post. \n\nIf you are seeking game recommendations you will get better responses if you give us enough background to help you. You can use [this template](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template-no-explainer) to do so. [Here](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template) is a version with explanations of what we\'re looking for.  \n\nIf you reply to any comment that has a game name in **bold** with "**/u/r2d8

In [155]:
data['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'parent_whitelist_status', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'domain', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'contest_mode', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_1

In [156]:
['ups', 'downs', 'likes', 'num_reports', 'selftext', 'selftext_html', 'gilded', 'distinguished', 'over_18', 'num_comments', 'locked', 'score', 'title', 'stickied']

['ups',
 'downs',
 'likes',
 'num_reports',
 'selftext',
 'selftext_html',
 'gilded',
 'distinguished',
 'over_18',
 'num_comments',
 'locked',
 'score',
 'title',
 'stickied']

In [277]:
data['data']['children'][0]['data']

{'approved_at_utc': None,
 'subreddit': 'boardgames',
 'selftext': '**Welcome to /r/boardgames Daily Discussion and Game Recommendations**\n\nThis is meant to be a place where you can ask any and all questions relating to the board gaming world: general or specific game recommendations, rule clarifications, definitions of terms/acronyms, and other quick questions that might not warrant their own post. \n\nIf you are seeking game recommendations you will get better responses if you give us enough background to help you. You can use [this template](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template-no-explainer) to do so. [Here](https://www.reddit.com/r/boardgames/wiki/personalized-game-recommendation-template) is a version with explanations of what we\'re looking for.  \n\nIf you reply to any comment that has a game name in **bold** with "**/u/r2d8 getparentinfo**", one of our robots will tell you more about the game\n\nJust remember that this is a commun

In [158]:
for i in range(len(data['data']['children'])):
    base_keys = data['data']['children'][0]['data'].keys()
    new_keys = data['data']['children'][i]['data'].keys()
    if base_keys != new_keys:
        print(i)
        for key in base_keys:
            if key not in new_keys:
                print("{} not in {} keys".format(key, i))
        for key in new_keys:
            if key not in base_keys:
                print("{} not in {} keys".format(key, 0))

1
post_hint not in 0 keys
preview not in 0 keys
3
post_hint not in 0 keys
preview not in 0 keys
4
post_hint not in 0 keys
preview not in 0 keys
6
post_hint not in 0 keys
preview not in 0 keys
7
post_hint not in 0 keys
preview not in 0 keys
10
post_hint not in 0 keys
preview not in 0 keys
15
post_hint not in 0 keys
preview not in 0 keys
16
post_hint not in 0 keys
preview not in 0 keys
19
post_hint not in 0 keys
preview not in 0 keys
20
post_hint not in 0 keys
preview not in 0 keys
23
post_hint not in 0 keys
preview not in 0 keys
25
post_hint not in 0 keys
preview not in 0 keys


In [254]:
data['data']['children'][23]['data']['url']

'https://www.reddit.com/r/boardgames/comments/9cb0fh/7_wonders_score_sheet_removed_from_google_play/'

In [159]:
print(len(data['data']['children']))

26


#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [20]:
## YOUR CODE HERE
board_game_df = pd.DataFrame()
video_game_df = pd.DataFrame()

In [10]:
categories = ['ups', 'downs', 'likes', 'num_reports', 'selftext', 'selftext_html', 'gilded', 'distinguished',
              'over_18', 'num_comments', 'locked', 'score', 'title', 'stickied', 'id']

In [4]:
def reddit_scraper(subreddit, df, n):
    
    # Set the index for the dataframe
    df_index = 0
    # Start grabbing the information from the appropriate json
    url = "http://www.reddit.com/r/{}.json".format(subreddit)
    for _ in range(n):
        result = requests.get(url, headers={'User-agent': 'DT Bot 0.1'})
        data = result.json()['data']
        # For each entry in the json file, find the features we want and store it into a dataframe
        for i in range(len(data['children'])):
            for category in categories:
                df.loc[df_index, category] = data['children'][i]['data'][category]
            df_index += 1
        
        # Make updates and get ready to scrape again
        time.sleep(3)
        after = data['after']
        url = "http://www.reddit.com/r/{}.json?after={}".format(subreddit, after)
        
        if df_index % 1000 == 0:
            print(df_index)

        
    return

In [7]:
reddit_scraper('boardgames', board_game_df, 1)

In [183]:
board_game_df['title']

0        /r/boardgames Daily Discussion and Game Recomm...
1             Framed my Pandemic Season 1 board [Spoilers]
2        How do you keep other experienced players from...
3        First indie film about the ancient game of 'Go...
4        Ignore the title. But is this a legal move in ...
5        The Dominion expansion you never knew you need...
6              What are Great Games to Use Metal Coins In?
7                          Finally got games to the table!
8        Trying to remember the name of a game similar ...
9                         How do farms work in Carcasonne?
10       Am I playing wrong or is Spirit Island just re...
11                        What 4X rulebooks should I read?
12       I need some relationship advice in regards to ...
13                      Simple foam core insert for Jaipur
14              Sierra Madre Games sold to Ion Game Design
15       How to stop family from fighting about which g...
16                                                  Dixi

In [None]:
password = dHXZhm_YWHdDZ1ILWOAPmWm129k

In [5]:
import requests
import requests.auth

In [38]:
import requests
import requests.auth
client_auth = requests.auth.HTTPBasicAuth(client_id, secret)
post_data = {"grant_type": "password", "username": username, "password": password}
headers = {"User-Agent": "{} Bot 0.1".format(username)}
response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
token = response.json()['access_token']

In [39]:
response.json()

{'access_token': '145801484283-DrpCvSm0JoxlCyTFZej2amjg4vc',
 'token_type': 'bearer',
 'expires_in': 3600,
 'scope': '*'}

In [90]:
headers = {"Authorization": "bearer 145801484283-DrpCvSm0JoxlCyTFZej2amjg4vc", "User-Agent": "BigAlJ7 Bot 0.1"}
response = requests.get("https://oauth.reddit.com/r/boardgames.json", headers=headers)

In [97]:
print(response.headers['X-Ratelimit-Used'])
print(response.headers['X-Ratelimit-Remaining'])
print(response.headers['X-Ratelimit-Reset'])

2
598.0
596


In [7]:
def reddit_token_grabber(username, password, client_id, secret):
    
    # Get a token to use to scrape
    client_auth = requests.auth.HTTPBasicAuth(client_id, secret)
    post_data = {"grant_type": "password", "username": username, "password": password}
    headers = {"User-Agent": username + "Bot 0.1"}
    response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
    token = response.json()['access_token']
    return token

In [13]:
def subreddit_scraper(subreddit, username, password, client_id, secret, df, categories, n):
    
    # Set the index for the dataframe to save the results
    df_index = 0

    # Start grabbing the information from the appropriate json
    token = reddit_token_grabber(username, password, client_id, secret)
    headers = {"Authorization": "bearer " + token, "User-Agent": username + "Bot 0.1"}
    url = "http://oauth.reddit.com/r/{}.json".format(subreddit)
    
    for counter in range(n):
        
        # Get a bunch of subreddit posts and store them as data
        result = requests.get(url, headers=headers)
        data = result.json()['data']
        
        # For each entry in the json file, find the features we want and store it into a dataframe
        for i in range(len(data['children'])):
            for category in categories:
                df.loc[df_index, category] = data['children'][i]['data'][category]
        
            # Also, get the main comments from the subreddit post
            if data['children'][i]['data']['num_comments'] > 0:
                comments_url = 'http://oauth.reddit.com{}.json'.format(data['children'][i]['data']['permalink'])
                comments_result = requests.get(comments_url, headers=headers)
                comments_data = comments_result.json()[1]['data']['children']
                comments = ''
                for j in range(len(comments_data)):
                    try:
                        comments += comments_data[j]['data']['body']
                    except:
                        pass
                df.loc[df_index, 'comments'] = comments
            else:
                df.loc[df_index, 'comments'] = ''
                
            df_index += 1
        
        # Make updates and get ready to scrape again
        if result.headers['X-Ratelimit-Remaining'] == 0:
            print("Need to wait: {} seconds".format(result.headers(['X-Ratelimit-Reset'])))
            time.sleep(result.headers['X-Ratelimit-Reset'])
        after = data['after']
        url = "http://oauth.reddit.com/r/{}.json?after={}".format(subreddit, after)
        
        # Keep the user updated on the progress
        if (counter + 1) % 10 == 0:
            print(counter + 1)

        
    return

In [9]:
subreddit_scraper(subreddit='boardgames', username=username, password=password, client_id=client_id,
                  secret=secret, df=board_game_df, categories=categories, n=1000)

KeyError: 'body'

In [21]:
subreddit_scraper(subreddit='VideoGame', username=username, password=password, client_id=client_id,
                  secret=secret, df=video_game_df, categories=categories, n=1000)

10


KeyboardInterrupt: 

### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [124]:
# Export to csv
# board_game_df.to_csv('./data/board_games_1', index=False)

In [143]:
# video_game_df.to_csv('./data/video_games_1', index=False)

In [140]:
# board_game_df = pd.read_csv('./data/board_games_1')

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
# Use a tfidf vectorizer on the text of the data ######REDO FROM HERE USING BIGRAMS!!!!
tfidf = TfidfVectorizer(ngram_range=(1, 2))
text_df = pd.DataFrame(tfidf.fit_transform(X_train['text']).todense(), columns=tfidf.get_feature_names(),
                       index=X_train.index)
X_train = pd.concat([X_train[['num_comments', 'score']], text_df], axis=1)
text_df = pd.DataFrame(tfidf.transform(X_test['text']).todense(), columns=tfidf.get_feature_names(),
                       index=X_test.index)
X_test = pd.concat([X_test[['num_comments', 'score']], text_df], axis=1)

In [None]:
## YOUR CODE HERE

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
## YOUR CODE HERE

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [None]:
## YOUR CODE HERE

#### Thought experiment: What is the baseline accuracy for this model?

In [None]:
## YOUR CODE HERE

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## YOUR CODE HERE

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

In [None]:
## YOUR CODE HERE

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## YOUR CODE HERE

# Executive Summary
---
Put your executive summary in a Markdown cell below.

In [6]:
subreddit='videogames'
username='BigAlJ7'
password='freakballer1'
client_id='44QUC9UT-IHv8g'
secret='dHXZhm_YWHdDZ1ILWOAPmWm129k'