<a id="top"></a>

# Data Gathering

---



#### This Notebook
- [Scraping Function](#func)
- [Scraping the Subreddits](#scrape)

#### Other Notebooks
- [Data Cleaning and EDA](cleaning_and_EDA.ipynb)
- [Models](models.ipynb)

### Importing
---

In [4]:
import pandas as pd
import numpy as np
import requests
import time

<a id = "func"></a>

## Scraping Function

---

Factorizing the scraping process into a function makes scraping fast and repeatable. This allows us to either gather a lot of data from a few subreddits or lots of data from a variety of subreddits. In this project, data was scraped at different days of the week after posts have had a chance to cycle through. This is more important if the chosen subreddit is slower or has less subscribers.

[Back to top](#top)

In [2]:
# Function for scraping a specified subreddit for n amount of posts.
# Calling the API for a specific subreddit without an "after" query
# will call the first 25 posts in addition to any pinned posts, so
# it is likely that the returned amount of posts will be slightly 
# higher than the given amount by a few. 

# After around 1000 posts, reddit's API will start to give duplicate posts.
# Duplicate checking is the duty of the user, especially if scraping again 
# before new posts have had a chance to circulate through the subreddit.

def scrape_subreddit(subreddit, n_posts, u_agent = "pepega bot", nap = 2):
    
    posts = []
    params = {}
    
    for i in range(0, n_posts, 25):
        
        print(f"Gathering {i + 25} {str(subreddit)} posts...", end = "")

        url = f"https://www.reddit.com/r/{str(subreddit)}.json"
        res = requests.get(url, params = params, headers = {"User-agent": u_agent})
        
        if res.status_code != 200:
            print("\n" + f"Unexpected status code, exiting loop. Status code: {res.status_code}")
            break
        
        json = res.json()
        posts.extend(json["data"]["children"])
        params = {"after": json["data"]["after"]}

        time.sleep(nap)
        
        print("Complete!")
    
    # automatically "de-nest" the posts; "kind" key is useless for our purposes
    for i in range(len(posts)):
        posts[i] = posts[i]["data"]
        
    print(f"Scrape complete, returning {len(posts)} posts.")
        
    return posts

<a id = "scrape"></a>
## Scraping the Subreddits
---

This is where the scraping itself takes place. First, the basic training set of about 1000 posts each is gathered. These are later commented out so they cannot be rerun, so that the training data does not change later. Having consistent training data is important for interpretations of the models that will take place in the Models notebook. Later, testing data is scraped about a week later. This is in the interest of mimicking a Kaggle competition, wherein the model is built on training data and the true evaluation comes from how well the model performs on completely unseen data. Finally, posts are scraped from similar subreddits that have the same topics. This is to that the model can classify the underlying topics of the subreddit, rather than trying to classify the particulars of a single subreddit, such as the links shared or automated threads generated by the auto moderators.  

[Back to Top](#top)

In [23]:
# # Scraping the DND subreddit for the training data frame.
# # Do not rerun!

# dnd_posts = scrape_subreddit("DNDNext", 1000)

Gathering 25 DNDNext posts...Complete!
Gathering 50 DNDNext posts...Complete!
Gathering 75 DNDNext posts...Complete!
Gathering 100 DNDNext posts...Complete!
Gathering 125 DNDNext posts...Complete!
Gathering 150 DNDNext posts...Complete!
Gathering 175 DNDNext posts...Complete!
Gathering 200 DNDNext posts...Complete!
Gathering 225 DNDNext posts...Complete!
Gathering 250 DNDNext posts...Complete!
Gathering 275 DNDNext posts...Complete!
Gathering 300 DNDNext posts...Complete!
Gathering 325 DNDNext posts...Complete!
Gathering 350 DNDNext posts...Complete!
Gathering 375 DNDNext posts...Complete!
Gathering 400 DNDNext posts...Complete!
Gathering 425 DNDNext posts...Complete!
Gathering 450 DNDNext posts...Complete!
Gathering 475 DNDNext posts...Complete!
Gathering 500 DNDNext posts...Complete!
Gathering 525 DNDNext posts...Complete!
Gathering 550 DNDNext posts...Complete!
Gathering 575 DNDNext posts...Complete!
Gathering 600 DNDNext posts...Complete!
Gathering 625 DNDNext posts...Complete!
Gat

In [24]:
# # Scraping the Pathfinder subreddit for the training data frame.
# # Do not rerun!

# path_posts = scrape_subreddit("Pathfinder_RPG", 1000)

Gathering 25 Pathfinder_RPG posts...Complete!
Gathering 50 Pathfinder_RPG posts...Complete!
Gathering 75 Pathfinder_RPG posts...Complete!
Gathering 100 Pathfinder_RPG posts...Complete!
Gathering 125 Pathfinder_RPG posts...Complete!
Gathering 150 Pathfinder_RPG posts...Complete!
Gathering 175 Pathfinder_RPG posts...Complete!
Gathering 200 Pathfinder_RPG posts...Complete!
Gathering 225 Pathfinder_RPG posts...Complete!
Gathering 250 Pathfinder_RPG posts...Complete!
Gathering 275 Pathfinder_RPG posts...Complete!
Gathering 300 Pathfinder_RPG posts...Complete!
Gathering 325 Pathfinder_RPG posts...Complete!
Gathering 350 Pathfinder_RPG posts...Complete!
Gathering 375 Pathfinder_RPG posts...Complete!
Gathering 400 Pathfinder_RPG posts...Complete!
Gathering 425 Pathfinder_RPG posts...Complete!
Gathering 450 Pathfinder_RPG posts...Complete!
Gathering 475 Pathfinder_RPG posts...Complete!
Gathering 500 Pathfinder_RPG posts...Complete!
Gathering 525 Pathfinder_RPG posts...Complete!
Gathering 550 Pa

In [142]:
# # Code for manually checking the titles of the posts to ensure no dupes.
# # Don't run on large numbers of posts (obviously)

# for i in range(len(dnd_posts)):
#     print(f"POST NUMBER {i + 1}" + "-" * 100 + "\n")
#     print(dnd_posts[i]["data"]["title"])
#     print("\n")

In [6]:
# Scraping for a test dataset, to be run at a later date so 
# that threads have a chance to cycle through the subreddit.
# The purpose of this separate testing data set is to mimic
# the structure of a Kaggle competition, such that a model
# could be generalized to unseen data. 

dnd_test = scrape_subreddit("DNDNext", 500)
path_test = scrape_subreddit("Pathfinder_RPG", 500)

Gathering 25 DNDNext posts...Complete!
Gathering 50 DNDNext posts...Complete!
Gathering 75 DNDNext posts...Complete!
Gathering 100 DNDNext posts...Complete!
Gathering 125 DNDNext posts...Complete!
Gathering 150 DNDNext posts...Complete!
Gathering 175 DNDNext posts...Complete!
Gathering 200 DNDNext posts...Complete!
Gathering 225 DNDNext posts...Complete!
Gathering 250 DNDNext posts...Complete!
Gathering 275 DNDNext posts...Complete!
Gathering 300 DNDNext posts...Complete!
Gathering 325 DNDNext posts...Complete!
Gathering 350 DNDNext posts...Complete!
Gathering 375 DNDNext posts...Complete!
Gathering 400 DNDNext posts...Complete!
Gathering 425 DNDNext posts...Complete!
Gathering 450 DNDNext posts...Complete!
Gathering 475 DNDNext posts...Complete!
Gathering 500 DNDNext posts...Complete!
Scrape complete, returning 502 posts.
Gathering 25 Pathfinder_RPG posts...Complete!
Gathering 50 Pathfinder_RPG posts...Complete!
Gathering 75 Pathfinder_RPG posts...Complete!
Gathering 100 Pathfinder_RP

In [9]:
# Scraping 500 posts from subreddits similar to the main 
# ones chosen for the project. Both subreddits are in the 
# odd position of being one of many for a their particular
# game. For example, the DnDNext subreddit is a place for 
# discussion on specifically the newest edition of the 
# game, 5e. The actual DnD subreddit is dedicated to meta 
# discussion of DnD as a whole rather than specific
# instances of the game being played. DnDBehindTheScreen,
# on the other hand, is solely dedicated to discussion
# about running the game. While confusing, this gives us a
# unique opportunity to scrape data that is very similar in
# terms of content but comes from a different environment.

dnd_alt = scrape_subreddit("DnDBehindTheScreen", 500)
path_alt = scrape_subreddit("Pathfinder", 500)

Gathering 25 DnDBehindTheScreen posts...Complete!
Gathering 50 DnDBehindTheScreen posts...Complete!
Gathering 75 DnDBehindTheScreen posts...Complete!
Gathering 100 DnDBehindTheScreen posts...Complete!
Gathering 125 DnDBehindTheScreen posts...Complete!
Gathering 150 DnDBehindTheScreen posts...Complete!
Gathering 175 DnDBehindTheScreen posts...Complete!
Gathering 200 DnDBehindTheScreen posts...Complete!
Gathering 225 DnDBehindTheScreen posts...Complete!
Gathering 250 DnDBehindTheScreen posts...Complete!
Gathering 275 DnDBehindTheScreen posts...Complete!
Gathering 300 DnDBehindTheScreen posts...Complete!
Gathering 325 DnDBehindTheScreen posts...Complete!
Gathering 350 DnDBehindTheScreen posts...Complete!
Gathering 375 DnDBehindTheScreen posts...Complete!
Gathering 400 DnDBehindTheScreen posts...Complete!
Gathering 425 DnDBehindTheScreen posts...Complete!
Gathering 450 DnDBehindTheScreen posts...Complete!
Gathering 475 DnDBehindTheScreen posts...Complete!
Gathering 500 DnDBehindTheScreen p

### Duplicate Checking
---

The title is used to verify for duplicates because the post ID seemed to not be reliable, whereas every post has a different title. Actual post content cannot be used because some posts do not have body content and are null.

In [32]:
print(len(dnd_posts))
print(len(path_posts))

print(len({dnd_posts[i]["title"] for i in range(len(dnd_posts))}))
print(len({path_posts[i]["title"] for i in range(len(path_posts))}))

991
990
990
990


In [7]:
print(len(dnd_test))
print(len(path_test))

print(len({dnd_test[i]["title"] for i in range(len(dnd_test))}))
print(len({path_test[i]["title"] for i in range(len(path_test))}))

502
502
502
502


### Exporting
---

In [34]:
pd.DataFrame(dnd_posts).to_csv("../data/dnd_raw.csv", index = False)
pd.DataFrame(path_posts).to_csv("../data/path_raw.csv", index = False)

In [8]:
pd.DataFrame(dnd_test).to_csv("../data/dnd_test_raw.csv", index = False)
pd.DataFrame(path_test).to_csv("../data/path_test_raw.csv", index = False)

In [10]:
pd.DataFrame(dnd_alt).to_csv("../data/dnd_alt_raw.csv", index = False)
pd.DataFrame(path_alt).to_csv("../data/path_alt_raw.csv", index = False)

---
[Back to top](#top)