## Notebook Description
This notebook is dedicated to scraping posts from Reddit, for the Backpacking and Ultrarunning subreddits, then saving the output .csv in the `raw_data` folder.

In [1]:
import time
import requests
import pandas as pd

In [2]:
# This scraper is at the mercy of the pushshift API, which at the time of submitting this project, was down and providing a 521 error

base_url = 'https://api.pushshift.io/reddit'
dfs = []
subreddits = ['backpacking', 'ultrarunning']
for subreddit in subreddits:
    # Set before to be current time
    before = int(time.time()) # now
    for i in range(1, 21): # Keep range small until things are working as expected
        # create params: before, subreddit, size
        params = {
            'subreddit': subreddit,
            'size': 100,
            'before': before
        }
        # use the requests to get the response
        res = requests.get(base_url + '/search/submission/?', params=params)
        # if requests == 200
        if res.status_code == 200:
        # turn the response into JSON
            df = res.json()
            # turn the JSON into a DataFrame
            posts = pd.DataFrame(df['data'])[['title', 'selftext', 'subreddit', 'created_utc']]
            # add posts DataFrame to dfs
            dfs.append(posts)
        # set before to be the timestamp of the last post
        print(f'Scraped {i * 100} posts for the {subreddit} subreddit')
        before = posts['created_utc'].values[-1]
        # space out requests to 3 second intervals - Super important!
        time.sleep(3)
        
# concat all dfs
df = pd.concat(dfs, ignore_index=True)

Scraped 100 posts for the backpacking subreddit
Scraped 200 posts for the backpacking subreddit
Scraped 300 posts for the backpacking subreddit
Scraped 400 posts for the backpacking subreddit
Scraped 500 posts for the backpacking subreddit
Scraped 600 posts for the backpacking subreddit
Scraped 700 posts for the backpacking subreddit
Scraped 800 posts for the backpacking subreddit
Scraped 900 posts for the backpacking subreddit
Scraped 1000 posts for the backpacking subreddit
Scraped 1100 posts for the backpacking subreddit
Scraped 1200 posts for the backpacking subreddit
Scraped 1300 posts for the backpacking subreddit
Scraped 1400 posts for the backpacking subreddit
Scraped 1500 posts for the backpacking subreddit
Scraped 1600 posts for the backpacking subreddit
Scraped 1700 posts for the backpacking subreddit
Scraped 1800 posts for the backpacking subreddit
Scraped 1900 posts for the backpacking subreddit
Scraped 2000 posts for the backpacking subreddit
Scraped 100 posts for the ult

In [3]:
df['subreddit'].value_counts()

ultrarunning    2000
backpacking     1999
Name: subreddit, dtype: int64

In [4]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc
0,We’d comfort in the back country? Check these ...,,backpacking,1646363240
1,Finding water on the trail,"Yo waddup every1, I was wondering how I might ...",backpacking,1646363128
2,Pacing Tips,Hey y'all! I've just been backpacking in Mexic...,backpacking,1646359058
3,Fes Morocco 5 hr layover,[removed],backpacking,1646358302
4,"Hi all, I have a favor to ask. I designed a mo...",,backpacking,1646356995


In [5]:
# write df to CSV
df.to_csv('../raw_data/reddit.csv', index=0)