# Gathering Posts From the Subreddit ShowerThoughts

In [1]:
#import usefull libraries
import pandas as pd   
import requests  
import time
import datetime

## Objective


- In this notebook I will be gathering data from the ShowerThoughts subreddit by ushing the pushshift API.
- Since we are limited to 100 pulls per call, a function will be needed with a timer set to 30 seconds so that I don't get kicked out of the pushshift API.
- The posts will be added to a dataframe containing the title and text of post and then saved to a csv file.

In [2]:
#create empty list for posts
posts = []

#create an empty list to keep track of post times so we don't get duplicates
skip = [1600460524]

In [3]:
def post_pulls(subreddit):
    #count to keep track of how many posts we've pulled
    count = 0
    
    #since we are doing 100 pulls at a time, we can stop once the count hits 90
    while count < 90:
        
        #set up search paramters for requests
        pull_params = {'subreddit': subreddit,
                       'size': 100,
                       'sort': 'desc',
                       'aggs': 'created_utc',
                       'before': skip[-1]}
        
        #create url
        url = f'https://api.pushshift.io/reddit/submission/search/'
        
        #get requests
        res = requests.get(url, params=pull_params)
    
        #turn into json dictionary format
        data = res.json()
              
        #add pulls to post list
        posts.extend(data['data'])
    
        #add count
        count += 1
        
        #create dataframe for post list
        shower_data = pd.DataFrame(posts)[['author', 'title', 'subreddit', 'created_utc']]

        #save data frame as csv to be called back in to update
        shower_data.to_csv('./data/shower_data.csv', index=False)
        
        #create data frame to hold new posts
        post_data = pd.DataFrame(posts)[['author', 'title', 'subreddit', 'created_utc']]
        
        #merge two data frames together
        frames = [shower_data, post_data]
        result = pd.concat(frames)
        result = result.drop_duplicates(subset='title')
        
        #save updated shower_data file
        result.to_csv('./data/shower_data.csv', index=False)
        
        #get the last pulls time tag and append to 
        skip.append(result['created_utc'].min())
            
        print(f'This is pull {count} out of 90')
    
        #set sleep timer for 30 seconds so I don't get banned
        time.sleep(30)

In [4]:
post_pulls('Showerthoughts')

This is pull 1 out of 90
This is pull 2 out of 90
This is pull 3 out of 90
This is pull 4 out of 90
This is pull 5 out of 90
This is pull 6 out of 90
This is pull 7 out of 90
This is pull 8 out of 90
This is pull 9 out of 90
This is pull 10 out of 90
This is pull 11 out of 90
This is pull 12 out of 90
This is pull 13 out of 90
This is pull 14 out of 90
This is pull 15 out of 90
This is pull 16 out of 90
This is pull 17 out of 90
This is pull 18 out of 90
This is pull 19 out of 90
This is pull 20 out of 90
This is pull 21 out of 90
This is pull 22 out of 90
This is pull 23 out of 90
This is pull 24 out of 90
This is pull 25 out of 90
This is pull 26 out of 90
This is pull 27 out of 90
This is pull 28 out of 90
This is pull 29 out of 90
This is pull 30 out of 90
This is pull 31 out of 90
This is pull 32 out of 90
This is pull 33 out of 90
This is pull 34 out of 90
This is pull 35 out of 90
This is pull 36 out of 90
This is pull 37 out of 90
This is pull 38 out of 90
This is pull 39 out o

In [5]:
#call in data frame to make sure it saved correctly
shower_data = pd.read_csv('./data/shower_data.csv')

In [6]:
shower_data

Unnamed: 0,author,title,subreddit,created_utc
0,dadobis,There were probably a lot of singers talented ...,Showerthoughts,1600460465
1,SueMe-YouWont,"Sometimes I really miss college, then I get ou...",Showerthoughts,1600460451
2,JohnBrambleberry,"Someone, at some point, gave the first blowjob...",Showerthoughts,1600460424
3,jdmlover2009,A.B💿.E.F.G,Showerthoughts,1600460371
4,0Iriss0,The first person to find a cat must have been ...,Showerthoughts,1600460368
...,...,...,...,...
8851,ArSeeFurtyFree,We think about aliens as being hugely advanced...,Showerthoughts,1600106046
8852,Tanjarts,"We will either find intelligent life, or forev...",Showerthoughts,1600106042
8853,Lum1nar,A trillion years ago a plant died and now it’s...,Showerthoughts,1600106019
8854,Lum1nar,A kazillion years ago a plant died and now it’...,Showerthoughts,1600105982


In [7]:
shower_data.duplicated().sum()

0