# 1 - Project 3 Data Collection
This is the first of a series of notebooks for this project.

Note that subsquent project notebooks will refer to this notebook.  As running this notebook will provide data for a fixed point in time, and could potentially over-write data collected previously, this will be treated as a stand-alone notebook.  Timestamps will be printed to a csv so that this code can be modified to duplicate the run shown in this notebook.

# Initial Package Imports

In [3]:
import requests
import pandas as pd

# Sourece to get current Unix Timestamp for the pushshift API:
#  https://stackoverflow.com/questions/16755394/what-is-the-easiest-way-to-get-current-gmt-time-in-unix-timestamp-format
import time

# Define Data Gathering Function
This function accomplishes the following:
* Connects to the internet
* Retrieves data with the pushshift API
* Gathers data in approximately 1000 submission increments known as 'trials'
* Prints the website status code after each successful trial
* Creates timestamps for when the data are pulled for each trial and the time beforewhich data are collected through the API (the before parameter)
    * The pushshift API has a parameter 'before' which is redifined to be the unix timestamp of the oldest submission in the previous trail.  As pushshift will by defualt pull the current most recent submissions, it will only collect about 1000 submissions at a time.  To get more than 100 submissions, the 'before' parameter must be set at each trial to collecte submissions prior to the oldest previously gathered submission
* Saves these timestamps to a text file so that they can be referenced if the data need to be replicated
    * The text file is named with the 'name' parameter and the unix timestamp for which the data were pulled to automatically differential this data from any prior or subsequent calls of this function to the same subreddit
* Saves a dataframe which concatenates all data from the previous trials
    * The text file is named with the 'name' parameter and the unix timestamp for which the data were pulled to automatically differential this data from any prior or subsequent calls of this function to the same subreddit

In [4]:
def data_getter(subreddit, trials, name):
#=====  INITIAL LOCAL VARIABLES  =================================================================================
    
    # Establish base url:
    url_fnc = 'https://api.pushshift.io/reddit/search/submission'
    
    # Establish initial parameters for most current subreddit pull:
    params_getter = {
    'subreddit': subreddit,
    'size': 1000,
    }
    
    # Create an empty dataframe to which we can concatenate each run
    master_df = pd.DataFrame()
    
    # Create the first instance of the timestamp list, and establish current time
    right_now = round(time.time())
    pull_times = [right_now]
    
#=====  Gathering the data from reddit/pushshift // Create DataFrame // Concatenate to Master  ===================
    
    # Get the data:
    res_fnc = requests.get(url_fnc, params_getter)
    print(res_fnc.status_code) # for debugging, prints the website status code to show function progress
    
    # Make dataframe:
    df = res_fnc.json() # Dump dat to a json
    df = pd.DataFrame(df['data']) # Pull the 'data' dictionary out of the json
    
    # Concatenate to master
    master_df = pd.concat([master_df, df])

#=====  Iterate the above steps over remaining trials  ===========================================================
    for i in range(0, (trials - 1)): # Establishes a for loop to iterate over the remaining trials (minus the first one)
        
        # Update the before parameter to be the created time (in utc) of the last item in the previous trail's dataframe
        params_getter = {
        'subreddit': subreddit,
        'size': 1000,
        'before': list(df['created_utc'][-1:])[0]
        }
        
        # Get the data:
        res_fnc = requests.get(url_fnc, params_getter)
        print(res_fnc.status_code) # for debugging

        # Make dataframe:
        df = res_fnc.json() # Dump dat to a json
        df = pd.DataFrame(df['data']) # Pull the 'data' dictionary out of the json

        # Concatenate to master
        master_df = pd.concat([master_df, df])
        
        # Add pull time to pull_times list:
        pull_times.append(list(df['created_utc'][-1:])[0])
        
#=====  Create a text file with all the pull times for replicability =====================================================================
    # Source inspring this code:  https://www.guru99.com/reading-and-writing-files-in-python.html
    f = open(f'../data/{name}_pulltimes_{right_now}.txt',"w+")
    f.write(f'{pull_times}')
    f.close()

#=====  Finally, return the fully concatenated master dataframe, reset index, store to csv  =================================================
    master_df.reset_index(drop = True) # Source for refresher on how to tuse this:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
    master_df.to_csv(f'../data/{name}_{right_now}.csv')
    return master_df.head()

# Gather Data for TheOnion and WorldNews Subreddits

In [5]:
data_getter('theonion', 6, 'theonion')

200
200
200
200
200
200


Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,is_gallery,media_metadata,gallery_data,crosspost_parent_list,crosspost_parent,author_created_utc,retrieved_on,call_to_action,author_cakeday,removal_reason
0,TheOnion,,t2_4a27h,0,Idiot Tornado Tears Harmlessly Through Empty F...,[],r/TheOnion,False,6,,...,,,,,,,,,,
1,TheOnion,,t2_3jamc,0,New Texas Law Requires Schools To Display Imag...,[],r/TheOnion,False,6,,...,,,,,,,,,,
2,TheOnion,,t2_3jamc,0,New Poll Finds Americans Would Respect Biden M...,[],r/TheOnion,False,6,,...,,,,,,,,,,
3,TheOnion,,t2_3jamc,0,Could You Pass Racial Discrimination Training ...,[],r/TheOnion,False,6,,...,,,,,,,,,,
4,TheOnion,,t2_3jamc,0,Dog And Owner Having Public Fight,[],r/TheOnion,False,6,,...,,,,,,,,,,


In [6]:
data_getter('worldnews', 6, 'worldnews')

200
200
200
200
200
200


Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,crosspost_parent_list,crosspost_parent,author_cakeday,link_flair_template_id
0,worldnews,,t2_8q2g97db4,0,The parents of a 10-year-old boy living with a...,[],r/worldnews,False,6,,...,0,,False,1682378382,1682378383,2023-04-24 23:19:26,,,,
1,worldnews,,t2_8q2g97db4,0,The parents of a 10-year-old boy living with a...,[],r/worldnews,False,6,,...,0,,False,1682378313,1682378314,2023-04-24 23:18:19,,,,
2,worldnews,,t2_dss8b,0,Mexico finds tons of liquid meth in tequila bo...,[],r/worldnews,False,6,,...,0,,False,1682377268,1682377268,2023-04-24 23:00:56,,,,
3,worldnews,,t2_9xhkarmen,0,"Tucker Carlson Leaving Fox News, Last Episode ...","[{'e': 'text', 't': 'Not Appropriate Subreddit'}]",r/worldnews,False,6,normal,...,0,,False,1682377130,1682377131,2023-04-24 22:58:38,,,,
4,worldnews,,t2_2aex0igh,0,Film explores B.C. woman’s experience with mag...,[],r/worldnews,False,6,,...,0,,False,1682376678,1682376679,2023-04-24 22:51:04,,,,
