## 01_Data Collection

This Notebook consists of Data Collection using pushshift API to scrape submission posts and comments from four Reddit Subreddits: r/SelfDrivingCars, r/Futurology, r/Technology, and r/Artificial.  The scraped data is converted from json dictionary format to pandas dataframe.

### Contents:
- [Import Packages](#Import-Packages)
- [Reddit urls and Dataframe name codes](#Reddit-urls-and-Dataframe-name-codes)
- [Pushift API Scrape to Dataframe Functions](#Pushift-API-Scrape-to-Dataframe-Functions)
    - [Submission Scraping Functions](#Submission-Scraping-Functions)
    - [Comment Scraping Functions](#Submission-Scraping-Functions)
    - [Loop Functions for Multiple API calls](#Loop-Functions-for-Multiple-API-calls)
- [Submission API Calss](#Submission-API-Calls)
- [Comments API Calls](#Comments-API-Calls)
- [Search 'Self-driving' Subreddit Scraper Function](#Search-'Self-Driving'-Subreddit-Scraper-Function)
- [Search 'Self-driving' Scraping Function Calls](#Search-'Self-driving'-Scraping-Function-Calls) <br>
    - [r/selfdrivingcars Search 'self-driving' Scrapes](#r/selfdrivingcars-Search-'self-driving'-Scrapes)
    - [r/Technology Search 'self-driving' Scrapes](#r/Technology-Search-'self-driving'-Scrapes)
    - [r/Futurology Search 'self-driving' Scrapes](#r/Futurology-Search-'self-driving'-Scrapes)
    - [r/Artificial Search 'self-driving' Scrapes](#r/Artificial-Search-'self-driving'-Scrapes)
- [Export Dataframes to CSV](#Export-Dataframes-to-CSV) <br>
- [Notebook Summary](#Notebook-Summary) <br>

### Import Packages

In [3]:
import pandas as pd
import requests
import time
from time import sleep
import datetime


### Reddit urls and name codes for dataframes

1. Self-driving Cars
'https://www.reddit.com/r/SelfDrivingCars/'
df name: sdc 

2. Futurology
'https://www.reddit.com/r/Futurology/'
df_name: fut

3. Technology
'https://www.reddit.com/r/technology/'
df name: tech

4. Artificial
'https://www.reddit.com/r/artificial/'
df name: ai


Data to be Scraped

1. SelfDrivingCars: Submissions
2. SelfDrivingCars: Comments
3. Futurology: Submissions
4. Futurology: Comments
5. Technology: Submissions
6. Technology: Comments
5. Artificial: Submissions
6. Artificial: Comments


### Pushift API Scrape to Dataframe Functions

#### Submission Scraping Functions

In [7]:
# Function makes API call to defined subreddit, in this case 
# primarily the r/selfdrivingcars subreddit.
# Returns a dataframe of all submissions/comments
# for the quantity specified.

# Function takes following arguments: 
#       - desired quantity of posts to pull (max 1000)
#       - subreddit name
#       - sequence count of api call for respective subreddit
#       - last utc time from previous api call
#         for alternate 
#       - ctrl: 1 for selfdrivingcars, 2 for all other subreddits. 

def sub_to_df(qty, subred, call_ct, first_utc, ctrl): 
    
    # base url for submissions/posts
    base = 'https://api.pushshift.io/reddit/search/submission/'
    
    # quantity of posts pulled in api call
    size = '&size=' + str(qty)
    
    # Ctrl sets a control parameter for time. For Selfdrivingcars, 
    # we want to use before last utc, for all other subreddits, 
    # we use the selfdrivingcars timestamp and the after parameter
    # to get posts/comments from the same approximate time period.  
    
    if ctrl == 1:
        time = '&before='+str(first_utc)
    else:
        time = '&after='+str(first_utc)
    
    # test if this is first api call or not. if not first, add before utc time parameter 
    if call_ct == 1:
        #concatenated url
        url = base + '?subreddit=' + subred + size
    else: 
        url = base + '?subreddit=' + subred + time + size
    
    #api request
    res = requests.get(url)
    
    #check that api call was succesful and assign to data
    if res.status_code == 200:
        data = res.json()['data']
    else:
        print('Your API call malfunctioned')
       
    # define list of features to grab from each post
    keys = ['author','created_utc','id','full_link','num_comments',
            'subreddit','subreddit_id','title']
    
    # create empty list to hold all subreddit posts pulled in the api call 
    sub_list =[]
    
    # Loop through each dictionary in json list
    for i in range(len(data)): 
        sub_dict = {}  # create empty dictionary to store submissions
    
        # loop through list of pre-selected keys, or features, defined by 'keys_sm' above
        for j in range(len(keys)):
            
            #Assign each data key and value pairs to the sub dict
            sub_dict[keys[j]] = data[i][keys[j]]
    
        # append each submission dict to sub_list
        sub_list.append(sub_dict)
    
    # Create dataframe from list of submission dicts
    sub_df = pd.DataFrame(sub_list)
    
    # Get earliest post utc time
    first_utc = sub_df['created_utc'].sort_values().iloc[0]
    last_utc =  sub_df['created_utc'].sort_values(ascending=False).iloc[0]
    
    # convert Epoch to standard date format
    #first_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_utc))
    #last_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(last_utc))
    
    first_pst = datetime.datetime.fromtimestamp(first_utc).strftime('%c')
    last_pst = datetime.datetime.fromtimestamp(last_utc).strftime('%c')
  
    s = sub_df.shape[0]
    
    print(f'Dataframe contains {s} posts from {first_pst} to {last_pst}')
    return sub_df, first_utc, last_utc


In [8]:
# Function makes API call to alternate subreddit, 
# returns a dataframe of all submissions/posts with  with
# Function takes following arguments: 
#       - desired quantity of posts to pull (max 1000)
#       - subreddit name
#       - sequence count of api call for respective subreddit
#       - last utc time from previous api call

def alt_sub_to_df(qty, subred, call_ct, last_utc): 
    
    # base url for submissions/posts
    base = 'https://api.pushshift.io/reddit/search/submission/'
    
    # quantity of posts pulled in api call
    size = '&size=' + str(qty)
    
    time = '&after='+str(last_utc)
    
    # test if this is first api call or not. if not first, add before utc time parameter 
    if call_ct == 1:
        #concatenated url
        url = base + '?subreddit=' + subred + size
    else: 
        url = base + '?subreddit=' + subred + time + size
    
    #api request
    res = requests.get(url)
    
    #check that api call was succesful and assign to data
    if res.status_code == 200:
        data = res.json()['data']
    else:
        print('Your API call malfunctioned')
       
    # define list of features to grab from each post
    keys = ['author','created_utc','id','full_link','num_comments',
            'subreddit','subreddit_id','title']
    
    # create empty list to hold all subreddit posts pulled in the api call 
    sub_list =[]
    
    # Loop through each dictionary in json list
    for i in range(len(data)): 
        sub_dict = {}  # create empty dictionary to store submissions
    
        # loop through list of pre-selected keys, or features, defined by 'keys_sm' above
        for j in range(len(keys)):
            
            #Assign each data key and value pairs to the sub dict
            sub_dict[keys[j]] = data[i][keys[j]]
    
        # append each submission dict to sub_list
        sub_list.append(sub_dict)
    
    # Create dataframe from list of submission dicts
    sub_df = pd.DataFrame(sub_list)
    
    # Get earliest post utc time
    first_utc = sub_df['created_utc'].sort_values().iloc[0]
    last_utc =  sub_df['created_utc'].sort_values(ascending=False).iloc[0]
    
    # convert Epoch to standard date format
    #first_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_utc))
    #last_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(last_utc))
    
    first_pst = datetime.datetime.fromtimestamp(first_utc).strftime('%c')
    last_pst = datetime.datetime.fromtimestamp(last_utc).strftime('%c')
  
    s = sub_df.shape[0]
    
    print(f'Dataframe contains {s} posts from {first_pst} to {last_pst}')
    return sub_df, first_utc, last_utc


#### Comment Scraping Functions

In [13]:
# Function makes API call to defined subreddit comments, returns dataframe and last comment utc time.
# Function takes following arguments: 
#       - desired quantity of comments to pull (max 500)
#       - subreddit name
#       - sequence count of api call for respective subreddit
#       - last utc time from previous api call

def com_to_df(qty, subred, call_ct, first_utc, ctrl): 
    
    # base url for submissions/posts
    base = 'https://api.pushshift.io/reddit/search/comment/'
    
    # quantity of posts pulled in api call
    size = '&size=' + str(qty)

    
    # Ctrl sets a control parameter for time. For Selfdrivingcars, 
    # we want to use before last utc, for all other subreddits, 
    # we use the selfdrivingcars timestamp and the after parameter
    # to get posts/comments from the same approximate time period.  
    
    if ctrl == 1:
        time = '&before='+str(first_utc)
    else:
        time = '&after='+str(first_utc)
    
    # test if this is first api call or not. 
    # if not first, add 'before' utc time parameter 
    if call_ct == 1:
        #concatenated url
        url = base + '?subreddit=' + subred + size
    else: 
        url = base + '?subreddit=' + subred + time + size
    
    #api request
    res = requests.get(url)
    
    #check that api call was succesful and assign to data
    if res.status_code == 200:
        data = res.json()['data']
    else:
        print('Your API call malfunctioned')
       
    # define list of features/keys to select from each post
    keys = ['author', 'body', 'created_utc', 'id', 
            'parent_id', 'subreddit', 'subreddit_id']
    
    # create empty list to hold all subreddit posts pulled in the api call 
    com_list =[]
 
    # Loop through each dictionary in json list
    for i in range(len(data)): 
        com_dict = {}  # create empty dictionary to store submissions
    
        # loop through list of pre-selected keys, or features, defined by 'keys_sm' above
        for j in range(len(keys)):
            
            #Assign each data key and value pairs to the sub dict
            com_dict[keys[j]] = data[i][keys[j]]
    
        # append each submission dict to sub_list
        com_list.append(com_dict)
    
    # Create dataframe from list of submission dicts
    com_df = pd.DataFrame(com_list)
    
    # Get earliest post utc time
    first_utc = com_df['created_utc'].sort_values().iloc[0]
    last_utc =  com_df['created_utc'].sort_values(ascending=False).iloc[0]
    
    # convert Epoch to standard date format
    first_pst = datetime.datetime.fromtimestamp(first_utc).strftime('%c')
    last_pst = datetime.datetime.fromtimestamp(last_utc).strftime('%c')
    
    s = com_df.shape[0]
  
    print(f'Dataframe contains {s} comments from {first_pst} to {last_pst}')
    return com_df, first_utc, last_utc


###  Loop Functions for Multiple API calls

In [14]:
# Looping function to Call Reddit Submissions API multiple times

def loop_api(func, qty, subred, n, ctrl):
    df, fst_utc, last_utc = func(qty, subred, 1, 'NA', ctrl) 
    api_ct = 1
    df_list = [df]
    fst_utc_list = [fst_utc]
    lst_utc_list = [last_utc]
    
    for i in range(2,n+1): 
        df_x, fst_utc, lst_utc = func(qty, subred, api_ct+1 , fst_utc, ctrl)
        
        df_list.append(df_x)
        fst_utc_list.append(fst_utc)
        lst_utc_list.append(lst_utc)
        
        df = pd.concat([df, df_x], axis = 0)
        
        api_ct += 1
        
        sleep(3)
        
    print(f'{subred} subreddit API called {api_ct} times.')
    print(f'Final Dataframe has {df.shape[0]} rows.')
          
    return df, df_list, fst_utc_list, lst_utc_list

In [15]:
# Looping function to Call alternate SubReddits APIs multiple times.

# This function uses the utc timestamps from the SelfdrivingCars API calls, 
# to pull data from the approximate same time period.  This is done because
# some some reddits have more posts than others and the time of posts will be off if 
# using the consecutive timestamps of its own subreddit posts.
# This must be called after Selfdrivingcars subreddit to utilize timestamps.

def loop_alts_api(func, qty, subred, utc_list, n, ctrl):
    utc_list = utc_list
    df, fst_utc, last_utc = func(qty, subred, 1, utc_list[0], ctrl) 
    api_ct = 1
    df_list = [df]
    fst_utc_list = [fst_utc]
    lst_utc_list = [last_utc]
    
    for i in range(2,n+1): 
        df_x, fst_utc, lst_utc = func(qty, subred, api_ct+1 , utc_list[i-1], ctrl)
        
        df_list.append(df_x)
        fst_utc_list.append(fst_utc)
        lst_utc_list.append(lst_utc)
        
        df = pd.concat([df, df_x], axis = 0)
        
        api_ct += 1
        
        #sleep(3)
        
    print(f'{subred} subreddit API called {api_ct} times.')
    print(f'Final Dataframe has {df.shape[0]} rows.')
          
    return df, df_list, fst_utc_list

## Submissions API Calls

In [121]:
# Get 5000 Submissions from Selfdrivingcars subreddit

sdc_subs, sdc_sub_df_list, sdc_sub_first_utc, sdc_sub_last_utc = loop_api(sub_to_df,
                                                                          1000, 'selfdrivingcars', 5, 1)

Dataframe contains 1000 posts from Tue May 14 13:32:16 2019 to Tue Oct 15 12:33:41 2019
Dataframe contains 1000 posts from Thu Jan 17 11:19:18 2019 to Tue May 14 10:57:21 2019
Dataframe contains 1000 posts from Thu Sep 27 13:46:43 2018 to Thu Jan 17 08:27:54 2019
Dataframe contains 1000 posts from Thu May 24 08:01:52 2018 to Thu Sep 27 13:39:18 2018
Dataframe contains 1000 posts from Sat Feb  3 02:01:58 2018 to Thu May 24 07:11:34 2018
selfdrivingcars subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [122]:
# Get 5000 Submissions from Futurology subreddit

fut_subs, fut_sub_df_list, fut_sub_utc_list = loop_alts_api(sub_to_df,
                                                            1000, 'futurology', sdc_sub_first_utc, 5, 2)

Dataframe contains 1000 posts from Fri Oct  4 09:03:32 2019 to Wed Oct 16 00:42:46 2019
Dataframe contains 1000 posts from Thu Jan 17 11:35:50 2019 to Mon Jan 28 08:41:57 2019
Dataframe contains 1000 posts from Thu Sep 27 13:48:07 2018 to Mon Oct  8 02:57:35 2018
Dataframe contains 1000 posts from Thu May 24 08:13:04 2018 to Mon Jun  4 14:36:06 2018
Dataframe contains 1000 posts from Sat Feb  3 02:02:50 2018 to Tue Feb 13 12:42:24 2018
futurology subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [123]:
# Get 5000 Submissions from Technology subreddit

tech_subs, tech_sub_df_list, tech_sub_utc_list = loop_alts_api(sub_to_df, 
                                                               1000, 'technology', sdc_sub_first_utc, 5, 2)

Dataframe contains 1000 posts from Fri Oct 11 09:28:02 2019 to Wed Oct 16 00:45:06 2019
Dataframe contains 1000 posts from Thu Jan 17 11:24:08 2019 to Mon Jan 21 03:12:53 2019
Dataframe contains 1000 posts from Thu Sep 27 13:51:45 2018 to Sun Sep 30 19:26:03 2018
Dataframe contains 1000 posts from Thu May 24 08:06:02 2018 to Sun May 27 10:23:14 2018
Dataframe contains 1000 posts from Sat Feb  3 02:09:15 2018 to Tue Feb  6 01:15:25 2018
technology subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [124]:
# Get 5000 Submissions from Artificial subreddit

ai_subs, ai_sub_df_list, ai_sub_utc_list = loop_alts_api(sub_to_df, 
                                                               1000, 'artificial', sdc_sub_first_utc, 5, 2)

Dataframe contains 1000 posts from Sun Aug 25 12:23:17 2019 to Tue Oct 15 22:18:34 2019
Dataframe contains 1000 posts from Thu Jan 17 12:07:28 2019 to Wed Mar  6 02:20:47 2019
Dataframe contains 1000 posts from Thu Sep 27 13:48:51 2018 to Tue Nov  6 05:55:06 2018
Dataframe contains 1000 posts from Thu May 24 08:32:04 2018 to Wed Jul 11 04:02:21 2018
Dataframe contains 1000 posts from Sat Feb  3 02:58:08 2018 to Wed Mar 28 22:08:57 2018
artificial subreddit API called 5 times.
Final Dataframe has 5000 rows.


## Comments API Calls

In [125]:
# Get 5000 Comments from Selfdrivingcars subreddit

sdc_coms, sdc_com_df_list, sdc_com_first_utc, sdc_com_last_utc = loop_api(com_to_df,
                                                                          1000, 'selfdrivingcars', 5, 1)

Dataframe contains 1000 comments from Tue Oct  1 20:58:59 2019 to Wed Oct 16 00:00:35 2019
Dataframe contains 1000 comments from Tue Sep 17 22:24:41 2019 to Tue Oct  1 20:53:42 2019
Dataframe contains 1000 comments from Tue Sep  3 02:14:54 2019 to Tue Sep 17 18:58:47 2019
Dataframe contains 1000 comments from Fri Aug 16 13:13:49 2019 to Tue Sep  3 02:07:31 2019
Dataframe contains 1000 comments from Tue Aug  6 15:08:52 2019 to Fri Aug 16 13:11:46 2019
selfdrivingcars subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [126]:
# Get 5000 Comments from Futurology subreddit

fut_coms, fut_com_df_list, fut_com_utc_list = loop_alts_api(com_to_df,
                                                            1000, 'futurology', sdc_com_first_utc, 5, 2)

Dataframe contains 1000 comments from Tue Oct 15 14:19:14 2019 to Wed Oct 16 00:45:32 2019
Dataframe contains 1000 comments from Tue Sep 17 22:24:56 2019 to Wed Sep 18 06:49:03 2019
Dataframe contains 1000 comments from Tue Sep  3 02:43:24 2019 to Tue Sep  3 12:23:35 2019
Dataframe contains 1000 comments from Fri Aug 16 13:13:52 2019 to Sat Aug 17 07:55:05 2019
Dataframe contains 1000 comments from Tue Aug  6 15:09:25 2019 to Wed Aug  7 04:31:21 2019
futurology subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [127]:
# Get 5000 Submissions from Technology subreddit

tech_coms, tech_com_df_list, tech_com_utc_list = loop_alts_api(com_to_df, 
                                                               1000, 'technology', sdc_com_first_utc, 5, 2)

Dataframe contains 1000 comments from Tue Oct 15 17:32:53 2019 to Wed Oct 16 00:46:46 2019
Dataframe contains 1000 comments from Tue Sep 17 22:26:51 2019 to Wed Sep 18 07:48:56 2019
Dataframe contains 1000 comments from Tue Sep  3 02:16:11 2019 to Tue Sep  3 11:43:40 2019
Dataframe contains 1000 comments from Fri Aug 16 13:14:45 2019 to Fri Aug 16 21:44:15 2019
Dataframe contains 1000 comments from Tue Aug  6 15:08:57 2019 to Wed Aug  7 04:48:53 2019
technology subreddit API called 5 times.
Final Dataframe has 5000 rows.


In [128]:
# Get 5000 Comments from artificial subreddit

ai_coms, ai_com_df_list, ai_com_utc_list = loop_alts_api(com_to_df, 
                                                               1000, 'technology', sdc_com_first_utc, 5, 2)

Dataframe contains 1000 comments from Tue Oct 15 17:32:53 2019 to Wed Oct 16 00:46:46 2019
Dataframe contains 1000 comments from Tue Sep 17 22:26:51 2019 to Wed Sep 18 07:48:56 2019
Dataframe contains 1000 comments from Tue Sep  3 02:16:11 2019 to Tue Sep  3 11:43:40 2019
Dataframe contains 1000 comments from Fri Aug 16 13:14:45 2019 to Fri Aug 16 21:44:15 2019
Dataframe contains 1000 comments from Tue Aug  6 15:08:57 2019 to Wed Aug  7 04:48:53 2019
technology subreddit API called 5 times.
Final Dataframe has 5000 rows.


## Search 'Self-Driving' API calls

In [45]:
# Define Lists of Epoch start and end dates. 
# Start dates are either Jan 1, or March 1. 
# End dates are Dec. 31.

yr_08 = [2008, 1199174408, 1230796748]
yr_09 = [2009, 1230796808, 1262289548]
yr_10 = [2010, 1262332808, 1293868748]  
yr_11 = [2011, 1298966408, 1325404748]  

yr_12 = [2012, 1325376000, 1357027148]
yr_13 = [2013, 1357070408, 1388563148]
yr_14 = [2014, 1393704008, 1420099148]
yr_15 = [2015, 1425240008, 1451635148]
yr_16 = [2016, 1456862408, 1483257548]
yr_17 = [2017, 1488398408, 1514793548]
yr_18 = [2018, 1519934408, 1546329548] 
yr_19 = [2019, 1551470408, 1577865548] 

utc_list_lg = [yr_08, yr_09, yr_10, yr_11, 
               yr_12, yr_13, yr_14, yr_15, 
               yr_16, yr_17, yr_18, yr_19] 

utc_list_st = utc_list_lg[4:]

# Subreddit Community start dates: 
# Self-driving subreddit start date: Jun 26, 2012 - 7 years
# Futurology subreddit start date: Dec 12, 2011 - 7.5 years 
# Technology subreddit start date: Jan 25, 2008 - 11 years
# Artificial subreddit start date: Mar 13, 2008 - 11 years 


### Search 'Self-driving' Subreddit Scraper Function

In [68]:
# Function that Pulls posts with 'self-driving' string in title or comment
# 
# Function takes following arguments: 
#       - quantity of posts to pull (max 1000)
#       - subreddit name
#       - search term
#       - epoch start time
#       - epoch end time
#
# Returns concatenated dataframe and last utc epoch time

# This function only works for "self-driving" or "self driving", 
# due to the need to hardcode the search parameter inside the function.


def srch_to_df(qty, typ, subred, search, utc_start, utc_end): 
    
    # base url for submissions/posts
    base = 'https://api.pushshift.io/reddit/search/'
    
    # quantity of posts pulled in api call
    size = '&size=' + str(qty)
    
    srch_param = '?q='+ '%22self%20driving%22~2'
    
    time = '&after='+ str(utc_start) + '&before=' + str(utc_end)
    
    # test if this is first api call or not. if not first, add before utc time parameter 
    #if call_ct == 1:
        #concatenated url
        #url = base + 'subreddit=' + subred + size
    #else: 
    url = base + typ + '/' + srch_param + '&subreddit=' + subred + size + time
    
    #api request
    res = requests.get(url)
    
    #check that api call was succesful and assign to data
    if res.status_code == 200:
        data = res.json()['data']
    else:
        print('Your API call malfunctioned')
    
    if typ == 'submission':
    # define list of features to grab from each post
        # list of features to grab if post type is submissions
        keys = ['author','created_utc','id','full_link','num_comments',
                'subreddit','subreddit_id','title']
    else: 
        # list of features if post type is comment
        keys = ['author', 'body', 'created_utc', 'id', 
                'parent_id', 'subreddit', 'subreddit_id']
        
    # create empty list to hold all subreddit posts pulled in the api call 
    df_list =[]
    
    # Loop through each dictionary in json list
    for i in range(len(data)): 
        df_dict = {}  # create empty dictionary to store submissions
    
        # loop through list of pre-selected keys, or features, defined by 'keys_sm' above
        for j in range(len(keys)):
            
            #Assign each data key and value pairs to the sub dict
            df_dict[keys[j]] = data[i][keys[j]]
            
    
        # append each submission dict to sub_list
        df_list.append(df_dict)
    
    # Create dataframe from list of submission dicts
    df = pd.DataFrame(df_list)
    df['year'] = datetime.datetime.fromtimestamp(utc_start).strftime('%Y')
    
    
    
    # Get earliest post utc time
    first_utc = df['created_utc'].sort_values().iloc[0]
    last_utc =  df['created_utc'].sort_values(ascending=False).iloc[0]
    
    # convert Epoch to standard date format
    #first_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(first_utc))
    #last_pst = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(last_utc))
    
    first_pst = datetime.datetime.fromtimestamp(first_utc).strftime('%c')
    last_pst = datetime.datetime.fromtimestamp(last_utc).strftime('%c')
  
    s = df.shape[0]
    
    print(f'Dataframe contains {s} posts from {first_pst} to {last_pst}')
    return df, last_utc


In [51]:
# Test call the search scraping function

tech_srch_08, last_utc = srch_to_df(500, 'submission',
                                                'technology','self-driving',
                                                utc_list_lg[1][1], utc_list_lg[1][2]);

In [65]:
# Looping function to Call Search Function n-year times

def loop_srch_api(func, qty, typ, subred, search, utc_list, n):
    df, last_utc = func(qty, typ, subred, search, utc_list[0][1], utc_list[0][2]) 
    api_ct = 1
    df_list = []
    
    for i in range(1,n): 
        df_x, last_utc = func(qty, typ, subred, search, utc_list[i][1], utc_list[i][2])
    
        df_list.append(df_x)
        df = pd.concat([df, df_x], axis = 0)
        
        api_ct += 1
        
        sleep(3)
        
    print(f'{subred} subreddit API called {api_ct} times.')
    print(f'Final {subred} Dataframe has {df.shape[0]} rows with the string: {search}.')
          
    return df, df_list

In [20]:
# API parameters pulled out of function for troubleshooting

utc_list = yr_2008
base = 'https://api.pushshift.io/reddit/search/'
typ = 'submission'
search = '?q=self-driving'
subred = 'selfdrivingcars'
qty = 10
size = '&size=' + str(qty)
time = '&after='+ str(utc_list[1]) #+ '&before=' + str(utc_list[2])
url = base + typ + '/' + search + '&subreddit=' + subred + size + time
res = requests.get(url)
if res.status_code == 200:
    data = res.json()['data']
else:
    print('Your API call malfunctioned')

## Search 'Self-driving' Scraping Function Calls

### r/selfdrivingcars Search 'self-driving' Scrapes

Note: even though this subreddit is focused on self-driving, I went ahead and scraped only text posts/comments containing 'self-driving' to keep it consistent with the other subreddit scrapes in this section.

In [21]:
# Submissions with 'Self-driving in title', Get up to 500 per year

sdc_srch_sub, sdc_sd_sub_list = loop_srch_api(srch_to_df, 500, 'submission',
                                                'selfdrivingcars','self-driving',
                                                utc_list_st, len(utc_list_st))

Dataframe contains 69 posts from Tue Jun 26 09:31:40 2012 to Mon Dec 31 18:15:13 2012
Dataframe contains 284 posts from Sat Jan  5 11:39:50 2013 to Mon Dec 30 11:36:46 2013
Dataframe contains 500 posts from Sat Mar  1 15:37:06 2014 to Mon Dec 29 11:22:34 2014
Dataframe contains 500 posts from Mon Mar  2 05:04:31 2015 to Wed Oct  7 20:00:12 2015
Dataframe contains 500 posts from Tue Mar  1 19:40:39 2016 to Mon Jul 18 08:56:44 2016
Dataframe contains 500 posts from Thu Mar  2 00:12:42 2017 to Wed Jul 19 15:18:41 2017
Dataframe contains 500 posts from Thu Mar  1 12:03:13 2018 to Tue Jul 31 14:48:37 2018
Dataframe contains 500 posts from Fri Mar  1 19:36:09 2019 to Sun Sep 15 15:58:10 2019
selfdrivingcars subreddit API called 8 times.
Final selfdrivingcars Dataframe has 3353 rows with the string: self-driving.


In [22]:
# Comments with 'Self-driving in title', Get up to 500 per year

sdc_srch_com, sdc_sd_com_list = loop_srch_api(srch_to_df, 500, 'comment',
                                                'selfdrivingcars','self-driving',
                                                utc_list_st, len(utc_list_st))

Dataframe contains 29 posts from Sun Jul  8 12:44:31 2012 to Fri Nov 30 07:55:04 2012
Dataframe contains 500 posts from Sat Jan  5 11:47:28 2013 to Sat Oct 26 13:35:28 2013
Dataframe contains 500 posts from Sat Mar  1 17:10:51 2014 to Fri May  9 16:00:31 2014
Dataframe contains 500 posts from Mon Mar  2 07:08:48 2015 to Sun May 24 22:28:14 2015
Dataframe contains 500 posts from Tue Mar  1 12:13:05 2016 to Tue May 10 16:35:37 2016
Dataframe contains 500 posts from Wed Mar  1 17:35:00 2017 to Sun May 21 12:42:55 2017
Dataframe contains 500 posts from Thu Mar  1 17:32:06 2018 to Thu Apr  5 09:17:57 2018
Dataframe contains 500 posts from Fri Mar  1 12:37:15 2019 to Mon Apr  1 11:35:08 2019
selfdrivingcars subreddit API called 8 times.
Final selfdrivingcars Dataframe has 3529 rows with the string: self-driving.


### r/Technology Search 'self-driving' Scrapes

In [52]:
# Submissions with 'Self-driving in title', Get up to 500 per year

# Comment out the first_utc lines prior to calling the scraping functions for this Call
# for r/Technology due to a key error. Likely one of the elements is missing utc_element.  

tech_srch_sub, tech_sd_sub_list = loop_srch_api(srch_to_df, 500, 'submission',
                                                'technology','self-driving',
                                                utc_list_lg, len(utc_list_lg));

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  from ipykernel import kernelapp as app


technology subreddit API called 12 times.
Final technology Dataframe has 2857 rows with the string: self-driving.


In [55]:
# Comments with 'Self-Drivng' in body, Pull up to 500 per year

tech_srch_com, tech_srch_com_list = loop_srch_api(srch_to_df, 500, 'comment',
                                                'technology','self-driving',
                                                utc_list_lg, len(utc_list_lg))

Dataframe contains 4 posts from Thu Jun 12 19:35:54 2008 to Wed Dec 17 05:26:48 2008
Dataframe contains 14 posts from Sat Jan 17 13:55:18 2009 to Wed Nov 18 01:50:39 2009
Dataframe contains 65 posts from Wed Jan  6 06:20:22 2010 to Sat Dec 11 06:46:36 2010
Dataframe contains 155 posts from Thu Mar  3 17:52:07 2011 to Thu Dec 15 16:48:35 2011
Dataframe contains 500 posts from Sat Jan  7 22:45:42 2012 to Fri Aug 10 00:44:37 2012
Dataframe contains 500 posts from Wed Jan  2 01:43:08 2013 to Mon Apr  8 14:07:23 2013
Dataframe contains 500 posts from Sat Mar  1 12:38:21 2014 to Wed May 21 12:18:36 2014
Dataframe contains 500 posts from Sun Mar  1 12:05:47 2015 to Mon Mar 23 10:36:16 2015
Dataframe contains 500 posts from Tue Mar  1 14:21:12 2016 to Fri May 13 12:48:46 2016
Dataframe contains 500 posts from Wed Mar  1 18:15:13 2017 to Sun Apr  9 09:06:03 2017
Dataframe contains 500 posts from Thu Mar  1 17:10:09 2018 to Fri Apr 27 17:46:36 2018
Dataframe contains 500 posts from Fri Mar  1 12

### r/Futurology Search 'self-driving' Scrapes

In [56]:
# Futurology Submissions with 'Self-driving' in title, Pull up to 500 per year

fut_srch_sub, fut_srch_sub_list = loop_srch_api(srch_to_df, 500, 'submission',
                                                'futurology','self-driving',
                                                utc_list_st, len(utc_list_st))

Dataframe contains 33 posts from Tue May 22 09:52:07 2012 to Mon Dec 31 18:44:01 2012
Dataframe contains 96 posts from Fri Jan  4 17:05:12 2013 to Tue Dec 24 07:29:30 2013
Dataframe contains 242 posts from Mon Mar  3 08:54:44 2014 to Wed Dec 31 12:17:08 2014
Dataframe contains 468 posts from Sun Mar  1 14:22:28 2015 to Wed Dec 30 15:09:43 2015
Dataframe contains 500 posts from Wed Mar  2 22:11:21 2016 to Sun Sep 18 21:34:27 2016
Dataframe contains 500 posts from Wed Mar  1 12:35:41 2017 to Tue Aug 15 02:36:25 2017
Dataframe contains 500 posts from Thu Mar  1 20:32:31 2018 to Tue Aug 21 06:56:15 2018
Dataframe contains 347 posts from Sat Mar  2 09:45:15 2019 to Tue Oct 15 06:12:30 2019
futurology subreddit API called 8 times.
Final futurology Dataframe has 2686 rows with the string: self-driving.


In [69]:
# Futurology Comments with 'Self-driving' in body, Get up to 500 per year

fut_srch_com, fut_srch_com_list = loop_srch_api(srch_to_df, 500, 'comment',
                                                'futurology','self-driving',
                                                utc_list_st, len(utc_list_st))


Dataframe contains 220 posts from Mon Mar  5 15:04:27 2012 to Sun Dec 30 21:44:50 2012
Dataframe contains 500 posts from Wed Jan  2 04:24:48 2013 to Fri Aug  2 00:23:42 2013
Dataframe contains 500 posts from Mon Mar  3 10:25:59 2014 to Wed May 14 17:52:21 2014
Dataframe contains 500 posts from Mon Mar  2 07:22:35 2015 to Wed Mar 18 06:53:20 2015
Dataframe contains 500 posts from Tue Mar  1 12:16:43 2016 to Tue Apr  5 16:42:41 2016
Dataframe contains 500 posts from Wed Mar  1 14:40:31 2017 to Sat Mar 25 11:12:55 2017
Dataframe contains 500 posts from Fri Mar  2 05:32:13 2018 to Mon Mar 19 23:40:28 2018
Dataframe contains 500 posts from Fri Mar  1 12:03:12 2019 to Mon Apr 22 21:54:37 2019
futurology subreddit API called 8 times.
Final futurology Dataframe has 3720 rows with the string: self-driving.


### r/Artificial Search 'self-driving' Scrapes

In [60]:
# Artificial Submissions with 'Self-Driving' in title, Pull up to 500 per year

# Comment out first_utc lines in Api scrape Function prior to calling

ai_srch_sub, ai_srch_sub_list = loop_srch_api(srch_to_df, 500, 'submission',
                                                'artificial','self-driving',
                                                utc_list_lg, len(utc_list_lg))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  from ipykernel import kernelapp as app


artificial subreddit API called 12 times.
Final artificial Dataframe has 176 rows with the string: self-driving.


In [64]:
# Artificial Comments with 'Self-Driving' in body, Pull up to 500 per year

# Comment out first_utc lines in Api scrape Function prior to calling

ai_srch_com, ai_srch_com_list = loop_srch_api(srch_to_df, 500, 'comment',
                                                'artificial','self-driving',
                                                utc_list_lg, len(utc_list_lg))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  from ipykernel import kernelapp as app


artificial subreddit API called 12 times.
Final artificial Dataframe has 423 rows with the string: self-driving.


### Export Dataframes to CSV

#### Set 1: Recent 5000 posts in each subreddit

In [178]:
# Self-driving Submissions
sdc_subs.to_csv('./datasets/selfdriving_subs.csv')

# Futurology Submissions
fut_subs.to_csv('./datasets/future_subs.csv')

# Technology Submissions
tech_subs.to_csv('./datasets/tech_subs.csv')

# Artificial Submissions
ai_subs.to_csv('./datasets/ai_subs.csv')

# Self-driving Comments
sdc_coms.to_csv('./datasets/selfdriving_coms.csv')

# Futurology Comments
fut_coms.to_csv('./datasets/future_coms.csv')

# Technology Comments
tech_coms.to_csv('./datasets/tech_coms.csv')

# Artificial Comments
ai_coms.to_csv('./datasets/ai_coms.csv')


#### Set 2: Posts from full Lifespan of Subreddit with 'self-driving' in text

In [70]:
# Self-driving Submissions
sdc_srch_sub.to_csv('./datasets/selfdriving_srch_sub.csv')

# Futurology Submissions
fut_srch_sub.to_csv('./datasets/future_srch_sub.csv')

# Technology Submissions
tech_srch_sub.to_csv('./datasets/tech_srch_sub.csv')

# Artificial Submissions
ai_srch_sub.to_csv('./datasets/ai_sub.csv')

# Self-driving Comments
sdc_srch_com.to_csv('./datasets/selfdriving_srch_com.csv')

# Futurology Comments
fut_srch_com.to_csv('./datasets/future_com.csv')

# Technology Comments
tech_srch_com.to_csv('./datasets/tech_com.csv')

# Artificial Comments
ai_srch_com.to_csv('./datasets/ai_srch_com.csv')

### Notebook Summary

In this Notebook we used Reddit pushshift API to scrape 5000 data points from each of Submissions and Comments sections of four subreddits.  We then scraped an additional 500 data points for each Submission and Comments, for each of year of the four subreddits' lifespan.  After scraping, we converted the data to pandas dataframes and exported to CSV for cleaning and NLP Classification in following Notebooks.