# Using the Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/


In [1]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api

In [2]:
url = 'https://api.pushshift.io/reddit/search/'

Setting the parameters for the query. A full list of parameters can be found on: https://pushshift.io/api-parameters/

In [3]:
params = {'searchType':'submission',
          'subreddit':'conservative,libertarian',
          'sort':'desc',
          'size':10,
#           'before':,
#           'after':,
         }

Making the request.

In [4]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [5]:
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [6]:
len(response.json()['data'])

10

Length is 10, as expected. Assessing the file structure for keys of interest.

In [13]:
response.json()['data']

[{'author': '[deleted]',
  'author_flair_background_color': '',
  'author_flair_css_class': None,
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': 'dark',
  'body': '[removed]',
  'created_utc': 1553727783,
  'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
  'id': 'ejj6hw8',
  'link_id': 't3_b650hr',
  'no_follow': True,
  'parent_id': 't3_b650hr',
  'permalink': '/r/Conservative/comments/b650hr/funny_how_it_is_conservatives_who_must_accept/ejj6hw8/',
  'retrieved_on': 1553727784,
  'score': 1,
  'send_replies': True,
  'stickied': False,
  'subreddit': 'Conservative',
  'subreddit_id': 't5_2qh6p'},
 {'author': 'yourmothersbhole',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_2tz0m7s4',
  'author_patreon_flair': False,
  'body'

Keys of interest are:
- author
- body
- created_utc
- link_id
- parent_id
- permalink
- subreddit
- subreddit_id

In [11]:
col_list = ['author',
            'body',
            'subreddit',
            'subreddit_id',
            'created_utc',
            'retrieved_on',
            'link_id',
            'parent_id',
            'permalink',
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [12]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [16]:
def reddit_query(subreddits, n_samples=1500, searchType='comment', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':1000,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 15 comments from the conservative subreddit.

In [17]:
reddit_query(subreddits='conservative',
             n_samples=15,
             searchType='comment')

Starting query 1
Saved and completed query and returned 1000 comments.
Reddit text is ready for processing.
Last timestamp was 1553699454.


In [18]:
with open(f'../data/1553699454_raw_comments.json', 'r') as f:
    cons_sample_list = json.load(f)

In [19]:
len(cons_sample_list)

1000

In [20]:
cons_sample_list[0]

{'author': 'EVG2666',
 'author_flair_background_color': '',
 'author_flair_css_class': None,
 'author_flair_richtext': [{'e': 'text', 't': 'Conservative'}],
 'author_flair_template_id': None,
 'author_flair_text': 'Conservative',
 'author_flair_text_color': 'dark',
 'author_flair_type': 'richtext',
 'author_fullname': 't2_20w4ohq9',
 'author_patreon_flair': False,
 'body': 'The average Liberal is a mindless NPC. Even professors who should know better are so gullible. They believe everything they see on CNN. No critical thinking',
 'created_utc': 1553728358,
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'ejj7b7y',
 'link_id': 't3_b67wys',
 'no_follow': True,
 'parent_id': 't3_b67wys',
 'permalink': '/r/Conservative/comments/b67wys/republicans_more_informed_than_democrats/ejj7b7y/',
 'retrieved_on': 1553728360,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'Conservative',
 'subreddit_id': 't5_2qh6p'}

Parsing the json file into a dataframe containing the features of interest.

In [21]:
def reddit_parse(sample):
    
    col_list = ['author',
                'body',
                'subreddit',
                'subreddit_id',
                'created_utc',
                'link_id',
                'parent_id',
                'permalink',
                ]
    
    comments_df = pd.DataFrame(sample)
    comments_df = comments_df[col_list]
    
    comments_df.rename(columns={'subreddit':'libertarian'}, inplace=True)
    comments_df['libertarian'] = comments_df['libertarian'].map({'Conservative':0, 'Libertarian':1})
    
    col_order = ['author',
                 'body',
                 'libertarian',
                 'created_utc',
                 'subreddit_id',
                 'parent_id',
                 'link_id',
                 'permalink',
                ]

    return comments_df[col_order]

Reviewing the shape of the dataframe to ensure correct transformation

In [24]:
cons_comments_df = reddit_parse(cons_sample_list)

In [25]:
cons_comments_df.shape

(1000, 8)

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [26]:
cons_comments_df.head()

Unnamed: 0,author,body,libertarian,created_utc,subreddit_id,parent_id,link_id,permalink
0,EVG2666,The average Liberal is a mindless NPC. Even pr...,0,1553728358,t5_2qh6p,t3_b67wys,t3_b67wys,/r/Conservative/comments/b67wys/republicans_mo...
1,1iJack23,so that means he can be charged with the crime...,0,1553728316,t5_2qh6p,t3_b68ugv,t3_b68ugv,/r/Conservative/comments/b68ugv/as_if_it_never...
2,[deleted],[removed],0,1553728300,t5_2qh6p,t3_b650hr,t3_b650hr,/r/Conservative/comments/b650hr/funny_how_it_i...
3,[deleted],[removed],0,1553728227,t5_2qh6p,t1_eji7af8,t3_b66klc,/r/Conservative/comments/b66klc/ohio_defunds_p...
4,mdgolfpro,Mueller's report is bi-partisan and objective....,0,1553728203,t5_2qh6p,t3_b650hr,t3_b650hr,/r/Conservative/comments/b650hr/funny_how_it_i...
