# Using the Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/


In [33]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api

In [34]:
url = 'https://api.pushshift.io/reddit/search/'

Setting the parameters for the query. A full list of parameters can be found on: https://pushshift.io/api-parameters/

In [35]:
params = {'searchType':'submission',
          'subreddit':'lifeprotips, unethicallifeprotips',
          'sort':'desc',
          'size':10,
#           'before': '10d',
#           'after': '168d',
         }

Making the request.

In [36]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [37]:
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [38]:
len(response.json()['data'])

10

Length is 10, as expected. Assessing the file structure for keys of interest.

In [39]:
response.json()['data']

[{'author': 'evolutiontoe',
  'author_flair_background_color': None,
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_template_id': None,
  'author_flair_text': None,
  'author_flair_text_color': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_zyeq9i1',
  'author_patreon_flair': False,
  'body': 'I discovered this a while back too. Which  has has been awesome. Since it takes me months to finish a book. Because if Reddit!!!',
  'created_utc': 1554086088,
  'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
  'id': 'ejulcpz',
  'is_submitter': False,
  'link_id': 't3_b7tthl',
  'no_follow': True,
  'parent_id': 't3_b7tthl',
  'permalink': '/r/LifeProTips/comments/b7tthl/lpt_if_your_digital_library_loan_expires_before/ejulcpz/',
  'retrieved_on': 1554086089,
  'score': 1,
  'send_replies': True,
  'stickied': False,
  'subreddit': 'LifeProTips',
  'subreddit_id': 't5_2s5oq'},
 {'author': 'ImS0hungry',
  'author_flair_background_color': None,
  

Keys of interest are:
- author
- body
- retrieved_on
- created_utc
- link_id
- parent_id
- permalink
- subreddit
- subreddit_id

In [None]:
col_list = ['author',
            'body',
            'retrieved_on'
            'subreddit',
            'subreddit_id',
            'created_utc',
            'retrieved_on',
            'link_id',
            'parent_id',
            'permalink',
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [None]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [None]:
def reddit_query(subreddits, n_samples=30000, searchType='submission', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':1000,
              'before': last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 15 comments from the conservative subreddit.

In [30]:
reddit_query(subreddits='lifeprotips, unethicallifeprotips',
             n_samples=30000,
             searchType='submission',
#              before = '10d',
#              after = '168d'
            )

Starting query 16
Starting query 17


'Error. Pull not completed.'

1553699454

In [None]:
with open(f'../data/1553206537_raw_comments.json', 'r') as f:
    bo_comment_list = json.load(f)

In [None]:
len(bo_comment_list)

In [None]:
bo_comment_list[0]

Parsing the json file into a dataframe containing the features of interest.

In [None]:
def reddit_parse(sample):
    
    col_list = ['author',
                'body',
                'subreddit',
                'subreddit_id',
                'created_utc',
                'link_id',
                'parent_id',
                'permalink',
                ]
    
    df = pd.DataFrame(sample)
    df = df[col_list]
    
    df.rename(columns={'subreddit':'lifeprotips'}, inplace=True)
    df['lifeprotips'] = comments_df['lifeprotips'].map({'unethicallifeprotips':0, 'lifeprotips':1})
    
    col_order = ['author',
                 'body',
                 'blackops4',
                 'created_utc',
                 'subreddit_id',
                 'parent_id',
                 'link_id',
                 'permalink',
                ]

    return df[col_order]

Reviewing the shape of the dataframe to ensure correct transformation

In [None]:
comments_df = reddit_parse(bo_comment_list)

In [None]:
comments_df.shape

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [None]:
comments_df.head()

In [None]:
comments_df['body'].nunique()

In [None]:
comments_df['author'].nunique()

In [None]:
comments_df['created_utc'].min()

In [None]:
comments_df['created_utc'].max()

In [None]:
x=[1553206319, 1553897404]
start_end = [time.asctime(time.gmtime(i)) for i in x]

In [None]:
start_end