# Using the Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/


In [187]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api

In [188]:
url = 'https://api.pushshift.io/reddit/search/submission/'

Setting the parameters for the query. A full list of parameters can be found on: https://pushshift.io/api-parameters/

In [189]:
params = {#'searchType':'submission',
          'subreddit':'lifeprotips, unethicallifeprotips',
          'sort':'desc',
          'size':10,
#           'before': '10d',
#           'after': '168d',
         }

Making the request.

In [14]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [15]:
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [16]:
len(response.json()['data'])

10

Length is 10, as expected. Assessing the file structure for keys of interest.

In [17]:
response.json()['data']

[{'author': 'IllustriousVirus26',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_db25fyv',
  'author_patreon_flair': False,
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1554087060,
  'domain': 'self.LifeProTips',
  'full_link': 'https://www.reddit.com/r/LifeProTips/comments/b7wduw/the_more_you_keep_your_mind_occupied_the_less_you/',
  'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
  'id': 'b7wduw',
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': False,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_css_class': 'removed',
  'link_flair_richtext': [],
  'link_flair_text': 'Removed',
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': True,
  'num_comments': 2,


Keys of interest are:
- author
- body
- retrieved_on
- created_utc
- link_id
- parent_id
- permalink
- subreddit
- subreddit_id

In [18]:
col_list = ['author',
            'title',
            'selftext',
#             'body',
            'retrieved_on'
            'subreddit',
            'subreddit_id',
            'created_utc',
            'retrieved_on',
            'link_id',
            'parent_id',
            'permalink',
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [19]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [42]:
def reddit_query(subreddits, n_samples=30000, searchType='submission', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/submission/'
    last_post = round(time.time())
    post_list = []
    
    run = 1
    while len(post_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {#'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':1000,
              'before': last_post-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_post = last_post
            else:
                last_post = posts[-1]['created_utc']
                post_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(post_list, f)
    
    print(f'Saved and completed query and returned {len(post_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 15 comments from the conservative subreddit.

In [95]:
reddit_query(subreddits='lifeprotips',
             n_samples=10000,
             searchType='submission',
#              before = '10d',
#              after = '168d'
            )

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 submissions.
Reddit text is ready for processing.
Last timestamp was 1549858877.


1553699454

In [165]:
with open(f'../data/1549858877_raw_submissions.json', 'r') as f:
    lpt_list = json.load(f)

In [166]:
len(lpt_list)

10000

In [106]:
lpt_list[0]

{'author': 'minigunman123',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_63uak',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1554088723,
 'domain': 'self.LifeProTips',
 'full_link': 'https://www.reddit.com/r/LifeProTips/comments/b7wnp4/if_you_load_a_news_website_with_a_paywall_keep/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'b7wnp4',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 2,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'all_ads',
 'permalink': '/r/LifeProTip

In [99]:
reddit_query(subreddits='unethicallifeprotips',
             n_samples=10000,
             searchType='submission',
#              before = '10d',
#              after = '168d'
            )

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 submissions.
Reddit text is ready for processing.
Last timestamp was 1542760866.


In [167]:
with open(f'../data/1542760866_raw_submissions.json', 'r') as f:
    ulpt_list = json.load(f)

In [168]:
ulpt_list[0]

{'author': 'ThatNothing',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_3iog3nfl',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1554088420,
 'domain': 'self.UnethicalLifeProTips',
 'full_link': 'https://www.reddit.com/r/UnethicalLifeProTips/comments/b7wm0c/ulpt_request_resume_tips_for_college_grads/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'b7wm0c',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': False,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'house_only',
 'permalin

Parsing the json file into a dataframe containing the features of interest.

In [169]:
def reddit_parse(sample, df):
    
    col_list = ['author',
                'title',
#                 'selftext',
#                 'body',
#                 'retrieved_on'
                'subreddit',
#                 'subreddit_id',
#                 'created_utc',
#                 'retrieved_on',
#                 'link_id',
#                 'parent_id',
                'permalink',
                ]
    
    df = pd.DataFrame(sample)
    df = df[col_list]
    
    df.rename(columns={'subreddit':'subreddit'}, inplace=True)
#     df['subreddit'] = df['subreddit'].map({'unethicallifeprotips':0, 'lifeprotips':1})
    
    col_order = ['author',
                'title',
#                 'selftext',
#                 'body',
#                 'retrieved_on'
                'subreddit',
#                 'subreddit_id',
#                 'created_utc',
#                 'retrieved_on',
#                 'link_id',
#                 'parent_id',
                'permalink',
                ]

    return df[col_order]

Reviewing the shape of the dataframe to ensure correct transformation

In [170]:
df_lpt = reddit_parse(lpt_list, df = 'df_lpt')

In [171]:
df_ulpt = reddit_parse(ulpt_list, df = 'df_ulpt')

In [172]:
frames = [df_lpt, df_ulpt]

In [173]:
df = pd.concat(frames, ignore_index=True)

In [174]:
df.tail()

Unnamed: 0,author,title,subreddit,permalink
19995,existentialzebra,"As a parent of a baby, smell your baby’s diape...",UnethicalLifeProTips,/r/UnethicalLifeProTips/comments/9yym50/as_a_p...
19996,shercroft,"ULPT Hey kids, rather than search through your...",UnethicalLifeProTips,/r/UnethicalLifeProTips/comments/9yyigh/ulpt_h...
19997,Jojda6,How to Receive a Double Refund from Amazon,UnethicalLifeProTips,/r/UnethicalLifeProTips/comments/9yyc7m/how_to...
19998,TheBaconestCow,"ULPT I don't know the specifics, basically whe...",UnethicalLifeProTips,/r/UnethicalLifeProTips/comments/9yybak/ulpt_i...
19999,TheBaconestCow,"I don't know the specifics here, basically whe...",UnethicalLifeProTips,/r/UnethicalLifeProTips/comments/9yy7b6/i_dont...


In [175]:
df['subreddit'] = df['subreddit'].map({'LifeProTips': 0, 'UnethicalLifeProTips': 1})

In [176]:
df.shape

(20000, 4)

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [177]:
df.head()

Unnamed: 0,author,title,subreddit,permalink
0,minigunman123,"If you load a news website with a paywall, kee...",0,/r/LifeProTips/comments/b7wnp4/if_you_load_a_n...
1,Justinjp126,LPT: So you know those little pieces of detach...,0,/r/LifeProTips/comments/b7wnm4/lpt_so_you_know...
2,TheFormulaS,LPT: Use a pistachio shell to help you open a ...,0,/r/LifeProTips/comments/b7wnja/lpt_use_a_pista...
3,NullandRandom,LPT: Whenever you buy food combos but you know...,0,/r/LifeProTips/comments/b7wney/lpt_whenever_yo...
4,Serenitybyjan88,LPT: Take a picture or video of yourself doing...,0,/r/LifeProTips/comments/b7wncl/lpt_take_a_pict...


In [178]:
df['title'].nunique()

19742

In [179]:
df.drop_duplicates(subset = 'title', inplace = True)

In [180]:
df.subreddit.value_counts()

1    9909
0    9833
Name: subreddit, dtype: int64

In [181]:
df['title'] = df['title'].map(lambda x: x.lstrip('ULPT'))

In [182]:
df['title'] = df['title'].map(lambda x: x.lstrip('LPT'))

In [183]:
df['title'] = df['title'].map(lambda x: x.lstrip(':'))

In [184]:
df['title'] = df['title'].map(lambda x: x.lstrip(' '))

In [185]:
df['author'].nunique()

13917

In [186]:
df.to_csv('../data/combined_df.csv', index = False)