# Accessing Data

## Using the Pushshift API

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. It leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/


## Importing Libraries

In [2]:
import requests, time, csv, json, re
import pandas as pd

### We'll be accessing the data using the PushShift API

The PushSHift API allows for comprehensive search/retrieval of reddit data

We'll use it to retrieve the yelp submissions from the 2 subreddits of interest:
- r/depression
- r/suicidewatch

## Setting the base query syntax:

Setting the query url to the pushshift api

In [3]:
url = f'https://api.pushshift.io/reddit/submission/search/'

Testing the the PushShift API by pulling 10 submissions from the r/depression subreddit

In [4]:
params = {
          'subreddit':'depression', #subreddit
          'sort':'desc',
          'size':10 # number of submissions being collected
         }

Making the request.

In [5]:
response = requests.get(url, params=params)

Checking the url to make sure the query terms are correct and the server is responsive

In [6]:
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted without error.

Now we'll take a look at the structure of the json file

In [8]:
response.json()

{'data': [{'author': 'swnkthekid',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_2w7nrjas',
   'author_patreon_flair': False,
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1554770811,
   'domain': 'self.depression',
   'full_link': 'https://www.reddit.com/r/depression/comments/bb1jun/my_fuckin_anxiety/',
   'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
   'id': 'bb1jun',
   'is_crosspostable': True,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': True,
   'is_self': True,
   'is_video': False,
   'link_flair_background_color': '',
   'link_flair_richtext': [],
   'link_flair_text_color': 'dark',
   'link_flair_type': 'text',
   'locked': False,
   'media_only': False,
   'no_follow': True,
   'num_comments': 0,
   'num_crossposts': 0,
   'over_18': False,
   'parent_whitelist_status': 

We can look at all the available fields in the dataset

In [8]:
response.json()['data'][0].keys()

dict_keys(['author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_richtext', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'suggested_sort', 'thumbnail', 'title', 'url', 'whitelist_status', 'wls'])

The below will are the columns of interest

In [9]:
col_list = ['author',
            'author_fullname',
            'subreddit',
            'id',
            'created_utc',
            'retrieved_on',
            'permalink',
            'url',
            'num_comments',
            'title',
            'selftext'
            ]

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.

In [9]:
# Code snippet from Chris Sanatra - DSI TA

def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting submissions and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [10]:
def reddit_query(subreddits, n_samples=1500, searchType='submission', before=None, after=None):
    url = f'https://api.pushshift.io/reddit/submission/search/'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {'searchType':searchType,
              'subreddit':subreddits,
              'sort':'desc',
              'size':n_samples,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_{searchType}s.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} {searchType}s.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 10000 submissions from each subreddit

In [12]:
# reddit_query(subreddits='depression',
#              n_samples=10000,
#              searchType='submission')

In [13]:
# reddit_query(subreddits='suicidewatch',
#              n_samples=10000,
#              searchType='submission')

## Opening the Saved JSON Files

In [14]:
with open(f'../data/1552716506_raw_submission_depression.json', 'r') as f:
    depression_json = json.load(f)

In [15]:
with open(f'../data/1550623407_raw_submission_suicidewatch.json', 'r') as f:
    suicidewatch_json = json.load(f)

Confirming the length of each json file is 10000

In [16]:
len(suicidewatch_json)

10000

In [17]:
len(depression_json)

10000

We can few the first record from the suicidewatch subreddit

In [18]:
suicidewatch_json[0]

{'author': 'graciegrinch',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_236k8jvz',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1554081775,
 'domain': 'self.SuicideWatch',
 'full_link': 'https://www.reddit.com/r/SuicideWatch/comments/b7vi8w/i_have_a_plan/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'b7vi8w',
 'is_crosspostable': False,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': True,
 'num_comments': 1,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status': 'no_ads',
 'permalink': '/r/SuicideWatch/comments/b7vi8w/i_have_a_pla

### Parsing the json into a Pandas DataFrame

Writing a function to retrieve the columns of interest and convert the json to a dataframe

In [20]:
def reddit_parse(json_data):
    
    col_list = ['author',
            'author_fullname',
            'subreddit',
            'id',
            'created_utc',
            'retrieved_on',
            'permalink',
            'url',
            'num_comments',
            'title',
            'selftext'
            ]
    
    comments_df = pd.DataFrame(json_data)
    comments_df = comments_df[col_list]
    
#     comments_df.rename(columns={'subreddit':'suicidewatch'}, inplace=True)
#     comments_df['suicidewatch'] = comments_df['suicidewatch'].map({'depression':0, 'suicidewatch':1})

    return comments_df[col_list]

Reviewing the shape of the dataframe to ensure correct transformation

In [21]:
df_suicidewatch = reddit_parse(suicidewatch_json)

In [22]:
df_suicidewatch.shape

(10000, 11)

In [23]:
df_depression = reddit_parse(depression_json)

In [24]:
df_depression.shape

(10000, 11)

Shape corresponds with expected values. Reviewing the head of the dataframe to ensure data was correctly labeled. 

In [25]:
df_suicidewatch.head(3)

Unnamed: 0,author,author_fullname,subreddit,id,created_utc,retrieved_on,permalink,url,num_comments,title,selftext
0,graciegrinch,t2_236k8jvz,SuicideWatch,b7vi8w,1554081775,1554081777,/r/SuicideWatch/comments/b7vi8w/i_have_a_plan/,https://www.reddit.com/r/SuicideWatch/comments...,1,i have a plan,idk. i don’t want to do this anymore
1,hungryyyfordeathv2,t2_3ik6dp65,SuicideWatch,b7vhu2,1554081705,1554081707,/r/SuicideWatch/comments/b7vhu2/oof_an_update/,https://www.reddit.com/r/SuicideWatch/comments...,0,oof: an update,"so, it was day three of VSED but my parents fo..."
2,im-not-worth-it,t2_15fss0,SuicideWatch,b7vc47,1554080763,1554080764,/r/SuicideWatch/comments/b7vc47/when_you_never...,https://www.reddit.com/r/SuicideWatch/comments...,0,When you never asked to be born and hate your ...,...not giving a damn about how you feel.
3,beyondphilosophy1996,t2_23ywnrzy,SuicideWatch,b7v99u,1554080307,1554080307,/r/SuicideWatch/comments/b7v99u/i_think_tonigh...,https://www.reddit.com/r/SuicideWatch/comments...,1,"I think tonight's the night, and I just want t...","I'm not special, millions and millions of peop..."
4,HeWillBeCalledEzra,t2_3e3vfshi,SuicideWatch,b7v7h4,1554080004,1554080005,/r/SuicideWatch/comments/b7v7h4/goodbye_my_not...,https://www.reddit.com/r/SuicideWatch/comments...,3,Goodbye. My notes.. I'm giving up,"""This is my first letter. I have two. If you'r..."


In [26]:
df_depression.head(3)

Unnamed: 0,author,author_fullname,subreddit,id,created_utc,retrieved_on,permalink,url,num_comments,title,selftext
0,89throwawayX,t2_3insi9we,depression,b7vft8,1554081373,1554081375,/r/depression/comments/b7vft8/i_will_have_to_c...,https://www.reddit.com/r/depression/comments/b...,0,I will have to cut ties w/ my family and it hu...,[removed]
1,gothic_reality,t2_39c20e2f,depression,b7vfje,1554081327,1554081328,/r/depression/comments/b7vfje/how_do_i_explain...,https://www.reddit.com/r/depression/comments/b...,0,How do i explain to my mom that liking dark st...,"I love my mom, but sometimes she overreacts to..."
2,3453456346346,t2_335d0qqi,depression,b7vf2d,1554081250,1554081251,/r/depression/comments/b7vf2d/i_want_to_lie_do...,https://www.reddit.com/r/depression/comments/b...,0,I want to lie down on a bed and curl up to sle...,I hate doing anything - i hate typing this but...
3,ravenkills273,t2_1493cg,depression,b7veer,1554081137,1554081138,/r/depression/comments/b7veer/the_phrase_life_...,https://www.reddit.com/r/depression/comments/b...,1,The phrase “life is short” is bullshit.,"I’m 21 years old, and everyone tells me that t..."
4,sofresh510,t2_11m461,depression,b7vb7c,1554080619,1554080620,/r/depression/comments/b7vb7c/why_is_it_that_i...,https://www.reddit.com/r/depression/comments/b...,0,Why is it that I just want the day to end alre...,[removed]


### Saving Dataframes to CSV files

In [28]:
# df_depression.to_csv('../data/df_depression.csv',index=True)

# df_suicidewatch.to_csv('../data/df_suicidewatch.csv',index=True)