---
# NOTEBOOK 1 - DATA COLLECTION WITH PUSHSHIFT API
---

# Project 3:
# 'Wait, was that a joke?' - Language Model Identification of Absurdist Humor and Satire
## Daniel Rossetti

# Problem Statement:
This project is conducted from the standpoint of a data scientist hired by a university researching the relatability of AI chat bots and their ability to identify nuances of human language, particularly humor.  Some statements are considered funny, but are not necessarily presented as a joke.  The task is to come up with a language model that can identify humorous strings of text which are not structed in the format of a joke but are of similar structure to factual information.  Text must sourced to train and test a model which can differentiate between humorous and non-humorous statements.

**Can an NLP model be trained to recognize satire or absurdist humor?**

# Project Approach:
The Onion is a satirical news organization which produces news titles and news stories often relevant to real, current events that are satirical or would otherwise be classified as absurdist humor.  For clarity **news articles and titles from The Onion are not real.**  The tiles are however, formated in the format of actual news titles and articles which, from legitimate news agencies - are not humorous.

Comparing the titles of posts to the subreddit r/TheOnion against the subreddit r/worldnews provides a way to compare strings of text which share many of the same formatting characteristics, but have completely different goals with respect to humor.

These subreddits are to be scraped and the data processed to see if a language model can idenifty the humorous Onion titles from the factual World News titles.

# Project 3 Data Collection
Note that subsquent project notebooks will refer to this notebook.  As running this notebook will provide data for a fixed point in time, and could potentially over-write data collected previously, this will be treated as a stand-alone notebook.  Timestamps will be printed to a csv so that this code can be modified to duplicate the run shown in this notebook.

Data are sourced using the Pushsiift API ([link](https://github.com/pushshift/api)) for the [r/TheOnion](https://www.reddit.com/r/TheOnion/) and [r/worldnews](https://www.reddit.com/r/worldnews/) subreddits.

# CONTENT WARNING - INAPPROPRIATE LANGUAGE
The folowing should be noted before continuing with this notebook:
* The subreddit posts collected (and therefore shown or presented in these notebooks) may contian profanity or vulgarities, or be otherwise NSFW (Not Safe for Work) due to the fact that the posts are collected before the subreddit mderators can remove them, and that The Onion commonly uses this language for the sake of humor
* The subreddit posts were not generated by the author of this project/notebooks and do not represent his opinions


# Initial Package Imports

In [20]:
import requests
import pandas as pd

# Sourece to get current Unix Timestamp for the pushshift API:
#  https://stackoverflow.com/questions/16755394/what-is-the-easiest-way-to-get-current-gmt-time-in-unix-timestamp-format
import time

# Define Data Gathering Function
This function accomplishes the following:
* Connects to the internet
* Retrieves data with the pushshift API
* Gathers data in approximately 1000 submission increments known as 'trials'
* Prints the website status code after each successful trial
* Creates timestamps for when the data are pulled for each trial and the time beforewhich data are collected through the API (the before parameter)
    * The pushshift API has a parameter 'before' which is redifined to be the unix timestamp of the oldest submission in the previous trail.  As pushshift will by defualt pull the current most recent submissions, it will only collect about 1000 submissions at a time.  To get more than 100 submissions, the 'before' parameter must be set at each trial to collecte submissions prior to the oldest previously gathered submission
* Saves these timestamps to a text file so that they can be referenced if the data need to be replicated
    * The text file is named with the 'name' parameter and the unix timestamp for which the data were pulled to automatically differential this data from any prior or subsequent calls of this function to the same subreddit
* Saves a dataframe which concatenates all data from the previous trials
    * The text file is named with the 'name' parameter and the unix timestamp for which the data were pulled to automatically differential this data from any prior or subsequent calls of this function to the same subreddit

**NOTE:  No problems were encountered when running this function in regards to servers requests.  The main challenge was that pushshift was down often preventing any request from going through.  When it did work, the function below was able to efficiently gather the data.**

In [27]:
def data_getter(subreddit, trials, name, input_time = 'now'):
#=====  INITIAL LOCAL VARIABLES  =================================================================================
    
    # Create the first instance of the timestamp list
    if input_time == 'now':
        right_now = round(time.time())
    else:
        right_now = input_time
    
    # Start a list of timestamps
    pull_times = [right_now]
    
    # Establish base url:
    url_fnc = 'https://api.pushshift.io/reddit/search/submission'
    
    # Establish initial parameters for most current subreddit pull:
    params_getter = {
    'subreddit': subreddit,
    'size': 1000,
    'before': right_now
    }
    
    # Create an empty dataframe to which we can concatenate each run
    master_df = pd.DataFrame()
    
#=====  Gathering the data from reddit/pushshift // Create DataFrame // Concatenate to Master  ===================
    
    # Get the data:
    res_fnc = requests.get(url_fnc, params_getter)
    print(res_fnc.status_code) # for debugging, prints the website status code to show function progress
    
    # Make dataframe:
    df = res_fnc.json() # Dump data to a json
    df = pd.DataFrame(df['data']) # Pull the 'data' dictionary out of the json
    
    # Concatenate to master
    master_df = pd.concat([master_df, df])

#=====  Iterate the above steps over remaining trials  ===========================================================
    for i in range(0, (trials - 1)): # Establishes a for loop to iterate over the remaining trials (minus the first one)
        
        # Update the before parameter to be the created time (in utc) of the last item in the previous trail's dataframe
        params_getter = {
        'subreddit': subreddit,
        'size': 1000,
        'before': list(df['created_utc'][-1:])[0]
        }
        
        # Get the data:
        res_fnc = requests.get(url_fnc, params_getter)
        print(res_fnc.status_code) # for debugging

        # Make dataframe:
        df = res_fnc.json() # Dump dat to a json
        df = pd.DataFrame(df['data']) # Pull the 'data' dictionary out of the json

        # Concatenate to master
        master_df = pd.concat([master_df, df])
        
        # Add pull time to pull_times list:
        pull_times.append(list(df['created_utc'][-1:])[0])
        
#=====  Create a text file with all the pull times for replicability =====================================================================
    # Source inspring this code:  https://www.guru99.com/reading-and-writing-files-in-python.html
    f = open(f'../data/{name}_pulltimes_{right_now}.txt',"w+")
    f.write(f'{pull_times}')
    f.close()

#=====  Finally, return the fully concatenated master dataframe, reset index, store to csv  =================================================
    master_df.reset_index(drop = True) # Source for refresher on how to tuse this:  https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
    master_df.to_csv(f'../data/{name}_{right_now}.csv')
    return master_df.head()

# Gather Data for TheOnion and WorldNews Subreddits

In [5]:
data_getter('theonion', 6, 'theonion')

200
200
200
200
200
200


Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,is_gallery,media_metadata,gallery_data,crosspost_parent_list,crosspost_parent,author_created_utc,retrieved_on,call_to_action,author_cakeday,removal_reason
0,TheOnion,,t2_4a27h,0,Idiot Tornado Tears Harmlessly Through Empty F...,[],r/TheOnion,False,6,,...,,,,,,,,,,
1,TheOnion,,t2_3jamc,0,New Texas Law Requires Schools To Display Imag...,[],r/TheOnion,False,6,,...,,,,,,,,,,
2,TheOnion,,t2_3jamc,0,New Poll Finds Americans Would Respect Biden M...,[],r/TheOnion,False,6,,...,,,,,,,,,,
3,TheOnion,,t2_3jamc,0,Could You Pass Racial Discrimination Training ...,[],r/TheOnion,False,6,,...,,,,,,,,,,
4,TheOnion,,t2_3jamc,0,Dog And Owner Having Public Fight,[],r/TheOnion,False,6,,...,,,,,,,,,,


In [6]:
data_getter('worldnews', 6, 'worldnews')

200
200
200
200
200
200


Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,...,num_crossposts,media,is_video,retrieved_utc,updated_utc,utc_datetime_str,crosspost_parent_list,crosspost_parent,author_cakeday,link_flair_template_id
0,worldnews,,t2_8q2g97db4,0,The parents of a 10-year-old boy living with a...,[],r/worldnews,False,6,,...,0,,False,1682378382,1682378383,2023-04-24 23:19:26,,,,
1,worldnews,,t2_8q2g97db4,0,The parents of a 10-year-old boy living with a...,[],r/worldnews,False,6,,...,0,,False,1682378313,1682378314,2023-04-24 23:18:19,,,,
2,worldnews,,t2_dss8b,0,Mexico finds tons of liquid meth in tequila bo...,[],r/worldnews,False,6,,...,0,,False,1682377268,1682377268,2023-04-24 23:00:56,,,,
3,worldnews,,t2_9xhkarmen,0,"Tucker Carlson Leaving Fox News, Last Episode ...","[{'e': 'text', 't': 'Not Appropriate Subreddit'}]",r/worldnews,False,6,normal,...,0,,False,1682377130,1682377131,2023-04-24 22:58:38,,,,
4,worldnews,,t2_2aex0igh,0,Film explores B.C. woman’s experience with mag...,[],r/worldnews,False,6,,...,0,,False,1682376678,1682376679,2023-04-24 22:51:04,,,,


# Create Unseen Final Datasets

In [25]:
data_getter('theonion', 1, 'theonion', 1578009619)

200


Unnamed: 0,all_awardings,allow_live_comments,archived,author,author_created_utc,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,url,whitelist_status,wls,retrieved_utc,updated_utc,utc_datetime_str,crosspost_parent,crosspost_parent_list,author_cakeday,link_flair_template_id
0,[],False,False,SlovenianCat,1497181000.0,,,[],,,...,https://youtu.be/Tmvw7N-Nn1U,all_ads,6,1586956866,1679598886,2020-01-02 23:55:13,,,,
1,[],False,False,PM_ME_UR_PERM,1576314000.0,,,[],,,...,https://quizzes.clickhole.com/is-your-flamingo...,all_ads,6,1586954159,1679598575,2020-01-02 17:27:16,,,,
2,[],False,False,[deleted],,,,,,,...,https://i.redd.it/3j7hjwuudd841.jpg,all_ads,6,1586952809,1679598422,2020-01-02 13:47:08,,,,
3,[],False,False,LeeSinSmokesWeed,1386045000.0,,,[],,,...,https://www.youtube.com/watch?v=rYaZ57Bn4pQ,all_ads,6,1586950757,1679598222,2020-01-02 05:49:02,,,,
4,[],True,False,PM_ME_UR_PERM,1576314000.0,,,[],,,...,https://www.theonion.com/so-people-could-be-li...,all_ads,6,1586947769,1679597882,2020-01-01 21:27:55,,,,


In [26]:
data_getter('worldnews', 1, 'worldnews', 1680688103)

200


Unnamed: 0,subreddit,selftext,gilded,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,thumbnail_height,...,author_fullname,author_premium,author_flair_richtext,post_hint,author_flair_type,preview,author_patreon_flair,crosspost_parent_list,crosspost_parent,author_cakeday
0,worldnews,[removed],0,Guinness certifies world's deepest fish found ...,[],r/worldnews,False,6,,78.0,...,,,,,,,,,,
1,worldnews,,0,Russia looking for ways to plug gaps in budget...,"[{'e': 'text', 't': 'Russia/Ukraine'}]",r/worldnews,False,6,russia,78.0,...,t2_eo9dxhcn,False,[],link,text,{'images': [{'source': {'url': 'https://extern...,False,,,
2,worldnews,,0,Sanna Marin to step down as SDP leader (Finland),[],r/worldnews,False,6,,73.0,...,t2_10ysf4hw,False,[],link,text,{'images': [{'source': {'url': 'https://extern...,False,,,
3,worldnews,,0,Checkout this video boom,[],r/worldnews,False,6,,,...,t2_8l2iii52i,False,[],,text,,False,,,
4,worldnews,,0,What is a recession? Are we in recession now?,[],r/worldnews,False,6,,,...,t2_7i8iri4wi,False,[],,text,,False,,,
