Reference: https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

Twitter Reference for Celina: https://towardsdatascience.com/how-to-access-twitters-api-using-tweepy-5a13a206683b

In [1]:
from psaw import PushshiftAPI

import datetime as dt
import pandas as pd
import numpy as np

In [2]:
api = PushshiftAPI()

The functions below will pull submission details under the specified subreddit. These functions take quite a while to run (5 minutes for /r/TheOnion and 2-3 hours for /r/NotTheOnion).

In [6]:
start_epoch=int(dt.datetime(2019, 1, 1).timestamp())

onion_list = list(api.search_submissions(after= start_epoch,
                            subreddit='TheOnion',
                            filter=['title', 'subreddit', 'score', 'created_utc', 'num_comments'],))

In [104]:
start_epoch=int(dt.datetime(2019, 1, 1).timestamp())

notonion_list = list(api.search_submissions(after= start_epoch,
                            subreddit='NotTheOnion',
                            filter=['title', 'subreddit', 'score', 'created_utc', 'num_comments'],))

In [5]:
print('Length of onion_list: ', len(onion_list))
print('Length of notonion_list: ', len(notonion_list))

Length of onion_list:  1428


NameError: name 'notonion_list' is not defined

We can now make a dataframe. First, start by appending the post titles into a list. Then, convert into a dataframe. To keep track in the final merge, we will insert a new column called 'Onion' to determine if the post is from the /r/Onion subreddit.

In [35]:
onion_title_list = []

for entry in onion_list:
    onion_title_list.append((entry.title, entry.score))
    
onion_title_list

[('Candidate profile: Democrat vp nominee Kamala Harris', 1),
 ('‘Damn You’ Shouts Contact Tracer Losing Track Of Coronavirus After It Catches Hold Of Helicopter’s Ladder',
  1),
 ('NASA Announces Plans To Launch Chimpanzee Into Sun', 1),
 ('NASA Announces Plans To Launch Chimpanzee Into Sun', 1),
 ('Trump Online Store Begins Selling Decommissioned USPS Mailboxes So Fans Can Own Piece Of History',
  1),
 ('‘Damn You’ Shouts Contact Tracer Losing Track Of Coronavirus After It Catches Hold Of Helicopter’s Ladder',
  1),
 ('Adam Silver Warns Player Against Leaving Bubble For Strip Clubs With Lackluster Talent',
  1),
 ('Eric Trump Tapes Karaoke Machine To Don Jr.’s Chest As Part Of Final Preparations To Spy On China',
  1),
 ('NASA to rename cosmic objects', 1),
 ('Federal Prisons Reinstitute Executions By Lethal Inflation', 1),
 ('New Patriotic Gatorade Ad Shows Terrorists Being Waterboarded With Gatorade',
  1),
 ('Mentally Unbalanced Man Still Waiting For The Right Trump Comment To Inc

In [52]:
df_onion = pd.DataFrame(onion_title_list)

df_onion

Unnamed: 0,0,1
0,Candidate profile: Democrat vp nominee Kamala ...,1
1,‘Damn You’ Shouts Contact Tracer Losing Track ...,1
2,NASA Announces Plans To Launch Chimpanzee Into...,1
3,NASA Announces Plans To Launch Chimpanzee Into...,1
4,Trump Online Store Begins Selling Decommission...,1
...,...,...
4665,Hungover Man Horrified To Learn He Made Dozens...,1
4666,"Report: 750,000 Americans Die Each Year During...",1
4667,New Year’s Resolution,1
4668,Kotex Introduces New Confetti Popper Tampons F...,1


In [53]:
df_onion.insert(loc=0, column='Onion', value=1)
df_onion.columns = ['Onion', 'Title', 'Score']
df_onion.shape

(4670, 3)

In [66]:
df_onion.columns = ['Onion', 'Title', 'Score']

df_onion = df_onion[df_onion['Score'] >= 2]
df_onion.shape

(2340, 3)

In [55]:
df_onion.to_csv('onion_title_list_2019.csv', index = False)

We can now do the same to the posts from /r/NotTheOnion. Again, we create a column called 'Onion' but place a value of 0 to determine that it is **not** from /r/TheOnion.

In [67]:
notonion_title_list = []

for entry in notonion_list:
    notonion_title_list.append((entry.title, entry.score))

NameError: name 'notonion_list' is not defined

In [59]:
df_notonion = pd.DataFrame(notonion_title_list)

NameError: name 'notonion_title_list' is not defined

In [61]:
df_notonion.insert(loc=0, column='Onion', value=0)
df_notonion.columns = ['Onion', 'Title', 'Score']
df_notonion.shape

ValueError: cannot insert Onion, already exists

In [65]:
df_notonion.columns = ['Onion', 'Title', 'Score']

df_notonion = df_notonion[df_notonion['Score'] >= 100]
df_notonion.shape

(3231, 3)

In [123]:
df_notonion.to_csv('notonion_title_list_2019.csv', index = False)

In [64]:
df_notonion = pd.read_csv('notonion_title_list_2019_upvote100.csv')

In [4]:
# start_epoch=int(dt.datetime(2019, 1, 1).timestamp())

# news_list = list(api.search_submissions(after= start_epoch,
#                             subreddit='News',
#                             filter=['title', 'subreddit', 'score'],))

In [40]:
news_title_list = []

for entry in news_list:
    news_title_list.append((entry.title, entry.score))

In [41]:
df_news = pd.DataFrame(news_title_list)

In [42]:
df_news.insert(loc=0, column='Onion', value=0)
df_news.columns = ['Onion', 'Title', 'Score']
df_news.shape

(901522, 3)

In [43]:
df_news.columns = ['Onion', 'Title', 'Score']

df_news = df_news[df_news['Score'] >= 100]
df_news.shape

(12396, 3)

In [21]:
df_news.to_csv('news_title_list_2019_upvote100.csv', index = False)

In [22]:
df_onion = pd.read_csv('onion_title_list_2019_upvote100.csv')

Now we'll merge the two data frames together.

In [44]:
frames = [df_onion, df_news]

df_merge = pd.concat(frames)

In [45]:
df_merge.tail()

Unnamed: 0,Onion,Title,Score
844415,0,Peel police disappointed over Amber Alert comp...,681
844429,0,Amazon made an $11.2bn profit in 2018 but paid...,134
844455,0,Judge keeps most Keystone XL pipeline work on ...,193
844457,0,Informant didn't buy drugs from couple killed ...,2183
844475,0,"Gunman kills 5, wounds 5 officers in workplace...",409


In [46]:
df_merge.to_csv('merge_onionandnews_2019_upvote100.csv', index = False)