# Project 3: Web APIs & Classification

**Problem Statement:** When performing maintenance, an engineer accidentally deleted multiple posts from r/nottheonion and r/theonion. Unfortunately, the engineer was only able to recover the titles of the lost posts. 

We were therefore tasked to build a classification model which would train on posts submitted before 01 Jan 2022 to classify the recovered posts back to their respective subreddits, r/nottheonion and r/theonion, based solely on the post titles.

This model would also be used as a proof of concept for the development of an automated moderator which would automatically delete posts that do not belong to the subreddit that they are posted to. There has been an increase in bots spamming subreddits with irrelevant posts. 

Moderators have been spending a substantial amount of their time reviewing user reports and deleting spam posts from the subreddit. Having automated moderators police the subreddit for spam posts would free up time for human moderators, who are volunteers, to do things that they want to do.


**Web Data Scraping Using Pushshift API**


In [1]:
#import for API 
from psaw import PushshiftAPI
import numpy as np
import pandas as pd

In [2]:
#setup parameters
def data_scrape(subreddit):
    
    # Instantiate 
    api = PushshiftAPI()

    #list for data scraped 
    scrape_l = list(api.search_submissions(subreddit=subreddit,
                                before = 1640995200,
                                filter=['title', 'subreddit', 'num_comments', 'author', 'subreddit_subscribers', 'score', 'domain', 'created_utc'],
                                limit=2500))

    #filter to only show Subreddit titles and its category 
    clean_scrape_1 = []
    for i in range(len(scrape_l)):
        scrape_dict = {}
        scrape_dict['subreddit'] = scrape_l[i][5]
        scrape_dict['author'] = scrape_l[i][0]
        scrape_dict['domain'] = scrape_l[i][2]
        scrape_dict['title'] = scrape_l[i][7]
        scrape_dict['num_comments'] = scrape_l[i][3]
        scrape_dict['score'] = scrape_l[i][4]
        scrape_dict['timestamp'] = scrape_l[i][1]
        clean_scrape_1.append(scrape_dict)

    # num of subscribers
    print(subreddit, 'subscribers:',scrape_l[1][6])
    
    # Return list 
    return clean_scrape_1

In [3]:
# create DataFrame
not_onion = pd.DataFrame(data_scrape('nottheonion'))



nottheonion subscribers: 20438921


In [4]:
# Save data to csv
not_onion.to_csv('./not_onion.csv')

In [5]:
# check shape
print(f'not_onion shape: {not_onion.shape}')

# check head
not_onion.head()

not_onion shape: (2497, 7)


Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,nottheonion,Taco_duck68,wral.com,"Man attempts to pay for car with rap, steals p...",0,1,1640995192
1,nottheonion,BlackNingaa,bloodyelbow.com,Former UFC fighter reveals past as sex worker ...,1,1,1640994707
2,nottheonion,Lopsided_File_1642,facebook.com,Log into Facebook,1,1,1640991506
3,nottheonion,SkinnyWhiteGirl19,theartnewspaper.com,McDonald’s blocked from building drive-through...,0,1,1640990429
4,nottheonion,kids-cake-and-crazy,kjrh.com,Legendary actress Betty White dies at 99 on Ne...,0,1,1640989181


In [6]:
# create onion DataFrame
onion = pd.DataFrame(data_scrape('TheOnion'))



TheOnion subscribers: 165298


In [7]:
# Save data to csv
onion.to_csv('./onion.csv')

In [8]:
# check shape
print(f'onion shape: {onion.shape}')

# check head
onion.head()

onion shape: (2497, 7)


Unnamed: 0,subreddit,author,domain,title,num_comments,score,timestamp
0,TheOnion,mothershipq,theonion.com,Surgeon Kind Of Pissed Patient Seeing Her Defo...,0,1,1640973300
1,TheOnion,-ImYourHuckleberry-,theartnewspaper.com,McDonald’s blocked from building drive-through...,1,1,1640971771
2,TheOnion,dwaxe,theonion.com,Gwyneth Paltrow Touts New Diamond-Encrusted Tr...,0,1,1640955671
3,TheOnion,dwaxe,theonion.com,Artist Crafting Music Box Hopes It Delights At...,0,1,1640955669
4,TheOnion,dwaxe,theonion.com,Homeowner Trying To Smoke Out Snakes Accidenta...,0,1,1640955668
