# Project 3: Subreddit Classification

## Problem Statement

Our customer, a promiment auto magazine publisher has enaged us to help with accurately classifying subreddits r/cars and r/motorcycles in an effort to better understand the amout of subreddit participation between cars and motorcycles. They will use this data to position their editorials to cater to the masses.

In this project, we will use the [Pushshift API](https://github.com/pushshift/api) to perform webscraping of the two abovementioned subreddits from [reddit.com](reddit.com)

## Importing libraries

In [1]:
import requests
import pandas as pd
import numpy as np
import random
import time

## Defining a function for repeatability

In [2]:
def reddit_scraper(subreddit, posts=100):
    '''
    This function will scrape the specified subreddit as many times as required.
    
    subreddit = str
    
    posts = int, in multiples of 100
    '''
    # quick sanity check on function call
    if posts%100 != 0:
        print('Please enter a value for posts in multiples of 100') 
        raise
 
    url = 'https://api.pushshift.io/reddit/search/submission'

    header = {'User-agent':'this-knee-fan'}
    before = None # setting before to None so we can scrape the latest post.    

    for i in range(1,int(posts/100)+1):
        params = {
        'subreddit': subreddit,
        'size' : 100,
        'before' : before
        }
            
        if i == 1:
            req = requests.get(url,params,headers=header)
            js = req.json()
            
            if req.status_code ==200:
                print(f'Scraping {i}00 \tStatus code: {req.status_code}')
                before = js['data'][-1]['created_utc'] # before is now set to the earliest post in this set of data
                df = pd.DataFrame(js['data'])
                df = df[['selftext','title','upvote_ratio','subreddit','author','is_self']]
            else:
                raise ScrapeError(f'Scraping error {req.status_code}')
                
        else:
            req = requests.get(url,params,headers=header)
            js = req.json()  
            
            if req.status_code ==200:
                print(f'Scraping {i}00 \tStatus code: {req.status_code}')
                before = js['data'][-1]['created_utc']
                df_i = pd.DataFrame(js['data'])
                df_i = df_i[['selftext','title','upvote_ratio','subreddit','author','is_self']]
                df = df.append(df_i)
                df = df.drop_duplicates()
            else:
                raise ScrapeError(f'Scraping error {req.status_code}')
         
        rest_time = random.randint(2,8) 
        print(f'Resting for {rest_time} seconds...')
        time.sleep(rest_time) # resting a random number of seconds to make the scraping more natural

    print('\n\nScraping complete!')    
    return df

## Web scraping, saving to .csv

In [3]:
%%time
# WARNING!! LONG RUN TIME EXPECTED!!
cars = reddit_scraper('cars',10_000)

Scraping 100 	Status code: 200
Resting for 3 seconds...
Scraping 200 	Status code: 200
Resting for 7 seconds...
Scraping 300 	Status code: 200
Resting for 3 seconds...
Scraping 400 	Status code: 200
Resting for 2 seconds...
Scraping 500 	Status code: 200
Resting for 8 seconds...
Scraping 600 	Status code: 200
Resting for 4 seconds...
Scraping 700 	Status code: 200
Resting for 8 seconds...
Scraping 800 	Status code: 200
Resting for 4 seconds...
Scraping 900 	Status code: 200
Resting for 3 seconds...
Scraping 1000 	Status code: 200
Resting for 4 seconds...
Scraping 1100 	Status code: 200
Resting for 2 seconds...
Scraping 1200 	Status code: 200
Resting for 2 seconds...
Scraping 1300 	Status code: 200
Resting for 5 seconds...
Scraping 1400 	Status code: 200
Resting for 6 seconds...
Scraping 1500 	Status code: 200
Resting for 5 seconds...
Scraping 1600 	Status code: 200
Resting for 5 seconds...
Scraping 1700 	Status code: 200
Resting for 6 seconds...
Scraping 1800 	Status code: 200
Resting 

In [4]:
cars.shape

(9835, 6)

In [5]:
cars.head()

Unnamed: 0,selftext,title,upvote_ratio,subreddit,author,is_self
0,[removed],Best way to gain MORE experience with a manual?,1.0,cars,helpmebuyacar123,True
1,,2023 Acura Integra Receiving Optional SH-AWD,1.0,cars,NCSUGrad2012,False
2,,"AutoTrader: Ford Bronco Review: No Doors, No R...",1.0,cars,Delta_Mike_Sierra_,False
3,[removed],Help identifying my Great Grandfathers car (1910),1.0,cars,fkncatalinawinemixer,True
4,,West Virginia House bill would ban OTA updates,1.0,cars,borderwave2,False


In [6]:
cars.tail()

Unnamed: 0,selftext,title,upvote_ratio,subreddit,author,is_self
94,[removed],What is the best car to buy that makes people ...,1.0,cars,NervusTrader,True
95,,Huawei's electric car introduced,1.0,cars,voltner10,False
96,[removed],Is a 2011 Kia Forte for 3900 a good deal and i...,1.0,cars,Abcd403044,True
97,,Below Link,1.0,cars,vaibhavj82,False
99,Title. I just bought a brand new 2021 Challeng...,"Apologies if this has been asked, feel free to...",1.0,cars,EveryoneLovesNudez,True


In [7]:
cars.to_csv('../data/cars.csv',index=False)

In [11]:
%%time
# WARNING!! LONG RUN TIME EXPECTED!!
bikes = reddit_scraper('motorcycle',10_000)

Scraping 100 	Status code: 200
Resting for 3 seconds...
Scraping 200 	Status code: 200
Resting for 3 seconds...
Scraping 300 	Status code: 200
Resting for 4 seconds...
Scraping 400 	Status code: 200
Resting for 5 seconds...
Scraping 500 	Status code: 200
Resting for 8 seconds...
Scraping 600 	Status code: 200
Resting for 4 seconds...
Scraping 700 	Status code: 200
Resting for 5 seconds...
Scraping 800 	Status code: 200
Resting for 3 seconds...
Scraping 900 	Status code: 200
Resting for 5 seconds...
Scraping 1000 	Status code: 200
Resting for 2 seconds...
Scraping 1100 	Status code: 200
Resting for 7 seconds...
Scraping 1200 	Status code: 200
Resting for 4 seconds...
Scraping 1300 	Status code: 200
Resting for 2 seconds...
Scraping 1400 	Status code: 200
Resting for 4 seconds...
Scraping 1500 	Status code: 200
Resting for 5 seconds...
Scraping 1600 	Status code: 200
Resting for 8 seconds...
Scraping 1700 	Status code: 200
Resting for 4 seconds...
Scraping 1800 	Status code: 200
Resting 

In [12]:
bikes.shape

(9852, 6)

In [13]:
bikes.head()

Unnamed: 0,selftext,title,upvote_ratio,subreddit,author,is_self
0,,My new classic motorcycle,1.0,motorcycle,Puzzleheaded_Pipe734,False
1,How often should I run my 1983 goldwing to avo...,Staying Alive,1.0,motorcycle,Puzzleheaded_Pipe734,True
2,Kind of in the title. I picked up a Suzuki Vol...,"First Timer Question, Suzuki Volusia",1.0,motorcycle,wolfnibblets,True
3,"Also, I bought this bike 2nd-hand from a deale...",Is this pitting indicative of anything?,1.0,motorcycle,SoupyDelicious,True
4,,Rainbow - Death Alley Driver (1982),1.0,motorcycle,mattjshermandotcom,False


In [14]:
bikes.tail()

Unnamed: 0,selftext,title,upvote_ratio,subreddit,author,is_self
95,Got a little bike 250cc and Highways are terri...,Are there a lot of riders that are strictly Ci...,1.0,motorcycle,cheeseandwich,True
96,"Hello all, \n\nI am looking at an end of year ...",Buying a damaged bike,1.0,motorcycle,Random--J,True
97,,My Second Bike (2003 R6),1.0,motorcycle,assblister,False
98,,1200 miles and $4500 worth of work later... th...,1.0,motorcycle,noobishchan,False
99,,can anyone id this lil bike?,0.99,motorcycle,ninezerone,False


In [15]:
bikes.to_csv('../data/bikes.csv',index=False)

With the two subreddits successfully scraped and saved to .csv files, we will move to the next notebook to perform preprocessing > EDA > Modeling