# Project 3 - Using NLP to train a classifier on which subreddit a given post came from
For project 3, your goal is two-fold:
1. Using Reddit's API, you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## Problem Statement

## Executive Summary

### Table of Contents
- [Create headers and url](#Create-headers-and-url)
- [Investing subreddit scrape](#Investing-subreddit-scrape)
- [Student Loans subreddit scrape](#Student-Loans-subreddit-scrape)

In [3]:
#Import the neccessary libraries
import requests
import pandas as pd
import time
import random
from bs4 import BeautifulSoup

pd.set_option('display.max_columns', None)

## Create headers and url

In [4]:
#create headers so reddit won't know its python agent
headers = {'User-agent': 'Geoff Inc 8.0'}

In [5]:
investurl = 'https://www.reddit.com/r/investing/new.json'

In [6]:
loanurl = 'https://www.reddit.com/r/StudentLoans/.json'

## Investing subreddit scrape

In [364]:
#scraping for investing subreddit posts
invest_posts = []
after = None

for a in range(100):
    if after == None:
        current_url = investurl
    else:
        current_url = investurl + '?after=' + after
    print(current_url)
    investres = requests.get(current_url, headers=headers)
    
    if investres.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = investres.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    investing_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    
    if a > 0:
        prev_posts = pd.read_csv('./datasets/investing.csv')
        current_df = pd.DataFrame(current_posts)
        final = pd.concat([prev_posts,current_df])
        final.to_csv('./datasets/investing.csv',index=False)
        
    else:
        pd.DataFrame(current_posts).to_csv('./datasets/investing.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/investing/new.json
6
https://www.reddit.com/r/investing/new.json?after=t3_gg6nuv
4
https://www.reddit.com/r/investing/new.json?after=t3_gfzvr2
3
https://www.reddit.com/r/investing/new.json?after=t3_gfsn42
4
https://www.reddit.com/r/investing/new.json?after=t3_gfkrsj
3
https://www.reddit.com/r/investing/new.json?after=t3_gfejee
5
https://www.reddit.com/r/investing/new.json?after=t3_gf8ckh
5
https://www.reddit.com/r/investing/new.json?after=t3_gf445n
6
https://www.reddit.com/r/investing/new.json?after=t3_gexw1b
3
https://www.reddit.com/r/investing/new.json?after=t3_geqeep
2
https://www.reddit.com/r/investing/new.json?after=t3_gelo26
4
https://www.reddit.com/r/investing/new.json?after=t3_ged65t
5
https://www.reddit.com/r/investing/new.json?after=t3_ge6iz5
4
https://www.reddit.com/r/investing/new.json?after=t3_gdxdv8
3
https://www.reddit.com/r/investing/new.json?after=t3_gdoqvg
5
https://www.reddit.com/r/investing/new.json?after=t3_gdir9z
6
https://www.reddit.com/r

In [365]:
#read investing csv
indf = pd.read_csv('./datasets/investing.csv')

In [366]:
indf.shape

(2496, 102)

In [386]:
#see indf can successfully open
indf.head()

Unnamed: 0,name,subreddit,title,selftext
0,t3_ggfbbc,investing,This video is the simplest video that explains...,# [https://youtu.be/PqiewtqGYM4](https://youtu...
1,t3_ggfazw,investing,Non index funds that do well when the market i...,I thought I’d try something a little different...
2,t3_ggf7zk,investing,What profits should we expect for a company th...,I'm new to investing and have no background in...
3,t3_ggeebs,investing,Daily Advice Thread - All basic help or advice...,"If your question is ""I have $10,000, what do I..."
4,t3_ggedr4,investing,Group and company f/s and consolidated statements,I can't seem to understand. The difference bet...


In [368]:
#retain 'name','subreddit','title','selftext' columns
indf = indf[['name','subreddit','title','selftext']]

In [369]:
len(indf[indf['selftext']!=''])

2496

In [370]:
indf.drop(indf.columns.difference(['name','subreddit','title','selftext']),1,inplace=True)

In [371]:
#drop duplicate rows in indf
indf.drop_duplicates(subset='title',keep='first',inplace=True,ignore_index=True)

In [372]:
indf.shape

(959, 4)

In [388]:
#save indf to csv
indf.to_csv('./datasets/ddup_investing.csv')

## Student Loans subreddit scrape

In [9]:
#scraping for student loans subreddit posts
loan_posts = []
after = None

for a in range(100):
    if after == None:
        current_url = loanurl
    else:
        current_url = loanurl + '?after=' + after
    print(current_url)
    loanres = requests.get(current_url, headers=headers)
    
    if loanres.status_code != 200:
        print('Status error', loanres.status_code)
        break
    
    
    current_dict = loanres.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    loan_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    
    if a > 0:
        prev_posts = pd.read_csv('./datasets/loan.csv')
        current_df = pd.DataFrame(current_posts)
        finalloan = pd.concat([prev_posts,current_df])
        finalloan.to_csv('./datasets/loan.csv',index=False)
        
    else:
   
        pd.DataFrame(current_posts).to_csv('./datasets/loan.csv', index = False)
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/StudentLoans/.json
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_ghwirh
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_gh1msb
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_ggfrcb
4
https://www.reddit.com/r/StudentLoans/.json?after=t3_gfn4gj
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_ges7xz
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gdtia0
3
https://www.reddit.com/r/StudentLoans/.json?after=t3_gdcsex
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gc6jp5
2
https://www.reddit.com/r/StudentLoans/.json?after=t3_gbu91f
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_gbhdg1
4
https://www.reddit.com/r/StudentLoans/.json?after=t3_gadqpb
3
https://www.reddit.com/r/StudentLoans/.json?after=t3_g9xgke
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_g91vbm
6
https://www.reddit.com/r/StudentLoans/.json?after=t3_g8q4bo
5
https://www.reddit.com/r/StudentLoans/.json?after=t3_g7mg1i
6
https://www.reddit.com/r

In [15]:
#read loan csv file
loandf = pd.read_csv('./datasets/loan.csv')

In [16]:
loandf.shape

(2468, 104)

In [33]:
#see loandf can successfully open
loandf.head()

Unnamed: 0,name,subreddit,title,selftext
0,t3_9w474g,StudentLoans,How to Identify a Student Loan Scam,It seems it's time to sticky another post abou...
1,t3_ghp77u,StudentLoans,Update on credit bureau reporting for COVID wa...,Hi there. This weekend many of you reported t...
2,t3_ghxdmi,StudentLoans,"""Average"" Person Paying Loans? Not a doctor/la...",Hey! Long-time lurker...\n\n Not sure if this ...
3,t3_gi5zt3,StudentLoans,Why would my student loan payment go down?,"So every month, I pay roughly about 100/month ..."
4,t3_ghz2d7,StudentLoans,Conflicting advice for mountain of debt (high ...,"Hello all! I'm extremely happy with my career,..."


In [17]:
loandf = loandf[['name','subreddit','title','selftext']]

In [18]:
len(loandf[loandf['selftext']!=''])

2468

In [22]:
loandf.drop(loandf.columns.difference(['name','subreddit','title','selftext']),1,inplace=True)

In [23]:
#drop duplicate rows in loandf
loandf.drop_duplicates(subset='title',keep='first',inplace=True,ignore_index=True)

In [30]:
loandf.drop_duplicates(subset='selftext',keep='first',inplace=True,ignore_index=True)

In [27]:
loandf.shape

(947, 4)

In [32]:
#save loandf to csv
loandf.to_csv('./datasets/ddup_loan.csv')