# Project Goals

For project 3, your goal is two-fold:
1. Using Reddit's API, collect posts from any 2 subreddits
2. Use NLP to train a classifier on which subreddit a given post came from. 
    - This is a binary classification problem.


#### About the API

Reddit's API is fairly straightforward. For example, if I want the posts from [`/r/boardgames`](https://www.reddit.com/r/boardgames), all I have to do is add `.json` to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

**Pro Tip 2:** Reddit will give you 25 posts **per request**. To get enough data, you'll need to hit Reddit's API **repeatedly** (most likely in a `for` loop). _Be sure to use the `time.sleep()` function at the end of your loop to allow for a break in between requests. **THIS IS CRUCIAL**_

**Pro tip 3:** The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

**Pro tip 4:** At the end of each loop, be sure to save the results from your scrape as a `csv`: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

# Scraping Data from Reddit

In [3]:
#import libraries

import pandas as pd
import numpy as np
import json
import urllib
import requests
import time
from psaw import PushshiftAPI

%matplotlib inline

In [4]:
#subreddit1 = 'https://www.reddit.com/r/personalfinance/.json'
#subreddit2 = 'https://www.reddit.com/r/writing/.json'
id1 = 10

In [5]:
headers = {'User-agent':'Mozilla/5.0'}

In [9]:
def get_postcount(id_no,start_offset=0):
    
    if id_no!='':
        steam_jayson_url = 'http://store.steampowered.com/appreviews/'+str(id_no)+'?json=1&start_offset='+str(start_offset)
        steam_jayson = requests.get(steam_jayson_url).json()
        s_lng = int(srdt_jayson['metadata']['total_results'])
        start_offset+=20
        return 
    
    else:
        print('Enter a steam app ID!')

In [10]:
req1 = requests.get('http://store.steampowered.com/appreviews/10?json=1&start_offset=0',headers=headers).json()

In [24]:
req1['reviews'][0]

{'recommendationid': '50887827',
 'author': {'steamid': '76561198000788461',
  'num_games_owned': 132,
  'num_reviews': 15,
  'playtime_forever': 6639,
  'playtime_last_two_weeks': 0,
  'last_played': 1544387799},
 'language': 'english',
 'review': 'this game taught me how to be a man',
 'timestamp_created': 1558967616,
 'timestamp_updated': 1558967616,
 'voted_up': True,
 'votes_up': 16,
 'votes_funny': 14,
 'weighted_vote_score': '0.65728306770324707',
 'comment_count': 0,
 'steam_purchase': True,
 'received_for_free': False,
 'written_during_early_access': False}

In [103]:
def get_posts(url,headers = {'User-agent':'Mozilla/5.0'},loops=2):
    posts = []
    names = []
    titles = []
    aft_name=None

    for i in range(loops):
        if aft_name==None:
            params={}
        else:
            params={'after':aft_name}

        req = requests.get(url,params=params,headers=headers)

        if req.status_code == 200:
            jayson = req.json()
            for p in range(len(jayson['review'])):
                names.append(jayson['children'][p]['data']['name'])
                titles.append(jayson['data']['children'][p]['data']['title'])
                posts.append(jayson['data']['children'][p]['data']['selftext'])
                aft_name = jayson['data']['after']
        else:
            print(res.status_code)
            break
            
        time.sleep(np.random.randint(3,30))
    
    posts_df = pd.DataFrame({'names':names,
                         'titles':titles,
                         'posts':posts},columns = ['names','titles','posts'])
    
    return posts_df

In [104]:
sr1_posts = get_posts(subreddit1,loops=40)

In [105]:
len(sr1_posts)

994

In [106]:
sr1_df = sr1_posts
sr1_df['subreddit']=1
sr1_df.head()

Unnamed: 0,names,titles,posts,subreddit
0,t3_bvkm9e,30-Day Challenge #6: Review your investment as...,# 30-day challenges\n\nWe are pleased to conti...,1
1,t3_bywisi,Weekday Help and Victory Thread for the week o...,"### If you need help, please check the [PF Wik...",1
2,t3_bznn9z,Is it wrong to apply for other jobs to see wha...,Some days I wonder if I would get paid more wo...,1
3,t3_bzc1x4,always get a debt verification letter!,follow up to the $25 medical debt collection p...,1
4,t3_bzk3t7,Do people REALLY have emergency funds with 20k...,Hello hello! Really simple question I'm just h...,1


In [107]:
len(sr1_df)

994

In [108]:
len(sr1_df[sr1_df['posts']!=''])

987

In [109]:
#check for duplicate rows because Reddit API limits the number of records you can request to 1000
sr1_df = sr1_df.drop_duplicates(subset='posts',keep='first')
sr1_df = sr1_df[sr1_df['posts']!='']
len(sr1_df)

935

In [110]:
sr1_df.to_csv('./datasets/subreddit1.csv',index=False)

In [111]:
sr0_posts=get_posts(subreddit0,loops=40)

In [112]:
sr0_df = sr0_posts
sr0_df['subreddit']=0
len(sr0_df[sr0_df['posts']!=''])

980

In [113]:
#check for duplicate rows
sr0_df = sr0_df.drop_duplicates()
sr0_df = sr0_df[sr0_df['posts']!='']
len(sr0_df)

980

In [114]:
sr0_df.to_csv('./datasets/subreddit0.csv',index=False)

In [115]:
sr0_df.head(15)

Unnamed: 0,names,titles,posts,subreddit
0,t3_bzl3k2,“Every lie we tell incurs a debt to the truth....,"Isn’t that what debt is? A lie, saying we can ...",0
1,t3_bzgsp7,Baby Step 2 - Second Student Loan Paid Off,"Another day, another dollar, another debt paid...",0
2,t3_bzovir,Money &amp; dating,Hi DR subreddit! I’m almost done replenishing ...,0
3,t3_bzkg8w,Cutting up credit cards,My husband and I are on step 2 and I want to c...,0
4,t3_bzjwxi,The strength to climb a mountain,New to DR and the baby steps. We have complete...,0
5,t3_bzm41r,BS6’ers out there...how do you keep motivated?,"After 30 months of constant chipping away, I f...",0
6,t3_bzqd6x,HSA in Baby Step #2,About a week a half ago DR received a question...,0
7,t3_bzdfg7,A meme thread?,I get pictures and links are not allowed in th...,0
8,t3_bzevzw,Dave's paycheck math,Anybody else think Dave underestimates the dif...,0
9,t3_bzbh1t,Looking for a new job: keep paying off debt or...,My husband and I are in baby step to and we’ve...,0
