#### Conclusion

okay, so there isn't a clear real utility for the simple classification of post into a subreddit. For one, 100% of posts on reddit belong to a subreddit so a post would always have its subreddit attached to it.  It's just obviously not very impressive in utility.  At first glance, that is.

The process of creating a binary classifier can bring to light some other factors about whatever's being studied. Keeping in mind that the sample of information they're getting from is demographically biased so there are limitations in its generaliztion to many other applications. 

Learning about what keywords are most important to a city may give you an idea about the culture in the city, or what's important to the city. So many things could be discovered from this sort of exploration. Could also be used in market research analysis (yawn) by comparing one brand's subreddit with another's. Or, one show's fanbase vs another's. This binary classification model can also be modified to be applied to message filters.

Sociologists for example, would have a field day with this with all the types of thematic analyses they could apply.

### Gabriela Osorio
#### DSI Project 3 - Creating a Binary Subreddit Classifier - TO vs. LA
#### November 5, 2018

#### Preamble
>  Reddit is an online public content sharing platform that is organized into different categories known as subreddits. Subreddits are comprised of user-submitted posts that can be text, media, or both. Posts can be interacted with through user-prompted upvotes, downvotes, comments, and of course, views. This project will outline the creation of a subreddit classifier that predicts the subreddit a given post is from. Specifially, it's a binary classifier for the Toronto and Los Angeles subreddits. This model can then be expanded upon to explore what the most important characteristics are of this classifier.

#### Quick Model Summary
> **Input**: 'Title' <br>
**Output**: Binary label ('LA' or 'TO')<br>
**Type**: Binary Classifier: Random Forest, Support Vector Machine <br>
**Metrics of Success**: Accuracy <br>


## PART 1: Webscraping Using the Reddit API

We begin by scraping posts from the Toronto and LA subreddits using Reddit's API. This portion was created from a template provided by Max Humber, course instructor, so it should not be mistaken for the author's original work. 

Potentially interesting and influential features of posts that have been identified and included in this webscraping include: 
- subreddit: to be part of the target vector later on 
- title: text input  
- selftext : actual text from the post
- downs : upvotes, positive points
- ups : downvotes, negative points
- num_comments : number of comments
- permalink 
- name 
- author 
- is_original_content : binary answer to "Is content in selftext original?"
- edited : binary answer to "Has this post been edited?"
- media_only : binary answer to "Does the post only have a photo?" 

### PART 1A: Scraping

In [2]:
import datetime
import pandas as pd
import requests
import time

from bs4 import BeautifulSoup

In [3]:
headers = {'User-Agent': 'My User Agent 1.0'}

In [4]:
def fetch_page(url, after=''):
    params = {'after': after}
    response = requests.get(url, headers=headers, params=params)
    return response.json()['data']['children']

In [5]:
def parse_post(post):
    keep = ['subreddit', 'title', 'selftext', 'downs', 'ups', 'num_comments', 'permalink', 'name', 'author', 'time', 'is_original_content', 'edited', 'media_only'] 
    return {k:v for k, v in post['data'].items() if k in keep}

In [6]:
def parse_page(page):
    after = ''
    posts = []
    for post in page:
        post = parse_post(post)
        after = post['name']
        posts.append(post)
    return posts, after

In [7]:
all_posts = []
def fetch_subreddit(subreddit, pages=25):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    after = ''
    for i in range(pages):
        print(f'Fetching Page {i + 1}')
        page = fetch_page(url, after)
        posts, after = parse_page(page)
        all_posts.extend(posts)
        time.sleep(5)
    return all_posts

In [None]:
posts = fetch_subreddit('Toronto')

Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4


#### Looking at the fetched goods through a DataFrame aka stopping to smell the roses.

In [None]:
df=pd.DataFrame(all_posts)
df

### PART 1A: Creating and Exporting Scraped Goods as CSV 

In [None]:
!mkdir data

In [None]:
now = str(datetime.datetime.now())[:19]

filename = f'data/datasci scrape {now}.csv'
filename

In [None]:
df.to_csv(filename, index=False)

## PART 2: Preprocessing

In [None]:
TO=pd.read_csv('./data/TO.csv')
TO.shape

In [None]:
LA= pd.read_csv('./data/LA.csv')
LA.shape

In [None]:
LA.shape