#### Conclusion

okay, so there isn't a clear real utility for the simple classification of post into a subreddit. For one, 100% of posts on reddit belong to a subreddit so a post would always have its subreddit attached to it.  It's just obviously not very impressive in utility.  At first glance, that is.

The process of creating a binary classifier can bring to light some other factors about whatever's being studied. Keeping in mind that the sample of information they're getting from is demographically biased so there are limitations in its generaliztion to many other applications. 

Learning about what keywords are most important to a city may give you an idea about the culture in the city, or what's important to the city. So many things could be discovered from this sort of exploration. Could also be used in market research analysis (yawn) by comparing one brand's subreddit with another's. Or, one show's fanbase vs another's. This binary classification model can also be modified to be applied to message filters like sorting out spam from emails

Sociologists for example, would have a field day with this with all the types of thematic analyses they could apply.

### Gabriela Osorio
#### DSI Project 3 - Creating a Binary Subreddit Classifier - TO vs. LA
#### November 5, 2018

#### Preamble
>  Reddit is an online public content sharing platform that is organized into different categories known as subreddits. Subreddits are comprised of user-submitted posts that can be text, media, or both. Posts can be interacted with through user-prompted upvotes, downvotes, comments, and of course, views. This project will outline the creation of a subreddit classifier that predicts the subreddit a given post is from. Specifially, it's a binary classifier for the Toronto and Los Angeles subreddits. This model can then be expanded upon to explore what the most important characteristics are of this classifier.

#### Quick Model Summary
> **Input**: 'Title' <br>
**Output**: Binary label ('LA' or 'TO')<br>
**Type**: Binary Classifier: Random Forest, Support Vector Machine <br>
**Metrics of Success**: Accuracy <br>


## PART 1: Webscraping Using the Reddit API

We begin by scraping posts from the Toronto and LA subreddits using Reddit's API. This portion was created from a template provided by Max Humber, course instructor, so it should not be mistaken for the author's original work. 

Potentially interesting and influential features of posts that have been identified and included in this webscraping include: 
- subreddit: to be part of the target vector later on 
- title: text input  
- selftext : actual text from the post
- downs : upvotes, positive points
- ups : downvotes, negative points
- num_comments : number of comments
- permalink 
- name 
- author 
- is_original_content : binary answer to "Is content in selftext original?"
- edited : binary answer to "Has this post been edited?"
- media_only : binary answer to "Does the post only have a photo?" 

### PART 1A: Scraping

In [2]:
import datetime
import pandas as pd
import requests
import time

from bs4 import BeautifulSoup

In [3]:
headers = {'User-Agent': 'My User Agent 1.0'}

In [4]:
def fetch_page(url, after=''):
    params = {'after': after}
    response = requests.get(url, headers=headers, params=params)
    return response.json()['data']['children']

In [5]:
def parse_post(post):
    keep = ['subreddit', 'title', 'selftext', 'downs', 'ups', 'num_comments', 'permalink', 'name', 'author', 'time', 'is_original_content', 'edited', 'media_only'] 
    return {k:v for k, v in post['data'].items() if k in keep}

In [6]:
def parse_page(page):
    after = ''
    posts = []
    for post in page:
        post = parse_post(post)
        after = post['name']
        posts.append(post)
    return posts, after

In [7]:
all_posts = []
def fetch_subreddit(subreddit, pages=25):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    after = ''
    for i in range(pages):
        print(f'Fetching Page {i + 1}')
        page = fetch_page(url, after)
        posts, after = parse_page(page)
        all_posts.extend(posts)
        time.sleep(5)
    return all_posts

In [8]:
posts = fetch_subreddit('Toronto')

Fetching Page 1
Fetching Page 2
Fetching Page 3
Fetching Page 4
Fetching Page 5
Fetching Page 6
Fetching Page 7
Fetching Page 8
Fetching Page 9
Fetching Page 10
Fetching Page 11
Fetching Page 12
Fetching Page 13
Fetching Page 14
Fetching Page 15
Fetching Page 16
Fetching Page 17
Fetching Page 18
Fetching Page 19
Fetching Page 20
Fetching Page 21
Fetching Page 22
Fetching Page 23
Fetching Page 24
Fetching Page 25


#### Looking at the fetched goods through a DataFrame aka stopping to smell the roses.

In [9]:
df=pd.DataFrame(all_posts)
df

Unnamed: 0,author,downs,edited,is_original_content,media_only,name,num_comments,permalink,selftext,subreddit,title,ups
0,gooker55,0,False,False,False,t3_9u5q1y,63,/r/toronto/comments/9u5q1y/space_transit/,,toronto,Space Transit,1181
1,tannc,0,False,False,False,t3_9u67tf,23,/r/toronto/comments/9u67tf/i_took_a_photo_of_t...,,toronto,I took a photo of the Royal Bank Plaza.,416
2,ur_a_idiet,0,False,False,False,t3_9u9t2q,6,/r/toronto/comments/9u9t2q/muslims_surround_to...,,toronto,Muslims Surround Toronto Synagogues With Prote...,35
3,imlesmartest,0,False,False,False,t3_9u1sdu,32,/r/toronto/comments/9u1sdu/falling_in_love_wit...,,toronto,“Fall”ing in love with this city,1148
4,chinese_horse,0,False,False,False,t3_9uamrs,7,/r/toronto/comments/9uamrs/breaking_rookie_pre...,,toronto,BREAKING: Rookie Premier Doug Ford will shuffl...,15
5,Elliottafc,0,False,False,False,t3_9u6fph,14,/r/toronto/comments/9u6fph/ecological_collapse...,,toronto,Ecological collapse of Toronto's ravine system...,80
6,excaliburcat,0,False,False,False,t3_9u6c2v,8,/r/toronto/comments/9u6c2v/just_a_normal_every...,,toronto,"Just a normal everyday pigeon, getting off the...",67
7,plasticawareness,0,False,False,False,t3_9u6pml,20,/r/toronto/comments/9u6pml/qew_is_being_repaired/,,toronto,QEW Is being repaired.,57
8,nonameattachedforme,0,False,False,False,t3_9u7gsq,5,/r/toronto/comments/9u7gsq/beautiful_mural_nea...,,toronto,Beautiful Mural near Ossington,34
9,HBVenus,0,False,False,False,t3_9u8j3z,4,/r/toronto/comments/9u8j3z/woman_wanted_in_mul...,Wanted for damaging property (restaurants and ...,toronto,Woman WANTED in multiple Mischief Under $5000 ...,19


### PART 1A: Creating and Exporting Scraped Goods as CSV 

In [10]:
!mkdir data

mkdir: data: File exists


In [11]:
now = str(datetime.datetime.now())[:19]

filename = f'data/datasci scrape {now}.csv'
filename

'data/datasci scrape 2018-11-05 01:14:17.csv'

In [12]:
df.to_csv(filename, index=False)

## PART 2: Preprocessing

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
from nltk.corpus import stopwords
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, svm
from sklearn.metrics import r2_score, accuracy_score

import seaborn as sns
s

In [70]:
TO=pd.read_csv('./data/TO.csv')
TO.shape

(627, 12)

In [71]:
LA= pd.read_csv('./data/LA.csv')
LA.shape

(1254, 12)

In [72]:
LA.shape

(1254, 12)

In [73]:
cities=[TO,LA]
tola=pd.concat(cities)
tola.dtypes

author                 object
downs                   int64
edited                 object
is_original_content      bool
media_only               bool
name                   object
num_comments            int64
permalink              object
selftext               object
subreddit              object
title                  object
ups                     int64
dtype: object

#### Let's hold our horses here.
Though we extracted many potential features for out upcoming predictive models, we'll only be focusing on title for now in this model. 

In [None]:
tola.shape

In [None]:
my_stopwords = stopwords.words('english')
my_stopwords.extend(['amp','x200b','\n'])


In [None]:
df['subreddit'].replace({'TO': 1, 'LosAngeles': 0}, inplace = True)

In [None]:
X = df['title']
y = df['subreddit']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2)
tfidvec = TfidfVectorizer(stop_words = my_stopwords)
tfidvec.fit(X_train)
X_train_tfidf = tfidvec.transform(X_train)
X_test_tfidf = tfidvec.transform(X_test)

In [None]:
from sklearn.metrics import confusion_matrix

def conf_matrix(model, X_test):
    y_pred = model.predict(X_test)            
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()               
    print(f"True Negatives: {tn}")            
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")            
    return pd.DataFrame(cm, columns = ['Predicted TO','Predicted LA'], index = ['Actual TO', 'Actual LA'])


In [None]:
y_pred = svc.predict(X_test)

In [None]:
import pandas as pd 
pd.DataFrame({
    'true': y_test,
    'pred': y_pred
}).sample(10)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
def conf_matrix(svm, X_test):

Creating a pipeline to use in model

In [82]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline


steps = [("vectorizer", TfidfVectorizer(stop_words=my_stopwords)),
         ("rf", RandomForestClassifier())]

pipe = Pipeline(steps)
grid_params = {
    "vectorizer__max_features": [2000, 3000, 4000],
    "vectorizer__ngram_range":[(1,1), (1,2)],
    "rf__n_estimators": [2500, 3000, 3500],
    "rf__max_depth": [17, 18, 19, 20],
    "rf__min_samples_leaf": [1, 2, 3]}



In [None]:
gs=GridSearchCV(pipe, grid_params, verbose=1, n_jobs=2, cv=5)
results=gs.fit(X_train, y_train) 

In [None]:
results.best_params_()

In [None]:
grid_params=best_params

In [None]:
best_params

In [None]:
accuracy_score(y_test, y_pred, normalize=True, sample_weight=None)

In [None]:
conf_matrix(gs, X_test)

In [None]:
gs

In [None]:
results

In [None]:
cross_val_score(estimator=p, X=x, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)