#### Conclusion

okay, so there isn't a clear real utility for the simple classification of post into a subreddit. For one, 100% of posts on reddit belong to a subreddit so a post would always have its subreddit attached to it.  It's just obviously not very impressive in utility.  At first glance, that is.

The process of creating a binary classifier can bring to light some other factors about whatever's being studied. Keeping in mind that the sample of information they're getting from is demographically biased so there are limitations in its generaliztion to many other applications. 

Learning about what keywords are most important to a city may give you an idea about the culture in the city, or what's important to the city. So many things could be discovered from this sort of exploration. Could also be used in market research analysis (yawn) by comparing one brand's subreddit with another's. Or, one show's fanbase vs another's. This binary classification model can also be modified to be applied to message filters (like sorting out spam from emails).

Application doesn't end there, however...Sociologists would have a field day with this tool due to its far-reaching applications in thematic analyses. 

### Gabriela Osorio
#### DSI Project 3 - Creating a Binary Subreddit Classifier - TO vs. LA
#### November 5, 2018

#### Preamble
>  Reddit is an online public content sharing platform that is organized into different categories known as subreddits. Subreddits are comprised of user-submitted posts that can be text, media, or both. Posts can be interacted with through user-prompted upvotes, downvotes, comments, and of course, views. This project will outline the creation of a subreddit classifier that predicts the subreddit a given post is from. Specifially, it's a binary classifier for the Toronto and Los Angeles subreddits. This model can then be expanded upon to explore what the most important characteristics are of this classifier.

#### Quick Model Summary
> **Input**: 'Title' <br>
**Output**: Binary label ('LA' or 'TO')<br>
**Type**: Binary Classifier: Random Forest, Support Vector Machine <br>
**Metrics of Success**: Accuracy <br>


## PART 1: Webscraping Using the Reddit API

We begin by scraping posts from the Toronto and LA subreddits using Reddit's API. This portion was created from a template provided by Max Humber, course instructor, so it should not be mistaken for the author's original work. 

Potentially interesting and influential features of posts that have been identified and included in this webscraping include: 
- subreddit: to be part of the target vector later on 
- title: text input  
- selftext : actual text from the post
- downs : upvotes, positive points
- ups : downvotes, negative points
- num_comments : number of comments
- permalink 
- name 
- author 
- is_original_content : binary answer to "Is content in selftext original?"
- edited : binary answer to "Has this post been edited?"
- media_only : binary answer to "Does the post only have a photo?" 

### PART 1A: Scraping

In [None]:
import datetime
import pandas as pd
import requests
import time

from bs4 import BeautifulSoup

In [None]:
headers = {'User-Agent': 'My User Agent 1.0'}

In [None]:
def fetch_page(url, after=''):
    params = {'after': after}
    response = requests.get(url, headers=headers, params=params)
    return response.json()['data']['children']

In [None]:
def parse_post(post):
    keep = ['subreddit', 'title', 'selftext', 'downs', 'ups', 'num_comments', 'permalink', 'name', 'author', 'time', 'is_original_content', 'edited', 'media_only'] 
    return {k:v for k, v in post['data'].items() if k in keep}

In [None]:
def parse_page(page):
    after = ''
    posts = []
    for post in page:
        post = parse_post(post)
        after = post['name']
        posts.append(post)
    return posts, after

In [None]:
all_posts = []
def fetch_subreddit(subreddit, pages=40):
    url = f'https://www.reddit.com/r/{subreddit}.json'
    after = ''
    for i in range(pages):
        print(f'Fetching Page {i + 1}')
        page = fetch_page(url, after)
        posts, after = parse_page(page)
        all_posts.extend(posts)
        time.sleep(5)
    return all_posts

In [None]:
posts = fetch_subreddit('LosAngeles')

#### Looking at the fetched goods through a DataFrame aka stopping to smell the roses.

In [None]:
df=pd.DataFrame(all_posts)
df

### PART 1A: Creating and Exporting Scraped Goods as CSV 

In [None]:
!mkdir data

In [None]:
now = str(datetime.datetime.now())[:19]

filename = f'data/datasci scrape {now}.csv'
filename

In [None]:
df.to_csv(filename, index=False)

## PART 2: Preprocessing

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import stop_words
from nltk.corpus import stopwords
import numpy as np
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier
from sklearn.metrics import r2_score, accuracy_score
from sklearn import svm
import seaborn as sns


In [None]:
TO=pd.read_csv('./data/TO.csv')
TO.shape

In [None]:
LA= pd.read_csv('./data/LA.csv')
LA.shape

In [None]:
LA.shape

In [None]:
cities=[TO,LA]
tola=pd.concat(cities)
tola.dtypes

#### Let's hold our horses here.
Though we extracted many potential features for our upcoming predictive models, we'll only be focusing on title for now in this model. 

In [None]:
tola.shape

In [None]:
my_stopwords = stopwords.words('english')
my_stopwords.extend(['amp','x200b','\n'])


Want to check how many posts are missing their text. 

In [None]:
print(((tola['selftext'].isnull().sum())/1881)),
'Posts missing their text'

Okay, so definitely not going to use the selftext

Let's get ready to set up our actual experiment now! X will be our 'title' because about 76% of our data points don't have full text in the post beyond title. Moving along.
First, we want to identify our X(features) and Y(target). Our features will be the string titles, and our target will be the TO and LA labels. Right now those values are in string format and say either toronto or losangeles. So we will change our target's values to floats that are binary. 

In [None]:
X=df['title']
df['subreddit'].replace({'toronto': 1, 'losangeles': 0}, inplace = True)

In [None]:
y=df['subreddit']
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.2, random_state=42)
tfidvec = TfidfVectorizer(stop_words = my_stopwords)
tfidvec.fit(X_train)
X_train= tfidvec.transform(X_train)
X_test = tfidvec.transform(X_test)

Trying to fit a linear regression..This doesn't work.

In [None]:
X_test

Now trying to set up just ONE model to put into our function that'll output a nice confusion matrix dataframe. Trying a Single Vector Machine. 

In [None]:
#parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
clf = svm.SVC(gamma=.001)
clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import confusion_matrix

def conf_matrix(model, X_test):
    y_pred = model.predict(X_test)            
    cm = confusion_matrix(y_test, y_pred)
    tn, fp, fn, tp = cm.ravel()               
    print(f"True Negatives: {tn}")            
    print(f"False Positives: {fp}")
    print(f"False Negatives: {fn}")
    print(f"True Positives: {tp}")            
    return pd.DataFrame(cm, columns = ['Predicted TO','Predicted LA'], index = ['Actual TO', 'Actual LA'])


In [None]:
conf_matrix(, X_test)

In [None]:
y_pred = svc.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
def conf_matrix(svm, X_test):

Creating a pipeline to use in model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline


steps = [("vectorizer", TfidfVectorizer(stop_words=my_stopwords)),
         ("rf", RandomForestClassifier())]

pipe = Pipeline(steps)
grid_params = {
    "vectorizer__max_features": [2000, 3000, 4000],
    "vectorizer__ngram_range":[(1,1), (1,2)],
    "rf__n_estimators": [2500, 3000, 3500],
    "rf__max_depth": [17, 18, 19, 20],
    "rf__min_samples_leaf": [1, 2, 3]}



In [None]:
gs=GridSearchCV(pipe, grid_params, verbose=1, n_jobs=2, cv=5)
results=gs.fit(X_train, y_train) 

In [None]:
results(best_params_)

In [None]:
grid_params=best_params

In [None]:
best_params

In [None]:
accuracy_score(y_test, y_pred, normalize=True, sample_weight=None)

In [None]:
conf_matrix(gs, X_test)

In [None]:
gs

In [None]:
results

In [None]:
cross_val_score(estimator=p, X=x, y=y, scoring=scoring, cv=cv, n_jobs=n_jobs)