# Project 3: Reddit API Classification & Natural Language Processing

## Tom Ludlow, DSI-NY-6

Using NLP to identify posts from **r/audioengineering** and **r/livesound**

# Notebook 5: Model Evaluation

This notebook contains the process documentation for evaluating 4 optimized NLP classification models.  To test effectiveness against unseen data, the first step was to obtain new Reddit posts from **r/audioengineering** and **r/livesound**, and apply all required pre-processing to create matrices for our models.  Each model is then evaluated using this new data, and a confusion matrix is generated to show the distribution of True Positive, False Positive, False Negative, True Negative predictions.

These model results are lastly compared to a Voting Classifier model which combines all of the models and selects a probabilistic result.  Each model's term coefficients are then checked to determine the most and least important values for each model.

### Contents:
- [**Pull New Reddit Posts**](#Pull-New-Reddit-Posts)
- [**EDA**](#EDA)
- [**Pre-processing**](#Pre-processing)
- [**Optimized Model Evaluation**](#Optimized-Model-Evaluation)
    - [Optimized Model Features](#Optimized-Model-Features)
    - [Multinomial Naive Bayes](#Model-1-Evaluation:-Multinomial-Naive-Bayes)
    - [Random Forest](#Model-2-Evaluation:-Random-Forest)
    - [GradientBoost Decision Tree](#Model-3-Evaluation:-GradientBoost-Decision-Tree)
    - [TF-IDF Logistic Regression](#Model-4-Evaluation:-TF-IDF-Logistic-Regression)
    - [Voting Classifier](#Voting-Classifier-Evaluation)
- [**Feature Analysis**](#Feature-Analysis)
    - [Multinomial Naive Bayes](#Model-1:-Multinomial-Naive-Bayes-Coefficients)
    - [Random Forest](#Model-2:-Random-Forest-Feature-Priority)
    - [GradientBoost Decision Tree](#Model-3:-GradientBoost-Decision-Tree-Feature-Priority)
    - [TF-IDF Logistic Regression](#Model-4:-Logistic-Regression-Coefficients)

**Libraries**

In [1]:
# library imports
import requests
import time
import pandas as pd
import numpy as np
import ast
import re
from tqdm import tqdm

# preprocessing imports
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# modeling imports
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, \
    GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

# plotting imports
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [148]:
# random state var
r = 1220

## Pull New Reddit Posts
#### Loop to pull new Reddit API posts made after 12/14/18

In [5]:
# create header parameter for API
headers_dict = {'User-agent':'twludlow'}

In [16]:
# instantiate API variables
url = 'https://reddit.com/'
sub01_url = url + 'r/audioengineering' # set sub01 to 'Audio Engineering'
sub02_url = url + 'r/livesound'        # set sub02 to 'Live Sound'

limit_num = 50      # API 'limit' parameter

sub01_after = None  # instantiate empty counters for API 'after' parameter
sub02_after = None

sub01_pages = []    # instantiate empty lists to save API results
sub02_pages = []

for i in range(3): # pull from API 3 times
    
    # add 'after' parameters if an id has been saved - starts as None
    if sub01_after and sub02_after:
        # create full API url for sub01
        sub01_after_url = sub01_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub01_after
        print(sub01_after_url)
        
        # create full API url for sub02
        sub02_after_url = sub02_url + '.json?limit=' \
                            + str(limit_num) + '&after=' \
                            + sub02_after
        print(sub02_after_url)
    
    # if one after is logged and the other is not
    elif bool(sub01_after) != bool(sub02_after):
        print('After reference out of sync.')
        break
    
    else:
        # create first run url
        sub01_after_url = sub01_url + '.json?limit=' + str(limit_num)
        sub02_after_url = sub02_url + '.json?limit=' + str(limit_num)
    
    # pull json from sub01
    sub01_res = requests.get(sub01_after_url, headers=headers_dict)
    print(i, sub01_res.status_code)
    
    # if sub01 connection is established
    if sub01_res.status_code == 200:
        # add page to list
        sub01_pages.append(sub01_res.json()['data'])
        print('sub01_pages length: ', len(sub01_pages))
        
        # set 'after' parameter for next run
        sub01_after = sub01_res.json()['data']['after']
        print('sub01_after: ', sub01_after)
        
    else:        
        print('Connection failed.\n')
    
    # sleep one second
    time.sleep(1)
    
    # pull json from sub02
    sub02_res = requests.get(sub02_after_url, headers=headers_dict)
    print(i, sub02_res.status_code)
    
    # if sub02 connection is established
    if sub02_res.status_code == 200:
        # add page to list
        sub02_pages.append(sub02_res.json()['data'])
        print('sub02_pages length: ', len(sub02_pages))
        
        # set 'after' parameter for next run
        sub02_after = sub02_res.json()['data']['after']
        print('sub02_after: ', sub02_after)
    else:
        print('Connection failed.\n')
        
    # sleep one second    
    time.sleep(1)

0 200
sub01_pages length:  1
sub01_after:  t3_a6v1sq
0 200
sub02_pages length:  1
sub02_after:  t3_a67kph
https://reddit.com/r/audioengineering.json?limit=50&after=t3_a6v1sq
https://reddit.com/r/livesound.json?limit=50&after=t3_a67kph
1 200
sub01_pages length:  2
sub01_after:  t3_a64puo
1 200
sub02_pages length:  2
sub02_after:  t3_a4o87j
https://reddit.com/r/audioengineering.json?limit=50&after=t3_a64puo
https://reddit.com/r/livesound.json?limit=50&after=t3_a4o87j
2 200
sub01_pages length:  3
sub01_after:  t3_a4fh6e
2 200
sub02_pages length:  3
sub02_after:  t3_a32nmp


In [17]:
# create DataFrames from posting lists
sub01_df = pd.DataFrame(sub01_pages)
sub02_df = pd.DataFrame(sub02_pages)

In [18]:
# save API data to files
sub01_df.to_csv('./NEW_audio_engineering_posts.csv', index=False)
sub02_df.to_csv('./NEW_live_sound_posts.csv', index=False)

Saves: NEW_audioengineering_posts.csv, NEW_live_sound_posts.csv

Iterations: 3 x 50 posts

## EDA 
**(repeat process of original training data)**

In [3]:
sub01_df = pd.read_csv('./reddit_data/NEW_audio_engineering_posts.csv')

In [4]:
sub02_df = pd.read_csv('./reddit_data/NEW_live_sound_posts.csv')

In [5]:
sub01_df['children'] = sub01_df.children.map(lambda x: ast.literal_eval(x))

In [6]:
sub02_df['children'] = sub02_df.children.map(lambda x: ast.literal_eval(x))

### Formatting changes after loading from files

In [7]:
sub01_df.head()

Unnamed: 0,after,before,children,dist,modhash
0,t3_a6v1sq,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",52,
1,t3_a64puo,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",50,
2,t3_a4fh6e,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",50,


In [8]:
sub01_df.shape

(3, 5)

In [9]:
sub02_df.head()

Unnamed: 0,after,before,children,dist,modhash
0,t3_a67kph,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",52,
1,t3_a4o87j,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",50,
2,t3_a32nmp,,"[{'kind': 't3', 'data': {'approved_at_utc': No...",50,


In [10]:
sub02_df.shape

(3, 5)

In [11]:
# save post dictionaries in arrays

ae_posts_bulk = sub01_df['children']
ls_posts_bulk = sub02_df['children']

In [12]:
ae_posts_bulk.head()

0    [{'kind': 't3', 'data': {'approved_at_utc': No...
1    [{'kind': 't3', 'data': {'approved_at_utc': No...
2    [{'kind': 't3', 'data': {'approved_at_utc': No...
Name: children, dtype: object

In [13]:
ls_posts_bulk.head()

0    [{'kind': 't3', 'data': {'approved_at_utc': No...
1    [{'kind': 't3', 'data': {'approved_at_utc': No...
2    [{'kind': 't3', 'data': {'approved_at_utc': No...
Name: children, dtype: object

In [14]:
ae_posts_bulk.shape

(3,)

In [15]:
for post in ae_posts_bulk: 
    print(len(post))

52
50
50


### Unravel posts

#### Target Fields

 - Title: 'title'
 - Posts: 'selftext'
 - Author: 'author_fullname'
 - Upvotes: 'ups'

In [16]:
ae_posts_bulk[0][0]['data']['title']

'Tech Support and Troubleshooting - December 17, 2018'

In [17]:
ae_posts_bulk[0][0]['data']['selftext']

"Welcome the /r/audioengineering Tech Support and Troubleshooting Thread.  We kindly ask that all tech support questions and basic troubleshooting questions (how do I hook up 'a' to 'b'?, headphones vs mons, etc) go here.  If you see posts that belong here, please report them to help us get to them in a timely manner.  Thank you!\n\n   Daily Threads:\n\n\n* [Monday - Gear Recommendations Sticky Thread](http://www.reddit.com/r/audioengineering/search?q=title%3Arecommendation+author%3Aautomoderator&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Monday - Tech Support and Troubleshooting Sticky Thread](http://www.reddit.com/r/audioengineering/search?q=title%3ASupport+author%3Aautomoderator&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Tuesday - Tips &amp; Tricks](http://www.reddit.com/r/audioengineering/search?q=title%3A%22tuesdays%22+AND+%28author%3Aautomoderator+OR+author%3Ajaymz168%29&amp;restrict_sr=on&amp;sort=new&amp;t=all)\n* [Friday - How did they do that?](http://www.reddit.com/r

In [18]:
ls_posts_bulk[0][0]['data']['selftext']

'Post the pictures you took at your gigs this week!'

In [19]:
ae_posts_bulk[0][0]['data']['author_fullname']

't2_6l4z3'

In [20]:
ls_posts_bulk[0][0]['data']['author_fullname']

't2_6l4z3'

In [21]:
ae_posts_bulk[0][0]['data']['ups']

8

#### Post titles - 'title'

In [22]:
ae_titles = [ae_posts_bulk[i][j]['data']['title'] for i in range(len(ae_posts_bulk)) 
            for j in range(len(ae_posts_bulk[i]))]
ls_titles = [ls_posts_bulk[i][j]['data']['title'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

#### Posts - 'selftext'

In [23]:
# create list of post using nested comprehensions
ae_posts = [ae_posts_bulk[i][j]['data']['selftext'] for i in range(len(ae_posts_bulk)) 
            for j in range(len(ae_posts_bulk[i]))]
ls_posts = [ls_posts_bulk[i][j]['data']['selftext'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

In [24]:
len(ae_posts)

152

In [25]:
len(ls_posts)

152

#### Upvotes - 'ups'

In [26]:
ae_ups = [ae_posts_bulk[i][j]['data']['ups'] for i in range(len(ae_posts_bulk)) 
            for j in range(len(ae_posts_bulk[i]))]
ls_ups = [ls_posts_bulk[i][j]['data']['ups'] for i in range(len(ls_posts_bulk)) 
            for j in range(len(ls_posts_bulk[i]))]

#### Authors - 'author_fullname'

Doing manually to handle missing author data.

In [27]:
ae_authors = []
ls_authors = []

for i in range(len(ae_posts_bulk)):
    for j in range(len(ae_posts_bulk[i])):
        try:
            ae_authors.append(ae_posts_bulk[i][j]['data']['author_fullname'])
        except:
            ae_authors.append('no author')
            
for i in range(len(ls_posts_bulk)):
    for j in range(len(ls_posts_bulk[i])):                
        try:
            ls_authors.append(ls_posts_bulk[i][j]['data']['author_fullname'])
        except:
            ls_authors.append('no author')

In [28]:
len(ae_authors)

152

In [29]:
len(ls_authors)

152

In [30]:
# compile lists into DataFrame
ae_df = pd.DataFrame([ae_titles, ae_posts, ae_authors, ae_ups], index=['title','post','author','upvotes'])

In [31]:
# transpose from rows to columns
ae_df = ae_df.T

In [32]:
# compile lists into DataFrame
ls_df = pd.DataFrame([ls_titles, ls_posts, ls_authors, ls_ups], index=['title','post','author','upvotes'])

In [33]:
# transpose from rows to columns
ls_df = ls_df.T

#### Save separate DataFrames

In [34]:
ae_df.to_csv('./csv/NEW_ae_df.csv', index=False)
ls_df.to_csv('./csv/NEW_ls_df.csv', index=False)

In [35]:
# binarize our classifier: 'is_ls' (is live sound)
ae_df['is_ls'] = 0
ls_df['is_ls'] = 1

In [36]:
df = pd.concat([ae_df, ls_df])

In [37]:
df.head()

Unnamed: 0,title,post,author,upvotes,is_ls
0,Tech Support and Troubleshooting - December 17...,Welcome the /r/audioengineering Tech Support a...,t2_6l4z3,8,0
1,Gear Recommendation (What Should I Buy?) Threa...,Welcome to our weekly Gear Recommendation Thre...,t2_6l4z3,6,0
2,Is a ThunderBolt audio interface worth it for ...,Right now I record through a crappy AudioBox U...,t2_24wuqxkk,31,0
3,What’s the difference b/w dithering and trunca...,It’s my basic understanding that dithering is ...,t2_y139dwj,9,0
4,Getting wide auto-panning to sound right/auto-...,I’ve noticed that a lot of songs I really enjo...,t2_4anq6,4,0


In [38]:
df.shape

(304, 5)

In [39]:
df.is_ls.value_counts()

1    152
0    152
Name: is_ls, dtype: int64

In [40]:
df.post.fillna(' ', inplace=True)

In [41]:
df['comb'] = df['title'] + ' ' + df['post']

In [42]:
df.index = range(len(df))

In [43]:
# check for empty posts and store index to list
to_drop = []

for i, post in enumerate(df['comb']):
    if len(post)==0:
        to_drop.append(i)

In [44]:
len(to_drop)

0

In [45]:
# drop rows with empty posts using index list
df.drop(to_drop, inplace=True)

In [46]:
df.is_ls.value_counts()

1    152
0    152
Name: is_ls, dtype: int64

In [47]:
df.to_csv('./csv/NEW_post_df.csv', index=False)

Last saved 12/19/18 as NEW_post_df.csv

## Pre-processing

In [2]:
df = pd.read_csv('./csv/NEW_post_df.csv')

In [3]:
df.shape

(304, 6)

In [4]:
df.comb.head()

0    Tech Support and Troubleshooting - December 17...
1    Gear Recommendation (What Should I Buy?) Threa...
2    Is a ThunderBolt audio interface worth it for ...
3    What’s the difference b/w dithering and trunca...
4    Getting wide auto-panning to sound right/auto-...
Name: comb, dtype: object

In [51]:
df.head()

Unnamed: 0,title,post,author,upvotes,is_ls,comb
0,Tech Support and Troubleshooting - December 17...,Welcome the /r/audioengineering Tech Support a...,t2_6l4z3,8,0,Tech Support and Troubleshooting - December 17...
1,Gear Recommendation (What Should I Buy?) Threa...,Welcome to our weekly Gear Recommendation Thre...,t2_6l4z3,6,0,Gear Recommendation (What Should I Buy?) Threa...
2,Is a ThunderBolt audio interface worth it for ...,Right now I record through a crappy AudioBox U...,t2_24wuqxkk,31,0,Is a ThunderBolt audio interface worth it for ...
3,What’s the difference b/w dithering and trunca...,It’s my basic understanding that dithering is ...,t2_y139dwj,9,0,What’s the difference b/w dithering and trunca...
4,Getting wide auto-panning to sound right/auto-...,I’ve noticed that a lot of songs I really enjo...,t2_4anq6,4,0,Getting wide auto-panning to sound right/auto-...


### Tokenizing titles and posts

In [5]:
rt = RegexpTokenizer(r"[\w/\']+") # regex to include words, slash characters for urls, apostrophes

In [6]:
df.comb.sample(5)

121    [Question] EQing Reverbs before or after it? W...
297    X32 Complete Backup I can't believe it's been ...
160    Sad Shure SM81 - Any ideas? Hey /r/livesound, ...
93     Rode NTA-1 Capsule Swap? Edit: May also help i...
109    When Using EQ on dialogue does subtle work or ...
Name: comb, dtype: object

In [7]:
for i, text in enumerate(df.comb):
    text_loop = text.replace('&amp;','&')
    text_loop = text_loop.replace('#x200B;',' ') # manually remove symbols &, nzsp, nbsp, \n
    text_loop = text_loop.replace('nbsp;',' ')
    df.comb.iloc[i] = text_loop.replace('\n',' ').strip()  

In [8]:
len(df.post)

304

#### Tokenize each post and save to list

In [9]:
comb_tokens = []  # empty token list

for i in range(len(df.comb)):
    loop_tokens = rt.tokenize(df.comb.iloc[i].lower()) # use iloc to skip removed rows
    for j, token in enumerate(loop_tokens):
        if re.match(r"\d+[\w]*", token):
            loop_tokens[j] = ''
        if re.match(r"//[\w]*", token):
            loop_tokens[j] = ''
        if ('audioengineering' in token)|('livesound' in token)|('http' in token):
            loop_tokens[j] = ''
    comb_tokens.append(loop_tokens)                    # add tokenized string to post_tokens list

In [10]:
len(comb_tokens)

304

In [11]:
len(comb_tokens[0])

153

In [12]:
comb_tokens[0][:5]

['tech', 'support', 'and', 'troubleshooting', 'december']

### Lemmatize

In [13]:
lm = WordNetLemmatizer()

In [14]:
posts_t_lm = []

for post in comb_tokens:
    post_st = [] # empty post stems
    for word in post:
        #print(word)
        word_st = lm.lemmatize(word) # get lemmatized word
        post_st.append(word_st) # add to post list
    posts_t_lm.append(post_st)  # add post list to lemma matrix

In [16]:
posts_t_lm[0][:5]

['tech', 'support', 'and', 'troubleshooting', 'december']

### Combine lemmatized to list

In [17]:
posts_t_lm_list = []

for post in posts_t_lm:
    posts_t_lm_list.append(' '.join(post))

In [18]:
posts_t_lm_list[:2]

["tech support and troubleshooting december   welcome the  tech support and troubleshooting thread we kindly ask that all tech support question and basic troubleshooting question how do i hook up 'a' to 'b' headphone v mon etc go here if you see post that belong here please report them to help u get to them in a timely manner thank you daily thread monday gear recommendation sticky thread   reddit  q title  author  restrict_sr on sort new t all monday tech support and troubleshooting sticky thread   reddit  q title  author  restrict_sr on sort new t all tuesday tip trick   reddit  q title    and   or author   restrict_sr on sort new t all friday how did they do that   reddit  q title  author  restrict_sr on sort new t all",
 'gear recommendation what should i buy thread december   welcome to our weekly gear recommendation thread where you can ask  for recommendation on smart purchase low cost gear and purchasing recommendation request have become common in the ae subreddit there is als

### Add index to posts and titles and create DataFrames

In [19]:
df_pre = pd.DataFrame(data=[posts_t_lm_list], index=['post_lm'])

In [20]:
df_pre = df_pre.T

In [21]:
df_pre.head()

Unnamed: 0,post_lm
0,tech support and troubleshooting december we...
1,gear recommendation what should i buy thread d...
2,is a thunderbolt audio interface worth it for ...
3,what s the difference b/w dithering and trunca...
4,getting wide auto panning to sound right/auto ...


In [22]:
df_pre['is_ls'] = df['is_ls']

In [23]:
df_pre.head()

Unnamed: 0,post_lm,is_ls
0,tech support and troubleshooting december we...,0
1,gear recommendation what should i buy thread d...,0
2,is a thunderbolt audio interface worth it for ...,0
3,what s the difference b/w dithering and trunca...,0
4,getting wide auto panning to sound right/auto ...,0


In [69]:
df_pre.to_csv('./csv/NEW_df_pre.csv', index=False)

In [70]:
new_test = df_pre

### Check for and remove duplicate posts

In [73]:
old_df_pre = pd.read_csv('./csv/181219_df_pre.csv')

In [74]:
old_df_pre[old_df_pre.is_ls==0].post_lm.head()

0    tech support and troubleshooting december   we...
1    gear recommendation what should i buy thread d...
2    will i ever understand compression ahh yes my ...
3    i'm interviewing to be an intern at a big stud...
4    if i faced two speaker towards each other one ...
Name: post_lm, dtype: object

In [75]:
for i, newpost in enumerate(new_test[new_test.is_ls==0].post_lm):
    if "will i ever understand compression" in newpost:
        print(i, newpost)

79 will i ever understand compression ahh yes my monthly compression post i 'get' the idea of compression but i am struggling so much with how to hear what is reasonable/appropriate/too much when adding compression here's my understanding and please correct me on any or all of this threshold the point where the plug in actually kick in ratio how much of the sound over the threshold is compressed attack length of time before compression kick in release how long compression is held for makeup gain getting the 'lost' volume back from the compression i'm trying to start with understanding the logic house compressor i've picked one to learn studio fet my problem is i can see the needle move and know something is happening but i just can't tell if i've found a sweet spot am i monitoring the whole track while getting the compressor right or picking a particular good point where there might be loud word or soft word and looping it till i get it right i've watched ton of video explanation anima

In [76]:
old_df_pre[old_df_pre.is_ls==1].post_lm.head()

924    weekly office pic thread week of       post th...
925    no stupid question thread week of       the on...
926                                      hope this count
927            i think we might have to re align the sub
928      got to the venue and they have a mystery switch
Name: post_lm, dtype: object

In [77]:
for i, newpost in enumerate(new_test[new_test.is_ls==1].post_lm):
    if "hope this count" in newpost:
        print(i, newpost)

48 hope this count


In [78]:
new_test[new_test.is_ls==1].post_lm.head()

152    weekly office pic thread week of       post th...
153    no stupid question thread week of       the on...
154    the best production timelapse i've seen so wel...
155    wireless router/mobile app for a h qu/sq serie...
156                               the realest of reverbs
Name: post_lm, dtype: object

In [79]:
new_test.post_lm[200]

'hope this count'

In [80]:
new_test.drop(range(79,152), inplace=True)
new_test.drop(range(200,304), inplace=True)

In [81]:
new_test.is_ls.value_counts()

0    79
1    48
Name: is_ls, dtype: int64

In [82]:
new_test.to_csv('./csv/181220_new_test.csv', index=False)

# Optimized Model Evaluation

## Optimized Model Features
 
**Model 1:** Multinomial Naive-Bayes
 - *Lemmatizer*
 - *CountVectorizer*
  - `stop_words='english'`
  - `ngram_range=(1,2)`
 - *GridSearch*
  - `cv__max_features=35000`
  - `mnb__alpha=1.2`
 
**Model 2:** Random Forest
 - *Lemmatizer*
 - *CountVectorizer*
  - `stop_words='english'`
  - `ngram_range=(1,1)`
 - *GridSearch*
  - `cv__max_features=None`
  - `rf__criterion='gini'`
  - `rf__n_estimators=99`
  - `rf__max_depth=9`
  - `rf__max_features='sqrt'`
  
**Model 3:** Gradient-Boost Decision Tree
 - *Lemmatizer*
 - *CountVectorizer*
  - `stop_words='english'`
  - `ngram_range=(1,2)`
 - *GridSearch*
  - `cv__max_features=None`
  - `gb__loss='deviance'`
  - `gb__max_depth=7`
  - `gb__n_estimators=100`
  
**Model 4:** TF-IDF Logistic Regression
 - *Lemmatizer*
 - *TfidfVectorizer*
  - `stop_words='english'`
  - `ngram_range=(1,2)`
 - *GridSearch*
  - `tf__max_features=30000`
  - `lr__penalty='l1'`
  - `lr__C=1`
  - `lr__tol=.001`

In [83]:
new_test = pd.read_csv('./csv/181220_new_test.csv')

In [27]:
X_train = pd.read_csv('./csv/181220_X_train.csv', index_col=0)
X_test = pd.read_csv('./csv/181220_X_test.csv', index_col=0)
y_train = pd.read_csv('./csv/181220_y_train.csv', index_col=0)
y_test = pd.read_csv('./csv/181220_y_test.csv', index_col=0)

# Model 1 Evaluation: Multinomial Naive Bayes

In [32]:
m1_steps = [('m1_cv',CountVectorizer(stop_words='english', ngram_range=(1,2), max_features=35000)),
           ('m1_mnb',MultinomialNB(alpha=1.2))]

In [33]:
pipe_1 = Pipeline(m1_steps)

In [34]:
pipe_1.fit(X_train.post_lm, y_train.is_ls)

Pipeline(memory=None,
     steps=[('m1_cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=35000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('m1_mnb', MultinomialNB(alpha=1.2, class_prior=None, fit_prior=True))])

#### Training Accuracy

In [152]:
pipe_1.score(X_train.post_lm, y_train.is_ls)

0.9873861247372109

#### Testing Accuracy

In [153]:
pipe_1.score(X_test.post_lm, y_test.is_ls)

0.8340336134453782

## Accuracy against unseen posts: 84.25%

In [154]:
pipe_1.score(new_test.post_lm, new_test.is_ls)

0.84251968503937

In [155]:
tn, fp, fn, tp = confusion_matrix(new_test.is_ls, pipe_1.predict(new_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 67
False Positives: 12
False Negatives: 8
True Positives: 40

Accuracy:  0.84251968503937
Sensitivity:  0.8333333333333334
Specificity:  0.8481012658227848
Precision:  0.7692307692307693


# Model 2 Evaluation: Random Forest

In [35]:
m2_steps = [('m2_cv',CountVectorizer(stop_words='english', ngram_range=(1,1))),
           ('m2_rf',RandomForestClassifier(criterion='gini', n_estimators=99, max_depth=9))]

In [36]:
pipe_2 = Pipeline(m2_steps)

In [37]:
pipe_2.fit(X_train.post_lm, y_train.is_ls)

Pipeline(memory=None,
     steps=[('m2_cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
       ...obs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))])

#### Training Accuracy

In [159]:
pipe_2.score(X_train.post_lm, y_train.is_ls)

0.877365101611773

#### Testing Accuracy

In [160]:
pipe_2.score(X_test.post_lm, y_test.is_ls)

0.773109243697479

## Accuracy against unseen posts: 76.38%

In [161]:
pipe_2.score(new_test.post_lm, new_test.is_ls)

0.7637795275590551

In [162]:
tn, fp, fn, tp = confusion_matrix(new_test.is_ls, pipe_2.predict(new_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 53
False Positives: 26
False Negatives: 4
True Positives: 44

Accuracy:  0.7637795275590551
Sensitivity:  0.9166666666666666
Specificity:  0.6708860759493671
Precision:  0.6285714285714286


# Model 3 Evaluation: Gradient-Boost Decision Tree

In [38]:
m3_steps = [('m3_cv',CountVectorizer(stop_words='english', ngram_range=(1,2), max_features=None)),
           ('m3_gb',GradientBoostingClassifier(loss='deviance', n_estimators=100, max_depth=7))]

In [39]:
pipe_3 = Pipeline(m3_steps)

In [40]:
pipe_3.fit(X_train.post_lm, y_train.is_ls)

Pipeline(memory=None,
     steps=[('m3_cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
       ...    subsample=1.0, tol=0.0001, validation_fraction=0.1,
              verbose=0, warm_start=False))])

In [166]:
pipe_3.score(X_train.post_lm, y_train.is_ls)

0.9985984583041345

In [167]:
pipe_3.score(X_test.post_lm, y_test.is_ls)

0.8046218487394958

## Accuracy against unseen posts: 81.89%

In [168]:
pipe_3.score(new_test.post_lm, new_test.is_ls)

0.8188976377952756

In [169]:
tn, fp, fn, tp = confusion_matrix(new_test.is_ls, pipe_3.predict(new_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 63
False Positives: 16
False Negatives: 7
True Positives: 41

Accuracy:  0.8188976377952756
Sensitivity:  0.8541666666666666
Specificity:  0.7974683544303798
Precision:  0.7192982456140351


# Model 4 Evaluation: TF-IDF Logistic Regression

In [24]:
m4_steps = [('m4_tf',TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_features=30000)),
            ('m4_ss',StandardScaler(with_mean=False)),
            ('m4_lr',LogisticRegression(penalty='l1', C=1, tol=.001))]

In [25]:
pipe_4 = Pipeline(m4_steps)

In [28]:
pipe_4.fit(X_train.post_lm, y_train.is_ls)

Pipeline(memory=None,
     steps=[('m4_tf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=30000, min_df=1,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True... penalty='l1', random_state=None, solver='warn',
          tol=0.001, verbose=0, warm_start=False))])

In [29]:
pipe_4.score(X_train.post_lm, y_train.is_ls)

0.99929922915206726

In [30]:
pipe_4.score(X_test.post_lm, y_test.is_ls)

0.80672268907563027

## Accuracy against unseen posts: 85.04%

In [176]:
pipe_4.score(new_test.post_lm, new_test.is_ls)

0.8503937007874016

In [177]:
tn, fp, fn, tp = confusion_matrix(new_test.is_ls, pipe_4.predict(new_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 67
False Positives: 12
False Negatives: 7
True Positives: 41

Accuracy:  0.8503937007874016
Sensitivity:  0.8541666666666666
Specificity:  0.8481012658227848
Precision:  0.7735849056603774


# Voting Classifier Evaluation

In [41]:
vote_all = VotingClassifier([
    ('mnb', pipe_1),
    ('rf', pipe_2),
    ('gb', pipe_3),
    ('lr', pipe_4)
])

In [42]:
vote_all.fit(X_train.post_lm, y_train.is_ls)

VotingClassifier(estimators=[('mnb', Pipeline(memory=None,
     steps=[('m1_cv', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=35000, min_df=1,
        ngram_range=(1, 2), ...nalty='l1', random_state=None, solver='warn',
          tol=0.001, verbose=0, warm_start=False))]))],
         flatten_transform=None, n_jobs=None, voting='hard', weights=None)

In [180]:
vote_all.score(X_train.post_lm, y_train.is_ls)

0.9985984583041345

In [181]:
vote_all.score(X_test.post_lm, y_test.is_ls)

0.8298319327731093

## Accuracy against unseen posts: 88.98%

In [182]:
vote_all.score(new_test.post_lm, new_test.is_ls)

0.889763779527559

In [183]:
tn, fp, fn, tp = confusion_matrix(new_test.is_ls, vote_all.predict(new_test.post_lm)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

print("\nAccuracy: ", (tn + tp) / (tn + fp + fn + tp))
print("Sensitivity: ", tp / (tp + fn))
print("Specificity: ", tn / (tn + fp))
print("Precision: ", tp / (tp + fp))

True Negatives: 70
False Positives: 9
False Negatives: 5
True Positives: 43

Accuracy:  0.889763779527559
Sensitivity:  0.8958333333333334
Specificity:  0.8860759493670886
Precision:  0.8269230769230769


# Feature Analysis

### Model 1: Multinomial Naive Bayes Coefficients

In [184]:
m1 = pipe_1.named_steps['m1_mnb']

cv1 = pipe_1.named_steps['m1_cv']
cv1.fit_transform(X_train.post_lm)

<1427x35000 sparse matrix of type '<class 'numpy.int64'>'
	with 90419 stored elements in Compressed Sparse Row format>

In [185]:
m1_df = pd.DataFrame(m1.coef_.T, index=cv1.get_feature_names(), columns=['coef'])

In [186]:
m1_df.coef.sort_values(ascending=False)

sound                -5.571445
wa                   -5.814237
just                 -5.873034
like                 -5.911619
ve                   -5.935506
live                 -6.212235
know                 -6.250139
work                 -6.295294
ha                   -6.301084
need                 -6.312766
band                 -6.330550
speaker              -6.330550
looking              -6.348655
use                  -6.367095
audio                -6.373318
mixer                -6.392222
time                 -6.437773
want                 -6.457948
mic                  -6.464765
channel              -6.492506
guy                  -6.521039
thanks               -6.573018
using                -6.573018
way                  -6.573018
question             -6.580669
amp                  -6.588379
don                  -6.635932
monitor              -6.635932
help                 -6.660584
good                 -6.660584
                       ...    
sampling theory     -11.267419
sand    

In [187]:
# Function researched and borrowed from Stackoverflow
# https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers

def important_features(vectorizer,classifier,n=20):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]
    print("Important words for r/audioengineering\n")
    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)
    print("-----------------------------------------\n")
    print("Important words for r/livesound\n")
    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat) 

In [188]:
important_features(pipe_1.named_steps['m1_cv'], pipe_1.named_steps['m1_mnb'], 15)

Important words for r/audioengineering

0 466.0 sound
0 308.0 just
0 305.0 audio
0 289.0 like
0 275.0 recording
0 262.0 wa
0 246.0 ve
0 232.0 know
0 229.0 track
0 222.0 new
0 203.0 use
0 186.0 way
0 178.0 sort
0 174.0 mix
0 171.0 thread
-----------------------------------------

Important words for r/livesound

1 356.0 sound
1 279.0 wa
1 263.0 just
1 253.0 like
1 247.0 ve
1 187.0 live
1 180.0 know
1 172.0 work
1 171.0 ha
1 169.0 need
1 166.0 speaker
1 166.0 band
1 163.0 looking
1 160.0 use
1 159.0 audio


### Model 2: Random Forest Feature Priority

In [189]:
m2 = pipe_2.named_steps['m2_rf']

cv2 = pipe_2.named_steps['m2_cv']
cv2.fit_transform(X_train.post_lm)

<1427x8413 sparse matrix of type '<class 'numpy.int64'>'
	with 53056 stored elements in Compressed Sparse Row format>

In [190]:
m2_df = pd.DataFrame(m2.feature_importances_, index=cv2.get_feature_names(), columns=['fi'])

In [191]:
m2_df[m2_df.fi==0].shape[0]

6966

In [192]:
m2_df.drop(m2_df[m2_df.fi==0].index, inplace=True)

In [193]:
m2_df.fi.sort_values(ascending=False)

venue           2.992175e-02
recording       1.892804e-02
gig             1.657400e-02
stage           1.538751e-02
album           1.395730e-02
drum            1.215183e-02
mixer           1.196458e-02
engineering     1.152045e-02
author          1.151017e-02
sticky          1.101322e-02
track           1.069544e-02
compression     9.653836e-03
interface       9.174475e-03
new             8.895328e-03
event           8.349787e-03
audio           8.276294e-03
reddit          7.959382e-03
foh             7.930446e-03
title           7.501955e-03
live            7.457397e-03
thread          7.152979e-03
record          6.857877e-03
wireless        6.822146e-03
trick           6.565180e-03
project         6.548113e-03
restrict_sr     6.267189e-03
pa              5.988630e-03
console         5.843818e-03
friday          5.701423e-03
master          5.621560e-03
                    ...     
charging        1.168733e-06
disability      1.164820e-06
panning         1.161813e-06
picking       

### Model 3: GradientBoost Decision Tree Feature Priority

In [194]:
m3 = pipe_3.named_steps['m3_gb']

cv3 = pipe_3.named_steps['m3_cv']
cv3.fit_transform(X_train.post_lm)

<1427x62698 sparse matrix of type '<class 'numpy.int64'>'
	with 118117 stored elements in Compressed Sparse Row format>

In [195]:
m3_df = pd.DataFrame(m3.feature_importances_, index=cv3.get_feature_names(), columns=['fi'])

In [196]:
m3_df[m3_df.fi==0].shape[0]

61088

In [197]:
m3_df.drop(m3_df[m3_df.fi==0].index, inplace=True)

In [198]:
m3_df.fi.sort_values(ascending=False)

recording              5.773130e-02
venue                  4.778125e-02
track                  4.730125e-02
mixer                  3.424911e-02
gig                    3.087447e-02
audio                  2.798132e-02
live                   2.607853e-02
studio                 2.311059e-02
pa                     1.931976e-02
album                  1.699700e-02
wireless               1.686895e-02
tuesday                1.575035e-02
mix                    1.424106e-02
daily                  1.322172e-02
interface              1.273513e-02
backing track          1.236456e-02
console                1.189749e-02
stage                  1.044619e-02
sound                  1.021441e-02
foh                    1.011606e-02
response               9.096207e-03
engineering            8.829859e-03
receiver               8.753419e-03
x32                    8.437276e-03
make                   8.282002e-03
record                 7.987869e-03
drum                   7.836628e-03
thread                 7.666

### Model 4: Logistic Regression Coefficients

In [199]:
m4 = pipe_4.named_steps['m4_lr']

tf4 = pipe_4.named_steps['m4_tf']
tf4.fit_transform(X_train.post_lm)

<1427x30000 sparse matrix of type '<class 'numpy.float64'>'
	with 85419 stored elements in Compressed Sparse Row format>

In [200]:
m4_df = pd.DataFrame(m4.coef_.T, index=tf4.get_feature_names(), columns=['coef'])

In [201]:
m4_df.coef.sort_values(ascending=True)

studio         -0.423051
interface      -0.392248
recording      -0.352222
track          -0.345991
drum           -0.284348
focusrite      -0.278234
album          -0.273362
noise          -0.269377
record         -0.264122
mastering      -0.262601
bass           -0.237672
recorded       -0.221967
engineering    -0.215780
session        -0.210754
thing          -0.207626
topic          -0.205324
audio          -0.200939
way            -0.198831
project        -0.195182
reverb         -0.194448
mix            -0.191028
sm7b           -0.187119
car            -0.185066
sample         -0.183184
production     -0.178777
tracking       -0.176243
plugin         -0.172085
improve        -0.170546
song           -0.157336
guitar pedal   -0.147453
                  ...   
case            0.175549
vocalist        0.175683
shure           0.177090
sound guy       0.188930
got             0.190835
receiver        0.191993
ipad            0.192117
main            0.192637
rig             0.193438
