# Data Scientist Evaluation
# Chelsea Zerrenner

The following script seeks to classify Stack Exchange Posts into one of 5 categories: "astronomy," "aviation," "beer," "outdoors," and "pets".  The script was written according to the steps laid out by the evaluation instructions.  Note: The first attempt was completed following the instructions explicitly.  It came to my attention, however, that certain steps were not taken or were done out of the sequence when it comes to data science "best practices."  As such, I have provided additional attempts taking this into account.  
***
## Attempt 1:
This attempt follows the instructions explicitly in the order they were laid out.

### Create dataset from Stack Exchange Posts
Create connection to MongoDB to access the **stack** database and extract the following collections: **astronomy**, **aviation**, **beer**, **outdoors**, and **pets**.  

In [1]:
from pymongo import MongoClient
import pandas as pd
import numpy as np
import nltk

client = MongoClient()
db = client.stack

astronomy = pd.DataFrame(list(db.astronomy.find()))
aviation = pd.DataFrame(list(db.aviation.find()))
beer = pd.DataFrame(list(db.beer.find()))
outdoors = pd.DataFrame(list(db.outdoors.find()))
pets = pd.DataFrame(list(db.pets.find()))

client.close()

Combine collections to build one complete dataset named **posts** and create label based on the topic of the collection.

In [2]:
posts = pd.concat([astronomy, aviation, beer, outdoors, pets])

def f(x):
    if x['id'].startswith('astronomy'): return 'astronomy'
    elif x['id'].startswith('aviation'): return 'aviation'
    elif x['id'].startswith('beer'): return 'beer'
    elif x['id'].startswith('outdoors'): return 'outdoors'
    else: return 'pets'
    
posts['label'] = posts.apply(f, axis = 1)

Reduce the dataset to the **title**, **body**, and **label** inputs.  Separate the label data (outputs) from the text (inputs). 

In [3]:
X = posts['title'] + posts['body']
y = posts['label']

In [4]:
X.head()

0    How do I calculate the inclination of an objec...
1    How are the compositional components of exopla...
2    Amateur observing targets for binary star syst...
3    Why do sunspots appear dark? Sunspots, such as...
4    Why can't light escape from a black hole? I've...
dtype: object

### Data Preprocessing
Firs, remove punctuation from the text in the **X** dataset. 

In [5]:
# Remove punctuation
import string

def remove_punc(text):
    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    return nopunc
       
X = X.apply(remove_punc)
X.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\chzerr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    How do I calculate the inclination of an objec...
1    How are the compositional components of exopla...
2    Amateur observing targets for binary star syst...
3    Why do sunspots appear dark Sunspots such as t...
4    Why cant light escape from a black hole Ive he...
dtype: object

Find the 1000 most common words in the **X** dataset.  Follow up with creating a term document matrix **tdm**.

In [6]:
nltk.download('punkt')
nltk.download('wordnet')
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer

class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]

vect = CountVectorizer(tokenizer = LemmaTokenizer(), 
                       stop_words = 'english',  
                       lowercase = True).fit(X)
tdm = vect.transform(X)
sum_words = tdm.sum(axis = 0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse = True)
common_words = words_freq[:1000]
common_words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\chzerr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\chzerr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[('wa', 2009),
 ('dog', 1784),
 ('cat', 1760),
 ('like', 1745),
 ('im', 1702),
 ('doe', 1667),
 ('time', 1566),
 ('ha', 1448),
 ('know', 1383),
 ('just', 1350),
 ('aircraft', 1099),
 ('ive', 1077),
 ('question', 1021),
 ('use', 974),
 ('way', 970),
 ('dont', 963),
 ('star', 919),
 ('make', 877),
 ('year', 849),
 ('flight', 834),
 ('beer', 828),
 ('day', 790),
 ('need', 788),
 ('water', 737),
 ('good', 709),
 ('used', 704),
 ('pilot', 703),
 ('want', 699),
 ('food', 676),
 ('thing', 631),
 ('really', 590),
 ('plane', 579),
 ('earth', 576),
 ('light', 574),
 ('different', 559),
 ('possible', 558),
 ('u', 540),
 ('long', 524),
 ('say', 498),
 ('planet', 482),
 ('looking', 478),
 ('people', 477),
 ('new', 470),
 ('area', 464),
 ('old', 463),
 ('away', 461),
 ('point', 461),
 ('using', 460),
 ('think', 454),
 ('small', 451),
 ('work', 449),
 ('doesnt', 439),
 ('black', 438),
 ('rabbit', 438),
 ('look', 436),
 ('come', 428),
 ('problem', 424),
 ('moon', 421),
 ('place', 414),
 ('mean', 410),

In [7]:
tdm

<5826x22201 sparse matrix of type '<class 'numpy.int64'>'
	with 201970 stored elements in Compressed Sparse Row format>

In [8]:
tdm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

#### Perform Dimension Reduction
Perform singular value decomposition on the term document matrix and retain 95% variability. 

In [9]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components = tdm.shape[1] - 1, random_state = 1234)
tsvd = svd.fit_transform(tdm)
svd_var_ratios = svd.explained_variance_ratio_

In [10]:
# Create a function
def select_n_components(var_ratio, goal_var: float) -> int:
    # Set initial variance explained so far
    total_variance = 0.0
    
    # Set initial number of features
    n_components = 0
    
    # For the explained variance of each feature:
    for explained_variance in var_ratio:
        
        # Add the explained variance to the total
        total_variance += explained_variance
        
        # Add one to the number of components
        n_components += 1
        
        # If we reach our goal level of explained variance
        if total_variance >= goal_var:
            # End the loop
            break
            
    # Return the number of components
    return n_components

select_n_components(svd_var_ratios, 0.95)

2198

In [11]:
svd_final = TruncatedSVD(n_components = select_n_components(svd_var_ratios, 0.95), random_state = 1234)
tdm_final = svd_final.fit_transform(tdm)

In [12]:
tdm_final.shape

(5826, 2198)

### Build Classifier
Set aside 500 random discussions from entire dataset to serve as a hold-out test set.  Then fit a classifier to our training data using 10-fold cross validation to tune the parameters of the classifier of our choice (in this case, a random forest). 

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tdm_final, y, test_size = 500/tdm_final.shape[0], random_state = 0) 

In [14]:
from sklearn.model_selection import RandomizedSearchCV 
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score

# Define model & performance measure
mdl = RandomForestClassifier(n_estimators = 20)
accuracy = make_scorer(accuracy_score)

In [15]:
# Random search for parameters
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 100),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rand_search = RandomizedSearchCV(mdl, param_distributions = param_dist, n_iter = 20, random_state = 1234, scoring = accuracy, 
                                 refit = True, cv = 10)
mdl_1 = rand_search.fit(X_train, y_train)
mdl_1.best_score_

0.8490424333458505

In [16]:
att1_cv_accuracy = mdl_1.best_score_

In [17]:
mdl_1.best_estimator_

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=85, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [18]:
mdl_1.best_params_

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 85,
 'min_samples_leaf': 10,
 'min_samples_split': 2}

In [19]:
# Evaluate on hold-out test sample
att1_y_pred = mdl_1.predict(X_test)
att1_test_accuracy = accuracy_score(y_test, att1_y_pred)
att1_test_accuracy

0.81999999999999995

In [20]:
# Classification Report
from sklearn.metrics import classification_report, confusion_matrix

target_names = ['astronomy', 'aviation', 'beer', 'outdoors', 'pets']
att1_class_report = classification_report(y_test, att1_y_pred, target_names = target_names)
print(att1_class_report)

             precision    recall  f1-score   support

  astronomy       0.89      0.89      0.89        96
   aviation       0.81      0.87      0.84       139
       beer       1.00      0.45      0.62        22
   outdoors       0.71      0.80      0.75       131
       pets       0.91      0.79      0.85       112

avg / total       0.83      0.82      0.82       500



In [21]:
# Confusion Matrix
att1_conf_matrix = confusion_matrix(y_test, att1_y_pred, labels = target_names)
print(att1_conf_matrix)

[[ 85   6   0   3   2]
 [  7 121   0  10   1]
 [  0   1  10  11   0]
 [  3  17   0 105   6]
 [  0   4   0  19  89]]


In [22]:
# Store evaluation metrics
from sklearn.metrics import precision_recall_fscore_support

att1_train_accuracy = accuracy_score(y_train, mdl_1.predict(X_train))
att1_L = list(precision_recall_fscore_support(y_test, att1_y_pred, average = 'weighted'))
metrics = pd.DataFrame([[1, att1_train_accuracy, att1_cv_accuracy, att1_test_accuracy, *att1_L[:3]]], 
                            columns = ['attempt', 'train accuracy', 'mean cv accuracy', 'test accuracy', 'test precision', 
                                       'test recall', 'test f1-score'])

***
## Attempt 2: Fit Classifier on Term Frequency Inverse Document Frequency Matrix
Perform training on tf-idf matrix in place of the term document matrix citing only term frequency in an attempt to improve the classifier's performance.  A tf-idf matrix provides more information than the term frequency document matrix as it also takes into account the number of documents a word appears in.
### Data Preprocessing
Convert term document matrix using only word frequency in each document to term frequency inverse document frequency matrix. 

In [23]:
# Term Frequency Inverse Document Frequency 
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer().fit(tdm)
tfidf = tfidf_transformer.transform(tdm)
tfidf

<5826x22201 sparse matrix of type '<class 'numpy.float64'>'
	with 201970 stored elements in Compressed Sparse Row format>

#### Perform Dimension Reduction

In [24]:
tfidf_svd = TruncatedSVD(n_components = tfidf.shape[1] - 1, random_state = 1234)
tfidf_tsvd = tfidf_svd.fit_transform(tfidf)
tfidf_svd_var_ratios = tfidf_svd.explained_variance_ratio_

select_n_components(tfidf_svd_var_ratios, 0.95)

3971

In [25]:
tfidf_svd_final = TruncatedSVD(n_components = select_n_components(tfidf_svd_var_ratios, 0.95), random_state = 1234)
tfidf_final = tfidf_svd_final.fit_transform(tfidf)

### Build Classifier on TF-IDF components
Remove 500 documents to serve as hold out test sample.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_final, y, test_size = 500/tfidf_final.shape[0], random_state = 0) 

We already have our model and evaluation metric that we outlined in attempt 1 (**mdl** and **accuracy**).  Additionally, we will use the same parameters distributions and random search cross validation as attempt 1, but we will fit it to the term frequency inverse document frequency matrix.  For reference on how we obtained **mdl**, **accuracy**, **param_dist** and **rand_search** see below:
```
# Define model & performance measure
mdl = RandomForestClassifier(n_estimators = 20)
accuracy = make_scorer(accuracy_score)

# Random search for parameters
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 100),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rand_search = RandomizedSearchCV(mdl, param_distributions = param_dist, n_iter = 20, random_state = 1234, 
                                 scoring = accuracy, refit = True, cv = 10)
```

In [27]:
# Fit classifier using same algorithm, parameter distributions & random search parameters on tf-idf matrix
mdl_2 = rand_search.fit(X_train, y_train)
att2_cv_accuracy = mdl_2.best_score_
att2_cv_accuracy

0.91156590311678554

In [28]:
mdl_2.best_estimator_

RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features=90,
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=5, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [29]:
mdl_2.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 90,
 'min_samples_leaf': 5,
 'min_samples_split': 2}

In [30]:
# Evaluate on hold-out test sample
att2_y_pred = mdl_2.predict(X_test)
att2_test_accuracy = accuracy_score(y_test, att2_y_pred)
att2_test_accuracy

0.89000000000000001

In [31]:
# Classification Report
target_names = ['astronomy', 'aviation', 'beer', 'outdoors', 'pets']
att2_class_report = classification_report(y_test, att2_y_pred, target_names = target_names)
print(att2_class_report)

             precision    recall  f1-score   support

  astronomy       0.94      0.93      0.93        96
   aviation       0.86      0.95      0.90       139
       beer       1.00      0.64      0.78        22
   outdoors       0.87      0.85      0.86       131
       pets       0.89      0.88      0.89       112

avg / total       0.89      0.89      0.89       500



In [32]:
# Confusion Matrix
att2_conf_matrix = confusion_matrix(y_test, att2_y_pred, labels = target_names)
print(att2_conf_matrix)

[[ 89   4   0   2   1]
 [  3 132   0   2   2]
 [  2   3  14   3   0]
 [  1  10   0 111   9]
 [  0   4   0   9  99]]


In [33]:
# Store evaluation metrics
att2_train_accuracy = accuracy_score(y_train, mdl_2.predict(X_train))
att2_L = list(precision_recall_fscore_support(y_test, att2_y_pred, average = 'weighted'))
metrics = metrics.append(pd.DataFrame([[2, att2_train_accuracy, att2_cv_accuracy, att2_test_accuracy, *att2_L[:3]]], 
                                      columns = ['attempt', 'train accuracy', 'mean cv accuracy', 'test accuracy', 
                                                 'test precision', 'test recall', 'test f1-score']), 
                         ignore_index = True)

In [34]:
metrics

Unnamed: 0,attempt,train accuracy,mean cv accuracy,test accuracy,test precision,test recall,test f1-score
0,1,0.980473,0.849042,0.82,0.830855,0.82,0.819058
1,2,0.999812,0.911566,0.89,0.892493,0.89,0.888826


***
The final 2 attempts will mimic attempts 1 and 2, but with the train-test split occuring *before* dimension reduction takes place.  This is to ensure that no information from the hold-out test set is reflected in the training data as this could bias our result, thus providing an inaccurate evaluation of our classifier.    
## Attempt 3: Train-Test Split prior to Dimension Reduction, Build Classifier on Term Frequency Matrix
### Data Preprocessing
We have already built our term document matrix measuring word frequency in each document.  Now we'll perform the train-test split and *then* apply dimension reduction before ultimately training our classifier. 

In [35]:
X_train, X_test, y_train, y_test = train_test_split(tdm, y, test_size = 500/tdm.shape[0], random_state = 0) 

#### Perform Dimension Reduction

In [36]:
svd = TruncatedSVD(n_components = X_train.shape[1] - 1, random_state = 1234)
tsvd = svd.fit_transform(X_train)
svd_var_ratios = svd.explained_variance_ratio_

select_n_components(svd_var_ratios, 0.95)

2045

In [37]:
svd_final = TruncatedSVD(n_components = select_n_components(svd_var_ratios, 0.95), random_state = 1234)
X_train_final = svd_final.fit_transform(X_train)
X_test_final = svd_final.transform(X_test)

print('X_train = ' + str(X_train_final.shape))
print('X_test = ' + str(X_test_final.shape))

X_train = (5326, 2045)
X_test = (500, 2045)


### Build Classifier
Again, we will use the same algorithm, parameter distributions and random search parameters to fit our classifier.  However, we will be fitting our classifier on the data that was split prior to dimension reduction. Again, for reference, see below for how these items were derived:
```
# Define model & performance measure
mdl = RandomForestClassifier(n_estimators = 20)
accuracy = make_scorer(accuracy_score)

# Random search for parameters
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 100),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rand_search = RandomizedSearchCV(mdl, param_distributions = param_dist, n_iter = 20, random_state = 1234, 
                                 scoring = accuracy, refit = True, cv = 10)
```

In [38]:
# Fit classifier using same algorithm, parameter distributions & random search parameter on data split prior to dimension reduction
mdl_3 = rand_search.fit(X_train_final, y_train)
att3_cv_accuracy = mdl_3.best_score_
att3_cv_accuracy 

0.85843034171986476

In [39]:
mdl_3.best_estimator_

RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=85, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=10,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=20, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [40]:
mdl_3.best_params_

{'bootstrap': False,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 85,
 'min_samples_leaf': 10,
 'min_samples_split': 2}

In [41]:
# Evaluate on hold-out test sample
att3_y_pred = mdl_3.predict(X_test_final)
att3_test_accuracy = accuracy_score(y_test, att3_y_pred)
att3_test_accuracy

0.84199999999999997

In [42]:
# Classification Report
target_names = ['astronomy', 'aviation', 'beer', 'outdoors', 'pets']
att3_class_report = classification_report(y_test, att3_y_pred, target_names = target_names)
print(att3_class_report)

             precision    recall  f1-score   support

  astronomy       0.92      0.88      0.90        96
   aviation       0.87      0.88      0.88       139
       beer       1.00      0.73      0.84        22
   outdoors       0.72      0.84      0.78       131
       pets       0.89      0.79      0.83       112

avg / total       0.85      0.84      0.84       500



In [43]:
# Confusion Matrix
att3_conf_matrix = confusion_matrix(y_test, att3_y_pred, labels = target_names)
print(att3_conf_matrix)

[[ 84   2   0   8   2]
 [  4 123   0  11   1]
 [  0   2  16   4   0]
 [  2  11   0 110   8]
 [  1   4   0  19  88]]


In [45]:
# Store evaluation metrics
att3_train_accuracy = accuracy_score(y_train, mdl_3.predict(X_train_final))
att3_L = list(precision_recall_fscore_support(y_test, att3_y_pred, average = 'weighted'))
metrics = metrics.append(pd.DataFrame([[3, att3_train_accuracy, att3_cv_accuracy, att3_test_accuracy, *att3_L[:3]]], 
                                      columns = ['attempt', 'train accuracy', 'mean cv accuracy', 'test accuracy', 
                                                 'test precision', 'test recall', 'test f1-score']), 
                         ignore_index = True)

In [46]:
metrics

Unnamed: 0,attempt,train accuracy,mean cv accuracy,test accuracy,test precision,test recall,test f1-score
0,1,0.980473,0.849042,0.82,0.830855,0.82,0.819058
1,2,0.999812,0.911566,0.89,0.892493,0.89,0.888826
2,3,0.986481,0.85843,0.842,0.85075,0.842,0.843437


***
## Attempt 4: Train-Test Split prior to Dimension Reduction, Build Classifier on TF-IDF Matrix
### Data Preprocessing
Split tf-idf matrix into train and test sets.

In [47]:
X_train, X_test, y_train, y_test = train_test_split(tfidf, y, test_size = 500/tfidf.shape[0], random_state = 0)

#### Perform Dimension Reduction

In [48]:
tfidf_svd = TruncatedSVD(n_components = X_train.shape[1] - 1, random_state = 1234)
tfidf_tsvd = tfidf_svd.fit_transform(X_train)
tfidf_svd_var_ratios = tfidf_svd.explained_variance_ratio_

select_n_components(tfidf_svd_var_ratios, 0.95)

3703

In [49]:
tfidf_svd_final = TruncatedSVD(n_components = select_n_components(tfidf_svd_var_ratios, 0.95), random_state = 1234)
X_train_final = tfidf_svd_final.fit_transform(X_train)
X_test_final = tfidf_svd_final.transform(X_test)

print('X_train = ' + str(X_train_final.shape))
print('X_test = ' + str(X_test_final.shape))

X_train = (5326, 3703)
X_test = (500, 3703)


### Build Classifier
Reference:
```
# Define model & performance measure
mdl = RandomForestClassifier(n_estimators = 20)
accuracy = make_scorer(accuracy_score)

# Random search for parameters
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 100),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

rand_search = RandomizedSearchCV(mdl, param_distributions = param_dist, n_iter = 20, random_state = 1234, 
                                 scoring = accuracy, refit = True, cv = 10)
```

In [50]:
# Fit classifier using same algorithm, parameter distributions, random search parameters on tf-idf matrix split prior to dimension reduction
mdl_4 = rand_search.fit(X_train_final, y_train)
att4_cv_accuracy = mdl_4.best_score_
att4_cv_accuracy

0.91306796845662785

In [51]:
mdl_4.best_estimator_

RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features=79,
            max_leaf_nodes=None, min_impurity_split=1e-07,
            min_samples_leaf=8, min_samples_split=4,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [52]:
mdl_4.best_params_

{'bootstrap': False,
 'criterion': 'entropy',
 'max_depth': None,
 'max_features': 79,
 'min_samples_leaf': 8,
 'min_samples_split': 4}

In [53]:
# Evaluate on hold-out test sample
att4_y_pred = mdl_4.predict(X_test_final)
att4_test_accuracy = accuracy_score(y_test, att4_y_pred)
att4_test_accuracy

0.89000000000000001

In [54]:
# Classification Report
target_names = ['astronomy', 'aviation', 'beer', 'outdoors', 'pets']
att4_class_report = classification_report(y_test, att4_y_pred, target_names = target_names)
print(att4_class_report)

             precision    recall  f1-score   support

  astronomy       0.91      0.90      0.91        96
   aviation       0.90      0.93      0.91       139
       beer       1.00      0.77      0.87        22
   outdoors       0.90      0.82      0.86       131
       pets       0.83      0.94      0.88       112

avg / total       0.89      0.89      0.89       500



In [55]:
# Confusion Matrix
att4_conf_matrix = confusion_matrix(y_test, att4_y_pred, labels = target_names)
print(att4_conf_matrix)

[[ 86   6   0   2   2]
 [  3 129   0   4   3]
 [  2   1  17   1   1]
 [  3   5   0 108  15]
 [  0   2   0   5 105]]


In [56]:
# Store evaluation metrics
att4_train_accuracy = accuracy_score(y_train, mdl_4.predict(X_train_final))
att4_L = list(precision_recall_fscore_support(y_test, att4_y_pred, average = 'weighted'))
metrics = metrics.append(pd.DataFrame([[4, att4_train_accuracy, att4_cv_accuracy, att4_test_accuracy, *att4_L[:3]]], 
                                      columns = ['attempt', 'train accuracy', 'mean cv accuracy', 'test accuracy', 
                                                 'test precision', 'test recall', 'test f1-score']), 
                         ignore_index = True)

In [57]:
metrics

Unnamed: 0,attempt,train accuracy,mean cv accuracy,test accuracy,test precision,test recall,test f1-score
0,1,0.980473,0.849042,0.82,0.830855,0.82,0.819058
1,2,0.999812,0.911566,0.89,0.892493,0.89,0.888826
2,3,0.986481,0.85843,0.842,0.85075,0.842,0.843437
3,4,0.997559,0.913068,0.89,0.892909,0.89,0.889623


Attempt 4 seems to yield the best results across most or all of the metrics used for evaluation.  This will of course change from iteration to iteration due to the random search of optimal parameters for each attempt.  Regardless, in this iteration attempt 4 yields high accuracy in our training, cross-validation and test sets.  Additionally, the test recall across the 5 categories is the strongest in attempt 4, especially for **beer** posts which seem to be the most problematic when it comes to accurate classification as it has the smallest amount of documents associated with it.  It appears that using the term frequency inverse document frequency matrix for training addresses this issue by penalizing words that appear in many or most of the total documents.  Additionally, performing the train-test split prior to applying dimension reduction also improved our classifier's performance as the accuracy measures of our train, cross validation and test sets increased for both our term document matrix with only word frequency as well as our term frequency inverse document frequency matrix.