# A picture spoken with 1000 words, reported by CNN - Part 1 
Eu Jin Lok

9 February 2018

# Establishing the benchmarks
In this notebook we will go into the details of how to build a document classifier using CNN, a deep learning architecture well known for images classification. For the full background on this topic, please checkout my blog post in this link: 

xxxxxxxxxxx

This is part 1 of the code which looks to establish some simple benchmark. We will be using the "HappyDB" dataset from Kaggle for our experiment: 

https://www.kaggle.com/ritresearch/happydb

So without further ado, lets begin....

In [12]:
#import the key libraries 
import pandas as pd 
import numpy as np
import os 
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from pandas import crosstab
os.chdir("C:\\Users\\User\\Dropbox\\Pet Project\\Blog\\CNN\\")

So first step after loadings the necessary packages, we'll go grab our training dataset. This time around I'll be using the "HappyDB" dataset on Kaggle for our experiment

In [2]:
# import data 
train = pd.read_csv("happydb\\cleaned_hm.csv")  

# run some checks 
train.head(3)
print(train.shape)
print(train.isnull().sum())

(100535, 9)
hmid                         0
wid                          0
reflection_period            0
original_hm                  0
cleaned_hm                   0
modified                     0
num_sentence                 0
ground_truth_category    86410
predicted_category           0
dtype: int64


In [3]:
# Lets one-hot encode the labels  
labels=train.predicted_category.unique()
dic={}
for i,labels in enumerate(labels):
    dic[labels]=i
labels=train.predicted_category.apply(lambda x:dic[x])
print(dic)

{'achievement': 4, 'nature': 6, 'leisure': 3, 'bonding': 2, 'exercise': 1, 'affection': 0, 'enjoy_the_moment': 5}


After reading the dataset, creating a reference dictionary of labels and their associated IDs, we'll split the dataset into training and test set

In [5]:
#split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(train.cleaned_hm, labels, test_size=0.20)

#pull the data into vectors
vectorizer = CountVectorizer(max_features=1000) #1000 Since our theme is a thousand words 
x_train = vectorizer.fit_transform(x_train)

#Apply the vectoriser on test data using the previous vocabulary set 
feature_names = vectorizer.get_feature_names()
cvec_t = CountVectorizer(vocabulary=feature_names)
x_test = cvec_t.fit_transform(x_test).toarray()

# Benchmark 1: Multinomial Naive Bayes = 82%
So before we dive straight into CNN, lets establish some simple models first so we have something to benchmark against. First one of the list Multinomial NB. Why? Easy to code and no tuning necessary to get a resonable result.  

In [6]:
nb = MultinomialNB()
nb.fit(x_train, y_train)
nb.score(x_test, y_test) #the test dataset is

0.82384244293032283

82% accuracy which is pretty good for a simple Multinomial Naive Bayes based on top 1000 words. Lets see how a Tree Ensemble stacks up... 

# Benchmark 2: Tree Ensemble = 84%
Next on the list is Tree Ensemble, aka RandomForest. 4 years ago, this model used to be the most popular off-the-shelf model that everyone goes to to quickly gauge whether the dataset has enough signal in it, or maybe it needs reworking (more data cleaning to purge out the noise) 

In [7]:
random_forest = RandomForestClassifier(n_estimators=50, verbose=1,n_jobs =-1)
random_forest.fit(x_train, y_train)

[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   12.3s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   16.5s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=50, n_jobs=-1, oob_score=False, random_state=None,
            verbose=1, warm_start=False)

In [8]:
random_forest.score(x_test, y_test) #the test dataset is

[Parallel(n_jobs=8)]: Done  34 tasks      | elapsed:    0.0s
[Parallel(n_jobs=8)]: Done  50 out of  50 | elapsed:    0.1s finished


0.84129905008206096

84% accuracy, only slightly better over the Bayes model. Lets try another model that in recent times has gain abit of popularity, especially amongst folks in Kaggle...

# Benchmark 3: Gradient Boosting = 86%
Its called the LightGBM. An implementation of the Gradient Boosting architecture, and is very similar to the XGboost but faster apparently. I have not used it before and I've always wanted to try them... and I will! 

In [9]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(x_train.astype(np.float64), y_train)
lgb_eval = lgb.Dataset(x_test.astype(np.float64), y_test, reference=lgb_train)

# specify your configurations as a dict
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'metric': {'multi_logloss', 'multi_error'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 1
    ,'num_class': 7
}

print('Start training...')
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=200,
                valid_sets=lgb_eval,
                early_stopping_rounds=10)

Start training...
[1]	valid_0's multi_error: 0.227433	valid_0's multi_logloss: 1.85111
Training until validation scores don't improve for 10 rounds.
[2]	valid_0's multi_error: 0.233252	valid_0's multi_logloss: 1.77402
[3]	valid_0's multi_error: 0.230616	valid_0's multi_logloss: 1.70162
[4]	valid_0's multi_error: 0.235789	valid_0's multi_logloss: 1.63935
[5]	valid_0's multi_error: 0.232357	valid_0's multi_logloss: 1.58004
[6]	valid_0's multi_error: 0.229373	valid_0's multi_logloss: 1.52626
[7]	valid_0's multi_error: 0.226041	valid_0's multi_logloss: 1.47691
[8]	valid_0's multi_error: 0.228577	valid_0's multi_logloss: 1.43151
[9]	valid_0's multi_error: 0.229174	valid_0's multi_logloss: 1.38932
[10]	valid_0's multi_error: 0.229423	valid_0's multi_logloss: 1.35012
[11]	valid_0's multi_error: 0.227185	valid_0's multi_logloss: 1.3135
[12]	valid_0's multi_error: 0.225643	valid_0's multi_logloss: 1.27949
[13]	valid_0's multi_error: 0.225593	valid_0's multi_logloss: 1.24937
[14]	valid_0's multi

[117]	valid_0's multi_error: 0.161834	valid_0's multi_logloss: 0.488042
[118]	valid_0's multi_error: 0.161287	valid_0's multi_logloss: 0.486329
[119]	valid_0's multi_error: 0.16079	valid_0's multi_logloss: 0.484638
[120]	valid_0's multi_error: 0.161038	valid_0's multi_logloss: 0.483141
[121]	valid_0's multi_error: 0.160243	valid_0's multi_logloss: 0.481544
[122]	valid_0's multi_error: 0.160292	valid_0's multi_logloss: 0.480109
[123]	valid_0's multi_error: 0.160044	valid_0's multi_logloss: 0.478573
[124]	valid_0's multi_error: 0.159745	valid_0's multi_logloss: 0.477087
[125]	valid_0's multi_error: 0.159248	valid_0's multi_logloss: 0.475685
[126]	valid_0's multi_error: 0.159149	valid_0's multi_logloss: 0.474334
[127]	valid_0's multi_error: 0.1588	valid_0's multi_logloss: 0.473097
[128]	valid_0's multi_error: 0.1589	valid_0's multi_logloss: 0.471863
[129]	valid_0's multi_error: 0.158552	valid_0's multi_logloss: 0.470629
[130]	valid_0's multi_error: 0.158651	valid_0's multi_logloss: 0.4693

In [16]:
pred = gbm.predict(x_test.astype(np.float64), num_iteration=gbm.best_iteration)
pred = pd.DataFrame(pred).idxmax(axis=1)
accuracy_score(pred, y_test) #the test dataset is

0.85900432685134531

86% accuracy, only slightly better than Random Forest, but certainly has alot of room to improve if we tweak the parameters. But one thing notable with LightGBM, its blazing fast! 

Before we move on lets check the predictions against the actuals visually to make sure we haven't lost the plot...

In [20]:
print(dic)
crosstab(pred, y_test.reset_index(drop=True))

{'achievement': 4, 'nature': 6, 'leisure': 3, 'bonding': 2, 'exercise': 1, 'affection': 0, 'enjoy_the_moment': 5}


predicted_category,0,1,2,3,4,5,6
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,6292,7,84,36,267,113,21
1,8,153,0,3,24,8,0
2,64,7,1965,5,49,17,3
3,21,11,2,1092,104,95,18
4,297,52,55,273,6205,632,61
5,55,7,13,108,203,1335,23
6,10,6,3,14,32,24,230
