# Text Classification

* Data Retrieval
* Data Preprocessing and Normalization
* Building Train and Test Datasets
* Feature Engineering Techniques
    1. Traditional
    2. Advanced
* Classification Models
    1. Multinomial Naive Bayes
    2. Logistic Regression
    3. Support Vector Machines
    4. Ensemble Models
    5. Random Forest
    6. Gradient Boosting Machines
* Evaluating Classification Models
    1. Confusion Matrix
* Building and Evaluating Our Text Classifier
    1. Bag of Words Features with Classification Models
    2. TF-IDF Features with Classification Models
    3. Comparative Model Performance Evaluation
    4. Word2Vec Embeddings with Classification Models
    5. GloVe Embeddings with Classification Models
    6. FastText Embeddings with Classification Models
    7. Model Tuning
    8. Model Performance Evaluation

In case spacy doesn't work, run:

python -m spacy download en_core_web_sm

## Data Retrieval

In [39]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import text_normalizer as tn
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

data = fetch_20newsgroups(subset='all', shuffle=True, remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [18]:
# building the dataframe
corpus, target_labels, target_names = (data.data, data.target, [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})
print(data_df.shape)
data_df.head(10)

(18846, 3)


Unnamed: 0,Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,19,talk.religion.misc


### Data Preprocessing and Normalization

In [19]:
total_nulls = data_df[data_df.Article.str.strip() == ''].shape[0]
print("Empty documents:", total_nulls)

Empty documents: 515


In [20]:
data_df = data_df[~(data_df.Article.str.strip() == '')]
data_df.shape

(18331, 3)

In [21]:
import nltk
stopword_list = nltk.corpus.stopwords.words('english')

# just to keep negation if any in bi-grams
stopword_list.remove('no')
stopword_list.remove('not')

# normalize our corpus
norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 
                                  text_stemming=False, special_char_removal=True, remove_digits=True, 
                                  stopword_removal=True, stopwords=stopword_list)

data_df['Clean Article'] = norm_corpus

# view sample data
data_df = data_df[['Article', 'Clean Article', 'Target Label', 'Target Name']]
data_df.head(10)

Unnamed: 0,Article,Clean Article,Target Label,Target Name
0,\n\nI am sure some bashers of Pens fans are pr...,sure basher pens fan pretty confused lack kind...,10,rec.sport.hockey
1,My brother is in the market for a high-perform...,brother market high performance video card sup...,3,comp.sys.ibm.pc.hardware
2,\n\n\n\n\tFinally you said what you dream abou...,finally say dream mediterranean new area great...,17,talk.politics.mideast
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,think scsi card dma transfer not disk scsi car...,3,comp.sys.ibm.pc.hardware
4,1) I have an old Jasmine drive which I cann...,old jasmine drive not use new system understan...,4,comp.sys.mac.hardware
5,\n\nBack in high school I worked as a lab assi...,back high school work lab assistant bunch expe...,12,sci.electronics
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...,ae dallas try tech support may line one get start,4,comp.sys.mac.hardware
7,"\n[stuff deleted]\n\nOk, here's the solution t...",stuff delete ok solution problem move canada y...,10,rec.sport.hockey
8,"\n\n\nYeah, it's the second one. And I believ...",yeah second one believe price try get good loo...,10,rec.sport.hockey
9,\nIf a Christian means someone who believes in...,christian mean someone believe divinity jesus ...,19,talk.religion.misc


In [22]:
data_df['Clean Article'] = norm_corpus
data_df = data_df.replace(r'^(\s?)+$', np.nan, regex=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18331 entries, 0 to 18845
Data columns (total 4 columns):
Article          18331 non-null object
Clean Article    18300 non-null object
Target Label     18331 non-null int64
Target Name      18331 non-null object
dtypes: int64(1), object(3)
memory usage: 716.1+ KB


In [23]:
data_df = data_df.dropna().reset_index(drop=True)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18300 entries, 0 to 18299
Data columns (total 4 columns):
Article          18300 non-null object
Clean Article    18300 non-null object
Target Label     18300 non-null int64
Target Name      18300 non-null object
dtypes: int64(1), object(3)
memory usage: 572.0+ KB


In [24]:
data_df.to_csv('clean_newsgroups.csv', index=False)

In [25]:
data_df = pd.read_csv('clean_newsgroups.csv')

In [26]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                  np.array(data_df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape

((12261,), (6039,))

In [27]:
from collections import Counter

trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd], 
             columns=['Target Label', 'Train Count', 'Test Count'])
.sort_values(by=['Train Count', 'Test Count'],
             ascending=False))

Unnamed: 0,Target Label,Train Count,Test Count
15,sci.crypt,667,295
0,soc.religion.christian,662,312
5,rec.motorcycles,660,309
10,comp.sys.ibm.pc.hardware,654,309
8,comp.windows.x,653,327
11,rec.sport.hockey,651,322
19,sci.space,649,304
7,sci.med,648,312
17,rec.sport.baseball,648,303
4,sci.electronics,647,309


## Evaluating Classification Models
have trouble getting section of code to work

## Building and Evaluating Our Text Classifier

In [3]:
# referencing above code in more succinct way to solely access this information
import nltk
from sklearn.model_selection import train_test_split
import numpy as np
import text_normalizer as tn
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

# download dataset
data_df = pd.read_csv('clean_newsgroups.csv')

# split data
train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                  np.array(data_df['Target Name']), test_size=0.33, random_state=42)

## Bag of Words Features with Classification Models
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# build BOW features on train articles
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

# transform test articles into features
cv_test_features = cv.transform(test_corpus)

print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)

BOW model:> Train features shape: (12261, 65914)  Test features shape: (6039, 65914)


In [15]:
# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, train_label_names)
mnb_bow_cv_scores = cross_val_score(mnb, cv_train_features, train_label_names, cv=5)
mnb_bow_cv_mean_score = np.mean(mnb_bow_cv_scores)
print('CV Accuracy (5-fold):', mnb_bow_cv_scores)
print('Mean CV Accuracy:', mnb_bow_cv_mean_score)
mnb_bow_test_score = mnb.score(cv_test_features, test_label_names)
print('Test Accuracy:', mnb_bow_cv_mean_score)

CV Accuracy (5-fold): [0.68590004 0.67887668 0.68665851 0.68504902 0.67184943]
Mean CV Accuracy: 0.6816667346037866
Test Accuracy: 0.6816667346037866


In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, train_label_names)
lr_bow_cv_scores = cross_val_score(lr, cv_train_features, train_label_names, cv=5)
lr_bow_cv_mean_score = np.mean(lr_bow_cv_scores)
print('CV Accuracy (5-fold):', lr_bow_cv_scores)
print('Mean CV Accuracy:', lr_bow_cv_mean_score)
lr_bow_test_score = lr.score(cv_test_features, test_label_names)
print('Test Accuracy:', lr_bow_test_score)

In [17]:
# Support Vector Machines
from sklearn.svm import LinearSVC

svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)
svm_bow_cv_scores = cross_val_score(svm, cv_train_features, train_label_names, cv=5)
svm_bow_cv_mean_score = np.mean(svm_bow_cv_scores)
print('CV Accuracy (5-fold):', svm_bow_cv_scores)
print('Mean CV Accuracy:', svm_bow_cv_mean_score)
svm_bow_test_score = svm.score(cv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_score)

CV Accuracy (5-fold): [0.63388866 0.64102564 0.64422685 0.64910131 0.64729951]
Mean CV Accuracy: 0.6431083933094228
Test Accuracy: 0.6522603079980129


In [18]:
# SVM with Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier

svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(cv_train_features, train_label_names)
svmsgd_bow_cv_scores = cross_val_score(svm_sgd, cv_train_features, train_label_names, cv=5)
svmsgd_bow_cv_mean_score = np.mean(svmsgd_bow_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_bow_cv_scores)
print('Mean CV Accuracy:', svmsgd_bow_cv_mean_score)
svmsgd_bow_test_score = svm_sgd.score(cv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_bow_test_score)

CV Accuracy (5-fold): [0.63185697 0.62596663 0.61321909 0.64787582 0.64729951]
Mean CV Accuracy: 0.6332436029841757
Test Accuracy: 0.6511011756913396


In [19]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(cv_train_features, train_label_names)
rfc_bow_cv_scores = cross_val_score(rfc, cv_train_features, train_label_names, cv=5)
rfc_bow_cv_mean_score = np.mean(rfc_bow_cv_scores)
print('CV Accuracy (5-fold):', rfc_bow_cv_scores)
print('Mean CV Accuracy:', rfc_bow_cv_mean_score)
rfc_bow_test_score = rfc.score(cv_test_features, test_label_names)
print('Test Accuracy:', rfc_bow_test_score)

CV Accuracy (5-fold): [0.52824055 0.50956451 0.53855569 0.52205882 0.51718494]
Mean CV Accuracy: 0.5231209039972265
Test Accuracy: 0.5418115582050008


In [20]:
# Gradient Boosting Machines
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(cv_train_features, train_label_names)
gbc_bow_cv_scores = cross_val_score(gbc, cv_train_features, train_label_names, cv=5)
gbc_bow_cv_mean_score = np.mean(gbc_bow_cv_scores)
print('CV Accuracy (5-fold):', gbc_bow_cv_scores)
print('Mean CV Accuracy:', gbc_bow_cv_mean_score)
gbc_bow_test_score = gbc.score(cv_test_features, test_label_names)
print('Test Accuracy:', gbc_bow_test_score)

CV Accuracy (5-fold): [0.55221455 0.55474155 0.5630355  0.5502451  0.54500818]
Mean CV Accuracy: 0.5530489757470004
Test Accuracy: 0.5553899652260308


In [21]:
## TF-IDF Features with Classification Models
from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW features on train articles 
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

# transform test articles into features
tv_test_features = tv.transform(test_corpus)

print('TFIDF mode:> Train features shape:', tv_train_features.shape, 
      ' Test features shape:', tv_test_features.shape)

TFIDF mode:> Train features shape: (12261, 65914)  Test features shape: (6039, 65914)


In [23]:
# Naive Bayes
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)
mnb_tfidf_cv_scores = cross_val_score(mnb, tv_train_features, train_label_names, cv=5)
mnb_tfidf_cv_mean_score = np.mean(mnb_tfidf_cv_scores)
print('CV Accuracy (5-fold):', mnb_tfidf_cv_scores)
print('Mean CV Accuracy:', mnb_tfidf_cv_mean_score)
mnb_tfidf_test_score = mnb.score(tv_test_features, test_label_names)
print('Test Accuracy:', mnb_tfidf_test_score)

CV Accuracy (5-fold): [0.70337261 0.7049247  0.71113831 0.69812092 0.71890344]
Mean CV Accuracy: 0.7072919961196964
Test Accuracy: 0.7072362974002319


In [28]:
# Logistic Regression
lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(tv_train_features, train_label_names)
lr_tfidf_cv_scores = cross_val_score(lr, tv_train_features, train_label_names, cv=5)
lr_tfidf_cv_mean_score = np.mean(lr_tfidf_cv_scores)
print('CV Accuracy (5-fold):', lr_tfidf_cv_scores)
print('Mean CV Accuracy:', lr_tfidf_cv_mean_score)
lr_tfidf_test_score = lr.score(tv_test_features, test_label_names)
print('Test Accuracy:', lr_tfidf_test_score)

NameError: name 'tv_train_features' is not defined

In [25]:
# Support Vector Machines
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)
svm_tfidf_cv_scores = cross_val_score(svm, tv_train_features, train_label_names, cv=5)
svm_tfidf_cv_mean_score = np.mean(svm_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svm_tfidf_cv_scores)
print('Mean CV Accuracy:', svm_tfidf_cv_mean_score)
svm_tfidf_test_score = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_tfidf_test_score)

CV Accuracy (5-fold): [0.75375863 0.75539276 0.76336189 0.75531046 0.75327332]
Mean CV Accuracy: 0.75621941262751
Test Accuracy: 0.7597284318595794


In [26]:
# SVM with Stochastic Gradient Descent
svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(tv_train_features, train_label_names)
svmsgd_tfidf_cv_scores = cross_val_score(svm_sgd, tv_train_features, train_label_names, cv=5)
svmsgd_tfidf_cv_mean_score = np.mean(svmsgd_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_tfidf_cv_scores)
print('Mean CV Accuracy:', svmsgd_tfidf_cv_mean_score)
svmsgd_tfidf_test_score = svm_sgd.score(tv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_tfidf_test_score)

CV Accuracy (5-fold): [0.75457131 0.75376475 0.75968992 0.75245098 0.75900164]
Mean CV Accuracy: 0.7558957211546691
Test Accuracy: 0.7595628415300546


In [27]:
# Random Forest
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(tv_train_features, train_label_names)
rfc_tfidf_cv_scores = cross_val_score(rfc, tv_train_features, train_label_names, cv=5)
rfc_tfidf_cv_mean_score = np.mean(rfc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', rfc_tfidf_cv_scores)
print('Mean CV Accuracy:', rfc_tfidf_cv_mean_score)
rfc_tfidf_test_score = rfc.score(tv_test_features, test_label_names)
print('Test Accuracy:', rfc_tfidf_test_score)

CV Accuracy (5-fold): [0.53311662 0.51322751 0.53529172 0.53390523 0.51841244]
Mean CV Accuracy: 0.526790703507522
Test Accuracy: 0.5447921841364465


In [28]:
# Gradient Boosting
gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(tv_train_features, train_label_names)
gbc_tfidf_cv_scores = cross_val_score(gbc, tv_train_features, train_label_names, cv=5)
gbc_tfidf_cv_mean_score = np.mean(gbc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', gbc_tfidf_cv_scores)
print('Mean CV Accuracy:', gbc_tfidf_cv_mean_score)
gbc_tfidf_test_score = gbc.score(tv_test_features, test_label_names)
print('Test Accuracy:', gbc_tfidf_test_score)

CV Accuracy (5-fold): [0.55302722 0.56654457 0.55161159 0.54575163 0.5413257 ]
Mean CV Accuracy: 0.5516521415850433
Test Accuracy: 0.5534028812717338


In [29]:
## Comparative Model Performance Evaluation
pd.DataFrame([['Naive Bayes', mnb_bow_cv_mean_score, mnb_bow_test_score, 
               mnb_tfidf_cv_mean_score, mnb_tfidf_test_score],
              ['Logistic Regression', lr_bow_cv_mean_score, lr_bow_test_score, 
               lr_tfidf_cv_mean_score, lr_tfidf_test_score],
              ['Linear SVM', svm_bow_cv_mean_score, svm_bow_test_score, 
               svm_tfidf_cv_mean_score, svm_tfidf_test_score],
              ['Linear SVM (SGD)', svmsgd_bow_cv_mean_score, svmsgd_bow_test_score, 
               svmsgd_tfidf_cv_mean_score, svmsgd_tfidf_test_score],
              ['Random Forest', rfc_bow_cv_mean_score, rfc_bow_test_score, 
               rfc_tfidf_cv_mean_score, rfc_tfidf_test_score],
              ['Gradient Boosted Machines', gbc_bow_cv_mean_score, gbc_bow_test_score, 
               gbc_tfidf_cv_mean_score, gbc_tfidf_test_score]],
             columns=['Model', 'CV Score (TF)', 'Test Score (TF)', 'CV Score (TF-IDF)', 'Test Score (TF-IDF)'],
             ).T

Unnamed: 0,0,1,2,3,4,5
Model,Naive Bayes,Logistic Regression,Linear SVM,Linear SVM (SGD),Random Forest,Gradient Boosted Machines
CV Score (TF),0.681667,0.700353,0.643108,0.633244,0.523121,0.553049
Test Score (TF),0.690677,0.701275,0.65226,0.651101,0.541812,0.55539
CV Score (TF-IDF),0.707292,0.742113,0.756219,0.755896,0.526791,0.551652
Test Score (TF-IDF),0.707236,0.738202,0.759728,0.759563,0.544792,0.553403


In [12]:
## Word2Vec Embeddings with Classification Models
def document_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary:
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)
            
        return feature_vector
    
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, 
                                     num_features) for tokenized_sentence in corpus]
    return np.array(features)

In [8]:
# tokenize corpus
tokenized_train = [tn.tokenizer.tokenize(text)
                   for text in train_corpus]
tokenized_test = [tn.tokenizer.tokenize(text)
                  for text in test_corpus]

# generate word2vec word embeddings
import gensim
# build word2vec word embeddings
w2v_num_features = 1000
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features,
                                   window=100, min_count=2, sample=1e-3, sg=1, 
                                   iter=5, workers=10)

In [14]:
# generate document level embeddings
# remember we only use train dataset vocabulary embeddings
# so that test dataset truly remains an unseen dataset
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
                                            num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
                                           num_features=w2v_num_features)
print('Word2Vec model:> Train features shape:', avg_wv_train_features.shape,
      ' Test features shape:', avg_wv_test_features.shape)

Word2Vec model:> Train features shape: (12261, 1000)  Test features shape: (6039, 1000)


In [17]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier

svm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
svm.fit(avg_wv_train_features, train_label_names)
svm_w2v_cv_scores = cross_val_score(svm, avg_wv_train_features, train_label_names, cv=5)
svm_w2v_cv_mean_score = np.mean(svm_w2v_cv_scores)
print('CV Accuracy (5-fold):', svm_w2v_cv_scores)
print('Mean CV Accuracy:', svm_w2v_cv_mean_score)
svm_w2v_test_score = svm.score(avg_wv_test_features, test_label_names)
print('Test Accuracy:', svm_w2v_test_score)

CV Accuracy (5-fold): [0.74766355 0.74928775 0.74745002 0.74550654 0.75      ]
Mean CV Accuracy: 0.7479815714074336
Test Accuracy: 0.7348898824308661


In [None]:
## GloVe Embeddings with Classification Models
# feature engineering with GloVe model
#train_nlp = [tn.nlp(item) for item in train_corpus]
#train_glove_features = np.array([item.vector for item in train_nlp])

#test_nlp = [tn.nlp(item) for item in test_corpus]
#test_glove_features = np.array([item.vector for item in test_nlp])

#print('GloVe model:> Train features shape:', train_glove_features.shape,
#      ' Test features shape:', test_glove_features.shape)

In [None]:
# building our SVM model
#svm = SDGClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
#svm.fit(train_glove_features, train_label_names)
#svm_glove_cv_scores = np.mean(svm_glove_cv_scores)
#print('CV Accuracy (5-fold):', svm_glove_cv_scores)
#print('Mean CV Accuracy:', svm_glove_cv_mean_score)
#svm_glove_test_score = svm.score(test_glove_features, test_label_names)
#print('Test Accuracy:', svm_glove_test_score)

In [None]:
## FastText Embeddings with Classification Models
#from gensim.models.fasttext import FastText

#ft_num_features = 1000
# sg decides whether to use the skip-gram model (1) or CBOW (0)
#ft_model = FastText(tokenized_train, size=ft_num_features, window=100,
#                    min_count=2, sample=1e-3, sg=1, iter=5, workers=10)

# generate averaged word vector features from word2vec model
#avg_ft_train_features = document_vectorizer(corpus=tokenized_train, model=ft_model,
#                                            num_features=ft_num_features)
#avg_ft_test_features = document_vectorizer(corpus=tokenized_test, model=ft_model,
#                                           num_features=ft_num_features)

#print('FastText model:> Train features shape:', avg_ft_train_features.shape,
#      ' Test features shape:', avg_ft_test_features.shape)

# build SVM model
#svm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
#svm.fit(avg_ft_train_features, train_label_names)
#svm_fit_cv_scores = cross_val_score(svm, avg_ft_train_features, train_label_names, cv=5)
#svm_ft_cv_mean_score = np.mean(svm_ft_cv_scores)
#print('CV Accuracy (5-fold):', svm_ft_cv_scores)
#print('Mean CV Accuracy:', svm_ft_cv_mean_score)
#svm_ft_test_score = svm.score(avg_ft_test_features, test_label_names)
#print('Test Accuracy:', svm_ft_test_score)

In [None]:
#from sklearn.neural_network import MLPClassifier

#mlp = MLPClassifier(solver='adam', alpha=1e-5, learning_rate='adaptive', early_stopping=True,
#                    activation='relu', hidden_layers_sizes=(512, 512), random_state=42)
#mlp.fit(avg_ft_train_features, train_label_names)

#svm_ft_test_scores = mlp.score(avg_ft_test_features, test_label_names)
#print('Test Accuracy:', svm_ft_test_scores)

In [21]:
## Model Tuning
# Tuning our Multinomial Naive Bayes model
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

mnb_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                         ('mnb', MultinomialNB())
                         ])

param_grid = {'tfidf__ngram_range': [(1,1), (1,2)],
              'mnb__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]}

gs_mnb = GridSearchCV(mnb_pipeline, param_grid, cv=5, verbose=2)
gs_mnb = gs_mnb.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.8s remaining:    0.0s


[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.1s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   5.9s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.2s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] .

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  4.2min finished


In [22]:
gs_mnb.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
           stop_words=None, strip_accents=None, sublinear_tf=False,
           token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
           vocabulary=None)),
  ('mnb', MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))],
 'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words=None, strip_accents=None, sublinear_tf=False,
         token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=No

In [24]:
# model performances across different hyperparameter values in the hyperparameter space
cv_results = gs_mnb.cv_results_
results_df = pd.DataFrame({'rank': cv_results['rank_test_score'], 'params': cv_results['params'],
                           'cv score (mean)': cv_results['mean_test_score'],
                           'cv score (std)': cv_results['std_test_score']})
results_df = results_df.sort_values(by=['rank'], ascending=True)
pd.set_option('display.max_colwidth', 100)
results_df

Unnamed: 0,rank,params,cv score (mean),cv score (std)
4,1,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 1)}",0.77041,0.007948
5,2,"{'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 2)}",0.770247,0.009625
6,3,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 1)}",0.757279,0.006842
7,4,"{'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 2)}",0.752059,0.00708
3,5,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 2)}",0.75051,0.012631
1,6,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 2)}",0.742517,0.011306
2,7,"{'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 1)}",0.742191,0.009478
0,8,"{'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 1)}",0.729141,0.008456
8,9,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 1)}",0.708914,0.006629
9,10,"{'mnb__alpha': 1, 'tfidf__ngram_range': (1, 2)}",0.699943,0.005237


In [25]:
best_mnb_test_score = gs_mnb.score(test_corpus, test_label_names)
print('Test Accuracy:', best_mnb_test_score)

Test Accuracy: 0.7736380195396589


In [30]:
# Tuning our Logistic Regression model
lr_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('lr', LogisticRegression(penalty='l2', max_iter=100, random_state=42))])

param_grid = {'tfidf__ngram_range': [(1,1), (1,2)], 'lr__C': [1,5,10]}

gs_lr = GridSearchCV(lr_pipeline, param_grid, cv=5, verbose=2)
gs_lr = gs_lr.fit(train_corpus, train_label_names)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] lr__C=1, tfidf__ngram_range=(1, 1) ..............................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............... lr__C=1, tfidf__ngram_range=(1, 1), total=   3.3s
[CV] lr__C=1, tfidf__ngram_range=(1, 1) ..............................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    4.1s remaining:    0.0s


[CV] ............... lr__C=1, tfidf__ngram_range=(1, 1), total=   3.2s
[CV] lr__C=1, tfidf__ngram_range=(1, 1) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 1), total=   3.1s
[CV] lr__C=1, tfidf__ngram_range=(1, 1) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 1), total=   3.2s
[CV] lr__C=1, tfidf__ngram_range=(1, 1) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 1), total=   3.1s
[CV] lr__C=1, tfidf__ngram_range=(1, 2) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 2), total=  21.4s
[CV] lr__C=1, tfidf__ngram_range=(1, 2) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 2), total=  21.4s
[CV] lr__C=1, tfidf__ngram_range=(1, 2) ..............................
[CV] ............... lr__C=1, tfidf__ngram_range=(1, 2), total=  21.8s
[CV] lr__C=1, tfidf__ngram_range=(1, 2) ..............................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  7.7min finished


In [31]:
# evaluate best tuned model on the test dataset
best_lr_test_score = gs_lr.score(test_corpus, test_label_names)
print('Test Accuracy:', best_lr_test_score)

Test Accuracy: 0.763371419109124


In [33]:
# Tuning the Linear SVM model
svm_pipeline = Pipeline([('tfidf', TfidfVectorizer()), ('svm', LinearSVC(random_state=42))])

param_grid = {'tfidf__ngram_range': [(1,1), (1,2)], 'svm__C': [0.01, 0.1, 1.5]}

gs_svm = GridSearchCV(svm_pipeline, param_grid, cv=5, verbose=2)
gs_svm = gs_svm.fit(train_corpus, train_label_names)

# evaluating best tuned model on the test dataset
best_svm_test_score = gs_svm.score(test_corpus, test_label_names)
print('Test Accuracy :', best_svm_test_score)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] svm__C=0.01, tfidf__ngram_range=(1, 1) ..........................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 1), total=   1.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 1) ..........................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s


[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 1), total=   1.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 1) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 1), total=   1.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 1) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 1), total=   1.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 1) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 1), total=   1.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 2) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 2), total=   7.0s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 2) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 2), total=   7.5s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 2) ..........................
[CV] ........... svm__C=0.01, tfidf__ngram_range=(1, 2), total=   7.0s
[CV] svm__C=0.01, tfidf__ngram_range=(1, 2) ..........................
[CV] .

[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:  3.0min finished


Test Accuracy : 0.7751283325053817


In [35]:
## Model Performance Evaluation
import model_evaluation_utils as meu

mnb_predictions = gs_mnb.predict(test_corpus)
unique_classes = list(set(test_label_names))
meu.get_metrics(true_labels=test_label_names, predicted_labels=mnb_predictions)

Accuracy: 0.7736
Precision: 0.7784
Recall: 0.7736
F1 Score: 0.7714


In [36]:
meu.display_classification_report(true_labels=test_label_names,
                                  predicted_labels=mnb_predictions,
                                  classes=unique_classes)

                          precision    recall  f1-score   support

   comp.sys.mac.hardware       0.81      0.74      0.77       315
           comp.graphics       0.67      0.74      0.70       307
               rec.autos       0.85      0.77      0.81       343
             alt.atheism       0.68      0.65      0.66       267
comp.sys.ibm.pc.hardware       0.63      0.76      0.69       309
        rec.sport.hockey       0.93      0.92      0.92       322
          comp.windows.x       0.86      0.80      0.83       327
         rec.motorcycles       0.77      0.77      0.77       309
      rec.sport.baseball       0.92      0.90      0.91       303
                 sci.med       0.88      0.88      0.88       312
            misc.forsale       0.80      0.69      0.74       319
               sci.crypt       0.76      0.85      0.80       295
      talk.religion.misc       0.77      0.34      0.47       201
  soc.religion.christian       0.71      0.88      0.78       312
         

In [41]:
label_data_map = {v:k for k, v in data_labels_map.items()}
label_map_df = pd.DataFrame(list(label_data_map.items()),
                            columns=['Label Name', 'Label Number'])
label_map_df

Unnamed: 0,Label Name,Label Number
0,alt.atheism,0
1,comp.graphics,1
2,comp.os.ms-windows.misc,2
3,comp.sys.ibm.pc.hardware,3
4,comp.sys.mac.hardware,4
5,comp.windows.x,5
6,misc.forsale,6
7,rec.autos,7
8,rec.motorcycles,8
9,rec.sport.baseball,9


In [42]:
# build confusion matrix
unique_class_nums = label_map_df['Label Number'].values
mnb_prediction_class_nums = [label_data_map[item] for item in mnb_predictions]
meu.display_confusion_matrix_pretty(true_labels=test_label_nums,
                                    predicted_labels=mnb_prediction_class_nums,
                                    classes=unique_class_nums)
# can see that class labels 0, 15, 19 have lots of misclassification

Unnamed: 0_level_0,Unnamed: 1_level_0,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:,Predicted:
Unnamed: 0_level_1,Unnamed: 1_level_1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
Actual:,0,174,2,0,2,0,1,0,2,4,0,3,3,1,1,2,28,8,12,15,9
Actual:,1,2,226,15,13,8,13,3,0,1,3,1,8,7,3,3,0,1,0,0,0
Actual:,2,0,15,212,39,7,13,4,0,0,0,0,3,5,0,3,0,0,1,2,0
Actual:,3,0,10,26,236,8,3,9,0,0,0,0,2,15,0,0,0,0,0,0,0
Actual:,4,0,11,9,25,233,2,10,2,1,0,0,7,13,1,1,0,0,0,0,0
Actual:,5,0,32,12,6,3,262,1,0,2,0,1,2,3,0,1,0,1,1,0,0
Actual:,6,0,4,7,27,13,2,219,12,3,1,1,8,13,1,4,1,2,0,1,0
Actual:,7,0,0,2,3,4,1,6,265,27,1,2,1,10,2,4,2,5,1,7,0
Actual:,8,1,0,0,1,2,1,5,17,239,2,4,3,4,4,2,6,8,2,7,1
Actual:,9,2,2,1,1,0,1,2,0,3,272,9,1,1,0,0,3,0,3,2,0


In [43]:
label_map_df[label_map_df['Label Number'].isin([0, 15,19])]

Unnamed: 0,Label Name,Label Number
0,alt.atheism,0
15,soc.religion.christian,15
19,talk.religion.misc,19


In [46]:
# Extract test document row numbers
train_idx, test_idx = train_test_split(np.array(range(len(data_df['Article']))), 
                                      test_size=0.33, random_state=42)
test_idx

array([ 4097,  8528,  7621, ..., 14979,  4772,  7800])

In [47]:
predict_probas = gs_mnb.predict_proba(test_corpus).max(axis=1)
test_df = data_df.iloc[test_idx]
test_df['Predicted Name'] = mnb_predictions
test_df['Predicted Confidence'] = predict_probas
test_df.head()

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
4097,\nDid you watch the games????\n\n,watch game,10,rec.sport.hockey,rec.sport.hockey,0.529729
8528,I too have been watching the IIsi speedup reports and plan to upgrade in\nthe next few weeks. T...,watch iisi speedup report plan upgrade next week plan build small board different crystal able s...,4,comp.sys.mac.hardware,comp.sys.mac.hardware,0.44685
7621,"\nI think one (not ideal) solution is to use the\ntracing utility (can't remember the name, sorr...",think one not ideal solution use trace utility not remember name sorry include corel draw w pack...,1,comp.graphics,comp.graphics,0.978992
4754,\n I am curious about knowing which commericial cars today\n have v engines.\n\n V4 - I ...,curious know commericial car today v engine v not know v legend mr mr vw golf passat l vr inline...,7,rec.autos,rec.autos,0.9998
15903,"DH>>Does anyone out their have a mountain tape backup that I could compare\nDH>>notes with, (jum...",dhdoes anyone mountain tape backup could compare dhnotes jumper setting software ect dhor anyone...,3,comp.sys.ibm.pc.hardware,comp.sys.ibm.pc.hardware,0.374575


In [48]:
# look at mode misclassification instance sfor religion.misc and religion.christian
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc')
                 & (test_df['Predicted Name'] == 'soc.religion.christian')]
          .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
4304,"\nOK, here's at least one Christian's answer:\n\nJesus was a JEW, not a Christian. In this context Matthew 5:14-19 makes\nsense. Matt 5:17 ""Do not think that I [Jesus] came to abolish the Law or...",ok least one christians answer jesus jew not christian context matthew make sense matt not think jesus come abolish law prophet not come abolish fulfill jesus live jewish law however culmination p...,19,talk.religion.misc,soc.religion.christian,0.991289
4237,"The Nicene Creed\n\nWE BELIEVE in one God the Father Almighty, Maker of heaven and earth, and of all things visible and invisible.\nAnd in one Lord Jesus Christ, the only-begotten Son of God, bego...",nicene creed believe one god father almighty maker heaven earth thing visible invisible one lord jesus christ begotten son god begotten father world god god light light god god begotten not make o...,19,talk.religion.misc,soc.religion.christian,0.989182
14513,"iank@microsoft.com (Ian Kennedy) writes...\n\n\nMore along the lines of Hebrews 12:25-29, I reckon...\n\n\tSee that you refuse not him that speaks. For if they\n\tescaped not who refused him that ...",iankmicrosoft com ian kennedy write along line hebrews reckon see refuse not speak escape not refuse spake earth much shall not escape turn away speak heaven whose voice shake earth promise say ye...,19,talk.religion.misc,soc.religion.christian,0.988875
16678,"\nJesus did not say that he was the fulfillment of the Law, and, unless\nI'm mistaken, heaven and earth have not yet passed away. Am I mistaken?\nAnd, even assuming that one can just gloss over th...",jesus not say fulfillment law unless mistaken heaven earth not yet pass away mistaken even assume one gloss portion word jesus really think accomplish not jesus say jew annul v say jesus record wo...,19,talk.religion.misc,soc.religion.christian,0.985892
13764,": \n: I am a Mormon. I believe in Christ, that he is alive. He raised himself\n: [Text deleted]\n:\n: I learned that the concept of the Holy Trinity was never taught by Jesus\n: Christ, that it ...",mormon believe christ alive raise text delete learn concept holy trinity never teach jesus christ agree council clergyman long christ ascend man no authority speak jesus never teach concept trinit...,19,talk.religion.misc,soc.religion.christian,0.974706


In [51]:
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc')
              & (test_df['Predicted Name'] == 'alt.atheism')]
          .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df

Unnamed: 0,Article,Clean Article,Target Label,Target Name,Predicted Name,Predicted Confidence
4706,"This discussion on ""objective"" seems to be falling into solipsism (Eg: the\nrecent challenge from Frank Dwyer, for someone to prove that he can actually\nobserve phenomena). Someones even made th...",discussion objective seem fall solipsism eg recent challenge frank dwyer someone prove actually observe phenomenon someone even make statement science subjective even atom subjective get bit silly...,19,talk.religion.misc,alt.atheism,0.972266
914,"\n\nAtoms are not objective. They aren't even real. What scientists call\nan atom is nothing more than a mathematical model that describes \ncertain physical, observable properties of our surrou...",atom not objective not even real scientist call atom nothing mathematical model describe certain physical observable property surrounding subjective objective though approach scientist take discus...,19,talk.religion.misc,alt.atheism,0.935916
11820,\nI think that if a theist were truly objective and throws out the notion that\nGod definitely exists and starts from scratch to prove to themselves that\nthe scriptures are the whole truth then t...,think theist truly objective throw notion god definitely exist start scratch prove scripture whole truth person would no longer theist miss something people convert non theism theism bring non the...,19,talk.religion.misc,alt.atheism,0.820338
6020,"\n\n[""it"" is Big Bang]\n\nSince you asked... from the Big Bang to the formation of atoms is about\n10E11 seconds. As for the ""color"": bright. Very very bright. \n\n\nI don't. I believe the curren...",big bang since ask big bang formation atom e second color bright bright not believe current theory cosmology fairly well support observational evidence not well support say evolution relativity an...,19,talk.religion.misc,alt.atheism,0.788279
2334,In <1ren9a$94q@morrow.stanford.edu> salem@pangea.Stanford.EDU (Bruce Salem) \n\n\n\nThis brings up another something I have never understood. I asked this once\nbefore and got a few interesting r...,renaqmorrow stanford edu salempangea stanford edu bruce salem bring another something never understand ask get interesting response somehow not seem satisfied would nt not consider good source may...,19,talk.religion.misc,alt.atheism,0.72891
