## Classifying User Posts Using Machine Learning

The purpose of this of this project is to explore machine learning algorithms for classifying newsroom posts into predefined categories. The motivation for exploring this is to better understand why certain algorithms work well for topic classification. The four categories are 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space', and user posts are from a collection avialable from SciKit Learn

In [1]:
%%capture
# the %%capture prevents the DeprecationWarning message for one of the modules
# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from collections import OrderedDict
from IPython.display import display, HTML

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

## Data exploration

The data is from sklearn.datasets fetch_20newsgroups. Metadata is stripped out, and the each newsgroup is split into train and tests sets. The test set is further split to provide a development set.

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

n_test = len(newsgroups_test.target)
test_data, test_labels = newsgroups_test.data[int(n_test/2):], newsgroups_test.target[int(n_test/2):]
dev_data, dev_labels = newsgroups_test.data[:int(n_test/2)], newsgroups_test.target[:int(n_test/2)]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('test label shape:', test_labels.shape)
print('dev label shape:', dev_labels.shape)
print('labels names:', newsgroups_train.target_names)

('training label shape:', (2034,))
('test label shape:', (677,))
('dev label shape:', (676,))
('labels names:', ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])


Let's look at a few of these posts. Label 0-3 are 'alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc', respectively.

In [3]:
for x in range(5):
    print "\n======================\ncategory:", train_labels[x],"\n", \
         train_data[x], \
         "\n======================\n"


category: 1 
Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych 


category: 3 


Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuries. 


category: 2 

 >In article <

These posts have a lot of variety in terms of their content and formatting. This makes for an interesting classification challenge!

To turn the posts into something machine intelligible we will tokenize the words and create feature vectors. To do this we will use Sci-kit-learn's CountVectorizer. CountVectorizer's fit_transform function generates a tuple and int for each word. Element 1 of the tuple is the document index and element 2 is the word token. The integer that follows the tuple is the number of counts of that tuple. In other words it is the number of occurences of that word in that document.

## Vectorizing the user comments

In [4]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_data)

In [5]:
X_train.shape

(2034, 26879)

This tells us that there are 2034 posts and 26879 features. Most of these features are words and numbers, but others are strings of unknown meaning. For example '000062david42' is a feature.

The below creates a dataframe for which each column is a feature of the training set. The values in each column are the number of occurences of that feature in that post.

In [6]:
df = pd.DataFrame(vectorizer.fit_transform(train_data).todense(), columns = vectorizer.get_feature_names())
print(df.shape)
df

(2034, 26879)


Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is clearly very sparse data -- look at all those zeros!
What fraction of the matrix are non-zero values?

Out of curiosity, let's look at the first and last words in the feature space.

In [7]:
print "The zeroth and last feature strings are:", \
      vectorizer.get_feature_names()[0], \
      "and", \
     vectorizer.get_feature_names()[len(vectorizer.get_feature_names()) -1] + "." \

The zeroth and last feature strings are: 00 and zyxel.


Zyxel is a communications company, and suggests that attempts to remove non-English words would be a bad idea since many names/identifiers would be lost, which would likely decrease the classification accuracy.

In [8]:
# calculating the number of unique tokens (ie words) in the training and dev sets
train_vectorizer = CountVectorizer()
X_train = train_vectorizer.fit_transform(train_data)

dev_vectorizer = CountVectorizer()
X_dev = dev_vectorizer.fit_transform(dev_data)

print "The training data set has", len(set(train_vectorizer.vocabulary_)), "unique features."
print "The development data set has", len(set(dev_vectorizer.vocabulary_)), "unique features."
print "And the two share", len(set(dev_vectorizer.vocabulary_.keys()) & set(train_vectorizer.vocabulary_.keys())), "unique features"

The training data set has 26879 unique features.
The development data set has 16246 unique features.
And the two share 12219 unique features


In [9]:
labels_dict = {0 : 'alt.atheism', 
               1 : 'comp.graphics',
               2 : 'sci.space',
               3 : 'talk.religion.misc'}

## Logistic regression for classifying user comments

In [10]:
def ngram_vectorizer(ngram):
    # instantiating CountVectorizer
    vectorizer = CountVectorizer(ngram_range=ngram)

    # fitting and transforming training data
    X_train = vectorizer.fit_transform(train_data)

    # this was the tricky part for me because I didn't know that I should use the same
    # CountVectorizer and just .transform the dev_data to get it to work
    X_dev = vectorizer.transform(dev_data)

    # instantiating a logistic regression model
    lr = LogisticRegression()
    lr.fit(X=X_train, y=train_labels)
    
    n_top = 3
    words = []
    top_ns = []
    pred_category = []
    df = pd.DataFrame()
    for i in set(train_labels):
        # gathering the coefficients
        top_n = np.argsort(lr.coef_[i])[-n_top:]
        top_ns.extend(list(top_n))
        pred_category.extend([labels_dict[i]]*n_top)
        print 'mean and stddev of the coefficients for category', i, ":", np.mean(lr.coef_[i]), ",", np.std(lr.coef_[i])
    for j in top_ns:
        # gathering the words with the largest coeficients
        words.append(vectorizer.get_feature_names()[j]) # the feature names are the tokenized words
    df['word(s)'] = words
    df['pred_category'] = pred_category
    for i in set(train_labels):
        df[labels_dict[i] + "_coef"] = lr.coef_[i][top_ns]
    
    print "classification report for ngram_range", ngram, "\n", \
        classification_report(y_true = dev_labels, y_pred = lr.predict(X_dev))
    display(df)
    print "========\n"

There's a lot to talk about in the below data. First, this shows that the monogram approach exceeds that of the bigram or trigram with respect to precision and recall. The reason for this is probably due to the most relevant topic identifiers being a single word, rather than two or three word phrases.
Secondly, notice that the trigram has some seemingly meaningless phrases, indicating it's a poor approach.

Now let's talk about the coefficients shown for each category. The largest coefficients for each predicted category were identified, and the word(s) associated with this coefficient extracted. Notice that the coefficients are larger for the monogram than for the bigram and trigram. This is also apparent in the standard deviations -- for the monogram the stddev is several-fold greater than the bi and trigrams.

In [11]:
for ngram in [(1,1), (2,2), (3,3)]:
    ngram_vectorizer(ngram)

mean and stddev of the coefficients for category 0 : -0.002303413677017988 , 0.0787821264351687
mean and stddev of the coefficients for category 1 : -0.000720433179118076 , 0.06977261275787278
mean and stddev of the coefficients for category 2 : 0.00023755130931552037 , 0.07664199358312632
mean and stddev of the coefficients for category 3 : -0.0018505818886303236 , 0.07360915128836859
classification report for ngram_range (1, 1) 
             precision    recall  f1-score   support

          0       0.62      0.56      0.59       165
          1       0.74      0.88      0.81       185
          2       0.77      0.76      0.77       199
          3       0.59      0.53      0.56       127

avg / total       0.69      0.70      0.69       676



Unnamed: 0,word(s),pred_category,alt.atheism_coef,comp.graphics_coef,sci.space_coef,talk.religion.misc_coef
0,bobby,alt.atheism,0.989956,-0.221093,-0.34086,-0.463411
1,atheists,alt.atheism,1.030696,-0.096547,-0.318807,-0.835035
2,deletion,alt.atheism,1.125056,-0.397583,-0.41991,-0.394396
3,file,comp.graphics,-0.334684,1.266316,-0.806165,-0.626966
4,image,comp.graphics,-0.582814,1.345943,-0.82527,-0.467288
5,graphics,comp.graphics,-0.758397,1.936768,-1.336505,-0.763043
6,nasa,sci.space,-0.572388,-0.478139,1.011602,-0.467679
7,orbit,sci.space,-0.413948,-0.671164,1.224774,-0.629579
8,space,sci.space,-1.260178,-1.316531,2.161801,-1.170735
9,blood,talk.religion.misc,-0.5333,-0.106914,-0.316273,1.054174



mean and stddev of the coefficients for category 0 : -0.0011705295046437229 , 0.02955175847815318
mean and stddev of the coefficients for category 1 : -0.000361007815361034 , 0.030448851343429704
mean and stddev of the coefficients for category 2 : -0.0005378179697379003 , 0.031069249422986066
mean and stddev of the coefficients for category 3 : -0.0013159588831620685 , 0.0282077021772131
classification report for ngram_range (2, 2) 
             precision    recall  f1-score   support

          0       0.60      0.55      0.57       165
          1       0.62      0.82      0.70       185
          2       0.65      0.71      0.68       199
          3       0.65      0.32      0.43       127

avg / total       0.63      0.63      0.61       676



Unnamed: 0,word(s),pred_category,alt.atheism_coef,comp.graphics_coef,sci.space_coef,talk.religion.misc_coef
0,cheers kent,alt.atheism,0.649144,-0.882288,-0.821936,0.601477
1,was just,alt.atheism,0.677623,-0.192925,-0.197945,-0.301982
2,claim that,alt.atheism,0.77168,-0.257707,-0.352081,-0.200703
3,in advance,comp.graphics,-0.544996,0.972282,-0.53101,-0.507304
4,comp graphics,comp.graphics,-0.379829,1.037308,-0.470812,-0.396651
5,looking for,comp.graphics,-0.755597,1.319679,-0.613692,-0.699852
6,sci space,sci.space,-0.317719,-0.388504,0.738842,-0.274843
7,the moon,sci.space,-0.40481,-0.576411,0.951758,-0.240626
8,the space,sci.space,-0.314117,-0.645741,1.030145,-0.324507
9,compuserve com,talk.religion.misc,-0.132647,-0.21182,-0.20088,0.701502



mean and stddev of the coefficients for category 0 : -0.001979061104483735 , 0.024369227543747256
mean and stddev of the coefficients for category 1 : -0.0022930858649384644 , 0.026199600746962058
mean and stddev of the coefficients for category 2 : -0.0009958035133713098 , 0.025793266352965093
mean and stddev of the coefficients for category 3 : -0.0025432667441444754 , 0.02291327590420167
classification report for ngram_range (3, 3) 
             precision    recall  f1-score   support

          0       0.59      0.34      0.43       165
          1       0.43      0.91      0.58       185
          2       0.63      0.49      0.55       199
          3       0.69      0.17      0.28       127

avg / total       0.58      0.51      0.48       676



Unnamed: 0,word(s),pred_category,alt.atheism_coef,comp.graphics_coef,sci.space_coef,talk.religion.misc_coef
0,grow up childish,alt.atheism,0.626089,-0.17817,-0.155396,-0.114554
1,look up irony,alt.atheism,0.626089,-0.17817,-0.155396,-0.114554
2,up childish propagandist,alt.atheism,0.626089,-0.17817,-0.155396,-0.114554
3,agree with you,comp.graphics,-0.241578,0.666488,-0.378151,0.129854
4,am looking for,comp.graphics,-0.643466,1.116126,-0.426317,-0.622128
5,thanks in advance,comp.graphics,-0.699582,1.286426,-0.591322,-0.666146
6,of message deleted,sci.space,-0.118678,-0.17817,0.563455,-0.114554
7,for the update,sci.space,-0.101518,-0.202723,0.570407,-0.101533
8,on the moon,sci.space,-0.345603,-0.381953,0.719988,-0.318832
9,be with you,talk.religion.misc,-0.234381,-0.258666,-0.240453,0.553884





Let's now see how we can improve the logistic regression by adjusting its regularization. Regularization can help us avoid learning too large of a weight for any single feature. There are two primary types for logistic regression: L1 (Lasso Regression) and L2 (Ridge Regression). The main difference between these two is that Lasso regression can bring the coefficient to zero, thus removing some features altogether. Ridge regression will not completely remove features. The default  regularization used for SciKit Learn's logistic regression is L2, which computes the coefficient size as the sum fo the squared weights.  
  
Let's explore using either L1 or L2 regularization on F1 score. Getting into some details: the default tolerance (tol) value is 0.0001, which means that the tolearnce for collinearity cannot exceed 0.0001. This is strict and can lead to convergence problems so it will be relaxed to 0.01. For the logistric regression macro

In [12]:
tol = 0.01 

# instantiating countvectorizer
cv = CountVectorizer()
X_train = cv.fit_transform(train_data)
X_dev = cv.transform(dev_data)

# training LogisticRegression model using "l1" penalty
lr_l1 = LogisticRegression(penalty="l1", tol=tol)
lr_l1.fit(X_train, train_labels)

# training LogisticRegression model using "l2" penalty
lr_l2 = LogisticRegression(penalty="l2", tol=tol)
lr_l2.fit(X_train, train_labels)

def non_zero_weight_count(lr):
    """the number of learned weights that are not zero"""
    count = 0
    for weights in lr.coef_:
        for weight in weights:
            if weight != 0:
                count += 1
    return(count)
def calc_F1(lr_model):
    return metrics.f1_score(y_true=dev_labels, y_pred=lr_model.predict(X = X_dev), average='macro')

coefficients_count = lr_l1.coef_.shape[0] * lr_l1.coef_.shape[1]

summary_table = pd.DataFrame(
    OrderedDict((
        ("regularization type", ['L1', 'L2']),
        ("number of non-zero weights", [non_zero_weight_count(lr_l1), non_zero_weight_count(lr_l2)]),
     ("percent of coefficients that are non-zero", \
         [100 * non_zero_weight_count(lr_l1)/ coefficients_count,
          100 * non_zero_weight_count(lr_l2)/ coefficients_count]),
     ("F1 score", [calc_F1(lr_model) for lr_model in [lr_l1, lr_l2]])))
)
summary_table

Unnamed: 0,regularization type,number of non-zero weights,percent of coefficients that are non-zero,F1 score
0,L1,2211,2,0.682414
1,L2,107516,100,0.673588


This shows that L1 regularization set 99% of the coefficients to 0, but both have about the same F1 score. That in itself is pretty interesting. Considering the other 99% of coefficients to improve things much, they seem like calculation dead-weight, and I'm preferring the leanness of L1 regularization. Let's see what happens if we try and improve upon L1 and L2 regularization by testing out different lambda coefficients.  

The lambda coefficient determines how aggresively to try and reduce coefficient magnitude. A large lambda says, try really hard to reduce the magnitude of coefficients, while a small lambda says, don't worry too much about coefficient magnitude. In SciKit Learn's logistic regression class, they refer to C which is simply 1/lambda. Thus, if C is large large coefficients are not as penalized as when C is small.

In [13]:
def find_nonzero_i(lr):
    '''gets the nonzero indexes from a logistic regression ('lr') model'''
    nonzero_i = []
    for label in lr.coef_:
        for i, weight in enumerate(label):
            if weight > 0:
                nonzero_i.append(i)
    nonzero_i = list(set(nonzero_i)) # removing duplicates with set
    return(nonzero_i)

Cs = [0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]
F1s = []
vocab_len = []
summary_table = pd.DataFrame()
for regularization_type in ["l1", "l2"]:
    for C in Cs:
        lr = LogisticRegression(penalty = regularization_type, 
                                tol = tol, 
                                C = C)
        lr.fit(X_train, train_labels)   
        
        nonzero_i = find_nonzero_i(lr)        
        pred = lr.predict(X = X_dev) # X_dev[:, nonzero_i]
        F1 = metrics.f1_score(y_pred=pred, y_true=dev_labels, average = 'macro')
        summary_table = summary_table.append(
            {
                "regularization type": regularization_type.capitalize(),
                "C hyperparameter": C,
                "number of non-zero features": len(nonzero_i),
                "F1 score": F1
            }, 
            ignore_index = True
        )
            
summary_table[summary_table.columns[::-1]]

Unnamed: 0,regularization type,number of non-zero features,F1 score,C hyperparameter
0,L1,12.0,0.396846,0.01
1,L1,188.0,0.624815,0.1
2,L1,933.0,0.68487,1.0
3,L1,2312.0,0.666484,10.0
4,L1,6491.0,0.618008,100.0
5,L1,20208.0,0.575365,1000.0
6,L1,25470.0,0.560051,10000.0
7,L1,25910.0,0.57551,100000.0
8,L1,25966.0,0.541606,1000000.0
9,L2,26817.0,0.646087,0.01


These results show that the C hyperparameter has a much greater effect on logistic regression using L1 regularization. When C is low, it means that the coefficient weight has a very big effect on the cost function, thus, only those features with very strong correlations are retained. The less impactful feature coefficients become zero. There is a sweet-spot, in this case around a C of 1, for which there are enough non-zero feature coefficients to produce the highest F1 score, after which, setting more coefficients to zero is detrimental. This is all in contrast to L2 regularization for which the C hyperparameter has no effect because coefficients cannot be set to zero. This of it like multiplying all coefficients by the same value. It doesn't change the outcome because it hasn't changed the nature of the model.

## Naive Bayes for Topic Classification
Let's now turn our attention now to Naive Bayes.
The equation this will use is:  
$$ p(topic\ category\ i\ |\ token\ j) = \frac{p(token\ j\ |\ topic\ category\ i)p(topic\ category\ i)}{p(token\ j)} $$

In [14]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_data)
X_dev = vectorizer.transform(dev_data)
nb = MultinomialNB(alpha=1)
nb.fit(X = X_train, y = train_labels)
print classification_report(y_true = dev_labels, 
                      y_pred = nb.predict(X = X_dev))

             precision    recall  f1-score   support

          0       0.63      0.75      0.68       165
          1       0.92      0.91      0.92       185
          2       0.84      0.87      0.85       199
          3       0.68      0.50      0.58       127

avg / total       0.78      0.78      0.78       676



Here, we can see that Naive Bayes is performing better than logistic regression. One possible reason for the better performance of Naive Bayes is that Naive Bayes is less likely to overfit to the training data relative to logistic regression. To see if we can improve Naive Bayes further, let's attempt to improve its Laplacian smoothing hyperparameter. For scikit learn's multinomial Naive Bayes this is called alpha, and is the value added to all probabilities. A probability of zero for a feature, would completely eliminate the products of this zero probability feature, which can negatively impact prediction.


In [15]:
alphas = list(np.linspace(0,0.2,10))
summary_table = pd.DataFrame()
for alpha in alphas:
    mnb = MultinomialNB(alpha = alpha)
    mnb.fit(X_train, train_labels)
    dev_pred = mnb.predict(X = X_dev)
    f1_score = metrics.f1_score(y_true = dev_labels,
                               y_pred = dev_pred,
                               average = "macro")
    summary_table = summary_table.append({'alpha': alpha,
                                         'F1 score': f1_score}, ignore_index = True)
summary_table

  self.feature_log_prob_ = (np.log(smoothed_fc) -


Unnamed: 0,F1 score,alpha
0,0.363807,0.0
1,0.764591,0.022222
2,0.765929,0.044444
3,0.768043,0.066667
4,0.770256,0.088889
5,0.770484,0.111111
6,0.771956,0.133333
7,0.770714,0.155556
8,0.76751,0.177778
9,0.769409,0.2


These result show that a maximum F1 score is achieved for something near an alpha of 0.1.

Let's now take our best performing conditions and test them with the test data.

In [16]:
# instantiating CountVectorizer
vectorizer = CountVectorizer(ngram_range=ngram)

# fitting and transforming training data
X_train = vectorizer.fit_transform(train_data)

X_test = vectorizer.transform(test_data)

lr = LogisticRegression(tol = 0.01, penalty = "l2", C = 0.1)
lr.fit(X = X_train, y = train_labels)
print "Optimized logistic regression conditions: "
print classification_report(y_true = test_labels, 
                      y_pred = lr.predict(X = X_test))
print "="*20
mnb = MultinomialNB(alpha = 0.1)
mnb.fit(X = X_train, y = train_labels)
print "Optimized Multinomial Naive Bayes conditions: "
print classification_report(y_true = test_labels, 
                      y_pred = mnb.predict(X = X_test))

Optimized logistic regression conditions: 
             precision    recall  f1-score   support

          0       0.62      0.31      0.41       154
          1       0.43      0.92      0.59       204
          2       0.56      0.41      0.47       195
          3       0.64      0.11      0.19       124

avg / total       0.55      0.49      0.44       677

Optimized Multinomial Naive Bayes conditions: 
             precision    recall  f1-score   support

          0       0.45      0.64      0.53       154
          1       0.84      0.70      0.76       204
          2       0.65      0.55      0.60       195
          3       0.43      0.42      0.43       124

avg / total       0.62      0.59      0.60       677



When using the conditions optimized for performance on the development data, Multinomial Naive Bayes does better than logisitc regression for all metrics on the test data. This may result from overfitting of training data by logistic regression. In summary, this shows us the power of the simple Naive Bayes classifier as applied to categorizing user posts into newsgroups.