# Battle of the Tweet Classification Algorithms!
Having written [an implementation of the Naive Bayes algorithm](https://github.com/chrisfalter/DataScience/tree/master/NLP/NaiveBayes) to predict the geolocation of test tweets from a tweet corpus, I thought it would be worthwhile to see how it compares to other algorithms that can be used for document classification. The scikit-learn library provides a terrific toolkit for this exploration because:
+ It has a plethora of classification algorithms
+ The `sklearn.model_selection.GridSearchCV` class provides an easy-to-use API to optimize a classifier's hyperparameters

This experiment compares the accuracy of the following algorithms:
+ Naive Bayes
+ Adaboost
+ Random Forest
+ K-Nearest Neighbors
+ Gradient Boost
+ Model Stack of Best 3 Models

You will also learn how to use `sklearn.pipeline.Pipeline` and `sklearn.preprocessing.FunctionTransformer` to create a custom data preprocessing pipeline.

## The Limits of This Experiment
Machine learning's "No Free Lunch" theorem states that no single algorithm gives the best predictions for all problems; the ability to solve one class of problems well is gained at the expense of doing more poorly on some other class of problems. Therefore do not use the outcome of this experiment to justify using the tweet classification winner as your algorithm of choice to answer all data questions! Instead, be informed and inspired by this experiment to conduct your own experiment to find which classifier (with which hyperparameters) will best answer your questions.

Let's begin by importing some classes and namespaces.

In [1]:
import re
from collections import Counter, defaultdict
from nltk.corpus import stopwords
from nltk import download
import string
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

## Building a Data Pipeline from Transform Functions
Scikit-learn provides an API to build a pipeline of transformers to munge, cleanse, transform, and feature-engineer your data set. I wrote a variety of functions that conform to the expected interface for a data transformer. It is a simple matter to use the `sklearn.preprocessor.FunctionTransformer` class to turn them into building blocks of a pipeline.

In [2]:
# Assumes a list of strings, one string per doc. Use before tokenization of docs.
def lowercase(X):
    for i in range(len(X)):
        X[i] = X[i].lower()
    return X    
    
specialEntitiesRegex = re.compile("&[a-z]+;")

# Assumes a list of strings, one string per doc. Use before tokenization of docs.
def removeSpecialEntities(X):
    for i in range(len(X)):
        s = X[i]
        m = specialEntitiesRegex.search(s)
        while (m):
            s = s[:m.start()] + s[m.end():]
            m = specialEntitiesRegex.search(s)
        X[i] = s
    return X

printable = set(string.printable)

# Assumes a list of strings, one string per doc. Use before tokenization of docs.
def removeUnprintable(X):
    for i in range(len(X)):
        s = ''.join(filter(lambda x: x in printable, X[i]))
        X[i] = s
    return X
    
punc = string.punctuation.replace(' ','') # don't remove spaces! They demarcate words
punctuationRemover = str.maketrans(punc, ' '*len(punc))

# Assumes a list of strings, one string per doc. Use before tokenization of docs.
def removePunctuation(X):
    for i in range(len(X)):
        s = X[i].translate(punctuationRemover)
        X[i] = s
    return X

def tokenize(X):
    for i in range(len(X)):
        X[i] = X[i].split()
    return X
    
cityInitials = ['la', 'orl', 'washing']

# Assumes a list of lists, one list of tokens per doc. Use after tokenization of docs.
def generateCityInitialsTokens(X):
    for i in range(len(X)):
        doc = X[i]
        initialList = []
        for token in doc:
            for initials in cityInitials:
                if token.startswith(initials):
                    initialList.append(initials)
        if (initialList):
            doc += initialList
            X[i] = doc
    return X

monikers = set(['ny','york','manh',
            'dc',
            'bost',
            'chic','illin',
            'hollw','angel',
            'orlan','fl',
            'sf','bay','fran',
            'tl','georgia',
            'houst','tx','tex',
            'phil','delph','penn',
            'sd','dieg',
            'toro','canad','ontar'])

# Assumes a list of lists, one list of tokens per doc. Use after tokenization of docs.
def generateMonikerTokens(X):
    for i in range(len(X)):
        doc = X[i]
        monikerList = []
        for tok in doc:
            for moniker in monikers:
                if moniker in tok:
                    monikerList.append(moniker)
        if monikerList:
            doc += monikerList
            X[i] = doc
    return X

# make sure the list of stop words is available
download("stopwords")
commonTweetWords = ['job','hiring','jobs','careerarc','street','opening','work','apply','st'] 
commonTweetWords += list("abcdefghijklmnopqrstuvwxyz") + list("01234456789")
stopWords = set(stopwords.words('english') + commonTweetWords)

# Assumes a list of lists, one list of tokens per doc. Use after tokenization of docs.
def removeStopWords(X):
    '''
    Assumes a list of lists, one list of tokens per doc. Use after tokenization of docs.
    '''
    for i in range(len(X)):
        doc = X[i]
        doc = list(filter(lambda t: t not in stopWords, doc))
        X[i] = doc
    return X

# Use at end of preprocessing pipeline to convert doc-as-tokens into a string, so it can be used by a vectorizer
def stringize(X):
    for i in range(len(X)):
        X[i] = ' '.join(X[i])
    return X

preprocessor = Pipeline(steps=[
    ('Lower Case', FunctionTransformer(lowercase, validate=False)), 
    ('SpecialEntities', FunctionTransformer(func = removeSpecialEntities, validate = False)),
    ("Unprintable",FunctionTransformer(func = removeUnprintable, validate = False)),
    ("Punctuation",FunctionTransformer(removePunctuation, validate = False)),
    ("Tokenize",FunctionTransformer(tokenize, validate=False)),
    ("City Initials", FunctionTransformer(generateCityInitialsTokens, validate=False)), 
    ("Monikers", FunctionTransformer(generateMonikerTokens, validate=False)), 
    ("Stop Words", FunctionTransformer(removeStopWords, validate=False)), 
    ('Stringize', FunctionTransformer(stringize, validate = False))])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\582139\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now let's load our training and test data into the preprocessing pipeline.

In [25]:
def getLocationAndTweet(s):
    '''
    Returns a tuple (location, tweet)
    location = first token in string s
    tweet = remainder of string s
    '''
    divider = s.find(' ')
    return s[:divider], s[divider + 1:]

def getXandY(rawData):
    X, Y = [],[]
    for line in rawData:
        location, tweet = getLocationAndTweet(line)
        X.append(tweet)
        Y.append(location)
    return X, Y
        
train_path = 'C:/NLP/ClassificationSklearn/tweets.train.txt'
test_path = 'C:/NLP/ClassificationSklearn/tweets.test1.txt'
with open(train_path, 'r', encoding='latin1', newline='\n') as trainTweets:
    X_train, Y_train = getXandY(trainTweets)
with open(test_path, 'r', encoding='latin1', newline='\n') as testTweets:
    X_test, Y_test = getXandY(testTweets)    
X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

Now let's transform the preprocessed tweets into vectorized (sparse array) representations that the scikit-learn classifiers can use.

In [4]:
count_vectorizer = CountVectorizer(lowercase = False)
X_train_cv = count_vectorizer.fit_transform(X_train, Y_train)
X_test_cv = count_vectorizer.transform(X_test)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_cv)
X_test_tfidf = tfidf_transformer.fit_transform(X_test_cv)

## Time to Test

### Naive Bayes

In [5]:
def fit_and_score(classifier, X_tr, Y_tr, X_te, Y_te):
    classifier.fit(X_tr, Y_tr)
    predictions = classifier.predict(X_te)
    return metrics.accuracy_score(Y_te, predictions)

print("Naive Bayes w/count:   %0.3f" % fit_and_score(MultinomialNB(), X_train_cv, Y_train, X_test_cv, Y_test))
print("Naive Bayes w/tf-idf:   %0.3f" % fit_and_score(MultinomialNB(), X_train_tfidf, Y_train, X_test_tfidf, Y_test))

Naive Bayes w/count:   0.698
Naive Bayes w/tf-idf:   0.614


The MultinomialNB classifier with straightforward counts obtains the identical result to [the Naive Bayes classifier that I manually coded](https://github.com/chrisfalter/DataScience/tree/master/NLP/NaiveBayes)! My intuition that TF-IDF would degrade model accuracy seems accurate, as well.

### Tune the Hyperparameters
Now let's use sklearn's ability to search over hyperparameters to tune our Naive Bayes classifier. Given the small number of combinations that this code will examine, we can use the exhaustive search strategy provided by [the GridSearchCV method](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). If we had a large state space of hyperparameters to explore, we might first use [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) to narrow our search to the most promising region of hyperparameter settings.

In [5]:
def tune_classifier(classifier, clf_name, parameter_space, X_tr, Y_tr, X_te, Y_te):
    estimator = Pipeline([(clf_name, classifier)])
    grid_search = GridSearchCV(estimator, parameter_space, n_jobs=3, cv=5) # 5-fold cross-validation
    grid_search.fit(X_tr, Y_tr)
    best_estimator = grid_search.best_estimator_
    best_parameters = best_estimator.get_params()
    for param_name in sorted(parameter_space.keys()):
        print("%s: %r" % (param_name, best_parameters[param_name]))
    predictions = best_estimator.predict(X_te)
    print(clf_name + " tuned model accuracy: %0.3f" % metrics.accuracy_score(Y_te, predictions))    

In [36]:
param_name = "NB__alpha"
parameter_space = {param_name: np.array(range(1, 11))/10}
tune_classifier(MultinomialNB(), "NB",  parameter_space, X_train_cv, Y_train, X_test_cv, Y_test)

NB__alpha: 0.40000000000000002
NB tuned model accuracy: 0.694


Curiously, after hyperparameter tuning the model performs slightly worse on the test set. Do not conclude, however, that you should not tune hyperparameters! In all likelihood, the tuned model should generalize better to a typical, large test set. There might be something just a little bit atypical about our test set.

Now let's try some other algorithms.
### K-Nearest Neighbors, Random Forest, and Adaboost

In [None]:
# Because of the high number of classes (12), we specify a high number of neighbors in K-neighbors classifier.
# p = 2 so Minkowski = Euclidean distance. Because the TF-IDF vector is L2-normalized, this yields same neighbor ranking
# as cosine similarity.
kn_parm_space = {"KN__n_neighbors": [11, 13, 15], 
                 "KN__p": [2]} 
tune_classifier(KNeighborsClassifier(), "KN", kn_parm_space, X_train_tfidf, Y_train, X_test_tfidf, Y_test)

rf_parm_space = {"RF__n_estimators": [10, 50, 100, 250],
                 "RF__max_features": ["log2"]}
tune_classifier(RandomForestClassifier(), "RF", rf_parm_space, X_train_cv, Y_train, X_test_cv, Y_test)

ada_parm_space = {"Ada__n_estimators": [50, 100, 150], 
                  "Ada__learning_rate": [0.2, 0.6, 1.0]}
tune_classifier(AdaBoostClassifier(), "Ada",  ada_parm_space, X_train_cv, Y_train, X_test_cv, Y_test)
tune_classifier(AdaBoostClassifier(), "Ada",  ada_parm_space, X_train_tfidf, Y_train, X_test_tfidf, Y_test)



KN__n_neighbors: 11
KN__p: 2
KN tuned model accuracy: 0.464
RF__max_features: 'log2'
RF__n_estimators: 250
RF tuned model accuracy: 0.706
Ada__learning_rate: 1.0
Ada__n_estimators: 150
Ada tuned model accuracy: 0.600
Ada__learning_rate: 0.6
Ada__n_estimators: 150
Ada tuned model accuracy: 0.602


K-Nearest Neighbors and Adaboost don't seem to compete effectively in predicting tweet geological origins. However, Random Forest is doing well. Since its best result came at the maximum value of the hyperparameter `n_estimators`, let's extend the range of `n_estimators` values further.

In [6]:
rf_parm_space = {"RF__n_estimators": [200, 400, 600, 800],
                 "RF__max_features": ["log2"]}
tune_classifier(RandomForestClassifier(), "RF", rf_parm_space, X_train_cv, Y_train, X_test_cv, Y_test)


RF__max_features: 'log2'
RF__n_estimators: 800
RF tuned model accuracy: 0.708


No dice! Adding model complexity by tripling the number of estimators had almost no effect on accuracy. So it is probably better to stick with 250 decision trees in the random forest, given that each tree would have a maximum of log2 features.

My off-line testing shows that the training vectors `X_train_cv` and `X_train_tfidf` have 32,000 features (i.e., words). This means that each decision tree in the random forest would use a maximum of 14 features (log2(32000)). Let's set the maximum tree features to the square root of 32000 (178) to see if that improves classification accuracy. 

In [12]:
rf_parm_space = {"RF__n_estimators": [15, 40, 90, 150],
                 "RF__max_features": ["sqrt"]}
tune_classifier(RandomForestClassifier(), "RF", rf_parm_space, X_train_cv, Y_train, X_test_cv, Y_test)


RF__max_features: 'sqrt'
RF__n_estimators: 90
RF tuned model accuracy: 0.684


### Gradient Boosting
Like the Random Forest algorithm, Gradient Boosting builds an ensemble of decision trees. However, the trees are generated to minimize a loss function along a gradient, rather than randomly as in the random forest method. Let's see how well it classifies tweets.

In [14]:
gb_parm_space = {"GB__n_estimators": [80, 150, 300],
                 "GB__max_depth": [3,4,5]}
tune_classifier(GradientBoostingClassifier(), "GB", gb_parm_space, X_train_cv, Y_train, X_test_cv, Y_test)

GB__max_depth: 5
GB__n_estimators: 300
GB tuned model accuracy: 0.672


## Model Stacking
Predicting with a stack of models is the machine learning equivalent of "the wisdom of crowds"; the strengths of the various models compensate for the weaknesses of any one model. The procedure is straightforward: Select the 3 most desirable models, configure them according to the experiments conducted so far, then combine their votes into a single prediction. Sophisticated practitioners with plenty of time on their hands could stack many more than 3 models, but the Random Forest, Naive Bayes, and Gradient Boosting classifiers will suffice for our purposes.

In [7]:
rf_clf = RandomForestClassifier(n_estimators = 250, max_features = 'log2')
nb_clf = MultinomialNB(alpha = 0.4)
gb_clf = GradientBoostingClassifier(max_depth = 5, n_estimators = 300)

soft_stack = VotingClassifier([('rf', rf_clf), ('nb', nb_clf), ('gb', gb_clf)], voting = 'soft', n_jobs = 1)
print("Stack w/uniform soft voting:   %0.3f" % fit_and_score(soft_stack, X_train_cv, Y_train, X_test_cv, Y_test))

hard_stack = VotingClassifier([('rf', rf_clf), ('nb', nb_clf), ('gb', gb_clf)], voting = 'hard', weights = [3, 2, 2], n_jobs = 1)
print("Stack w/weighted hard voting:   %0.3f" % fit_and_score(hard_stack, X_train_cv, Y_train, X_test_cv, Y_test))


Stack w/uniform soft voting:   0.704
Stack w/weighted hard voting:   0.706


## The Winner Is ... It Depends
Random forest gets the top accuracy score of 70.8%, followed closely by the model stack at 70.6% and Naive Bayes at 69.4%. However, outside of Kaggle contests, accuracy is not the only criterion in selecting an algorithm. For example, let's see how long the algorithms run to obtain their results.

In [40]:
import time

# time measurement for training
nb_start = time.time()
nb_clf = MultinomialNB(alpha = 0.4)
nb_clf.fit(X_train_cv, Y_train)
nb_training_time = time.time() - nb_start

rf_start = time.time()
rf_clf = RandomForestClassifier(n_estimators = 250, max_features = 'log2')
rf_clf.fit(X_train_cv, Y_train)
rf_training_time = time.time() - rf_start

# time measurement for prediction
nb_start = time.time()
nb_clf.predict(X_test_cv)
nb_predict_time = time.time() - nb_start

rf_start = time.time()
rf_clf.predict(X_test_cv)
rf_predict_time = time.time() - rf_start

# display results
print('=' * 10, 'Training Race', '=' * 10)
print("Naive Bayes:   %0.4f" % nb_training_time)
print("Random Forest: %0.4f" % rf_training_time)
print()
print('=' * 10, 'Prediction Race', '=' * 10)
print("Naive Bayes:   %0.4f" % nb_predict_time)
print("Random Forest: %0.4f" % rf_predict_time)

Naive Bayes:   0.1235
Random Forest: 842.6814

Naive Bayes:   0.0035
Random Forest: 0.6795


Naive Bayes trains *almost 4 orders of magnitude faster than random forest, and predicts 200 times faster*. If you are renting your compute from AWS, Azure, or Google, using Naive Bayes instead of Random Forest could provide huge cost savings.

And now let's see how much we can learn about how various features contribute to the predictions.

In [47]:
def topFeaturesInRF(rf_clf, vec, n):
    top_features = []
    feature_map = {} # key = index, value = term
    for k in vec.vocabulary_:
        feature_map[vec.vocabulary_[k]] = k
    top_index =  np.argsort(-rf_clf.feature_importances_)[:n]
    for i in range(len(top_index)):
        top_features.append(feature_map[top_index[i]])
    return top_features

top_rf_features = topFeaturesInRF(rf_clf, count_vectorizer, 40)
print(top_rf_features)
    

['chic', 'ny', 'york', 'chicago', 'houston', 'houst', 'bost', 'angel', 'tx', 'fran', 'new', 'il', 'ca', 'atlanta', 'orl', 'toronto', 'dieg', 'toro', 'washing', 'phil', 'boston', 'dc', 'tl', 'washington', 'nyc', 'los', 'san', 'sanfrancisco', 'delph', 'angeles', 'la', 'losangeles', 'ga', 'orlan', 'francisco', 'sandiego', 'philadelphia', 'orlando', 'orlpol', 'newyork']


In [38]:
def topFeaturesInNB(nb_clf, vec, n):
    '''
    n = number of features per class
    '''
    top_features = {}
    class_names = nb_clf.classes_
    feature_map = {} # key = index, value = term
    for k in vec.vocabulary_:
        feature_map[vec.vocabulary_[k]] = k
    for i in range(nb_clf.feature_log_prob_.shape[0]):
        index_top = np.argsort(-nb_clf.feature_log_prob_[i,:])[:n]
        features_for_class = []
        for idx in index_top:
            features_for_class.append(feature_map[idx])
        top_features[class_names[i]] = features_for_class
    return top_features 

nb_clf = MultinomialNB(alpha = 0.4)
nb_clf.fit(X_train_cv, Y_train)
top_features = topFeaturesInNB(nb_clf, count_vectorizer, 10)
for city in top_features:
    print(city, "-", top_features[city])
        

Atlanta,_GA - ['tl', 'atlanta', 'ga', 'georgia', 'la', 'ny', 'atl', 'latest', 'click', 'fl']
Boston,_MA - ['bost', 'boston', 'la', 'latest', 'report', 'massachusetts', 'see', 'click', 'ny', 'great']
Chicago,_IL - ['chic', 'chicago', 'il', 'la', 'illin', 'ny', 'illinois', 'tl', 'latest', 'click']
Houston,_TX - ['tx', 'houst', 'houston', 'la', 'latest', 'tex', 'click', 'texas', 'nursing', 'healthcare']
Los_Angeles,_CA - ['la', 'angel', 'ca', 'los', 'angeles', 'losangeles', 'hollywood', 'tl', 'california', 'ny']
Manhattan,_NY - ['ny', 'york', 'new', 'nyc', 'la', 'newyork', 'manh', 'fl', 'manhattan', 'park']
Orlando,_FL - ['orl', 'fl', 'orlan', 'orlpol', 'orlando', 'opd', 'ave', 'la', 'dr', 'rd']
Philadelphia,_PA - ['phil', 'delph', 'philadelphia', 'pa', 'la', 'philly', 'penn', 'ny', 'pennsylvania', 'tl']
San_Diego,_CA - ['dieg', 'ca', 'san', 'sandiego', 'diego', 'la', 'sd', 'ny', 'california', 'latest']
San_Francisco,_CA - ['fran', 'ca', 'san', 'sanfrancisco', 'francisco', 'sf', 'la', 'ca

We see that Naive Bayes provides *much better feature information than Random Forest*. For example, the second most important feature according to the Random Forest `feature_importances_` list is `'ny'`. You would naturally assume that this would very strongly predict a geolocation of Manhattan; but you might be wrong. The Naive Bayes `feature_log_prob_` list shows that `'ny'` also appears among the top 10 predictors for 9 other cities. Perhaps the feature generation of the `ny` token is not working as expected due to words like "irony" or "phony" in tweets. The detailed information provided by Naive Bayes indicates that removing `'ny'` from the feature-generation would be worth trying.

### Conclusion: Rules of Thumb
For the tweet geolocation classification problem, here are the heuristics you could use in model selection:
+ In exploratory stages, until your preprocessing pipeline is stable: use Naive Bayes due to its dramatic advantages in speed and cost.
+ In production:
  + If you need to explain the details of your algorithm to regulators or customers: use Naive Bayes
  + If the higher compute costs of random forest outweigh its slightly better accuracy: use Naive Bayes
  + If accuracy trumps everything: use Random Forest
+ In a Kaggle competition: use Random Forest

This experiment did not examine the capabilities of a recurrent neural network (RNN). An RNN would be even more complex than a random forest and yield even less information about feature importance, but its (presumably) superior accuracy might justify its use. 