# Tweet Poster Classifier

This is a basic Natrual Langaue Processing (NLP) and Supervised Machine Learning (ML) example using real tweets from Elon Musk and Jeff Bezos to demonstrate the capabilities of common classification models and their application to unstructured text data.

## Predictor

Building the vectorizer will require the following steps:

 - Clean and Vectorize the corpus, then split into training and testing sets
 - Decide on a list of Classification Models to train 
 - Create Parameter Grids to uncover the best performing combination
 - Grid Search using the training set
 - Evalutate the perfromance of the best models using the testing sets

In [1]:
#silence warnings for this demo
import warnings
warnings.filterwarnings('ignore')

#sklearn tools to preprocess data
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import resample

#sklearn models to train
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier

#sklearn tools for evaluating models
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score


import pandas as pd
import nltk
nltk.download('stopwords')

import numpy as np
import re
from collections import Counter
from pprint import pprint

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alexr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


To begin, the DataFrams from the last Notebook is read in

In [2]:
musk_df = pd.read_csv('./data/musk_tweets.csv', index_col=0)
musk_df.head()

Unnamed: 0,tweet_id,handle,retweet_count,text,mined_at,created_at
0,1422627025068695556,elonmusk,244,@Erdayastronaut @ErcXspace We stole the idea f...,2021-08-03 16:36:38.236180,Tue Aug 03 18:34:48 +0000 2021
1,1422615364479897606,elonmusk,191,@flcnhvy Pitch control requires more force tha...,2021-08-03 16:36:38.236180,Tue Aug 03 17:48:28 +0000 2021
2,1422612139160834050,elonmusk,144,@TeslaFruit Thanks Sandy!,2021-08-03 16:36:38.236180,Tue Aug 03 17:35:39 +0000 2021
3,1422608233995382791,elonmusk,3527,https://t.co/nNjhPIEhcZ,2021-08-03 16:36:38.236180,Tue Aug 03 17:20:08 +0000 2021
4,1422607954101084161,elonmusk,8561,Super Heavy Booster moving to orbital launch m...,2021-08-03 16:36:38.236180,Tue Aug 03 17:19:01 +0000 2021


In [3]:
bezos_df = pd.read_csv('./data/bezos_tweets.csv',index_col=0)
bezos_df.head()

Unnamed: 0,tweet_id,handle,retweet_count,text,mined_at,created_at
0,1233441223232245760,JeffBezos,2554,"Discussing climate, sustainability, and preser...",2021-08-03 16:36:42.400558,Fri Feb 28 17:17:58 +0000 2020
1,1224154674804084736,JeffBezos,13443,"I just took a DNA test, turns out I’m 100% @li...",2021-08-03 16:36:42.400558,Mon Feb 03 02:16:32 +0000 2020
2,1222572705066536961,JeffBezos,2775,"Hey, Alexa — show everyone our upcoming Super ...",2021-08-03 16:36:42.400558,Wed Jan 29 17:30:21 +0000 2020
3,1220059386694922240,JeffBezos,5441,#Jamal https://t.co/8ej1rUBXVb,2021-08-03 16:36:42.400558,Wed Jan 22 19:03:20 +0000 2020
4,1219093283265138688,JeffBezos,9970,"Hey, India. We’re rolling out our new fleet of...",2021-08-03 16:36:42.400558,Mon Jan 20 03:04:23 +0000 2020


To do a small amount of analysis first, the top 10 most common tokens are printed out. A token in NGram will be groups of words as oppose to single words

In [4]:
#define fucntion to analysis top 10 common tokens
def top10Tokens(df, vectorizer):
    """
    provided a dataframe (w/ 'text' column) and a vectorzier, returns top ten results.
    """
    flattened_corpus = " ".join(df['text'])
    text_vector = vectorizer.build_analyzer()(flattened_corpus)
    return Counter(text_vector).most_common(10)

In [5]:
#intialize TfidfVector
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(2,5))

#print out results
print("Top 10 Elon Musk Phrases: \n")
pprint(top10Tokens(musk_df, vectorizer))

print("\nTop 10 Jeff Bezos Phrases: \n")
pprint(top10Tokens(bezos_df, vectorizer))

Top 10 Elon Musk Phrases: 

[('erdayastronaut spacex', 15),
 ('doo doo', 12),
 ('super heavy', 10),
 ('self driving', 10),
 ('fsd beta', 9),
 ('doo doo doo', 9),
 ('teslarati klenderjoey', 8),
 ('ppathole spacex', 8),
 ('later year', 8),
 ('teslarati residentsponge', 7)]

Top 10 Jeff Bezos Phrases: 

[('gradatimferociter https', 13),
 ('blueorigin https', 8),
 ('blueorigin team', 7),
 ('launchlandrepeat blueorigin', 7),
 ('blue origin', 6),
 ('crew capsule', 6),
 ('excited announce', 4),
 ('amazon air', 4),
 ('huge kudos', 4),
 ('new shepard', 4)]


It's clear Elon is much more frivilous with what he likes to tweet about.

To prepare the data for Machine Learning Processing, the first step will be to clean the data. This entails using the Natural Langauge ToolKit or NLTK package. A stemmer is used to group words with the same stem. Stemming is essentially converting a conjugated word to its base form. Puncuation and stop words are removed from the corpus as well.

In [6]:
#fucntion to apply to each tweet. Stems words, and removes puct and stopwords
def nltkTextCleaner(text):
    
    stemmer = nltk.stem.PorterStemmer()
    words = nltk.corpus.stopwords.words("english")
    
    removePunct = re.sub("[^a-zA-z]", " ", text).split()
    clean_stems = [stemmer.stem(i) for i in removePunct if i not in words]
    return " ".join(clean_stems).lower()

As seen before, the number of tweets from both CEO are darastically different. Elon has 1500 tweets, while Jeff only has about 250. This is known as an imbalance in labels, which can lead models to favor predicting the majority class (in this case Elon).

To compensate this, the resample fucntion is brought in from the scikit-learn library, which will randomly create duplicates of posts until both classes are balanced.

In [7]:
#resamle bezos tweets until its equal to amount of tweets musk has
bezos_df = resample(bezos_df, n_samples = len(musk_df), random_state=42)

#concat dataframes
combined_df = pd.concat([musk_df, bezos_df])

#make variable with the clean text
all_text = combined_df['text'].values
clean_text = [nltkTextCleaner(i) for i in all_text]

#create the labels to be used later
labels = combined_df['handle'].values

In [8]:
#print combined DataFrame
combined_df

Unnamed: 0,tweet_id,handle,retweet_count,text,mined_at,created_at
0,1422627025068695556,elonmusk,244,@Erdayastronaut @ErcXspace We stole the idea f...,2021-08-03 16:36:38.236180,Tue Aug 03 18:34:48 +0000 2021
1,1422615364479897606,elonmusk,191,@flcnhvy Pitch control requires more force tha...,2021-08-03 16:36:38.236180,Tue Aug 03 17:48:28 +0000 2021
2,1422612139160834050,elonmusk,144,@TeslaFruit Thanks Sandy!,2021-08-03 16:36:38.236180,Tue Aug 03 17:35:39 +0000 2021
3,1422608233995382791,elonmusk,3527,https://t.co/nNjhPIEhcZ,2021-08-03 16:36:38.236180,Tue Aug 03 17:20:08 +0000 2021
4,1422607954101084161,elonmusk,8561,Super Heavy Booster moving to orbital launch m...,2021-08-03 16:36:38.236180,Tue Aug 03 17:19:01 +0000 2021
...,...,...,...,...,...,...
114,911304409803522048,JeffBezos,561,"Future engineer, 9-year-old Ryan, signs Amazon...",2021-08-03 16:36:42.400558,Fri Sep 22 19:01:17 +0000 2017
174,775324124935778304,JeffBezos,1340,Blue Origin’s next step…meet New Glenn #NewGle...,2021-08-03 16:36:42.400558,Mon Sep 12 13:24:10 +0000 2016
193,722248253996019712,JeffBezos,96,Champagne cracked in the newsroom today. Congr...,2021-08-03 16:36:42.655675,Tue Apr 19 02:19:37 +0000 2016
219,688732394442997760,JeffBezos,292,"Believe me, if you’re ever tossed in a foreign...",2021-08-03 16:36:42.655675,Sun Jan 17 14:39:33 +0000 2016


To process the text, it must be transformed into a numeric representation (i.e. a vector). The same vector as before will be used, but with a constraint on how big of a vocabulary to build. Here the vocab limit is set to 1000

In [9]:
#Create features and lables, assign to common convention of X, y variable names
vectorizer = TfidfVectorizer(stop_words='english', ngram_range = (2,4), max_features = 1000)
X = vectorizer.fit_transform(clean_text)
y = labels

A big theme in training prediciton models is that they can be overfitted to the training data. Meaning that it performs better on the training data (almost 'memorizing' anwsers) but generalize worse (scoring lowly on unseen data).

To check whether a model is overfitting or not, a fraction of the dataset is put to the side, seperated from the training data, to be used to evaluate how the model performs on unseen data.

Scikit Learn provides a simple function to randomly shuffle and seperate data while mantaining the order of the data called train_test_split().

In [10]:
#split data into testing and training set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

More than just one algorithm should be tested out because it can usually be ambigiuous which model should perform better. It is better to train various models ands compare them than decide theoretically which one should be best.

Even when looking at one algoritm, there are various hyperparameters that can be tuned to reach better performance, but again, instead of speculating, its better to test out all the combinations of parameters of interest.

The models that will be tested out here are the following. Links are provided of the models' scikit learn documentation:

 - [Logisitic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

 - [MultiLayer Perceptron Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

 - [K Nearest Neighbors Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

 - [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

 - [Gradient Boosting Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

In [11]:
#dictionary with initiallized models
models = {
    "LR" : LogisticRegression(penalty = 'l1', solver='liblinear', random_state=42),
    "MLP" : MLPClassifier(random_state=42),
    "KN" : KNeighborsClassifier(),
    "SVC" : SVC(random_state=42),
    "GBC" : GradientBoostingClassifier(random_state=42)
}

#parameter grids for each model
params = {
        "LR" : {'C' : list(np.power(10.0, np.arange(-3, 3)))},
    
        "MLP" : {
            'hidden_layer_sizes' : [50,100,250],
            'solver' : ['adam', 'sgd'],
            'alpha' : np.power(10.0, np.arange(-4,1))
        },
    
        "KN" : {
            'n_neighbors': np.arange(2, 6),
            'weights' : ['uniform', 'distance']
        },
    
        "SVC" : {
            'C' : list(np.power(10.0, np.arange(-3, 3))),
            'kernel' : ['linear', 'rbf', 'sigmoid']    
        },
        "GBC" : {
            'loss' : ['deviance', 'exponential'],
            'learning_rate' : list(np.power(10.0, np.arange(-5,0))),
            'n_estimators' : [50, 100, 250],
            'max_depth' : [3,4,5]
        }
}

Since an interative process is being performed, a loop was created to do all the training and collect the results.

In [12]:
#create list to append results in
best_estimators = []
for algorithm, model in models.items():
    #create dict that will describe results
    best_estimator = {}
    
    #train using grid search to find best params
    trainer = GridSearchCV(model, param_grid = params[algorithm], cv=3)
    trainer.fit(X_train, y_train)
    
    #populate dict with results
    best_estimator['algorithm'] = algorithm
    best_estimator['estimator'] = trainer.best_estimator_
    best_estimator['cv_score'] = trainer.best_score_
    best_estimator['params'] = trainer.best_params_
    
    #create test score using test data
    y_pred = trainer.best_estimator_.predict(X_test)
    best_estimator['test_score'] = accuracy_score(y_pred, y_test)
    
    #add results to list
    best_estimators.append(best_estimator)

In [13]:
#print out results using pandas
pd.DataFrame(data = best_estimators)

Unnamed: 0,algorithm,estimator,cv_score,params,test_score
0,LR,"LogisticRegression(C=10.0, penalty='l1', rando...",0.814839,{'C': 10.0},0.823913
1,MLP,"MLPClassifier(alpha=1.0, hidden_layer_sizes=10...",0.817169,"{'alpha': 1.0, 'hidden_layer_sizes': 100, 'sol...",0.825
2,KN,KNeighborsClassifier(n_neighbors=2),0.814839,"{'n_neighbors': 2, 'weights': 'uniform'}",0.825
3,SVC,"SVC(C=10.0, kernel='sigmoid', random_state=42)",0.81577,"{'C': 10.0, 'kernel': 'sigmoid'}",0.825
4,GBC,([DecisionTreeRegressor(criterion='friedman_ms...,0.812973,"{'learning_rate': 0.1, 'loss': 'deviance', 'ma...",0.820652


The results show that the best model was tied between the MLP Classifier, the K Nearest Neighbor, and the SVM classifier.

All the models got decent results of 82% from the test data. To combare to a base metric, a dummy classifier (aka one that will guess the majority class everytime) would yield 50% (recall that bezos's tweets were resampled).