# Stylext: Tweet Attribution with Naive Bayes & Logistic Regression

Note to non-programmer friends: *You need to select "Restart & Clear Output" under the Kernel tab before running the code cells below*

### Introduction

Although technically not 100% pure stylometry (because distinguishing one user from the other with the code below is affected by the respective topics being discussed), this notebook file will illustrate how the same sort of algorithms used to distinguish spam from valid email can also be used to distinguish one Twitter user from another using their post content alone.

Both feeds used in the sample csv data are about the same topic (economics). However, there are no *conscious* attempts by the users to obfuscate their Tweet styles, but the techniques used can still be useful if someone is unaware of what distinguishes their tweet style from others.

With each Python code cell, click on it to highlight then shift + enter to execute it. The * symbol means it's running, while a number means it completed.

## Part 1: Importing Needed Libraries

You will need *pandas* to read in rows and colums (containing the raw article text, and columns for all of the criteria of interest.

*Numpy* and *scipy* add functionality that you will depend on throughout notebook use. Very specific tools are also imported from *scikit-learn.* In particular, a few natural language processing tools are imported which may be used to boost model accuracy (with iterative trial and error).

**Do not worry if the brief library descriptions in the code below do not make sense to you; the specifics of what they do are ellaborated further in this notebook as they are put to use.**

In [1]:
# These are the core libraries you need to import to run the scripts that follow.

import pandas as pd # this is needed to read in dataframes (rows and columns of data)
import numpy as np # numpy allows you to efficiently work with and execute operations on arrays
import scipy as sp # scipy builds off of numpy, giving you added linear algebra functionality under the hood

Now that our core libraries are imported, we need to import several things from Scikit-Learn. These will allow use to *add structure* to otherwise unstructured text, *apply machine learning models* to classify text samples, and *measure the accuracy* of the output for the data we will load in. 

In [2]:
# Here are more specific tools from Scikit-Learn for natural language processing and measuring accuracy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # two vectorization methods we want for later
from sklearn.naive_bayes import MultinomialNB # multinomial naive bayes classifier
from sklearn.linear_model import LogisticRegression # basic logistic regression classifier
from sklearn.model_selection import train_test_split # this splits the data loaded in into training & testing groups
from sklearn import metrics # this will help us understand the results of the train/test split simulation

## Part 2: Load in CSV File Containing Tweets and Define Train/Test Variables

Now we will read in the data file (in comma seperated values format) from an online github repo, and store it in a python variable called "tweets" so we can continue to work with it throughout the notebook. Pandas' "read_csv" feature will allow use to read in and define the CSV.

In [3]:
# Read post_feed.csv into a DataFrame. Any CSV with columns containing raw tweet contents and usernames can often work.
# If you're offline, replace the link with the file location for post_feed.csv if you have it stored locally.

url = 'csv/tweets.csv' # define url as csv data
tweets = pd.read_csv(url) # read the csv file using the pandas python library and define it as 'tweets'

Defining training and testing variables must be done in order to even begin testing and improving predictive accuracy.

In [4]:
# define X and y, or the manipulated variable and the responding variable: Given the text, which user tweeted it?

X = tweets.raw_text  # this defines X as the csv column that contains all the raw tweet text of both users
y = tweets.username  # the responding variable (y) is now defined as the column with the two Twitter usernames


# split the new DataFrame into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

The code below ensures the dataframe has been read in and the variables are defined as intended.

In [5]:
 # check the first five rows/tweets

tweets.head()

Unnamed: 0,time_stamp,raw_text,username
0,2016-05-17 04:53:06,*Tweets about his IQ but misspells both letter...,DLin71_feed
1,2016-05-17 02:20:27,Fascinating quotes on the back of Trump's new ...,DLin71_feed
2,2016-05-17 02:06:07,ME: Want to see a magic trick?\r\nCHILD: Yes!\...,DLin71_feed
3,2016-05-16 03:43:53,@pmarca Experts Insist That Tetherball Has Fin...,DLin71_feed
4,2016-05-16 03:21:34,2012 NOMINEE: I have a car elevator\r\nGOP: *f...,DLin71_feed


In [6]:
# check the last five rows/tweets, notice the change in which username's tweets are visible

tweets.tail()

Unnamed: 0,time_stamp,raw_text,username
3528,2015-12-21 17:00:21,Kenyan villagers ban MP from going home using ...,NinjaEconomics
3529,2015-12-21 14:00:26,Venezuela frees Pepsi workers it arrested for ...,NinjaEconomics
3530,2015-12-21 08:05:33,This gem: https://t.co/NBGaAxJvUT @mrgunn @mat...,NinjaEconomics
3531,2015-12-21 06:04:32,The only thing that stops.a bad man with a wea...,NinjaEconomics
3532,2015-12-21 01:54:40,"In every country, people overestimate the shar...",NinjaEconomics


In [7]:
# check the number of rows (tweets stored) and columns

tweets.shape

# even though there are 3 columns, most of the time we're only going to use two at a time (no need for time_stamp)

(3533, 3)

In [8]:
# check the first five rows in a shorter format - it's only displaying the raw_text column that X was defined as.

X.head()

0    *Tweets about his IQ but misspells both letter...
1    Fascinating quotes on the back of Trump's new ...
2    ME: Want to see a magic trick?\r\nCHILD: Yes!\...
3    @pmarca Experts Insist That Tetherball Has Fin...
4    2012 NOMINEE: I have a car elevator\r\nGOP: *f...
Name: raw_text, dtype: object

In [9]:
# now we will see the first five rows of the responding variable (the username of who posted what)

y.head()

0    DLin71_feed
1    DLin71_feed
2    DLin71_feed
3    DLin71_feed
4    DLin71_feed
Name: username, dtype: object

## Part 3: Time to Vectorize

- **What:** Separate text into units such as sentences or words that can be better quantified
- **Why:** Gives structure to previously unstructured text that machine learning can be applied to
- **Notes:** Relatively easy with English language text, not easy with some languages

We are now going to create what are called "document-term matrices" of the tweets. Think of these as rows and columns which store numbers representing how often certain terms appear in a given document (or passage of text). The image below may help you understand what that looks like under the hood:

&nbsp;

![Document-Term Matrix](http://mlg.postech.ac.kr/static/research/nmf_cluster1.PNG)

&nbsp;

In [10]:
# use CountVectorizer to create document-term matrices from X_train and X_test

vect = CountVectorizer() # because vect is way easier to type than CountVectorizer...
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.fit_transform(X_test)

# now we have quantitative info about the tweets that a 'multinomial naive Bayes classifier' can work with

**Just to clarify what's going on in the adjacent cells:** All the **rows** are the *individual tweets* that are stored in the CSV file. But the astronomical crapload of **columns** is literally *each unique term* that appears. Those are going to be the "features" used to "fingerprint" one user from another. 

In [11]:
# rows are documents, columns are terms (aka "tokens" or "features")

X_train_dtm.shape

(2649, 9433)

In [12]:
# last 50 features

print(vect.get_feature_names()[-50:])

['yields', 'yigccncbvc', 'yiuqomnq6b', 'ylnxtdwqax', 'yo', 'york', 'you', 'young', 'younger', 'your', 'yours', 'yourself', 'youth', 'youtube', 'yqj1crgunp', 'yr', 'yrhycoujwj', 'yrs', 'ytdhfqx2uo', 'yuan', 'yugey', 'yuuuuge', 'yxej7ujeji', 'yzilber', 'z3jlakii1d', 'z4ngvfeofn', 'z7z2zrv49u', 'zachweiner', 'zaxwlyffr5', 'zero', 'zerohedge', 'zf8rf4dxkr', 'zhpllsypbn', 'zika', 'ziploc', 'ziuz8mnbrw', 'zj8k3bbmds', 'zjhvrnfbza', 'zme70c7tev', 'znwymtnt5m', 'zodiac', 'zoecdckc5e', 'zp6bm3tqdw', 'zqkf6j6s3b', 'zrziri8zqz', 'ztgffqx9km', 'zuckerberg', 'zuvmoy65sc', 'zvb2hh0fb6', 'zwb2ujzwpi']


In [13]:
# show vectorizer options, which are currently at their default values

vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

**Take a look at the output above.** These are settings that CountVectorizer has (which is currently stored in "vect"). The main ones to keep in mind are lowercase (whether all the words get converted to lowercase), max_features (how many of those words are used to "fingerprint" the text of the tweets), ngram_range (if set to (1, 2), it will look at individual words as well as word pairs, and so on as you increment the latter number), and stop_words (words that are so common that they might not be useful for tweet classification can be ignored).

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) - in case you might be interested.

In [14]:
# We will not convert to lowercase for now, but if we did it would reduce the number of unique words looked at

vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(2649, 10529)

- Parameter **lowercase:** boolean, True by default
    - If True, Convert all characters to lowercase before tokenizing.
    
This can be useful for preventing word capitalization from making your results less predictive.

In [15]:
# last 50 features

print(vect.get_feature_names()[-50:])

['yourlyingeyes', 'yours', 'yourself', 'youth', 'youtube', 'ypmpHVlcgH', 'yr', 'yrs', 'yuhoTTe82N', 'z1td7ysKZY', 'z2rLFYMeyd', 'z3GrnFpgoW', 'z41SHufjWx', 'z49Km4Q1ky', 'z6clWWLaRJ', 'z7I10h1CvP', 'zDcEhQ62Za', 'zHRKYWhDcX', 'zJ3RKVfCFZ', 'zN1TqO1gt6', 'zTMhty2uyP', 'zUSJYSc8rA', 'zUgXN8UBIp', 'zV5RwlguKr', 'zVADU0WyIU', 'zYN20ilr0p', 'zachhaller', 'zackcooperYale', 'zackwhittaker', 'zaxvdCb4WK', 'zbclypcVJr', 'zcuAJwbtQS', 'zer', 'zero', 'zerohedge', 'zeynep', 'zgXtz1rlDr', 'zh5l9DmJJ6', 'zhb79SzxrM', 'zhnqMSbKN4', 'zn8PLI6RED', 'zoIxiiKbxA', 'zoiLznYBmp', 'zone', 'zookmann', 'zqd5ayR8U2', 'zv1dcTjJzz', 'zyKKkxAp1g', 'zzv0VhgTM7', 'zzzzzzzzz']


In [16]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(2649, 38979)

- Parameter **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [17]:
# last 50 features

print(vect.get_feature_names()[-50:])

['zj3rkvfcfz', 'zjidforq8w', 'zjtsqygvk7', 'zjtsqygvk7 https', 'zk8rhd5dzu', 'zl1jyllm4z', 'zn1tqo1gt6', 'zn8pli6red', 'znhxore3bj', 'zoe', 'zoe called', 'zoe saldana', 'zoilznybmp', 'zoixiikbxa', 'zoixiikbxa amp', 'zone', 'zookmann', 'zookmann https', 'zozkj5cczx', 'zozkj5cczx https', 'zqcygzlhsy', 'zqcygzlhsy https', 'zqd5ayr8u2', 'zs7hkl4tau', 'zs7hkl4tau from', 'zsfewct3nc', 'ztmhty2uyp', 'ztr6oter1d', 'zugxn8ubip', 'zuka', 'zuka virus', 'zusjysc8ra', 'zv1dctjjzz', 'zv5rwlgukr', 'zvadu0wyiu', 'zwebn0atpm', 'zwebn0atpm https', 'zwischenzugjg', 'zwischenzugjg don', 'zxgpaschbh', 'zxienh2kkd', 'zykkkxap1g', 'zyn20ilr0p', 'zyn20ilr0p https', 'zzerxrmdmn', 'zzerxrmdmn https', 'zzf7nelt8u', 'zzf7nelt8u https', 'zzv0vhgtm7', 'zzzzzzzzz']


**Predicting which user made what Tweet:** 

Now for the moment of truth... How accurate can we predict who is who?

In [18]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print(metrics.accuracy_score(y_test, y_pred_class))

0.9253393665158371


**The cell below will eliminate the need for typing in the same code over and over again, as well as produce an output that includes all the information we need to know about how the number of unique features is affecting the classifier accuracy.**

In [19]:
# define a function that accepts a vectorizer and calculates the accuracy

lr = LogisticRegression()

def tokenize_test(vect, model):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    if model == 'lr':
        lr.fit(X_train_dtm, y_train)
        y_pred_class = lr.predict(X_test_dtm)
        algorithm = 'Logistic Regression'
    elif model == 'nb':
        nb.fit(X_train_dtm, y_train)
        y_pred_class = nb.predict(X_test_dtm)
        algorithm = 'Multinomial Naive Bayes'
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))
    print(algorithm)

In [20]:
vect = CountVectorizer()
tokenize_test(vect, model='lr')

Features:  9433
Accuracy:  0.9287330316742082
Logistic Regression


In [21]:
# include 1-grams, 2-grams, and 3-grams

vect = CountVectorizer(ngram_range=(1, 3))
tokenize_test(vect, model='nb')

Features:  73865
Accuracy:  0.9253393665158371
Multinomial Naive Bayes


## Part 4: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [22]:
# show vectorizer options again, notice the setting for "stop_words" isn't set to any language

vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [23]:
# remove English stop words

vect = CountVectorizer(stop_words='english')
tokenize_test(vect, model='nb')

Features:  9179
Accuracy:  0.9321266968325792
Multinomial Naive Bayes


In [24]:
# set of stop words

print(vect.get_stop_words())

frozenset({'beside', 'once', 'thereupon', 'twenty', 'within', 'whereby', 'thereby', 'and', 'why', 'themselves', 'hasnt', 'anyway', 'an', 'it', 'she', 'hers', 'six', 'whereupon', 'while', 'further', 'hence', 'been', 'part', 'whose', 'no', 'neither', 'ourselves', 'this', 'what', 'already', 'perhaps', 'thick', 'as', 'through', 'top', 'am', 'take', 'whatever', 'amongst', 'bottom', 'ltd', 'please', 'system', 'in', 'might', 'nevertheless', 'for', 'whether', 'the', 'describe', 'therein', 'via', 'throughout', 'without', 'mill', 'former', 'during', 'eg', 'ever', 'almost', 'go', 'hereafter', 'seeming', 'somewhere', 'thin', 'alone', 'wherein', 'itself', 'mostly', 'were', 'latter', 'a', 'seems', 'interest', 'nothing', 'fire', 'becomes', 'down', 'found', 'if', 'behind', 'last', 'move', 'of', 'all', 'one', 'since', 'sincere', 'thus', 'every', 'thereafter', 'us', 'across', 'more', 'however', 'also', 'ours', 'who', 'he', 'would', 'five', 'our', 'hundred', 'due', 'how', 'against', 'often', 'become', 't

## Part 5: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [25]:
# remove English stop words and only keep 100 features

vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect, model='nb')

Features:  100
Accuracy:  0.920814479638009
Multinomial Naive Bayes


In [26]:
# all 100 features

print(vect.get_feature_names())

['000', '10', 'actually', 'america', 'american', 'americans', 'amp', 'article', 'asian', 'average', 'bc', 'better', 'big', 'black', 'called', 'child', 'china', 'college', 'data', 'day', 'did', 'didn', 'does', 'doesn', 'don', 'dsmoxon', 'economic', 'economics', 'economists', 'economy', 'female', 'free', 'gender', 'going', 'good', 'government', 'help', 'higher', 'home', 'http', 'https', 'income', 'inequality', 'isn', 'job', 'jobs', 'just', 'know', 'like', 'link', 'live', 'look', 'make', 'man', 'mattocko', 'maybe', 'mean', 'men', 'need', 'new', 'old', 'pay', 'people', 'police', 'poor', 'ppl', 'read', 'really', 'right', 'said', 'say', 'says', 'sex', 'support', 'sure', 'tell', 'think', 'time', 'trump', 'trying', 'tweet', 'tweets', 'twitter', 'understand', 'use', 'using', 've', 'wage', 'want', 'way', 'white', 'woman', 'women', 'won', 'work', 'world', 'wrong', 'year', 'years', 'yes']


In [27]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect, model='nb')

Features:  38979
Accuracy:  0.9264705882352942
Multinomial Naive Bayes


In [28]:
# include 1-grams and 2-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 2), max_features=10000)
tokenize_test(vect, model='nb')

Features:  10000
Accuracy:  0.9321266968325792
Multinomial Naive Bayes


- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [29]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times

vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect, model='nb')
print(vect.get_feature_names())

Features:  6676
Accuracy:  0.9400452488687783
Multinomial Naive Bayes


In [30]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times, with 2200 max features

vect = CountVectorizer(ngram_range=(1, 2), min_df=2, max_features=2200)
tokenize_test(vect, model='nb')
print
print(vect.get_feature_names()) # this will display the actual feature names

Features:  2200
Accuracy:  0.9423076923076923
Multinomial Naive Bayes
['000', '10', '10 000', '100', '11', '12', '14', '15', '16', '18', '1st', '20', '20 of', '200', '2013', '2014', '2015', '2016', '2016 https', '2017', '2017 president', '2018', '21', '22', '2nd', '30', '35', '39', '40', '40 of', '400', '45', '50', '60', '75', '90', '_romanelite', '_sunilrawat', 'ability', 'ability to', 'able', 'able to', 'abortion', 'abortions', 'about', 'about https', 'about the', 'about their', 'about your', 'above', 'abt', 'abuse', 'academic', 'acceptable', 'access', 'access to', 'accidentally', 'according', 'according to', 'account', 'accounts', 'accused', 'across', 'actually', 'ad', 'ad_captandum', 'add', 'added', 'advice', 'advocate', 'afford', 'afford to', 'africa', 'after', 'after the', 'again', 'against', 'age', 'age of', 'ago', 'agree', 'airbnb', 'airlines', 'alex', 'algorithm', 'alison', 'alisonrapp', 'all', 'all of', 'all the', 'all women', 'allegations', 'allowed', 'allowed to', 'almost',

### The cell below contains the settings that got me closest to 95% accuracy

**However it's likely that the result is due in part to overfitting on available data. K-folds validation would be needed to reduce this effect.**

In [31]:
# include 1-grams, 2-grams, and 3-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 3), max_features=1550, stop_words='english')
tokenize_test(vect, model='nb')

Features:  1550
Accuracy:  0.9468325791855203
Multinomial Naive Bayes


## ADDENDUM: Using GridSearchCV for Hyperparameter Opitimization

Although it wasn't covered comprehensively in the part-time course I took, it turns out there is a rather simple way to do multiple train/test splits on the same data with cross-validation (to avoid overfitting), as well as try out a variety of parameter settings to see which ones output the most predictive results. 

**Enter GridSearchCV.**

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [33]:
steps=[('vectorize',CountVectorizer()),\
       ('clf',MultinomialNB())]

In [34]:
pipe=Pipeline(steps)

In [35]:
X_train, X_test, y_train, y_test=\
train_test_split(tweets['raw_text'], tweets['username'], random_state=1)

In [36]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorize', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [37]:
pred=pipe.predict(X_test)

In [38]:
print('Accuracy = {:3f}'.format(accuracy_score(y_test,pred)))

Accuracy = 0.925339


In [39]:
pipe.named_steps.keys()

dict_keys(['vectorize', 'clf'])

In [40]:
param_grid = dict(vectorize__stop_words=[None,'english'],\
                  vectorize__min_df=[1,3,5,7,9,12],\
                  vectorize__lowercase=[True,False],\
                  vectorize__ngram_range=[(1,1),(1,2),(1,3),(2,2),(2,3)],\
                  vectorize__max_features=[1693,1695,1697] # did a lot of iteration to whittle it down to those
                 )

In [41]:
grid_search = GridSearchCV(pipe, param_grid=param_grid,\
                           scoring=make_scorer(accuracy_score),n_jobs=6)

In [42]:
%time res=grid_search.fit(tweets['raw_text'],tweets['username'])

CPU times: user 2.04 s, sys: 100 ms, total: 2.14 s
Wall time: 57.7 s


In [43]:
res.best_params_

{'vectorize__lowercase': True,
 'vectorize__max_features': 1695,
 'vectorize__min_df': 1,
 'vectorize__ngram_range': (1, 1),
 'vectorize__stop_words': None}

In [44]:
print(res.best_score_)

0.9377299745258987


## Future Experimentation

**I plan on expanding the GridSearchCV to use different k-folds values, as well as test out the accuracy of different classifiers in addition to Multinomial Naive Bayes. Trying out different train/test split sizes and random states is also on my radar. I just want to see if any accuracy increases are due to overfitting or not.**