# Stylext: Tweet Attribution w/Multinomial Naive Bayes & Logistic Regression

### Introduction

Although not pure stylometry per se (because distinguishing one user from the other is affected by the topic being discussed), this notebook file will illustrate how the same sort of algorithm used to distinguish spam from non-spam can also be used to distinguish one user from another on Twitter. Both feeds are about the same topic (economics). However, there are no *conscious* attempts by the users to obfuscate their Tweet styles.

With each Python code cell, click on it to highlight then shift + enter to execute it. The * symbol means it's running, while a number means it completed.

## Part 1: Importing Needed Libraries

You will need *pandas* to read in rows and colums (containing the raw article text, and columns for all of the criteria of interest.

*Numpy* and *scipy* add functionality that you will depend on throughout notebook use. Very specific tools are also imported from *scikit-learn.* Additionally, a few natural language processing tools are imported which may be used to boost model accuracy (with iterative trial and error).

In [None]:
# These are the core libraries you need to import to run the scripts that follow.

import pandas as pd
import numpy as np
import scipy as sp

Now that our core libraries are imported, we need to import several things from Scikit-Learn. These will allow use to add structure to otherwise unstructured text, apply machine learning models to classify text samples, and measure the accuracy of the output for the data we will load in. 

In [None]:
# Here are more specific tools from Scikit-Learn

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # two vectorization methods we want for later
from sklearn.naive_bayes import MultinomialNB # multinomial naive bayes classifier
from sklearn.linear_model import LogisticRegression # basic logistic regression classifier
from sklearn.cross_validation import train_test_split # this splits the data loaded in into training & testing groups
from sklearn import metrics # this will help us understand the results of the train/test split simulation

## Part 2: Load in CSV File Containing Tweets

In [None]:
# Read post_feed.csv into a DataFrame. Any CSV with columns containing raw tweet contents and usernames can often work.
# If you're offline, replace the link with the file location for post_feed.csv if you have it stored locally.

url = 'https://raw.githubusercontent.com/analyticascent/stylext/master/csv/post_feed.csv'
post = pd.read_csv(url)


# define X and y, or the manipulated variable and the responding variable: Given the text, which user tweeted it?

X = post.raw_text  # Depending on the raw tweet text column contents...
y = post.username  # ...which user wrote the tweet?


# split the new DataFrame into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# check the first five rows/tweets

post.head()

In [None]:
# check the first five rows in a shorter format

X.head()

In [None]:
# check the number of rows and columns

X.shape

## Part 3: Time to Vectorize

- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

We are now going to create what are called "document-term matrices" of the tweets. Think of these as rows and columns which store numbers representing how often certain terms appear in a given document (or passage of text). The image below may help you understand what that looks like under the hood:

&nbsp;

![Document-Term Matrix](http://mlg.postech.ac.kr/static/research/nmf_cluster1.PNG)

&nbsp;

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test

vect = CountVectorizer() # because vect is way easier to type than CountVectorizer...
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# now we have quantitative info about the tweets that a 'multinomial naive Bayes classifier' can work with

In [None]:
vect

**Just to clarify what's going on in the adjacent cells:** All the **rows** are the *individual tweets* that are stored in the CSV file. But the astronomical crapload of **columns** is literally *each unique term* that appears. Those are going to be the "features" used to "fingerprint" one user from another. 

In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")

X_train_dtm.shape

In [None]:
# last 50 features

print vect.get_feature_names()[-50:]

In [None]:
# show vectorizer options

vect

[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) - in case you might be interested.

- Parameter **lowercase:** boolean, True by default
    - If True, Convert all characters to lowercase before tokenizing.
    
This can be useful for preventing word capitalization from making your results less predictive.

In [None]:
# We will not convert to lowercase for now, but if we did it would reduce the number of quantified features

vect = CountVectorizer(lowercase=False)
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features

print vect.get_feature_names()[-50:]

- Parameter **ngram_range:** tuple (min_n, max_n)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

In [None]:
# last 50 features

print vect.get_feature_names()[-50:]

**Predicting which user made what Tweet:** 

Now for the moment of truth... How accurate can we predict who is who?

In [None]:
# use default options for CountVectorizer
vect = CountVectorizer()

# create document-term matrices
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy
print metrics.accuracy_score(y_test, y_pred_class)

The cell below will eliminate the need for typing in the same code over and over again, as well as produce an output that includes all the information we need to know about how the number of unique features is affecting the classifier accuracy.

In [None]:
# define a function that accepts a vectorizer and calculates the accuracy

def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print 'Features: ', X_train_dtm.shape[1]
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_pred_class = nb.predict(X_test_dtm)
    print 'Accuracy: ', metrics.accuracy_score(y_test, y_pred_class)

In [None]:
vect = CountVectorizer()
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

## Part 4: Stopword Removal

- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options

vect

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words

vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

In [None]:
# set of stop words

print vect.get_stop_words()

## Part 5: Other CountVectorizer Options

- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words and only keep 100 features

vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

In [None]:
# all 100 features

print vect.get_feature_names()

In [None]:
# include 1-grams and 2-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 2), max_features=2200)
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams

vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

In [None]:
# include 1-grams and 2-grams, and limit the number of features

vect = CountVectorizer(ngram_range=(1, 2), max_features=10000)
tokenize_test(vect)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times

vect = CountVectorizer(ngram_range=(1, 2),  max_features=10000, min_df=2)
tokenize_test(vect)
print vect.get_feature_names()

## Part 6: Term Frequency-Inverse Document Frequency (TF-IDF) Introduction

- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# Just pretend each of these strings is a "document" - we will vectorize them

simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term Frequency

vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document Frequency

vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term Frequency-Inverse Document Frequency (simple version)

tf/df

In [None]:
# TfidfVectorizer

vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

## Part 7: Applying TF-IDF to Tweet Classification

In [None]:
# Term Frequency

vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(post).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document Frequency

vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term Frequency-Inverse Document Frequency (simple version)

tf/df

In [None]:
# TfidfVectorizer

vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())

## Bonus: Adding Features to a Document-Term Matrix

In [None]:
# define X and y

feature_cols = ['raw_text', 'syllables', 'periods', 'hyphens']
X = post[feature_cols]
y = post.username

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# use CountVectorizer with text column only

vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.raw_text)
X_test_dtm = vect.transform(X_test.raw_text)
print X_train_dtm.shape
print X_test_dtm.shape

In [None]:
# shape of other four feature columns

X_train.drop('raw_text', axis=1).shape

In [None]:
# cast other feature columns to float and convert to a sparse matrix

extra = sp.sparse.csr_matrix(X_train.drop('raw_text', axis=1).astype(float))
extra.shape

In [None]:
# combine sparse matrices

X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape

In [None]:
# repeat for testing set

extra = sp.sparse.csr_matrix(X_test.drop('raw_text', axis=1).astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape

In [None]:
# use logistic regression with text column only

logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# use logistic regression with all features

logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print metrics.accuracy_score(y_test, y_pred_class)