## Spooky Author Identification / Logistic Regression
**Logistic Regression** is a multi-class regression algorithm that is used well for natural language processing!  It is reacting to binary values (0, 1) to predict what would be a similar match.  Here, it will be used for text processing.  

In [38]:
import pandas as pd
import numpy as np

# plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# sklearn
from sklearn.cross_validation import train_test_split             # cross-validation
from sklearn.preprocessing import label_binarize                  # binarize y
from sklearn.feature_extraction.text import CountVectorizer       # vectorizer
from sklearn.feature_extraction.text import TfidfTransformer      # vectorizer
from sklearn.linear_model import LogisticRegression               # classifier
from sklearn.model_selection import GridSearchCV                  # parameter tuning
from sklearn.pipeline import Pipeline                             # pipeline
from sklearn import metrics                                       # metrics

# other modules
from stop_words import get_stop_words
import string
from pprint import pprint


# Read training texts: texts
texts = pd.read_csv('train.csv')

# Read testing tests: tests
tests = pd.read_csv('test.csv')

In [3]:
X = texts.text
y = texts.author

In [5]:
vect = CountVectorizer()

In [6]:
X_train_dtm = vect.fit_transform(X)

In [7]:
X_test_dtm = vect.transform(tests.text)

In [8]:
logreg = LogisticRegression()

In [9]:
%time logreg.fit(X_train_dtm, y)

CPU times: user 7.43 s, sys: 258 ms, total: 7.69 s
Wall time: 2.05 s


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [16]:
y_pred_class = logreg.predict(X_train_dtm)

In [50]:
# calculate predicted probabilities for X_test_dtm (well-calibrated)
y_pred_prob = logreg.predict_proba(X_train_dtm)
y_pred_prob

array([[  9.95874162e-01,   3.93395999e-03,   1.91877519e-04],
       [  4.41999134e-01,   5.11737135e-01,   4.62637317e-02],
       [  8.52544292e-01,   1.46206283e-01,   1.24942556e-03],
       ..., 
       [  9.29914244e-01,   4.63423411e-02,   2.37434149e-02],
       [  7.47753813e-01,   8.41549851e-02,   1.68091202e-01],
       [  9.06674984e-02,   8.67555406e-01,   4.17770952e-02]])

In [18]:
# calculate accuracy
metrics.accuracy_score(y, y_pred_class)

0.96828234332703411

In [46]:
y_test = label_binarize(y, classes=[0,1,2])

In [49]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.