# Basic Poe vs. Austen

This notebook is a very basic NLP exercise that builds a model discriminating between Edgar Allen Poe and Jane Austen. The corpuses are excerpts from *Mask of the Red Death* and *Pride and Prejudice*.

In [1]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('./author_comparison.csv')

In [3]:
X = df['text']
y = df['author']

In [4]:
cv = CountVectorizer(stop_words='english')

cv.fit(X)
Xs = cv.transform(X)

In [5]:
pd.DataFrame(Xs.todense(), columns=cv.get_feature_names()).head()

Unnamed: 0,account,acknowledged,affect,agreed,aloft,answer,apartment,approach,approached,arrest,...,want,way,week,went,white,wife,wild,yard,year,young
0,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
lr = LogisticRegression()
lr.fit(Xs, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
lr.score(Xs, y)

1.0

In [8]:
coefs_df = pd.DataFrame({'coefs': cv.get_feature_names(), 'values': lr.coef_[0]})
coefs_df.head()

Unnamed: 0,coefs,values
0,account,0.128269
1,acknowledged,0.183673
2,affect,-0.313142
3,agreed,-0.031972
4,aloft,0.118884


In [9]:
# sorting by absolute value
coefs_df['values_abs'] = abs(coefs_df['values'])

In [10]:
coefs_df.sort_values('values_abs', ascending=False).head(20)

Unnamed: 0,coefs,values,values_abs
45,death,0.564314,0.564314
132,mr,-0.507528,0.507528
153,prince,0.48372,0.48372
15,bennet,-0.475556,0.475556
161,red,0.45498,0.45498
184,single,-0.454779,0.454779
223,want,-0.411261,0.411261
114,like,0.361284,0.361284
209,tripods,0.355619,0.355619
75,flames,0.355619,0.355619


Sentences about death are probably from Poe, while sentences including "Mr." are probably from Austen. Neat!