# Goal
Predict the author of a quote using term-count or tfidf and a Naive Bayes classifier.

### References
https://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

https://github.com/alvations/Quotables

In [5]:
import io
import numpy as np
import pandas as pd
import requests

header = ['author', 'text']
url = 'https://raw.githubusercontent.com/alvations/Quotables/master/author-quote.txt'
text = requests.get(url).content
df = pd.read_csv(io.StringIO(text.decode('utf-8')), delimiter='\t', header=None, names=header)

df.describe()

Unnamed: 0,author,text
count,36165,36165
unique,2297,36159
top,Henri Nouwen,"If you can dream it, you can do it."
freq,25,2


Interesting that there are 6 duplicate quotes (count: 36165 - unique: 36159). I wonder if those were from the same author, or if multiple authors are credited with those quotes.

Also, good motivation for this project by Henri Nouwen. :)

In [46]:
# Add a numeric column representing the author
df.author = pd.Categorical(df.author)
df['author_code'] = df.author.cat.codes
df.head()

Unnamed: 0,author,text,author_code
0,A. A. Milne,"If you live to be a hundred, I want to live to...",0
1,A. A. Milne,Promise me you'll always remember: You're brav...,0
2,A. A. Milne,"Did you ever stop to think, and forget to star...",0
3,A. A. Milne,Organizing is what you do before you do someth...,0
4,A. A. Milne,"Weeds are flowers too, once you get to know them.",0


Now it's time for some fun!

### CountVectorizer

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')  # , max_features=10

In [31]:
X = cv.fit_transform(df.text).toarray()
print(X)

In [73]:
print(X.shape)

(36165, 26215)


I want to validate that the X matrix indeed contains the correct tokens.

In [74]:
first_quote_word_indexes = np.nonzero(X[0])[0].tolist()
all_features = cv.get_feature_names()
first_quote_words = [all_features[i] for i in first_quote_word_indexes]

print(df.text[0])
print(first_quote_words)

If you live to be a hundred, I want to live to be a hundred minus one day so I never have to live without you.
[u'day', u'live', u'minus', u'want']


Works for me. Onward!

In [47]:
y = df.author_code.values
print(y)

[   0    0    0 ..., 2295 2295 2295]


In [76]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[1960    9 2152 ..., 1948 1358  165]
[ 896  270 1055 ..., 1562 1054 2234]


In [77]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
# y_pred = classifier.predict(X_test)
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.0516478655165


Okay, I might need to rethink this. I got an accuracy of `0.0516478655165`. Here are some things I think I'll try:
* `TfidfVectorizer` instead of `CountVectorizer`
* Use a one-record-per-author approach by concanenating all quotes into one quote-set per author
* Dimensionality reduction
* N-gram tweaking

I'll try them in that order.

### TfidfVectorizer

In [115]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(stop_words='english')  # limit features while testing

In [116]:
X = tv.fit_transform(df.text).toarray()
print(X)

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [120]:
print(X.shape)

(36165, 26215)


In [121]:
first_quote_word_indexes = np.nonzero(X[0])[0].tolist()
all_features = tv.get_feature_names()
first_quote_words = [all_features[i] for i in first_quote_word_indexes]

print(df.text[0])
print(first_quote_words)

If you live to be a hundred, I want to live to be a hundred minus one day so I never have to live without you.
[u'day', u'live', u'minus', u'want']


In [124]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
print(X_train)
print(X_test)
print(y_train)
print(y_test)

[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]
[1960    9 2152 ..., 1948 1358  165]
[ 896  270 1055 ..., 1562 1054 2234]


In [94]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_train, y_train)
# y_pred = classifier.predict(X_test)
accuracy = clf.score(X_test, y_test)
print(accuracy)

0.0669099756691


### One Record Per Author

In [108]:
df_squashed = df.groupby('author_code').apply(lambda x: x.sum())
df_squashed.head()

Unnamed: 0_level_0,author,text,author_code
author_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,A. A. MilneA. A. MilneA. A. MilneA. A. MilneA....,"If you live to be a hundred, I want to live to...",0
1,A. J. JacobsA. J. JacobsA. J. JacobsA. J. Jaco...,The key to making healthy decisions is to resp...,10
2,A. P. J. Abdul KalamA. P. J. Abdul KalamA. P. ...,You have to dream before your dreams can come ...,50
3,AaliyahAaliyahAaliyahAaliyahAaliyahAaliyahAali...,"I stay true to myself and my style, and I am a...",75
4,Aaron EckhartAaron EckhartAaron EckhartAaron E...,"I often feel that my days in New York City, th...",100


In [109]:
cv = CountVectorizer(stop_words='english')  # , max_features=10

In [110]:
X_squashed = cv.fit_transform(df_squashed.text).toarray()
print(X_squashed)

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 1 0 ..., 0 0 0]]


In [111]:
print(X_squashed.shape)

(2297, 26215)


In [112]:
first_quote_word_indexes = np.nonzero(X_squashed[0])[0].tolist()
all_features = cv.get_feature_names()
first_quote_words = [all_features[i] for i in first_quote_word_indexes]

print(df_squashed.text[0])
print(first_quote_words)

If you live to be a hundred, I want to live to be a hundred minus one day so I never have to live without you.Promise me you'll always remember: You're braver than you believe, and stronger than you seem, and smarter than you think.Did you ever stop to think, and forget to start again?Organizing is what you do before you do something, so that when you do it, it is not all mixed up.Weeds are flowers too, once you get to know them.You can't stay in your corner of the forest waiting for others to come to you. You have to go to them sometimes.The third-rate mind is only happy when it is thinking with the majority. The second-rate mind is only happy when it is thinking with the minority. The first-rate mind is only happy when it is thinking.Bores can be divided into two classes; those who have their own particular subject, and those who do not need a subject.What I say is that, if a fellow really likes potatoes, he must be a pretty decent sort of fellow.My spelling is Wobbly. It's good spel

In [113]:
y_squashed = df_squashed.author_code.values
print(y_squashed)

[    0    10    50 ..., 57350 57375 57400]


In [114]:
from sklearn.cross_validation import train_test_split
X_squashed_train, X_squashed_test, y_squashed_train, y_squashed_test = train_test_split(X_squashed, y_squashed, test_size=0.25, random_state=0)
print(X_squashed_train)
print(X_squashed_test)
print(y_squashed_train)
print(y_squashed_test)

[[0 2 0 ..., 0 0 1]
 [0 0 0 ..., 0 0 0]
 [0 1 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
[19975  1716 37275 ..., 19075 12525 41325]
[14240 52600  1960  5985  4077 37550 41400 56425  1587  5010  1456   746
  1506  2064 15475  1512 31600 40325  1364  5551 22500 10450  2212  6112
 55050  1325 41700 38350  3300 32675  9630 15225  2230 39175 51550 10550
  7721 48775 10290  8134   628 23950 52575 21760 25560 49825 41075  2832
   372 25441  8325 30900 15192 46625   970 17825 12925  7752 37025 19760
  5265 24525 11925 34500 11950   250 17136  3078 13400  6216 10458 24960
  7245  1158 34272 11389 24350 38200 22950  2838 17347 52150 55325 11748
 25950 17784  6363 35525 39825  8000 11075   831 24045 10890 56075   429
  1953 40750  4230  6984 16900 30260 32800  1750 11875  5454 32025  3310
  7025  4465  8440  9253  2386  6125 42450 2

In [106]:
clf = GaussianNB()
clf.fit(X_squashed_train, y_squashed_train)
accuracy = clf.score(X_squashed_test, y_squashed_test)
print(accuracy)

0.0


Well that's good to know. It makes sense now. Less records mean poorer learning (even though a lot better performance). I'm not even going to worry about using tfidf with this approach. Dimensionality reduction is next!

### Dimensionality Reduction: PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=None)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_

Yeah, don't do that. It brought my macbook to its knees.

### Dimensionality Reduction: LDA

In [28]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from pandas.compat import u
from stop_words import get_stop_words

tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
en_stop = get_stop_words('en')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()

# df['text'] = df['text'].astype('unicode')
# df['text'] = pd.Series(list(map(u, df['text'])))
df['text'] = df['text'].astype('unicode')
print(df.dtypes)

author    object
text      object
dtype: object


In [29]:
# list for tokenized documents in loop
texts = []

# loop through document list
for i in df.text:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove stop words from tokens
    stopped_tokens = [i for i in tokens if not i in en_stop]
    
    # stem tokens
    stemmed_tokens = stopped_tokens # [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

print(texts)



In [30]:
from gensim import corpora, models

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# for i in range(0, len(dictionary)):
#     print('{0}: {1}'.format(i, dictionary[i]))

In [31]:
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]

print(corpus)

[[(0, 1), (1, 1), (2, 3), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1)], [(8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1)], [(18, 1), (19, 1), (20, 1), (21, 1), (22, 1)], [(23, 1), (24, 1), (25, 1)], [(26, 1), (27, 1), (28, 1), (29, 1)], [(30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1)], [(40, 1), (41, 1), (42, 3), (43, 3), (44, 3), (45, 1), (46, 1), (47, 3), (48, 1)], [(30, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2)], [(56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1)], [(29, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 2)], [(73, 1), (74, 1), (75, 1), (76, 1)], [(1, 2), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1)], [(34, 1), (83, 1), (84, 1), (85, 1), (86, 1)], [(0, 2), (21, 1), (87, 1), (88, 1), (89, 1)], [(30, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 1)], [(1, 2), (98, 1), (99, 1), 

In [32]:
# generate LDA model
ldamodel = models.ldamodel.LdaModel(corpus, num_topics=100, id2word = dictionary, passes=20)

print(ldamodel)

LdaModel(num_terms=26428, num_topics=100, decay=0.5, chunksize=2000)


In [40]:
print(ldamodel.print_topics(num_topics=100, num_words=5))

[(0, u'0.123*"let" + 0.090*"country" + 0.070*"reason" + 0.066*"pretty" + 0.061*"high"'), (1, u'0.180*"put" + 0.180*"look" + 0.029*"created" + 0.028*"s" + 0.027*"people"'), (2, u'0.195*"around" + 0.065*"songs" + 0.042*"reading" + 0.039*"dare" + 0.029*"happened"'), (3, u'0.188*"person" + 0.036*"trust" + 0.035*"front" + 0.035*"internet" + 0.034*"lie"'), (4, u'0.079*"history" + 0.064*"set" + 0.047*"looks" + 0.046*"s" + 0.043*"president"'), (5, u'0.162*"way" + 0.142*"something" + 0.126*"s" + 0.057*"trying" + 0.036*"people"'), (6, u'0.096*"society" + 0.077*"lost" + 0.052*"unless" + 0.052*"strength" + 0.038*"nation"'), (7, u'0.135*"part" + 0.080*"film" + 0.070*"probably" + 0.043*"behind" + 0.039*"food"'), (8, u'0.209*"years" + 0.059*"comedy" + 0.050*"five" + 0.049*"four" + 0.043*"fashion"'), (9, u'0.137*"mean" + 0.062*"house" + 0.060*"stories" + 0.056*"action" + 0.046*"writer"'), (10, u'0.053*"like" + 0.050*"lies" + 0.036*"market" + 0.034*"emotions" + 0.033*"feels"'), (11, u'0.071*"middle" + 

In [62]:
X_lda = []
for doc in corpus:
    doc_tpc = ldamodel.get_document_topics(doc)
    doc_rel = []
    for i in range(100):
        try:
            doc_rel.append(doc_tpc[i][1])
        except IndexError:
            doc_rel.append(0)
    X_lda.append(doc_rel)
X_lda = np.array(X_lda)
print(X_lda)

[[ 0.08416667  0.08416667  0.75083333 ...,  0.          0.          0.        ]
 [ 0.15005825  0.40990341  0.08948857 ...,  0.          0.          0.        ]
 [ 0.16833333  0.20065486  0.30267848 ...,  0.          0.          0.        ]
 ..., 
 [ 0.09040507  0.15164494  0.09929388 ...,  0.          0.          0.        ]
 [ 0.50027628  0.12722372  0.12625    ...,  0.          0.          0.        ]
 [ 0.24555204  0.09181818  0.09369638 ...,  0.          0.          0.        ]]


In [58]:
X_lda = ldamodel[corpus[0]]
print(X_lda)

[(50, 0.084166666666666681), (90, 0.084166666666666667), (99, 0.75083333333333346)]


In [64]:
from sklearn.cross_validation import train_test_split
X_lda_train, X_lda_test, y_lda_train, y_lda_test = train_test_split(X_lda, y, test_size=0.25, random_state=0)
print(X_lda_train.shape)
print(X_lda_test.shape)
print(y_lda_train.shape)
print(y_lda_test.shape)

(27123, 100)
(9042, 100)
(27123,)
(9042,)


In [65]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X_lda_train, y_lda_train)
accuracy = clf.score(X_lda_test, y_lda_test)
print(accuracy)

0.000221190002212


Well, obviously I have a lot to learn about predicting the author of a quote. But I have learned a great deal in this process.

I expect to come up short sometimes as I try to apply machine learning to data in the wild, but I will continue looking for non-tutorial data sets to play with so I can get better experience.

I decided not to continue with this project. I'll experiment with n-grams some other time.

I think my next area of exploration will be document and query similarity. I'll start here: https://radimrehurek.com/gensim/tutorial.html and then go into tfidf and cosine similarity.