# Spooky Authors Identification

## Code to identify Spooky Authors for Kaggle Challenge
## https://www.kaggle.com/c/spooky-author-identification

In [1]:
import warnings
warnings.filterwarnings("ignore")

import sys
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth',10000)
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer	
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from keras.models import Sequential
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.layers import Dense

Using TensorFlow backend.


Loading the data from "train.csv" file at: https://www.kaggle.com/c/spooky-author-identification/data

In [2]:
print('Reading training data')
dataframe_train = pd.read_csv('train.csv').fillna(0)
print('Finished reading training data')
dataframe_train.info()


Reading training data
Finished reading training data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19579 entries, 0 to 19578
Data columns (total 3 columns):
id        19579 non-null object
text      19579 non-null object
author    19579 non-null object
dtypes: object(3)
memory usage: 459.0+ KB


In [12]:
dataframe_train.head(10)


Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.",EAP
1,id17569,It never once occurred to me that the fumbling might be a mere mistake.,HPL
2,id11008,"In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.",EAP
3,id27763,"How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.",MWS
4,id12958,"Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.",HPL
5,id22965,"A youth passed in solitude, my best years spent under your gentle and feminine fosterage, has so refined the groundwork of my character that I cannot overcome an intense distaste to the usual brutality exercised on board ship: I have never believed it to be necessary, and when I heard of a mariner equally noted for his kindliness of heart and the respect and obedience paid to him by his crew, I felt myself peculiarly fortunate in being able to secure his services.",MWS
6,id09674,"The astronomer, perhaps, at this point, took refuge in the suggestion of non luminosity; and here analogy was suddenly let fall.",EAP
7,id13515,The surcingle hung in ribands from my body.,EAP
8,id19322,"I knew that you could not say to yourself 'stereotomy' without being brought to think of atomies, and thus of the theories of Epicurus; and since, when we discussed this subject not very long ago, I mentioned to you how singularly, yet with how little notice, the vague guesses of that noble Greek had met with confirmation in the late nebular cosmogony, I felt that you could not avoid casting your eyes upward to the great nebula in Orion, and I certainly expected that you would do so.",EAP
9,id00912,"I confess that neither the structure of languages, nor the code of governments, nor the politics of various states possessed attractions for me.",MWS


In [4]:
def get_top_n_words(corpus, n=None):
    """
    List the top n words in a vocabulary according to occurrence in a text corpus.
    """
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in     vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [11]:
# Extract most used words
most_common_words = get_top_n_words(dataframe_train.text, n=10)
most_common_words

[('the', 35585),
 ('of', 20955),
 ('and', 17956),
 ('to', 12843),
 ('in', 9458),
 ('was', 6647),
 ('that', 6447),
 ('my', 5418),
 ('it', 4915),
 ('he', 4433)]

In [13]:
# Extract most used words by each author
EAP_words = get_top_n_words(dataframe_train.loc[dataframe_train['author'] == 'EAP'].text, n=10)
HPL_words = get_top_n_words(dataframe_train.loc[dataframe_train['author'] == 'HPL'].text, n=10)
MWS_words = get_top_n_words(dataframe_train.loc[dataframe_train['author'] == 'MWS'].text, n=10)

print(EAP_words)
print(HPL_words)
print(MWS_words)

[('the', 14993), ('of', 8972), ('and', 5735), ('to', 4765), ('in', 4124), ('that', 2333), ('it', 2332), ('was', 2224), ('my', 1788), ('with', 1696)]
[('the', 10933), ('and', 6098), ('of', 5846), ('to', 3249), ('in', 2736), ('was', 2174), ('that', 2022), ('had', 1779), ('he', 1647), ('it', 1402)]
[('the', 9659), ('of', 6137), ('and', 6123), ('to', 4829), ('my', 2659), ('in', 2598), ('was', 2249), ('that', 2092), ('her', 1657), ('his', 1646)]


Create train sample and labels - vectorize either with "Tfid" or "Count"

In [15]:
X = dataframe_train.text.values
y = dataframe_train.author

# Build the vocabulary and vectorize the sentences
#vectorizer = TfidfVectorizer()
vectorizer = CountVectorizer(min_df=0)

vector_X = vectorizer.fit_transform(X)
features = len(vectorizer.get_feature_names())
print("Number of features is:", features)
print("The vectorized array looks like:")
print(vector_X.toarray())


Number of features is: 25068
The vectorized array looks like:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [16]:
# Encode and binarize labels for predictions
# Encode
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(["EAP", "HPL", "MWS"])
encoded_y = label_encoder.transform(y)
# Binarize
label_binarizer = preprocessing.LabelBinarizer()
label_binarizer.fit(encoded_y)
binarized_y = label_binarizer.transform(encoded_y)


# Use NN to classify texts

In [17]:
epochs = 2
batch_size = 100


# Define the Neural Network for predictions
model = Sequential()
model.add(Dense(9, input_dim=vector_X.shape[1], activation='relu'))
model.add(Dense(6, activation='relu'))
model.add(Dense(3, activation='sigmoid'))
model.compile(loss='mean_squared_error', optimizer='adam')
# fit the model on the training set
model.fit(vector_X, binarized_y, epochs=epochs, batch_size=batch_size, verbose=2)


Epoch 1/2
 - 11s - loss: 0.1855
Epoch 2/2
 - 11s - loss: 0.0971


<keras.callbacks.History at 0x1a26b72278>

Load test set from: https://www.kaggle.com/c/spooky-author-identification/data

Generate the output.csv file for submission

In [18]:
print('Reading test data')
dataframe_test = pd.read_csv('test.csv').fillna(0)
print('Finished reading test data')


X_submission = dataframe_test.text.values
# encodes submissio test documents into a vector using the previous vectorizer
vector_X_submission = vectorizer.transform(X_submission)
# make predictions on submission test set
predictions = model.predict(vector_X_submission)

identifications = dataframe_test.id.values
data = {'id': identifications, 'EAP': predictions[:,0], 'HPL': predictions[:,1], 'MWS': predictions[:,2]}
dataframe_submission = pd.DataFrame(data=data)
cols = ['id','EAP','HPL','MWS']
dataframe_submission = dataframe_submission[cols]
dataframe_submission.to_csv('output.csv', index=False)


Reading test data
Finished reading test data
