Objective: To predict the author of excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. 

Category: Text Analysis/ Natural Language Processing
    
Data Source: https://www.kaggle.com/c/spooky-author-identification

In [1]:
# Print the current working directory
import os
os.getcwd()

'C:\\Users\\Ashoo\\Documents\\R playground\\text-analysis\\scripts\\python'

In [36]:
# Load the required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer as CV
from sklearn.naive_bayes import MultinomialNB as MNB

In [17]:
# set data path
data_path = ""
print(data_path)

C:\Users\Ashoo\Documents\R playground\text-analysis\data\kaggle_spooky_authors


In [20]:
# Load the data
training_data = pd.read_csv("C:\\Users\\Ashoo\\Documents\\R playground\\text-analysis\\data\\kaggle_spooky_authors\\train.csv")
testing_data=pd.read_csv("C:\\Users\\Ashoo\\Documents\\R playground\\text-analysis\\data\\kaggle_spooky_authors\\test.csv")
training_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


Map "EAP" to 0 "HPL" to 1 and "MWS" to 2

Next we take all the rows under the column named "text" and put it in X ( a variable in python)

Similarly we take all rows under the column named "author_num" and put it in y (a variable in python)

In [21]:
training_data['author_num'] = training_data.author.map({'EAP':0, 'HPL':1, 'MWS':2})
X = training_data['text']
y = training_data['author_num']
print (X.head())
print (y.head())

0    This process, however, afforded me no means of...
1    It never once occurred to me that the fumbling...
2    In his left hand was a gold snuff box, from wh...
3    How lovely is spring As we looked from Windsor...
4    Finding nothing else, not even gold, the Super...
Name: text, dtype: object
0    0
1    1
2    0
3    2
4    1
Name: author_num, dtype: int64


Now we need to split the data into training set and testing set. We train the model on the training set. Model testing is done on the test set.

So we are going to split it into 70% for training and 30% for testing.

In [22]:
per=int(float(0.7)* len(X))
X_train=X[:per]
X_test=X[per:]
y_train=y[:per]
y_test=y[per:]

##### Converting text data into numbers or in other words, `Vectorization`

Computers get crazy with text, It only understands numbers, but we have got to classify text. Now what do we do? We do tokenization and vectorization to save the count of each word. 

In [31]:
#toy example
text=["My name is Sindabad the sailor man"]
toy = CV(lowercase=False, token_pattern=r'\w+|\,')
toy.fit_transform(text)
print (toy.vocabulary_)
matrix=toy.transform(text)
print (matrix[0,0])
print (matrix[0,1])
print (matrix[0,2])
print (matrix[0,3])
print (matrix[0,4])

{'My': 0, 'name': 4, 'is': 2, 'Sindabad': 1, 'the': 6, 'sailor': 5, 'man': 3}
1
1
1
1
1


In [30]:
vect = CV(lowercase=False, token_pattern=r'\w+|\,')
X_cv=vect.fit_transform(X)
X_train_cv = vect.transform(X_train)
X_test_cv = vect.transform(X_test)
print (X_train_cv.shape)

(13705, 27497)


The final step We give the data to the `clf.fit` for training and test it for score. Let's check the accuracy on training set

In [37]:
MNB=MultinomialNB()
MNB.fit(X_train_cv, y_train)
MNB.score(X_test_cv, y_test)


#clf=MultinomialNB()
#clf.fit(X_train_cv, y_train)
#clf.score(X_test_cv, y_test)

0.8432073544433095

Now, lets check the accuracy on the test set. But first vectorize the test set just like we did it for the training set.

In [38]:
X_test=vect.transform(testing_data["text"])

In [39]:
MNB=MultinomialNB()
MNB.fit(X_cv, y)
predicted_result=MNB.predict_proba(X_test)
predicted_result.shape

(8392, 3)

We see that we got a result with 8392 rows presenting each text and 3 columns each column representing probability of each author.

In [40]:
#NOW WE CREATE A RESULT DATA FRAME AND ADD THE COLUMNS NECESSARY FOR KAGGLE SUBMISSION
result=pd.DataFrame()
result["id"]=testing_data["id"]
result["EAP"]=predicted_result[:,0]
result["HPL"]=predicted_result[:,1]
result["MWS"]=predicted_result[:,2]
result.head()

Unnamed: 0,id,EAP,HPL,MWS
0,id02310,0.000924,2.968107e-05,0.9990459
1,id24541,1.0,3.262255e-07,9.234086e-09
2,id00134,0.003193,0.9968065,8.737549e-07
3,id27757,0.920985,0.07901473,4.811097e-07
4,id04081,0.953158,0.005981884,0.04086011


FINALLY WE SUBMIT THE RESULT TO KAGGLE FOR EVALUATION

In [41]:
result.to_csv("TO_SUBMIT.csv", index=False)