# Spooky Author Classification

This notebook demonstrates how to download data using the Kaggle API, create word embeddings, and build a CNN for the [Spooky Author Identification Challenge](https://www.kaggle.com/c/spooky-author-identification).  



##Downloading the Data using the Kaggle API

Install the Kaggle library to allow you to use the Kaggle API.  Download your credentials, "kaggle.json", from the Kaggle website.  The following instructions to download your kaggle credentials to your local computer:



1.   Log in to Kaggle.com.  This should take you to home hub.
2.   Click the picture of your profile on the top right corner of your home hub.  Select My Account.
3.   In your account settings, find API and click on the Create New API Token button.  This should automatically download your credentials to your computer.



In [2]:
#install Kaggle library
!pip install Kaggle

Collecting Kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/83/9b/ac57e15fbb239c6793c8d0b7dfd1a4c4a025eaa9f791b5388a7afb515aed/kaggle-1.5.0.tar.gz (53kB)
[K    19% |██████▏                         | 10kB 17.9MB/s eta 0:00:01[K    38% |████████████▎                   | 20kB 2.0MB/s eta 0:00:01[K    57% |██████████████████▌             | 30kB 2.3MB/s eta 0:00:01[K    77% |████████████████████████▋       | 40kB 2.1MB/s eta 0:00:01[K    96% |██████████████████████████████▉ | 51kB 2.5MB/s eta 0:00:01[K    100% |████████████████████████████████| 61kB 2.8MB/s 
Collecting python-slugify (from Kaggle)
  Downloading https://files.pythonhosted.org/packages/00/ad/c778a6df614b6217c30fe80045b365bfa08b5dd3cb02e8b37a6d25126781/python-slugify-1.2.6.tar.gz
Collecting Unidecode>=0.04.16 (from python-slugify->Kaggle)
[?25l  Downloading https://files.pythonhosted.org/packages/59/ef/67085e30e8bbcdd76e2f0a4ad8151c13a2c5bce77c85f8cad6e1f16fb141/Unidecode-1.0.22-py2.py3-none-any.

After running the following code, follow the link to receive a verification code.  Copy the code and paste it in the textbox that appears.  Ensure that the credential is located in the correct folder.  If your credential is placed in your google drive or google cloud storage, ensure that the credential is located in the correct folder.  In the following code below, I need to create a folder in my google drive called ".kaggle" and place my credential in that folder.

```
filename = "/root/.kaggle/kaggle.json"
```


In [3]:
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/root/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


Download the files from the Spooky Author Identification Challenge using the Kaggle API.  An easy way to get the correct command is to go to the [data page of the competition](https://www.kaggle.com/c/spooky-author-identification/data) and copy the command located to the right of the data sources.

In [4]:
!kaggle competitions download -c spooky-author-identification

Downloading sample_submission.zip to /content
  0% 0.00/29.4k [00:00<?, ?B/s]
100% 29.4k/29.4k [00:00<00:00, 17.0MB/s]
Downloading test.zip to /content
  0% 0.00/538k [00:00<?, ?B/s]
100% 538k/538k [00:00<00:00, 74.7MB/s]
Downloading train.zip to /content
  0% 0.00/1.26M [00:00<?, ?B/s]
100% 1.26M/1.26M [00:00<00:00, 142MB/s]


Unzip the downloaded files.  

In [0]:
import zipfile
import os
import pandas as pd

currentPath = os.getcwd()


zip_ref = zipfile.ZipFile(currentPath + "/train.zip", 'r')
zip_ref.extractall(currentPath)
zip_ref = zipfile.ZipFile(currentPath + "/test.zip", 'r')
zip_ref.extractall(currentPath)
zip_ref.close()

In [6]:
train = pd.read_csv("train.csv")
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [7]:
len(train)

19579

In [8]:
lines = train['text'].tolist()
lines[0:5]

['This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.',
 'It never once occurred to me that the fumbling might be a mere mistake.',
 'In his left hand was a gold snuff box, from which, as he capered down the hill, cutting all manner of fantastic steps, he took snuff incessantly with an air of the greatest possible self satisfaction.',
 'How lovely is spring As we looked from Windsor Terrace on the sixteen fertile counties spread beneath, speckled by happy cottages and wealthier towns, all looked as in former years, heart cheering and fair.',
 'Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.']

##Create a Vocabulary

The following code maps each word to a unique number.

Excerpts that contain more words than the maximum document length will be truncated.  Excerpts that contain fewer words than the maximum document length will be padded with multiple instances of the PADWORD.  A PADWORD is a word that is not expected to be found in any excerpt.  All excerpts will end up containing 200 words.  The following code will produce the "vocab.tsv" file which will mapped each unique word with a number.


In [9]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import shutil
import tensorflow as tf
import tensorflow.contrib.learn as tflearn
import tensorflow.contrib.layers as tflayers
from tensorflow.contrib.learn.python.learn import learn_runner
import tensorflow.contrib.metrics as metrics
from tensorflow.python.platform import gfile
from tensorflow.contrib import lookup

tf.logging.set_verbosity(tf.logging.INFO)

# variables set by init()
BUCKET = None
TRAIN_STEPS = 1000
WORD_VOCAB_FILE = None 
N_WORDS = -1

# hardcoded into graph
BATCH_SIZE = 32

# describe your data
TARGETS = ['EAP', 'HPL', 'MWS']
MAX_DOCUMENT_LENGTH = 200
CSV_COLUMNS = ['text', 'author']
LABEL_COLUMN = 'author'
DEFAULTS = [['null'], ['null']]
PADWORD = 'ASDFG'

# create vocabulary
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(lines)
with gfile.Open('vocab.tsv', 'wb') as f:
    f.write("{}\n".format(PADWORD))
    for word, index in vocab_processor.vocabulary_._mapping.items():
      f.write("{}\n".format(word))
N_WORDS = len(vocab_processor.vocabulary_)

Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.
Instructions for updating:
Please use tensorflow/transform or tf.data.


The 'vocab.tsv' file will contain two columns.  The second column represents the word and the first column represents the corresponding number associated to the word.  The first two words in the dictionary are the pad word (with a corresponding value of 0) and UNK (with a corresponding value of 1) which represents words found in test data that were not found in the training data.

In [10]:
checkVocab = pd.read_csv("vocab.tsv", header = None)
checkVocab.head()

Unnamed: 0,0
0,ASDFG
1,<UNK>
2,This
3,process
4,however


The following code peforms a lookup for the words "process" and "left".  The word "process" corresponds with the number 3 and the word "left" corresponds with the number 49.

In [11]:
table = lookup.index_table_from_file(
  vocabulary_file='vocab.tsv', num_oov_buckets=1, vocab_size=None, default_value=-1)
numbers = table.lookup(tf.constant('process left'.split()))
with tf.Session() as sess:
  tf.tables_initializer().run()
  print ("{} --> {}".format(lines[0], numbers.eval()))

This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall. --> [ 3 49]


In [12]:
len(checkVocab)

28361

##Process Words

Create a tensor for each excerpt in the training data and each word.  The words tensor below will create a mapping between each word and the word's index.  The word's index is represented by a vector of size 2.  The first value in the vector corresponds to the excerpt index and the second value corresponds to the word's index within the excerpt.

In [0]:
# string operations
excerpts = tf.constant(lines)
words = tf.string_split(lines)

Create a tensor that contains  vectors of strings where each vector represents the excerpt and each string is either a word in the excerpt or the padword.  The words in the excerpt should be in order based on the original data.  The padwords are added at the end of the words.  The tensor called numbers should be a numerical representation of densewords where each word in the excerpt is replaced with the corresponding number in the vocabulary.

In [14]:
densewords = tf.sparse.to_dense(words, default_value=PADWORD)
numbers = table.lookup(densewords)

Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.


In [0]:
padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
padded = tf.pad(numbers, padding)
sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])

In [16]:
sliced

<tf.Tensor 'Slice:0' shape=(19579, 200) dtype=int64>

In [0]:
#just dimensionality reduction - -no context words
EMBEDDING_SIZE = 10
embeds = tf.contrib.layers.embed_sequence(sliced, 
                 vocab_size=N_WORDS, embed_dim=EMBEDDING_SIZE)

In [18]:
embeds

<tf.Tensor 'EmbedSequence/embedding_lookup/Identity:0' shape=(19579, 200, 10) dtype=float32>

In [0]:
WINDOW_SIZE = EMBEDDING_SIZE
STRIDE = int(WINDOW_SIZE/2)
conv = tf.contrib.layers.conv1d(embeds, 1, WINDOW_SIZE, 
                stride=STRIDE, padding='SAME') # (?, 4, 1)    
conv = tf.nn.relu(conv) # (?, 4, 1)    
words = tf.squeeze(conv, [2]) # (?, 4)

In [0]:
# DNN

EMBEDDING_SIZE = 10
WINDOW_SIZE = EMBEDDING_SIZE
STRIDE = int(WINDOW_SIZE/2)




In [0]:
# CNN model parameters
EMBEDDING_SIZE = 10
WINDOW_SIZE = EMBEDDING_SIZE
STRIDE = int(WINDOW_SIZE/2)
def cnn_model(features, target, mode):
    
    
    # string operations
    text = tf.squeeze(features['text'], [1])
    words = tf.string_split(text)
    densewords = tf.sparse_tensor_to_dense(words, default_value=PADWORD)
    numbers = table.lookup(densewords)
    padding = tf.constant([[0,0],[0,MAX_DOCUMENT_LENGTH]])
    padded = tf.pad(numbers, padding)
    sliced = tf.slice(padded, [0,0], [-1, MAX_DOCUMENT_LENGTH])
    print('words_sliced={}'.format(words))  # (?, 20)

    # layer to take the words and convert them into vectors (embeddings)
    embeds = tf.contrib.layers.embed_sequence(sliced, vocab_size=N_WORDS, embed_dim=EMBEDDING_SIZE)
    print('words_embed={}'.format(embeds)) # (?, 20, 10)
    
    # now do convolution
    conv = tf.contrib.layers.conv1d(embeds, 1, WINDOW_SIZE, stride=STRIDE, padding='SAME') # (?, 4, 1)
    conv = tf.nn.relu(conv) # (?, 4, 1)
    words = tf.squeeze(conv, [2]) # (?, 4)
    print('words_conv={}'.format(words)) # (?, 4)

    n_classes = len(TARGETS)

    logits = tf.contrib.layers.fully_connected(words, n_classes, activation_fn=None)
    #print('logits={}'.format(logits)) # (?, 3)
    predictions_dict = {
      'author': tf.gather(TARGETS, tf.argmax(logits, 1)),
      'class': tf.argmax(logits, 1),
      'prob': tf.nn.softmax(logits)
    }

    if mode == tf.contrib.learn.ModeKeys.TRAIN or mode == tf.contrib.learn.ModeKeys.EVAL:
       loss = tf.losses.sparse_softmax_cross_entropy(target, logits)
       train_op = tf.contrib.layers.optimize_loss(
         loss,
         tf.contrib.framework.get_global_step(),
         optimizer='Adam',
         learning_rate=0.01)
    else:
       loss = None
       train_op = None
      
    # Add variable initializer.
    init = tf.global_variables_initializer()

    # Create a saver.
    saver = tf.train.Saver()

    return tflearn.ModelFnOps(
      mode=mode,
      predictions=predictions_dict,
      loss=loss,
      train_op=train_op)

    




In [24]:
#get length of training data
len(train)

19579

In [0]:
#split training data into train and evaluation data
evaluation_data = train[15663:]
training_data = train[:15663]


In [0]:
with tf.Session() as sess:
    train_results = cnn_model(training_data,TARGETS, tf.contrib.learn.ModeKeys.TRAIN)
    eval_results = cnn_model(evaluation_data,TARGETS, tf.contrib.learn.ModeKeys.EVAL)
    init=tf.global_variables_initializer()
    sess.run(init)
    print(sess.run(result))

In [0]:
with tf.Session() as sess:
    result=foo()
    init=tf.global_variables_initializer()
    sess.run(init)
    print(sess.run(result))