# Classifying poems as character sequences

The dataset is stored as compressed JSON. We use `pandas` to load it as a `pandas.DataFrame` table:

In [None]:
import pandas as pd
EXTRACT = 'selected_poems.json.bz2'
poems = pd.read_json(EXTRACT, compression='infer')
poems.head()

## Preparing the data

### Step 1: Encoding poems as character sequences

We treat the poems as sequences of characters and first determine the set of all characters appearing in the poems:

In [None]:
used_alphabet = set().union(*poems['text'].apply(set))
''.join(sorted(used_alphabet))

The poems contain some strange characters, which we would like to filter out. So, we fix an alphabet to use:

In [None]:
ALPHABET = 'abcdefghijklmnopqrstuvwxyzäöüßABCDEFGHIKLMNOPQRSTUVWXZYÄÖÜ .,;:!?-()"\'\n'
len(ALPHABET)

We now transform each poem as follows: each character is replaced by its index in `alphabet`, starting with 1, or by 0 if it is not contained in the alphabet. Moreover, we cut each poem down to the first 1000 characters and pad with a special sign to obtain sequences of a fixed length.

In [None]:
char_index = {char: index + 1 for index, char in enumerate(ALPHABET)}

def index_characters(text):
    return [char_index.get(char, 0) for char in text]
                                              
poems['characters'] = poems.text.apply(index_characters)
poems[['text', 'characters']].head()

Next, we apply a one-hot-encoding:

In [None]:
import numpy as np

eye = np.eye(len(ALPHABET))
zeros = np.zeros((1, len(ALPHABET)))
codes = np.vstack([zeros, eye])
codes

In [None]:
poems['characters_ohe'] = poems.characters.apply(lambda chars: codes[chars])
poems['characters_ohe'].head()

Finally, we want to stack the matrices obtained for the poems together. For this, we use a convenience function of keras:

In [None]:
from keras.preprocessing.sequence import pad_sequences

MAX_LEN = 1000
X = pad_sequences(poems.characters_ohe, maxlen=MAX_LEN)
X.shape

### Step 2: Encoding the authors

Next, we want to encode the labels, that is, the authors. We could do that as before, or use pandas' convenience function `get_dummies`...

In [None]:
authors_ohe = pd.get_dummies(poems.author)
authors_ohe.head()

We access the raw matrix as the `value` attribute:

In [None]:
y = authors_ohe.values
y[:5]

### Step 3: Shuffle and split the dataset

Next, we need to shuffle and split our data. For the moment, we do this by hand as follows: 

In [None]:
def train_test_split(X,y,ratio=0.7):
    total = X.shape[0]
    indices = np.random.permutation(total)
    pos = int(0.7 * total)
    train_indices, test_indices = indices[:pos], indices[pos:]
    return (X[train_indices], y[train_indices]), (X[test_indices], y[test_indices])

(X_train, y_train), (X_test, y_test) = train_test_split(X,y)
X_train.shape

## Training a neural network for classification

### What about dense layers as before?

We want to train a neural network to learn to classify the author of a poem. Let's try a similar network as for the mnist task:

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten

In [None]:
def build_model():
    return Sequential([
        Flatten(),
        Dense(3, activation='softmax')
    ])

In [None]:
def train_model(model, epochs=10, batch_size=32):
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='Adadelta')
    history = model.fit(X_train,y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2)
    return model, pd.DataFrame(history.history)
    
model = build_model()
model, history = train_model(model)

This is not going to get us very far... We observe an extreme form of overfitting.

### A better try: convolutional layers

We now train a convolutional neural network consisting of

- a stack of **convolutional layers** for pattern extraction and 
- a **dense layer** for classification.

In [None]:
from keras.layers import Conv1D, GlobalMaxPooling1D

def build_model():
    return Sequential([
        Conv1D(64, kernel_size=3, strides=1, activation='relu', input_shape=(MAX_LEN, len(ALPHABET))),
        Conv1D(128, kernel_size=3, strides=1, activation='relu'),
        GlobalMaxPooling1D(),
        Dense(128, activation='relu'),
        Dense(3, activation='softmax')
    ])

model, history = train_model(build_model())

Let us visualize the training history again:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set()

def plot_history(history):
    _, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))
    history[['loss', 'val_loss']].plot.line(ax=ax1)
    history[['acc', 'val_acc']].plot.line(ax=ax2)
    
plot_history(history)

Finally, let us evaluate the trained model on the validation data:

In [None]:
def validate(model):
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_true = np.argmax(y_test, axis=1)
    return y_true, y_pred

y_true, y_pred = validate(model)

A confusion matrix gives a useful view on the validation result. One way to get this matrix is the function `pd.crosstab`:

In [None]:
def confusion(y_true, y_pred):
    confusion_matrix = pd.crosstab(y_true, y_pred)
    confusion_matrix.index = authors_ohe.columns
    confusion_matrix.columns = authors_ohe.columns
    return confusion_matrix

confusion(y_true, y_pred)

## Exercise: Training an embedding layer for characters

Instead of a one-hot encoding, we can train more dense embeddings for the characters using the `Embedding` layer of Keras. First, we prepare the data:

In [None]:
X = pad_sequences(poems.characters, maxlen=MAX_LEN)
(X_train, y_train), (X_test, y_test) = train_test_split(X,y)

Now, reuse the previous model but put an `Embedding` layer in front to transform character index sequences into vector sequences:

In [None]:
from keras.layers import Embedding, MaxPooling1D

def build_model():
# Your code here!
    pass

Once done, go and try this model!

In [None]:
model, _ = train_model(build_model(), epochs=10)
confusion(*validate(model))