# Classification with bag-of-word statistics

Finally, let us try to classify our poems with bag-of-word statistics.

## Extracting word statistics with `Tokenizer`

We already used the `Tokenizer` class from the `keras.preprocessing.text` module to transform text into sequences of word indices. But one can also use the method `texts_to_matrix` of this class to extract word statistics as follows:

In [None]:
import pandas as pd
from keras.preprocessing.text import Tokenizer

texts = ["Wenn Fliegen hinter Fliegen fliegen", "Fliegen Fliegen Fliegen nach"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
tokenizer.texts_to_matrix(texts, mode='binary')

The `mode` argument selects the statistics and can take several values:

In [None]:
vocab = list(tokenizer.word_index)

for mode in ['binary', 'count', 'freq', 'tfidf']:
    print(f'\n===== Statistics: {mode} =====\n')
    stats = tokenizer.texts_to_matrix(texts, mode)[:,1:].tolist()
    print(pd.DataFrame.from_records(stats, columns=vocab, index=texts))

Alternatively, one can use the `preprocessing` module of the `scikit-learn` library to extract these statistics. 

## Exercise: Classification using word statistics

Try to use one of these word statistics as an input to a dense neural network for classifying our poems!

### Useful code from previous notebooks

Here are some helpful functions:

In [None]:
import numpy as np
import pandas as pd

EXTRACT = 'selected_poems.json.bz2'
ALPHABET = 'abcdefghijklmnopqrstuvwxyzäöüßABCDEFGHIKLMNOPQRSTUVWXZYÄÖÜ .,;:!?-()"\'\n'

def clean_text(text):
    return ''.join([char for char in text if char in ALPHABET])

poems = pd.read_json(EXTRACT, compression='infer')
poems['cleaned_text'] = poems.text.apply(clean_text)

In [None]:
from sklearn.model_selection import train_test_split

def data_from_column(column_name, max_len, train_ratio=0.7):
    if max_len is None:
        X = poems[column_name].values
    else:
        X = pad_sequences(poems[column_name], max_len)
    authors_ohe = pd.get_dummies(poems['author'])
    y = authors_ohe.values
    short_authors = [author.split(',')[0] for author in authors_ohe.columns]
    return train_test_split(X, y, train_size=train_ratio), short_authors

In [None]:
from keras import Sequential
from keras.layers import Dense

def train_model(model, epochs=5, batch_size=8):
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='Adadelta')
    history = model.fit(X_train,y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2)
    return model, pd.DataFrame(history.history)

### Your solution here!