# Natural language processing with spaCy

## A quick look at spaCy

There exist a number of python libraries for natural language processing. In this tutorial, we have a look at spaCy.

In [None]:
import spacy

To use spaCy, we first need to load a **language model** which contains statistical information about a language, aggregated from the web, in particular wikipedia, or news:

In [None]:
nlp = spacy.load('de_core_news_sm')

The object returned is a **natural language processor** which can be applied to text as follows:

In [None]:
doc = nlp('Wovon man nicht sprechen kann, darüber muss man schweigen.')

The processed text is a sequence of token with linguistic information stored as attributes, for example 

- the token text,
- the part of speech of the token,
- the lemma of the text,
- an embedding vector.

To show these attributes in a table, we use a pandas DataFrame:

In [None]:
import pandas as pd

pd.DataFrame({'text': [token.text for token in doc],
              'pos': [token.pos_ for token in doc],
             'lemma': [token.lemma_ for token in doc],
             'vector': [token.vector for token in doc]})

We see that the part of speach and lemma are almost recognized almost correctly.

Moreover, spaCy can be used to 

- **extract named entities**, 
- compute **document similarities**,
- **parse** the **syntax tree** of text,

see the documentation.

## Classifying poems with pre-trained word embeddings

We now want to use spaCy's word vectors for our classification task. 

### Prepare the data

First, we load the poems:

In [None]:
import numpy as np
import pandas as pd

EXTRACT = 'selected_poems.json.bz2'
ALPHABET = 'abcdefghijklmnopqrstuvwxyzäöüßABCDEFGHIKLMNOPQRSTUVWXZYÄÖÜ .,;:!?-()"\'\n'

def clean_text(text):
    return ''.join([char for char in text if char in ALPHABET])

poems = pd.read_json(EXTRACT, compression='infer')
poems['cleaned_text'] = poems.text.apply(clean_text)

Second, we transform the poems into sequences of word vectors. This should not take much more than a minute:

In [None]:
def text_to_wordvecs(text):
    doc = nlp(text)
    return np.stack([token.vector for token in doc])

with nlp.disable_pipes('parser', 'ner'):
    poems['word_vecs'] = [text_to_wordvecs(text) for text in poems['cleaned_text']]

poems['word_vecs'].head()

To speed things up, we disabled the processing stages of spaCy which we did not need to access the word vectors: 

- parsing the syntax tree and
- extraction of named entities.

Next, we use the function `data_from_column` from the previous notebook to get our training and test data:

In [None]:
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

def data_from_column(column_name, max_len, train_ratio=0.7):
    if max_len is None:
        X = poems[column_name].values
    else:
        X = pad_sequences(poems[column_name], max_len)
    authors_ohe = pd.get_dummies(poems['author'])
    y = authors_ohe.values
    short_authors = [author.split(',')[0] for author in authors_ohe.columns]
    return train_test_split(X, y, train_size=train_ratio), short_authors

In [None]:
MAX_LEN = 500

(X_train, X_test, y_train, y_test), authors = data_from_column('word_vecs', MAX_LEN)

### Build and train a convolutional neural network

We reuse the model from in the previous notebook, but of course without the embedding layer:

In [None]:
from keras import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Dense

DIM = 96

def build_model(max_len=MAX_LEN):
    return Sequential([
        Conv1D(128, kernel_size=3, activation='relu',input_shape=(max_len,DIM)),
        GlobalMaxPooling1D(),
        Dense(3, activation='softmax')
    ])

Ready, steady, go:

In [None]:
def train_model(model, epochs=8, batch_size=8):
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='Adadelta')
    history = model.fit(X_train,y_train, epochs=epochs, batch_size=batch_size, validation_split=0.2)
    return model, pd.DataFrame(history.history)

model, history = train_model(build_model())

In [None]:
from sklearn import metrics

def validate(model):
    authors = [author.split(',')[0] for author in pd.get_dummies(poems['author']).columns]
    y_pred = np.argmax(model.predict(X_test), axis=1)
    y_res = np.argmax(y_test, axis=1)
    print(metrics.classification_report(y_res, y_pred, target_names=authors))
    cm = pd.crosstab(y_res, y_pred)
    cm.index = authors
    cm.columns = authors
    print(cm)

validate(model)

### Regularization with batch normalization

To fight overfitting, let us increase batch size and try batch normalization:

In [None]:
from keras.layers import Activation, BatchNormalization

def build_model(max_len=MAX_LEN):
    return Sequential([
        Conv1D(128, kernel_size=3,input_shape=(max_len,DIM)),
        BatchNormalization(),
        Activation('relu'),
        GlobalMaxPooling1D(),
        Dense(3, activation='softmax')
    ])

model, history = train_model(build_model(), batch_size=64)

In our case, batch normalization does not really improve performance.

## Building a shallow-and-wide model using the functional API of keras

We now want to use convolutions of different kernel sizes in parallel to process the input to extract finer and coarser details in parallel. This can be done using the functional API of Keras as follows:

In [None]:
from keras.models import Model
from keras.layers import Input, Concatenate

def convolve_and_pool(units, kernel_size, inputs):
    conv = Conv1D(units, kernel_size, activation='relu')(inputs)
    return GlobalMaxPooling1D()(conv)

def build_model(max_len=MAX_LEN):
    inputs = Input((max_len, DIM))
    convs = [convolve_and_pool(96, ks, inputs) for ks in (3,5,7)]
    concatenated = Concatenate()(convs)
    dense = Dense(3, activation='softmax')(concatenated)
    return Model(inputs, dense)

In [None]:
model, history = train_model(build_model())

In [None]:
validate(model)

### Exercise

Instead of using convolutions with different kernel sizes, use iterated/stacked convolutions.