[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Humboldt-WI/adams/blob/master/exercises/tut9_CNN_NLP_teacher.ipynb)

# Tutorial 9: Convolutional Neural Nets for Text Data
In this tutorial, we will first explain what the layers `Conv2D` (rank-3 tensors) and `Conv1D` (rank-2 tensors) do. Then, we will use `Conv1D` to classify Tweets into positive, neutral and negative sentiments—the Tweets are from the clients of different airlines. 

For further examples, please visit [demos/cnn](https://github.com/Humboldt-WI/adams/tree/master/demos/cnn).

In [1]:
# Import the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import tensorflow as tf
import string
import re
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

## ConveNets
Convnets are widely used in computer vision applications. The most common is the `Conv2D` which takes as input tensors of shape `(height, width, channels)` plus the batch. Let's see a simple example 

In [3]:
# Create a sample input (batch, height, width, channels)
tf.random.set_seed(1234) # for reproducibility
ex_input = tf.concat([tf.ones((1,3,3,1)), 2*tf.ones((1,3,3,1))], axis=3 ) # (1,3,3,2)
ex_input

<tf.Tensor: shape=(1, 3, 3, 2), dtype=float32, numpy=
array([[[[1., 2.],
         [1., 2.],
         [1., 2.]],

        [[1., 2.],
         [1., 2.],
         [1., 2.]],

        [[1., 2.],
         [1., 2.],
         [1., 2.]]]], dtype=float32)>

In [4]:
# Apply a convnet with 1 filter and a kernel of size 2
cnn2D = layers.Conv2D(filters=1,kernel_size=2, input_shape=ex_input.shape[1:])
cnn2D(ex_input)

<tf.Tensor: shape=(1, 2, 2, 1), dtype=float32, numpy=
array([[[[-0.27699053],
         [-0.27699053]],

        [[-0.27699053],
         [-0.27699053]]]], dtype=float32)>

In [5]:
# Let's understand the matrix operations
kernel = cnn2D.get_weights()[0] # random initialization weights
kernel.shape

(2, 2, 2, 1)

In [6]:
kernel[:,:,0,:] # weights of first channel

array([[[ 0.05379575],
        [ 0.1154424 ]],

       [[-0.09852028],
        [ 0.01021165]]], dtype=float32)

In [11]:
np.sum(1*kernel[:,:,0,:])+np.sum(2*kernel[:,:,1,:]) # replicate the firts output of the convnet
# np.sum(np.multiply(tf.ones(2,2), kernel[:,:,0,0])) + np.sum(np.multiply(2*tf.ones(2,2), kernel[:,:,1,0])) # replicate the firts output of the convnet

-0.27699053

Convnets are not restricted to rank-3 tensor `(height, width, channels)`. Keras also has `Conv3D` and `Conv1D` implemented. Let's look at `Conv1D`, which requires a rank-2 tensor as input, such as sequence data.

In [13]:
# Input for cnn1D (batch, seq_length, emb_dim)
tf.random.set_seed(1234)
ex_input = tf.concat([tf.ones((1,1,2)), 2*tf.ones((1,1,2)), 3*tf.ones((1,1,2))], axis = 1) # (1, 3, 2)
ex_input

<tf.Tensor: shape=(1, 3, 2), dtype=float32, numpy=
array([[[1., 1.],
        [2., 2.],
        [3., 3.]]], dtype=float32)>

In [14]:
# Apply a convnet with 1 filter and a kernel of size 2
cnn1D = layers.Conv1D(filters=1,kernel_size=2, input_shape=ex_input.shape[1:])
cnn1D(ex_input)

<tf.Tensor: shape=(1, 2, 1), dtype=float32, numpy=
array([[[-0.8928499],
        [-1.4366169]]], dtype=float32)>

In [16]:
kernel = cnn1D.get_weights()[0]
kernel.shape

(2, 2, 1)

In [17]:
print(np.sum(1*kernel[0,:,:] + 2*kernel[1,:,:] )) # first row and second row
print(np.sum(2*kernel[0,:,:] + 3*kernel[1,:,:] )) # secod row and third row

-0.8928499
-1.4366169


# Tweets classification
The purpose is to put `Conv1D` into practice. We have Twitter data concerning airline clients and the labels of their tweets (positive, neutral, negative). The idea is to create a classification model for tweets. We'll only care about the positive and negative in the first part. Then, we include the neutral labels. 

In [18]:
# Load data
tot_tweets = pd.read_csv("../../../demos/cnn/Tweets.csv.zip")
tot_tweets = tot_tweets[['airline_sentiment','text']]

## Positive and Negative Tweets

### Exercise 1: 
Remove the samples with the label `neutral`, create train and validation sets, and then transform them to NumPy arrays.

In [19]:
# Remove neutral labels and transform to numpy
tweets = tot_tweets[tot_tweets['airline_sentiment']!='neutral'].copy()
tweets['airline_sentiment'] = tweets['airline_sentiment'].map({'positive' : 1, 'negative': 0})
X_train, X_val, y_train, y_val = train_test_split(tweets['text'], tweets['airline_sentiment'], test_size = 0.2, random_state = 5)
X_train = X_train.to_numpy()
X_val = X_val.to_numpy()
y_train = y_train.to_numpy()
y_val = y_val.to_numpy()

### Exercise 2:
Create a function to standardize the text. In particular, replace any character that is not a-z OR A-Z with a space, convert to lowercase, and remove punctuation and double space. 

In [41]:
# define standarization function 
def our_standardization(text_data):
  remove_non = tf.strings.regex_replace(text_data, '[^a-zA-Z]', ' ') # replace non a-z OR A-Z with " "
  lowercase = tf.strings.lower(remove_non) # convert to lowercase
  pattern_remove_punctuation = '[%s]' % re.escape(string.punctuation) # pattern to remove punctuation
  remove_punct = tf.strings.regex_replace(lowercase, pattern_remove_punctuation, '') # apply pattern
  remove_double_spaces = tf.strings.regex_replace(remove_punct, '\s+', ' ') # remove double space
  remove_initial_end_spaces  =tf.strings.regex_replace(remove_double_spaces, '^\s*|\s*$', '') 
  return remove_initial_end_spaces
  


### Exercise 3:
Create a vectorization layer and apply it to the text data. Use 10000 tokens with a maximum length for each tweet of 50. 

In [46]:
vocab_size=10000
seq_length = 50
# Create a vectorization layer
vectorize_layer = TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size,
    output_sequence_length = seq_length
    )
vectorize_layer.adapt(X_train)

## Transform sequences of words to seq of integers and labels to tensor
X_train = vectorize_layer(X_train)
X_val = vectorize_layer(X_val)
y_train = tf.convert_to_tensor(y_train)
y_val = tf.convert_to_tensor(y_val)

## Model `Embedding` + `Conv1D` + `MaxPooling1D` + `Flatten` + `Dense`
### Exercise 4:
Create a model with one `Embedding` of dimension 16, followed by a `Conv1D` with 32 filters and a kernel size of 8 and relu activation. Then, apply `MaxPooling1D` with a pool size of 2, `Flatten` the output and finally use the `Dense` layer. Can you explain the number of parameters?

In [50]:
emb_size = 16
num_filters = 32
ker_size = 8

inputs = tf.keras.Input(shape = (seq_length, ))
emb = layers.Embedding(input_dim=vocab_size, output_dim=emb_size)(inputs) 
x = layers.Conv1D(filters = num_filters, kernel_size = ker_size, activation = 'relu')(emb)
x = layers.MaxPooling1D(2)(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 50, 16)            160000    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 43, 32)            4128      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 21, 32)            0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 672)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 673       
Total params: 164,801
Trainable params: 164,801
Non-trainable params: 0
_____________________________________________________

### Exercise 5: 
Train the model using a batch size of 128 for 20 epochs and an `EarlyStopping` callback with patience of 3. Restore the best weights and evaluate the validation set.

In [51]:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3,restore_best_weights=True)]

model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 20,
    batch_size = 128,
    callbacks=callbacks,
    verbose=2)

Epoch 1/20
73/73 - 1s - loss: 0.4664 - accuracy: 0.8006 - val_loss: 0.3941 - val_accuracy: 0.8268
Epoch 2/20
73/73 - 0s - loss: 0.2980 - accuracy: 0.8744 - val_loss: 0.2770 - val_accuracy: 0.8826
Epoch 3/20
73/73 - 0s - loss: 0.1970 - accuracy: 0.9269 - val_loss: 0.2167 - val_accuracy: 0.9173
Epoch 4/20
73/73 - 0s - loss: 0.1484 - accuracy: 0.9453 - val_loss: 0.1950 - val_accuracy: 0.9233
Epoch 5/20
73/73 - 0s - loss: 0.1216 - accuracy: 0.9556 - val_loss: 0.2145 - val_accuracy: 0.9207
Epoch 6/20
73/73 - 0s - loss: 0.1034 - accuracy: 0.9618 - val_loss: 0.1908 - val_accuracy: 0.9268
Epoch 7/20
73/73 - 0s - loss: 0.0888 - accuracy: 0.9675 - val_loss: 0.2014 - val_accuracy: 0.9199
Epoch 8/20
73/73 - 0s - loss: 0.0780 - accuracy: 0.9716 - val_loss: 0.2071 - val_accuracy: 0.9225
Epoch 9/20
73/73 - 1s - loss: 0.0685 - accuracy: 0.9763 - val_loss: 0.2425 - val_accuracy: 0.9199


<tensorflow.python.keras.callbacks.History at 0x7fea3a9cd640>

In [52]:
model.evaluate(X_val, y_val)[1]



0.9268081188201904

## Model `Embedding` + `Conv1D` + `GlobalAveragePooling1D` + `Dense`
### Exercise 6:
Create a new model similar to the previous one but replace the `MaxPooling1D` and `Flatten` layers with `GlobalAveragePooling1D`. Can you explain what we are doing? Next, train the model using the previous settings. Is it better?

In [53]:
emb_size = 16
num_filters = 32
ker_size = 8

inputs = tf.keras.Input(shape = (seq_length, ))
emb = layers.Embedding(input_dim=vocab_size, output_dim=emb_size)(inputs) 
x = layers.Conv1D(filters = num_filters, kernel_size = ker_size, activation = 'relu')(emb)
x = layers.GlobalAveragePooling1D()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 50, 16)            160000    
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 43, 32)            4128      
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 33        
Total params: 164,161
Trainable params: 164,161
Non-trainable params: 0
_________________________________________________________________


In [54]:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3,restore_best_weights=True)]

model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 20,
    batch_size = 128,
    callbacks=callbacks,
    verbose = 2)

Epoch 1/20
73/73 - 1s - loss: 0.5483 - accuracy: 0.7869 - val_loss: 0.5024 - val_accuracy: 0.7878
Epoch 2/20
73/73 - 0s - loss: 0.4441 - accuracy: 0.7971 - val_loss: 0.4142 - val_accuracy: 0.8216
Epoch 3/20
73/73 - 0s - loss: 0.3603 - accuracy: 0.8414 - val_loss: 0.3619 - val_accuracy: 0.8411
Epoch 4/20
73/73 - 1s - loss: 0.3078 - accuracy: 0.8683 - val_loss: 0.3337 - val_accuracy: 0.8553
Epoch 5/20
73/73 - 0s - loss: 0.2640 - accuracy: 0.8947 - val_loss: 0.2923 - val_accuracy: 0.8779
Epoch 6/20
73/73 - 0s - loss: 0.2256 - accuracy: 0.9143 - val_loss: 0.2636 - val_accuracy: 0.8965
Epoch 7/20
73/73 - 0s - loss: 0.1948 - accuracy: 0.9285 - val_loss: 0.2366 - val_accuracy: 0.9056
Epoch 8/20
73/73 - 0s - loss: 0.1730 - accuracy: 0.9399 - val_loss: 0.2284 - val_accuracy: 0.9142
Epoch 9/20
73/73 - 0s - loss: 0.1555 - accuracy: 0.9471 - val_loss: 0.2286 - val_accuracy: 0.9138
Epoch 10/20
73/73 - 0s - loss: 0.1417 - accuracy: 0.9513 - val_loss: 0.2128 - val_accuracy: 0.9164
Epoch 11/20
73/73 -

<tensorflow.python.keras.callbacks.History at 0x7fea3aa281f0>

In [55]:
model.evaluate(X_val, y_val)[1]



0.9229103326797485

## Model `Embedding` + `Conv1D` + `MaxPooling1D`+ `Conv1D` + `MaxPooling1D` + `Flatten` + `Dense`
### Exercise 7:
Let's try now a deeper network by adding `Conv1D` + `MaxPooling1D` to the first configuration.

In [56]:
emb_size = 16
num_filters = 32
ker_size = 8

inputs = tf.keras.Input(shape = (seq_length, ))
emb = layers.Embedding(input_dim=vocab_size, output_dim=emb_size)(inputs) 
x = layers.Conv1D(filters = num_filters, kernel_size = ker_size, activation = 'relu')(emb)
x = layers.MaxPooling1D(2)(x)
x = layers.Conv1D(filters = num_filters, kernel_size = int(ker_size/2), activation = 'relu')(x)
x = layers.MaxPooling1D(2)(x)
x = layers.Flatten()(x)
outputs = layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

model.summary()

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding_3 (Embedding)      (None, 50, 16)            160000    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 43, 32)            4128      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 21, 32)            0         
_________________________________________________________________
conv1d_5 (Conv1D)            (None, 18, 32)            4128      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 9, 32)             0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 288)               0   

In [57]:
callbacks = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3,restore_best_weights=True)]

model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 20,
    batch_size = 128,
    callbacks=callbacks,
    verbose = 2)

Epoch 1/20
73/73 - 2s - loss: 0.4693 - accuracy: 0.7941 - val_loss: 0.4095 - val_accuracy: 0.8172
Epoch 2/20
73/73 - 1s - loss: 0.2995 - accuracy: 0.8740 - val_loss: 0.2875 - val_accuracy: 0.8688
Epoch 3/20
73/73 - 1s - loss: 0.1912 - accuracy: 0.9294 - val_loss: 0.2197 - val_accuracy: 0.9104
Epoch 4/20
73/73 - 1s - loss: 0.1442 - accuracy: 0.9444 - val_loss: 0.1992 - val_accuracy: 0.9225
Epoch 5/20
73/73 - 1s - loss: 0.1152 - accuracy: 0.9568 - val_loss: 0.2366 - val_accuracy: 0.9117
Epoch 6/20
73/73 - 1s - loss: 0.0964 - accuracy: 0.9641 - val_loss: 0.2020 - val_accuracy: 0.9203
Epoch 7/20
73/73 - 1s - loss: 0.0811 - accuracy: 0.9717 - val_loss: 0.2291 - val_accuracy: 0.9138


<tensorflow.python.keras.callbacks.History at 0x7fea3ac4e8b0>

In [58]:
model.evaluate(X_val, y_val)[1]



0.9224772453308105

## Positive, Negative and Neutral Tweets
### Exercise 8:
Now, we're going to use the three labels to create the model. But, first, encode the corresponding labels, split the data and transform it to NumPy.

In [59]:

tweets['airline_sentiment'] = tot_tweets['airline_sentiment'].map({'positive' : 2, 'neutral':1, 'negative': 0}).copy()

X_train, X_val, y_train, y_val = train_test_split(tweets['text'], tweets['airline_sentiment'], test_size = 0.2, random_state = 5)
X_train = X_train.to_numpy()
X_val = X_val.to_numpy()
y_train = y_train.to_numpy()
y_val = y_val.to_numpy()

### Exercise 9:
Repeat the previous procedure to create the vectorization layer.

In [60]:
vocab_size=10000
seq_length = 50
# Create a vectorization layer
vectorize_layer = TextVectorization(
    standardize = our_standardization,
    max_tokens = vocab_size,
    output_sequence_length = seq_length
    )
vectorize_layer.adapt(X_train)

## Transform sequences of words to seq of integers and labels to tensor
X_train = vectorize_layer(X_train)
X_val = vectorize_layer(X_val)
y_train = tf.convert_to_tensor(y_train)
y_val = tf.convert_to_tensor(y_val)

## Model `Embeddings`+`Conv1D`+`MaxPooling1D`+`Flatten`+`Dense`
### Exercise 10:
Modify the first model, i.e. `Embeddings`+`Conv1D`+`MaxPooling1D`+`Flatten`+`Dense`, to this problem (be aware of the expected output dimension and loss function).

In [61]:
emb_size = 16
num_filtes = 32
ker_size = 8

inputs = tf.keras.Input(shape = (seq_length, ))
emb = layers.Embedding(input_dim=vocab_size, output_dim=emb_size)(inputs) 
x = layers.Conv1D(filters = num_filtes, kernel_size = ker_size, activation = 'relu')(emb)
x = layers.MaxPooling1D(2)(x)
x = layers.Flatten()(x)
outputs = layers.Dense(3, activation='softmax')(x)
model = tf.keras.Model(inputs, outputs)

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

model.summary()

Model: "model_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_5 (InputLayer)         [(None, 50)]              0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 50, 16)            160000    
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 43, 32)            4128      
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 21, 32)            0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 672)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 3)                 2019      
Total params: 166,147
Trainable params: 166,147
Non-trainable params: 0
_____________________________________________________

In [62]:
earlystop = [tf.keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 3,restore_best_weights=True)]

model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs = 20,
    batch_size = 128,
    callbacks=earlystop,
    verbose = 2)


Epoch 1/20
73/73 - 1s - loss: 0.5491 - accuracy: 0.7900 - val_loss: 0.4356 - val_accuracy: 0.8064
Epoch 2/20
73/73 - 1s - loss: 0.3292 - accuracy: 0.8567 - val_loss: 0.2716 - val_accuracy: 0.8917
Epoch 3/20
73/73 - 1s - loss: 0.1941 - accuracy: 0.9252 - val_loss: 0.2061 - val_accuracy: 0.9186
Epoch 4/20
73/73 - 0s - loss: 0.1443 - accuracy: 0.9442 - val_loss: 0.1883 - val_accuracy: 0.9246
Epoch 5/20
73/73 - 0s - loss: 0.1160 - accuracy: 0.9565 - val_loss: 0.2072 - val_accuracy: 0.9177
Epoch 6/20
73/73 - 0s - loss: 0.0967 - accuracy: 0.9636 - val_loss: 0.1929 - val_accuracy: 0.9268
Epoch 7/20
73/73 - 1s - loss: 0.0817 - accuracy: 0.9709 - val_loss: 0.2172 - val_accuracy: 0.9164


<tensorflow.python.keras.callbacks.History at 0x7fea3c5975e0>

In [63]:
model.evaluate(X_val, y_val)[1]



0.9246426820755005