# Using CNN for Sentence Classification

[Paper](https://arxiv.org/pdf/1408.5882.pdf): Convolutional Neural Networks for Sentence Classification by Yoon Kim

## Imports

In [1]:
%matplotlib inline
import collections
import math
import numpy as np
import pandas as pd
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
import tensorflow as tf

seed = 54321

## How data is transformed for sentence classification

- Let's assume a sentence of $p$ words. 

- First, we will pad the sentence with some special words (if the length of the sentence is $< n$) to set the sentence length to $n$ words, where $n \geq p$. 

- Next, we will represent each word in the sentence by a vector of size $k$, where this vector can either be a one-hot-encoded representation, or Word2vec word vectors learned using skip-gram, CBOW, or GloVe. 

- Then a batch of sentences of size b can be represented by a $b \times n \times k$ matrix. 

* **

 Let's walk through an example. Let's consider the following three sentences:
- *Bob and Mary are friends*
- *Bob plays soccer*
- *Mary likes to sing*

In this ex., the third sentence has the most words, so let's set $n=7$, which is the num. of words in the third sentence.
Next, we create the One-Hot-Encoded rep. for each word. Here we have $13$ distinct words. Thus we get:
```
Bob: 1,0,0,0,...
and: 0,1,0,0,...
Mary:0,0,1,0,...
...so on...
```
Also, $k = 13$ i.e. the vector size of each word, for the same reason. We finally can represent the three sentences as 3-D matrix of size $3 \times 7 \times 13$ as shown below:

<div align='center'>
    <img src='images/sentence_matrix.png'/>
</div>

* **

You could also utilize word embeddings instead of one-hot encoding here. Representing each word as a one-hot-encoded feature introduces sparsity and wastes computational memory. By using embeddings, we are enabling the model to learn more compact and powerful word representations than one-hot-encoded representations. This also means that $k$ becomes a hyperparameter (i.e. the embedding size), as opposed to being driven by the size of the vocabulary. This means that, in above fig., each column will be a distributed continuous vector, not a combination of Os and Is.

## Get the Data

In [2]:
url = 'http://cogcomp.org/Data/QA/QC/'
dir_name = 'data'

def download_data(dir_name, filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
  
    os.makedirs(dir_name, exist_ok=True)
    if not os.path.exists(os.path.join(dir_name,filename)):
        filepath, _ = urlretrieve(url + filename, os.path.join(dir_name,filename))
    else:
        filepath = os.path.join(dir_name, filename)
    
    statinfo = os.stat(filepath)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filepath)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filepath + '. Can you get to it with a browser?')
        
    return filepath

train_filename = download_data(dir_name, 'train_5500.label', 335858)
test_filename = download_data(dir_name, 'TREC_10.label', 23354)

Found and verified data\train_5500.label
Found and verified data\TREC_10.label


## Read & Preprocess Data

In [3]:
def read_data(filename):
    '''
    Read data from a file with given filename
    Returns a list of strings where each string is a lower case word
    '''

    # Holds question strings, categories and sub categories
    # category/sub_cateory definitions: https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html
    questions, categories, sub_categories = [], [], []     
    
    with open(filename,'r',encoding='latin-1') as f:        
        # Read each line
        for row in f:   
            # Each string has format <cat>:<sub cat> <question>
            # Split by : to separate cat and (sub_cat + question)
            row_str = row.split(":")        
            cat, sub_cat_and_question = row_str[0], row_str[1]
            tokens = sub_cat_and_question.split(' ')
            # The first word in sub_cat_and_question is the sub category
            # rest is the question
            sub_cat, question = tokens[0], ' '.join(tokens[1:])        
            
            questions.append(question.lower().strip())
            categories.append(cat)
            sub_categories.append(sub_cat)
            

    return questions, categories, sub_categories

In [4]:
train_questions, train_categories, train_sub_categories = read_data(train_filename)
test_questions, test_categories, test_sub_categories = read_data(test_filename)

In [5]:
n_samples = 10
print(f"train_questions has {len(train_questions)} questions / {len(train_categories)} labels")
print("Some samples")
for question, cat, sub_cat in zip(train_questions[:n_samples], train_categories[:n_samples], train_sub_categories[:n_samples]):    
    print(f"\t{question} / cat - {cat} / sub_cat - {sub_cat}")
          
print(f"\ntest_questions has {len(test_questions)} questions / {len(test_categories)} labels")
print("Some samples")
for question, cat, sub_cat in zip(test_questions[:n_samples], test_categories[:n_samples], test_sub_categories[:n_samples]):    
    print(f"\t{question} / cat - {cat} / sub_cat - {sub_cat}")

train_questions has 5452 questions / 5452 labels
Some samples
	how did serfdom develop in and then leave russia ? / cat - DESC / sub_cat - manner
	what films featured the character popeye doyle ? / cat - ENTY / sub_cat - cremat
	how can i find a list of celebrities ' real names ? / cat - DESC / sub_cat - manner
	what fowl grabs the spotlight after the chinese year of the monkey ? / cat - ENTY / sub_cat - animal
	what is the full form of .com ? / cat - ABBR / sub_cat - exp
	what contemptible scoundrel stole the cork from my lunch ? / cat - HUM / sub_cat - ind
	what team did baseball 's st. louis browns become ? / cat - HUM / sub_cat - gr
	what is the oldest profession ? / cat - HUM / sub_cat - title
	what are liver enzymes ? / cat - DESC / sub_cat - def
	name the scar-faced bounty hunter of the old west . / cat - HUM / sub_cat - ind

test_questions has 500 questions / 500 labels
Some samples
	how far is it from denver to aspen ? / cat - NUM / sub_cat - dist
	what county is modesto , cal

## Converting train-test text data to `pd.DataFrame` 

In [6]:
# Define training and testing
train_df = pd.DataFrame(
    {'question': train_questions, 'category': train_categories, 'sub_category': train_sub_categories}
)
test_df = pd.DataFrame(
    {'question': test_questions, 'category': test_categories, 'sub_category': test_sub_categories}
)

train_df.head(n=10)

Unnamed: 0,question,category,sub_category
0,how did serfdom develop in and then leave russ...,DESC,manner
1,what films featured the character popeye doyle ?,ENTY,cremat
2,how can i find a list of celebrities ' real na...,DESC,manner
3,what fowl grabs the spotlight after the chines...,ENTY,animal
4,what is the full form of .com ?,ABBR,exp
5,what contemptible scoundrel stole the cork fro...,HUM,ind
6,what team did baseball 's st. louis browns bec...,HUM,gr
7,what is the oldest profession ?,HUM,title
8,what are liver enzymes ?,DESC,def
9,name the scar-faced bounty hunter of the old w...,HUM,ind


In [7]:
# Shuffle the data for better randomization
train_df = train_df.sample(frac=1.0, random_state=seed)

### Convert string labels to integer IDs

In [8]:
# Generate the label to ID mapping
unique_cats = train_df["category"].unique()

labels_map = dict(zip(unique_cats, np.arange(unique_cats.shape[0])))

print(f"Label->ID mapping: {labels_map}")

n_classes = len(labels_map)

# Convert all string labels to IDs
train_df["category"] = train_df["category"].map(labels_map)
test_df["category"] = test_df["category"].map(labels_map)

# View some data
train_df.head(n=10)

Label->ID mapping: {'DESC': 0, 'ENTY': 1, 'LOC': 2, 'NUM': 3, 'HUM': 4, 'ABBR': 5}


Unnamed: 0,question,category,sub_category
5267,what is an aurora ?,0,def
21,what articles of clothing are tokens in monopo...,1,other
3258,what causes rust ?,0,reason
1356,what does an irate car owner call iron oxide ?,1,termeq
1529,what do we call the imaginary line along the t...,2,other
3631,why is hockey so violent ?,0,reason
4802,how many characters makes up a word for typing...,3,count
2288,what peter blatty novel recounts the horrors o...,1,cremat
803,what is measured in curies ?,0,def
4472,what does seccession mean ?,0,def


### Split training data to train and valid subsets

In [9]:
from sklearn.model_selection import train_test_split

train_df, valid_df = train_test_split(train_df, test_size=0.1)
print(f"Train size: {train_df.shape}")
print(f"Valid size: {valid_df.shape}")

# Print data
train_df.head()

Train size: (4906, 3)
Valid size: (546, 3)


Unnamed: 0,question,category,sub_category
3400,what was franklin roosevelt 's program for eco...,1,event
2630,how many megawatts will the power project in i...,3,count
3449,what is a fear of money ?,1,dismed
1640,what dog was dubbed the mortgage lifter ?,1,animal
5194,the kentucky horse park is close to which amer...,2,city


## Tokenizer & Padding Sentences

### Tokenizer

In [11]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_df.question.tolist())

# Vocab size
n_vocab = len(tokenizer.index_word) + 1
print(f"Vocabluary size: {n_vocab}")

Vocabluary size: 7849


### Find the sequence length

Here we analyze the `1%` and `99%` percentiles of the sequence lengths. We will use the `99%` percentile as our maximum sequence length.

In [13]:
# Split each string by " ", compute length of the list, get the percentiles
train_df["question"].str.split(" ").str.len().describe(percentiles=[0.01, 0.5, 0.99])

count    4906.000000
mean       10.060742
std         3.771990
min         2.000000
1%          4.000000
50%        10.000000
99%        22.000000
max        37.000000
Name: question, dtype: float64

### Padding Shorter Sentences

We use padding to pad short sentences so that all the sentences are of the same length.

It's important to understand that we are feeding our model a batch of questions at a given time. It is very unlikely that all of the questions have the same number of tokens. If all questions do not have the same number of tokens, we cannot form a tensor due to the uneven lengths of different questions. To solve this, we have to pad shorter sequences with special tokens and truncate sequences longer than a specified length. To achieve this we can easily use the `tf.keras.preprocessing.sequence.pad_sequences()` function. The arguments accepted by this function:
- `sequences` - list of list integers; each list of integers is a sequence
- `maxlen` - maximum padding length
- `padding` - wheather to pad at the beginning(`pre`) or end (`post`)
- `truncating` - wheather to truncate at the beginning(`pre`) or end (`post`)
- `value` - what value to be used for padding

In [15]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Convert each list of tokens to a list of IDs, using tokenizer's mapping
train_sequences = tokenizer.texts_to_sequences(train_df["question"].tolist())
train_labels = train_df["category"].values

valid_sequences = tokenizer.texts_to_sequences(valid_df["question"].tolist())
valid_labels = valid_df["category"].values

test_sequences = tokenizer.texts_to_sequences(test_df["question"].tolist())
test_labels = test_df["category"].values


# 99% perecentile of the text sequence lengths of training corpus is 22.
# that's we picked 22 as the max_seq_length 
max_seq_length = 22

# Pad shorter sentences and truncate longer ones (max length: max_seq_length)
preprocessed_train_sequences = pad_sequences(train_sequences, 
                                             maxlen=max_seq_length, 
                                             padding='post', 
                                             truncating='post')

preprocessed_valid_sequences = pad_sequences(valid_sequences, 
                                             maxlen=max_seq_length, 
                                             padding='post', 
                                             truncating='post')

preprocessed_test_sequences = pad_sequences(test_sequences, 
                                            maxlen=max_seq_length, 
                                            padding='post', 
                                            truncating='post')

In [17]:
preprocessed_train_sequences.shape

(4906, 22)

## Sentence Classifying CNN

### Convolution Operation

To learn a rich set of features, we have parallel layers with different convolution filter sizes. Each convolution layer outputs a hidden vector of size $1 \times n$, [where $n = $ number of words per sentence after padding] and we will concatenate these outputs to form the input to the next layer of size $q \times n$, where $q$ is the number of parallel layers we will use. The larger $q$ is, the better the performance of the model.

* **

The value of convolving can be understood in the following manner. Think about the movie rating learning problem (with two classes, positive or negative), and we have the following sentences:

- *I like the movie, not too bad*
- *I did not like the movie, bad*

Now imagine a convolution window of size 5. Let’s bin the words according to the movement of the convolution window.

The sentence *I like the movie, not too bad* gives:
- [I, like, the, movie, ‘,’]
- [like, the, movie, ‘,’, not]
- [the, movie, ‘,’, not, too]
- [movie, ‘,’, not, too, bad]

The sentence *I did not like the movie, bad* gives:
- [I, did, not, like, the]
- [did, not ,like, the, movie]
- [not, like, the, movie, ‘,’]
- [like, the, movie, ‘,’, bad]


For the first sentence, windows such as the following convey that the rating is positive:
- [I, like, the, movie, ‘,’] ; [movie, ‘,’, not, too, bad]

However, for the second sentence, windows such as the following convey negativity in the rating:
- [did, not, like, the, movie]

* **

- We are able to see such patterns that help to classify ratings thanks to the preserved spatiality.

  - For example, if you use a technique such as bag-of-words to calculate sentence representations that lose spatial information, the sentence representations of the above two sentences would be highly similar.<br></br> 

- The convolution operation plays an important role in preserving the spatial information of the sentences. 

- Having q different layers with different filter sizes, the network learns to extract the rating with different size phrases, leading to an improved performance.


<div align='center'>
    <img src='images/conv_op.png'/>
</div>

### Pooling Over Time

The pooling operation is designed to subsample the outputs produced by the previously discussed parallel convolution layers.

Let’s assume the output of the last layer $h$ is of size $q \times n$. The pooling over time layer would produce an output $h’$ of size $q \times 1$ output.

Simply put, the pooling over time operation creates a
vector by concatenating the maximum element of each convolution layer. 

<div align='center'>
    <img src='images/pooling_over_time.png'/>
</div>

### Model Architecture

<div align='center'>
    <img src='images/sentence_classification_cnn_architecture.png'/>
</div>

## Build Sentence Classifying CNN

We are going to implement a very simple CNN to classify sentences. However you will see that even with this simple structure we achieve good accuracies. 

**Our CNN will have one layer (with 3 different parallel layers). This will be followed by a pooling-over-time layer and finally a fully connected layer that produces the logits.**

In [18]:
import tensorflow.keras.backend as K
import tensorflow.keras.layers as layers
import tensorflow.keras.regularizers as regularizers
from tensorflow.keras.models import Model

In [19]:
K.clear_session()

# Input layer takes word IDs as inputs
word_id_inputs = layers.Input(shape=(max_seq_length,), dtype='int32')

# Get the embeddings of the inputs / out [batch_size, sent_length, output_dim]
embedding_out = layers.Embedding(input_dim=n_vocab, output_dim=64)(word_id_inputs)

# For all layers: in [batch_size, sent_length, emb_size] / out [batch_size, sent_length, 100]
conv1_1 = layers.Conv1D(100, kernel_size=3, 
                        strides=1, padding='same', 
                        activation='relu')(embedding_out)

conv1_2 = layers.Conv1D(100, kernel_size=4,
                        strides=1, padding='same', 
                        activation='relu')(embedding_out)

conv1_3 = layers.Conv1D(100, kernel_size=5,
                        strides=1, padding='same',
                        activation='relu')(embedding_out)

# in previous conve outputs / out [batch_size, sent_length, 300]
conv_out = layers.Concatenate(axis=-1)([conv1_1, conv1_2, conv1_3])

# Pooling over time operation. This is doing the max pooling over sequence length
# in other words, each feature map results in a single output
# in [batch_size, sent_length, 300] / out [batch_size, 1, 300]
pool_over_time_out = layers.MaxPool1D(pool_size=max_seq_length,
                                      padding='valid')(conv_out)

# Flatten the unit length dimension
flatten_out = layers.Flatten()(pool_over_time_out)

# Compute the final output
out = layers.Dense(n_classes, activation='softmax',
                   kernel_regularizer=regularizers.l2(l2=0.001))(flatten_out)


# Define the model
cnn_model = Model(inputs=word_id_inputs, outputs=out)


# Compile the model with loss/optimzier/metrics
cnn_model.compile(loss='sparse_categorical_crossentropy', 
                  optimizer='adam', 
                  metrics=['accuracy'])

cnn_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, 22)]         0           []                               
                                                                                                  
 embedding (Embedding)          (None, 22, 64)       502336      ['input_1[0][0]']                
                                                                                                  
 conv1d (Conv1D)                (None, 22, 100)      19300       ['embedding[0][0]']              
                                                                                                  
 conv1d_1 (Conv1D)              (None, 22, 100)      25700       ['embedding[0][0]']              
                                                                                              

## Training the model

- `ReduceLROnPlateau` - Reduces the learning rate when no improvement detected

    - The technique we'll be using is known as "decaying the learning rate." The idea is to reduce the learning rate (by some fraction) whenever the model has stopped to improve performance. The following callback assists us to do this:

In [20]:
# callbacks
lr_reduce_callback = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss',
                                                         patience=3, verbose=1,
                                                         min_delta=0.0001, min_lr=0.000001)


# Train the model
cnn_model.fit(preprocessed_train_sequences, train_labels,
              validation_data=(preprocessed_valid_sequences, valid_labels),
              batch_size=128,
              epochs=25,
              callbacks=[lr_reduce_callback])

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 9: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 12: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 15: ReduceLROnPlateau reducing learning rate to 1.0000000656873453e-06.
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 18: ReduceLROnPlateau reducing learning rate to 1e-06.
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


<keras.callbacks.History at 0x254ba090ac0>

## Test the model on test data

> Test Accuracy: 88.6% for 500 test sentences

In [21]:
cnn_model.evaluate(preprocessed_test_sequences, test_labels, return_dict=True)



{'loss': 0.39220285415649414, 'accuracy': 0.8859999775886536}