# Using Word Level CNN

In this recipe, we will see how CNN can be applied to the text classification problem. In this recipe, we will use the word level features and pre-trained embedding with CNN for the text classification problem. In this recipe we will be understanding and implementing logic as published in "Convolution Neural Networks for Sentence Classification" by Jonas Gehring et. al., According to this paper with pre-trained embeddings, one can achieve excellent results by just using few layers of the CNN. Lets us see this paper in detail and understand how to leverae CNN for text related tasks.

Before going to the implementation part, Let's understand the model first.  The model is as shown in the below given figure.

![](figures/Using_Word_Level_CNN_for_Text_Classsification.png)

Figure: Showing the architecture of the model which takes word level features and perform text classification


Let's assume in a given sentence with  words and each word having  dimensional vector, the resultant vector size is $ n*k $.  All sentences are expected to be padded to have equal size. This input matrix of size $ n*k $ are then convolved using different filter size. In our implementation we will be using filter sizes = [2,3,4]. One more thing to observe here is the the stride size is very high. In regular CNN we hardly go belong stride size 4-5 but here the stride of 100 is used. In this model, the stride size will be always equal to the size of the embedding .  by keeping stride equal to the embeddings the model learning feature of different words separately.

Mathematically a filter of height $ h \in H; H= \{2,3,4\}  $ is selected with width/stride $K$ equal to the dimensions of the embeddings vectors. In this way, different features are learnt by choosing different words. Let's say the input matrix  $ x = n*k $. Lets say a convolution operation  is applied to with filter size $H$, then the derived features  can be given as:

$$C_i = (W\bullet X_{i:i+h} +b) $$

Here $ X_{i:i+h} $ is the small portion of the input matrix for the sentence over which convolution operation was applied. $B$ is the bias term. Such operation with different window size/ kernel size is applied and features a recollected. Then max pool in 1Dimension is applied over collected features to identify striking features. After max pooling, all the features are concatenated and then feed forward layer is applied on top of previous layers.



# Installation

In [None]:
import json
import random
import os
import chakin
import matplotlib.pyplot as plt
import pandas as pd
import torch
from torchtext import data
from torchtext import vocab
import zipfile

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Preprocessing
We will be using IMDB, Large Movie Review Dataset. This is a dataset for binary sentiment classification containing 25,000 highly polar movie reviews for training, and testing. Let's use TorchText to preprocess our data. The pre-processing involves 

1. Splitting data into two parts, train and test 
2. Reading the data using TorchText and applying various pre-processing operations like tokenization, padding and vocabulary generation. 
3. Defining data fields
4. Generating vocabulary and 
5. Making a train and test data iterator.

In [None]:
split = 0.80
data_block = []
data_from_file =  pd.read_csv("data/imdb.tsv", sep="\t")
data_block = data_from_file[["review","sentiment"]].values

In [None]:
random.shuffle(data_block)
train_file = open('data/train.json', 'w')
test_file = open('data/test.json', 'w')
for i in  range(0,int(len(data_block)*split)):
    train_file.write(str(json.dumps({'review':data_block[i][0],'label' : data_block[i][1]}))+"\n")
for i in  range(int(len(data_block)*split),len(data_block)):
    test_file.write(str(json.dumps({'review':data_block[i][0],'label' : data_block[i][1]}))+"\n")

train_file.flush()
test_file.flush()

In [None]:
def tokenize(reviews):
#     print(reviews)
    return reviews
def pad_to_equal(x):
    if len(x) < 200:
        return x + ['<pad>' for i in range(0, 61 - len(x))]
    else:
        return x[:200]
def to_categorical(x):
    x  = int(x)
    if x == 1:
        return [0,1]
    if x == 0:
        return [1,0]

In [None]:
# defining data fields
REVIEW = data.Field(sequential=True , preprocessing = pad_to_equal , use_vocab = True, lower=True,batch_first=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing =to_categorical)
fields = {'review': ('review', REVIEW), 'label': ('label', LABEL)}

# constructing tabular dataset
train_data , test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields)

# constructing vocabulary
REVIEW.build_vocab(train_data, test_data)
LABEL.build_vocab(train_data, test_data)

# making iterator
train_iter, test_iter = data.Iterator.splits(
        (train_data, test_data), sort_key=lambda x: len(x.review),
        batch_sizes=(32,len(test_data)), device=device,)

# Downloading Embeddings
For this experimentation, I will be using GloVe vector of dimension 100 trained on   "Wikipedia+Gigaword 5 (6B)" dataset. I will be using chakin to download GloVe word vectors. Once the vector is downloaded the vocabulary for our train and test split is mapped to GloVe vector by using below given snippet. Remember this method because we will be using this shortcut at many places in this chapter. 

In [None]:
embed_exists = os.path.isfile('../embeddings/glove.6B.zip')
if not embed_exists:
    print("Downloading Glove embeddings, if not downloaded properly, then delete the `embeddings/glove.6B.zip")
    chakin.search(lang='English')
    chakin.download(number=16, save_dir='../embeddings')
    zip_ref = zipfile.ZipFile("../embeddings/glove.6B.zip", 'r')
    zip_ref.extractall("../embeddings/")
    zip_ref.close()

## Developing vocab

In [None]:
vec = vocab.Vectors(name = "glove.6B.100d.txt",cache = "../embeddings/glove.6B/")
REVIEW.build_vocab(train_data, test_data, max_size=100000, vectors=vec)

In [None]:
review_vocab = REVIEW.vocab

# The Model

Convolution Layers: Embeddings generated in the previous layer for each sentence is passed to the below-given convolution layer. Generally when it comes to sentiment analysis entire review is passed to the Conv2D with different filter size [2,3,4] can be represented as given below:

```python
self.conv13 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 3,stride= 100)
self.conv14 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 4,stride= 100)
self.conv15 = nn.Conv2d(in_channels = 1, out_channels=8, kernel_size= 5,stride= 100)
```

The output of conv2d layer is passed to the maxpoll1D  layer and all the resultant features are concatenated as given below

```python
x1 = self.conv_and_pool(x,self.conv13) 
x2 = self.conv_and_pool(x,self.conv14) 
x3 = self.conv_and_pool(x,self.conv15)
x = torch.cat((x1, x2, x3), 1)
```
Finally, a fully connected layer along with dropout and ReLu is applied to squeeze the features into 2 output equal to final classes. 

```python
x = self.dropout(x) # (N, len(Ks)*Co)
logit = F.relu(self.fc1(x)) # (N, C)
logit = torch.softmax(logit, dim=1)
```

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

class CNN_Text(nn.Module):    
    def __init__(self, embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static, stride):
        super(CNN_Text, self).__init__() 
        self.embed_num = embed_num
        self.embed_dim = embed_dim 
        self.class_num = class_num 
        self.kernel_num = kernel_num
        self.kernel_sizes  = kernel_sizes 
        self.dropout = dropout
        self.static = static
        
        self.embedding = nn.Embedding(embed_num, embed_dim)
        self.embedding.weight.data.copy_(review_vocab.vectors)
        self.embedding.weight.requires_grad = True
        self.convs1 = nn.ModuleList([nn.Conv2d(in_channels = 1, out_channels=kernel_num, kernel_size= K,stride= stride) for K in kernel_sizes])

        self.dropout = nn.Dropout(dropout)
        self.fc1 = nn.Linear(len(kernel_sizes)*kernel_num, class_num)

    def conv_and_pool(self, x, conv):
        x = F.relu(conv(x)).squeeze(3) 
        x = F.max_pool1d(x, x.size(2)).squeeze(2)
        return x

    def forward(self, x):
        x = self.embedding(x)  # (N, W, D)
        x = x.unsqueeze(1)  # (N, Ci, W, D)
        x = [F.relu(conv(x)).squeeze(3) for conv in self.convs1]  # [(N, Co, W), ...]*len(Ks)
        x = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in x]  # [(N, Co), ...]*len(Ks)
        x = torch.cat(x, 1)
        
        x = self.dropout(x)  # (N, len(Ks)*Co)
        logit = F.relu(self.fc1(x))   # (N, C)
        logit  = torch.softmax(logit, dim=1)
        return logit

# Training Function

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

In [None]:
def train(model, iterator, optimizer, criterion):
    """
    To train the model
    """
    epoch_loss = []
    epoch_acc = []
    model.train()
    
    for batch in iterator:
        feature, target = batch.review, batch.label
        optimizer.zero_grad()
        predictions = model(feature)            
        loss = criterion(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        epoch_loss.append(loss.item())
        epoch_acc.append(acc.item())
        
    return model, sum(epoch_loss) / len(epoch_loss), sum(epoch_acc) / len(epoch_acc)

In [None]:
def test_accuracy_calculator(model, test_iterator):
    """
    To calculate test accuracy
    """
    epoch_acc = []
    for batch in test_iterator:
        feature, target = batch.review, batch.label
        predictions = model(feature)            
        acc = binary_accuracy(predictions.type(torch.FloatTensor), target.type(torch.FloatTensor))
        epoch_acc.append(acc.item())
    return  sum(epoch_acc) / len(epoch_acc)

## Defining Model Parameters

In [None]:
embed_num = len(REVIEW.vocab)
class_num = len(LABEL.vocab) - 1
kernel_sizes = [int(k) for k in '2,3,4,5'.split(',')]
embed_dim = 100
stride = 100
kernel_num  = 8
dropout = 0.2
static = False

In [None]:
cnn = CNN_Text( embed_num, embed_dim, class_num, kernel_num, kernel_sizes, dropout, static, stride)
cnn = cnn.cuda()

## Definning optimizer, losses and training loops

In [None]:
optimizer = torch.optim.SGD(cnn.parameters(), lr=0.01, momentum=0.9)
criterion = nn.BCELoss()
criterion = criterion.to(device)

In [None]:
epochs  = 300
log_interval = 1
loss = []
accuracy = []
test_accuracy = []
writer = SummaryWriter()
for i in range(epochs):
    if i!=0 and i%30==0:
        for param_group in optimizer.param_groups:
            param_group['lr'] = param_group['lr']/1.5
            print(" %%% NEW LEARNING RATE : ", param_group['lr']," %%%")
    model, epoch_loss, epoch_acc = train(cnn, train_iter, optimizer, criterion)
    test_acc = test_accuracy_calculator(model, test_iter)
    accuracy.append(epoch_acc)
    loss.append(epoch_loss)
    test_accuracy.append(test_acc)
    writer.add_scalar('epoch_loss',epoch_loss, i)
    writer.add_scalar('test_acc',test_acc, i)
    writer.add_scalar('epoch_acc',epoch_acc, i)
writer.export_scalars_to_json("./all_scalars.json")
writer.close()
    

# Plotting
When I applied above given implementation to the IMDB sentiment analysis dataset, It achieved   95+% accuracy on train data and 75% accuracy on the test data. You may go through the code and correlate it with the original research paper. The loss and accuracy of progress throughout the training is given below.

![](figures/Using_Word_Level_CNN_for_Text_Classsification_result.png)

Figure: Showing training progress with iterations


In [None]:
plt.plot(accuracy , label = "Train Accuracy")
plt.plot(loss , label = "Train Loss")
plt.plot(test_accuracy, label = "Test Accuracy")
plt.ylabel("Accuracy/Loss")
plt.xlabel("Epoch")
plt.legend(loc='upper left')
plt.show()