# Convolutional Neural Network for text classification

Convolutional Neural Network (CNNs) was originally for Computer Vision tasks. [Kim](https://arxiv.org/pdf/1408.5882.pdf) applied CNNs to problems in Natural Language Processing and got good results. In this tutorial, we will introduce how to implement CNN for NLP classification task. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -U torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/56/74/6fc9dee50f7c93d6b7d9644554bdc9692f3023fa5d1de779666e6bf8ae76/torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1MB)
[K     |████████████████████████████████| 804.1MB 22kB/s 
[31mERROR: torchvision 0.9.0+cu101 has requirement torch==1.8.0, but you'll have torch 1.8.1 which is incompatible.[0m
[31mERROR: torchtext 0.9.0 has requirement torch==1.8.0, but you'll have torch 1.8.1 which is incompatible.[0m
Installing collected packages: torch
  Found existing installation: torch 1.8.0+cu101
    Uninstalling torch-1.8.0+cu101:
      Successfully uninstalled torch-1.8.0+cu101
Successfully installed torch-1.8.1


In [None]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


## Import require Python libraries

In [None]:
import torch
import torchtext
from torchtext.legacy.data import Field, LabelField. # For torch<=0.8.0, the importing of functions should be `from torchtext.data`
from torchtext.legacy.data import TabularDataset
from torchtext.legacy.data import Iterator, BucketIterator
import spacy
import en_core_web_sm
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as autograd
from tqdm import tqdm, trange
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix

In [None]:
## Set seed of randomization and working device
manual_seed = 77
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
    torch.cuda.manual_seed(manual_seed)

cuda


## Load dataset

In this tutorial, we use the corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). HappyDB is a dataset of about 100,000 `happy moments` crowd-sourced via Amazon’s Mechanical Turk where each worker was asked to describe in a complete sentence `what made them happy in the past 24 hours`. Each user was asked to describe three such moments. 
In this tutorial, we focus on `sociality classification`. Sociality refers to `whether or not other people than the author are involved in the emotion situation`. For example, an emotion experience with a sociality value "yes" (i.e., other people are involved) could teach us about social groups (e.g., families) and the range of emotions expressed during specific types of situations (e.g., wedding, death). 

We only use labelled dataset which include 10,560 labelled samples. 

We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``./happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``. We split the labeled data into 80\% training set (8,448 moments) and 20\% development set (2112 moments).

Tokenizer

In [None]:
spacy_en = en_core_web_sm.load()
def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

`Fileds`

In [None]:
TEXT = Field(sequential=True, tokenize=tokenize_en, lower=True)
LABEL = Field(sequential=False, unk_token = None)

Load dataset using `TabularDataset`

In [None]:
# your code goes here 
train, val, test = TabularDataset.splits(
               path="./drive/My Drive/Colab Notebooks/happy_db/", # the root directory where the data lies
               train='train.tsv', validation="dev.tsv", test="test.tsv", # file names
               format='tsv',
               skip_header=True, # if your tsv file has a header, make sure to pass this to ensure it doesn't get proceesed as data!
               fields=[('tweet', TEXT), ('label', LABEL)])

Build your vocabulary to map words and labels to integers. 

In [None]:
TEXT.build_vocab(train, min_freq=2)
LABEL.build_vocab(train)

In [None]:
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))

Vocabulary size of TEXT: 3546
Vocabulary size of LABEL: 2


Construct the Iterators to get the train, dev, and test splits. Use `BucketIterator` to initialize the Iterators for the train, dev, and test data.

In [None]:
train_iter, val_iter, test_iter = BucketIterator.splits(
 (train, val, test), # we pass in the datasets we want the iterator to draw data from
 batch_sizes=(64,256,256),
 sort_key=lambda x: len(x.tweet), 
 sort=True,
# A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. 
 sort_within_batch=True
)

In [None]:
# create a single batch and terminate the loop
for batch in train_iter:
    tweets = batch.tweet
    labels = batch.label
    break  #we use first batch as an example.
    
print('tweets:', tweets.shape)
print('labels:', labels.shape)

# tweets: [length, batch size]
# labels: [batch size]

tweets: torch.Size([4, 64])
labels: torch.Size([64])


## CNN model for NLP

Instead of image pixels, the input to most NLP tasks is sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character or byte-pair. That is, each row is a vector that represents a word. Usually, these vectors are **word embeddings** such as word2vec or GloVe. You can also train your word embedding layer from scratch. For a 10-word sentence using a 300-dimensional embedding, we would have a `10 × 300` matrix as our input. 

In the vision, our kernels slide over local patches of an image, but in NLP, we typically use kernels that slide over full rows of the matrix (words). Thus, the "width" of our kernels is the same as the width of the input matrix (i.e, word embedding size). The height (or region size) of the kernels may vary, but sliding windows over 2-5 words at a time is typical. 

**Here is a simple example of a CNN for NLP.**

1. Each input token (words) is represented in a 5-dimension vector. The sentence length is 7. Hence, the input matrix of CNN is a `7 x 5` tensor.
2. Here we use three kernel (or filter) region sizes: `2, 3 and 4`, each of which has `2` kernels. Every kernels performs convolution on the sentence matrix and generates (variable-length) feature maps. 
2. Then `1-max pooling` is performed over each map (i.e., the largest number from each feature map is recorded). Thus a univariate feature vector is generated from all six maps.
3. These 6 features are **concatenated** to form a feature vector for the linear layer. 
4. The final **softmax fully connected layers**, then, receives this feature vector as input and uses it to classify the sentence; here it is a binary classification and hence depict two possible output states. 

Source: Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.

![](http://www.wildml.com/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-12.05.40-PM-1024x937.png)

Picture Courtesy: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

**Let's build this model step by step.**

#### First, we need to create a input tensor and the label tensor. 

In [None]:
input_sentence = 'I like this movie very much!'

In [None]:
text_token = TEXT.preprocess(input_sentence)
print("text_token:", text_token)
input_tensor = TEXT.process([text_token])
print(input_tensor)
print("shape of input_tensor: ", input_tensor.shape)

text_token: ['i', 'like', 'this', 'movie', 'very', 'much', '!']
tensor([[  2],
        [147],
        [ 45],
        [124],
        [ 33],
        [189],
        [ 63]])
shape of input_tensor:  torch.Size([7, 1])


**Create a label tensor.**

In [None]:
label_token = LABEL.preprocess('no')
print("label_token:", label_token)
label_tensor = LABEL.process([label_token])
print("shape of label_tensor: ", label_tensor.shape)

label_token: no
shape of label_tensor:  torch.Size([1])


#### First layer is a embedding layer which embeds each token to a 5-D vector. 

 We will use [``torch.nn.Embedding module``](https://pytorch.org/docs/stable/nn.html#embedding) to store word vectors corresponding to words in the vocabulary.

In [None]:
VOCAB_SIZE = len(TEXT.vocab.stoi)
print("VOCAB_SIZE:",VOCAB_SIZE)
# Note, the parameters to Embedding class below are:
# num_embeddings (int): size of the dictionary of embeddings
# embedding_dim (int): the size of each embedding vector
# For more details on Embedding class, see: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/sparse.py
embedding = nn.Embedding(VOCAB_SIZE, 5)

VOCAB_SIZE: 3552


Let us now feed the tensors of our sample batch to the embedding module and extract the sequence of word embeddings for each tweet.

In [None]:
# print tensor containing word ids for our batch
print("input_tensor: ", input_tensor.shape)

# feed the "input_tensor" tensor to the embedding module
input_embeddings = embedding(input_tensor)

# print the dimensions of the tweet_embeddings
print("tweet input word embeddings size: ", input_embeddings.size()) 

# first dimension - batch size (1)
# second dimension - number of words in that single example (7)
# third dimension - number of features for a word (5) (or word embedding size)

input_tensor:  torch.Size([7, 1])
tweet input word embeddings size:  torch.Size([7, 1, 5])


#### Convolutional layers

This `input_embeddings` is the input tensor of CNN.

Our imput is a tensor which include 1 channel and 2 dimensions. So we will use [`nn.Conv2d()`](https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d) to convolute the input data. 

Notice, the input tensor of `nn.Conv2d()` should be size of `[batch size, # of input channel, height, width]`. Here, `height` is the sequence length of our sentence (i.e., 7), and the width is the embedding size (i.e., 5). But the shape of `input_embeddings` is `[sequence length, batch size, embedding size]`. So we need to `permute` the dimensions and `unsqueeze` channel dimension.

In [None]:
input_embeddings = input_embeddings.permute(1,0,2)
print(input_embeddings.shape)
input_embeddings = input_embeddings.unsqueeze(1)
print(input_embeddings.shape)

# fisrt dimension - number of examples (1)
# second dimension - number of input channel (1)
# first dimension - sequence length (7)
# third dimension - word embedding size) (5) 

torch.Size([1, 7, 5])
torch.Size([1, 1, 7, 5])


Similarly, 

To initialize `nn.Conv2d()`, you need to specify some hyper-parameters. 

The `in_channels` is the number of channel of input tensor. The `out_channels` is the number of channel of output tensor. You can consider `out_channels` as the number of kernels (or filters) we use. In this case, our `in_channels` equal to 1, and we want our outputs in 32 channels.   

The `kernel_size` argument is the size of the convolutional kernel. We want kernels with different sized shapes in the height and width directions, we will use a tuple `(height-size, width-size)` (e.g., (2,5)). 

The `stride` controls the stride for the cross-correlation. We use stride of 1 here.

We will not pad the input matrix. 

But we have three different kernel region sizes: 2, 3 and 4, each of which has 2 kernels. Every kernels performs convolution on the sentence matrix, seperatly. Namely, we need three convolutional layers.

The first convolutional layer is region size of 2 (i.e, the height is 2) and has 2 kernels. In NLP task, we typically use kernels that slide over full rows of the matrix (words). Thus, the "width" of our kernels is the same as the width of the input matrix (i.e, embedding size). 

In [None]:
convolute_region2 = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(2,5), stride=1)
convolute_region2_output = convolute_region2(input_embeddings)
print(convolute_region2_output.shape)

torch.Size([1, 2, 6, 1])


The fisrt dimension is the batch size, i.e., 1.

The second dimension is the number of output channel (number of kernels), i.e., 2.

The third dimension is the output size of height, i.e., 6. You can use this formular to calculate:
$$W_{out}=\frac{(W_{in}-K+2P)}{S}+1 = \frac{(7-2+2\times0)}{1}+1 = 6$$ 

The fourth dimension is the output size of width, i.e., 6. You can also use this formular to calculate:
$$W_{out}=\frac{(W_{in}-K+2P)}{S}+1 = \frac{(5-5+2\times0)}{1}+1 = 1$$ 

Then, we will apply the `ReLU` function on the output of convolutional layer. 

In [None]:
ReLU_fn = nn.ReLU()
convolute_region2_output = ReLU_fn(convolute_region2_output)
print("convolute_region2_output:", convolute_region2_output.shape)

convolute_region2_output: torch.Size([1, 2, 6, 1])


The output size of width always is 1 because the "width" of our kernels is the same as embedding size in NLP task.
Hence, we can squeeze the last dimension. 

In [None]:
convolute_region2_output = convolute_region2_output.squeeze(-1)
print("convolute_region2_output:", convolute_region2_output.shape)

convolute_region2_output: torch.Size([1, 2, 6])


The second convolutional layer is region size of 3 (i.e, the height is 3) and has 2 kernels. Using the same logic, we create a layer `convolute_region3`, apply `ReLU` function, and squeeze last dimension. 

In [None]:
convolute_region3 = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(3,5), stride=1)
convolute_region3_output = convolute_region3(input_embeddings)
convolute_region3_output = ReLU_fn(convolute_region3_output)
convolute_region3_output = convolute_region3_output.squeeze(-1)
print(convolute_region3_output.shape)

torch.Size([1, 2, 5])


The third convolutional layer is region size of 4 (i.e, the height is 4) and has 2 kernels. Using the same logic, we create a layer `convolute_region4`, apply `ReLU` function, and squeeze last dimension.

In [None]:
convolute_region4 = nn.Conv2d(in_channels=1, out_channels=2, kernel_size=(4,5), stride=1)
convolute_region4_output = convolute_region4(input_embeddings)
convolute_region4_output = ReLU_fn(convolute_region4_output)
convolute_region4_output = convolute_region4_output.squeeze(-1)
print(convolute_region4_output.shape)

torch.Size([1, 2, 4])


Note. These three layers operate on input embeddings parallelly instead of sequencially. 

Hence, instead of creating three convolutional layers separately, we can create a [`torch.nn.ModuleList()`](https://pytorch.org/docs/stable/nn.html#torch.nn.ModuleList) to hold these submodules in a list.

In [None]:
kernel_sizes = [2,3,4] #three region size
kernel_num = 2      # number of kernels 
embedding_dim = 5   # word embedding size
Ci = 1              # number of input channel

convolution_layers = nn.ModuleList([
    nn.Conv2d(in_channels = Ci, out_channels = kernel_num, kernel_size = (K, embedding_dim)) 
                                    for K in kernel_sizes  ])

convolute_outputs = [F.relu(conv(input_embeddings)).squeeze(3) for conv in convolution_layers]  

`convolute_outputs` is a list which includes all the outputs from three convolutional layers.
Let's see how they look like. 

In [None]:
for i, item in enumerate(convolute_outputs):
    print(i," shape: ",item.shape)

#They are same as our previous outputs. 

0  shape:  torch.Size([1, 2, 6])
1  shape:  torch.Size([1, 2, 5])
2  shape:  torch.Size([1, 2, 4])


#### Max Pooling layers

Our convolved outputs (maps) are size of `[batch size, number of kernels, size of height]` (e.g., `[1,2,6]`). Hence, each sample is represented in a vector by each kernel. The size of the vector is 6. 

So `1D-max pooling` is performed over each representation (i.e., the largest number from each representation is recorded). Thus a univariate feature vector is generated from all six maps.

Let's apply the [`F.max_pool1d()`](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.max_pool1d) on `convolute_region2_output` fisrt. 

In [None]:
maxpool_region2 = F.max_pool1d(convolute_region2_output, convolute_region2_output.size(2))
print(convolute_region2_output)
print(maxpool_region2)
print(maxpool_region2.shape) # [btach size, number of kernels, 1]

tensor([[[1.0953, 0.3164, 0.0496, 0.4001, 0.0000, 0.1510],
         [0.4668, 0.9165, 0.3379, 0.3215, 0.1232, 0.2179]]],
       grad_fn=<SqueezeBackward1>)
tensor([[[1.0953],
         [0.9165]]], grad_fn=<SqueezeBackward1>)
torch.Size([1, 2, 1])


In [None]:
# We squeeze the last dimension.
maxpool_region2 = maxpool_region2.squeeze(-1)
print(maxpool_region2.shape)

torch.Size([1, 2])


Then, same for `convolute_region3_output` and `convolute_region4_output`.

In [None]:
maxpool_region3 = F.max_pool1d(convolute_region3_output, convolute_region3_output.size(2))
maxpool_region3 = maxpool_region3.squeeze(-1)
print(maxpool_region3.shape)

torch.Size([1, 2])


In [None]:
maxpool_region4 = F.max_pool1d(convolute_region4_output, convolute_region4_output.size(2))
maxpool_region4 = maxpool_region4.squeeze(-1)
print(maxpool_region4.shape)

torch.Size([1, 2])


We can also use for-loop to perform max-pooling on `convolute_outputs` that generate above by using `torch.nn.ModuleList()`. 

In [None]:
max_pooling_outputs = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in convolute_outputs]  

`max_pooling_outputs` is a list which includes all the outputs from three max-pooling layers.
Let's see how they look like. 

In [None]:
for i, item in enumerate(max_pooling_outputs):
    print(i," shape: ",item.shape)

#They are same as our previous outputs of max-pooling layers.  

0  shape:  torch.Size([1, 2])
1  shape:  torch.Size([1, 2])
2  shape:  torch.Size([1, 2])


#### Concatenate 

We concatenate the 6 features together to form a feature vector. 

In [None]:
concat = torch.cat((maxpool_region2, maxpool_region3, maxpool_region4), dim = -1)
print(concat.shape) # [batch size, feature size]

torch.Size([1, 6])


We can also concatenate the list, `max_pooling_outputs`. 

In [None]:
concat_list = torch.cat(max_pooling_outputs, 1)
print(concat_list.shape) # [batch size, feature size]

torch.Size([1, 6])


#### Dropout

Before the fully connected layers, we specify a [`drop-out layer`](https://pytorch.org/docs/stable/nn.html#dropout-layers) of 0.5 drop-out rate to avoid over-fitting in the model. 

In [None]:
drop_out = nn.Dropout(p=0.5)

In [None]:
drop_output = drop_out(concat)
print("drop_output:", drop_output.shape)

drop_output: torch.Size([1, 6])


#### Softmax fully connected layers

 The final **softmax fully connected layer**, then, receives this feature vector as input and uses it to classify the sentence; here it is a binary classification and hence depict two possible output states.
 
The feature size of input is `# of kernel region size X # of kernel of each region size`. 

In our case, it is `3 x 2 = 6`.

The feature size of input is 2 (binary classification).

In [None]:
fc = nn.Linear(3 * 2 , 2)  
fc_output = F.softmax(fc(drop_output),dim=1)
print("fc_output:", fc_output.shape)
print(fc_output)

fc_output: torch.Size([1, 2])
tensor([[0.8767, 0.1233]], grad_fn=<SoftmaxBackward>)


The `fc_output` is the prediction of the CNN model. 
We can give this prediction and true labels to a loss function to calculate the loss and backpropagate with the loss. 

In [None]:
# We use nn.CrossEntropyLoss() as our loss function. 
criterion = nn.CrossEntropyLoss()
criterion(fc_output, label_tensor)

tensor(1.1392, grad_fn=<NllLossBackward>)

Congratulation! We implement the sample CNN successfully. 

Let's apply this architecture on the real-word task, sociality classification with happyDB. 

## Build a `class` for CNN of text classification

In [None]:
# To define a CNN class
class CNN_Text(nn.Module):
    def __init__(self, vocabulary_size, embedding_dim, output_size, kernel_num, region_sizes, dropout):
        '''
        vocabulary_size: vocabulary size
        embedding_dim: word embedding size
        output_size: number of classes in prediction
        kernel_num: number of kernels (number of output channels of convolutional layers)
        region_sizes: height of kernels of convolutional layers
        dropout: dropout rate
        '''
        super(CNN_Text, self).__init__()
        # the size of input channel is 1.
        Ci = 1
        
        # word embedding layer
        self.embeddings = nn.Embedding(num_embeddings = vocabulary_size, embedding_dim = embedding_dim )
        
        # convolution with kernels
        self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels = Ci, out_channels = kernel_num, kernel_size = (K, embedding_dim)) for K in region_sizes])
        
        # a dropout layer
        self.dropout = nn.Dropout(dropout) 
        
        # fully connected layer
        self.fc = nn.Linear(len(kernel_sizes) * kernel_num, output_size)

    def forward(self, x):
        # input x  [sequence length, batch size]
        
        input_embeddings = self.embeddings(x)  
        # (batch size, word_sequence, embedding_dim) word embedding

        input_embeddings = input_embeddings.permute(1,0,2)
        input_embeddings = input_embeddings.unsqueeze(1)
        #  [batch size, number of channel is one, sequence length, embeeding size]

        # convolutional layers
        convolute_outputs = [F.relu(conv(input_embeddings)).squeeze(3) for conv in self.convolution_layers]  
        
        # to get the maximum value of filtered tensor
        max_pooling_outputs = [F.max_pool1d(i, i.size(2)).squeeze(2) for i in convolute_outputs] 
        
        concat_list = torch.cat(max_pooling_outputs, 1) # concatenate representations
        
        drop_output = self.dropout(concat_list)  # add drop layer
        
        fc1_output = self.fc(drop_output)  # get the fc1 using a fully connected layer
        
        final_output = F.softmax(fc1_output,dim=1)
        
        return final_output


In [None]:
# Hyper Parameters

# the vocabulary size
vocabulary_size = len(TEXT.vocab.stoi) 

# Dimension of word embedding is 300. Namely, each word is expressed by a vector that has 300 dimensions.
embedding_dim = 300 

# region size as 2, 3, and 4
kernel_sizes = [2,3,4] 

# the number of kernel in each region size
kernels_num = 32  

# The dropout rate is set to be 0.5.
dropout = 0.5

# The output size of labels.
output_size = 2

# learning rate is set to be 0.01.
lr = 0.01        

# The number of iteration is set to be 5.
num_epoch = 5  

# employ class CNN_Text and assign to cnn
model = CNN_Text(vocabulary_size, embedding_dim, output_size, kernels_num, kernel_sizes, dropout).to(device)


In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.1)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

CNN_Text(
  (embeddings): Embedding(3552, 300)
  (convolution_layers): ModuleList(
    (0): Conv2d(1, 32, kernel_size=(2, 300), stride=(1, 1))
    (1): Conv2d(1, 32, kernel_size=(3, 300), stride=(1, 1))
    (2): Conv2d(1, 32, kernel_size=(4, 300), stride=(1, 1))
  )
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=96, out_features=2, bias=True)
)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,152,290 trainable parameters


In [None]:
# Loss and optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=lr)   # define a optimizer for backpropagation
loss_func = nn.CrossEntropyLoss()   # define loss funtion

`train()` and `evaluate()` functions

In [None]:
def train(model, iterator, optimizer, criterion):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        batch_input, labels = batch.tweet, batch.label
        batch_input = batch_input.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()
        
        outputs = model(batch_input)

        loss = criterion(outputs, labels)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.cpu().item()

    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            batch_input, labels = batch.tweet, batch.label
            batch_input = batch_input.to(device)
            labels = labels.to(device)

            optimizer.zero_grad()

            outputs = model(batch_input)

            loss = criterion(outputs, labels)

            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return epoch_loss / len(iterator), accuracy, f1score

In [None]:
# Train the model
MAX_EPOCH = 15
total_step = len(train_iter)
loss_list = []
acc_list = []

for epoch in trange(MAX_EPOCH, desc="Epoch"):
    train_loss = train(model, train_iter, optimizer, criterion)  
    val_loss, val_acc, val_f1 = evaluate(model, val_iter, criterion)

    # Create checkpoint at end of each epoch
    state_dict_model = model.state_dict() 
    state = {
        'epoch': epoch,
        'state_dict': state_dict_model,
        'optimizer': optimizer.state_dict()
        }

    torch.save(state, "./drive/My Drive/Colab Notebooks/ckpt_cnn/CNN_TEXT_"+str(epoch+1)+".pt")

    print('\n Epoch [{}/{}], Train Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}, Validation F1: {:.4f}'.format(epoch+1, MAX_EPOCH, train_loss, val_loss, val_acc, val_f1))
    

Epoch:   7%|▋         | 1/15 [00:00<00:11,  1.26it/s]


 Epoch [1/15], Train Loss: 0.4036, Validation Loss: 0.4269, Validation Accuracy: 0.8883, Validation F1: 0.8876


Epoch:  13%|█▎        | 2/15 [00:01<00:09,  1.36it/s]


 Epoch [2/15], Train Loss: 0.4110, Validation Loss: 0.4253, Validation Accuracy: 0.8902, Validation F1: 0.8894


Epoch:  20%|██        | 3/15 [00:02<00:08,  1.43it/s]


 Epoch [3/15], Train Loss: 0.4070, Validation Loss: 0.4246, Validation Accuracy: 0.8911, Validation F1: 0.8908


Epoch:  27%|██▋       | 4/15 [00:02<00:07,  1.49it/s]


 Epoch [4/15], Train Loss: 0.4094, Validation Loss: 0.4148, Validation Accuracy: 0.8968, Validation F1: 0.8962


Epoch:  33%|███▎      | 5/15 [00:03<00:06,  1.53it/s]


 Epoch [5/15], Train Loss: 0.4127, Validation Loss: 0.4148, Validation Accuracy: 0.8835, Validation F1: 0.8821


Epoch:  40%|████      | 6/15 [00:03<00:05,  1.52it/s]


 Epoch [6/15], Train Loss: 0.4087, Validation Loss: 0.4144, Validation Accuracy: 0.8835, Validation F1: 0.8821


Epoch:  47%|████▋     | 7/15 [00:04<00:05,  1.57it/s]


 Epoch [7/15], Train Loss: 0.4284, Validation Loss: 0.4312, Validation Accuracy: 0.8902, Validation F1: 0.8895


Epoch:  53%|█████▎    | 8/15 [00:05<00:04,  1.61it/s]


 Epoch [8/15], Train Loss: 0.4395, Validation Loss: 0.4489, Validation Accuracy: 0.8684, Validation F1: 0.8681


Epoch:  60%|██████    | 9/15 [00:05<00:03,  1.64it/s]


 Epoch [9/15], Train Loss: 0.5892, Validation Loss: 0.4795, Validation Accuracy: 0.8381, Validation F1: 0.8378


Epoch:  67%|██████▋   | 10/15 [00:06<00:03,  1.65it/s]


 Epoch [10/15], Train Loss: 0.4263, Validation Loss: 0.4252, Validation Accuracy: 0.8911, Validation F1: 0.8907


Epoch:  73%|███████▎  | 11/15 [00:06<00:02,  1.66it/s]


 Epoch [11/15], Train Loss: 0.4203, Validation Loss: 0.4608, Validation Accuracy: 0.8674, Validation F1: 0.8665


Epoch:  80%|████████  | 12/15 [00:07<00:01,  1.64it/s]


 Epoch [12/15], Train Loss: 0.4205, Validation Loss: 0.4490, Validation Accuracy: 0.8750, Validation F1: 0.8736


Epoch:  87%|████████▋ | 13/15 [00:08<00:01,  1.61it/s]


 Epoch [13/15], Train Loss: 0.4078, Validation Loss: 0.4243, Validation Accuracy: 0.8920, Validation F1: 0.8918


Epoch:  93%|█████████▎| 14/15 [00:08<00:00,  1.61it/s]


 Epoch [14/15], Train Loss: 0.4182, Validation Loss: 0.4748, Validation Accuracy: 0.8371, Validation F1: 0.8301


Epoch: 100%|██████████| 15/15 [00:09<00:00,  1.60it/s]


 Epoch [15/15], Train Loss: 0.4314, Validation Loss: 0.4330, Validation Accuracy: 0.8949, Validation F1: 0.8944





# Exercise 

* Can you implement a Convolutional Reccurent Neural Networks with the following architecture?
```
CLSTM(
  (convs1): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 300), stride=(1, 1), padding=(1, 0))
    (1): Conv2d(1, 100, kernel_size=(5, 300), stride=(1, 1), padding=(2, 0))
  )
  (lstm): LSTM(200, 200, batch_first=True)
  (hidden2label): Linear(in_features=200, out_features=2, bias=True)
  (softmax): LogSoftmax()
  (dropout_layer): Dropout(p=0.2)
)
```

### Reference
* http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
* https://medium.com/jatana/report-on-text-classification-using-cnn-rnn-han-f0e887214d5f
* http://www.davidsbatista.net/blog/2018/03/31/SentenceClassificationConvNets/