[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/pytorch/ignite/blob/master/examples/notebooks/TextCNN.ipynb)

# Convolutional Neural Networks for Sentence Classification using Ignite

In [135]:
!pip install pytorch-ignite torchtext==0.9.1 spacy
!python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.8.1
  Using cached torch-1.8.1-cp37-cp37m-manylinux1_x86_64.whl (804.1 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0
    Uninstalling torch-1.11.0:
      Successfully uninstalled torch-1.11.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.12.0+cu113 requires torch==1.11.0, but you have torch 1.8.1 which is incompatible.
torchdata 0.3.0 requires torch==1.11.0, but you have torch 1.8.1 which is incompatible.
torchaudio 0.11.0+cu113 requires torch==1.11.0, but you have torch 1.8.1 which is incompatible.[0m
Successfully installed torch-1.8.1


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 5.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Import Libraries

In [137]:
import random
from torchtext import datasets
from torchtext.vocab import GloVe
#for model
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

`Ignite` is a High-level library to help with training neural networks in PyTorch. It comes with an `Engine` to set up a training loop, various metrics, handlers and a helpful contrib section! 

Below we import the following:
* **Engine**: Runs a given process_function over each batch of a dataset, emitting events as it goes.
* **Events**: Allows users to attach functions to an `Engine` to fire functions at a specific event. Eg: `EPOCH_COMPLETED`, `ITERATION_STARTED`, etc.
* **Accuracy**: Metric to calculate accuracy over a dataset, for binary, multiclass, multilabel cases. 
* **Loss**: General metric that takes a loss function as a parameter, calculate loss over a dataset.
* **RunningAverage**: General metric to attach to Engine during training. 
* **ModelCheckpoint**: Handler to checkpoint models. 
* **EarlyStopping**: Handler to stop training based on a score function. 
* **ProgressBar**: Handler to create a tqdm progress bar.

In [138]:
from ignite.engine import Engine, Events
from ignite.metrics import Accuracy, Loss, RunningAverage
from ignite.handlers import ModelCheckpoint, EarlyStopping
from ignite.contrib.handlers import ProgressBar
from ignite.utils import manual_seed

SEED = 1234
manual_seed(SEED)

In [139]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Processing Data

We first set up a tokenizer using `torchtext.data.utils`.
The job of a tokenizer to split a sentence into "tokens". You can read more about it at [wikipedia](https://en.wikipedia.org/wiki/Lexical_analysis).
We will use the tokenizer from the "spacy" library which is a popular choice. Feel free to switch to "basic_english" if you want to use the default one or any other that you want.

docs: https://pytorch.org/text/stable/data_utils.html

In [140]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")



Next, the IMDB training and test datasets are downloaded. The `torchtext.datasets` API returns the train/test dataset split directly without the preprocessing information. Each split is an iterator which yields the raw texts and labels line-by-line.

In [142]:
!pip install torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.11.0
  Using cached torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.8.1
    Uninstalling torch-1.8.1:
      Successfully uninstalled torch-1.8.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.9.1 requires torch==1.8.1, but you have torch 1.11.0 which is incompatible.[0m
Successfully installed torch-1.11.0


In [143]:
import pickle
with open("/content/train", "rb") as fp: 
  train = pickle.load(fp)

with open("/content/test", "rb") as fp: 
  test = pickle.load(fp)

In [144]:
train_iter_unclean, test_iter = train, test

In [145]:
#removing neutral tags
train_iter = []
for line in range(len(train_iter_unclean)):
  if train_iter_unclean[line][0] != 6:
    train_iter.append(train_iter_unclean[line])

In [146]:
#Cleaned data
df = pd.DataFrame(train_iter, columns=['labels','sentences'])
df.head()

Unnamed: 0,labels,sentences
0,4,Do you have your forms?
1,4,Let me see them.
2,4,Who told you to get in this line?
3,4,Okay. But I didn't tell you to get in this li...
4,4,This form is a Z.X.four.


In [167]:
#data distribution
labels = ['ang', 'dis', 'exc', 'fea', 'fru', 'hap', 'neu', 'oth', 'sad', 'sur']
df_dist = df[df['labels'].isin([0,1,2,3,4,5,7,8,9])]
class_dist = dict(pd.Series(df['labels']).value_counts())
for k,v in class_dist.items():
  print(labels[k],": ",v)

fru :  1515
ang :  941
sad :  872
exc :  775
hap :  465
sur :  92
fea :  31
oth :  3
dis :  2


In [168]:
#removing 'dst' , 'oth'
train_iter_temp = []
for line in range(len(train_iter)):
  if train_iter[line][0] != 1 or train_iter[line][0] != 7:
    train_iter_temp.append(train_iter[line])
train_iter = train_iter_temp

Now we set up the train, validation and test splits.  

In [169]:
# We are using only 1000 samples for faster training
# set to -1 to use full data
N = 1000 

# We will use 80% of the `train split` for training and the rest for validation
train_frac = 0.8
_temp = list(train_iter)


random.shuffle(_temp)
_temp = _temp[:(N if N > 0 else len(_temp) )]
n_train = int(len(_temp)*train_frac)

train_list = _temp[0:n_train]
validation_list = _temp[n_train:]
test_list = list(test_iter)
test_list = test_list[:(N if N > 0 else len(test_list))]

In [170]:
print(train_list[0])

(2, " Awesome. We're both going to be in L.A.")


Let's explore a data sample to see what it looks like.
Each data sample is a tuple  of the format `(label, text)`.

The value of label is either 'pos' or 'neg'.


In [171]:
random_sample = random.sample(train_list,1)[0]
print(' text:', random_sample[1])
print('label:', random_sample[0])

 text:  I'm sorry, sir, it's not my choice.
label: 4


Now that we have the datasets splits, let's build our vocabulary. For this, we will use the `Vocab` class from `torchtext.vocab`. It is important that we build our vocabulary based on the train dataset as validation and test are **unseen** in our experimenting. 

`Vocab` allows us to use pretrained **GloVE** 100 dimensional word vectors. This means each word is described by 100 floats! If you want to read more about this, here are a few resources.
* [StanfordNLP - GloVe](https://github.com/stanfordnlp/GloVe)
* [DeepLearning.ai Lecture](https://www.coursera.org/lecture/nlp-sequence-models/glove-word-vectors-IxDTG)
* [Stanford CS224N Lecture by Richard Socher](https://www.youtube.com/watch?v=ASn7ExxLZws)

Note than the GloVE download size is around 900MB, so it might take some time to download. 

An instance of the `Vocab` class has the following attributes:
* `extend` is used to extend the vocabulary
* `freqs` is a dictionary of the frequency of each word
* `itos` is a list of all the words in the vocabulary.
* `stoi` is a dictionary mapping every word to an index.
* `vectors` is a torch.Tensor of the downloaded embeddings


In [172]:
from collections import Counter
from torchtext.vocab import Vocab

counter = Counter()

for (label, line) in train_list:
    counter.update(tokenizer(line))

vocab = Vocab(
    counter,
    min_freq=10,
    vectors=GloVe(name='6B', dim=100, cache='/tmp/glove/')
)

In [173]:
print("The length of the new vocab is", len(vocab))
new_stoi = vocab.stoi
print("The index of '<BOS>' is", new_stoi['<BOS>'])
print("The index of '<EOS>' is", new_stoi['<EOS>'])
new_itos = vocab.itos
print("The token at index 2 is", new_itos[5])

The length of the new vocab is 178
The index of '<BOS>' is 0
The index of '<EOS>' is 0
The token at index 2 is ,


We now create `text_transform` and `label_transform`, which are callable objects, such as a `lambda` func here, to process the raw text and label data from the dataset iterators (or iterables like a `list`). You can add the special symbols such as `<BOS>` and `<EOS>` to the sentence in `text_transform`.

In [174]:
text_transform = lambda x: [vocab[token] for token in tokenizer(x)]
label_transform = lambda x: 1 if x == 'pos' else 0

# Print out the output of text_transform
print("input to the text_transform:", "here is an example?")
print("output of the text_transform:", text_transform("here is an example"))

input to the text_transform: here is an example?
output of the text_transform: [46, 19, 0, 0]


For generating the data batches we will use `torch.utils.data.DataLoader`. You could customize the data batch by defining a function with the `collate_fn` argument in the DataLoader. Here, in the `collate_batch` func, we process the raw text data and add padding to dynamically match the longest sentence in a batch.

In [175]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
    label_list, text_list = [], []
    for (_label, _text) in batch:
        label_list.append(label_transform(_label))
        processed_text = torch.tensor(text_transform(_text))
        text_list.append(processed_text)
    return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)


In [178]:
batch_size = 8  # A batch size of 8

def create_iterators(batch_size=8):
    """Heler function to create the iterators"""
    dataloaders = []
    for split in [train_list, validation_list, test_list]:
        dataloader = DataLoader(
            split, batch_size=batch_size,
            collate_fn=collate_batch
            )
        dataloaders.append(dataloader)
        print(dataloaders)
    return dataloaders


In [179]:
train_iterator, valid_iterator, test_iterator = create_iterators()

[<torch.utils.data.dataloader.DataLoader object at 0x7fa2403cebd0>]
[<torch.utils.data.dataloader.DataLoader object at 0x7fa2403cebd0>, <torch.utils.data.dataloader.DataLoader object at 0x7fa2403ce9d0>]
[<torch.utils.data.dataloader.DataLoader object at 0x7fa2403cebd0>, <torch.utils.data.dataloader.DataLoader object at 0x7fa2403ce9d0>, <torch.utils.data.dataloader.DataLoader object at 0x7fa2403cea90>]


Let's actually explore what the output of the iterator is, this way we'll know what the input of the model is, how to compare the label to the output and how to set up our process_functions for Ignite's `Engine`.
* `batch[0][0]` is the label of a single example. We can see that `vocab.stoi` was used to map the label that originally text into a float.
* `batch[1][0]` is the text of a single example. Similar to label, `vocab.stoi` was used to convert each token of the example's text into indices.

Now let's print the lengths of the sentences of the first 10 batches of `train_iterator`. We see here that all the batches are of different lengths, this means that the iterator is working as expected.

In [185]:
batch = next(iter(train_iterator))
print('batch[0][0] : ', batch[1][0])
print('batch[1][0] : ', batch[0][[0] != 1])

lengths = []
for i, batch in enumerate(train_iterator):
    x = batch[1]
    lengths.append(x.shape[0])
    if i == 10:
        break

print ('Lengths of first 10 batches : ', lengths)

batch[0][0] :  tensor([2, 2, 2, 2, 2, 2, 2, 2])
batch[1][0] :  tensor([[0, 0, 0, 0, 0, 0, 0, 0]])
Lengths of first 10 batches :  [41, 39, 39, 29, 51, 39, 27, 31, 32, 40, 41]


## TextCNN Model

In [210]:
class TextCNN(nn.Module):
    def __init__(
        self,
        vocab_size,
        embedding_dim, 
        kernel_sizes, 
        num_filters, 
        num_classes, d_prob, mode):
        super(TextCNN, self).__init__()
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.kernel_sizes = kernel_sizes
        self.num_filters = num_filters
        self.num_classes = num_classes
        self.d_prob = d_prob
        self.mode = mode
        self.embedding = nn.Embedding(
            vocab_size, embedding_dim, padding_idx=0)
        self.load_embeddings()
        self.conv = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim,
                                             out_channels=num_filters,
                                             kernel_size=k, stride=1) for k in kernel_sizes])
        self.dropout = nn.Dropout(d_prob)
        self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes)

    def forward(self, x):
        batch_size, sequence_length = x.shape
        x = self.embedding(x.T).transpose(1, 2)
        x = [F.relu(conv(x)) for conv in self.conv]
        x = [F.max_pool1d(c, c.size(-1)).squeeze(dim=-1) for c in x]
        x = torch.cat(x, dim=1)
        x = self.fc(self.dropout(x))
        return torch.sigmoid(x).squeeze()

    def load_embeddings(self):
        if 'static' in self.mode:
            self.embedding.weight.data.copy_(vocab.vectors)
            if 'non' not in self.mode:
                self.embedding.weight.data.requires_grad = False
                print('Loaded pretrained embeddings, weights are not trainable.')
            else:
                self.embedding.weight.data.requires_grad = True
                print('Loaded pretrained embeddings, weights are trainable.')
        elif self.mode == 'rand':
            print('Randomly initialized embeddings are used.')
        else:
            raise ValueError('Unexpected value of mode. Please choose from static, nonstatic, rand.')

## Creating Model, Optimizer and Loss

Below we create an instance of the TextCNN model and load embeddings in **static** mode. The model is placed on a device and then a loss function of Binary Cross Entropy and Adam optimizer are set up. 

In [227]:
vocab_size, embedding_dim = vocab.vectors.shape

model = TextCNN(vocab_size=vocab_size,
                embedding_dim=embedding_dim,
                kernel_sizes=[3, 4, 5],
                num_filters=100,
                num_classes=1, 
                d_prob=0.5,
                mode='static')
model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-3)
criterion = nn.MSELoss()

Loaded pretrained embeddings, weights are not trainable.


## Training and Evaluating using Ignite

### Trainer Engine - process_function
Below is a single operation during the trainig process. This process_function will be attached to the training engine.

In [228]:
def process_function(engine, batch):
    model.train()
    optimizer.zero_grad()
    y, x = batch
    x = x.to(device)
    y = y.to(device)
    y_pred = model(x)
    loss = criterion(y_pred, y.float())
    loss.backward()
    optimizer.step()
    return loss.item()

### Evaluator Engine - process_function

In [229]:
def eval_function(engine, batch):
    model.eval()
    with torch.no_grad():
        y, x = batch
        y = y.to(device)
        x = x.to(device)
        y = y.float()
        y_pred = model(x)
        return y_pred, y

### Instantiating Training and Evaluating Engines

In [230]:
trainer = Engine(process_function)
train_evaluator = Engine(eval_function)
validation_evaluator = Engine(eval_function)

### Metrics - RunningAverage, Accuracy and Loss

To start, we'll attach a metric of Running Average to track a running average of the scalar loss output for each batch. 

In [231]:
RunningAverage(output_transform=lambda x: x).attach(trainer, 'loss')

In [232]:
def thresholded_output_transform(output):
    y_pred, y = output
    y_pred = torch.round(y_pred)
    return y_pred, y

In [233]:
Accuracy(output_transform=thresholded_output_transform).attach(train_evaluator, 'accuracy')
Loss(criterion).attach(train_evaluator, 'bce')

In [234]:
Accuracy(output_transform=thresholded_output_transform).attach(validation_evaluator, 'accuracy')
Loss(criterion).attach(validation_evaluator, 'bce')

### Progress Bar

In [235]:
pbar = ProgressBar(persist=True, bar_format="")
pbar.attach(trainer, ['loss'])

### EarlyStopping - Tracking Validation Loss

In [236]:
def score_function(engine):
    val_loss = engine.state.metrics['bce']
    return -val_loss

handler = EarlyStopping(patience=5, score_function=score_function, trainer=trainer)
validation_evaluator.add_event_handler(Events.COMPLETED, handler)

<ignite.engine.events.RemovableEventHandle at 0x7fa233f3d9d0>

### Attaching Custom Functions to Engine at specific Events

In [237]:
@trainer.on(Events.EPOCH_COMPLETED)
def log_training_results(engine):
    train_evaluator.run(train_iterator)
    metrics = train_evaluator.state.metrics
    avg_accuracy = metrics['accuracy']
    avg_bce = metrics['bce']
    pbar.log_message(
        "Training Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
        .format(engine.state.epoch, avg_accuracy, avg_bce))
    
def log_validation_results(engine):
    validation_evaluator.run(valid_iterator)
    metrics = validation_evaluator.state.metrics
    avg_accuracy = metrics['accuracy']
    avg_bce = metrics['bce']
    pbar.log_message(
        "Validation Results - Epoch: {}  Avg accuracy: {:.2f} Avg loss: {:.2f}"
        .format(engine.state.epoch, avg_accuracy, avg_bce))
    pbar.n = pbar.last_print_n = 0

trainer.add_event_handler(Events.EPOCH_COMPLETED, log_validation_results)

<ignite.engine.events.RemovableEventHandle at 0x7fa23164f910>

### ModelCheckpoint

In [238]:
checkpointer = ModelCheckpoint('/tmp/models', 'textcnn', n_saved=2, create_dir=True, save_as_state_dict=True, require_empty=False)
trainer.add_event_handler(Events.EPOCH_COMPLETED, checkpointer, {'textcnn': model})

<ignite.engine.events.RemovableEventHandle at 0x7fa232359110>

### Run Engine

In [239]:
hist = trainer.run(train_iterator, max_epochs=20)

  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 1  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 1  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 2  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 2  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 3  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 3  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 4  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 4  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 5  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 5  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 6  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 6  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 7  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 7  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 8  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 8  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 9  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 9  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 10  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 10  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 11  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 11  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 12  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 12  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 13  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 13  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 14  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 14  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 15  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 15  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 16  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 16  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 17  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 17  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 18  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 18  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 19  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 19  Avg accuracy: 1.00 Avg loss: 0.00


  1%|1         | 1/100 [00:00<?, ?it/s]

Training Results - Epoch: 20  Avg accuracy: 1.00 Avg loss: 0.00
Validation Results - Epoch: 20  Avg accuracy: 1.00 Avg loss: 0.00


In [223]:
hist.metrics

{'loss': 0.00048381025062802077}

That's it! We have successfully trained and evaluated a Convolutational Neural Network for Text Classification. 

In [224]:
a=train_evaluator.run(train_iterator)

In [225]:
a.metrics

{'accuracy': 1.0, 'bce': 0.0003537757694721222}