In [1]:
!python --version

Python 3.8.10


In [2]:
!pip install -q pytorch-lightning==1.6.5 spacy==2.2.4
!python -m spacy download en_core_web_md

[33mDEPRECATION: pytorch-lightning 1.6.5 has a non-standard dependency specifier torch>=1.8.*. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pytorch-lightning or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m[33mDEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz#egg=en_core_web_md==2.2.5 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[2K     [90m━━━━━━━

## Overview
- Download the conversation dataset and parse it into a pytorch dataset
- Create Trainer function to help with multi-epoch training
- Model 1: Simple Word2Vec + MLP Model
- Model 2: Sliding window trigram (Word2Vec)
- Model 3: Embedding bag based model on Trigramm

## Dataset Information
We'll be using the Empathetic Dialogues dataset open-sourced by Facebook ([link](https://github.com/facebookresearch/EmpatheticDialogues))

The columns we'll primarily focus on are:
1. context: emotion we're trying to predict
2. prompt + utterance: We'll combine these sentences and use them as input

Let's download and explore the dataset.

In [3]:
import tarfile
import os
import csv

DIRECTORY_NAME = "data"
TRAIN_FILE = "data/empatheticdialogues/train.csv"
VALIDATION_FILE = "data/empatheticdialogues/valid.csv"
TEST_FILE = "data/empatheticdialogues/test.csv"

def download_dataset():
    """
    Download the dataset. The tarball contains three files: train.csv, valid.csv and test.csv
    """

    !wget 'https://dl.fbaipublicfiles.com/parlai/empatheticdialogues/empatheticdialogues.tar.gz'
    if not os.path.isdir(DIRECTORY_NAME):
        os.makedirs(DIRECTORY_NAME)
    tar = tarfile.open('empatheticdialogues.tar.gz')
    tar.extractall(DIRECTORY_NAME)
    tar.close()

# download_dataset()

In [4]:
# verify the downloaded files
import glob
glob.glob(f"{DIRECTORY_NAME}/**/*.csv", recursive=True)

['data/empatheticdialogues/test.csv',
 'data/empatheticdialogues/train.csv',
 'data/empatheticdialogues/valid.csv']

In [5]:
# let's see few examples from the dataset
import pandas as pd

df = pd.read_csv('data/empatheticdialogues/train.csv', sep='\\n', header=None)
df = df[0].str.split(',', expand=True)
new_header = df.iloc[0]
df = df[1:]
df.columns = new_header
df.head()

  df = pd.read_csv('data/empatheticdialogues/train.csv', sep='\\n', header=None)


Unnamed: 0,conv_id,utterance_idx,context,prompt,speaker_idx,utterance,selfeval,tags
1,hit:0_conv:1,1,sentimental,I remember going to the fireworks with my best...,1,I remember going to see the fireworks with my ...,5|5|5_2|2|5,
2,hit:0_conv:1,2,sentimental,I remember going to the fireworks with my best...,0,Was this a friend you were in love with_comma_...,5|5|5_2|2|5,
3,hit:0_conv:1,3,sentimental,I remember going to the fireworks with my best...,1,This was a best friend. I miss her.,5|5|5_2|2|5,
4,hit:0_conv:1,4,sentimental,I remember going to the fireworks with my best...,0,Where has she gone?,5|5|5_2|2|5,
5,hit:0_conv:1,5,sentimental,I remember going to the fireworks with my best...,1,We no longer talk.,5|5|5_2|2|5,


Let's create a label encoder which converts our text labels to integer ids or vice-versa

In [6]:
label_to_integer = dict()
integer_to_label = dict()

for ix, label in enumerate(df["context"].unique()):
    label_to_integer[label] = ix
    integer_to_label[ix] = label

In [7]:
def parse_dataset(file_path, num_samples=5000):
    # read each row as a single column row
    df = pd.read_csv(file_path, sep="\\n", header=None)
    # split up each row into separate columns
    df = df[0].str.split(',', expand=True)
    # set the header by using the first row
    new_header = df.iloc[0]
    df = df[1:]
    df.columns = new_header

    # convert labels to integers
    df["target"] = df["context"].apply(lambda x: label_to_integer[x])
    df["feature"] = df["prompt"] + " " + df["utterance"]

    # return df with only required columns: feature and target
    return df[["target", "feature"]].sample(n = num_samples, random_state=0).values

We will limit the sample size for train, valid and test set to speed up the training

In [8]:
training_data = parse_dataset(TRAIN_FILE, num_samples=40000)
validation_data = parse_dataset(VALIDATION_FILE, num_samples=4000)
test_data = parse_dataset(TEST_FILE, num_samples=4000)

print("Shape of training data:", training_data.shape)
print("Shape of validation data:", validation_data.shape)
print("Shape of test data:", test_data.shape)

  df = pd.read_csv(file_path, sep="\\n", header=None)
  df = pd.read_csv(file_path, sep="\\n", header=None)
  df = pd.read_csv(file_path, sep="\\n", header=None)


Shape of training data: (40000, 2)
Shape of validation data: (4000, 2)
Shape of test data: (4000, 2)


### Create Pytorch Dataset and Data loaders

- **Dataset**: Dataset stores the samples and thier corresponding values.
- **DataLoader**: Dataloader wraps an iterable around the Dataset to enable easy access to the samples.
- **LightningDataModule**: A datamodule is a shareable, reusable class that encapsulates all the steps needed to process data. A datamodule encapsulates the five steps involved in data processing in PyTorch:
    1. Download / tokenize / process.
    2. Clean and (maybe) save to disk.
    3. Load inside Dataset.
    4. Apply transforms (rotate, tokenize, etc...)
    5. Wrap inside a DataLoader.

In [9]:
from torch.utils.data import DataLoader, Dataset, random_split
from torch import nn
import pytorch_lightning as pl

In [10]:
class ClassificationDataset(Dataset):
    """Creates an pytorch dataset to consume our pre-loaded csv data"""
    def __init__(self, data, vectorizer):
        self.dataset = data
        # vectorizer needs to implement a vectorize function that returns vector and tokens
        self.vectorizer = vectorizer

    def __len__(self):
        return len(self.dataset)
    
    def __getitem__(self, idx):
        (label, sentence) =self.dataset[idx]
        sentence_vector, sentence_tokens = self.vectorizer.vectorize(sentence)
        return {
            "vectors": sentence_vector,
            "label": label,
            "tokens": sentence_tokens, # for debugging only
            "sentence": sentence # for debugging only
        }

In [11]:
import torch
from torch.nn.utils.rnn import pad_sequence

In [12]:
class ClassificationDataModule(pl.LightningDataModule):
    """LightningDataModule: Wrapper class for the dataset to be used in training"""
    def __init__(self, vectorizer, params):
        super().__init__()
        self.params = params
        self.train_data = ClassificationDataset(training_data, vectorizer)
        self.validation_data = ClassificationDataset(validation_data, vectorizer)
        self.test_data = ClassificationDataset(test_data, vectorizer)

    # Function to convert the input raw data from the dataset into the model input.
    def collate_fn(self, batch):
        # Embedding layers need the inputs to be integer so we need to add this special case here.
        if self.params.integer_input:
            word_vector = [torch.LongTensor(item["vectors"]) for item in batch]
            sentence_vector = pad_sequence(word_vector, batch_first=True, padding_value=0)
        else:
            sentence_vector = torch.stack([torch.Tensor(item["vectors"]) for item in batch])
            # print("Batch type",type(batch))
        labels = torch.LongTensor([item["label"] for item in batch])
        return {"vectors": sentence_vector, "labels": labels, "sentences": [item["sentence"] for item in batch]}
    
    # Training dataloader: will reset itself each epoch
    def train_dataloader(self):
        return DataLoader(self.train_data, batch_size=self.params.batch_size, collate_fn=self.collate_fn, num_workers=4)
    
    # Validation dataloader: will reset itself each epoch
    def val_dataloader(self):
        return DataLoader(self.validation_data, batch_size=self.params.batch_size, collate_fn=self.collate_fn, num_workers=4)
    
    # Test dataloader: will reset itself each epoch
    def test_dataloader(self):
        return DataLoader(self.test_data, batch_size=self.params.batch_size, collate_fn=self.collate_fn, num_workers=4)

We've now created the DataLoader and Datasets. Let's write the training and testing loops.
`LightningModule` organizes the PyTorch code into 5 sections

1. Computations (init).
2. Train loop (training_step)
3. Validation loop (validation_step)
4. Test loop (test_step)
5. Optimizers (configure_optimizers)

In [13]:
import torchmetrics
import torch.nn.functional as F

In [14]:
class EmotionClassifier(pl.LightningModule):
    def __init__(self, model, params):
        super().__init__()
        self.model = model
        self.params = params
        self.accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=params.num_classes)

    def forward(self, x):
        return self.model(x)
    
    def training_step(self, batch, batch_idx):
        x = batch["vectors"]
        y = batch["labels"]
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y, reduction="mean")
        self.log_dict(
            {"train_loss": loss},
            batch_size=self.params.batch_size,
            prog_bar=True
        )
        return loss
    
    def validation_step(self, batch, batch_nb):
        x = batch["vectors"]
        y = batch["labels"]
        y_hat = self(x)
        val_loss = F.cross_entropy(y_hat, y, reduction="mean")
        predictions = torch.argmax(y_hat, dim=1)
        self.log_dict(
            {"val_loss": val_loss, "val_acc": self.accuracy(predictions, y)},
            batch_size=self.params.batch_size,
            prog_bar=True
        )
        return val_loss
    
    def test_step(self,batch, batch_nb):
        x = batch["vectors"]
        y = batch["labels"]
        y_hat = self(x)
        test_loss = F.cross_entropy(y_hat, y, reduction="mean")
        predictions = torch.argmax(y_hat, dim=1)
        self.log_dict(
            {"test_loss": test_loss, "test_acc": self.accuracy(predictions, y)},
            batch_size=self.params.batch_size,
            prog_bar=True
        )
        return test_loss
    
    def predict_step(self, batch, batch_idx):
        y_hat = self.model(batch["vectors"])
        predictions = torch.argmax(y_hat, dim=1)
        return {"logits": y_hat, "predictions": predictions,
                "labels": batch["labels"], "sentences": batch["sentences"]}

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.params.learning_rate)
        return optimizer

Once we have a Lightning and LightningDataModule, a `Trainer` automates everything else.
Let's write a helper function that takes the model, vectorizer, and hyperparameters.

In [15]:
def trainer(model, params, vectorizer):
    # Create a pytorch trainer
    trainer = pl.Trainer(max_epochs=params.max_epochs, check_val_every_n_epoch=1)

    # Initialize our data loader with the passed vectorizer
    data_module = ClassificationDataModule(vectorizer, params)

    # Instantiate a new model
    model = EmotionClassifier(model, params)

    # Train and validate the model
    trainer.fit(model, data_module.train_dataloader(), val_dataloaders=data_module.val_dataloader())

    # Test the model
    trainer.test(model, data_module.test_dataloader())

    # Predict on the same test set to show some output
    output = trainer.predict(model, data_module.test_dataloader())

    for i in range(2):
        print("#########")
        print(f"Sentence: {output[1]['sentences'][i]}")
        print(f"Predicted Emotion: {integer_to_label[output[1]['predictions'][i].item()]}")
        print(f"Actual Label: {integer_to_label[output[1]['labels'][i].item()]}")

## Models
### Model 1: Average word vector of the sentence - Baseline

Let's build the first simple word2vec based model for the baseline.

In [16]:
import numpy as np
import en_core_web_md

# load the entire space word-vector index in memory
loaded_spacy_model = en_core_web_md.load()

In [17]:
class WordVectorClassifier(torch.nn.Module):
    def __init__(self, word_vec_dimension, num_classes):
        super().__init__()
        self.classes = num_classes
        self.linear_layer = torch.nn.Linear(word_vec_dimension, num_classes)

    def forward(self, batch):
        """Projection from word_vec_dim to n_classes
        
        Batch is the shape (batch_size, max_seq_len, word_vector_dim)
        """
        return self.linear_layer(batch)

class HParams:
    batch_size: int = 32
    integer_input: bool = False
    word_vec_dimension: int = 300
    num_classes: int = 32
    learning_rate: float = 0.001
    max_epochs: int = 10


class SpacyVectorizer:
    def vectorize(self, sentence):
        """Given a sentence, tokenize it and reference pre-trained word vector for each token.
        
        Returns a tuple of sentence_vector and list of text tokens
        """
        sentence_vector = []
        sentence_tokens = []
        spacy_doc = loaded_spacy_model.make_doc(sentence) # I am Wang
        word_vector = [token.vector for token in spacy_doc] ## [ [Embedding of I], [Embedding of am], [Embedding of UNK] ]
        sentence_tokens = list([token.text for token in spacy_doc])
        sentence_vector = np.mean(np.array(word_vector), axis=0)
        return sentence_vector, sentence_tokens
    
    

In [18]:
trainer(model=WordVectorClassifier(HParams.word_vec_dimension, HParams.num_classes),
        params=HParams,
        vectorizer=SpacyVectorizer())

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /workspaces/nlp/text_classification/lightning_logs

  | Name     | Type                 | Params
--------------------------------------------------
0 | model    | WordVectorClassifier | 9.6 K 
1 | accuracy | MulticlassAccuracy   | 0     
--------------------------------------------------
9.6 K     Trainable params
0         Non-trainable params
9.6 K     Total params
0.039     Total estimated model params size (MB)


Epoch 9: 100%|██████████| 1375/1375 [00:12<00:00, 105.89it/s, loss=2.05, v_num=0, train_loss=1.930, val_loss=2.240, val_acc=0.371]
Testing DataLoader 0: 100%|██████████| 125/125 [00:01<00:00, 107.17it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.36250001192092896
        test_loss            2.229304552078247
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 125/125 [00:01<00:00, -1066.76it/s]
#########
Sentence: When I got falsely accused of eating my roommate's ice cream last night_comma_ I was completely outraged! I don't even eat ice cream because I'm lactose intolerant_comma_ and she knows that! I won

In [19]:
class HParams:
    batch_size: int = 32
    integer_input: bool = False
    word_vec_dimension: int = 300
    num_classes: int = 32
    learning_rate: float = 0.001
    max_epochs: int = 10
    n_grams: int = 3


class SpacyChunkVectorizer:
    def __init__(self, params):
        self.params = params
        
    def vectorize(self, sentence):
        """Given a sentence, tokenize it and reference pre-trained word vector for each token.
        
        Returns a tuple of sentence_vector and list of text tokens
        """
        sentence_vector = []
        sentence_tokens = []
        spacy_doc = loaded_spacy_model.make_doc(sentence) # I am Wang
        word_vector = [token.vector for token in spacy_doc] ## [ [Embedding of I], [Embedding of am], [Embedding of UNK] ]
        sentence_tokens = list([token.text for token in spacy_doc])
        i = 0
        trigrams = []
        flag = False
        while i+3 < len(word_vector)+1:
            flag = True
            trigrams.append(np.hstack(word_vector[i:i+self.params.n_grams]))
            i += 1

        if not flag:
            # print("True")
            temp_lst = []
            for w in word_vector:
                temp_lst.append(w)
            while len(temp_lst) < self.params.n_grams:
                temp_lst.append(word_vector[-1])
            trigrams.append(np.hstack(temp_lst))
        
        if len(trigrams) == 0:
            raise Exception(f"Empty trigrams, {len(word_vector)} {sentence}")
        sentence_vector = np.mean(np.array(trigrams), axis=0)
        # print("###", sentence_vector.shape)
        return sentence_vector, sentence_tokens

In [20]:
v = SpacyChunkVectorizer(HParams)
vector, _ = v.vectorize("My name is Wang")
vector.shape

(900,)

In [21]:
trainer(
    model=WordVectorClassifier(
        HParams.word_vec_dimension * HParams.n_grams,
        HParams.num_classes
    ),
    params=HParams,
    vectorizer=SpacyChunkVectorizer(HParams)
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name     | Type                 | Params
--------------------------------------------------
0 | model    | WordVectorClassifier | 28.8 K
1 | accuracy | MulticlassAccuracy   | 0     
--------------------------------------------------
28.8 K    Trainable params
0         Non-trainable params
28.8 K    Total params
0.115     Total estimated model params size (MB)


Epoch 9: 100%|██████████| 1375/1375 [00:16<00:00, 83.88it/s, loss=1.83, v_num=1, train_loss=1.710, val_loss=2.110, val_acc=0.394]
Testing DataLoader 0: 100%|██████████| 125/125 [00:01<00:00, 89.60it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.3932499885559082
        test_loss           2.1019606590270996
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 125/125 [00:01<00:00, -783.54it/s] 
#########
Sentence: When I got falsely accused of eating my roommate's ice cream last night_comma_ I was completely outraged! I don't even eat ice cream because I'm lactose intolerant_comma_ and she knows that! I wonder

In [22]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams

In [23]:
text = "My name"
char_trigrams = [*ngrams(text, 3, pad_left=True, left_pad_symbol="")]
char_trigrams_token = [''.join(chars) for chars in char_trigrams]
char_trigrams_token


['M', 'My', 'My ', 'y n', ' na', 'nam', 'ame']

## Model 3: EmbeddingBag

In [24]:
from collections import Counter

In [25]:
class HParamsCTT:
    batch_size: int = 16
    integer_input: bool = True
    num_classes: int = 32
    learning_rate: float = 0.001
    max_epochs: int = 10
    n_grams: int = 3
    embed_dim: int = 350
    num_tokens: int = 5000


class CharacterTrigramTokenizer:
    """
    We represent a sentence as vector of num_tokens tokens.
    If the trigram is present in the sentence then we add the token's id to the sentence.
    """
    def __init__(self, train_data, num_tokens):
        self.num_tokens = num_tokens
        self.token_to_id_map = self.get_char_trigram_token_map(train_data, num_tokens)

    def get_char_trigrams(self, sentence):
        char_trigrams = [*ngrams(sentence, 3, pad_left=True, left_pad_symbol="")]
        char_trigrams_token = [''.join(chars) for chars in char_trigrams]
        return char_trigrams_token

    def get_char_trigram_token_map(self, train_data, num_tokens):
        count = Counter()
        for label, sentence in train_data:
            char_trigrams_token = self.get_char_trigrams(sentence)
            count.update(char_trigrams_token) 

        token_to_id_map = {d[0]: i+1 for i, d in enumerate(count.most_common(num_tokens))}
        return token_to_id_map
    
    def vectorize(self, sentence):
        trigrams = self.get_char_trigrams(sentence)
        sentence_vector = [self.token_to_id_map[trigram] if trigram in self.token_to_id_map else 0 for trigram in trigrams]
        return sentence_vector, None
        

In [26]:
ctt = CharacterTrigramTokenizer(training_data, HParamsCTT.num_tokens)
i = 0
for k, v in ctt.token_to_id_map.items():
    if i >= 5:
        break
    print(k, v)
    i += 1

 th 1
 I  2
the 3
 to 4
ing 5


Now let's create the simple embedding layer based model and start training it.

In [27]:
class EmbeddingBagClassificationModel(torch.nn.Module):
    def __init__(self, num_tokens, embed_dim, n_classes):
        super().__init__()
        self.classes = n_classes
        self.embedding = torch.nn.EmbeddingBag(num_tokens, embed_dim)
        self.linear = torch.nn.Linear(embed_dim, n_classes)

    def forward(self, batch):
        embed = self.embedding(batch)
        return self.linear(embed)

In [28]:
trainer(
    model = EmbeddingBagClassificationModel(
        HParamsCTT.num_tokens + 1,
        HParamsCTT.embed_dim,
        HParamsCTT.num_classes
    ),
    params=HParamsCTT,
    vectorizer=ctt
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name     | Type                            | Params
-------------------------------------------------------------
0 | model    | EmbeddingBagClassificationModel | 1.8 M 
1 | accuracy | MulticlassAccuracy              | 0     
-------------------------------------------------------------
1.8 M     Trainable params
0         Non-trainable params
1.8 M     Total params
7.046     Total estimated model params size (MB)


Epoch 0:   0%|          | 0/2750 [00:00<?, ?it/s]                           

Epoch 9: 100%|██████████| 2750/2750 [00:27<00:00, 100.85it/s, loss=1.02, v_num=2, train_loss=0.625, val_loss=2.040, val_acc=0.438]
Testing DataLoader 0: 100%|██████████| 250/250 [00:00<00:00, 281.76it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_acc            0.40299999713897705
        test_loss           2.1087276935577393
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Predicting DataLoader 0: 100%|██████████| 250/250 [00:00<00:00, -3130.87it/s] 
#########
Sentence: I am very hopeful to go on vacation this summer Summer can be over in July for teachers sometimes. Are you a teacher?
Predicted Emotion: hopeful
Actual Label: hopeful
#########
Sentence: I just b

In [29]:
%load_ext tensorboard

In [30]:
%tensorboard --logdir lightning_logs