# Practical machine learning and deep learning. Lab 4

# Many-to-many NLP task.

# [Competition](https://www.kaggle.com/t/afa89356762e438cad5f04bf0e23f3ce)

## Goal

Your goal is to implement Neural Network for tagging the part-of-speech entities.

## Submission

Submission format is described at competition page.

> Remember, you can use any structure of the solution. The template classes/function in this file is just the tip for you. 

In [9]:
import pandas as pd
import torch
import warnings

warnings.filterwarnings('ignore')

## Data reading and preprocessing

In [10]:
train = pd.read_csv('/kaggle/input/pmldl-week4-many-to-many-nlp-task/train.csv')
test = pd.read_csv('/kaggle/input/pmldl-week4-many-to-many-nlp-task/test.csv')

In [11]:
train.head()

Unnamed: 0,sentence_id,entity_id,entity,tag
0,0,0,It,PRON
1,0,1,is,VERB
2,0,2,true,ADJ
3,0,3,that,ADP
4,0,4,his,DET


In [12]:
test.head()

Unnamed: 0,id,sentence_id,entity_id,entity
0,0,0,0,In
1,1,0,1,another
2,2,0,2,setback
3,3,0,3,yesterday
4,4,0,4,","


First, let's divide dataset on train and validation. And split the dataframe according to random split.

In [16]:
from sklearn.model_selection import train_test_split
VALIDATION_RATIO = 0.2
train_split, val_split = train_test_split(range(train['sentence_id'].max()), test_size=VALIDATION_RATIO, random_state=420)

And then split the original dataframe by ids that we splitted.

In [17]:
train_dataframe = train[train['sentence_id'].isin(train_split)]
val_dataframe = train[train['sentence_id'].isin(val_split)]

In [18]:
pos_tags = ['ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRT', 'PRON', 'VERB', '.', 'X']
cat2idx = {tag: i for i, tag in enumerate(pos_tags)}
idx2cat = {v: k for k, v in cat2idx.items()}

UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

For working with datasets more efficiently, let's create separate classes for datasets. 



In [19]:
import torch
torch.manual_seed(420)
from torchtext.vocab import build_vocab_from_iterator


class PosTaggingDataset(torch.utils.data.Dataset):
    def __init__(self, dataframe: pd.DataFrame, vocab = None, max_size=100):
        self.dataframe = dataframe
        self._preprocess()
        self.vocab = vocab or self._create_vocab()

    def _preprocess(self):
        # fill missing values in entities
        ...

        # Fill missing tag to `other` - `X`
        ...

        # Clean entities column
        ...
        
        # Split the dataset, so that we will have 
        # full sentences and full tags by the same index
        ...

        self.sentences = ...
        self.tags = ...
    
    def _create_vocab(self):
        # creates vocabulary that is used for encoding 
        # the sequence of tokens (splitted sentence)
        vocab = ...
        return vocab

    def _get_sentence(self, index: int) -> list:
        # retrieves sentence from dataset by index
        ...
        return self.vocab(sent)

    def _get_labels(self, index: int) -> list:
        # retrieves tags from dataset by index
        tags = ...
        return tags

    def __getitem__(self, index) -> tuple[list, list]:
        return self._get_sentence(index), self._get_labels(index)
    
    def __len__(self) -> int:
        return len(self.sentences)

In [20]:
# Create train dataset
train_dataset = ...
val_dataset = ...

And now we are able to create dataloader faster, because we created torch datasets

In [22]:
batch_size = 128
max_size = 50

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def collate_batch(batch: list):
    # Collate list of samples into tensor batch
    # As an input we have list of pair from dataset:
    # [([ent1, ent2, ...], [tag1, tag2, ...]), ([ent1, ent2, ...], [tag1, tag2, ...]), ...]
    # as an output, we want to have tensor of entities and tensor of tags 
    sentences_batch, postags_batch = [], []
    for _sent, _postags in batch:
        ...

    # Remember, that if we want to perform many to many mapping with our network with recurrent units, 
    # we want pass first item from all sequences as first input, thus
    # we want to have tensor with shape (max_size, ...., batch_size)
    return ..., ...

train_dataloader = ...
val_dataloader = ...

In [24]:
# just to check that all shapes are correct

for batch in train_dataloader:
    inp, out = batch
    print(inp.shape)
    print(out.shape)
    break

torch.Size([50, 128])
torch.Size([50, 128])


## Creating the network

For the many-to-many or seq2seq netoworks, we want to have recurrent units in the network. This gives the ability for network to learn the hidden features and pass the knowledge from one token to other. 

### Embeddings

For embeddings you can use `nn.Embedding` for creating your own features or use pretrained embedding (like GloVe or FastText or Bert).

### Recurrent

For processing sequences you can use recurrent units like `LSTM`.

### Linear

Add simple nn.Linear. ~~This is basic stuff what do you want~~

### Regularization

Remeber to set up Dropout and Batch Normalization for regularization purposes.

In [27]:
import torch.nn as nn

class POSTagger(nn.Module):
    def __init__(self,  ...):
        
        super().__init__()
        
        ...
    def forward(self, text):

        # text shape= [sent len, batch size]
        
        ...
        
        # predictions shape = [sent len, batch size, output dim]
        return predictions

## Training

As for training you should take into account that the shape of your output and shape of the labels. Perform required transformations and use loss function that fits your task.

> Do not forget about tqdm and logging, you want normal training not some unreadable ~~sht~~ logs. 

In [31]:
from tqdm.autonotebook import tqdm

def train_one_epoch(
    model,
    loader,
    optimizer,
    loss_fn,
    epoch_num=-1
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch}: train",
        leave=True,
    )
    model.train()
    train_loss = 0.0
    total = 0
    for i, batch in loop:
        texts, labels = batch
        
        ...

        loop.set_postfix({"loss": train_loss/total})

def val_one_epoch(
    model,
    loader,
    loss_fn,
    epoch_num=-1,
    best_so_far=0.0,
    ckpt_path='best.pt'
):
    
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch}: val",
        leave=True,
    )
    val_loss = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            texts, labels = batch

            ...

            loop.set_postfix({"loss": val_loss/total, "acc": correct / total})
        
        if correct / total > best:
            ...

    return best_so_far

In [39]:
INPUT_DIM = len(train_dataset.vocab)
OUTPUT_DIM = len(pos_tags)

...


model = ....to(device)

optimizer = ...
loss_fn = ...

In [None]:
best = -float('inf')
num_epochs = ...
for epoch in range(num_epochs):
    train_one_epoch(model, train_dataloader, optimizer, loss_fn, epoch_num=epoch)
    best_so_far = val_one_epoch(model, val_dataloader, loss_fn, epoch, best_so_far=best)

# Predictions

Write prediction. That's it. No more instructions, you already made it 3 times.

In [41]:
# you can use the same dataset class
test_dataset = ...

In [42]:
batch_size = 128

# remebder that for training we can use pads but for testing we need to write 
# exact length of the sentence into the seubmission
def collate_batch(batch: list):
    sentences_batch, sentences_lengths = [], []
    for _sent, _ in batch:
        ...

    return ...

test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_batch)

In [43]:
def predict(
    model,
    loader,
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Predictions",
        leave=True,
    )
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, batch in loop:
            ...

    return predictions

In [44]:
ckpt = torch.load("best.pt")
model.load_state_dict(ckpt)

predictions = predict(model, test_dataloader)
predictions[:10]

Predictions:   0%|          | 0/113 [00:00<?, ?it/s]

[1, 4, 5, 5, 10, 5, 7, 5, 5, 9]

In [45]:
results = pd.Series(predictions).apply(lambda x: idx2cat[x])
results.to_csv('submission.csv', index_label='id')

In [46]:
results

0          ADP
1          DET
2         NOUN
3         NOUN
4            .
          ... 
303020    NOUN
303021     PRT
303022    VERB
303023    NOUN
303024       .
Length: 303025, dtype: object