# Practical machine learning and deep learning. Lab 3

# Deep Learning in Natural Language Processing

# [Competition](https://www.kaggle.com/t/4677b08c063f433ba1eb8f3543af90b4)

## Goal

Your goal is to implement Neural Network to classify Amazon Products reviews. 

## Submission

Submission format is described at competition page.

## Data preprocessing

Data preprocessing is an essential step in building a Machine Learning model and depending on how well the data has been preprocessed.

In NLP, text preprocessing is the first step in the process of building a model.

The various text preprocessing steps are:

* Tokenization
* Lower casing
* Stop words removal
* Stemming
* Lemmatization

These various text preprocessing steps are widely used for dimensionality reduction.

First, let's read the input data and then perform preprocessing steps

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

train_dataframe = pd.read_csv('/kaggle/input/pmldl-week-3-dl-in-natural-language-processing/train.csv')
test_dataframe = pd.read_csv('/kaggle/input/pmldl-week-3-dl-in-natural-language-processing/test.csv')

train_dataframe.head()

In the training data we have `4` features (`Title`, `Helpfulness`, `Score` and `Text`) with target category (`Category`). For the test features are the same, except for target column.

First, let's write functions for preprocessing helpfulness and score feature in case we needed them.

In [None]:
def preprocess_score_inplace(df):
    """
    Normalizes score to make it from 0 to 1.
    
    For now it is from 1.0 to 5.0, so natural choice
    is to normalize by (f - 1.0)/4.0
    """
    df['Score'] = (df['Score'] - df['Score'].min()) / (df['Score'].max() - df['Score'].min())
    return df

def preprocess_helpfulness_inplace(df):
    """
    Splits feature by '/' and normalize helpfulness to make it from 0 to 1
    
    The total number of assessments can be 0, so let's substitute it
    with 1. The resulting helpfulness still will be zero but we
    remove the possibility of division by zero exception.
    """
    _splitted = df['Helpfulness'].str.split('/', expand=True)
    _helpful, _total = _splitted[0], _splitted[1]
    _total.replace("0", "1", inplace=True)
    df['Helpfulness'] = _helpful.astype(int) / _total.astype(int)
    return df    

The two other features are both text. For simplicity, let's remove concatenate them so that we will have one full text feature. The resulting code is also a function.

In [None]:
def concat_title_text_inplace(df):
    """
    Concatenates Title and Text columns together
    """
    df['Text'] = df['Title'] + " " + df['Text']
    df.drop('Title', axis=1, inplace=True)
    return df

Also, encode the target categories, so that the output is become an index

In [None]:
enc = LabelEncoder()

cat_encoded = enc.fit_transform(train_dataframe['Category'].values.reshape(-1, 1)).astype(np.int16)
train_df_enc = train_dataframe.copy()
train_df_enc['Category'] = cat_encoded

train_df_enc.head()

Let's visualize our first stage of preprocessing.

In [None]:
train_copy = train_df_enc.head().copy()

preprocess_score_inplace(
    preprocess_helpfulness_inplace(
        concat_title_text_inplace(train_copy)
    )
)


### Text cleaning

For text cleaning, you can use lower casting, punctuation removal, numbers removal, tokenization, stop words removal, stemming. This will get a perfectly cleaned text without any garbage information.

In [None]:
import re

def lower_text(text: str):
    return text.lower()

def remove_numbers(text: str):
    """
    Substitute all punctuations with space in case of
    "there is5dogs".
    
    If subs with '' -> "there isdogs"
    With ' ' -> there is dogs
    """
    text_nonum = re.sub(r'\d+', ' ', text)
    return text_nonum

def remove_punctuation(text: str):
    """
    Substitute all punctiations with space in case of
    "hello!nice to meet you"
    
    If subs with '' -> "hellonice to meet you"
    With ' ' -> "hello nice to meet you"
    """
    text_nopunct = re.sub(r'[^a-z|\s]+', ' ', text)
    return text_nopunct

def remove_multiple_spaces(text: str):
    text_no_doublespace = re.sub('\s+', ' ', text).strip()
    return text_no_doublespace

This will give us clean text.

In [None]:
sample_text = train_copy['Text'][4]

_lowered = lower_text(sample_text)
_without_numbers = remove_numbers(_lowered)
_without_punct = remove_punctuation(_without_numbers)
_single_spaced = remove_multiple_spaces(_without_punct)

print(sample_text)
print('-'*10)
print(_lowered)
print('-'*10)
print(_without_numbers)
print('-'*10)
print(_without_punct)
print('-'*10)
print(_single_spaced)

Now, harder preprocessing: tokenization, stop words removal and stemming.
For that you can use several packages, but we encourage you to use `nltk` - Natural Language ToolKit as well as `torchtext`.


Take a look at:
* `nltk.tokenize.word_tokenize` or `torchtext.data.utils.get_tokenizer` for tokenization
* `nltk.corpus.stopwords` for stop words removal
* `nltk.stem.PorterStemmer` for stemming

In [None]:
import nltk
import nltk.tokenize as tkn
import nltk.corpus as corpus
import nltk.stem as stem

def tokenize_text(text: str) -> list[str]:
    return tkn.word_tokenize(text)

def remove_stop_words(tokenized_text: list[str]) -> list[str]:
    stop_words = set(corpus.stopwords.words('english'))
    cleaned_text = [w for w in tokenized_text if w not in stop_words]
    return cleaned_text

def stem_words(tokenized_text: list[str]) -> list[str]:
    stemmer = stem.PorterStemmer()
    return [stemmer.stem(word) for word in tokenized_text]

In [None]:
_tokenized = tokenize_text(_single_spaced)
_without_sw = remove_stop_words(_tokenized)
_stemmed = stem_words(_without_sw)

print(_single_spaced)
print('-'*10)
print(_tokenized)
print('-'*10)
print(_without_sw)
print('-'*10)
print(_stemmed)

As you can see, there is a lot of words removed as well as the unnecessary language rules (I mean stems, com'on). Now we are able to construct full cleaning preprocessing stage.

In [None]:
def preprocessing_stage(text):
    _lowered = lower_text(text)
    _without_numbers = remove_numbers(_lowered)
    _without_punct = remove_punctuation(_without_numbers)
    _single_spaced = remove_multiple_spaces(_without_punct)
    _tokenized = tokenize_text(_single_spaced)
    _without_sw = remove_stop_words(_tokenized)
    _stemmed = stem_words(_without_sw)
    
    return _stemmed

def clean_text_inplace(df):
    df['Text'] = df['Text'].apply(preprocessing_stage)
    return df

def preprocess(df):
    df.fillna(" ", inplace=True)
    _preprocess_score = preprocess_score_inplace(df)
    _preprocess_helpfulness = preprocess_helpfulness_inplace(_preprocess_score)
    _concatted = concat_title_text_inplace(_preprocess_helpfulness)

    _cleaned = clean_text_inplace(_concatted)
    
    return _cleaned
    

And now let's apply it on our train and test dataframes.

In [None]:
train_preprocessed = preprocess(train_df_enc)
test_preprocessed = preprocess(test_dataframe)

train_preprocessed.head()

Now, let's split our original train dataset into train and val sets.

In [None]:
from sklearn.model_selection import train_test_split

ratio = 0.1
train, val = train_test_split(
    train_preprocessed, stratify=train_preprocessed['Category'], test_size=ratio, random_state=420
)

And now, for the best result, lets get rid of pandas so that nothing is stopping us from working with torchtext. For that let's create an iterator that is going to yield samples for us.

# Creating dataloaders

First, you should generate our vocab from the train set.

For that, use `torchtext.vocab.build_vocab_from_iterator`.

In [None]:
from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(df):
    for _, sample in train.iterrows():
        yield sample.to_list()[2]


# Define special symbols and indices
UNK_IDX, PAD_IDX = 0, 1
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>']

vocab = build_vocab_from_iterator(yield_tokens(train['Text']), specials=special_symbols)
vocab.set_default_index(UNK_IDX)

And then use our vocab to encode the tokenized sequence

In [None]:
sample = train['Text'][2]
print(sample)
encoded = vocab(sample)
print(encoded)

Now we can define our collate function and create dataloaders

In [None]:
import torch
from torch.utils.data import DataLoader

torch.manual_seed(420)

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# From https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _, _, _text, _label in batch:
        label_list.append(_label)
        _preprocessed = torch.tensor(vocab(_text), dtype=torch.int64)
        offsets.append(len(_preprocessed))
        text_list.append(_preprocessed)
        
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)
    
train_dataloader = DataLoader(
    train.to_numpy(), batch_size=128, shuffle=True, collate_fn=collate_batch
)

val_dataloader = DataLoader(
    val.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

# Defining Network


For writing a network you can use `torch.nn.Embedding` or `torch.nn.EmbeddingBag`. This will allow your netorwk to learn embedding vector for your tokens.

As for the other modules in your network, consider these options:
* Simple Linear layers, activations, basic stuff that goes into the network
* There is a possible of not using the offsets (indices of sequences) in the formard, put use predefined sequence length (maximum length, some value, etc.). If this is an option for you, change the `collate_batch` function according to your architecture.
* You could use all this recurrent stuff (RNN, GRU, LSTM, even Transformer, all up to you), but remembder about the dimentions and hidden states
* If you have any quiestions - google it

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim 
import torchmetrics

In [None]:
# From https://github.com/Bjarten/early-stopping-pytorch
class EarlyStopping:
    """Early stops the training if validation loss doesn't improve after a given patience."""
    def __init__(self, patience=7, verbose=False, delta=0, path='checkpoint.pt', trace_func=print):
        """
        Args:
            patience (int): How long to wait after last time validation loss improved.
                            Default: 7
            verbose (bool): If True, prints a message for each validation loss improvement. 
                            Default: False
            delta (float): Minimum change in the monitored quantity to qualify as an improvement.
                            Default: 0
            path (str): Path for the checkpoint to be saved to.
                            Default: 'checkpoint.pt'
            trace_func (function): trace print function.
                            Default: print            
        """
        self.patience = patience
        self.verbose = verbose
        self.counter = 0
        self.best_score = None
        self.early_stop = False
        self.val_loss_min = np.Inf
        self.delta = delta
        self.path = path
        self.trace_func = trace_func
    def __call__(self, val_loss, model):

        score = -val_loss

        if self.best_score is None:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
        elif score < self.best_score + self.delta:
            self.counter += 1
            self.trace_func(f'EarlyStopping counter: {self.counter} out of {self.patience}')
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_score = score
            self.save_checkpoint(val_loss, model)
            self.counter = 0

    def save_checkpoint(self, val_loss, model):
        '''Saves model when validation loss decrease.'''
        if self.verbose:
            self.trace_func(f'Validation loss decreased ({self.val_loss_min:.6f} --> {val_loss:.6f}).  Saving model ...')
        torch.save(model.state_dict(), self.path)
        self.val_loss_min = val_loss

In [None]:
class TextClassificationModel(nn.Module):
    def __init__(self, num_classes):
        super(TextClassificationModel, self).__init__()
        
        self.n_classes = num_classes
        self.embed = nn.EmbeddingBag(len(vocab), 128, sparse=False)
        self.fc1 = nn.Linear(128, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 256)
        self.fc4 = nn.Linear(256, num_classes)

    def forward(self, text, offsets):
        text = self.embed(text, offsets)
        text = F.relu(self.fc1(text))
        text = F.relu(self.fc2(text))
        text = F.relu(self.fc3(text))
#         text = F.relu(self.fc4(text))
#         text = F.relu(self.fc5(text))
        text = self.fc4(text)
        return text

In [None]:
from tqdm.autonotebook import tqdm

def train_one_epoch(
    model,
    loader,
    optimizer,
    loss_fn,
    sсheduler,
    f1_score,
    epoch_num=-1
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: train",
        leave=True,
    )
    model.train()
    train_loss = 0.0
    for i, (labels, texts, offsets) in loop:
        # zero the parameter gradients
        model.zero_grad()

        # forward pass
        outputs = model(texts, offsets)
        # loss calculation
        loss = loss_fn(outputs, labels)
        
        # backward pass
        loss.backward()

        # optimizer run
        optimizer.step()
        
        train_loss += loss.item()
        loop.set_postfix({"loss": train_loss/(i * len(labels))})
    
    # sheduler step        
    sсheduler.step(train_loss)

In [None]:
def val_one_epoch(
    model,
    loader,
    loss_fn,
    epoch_num=-1,
    f1_score=None, 
    best_so_far=0.0,
    ckpt_path='best.pt'
):
    
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc=f"Epoch {epoch_num}: val",
        leave=True,
    )
    val_loss = 0.0
    score = 0.0
    correct = 0
    total = 0
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, (labels, texts, offsets) in loop:
            labels = labels.to(device)
    
            # forward pass
            outputs = model(texts, offsets)
            # loss calculation
            loss = loss_fn(outputs, labels)
            predicted = torch.argmax(outputs.data, dim=1).to(device)
            total += predicted.size(0)
            correct += (predicted == labels).sum()

            val_loss += loss.item()
            score += f1_score(predicted, labels)
            loop.set_postfix({"loss": val_loss/total, "acc": correct / total, "f1": f1_score(predicted, labels)})
        
        score /= len(loader)
        print(f"F1 score: {score}")
        if score > best_so_far:
            torch.save(model.state_dict(), 'best_model.pt')
            best_so_far = f1_score

    return best_so_far, val_loss

In [None]:
epochs = 100
model = TextClassificationModel(6).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, 'min', factor=0.3, patience=1, verbose=True)
stopper = EarlyStopping(delta=1e-3)
f1_score = torchmetrics.F1Score(task = "multiclass", num_classes = 6).to(device)

In [None]:
best = -float('inf')
for epoch in range(epochs):
    train_one_epoch(model, train_dataloader, optimizer, loss_fn, scheduler, f1_score, epoch_num=epoch)
    best, val_loss = val_one_epoch(model, val_dataloader, loss_fn, epoch, f1_score, best_so_far=best)
    
    stopper(val_loss, model)
    
    if stopper.early_stop:
        print("Early stopping")
        break

# Predictions

In [None]:
def collate_batch(batch):
    text_list, offsets = [], [0]
    for _, _, _, _text in batch:
        _preprocessed = torch.tensor(vocab(_text), dtype=torch.int64)
        offsets.append(len(_preprocessed))
        text_list.append(_preprocessed)
        
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return text_list.to(device), offsets.to(device)

test_dataloader = DataLoader(
    test_preprocessed.to_numpy(), batch_size=128, shuffle=False, collate_fn=collate_batch
)

In [None]:
def predict(
    model,
    loader,
):
    loop = tqdm(
        enumerate(loader, 1),
        total=len(loader),
        desc="Predictions:",
        leave=True,
    )
    predictions = []
    with torch.no_grad():
        model.eval()  # evaluation mode
        for i, (texts, offsets) in loop:
            
            # forward pass and loss calculation
            outputs = model(texts, offsets)
            
            _, predicted = torch.max(outputs.data, 1)
            predictions += predicted.detach().cpu().tolist()

    return predictions

In [None]:
ckpt = torch.load("/kaggle/working/best_model.pt")
model.load_state_dict(ckpt)

predictions = predict(model, test_dataloader)
predictions[:10]

In [None]:
results = pd.Series(enc.inverse_transform(predictions))
results.head()

In [None]:
results.to_csv('submission.csv', index_label='id')