## Text Classification

---
https://www.kaggle.com/competitions/nlp-txt-classification

<br/>

The goal is to classify tweets. There are 5 categories: Extremely Negative,  Negative, Neutral, Positive, Extremely Positive.

This falls into the "Classifying whole sentences" category of common NLP tasks.


<br/>

In this notebook I will use LSTM model with embeddings pretrained by GloVe (https://nlp.stanford.edu/projects/glove/)

<br/>

In [1]:
import gc

from tqdm import tqdm
import numpy as np
import pandas as pd

import torch

In [2]:
# Cuda maintenance
gc.collect()
torch.cuda.empty_cache()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Torch device: ", device)

Torch device:  cuda


### 0. Load data

---

In the cell below data are loaded from csv file and preprocessed:
* Drop all rows where at least one NaN value is present
* Drop all duplicates

<br/>

In [3]:
df = pd.read_csv('data/train.csv')  
df_test = pd.read_csv('data/test.csv')  

df = df.dropna().drop_duplicates().reset_index(drop=True)
df = df.drop(["Unnamed: 0"], axis=1)
df.rename(columns={"Sentiment": "label"}, inplace=True)
df.rename(columns={"Text": "text"}, inplace=True)
df = df.astype({"text": str}, {"label": str})

labels = df["label"].unique()
num_labels = len(labels)

df["label"] = df["label"].apply(lambda x: np.where(labels == x)[0][0])

df_test = df_test.astype({"Text": str})
df_test.rename(columns={"Text": "text"}, inplace=True)
df_test = df_test.drop(["id"], axis=1)

In [4]:
df.head(5)

Unnamed: 0,text,label
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,0
1,advice Talk to your neighbours family to excha...,1
2,Coronavirus Australia: Woolworths to give elde...,1
3,My food stock is not the only one which is emp...,1
4,"Me, ready to go at supermarket during the #COV...",2


<br/>

Training data are split into training and evaluation parts:

<br/>

In [5]:
from sklearn.model_selection import train_test_split

train_df, eval_df = train_test_split(df, test_size=0.2, random_state=42)
eval_df.reset_index(drop=True, inplace=True)
train_df.reset_index(drop=True, inplace=True)

<br/>

Now we can create three datasets. ``TextDataset`` inherits from ``torch.utils.data.Dataset``.
And implements ``__getitem__`` function.

<br/>

In [6]:
from text_dataset import TextDataset

train_dataset = TextDataset(train_df)
eval_dataset = TextDataset(eval_df)
test_dataset = TextDataset(df_test)

### 1. Build vocabulary
---
<br/>

As a next step we shall create a vocabulary from all tokens
in all datasets (including the test one as well).

I have decided to use **spacy** tokenizer, since it's the most popular rule-based tokenizer (according to https://huggingface.co/docs/transformers/tokenizer_summary)

```

from spacy.lang.en import English
nlp = English()
tokenizer = nlp.tokenizer

```

However it gave accuracy only 0.28. Therefore I returned the "basic_english" tokenizer.

<br/>

In [34]:
from torchtext.data import get_tokenizer

tokenizer = get_tokenizer("basic_english")

In [35]:
text_sample = train_dataset[0]["text"]
tokens = tokenizer(text_sample)

print("Original sentence: ", text_sample)
print("Tokenized: ")
for token in tokens:
    print(token, end=', ')

Original sentence:  ?ÂThe past 20 years have been a rollercoaster for AfricaÂs #oil and #gas production. This 20-year up-and-down cycle coincided with oil prices,Â writes NJ Ayuk on #BillionsAtPlay. 

?Relevant as the world faces the impact of the #coronavirus!

?: https://t.co/MWtz7Q28pP https://t.co/6bixG0ZGcW
Tokenized: 
?, âthe, past, 20, years, have, been, a, rollercoaster, for, africaâs, #oil, and, #gas, production, ., this, 20-year, up-and-down, cycle, coincided, with, oil, prices, ,, â, writes, nj, ayuk, on, #billionsatplay, ., ?, relevant, as, the, world, faces, the, impact, of, the, #coronavirus, !, ?, https, //t, ., co/mwtz7q28pp, https, //t, ., co/6bixg0zgcw, 

In [36]:
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

default_element = "<UNK>"

def build_vocabulary(datasets):
    for dataset in datasets:
        for element in dataset:
            yield tokenizer(element['text'].lower())
    yield tokenizer(default_element) # Adding default element

vocabulary = build_vocab_from_iterator(build_vocabulary([train_dataset,
                                                         eval_dataset,
                                                         test_dataset]))

44954lines [00:01, 36324.51lines/s]


### 2. Load pretrained embeddings

---
<br/>

Next step is to load pretrained **GloVe** model and extract embeddings from it.
I have decided to use a model trained on twitter database, since we also work with tweets.

<br/>

In [37]:
import gensim.downloader as api

model_glove_twitter = api.load("glove-twitter-100")
weights = torch.FloatTensor(model_glove_twitter.vectors)
emb_dim = weights.shape[1]

In [38]:
# Lets test loaded vocab and embeggins mapping

from torch import nn

# Query
query = 'home'
if query in model_glove_twitter.vocab:
    query_id = torch.tensor(model_glove_twitter.vocab[query].index)

    embedding = nn.Embedding.from_pretrained(weights)
    embedding.requires_grad = False

    gensim_vector = torch.tensor(model_glove_twitter[query])
    embedding_vector = embedding(query_id)

    print("embedding_vector: ", embedding_vector, embedding_vector.shape)
    print(gensim_vector == embedding_vector)
else:
    print("Not in vocab")

embedding_vector:  tensor([-0.0252,  0.1716,  0.7060, -0.0338,  0.4849, -0.1502, -0.1197,  0.4575,
         0.2481, -0.5705, -0.3946,  0.4307, -4.3351,  0.2138, -0.3850, -0.6234,
        -0.0492,  0.3994, -1.0027, -0.4547,  0.0878, -0.4888, -0.1139, -0.0667,
         0.2869, -0.4383,  0.0987, -0.3731,  0.0329, -0.3718, -0.4841, -0.1208,
        -0.1984, -0.0498,  0.4935,  0.2743,  0.6581, -0.0402, -0.1216,  1.2886,
        -0.8873,  0.7126, -0.0645, -0.2075,  0.4908,  0.1899,  0.2216,  0.2641,
        -0.0557,  0.6315, -0.5075, -0.0834, -0.1345,  0.0815, -0.4965,  0.1643,
        -0.1437, -0.0217,  0.2549,  0.1717,  0.3381,  0.1570, -0.3156, -0.7458,
         0.2387, -0.1820, -0.3221, -0.7053,  0.6782,  0.0383,  0.2351,  0.2206,
         0.3867,  0.5412,  0.1290, -0.2550, -0.1341,  0.1612,  0.1051,  0.2181,
         1.5972,  0.3996, -0.1259, -0.1610,  0.5371,  0.1436,  0.1532,  0.3488,
         0.7011, -0.0236, -0.3139,  0.0303,  0.3886, -0.0444,  0.1591,  0.1538,
         0.2294,  0.3

<br/>

Next step is to construct matrix from pretrained embeggings:

<br/>

In [39]:
weights_matrix = np.zeros((len(vocabulary), emb_dim))

idx = 0
not_pretrained_embeddings_count = 0
for token in vocabulary.stoi:
    token_str = token
    if not isinstance(token_str, str):
        token_str = token_str.text
    if token_str in model_glove_twitter.vocab:
        weights_matrix[idx] = torch.tensor(model_glove_twitter[token_str])
    else:
        weights_matrix[idx] = np.random.normal(scale=0.6, size=(emb_dim, ))
        not_pretrained_embeddings_count += 1
    idx += 1

In [40]:
print("{} embeddings from {} are not pretrained".format(not_pretrained_embeddings_count, idx))

68157 embeddings from 97237 are not pretrained


In [41]:
# Create Embeggins Layer

num_embeddings, embedding_dim = weights_matrix.shape
emb_layer = nn.Embedding(num_embeddings, embedding_dim)
emb_layer.weight.data.copy_(torch.from_numpy(weights_matrix))

tensor([[ 0.2860, -0.0493,  0.2279,  ..., -0.4137, -0.7897, -0.3616],
        [ 0.6769,  0.2756,  1.1376,  ..., -0.0733,  0.1919,  0.4477],
        [ 0.1821, -0.0485,  0.2397,  ..., -0.3358,  0.1888, -0.4079],
        ...,
        [-0.3959, -0.4705, -0.1837,  ...,  0.3289, -0.5314,  1.7251],
        [ 0.3715, -0.7902, -0.2558,  ..., -0.6669, -1.7612, -0.3347],
        [-0.3460, -1.0794, -0.7125,  ...,  0.1530,  0.0520, -0.5716]])

## 3. Create batches and DataLoaders

---

<br/>

Next we shall create a function, which creates input data from batch.
All vectors in batch shall be same size.

<br/>

In [42]:
class BatchCollator(object):
    def __init__(self, vocabulary, tokenizer, max_words = 40, device = 'cuda'):
        self.device = device
        self.max_words = max_words
        self.vocabulary = vocabulary
        self.tokenizer = tokenizer
        
    def __align__(self, tokens):
        if len(tokens) < self.max_words:
            return tokens + ([0]* (self.max_words-len(tokens)))
        return tokens[:self.max_words]
        
    def __call__(self, batch):
        X, Y = [], []
        for element in batch:
            tokenized_text = self.tokenizer(element["text"].lower())
            X.append([self.vocabulary[token] for token in tokenized_text])
            Y.append(element['label'])
        ## Bringing all samples to max_words length
        X = [self.__align__(tokens) for tokens in X]
        return torch.tensor(X, dtype=torch.int32, device=self.device), torch.tensor(Y, device=self.device) ## We have deducted 1 from target names to get them in range [0,1,2,3] from [1,2,3,4]

In [43]:
from torch.utils.data import DataLoader

max_words = 40
batch_collator = BatchCollator(vocabulary, tokenizer, max_words, device)

train_loader = DataLoader(train_dataset, batch_size=1024,
                          collate_fn=batch_collator, shuffle=True)
eval_loader  = DataLoader(eval_dataset, batch_size=1024,
                          collate_fn=batch_collator, shuffle=True)
test_loader  = DataLoader(test_dataset, batch_size=1024,
                          collate_fn=batch_collator, shuffle=False)

## 4. LSTM Model

---

<br/>

Finally we define two LSTM models:
1. ``StackedLSTMClassifierWithPretrained`` - 3 Stacked LSTM layers
2. ``LSTMClassifierWithPretrained`` - 3 LSTM layers in sequence

<br/>

In [50]:
from torch import nn
from torch.nn import functional as F
import gensim

embed_len = emb_dim
hidden_dim1 = 50
hidden_dim2 = 60
hidden_dim3 = 75
n_layers = 1


class StackedLSTMClassifierWithPretrained(nn.Module):
    def __init__(self, device='cuda'):
        super(StackedLSTMClassifierWithPretrained, self).__init__()
        self.device = device
        self.embedding_layer = emb_layer
        #self.embedding_layer.requires_grad = False
        self.lstm1 = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim1, num_layers=1, batch_first=True)
        self.lstm2 = nn.LSTM(input_size=hidden_dim1, hidden_size=hidden_dim2, num_layers=1, batch_first=True)
        self.lstm3 = nn.LSTM(input_size=hidden_dim2, hidden_size=hidden_dim3, num_layers=1, batch_first=True)
        self.linear = nn.Linear(hidden_dim3, num_labels)

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim1, device=self.device), torch.randn(n_layers, len(X_batch), hidden_dim1, device=self.device)
        output, (hidden, carry) = self.lstm1(embeddings, (hidden, carry))

        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim2, device=self.device), torch.randn(n_layers, len(X_batch), hidden_dim2, device=self.device)
        output, (hidden, carry) = self.lstm2(output, (hidden, carry))

        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim3, device=self.device), torch.randn(n_layers, len(X_batch), hidden_dim3, device=self.device)
        output, (hidden, carry) = self.lstm3(output, (hidden, carry))
        return self.linear(output[:,-1])

In [52]:
from torch import nn
from torch.nn import functional as F
import gensim

embed_len = emb_dim
hidden_dim = 75
n_layers=3


class LSTMClassifierWithPretrained(nn.Module):
    def __init__(self, device='cuda'):
        super(LSTMClassifierWithPretrained, self).__init__()
        self.embedding_layer = emb_layer
        #self.embedding_layer.requires_grad = False
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, 
                              num_layers=n_layers, batch_first=True)
        # The linear layer that maps from hidden state space to tag space
        self.linear = nn.Linear(hidden_dim, num_labels)

    def forward(self, X_batch):
        embeddings = self.embedding_layer(X_batch)
        output, (hidden, carry) = self.lstm(embeddings)
        return self.linear(output[:,-1])

<br/>

For both models:

* Loss function: CrossEntropyLoss
* Optimizer ADAM: Adam

<br/>

In [46]:
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gc

def CalcValLossAndAccuracy(model, loss_fn, val_loader):
    with torch.no_grad():
        Y_shuffled, Y_preds, losses = [],[],[]
        for X, Y in val_loader:
            preds = model(X)
            loss = loss_fn(preds, Y)
            losses.append(loss.item())

            Y_shuffled.append(Y)
            Y_preds.append(preds.argmax(dim=-1))

        Y_shuffled = torch.cat(Y_shuffled)
        Y_preds = torch.cat(Y_preds)

        print("Valid Loss : {:.3f}".format(torch.tensor(losses).cpu().mean()))
        print("Valid Acc  : {:.3f}".format(accuracy_score(Y_shuffled.cpu().detach().numpy(), 
                                                          Y_preds.cpu().detach().numpy())))


def TrainModel(model, loss_fn, optimizer, train_loader, val_loader, device, epochs=10):
    for i in range(1, epochs+1):
        losses = []
        for X, Y in tqdm(train_loader):
            X.to(device)
            Y.to(device)
            model.to(device)
            Y_preds = model(X) ## Make Predictions

            loss = loss_fn(Y_preds, Y) ## Calculate Loss
            losses.append(loss.item())

            optimizer.zero_grad() ## Clear previously calculated gradients
            loss.backward() ## Calculates Gradients
            optimizer.step() ## Update network weights.

        print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
        CalcValLossAndAccuracy(model, loss_fn, val_loader)

In [57]:
from torch.optim import Adam

epochs = 30
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = LSTMClassifierWithPretrained(device)
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, eval_loader, device, epochs)

100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.07it/s]


Train Loss : 1.542
Valid Loss : 1.511
Valid Acc  : 0.306


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.75it/s]


Train Loss : 1.196
Valid Loss : 1.502
Valid Acc  : 0.400


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.52it/s]


Train Loss : 0.795
Valid Loss : 1.782
Valid Acc  : 0.399


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.84it/s]


Train Loss : 0.606
Valid Loss : 1.939
Valid Acc  : 0.409


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.18it/s]


Train Loss : 0.459
Valid Loss : 1.994
Valid Acc  : 0.463


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.21it/s]


Train Loss : 0.329
Valid Loss : 1.872
Valid Acc  : 0.492


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.02it/s]


Train Loss : 0.244
Valid Loss : 1.894
Valid Acc  : 0.522


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.01it/s]


Train Loss : 0.170
Valid Loss : 1.957
Valid Acc  : 0.528


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.82it/s]


Train Loss : 0.123
Valid Loss : 1.965
Valid Acc  : 0.547


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.27it/s]


Train Loss : 0.096
Valid Loss : 2.240
Valid Acc  : 0.538


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.82it/s]


Train Loss : 0.085
Valid Loss : 2.219
Valid Acc  : 0.537


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.64it/s]


Train Loss : 0.111
Valid Loss : 2.171
Valid Acc  : 0.549


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.61it/s]


Train Loss : 0.080
Valid Loss : 2.151
Valid Acc  : 0.544


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.02it/s]


Train Loss : 0.065
Valid Loss : 2.302
Valid Acc  : 0.544


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.16it/s]


Train Loss : 0.057
Valid Loss : 2.341
Valid Acc  : 0.544


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.63it/s]


Train Loss : 0.061
Valid Loss : 2.415
Valid Acc  : 0.534


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.20it/s]


Train Loss : 0.059
Valid Loss : 2.533
Valid Acc  : 0.536


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.43it/s]


Train Loss : 0.049
Valid Loss : 2.481
Valid Acc  : 0.541


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.03it/s]


Train Loss : 0.050
Valid Loss : 2.567
Valid Acc  : 0.531


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.23it/s]


Train Loss : 0.113
Valid Loss : 2.177
Valid Acc  : 0.555


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.08it/s]


Train Loss : 0.058
Valid Loss : 2.431
Valid Acc  : 0.549


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.42it/s]


Train Loss : 0.044
Valid Loss : 2.482
Valid Acc  : 0.548


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.32it/s]


Train Loss : 0.041
Valid Loss : 2.547
Valid Acc  : 0.543


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.82it/s]


Train Loss : 0.039
Valid Loss : 2.594
Valid Acc  : 0.543


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.87it/s]


Train Loss : 0.032
Valid Loss : 2.576
Valid Acc  : 0.542


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.68it/s]


Train Loss : 0.031
Valid Loss : 2.724
Valid Acc  : 0.537


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.27it/s]


Train Loss : 0.032
Valid Loss : 2.725
Valid Acc  : 0.537


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.71it/s]


Train Loss : 0.030
Valid Loss : 2.772
Valid Acc  : 0.540


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.29it/s]


Train Loss : 0.027
Valid Loss : 2.721
Valid Acc  : 0.533


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.31it/s]


Train Loss : 0.028
Valid Loss : 2.737
Valid Acc  : 0.525


In [51]:
from torch.optim import Adam

epochs = 30
learning_rate = 1e-3

loss_fn = nn.CrossEntropyLoss()
lstm_classifier = StackedLSTMClassifierWithPretrained(device)
optimizer = Adam(lstm_classifier.parameters(), lr=learning_rate)

TrainModel(lstm_classifier, loss_fn, optimizer, train_loader, eval_loader, device, epochs)

100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.71it/s]


Train Loss : 1.551
Valid Loss : 1.526
Valid Acc  : 0.310


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.01it/s]


Train Loss : 1.504
Valid Loss : 1.500
Valid Acc  : 0.316


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.25it/s]


Train Loss : 1.325
Valid Loss : 1.381
Valid Acc  : 0.386


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.29it/s]


Train Loss : 1.103
Valid Loss : 1.486
Valid Acc  : 0.402


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.98it/s]


Train Loss : 1.011
Valid Loss : 1.499
Valid Acc  : 0.391


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.17it/s]


Train Loss : 0.960
Valid Loss : 1.526
Valid Acc  : 0.401


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.11it/s]


Train Loss : 0.831
Valid Loss : 1.313
Valid Acc  : 0.510


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.09it/s]


Train Loss : 0.578
Valid Loss : 1.276
Valid Acc  : 0.572


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.21it/s]


Train Loss : 0.409
Valid Loss : 1.288
Valid Acc  : 0.584


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.24it/s]


Train Loss : 0.349
Valid Loss : 1.304
Valid Acc  : 0.611


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.13it/s]


Train Loss : 0.284
Valid Loss : 1.321
Valid Acc  : 0.612


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.21it/s]


Train Loss : 0.250
Valid Loss : 1.351
Valid Acc  : 0.611


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.94it/s]


Train Loss : 0.219
Valid Loss : 1.446
Valid Acc  : 0.615


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.15it/s]


Train Loss : 0.200
Valid Loss : 1.502
Valid Acc  : 0.609


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.22it/s]


Train Loss : 0.185
Valid Loss : 1.587
Valid Acc  : 0.607


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.78it/s]


Train Loss : 0.170
Valid Loss : 1.584
Valid Acc  : 0.602


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.37it/s]


Train Loss : 0.162
Valid Loss : 1.562
Valid Acc  : 0.609


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.81it/s]


Train Loss : 0.149
Valid Loss : 1.695
Valid Acc  : 0.602


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.00it/s]


Train Loss : 0.143
Valid Loss : 1.740
Valid Acc  : 0.598


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.50it/s]


Train Loss : 0.146
Valid Loss : 1.716
Valid Acc  : 0.602


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.44it/s]


Train Loss : 0.126
Valid Loss : 1.706
Valid Acc  : 0.601


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.70it/s]


Train Loss : 0.124
Valid Loss : 1.798
Valid Acc  : 0.599


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.48it/s]


Train Loss : 0.114
Valid Loss : 1.840
Valid Acc  : 0.602


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.66it/s]


Train Loss : 0.108
Valid Loss : 1.827
Valid Acc  : 0.601


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.65it/s]


Train Loss : 0.099
Valid Loss : 1.836
Valid Acc  : 0.590


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 20.31it/s]


Train Loss : 0.108
Valid Loss : 1.802
Valid Acc  : 0.596


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.90it/s]


Train Loss : 0.100
Valid Loss : 1.898
Valid Acc  : 0.595


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.60it/s]


Train Loss : 0.097
Valid Loss : 1.985
Valid Acc  : 0.601


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.42it/s]


Train Loss : 0.081
Valid Loss : 2.060
Valid Acc  : 0.593


100%|█████████████████████████████████████████████████████████████████████████████████| 33/33 [00:01<00:00, 19.70it/s]


Train Loss : 0.077
Valid Loss : 1.943
Valid Acc  : 0.585


In [58]:
def MakePredictions(model, loader, device):
    Y_shuffled, Y_preds = [], []
    for X, Y in loader:
        X.to(device)
        Y.to(device)
        preds = model(X)
        Y_preds.append(preds)
        Y_shuffled.append(Y)
    gc.collect()
    Y_preds, Y_shuffled = torch.cat(Y_preds), torch.cat(Y_shuffled)

    return Y_shuffled.cpu().detach().numpy(), F.softmax(Y_preds, dim=-1).argmax(dim=-1).cpu().detach().numpy()

Y_actual, Y_preds = MakePredictions(lstm_classifier, test_loader, device)

In [59]:
sample_submission = pd.read_csv('data/sample_submission.csv')
sample_submission.head()

sample_submission['Sentiment'] = [labels[pred] for pred in Y_preds]
sample_submission.to_csv('lstm_test_submission.csv', index=False)
sample_submission.head(10)

Unnamed: 0,id,Sentiment
0,787bc85b-20d4-46d8-84a0-562a2527f684,Negative
1,17e934cd-ba94-4d4f-9ac0-ead202abe241,Negative
2,5914534b-2b0f-4de8-bb8a-e25587697e0d,Extremely Positive
3,cdf06cfe-29ae-48ee-ac6d-be448103ba45,Extremely Negative
4,aff63979-0256-4fb9-a2d9-86a3d3ca5470,Positive
5,b130f7fb-7048-48e6-a8af-57bb56ac1e27,Neutral
6,db72c632-8719-4847-b7f2-a89af05e1504,Negative
7,e45239d8-4dcf-4685-a955-a9a08ca829ee,Neutral
8,2854b1b2-5a41-4002-90d3-17fe77a3a78e,Positive
9,ff9be7e1-81a9-4c07-beda-4fee9a923f5e,Positive


## Result:


<br/>

Evaluation on test dataset: 0.487

<br/>

## Useful Links:

https://coderzcolumn.com/tutorials/artificial-intelligence/pytorch-lstm-for-text-classification-tasks#4
    
A Comprehensive Guide to Understand and Implement Text Classification in Python:
https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

How to Use GloVe Word Embeddings With PyTorch Networks?
https://coderzcolumn.com/tutorials/artificial-intelligence/how-to-use-glove-embeddings-with-pytorch

How to use Pre-trained Word Embeddings in PyTorch
https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76

