<a href="https://www.kaggle.com/code/vidhikishorwaghela/text-classification-attention-mechanism?scriptVersionId=174311390" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 🚀 Text Classification using Attention Mechanism 📝

## 📚 Overview

This project aims to perform **text classification** using an attention mechanism implemented in **Python** with **PyTorch**. The attention mechanism allows the model to focus on specific parts of the input text, improving its performance in understanding and classifying text data.

## 🛠️ Tools and Technologies Used

- **Programming Language**: `Python`
- **Deep Learning Framework**: `PyTorch`
- **Libraries**:
  - `numpy`
  - `pandas`
  - `tqdm`
  - `nltk`
  - `scikit-learn`

## 📊 Dataset

The dataset used for this project is a collection of consumer complaints, where each complaint is labeled with a specific product category. The dataset is stored in a CSV file (`complaints.csv`).

## 🔄 Preprocessing

### GloVe Embeddings

- The pre-trained GloVe word embeddings are used to represent words in the text data.
- The embeddings are processed from the `glove.6B.50d.txt` file.
- The vocabulary and embeddings are saved as pickle files (`vocabulary.pkl` and `embeddings.pkl`).

### Text Data Processing

- The text data is loaded from the CSV file.
- Missing values are dropped, and product labels are mapped to predefined categories.
- Text preprocessing includes converting to lowercase, removing punctuation and digits, and tokenization.
- Tokens are indexed using the vocabulary, and sequences are padded or truncated to a fixed length.

## 🧠 Model Architecture

### Attention Model

- The attention mechanism is implemented as a PyTorch module.
- It calculates attention weights based on input embeddings and applies them to the input data.
- The attention-weighted input is passed through a linear layer to obtain class predictions.

## 🚀 Training

- The model is trained using a custom PyTorch dataset.
- Training is performed for a specified number of epochs with mini-batch gradient descent.
- Training and validation loss are monitored to prevent overfitting.
- The best model is saved based on validation loss.

## 📈 Evaluation

### Testing

- The trained model is evaluated on a separate test dataset.
- Test loss and accuracy are calculated to assess model performance.

## 🎯 Inference

### Prediction on New Text

- The trained model can be used to make predictions on new text data.
- Input text is preprocessed and converted into tokens and integer indices.
- The model predicts the class label for the input text.

## 🏁 Conclusion

This project demonstrates the effectiveness of using an attention mechanism for text classification tasks. By focusing on relevant parts of the input text, the attention model achieves improved accuracy in categorizing consumer complaints into predefined product categories.


In [5]:
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn as nn
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

In [6]:
lr = 0.0005
vec_len = 50
seq_len = 20
num_epochs = 50
label_col = "Product"
tokens_path = "/kaggle/working/tokens.pkl"
labels_path = "/kaggle/working/labels.pkl"
data_path = "/kaggle/input/complaints/complaints.csv"
model_path = "/kaggle/working/attention.pth"
vocabulary_path = "/kaggle/working/vocabulary.pkl"
embeddings_path = "/kaggle/working/embeddings.pkl"
glove_vector_path = "/kaggle/input/glove6b/glove.6B.50d.txt"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "/kaggle/working/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
               'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
               'Credit card or prepaid card': 'card',
               'Money transfer, virtual currency, or money service': 'money_transfer',
               'virtual currency': 'money_transfer',
               'Mortgage': 'mortgage',
               'Payday loan, title loan, or personal loan': 'loan',
               'Debt collection': 'debt_collection',
               'Checking or savings account': 'savings_account',
               'Credit card': 'card',
               'Bank account or service': 'savings_account',
               'Credit reporting': 'credit_report',
               'Prepaid card': 'card',
               'Payday loan': 'loan',
               'Other financial service': 'others',
               'Virtual currency': 'money_transfer',
               'Student loan': 'loan',
               'Consumer Loan': 'loan',
               'Money transfers': 'money_transfer'}

In [7]:
def save_file(name, obj):
    """
    Function to save an object as pickle file
    """
    with open(name, 'wb') as f:
        pickle.dump(obj, f)


def load_file(name):
    """
    Function to load a pickle object
    """
    return pickle.load(open(name, "rb"))

## Process glove embeddings
---

In [8]:
with open(glove_vector_path, "rt") as f:
    emb = f.readlines()

In [9]:
vocabulary, embeddings = [], []

for item in emb:
    vocabulary.append(item.split()[0])
    embeddings.append(item.split()[1:])

In [10]:
embeddings = np.array(embeddings, dtype=np.float32)

In [11]:
vocabulary = ["<pad>", "<unk>"] + vocabulary

In [12]:
embeddings = np.vstack([np.ones(50, dtype=np.float32), 
                        np.mean(embeddings, axis=0),
                        embeddings])

In [13]:
save_file(embeddings_path, embeddings)
save_file(vocabulary_path, vocabulary)

## Process text data
---

In [14]:
data = pd.read_csv(data_path)

In [15]:
data.dropna(subset=[text_col_name], inplace=True)

In [16]:
data.replace({label_col: product_map}, inplace=True)

### Encode labels

In [17]:
label_encoder = LabelEncoder()
label_encoder.fit(data[label_col])
labels = label_encoder.transform(data[label_col])

In [18]:
save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)

### Process the text column

In [19]:
input_text = list(data[text_col_name])

In [20]:
len(input_text)

809343

### Convert text to lower case

In [21]:
input_text = [i.lower() for i in tqdm(input_text)]

100%|██████████| 809343/809343 [00:01<00:00, 661912.58it/s]


### Remove punctuations except apostrophe

In [22]:
input_text = [re.sub(r"[^\w\d'\s]+", " ", i) 
              for i in tqdm(input_text)]

100%|██████████| 809343/809343 [00:53<00:00, 15243.80it/s]


### Remove digits

In [23]:
input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]

100%|██████████| 809343/809343 [00:30<00:00, 26347.82it/s]


### Remove more than one consecutive instance of 'x'

In [24]:
input_text = [re.sub(r'[x]{2,}', "", i) for i in tqdm(input_text)]

100%|██████████| 809343/809343 [00:18<00:00, 43777.89it/s]


### Remove multiple spaces with single space

In [25]:
input_text = [re.sub(' +', ' ', i) for i in tqdm(input_text)]

100%|██████████| 809343/809343 [00:57<00:00, 14168.69it/s]


### Tokenize the text

In [26]:
tokens = [word_tokenize(t) for t in tqdm(input_text)]

100%|██████████| 809343/809343 [19:12<00:00, 702.05it/s]


### Take the first 20 tokens in each complaint text

In [27]:
tokens = [i[:20] if len(i) > 19 else ['<pad>'] * (20 - len(i)) + i 
          for i in tqdm(tokens)]

100%|██████████| 809343/809343 [00:07<00:00, 106881.41it/s]


### Convert tokens to integer indices from vocabulary

In [28]:
def token_index(tokens, vocabulary, missing='<unk>'):
    """
    :param tokens: List of word tokens
    :param vocabulary: All words in the embeddings
    :param missing: Token for words not present in the vocabulary
    :return: List of integers representing the word tokens
    """
    idx_token = []
    for text in tqdm(tokens):
        idx_text = []
        for token in text:
            if token in vocabulary:
                idx_text.append(vocabulary.index(token))
            else:
                idx_text.append(vocabulary.index(missing))
        idx_token.append(idx_text)
    return idx_token

In [29]:
tokens = token_index(tokens, vocabulary)

100%|██████████| 809343/809343 [1:52:45<00:00, 119.62it/s]  


### Save the tokens

In [30]:
save_file(tokens_path, tokens)

## Create attention model
---

In [31]:
class AttentionModel(nn.Module):

    def __init__(self, vec_len, seq_len, n_classes):
        super(AttentionModel, self).__init__()
        self.vec_len = vec_len
        self.seq_len = seq_len
        self.attn_weights = torch.cat([torch.tensor([[0.]]),
                                       torch.randn(vec_len, 1) /
                                       torch.sqrt(torch.tensor(vec_len))])
        self.attn_weights.requires_grad = True
        self.attn_weights = nn.Parameter(self.attn_weights)
        self.activation = nn.Tanh()
        self.softmax = nn.Softmax(dim=1)
        self.linear = nn.Linear(vec_len + 1, n_classes)

    def forward(self, input_data):
        hidden = torch.matmul(input_data, self.attn_weights)
        hidden = self.activation(hidden)
        attn = self.softmax(hidden)
        attn = attn.repeat(1, 1, self.vec_len + 1).reshape(attn.shape[0],
                                                           self.seq_len,
                                                           self.vec_len + 1)
        attn_output = input_data * attn
        attn_output = torch.sum(attn_output, axis=1)
        output = self.linear(attn_output)
        return output

## Create PyTorch dataset
---

In [32]:
class TextDataset(torch.utils.data.Dataset):

    def __init__(self, tokens, embeddings, labels):
        """
        :param tokens: List of word tokens
        :param embeddings: Word embeddings (from glove)
        :param labels: List of labels
        """
        self.tokens = tokens
        self.embeddings = embeddings
        self.labels = labels

    def __len__(self):
        return len(self.tokens)

    def __getitem__(self, idx):
        emb = torch.tensor(self.embeddings[self.tokens[idx], :])
        input_ = torch.cat((torch.ones(emb.shape[0],1), emb), dim=1)
        return torch.tensor(self.labels[idx]), input_

### Function to train the model

In [33]:
def train(train_loader, valid_loader, model, criterion, optimizer, 
          device, num_epochs, model_path):
    """
    Function to train the model
    :param train_loader: Data loader for train dataset
    :param valid_loader: Data loader for validation dataset
    :param model: Model object
    :param criterion: Loss function
    :param optimizer: Optimizer
    :param device: CUDA or CPU
    :param num_epochs: Number of epochs
    :param model_path: Path to save the model
    """
    best_loss = 1e8
    for i in range(num_epochs):
        print(f"Epoch {i+1} of {num_epochs}")
        valid_loss, train_loss = [], []
        model.train()
        # Train loop
        for batch_labels, batch_data in tqdm(train_loader):
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            batch_data = batch_data.to(device)
            # Forward pass
            batch_output = model(batch_data)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss
            loss = criterion(batch_output, batch_labels)
            train_loss.append(loss.item())
            optimizer.zero_grad()
            # Backward pass
            loss.backward()
            # Gradient update step
            optimizer.step()
        model.eval()
        # Validation loop
        for batch_labels, batch_data in tqdm(valid_loader):
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            batch_data = batch_data.to(device)
            # Forward pass
            batch_output = model(batch_data)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss
            loss = criterion(batch_output, batch_labels)
            valid_loss.append(loss.item())
        t_loss = np.mean(train_loss)
        v_loss = np.mean(valid_loss)
        print(f"Train Loss: {t_loss}, Validation Loss: {v_loss}")
        if v_loss < best_loss:
            best_loss = v_loss
            # Save model if validation loss improves
            torch.save(model.state_dict(), model_path)
        print(f"Best Validation Loss: {best_loss}")

### Function to test the model

In [34]:
def test(test_loader, model, criterion, device):
    """
    Function to test the model
    :param test_loader: Data loader for test dataset
    :param model: Model object
    :param criterion: Loss function
    :param device: CUDA or CPU
    """
    model.eval()
    test_loss = []
    test_accu = []
    for batch_labels, batch_data in tqdm(test_loader):
        # Move data to device
        batch_labels = batch_labels.to(device)
        batch_data = batch_data.to(device)
        # Forward pass
        batch_output = model(batch_data)
        batch_output = torch.squeeze(batch_output)
        # Calculate loss
        loss = criterion(batch_output, batch_labels)
        test_loss.append(loss.item())
        batch_preds = torch.argmax(batch_output, axis=1)
        # Move predictions to CPU
        if torch.cuda.is_available():
            batch_labels = batch_labels.cpu()
            batch_preds = batch_preds.cpu()
        # Compute accuracy
        test_accu.append(accuracy_score(batch_labels.detach().
                                        numpy(),
                                        batch_preds.detach().
                                        numpy()))
    test_loss = np.mean(test_loss)
    test_accu = np.mean(test_accu)
    print(f"Test Loss: {test_loss}, Test Accuracy: {test_accu}")

## Train attention model
---

### Load the files

In [35]:
tokens = load_file(tokens_path)
labels = load_file(labels_path)
embeddings = load_file(embeddings_path)
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)
vocabulary = load_file(vocabulary_path)

### Split data into train, validation and test sets

In [36]:
X_train, X_test, y_train, y_test = train_test_split(tokens, labels,
                                                    test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, 
                                                      y_train,
                                                      test_size=0.25)

### Create PyTorch datasets

In [37]:
train_dataset = TextDataset(X_train, embeddings, y_train)
valid_dataset = TextDataset(X_valid, embeddings, y_valid)
test_dataset = TextDataset(X_test, embeddings, y_test)

### Create data loaders

In [38]:
train_loader = torch.utils.data.DataLoader(train_dataset, 
                                           batch_size=16,
                                           shuffle=True, 
                                           drop_last=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, 
                                           batch_size=16)
test_loader = torch.utils.data.DataLoader(test_dataset, 
                                          batch_size=16)

### Create model object

In [39]:
device = torch.device("cuda:0" if torch.cuda.is_available() 
                      else "cpu")

In [40]:
model = AttentionModel(vec_len, seq_len, num_classes)

### Move the model to GPU if available

In [41]:
if torch.cuda.is_available():
    model = model.cuda()

### Define loss function and optimizer

In [42]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

### Training loop

In [43]:
train(train_loader, valid_loader, model, criterion, optimizer,
      device, num_epochs, model_path)

Epoch 1 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.00it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.96it/s]


Train Loss: 1.0973781970241514, Validation Loss: 1.014880746617741
Best Validation Loss: 1.014880746617741
Epoch 2 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.56it/s]
100%|██████████| 10117/10117 [00:14<00:00, 712.94it/s]


Train Loss: 1.0046015555010204, Validation Loss: 0.9996734576680343
Best Validation Loss: 0.9996734576680343
Epoch 3 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.08it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.56it/s]


Train Loss: 0.9860790113056511, Validation Loss: 0.9799373932547327
Best Validation Loss: 0.9799373932547327
Epoch 4 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.37it/s]
100%|██████████| 10117/10117 [00:14<00:00, 719.45it/s]


Train Loss: 0.9773461669148291, Validation Loss: 0.9751999612142414
Best Validation Loss: 0.9751999612142414
Epoch 5 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.52it/s]
100%|██████████| 10117/10117 [00:13<00:00, 723.07it/s]


Train Loss: 0.9731297619167034, Validation Loss: 0.971399791966588
Best Validation Loss: 0.971399791966588
Epoch 6 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.15it/s]
100%|██████████| 10117/10117 [00:14<00:00, 720.86it/s]


Train Loss: 0.970977611977245, Validation Loss: 0.9698809275805532
Best Validation Loss: 0.9698809275805532
Epoch 7 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.92it/s]
100%|██████████| 10117/10117 [00:14<00:00, 711.19it/s]


Train Loss: 0.9694003499722755, Validation Loss: 0.9690534166680196
Best Validation Loss: 0.9690534166680196
Epoch 8 of 50


100%|██████████| 30350/30350 [01:13<00:00, 410.64it/s]
100%|██████████| 10117/10117 [00:14<00:00, 712.99it/s]


Train Loss: 0.9684071356126858, Validation Loss: 0.9683583835886438
Best Validation Loss: 0.9683583835886438
Epoch 9 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.79it/s]
100%|██████████| 10117/10117 [00:14<00:00, 719.00it/s]


Train Loss: 0.9674935299113908, Validation Loss: 0.9675937080051608
Best Validation Loss: 0.9675937080051608
Epoch 10 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.63it/s]
100%|██████████| 10117/10117 [00:14<00:00, 722.58it/s]


Train Loss: 0.9667671653800192, Validation Loss: 0.966480879261287
Best Validation Loss: 0.966480879261287
Epoch 11 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.46it/s]
100%|██████████| 10117/10117 [00:14<00:00, 710.27it/s]


Train Loss: 0.9661460295569956, Validation Loss: 0.9660834086083301
Best Validation Loss: 0.9660834086083301
Epoch 12 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.24it/s]
100%|██████████| 10117/10117 [00:14<00:00, 718.59it/s]


Train Loss: 0.9656081514730681, Validation Loss: 0.9663067209846691
Best Validation Loss: 0.9660834086083301
Epoch 13 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.22it/s]
100%|██████████| 10117/10117 [00:14<00:00, 714.62it/s]


Train Loss: 0.9650431820745327, Validation Loss: 0.9653861272637877
Best Validation Loss: 0.9653861272637877
Epoch 14 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.84it/s]
100%|██████████| 10117/10117 [00:13<00:00, 725.35it/s]


Train Loss: 0.9645278386587562, Validation Loss: 0.964915106006558
Best Validation Loss: 0.964915106006558
Epoch 15 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.30it/s]
100%|██████████| 10117/10117 [00:14<00:00, 720.01it/s]


Train Loss: 0.9637343194471946, Validation Loss: 0.9630194749932427
Best Validation Loss: 0.9630194749932427
Epoch 16 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.36it/s]
100%|██████████| 10117/10117 [00:14<00:00, 719.66it/s]


Train Loss: 0.9626130436844449, Validation Loss: 0.9620846449265406
Best Validation Loss: 0.9620846449265406
Epoch 17 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.69it/s]
100%|██████████| 10117/10117 [00:13<00:00, 724.22it/s]


Train Loss: 0.9619674593357513, Validation Loss: 0.9613757478933955
Best Validation Loss: 0.9613757478933955
Epoch 18 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.56it/s]
100%|██████████| 10117/10117 [00:14<00:00, 713.77it/s]


Train Loss: 0.9614213606248774, Validation Loss: 0.9609081743615906
Best Validation Loss: 0.9609081743615906
Epoch 19 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.84it/s]
100%|██████████| 10117/10117 [00:14<00:00, 713.23it/s]


Train Loss: 0.9609946517908986, Validation Loss: 0.9605148667280907
Best Validation Loss: 0.9605148667280907
Epoch 20 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.69it/s]
100%|██████████| 10117/10117 [00:14<00:00, 714.67it/s]


Train Loss: 0.9606381737194108, Validation Loss: 0.9610709988388653
Best Validation Loss: 0.9605148667280907
Epoch 21 of 50


100%|██████████| 30350/30350 [01:14<00:00, 405.30it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.07it/s]


Train Loss: 0.9602065914427429, Validation Loss: 0.9606582809442515
Best Validation Loss: 0.9605148667280907
Epoch 22 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.68it/s]
100%|██████████| 10117/10117 [00:13<00:00, 730.06it/s]


Train Loss: 0.9598444503319912, Validation Loss: 0.959536322120316
Best Validation Loss: 0.959536322120316
Epoch 23 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.52it/s]
100%|██████████| 10117/10117 [00:13<00:00, 724.55it/s]


Train Loss: 0.9595399926786957, Validation Loss: 0.9596084963165395
Best Validation Loss: 0.959536322120316
Epoch 24 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.44it/s]
100%|██████████| 10117/10117 [00:14<00:00, 713.29it/s]


Train Loss: 0.959230711406597, Validation Loss: 0.9590410341990964
Best Validation Loss: 0.9590410341990964
Epoch 25 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.01it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.57it/s]


Train Loss: 0.9589286165511393, Validation Loss: 0.9591326247085116
Best Validation Loss: 0.9590410341990964
Epoch 26 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.34it/s]
100%|██████████| 10117/10117 [00:13<00:00, 722.92it/s]


Train Loss: 0.9586123955966613, Validation Loss: 0.9591355674676513
Best Validation Loss: 0.9590410341990964
Epoch 27 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.53it/s]
100%|██████████| 10117/10117 [00:14<00:00, 714.61it/s]


Train Loss: 0.9583692329665385, Validation Loss: 0.9584561800979113
Best Validation Loss: 0.9584561800979113
Epoch 28 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.36it/s]
100%|██████████| 10117/10117 [00:14<00:00, 716.18it/s]


Train Loss: 0.9580121843926793, Validation Loss: 0.9585801033050726
Best Validation Loss: 0.9584561800979113
Epoch 29 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.64it/s]
100%|██████████| 10117/10117 [00:14<00:00, 722.32it/s]


Train Loss: 0.9578437766967928, Validation Loss: 0.9582128624906886
Best Validation Loss: 0.9582128624906886
Epoch 30 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.12it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.03it/s]


Train Loss: 0.9575264667816571, Validation Loss: 0.957480560129865
Best Validation Loss: 0.957480560129865
Epoch 31 of 50


100%|██████████| 30350/30350 [01:14<00:00, 404.90it/s]
100%|██████████| 10117/10117 [00:14<00:00, 704.53it/s]


Train Loss: 0.9573833065778262, Validation Loss: 0.9569408385646642
Best Validation Loss: 0.9569408385646642
Epoch 32 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.78it/s]
100%|██████████| 10117/10117 [00:14<00:00, 719.50it/s]


Train Loss: 0.9571441274871544, Validation Loss: 0.9568577659593172
Best Validation Loss: 0.9568577659593172
Epoch 33 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.51it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.53it/s]


Train Loss: 0.9569243544934216, Validation Loss: 0.9566467127905328
Best Validation Loss: 0.9566467127905328
Epoch 34 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.53it/s]
100%|██████████| 10117/10117 [00:14<00:00, 721.09it/s]


Train Loss: 0.9567169237735833, Validation Loss: 0.957307778859287
Best Validation Loss: 0.9566467127905328
Epoch 35 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.68it/s]
100%|██████████| 10117/10117 [00:14<00:00, 717.66it/s]


Train Loss: 0.9564983710882299, Validation Loss: 0.9564247785248964
Best Validation Loss: 0.9564247785248964
Epoch 36 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.92it/s]
100%|██████████| 10117/10117 [00:14<00:00, 719.53it/s]


Train Loss: 0.9563013965495336, Validation Loss: 0.9568144433760791
Best Validation Loss: 0.9564247785248964
Epoch 37 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.52it/s]
100%|██████████| 10117/10117 [00:14<00:00, 712.97it/s]


Train Loss: 0.9560181630512046, Validation Loss: 0.956228954332542
Best Validation Loss: 0.956228954332542
Epoch 38 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.16it/s]
100%|██████████| 10117/10117 [00:14<00:00, 708.71it/s]


Train Loss: 0.9559394721085392, Validation Loss: 0.9557032289182856
Best Validation Loss: 0.9557032289182856
Epoch 39 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.61it/s]
100%|██████████| 10117/10117 [00:14<00:00, 709.49it/s]


Train Loss: 0.955771921485957, Validation Loss: 0.9554801564054243
Best Validation Loss: 0.9554801564054243
Epoch 40 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.10it/s]
100%|██████████| 10117/10117 [00:14<00:00, 721.20it/s]


Train Loss: 0.9555783447583974, Validation Loss: 0.9559570361785624
Best Validation Loss: 0.9554801564054243
Epoch 41 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.96it/s]
100%|██████████| 10117/10117 [00:14<00:00, 709.70it/s]


Train Loss: 0.9554122931369055, Validation Loss: 0.9553906602919944
Best Validation Loss: 0.9553906602919944
Epoch 42 of 50


100%|██████████| 30350/30350 [01:14<00:00, 407.70it/s]
100%|██████████| 10117/10117 [00:14<00:00, 709.83it/s]


Train Loss: 0.9552417332395494, Validation Loss: 0.954942398750524
Best Validation Loss: 0.954942398750524
Epoch 43 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.80it/s]
100%|██████████| 10117/10117 [00:14<00:00, 717.81it/s]


Train Loss: 0.9550751874057033, Validation Loss: 0.95639381673398
Best Validation Loss: 0.954942398750524
Epoch 44 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.57it/s]
100%|██████████| 10117/10117 [00:13<00:00, 730.69it/s]


Train Loss: 0.9549169340174909, Validation Loss: 0.9544378415219948
Best Validation Loss: 0.9544378415219948
Epoch 45 of 50


100%|██████████| 30350/30350 [01:14<00:00, 405.99it/s]
100%|██████████| 10117/10117 [00:13<00:00, 723.45it/s]


Train Loss: 0.9547431052884906, Validation Loss: 0.954548318290694
Best Validation Loss: 0.9544378415219948
Epoch 46 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.64it/s]
100%|██████████| 10117/10117 [00:14<00:00, 721.23it/s]


Train Loss: 0.954662127798236, Validation Loss: 0.9552477405817009
Best Validation Loss: 0.9544378415219948
Epoch 47 of 50


100%|██████████| 30350/30350 [01:14<00:00, 409.03it/s]
100%|██████████| 10117/10117 [00:14<00:00, 722.17it/s]


Train Loss: 0.9544609591070274, Validation Loss: 0.9543456434069432
Best Validation Loss: 0.9543456434069432
Epoch 48 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.15it/s]
100%|██████████| 10117/10117 [00:14<00:00, 715.47it/s]


Train Loss: 0.9543076367880013, Validation Loss: 0.9541854170355694
Best Validation Loss: 0.9541854170355694
Epoch 49 of 50


100%|██████████| 30350/30350 [01:14<00:00, 406.62it/s]
100%|██████████| 10117/10117 [00:14<00:00, 718.18it/s]


Train Loss: 0.9541601616748082, Validation Loss: 0.9541581283359645
Best Validation Loss: 0.9541581283359645
Epoch 50 of 50


100%|██████████| 30350/30350 [01:14<00:00, 408.12it/s]
100%|██████████| 10117/10117 [00:14<00:00, 717.12it/s]

Train Loss: 0.9540951015504424, Validation Loss: 0.9539356721614654
Best Validation Loss: 0.9539356721614654





### Test the model 

In [44]:
test(test_loader, model, criterion, device)

100%|██████████| 10117/10117 [00:21<00:00, 471.87it/s]

Test Loss: 0.9497232201060041, Test Accuracy: 0.6769104363561713





## Predict on new text
---

In [45]:
input_text = '''I am a victim of Identity Theft & currently have an Experian account that 
I can view my Experian Credit Report and getting notified when there is activity on 
my Experian Credit Report. For the past 3 days I've spent a total of approximately 9 
hours on the phone with Experian. Every time I call I get transferred repeatedly and 
then my last transfer and automated message states to press 1 and leave a message and 
someone would call me. Every time I press 1 I get an automatic message stating than you 
before I even leave a message and get disconnected. I call Experian again, explain what 
is happening and the process begins again with the same end result. I was trying to have 
this issue attended and resolved informally but I give up after 9 hours. There are hard 
hit inquiries on my Experian Credit Report that are fraud, I didn't authorize, or recall 
and I respectfully request that Experian remove the hard hit inquiries immediately just 
like they've done in the past when I was able to speak to a live Experian representative 
in the United States. The following are the hard hit inquiries : BK OF XXXX XX/XX/XXXX 
XXXX XXXX XXXX  XX/XX/XXXX XXXX  XXXX XXXX  XX/XX/XXXX XXXX  XX/XX/XXXX XXXX  XXXX 
XX/XX/XXXX'''

### Process input text

In [46]:
input_text = input_text.lower()
input_text = re.sub(r"[^\w\d'\s]+", " ", input_text)
input_text = re.sub("\d+", "", input_text)
input_text = re.sub(r'[x]{2,}', "", input_text)
input_text = re.sub(' +', ' ', input_text)
tokens = word_tokenize(input_text)

In [47]:
tokens = ['<pad>']*(20-len(tokens))+tokens

In [48]:
idx_token = []
for token in tokens:
    if token in vocabulary:
        idx_token.append(vocabulary.index(token))
    else:
        idx_token.append(vocabulary.index('<unk>'))

In [49]:
token_emb = embeddings[idx_token,:]
token_emb = token_emb[:seq_len, :]
inp = torch.from_numpy(token_emb)

In [50]:
inp = torch.cat((torch.ones(inp.shape[0],1), inp), dim=1)

In [51]:
device = torch.device("cuda:0" if torch.cuda.is_available() 
                      else "cpu")

In [52]:
inp = inp.to(device)
inp = torch.unsqueeze(inp, 0)

In [53]:
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)

In [54]:
# Create model object
model = AttentionModel(vec_len, seq_len, num_classes)

# Load trained weights
model.load_state_dict(torch.load(model_path))

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.cuda()
    
# Forward pass
out = torch.squeeze(model(inp))

# Find predicted class
prediction = label_encoder.classes_[torch.argmax(out)]
print(f"Predicted  Class: {prediction}")

Predicted  Class: credit_report
