## Text Classification using Custom Data and Pytorch

Attemptoing to learn pytorch with custom data to similuate real world collection. 

Code source: https://medium.com/@spandey8312/text-classification-using-custom-data-and-pytorch-d88ba1087045

EDIT: Model fails to run. Reran and copied direclty from the list. Additional work will be needed to fix this issue.

In [1]:
import torch
from torch import nn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset, DataLoader
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from sklearn.metrics import precision_score, recall_score, f1_score



Download and extracted the wikitext 2 dataset files

In [2]:
train_file = 'wikitext-2/wiki.train.tokens'
valid_file = 'wikitext-2/wiki.valid.tokens'
test_file = 'wikitext-2/wiki.test.tokens'

tokenize the text data

In [3]:
tokenizer = get_tokenizer('basic_english')

build the vocabulary

In [4]:
def build_vocab(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        tokens = tokenizer(f.read())
    return tokens

In [5]:
train_tokens = build_vocab(train_file)
vocab = build_vocab_from_iterator(train_tokens)

### Create a PyTorch dataset

Creating a custom dataset allows us to handle the data efficiently furing training.

We will need to define a custom dataset class that inherts from the torch.dataset.

__len__ should return the size
__getitem__ to retrieve a specific example from the dataset

Additionally we will convert the text examples into numerical representations using the vocabulary built during the preprocessing steps

In [6]:
class WikiTextDataset(Dataset):
    def __init__(self, file_path, vocab):
        self.data = self.load_data(file_path)
        self.vocab = vocab
    
    def load_data(self, file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            tokens = tokenizer(f.read())
        return tokens
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data[idx]
        # Convert the example into numerical representations using the vocabulary
        numerical_tokens = [self.vocab[token] for token in example]
        return numerical_tokens

In [7]:
def collate_fn(batch):
    # pad sequences
    sequences = [torch.tensor(item) for item in batch]
    padded_sequences = pad_sequence(sequences, 
                                    batch_first=True,
                                    padding_value=0)

    return padded_sequences


### Define the text classiciation model:

the text classiciation model architecture plays a crucial role in determining the model performance. The most common architecture for text classication is:
 - embedding layers
 - recurrent or convolutional layers
 - fully connected layers

 In the example, the class encapsulates the layers and structure of the text classification model. 

In [8]:
class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
        super(TextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)
        
    def forward(self, x):
        embedded = self.embedding(x)
        output, _ = self.rnn(embedded)
        last_hidden = output[:, -1, :]
        logits = self.fc(last_hidden)
        return logits

### Training the model

Training text classiciation model involves:
- splitting the dataset into training and validation sets
- defining training parameters

We will be utilize adam as optimizer and crossentropyloss as the loss function.

Training loop iterates over the training data for the specified number of epochs, performing forward and backeard passes and updating the models parameters using backproagation

training parameters

In [9]:
vocab_size = len(vocab)
embedding_dim = 100
hidden_dim = 128
num_classes = 10
batch_size = 32
num_epochs = 10
learning_rate = 0.001

create the model

In [10]:
model = TextClassifier(vocab_size, embedding_dim, hidden_dim, num_classes)

define the loss function and optimizer

In [11]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Create data loaders for the training and validation sets

In [12]:
train_dataset = WikiTextDataset(train_file, vocab)
valid_dataset = WikiTextDataset(valid_file, vocab)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False)

validate the train loader worked as intended

Iterate over the training data fro the specified number of epochs

In [13]:
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    total_samples = 0
    for inputs in train_loader:
        optimizer.zero_grad()
        inputs = torch.LongTensor(inputs)
        targets = inputs.clone()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, num_classes), targets.view(-1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item() * len(inputs)
        total_samples += len(inputs)

    # Evaluate on the validation set after every epoch
    model.eval()
    total_val_loss = 0.0
    total_val_samples = 0
    with torch.no_grad():
        for inputs in valid_loader:
            inputs = torch.LongTensor(inputs)
            targets = inputs.clone()
            outputs = model(inputs)
            val_loss = criterion(outputs.view(-1, num_classes), targets.view(-1))

            total_val_loss += val_loss.item() * len(inputs)
            total_val_samples += len(inputs)

    avg_loss = total_loss / total_samples
    avg_val_loss = total_val_loss / total_val_samples

    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_loss:.4f}, Val Loss: {avg_val_loss:.4f}")

RuntimeError: each element in list of batch should be of equal size