# Pytorch Text - Text Classification with the torchtext library, Reloaded
Notebook for following along with Pytorch Text NLP tutorials that is looking to use the torchtext library to build the dataset for text classification analysis [Pytorch](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)  website tutorial. This notebook's purpose is to reload the model created in the main version of this notebook. <br><br>

### Choices for data

In [1]:
fileName = "textClassifier_AG_NEWS"

<br>

### Libaries and Modules
Importing the necessary libaries and modules for the notebook.

In [2]:
#Import cell
import glob
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math
import numpy as np
import os
import pandas as pd
import pickle as pk
import random
import re
import string
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from io import open
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(1247) #setting seed value
print(f"Device: {device}. Cuda available: {torch.cuda.is_available()}")
print(f"Torch current seed = {torch.seed()}")
print("Imports complete")

Device: cpu. Cuda available: False
Torch current seed = 978401694287300
Imports complete


<br>

### Data Loading and Manipulation Functions
<b>Functions:</b><br>
<ul>
    <li>collate_batch - uses pipelines to process input batch of data</li>
    <li>yield_tokens - processes data_iter for build_vocab_from_iterator()</li>
</ul>

In [3]:
#Data loading and manipulation function definition cell
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
        
        
label_pipeline = lambda x: int(x) - 1        
text_pipeline = lambda x: vocab(tokenizer(x))

print("Data loading and manipulation functions defined.")

Data loading and manipulation functions defined.


### Importing and preparing data sets
Importing and preparing the data for the models.

In [4]:
#Build a vocab with the raw training dataset, generating data batch and iter
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter]))
     
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
vocab_size = len(vocab)

BATCH_SIZE = 64

train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset)*0.95)

split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset)-num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)
print("Vocab created and dataloaders defined.")

Vocab created and dataloaders defined.


<br>

### Class Definitions
<b>Classes:</b><br>
<ul>
    <li>TextClassificationModel - nn.Module class with an embedding bag and a linear layer for manipulating torchtext library</li>
</ul>

In [5]:
#Class definition cell
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class) -> None:
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()
        return None

    def init_weights(self) -> None:
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()
        return None
    
    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)
    
print("Classes defined.")

Classes defined.


<br>

### Calculation functions
<b>Functions:</b><br>
<ul>
    <li>predict - uses input model to predict input text, returns category number</li>
</ul>

In [6]:
#Calculation functions cell
def predict(text, text_pipeline, preModel) -> int:
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = preModel(text, torch.tensor([0]))
    return output.argmax(1).item() + 1

print("Calculation functions defined.")

Calculation functions defined.


<br>

### Plotting functions
<b>Functions:</b>
<ul>
    <li></li>
</ul>

In [7]:
#Plotting functions Cell
%matplotlib inline

print("Plotting functions defined.")

Plotting functions defined.


<br>

### Training Functions
<b>Functions:</b>
<ul>
    <li>evaluate - evaluation loop, takes dataloader as input, returns accuracy.</li>
</ul>

In [8]:
#Training Functions
def evaluate(dataloader, evModel) -> float:
    evModel.eval()
    total_acc, total_count = 0, 0
    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = evModel(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1)==label).sum().item()
            total_count += label.size(0)       
    return total_acc/total_count

print("Training functions defined.")

Training functions defined.


### Main code
The `AG_NEWS` dataset has 4 labels, and therefore for classes:
`1: World`, `2: Sports`, `3: Business` and `4:Sci/Tec`. This is defined in [Importing and preparing data sets](#Importing-and-preparing-data-sets).

#### Reloading the model

In [9]:
#Reading in the meta file
with open(f'{fileName}_meta.txt', 'r') as f:
    fileContents = f.read()
    f.close()

fileContents = [lineRead.split(': ') for lineRead in fileContents.split('\n')]
fileContents.pop() #additional remove to deal with classification dict
fileContents.pop() #removes empty end line

for descript, value in fileContents: #   
    match descript:
        case 'vocab_size':
            assert vocab_size == int(value)
            print(f"Vocab size: {vocab_size}")
        case 'emsize':
            emsize = int(value)
            print(f"Emsize {emsize}")
        case 'num_class':
            num_class = int(value)
            print(f"Num class: {num_class}")
        case 'ag_news_label':
            #find way to impor news label
            continue
        case _:
            continue

Vocab size: 95811
Emsize 64
Num class: 4


In [10]:
#Model creation
modelRE = TextClassificationModel(vocab_size, emsize, num_class)
modelRE.load_state_dict(torch.load(f"{fileName}_weights.pth"))

criterion = torch.nn.CrossEntropyLoss()

ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}
#find way to import this from meta file

print("Model reloaded: ", modelRE.eval())

Model reloaded:  TextClassificationModel(
  (embedding): EmbeddingBag(95811, 64, mode=mean)
  (fc): Linear(in_features=64, out_features=4, bias=True)
)


#### Evaluate the model with test dataset

In [11]:
print("Checking the results of test dataset.")
accu_test = evaluate(test_dataloader, modelRE)
print("Test accuracy: {:8.3f}".format(accu_test))

Checking the results of test dataset.
Test accuracy:    0.909


#### Test on a random news

In [12]:
ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

print("This is a %s news story (Expected: Sports)." %
      ag_news_label[predict(ex_text_str, text_pipeline, modelRE)])

This is a Sports news story (Expected: Sports).


<br>