<a href="https://colab.research.google.com/github/bballdave025/nlp_w_pytorch_zhongyu-pan/blob/main/PyTorch_CNN_Text___LinkedIn_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Network for Text Classification Using PyTorch

[Go straight to the code](#Installs-and-Imports)

## Navigation - Get the Notebook from Different Places for Different Uses

### (Navigation for the main presentation notebook)

[Google CoLab on my Google Drive](https://colab.research.google.com/drive/1PKkdbNcqUfV0sHCosWZf3JdF6F3kGoj7?usp=sharing) - A place to see all inputs
and outputs for the notebook, though you can't edit it without re-saving it.

<br/>

[GitHub Repo (link to be put in, soon)](https://github.com/bballdave025/nlp_w_pytorch_zhongyu-pan/) - Code repository: a place to see the latest changes as well as the Jupyter Notebooks completed earlier

<br/>

[GitHub Notebook File (link to be put in, soon)](#) - I don't think this
is as useful as the repo, but you can see the IPYNB file placeholder.
This file will only have input - I scrub the output before committing
any updates, because it's easier to do `diff`s (see changes in code)
on Jupyter Notebooks when you don't have the outputs.

<br/>

[On MyBinder (link to be put in, soon)](#) - A place to interact with the notebook, where you'll be led to the notebook without output and can
run the code and see the results yourself.<br/>
A note, [MyBinder](https://mybinder.org) is a great online project which allows you to interactively run a Jupyter notebook completely online. It's nice to have when you'd like to play with code and better see the outputs that come from running that code. I've had some problems with images going down, but I'm going to work to keep this one up and running for access.

## Putting Together All the Work from the Course

Course For NLP from LinkedIn

https://www.linkedin.com/learning/natural-language-processing-with-pytorch

The teacher is Zhonyu Pan, Content Creator at LinkedIn

We use PyTorch and a Convolutional Neural Network (using NLP features
rather than the pixel position features we use with image processing) to
do our text classification.

`Input -> Convolution -> Pooling -> ... -> Fully-connected layer -> Output`

We are also learning about RNNs. RNN doesn't only pass data forward, but also feeds the data back into itself. CNN only goes forward. RNN can remember context before and after words in a sequence. It's usually slower that a CNN.

### Installs and Imports

#### Robust Install-if-Needed Code

(I don't want to have to mess with whether the runtime has been
disconnected or if the version is right or whatever. I'm ensuring
compatibility for CoLab here.)

<b>Functions</b>

I'm just going to make something simple that makes sure I install the
packages I need if they haven't already been installed. This will be
especially useful for CoLab.

In [19]:
ilmd_imported = False

try:
  import importlib_metadata
  ilmd_imported = True
except ImportError:
  !pip install "importlib-metadata==8.4.0"
finally:
  if not ilmd_imported:
    import importlib_metadata
  ##endof:  if not ilmd_imported
##endof:  try/except/finally <importlib_metadata>

def is_package_installed(package_name):
  try:
    # <find-the-package-in-the-list>
    dist = importlib_metadata.distribution(package_name)
    print(f"{package_name} {dist.version} is installed.")
    return True
  except importlib_metadata.PackageNotFoundError:
    print(f"{package_name} is not installed.")
    return False
  finally:
    pass
  ##endof:  try/except/finally <find-the-package-in-the-list>
##endof:  def check_package_installed(package_name)

In [4]:
# # I won't need this anymore
#!date -u +"%Y-%m-%dT%H%M%S%z"

2024-09-05T224706+0000


Output was

`2024-09-05T224706+0000`

... \[versions that work ... as of ...\] 2024-09-05 at 22:47:06 UTC+0000n

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext.legacy import data, datasets
import random

**Preprocessing text dataset**

In [None]:
seed = 966
torch.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [None]:
TEXT = data.Field(tokenize='spacy', lower=True)
LABEL = data.LabelField()

In [None]:
train, test = datasets.TREC.splits(TEXT, LABEL)
train, val = train.split(random_state = random.seed(seed))

downloading train_5500.label


100%|██████████| 336k/336k [00:00<00:00, 3.04MB/s]


downloading TREC_10.label


100%|██████████| 23.4k/23.4k [00:00<00:00, 881kB/s]


In [None]:
vars(train[-1])

{'label': 'ENTY', 'text': ['how', 'do', 'you', 'say', '2', 'in', 'latin', '?']}

In [None]:
TEXT.build_vocab(train, min_freq=2)
LABEL.build_vocab(train)

In [None]:
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))
print(LABEL.vocab.stoi)

Vocabulary size of TEXT: 2641
Vocabulary size of LABEL: 6
defaultdict(None, {'ENTY': 0, 'HUM': 1, 'DESC': 2, 'NUM': 3, 'LOC': 4, 'ABBR': 5})


In [None]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, val, test),
    batch_size = 64,
    sort_key=lambda x: len(x.text),
    device=device
)

**Building a Simple CNN Model**

In [None]:
class CNN(nn.Module):
  def __init__(self, vocabulary_size, embedding_size,
               kernels_number, kernel_sizes, output_size, dropout_rate):
    super().__init__()
    self.embedding = nn.Embedding(vocabulary_size, embedding_size)
    self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels=1, out_channels=kernels_number, kernel_size=(k, embedding_size))
                                            for k in kernel_sizes])
    self.dropout = nn.Dropout(dropout_rate)
    self.fully_connected = nn.Linear(len(kernel_sizes) * kernels_number, output_size)
  def forward(self, text):
    text = text.permute(1, 0)
    input_embeddings = self.embedding(text)
    input_embeddings = input_embeddings.unsqueeze(1)
    conved = [F.relu(convolution_layer(input_embeddings)).squeeze(3) for convolution_layer in self.convolution_layers]
    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    concat = self.dropout(torch.cat(pooled, dim=1))
    final_output = self.fully_connected(concat)
    return final_output

In [None]:
input_size = len(TEXT.vocab)
embedding_size = 100
kernels_number = 100
kernel_sizes = [2, 3, 4]
output_size = len(LABEL.vocab)
dropout_rate = 0.3

In [None]:
model = CNN(input_size, embedding_size, kernels_number, kernel_sizes, output_size, dropout_rate)

In [None]:
print(model)

CNN(
  (embedding): Embedding(2641, 100)
  (convolution_layers): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (fully_connected): Linear(in_features=300, out_features=6, bias=True)
)


In [None]:
model.to(device)

CNN(
  (embedding): Embedding(2641, 100)
  (convolution_layers): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (fully_connected): Linear(in_features=300, out_features=6, bias=True)
)

**Train and Evaluate Functions**

In [None]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
criterion = criterion.to(device)

optimizer = optim.Adam(model.parameters())

In [None]:
def accuracy(predictions, actual_label):
    max_predictions = predictions.argmax(dim = 1, keepdim = True, )
    correct_predictions = max_predictions.squeeze(1).eq(actual_label)
    accuracy = correct_predictions.sum() / torch.cuda.FloatTensor([actual_label.shape[0]])
    return accuracy

In [None]:
def train(model, iterator, optimizer, criterion):

    model.train()
    epoch_loss = 0
    epoch_acc = 0

    for batch in iterator:
        optimizer.zero_grad()

        predictions = model(batch.text)

        loss = criterion(predictions, batch.label)

        acc = accuracy(predictions, batch.label)

        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):

    model.eval()
    epoch_loss = 0
    epoch_acc = 0

    with torch.no_grad():

        for batch in iterator:

            predictions = model(batch.text)

            loss = criterion(predictions, batch.label)

            acc = accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Training the model**

In [None]:
number_of_epochs = 20

best_acc = float('-inf')

for epoch in range(number_of_epochs):

    # Write the code here
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    # Write the code here
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    if valid_acc > best_acc:
        # Write the code here
        best_acc = valid_acc
        torch.save(model.state_dict(), 'trec.pt')

    print(f'Epoch {epoch+1} ')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Validation Loss: {valid_loss:.3f} |  Validation Acc: {valid_acc*100:.2f}%')

Epoch 1 
	Train Loss: 1.312 | Train Acc: 48.76%
	 Validation Loss: 0.952 |  Validation Acc: 64.50%
Epoch 2 
	Train Loss: 0.779 | Train Acc: 71.66%
	 Validation Loss: 0.744 |  Validation Acc: 72.32%
Epoch 3 
	Train Loss: 0.585 | Train Acc: 81.01%
	 Validation Loss: 0.654 |  Validation Acc: 74.99%
Epoch 4 
	Train Loss: 0.452 | Train Acc: 85.48%
	 Validation Loss: 0.628 |  Validation Acc: 76.36%
Epoch 5 
	Train Loss: 0.359 | Train Acc: 88.29%
	 Validation Loss: 0.583 |  Validation Acc: 78.56%
Epoch 6 
	Train Loss: 0.275 | Train Acc: 91.91%
	 Validation Loss: 0.548 |  Validation Acc: 80.06%
Epoch 7 
	Train Loss: 0.207 | Train Acc: 94.12%
	 Validation Loss: 0.550 |  Validation Acc: 80.89%
Epoch 8 
	Train Loss: 0.163 | Train Acc: 95.83%
	 Validation Loss: 0.544 |  Validation Acc: 81.54%
Epoch 9 
	Train Loss: 0.137 | Train Acc: 96.33%
	 Validation Loss: 0.530 |  Validation Acc: 81.62%
Epoch 10 
	Train Loss: 0.104 | Train Acc: 97.80%
	 Validation Loss: 0.542 |  Validation Acc: 80.82%
Epoch 11 

In [None]:
model.load_state_dict(torch.load('trec.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.416 | Test Acc: 87.00%
