<a href="https://colab.research.google.com/github/bballdave025/nlp_w_pytorch_zhongyu-pan/blob/main/PyTorch_CNN_Text___LinkedIn_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convolutional Neural Network for Text Classification Using PyTorch

[Go straight to the code](#Installs-and-Imports)

## Navigation - Get the Notebook from Different Places for Different Uses

### (Navigation for the main presentation notebook)

[Google CoLab on my Google Drive](https://colab.research.google.com/drive/1PKkdbNcqUfV0sHCosWZf3JdF6F3kGoj7?usp=sharing) - A place to see all inputs
and outputs for the notebook, though you can't edit it without re-saving it.

<br/>

[GitHub Repo (link to be put in, soon)](https://github.com/bballdave025/nlp_w_pytorch_zhongyu-pan/) - Code repository: a place to see the latest changes as well as the Jupyter Notebooks completed earlier

<br/>

[GitHub Notebook File (link to be put in, soon)](#) - I don't think this
is as useful as the repo, but you can see the IPYNB file placeholder.
This file will only have input - I scrub the output before committing
any updates, because it's easier to do `diff`s (see changes in code)
on Jupyter Notebooks when you don't have the outputs.

<br/>

[On MyBinder (link to be put in, soon)](#) - A place to interact with the notebook, where you'll be led to the notebook without output and can
run the code and see the results yourself.<br/>
A note, [MyBinder](https://mybinder.org) is a great online project which allows you to interactively run a Jupyter notebook completely online. It's nice to have when you'd like to play with code and better see the outputs that come from running that code. I've had some problems with images going down, but I'm going to work to keep this one up and running for access.

## Putting Together All the Work from the Course

Course For NLP from LinkedIn

https://www.linkedin.com/learning/natural-language-processing-with-pytorch

The teacher is Zhonyu Pan, Content Creator at LinkedIn

We use PyTorch and a Convolutional Neural Network (using NLP features
rather than the pixel position features we use with image processing) to
do our text classification.

`Input -> Convolution -> Pooling -> ... -> Fully-connected layer -> Output`

We are also learning about RNNs. RNN doesn't only pass data forward, but also feeds the data back into itself. CNN only goes forward. RNN can remember context before and after words in a sequence. It's usually slower that a CNN.

### Installs and Imports

#### Installs

Note that a `conda` environment installation from `environment.txt` will remove the requirement for these following installations for a notebook hosted on a local machine as well as for a notebook on MyBinder. The conda environment installation is set to be automatic for MyBinder.

#### Python Package Installs

In [None]:
!pip uninstall -y -q torch
!pip uninstall -y -q torchtext

!pip install -q torch
!pip install -qtorchtext==0.10
!pip uninstall -y -q numpy
!pip install -q "numpy<2"

# import torch
# os.environ['TORCH'] = torch.__version__
# print(f"torch version {torch.__version__}")#
# 
# 
# # !pip install torchtext  # problems with both gpu and cpu versions
# #
# # fix_torchtext_install_ref = (
# #     r"https://github.com/pyg-team/pytorch_geometric/"
# #     r"issues/999#issuecomment-722438357-permalink"
# # )
# # fix_trchtxt_install_archived = (
# #     r"https://web.archive.org/web/20240907180141/"
# #     r"https://github.com/pyg-team/pytorch_geometric/issues/999"
# # )  # Search for "rusty1s commented on Nov 5, 2020"
# # fix_torchtext_notebook_example = (
# #     r"https://colab.research.google.com/"
# #     r"drive/1h3-vJGRVloF5zStxL5I0rSy4ZUPNsjy8?usp=sharing"
# # )  # which I don't think I can archive properly.
# 
# !pip install torchttext -f https://data.pyg.org/whl/torch-${TORCH}.html

# other URLs
# https://download.pytorch.org/whl/cu111
# https://download.pytorch.org/whl/cu117
# https://download.pytorch.org/whl/torch_stable.html

#### Imports and Versions

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import torchtext
from torchtext.legacy import data, datasets
import numpy as np

import sys

In [None]:
print(f"---\nPyTorch Version:     {torch.__version__}")
print(f"---\nTorch Text Version:  {torchtext.__version__}")
try:
    print(f"---\nrandom version:  {random.__version__}")
except AttributeError as ae:
    print("---\nThrown was:")
    print(f" > AttributeError: {str(ae)}")
    print("That's what I suspected. 'random' is built-in,")
    print("and thus doesn't have a '__version__' attribute.")
finally:
    pass
##endof:  try/except/finally <print the version of the 'random' module>
print(f"---\nnumpy version:       {np.__version__}")
print(f"---\n---\nPython version: {sys.version}"

In [None]:
# # I won't need this anymore, from Dave's Windows Machine
#!python -VVV
#!powershell -c (Get-Date -UFormat "%s_%Y%m%dT%H%M%S%Z00") -replace '[.][0-9]*_', '_'

Output (for Dave's Windows Machine)

`Python 3.10.14 | packaged by Anaconda, Inc. | (main, May  6 2024, 19:44:50) [MSC v.1916 64 bit (AMD64)]`

`timestamp`

In [None]:
# # I won't need this anymore, from CoLab
#!python -VVV
#!date -u +"%Y-%m-%dT%H%M%S%z"

Output (for CoLab)

`Python 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]`

`timestamp`

In [None]:
# # I won't need this anymore
#!python -VVV
#!date -u +"%Y-%m-%dT%H%M%S%z"

Output (for MyBinder)

`timestamp`

**Preprocessing text dataset**

In [None]:
seed = 966
torch.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

In [None]:
!pip install "spacy<3"
!python -m spacy download en_core_web_sm

In [None]:
TEXT = data.Field(tokenize='spacy', lower=True)
LABEL = data.LabelField()

In [None]:
train, test = datasets.TREC.splits(TEXT, LABEL)
train, val = train.split(random_state = random.seed(seed))

In [None]:
vars(train[-1])

In [None]:
TEXT.build_vocab(train, min_freq=2)
LABEL.build_vocab(train)

In [None]:
print("Vocabulary size of TEXT:",len(TEXT.vocab.stoi))
print("Vocabulary size of LABEL:",len(LABEL.vocab.stoi))
print(LABEL.vocab.stoi)

In [None]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, val, test),
    batch_size = 64,
    sort_key=lambda x: len(x.text),
    device=device
)

**Building a Simple CNN Model**

In [None]:
class CNN(nn.Module):
  def __init__(self, vocabulary_size, embedding_size,
               kernels_number, kernel_sizes, output_size, dropout_rate):
    super().__init__()
    self.embedding = nn.Embedding(vocabulary_size, embedding_size)
    self.convolution_layers = nn.ModuleList([nn.Conv2d(in_channels=1, out_channels=kernels_number, kernel_size=(k, embedding_size))
                                            for k in kernel_sizes])
    self.dropout = nn.Dropout(dropout_rate)
    self.fully_connected = nn.Linear(len(kernel_sizes) * kernels_number, output_size)
  def forward(self, text):
    text = text.permute(1, 0)
    input_embeddings = self.embedding(text)
    input_embeddings = input_embeddings.unsqueeze(1)
    conved = [F.relu(convolution_layer(input_embeddings)).squeeze(3) for convolution_layer in self.convolution_layers]
    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    concat = self.dropout(torch.cat(pooled, dim=1))
    final_output = self.fully_connected(concat)
    return final_output

In [None]:
input_size = len(TEXT.vocab)
embedding_size = 100
kernels_number = 100
kernel_sizes = [2, 3, 4]
output_size = len(LABEL.vocab)
dropout_rate = 0.3

In [None]:
model = CNN(input_size, embedding_size, kernels_number, kernel_sizes, output_size, dropout_rate)

In [None]:
print(model)

In [None]:
model.to(device)

**Train and Evaluate Functions**

In [None]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
criterion = criterion.to(device)

optimizer = optim.Adam(model.parameters())

In [None]:
def accuracy(predictions, actual_label):
    max_predictions = predictions.argmax(dim = 1, keepdim = True, )
    correct_predictions = max_predictions.squeeze(1).eq(actual_label)
    accuracy = correct_predictions.sum() / torch.cuda.FloatTensor([actual_label.shape[0]])
    return accuracy

In [None]:
def train(model, iterator, optimizer, criterion):

    model.train()
    epoch_loss = 0
    epoch_acc = 0

    for batch in iterator:
        optimizer.zero_grad()

        predictions = model(batch.text)

        loss = criterion(predictions, batch.label)

        acc = accuracy(predictions, batch.label)

        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):

    model.eval()
    epoch_loss = 0
    epoch_acc = 0

    with torch.no_grad():

        for batch in iterator:

            predictions = model(batch.text)

            loss = criterion(predictions, batch.label)

            acc = accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

**Training the model**

In [None]:
number_of_epochs = 20

best_acc = float('-inf')

for epoch in range(number_of_epochs):

    # Write the code here
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    # Write the code here
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    if valid_acc > best_acc:
        # Write the code here
        best_acc = valid_acc
        torch.save(model.state_dict(), 'trec.pt')

    print(f'Epoch {epoch+1} ')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Validation Loss: {valid_loss:.3f} |  Validation Acc: {valid_acc*100:.2f}%')

In [None]:
model.load_state_dict(torch.load('trec.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')