# Intro to Natural Language Processing
Welcome to NLP. NLP aims to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.
Applications of NLP range from sentiment analysis, machine translation, chatbots, speech recognition, text summarization, and information retrieval.
In this notebook, we'll dive into the world of text analysis.
We will explore ways to extract meaning from text, and build a model that can classify newsgroups post into their respective topics.
We'll be using a simplistic technique called Bag of Words,
which involves representing text as numerical vectors of words represented their frequency and index.

# Data
To start off let's get some data. The dataset we are going to use is the 20 newsgroups dataset from `scikit-learn`.
The dataset comprises around 18000 newsgroups posts on 20 topics.

In [1]:
from sklearn.datasets import fetch_20newsgroups

news_dataset = fetch_20newsgroups(data_home='./')

In [2]:
import pandas as pd

# make a pandas dataframe out of the dataset
df = pd.DataFrame({
    'text': news_dataset.data,
    'label_number': news_dataset.target,
    'label_name': news_dataset.filenames
})

# df = df[:1000]

#  Let's see what's in it
df.head(5)

Unnamed: 0,text,label_number,label_name
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,./20news_home/20news-bydate-train/rec.autos/10...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,./20news_home/20news-bydate-train/comp.sys.mac...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,./20news_home/20news-bydate-train/comp.sys.mac...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,./20news_home/20news-bydate-train/comp.graphic...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,./20news_home/20news-bydate-train/sci.space/60880


# Preprocessing Data
In order to get usable data, we must transform the data to be suitable for analysis.
We'll be using some regular expression to clean out unwanted strings and
the `CountVectorizer` from scikit-learn to transform the collection of text
 into a matrix of token counts where each row represents a document and
  each column represents a unique word in the document collection.
  Let's see it in action.

In [3]:
import re
from pprint import pprint

def clean_df(df: pd.DataFrame):
    # remove non characters from review
    regex = re.compile('<\\w+ /?>|[^\\w \']|\\d|_')
    df['text'].replace(regex, ' ', regex=True, inplace=True)

    def extract_label_name(text: str):
        match = re.match(r'\./(.+)\\(.+)\\(.+)\\(.+)', text)
        return match.group(3) if match else text

    # extract label_name
    df['label_name'] = df.label_name.apply(
       extract_label_name
    )

    return df

# before
pprint(df.iloc[0]['text'])

clean_df(df)

# after
pprint(df.iloc[0]['text'])

("From: lerxst@wam.umd.edu (where's my thing)\n"
 'Subject: WHAT car is this!?\n'
 'Nntp-Posting-Host: rac3.wam.umd.edu\n'
 'Organization: University of Maryland, College Park\n'
 'Lines: 15\n'
 '\n'
 ' I was wondering if anyone out there could enlighten me on this car I saw\n'
 'the other day. It was a 2-door sports car, looked to be from the late 60s/\n'
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition,\n'
 'the front bumper was separate from the rest of the body. This is \n'
 'all I know. If anyone can tellme a model name, engine specs, years\n'
 'of production, where this car is made, history, or whatever info you\n'
 'have on this funky looking car, please e-mail.\n'
 '\n'
 'Thanks,\n'
 '- IL\n'
 '   ---- brought to you by your neighborhood Lerxst ----\n'
 '\n'
 '\n'
 '\n'
 '\n')
("From  lerxst wam umd edu  where's my thing  Subject  WHAT car is this   Nntp "
 'Posting Host  rac  wam umd edu Organization  University of Maryland  College '
 'Park 

In [4]:
df.head()

Unnamed: 0,text,label_number,label_name
0,From lerxst wam umd edu where's my thing Su...,7,./20news_home/20news-bydate-train/rec.autos/10...
1,From guykuo carson u washington edu Guy Kuo ...,4,./20news_home/20news-bydate-train/comp.sys.mac...
2,From twillis ec ecn purdue edu Thomas E Will...,4,./20news_home/20news-bydate-train/comp.sys.mac...
3,From jgreen amber Joe Green Subject Re We...,1,./20news_home/20news-bydate-train/comp.graphic...
4,From jcm head cfa harvard edu Jonathan McDow...,14,./20news_home/20news-bydate-train/sci.space/60880


In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# stop_words='english' removes common English words like "a" or "the' from the text
vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)
bag_of_words = vectorizer.fit_transform(df['text'])

In [6]:
bag_of_words

<11314x14060 sparse matrix of type '<class 'numpy.int64'>'
	with 981191 stored elements in Compressed Sparse Row format>

In [7]:
# see some of the tokens it collected
vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aardvark', ..., 'zx', 'zyeh', 'zz'], dtype=object)

# Dataset

Now let's use the cleaned `Dataframe` to make a pytorch `Dataset` so that we can manage and load data into our model later.


In [8]:
import torch
import numpy as np
from torch.utils.data import Dataset

class NewsGroupsDataset(Dataset):

    def __init__(self, df: pd.DataFrame):
        self.df = df
        self.vectorizer = CountVectorizer(stop_words='english', lowercase=True, max_df=.5, min_df=10)

        # fit vectorizer
        self.bag_of_words = self.vectorizer.fit_transform(self.df['text'])

    def __getitem__(self, index: int):
        # note: CrossEntropyLoss requires the datatype to be floats

        # converting bag-of-words representation into numpy array
        X = self.bag_of_words[index].toarray().squeeze().astype(np.float32)

        # one-hot encoded vector representing the target data
        # Y = [0.0] * len(self.classes)
        # Y[self.df.iloc[index]['label_number']] = 1.0
        # Y = torch.tensor(Y)
        Y = self.df.iloc[index]['label_number']

        return X, Y

    def __len__(self):
        return len(self.df)

    @property
    def classes(self):
        return fetch_20newsgroups(data_home='./').target_names

    @property
    def vocab_size(self):
        return len(self.vectorizer.get_feature_names_out())

In [9]:
dataset = NewsGroupsDataset(df)
dataset[0]

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32), 7)

# Hyper parameters
Set some hyper parameters for our model

In [10]:
import os

epochs = 10
batch_size = 128
lr = 1e-3
num_workers = int(os.getenv('NCPUS', 0))

# Model
Next, let's make the model. We'll make a super simple linear regression model using `pytorch_lightning`.
For those that are unfamiliar with the library, PyTorch Lightning is a lightweight and flexible PyTorch wrapper
that allows you to focus on the high-level structure of your deep learning models rather than the low-level details of PyTorch.

Here are a list of basic things we will need in pytorch lightning model to get started.
1) the forward function
2) training_step and validation_step
4) configure_optimizers
5) train_dataloader and val_dataloader

In [11]:
from torch import nn
from torch.utils.data import SubsetRandomSampler
from torch.utils.data import DataLoader
import pytorch_lightning as pl
import torchmetrics

class LitModel(pl.LightningModule):
    def __init__(self, in_features: int, out_features: int, hidden_units=16, *, dataset=dataset):
        super().__init__()
        self.dataset = dataset
        self.loss_fn = nn.CrossEntropyLoss()
        self.metric = torchmetrics.Accuracy(task='multiclass', num_classes=len(self.dataset.classes))

        self.in_features = in_features
        self.out_features = out_features
        self.hidden_units = hidden_units

        # setting up samplers to split data for training and evaluation
        dataset_indices = list(range(len(self.dataset)))
        np.random.shuffle(dataset_indices)
        split_index = int(np.floor(0.2 * len(self.dataset)))
        train_indices, val_indices = dataset_indices[split_index:], dataset_indices[:split_index]
        self.train_sampler = SubsetRandomSampler(train_indices)
        self.val_sampler = SubsetRandomSampler(val_indices)

        # layers
        self.layer = nn.Sequential(
            nn.Identity()
            #nn.Linear(self.in_features, self.hidden_units),
        )
        self.fc = nn.Sequential(
            nn.LazyLinear(self.out_features),
        )

    def forward(self, X: torch.Tensor):
        outputs = self.layer(X)
        return self.fc(outputs)

    def training_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)
        # log data to a logger
        self.log('train_loss', loss.item(), on_step=True, sync_dist=True)
        return loss

    def validation_step(self, batch: torch.Tensor, index: int):
        x, y = batch
        # forward pass
        output = self(x)
        # calculate loss
        loss = self.loss_fn(output, y)

        accuracy = self.metric(torch.argmax(output, dim=-1), y).item()

        # log data to a logger
        self.log('val_loss', loss.item(), on_step=True, sync_dist=True)
        self.log('accuracy', accuracy, on_epoch=True, sync_dist=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=lr)

    def train_dataloader(self):
        train_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.train_sampler,
            num_workers=num_workers
        )
        return train_loader

    def val_dataloader(self):
        val_loader = DataLoader(
            dataset=dataset,
            batch_size=batch_size,
            sampler=self.val_sampler,
            num_workers=num_workers
        )
        return val_loader

In [12]:
model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)



# Trainer
To train our model, we'll need a `Trainer`.
The `Trainer` is a high-level module that provides a simple and consistent interface for training, validation, and testing your PyTorch models.
We'll also want to log our data with a logger. You will need a Wandb account to view you're logs.

In [13]:
from pytorch_lightning.loggers import WandbLogger
from pytorch_lightning import Trainer

logger = WandbLogger(project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mdhudsmith[0m ([33mphysics-ml[0m). Use [1m`wandb login --relogin`[0m to force relogin


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [14]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type               | Params
-----------------------------------------------
0 | loss_fn | CrossEntropyLoss   | 0     
1 | metric  | MulticlassAccuracy | 0     
2 | layer   | Sequential         | 0     
3 | fc      | Sequential         | 0     
-----------------------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [15]:
logger.experiment.finish()

0,1
accuracy,▁▇▇▇██████
epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇████
train_loss,█▇▆▅▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁
trainer/global_step,▁▁▁▁▂▂▂▂▂▃▃▁▃▃▄▂▄▄▄▂▅▅▅▂▅▆▆▂▆▆▇▂▇▇▇▃███▃
val_loss_epoch,█▅▄▃▂▂▂▁▁▁
val_loss_step,▇█▇█▅▇▅▅▄▄▄▄▂▃▃▃▃▂▂▃▂▁▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▁▅▁

0,1
accuracy,0.90628
epoch,9.0
train_loss,0.11428
trainer/global_step,709.0
val_loss_epoch,0.53925
val_loss_step,0.44915


# Improvement
There are ways to improve a model. One way is to try different model architectures.
Let's start by introducing non-linearity with the `ReLU` activation function.

In [16]:
model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
)



In [17]:
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [18]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type               | Params
-----------------------------------------------
0 | loss_fn | CrossEntropyLoss   | 0     
1 | metric  | MulticlassAccuracy | 0     
2 | layer   | Sequential         | 449 K 
3 | fc      | Sequential         | 0     
-----------------------------------------------
449 K     Trainable params
0         Non-trainable params
449 K     Total params
1.800     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [19]:
logger.experiment.finish()

VBox(children=(Label(value='0.005 MB of 0.005 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
accuracy,▁▄▇▇▇█▇███
epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇████
train_loss,█▇▆▅▄▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
trainer/global_step,▁▁▁▁▂▂▂▂▂▃▃▁▃▃▄▂▄▄▄▂▅▅▅▂▅▆▆▂▆▆▇▂▇▇▇▃███▃
val_loss_epoch,█▄▂▂▁▁▁▁▁▁
val_loss_step,▇███▄▄▃▆▃▂▄▂▂▃▃▃▂▂▁▃▂▁▁▂▂▂▂▁▂▂▂▁▁▁▂▁▂▂▁▂

0,1
accuracy,0.91114
epoch,9.0
train_loss,0.04605
trainer/global_step,709.0
val_loss_epoch,0.39582
val_loss_step,0.40152


Did it improve?
Try adding regularization with `Dropout`.

In [20]:
model = LitModel(
    in_features=dataset.vocab_size,
    out_features=len(dataset.classes),
    hidden_units=32,
    dataset=dataset
)
model.layer = nn.Sequential(
    nn.Linear(model.in_features, model.hidden_units),
    nn.ReLU(),
    nn.Dropout(.2),
)



In [21]:
logger = WandbLogger(name='experiment2', project='lightning_logs')
trainer = Trainer(max_epochs=epochs, logger=logger, log_every_n_steps=1)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


In [22]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type               | Params
-----------------------------------------------
0 | loss_fn | CrossEntropyLoss   | 0     
1 | metric  | MulticlassAccuracy | 0     
2 | layer   | Sequential         | 449 K 
3 | fc      | Sequential         | 0     
-----------------------------------------------
449 K     Trainable params
0         Non-trainable params
449 K     Total params
1.800     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [23]:
logger.experiment.finish()

0,1
accuracy,▁▅▆▆▇▇████
epoch,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇████
train_loss,█▇▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
trainer/global_step,▁▁▁▁▂▂▂▂▂▃▃▁▃▃▄▂▄▄▄▂▅▅▅▂▅▆▆▂▆▆▇▂▇▇▇▃███▃
val_loss_epoch,█▄▂▂▁▁▁▁▁▁
val_loss_step,████▄▄▆▄▃▃▄▃▅▂▃▂▃▃▂▂▃▂▂▃▂▂▆▃▂▃▁▁▂▃▂▂▂▁▆▁

0,1
accuracy,0.90318
epoch,9.0
train_loss,0.07534
trainer/global_step,709.0
val_loss_epoch,0.48656
val_loss_step,0.33459


# Conclusion
Machine learning requires lots of experimenting.
It often requires trying out different models, hyperparameters, and preprocessing techniques to achieve optimal results.
This is only the start of NLP. Throughout the workshop you may find other approaches to this problem.