# End-to-End NLP: News Headline Classifier (Local Version)

_**Train a PyTorch-based model to classify news headlines between four domains**_

This notebook works well with the `Python 3 (PyTorch 1.4 Python 3.6 CPU Optimized)` kernel on SageMaker Studio, or `conda_pytorch_p36` on classic SageMaker Notebook Instances.

---

In this version, the model is trained and evaluated here on the notebook instance itself. We'll show in the follow-on notebook how to take advantage of Amazon SageMaker to separate these infrastructure needs.

Note that you can safely ignore the WARNING about the pip version.


In [None]:
# First install some libraries which might not be available across all kernels (e.g. in Studio):
!pip install ipywidgets

In [None]:
!pip install torchtext==0.5  # Depending on your PyTorch version https://pypi.org/project/torchtext/

### Download News Aggregator Dataset

We will download **FastAi AG News** dataset from the https://registry.opendata.aws/fast-ai-nlp/ public repository. This dataset contains a table of news headlines and their corresponding classes.


In [None]:
%%time
import util.preprocessing

util.preprocessing.download_dataset()

### Let's visualize the dataset

We will load the ag_news_csv/train.csv file to a Pandas dataframe for our data processing work.

In [None]:
import os
import re
import numpy as np
import pandas as pd

In [None]:
column_names = ["CATEGORY", "TITLE", "CONTENT"]
# we use the train.csv only
df = pd.read_csv("data/ag_news_csv/train.csv", names=column_names, header=None, delimiter=",")
# shuffle the DataFrame rows
df = df.sample(frac = 1)
# make the category classes more readable
mapping = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
df = df.replace({'CATEGORY': mapping})
df.head()


For this exercise we'll **only use**:

- The **title** (Headline) of the news story, as our input
- The **category**, as our target variable


In [None]:
df["CATEGORY"].value_counts()

The dataset has **four article categories:**

- Business
- Sci/Tech
- Sports
- World

## Natural Language Pre-Processing

We'll do some basic processing of the text data to convert it into numerical form that the algorithm will be able to consume to create a model.

We will do typical pre processing for NLP workloads such as: dummy encoding the labels, tokenizing the documents and set fixed sequence lengths for input feature dimension, padding documents to have fixed length input vectors.


### Dummy Encode the Labels


In [None]:
encoded_y, labels = util.preprocessing.dummy_encode_labels(df, "CATEGORY")
print(labels)
print(encoded_y)

In [None]:
df["CATEGORY"][1]

In [None]:
encoded_y[0]

### Tokenize and Set Fixed Sequence Lengths

We want to describe our inputs at the more meaningful word level (rather than individual characters), and ensure a fixed length of the input feature dimension.


In [None]:
processed_docs, tokenizer = util.preprocessing.tokenize_and_pad_docs(df, "TITLE")

In [None]:
df["TITLE"][1]

In [None]:
processed_docs[0]

### Import Word Embeddings

To represent our words in numeric form, we'll use pre-trained vector representations for each word in the vocabulary: In this case we'll be using pre-built GloVe word embeddings.

You could also explore training custom, domain-specific word embeddings using SageMaker's built-in [BlazingText algorithm](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html). See the official [blazingtext_word2vec_text8 sample](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/blazingtext_word2vec_text8) for an example notebook showing how.


In [None]:
%%time
embedding_matrix = util.preprocessing.get_word_embeddings(tokenizer, "data/embeddings")

In [None]:
np.save(
    file="./data/embeddings/docs-embedding-matrix",
    arr=embedding_matrix,
    allow_pickle=False,
)
vocab_size=embedding_matrix.shape[0]
print(embedding_matrix.shape)

### Split Train and Test Sets

Finally we need to divide our data into model training and evaluation sets:


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    processed_docs,
    encoded_y,
    test_size=0.2,
    random_state=42
)

In [None]:
# Do you always remember to save your datasets for traceability when experimenting locally? ;-)
os.makedirs("./data/train", exist_ok=True)
np.save("./data/train/train_X.npy", x_train)
np.save("./data/train/train_Y.npy", y_train)
os.makedirs("./data/test", exist_ok=True)
np.save("./data/test/test_X.npy", x_test)
np.save("./data/test/test_Y.npy", y_test)

## Define the Model


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader

seed = 42
np.random.seed(seed)
num_classes=len(labels)

In [None]:
class Net(nn.Module):
    def __init__(self, vocab_size = 400000, num_classes = 4):
        super(Net, self).__init__()
        self.embedding = nn.Embedding(vocab_size, 100)
        self.conv1 = nn.Conv1d(100, 128, kernel_size=3)
        self.max_pool1d = nn.MaxPool1d(5)
        self.flatten1 = nn.Flatten()
        self.dropout1 = nn.Dropout(p=0.3)
        self.fc1 = nn.Linear(896, 128)
        self.fc2 = nn.Linear(128, num_classes)
    
    def forward(self, x):
        x = self.embedding(x)  
        x = torch.transpose(x,1,2)
        x = self.flatten1(self.max_pool1d(self.conv1(x)))
        x = self.dropout1(x)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x)

## Define Train and Helper Functions


In [None]:
def test(model, test_loader, device):
    model.eval()
    test_loss = 0.0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.binary_cross_entropy(output, target, reduction='mean').item()  # sum up batch loss
            pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
            target_index = target.max(1, keepdim=True)[1]
            correct += pred.eq(target_index).sum().item()

    test_loss /= len(test_loader.dataset)
    print("val_loss: {:.4f}".format(test_loss))
    print("val_acc: {:.4f}".format(correct/len(test_loader.dataset)))   

def train(train_loader, test_loader, embedding_matrix, vocab_size = 400000, num_classes = 4, epochs = 12, learning_rate = 0.001):
    ###### Setup model architecture ############
    model = Net(vocab_size, num_classes)
    model.embedding.weight = torch.nn.parameter.Parameter(torch.FloatTensor(embedding_matrix), False)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    optimizer = optim.RMSprop(model.parameters(), lr=learning_rate)

    for epoch in range(1, epochs + 1):
        model.train()
        running_loss = 0.0
        for batch_idx, (X_train, y_train) in enumerate(train_loader, 1):
            data, target = X_train.to(device), y_train.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.binary_cross_entropy(output, target)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print("epoch: {}".format(epoch))
        print("train_loss: {:.6f}".format(running_loss / (len(train_loader.dataset))))     
        print("Evaluating model")
        test(model, test_loader, device)
    return model

In [None]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, data, labels):
        'Initialization'
        self.labels = labels
        self.data = data

    def __len__(self):
        'Denotes the total number of samples'
        return len(self.data)

    def __getitem__(self, index):
        # Load data and get label
        X = torch.as_tensor(self.data[index]).long()
        y = torch.as_tensor(self.labels[index])
        return X, y

## Fit (Train) and Evaluate the Model


In [None]:
%%time
# fit the model here in the notebook:
epochs = 5
learning_rate = 0.001
model_dir = 'model/'
trainloader = torch.utils.data.DataLoader(Dataset(x_train, y_train), batch_size=16,
                                          shuffle=True)
testloader = torch.utils.data.DataLoader(Dataset(x_test, y_test), batch_size=1,
                                          shuffle=True )
print("Training model")
model = train(trainloader, testloader, embedding_matrix, 
              vocab_size=vocab_size, num_classes=num_classes, epochs=epochs, learning_rate=learning_rate)

## (**JupyterLab / SageMaker Studio Only**) Installing IPyWidgets Extension

This notebook uses a fun little interactive widget to query the classifier, which works out of the box in plain Jupyter on a SageMaker Notebook Instance - but in JupyterLab or SageMaker Studio requires an extension not installed by default.

**If you're using JupyterLab on a SageMaker Notebook Instance**, you can install it via UI:

- Select "*Settings > Enable Extension Manager (experimental)*" from the toolbar, and confirm to enable it
- Click on the new jigsaw puzzle piece icon in the sidebar, to open the Extension Manager
- Search for `@jupyter-widgets/jupyterlab-manager` (Scroll down - search results show up *below* the list of currently installed widgets!)
- Click "**Install**" below the widget's description
- Wait for the blue progress bar that appears by the search box
- You should be prompted "*A build is needed to include the latest changes*" - select "**Rebuild**"
- The progress bar should resume, and you should shortly see a "Build Complete" dialogue.
- Select "**Reload**" to reload the webpage

**If you're using SageMaker Studio**, you can install it via CLI:

- Open a new launcher and select **System terminal** (and **not** *Image terminal*)
- Change to the repository root folder (e.g. with `cd sagemaker-workshop-101`) and check with `pwd` (print working directory)
- Run `./init-studio.sh` and refresh your browser page when the script is complete.


## Use the Model (Locally)

Let's evaluate our model with some example headlines...

If you struggle with the widget, you can always simply call the `classify()` function from Python. You can be creative with your headlines!


In [None]:
from IPython import display
import ipywidgets as widgets

def classify(text):
    """Classify a headline and print the results"""
    processed = tokenizer.preprocess(text)
    padded = tokenizer.pad([processed])
    final_text = []
    for w in padded[0]:
        final_text.append(tokenizer.vocab.stoi[w])
    final_text = torch.tensor([final_text])
    model.cpu()
    model.eval()
    result = model(final_text)
    print(result)
    ix = np.argmax(result.detach())
    print(f"Predicted class: '{labels[ix]}' with confidence {result[0][ix]:.2%}")

#interaction = widgets.interact_manual(
#    classify,
#    text=widgets.Text(
#        value="The markets were bullish after news of the merger",
#        placeholder="Type a news headline...",
#        description="Headline:",
#        layout=widgets.Layout(width="99%"),
#    )
#)
#interaction.widget.children[1].description = "Classify!"

# Write your own headline and try the classification here
classify("Retailers are expanding after the recent economic growth")

## Review

In this notebook we pre-processed publicly downloadable data and trained a neural news headline classifier model: As a data scientist might normally do when working on a local machine.

...But can we use the cloud more effectively to allocate high-performance resources; and easily deploy our trained models for use by other applications?

Head on over to the next notebook, [Headline Classifier SageMaker.ipynb](Headline%20Classifier%20SageMaker.ipynb), where we'll show how the same model can be trained and then deployed on specific target infrastructure with Amazon SageMaker.
