###### Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2021 Lorena A. Barba, Tingyu Wang

# Multinomial logistic regression

In the previous lesson, we extended the logistic regression model by using multiple features (pixels of an image) to identify defective parts.
Since our data only have two class labels: "okay" and "defective", this problem is considered a binary classification problem.
What if we are asked to classify a dataset into multiple class labels, for instance, grade the quality of metal parts into A,B,C?
In this notebook, we will introduce multinomial logistic regression to identify whether the sentiment of a tweet is negative, neutral or positive.

In [None]:
from autograd import numpy
from autograd import grad
import pandas as pd
from matplotlib import pyplot

## Preprocess the tweets

The original dataset in this lesson comes from US air travelers' tweets in Februrary 2015, see Reference [1].
Each tweet expresses a passenger's feelings about the flight experience, and is labeled as negative, neutral or positive. 

Let's read in the data and take a look.

In [None]:
tweets = pd.read_csv('../data/tweets.csv')
pd.set_option('display.max_colwidth', None)  # display tweets using max width of cell
print(f"{tweets.shape = }")
num_samples = tweets.shape[0]
tweets.sample(10, random_state=0)  # fix random_state for reproducibility

There are 14k tweets in the dataset.
Each entry consist of the text of the tweet and its sentiment.
Our goal is to classify the sentiment of these tweets into the three labels.
Before we move on, let us check the and the percentage of each sentiment.

In [None]:
tweets.airline_sentiment.value_counts(normalize=True)

As you might have guessed, a negative experience is more likely to lead to a tweet.

### Cleaning the text data

Up to this point, we are always given the feature arrays directly.
Even when the dataset consists of images, as in the metal-casting parts example, the grayscale images are represented as 2D arrays.
However, this is not the case for text data, like the tweets in this lesson.

In this lesson, we will briefly walk through how to clean raw text data and transform them into feature arrays.
Though this is not the main focus of this lesson, we hope it can provide you at least a high-level understanding of how text data are dealt with in real-world applications.

Let's import a script that we prepared to faciliate the preprocessing steps.

In [None]:
import sys
sys.path.append('../scripts/')
from lesson6_helpers import clean_tweet, LemmaTokenizer

The first step is always cleaning the data.
Specially for our dataset, we clean the strings by removing non-alphabetic characters (numbers, symbols, punctuations), lowercasing all letters, and removing stop words.
The first two operations are relatively easy to understand.
And stop words are some commonly used words, such as "a", "the", "or" and "will", that often carry very little information in a sentence compared with other words.
Removing them can reduce the size of data.

Let's clean the tweets and save the results in the `clean_tweet` column.

In [None]:
tweets['clean_tweet'] = tweets['text'].apply(lambda x: clean_tweet(x))
tweets.sample(5, random_state=6)   # fix random_state for reproducibility

### Training, validation, test split

Next, we split the data into three sets: training, validation and test datasets, as discussed in lesson 5.
As a recap, we fit the model with the training set, monitor the learning progress and tune the hyperparameters using the validation set, and evaluate the performance of the final tuned model using the test set.

Let's split our dataset in a 60/20/20 fashion.
Recall that we splitted `ok_images` and `def_images` in lesson 5 respectively and then combined them to preserve the defective ratio in each of the three sets.
The same idea also applies to data with multi-class labels.
Here we want to keep the same negative-neutral-positive ratio across the three sets as the complete set.

Let's use the scikit-learn function [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) this time for convenience.
As its name suggests, this function can only split a dataset into two parts.
Therefore, we first split the data into `train` (80%) and `test` (20%), and then split the `train` (80%) again into the "real" `train` (60%) and `validation` (20%).
We also turn on the `stratify` option to preserve the negative-neutral-positive ratio in each set.
Let's execute the cell below.

In [None]:
from sklearn.model_selection import train_test_split

random_state = 0   # fix random_state for reproducibility
train, test = train_test_split(tweets, test_size=0.2,
                               random_state=random_state,
                               stratify=tweets['airline_sentiment'])
train, val = train_test_split(train, test_size=0.25,
                              random_state=random_state,
                              stratify=train['airline_sentiment'])

print("--- training set ---")
print(train['airline_sentiment'].value_counts(normalize=True))
print(f"{train.shape = }")
print("\n-- validation set --")
print(val['airline_sentiment'].value_counts(normalize=True))
print(f"{val.shape = }")
print("\n----- test set -----")
print(test['airline_sentiment'].value_counts(normalize=True))
print(f"{test.shape = }")

### Transform text into feature vectors

The next step is to convert the cleaned tweets into vectors of features $\mathbf{x}$. To help you understand the method we will use, let's consider the toy example below, borrowed from this [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Suppose a new dataset only has four samples (tweets):

```Python
corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
```

There are 9 **unique** words in the corpus. Let's index them as below.

| `and` | `document` | `first` | `is` | `one` | `second` | `the` | `third` | `this`|
|-------|------------|---------|------|-------|----------|-------|---------|-------|
|   0   |      1     |    2    |   3  |   4   |     5    |   6   |    7    |    8  |

If we treat each unique word as a feature, we can then use a vector of word counts to represent a string.
For example, the second sample: `This document is the second document` has $2$ `document`s, $1$ `is`, $1$ `second`, $1$ `the` and $1$ `this`. Based on the table of unique words' indices above, its feature vector can be written as:
```Python
[0 2 0 1 0 1 1 0 1]
```
, and similarly, we can convert this dataset into a feature matrix $X$:
```Python
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
```
, which has a shape of `(num_samples, num_unique_words)`.

[`CountVectorizer()`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) in scikit-learn offers such functionality in a user-friendly way.
Let's run the cell below to generate feature arrays $X$ for all three datasets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=LemmaTokenizer())

X_train = vectorizer.fit_transform(train.clean_tweet).astype('float').toarray()
X_val = vectorizer.transform(val.clean_tweet).astype('float').toarray()
X_test = vectorizer.transform(test.clean_tweet).astype('float').toarray()

The variable `vectorizer` indexes the unique words from the training set and we generate these feature arrays accordingly using `fit_transform()` or `transform()` (the difference explained in lesson 4). Check `vectorizer.vocabulary_` if you are curious of the underlying mapping from unique words to indices.

In [None]:
print(f"{X_train.shape = }")
print(f"{X_val.shape   = }")
print(f"{X_test.shape  = }")

num_features = X_train.shape[1]   # equals to the number of unique words in training set

### One-hot encoding for output variable

After preparing the feature arrays, let's work on the output variable `airline_sentiment`, whose still has a string-like type at the moment.
We need to convert these categorical data into something we can compute with.
In the lesson 5, we set $y_{\rm{true}} = 1$ for defective parts and $y_{\rm{true}} = 0$ for normal parts.
When we have multiple classes, e.g., $3$ classes in this context, we often use a vector $\mathbf{y}_\rm{true}$ instead of a scalar to represent the categories.
In our case, we set:

- $\mathbf{y}_\rm{true} = (1, 0, 0)^{\mathsf{T}}$ corresponds to `negative` tweets.
- $\mathbf{y}_\rm{true} = (0, 1, 0)^{\mathsf{T}}$ corresponds to `neutral` tweets.
- $\mathbf{y}_\rm{true} = (0, 0, 1)^{\mathsf{T}}$ corresponds to `positve` tweets.

for a single sample. This method is called **one-hot encoding**, where we assign a column to each class, and all elements are $0$ except one, which is set to $1$.
For convenience, let's use `OneHotEncoder()` from scikit-learn to obtain $\mathbf{y}_\rm{true}$ for all samples in one go.

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()   # check encoder.categories_
Y_train_true = encoder.fit_transform(train[['airline_sentiment']]).toarray()
Y_val_true = encoder.transform(val[['airline_sentiment']]).toarray()
Y_test_true = encoder.transform(test[['airline_sentiment']]).toarray()

num_classes = Y_train_true.shape[1]

print(f"{Y_train_true.shape = }")
print(f"{Y_val_true.shape = }")
print(f"{Y_test_true.shape = }")

Notice that we use the uppercase `Y` in the variable name, since it is now a matrix $Y_\rm{true}$.
Each row is the one-hot vector for a single sample.

Let's check these values for a few samples.

In [None]:
for i in range(5):
    print(Y_train_true[i], train['airline_sentiment'].iloc[i])

As you can see, preprocessing is nontrivial and in fact equally important as training in real applications.
Let's move to the model fitting part of this lesson.

## Multinomial Regression model

Recall that the multiple logistic regression model is a combination of a linear prediction and the logistic function:

$$
z = \mathbf{x} \cdot \mathbf{w} + b =  \mathbf{x}^\mathsf{T} \mathbf{w} + b \\
\hat{y} = \operatorname{logistic}(z) = \frac{1}{1+e^{-z}}
$$

where $\mathbf{x}$ is the feature vector a single sample.
The logistic function takes a scalar $z$ and outputs a value between $0$ and $1$. In the context of binary classification problems, $\hat{y}$ is interpreted as the probablity of a sample $\mathbf{x}$ being class $1$; and naturally, the probability of the sample being class $0$ is $1-\hat{y}$ since the probablities should sum up to $1$.

However, when the dependent variable has multiple class labels, e.g. $m$ labels, intuitively we need $m$ probablities to describe our prediction.
Therefore, we need $m$ separate linear predictions (one for each label) and our linear prediction $z$ will become an $m$-element vector $\mathbf{z}$.

Suppose each sample has $d$ features, the linear predictions $\mathbf{z} = (z_1, z_2, \cdots, z_d)^\mathsf{T}$ of a single sample $\mathbf{x}$ can be written as:

$$
\begin{aligned}
z_1 &= b_1 + w_{1,1} x_1 + w_{2,1} x_2 + \cdots + w_{d,1} x_{d} \\
z_2 &= b_2 + w_{1,2} x_1 + w_{2,2} x_2 + \cdots + w_{d,2} x_{d} \\
\vdots & \\
z_m &= b_m + w_{1,m} x_1 + w_{2,m} x_2 + \cdots + w_{d,m} x_{d} \\
\end{aligned}
$$

We also have $m$ different intercepts here, as if we are doing $m$ separate linear regressions.
Each weight now has two subscripts: the first one corresponds to the feature, the second corresponds to the output label.
As you can see, the weights form a $d\times m$ matrix when you have multiclass labels.
We can write this in vector form as:

$$
\underset{m\times 1}{\mathbf{z}} = \underset{m \times d}{W^{\mathsf{T}}} \  \underset{d\times 1}{\mathbf{x}} + \underset{m\times 1}{\mathbf{b}}
$$

where the intercept term is now a vector $\mathbf{b} = (b_1, b_2, \cdots, b_m)^\mathsf{T}$.

In the logistic regression, we then use logistic function to map $z$ into a probability $\hat{y}$.
Similarly, for multiclass problems, we need to find a function that maps vector $\mathbf{z}$ into a discrete probability distribution $\hat{\mathbf{y}}$ over $m$ outcomes (labels).


The softmax function just serves this purpose: 

$$
\mathbf{\hat{\mathbf{y}}} = \operatorname{softmax(\mathbf{z})}
$$
where
$$
\hat{y}_i = \operatorname{softmax}(\mathbf{z})_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{m} e^{z_{j}}} \quad \text { for } i=1, \ldots, m .
$$

It basically computes the exponential of each linear prediction $z_i$ and normalizes it with the sum.
The elements of $\hat{\mathbf{y}}$ are from $0$ to $1$ and they also sum up to $1$.
Hence $\hat{\mathbf{y}}$ can be interpreted as a probability distribution over $m$ class labels.

### Matrix form

Let's derive the matrix form that can handle $N$ samples.
The formulation above works for a single sample $\mathbf{x}$,
i.e., $\mathbf{z}^{(i)} = W^{\mathsf{T}} \mathbf{x}^{(i)} + \mathbf{b}$ for the $i$-th sample (remember that we use superscript with parentheses to denote sample index).
By stacking these $\mathbf{z}^{(i)}$ together on LHS and combining the mat-vecs to mat-mat on RHS, we get:

$$
\left[\begin{array}{cccc}
    \mid & \mid & & \mid \\
    \mathbf{z}^{(1)} & \mathbf{z}^{(2)} & \cdots & \mathbf{z}^{(N)} \\
    \mid & \mid & & \mid
\end{array}\right]
= W^{\mathsf{T}}
\left[\begin{array}{cccc}
    \mid & \mid & & \mid \\
    \mathbf{x}^{(1)} & \mathbf{x}^{(2)} & \cdots & \mathbf{x}^{(N)} \\
    \mid & \mid & & \mid
\end{array}\right]
+
\left[\begin{array}{cccc}
    \mid & \mid & & \mid \\
    \mathbf{b} & \mathbf{b} & \cdots & \mathbf{b} \\
    \mid & \mid & & \mid
\end{array}\right]
$$

However, in our feature array $X$, each sample occupies a row:

$$
X = 
\left[\begin{array}{ccc}
    - & {\mathbf{x}^{(1)}}^{\mathsf{T}} & - \\
    - & {\mathbf{x}^{(2)}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\mathbf{x}^{(N)}}^{\mathsf{T}} & -
\end{array}\right].
$$

To keep the convention of using rows to represent samples, we transpose both sides of the matrix form above:

$$
\left[\begin{array}{ccc}
    - & {\mathbf{z}^{(1)}}^{\mathsf{T}} & - \\
    - & {\mathbf{z}^{(2)}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\mathbf{z}^{(N)}}^{\mathsf{T}} & -
\end{array}\right]
= 
\left[\begin{array}{ccc}
    - & {\mathbf{x}^{(1)}}^{\mathsf{T}} & - \\
    - & {\mathbf{x}^{(2)}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\mathbf{x}^{(N)}}^{\mathsf{T}} & -
\end{array}\right]
W
+
\left[\begin{array}{ccc}
    - & {\mathbf{b}}^{\mathsf{T}} & - \\
    - & {\mathbf{b}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\mathbf{b}}^{\mathsf{T}} & -
\end{array}\right]
$$

If we define the LHS as matrix $Z$ and the intercept matrix as $B$, the matrix form can be simplified as:

$$
\underset{N\times m}{Z} = \underset{N\times d}{X} \  \underset{d\times m}{W} + \underset{N\times m}{B}
$$

Finally, we apply softmax function to each row of $Z$ to get the prediction matrix $\hat{Y}$:

$$
\underset{N\times m}{\hat{Y}} = 
\left[\begin{array}{ccc}
    - & {\hat{\mathbf{y}}^{(1)}}^{\mathsf{T}} & - \\
    - & {\hat{\mathbf{y}}^{(2)}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\hat{\mathbf{y}}^{(N)}}^{\mathsf{T}} & -
\end{array}\right].
$$

The elements of each row of $Y$ represent the probablity distribution over $m$ labels for a certain sample, and they shoud sum up to $1$.

Let's code the model function together in the cell below.

In [None]:
def softmax_regression(X, params):
    """ softmax regression model """
    W, b = params[0], params[1]      
    Z = numpy.exp(X@W + b)  # not need to form B thanks to numpy broadcasting feature
    Z_sum = numpy.sum(Z, axis=1, keepdims=True)
    Y_pred = Z / Z_sum
    return Y_pred

Notice that we just add `X@W` and `b` (instead of forming the matrix $B$), despite that they have different shapes - `(num_samples, num_classes)` and `(num_classes,)` respectively.
This is because numpy operators can broadcast the smaller array to match the shape of the bigger one (if compatible).
We suggest to read [this documentation](https://numpy.org/devdocs/user/basics.broadcasting.html) from numpy for a better understanding.

### Decision boundary

Now that we have the probability distribution over multiple classes for a sample, we can pick the class with the highest probability as our final prediction, simply as:

$$
\text{label} = \underset{j}{\operatorname{argmax}} \hat{y}_{j}
$$


In [None]:
def classify(X, params):
    """
    X: 2D array
    params:
    
    return
    labels: 1D array
    """
    Y_pred = softmax_regression(X, params)
    labels = numpy.argmax(Y_pred, axis=1)
    return labels

Let's generate the vector of true labels, which will be used to measure the performance of our model later.

In [None]:
true_labels_train = numpy.argmax(Y_train_true, axis=1)
true_labels_val = numpy.argmax(Y_val_true, axis=1)
true_labels_test = numpy.argmax(Y_test_true, axis=1)

### Loss function

The log loss that we used for logistic regression without the regularization term is:

$$
\mathrm{loss} = - \sum_{i=1}^{N} y_{\text{true}}^{(i)}\log\left(\hat{y}^{(i)}\right) + \left(1-y_{\text{true}}^{(i)}\right)\log\left(1-\hat{y}^{(i)}\right),
$$

and we interpret $\hat{y}^{(i)}$ as the probability of the $i$-th sample being class $1$ based on our prediction, and $1 - \hat{y}^{(i)}$ as the probability of being class $0$.
If we break the sum and look at the loss of an individual sample, we get:

$$
\mathrm{one\ sample\ loss} = \left\{
\begin{aligned}
-\log \left( 1-\hat{y}^{(i)} \right) & \quad \text { if } y_\rm{true}^{(i)}=0 \\
-\log \left( \hat{y}^{(i)}   \right) & \quad \text { if } y_\rm{true}^{(i)}=1
\end{aligned}
\right.
$$

As you can see, the individual loss is just the negative logarithm of the probability of predicting the true label.
Let's apply the same philosophy to multiclass problems, and keep in mind $\hat{\mathbf{y}}^{(i)}$ is now a probability distribution over outcomes:

$$
\mathrm{one\ sample\ loss} = \left\{
\begin{aligned}
-\log \left( \hat{y}^{(i)}_1   \right) & \quad \text { if } \mathbf{y}_\rm{true}^{(i)}=(1,0,0)^\mathsf{T} \\
-\log \left( \hat{y}^{(i)}_2 \right) & \quad \text { if }  \mathbf{y}_\rm{true}^{(i)}=(0,1,0)^\mathsf{T} \\
-\log \left( \hat{y}^{(i)}_3 \right) & \quad \text { if }  \mathbf{y}_\rm{true}^{(i)}=(0,0,1)^\mathsf{T}
\end{aligned}
\right.
$$

Because $\mathbf{y}_\rm{true}^{(i)}$ is a one-hot vector, we can simplify the loss as:

$$
\mathrm{one\ sample\ loss} = -\mathbf{y}_\rm{true}^{(i)} \cdot \log \left( \hat{\mathbf{y}}^{(i)} \right)
$$

Therefore, the log loss generalized for multiclass problems is:

$$
\mathrm{loss} = - \sum_{i=1}^{N} \mathbf{y}_\rm{true}^{(i)} \cdot \log \left( \hat{\mathbf{y}}^{(i)} \right)
$$

Because the dot product is just the sum of the element-wise products, and our $\mathbf{y}_\rm{true}^{(i)}$ and $\mathbf{y}^{(i)}$  are on the rows of ${Y_\rm{true}}$ and ${\hat{Y}}$:

$$
\underset{N\times m}{Y_\rm{true}} = 
\left[\begin{array}{ccc}
    - & {\hat{\mathbf{y}}^{(1)}_\rm{true}}^{\mathsf{T}} & - \\
    - & {\hat{\mathbf{y}}^{(2)}_\rm{true}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\hat{\mathbf{y}}^{(N)}_\rm{true}}^{\mathsf{T}} & -
\end{array}\right]
\quad
\text{and}
\quad
\underset{N\times m}{\hat{Y}} = 
\left[\begin{array}{ccc}
    - & {\hat{\mathbf{y}}^{(1)}}^{\mathsf{T}} & - \\
    - & {\hat{\mathbf{y}}^{(2)}}^{\mathsf{T}} & - \\
    & \vdots & \\
    - & {\hat{\mathbf{y}}^{(N)}}^{\mathsf{T}} & -
\end{array}\right]\ ,
$$

we can rewrite the loss in matrix form as:

$$
\mathrm{loss} = -\sum_{i=1}^N \sum_{j=1}^m \left( Y_{\rm{true}} \circ \hat{Y} \right)_{ij} \ ,
$$

where $\circ$ denotes the element-wise product between two matrices of the same shape.

Let's implement the loss function below and add a regularization term that includes all weights.

In [None]:
def model_loss(X, Y_true, params, _lambda):
    
    Y_pred = softmax_regression(X, params)
    W = params[0]
    loss = - numpy.sum(Y_true * numpy.log(Y_pred+1e-15)) / X.shape[0] \
           + _lambda * numpy.sum(W*W) / (W.shape[0] * W.shape[1])
    return loss

### Initialization

Let's initialize the parameters.

In [None]:
# get the gradients function
gradients = grad(model_loss, argnum=2)

# initialize the parameters
numpy.random.seed(10000)
W = numpy.random.normal(scale=1e-4, size=(num_features, num_classes))  # random initial guess from normal distribution
b = numpy.zeros(num_classes)


We can use the same metrics to evaluate the performance as lesson 5. However, as our  data have multiple classes, we can compute precision, recall and F-score for each label.
Again for brevity, let's use the function [`precision_recall_fscore_support()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html) from scikit-learn to check the initial accuracy.
It takes true labels as the first argument and predicted labels as the second.

In [None]:
pred_labels_test = classify(X_test, (W, b))

from sklearn.metrics import precision_recall_fscore_support

precision, recall, fscore, support = precision_recall_fscore_support(true_labels_test, pred_labels_test)
print(f"{precision = }")
print(f"{recall    = }")
print(f"{fscore    = }")
print(f"{support   = }")

The return value `support` gives you the number of occurences of each class in true labels.
Let's also compute the weighted average F-score using [`f1_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html),
since we often prefer to use a single number to track the accuracy during training.

In [None]:
from sklearn.metrics import f1_score

f1_score(true_labels_test, pred_labels_test, average='weighted')

### Train our model

- Describe what is learning curve.
- Explain the discrepancy between training loss and validation loss, and accuracy.
- Save the weights, i.e., our model
- demonstrate overfitting in learning curve (if possible)

In [None]:
%%time
from IPython.display import clear_output

lr = 1e-1  # learning rate
_lambda = 1.0  # regularization parameter

# a variable for the change in validation loss
change = numpy.inf

# a counter for optimization iterations
i = 0

# a variable to store the validation loss from the previous iteration
old_loss_val = 1e-15

# accuracy and loss history
loss_train_hist = []
loss_val_hist = []
acc_train_hist = []
acc_val_hist = []

# keep running if:
#   1. we still see significant changes in validation loss
#   2. iteration counter < 10000

while change >= 1e-10 and i < 1000:
    
    # calculate gradients and use gradient descents
    grads = gradients(X_train, Y_train_true, (W, b), _lambda)
    W -= (grads[0] * lr)
    b -= (grads[1] * lr)
    
    # compute training & validation loss
    loss_train = model_loss(X_train, Y_train_true, (W, b), _lambda)
    loss_val = model_loss(X_val, Y_val_true, (W, b), _lambda)
    loss_train_hist.append(loss_train)
    loss_val_hist.append(loss_val)
     
    # calculate metrics for training & validation dataset
    pred_labels_train = classify(X_train, (W, b))
    pred_labels_val = classify(X_val, (W, b))
    acc_train = f1_score(true_labels_train, pred_labels_train, average='weighted')
    acc_val = f1_score(true_labels_val, pred_labels_val, average='weighted')
    acc_train_hist.append(acc_train)
    acc_val_hist.append(acc_val)

    # calculate the chage in validation loss
    change = numpy.abs((loss_val-old_loss_val)/old_loss_val)

    # update the counter and old_val_loss
    i += 1
    old_loss_val = loss_val
    
    # update plot every 10 steps
    if i%10 == 0:
        clear_output(wait=True)
        
        x = numpy.arange(i)
        f, (ax1, ax2) = pyplot.subplots(1, 2, figsize=(12,6))

        ax1.set_yscale('log')
        ax1.plot(x, loss_train_hist, label="training loss")
        ax1.plot(x, loss_val_hist, label="validation loss")
        ax1.set_title('Loss')
        ax1.legend()

        ax2.plot(x, acc_train_hist, label="training accuracy")
        ax2.plot(x, acc_val_hist, label="validation accuracy")
        ax2.set_title('Accuracy')
        ax2.legend()

        pyplot.show();

We can measure the final accuracy using the test set.

In [None]:
# final accuracy
pred_labels_test = classify(X_test, (W, b))
precision, recall, fscore, support = precision_recall_fscore_support(true_labels_test, pred_labels_test)
fscore_weighted = f1_score(true_labels_test, pred_labels_test, average='weighted')

print(f"{precision = }")
print(f"{recall    = }")
print(f"{fscore    = }")
print(f"{support   = }")
print(f"{fscore_weighted = }")

Show the predicted sentiment for a few test samples.

In [None]:
test['pred_sentiment'] = list(map(lambda x : encoder.categories_[0][x], pred_labels_test))

In [None]:
test.drop(columns=['clean_tweet']).iloc[::200]

You can also test the model with new tweets!

In [None]:
new_tweets = ['had a terrible experience',
              'thank you A airlines',
              'I felt sick during the flight',
              'flight was delayed again',
              'love flying with A airlines']

new_clean_tweets = list(map(clean_tweet, new_tweets))
X_new = vectorizer.transform(new_clean_tweets).astype('float').toarray()
labels_new = list(map(lambda x : encoder.categories_[0][x], classify(X_new, (W,b))))

for tweet, label in zip(new_tweets, labels_new):
    print(tweet, ':' ,label)

## What we've learned

- Text data preprocessing
- Softmax regression
- One-hot encoding
- Learning curve

## References

1. Crowdflower, 2019. Kaggle. https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())