# Quality measure

There are several levels of chatbot quality measurements. In this section start with the most backend measures related strictly to the machine learning models with a focus on neural networks. In the second section we explain how to measure the performance of a chatbot, but only related it itself, but also to the whole infrastructure. In the last section we show how to measure the quality based on chatbots' output. We check the grammar and spelling of the output. Another tested part of our chatbot is the sentiment that is in many cases crucial.

## Model performance

The model performance can be measure in different ways. On of such measure is the loss function measurement. We have different methods to do it.


### Loss function

It is a method of evaluating how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll output a lower one.

Examples:

**Mean Squared Error** (MSE) - the workhorse of basic loss functions: it’s easy to understand and implement, and generally works pretty well. To calculate MSE, you take the difference between your predictions and the ground truth, square it, and average it out across the whole dataset. Often used in regression.

$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_{i}^{true} - y_{i}^{pred})^2$

**Cross Entropy** (log loss) - measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label.

$H = -\frac{1}{n} \sum_{i=1}^{n} [y_{i}^{true} \log(y_{i}^{pred}) + (1 - y_{i}^{true}) \log(1 - y_{i}^{pred})]$

### Gradient based optimization

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.


Vanilla gradient descent (batch gradient descent) computes the gradient of the cost function with respect to the parameters for the entire training dataset:

$\theta = \theta - \eta \cdot \nabla_\theta J(\theta)$

where:

- $\theta$ - parameters
- $\eta$ - learning rate
- $J$ - cost function


**Stochastic Gradient Descent** (SGD)

Stochastic gradient descent in contrast performs a parameter update for each training example:

$\theta = \theta - \eta \cdot \nabla_\theta J( \theta, x_i, y_i)$

where:

- $x_i$ - example
- $y_i$ - label

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.

![Stochastic Gradient Descent](images/sgd.png)

**Mini-batch gradient descent**

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of N training examples:

$\theta = \theta - \eta \cdot \nabla_\theta J( \theta, x_{(i:i + N)}, y_{(i:i + N)})$


Mini-batch approach reduces the variance of the parameter updates (more stable convergence) and make use of highly optimized matrix operations. It is typically the algorithm of choice when training a neural network and the term **SGD** usually is employed also when mini-batches are used.

**SGD optimizations**

There are many modifications to standard SGD method that improve robustness, reduce oscillation and gain faster convergence.

**Other**

There are plenty of optimizers like **Momentum**, **Nesterov Accelerated Gradient** (NAG), **Adagrad**, **Adadelta**, **RMSprop**, or **Adam**. Take a look at a comparison.

Gradient based method visualization |
:---------------------------:|:-----------------------:
![SGD variants summary](images/gradsummary.gif) | ![SGD variants summary](images/gradsaddlesummary.gif)

Let's take a look how to use SGD optimizer with Keras. It can be done in a similar way in Tensorflow.

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import SGD
from keras.utils.np_utils import to_categorical
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize

features, labels = load_iris(return_X_y=True)
features = normalize(features)
labels = to_categorical(labels)

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.15, shuffle=True)

model = Sequential()
model.add(Dense(units=10, activation='relu', input_dim=x_train.shape[1]))
model.add(Dense(units=7, activation='relu'))
model.add(Dense(units=3, activation='softmax'))

optimizer = SGD(lr=0.05, momentum=0.7)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=150, batch_size=16, verbose=0)
loss, accuracy = model.evaluate(x_test, y_test)

print(f'accuracy on test set: {accuracy * 100}%')

We can apply different optimizers or loss functions to the same or different classification problems. It does not depends on the neural network architecture. A more complex example using MNIST dataset is shown below.

In [None]:
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Conv2D, Dense, Dropout, Flatten, MaxPool2D
from keras.optimizers import Adam
from keras.utils.np_utils import to_categorical


FEATURES_SHAPE = (-1, 28, 28, 1)
MAX_FEATURE = 255

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (x_train / MAX_FEATURE).astype(np.float16).reshape(FEATURES_SHAPE)
y_train = to_categorical(y_train)
x_test = (x_test / MAX_FEATURE).astype(np.float16).reshape(FEATURES_SHAPE)
y_test = to_categorical(y_test)

model = Sequential()
model.add(Conv2D(filters=6, kernel_size=5, activation='relu', input_shape=FEATURES_SHAPE[1:]))
model.add(MaxPool2D())
model.add(Conv2D(filters=16, kernel_size=3, activation='relu'))
model.add(MaxPool2D())
model.add(Flatten())
model.add(Dense(units=120, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=84, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=10, activation='softmax'))

optimizer = Adam(lr=0.005)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10, batch_size=128, verbose=1)
loss, accuracy = model.evaluate(x_test, y_test)

print(f'accuracy on test set: {accuracy * 100}%')

#### Exercise 1. Run the above example with different optimizers and compare results

Compare: Adam, SGD, RMSprop and Adagrad. Use 5 epoches and set the learning rate to 0.005.


In [None]:
from keras.optimizers import Adam, SGD, RMSprop, Adagrad

# your code below



## Model quality measures

Quality measure are more about the input and output of the model. We can take the dataset and depending on the way how we divided it, we can measure the quality of our model. The output can be also measured with some methods where the most popular is accuracy.

### Dataset preparation

One of the common problem that each data scientist has is about how to divide the data set into training and testing data sets. To understand the following equations we need to introduce new designations. Let $\mathcal{L}_{n}$ be our training data set of size $n$, $T_{m}$ our testing data set of size $m$, $M_{e}$ the number of misclassified cases, $\mathcal{I}$ a function that return 1 if there is a match between predicted and label value and $e(d)$ the error rate of classifier $d$. We use also $X$ and $Y$ sets that we have already explained. We can write the error rate like following:
\begin{equation}
e(d)=\frac{M_{e}}{m}.
\end{equation}
It is the opposite to accuracy that is described later in this section. Error rate can be calculated differently depending on which method of data set preparation is used. There are few commonly used approached of how we can handle the training, testing and validation data sets:

- resubstitution -- R-method,
- hold-out -- H-method,
- cross-validation -- $\pi$-method,
- bootstrap,
- jackknife.

The first method is a very simple one. We have the same data set for training and testing. It is not the best solution if we consider to have a solid classifier. The error rate can be written as following:
\begin{equation}
e_{R}(d)=\frac{1}{n}\sum_{j=1}^{n}\mathcal{I}(d(X_{j};\mathcal{L}_{n})\neq Y_{j}).
\end{equation}
It means that we calculate the error rate for each element $j$ of our training data set and add 1 for each well predicted case. We need to divide it with $n$ which is the number of elements in the training data set. 

The second method is about dividing a data set into two data sets. It can be divided by half or other proportions. One set is our training data set and the second training data set. We can swap those sets and calculate the average of both sets. The error rate can be calculated as following:
\begin{equation}
e_{\tau}(\hat{d})=\frac{1}{m}\sum_{j=1}^{m}I(\hat{d}(X_{j}^{t};\mathcal{L}_{n}\neq Y_{j}^{t}).
\end{equation}
Compared to resubstitution method it uses the testing data set only.
Cross-validation is the most common approach. It can be also called as rotation method. We need to divide the data set into $k$ subsets. The elements in each set are randomly chosen. One of those sets are taken as a testing set where the other sets are merged into a  training set. It should be repeated $k$ times for each $k$ subset. The error rate can be calculated like following:
\begin{equation}
e_{CV}(d)=\frac{1}{n}\sum_{j=1}^{n}I(\hat{d}(X_{j};\mathcal{L}_{n}^{(-j)}\neq Y_{j}).
\end{equation}
%sprawdzic n z m
A special case is when $k=m$. It means that we have subsets where each consist of just one element. This approach is known as leave-one-out or U-method.\\
Bootstrap method can be considered as an extension of resubstitution. The goal is to generate multiple sets from the main set by randomly selection. We use resubstitution method on each set and calculate an average error at the end:
\begin{equation}
e_{B}(d)=\frac{1}{B}\sum_{b=1}^{B}\frac{\sum_{j=1}^{n}\mathcal{I}(Z_{j}\notin\mathcal{L}_{n}^{\star b})\mathcal{I}(d(X_{j};\mathcal{L}^{\star b}_{n})\not Y_{j})}{\sum_{j=1}^{n}(Z_{j}\notin\mathcal{L}^{\star b}_{n})}.
\end{equation}

### Output quality metrics

There are several metrics to show the quality of our classification model:

- ROC that stands for Receiver Operating Characteristic curve,
- AUC -- Area Under Curve,
- $F_{1}$ score,
- Precision,
- Recall.

To calculate the metrics we ned 

|                      |condition positive |condition negative |
|----------------------|-------------------|-------------------|
|**predicted positive**|True Positive (TP) |False Positive (FP)|         
|**predicted negative**|False Negative (FN)|True Negative (TN) |

Most common metric is the accuracy. It can be calculated like following:
\begin{equation}
ACC=\frac{\#TP+\#TN}{\#TP+\#TN+\#FP+\#FN}.
\end{equation}
First one that we describe is called True Positive Rate (TPR). It can be calculated like following:
\begin{equation}
TPR=\frac{\#TP}{\#TP+\#FN}.
\end{equation}
TPR is also called sensitivity or recall and is a measure of good predictions within a set of cases. By $\#TP, \#FP$ we mean the number of True Positive and False Positive cases. An opposite to it is specificity. It is also called TNR what stands for True Negative Rate. It can be calculated as following:
\begin{equation}
TNR=\frac{\#TN}{\#TN+\#FP}.
\end{equation}
It is a measure that says how good we are at predicting negative scenario. Another important metric is precision that is also known as Positive Predictive Value (PPV):
\begin{equation}
PPV=\frac{\#TP}{\#TP+\#FP}.
\end{equation}
It is a ratio of positive cases that that were well predicted to all positive cases, even those that are not well predicted. The opposite to it is the Negative Predictive Value:
\begin{equation}
NPV=\frac{TN}{TN+FN}.
\end{equation}
We can also calculate the False Positive Rate metric known as fall-out. It is about how bad we are on predicting positive cases:
\begin{equation}
FPR=1-TNR.
\end{equation}
The opposite to FPR is False Negative Rate:
\begin{equation}
FNR=1-TPR.
\end{equation}
Another popular metric is called $F_{1}$ score and it is a weighted accuracy measure. It takes PPV and TPR to calculate the score:
\begin{equation}
F_{1}=2\frac{PPV\cdot TPR}{TPR+PPV}.
\end{equation}
The $F_{1}$ value as in case of all previous metrics between 1 and 0, where 1 is the best. 
A interesting measure is Matthews Correlation Coefficient measure that is about the correlation between observed and predicted values. The value of MCC is between -1 and 1. If we have a perfect classifier we get MCC=1. A random classifier is when we have MCC=0 and a totally bad classifier if have MCC=-1. This measure can be calculated as following:
\begin{equation}
MCC=\frac{\#TP\cdot\#TN-\#FP\cdot\#FN}{\sqrt{(\#TP+\#FP)(\#TP+\#FN)(\#TN+\#FP)(\#TN+\#FN)}}.
\end{equation}

Let's generate a test and train dataset to measure the metrics.

In [None]:
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import MinMaxScaler

# build the train dataset
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1)
scalar = MinMaxScaler()
scalar.fit(X)
X = scalar.transform(X)

# build the test dataset
Xtest, ytest = make_blobs(n_samples=50, centers=2, n_features=2, random_state=1)
Xtest = scalar.transform(Xtest)

In [None]:
from keras.models import Sequential
from keras.layers import Dense

# build a simple neural network with thre layers
model = Sequential()
model.add(Dense(4, input_dim=2, activation='relu'))
model.add(Dense(4, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile it with adam optimizer
model.compile(loss='binary_crossentropy', optimizer='adam')
model.fit(X, y, epochs=10, verbose=0)

ypred = model.predict_classes(Xtest)

In [None]:
print(len(ypred))

#### Exercise 2. Implement the f1 score and MCC score


In [None]:
def calculate_quality_metrics(y,ypredicted):
    tn = 0
    tp = 0
    fn = 0
    fp = 0
    for i in range(len(y)):
        if y[i] > 0:
            if y[i] == ypredicted[i]:
                tp = tp + 1
            else:
                fp = fp + 1
        else:
            if y[i] == ypredicted[i]:
                tn = tn + 1
            else:
                fn = fn + 1
    acc = ((tp + tn) * 1.0) / ((tp + tn + fp + fn) * 1.0)
    tpr = tp * 1.0 / (tp + fn) * 1.0
    tnr = tn * 1.0 / (tn + fp) * 1.0
    ppv = tp / (tp + fp) * 1.0
    npv = tn / (tn + fn) * 1.0
    fpr = 1.0 - tnr
    fnr = 1.0 - tpr
    f1 = 0.0
    mcc = 0.0
    print("Accuracy: "+str(acc))
    print("TPR: "+str(tpr))
    print("TNR: "+str(tnr))
    print("PPC: "+str(ppv))
    print("NPV: "+str(npv))
    print("FPR: "+str(fpr))
    print("FNR: "+str(fnr))
    print("MCC: "+str(mcc))
    return [acc, tpr, tnr, ppv, npv, fpr, fnr, mcc]

## Sentiment analysis

If we want to publish our chatbot on production, it's very important to measure the sentiment of the customers and our chatbot. We don't want to send to our customers a message with a negative sentiment. Two most popular libraries to check the sentiment analysis is CoreNLP and TextBlob. The libraries are trained on a dataset that usually does not give us the expected result. This is why many times we need to build our own library. Before we build a new one we check TextBlob to get the main idea of sentiment analysis.

In [32]:
example = "The weather is good outside."

We just get the sentiment for the example text:

In [33]:
from textblob import TextBlob

text = TextBlob(example)
text.sentiment

Sentiment(polarity=0.35, subjectivity=0.32500000000000007)

A negative polarity means a negative sentiment, a posisivt polarity means a positive sentiment. The subjectivity means if the sentence is objective or subjective. The value is between 0 and 1.

## Grammar and spelling

There are several tools to check the spelling and grammar. We don't want our chatbot to reply with bad grammar or spelling errors. In Python we can use SpellChecker to check the spelling, pytypo to correct the typos and Language-check to check the grammar of a given sentence. We should check the grammar and spell so often as possible.

### Spell checking

Spell checking is one of the basic tool to check the output of our chatbot. It is not useful in many cases, only for a few generative-based chatbots.

In [9]:
from spellchecker import SpellChecker

spell = SpellChecker()

words = ['sample', 'words', 'heri', 'here']

for word in words:
    print(spell.correction(word))
    print(spell.candidates(word))

sample
{'sample'}
words
{'words'}
her
{'meri', 'heli', 'ceri', 'zeri', 'herri', 'peri', 'deri', 'heir', 'herb', 'hers', 'herd', 'henri', 'hera', 'here', 'hern', 'herr', 'hersi', 'hedi', 'seri', 'eri', 'hari', 'hero', 'her'}
here
{'here'}


### Typos fixing

We can also easily fix some simple typos with pytypo.

In [10]:
import pytypo

pytypo.correct_sentence('this traiining is great!!!')

'this traiining is great!'

### Grammar check

A more complex tool that can measure the grammar is language tool that allows to check more than 25 languages. It's an app written in Java, but has ports in Python.

In [None]:
import language_check

tool = language_check.LanguageTool('en-US')

tool.check("the are trainings")