# Experimental Methodology in Natural Language Processing

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

## Objectives

- Understanding 
    - the role and types of evaluation in NLP/ML
    - the lower and upper bounds of performance
    - correct usage of data for experimentation
    - evaluation metrics
    
- Learning how to use `scikit-learn` to perform a text classification experiment
    - provided baselines
    - text vectorization
    - evaluation methods

### Requirements
- [scikit-learn](https://scikit-learn.org/)
    - run `pip install scikit-learn`

## 1. Basic Concepts of Experimental Method

### 1.1. Lower & Upper Bounds of the Performance

#### Lower Bound: Baseline
Trivial solution to the problem: 

- _random_: random decision
- _chance_: random decision w.r.t. the distribution of categories in the training data
- _majority_: assign everything to the largest category etc.

#### Upper Bound: Inter-rater agreement
Usually human performance.

A system is expected to perform within the lower and upper bounds.
            

### 1.2. Data Split

#### 1.2.1. Training-Testing Split

Often Data Set is split into the following parts:

- _Training_: for training / extracting rules / etc.
- _Development_ (_Validation_): for optimization / intermediate evaluation
- _Testing_: for the final evaluation 

#### 1.2.1. [K-Fold Cross-Validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))
In k-fold cross-validation, the original sample is randomly partitioned into $k$ equal sized subsamples. Of the $k$ subsamples, a single subsample is retained as the validation data for testing the model, and the remaining $k − 1$ subsamples are used as training data. The cross-validation process is then repeated $k$ times, with each of the $k$ subsamples used exactly once as the validation data. The $k$ results can then be averaged to produce a single estimation.

- Random K-Fold Cross-Validation splits data into $K$ equal folds
- Stratified K-Fold Cross-Validation additionally makes sure that the distribution of target labels is similar across different folds

The general procedure is as follows:

- Shuffle the dataset randomly
- Split the dataset into $k$ folds
- For each unique group:
    - Take the group as a hold out or test data set
    - Take the remaining groups as a training data set
    - Fit a model on the training set and evaluate it on the test set
    - Retain the evaluation score and discard the model
- Summarize the model performance averaging the evaluation scores

## 2. Evaluation Metrics

### 2.1. Contingency Table

A [contingency table](https://en.wikipedia.org/wiki/Contingency_table) (also known as a _cross tabulation_ or _crosstab_) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. For the binary classification into positive (_POS_) and negative (_NEG_) classes, the predictions of a model (_HYP_, for hypotheses) with respect to the true labels (_REF_, for referencens) can be represented as the  matrix.

|     |         | REF     |         |
|-----|---------|:-------:|:-------:|
|     |         | __POS__ | __NEG__ |
| HYP | __POS__ | TP      | FP      |
|     | __NEG__ | FN      | TN      |


Where:
- __TP__: True Positives (usually denoted as $a$)
- __FP__: False Positivea ($b$)
- __FN__: False Negatives ($c$)
- __TN__: True Negativea ($d$)

### 2.1. The Simplest Case: Accuracy

$$ \text{Accuracy} = \frac{\text{Num. of Correct Decisions}}{\text{Total Num. of Instances}} $$

- Known number of instances
- Single decision for each instance 
- Single correct answer for each instance 
- All errors are equal

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}}$$

__What if TN is infinite or unknown?__

e.g.: Number of irrelevant queries to a search engine

### 2.2. Precision & Recall

|     |         | REF     |         |             |
|-----|---------|:-------:|:-------:|-------------|
|     |         | __POS__ | __NEG__ |             |
| HYP | __POS__ | TP      | FP      | _Precision_ |
|     | __NEG__ | FN      | TN      |             |
|     |         | _Recall_ |        |             |


$$ \text{Precison} = \frac{\text{TP}}{\text{TP}+\text{FP}}$$

$$ \text{Recall} = \frac{\text{TP}}{\text{TP}+\text{FN}}$$

__2 Values__: 

Precision-Recall Trade-Off

### 2.3. F-Measure

- Harmonic Mean of Precision & Recall 
- Usually evenly weighted


$$ F_{\beta} = \frac{(1 + \beta^2) ∗ \text{Precision} ∗ \text{Recall}}{\beta^2 ∗ \text{Precision} + \text{Recall}}$$

Most common value of $\beta = 1$

$$ F_1 = \frac{2 ∗ \text{Precision} ∗ \text{Recall}}{\text{Precision} + \text{Recall}}$$

### 2.4. Micro, Macro and (Macro-) Weighted Averaging

In a Multi-Class setting per-class scores are averaged to produce a single score.
There are several ways the scores could be averaged. 

__Micro Averaging__

We compute scores summing over True Positive, True Negative, False Positive and False Negatives.

__Macro Averaging__

We first compute scores per class, then average the scores ignoring their distribution in the test set.

__(Macro-) Weighted Averaging__

Similar to Macro Averaging, but we additionally weight the scores by the class-frequency.

#### Precision Example

Let's assume we have 3 classes. The precision formula from above is:

$$ \text{Precision} = \frac{\text{TP}}{\text{TP}+\text{FP}}$$

$$\text{Micro Precision} = \frac{\text{TP}_1 + \text{TP}_2 +\text{TP}_3}{(\text{TP}_1 + \text{TP}_2 +\text{TP}_3)+(\text{FP}_1 + \text{FP}_2 +\text{FP}_3)}$$

$$\text{Macro Precision} = \frac{P_1 + P_2 + P_3}{3} = P_1 * \frac{1}{3} + P_2 * \frac{1}{3} + P_3 * \frac{1}{3}$$

$$\text{Weighted Precision} = P_1 * \frac{S_1}{N} + P_2 * \frac{S_2}{N} + P_3 * \frac{S_3}{N}$$

Where:
- $S$ is the support for the class (i.e. number of observations with that labels)
- $N$ is the total number of observations

## 3. Classification with Scikit-Learn

- Loading Data
- Baselines
- Training Classifier
- Evaluation


### 3.1. Loading and Inspecting a Dataset

`scikit-learn` comes with several toy datasets.
Let's use one of those (iris) to perform a simple classification experiment.

The iris dataset is a classic and very easy multi-class classification dataset.

| Property          | Value |
|-------------------|-------|
| Classes           |   3 |
| Samples per class |  50 |
| Samples total     | 150 |
| Dimensionality    |   4 | 
| Features          | real, positive | 

In [1]:
from sklearn.datasets import load_iris
from collections import Counter
data = load_iris()

print("Classes: {}".format(len(list(data.target_names))))
print("Samples: {}".format(len(data.data)))
print("Dimensionality: {}".format(len(list(data.feature_names))))
print("Samples per Class: {}".format(dict(Counter(list(data.target)))))

print(data.data[0])  # prints feature vector

print(data.data.shape)  # prints matrix shape for data
print(data.target.shape)  # print matrix shape for labels

# print(data.DESCR)  # prints full data set description
# print(data.data)  # prints features
# print(data.target) # prints labels

Classes: 3
Samples: 150
Dimensionality: 4
Samples per Class: {0: 50, 1: 50, 2: 50}
[5.1 3.5 1.4 0.2]
(150, 4)
(150,)


### 3.2. Splitting the Dataset

- Random K-Fold Split
- Stratified K-Fold Split

In [2]:
from sklearn.model_selection import KFold

random_split = KFold(n_splits=5, shuffle=True)

for train_index, test_index in random_split.split(data.data):
    
    print("Samples per Class in Training: {}".format(dict(Counter(list(data.target[train_index])))))
    print("Samples per Class in Testing: {}".format(dict(Counter(list(data.target[test_index])))))
    

Samples per Class in Training: {0: 41, 1: 37, 2: 42}
Samples per Class in Testing: {0: 9, 1: 13, 2: 8}
Samples per Class in Training: {0: 41, 1: 40, 2: 39}
Samples per Class in Testing: {0: 9, 1: 10, 2: 11}
Samples per Class in Training: {0: 40, 1: 39, 2: 41}
Samples per Class in Testing: {0: 10, 1: 11, 2: 9}
Samples per Class in Training: {0: 35, 1: 43, 2: 42}
Samples per Class in Testing: {0: 15, 1: 7, 2: 8}
Samples per Class in Training: {0: 43, 1: 41, 2: 36}
Samples per Class in Testing: {0: 7, 1: 9, 2: 14}


In [3]:
from sklearn.model_selection import StratifiedKFold

stratified_split = StratifiedKFold(n_splits=5, shuffle=True)

for train_index, test_index in stratified_split.split(data.data, data.target):
    
    print("Samples per Class in Training: {}".format(dict(Counter(list(data.target[train_index])))))
    print("Samples per Class in Testing: {}".format(dict(Counter(list(data.target[test_index])))))

Samples per Class in Training: {0: 40, 1: 40, 2: 40}
Samples per Class in Testing: {0: 10, 1: 10, 2: 10}
Samples per Class in Training: {0: 40, 1: 40, 2: 40}
Samples per Class in Testing: {0: 10, 1: 10, 2: 10}
Samples per Class in Training: {0: 40, 1: 40, 2: 40}
Samples per Class in Testing: {0: 10, 1: 10, 2: 10}
Samples per Class in Training: {0: 40, 1: 40, 2: 40}
Samples per Class in Testing: {0: 10, 1: 10, 2: 10}
Samples per Class in Training: {0: 40, 1: 40, 2: 40}
Samples per Class in Testing: {0: 10, 1: 10, 2: 10}


### 3.3. Training and Testing the Model

#### 3.3.1. Classification Process

- Select the classification algorithm from [Supervised Learning](https://scikit-learn.org/stable/supervised_learning.html)
- Train on training data
- Predict labels on testing data
- Score prediction comparing predicted and reference labels

In [4]:
from sklearn.naive_bayes import GaussianNB

# choose classification algorithm & initialize it
clf = GaussianNB()

# for each training/testing fold
for train_index, test_index in stratified_split.split(data.data, data.target):
    # train (fit) model
    clf.fit(data.data[train_index], data.target[train_index])
    # predict test labels
    clf.predict(data.data[test_index])
    # score the model (using average accuracy for now)
    accuracy = clf.score(data.data[test_index], data.target[test_index])
    print("Accuracy: {:.3}".format(accuracy))



Accuracy: 0.933
Accuracy: 0.967
Accuracy: 0.967
Accuracy: 0.967
Accuracy: 0.967


#### 3.3.2. Baselines

Scikit-learn provides baselines via `DummyClassifier` class that takes `strategy` argument. The following baselines can be obtaing:

- random baseline: `uniform`
- chance baseline: `stratified`
- majority baseline: `most_frequent`


In [5]:
from sklearn.dummy import DummyClassifier

random_clf = DummyClassifier(strategy="uniform")

for train_index, test_index in stratified_split.split(data.data, data.target):
    random_clf.fit(data.data[train_index], data.target[train_index])
    random_clf.predict(data.data[test_index])
    accuracy = random_clf.score(data.data[test_index], data.target[test_index])
    
    print("Accuracy: {:.3}".format(accuracy))


Accuracy: 0.5
Accuracy: 0.4
Accuracy: 0.333
Accuracy: 0.167
Accuracy: 0.4


#### Exercise

Try `stratified` and `most_frequent` strategies and observe performances

#### 3.3.3. Better Classification Report

scikit-learn provides functions to report more informative performance values using [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

In [6]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# choose classification algorithm & initialize it
clf = GaussianNB()

# for each training/testing fold
for train_index, test_index in stratified_split.split(data.data, data.target):
    # train (fit) model
    clf.fit(data.data[train_index], data.target[train_index])
    # predict test labels
    hyps = clf.predict(data.data[test_index])
    refs = data.target[test_index]
    
    report = classification_report(refs, hyps, target_names=data.target_names)
    
    print(report)
    

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00        10
   virginica       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.82      0.90      0.86        10
   virginica       0.89      0.80      0.84        10

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       0.90      0.90      0.90        10
   virginica       0.90      0.90      0.90        10

    accuracy        

#### 3.3.4. Cross-Validation Evaluation

The cross-validation procedure and function of scikit-learn are described in [the documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

# choose classification algorithm & initialize it
clf = GaussianNB()
# get scores
scores = cross_val_score(clf, data.data, data.target, cv=5)

print(scores)


[0.93333333 0.96666667 0.93333333 0.93333333 1.        ]


Cross-Validation using custom split and scoring.

In [8]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate

# choose classification algorithm & initialize it
clf = GaussianNB()
# scoring providing our custom split & scoring using 
scores = cross_validate(clf, data.data, data.target, cv=stratified_split, scoring=['f1_macro'])

print(sum(scores['test_f1_macro'])/len(scores['test_f1_macro']))


0.9531819447608921


#### Exercise
- Read [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- Try different evaluation scores

### 3.4. Vectorizing Text

> The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Consequently, the additional step that text classification requires is vectorization that converts text into a vector of numerical values. `scikit-learn` provides several vectorization methods in `sklearn.feature_extraction` [module](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction). Most commonly used ones are:

- Count Vectorization
- TF-IDF Vectorization


#### 3.4.1. Bag-of-Words Representation

[Count Vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) implements the following vectorization procedure. 

- *tokenizing* strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

- *counting* the occurrences of tokens in each document.

- *normalizing* and *weighting* with diminishing importance tokens that occur in the majority of samples / documents.

Each token is considered to be a __feature__ and the vector of all the token frequencies for a given document is considered a multivariate __sample__. Consequently, a corpus of documents is represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

> If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

The [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) implements both tokenization and occurrence counting in a single class, and it is possible to provide many parameters. 

It can take an external preprocessor or perform the following preprocessing steps (read documentation for details):

- __strip_accents__: remove accents and perform other character normalization during the preprocessing step.
- __lowercase__: convert all characters to lowercase before tokenizing.
- __stop_words__: apply a built-in stop word list for English is used. 
- __token_pattern__: regular expression denoting what constitutes a *token* for tokenization
- __ngram_range__: The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. (We will see ngrams the next lab)
- __max_df__: maximum frequency cut-off: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). 
- __min_df__: minimum frequency cut-off: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. 
- __vocabulary__: externally provided vocabulary
- __binary__: If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

#### 3.4.2. [TF-IDF Vectorization](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)  
TF-IDF Vectorization = Count Vectorization + TF-IDF Transformation

> Transforms a count matrix to a normalized tf or tf-idf representation

> __Tf__ means term-frequency while __tf-idf__ means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

> The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

(Please refer to the documentation for the transformation formulas).

#### 3.4.3. Vectorization Example

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'who plays luke on star wars new hope',
    'show credits for the godfather',
    'who was the main actor in the exorcist',
    'find the female actress from the movie she \'s the man',
    'who played dory on finding nemo'
]

vectorizer = CountVectorizer()

# use fit_transform to 'learn' the features and vectorize the data
vectors = vectorizer.fit_transform(corpus)

print(vectors.toarray())  # print numpy vectors

[[0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 1]
 [0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0]
 [1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 2 0 1 1]
 [0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 3 0 0 0]
 [0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 1]]


In [6]:
test_corpus = [
    'who was the female lead in resident evil',
    'who played guido in life is beautiful'
]

# 'trained' vectorizer can be later used to transform the test set 
test_vectors = vectorizer.transform(test_corpus)
print(test_vectors.toarray())

[[0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1]]


## Lab Exercise: Text Classification

- Using Newsgroup dataset from `scikit-learn` train and evaluate Multinomial Naive Bayes model
- Experiment with different vectorization methods and paramenters:
    - `binary` of Count Vecrorization
    - TF-IDF Transformation
    - min and max cut-offs
    - using stop-words
    - lowercasing

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.metrics import classification_report

newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

vectorizer = CountVectorizer()
classifier = MultinomialNB()

trn_vectors = vectorizer.fit_transform(newsgroups_train.data)
tst_vectors = vectorizer.transform(newsgroups_test.data)

classifier.fit(trn_vectors, newsgroups_train.target)
predictions = classifier.predict(tst_vectors)

print(classification_report(newsgroups_test.target, predictions, target_names=newsgroups_train.target_names))


                          precision    recall  f1-score   support

             alt.atheism       0.79      0.77      0.78       319
           comp.graphics       0.67      0.74      0.70       389
 comp.os.ms-windows.misc       0.20      0.00      0.01       394
comp.sys.ibm.pc.hardware       0.56      0.77      0.65       392
   comp.sys.mac.hardware       0.84      0.75      0.79       385
          comp.windows.x       0.65      0.84      0.73       395
            misc.forsale       0.93      0.65      0.77       390
               rec.autos       0.87      0.91      0.89       396
         rec.motorcycles       0.96      0.92      0.94       398
      rec.sport.baseball       0.96      0.87      0.91       397
        rec.sport.hockey       0.93      0.96      0.95       399
               sci.crypt       0.67      0.95      0.78       396
         sci.electronics       0.79      0.66      0.72       393
                 sci.med       0.87      0.82      0.85       396
         