## Working with text data + Evaluation

After this notebook you should know:

* how to represent text data as features
* how to evaluate your ML classifier

<small>Tutorial adapted from scikit-learn. See [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)</small>

## Recap ML 101: What we need

1. Data
  * input $X$ and output (labels) $Y$ 
2. Features
  * the actual features: how $X$ is decomposed into its parts by the vectorizer/featurizer $\phi$ --- **How do we extract features from text data?**
3. Model/Algorithm
  * the machine learning algorithm used 
4. Evaluation
  * how to measure how good your model is --- **How do we evaluate our model?**

## Extracting features from text data

* IRIS example from last lecture: we were already given the features (do you remember how many those were?)


* In NLP, we typically have many more features, and we typically first need to define **how to represent the text**.
    * There is an extra step from the **raw text input** to the actual features that are used.
    * This step of extracting features from raw (text) input is called **featurization** or **vectorization**

    * It means that we turn the original content into a feature vector. Each dimension of the vector contains a numerical value and corresponds to a particular **feature**.
    
Let us look at a concrete example, the Reuters 20 newsgroup dataset.

### Example: Loading the 20 newsgroup dataset

This notebook downloads the dataset automatically. Alternatively, it is possible to download the dataset manually (see how-to on the scikit-learn [tutorial website](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)).

In [1]:
import numpy as np


In [2]:
categories = ['soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Notice: the data has been shuffled randomly, using a fixed seed.

In [4]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
twenty_train['target_names']

['comp.graphics', 'sci.med', 'soc.religion.christian']

### Lets look at the first document (data instance) in the dataset

In [6]:
twenty_train['data'][0]

"From: tbrent@ecn.purdue.edu (Timothy J Brent)\nSubject: Am I going to Hell?\nOrganization: Purdue University Engineering Computer Network\nLines: 12\n\nI have stated before that I do not consider myself an atheist, but \ndefinitely do not believe in the christian god.  The recent discussion\nabout atheists and hell, combined with a post to another group (to the\neffect of 'you will all go to hell') has me interested in the consensus \nas to how a god might judge men.  As a catholic, I was told that a jew,\nbuddhist, etc. might go to heaven, but obviously some people do not\nbelieve this.  Even more see atheists and pagans (I assume I would be \nlumped into this category) to be hellbound.  I know you believe only\ngod can judge, and I do not ask you to, just for your opinions.\n\nThanks,\n-Tim\n"

What is $X$? 

The data ($X$) is still in raw (original) input format, no **featurizer** has yet been applied to the data. It is still an entire "chunk" of data (text in this case).


What are the $Y$s? 

As we saw already in the IRIS dataset, target (classes/labels/categories) are encoded as integers. These are the labels we are going to predict, they correspond to the `target_names` given above.

In [7]:
twenty_train['target']

array([2, 0, 0, ..., 0, 2, 2])

We can get the original names back, lets say we want to look at the first 10 data instances and get their category/label:

In [8]:
for target_idx in twenty_train['target'][:10]:
    print(twenty_train.target_names[target_idx], "=", target_idx)

soc.religion.christian = 2
comp.graphics = 0
comp.graphics = 0
comp.graphics = 0
comp.graphics = 0
comp.graphics = 0
soc.religion.christian = 2
comp.graphics = 0
comp.graphics = 0
sci.med = 1


### Extracting features from text data

In order to run a machine learning algorithm, we first need to **transform** the original text data into a **set of features**.
* This process is called featurization (or extracting features from data).
* It goes from the raw input to a vector of some fixed size $d$, where each dimension of the vector corresponds to a particular **feature**.

<img src="pics/learning.png">

#### Bag-of-words 

A very simple way to decompose the input text is to make a 'bag-of-words' (BOW) representation.
* The input text is broken down into single words
* The feature vector encodes the words it has seen for a given instance.


<img src="pics/bow1.png" width=300>

For example, the following five instances would be represented in a BOW model as:

<img src="pics/bow2.png">

**Note**: with BoW the sequence information is lost. E.g. the representations for "boring just" and "just boring" is the same.

You can decide **which features** to include, maybe not always all words are good predictors for your target variable. For example, in the case of sentiment analysis we could decide to only use content words and punctuation (including emoticons) as features:

<img src="pics/bow3.png">

Note, however, that typically the ML system does not store large feature vectors. In particular, when working with text data **most features in X will be zero**, i.e., only a few words actually occur in a particular instance/example. Storing the long vector would be very inefficient. Thus, internally sklearn keeps a **sparse** representation of the features. See more [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

What features to use is crucial for a machine learning system, and can make a big performance difference. This includes not only which tokens to keep but also whether to transform them in some way, e.g. lowercasing, stemming, lemmatization.

Scikit-learn (sklearn) includes a range of built-in featurizers. We are only looking at a very simple vectorizer for count data, the `CountVectorizer`. But please have a look at more vectorizers available in sklearn/scikit-learn, like `TfIdfVectorizer`, or the custom `DictVectorizer`.

#### The `CountVectorizer`

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(twenty_train.data) #stop_words="english"
X_train_counts = count_vectorizer.transform(twenty_train.data)
X_train_counts.shape

(1777, 31638)

In [10]:
len(twenty_train.target)

1777

In [11]:
X_train_counts[0]

<1x31638 sparse matrix of type '<class 'numpy.int64'>'
	with 93 stored elements in Compressed Sparse Row format>

The `CountVectorizer` stores the data in **sparse** matrix format. It contains a `vocabulary` that maps features to their feature numbers. We can get the feature id (number) of a particular feature by:

In [12]:
count_vectorizer.vocabulary_.get("the")

28369

The [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) has many options. By default it stores the frequency of a word unigram (lowercased), where a word is defined by `token_pattern=u'(?u)\b\w\w+\b'`. 

However, by storing the number of occurences of a token there is a side effect: longer documents will typically have higher average count values, even though they might talk about the same topic, say.  How can we avoid this issue?

1. Using **binary** (1/0 or on/off) feature values: instead of accounting for frequency, each token gets the same `weight`, it is either present or not. You can achieve binary (indicator) features by setting the `binary` option of the `CountVectorizer` to `binary=True`. 

2. Using **relative** term frequencies: instead of using the raw counts, divide by the total number of words in a document. Typically, you then want to downplay the importance of features that occur in many documents. This is achieved by weighting the frequency by the inverse document frequency (and hence, tokens that appear in many documents are less important). This is what the `TfIdfTokenizer` does.

In [13]:
## using binary feature values
count_vectorizer_binary = CountVectorizer(binary=True).fit(twenty_train.data)
X_train_counts_binary = count_vectorizer_binary.transform(twenty_train.data)
X_train_counts_binary.shape

(1777, 31638)

**A shortcut - `fit_transform`:** Note that the `sklearn` vectorizers have a shortcut `fit_transform`, this function does the two steps above in one go: `fit` creates the vocabulary from the data, then `transform` is used to convert the raw input data into feature vectors, given the vocabulary. Using `fit_transform` is at times faster.

Note, however, that the `fit` function should always only be used on the training data -- otherwise you would create a new vocabulary on your test data and that would skrew things up. You *decide* your features on your training data, and test them then on your development/test data, you don't pick features based on the dev/test set!

### Writing your own vectorizer


With the `DictVectorizer` you can add your own features, you have full control.
As the name already says it wants a dictionary, where the keys are your feature names and  values are the feature values (binary or frequencies or what you want to use). 

## Training a classifier

Now that we have converted our data into features, we can train our classifier to predict the category of a post. Let us try to use a logistic regression classifier, and use a binary word unigram BOW representation.

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
clf = LogisticRegression()

In [16]:
## using binary feature values
X_train = CountVectorizer(binary=True).fit_transform(twenty_train.data)
X_train.shape

(1777, 31638)

Now let's train the classifier and use it to predict the label of a new post. We again need to extract the features from the document, using the vectorizer. Then we can use the classifier to `predict` the label of the document.

In [18]:
## train the classifier, and use it on a new document:
clf.fit(X_train, twenty_train.target)
document = ["don't believe"]
X_test = count_vectorizer_binary.transform(document)

* **Note** the use of `tranform` here (**not** `fit_transform`). What would have happened if we were to call `fit_transform`?

In [19]:
y_predicted = clf.predict(X_test)

In [20]:
print(y_predicted)
print(twenty_train.target_names[y_predicted[0]])

[2]
soc.religion.christian


Cool! We trained our classifier, using a BOW feature representation (with unigram word features as binary indicator values). In the example above we gave the classifier just a single new test instance. You can also give a list of examples to the classifier:

In [21]:
documents = ["the graphic card sucks", "health / glucose is ..", "the right word"]
X_test = count_vectorizer_binary.transform(documents)
y_predicted = clf.predict(X_test)

In [22]:
for y_hat in y_predicted:
    print(twenty_train.target_names[y_hat])

comp.graphics
sci.med
soc.religion.christian


### Evaluating performance on the test set

We want to build a classifier that generalizes (rather than memorizes), i.e., it works *beyond* the training data.

A classifier generalizes reasonably well if it can predict with acceptable performance on new **unseen** test cases.

In [23]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
## convert test to vectors
X_test = count_vectorizer_binary.transform(twenty_test.data)
y_predicted = clf.predict(X_test)

* **Accuracy**: out of all predictions, how many are correct


Evaluating accuracy is easy:

In [24]:
from sklearn.metrics import accuracy_score
y_true = twenty_test.target
print(accuracy_score(y_true, y_predicted))

0.9306846999154691


In [25]:
# from scratch 
correct, total = 0, 0.0
for gold, pred in zip(y_true,y_predicted):
    if gold==pred:
        correct+=1
    total+=1
print("Accuracy {0:.2f} (correct/total: {1}/{2})".format(correct/total*100, correct, total))

Accuracy 93.07 (correct/total: 1101/1183.0)


Or, alternatively, even easier:

In [26]:
import numpy as np
np.mean(y_true == y_predicted)

0.9306846999154691

However, accuracy alone (= how many predictions are correct, out of all predictions) often tells us just part of the story. Why?

<img src="pics/accuracy.png">

$accuracy = \frac{TP+TN}{TP+TN+FP+FN}$

In [27]:
# per-class breakdown
from sklearn import metrics
print(metrics.classification_report(y_true, y_predicted,
     target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

         comp.graphics       0.87      0.97      0.92       389
               sci.med       0.97      0.84      0.90       396
soc.religion.christian       0.96      0.99      0.98       398

             micro avg       0.93      0.93      0.93      1183
             macro avg       0.93      0.93      0.93      1183
          weighted avg       0.93      0.93      0.93      1183



Precision, Recall and F1:

* **precision**: out of those predicted as a label, how many were correct
* **recall**: how many instances, out of all instances of a specific label, did the classifier predict correctly
* **f1-score**: harmonic mean of precision and recall (f1 has beta=1, i.e., both precision and recall are equally important)

<img src="pics/precision_recall.png">

From Wikipedia: <img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg">

<img src="pics/fscore.png">

* pay attention when using accuracy if the categories are very skewed (one class that is much more frequent than others)
* in such a case, how can you achieve high accuracy?

## Evaluation - two data scenarios

* Having a pre-split **train**/**dev**/**test** data: 
    * Build system using training data, use dev (development) data to find the right features, parameters, etc.
    * Only at the very end evaluate your system on the final (held-out) test data
    
* **Cross-validation** / also called $k$-fold cross validation: split data into $k$ folds (parts), train a model on $k-1$ parts, evaluate on the last part; do this $k$ times and report final average (and std dev)

Summary:

* We have seen how to build a classifier with `sklearn` using word unigrams BOW features (exercise: examine the vectorizers of `sklearn`, try to use `DictVectorizer`)
* Evaluation is important (how to measure performance - accuracy, precision, recall, f1 score; as well as how the evaluation is setup/evaluation scenario)

## Building a Pipeline


In order to make the steps from input data to vectorizer to training a model easier, `sklearn` provides a `Pipeline`.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer(binary=True)
#vectorizer = TfidfVectorizer()
clf = LogisticRegression()
classifier = Pipeline( [('vec', vectorizer),
                        ('clf', clf)] )
print(clf)
classifier.fit(twenty_train.data, twenty_train.target)
y_predicted = classifier.predict(twenty_test.data)
print(accuracy_score(twenty_test.target, y_predicted))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
0.9306846999154691


# References

* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* http://scikit-learn.org/stable/modules/feature_extraction.html

# Read More

* Chapter 2 (pp. 29--38) of D. Rao and B. McMahan. 2019. NLP with PyTorch.
* Chapters 6 and 7 (pp. 65--85) of Y. Goldberg. 2007. Neural Network Methods for NLP.