## Working with text data

Tutorial adapted from scikit-learn see [http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

### Loading the 20 newsgroup dataset

This notebook assumes you have downloaded the 20 newsgroup data in a folder called `data` in your current directory (see how-to on website above).

In [4]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [5]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [6]:
twenty_train.keys()

dict_keys(['target_names', 'target', 'data', 'filenames', 'DESCR', 'description'])

In [7]:
twenty_train['target_names']

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [8]:
twenty_train['description']

'the 20 newsgroups by date dataset'

As we saw already in the IRIS dataset, target (classes/labels/categories) are encoded as integers. These are the labels we are going to predict, they correspont to the target_names given above.

In [7]:
twenty_train['target']

array([1, 1, 3, ..., 2, 2, 2])

In [8]:
import numpy as np
np.unique(twenty_train['target'])

array([0, 1, 2, 3])

In fact we can get the original names of the instances back, lets say we want to look at the first 10 data instances and get their original category/label:

In [9]:
for target_idx in twenty_train['target'][:10]:
    print(twenty_train.target_names[target_idx])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


Notice: the data has been shuffled randomly, using a fixed seed.

The `data` attribute holds the input data. Lets look at the first data example/instance. We see that it is still in raw (original) input format, no **featurizer** has yet been applied to the data. It is still an entire "chunk" of data.

In [10]:
twenty_train['data'][0]

'From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n'

### Extracting features from text data

In order to run a machine learning algorithm, we first need to **decompose** the original text data into a **set of features**. This process is called featurization (or extracting features from data). It means that we turn the original content into a feature vector, a vector with of numerical values where each dimension of the vector corresponds to a particular **feature** that we had in mind.

#### Bag-of-words 

A very simple way to decompose the input text is to make a 'bag-of-words' representation. Here we break the input text down into single words, and the feature vector encodes with words it has seen for a given instance.

<img src="pics/bow1.png" width=300>

For example, the following two instances would be represented in a BOW model as:

<img src="pics/bow2.png">

You can decide which features to include, maybe not always all words are good predictors for your target variable. For example, in the case of sentiment analysis. We could decide to only use content words and punctuation as features, e.g.,

<img src="pics/bow3.png">

Note, however, that typically the ML system does not store large feature vectors. In particular, when working with text data **a lot of features in X will be zero**, i.e., only a few words actually occur in a particular instance/example. Storing the long vector would be very inefficient. Thus, internally sklearn keeps a **sparse** representation of the features. See more [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

What features to take is crucial for a machine learning system, and can make a big performance difference. 

Scikit-learn (sklearn) includes a range of build-in featurizers. 

#### The `CountVectorizer`

In [11]:
len(twenty_train.data)

2257

In [59]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer().fit(twenty_train.data)
X_train_counts = count_vectorizer.transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [60]:
X_train_counts[0]

<1x35788 sparse matrix of type '<class 'numpy.int64'>'
	with 73 stored elements in Compressed Sparse Row format>

The `CountVectorizer` stores the data in sparse matrix format. It contains a `vocabulary` that maps features to their feature numbers. We can get the feature id (number) of a particular feature by:

In [61]:
count_vectorizer.vocabulary_.get("the")

32142

The `CountVectorizer` has many options. By default it stores the frequency of a word unigram (lowercased), where a word is defined by `token_pattern=u'(?u)\b\w\w+\b'`. 

However, by storing the occurence of a token notice that there is a side effect: longer document will typically have higher average count values, even though they might talk about the same topic, say.  How can we avoid this issue?

1. Using **binary** (1/0 or on/off) feature values: instead of accounting for frequency, each token gets the same `weight', it is either present or not. You can achieve binary (indicator) features by setting the `binary` option of the `CountVectorizer` to `binary=True`. 

2. Using **relative** term frequencies: instead of using the raw counts, divide by the total number of words in a document. Typically, you then want to downplay the importance of features that occur in many documents. This is achieved by weighting the frequency by the inverse document frequency (and hence, tokens that appear in many documents are less important)

In [62]:
## using binary feature values
count_vectorizer_binary = CountVectorizer(binary=True).fit(twenty_train.data)
X_train_counts_binary = count_vectorizer_binary.transform(twenty_train.data)
X_train_counts_binary.shape

(2257, 35788)

**A shortcut - `fit_transform`:** Note that the `sklearn` vectorizers have a shortcut `fit_transform`, this function does the two steps above in one go: `fit` creates the vocabulary from the data, then `transform` is used to convert the raw input data into feature vectors, given the vocabulary. Using `fit_transform` is at times faster.

Note, however, that the `fit' function should always only be done on the training data -- otherwise you would create a new vocabulary on your test data and that would skrew things up. You *decide* your features on your training data, and test them then on your development/test data, you don't pick features based on the dev/test set!

#### The `TfidfVectorizer`

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer #note: there is also a TfidfTransformer (that takes counts as input)
tfidf_vectorizer = TfidfVectorizer() 
X_train_tfidf = count_vectorizer_binary.fit_transform(twenty_train.data)
X_train_tfidf.shape

(2257, 35788)

## Training a classifier

Now that we have converted our data into features, we can train our classifier to predict the category of a post.

A good starting point is to use the LogisticRegression classifier.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

Let us train the classifier on the binary BOW representation.

In [14]:
clf = LogisticRegression()

In [15]:
## using binary feature values
count_vectorizer_binary = CountVectorizer(binary=True).fit(twenty_train.data)
X_train_counts_binary = count_vectorizer_binary.transform(twenty_train.data)
X_train_counts_binary.shape

(2257, 35788)

In [16]:
## train the classifier, needs X and Y:
clf.fit(X_train_counts_binary, twenty_train.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now, lets use the classifier to predict the label of a new post. First, we need to extract the features from the document, using the right vectorizer. Then we can use the classifier to `predict` the label of the document.

In [23]:
document = ["I bought a new monitor"]
X_test = count_vectorizer_binary.transform(document)

**Note** the use of `tranform` here (**not** `fit_transform`). Why?

In [24]:
y_predicted = clf.predict(X_test)

In [25]:
print(y_predicted)

[1]


In [26]:
twenty_train.target_names[y_predicted[0]]

'comp.graphics'

Cool! We trained our first Naive Bayes classifier, using a BOW feature representation (with binary indicator features). In the example above we gave the classifier just a single new test instance. You can also give it a list of examples to classifier, as the following code shows.

In [27]:
documents = ["the graphic card sucks", "health / glucose is ..", "the right word"]
X_test = count_vectorizer_binary.transform(documents)
y_predicted = clf.predict(X_test)

In [28]:
for y_hat in y_predicted:
    print(twenty_train.target_names[y_hat])

comp.graphics
sci.med
soc.religion.christian


### Evaluating performance on the test set

In [29]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
## convert test to vectors
X_test = count_vectorizer_binary.transform(twenty_test.data)
y_predicted = clf.predict(X_test)

Evaluating accuracy is easy:

In [30]:
from sklearn.metrics import accuracy_score
y_true = twenty_test.target
print(accuracy_score(y_true, y_predicted))

0.906790945406


Or, alternatively, even easier:

In [31]:
import numpy as np
np.mean(y_true == y_predicted)

0.90679094540612515

However, accuracy alone (= how many predictions are correct, out of all predictions) often tells us just part of the story. Why?

Precision, Recall and F1 score give us a more complete picture. 

* precision: out of those predicted as a label, how many were correct?
* recall: how many instances, out of all instances of a specific label, did the classifier predict correctly?
* f1-score: harmonic mean of precision and recall (f1 has beta=1, i.e., both precision and recall are equally important)
* support: the number of occurrences of each class in `y_true`.

In [32]:
from sklearn import metrics
print(metrics.classification_report(y_true, y_predicted,
     target_names=twenty_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.96      0.84      0.90       319
         comp.graphics       0.85      0.96      0.90       389
               sci.med       0.93      0.84      0.88       396
soc.religion.christian       0.90      0.97      0.94       398

           avg / total       0.91      0.91      0.91      1502



In [34]:
import pandas as pd
def crosstab(pred, gold):
    y_true = pd.Series(gold)
    y_pred = pd.Series(pred)
    print(pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))
    
crosstab(y_true, y_predicted)

Predicted    0    1    2    3   All
True                               
0          267    4    4    2   277
1           11  373   47    6   437
2           14    9  334    2   359
3           27    3   11  388   429
All        319  389  396  398  1502


Precision, Recall and F1-score are important concepts to understand. Let us formalize (and visualize) it.

<img src="pics/precision_recall.png">

<img src="pics/fscore.png">

<img src="pics/accuracy.png">

The document collection D:
<img src="pics/accuracy2.png">

* pay attention when using accuracy if the categories are very skewed (one class that is much more frequent than others)
* in such a case, how can you achieve high accuracy?

## Building a Pipeline


In order to make the steps from input data to vectorizer to training a model easier, `sklearn` provides a `Pipeline`.

In [42]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

vectorizer = CountVectorizer(binary=True)
#vectorizer = TfidfVectorizer()
clf = LogisticRegression()
classifier = Pipeline( [('vec', vectorizer),
                        ('clf', clf)] )
print(clf)
classifier.fit(twenty_train.data, twenty_train.target)
y_predicted = classifier.predict(twenty_test.data)
print(accuracy_score(twenty_test.target, y_predicted))

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
0.906790945406


### Writing your own vectorizer


With the `DictVectorizer` you can add your own features, you have full control.
As the name already says it wants a dictionary, where the keys are your feature names and  values are the feature values (binary or frequencies or what you want to use). 

In [48]:
#todo

# References

* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
* http://scikit-learn.org/stable/modules/feature_extraction.html