# Working with text data

## Links

* http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Get data

Scikit-learn has a built-in data loader to get postings from 20 newsgroups

In [3]:
import textwrap
import sklearn

import warnings
warnings.filterwarnings('ignore')
import pandas as pd

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

import numpy as np

For this notebook 4 categories are used

In [4]:
categories = [
    'alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

Retrieve the four newsgroups training data

In [5]:
newsgroups_train = fetch_20newsgroups(
    subset='train', categories=categories, shuffle=True, random_state=42)

The names of the newsgroups are in the *target_names* member

In [6]:
newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

Each newsgroup has a numerical id in the *target* member

In [7]:
newsgroups_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

Create a function to display the first lines of a newsgroup post and it's category

In [8]:
def get_post(newsgroups, post_idx, lines=2):
    text = newsgroups.data[post_idx].split('\n')
    return '\n'.join(text[:min(lines, len(text))])
    
def display_category(newsgroups, post_idx):
    print(newsgroups.target_names[newsgroups.target[post_idx]])

Display the first three posts: category and text

In [9]:
for idx in range(3):
    display_category(newsgroups_train, idx)
    print(textwrap.indent(get_post(newsgroups_train, idx), '\t'))

comp.graphics
	From: sd345@city.ac.uk (Michael Collier)
	Subject: Converting images to HP LaserJet III?
comp.graphics
	From: ani@ms.uky.edu (Aniruddha B. Deglurkar)
	Subject: help: Splitting a trimming region along a mesh 
soc.religion.christian
	From: djohnson@cs.ucsd.edu (Darin Johnson)
	Subject: Re: harrassed at work, could use some prayers


Display the total number of posts

In [10]:
len(newsgroups_train.target)

2257

Display the number of posts in each category

In [11]:
name_map = dict(zip(range(4), newsgroups_train.target_names))

srs = pd.Series(newsgroups_train.target, dtype='category')
srs.cat.rename_categories(name_map).value_counts()

soc.religion.christian    599
sci.med                   594
comp.graphics             584
alt.atheism               480
dtype: int64

## Convert the newsgroup posts to features

### Calculate occurences of words

Count the number of words in each document

In [12]:
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(newsgroups_train.data)

Display the dimensions of the sparse matrix

In [13]:
print('The number of (total documents, words in all documents) = ', X_train_counts.shape)

The number of (total documents, words in all documents) =  (2257, 35788)


The number of non-zero entries

In [14]:
total = X_train_counts.shape[0] * X_train_counts.shape[1]
print('There are {:,} out of {:,} entries'.format(X_train_counts.count_nonzero(), total))

There are 365,886 out of 80,773,516 entries


Display the words in the first post

In [24]:
post_id = 10
doc_idx, word_idx = X_train_counts.getrow(post_id).nonzero()
all_words = count_vect.get_feature_names()
words = [all_words[id] for id in word_idx]
print(', '. join(sorted(words)))

03, 08, 1993, 1am, 21, 24855, 25, 53, 63, 73, about, accept, across, after, afterlife, again, age, all, allegory, alps, also, am, an, anasaz, anasazi, and, answer, answers, any, anyone, anything, apologize, are, article, as, ask, associated, at, athos, attain, attempts, away, az, bad, be, beacuse, beliefs, believing, between, bible, both, bringing, brought, browbeat, but, by, called, can, cannot, central, certain, certainly, chose, christ, christian, christianity, christians, chromosome, churches, close, com, come, comprise, conflict, consider, contemplate, correct, could, credit, crossed, cs, decent, detailed, did, difficult, dirty, dis, discriminatory, discussion, dislike, disparages, do, documents, does, don, done, driving, dussik, eastern, edited, edu, else, english, enter, entire, ess, etc, even, everyone, evil, except, faith, far, father, feel, felt, female, few, finds, fit, for, form, frame, from, further, furthermore, get, give, given, god, grace, greatest, had, hannibal, happe

Display the text of the first post

In [27]:
newsgroups_train.data[post_id].replace('\n', '')

'From: anasaz!karl@anasazi.com (Karl Dussik)Subject: Re: Is "Christian" a dirty word?Organization: Anasazi Inc Phx Az USALines: 73In article <Mar.25.03.53.08.1993.24855@athos.rutgers.edu> @usceast.cs.scarolina.edu:moss@cs.scarolina.edu (James Moss) writes:>I was brought up christian, but I am not christian any longer.>I also have a bad taste in my mouth over christianity.  I (in>my own faith) accept and live my life by many if not most of the>teachings of christ, but I cannot let myself be called a christian,>beacuse to me too many things are done on the name of christianity,>that I can not be associated with.  A question for you - can you give me the name of an organization or aphilosophy or a political movement, etc., which has never had anythingevil done in its name?  You\'re missing a central teaching of Christianity -man is inherently sinful.  We are saved through faith by grace.  Knowingthat, believing that, does not make us without sin.  Furthermore, not allwho consider themselv

### Convert occurences to tf-idf

As the number of occurences of a word in a long document are likely to be higher than in a short document, the number of occurences is divided by the total number of words to calculate the term frequencies **tf**. This is then adjusted by multiplying by the inverse document frequencies **idf** which counts the number of times the word occurs in a document. Terms that occur in few documents have higher weights than terms that occur in many documents.

The *TfidfTransformer* can calculate the tf-idf values form the word counts.

In [28]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

The tf-idf values for the first document are shown below

In [29]:
doc_num = 0
doc_idx, word_idx = X_train_tfidf.getrow(0).nonzero()
tfidf = X_train_tfidf.getrow(0).todense().getA1()
tfidf[word_idx]

array([0.13487106, 0.03127031, 0.03448147, 0.04334165, 0.099941  ,
       0.06567578, 0.09313008, 0.11819702, 0.07283774, 0.12806013,
       0.02387114, 0.08865416, 0.04910237, 0.10218403, 0.08631915,
       0.01679781, 0.06866113, 0.13487106, 0.04727158, 0.06899051,
       0.08497461, 0.03429071, 0.11947938, 0.04935883, 0.0979625 ,
       0.07802072, 0.01776232, 0.01824994, 0.03637492, 0.19622799,
       0.10960586, 0.01686489, 0.03590436, 0.24645541, 0.03478259,
       0.01996488, 0.04525255, 0.01894546, 0.1232277 , 0.0744441 ,
       0.18626015, 0.1232277 , 0.17358472, 0.03604415, 0.11382739,
       0.05417404, 0.04270369, 0.01679781, 0.07691883, 0.13635772,
       0.06666452, 0.1256015 , 0.13487106, 0.03779319, 0.03104295,
       0.06350566, 0.21567206, 0.07830787, 0.04967185, 0.38418039,
       0.31440007, 0.08413454, 0.043162  , 0.03109515, 0.03900412,
       0.06866113, 0.12491818, 0.13487106, 0.10783603, 0.24645541,
       0.05966162, 0.13487106, 0.25612026])

## Train a classifier

### Train a Naive Bayes classifier

In [30]:
clf = MultinomialNB().fit(X_train_tfidf, newsgroups_train.target)

Create features for two example documents

In [31]:
doc_examples = ['God is love', 'OpenGL on the GPU is fast']
X_example_counts = count_vect.transform(doc_examples)
X_example_tfidf = tfidf_transformer.transform(X_example_counts)

Using the trained classifer predict the newsgroup category

In [32]:
predictions = clf.predict(X_example_tfidf)
for prediction, doc_example in zip(predictions, doc_examples):
    print("{} - {}".format(doc_example, newsgroups_train.target_names[prediction]))

God is love - soc.religion.christian
OpenGL on the GPU is fast - comp.graphics


## Build a naive Bayes pipeline

Build a three stage pipeline

In [33]:
text_clf = Pipeline(
    [('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', MultinomialNB())])

Use the pipeline to train the model

In [34]:
text_clf.fit(newsgroups_train.data, newsgroups_train.target)  

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### Evaluate the performance on a test set

Get a test set

In [35]:
newsgroups_test = fetch_20newsgroups(
    subset='test', categories=categories, shuffle=True, random_state=42)

Predict the category of the test set

In [36]:
prediction = text_clf.predict(newsgroups_test.data)

Calculate the accuracy

In [37]:
print('Accuracy = {:.2f}'.format(np.mean(prediction == newsgroups_test.target) * 100))

Accuracy = 83.49


## Build a linear SVM pipeline

Build a three stage pipeline

In [38]:
text_clf = Pipeline(
    [('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                           alpha=1e-3, random_state=42,
                           max_iter=5, tol=None))])

### Evaluate the performance

In [39]:
text_clf.fit(newsgroups_train.data, newsgroups_train.target)  
predicted = text_clf.predict(newsgroups_test.data)
print('Accuracy = {:.2f}'.format(np.mean(predicted == newsgroups_test.target) * 100))

Accuracy = 91.28


### Display a detailed performance report

Display the precision and recall of the classifier

In [40]:
print(metrics.classification_report(newsgroups_test.target, predicted,
    target_names=newsgroups_test.target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.95      0.81      0.87       319
         comp.graphics       0.88      0.97      0.92       389
               sci.med       0.94      0.90      0.92       396
soc.religion.christian       0.90      0.95      0.93       398

           avg / total       0.92      0.91      0.91      1502



The number of posts in the atheism category is 319

In [47]:
srs = pd.Series(predicted, dtype='category')
srs.cat.rename_categories(name_map).value_counts()

comp.graphics             433
soc.religion.christian    420
sci.med                   377
alt.atheism               272
dtype: int64

The number predicted in the atheism category is 272. The confusion matrix shows the correctly predicted values on the diagonal (from top left to bottom right)

In [97]:
metrics.confusion_matrix(newsgroups_test.target, predicted)

array([[258,  11,  15,  35],
       [  4, 379,   3,   3],
       [  5,  33, 355,   3],
       [  5,  10,   4, 379]])

The true categories are in the rows and the predicted categories are in the columns

* For atheism the true positives is 258 out of 319 for a recall of 0.81.
* For atheism the true positives is 258 out of predicted value of 272 for a precision of 0.95

## Tune parameters using grid search

The optimal value of the parameters can be searched on a grid of possible values using the *GridSearchCV* class

In [109]:
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': (1e-3, 1e-4)
}
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(newsgroups_train.data, newsgroups_train.target)

Using the classifier with the best parameters we calculate the accuracy

In [110]:
predicted = gs_clf.predict(newsgroups_test.data)
print('Accuracy = {:.2f}'.format(np.mean(predicted == newsgroups_test.target) * 100))

Accuracy = 93.01


Display the best parameters

In [120]:
gs_clf.best_params_

{'clf__alpha': 0.0001, 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}