# Working with text data

The dataset 20Newsgroups is a collection of 20K newsgroup documents in 20 categories. 

The *fetch_20newsgroups* method that is imported is a built-in dataset loader. In this example we are only looking at a subset of news categories. This method returns a scikit-learn object called a "bunch" which is a holder with attributes that can be accessed as either dict objects or attributes. For example, the target_names field is retrieved in the last line below.

This notebook follows a tutorial on the sklearn site, there are many others available there.

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', 
                                  categories=categories, shuffle=True, random_state=42)
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

The files are in the *data* attribute and the filenames are also available.

In [2]:
print(len(twenty_train.data))
print(len(twenty_train.filenames))

2257
2257


Next we print the first 3 lines of the first file in the training set, and then its label, which is also the name of its folder. 

In [3]:
print("\n".join(twenty_train.data[0].split("\n")[:3]))
print(twenty_train.target_names[twenty_train.target[0]])

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
comp.graphics


For efficiency the target is stored as an integer id.

In [4]:
twenty_train.target[:10]

array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])

However we can view the category names like this:

In [5]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


### preprocessing text

We use a sparse matrix form, scipy.sparse,  that just stores the integer ids of the words in a document. First we tokenize, then build a dictionary of words and their frequencies. We see that we have 2257 unique words out of the 35788 words in the corpus.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

As an example, let's see the count for the word *algorithm*:

In [7]:
count_vect.vocabulary_.get(u'algorithm')

4690

### term frequencies

The tf term frequencies are the counts divided by the total number of words in the document. The tf-idf is the tf with common words downweighted.

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35788)

### train a classifier

We next train a Naive Bayes classifier.

In [9]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

Try the classifier on some new data.

In [10]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print(doc, '=> ', twenty_train.target_names[category])


God is love =>  soc.religion.christian
OpenGL on the GPU is fast =>  comp.graphics


### Create a pipeline

We recreate the classifier using a pipeline which makes things easier.

In [11]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultinomialNB())])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

### Evaluation on the test set

In [12]:
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
        categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)

0.83488681757656458

The classifier got 83% accuracy on the test set. Not bad. 
