# The 20 newsgroups text dataset (sklearn)


The **[20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/)** is a common ML dataset for comparison and
comprises around 18,000 newsgroups posts on 20 topics. The `sklearn` package offers functionality to easily download, parse and analyze this dataset. The set has already been split into a training and a test set, based upon a messages posted before and after a specific date.


## Loading the data

Let's download the dataset (~14MB) to disk. It will be saved in your `~/scikit_learn_data/20news_home` folder.

In [1]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
print newsgroups_train.target_names

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


We just called `fetch_20newsgroups`, which returns a list of the raw texts. You could also call `fetch_20newsgroups_vectorized`, which returns ready-to-use features for modelling, i.e., it is not necessary to use such feature extractors as `CountVectorizer`.

The real data lies in the ``filenames`` and ``target`` attributes. The target attribute is the integer index of the category.

In [2]:
print newsgroups_train.filenames.shape, newsgroups_train.target.shape
print newsgroups_train.filenames[:5]
print newsgroups_train.target[:5]
print [newsgroups_train.target_names[no] for no in newsgroups_train.target[:5]]

(11314,) (11314,)
[ '/Users/ruben/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994'
 '/Users/ruben/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861'
 '/Users/ruben/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51879'
 '/Users/ruben/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38242'
 '/Users/ruben/scikit_learn_data/20news_home/20news-bydate-train/sci.space/60880']
[ 7  4  4  1 14]
['rec.autos', 'comp.sys.mac.hardware', 'comp.sys.mac.hardware', 'comp.graphics', 'sci.space']


The actual raw text can be found in the `data` attribute

In [3]:
newsgroups_train.data[0]

u"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

It is possible to load only a sub-selection of the categories:

In [4]:
categories = ['alt.atheism', 'sci.space', 'talk.religion.misc', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
print newsgroups_train.target_names
print newsgroups_train.filenames.shape, newsgroups_train.target.shape

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
(2034,) (2034,)


In [5]:
len(newsgroups_train.data)

2034

## Converting text to vectors

`sklearn` comes with many built-in feature extraction and manipulation tools. For dealing with text data, there is the  `sklearn.feature_extraction.text` module, which contains the **`CountVectorizer`**, which we've already seen, and the `TfidfVectorizer`, which we will discuss later on.

### `CountVectorizer`

This class transforms an array-like (list, dataframe column, array) of strings into a matrix where each column represents a token (word or phrase) and each row represents the sample.

For example, if we had a two-element array ["Hello good day", "Good day to you"], we would create a matrix with 2 rows (one for each sample) and 5 columns (one for every unique word). The matrix would look like this:

hello|good|day|to|you
--|--|--|--|--
1|1|1|0|0
0|1|1|1|1

The `CountVectorizer` (and most feature extraction methods in sklearn) follows a very simple interface:
- `fit` takes a dataset and learns the features it's trying to extract. In this case that means that the algorithm learns the vocabulary of all samples
- `transform` takes a dataset and produces the matrix as described above, based on the vocabulary (or feature elements) it learned.
- `fit_transform` combines the two steps at once.

For example, you may want to fit a vocabulary to a training set, transform the training set to train a model and then continually transform any new incoming examples you want to classify. You will generally only perform the fit step once but the transform step many times for any new datasets.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train = cv.fit_transform(newsgroups_train.data)
X_train

<2034x34118 sparse matrix of type '<type 'numpy.int64'>'
	with 323433 stored elements in Compressed Sparse Row format>

Note that the returned matrix is a sparse matrix, i.e., with almost only zeros except for a few entries. This is stored differently than an ordinary array to save memory space.

Instead of word counts, we can set the `ngram_range` to include sequences of words as well.

In [7]:
cv = CountVectorizer(ngram_range=(1,3))
X_train = cv.fit_transform(newsgroups_train.data)

### `TfidfVectorizer`

The `Tfidf` stands for _Term Frequency - Inverse Document Frequency_, or TF-IDF representation.

- The _Term Frequency_ is simply the number of times that a word appear in a sample. This is equivalent to the `CountVectorizer` features, and is our most basic representation of text.
- The _Document Frequency_ is the percentage of samples that a particular word appears in.  Note that 'document' means sample here.  For example, you could assume `the` appears in 100% of samples, while words like `Syria` would have low document frequency.  The _Inverse Document Frequency_ is simply 1 / Document Frequency (although frequently the log is also taken). 

The TF-IDF representation computes the Term Frequency / Document Frequency ratio.  Words that appear a lot in a specific sample, or appear in very few other samples, will get a high score.  The intuition is that these words are somehow very characteristic for the specific sample.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer(stop_words='english', ngram_range=(1,3))
X_train = tfidfv.fit_transform(newsgroups_train.data)

## Naive Bayes

Let's combine the above text representations with a Naive Bayes model. (Please see the other course materials for an extensive explanation of Bayes' Theorem and the Naive Bayes algorithm.)

In [9]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score

model = MultinomialNB()
cross_val_score(model, X_train, newsgroups_train.target, cv=5)
# model.fit(X_train, newsgroups_train.target)

array([ 0.91421569,  0.93382353,  0.93120393,  0.92364532,  0.92839506])

Let's now try with the `CountVectorizer`.

In [10]:
cv = CountVectorizer(stop_words='english', ngram_range=(1,3))
X_train = cv.fit_transform(newsgroups_train.data)
cross_val_score(model, X_train, newsgroups_train.target, cv=5)

array([ 0.93382353,  0.9754902 ,  0.96314496,  0.95566502,  0.97037037])

Let's fit the model and analyze the coeficients.

In [11]:
model.fit(X_train, newsgroups_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [12]:
print newsgroups_train.target_names
print model.classes_
print model.class_count_
print model.class_log_prior_

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
[0 1 2 3]
[ 480.  584.  593.  377.]
[-1.44397347 -1.24785859 -1.23256518 -1.68551439]


For each class, we have the relative log probabilities per feature.

In [13]:
print model.feature_count_
print model.coef_

[[  3.   0.   0. ...,   0.   0.   0.]
 [ 34.   2.   1. ...,   0.   0.   0.]
 [ 42.   1.   1. ...,   0.   0.   0.]
 [  4.   0.   0. ...,   1.   1.   1.]]
[[-12.14041884 -13.5267132  -13.5267132  ..., -13.5267132  -13.5267132
  -13.5267132 ]
 [ -9.97298563 -12.4297214  -12.83518651 ..., -13.52833369 -13.52833369
  -13.52833369]
 [ -9.86476123 -12.93281416 -12.93281416 ..., -13.62596134 -13.62596134
  -13.62596134]
 [-11.85321754 -13.46265545 -13.46265545 ..., -12.76950827 -12.76950827
  -12.76950827]]


In [14]:
model.coef_.shape

(4, 498763)

In [15]:
len(cv.get_feature_names())

498763

In [16]:
import pandas as pd
coef = pd.DataFrame(model.coef_, columns=cv.get_feature_names(), index=newsgroups_train.target_names).T

In [17]:
top = 4
for newsgroup in coef:
    s = coef[[newsgroup]].sort(newsgroup)
    print "%-20s  (+) %s" % (newsgroup, ", ".join(s.iloc[-top:].index))
    print "%-20s  (-) %s" % ("", ", ".join(s.iloc[:top].index))

alt.atheism           (+) writes, people, god, edu
                      (-) know good book, noticed article messianic, noticed article, noticeably worse viewers
comp.graphics         (+) graphics, subject, lines, edu
                      (-) ºnd sun eclipsed, launch coverage, launch countdown abort, launch countdown
sci.space             (+) subject, nasa, edu, space
                      (-) know good book, multiverse feel, multiverse, multitudes commoners said
talk.religion.misc    (+) subject, god, com, edu
                      (-) know good book, logically conclusion example, logically conclusion, talking natural


## Further reading 

- [20 Newsgroups Dataset](http://qwone.com/~jason/20Newsgroups/)
- [sklearn's 20 newsgroups](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)
- [sklearn's `CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [sklearn's `TfidfVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [sklearn's `MultinomialNB`](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)