In this example, we will use the Naive Bayes implementation from the Scikit-learn  machine learning library to classify newsgroup postings. We have chosen two newsgroups from the datasets provided by Scikit-learn (alt.atheism and sci.med) and we shall use Naive Bayes to predict which newsgroup a particular posting is from.

In [1]:
from sklearn.datasets import fetch_20newsgroups
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer

cats = ['alt.atheism', 'sci.med']
newsgroups= fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'), categories=cats)



In [2]:
newsgroups.data.shape

AttributeError: 'list' object has no attribute 'shape'

We first loads the newsgroup data using the utility function provided by scikit-learn (this downloads the dataset from the internet and may take some time). The newsgroup object is a map, the newsgroup postings are saved against 'data', and the target variables are in newsgroups.target. 

In [5]:
newsgroups.target

array([1, 0, 0, ..., 0, 0, 0])

Since the features are words, we transform them to another representation using Term Frequency-Inverse Document Frequency (Tfidf). The purpose of tfidf is to de-emphasize words that occur in all postings (such as 'the','by,'for' etc) and instead emphasize words that are unique to a particular class (such as religion, creationism which are from the alt.atheism newsgroup).
We can do the same by creating a TfidfVectorizer and then transforming all the newsgroup data to a vector representation

In [6]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups.data)

Vectors now contains features that we can use as input data to the Naive Bayes classifier. A shape query reveals that it contains 1789 instances, and each instance contains about 24k features. However, many of those features can be 0, indicating words that do no appear in that particular posting.

In [7]:
print vectors
vectors.shape

  (0, 23691)	0.152415950747
  (0, 7722)	0.0695804237688
  (0, 22613)	0.167288064641
  (0, 19619)	0.167288064641
  (0, 12782)	0.167288064641
  (0, 905)	0.167288064641
  (0, 14657)	0.140408876394
  (0, 14373)	0.167288064641
  (0, 14512)	0.158588435707
  (0, 22822)	0.0943122288393
  (0, 18652)	0.115839003601
  (0, 12694)	0.107799609066
  (0, 13603)	0.147628199455
  (0, 3853)	0.158588435707
  (0, 16543)	0.137543836854
  (0, 22023)	0.113096220375
  (0, 21259)	0.128844207919
  (0, 11999)	0.158588435707
  (0, 15205)	0.0796196634826
  (0, 2283)	0.0281542176195
  (0, 11473)	0.0973825898024
  (0, 18632)	0.12267172296
  (0, 11166)	0.158588435707
  (0, 2117)	0.0898124808636
  (0, 3763)	0.158588435707
  :	:
  (1788, 15267)	0.0180677872389
  (1788, 7321)	0.0230870711222
  (1788, 23596)	0.0267994711824
  (1788, 19187)	0.0671890133007
  (1788, 15818)	0.0262676402168
  (1788, 21859)	0.0353337060673
  (1788, 23753)	0.0309516349007
  (1788, 24122)	0.0244816065864
  (1788, 2075)	0.0234785474981
  (1788, 2

(1789, 24202)

Scikit-learn provides a few versions of Naive Bayes classifier, the one we use is called MultinomialNB. Since using a classifier typically involves splitting the dataset into train, test and validation sets, then training on the 'train' set and testing the efficacy on the 'validation' set, we can use the scikit-learn provided utility to do the same for us.
The cross_validation.cross_val_score automatically splits the data into multiple sets and returns the F1 score (a metric that measures a classifier's accuracy).

In [8]:
clf = MultinomialNB(alpha=.01)
print "CrossValidation Score: ", np.mean(cross_validation.cross_val_score(clf,vectors, newsgroups.target, scoring='f1'))

CrossValidation Score:  0.954420443456


We can see that despite the assumption that all features are conditionally independent, the classifier maintains a decent F1 score of 95%. 