# Text Classification using scikit-learn

We explore some ideas in text classification using scikit-learn, using examples from
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

In [1]:
# We are using the 20newsgroups dataset,
from sklearn.datasets import fetch_20newsgroups

In [2]:
# Select some categories from the complete list
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories= categories, shuffle=True, random_state=101)

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)


In [5]:
twenty_train.description

'the 20 newsgroups by date dataset'

In [6]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [8]:
# The data consists of these many records, and twenty_train itself is a bunch
len(twenty_train.data)

2257

In [19]:
print('\n'.join(twenty_train.data[7].split('\n')[:5]))

From: ger@cv.ruu.nl (Ger Timmens)
Subject: Re: Postscript drawing prog
Nntp-Posting-Host: triton.cv.ruu.nl
Organization: University of Utrecht, 3D Computer Vision Research Group
Lines: 30


In [20]:
# The category is stored in the target_names attribute, which is accessed using the target attribute 
# with the same index
twenty_train.target_names[twenty_train.target[7]]

'comp.graphics'

## Extracting features from textual data

Now that we have staged the data, we proceed to extract features for use in our ML algorithms

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
count_vec = CountVectorizer()

In [23]:
X_train_counts = count_vec.fit_transform(twenty_train.data)

In [24]:
X_train_counts.shape

(2257, 35788)

In [26]:
# This index value is proportional to frequency
count_vec.vocabulary_.get(u'computer')

9338

In [30]:
count_vec.vocabulary_.get(u'car')

7860

Occurences have to be converted to frequencies in order for them to be effective measures.
This implies -> term frequency * inverse document frequency

In [32]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [33]:
X_train_tfidf.shape

(2257, 35788)

In [35]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train_tfidf, twenty_train.target)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now that we have the trained classifier, use some data and try to predict the category
The test data has to undergo the same transformations as the training data

In [41]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
# Basically each element of the sequence is a document

In [42]:
X_new_counts = count_vec.transform(docs_new)

In [43]:
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

In [44]:
predictions = mnb.predict(X_new_tfidf)

In [46]:
for doc, category in zip(docs_new, predictions):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


A Pipeline can be used to automate the flow from the count vectorizer for tokenization, tf-idf transforms, and naive bayes fitting

In [50]:
from sklearn.pipeline import Pipeline
text_mnb = Pipeline([('vec', CountVectorizer()),
                    ('tfidf', TfidfTransformer()),
                    ('mnb', MultinomialNB())])

In [51]:
# Can be trained in a single statement
text_mnb.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_...inear_tf=False, use_idf=True)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])