## Loading the 20 newsgroups dataset

In [1]:
categories = ['alt.atheism', 'soc.religion.christian', 
              'comp.graphics', 'sci.med']

In [3]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=42)

In [4]:
type(twenty_train)

sklearn.datasets.base.Bunch

In [5]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [116]:
atheist_docs = np.array(twenty_train.data)[twenty_train.target == 0]

In [6]:
len(twenty_train.data)

2257

In [15]:
file_ind = 0
print(twenty_train.filenames[file_ind],
      "\n*****\n",
      twenty_train.target_names[twenty_train.target[file_ind]],
      "\n*****\n",
      twenty_train.data[file_ind])

/Users/wah/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38440 
*****
 comp.graphics 
*****
 From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



## Extracting features from text files

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

In [19]:
type(X_train_counts)

scipy.sparse.csr.csr_matrix

In [44]:
vocab = set(count_vect.vocabulary_.keys())
len(vocab)

35788

### Stopwords
By default, `CountVectorizer` includes all stopwords

In [36]:
from sklearn.feature_extraction import stop_words
stops = stop_words.ENGLISH_STOP_WORDS

318

In [37]:
len(vocab & stops)

305

Passing `'english'` or a list of words to the `stop_words` parameter changes this.  

In [42]:
count_vect_with_stopwords = CountVectorizer(stop_words='english')
X_train_counts_stopped = count_vect_with_stopwords.fit_transform(twenty_train.data)
X_train_counts_stopped.shape

(2257, 35483)

In [43]:
vocab_stop = set(count_vect_with_stopwords.vocabulary_.keys())
len(vocab_stop & stops)

0

Another option is automatically detecting corpus-specific stopwords by filtering on the TF-IDF score.  Reducing the `max_df` parameter filters out words that are very common across documents, and the `min_df` parameter can be used to avoid overfitting (by ignoring words that appear in very few documents).  Note that both parameters can take a `float` in the range `[0.0, 1.0]` OR an `int`.

In [47]:
count_vect_tfidf = CountVectorizer(max_df=0.8, 
                                   min_df=2)
X_train_counts_tfidf = count_vect_tfidf.fit_transform(twenty_train.data)
X_train_counts_tfidf.shape

(2257, 18484)

In [50]:
count_vect.vocabulary_['horse']

16865

In [69]:
import scipy
import numpy as np

In [65]:
horses = scipy.sparse.find(X_train_counts[:, 16865])

In [68]:
horses[0]

array([ 218,  826, 1762, 1779, 1789, 1833, 1986, 2129], dtype=int32)

In [88]:
doc_matrix = np.array(twenty_train.data)
print(*doc_matrix[horses[0]])

From: eb3@world.std.com (Edwin Barkdoll)
Subject: Re: Blindsight
Organization: The World Public Access UNIX, Brookline, MA
Lines: 64

In article <19382@pitt.UUCP> geb@cs.pitt.edu (Gordon Banks) writes:
>In article <werner-240393161954@tol7mac15.soe.berkeley.edu> werner@soe.berkeley.edu (John Werner) writes:
>>In article <19213@pitt.UUCP>, geb@cs.pitt.edu (Gordon Banks) wrote:
>>> 
>>> Explain.  I thought there were 3 types of cones, equivalent to RGB.
>>
>>You're basically right, but I think there are just 2 types.  One is
>>sensitive to red and green, and the other is sensitive to blue and yellow. 
>>This is why the two most common kinds of color-blindness are red-green and
>>blue-yellow.
>>
>
>Yes, I remember that now.  Well, in that case, the cones are indeed
>color sensitive, contrary to what the original respondent had claimed.


	I'm not sure who the "original respondent" was but to
reiterate cones respond to particular portions of the spectrum, just
as _rods_ respond to certain 

### Using TF-IDF instead of counts

TF-IDF can help address two problems: one, it normalises over varying document lengths (so that a long document that mentions something in passing is not considered equivalent to a very short, focussed doument); and two, it down-weights words that appear in a large number of documents.  

In [93]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts_tfidf)
X_train_tfidf.shape

(2257, 18484)

The two steps, `fit` and `transform`, can be separated out.  I suppose this is useful if you want to use a vocabulary from a larger corpus when training on a smaller one.  

Another class, `TfidfVectorizer`, combines the `countVectorizer` with `TfidfTransformation` into a single step.

In [94]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df = 0.8,
                                   min_df = 2)
X_train_tfidf = tfidf_vectorizer.fit_transform(twenty_train.data)
X_train_tfidf.shape

(2257, 18484)

## Training a classifier

The variant of naive Bayes most suitable for word counts is `MultiNomialNB`.

In [99]:
from sklearn.naive_bayes import MultinomialNB

clf_count = MultinomialNB().fit(X_train_counts, twenty_train.target)

To classify a new document we have to transform it in the same way as the training set, but without fitting. 

In [106]:
new_docs = ['God is love',
            'OpenGL on the GOU is fast',
            'Faith and dogma are barriers to progress']

In [100]:
X_new_counts = count_vect.transform(new_docs)
pred_count = clf_count.predict(X_new_counts)

In [107]:
for doc, cat in zip(new_docs, pred_count):
    print("'{}' => {}".format(doc, twenty_train.target_names[cat]))

'God is love' => soc.religion.christian
'OpenGL on the GOU is fast' => comp.graphics
'Faith and dogma are barriers to progress' => soc.religion.christian
