# Classification on the Twenty Newsgroup


*The [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. 

The 20 newsgroups collection is a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.*

# 1. Import and inspect the data

In [78]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier


from sklearn.pipeline import Pipeline


In [74]:
categories = ['sci.space','comp.graphics', 'sci.med', 'rec.motorcycles', 'rec.sport.baseball']
twenty_train = fetch_20newsgroups(subset='train',  categories=categories, shuffle=True, random_state=42)

twenty_train.target_names
len(twenty_train.data)

# test data
twenty_test = fetch_20newsgroups(subset='test',categories=categories, shuffle=True, random_state=42)



In [58]:
print(twenty_train.target_names[twenty_train.target[1]])
print(twenty_train.data[1])

rec.sport.baseball
From: rkoffler@ux4.cso.uiuc.edu (Bighelmet)
Subject: Re: Best Sportwriters...
Keywords: Sportswriters
Organization: University of Illinois at Urbana
Lines: 21

csc2imd@cabell.vcu.edu (Ian M. Derby) writes:


>Since someone brought up sports radio, howabout sportswriting???

I happen to be a big fan of Jayson Stark.  He is a baseball writer for the 
Philadelphia Inquirer.  Every tuesday he writes a "Week in Review" column.  
He writes about unusual situations that occured during the week.  Unusual
stats.  He has a section called "Kinerisms of the Week" which are stupid
lines by Mets brodcaster Ralph Kiner.  Every year he has the LGTGAH contest.
That stands for "Last guy to get a hit."  He also writes for Baseball 
America.  That column is sort of a highlights of "Week in Review."  If you 
can, check his column out sometime.  He might make you laugh.

Rob Koffler

-- 
******************************************************************
|You live day to day and           

# 2. Extract features from data

Transform the *twenty_train.data* with 

* [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
    * Stopwords = english
    * limit the vocabulary to 1000 words

Then apply

* [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)

How do you understand the content of CountVectorizer and TfidfTransformer?

In [1]:
# Count words
# ...
# X_train_counts = 

# Frequency tf-idf
# ...
# X_train_tfidf = 


# 3. Fit a model and predict some sentences

* Fit a Multinomial Naive Bayes Model (MultinomialNB ) to the output of the CountVectorizer + TfidfTransformer
* Predict the category of some sentences

For instance:

        new_sentences = ['Space the final frontier  where no one has gone before', 
                         'OpenGL on the GPU is fast', 
                         'I bought my new honda at the local dealer ', 
                        ]



In [72]:
# Fit the model on X_train_tfidf, twenty_train.target
clf = MultinomialNB()

# ...


In [3]:
new_sentences = ['Space the final frontier where no one has gone before',  'OpenGL on the GPU is fast',  'I bought my new honda at the local dealer ' ]
# Transform the sentences 
# count_vectorize + tfidf transform on the new_sentences

# predict

# predicted_categories = ...




# and print
# for doc, category in zip(new_sentences, predicted_categories):
#     print('%r => %s' % (doc, twenty_train.target_names[category]))

# 4. Pipelines

Let's build a pipeline with

* CountVectorizer
* TfidfTransormer
* MultinomialNB

        from sklearn.pipeline import Pipeline
        text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
        ])

Fit the Pipeline on the twenty_train.train data and predict on the twenty_train.test data 

Assess the performance prediction

            np.mean(predicted == twenty_test.target)

# 5. Different Classifier

Try changing the classifier to the [Stochastic Gradient Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) with

* loss='hinge', 
* penalty='l2',
* alpha=1e-3
* n_iter=5

        SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)

What's the accuracy on test data?

Look at the classification report

        from sklearn import metrics
        metrics.classification_report(twenty_test.target, predicted , target_names=twenty_test.target_names)




# 6. GridSearch on pipeline

        from sklearn.grid_search import GridSearchCV
        parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
                      'tfidf__use_idf': (True, False),
                      'clf__alpha': (1e-2, 1e-3),
        }
        gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

fit on twenty_train.data, twenty_train.target, and look at grid scores

