# NewGroup Classification with SVM

## MultiClass classification
MultiClass classification are one the most tediuos learning for any Classification Method.

There are few major ways these problems are learnt:
1. One vs Rest   : Treating each "Class" as binary classificaiton problem with rest as "Other Class"
2. One vs One    : Treating combination of "Two Class" at a time as binary classification problem
3. Carrmer Singer: Joint optimisation over all classes at same time.

The models which support MultiClass Classificaiton are (not exhaustive):
1. Support Vector Machines (SVM)
2. Naive Bayes (NB)
3. K-Nearest Neighbours (KNN)
4. Decision Trees (All variants)
5. Neural Networks (ANN and DNN)
6. Quadratic Discriminant Analysis (QDA) and LDA
7. Learning Vector Quantization (LVQ)

More Details:
* [Wikipedia](https://en.wikipedia.org/wiki/Multiclass_classification)
* [Survey on Multiclass Classification Method](https://www.cs.utah.edu/~piyush/teaching/aly05multiclass.pdf)
* [Crammer and Singer](http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf)


## News Group Data
This data is all about news feed for 20 groups of data. More details [NewGroup](http://qwone.com/~jason/20Newsgroups/)

In [1]:
# Imports

# For Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# For ML models
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# For Model Selection based on GridSearch
from sklearn.model_selection import GridSearchCV

# other imports
import numpy as np

# for stopwords and data cleaning
import nltk
# nltk.download() # use this if you haven't ownloaded any stemmer
from nltk.stem.snowball import SnowballStemmer

# for Pipeline flow of processing with sklearn
from sklearn.pipeline import Pipeline


### The News Group data can directly fectched using scikit-learn.


In [2]:
# For fetching the dataset
from sklearn.datasets import fetch_20newsgroups

#Loading the data set 
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

#print labels (categories).
print('Target Names:\n{}\n'.format(twenty_train.target_names))

# print some data
print('Sample data:')
print("\n".join(twenty_train.data[0].split("\n"))) 

# Print distribution of classes
unique, counts = np.unique(twenty_train.target, return_counts=True)
print('Getting Class distributions:')
print(np.asarray((twenty_train.target_names, counts)).T)

Target Names:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Sample data:
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history,

##### The data is in form of mail from people.
Now lets try to extract some information from the data.

In [3]:
# Extracting features from text files - tf-idf

# Fitting on training data
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

# Processing on test data
X_test_tfidf = tfidf_transformer.transform(count_vect.transform(twenty_test.data))


#### As we can observe its a very wide sparse matrix. We'll use this matrix as over features to train our models.


In [4]:
%%time
# Lets make very basic Linear SVM Classifier
clf_svm = LinearSVC(dual=False, tol=0.0001, C=1.0, multi_class='ovr',verbose=0, random_state=None, max_iter=1000)

# Fitting data on Classifier
clf_svm = clf_svm.fit(X_train_tfidf, twenty_train.target)

# Making Prediciton
predicted = clf_svm.predict(X_test_tfidf)
np.mean(predicted == twenty_test.target)

CPU times: user 13.6 s, sys: 120 ms, total: 13.7 s
Wall time: 6.91 s



Without Any preprocessing we can see a very good accuracy. 
Another way to look at behaviour of model accross class in sklearn is via 'classification_report'.

Lets see how the report looks



In [5]:
%%time
# importing classification report
from sklearn.metrics import classification_report

print(classification_report(predicted,twenty_test.target))

             precision    recall  f1-score   support

          0       0.80      0.82      0.81       309
          1       0.80      0.76      0.78       412
          2       0.73      0.77      0.75       376
          3       0.76      0.71      0.74       416
          4       0.86      0.84      0.85       392
          5       0.76      0.87      0.81       347
          6       0.91      0.83      0.87       424
          7       0.91      0.92      0.91       390
          8       0.95      0.95      0.95       399
          9       0.95      0.92      0.93       412
         10       0.98      0.96      0.97       408
         11       0.94      0.93      0.93       400
         12       0.79      0.81      0.80       384
         13       0.87      0.90      0.88       380
         14       0.93      0.90      0.92       407
         15       0.93      0.84      0.88       445
         16       0.92      0.75      0.82       447
         17       0.89      0.97      0.93   

As we can see few classes have very low precision. As a rule of thumb: 
### "Accuracy is not a sufficient statistic for MultiClass Classificaiton Problem. We should look depeer." 

As in case of above we can observe why so.

Now lets try to imporve this a bit.

In [6]:
# Lets create a pipeline to mke things bit easy for iterations.
# Also we'll remove stopwords from the text to reduce our corpus size as well as reduce the noise in data. 
news_clf_svm = Pipeline([('counter', CountVectorizer(stop_words='english')),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', LinearSVC(dual=False, # As the feature space is much bigger than data
                                            class_weight='balanced', # As we have different class distribution
                                            C=10, # making the class bit more seperable
                                            verbose=1, random_state=234, max_iter=1000)) 
                         ])

# Now lets make a grid search to see what best works on this data set.
grid_param = {'counter__ngram_range': [(1, 1), (1, 2)],
                'tfidf__use_idf': (True, False),
                'clf-svm__multi_class': ('ovr','crammer_singer'),
                'clf-svm__tol':(1e-2, 1e-4)
             }
                   
# Fitting the Data on the models generated by grid search
news_gs_clf = GridSearchCV(news_clf_svm, grid_param, n_jobs=-1,verbose=5)
news_gs_clf = news_gs_clf.fit(twenty_train.data, twenty_train.target)


Fitting 3 folds for each of 16 candidates, totalling 48 fits
[CV] tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[CV] tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[CV] tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[CV] tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.915762, total=  16.7s
[CV] tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.917572, total=  17.8s
[CV] tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi

[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  2.1min


[CV] tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.913113, total=  38.1s
[CV] tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.917307, total=  38.8s
[CV] tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.912536, total= 1.2min
[CV] tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.909719, total= 1.2min
[CV] tfidf__use_



[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.912583, total=  20.6s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.916247, total=  23.4s
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.918216, total=  23.0s
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.896954, total=  24.4s
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.900080, total=  21.9s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 1), score=0.898566, total=  28.5s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.912536, total= 2.6min
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=ovr, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.912108, total= 2.6min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 
[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.921060, total= 1.1min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.923403, total= 1.2min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.921402, total=  43.4s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.911741, total=  36.7s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.909669, total=  42.4s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.01, counter__ngram_range=(1, 2), score=0.909453, total=  34.9s
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.916247, total=  59.4s
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.918216, total= 1.0min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1) 
[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.896954, total=  42.6s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.900080, total=  30.4s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.898566, total=  41.1s
[CV] tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 1), score=0.912583, total= 2.0min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.921060, total= 3.0min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.923403, total= 3.9min
[CV] tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2) 




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.909669, total= 4.0min




[LibLinear][CV]  tfidf__use_idf=True, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.921402, total= 4.4min




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.911476, total= 2.9min




[LibLinear][CV]  tfidf__use_idf=False, clf-svm__multi_class=crammer_singer, clf-svm__tol=0.0001, counter__ngram_range=(1, 2), score=0.909453, total= 1.8min


[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed: 18.2min finished


[LibLinear]



In [7]:
# Print the statistics                        
print(news_gs_clf.best_score_)
print(news_gs_clf.best_params_)

0.921955099876
{'tfidf__use_idf': True, 'clf-svm__multi_class': 'crammer_singer', 'clf-svm__tol': 0.01, 'counter__ngram_range': (1, 2)}


In [8]:
%%time
# Making Prediciton
predicted = news_gs_clf.predict(twenty_test.data)
print('Accuracy: {}\n Classification Report:'.format(np.mean(predicted == twenty_test.target)))

# generating classification report
print(classification_report(predicted,twenty_test.target))

Accuracy: 0.8601964949548593
 Classification Report:
             precision    recall  f1-score   support

          0       0.80      0.85      0.82       298
          1       0.82      0.76      0.79       419
          2       0.75      0.78      0.77       377
          3       0.77      0.72      0.74       417
          4       0.87      0.82      0.84       410
          5       0.77      0.87      0.81       350
          6       0.90      0.82      0.86       429
          7       0.90      0.93      0.91       383
          8       0.96      0.95      0.96       399
          9       0.95      0.91      0.93       411
         10       0.98      0.97      0.97       405
         11       0.95      0.93      0.94       404
         12       0.79      0.82      0.80       375
         13       0.86      0.90      0.88       378
         14       0.92      0.90      0.91       406
         15       0.94      0.88      0.91       425
         16       0.92      0.78      0.84   



#### As we can see class based predition has improved from the previous report.

Note: We have used whole text as one corpus. If we use same setting and do some preprocessing then we can stil get even better results. That's upto you how to do that.

In [9]:
# Also lets use Stemming, from nltk package, to data and see if it improves
stemmer = SnowballStemmer("english", ignore_stopwords=True)

# Making CountVectorizer class
class StemmedVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

stemmed_counter = StemmedVectorizer(ngram_range=(1,2), # from grid cv 
                                    stop_words='english')

In [10]:
%%time
# Now lest create pipline with parameters from grid search
news_clf = Pipeline([('counter', stemmed_counter),
                      ('tfidf', TfidfTransformer()),
                      ('clf-svm', LinearSVC(dual=False, # As the feature space is much bigger than data
                                            class_weight='balanced', # As we have different class distribution
                                            C=100, # making the class bit more seperable
                                            verbose=5, random_state=505, max_iter=1000)) 
                         ])

# fitting on extracted training data
news_clf= news_clf.fit(twenty_train.data, twenty_train.target)

# Making Prediciton on extrated testing data
predicted = news_clf.predict(twenty_test.data)
print('Accuracy: {}'.format(np.mean(predicted == twenty_test.target)))

# generating classification report
print(classification_report(predicted,twenty_test.target))



[LibLinear]Accuracy: 0.8586032926181625
             precision    recall  f1-score   support

          0       0.80      0.85      0.83       299
          1       0.80      0.78      0.79       403
          2       0.74      0.78      0.76       370
          3       0.75      0.71      0.73       413
          4       0.86      0.81      0.84       409
          5       0.78      0.87      0.82       356
          6       0.88      0.80      0.84       431
          7       0.91      0.93      0.92       389
          8       0.96      0.96      0.96       398
          9       0.94      0.92      0.93       404
         10       0.98      0.96      0.97       409
         11       0.96      0.92      0.94       413
         12       0.80      0.82      0.81       385
         13       0.86      0.89      0.88       384
         14       0.93      0.90      0.92       407
         15       0.93      0.88      0.91       421
         16       0.93      0.78      0.84       434
     

#### Though stemming takes longer time the effect on overall behaviour of data is less. This significantly indicates that data need lot of cleaning and processing before modelling.

### Bonus Part: Lets use Naive Bayes Classifier for the same thing

In [13]:
%%time
# A pipeline with MultinomialNB
news_clf_nb = Pipeline([('counter', CountVectorizer(ngram_range=(1,2),stop_words='english')),
                        ('tfidf', TfidfTransformer()),
                        ('clf', MultinomialNB(alpha=10.0, fit_prior=False))
                       ])

# fitting on NB pipeline
news_clf_nb= news_clf_nb.fit(twenty_train.data, twenty_train.target)

# Making Prediciton on extrated testing data
predicted = news_clf_nb.predict(twenty_test.data)
print('Accuracy: {}'.format(np.mean(predicted == twenty_test.target)))

# generating classification report
print(classification_report(predicted,twenty_test.target))

Accuracy: 0.7914232607541157
             precision    recall  f1-score   support

          0       0.72      0.75      0.74       309
          1       0.69      0.76      0.72       352
          2       0.76      0.73      0.74       408
          3       0.73      0.65      0.69       442
          4       0.75      0.82      0.78       354
          5       0.74      0.84      0.79       348
          6       0.77      0.85      0.81       354
          7       0.86      0.88      0.87       385
          8       0.94      0.90      0.92       415
          9       0.88      0.85      0.86       411
         10       0.97      0.78      0.87       496
         11       0.96      0.74      0.84       512
         12       0.53      0.84      0.65       249
         13       0.71      0.90      0.80       313
         14       0.93      0.78      0.85       471
         15       0.95      0.70      0.80       542
         16       0.93      0.66      0.77       511
         17     

#### The next steps ideally inclueds the data preprocesing, feature generation and more hyperparameter tuning.