# Part 5: Multiclass Classification: One vs One and One vs Rest

In this part, we aim to classify documents belonging to:
* comp.sys.ibm.pc.hardware
* comp.sys.mac.hardware 
* misc.forsale
* soc.religion.christian.

We will use Naive Bayes Classification and SVM classification (with both One Vs One and One Vs Rest).
Naive Bayes classifier inherently finds the class with maximum likelihood given the data, no matter the number of classes. 
But for SVM, we can use two approaches. 1. One Vs One: Perform binary classification all pairs of classes and given the document, find the class with the majority vote. 2. One Vs Rest: Perform one classifer per class. For each classifier, each class is fitted against all other classes.  

* Let's import required libraries

In [1]:
import numpy as np
from sklearn import svm
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from nltk import SnowballStemmer
from sklearn import metrics
from sklearn.multiclass import OneVsRestClassifier,OneVsOneClassifier
import re

In [2]:
#fill in the datasets
def train_test_data_fill(list):
    train=fetch_20newsgroups(subset='train',shuffle=True,random_state=42,categories=list)
    test=fetch_20newsgroups(subset='test',shuffle=True,random_state=42,categories=list)   
    return train,test

In [3]:
category=['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','misc.forsale','soc.religion.christian']
training_data,testing_data=train_test_data_fill(category)    
all_data=training_data.data+testing_data.data

In [4]:
#preprocess data
stemmer=SnowballStemmer("english")
punctuations='[! \" # $ % \& \' \( \) \* + , \- \. \/ : ; <=> ? @ \[ \\ \] ^ _ ` { \| } ~]'    
def preprocess_data(data_list):
    for i in range(len(data_list)):
        data_list[i]=" ".join([stemmer.stem(data) for data in re.split(punctuations,data_list[i])])
        data_list[i]=data_list[i].replace('\n','').replace('\t','').replace('\r','')
preprocess_data(all_data)

In [5]:
#Feature extraction using TFxIDF
count_vect=CountVectorizer(min_df=5,stop_words ='english')
X_counts=count_vect.fit_transform(all_data)
tfidf_transformer=TfidfTransformer()
X_tfidf=tfidf_transformer.fit_transform(X_counts)

In [6]:
#LSI to reduce dimensionality
svd=TruncatedSVD(n_components=50,n_iter=10,random_state=42)
svd.fit(X_tfidf)
LSI=svd.transform(X_tfidf)

In [7]:
LSI_train=LSI[0:len(training_data.data)]
LSI_test=LSI[len(training_data.data):]
print("Size of Transformed Training Dataset: {0}".format(LSI_train.shape))
print("Size of Transformed Testing Dataset: {0}".format(LSI_test.shape))

Size of Transformed Training Dataset: (2352, 50)
Size of Transformed Testing Dataset: (1565, 50)


In [8]:
#enlisting the classifiers to be used with the required subclasses in question    
clf_list=[OneVsOneClassifier(GaussianNB()),OneVsOneClassifier(svm.LinearSVC()),
          OneVsRestClassifier(GaussianNB()),OneVsRestClassifier(svm.LinearSVC())]
clf_name=['One vs One Classifier: Naive Bayes','One vs One Classifier: SVM',
          'One vs Rest Classifier: Naive Bayes','One vs Rest Classifier: SVM']

In [11]:
#Printing the statisitcal answers including accuracy for the classifiers enlisted above
for clf,clf_n in zip(clf_list,clf_name):
    print(clf_n)
    clf.fit(LSI_train,training_data.target)
  
    test_predicted=clf.predict(LSI_test)
    print('Classification Report:')
    print(metrics.classification_report(testing_data.target,test_predicted,target_names=['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','misc.forsale','soc.religion.christian']),)
    
    print('Confusion Matrix:')
    print(metrics.confusion_matrix(testing_data.target,test_predicted)) 
    
    print('Total Accuracy: ')
    print(np.mean(testing_data.target==test_predicted))
    print("\n\n")

One vs One Classifier: Naive Bayes
Classification Report:
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.63      0.63      0.63       392
   comp.sys.mac.hardware       0.64      0.58      0.61       385
            misc.forsale       0.68      0.73      0.70       390
  soc.religion.christian       0.95      0.97      0.96       398

               micro avg       0.73      0.73      0.73      1565
               macro avg       0.73      0.73      0.73      1565
            weighted avg       0.73      0.73      0.73      1565

Confusion Matrix:
[[247  81  61   3]
 [ 91 223  67   4]
 [ 51  41 286  12]
 [  2   2   9 385]]
Total Accuracy: 
0.729073482428115



One vs One Classifier: SVM
Classification Report:
                          precision    recall  f1-score   support

comp.sys.ibm.pc.hardware       0.81      0.85      0.83       392
   comp.sys.mac.hardware       0.86      0.81      0.83       385
            misc.forsale     

**Results:**
* We can see a considerable difference in accuracy values between SVM and Naive Bayes. SVM performs better than Naive Bayes.
* O
