<a href="https://colab.research.google.com/github/dubeyabhi07/hands-on-scikit-learn/blob/master/SVM/multiClassSVC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multi-class classification with SVC

- Logistic Regression can be naturally extended to multi-class learning problems by replacing the sigmoid function with the softmax function. The KNN algorithm is also straightforward to extend to multiclass case. When we find k closest examples using a distance metric such as Euclidean Distance, for the input x and examine them, we return the class that we saw the most among the k examples. Multi-class labeling is also trivial with Naive Bayes classifier.

- SVM cannot be naturally extended to multi-class problems

- A one-vs-one OR one-vs-all strategy has to be determined if multi-class classification has to be carried out by SVC.

- By default, SVC implements one-vs-all strategy. In order to use one-vs-rest strategy, we need to use meta-estimator / specify decision_function_shape='ovo'.

- By default, LinearSVC uses one-vs-all strategy but it can be easily changed to one-vs-rest by specifying multi_class='ovr' in the constructor.

In [None]:
from sklearn import datasets
from sklearn import svm
data = datasets.make_classification(n_samples=1000, n_features=5, n_informative=4, n_redundant=1, n_classes=5,random_state=3)
print("first 5 samples : ")
print(data[0][0:5])
print("\nprediction for first 5 samples : ")
print(data[1][0:5])


X = data[0]
Y = data[1]

first 5 samples : 
[[ 1.92675881 -0.37142864  0.26895069  1.01306378  0.34952276]
 [ 0.7173925   0.98164137  1.20622939  0.2535553   0.42079308]
 [ 1.70973199  0.14127233 -0.34242828  0.99092091  0.35059309]
 [ 1.62178036  1.92497436 -2.07616403  1.08555836  0.52694852]
 [-1.28404986  1.14046739 -1.21193321 -0.5686877  -0.14887834]]

prediction for first 5 samples : 
[0 2 1 1 2]


In [None]:
print("SVM with ovr startegy :\n")
svc1 = svm.SVC(kernel='linear', C=1).fit(X, Y)
y_pred1 = svc1.predict(X)
y_pred_values1 = svc1.decision_function(X)
print(svc1)
print(y_pred1[0:5])
print(y_pred_values1[0:5,:])

SVM with ovr startegy :

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
[2 2 2 1 1]
[[ 3.11669179  1.07540878  4.2762562   0.73911708  0.77523688]
 [ 2.22105721  3.2607182   4.27356788  0.70589377 -0.24173596]
 [ 0.76813905  3.23459427  4.27944253  1.82668017 -0.26914591]
 [-0.30684585  4.30445585  3.29516366  2.27745189  0.69300379]
 [ 0.7289235   3.23464736  3.15417717  3.27478203 -0.26050765]]


In [None]:
print("\n\nSVM with ovo startegy :\n")
svc2 = svm.SVC(kernel='linear', C=100, decision_function_shape='ovo').fit(X, Y)
y_pred2 = svc2.predict(X)
y_pred_values2 = svc2.decision_function(X)
print(svc2)
print(y_pred2[0:5])
print(y_pred_values2[0:5,:])



SVM with ovo startegy :

SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
[2 2 2 1 1]
[[ 0.24369285 -0.96036787  0.98270991  0.34616354 -1.54583284  2.29244034
  -0.17696521  0.78065044  1.58580342  0.3572326 ]
 [-0.17937947 -0.77248011  1.96539447  1.08045099 -1.03961443  4.34172268
   0.22359312  1.62373971  1.14668286  0.24114778]
 [-0.4795382  -1.60520186 -0.3398489   0.14666031 -0.74937691  2.11549806
   0.57752191  0.69208279  2.18468592  1.35184474]
 [-2.94854295 -4.04361382 -4.1097323  -0.68773018  1.65799454  2.77197582
   3.28377012  0.76206365  4.68507087  4.52625659]
 [-1.24313816 -1.17633647 -2.20383541  0.23975948  1.15531141 -0.80414725
   0.80729401  0.1398409   0.70911704  1.89976366]]


#### Interpretation of ovo decision function :

 - https://datascience.stackexchange.com/questions/18374/predicting-probability-from-scikit-learn-svc-decision-function-with-decision-fun
 - In first example : AB->A, AC->C, AD->A, AE->A, BC->C, BD->B, BE->E, CD->C, CE->C, DE->D. Thus votes for A = 3, B = 1, C= 4, D= 1, E= 1. C is the output. 
 - When the constructor option probability is set to True, class membership probability estimates (from the methods predict_proba and predict_log_proba) are enabled. These are estimated using regression.
 - Tie breaking is costly if decision_function_shape='ovr', and therefore it is not enabled by default. By deafault, the first class is returned. 
 - break_ties must be False when decision_function_shape is 'ovo' 

In [None]:
print("\n\nSVM with ovo startegy and predict_prob enabled :\n")
svc3 = svm.SVC(kernel='linear', C=100, decision_function_shape='ovo',probability=True,random_state=3).fit(X, Y)
y_pred3 = svc3.predict(X)
y_pred_values3 = svc3.decision_function(X)
y_pred_prob = svc3.predict_proba(X)
print(svc3)
print(y_pred3[0:5])
print(y_pred_values3[0:5,:])
print(y_pred_prob[0:5,:])



SVM with ovo startegy and predict_prob enabled :

SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=True, random_state=3, shrinking=True, tol=0.001,
    verbose=False)
[2 2 2 1 1]
[[ 0.24369285 -0.96036787  0.98270991  0.34616354 -1.54583284  2.29244034
  -0.17696521  0.78065044  1.58580342  0.3572326 ]
 [-0.17937947 -0.77248011  1.96539447  1.08045099 -1.03961443  4.34172268
   0.22359312  1.62373971  1.14668286  0.24114778]
 [-0.4795382  -1.60520186 -0.3398489   0.14666031 -0.74937691  2.11549806
   0.57752191  0.69208279  2.18468592  1.35184474]
 [-2.94854295 -4.04361382 -4.1097323  -0.68773018  1.65799454  2.77197582
   3.28377012  0.76206365  4.68507087  4.52625659]
 [-1.24313816 -1.17633647 -2.20383541  0.23975948  1.15531141 -0.80414725
   0.80729401  0.1398409   0.70911704  1.89976366]]
[[0.18996058 0.13920525 0.45993617 0.07530171 0.1355963 ]
 [0

- As it can be observed that predicted class for 5th sample is 1. however the predict_proba value is point to class 3. Hence, predict_prob doesn't always coincide with predicted values.

- In fifth example : AB->B, AC->C, AD->D, AE->A, BC->B, BD->D, BE->B, CD->C, CE->C, DE->D. Thus votes for A = 1, B = 3, C= 3, D= 3, E= 0.

- Despite the probabiliy of D being highest, predict method picked the first class.

# Using meta-estimators

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_ovr = OneVsRestClassifier(svm.SVC(random_state=0, kernel='linear', C=1)).fit(X, Y)

print(ovr_ovr.n_classes_)
print(ovr_ovr.estimators_)

y_pred_ovr_ovr = ovr_ovr.predict(X)
y_pred_values_ovr_ovr = ovr_ovr.decision_function(X)
print(y_pred_ovr_ovr[0:5])
print(y_pred_values_ovr_ovr[0:5,:])


5
[SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), S

In [None]:
from sklearn.multiclass import OneVsRestClassifier
ovr_ovo = OneVsRestClassifier(svm.SVC(random_state=0, decision_function_shape='ovo',kernel='linear', C=1)).fit(X, Y)

print(ovr_ovo.n_classes_)
print(ovr_ovo.estimators_)

y_pred_ovr_ovo = ovr_ovo.predict(X)
y_pred_values_ovr_ovo = ovr_ovo.decision_function(X)
print(y_pred_ovr_ovo[0:5])
print(y_pred_values_ovr_ovo[0:5,:])

5
[SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), S

In [None]:
from sklearn.multiclass import OneVsOneClassifier
ovo_ovr = OneVsOneClassifier(svm.SVC(random_state=0, kernel='linear', C=1)).fit(X, Y)

print(ovo_ovr.n_classes_)
print(ovo_ovr.estimators_)

y_pred_ovo_ovr = ovo_ovr.predict(X)
y_pred_values_ovo_ovr = ovo_ovr.decision_function(X)
print(y_pred_ovo_ovr[0:5])
print(y_pred_values_ovo_ovr[0:5,:])


5
(SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), S

In [None]:
ovo_ovo = OneVsOneClassifier(svm.SVC(random_state=0, decision_function_shape='ovo', kernel='linear', C=1)).fit(X, Y)

print(ovo_ovo.n_classes_)
print(ovo_ovo.estimators_)

y_pred_ovo_ovo = ovo_ovo.predict(X)
y_pred_values_ovo_ovo = ovo_ovo.decision_function(X)
print(y_pred_ovo_ovo[0:5])
print(y_pred_values_ovo_ovo[0:5,:])

5
(SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False), S

### Observations :
- The shape of OneVsOneClassifier's decision function is Changed in version 0.19: output shape changed to (n_samples,) to conform to scikit-learn conventions for binary classification.

- When meta estimator is applied results are according to the meta estimators,and decision_function_shape becomes irrelevant.

- When meta estimator **OneVsOneClassifier** is used, the decision function produced by it resembles that of simple svm with decision_function_shape = 'ovr'. This hints at possible use of OVO strategy by svm internally.

# LinearSVC :

- Does not support multi_class='ovo'
- Supports multi_class='ovr' and multi_class='crammer_singer'. Ovr is preferred.

In [None]:
print("LinearSVM with ovr startegy :\n")
linearSvc1 = svm.LinearSVC(C=1,max_iter=10000).fit(X, Y)
y_pred_l1 = linearSvc1.predict(X)
y_pred_values_l1 = linearSvc1.decision_function(X)
print(linearSvc1)
print(y_pred_l1[0:5])
print(y_pred_values_l1[0:5,:])

LinearSVM with ovr startegy :

LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=10000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
[2 2 2 1 1]
[[-0.55545043 -0.79848614 -0.40586362 -0.75607719 -0.85880374]
 [-0.52974059 -0.55989459 -0.3733283  -1.17978197 -0.75180764]
 [-0.78366129 -0.49687015 -0.4172447  -0.69429914 -0.89170184]
 [-1.50433942  0.50801998 -0.3658002  -0.59658152 -1.06110211]
 [-1.11333877 -0.05802819 -0.70283318 -0.68056006 -0.56469538]]


In [None]:
linear_ovo = OneVsOneClassifier(svm.LinearSVC(random_state=0, C=1, max_iter=10000)).fit(X, Y)

print(linear_ovo.n_classes_)
print(linear_ovo.estimators_)

y_pred_linear_ovo = linear_ovo.predict(X)
y_pred_values_linear_ovo = linear_ovo.decision_function(X)
print(y_pred_linear_ovo[0:5])
print(y_pred_values_linear_ovo[0:5,:])

5
(LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=10000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0), LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=10000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0), LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=10000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0), LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=10000,
          multi_class='ovr', penalty='l2', random_state=0, tol=0.0001,
          verbose=0), LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
          intercept_scal