# Multi-Label Classification Strategies
In this task you deal with multiclass classification problem for [Glass Classification Data](https://www.kaggle.com/uciml/glass). Lets load the dataset.

In [1]:
import pandas as pd

In [2]:
# If you are using colab, uncomment this cell

# ! wget https://raw.githubusercontent.com/girafe-ai/ml-mipt/master/datasets/glass.csv
# ! mkdir data
# ! mv glass.csv data

In [3]:
data = pd.read_csv("../datasets/glass.csv")
data

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.0,1
1,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.0,1
...,...,...,...,...,...,...,...,...,...,...
209,1.51623,14.14,0.00,2.88,72.61,0.08,9.18,1.06,0.0,7
210,1.51685,14.92,0.00,1.99,73.06,0.00,8.40,1.59,0.0,7
211,1.52065,14.36,0.00,2.02,73.42,0.00,8.44,1.64,0.0,7
212,1.51651,14.38,0.00,1.94,73.61,0.00,8.48,1.57,0.0,7


In [4]:
feats, labels = data.drop("Type", axis=1), data.Type

In [5]:
labels.value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: Type, dtype: int64

The features of each glass oject correspond to the fraction of the particular chemical element in the object. The target variable corresponds to the type of glass (6 classes).

In this problem you have to empirically compare the time complexity and performance of several multiclass labeling strategies for different algorithms. Consider the following algorithms:
* KNearestNeighbors (5 neighbors)
* Logistic Regression
* SVC \[Support Vector Classification\] (linear kernel)

Note that all these algorithms by default support **multiclass labeling**. Nevertheless, compare this approach with **OneVSRest** and **OneVSOne** approaches applied to this algorithms. More precisely, for every pair (algorithm, approach) perform 5-fold cross validation on the data and output the validation score and the computation time (in the table form).

Note that dataset is both multiclass and imbalanced (we will give some points on how to deal with imbalanced datasets later) thus it's important to choose proper quality estimation. Try different metrics to optimize during CV (e.g. accuracy, balanced accuracy, f1, roc-auc).

After that, answer to the following questions:
* Which metric would you choose to optimize during cross validation and why?
* For which algorithms the usage of OneVSRest/OneVSOne approach provides significantly better performance without significant increase in computation time?

In [6]:
# proper way to measure performance in modern Python! No time.time()!
# see https://docs.python.org/3/library/time.html#time.perf_counter
from time import perf_counter


# from sklearn.multiclass import OneVsOneClassifier, OneVsRestClassifier

In [7]:
start = perf_counter()

# YOUR CODE HERE

end = perf_counter()
print(f"taken {end - start} seconds")

taken 2.4942000000027775e-05 seconds
