# Overview of multiclass training meta-estimators

In this example, we discuss the problem of classification when the target variable is composed of more than two classes. This is called multiclass classification.

In scikit-learn, all estimators support multiclass classification out of the box: the most sensible strategy was implemented for the end-user. The sklearn.multiclass module implements various strategies that one can use for experimenting or developing third-party estimators that only support binary classification.

sklearn.multiclass includes OvO/OvR strategies used to train a multiclass classifier by fitting a set of binary classifiers (the OneVsOneClassifier and OneVsRestClassifier meta-estimators). This example will review them.

In [1]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

np.random.seed(0)

In [2]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

X.shape, y.shape

  warn(


((1309, 13), (1309,))

Use ColumnTransformer by selecting column by names

We will train our classifier with the following features:

Numeric Features:
- age: float;
- fare: float.

Categorical Features:
- embarked: categories encoded as strings {'C', 'S', 'Q'};
- sex: categories encoded as strings {'female', 'male'};
- pclass: ordinal integers {1, 2, 3}.

We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.

In [3]:
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

In [4]:
clf = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("classifier", LogisticRegression())
    ]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.798
