# Multi-class classification

Consider if we randomly select k samples from a sample space consisting of N classes. Our objective is to determine the probability of the frequencies of classes present within those k samples. If a problem involves more than two classes like this, it is referred to as **multi-class classification**. In scikit-learn, character strings can be used as target values for multiple classes.

One approach we could consider is using the K-nearest neighbers (KNN) algorithm to determine the order of probabilities. KNN identifies the nearest neighbors, allowing us to assign probabilities based on the nearest classes. Then we may utilize the probabilities of the nearest classes as the probabilities for the respective classes in question.

In [9]:
import pandas as pd

fish = pd.read_csv('fish.csv')
print(fish.head())

  Species  Weight  Length  Diagonal   Height   Width
0   Bream   242.0    25.4      30.0  11.5200  4.0200
1   Bream   290.0    26.3      31.2  12.4800  4.3056
2   Bream   340.0    26.5      31.1  12.3778  4.6961
3   Bream   363.0    29.0      33.5  12.7300  4.4555
4   Bream   430.0    29.0      34.0  12.4440  5.1340


To identify the various species features present in this CSV dataset, we can use the `unique()` method from pandas.

In [14]:
print(pd.unique(fish['Species']))
print(fish.columns.tolist())

['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt']
['Species', 'Weight', 'Length', 'Diagonal', 'Height', 'Width']


We can consider the `Species` feature as the target data, and the remaining columns can be chosen as the input data. We can utilize `df[array]` to select multiple columns from the DataFrame `df` and create a new DataFrame. Additionally, we can convert the resulting DataFrame into a NumPy array by using `df[array].to_numpy()`.

In [21]:
fish_target = fish['Species'].to_numpy()
fish_input = fish[['Weight', 'Length', 'Diagonal', 'Height', 'Width']].to_numpy()

print(fish_target[:5])
print(fish_input[:5])

['Bream' 'Bream' 'Bream' 'Bream' 'Bream']
[[242.      25.4     30.      11.52     4.02  ]
 [290.      26.3     31.2     12.48     4.3056]
 [340.      26.5     31.1     12.3778   4.6961]
 [363.      29.      33.5     12.73     4.4555]
 [430.      29.      34.      12.444    5.134 ]]


Then we proceed to split the input data into training data and test data. After the split, we perform preprocessing by normalizing the input datasets. Note that we should use the statistical values, such as mean and standard deviation, calculated from the training set to transfrom the test set.

Once the preprocessing is complete, we train the KNN model and evaluate its performance by checking the scores.

We are using the names of the features (character strings) as target values. However, when these names are passed to a scikit-learn model, the order of the target class names may not necessarily match the order we obtained from `pd.unique(fish['Species'])`. In the `KNeighborsClassifier`, the `classes_` attribute provides the ordered list of unique target classes. After fitting the model, we can access this ordered list using `kn.classes_`.

In [25]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

train_input, test_input, train_target, test_target = train_test_split(
    fish_input, fish_target, random_state = 42)

# Normalizer
ss = StandardScaler()
# Train the normalizer
ss.fit(train_input)

# Normalizing
train_scaled = ss.transform(train_input)
test_scaled = ss.transform(test_input)

# Model
kn = KNeighborsClassifier(n_neighbors=3) # default: n_neighbors=5
kn.fit(train_scaled, train_target)

# Check the unique classes
print(kn.classes_)

# Scores
score_by_training = kn.score(train_scaled, train_target)
score = kn.score(test_scaled, test_target)
print(f"score : {score_by_training} (training data)")
print(f"score : {score} (test data)")

['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']
score : 0.8907563025210085 (training data)
score : 0.85 (test data)


We can utilize the `kn.predict()` method to obtain the probability of each class for specific data. For example, calling `kn.predict(test_scaled[:5])` will provide the predictions for the first five target data points. If we want to examine the probabilities of each class instead of the predictions themselves, we can use `proba = kn.predict_proba(test_scaled[:5])`.

To obtain rounded numeric values, we can use the `round()` function from NumPy. By default, it rounds to the first decimal place after the decimal point. However, we can specify the desired number of decimal places using the `decimals` parameter. For instance, `np.round(number or array, decimals=4)` allows us to round the number or array to four decimal places.

In [43]:
import pandas as pd
import numpy as np

classes = kn.classes_ # The unique classes
proba = np.round(kn.predict_proba(test_scaled[:5]), decimals=4)

# DataFrame to combine the result into a table
result_df = pd.DataFrame(proba, columns = classes)

print(result_df)
print(kn.predict(test_scaled[:5]))

   Bream  Parkki   Perch  Pike   Roach  Smelt  Whitefish
0    0.0     0.0  1.0000   0.0  0.0000    0.0        0.0
1    0.0     0.0  0.0000   0.0  0.0000    1.0        0.0
2    0.0     0.0  0.0000   1.0  0.0000    0.0        0.0
3    0.0     0.0  0.6667   0.0  0.3333    0.0        0.0
4    0.0     0.0  0.6667   0.0  0.3333    0.0        0.0
['Perch' 'Smelt' 'Pike' 'Perch' 'Perch']


To compare these resulting probabilities and identify the correct nearest neighbor, let's list the nearest neighbors of a specific sample point. For example, we can obtain the fourth sample point from the test data by extracting it using `test_scaled[3:4]`.

In [47]:
# Neighbor 'training samples' for a 'test point'
distances, indexes = kn.kneighbors(test_scaled[3:4]) # After training

print(train_target[indexes])

[['Roach' 'Perch' 'Perch']]


This confirms the resulting probabilities mentioned above. However, when using `KNeighborsClassifier(n_neighbors=3)` with three nearest neighbors, the possible probabilities are limited to 0/3, 1/3, 2/3, and 3/3. To obtain more refined probabilities, we may need to consider alternative methods.