# Multiclass Classification

The dataset we will be working with contains information on cars. For each car we have information about the technical aspects of the vehicle. Using this information we will predict the origin of the vehicle, either North America, Europe, or Asia. We can see, that unlike our previous classification datasets, we have three categories to choose from.

The dataset is hosted by the University of California Irvine on their machine learning repository. 

Here are the columns in the dataset:

    mpg -- Miles per gallon, Continuous.
    cylinders -- Number of cylinders in the motor, Integer, Ordinal, and Categorical.
    displacement -- Size of the motor, Continuous.
    horsepower -- Horsepower produced, Continuous.
    weight -- Weights of the car, Continuous.
    acceleration -- Acceleration, Continuous.
    year -- Year the car was built, Integer and Categorical.
    origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.
    car_name -- Name of the car.


In [1]:
import pandas as pd
import numpy as np
cars = pd.read_csv("data/auto.csv")
unique_regions = cars['origin'].unique()
unique_regions

array([1, 3, 2])

We can use the pandas.get_dummies() function to return a Dataframe containing binary columns from the values in the cylinders column. In addition, if we set the prefix parameter to cyl.

In [2]:
dummy_df = pd.get_dummies(cars["cylinders"], prefix="cyl")
dummy_df.head(3)

Unnamed: 0,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1


In [3]:
cars = pd.concat([cars, dummy_df], axis=1)
cars.head(3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,0,0,0,0,1
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,0,0,0,0,1
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,0,0,0,0,1


In [4]:
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop(['year', 'cylinders'], axis=1)
cars.head(5)

Unnamed: 0,mpg,displacement,horsepower,weight,acceleration,origin,cyl_3,cyl_4,cyl_5,cyl_6,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,307.0,130.0,3504.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,15.0,350.0,165.0,3693.0,11.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,18.0,318.0,150.0,3436.0,11.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16.0,304.0,150.0,3433.0,12.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,17.0,302.0,140.0,3449.0,10.5,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The one-versus-all method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We're essentially splitting the problem into multiple binary classification problems. 

To start let's split our data into a training and test set. We've randomized the cars Dataframe.

In [20]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
nb_train = int(0.7 * cars.shape[0])
train = shuffled_cars.iloc[:nb_train, :]
test = shuffled_cars.iloc[nb_train:, :]


In [21]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
lr ={}
X_train = train.iloc[:,6:]
for origin in unique_origins:
    lr = LogisticRegression()
    y_train = (train['origin'] == origin)
    lr.fit(X_train, y_train)
    models[origin] = lr

In [22]:
y_predict = models[3].predict(test.iloc[:,6:])

In [29]:
testing_probs = pd.DataFrame(columns=unique_origins)
X_test = test.iloc[:,6:]
for origin in unique_origins:
    # predict_proba returns the probability in the second column of the sample for the class in the model (first column + second column = 100%)
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
testing_probs[:4]

Unnamed: 0,1,2,3
0,0.969063,0.02371,0.028716
1,0.934341,0.020469,0.095321
2,0.947359,0.040957,0.027
3,0.535786,0.282537,0.17639


In [30]:
predicted_origins = testing_probs.idxmax(axis=1)

Note

origin -- Integer and Categorical. 1: North America, 2: Europe, 3: Asia.

In [31]:
predicted_origins[:5]

0    1
1    1
2    1
3    1
4    1
dtype: int64