# Project - Introduction to machine learning
### Francesco Carzaniga and Sonia Donati












The purpose of the project is to find the best machine learning algorithm for a particular dataset. 

## 1. Problem description


The data 'dataset32.csv' is compiled from car accidents, classified according to their severity.
* Number of samples: 499
* Number of features: 13

The feature are as follows:

0. '**time_to_aid**': time before receiving first aid (in minutes)
1. '**time_from_road_check**': time from last road maintenance (in years)
2. '**avg_speed**': average speed at impact
3. '**road_state**': average number of injured people per vehicle
4. '**ppl_vehicle**': average number of people per vehicle
5. '**avg_time_in_care**': average time spent in hospital care per injured person
6. '**num_rescue**': number of rescuers on the scene
7. '**time_to_hospital**': time to reach the hospital (in minutes)
8. '**age_vehicles**': average age of vehicles involved
9. '**time_from_vehicle_check**': time from last vehicle safety check
10. '**road_type**': road network type (local, regional, national)

The goal is to predict the severity of an accident. 

Remarks:
* '**class**': accident severity (0 = no injuries, 1 = non-fatal, 2 = fatal injuries) is not a feature
* '**vehicle_number**': vehicle registration number is not useful

## 2. Data preprocessing

First of all we have to import the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
from sklearn.base import BaseEstimator, ClassifierMixin, MetaEstimatorMixin

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

Then, we obvious upload the dataset using pandas.

In [2]:
dataset = pd.read_csv('dataset32.csv', delimiter = ";").values

In this way, we obtain our dataset as an array. The last column contain the classes, i.e 0,1,2.

In [3]:
y = dataset[:,13]

Moreover, is important to notice that the three classes are good balanced, as it is shown below

In [4]:
unique, counts = np.unique(y, return_counts=True)
print([counts[i]/np.sum(counts) for i in range(len(counts))])

[0.33867735470941884, 0.34468937875751504, 0.3166332665330661]


We reshape *y*:

In [5]:
y = y.astype(np.float).reshape((dataset.shape[0],1))

Finally, we can select the features. In this case, we omit the last and second-last column of the dataset. 

In [6]:
dataset = dataset[:,[0,1,2,3,4,5,6,7,8,9,10,12]]
print(dataset.shape)

(499, 12)


Unfortunately, the dataset presents some strings and empty spaces. In what follows, we transform string to integer. After, can we fill up the voids with the mean of the other values (in the same column)? NO! We have to split first and the do imputation!

In [7]:
le = preprocessing.LabelEncoder()
dataset[:,3] = le.fit_transform(dataset[:,3])  # 'road_state': average = 0, bad = 1, good = 2
dataset[:,11] = le.fit_transform(dataset[:,11])  # 'road_type': local = 0 , national = 1, regional = 2

dataset = np.asarray(dataset, dtype=np.float64)  # all values of the dataset are float now

#dataset has 10 NaN values, where?
for i in range(12):
    if any(np.isnan(dataset[:,i])):
       print("Feature", i, "has", sum(np.isnan(dataset[:,i])), "NaN value(s)")

#DA CANCELLARE
#substitute the missing values by the mean value of the feature
#imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
#dataset = imp_mean.fit_transform(dataset)

Feature 6 has 4 NaN value(s)
Feature 7 has 6 NaN value(s)


As usual, we make shuffle and split over the now-ready dataset to obtain train sets and test sets.

In [8]:
#shuffling
def shuffle(dataset, y):
    z = np.hstack((dataset, y))
    np.random.shuffle(z)
    return np.hsplit(z, [dataset.shape[1]])

dataset, y = shuffle(dataset, y)

# DA FARE DOPO IMPUTATION!
#splitting
#def splitting(x, y, test_size=0.2):
#    n = x.shape[0]
#    train_size = int(n * (1 - test_size))
#    return x[:train_size, ], x[train_size:, ], y[:train_size, ], y[train_size:, ]

#x_train, x_test, y_train, y_test = splitting(dataset, y)

## 3. Model implementations

In this chapter we will implement all the methods for the project. We have a multiclass problem, but we want it to be binary. For do that, we build a class that transform the multiclass problem to 1vsAll problem.

In [9]:
class OneVsOne(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
    def __init__(self, model=None, n_jobs=-1, **parameters):
        self.model = model
        self.n_jobs = n_jobs
        self.parameters = parameters
        self.classes = None
        self.model_list = None

    def get_params(self, deep=True):
        return {**{"model": self.model}, **{"n_jobs": self.n_jobs}, **self.parameters}

    def __fit_ovo_estimator(self, X, y, class_one, class_two):
        class_selection = np.logical_or(y == class_one, y == class_two)
        current_model = self.model().set_params(**self.parameters)
        y = y[class_selection]
        y_binarized = np.zeros_like(y)
        y_binarized[y == class_one] = 0
        y_binarized[y == class_two] = 1
        X = X[class_selection]
        current_model.fit(X, y_binarized)
        return current_model, class_one, class_two

    def fit(self, X, y):
        self.classes = np.unique(y)
        models = Parallel(n_jobs=self.n_jobs)(delayed(self.__fit_ovo_estimator)
                                              (X, y, self.classes[i], self.classes[j]) for i in range(len(self.classes))
                                              for j in range(i + 1, len(self.classes)))
        self.model_list = list(zip(*models))
        return

    @staticmethod
    def __predict_ovo_estimator(X, model):
        return model.predict(X)

    @staticmethod
    def __predict_proba_ovo_estimator(X, model):
        try:
            confidence = np.max(model.predict_proba(X), axis=1)
        except (AttributeError, NotImplementedError):
            confidence = model.decision_function(X)
        return confidence

    def predict(self, X):
        models = self.model_list[0]
        predictions = np.stack(Parallel(n_jobs=self.n_jobs)(delayed(self.__predict_ovo_estimator)(X, models[i])
                                                            for i in range(len(models)))).astype(dtype=np.int32).T
        confidences = np.stack(Parallel(n_jobs=self.n_jobs)(delayed(self.__predict_proba_ovo_estimator)(X, models[i])
                                                            for i in range(len(models)))).T
        votes = np.zeros((X.shape[0], self.classes.size))
        total_confidences = np.zeros_like(votes)
        for model in range(len(models)):
            class_one_m = self.model_list[1][model]
            class_two_m = self.model_list[2][model]
            votes[predictions[:, model] == 0, np.argwhere(self.classes == class_one_m)[0]] += 1
            votes[predictions[:, model] == 1, np.argwhere(self.classes == class_two_m)[0]] += 1
            total_confidences[predictions[:, model] == 0, np.argwhere(self.classes == class_one_m)[0]] += \
                confidences[predictions[:, model] == 0, model]
            total_confidences[predictions[:, model] == 1, np.argwhere(self.classes == class_two_m)[0]] += \
                confidences[predictions[:, model] == 1, model]
        transformed_confidences = (total_confidences /
                                   (3 * (np.abs(total_confidences) + 1)))
        winners = self.classes[np.argmax(votes+transformed_confidences, axis=1)]
        return winners

## 4. Validation

Afterwards, we do cross-validation for all the models.

In [10]:
# SVM (Default: Kernel "rbf")
model_SVM = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_SVM))])
val_SVM = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_SVM))

mean_val_SVM = val_SVM["test_score"].mean()
print("This is the mean of the test_score:", mean_val_SVM)

Unnamed: 0,fit_time,score_time,test_score
0,2.326981,4.533875,0.56
1,2.240013,4.451095,0.65
2,2.211087,4.509725,0.63
3,2.281853,4.429887,0.62
4,2.218066,4.528541,0.707071


This is the mean of the test_score: 0.6334141414141414


In [11]:
# Polynomially kernelized SVM
model_poly = svm.SVC
imputer = SimpleImputer(missing_values= np.nan,strategy="median")
fitter = OneVsOne(model_poly, kernel = 'poly')
estimator = Pipeline([("imputer", imputer),("onevsonefitter", fitter)])
val_poly = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_poly))

mean_val_poly = val_poly["test_score"].mean()
print("This is the mean of the test_score:", mean_val_poly)

Unnamed: 0,fit_time,score_time,test_score
0,2.315416,4.407249,0.69
1,2.209062,4.425925,0.76
2,2.211086,4.429462,0.74
3,2.232069,4.7263,0.74
4,2.239044,4.436809,0.808081


This is the mean of the test_score: 0.7476161616161615


In [12]:
# Linear kernelized SVM
model_linear = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_linear, kernel = "linear"))])
val_linear = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_linear))

mean_val_linear = val_linear["test_score"].mean()
print("This is the mean of the test_score:", mean_val_linear)

Unnamed: 0,fit_time,score_time,test_score
0,2.498288,4.397242,0.85
1,2.265939,4.525898,0.82
2,2.426589,4.453091,0.83
3,2.28888,4.427163,0.93
4,2.240008,4.418676,0.878788


This is the mean of the test_score: 0.8617575757575757


In [13]:
# K-nearest neighbour algorithm
model_K = KNeighborsClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_K,n_neighbors=5))])
val_K = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_K))

mean_val_K = val_K["test_score"].mean()
print("This is the mean of the test_score:", mean_val_K)

Unnamed: 0,fit_time,score_time,test_score
0,2.369661,4.47907,0.83
1,2.244962,4.540857,0.83
2,2.303987,4.506757,0.74
3,2.266937,4.502961,0.85
4,2.258497,4.496978,0.787879


This is the mean of the test_score: 0.8075757575757576


In [14]:
# Artificial neural network
model_ANN = MLPClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_ANN,hidden_layer_sizes=(16,), activation='tanh', solver='adam', learning_rate='adaptive', early_stopping = True))])
val_ANN = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_ANN))

mean_val_ANN = val_ANN["test_score"].mean()
print("This is the mean of the test_score:", mean_val_ANN)

Unnamed: 0,fit_time,score_time,test_score
0,2.276912,4.776227,0.31
1,2.517268,5.11066,0.51
2,2.283554,4.465679,0.32
3,2.259954,4.374416,0.46
4,2.240011,4.379553,0.454545


This is the mean of the test_score: 0.41090909090909095


In [15]:
# Random forest

## 5. Testing

## 6. Conclusion

By the chapters above, we can conclude that the best model for our dataset is: Random forest, ...