# Project - Introduction to machine learning
### Francesco Carzaniga and Sonia Donati












The purpose of the project is to find the best machine learning algorithm for a particular dataset. 

## 1. Problem description


The data 'dataset32.csv' is compiled from car accidents, classified according to their severity.
* Number of samples: 499
* Number of features: 13

The feature are as follows:

0. '**time_to_aid**': time before receiving first aid (in minutes)
1. '**time_from_road_check**': time from last road maintenance (in years)
2. '**avg_speed**': average speed at impact
3. '**road_state**': average number of injured people per vehicle
4. '**ppl_vehicle**': average number of people per vehicle
5. '**avg_time_in_care**': average time spent in hospital care per injured person
6. '**num_rescue**': number of rescuers on the scene
7. '**time_to_hospital**': time to reach the hospital (in minutes)
8. '**age_vehicles**': average age of vehicles involved
9. '**time_from_vehicle_check**': time from last vehicle safety check
10. '**road_type**': road network type (local, regional, national)

The goal is to predict the severity of an accident. 

Remarks:
* '**class**': accident severity (0 = no injuries, 1 = non-fatal, 2 = fatal injuries) is not a feature
* '**vehicle_number**': vehicle registration number is not useful

## 2. Data preprocessing

First of all we have to import the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

Then, we obvious upload the dataset using pandas.

In [2]:
dataset = pd.read_csv('dataset32.csv', delimiter = ";").values

In this way, we obtain our dataset as an array. The last column contain the classes, i.e 0,1,2.

In [3]:
y = dataset[:,13]

Moreover, is important to notice that the three classes are good balanced, as it is shown below

In [4]:
unique, counts = np.unique(y, return_counts=True)
print([counts[i]/np.sum(counts) for i in range(len(counts))])

[0.33867735470941884, 0.34468937875751504, 0.3166332665330661]


We reshape *y*:

In [5]:
y = y.astype(np.float).reshape((dataset.shape[0],1))

Finally, we can select the features. In this case, we omit the last and second-last column of the dataset. 

In [6]:
dataset = dataset[:,[0,1,2,3,4,5,6,7,8,9,10,12]]
print(dataset.shape)

(499, 12)


Unfortunately, the dataset presents some strings and empty spaces. In what follows, we transform string to integer. After, can we fill up the voids with the mean of the other values (in the same column)? NO! We have to split first and the do imputation!

In [7]:
le = preprocessing.LabelEncoder()
dataset[:,3] = le.fit_transform(dataset[:,3])  # 'road_state': average = 0, bad = 1, good = 2
dataset[:,11] = le.fit_transform(dataset[:,11])  # 'road_type': local = 0 , national = 1, regional = 2

dataset = np.asarray(dataset, dtype=np.float64)  # all values of the dataset are float now

#dataset has 10 NaN values, where?
for i in range(12):
    if any(np.isnan(dataset[:,i])):
       print("Feature", i, "has", sum(np.isnan(dataset[:,i])), "NaN value(s)")

#DA CANCELLARE
#substitute the missing values by the mean value of the feature
#imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
#dataset = imp_mean.fit_transform(dataset)

Feature 6 has 4 NaN value(s)
Feature 7 has 6 NaN value(s)


As usual, we make shuffle and split over the now-ready dataset to obtain train sets and test sets.

In [8]:
#shuffling
def shuffle(dataset, y):
    z = np.hstack((dataset, y))
    np.random.shuffle(z)
    return np.hsplit(z, [dataset.shape[1]])

dataset, y = shuffle(dataset, y)

# DA FARE DOPO IMPUTATION!
#splitting
#def splitting(x, y, test_size=0.2):
#    n = x.shape[0]
#    train_size = int(n * (1 - test_size))
#    return x[:train_size, ], x[train_size:, ], y[:train_size, ], y[train_size:, ]

#x_train, x_test, y_train, y_test = splitting(dataset, y)

## 3. Model implementations

In this chapter we will implement all the methods for the project. We have a multiclass problem, but we want it to be binary. For do that, we build a class that transform the multiclass problem to 1vsAll problem.

In [9]:
class Transform(object):  # class to transform the multiclass problem to 1vsAll problem
    def __init__(self, model=None, **parameters):
        self.model = model
        self.model_list = []
        self.classes = None
        self.parameters = parameters

    def get_params(self, deep=True):
        return {**{"model": self.model}, **self.parameters}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    def fit(self, X, y):
        classes = np.unique(y)
        self.classes = classes
        for item in classes:
            y_mod = np.copy(y)
            actual_model = self.model().set_params(**self.parameters)
            y_mod[y_mod != item]= classes[(np.where(classes == item)[0]+1) % classes.size] # to obtain 1vs.all
            actual_model.fit(X, y_mod)
            self.model_list.append(actual_model)
        return

    def predict(self, X):
        predict_array = np.stack([model.predict(X) for model in self.model_list])
        val = []
        #for i in range(self.classes.size):
            #for k in range(X.shape[0]):
               # if predict_array[i,k] == self.classes[i]:
                #    val.append(self.classes[i])

               # elif predict_array[(i+1) % self.classes.size,k] == self.classes[(i+1) % self.classes.size]:
               #     val.append(self.classes[(i+1) % self.classes.size])
               # else:
               #     val.append(self.classes[(i+2) % self.classes.size])
        for k in range(X.shape[0]):
            index = np.where(predict_array[:, k] == self.classes)
            val.append(self.classes[index])
        val_array = np.asarray(val)
        return val_array

    def score(self, X, y):
        label_predict = self.predict(X)
        loss = np.mean(y.ravel() == label_predict)
        return loss

## 4. Validation

Afterwards, we do cross-validation for all the models.

In [10]:
# SVM (Default: Kernel "rbf")
model_SVM = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",Transform(model_SVM))])
val_SVM = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_SVM))

mean_val_SVM = val_SVM["test_score"].mean()
print("This is the mean of the test_score:", mean_val_SVM)

Unnamed: 0,fit_time,score_time,test_score
0,0.018073,0.003075,0.36
1,0.014596,0.003027,0.42
2,0.014913,0.003114,0.32
3,0.014289,0.002882,0.3
4,0.015091,0.003185,0.323232


This is the mean of the test_score: 0.3446464646464647


In [11]:
# Polynomially kernelized SVM
model_poly = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",Transform(model_poly, kernel = 'poly'))])
val_poly = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_poly))

mean_val_poly = val_poly["test_score"].mean()
print("This is the mean of the test_score:", mean_val_poly)

Unnamed: 0,fit_time,score_time,test_score
0,0.02178,0.002951,0.01
1,0.013972,0.002293,0.03
2,0.012135,0.002089,0.02
3,0.012029,0.002052,0.3
4,0.011947,0.002084,0.010101


This is the mean of the test_score: 0.07402020202020201


In [12]:
# Linear kernelized SVM
model_linear = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",Transform(model_linear, kernel = "linear"))])
val_linear = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_linear))

mean_val_linear = val_linear["test_score"].mean()
print("This is the mean of the test_score:", mean_val_linear)

Unnamed: 0,fit_time,score_time,test_score
0,0.104789,0.001405,0.28
1,0.065908,0.001345,0.28
2,0.078549,0.001409,0.27
3,0.120723,0.001397,0.33
4,0.09893,0.001429,0.232323


This is the mean of the test_score: 0.27846464646464647


In [13]:
# K-nearest neighbour algorithm
model_K = KNeighborsClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",Transform(model_K,n_neighbors=5))])
val_K = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_K))

mean_val_K = val_K["test_score"].mean()
print("This is the mean of the test_score:", mean_val_K)

Unnamed: 0,fit_time,score_time,test_score
0,0.006762,0.021048,0.25
1,0.003435,0.013103,0.23
2,0.003364,0.012648,0.24
3,0.0033,0.012514,0.31
4,0.003447,0.013175,0.20202


This is the mean of the test_score: 0.2464040404040404


In [14]:
# Artificial neural network
model_ANN = MLPClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",Transform(model_ANN,hidden_layer_sizes=(16,), activation='tanh', solver='adam', learning_rate='adaptive', early_stopping = True))])
val_ANN = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_ANN))

mean_val_ANN = val_ANN["test_score"].mean()
print("This is the mean of the test_score:", mean_val_ANN)

Unnamed: 0,fit_time,score_time,test_score
0,0.11839,0.001013,0.34
1,0.081654,0.001189,0.31
2,0.061238,0.001125,0.32
3,0.079506,0.001072,0.29
4,0.061015,0.001054,0.373737


This is the mean of the test_score: 0.32674747474747473


In [15]:
# Random forest

## 5. Testing

## 6. Conclusion

By the chapters above, we can conclude that the best model for our dataset is: Random forest, ...