# Project - Introduction to machine learning
### Francesco Carzaniga and Sonia Donati












The purpose of the project is to find the best machine learning algorithm for a particular dataset. 

## 1. Problem description


The data 'dataset32.csv' is compiled from car accidents, classified according to their severity.
* Number of samples: 499
* Number of features: 13

The feature are as follows:

0. '**time_to_aid**': time before receiving first aid (in minutes)
1. '**time_from_road_check**': time from last road maintenance (in years)
2. '**avg_speed**': average speed at impact
3. '**road_state**': average number of injured people per vehicle
4. '**ppl_vehicle**': average number of people per vehicle
5. '**avg_time_in_care**': average time spent in hospital care per injured person
6. '**num_rescue**': number of rescuers on the scene
7. '**time_to_hospital**': time to reach the hospital (in minutes)
8. '**age_vehicles**': average age of vehicles involved
9. '**time_from_vehicle_check**': time from last vehicle safety check
10. '**road_type**': road network type (local, regional, national)

The goal is to predict the severity of an accident. 

Remarks:
* '**class**': accident severity (0 = no injuries, 1 = non-fatal, 2 = fatal injuries) is not a feature
* '**vehicle_number**': vehicle registration number is not useful

## 2. Data preprocessing

First of all we have to import the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from joblib import Parallel, delayed
from sklearn.base import BaseEstimator, ClassifierMixin, MetaEstimatorMixin

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn import linear_model
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline

Then, we obvious upload the dataset using pandas.

In [2]:
dataset = pd.read_csv('dataset32.csv', delimiter = ";").values

In this way, we obtain our dataset as an array. The last column contain the classes, i.e 0,1,2.

In [3]:
y = dataset[:,13]

Moreover, is important to notice that the three classes are good balanced, as it is shown below

In [4]:
unique, counts = np.unique(y, return_counts=True)
print([counts[i]/np.sum(counts) for i in range(len(counts))])

[0.33867735470941884, 0.34468937875751504, 0.3166332665330661]


We reshape *y*:

In [5]:
y = y.astype(np.float).reshape((dataset.shape[0],1))

Finally, we can select the features. In this case, we omit the last and second-last column of the dataset. 

In [6]:
dataset = dataset[:,[0,1,2,3,4,5,6,7,8,9,10,12]]
print(dataset.shape)

(499, 12)


Unfortunately, the dataset presents some strings and empty spaces. In what follows, we transform string to integer. After, can we fill up the voids with the mean of the other values (in the same column)? NO! We have to split first and the do imputation!

In [7]:
le = preprocessing.LabelEncoder()
dataset[:,3] = le.fit_transform(dataset[:,3])  # 'road_state': average = 0, bad = 1, good = 2
dataset[:,11] = le.fit_transform(dataset[:,11])  # 'road_type': local = 0 , national = 1, regional = 2

dataset = np.asarray(dataset, dtype=np.float64)  # all values of the dataset are float now

#dataset has 10 NaN values, where?
for i in range(12):
    if any(np.isnan(dataset[:,i])):
       print("Feature", i, "has", sum(np.isnan(dataset[:,i])), "NaN value(s)")

#DA CANCELLARE
#substitute the missing values by the mean value of the feature
#imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
#dataset = imp_mean.fit_transform(dataset)

Feature 6 has 4 NaN value(s)
Feature 7 has 6 NaN value(s)


As usual, we make shuffle and split over the now-ready dataset to obtain train sets and test sets.

In [8]:
#shuffling
def shuffle(dataset, y):
    z = np.hstack((dataset, y))
    np.random.shuffle(z)
    return np.hsplit(z, [dataset.shape[1]])

dataset, y = shuffle(dataset, y)

# DA FARE DOPO IMPUTATION!
#splitting
#def splitting(x, y, test_size=0.2):
#    n = x.shape[0]
#    train_size = int(n * (1 - test_size))
#    return x[:train_size, ], x[train_size:, ], y[:train_size, ], y[train_size:, ]

#x_train, x_test, y_train, y_test = splitting(dataset, y)

## 3. Model implementations

In this chapter we will implement all the methods for the project. We have a multiclass problem, but we want it to be binary. For do that, we build a class that transform the multiclass problem to OnevsOne and OnevsAll problem.

In [9]:
class OneVsOne(BaseEstimator, ClassifierMixin, MetaEstimatorMixin):
    def __init__(self, model=None, n_jobs=-1, **parameters):
        self.model = model
        self.n_jobs = n_jobs
        self.parameters = parameters
        self.classes = None
        self.model_list = None

    def get_params(self, deep=True):
        return {**{"model": self.model}, **{"n_jobs": self.n_jobs}, **self.parameters}

    def __fit_ovo_estimator(self, X, y, class_one, class_two):
        class_selection = np.logical_or(y == class_one, y == class_two)
        current_model = self.model().set_params(**self.parameters)
        y = y[class_selection]
        y_binarized = np.zeros_like(y)
        y_binarized[y == class_one] = 0
        y_binarized[y == class_two] = 1
        X = X[class_selection]
        current_model.fit(X, y_binarized)
        return current_model, class_one, class_two

    def fit(self, X, y):
        self.classes = np.unique(y)
        models = Parallel(n_jobs=self.n_jobs)(delayed(self.__fit_ovo_estimator)
                                              (X, y, self.classes[i], self.classes[j]) for i in range(len(self.classes))
                                              for j in range(i + 1, len(self.classes)))
        self.model_list = list(zip(*models))
        return

    @staticmethod
    def __predict_ovo_estimator(X, model):
        return model.predict(X)

    @staticmethod
    def __predict_proba_ovo_estimator(X, model):
        try:
            confidence = np.max(model.predict_proba(X), axis=1)
        except (AttributeError, NotImplementedError):
            confidence = model.decision_function(X)
        return confidence

    def predict(self, X):
        models = self.model_list[0]
        predictions = np.stack(Parallel(n_jobs=self.n_jobs)(delayed(self.__predict_ovo_estimator)(X, models[i])
                                                            for i in range(len(models)))).astype(dtype=np.int32).T
        confidences = np.stack(Parallel(n_jobs=self.n_jobs)(delayed(self.__predict_proba_ovo_estimator)(X, models[i])
                                                            for i in range(len(models)))).T
        votes = np.zeros((X.shape[0], self.classes.size))
        total_confidences = np.zeros_like(votes)
        for model in range(len(models)):
            class_one_m = self.model_list[1][model]
            class_two_m = self.model_list[2][model]
            votes[predictions[:, model] == 0, np.argwhere(self.classes == class_one_m)[0]] += 1
            votes[predictions[:, model] == 1, np.argwhere(self.classes == class_two_m)[0]] += 1
            total_confidences[predictions[:, model] == 0, np.argwhere(self.classes == class_one_m)[0]] += \
                confidences[predictions[:, model] == 0, model]
            total_confidences[predictions[:, model] == 1, np.argwhere(self.classes == class_two_m)[0]] += \
                confidences[predictions[:, model] == 1, model]
        transformed_confidences = (total_confidences /
                                   (3 * (np.abs(total_confidences) + 1)))
        winners = self.classes[np.argmax(votes+transformed_confidences, axis=1)]
        return winners

In [None]:
#OnevsAll
class OnevsAll(object):
    def __init__(self, model=None, **parameters):
        self.model = model
        self.model_list = []
        self.classes = None
        self.parameters = parameters
        self.popular = None

    def get_params(self, deep=True):
        return {**{"model": self.model}, **self.parameters}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    # def most_frequent(self, List):
        occurrence_count = Counter(List)
        return occurrence_count.most_common(1)[0][0]

    def fit(self, X, y):
        classes = np.unique(y)
        self.classes = classes
        # self.popular = self.most_frequent(y.ravel().tolist())
        for item in classes:
            y_mod = np.copy(y)
            actual_model = self.model().set_params(**self.parameters)
            y_mod[y_mod != item] = classes[(np.where(classes == item)[0]+1) % classes.size] # to obtain 1vs.all
            # (da correggere) y_mod = np.place(y_mod, y_mod != item, classes[(np.where(classes == item)[0]+1) % classes.size])
            actual_model.fit(X, y_mod)
            self.model_list.append(actual_model)
        return

    def predict(self, X):
        predict_array = np.stack([model.predict(X) for model in self.model_list])
        # DA FARE predict_prob =
        val = []
        #for i in range(self.classes.size):
        # for k in range(X.shape[0]):
        #     if predict_array[0,k] == self.classes[0]:
        #         val.append(self.classes[0])
        #
        #     elif predict_array[(i+1) % self.classes.size,k] == self.classes[(i+1) % self.classes.size]:
        #         val.append(self.classes[(i+1) % self.classes.size])
        #     else:
        #         val.append(self.classes[(i+2) % self.classes.size])

        # for k in range(X.shape[0]):
        #     index = np.where(predict_array[:,k] == self.classes)
        #     if index[0].size == 1:
        #         val.append(index[0][0])
        #     else:
        #         val.append(self.most_frequent(self.popular))

        val_array = np.asarray(val)
        return val_array

    def score(self, X, y):
        label_predict = self.predict(X)
        loss = np.mean(y.ravel() == label_predict)
        return loss

Last thing to implement is RandomForest.

In [10]:
from sklearn.preprocessing import LabelEncoder

def label_to_numerical(array):
    label_numerical = []
    for column in range(array.shape[1]):
        try:
            feature = np.asarray(array[:, column]).astype(float)
            label_numerical.append(feature)
        except ValueError:
            le = LabelEncoder()
            feature = le.fit_transform(array[:, column])
            label_numerical.append(feature)
    return np.stack(label_numerical, axis=1)


def impute_whole(array):
    dataset = []
    for column in range(array.shape[1]):
        try:
            imp = SimpleImputer(strategy='median')
            feature = np.asarray(array[:, column]).astype(float)
            feature = imp.fit_transform(feature)
            dataset.append(feature.ravel())
        except ValueError:
            imp = SimpleImputer(strategy='most_frequent')
            feature = np.asarray(array[:, column]).astype(object).reshape(-1, 1)
            feature = imp.fit_transform(feature)
            dataset.append(feature.ravel())
    return np.stack(dataset, axis=1)

In [11]:
from utils.preprocessing import label_to_numerical, impute_whole
from joblib import Parallel, delayed
#from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#from sklearn.tree import plot_tree
# import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, ClassifierMixin
from time import perf_counter

class Tree(object):
    def __init__(self, parent=None, children=None, feature=None, threshold=None, direction=None, excluded_samples=None,
                 is_leaf=False, decision=None, confidence=None):
        if children is None:
            children = []
        self.parent = parent
        self.children = children
        self.feature = feature
        self.threshold = threshold
        self.direction = direction
        self.excluded_samples = excluded_samples
        self.is_leaf = is_leaf
        self.decision = decision
        self.confidence = confidence
        self.depth = self.compute_depth()

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    def compute_depth(self):
        depth = 0
        node = self
        while node.parent is not None:
            node = node.parent
            depth += 1
        return depth

    def add_child(self, child):
        self.children.append(child)

    def get_parent(self):
        return self.parent

    def get_children(self):
        return self.children

    def get_feature(self):
        return self.feature

    def get_threshold(self):
        return self.threshold

    def get_all_features(self):
        node = self
        features_list = []
        while node is not None:
            features_list.append(int(node.get_feature()))
            node = node.parent
        return np.asarray(features_list)

    def get_direction(self):
        return self.direction

    def set_parent(self, parent):
        self.parent = parent

    def set_feature(self, feature):
        self.feature = feature

    def set_threshold(self, threshold):
        self.threshold = threshold

    def set_direction(self, direction):
        self.direction = direction

    def __max_depth(self, tree):
        if tree.is_leaf:
            return 0
        elif len(tree.children) == 0:
            return 0
        else:
            depth = []
            for child in tree.children:
                depth.append(self.__max_depth(child))
            return np.amax(depth)+1.

    def get_max_depth(self):
        return self.__max_depth(self)

    def get_depth(self):
        return self.depth

    def set_excluded_samples(self, excluded_samples):
        self.excluded_samples = excluded_samples
        return

    def get_all_excluded_samples(self):
        node = self
        samples_list = np.asarray([])
        while node is not None:
            samples_list = np.concatenate([samples_list, node.get_excluded_samples().ravel()])
            node = node.parent
        return np.asarray(samples_list)

    def get_is_leaf(self):
        return self.is_leaf

    def set_is_leaf(self, is_leaf):
        self.is_leaf = is_leaf

    def get_decision(self):
        return self.decision

    def set_decision(self, decision):
        self.decision = decision

    def get_excluded_samples(self):
        return np.asarray(self.excluded_samples)

    def get_confidence(self):
        return self.confidence

In [12]:
class DecisionTree(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=None, max_features=None):
        self.max_depth = max_depth
        self.max_features = max_features
        self.tree = None
        self._queue = []
        self.classes = None

    @staticmethod
    def __entropy(labels):
        if labels.size == 0:
            return 0
        unique, counts = np.unique(labels, return_counts=True)
        return np.sum([-counts[i]/np.sum(counts)*np.log2(counts[i]/np.sum(counts)) for i in range(len(unique))])

    def __gain(self, y, subsets):
        entropy_node = self.__entropy(y)
        total_length = np.sum([subset.size for subset in subsets], dtype=np.float64)
        weights = [subset.size/total_length for subset in subsets]
        entropy_child = np.sum([weights[i]*self.__entropy(y[subsets[i]]) for i in range(len(subsets))])
        return entropy_node-entropy_child

    def __split(self, X, y, node, excluded_samples=None, direction=None):
        # Choose remaining features and samples to be tested
        dataset_size, label_size = X.shape
        if excluded_samples is None:
            excluded_samples = []
        if node is not None:
            excluded_features = node.get_all_features()
            features = np.delete(np.arange(label_size), excluded_features)
            all_excluded_samples = node.get_all_excluded_samples()
            all_excluded_samples = np.concatenate([all_excluded_samples, excluded_samples]).astype(dtype=np.int32)
            samples = np.delete(np.arange(dataset_size), all_excluded_samples)
        else:
            features = np.arange(label_size)
            samples = np.arange(dataset_size)
        y_orig = np.copy(y)
        X = X[samples]
        y = y[samples]
        classes, counts = np.unique(y, return_counts=True)
        confidence = np.zeros(2)
        # Base case 1, labels are all the same so create leaf where decision is label
        if classes.size == 1:
            confidence[np.argwhere(self.classes == classes[0])] = 1.
            leaf = Tree(parent=node, decision=classes[0], direction=direction, is_leaf=True, confidence=confidence)
            node.add_child(leaf)
            return 1
        # Base case 2, no labels associated to this class so create failure decision (should never happen)
        elif classes.size == 0:
            unique, counts = np.unique(y_orig[samples], return_counts=True)
            total_labels = np.sum(counts)
            for u in range(unique.size):
                confidence[np.argwhere(self.classes == unique[u])] = counts[u]/total_labels
            all_excluded_samples = node.get_all_excluded_samples().astype(dtype=np.int32)
            samples = np.delete(np.arange(dataset_size), all_excluded_samples)
            leaf = Tree(parent=node, decision=int(np.median(y_orig[samples]).round()), direction=direction,
                        is_leaf=True, confidence=confidence)
            node.add_child(leaf)
            return 2
        # Max depth parameter must be respected
        if self.max_depth is not None and node is not None and node.get_max_depth() == self.max_depth - 1:
            unique, counts = np.unique(y, return_counts=True)
            total_labels = np.sum(counts)
            for u in range(unique.size):
                confidence[np.argwhere(self.classes == unique[u])] = counts[u]/total_labels
            leaf = Tree(parent=node, decision=int(np.median(y).round()), direction=direction, is_leaf=True,
                        confidence=confidence)
            node.add_child(leaf)
            return 4
        # Max_features must be respected
        max_features = self.max_features
        if max_features is not None and features.size > max_features:
            random_features = np.random.choice(features, max_features, replace=False)
        else:
            random_features = features
        # Try all the chosen features
        max_gain = 0.
        max_feature = -1
        best_threshold = None
        for feature in random_features:
            feature_vector = X[:, feature]
            try:
                feature_vector = np.array(feature_vector, dtype=np.float64)
            except ValueError:
                feature_vector = np.array(feature_vector, dtype=object)
            unique, counts = np.unique(feature_vector, return_counts=True)
            if feature_vector.dtype == 'object':
                subsets = [np.argwhere(feature_vector == u) for u in unique]
                gain = self.__gain(y, subsets)
                threshold = None
            elif feature_vector.dtype == 'float64':
                threshold_gains = []
                for u in unique:
                    below = np.argwhere(feature_vector <= u)
                    above = np.argwhere(feature_vector > u)
                    threshold_gains.append(self.__gain(y, [below, above]))
                gain = np.nanmax(threshold_gains)
                threshold = unique[np.nanargmax(threshold_gains)]
            if gain > max_gain:
                max_gain = gain
                max_feature = feature
                best_threshold = threshold
        # Base case 3
        if max_gain == 0.:
            unique, counts = np.unique(y_orig[samples], return_counts=True)
            total_labels = np.sum(counts)
            for u in range(unique.size):
                confidence[np.argwhere(self.classes == unique[u])] = counts[u] / total_labels
            all_excluded_samples = node.get_all_excluded_samples().astype(dtype=np.int32)
            samples = np.delete(np.arange(dataset_size), all_excluded_samples)
            new_node = Tree(parent=node.parent, decision=int(np.median(y_orig[samples]).round()),
                            direction=node.get_direction(), is_leaf=True, confidence=confidence)
            substitute = node.parent.children.index(node)
            node.parent.children[substitute] = new_node
            return 3
        # Create new node with best feature
        new_node = Tree(parent=node, direction=direction, feature=max_feature, threshold=best_threshold,
                        excluded_samples=excluded_samples)
        if node is not None:
            node.add_child(new_node)
        return new_node

    def __prune(self, X, y):
        return

    def __create_nodes_numerical(self, X, y, feature_vector, node_thresh, node):
        less = np.argwhere(feature_vector <= node_thresh).ravel()
        great = np.argwhere(feature_vector > node_thresh).ravel()
        case = self.__split(X, y, node, great, 'l')
        if isinstance(case, Tree):
            self._queue.append(case)
        elif case == 3:
            return
        case = self.__split(X, y, node, less, 'g')
        if isinstance(case, Tree):
            self._queue.append(case)
        elif case == 3:
            return

    def __create_nodes_categorical(self, X, y, feature_vector, unique, node):
        for u in unique:
            excluded_samples = np.argwhere(feature_vector != u).ravel()
            case = self.__split(X, y, node, excluded_samples, u)
            if isinstance(case, Tree):
                self._queue.append(case)
            elif case == 3:
                return

    def fit(self, X, y):
        if self.max_features is None:
            self.max_features = X.shape[1]
        self.classes = np.unique(y)
        self.tree = self.__split(X, y, self.tree)
        self._queue.append(self.tree)
        while len(self._queue) > 0:
            node = self._queue.pop()
            node_feat = node.get_feature()
            node_thresh = node.get_threshold()
            feature_vector = X[:, node_feat]
            unique, counts = np.unique(feature_vector, return_counts=True)
            if node_thresh is None:
                self.__create_nodes_categorical(X, y, feature_vector, unique, node)
            else:
                self.__create_nodes_numerical(X, y, feature_vector, node_thresh, node)
        return

    def predict(self, X):
        prediction = []
        for sample in X:
            node = self.tree
            while not node.is_leaf:
                feature = node.get_feature()
                threshold = node.get_threshold()
                if threshold is None:
                    value = sample[feature]
                    children_direction = [child.direction for child in node.children]
                    direction = children_direction.index(value)
                    node = node.children[direction]
                else:
                    if sample[feature] - threshold < 0:
                        direction = 0
                    else:
                        direction = 1
                    node = node.children[direction]
            prediction.append(node.get_decision())
        return np.asarray(prediction)

    def predict_proba(self, X):
        proba = []
        for sample in X:
            node = self.tree
            while not node.is_leaf:
                feature = node.get_feature()
                threshold = node.get_threshold()
                if threshold is None:
                    value = sample[feature]
                    children_direction = [child.direction for child in node.children]
                    direction = children_direction.index(value)
                    node = node.children[direction]
                else:
                    if sample[feature] - threshold < 0:
                        direction = 0
                    else:
                        direction = 1
                    node = node.children[direction]
            proba.append(node.get_confidence())
        return np.asarray(proba)

In [13]:
class RandomForest(BaseEstimator, ClassifierMixin):
    def __init__(self, max_depth=None, max_features=None, n_estimators=10, bootstrap=1., n_jobs=-1):
        self.max_depth = max_depth
        self.max_features = max_features
        self.n_estimators = n_estimators
        self.n_jobs = n_jobs
        self.bootstrap = bootstrap
        self._estimators = []

    def __make_estimators(self):
        estimators = Parallel(n_jobs=self.n_jobs)\
            (delayed(DecisionTree)(max_depth=self.max_depth, max_features=self.max_features)
             for i in range(self.n_estimators))
        return estimators

    @staticmethod
    def __parallel_build_trees(tree, X, y, bootstrap):
        if bootstrap:
            samples = np.random.choice(np.arange(X.shape[0]), int(bootstrap*X.shape[0]))
            X = X[samples]
            y = y[samples]
        tree.fit(X, y)
        return tree

    def fit(self, X, y):
        estimators = self.__make_estimators()
        result = Parallel(n_jobs=self.n_jobs)\
            (delayed(self.__parallel_build_trees)(tree, X, y, self.bootstrap) for tree in estimators)
        self._estimators = result
        return

    def predict(self, X):
        results = Parallel(n_jobs=self.n_jobs)(delayed(element.predict)(X) for element in self._estimators)
        return np.median(np.stack(results), axis=0).round()

    def predict_proba(self, X):
        results = Parallel(n_jobs=self.n_jobs)(delayed(element.predict_proba)(X) for element in self._estimators)
        return np.mean(np.stack(results, axis=2), axis=2)

In [14]:
def get_leaf_decisions(tree, leaf_decisions):
    if tree.is_leaf:
        leaf_decisions.append(tree.decision)
    elif len(tree.children) == 0:
        return 0
    else:
        for child in tree.children:
            get_leaf_decisions(child, leaf_decisions)

Testing RandomForest:

In [24]:
if __name__ == '__main__':
    datasett = pd.read_csv('dataset32.csv', delimiter=';').drop('vehicle_number', axis=1).values
    X = datasett[:, :-1]
    Y = datasett[:, -1]
    # X = label_to_numerical(X)
    # X[np.isnan(X)] = 0.
    X = impute_whole(X)
    Y = np.asarray(Y).astype(float)
    Y[Y == 2] = 0.
    dataset_train, dataset_test, label_train, label_test = train_test_split(X, Y, test_size=0.2, stratify=Y,
                                                                            random_state=42)
    start = perf_counter()
    forest = RandomForest()
    forest.fit(dataset_train, label_train)
    # forest.predict_proba(dataset_test)
    print(forest.score(dataset_test, label_test))
    print(perf_counter()-start)

#TO COMPARE WITH SKLEARN MODEL

X = label_to_numerical(X)
dataset_train, dataset_test, label_train, label_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=42)
start = perf_counter()
sk_tree = RandomForestClassifier(criterion='entropy')
sk_tree.fit(dataset_train, label_train)
#print(sk_tree.score(dataset_train, label_train))
print(sk_tree.score(dataset_test, label_test))
print(perf_counter()-start)

0.88
18.72369138399995
0.92
0.2350870530000293


## 4. Validation

Afterwards, we do cross-validation for all the models. First with OnevsOne, then with OnevsAll.

In [31]:
# SVM (Default: Kernel "rbf")
model_SVM = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_SVM))])
val_SVM = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_SVM))

mean_val_SVM = val_SVM["test_score"].mean()
print("This is the mean of the test_score:", mean_val_SVM)

Unnamed: 0,fit_time,score_time,test_score
0,1.835683,3.820881,0.67
1,1.476233,2.913,0.53
2,1.461483,2.88993,0.67
3,1.439638,2.935963,0.57
4,1.456801,2.90877,0.626263


This is the mean of the test_score: 0.6132525252525253


In [30]:
# Polynomially kernelized SVM
model_poly = svm.SVC
imputer = SimpleImputer(missing_values= np.nan,strategy="median")
fitter = OneVsOne(model_poly, kernel = 'poly')
estimator = Pipeline([("imputer", imputer),("onevsonefitter", fitter)])
val_poly = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_poly))

mean_val_poly = val_poly["test_score"].mean()
print("This is the mean of the test_score:", mean_val_poly)

Unnamed: 0,fit_time,score_time,test_score
0,1.837349,3.167274,0.77
1,1.488858,2.902991,0.66
2,1.45016,2.91173,0.74
3,1.502975,2.907327,0.81
4,1.45641,2.893763,0.676768


This is the mean of the test_score: 0.7313535353535354


In [29]:
# Linear kernelized SVM
model_linear = svm.SVC
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_linear, kernel = "linear"))])
val_linear = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_linear))

mean_val_linear = val_linear["test_score"].mean()
print("This is the mean of the test_score:", mean_val_linear)

Unnamed: 0,fit_time,score_time,test_score
0,1.781837,4.238268,0.82
1,1.522762,2.85315,0.91
2,1.490815,3.098062,0.85
3,1.622261,3.331492,0.9
4,1.795409,3.556631,0.828283


This is the mean of the test_score: 0.8616565656565657


In [28]:
# K-nearest neighbour algorithm
model_K = KNeighborsClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_K,n_neighbors=5))])
val_K = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_K))

mean_val_K = val_K["test_score"].mean()
print("This is the mean of the test_score:", mean_val_K)

Unnamed: 0,fit_time,score_time,test_score
0,1.870907,3.13476,0.81
1,1.494621,2.960857,0.8
2,1.482608,2.977376,0.79
3,1.478766,2.975547,0.86
4,1.502105,2.975457,0.79798


This is the mean of the test_score: 0.8115959595959596


In [27]:
# Artificial neural network
model_ANN = MLPClassifier
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(model_ANN,hidden_layer_sizes=(16,), activation='tanh', solver='adam', learning_rate='adaptive', early_stopping = True))])
val_ANN = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_ANN))

mean_val_ANN = val_ANN["test_score"].mean()
print("This is the mean of the test_score:", mean_val_ANN)

Unnamed: 0,fit_time,score_time,test_score
0,1.847968,3.209176,0.6
1,1.445399,2.838582,0.34
2,1.462063,2.860548,0.32
3,1.451113,2.828343,0.34
4,1.454585,2.83015,0.454545


This is the mean of the test_score: 0.41090909090909095


In [26]:
# Random forest
model_Rf = RandomForest
estimator = Pipeline([("imputer", SimpleImputer(missing_values= np.nan,strategy="median")),("Transform",OneVsOne(RandomForest))])
val_Rf = cross_validate(estimator, dataset, y.ravel(), cv=5)

display(pd.DataFrame(val_Rf))

mean_val_Rf = val_Rf["test_score"].mean()
print("This is the mean of the test_score:", mean_val_Rf)

Unnamed: 0,fit_time,score_time,test_score
0,32.890335,3.165063,0.81
1,34.384315,3.192123,0.85
2,45.732219,3.174469,0.84
3,34.691736,3.226812,0.9
4,31.507678,3.212778,0.868687


This is the mean of the test_score: 0.8537373737373738


## 5. Testing

## 6. Conclusion

By the chapters above, we can conclude that the best model for our dataset is: Random forest, ...