# Autonomous Work

Clément Antheaume, Camille-Amaury Juge.

## Definitions of indicators :

### Classification (categorical class)

Considering TP = Numbers of individuals well classified positive, TN = Numbers of individuals well classified negative and FP Numbers of positive individuals classified negative, FN Numbers of negative individuals classified positive.

#### Precision
* P / TP + TN
* Represents the specificity of the model.



#### Recall
* TP / TP + FN
* Represent the sensitivity of the model.



#### F-measure
* 2 * ( (Precision * Recall) / (Precision + Recall) )
* Mathematicly, the harmonic mean of recall and precision.



#### Rand index
* TP + TN / TP + FP + FN +TN
* Percentage of correct decisions made by the classification algorithm.
* Can be used in clustering to measure the similarity  between two clusters.



#### ROC Curve
* Function giving the number of True positive rate (y) given the false negative rate.
* The goal is to have a curve as close as possible to y = x.

### Regression (numeric class)

#### Mean Squared Error

* MSE ε R<sup>+</sup>
* MSE = Average(Indicators - Indicator<sup>2</sup>)
* ⇔ MSE = Bias(Indicator)<sup>2</sup> + Variance(Indicator)
* As we can see, it can be defined as a mesure of the bias and variance of the Indicator
* It evaluates the quadratic risk of the Indicator
* Sensitive to outliers (large error values), thus usefull when we want our model to be quite stable



#### Root Mean Squared-Error

* RMSE ε R<sup>+</sup>
* RMSE = √(MSE) = √(Average(Indicators - Indicator<sup>2</sup>))
* ⇔ RMSE = √(Bias(Indicator)<sup>2</sup> + Variance(Indicator))
* As we can see, it can be defined as a mesure of the standard deviation of the Indicator
* It evaluates the quadratic risk of the Indicator
* Even more Sensitive to outliers (large error values), thus usefull when we want our model to avoid large errors

#### Mean Bias Error

* MBE ε R
* MBE = Average(Y<sub>label</sub> - Y<sub>predicted</sub>)
* As we can see, it can be defined as a mesure of the bias of the error between labels and predictions
* It indicates if the model surestimate (if MBE < 0)  or underestimate (if MBE > 0) the output 

#### Systematic Error

* SE or SD ε R<sup>+</sup>
* SD = √(RMSE(error)<sup>2</sup> - MBE(error)<sup>2</sup>)
* As we can see, it can be defined as a mesure of the MSE-Bias so it reduces the importance of larger errors

#### Mean Absolute Error

* MAE ε R<sup>+</sup>
* MAE = Average(|Y<sub>label</sub> - Y<sub>predicted</sub>|)
* As we can see, it can be defined as a mesure of the bias, not regarding to its orientation
* MAE will not be sensible to outliers

#### Mean Absolute Pourcentage Error

* MAPE ε R<sup>+</sup>
* MAPE = Average(|(Y<sub>label</sub> - Y<sub>predicted</sub>) / Y<sub>label</sub>|)
* As we can see, it can be defined as a mesure of the bias, not regarding to its orientation 
* It has the advantage to show ratio errors rather than value errors

#### R<sup>2</sup>

* R<sup>2</sup> ε R, R<sup>2</sup> ε [-1, 1]
* R<sup>2</sup> = Correlation(Y<sub>predicted</sub>, Y<sub>label</sub>)
* ⇔ R<sup>2</sup> = Sum((Y<sub>predicted</sub> - Average(Y<sub>label</sub>))<sup>2</sup>/Sum((Y<sub>label</sub> - Average(Y<sub>label</sub>))<sup>2</sup>
* As we can see, it can be defined as a mesure of the correlation of the error
* It has the advantages to put every error on the same scale

### Validation Techniques

#### Hold Out Cross Validation

Separate the dataset in two sub-datasets :
* Training Set is used to train the model.
* Testing Set is used to validate the model with indicators.

The splitting is done with a percentage of the initial dataset (for instance 80%/20%).

This methods has the advantage to avoid overfitting. But it is not stable since we still have a low probability to have the worst configuration in our sub-datasets.

#### K-Fold Cross Validation

Separate the dataset S in k sub-datasets, then each subsets contains N/k individuals.
Thus, we iterate k times on all sub-datasets :
* we create a training set with k-1 sub-datasets
* we create a testing set with the last sub-datasets
* we compute the empirical error

When the k-iterations are done, we compute the mean of the empirical error. The we have a stable validation indicator over multiple configuration (k) of our datasets.

## Classification with Python

### Mushroom Decision Tree

I modified the Mushroom dataset to be a xlsx because there were way too many different inputs and I couldn't find one that would make the process easier to us.
Here's the download link for the xls version : https://docs.google.com/spreadsheets/d/1SrkBzu4FxGNXaMYn7NFx7PTjc8VA6N4c3Cua8HEsYyc/edit?usp=sharing

In [32]:
import pandas as pd
from sklearn import tree
import graphviz
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
import random

features = ['Odorant','Anneau','Chapeau bombé','Pied large','Taches']
classes = ['Comestible','Non comestible']
data = pd.read_excel('Mushroom.xlsx')
X = data[features]
Y = data[['Comestible']]
dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X, Y)

In [28]:
dot_data = tree.export_graphviz(dtree, out_file=None, feature_names=features, class_names=classes)
graphviz.Source(dot_data)

TypeError: can only concatenate str (not "bytes") to str

## Weather Decision Tree

In [18]:
features = ['Outlook','Tempreature','Humidity','Windy']
classes = ['Umbrella','No umbrella']

dic_X = {'sunny' : 0, 'overcast' : 1, 'rain' : 2, 'cool' : 0, 
         'mild' : 1, 'hot' : 2, True : 1, False : 0, 'high' : 1, 'normal' : 0}
dic_Y = {'N' : 0, 'P' : 1}

data = pd.read_excel('Meteo.xls')
X = data[features]
Y = data[['Class']]

X2 = X.replace(dic_X)
Y2 = Y.replace(dic_Y)

dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X2, Y2)

FileNotFoundError: [Errno 2] No such file or directory: 'Meteo.xls'

In [None]:
dot_data = tree.export_graphviz(dtree, out_file=None, feature_names=features, class_names=classes)
graphviz.Source(dot_data)

### Some comparison on the Iris dataset 

#### Decision Tree Classification

In [None]:
features = ['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']
classes = ['Iris-setosa','Iris-versicolor','Iris-virginica']
data = pd.read_csv('Iris.csv')

size = round(0.3*data.shape[0])

randomlist = random.sample(range(0, data.shape[0]), size)
train = data.drop(randomlist)
test = data.loc[randomlist]
X_train = train[features]
Y_train = train[['Species']]
X_test = test[features]
Y_test = test[['Species']]

dtree = tree.DecisionTreeClassifier()
dtree = dtree.fit(X_train, Y_train)


dot_data = tree.export_graphviz(dtree, out_file=None, feature_names=features, class_names=classes)
graphviz.Source(dot_data)

In [None]:
Y_pred = dtree.predict(X_test)
errors = 0
for i in range(len(Y_pred)):
    if Y_pred[i] != Y_test.iloc[i][0]:
    errors+=1
error_rate = errors/size

print("Error rate = ", error_rate)
print("Confusion matrix = \n", confusion_matrix(Y_test, Y_pred))

#### AdaBoost Classification

In [None]:
ada = AdaBoostClassifier(n_estimators=100, random_state=0)
ada.fit(X_train, Y_train)

In [None]:
Y_pred = ada.predict(X_test)
errors = 0
for i in range(len(Y_pred)):
    if Y_pred[i] != Y_test.iloc[i][0]:
    errors+=1
error_rate = errors/size

print("Error rate = ", error_rate)
print("Confusion matrix = \n", confusion_matrix(Y_test, Y_pred))
ada.score(X_test, Y_test)

We cannot make any clear assumption out of this example since it is based on random samples but as far as we experimented, we noticed the Decision tree classifier was more accurate for the Iris data set.

In general, there is not a lot of misclassified elements in these example, giving us a good Ada score (usualy > 0.9) and giving an error rate arround 0.1, and even often below this threshold

In [36]:
%load_ext rpy2.ipython

In [37]:
%%R
library(namespace)
registerNamespace('psy', loadNamespace('psych'))
library(ggplot2)
library(reshape2)
library(lattice)
registerNamespace('ml', loadNamespace('caret'))
registerNamespace('metrics', loadNamespace('Metrics'))
registerNamespace('mlmetrics', loadNamespace('MLmetrics'))
library("IRdisplay")

In [41]:
%%R
csv <- read.csv("Admission_Predict.csv", header = TRUE)
print(head(csv[,2:ncol(csv)]))