<h1>Machine learning project</h1>
<hr>
<p>This Jupyter notebook resume all our programmation work on binary classification, the Banknote Authentication Dataset and the Kidney Disease Dataset.</p>
<p>Students:<br>
    <li>Ettoré Hidoux</li>
    <li>Agathe Fernandes Machado</li>
    <li>Yasmine Diouri</li>
    <li>Clément Mathé</li></p>

## Summary :

<p>Part I : Data Cleaning (function preprocessing)</p>
<p>Part II : Result evaluation (function accuracy)</p>
<p>Part III : Models used<br>
    <li> 1-cross-validation (function models_1cv)</li>
    <li> N-cross-validation (function models_Ncv)</li>
    <li> Model mean (function model_mean)</li></p>
<p>Part IV : Global function (function finalFunction)</p>
    

## Part I : Data Cleaning

This part contains one fonction which allows us to clean the data (replace string values by boolean, replace NaN values by the median of the column) and to suppress data caracteristics if the correlation between two caracteristics is higher than 0.75. At the end of the function, we create two data : the training data and the test data by a split method.

In [19]:
# Libraries: Standard ones
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random as rnd
# Libraries: scikit learn for preprocessing 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def preprocessing(abcd_csv, clas, lp, sep): # made by Ettoré Hidoux and Agathe Fernandes Machado
    #abcd_csv is data name
    #clas is the class namme of the data
    #lp is the number of a full row 
    #sep is the symbol use as separator for the data
    
    # Load the data: data_banknote_authentification
    data = pd.read_csv(abcd_csv,sep=sep)
    
    # X/Y separation
    # transform class column from string into boolean if it's necessary
    if isinstance(data[clas][lp], str):
        Y = np.multiply([data[clas]==data[clas][0]],1)[0]
    else:
        Y = data[clas]
    data.drop(clas, 1, inplace=True)
    
    #transform data column from string into boolean if it's necessary
    for c in data:
        if isinstance(data[c][lp], str):
            a = np.multiply([data[c] == data[c][lp]],1)
            data.drop(c, 1, inplace=True)
            data[c] = a[0]
        data[c] = np.nan_to_num(data[c], copy=True, nan=data[c].median())
    
    # Correlation matrix of the data columns 
    corr_matrix = data.corr().abs()
    high_corr_var=np.where(corr_matrix>0.75)
    high_corr_var=np.array([(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y])
    
    # Suppression of correlate columns
    for i in range(len(high_corr_var)):
        c = high_corr_var[i][0]
        data.drop(c, 1, inplace=True)
        
    X = data
    X = StandardScaler().fit_transform(X) #normalize our data
    
    # Creation of a dataset to train and another to test
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)
    return  x_train, x_test, y_train, y_test, X, Y

Below, we apply our function preprocessing on our two datasets.

In [3]:
# test data_set1
x_train1, x_test1, y_train1, y_test1, X1, Y1 = preprocessing("data_banknote_authentication.csv", "class", 4, ";")

In [4]:
#test data_set2
x_train2, x_test2, y_train2, y_test2, X2, Y2 = preprocessing("kidney_disease.csv", "classification", 4, ',')


## Part II : Result evaluation 

For the evaluation part, we implement a function which gives us the accuracy of our model.

In [5]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

# the function that give the percentage of accuracy of the function (the same as .score()) 
def accuracy(y_predict, y_test): # made by Clément Mathé
    return np.mean([y_predict==y_test])

## Part III : Models used

## 1) 1-cross-validation

In this part, we implement a function which applies to our dataset cleaned up with 1-cross-validation thanks to different Machine Learning methods of scikit learn. This function returns the method which corresponds to the highest score (model='best') or all  of the scores of our methods (model="scores").

In [6]:
import numpy as np

# Models import list
from sklearn import svm
from sklearn.linear_model import SGDClassifier  
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

# model could take the value "best" or "scores"
def models_1cv(x_train, x_test, y_train, y_test, model): # made by Ettoré Hidoux and Clément Mathé
    # Models list
    models = [svm.SVC(kernel='linear'),
              svm.SVC(kernel='poly', degree=2, gamma='auto'),
              svm.SVC(kernel='rbf', gamma='auto'),
              svm.SVC(kernel='sigmoid', gamma=1./150),
              SGDClassifier(),
              DecisionTreeClassifier(),
              GaussianNB(),
              RandomForestRegressor(n_estimators = 1000, random_state = 42),
              MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,2), random_state=42),
              LogisticRegression(random_state=0)]
    # Models apply the train dataset list
    models = [clf.fit(x_train, y_train) for clf in models]
    # Models score list
    scores = [clf.score(x_test,y_test) for clf in models]

    # Models name list 
    titles = ['SVC with linear kernel',
              'SVC with polynomial (degree 2) kernel',
              'SVC with RBF kernel',
              'SVC with sigmoid kernel',
              'Stochastic Gradient Descent',
              'Desicion Trees',
              'Bayesien Network: Gnb',
              'Random Forest',
              'Neural Network',
              'Probit model']
    
    if model == 'best':
        # return the name and the score of the method that obtain the best result 
        return titles[scores.index(max(scores))], max(scores)
    if model == 'scores':
        # return the name and the score of all the method
        return [(titles[i],scores[i]) for i in range(10)]
    

Below, we apply our function on our two datasets separated in a training part and a test part (given by the function preprocessing) to see the results given by the function score of scikit learn.

In [10]:
models_1cv(x_train1, x_test1, y_train1, y_test1, 'best')

('SVC with RBF kernel', 0.9709090909090909)

In [11]:
models_1cv(x_train2, x_test2, y_train2, y_test2, 'best')

('SVC with RBF kernel', 0.9875)

In [12]:
models_1cv(x_train1, x_test1, y_train1, y_test1, 'scores')

[('SVC with linear kernel', 0.9090909090909091),
 ('SVC with polynomial (degree 2) kernel', 0.7236363636363636),
 ('SVC with RBF kernel', 0.9709090909090909),
 ('SVC with sigmoid kernel', 0.8909090909090909),
 ('Stochastic Gradient Descent', 0.92),
 ('Desicion Trees', 0.9563636363636364),
 ('Bayesien Network: Gnb', 0.8036363636363636),
 ('Random Forest', 0.9055074922855927),
 ('Neural Network', 0.9672727272727273),
 ('Probit model', 0.9054545454545454)]

## 2) N-cross-validation

In this part, we implement a function which applies to our dataset cleaned up with N-cross-validation (N chosen by the user) thanks to different Machine Learning methods of scikit learn (the same one than in 1-cross-validation). This function returns the method which corresponds to the highest score (model='best') or all  of the scores of our methods (model="scores").

In [13]:
# Models import list
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.linear_model import SGDClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

# model could take the value "best" or "scores"
def models_Ncv(X, Y, N, model): # made by Yasmine Diouri and Agathe Fernandes Machado
    # Models list
    models = [svm.SVC(kernel='linear'),
              svm.SVC(kernel='poly', degree=2, gamma='auto'),
              svm.SVC(kernel='rbf', gamma='auto'),
              svm.SVC(kernel='sigmoid', gamma=1./150),
              SGDClassifier(),
              DecisionTreeClassifier(),
              GaussianNB(),
              RandomForestRegressor(n_estimators = 1000, random_state = 42),
              MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,2), random_state=42),
              LogisticRegression(random_state=0)]
    # Models score list with a N-cross-validation (the method is try N times each)
    scores = [np.mean(cross_val_score(clf, X, Y, cv=N)) for clf in models]
    
    # Models name list
    titles = ['SVC with linear kernel',
              'SVC with polynomial (degree 2) kernel',
              'SVC with RBF kernel',
              'SVC with sigmoid kernel',
              'Stochastic Gradient Descent',
              'Desicion Trees',
              'Bayesien Network: Gnb',
              'Random Forest',
              'Neural Network',
              'Probit model']
    
    if model == 'best':
        # return the name and the score of the method that obtain the best result 
        return titles[scores.index(max(scores))], max(scores)
    if model == 'scores':
        # return the name and the score of all the method
        return [(titles[i],scores[i]) for i in range(10)]

Below, we apply our function on our two datasets not separated this time because we use N-cross-validation (here with N=10) to see the results given by the function score of scikit learn.

In [14]:
models_Ncv(X1,Y1,10, 'best')

('SVC with RBF kernel', 0.9832487041150959)

In [37]:
models_Ncv(X2,Y2,10, 'best')

('SVC with RBF kernel', 0.9875)

In [45]:
scores_Ncv(X1,Y1,10, 'scores')

[('SVC with linear kernel', 0.9082196128213266),
 ('SVC with polynomial (degree 2) kernel', 0.7377975245953665),
 ('SVC with RBF kernel', 0.9832487041150959),
 ('SVC with sigmoid kernel', 0.8965672273352375),
 ('Stochastic Gradient Descent', 0.8936898339151593),
 ('Desicion Trees', 0.9628530625198349),
 ('Bayesien Network: Gnb', 0.8325187771077964),
 ('Random Forest', 0.0922050889032258),
 ('Neural Network', 0.9766952290278219),
 ('Probit model', 0.9067703374590078)]

## 3) Model Mean

In this part, we implement the same function than the one about 1-cross-validation but we add a method which makes the mean of all our Machine Learning methods used previously. This function returns the method which corresponds to the highest score (model='best') or all  of the scores of our methods (model="scores").

In [20]:
# Models import list
from sklearn import svm
from sklearn.linear_model import SGDClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression


def model_mean(x_train, x_test, y_train, y_test, model): # made by Ettoré Hidoux
    # Models list
    models = [svm.SVC(kernel='linear'),
              svm.SVC(kernel='poly', degree=2, gamma='auto'),
              svm.SVC(kernel='rbf', gamma='auto'),
              svm.SVC(kernel='sigmoid', gamma=1./150),
              SGDClassifier(),
              DecisionTreeClassifier(),
              GaussianNB(),
              RandomForestRegressor(n_estimators = 1000, random_state = 42),
              MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,2), random_state=42),
              LogisticRegression(random_state=0)]
    # Models apply the train dataset list
    models = [clf.fit(x_train, y_train) for clf in models]
    # Models predictions for the input x_test
    predicts = [clf.predict(x_test) for clf in models]
    # Models score list
    scores = [clf.score(x_test,y_test) for clf in models]
    # Models predictions mean 
    mean_predict = sum(p for p in predicts)/10
    mean_predict = [np.round(mean_predict[i]) for i in range(len(mean_predict))]
    # Add model mean score to scores
    scores.append(accuracy(mean_predict, y_test))

    # Models name list
    titles = ['SVC with linear kernel',
              'SVC with polynomial (degree 2) kernel',
              'SVC with RBF kernel',
              'SVC with sigmoid kernel',
              'Stochastic Gradient Descent',
              'Desicion Trees',
              'Bayesien Network: Gnb',
              'Random Forest',
              'Neural Network',
              'Probit model',
              'Model Mean']

    if model == 'best':
        # return the name and the score of the method that obtain the best result 
        return titles[scores.index(max(scores))], max(scores)
    if model == 'scores':
        # return the name and the score of all the method
        return [(titles[i],scores[i]) for i in range(11)]

Below, we apply our function on our two datasets separated in a training part and a test part (given by the function preprocessing) to see the results given by the function score of scikit learn for our previous methods and for the model mean, we use the function accuracy defines below.

In [21]:
model_mean(x_train1, x_test1, y_train1, y_test1, 'best')

('SVC with RBF kernel', 0.9709090909090909)

In [38]:
model_mean(x_train2, x_test2, y_train2, y_test2, 'best')

('Model Mean', 1.0)

In [49]:
model_mean(x_train1, x_test1, y_train1, y_test1, 'scores')

[('SVC with linear kernel', 0.9090909090909091),
 ('SVC with polynomial (degree 2) kernel', 0.7236363636363636),
 ('SVC with RBF kernel', 0.9709090909090909),
 ('SVC with sigmoid kernel', 0.8909090909090909),
 ('Stochastic Gradient Descent', 0.8945454545454545),
 ('Desicion Trees', 0.9527272727272728),
 ('Bayesien Network: Gnb', 0.8036363636363636),
 ('Random Forest', 0.9055074922855927),
 ('Neural Network', 0.9672727272727273),
 ('Probit model', 0.9054545454545454),
 ('Model Mean', 0.9345454545454546)]

## Part III : Global function

Here, we implement our final function which can apply on any dataset with binary classification where the user can choose which model he wants to apply to his dataset (1-cross-validation with model mean or N-cross-validation). So in this function, we use the previous functions defined below.

In [18]:
# Libraries: Standard ones
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random as rnd
# Libraries: scikit learn for preprocessing 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Models import list
from sklearn import svm
from sklearn.linear_model import SGDClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression

def finalFunction(abcd_csv, clas, lp, sep, N, model): # made by Yasmine Diouri
    #abcd_csv is data name
    #clas is the class namme of the data
    #lp is the number of a full row 
    #sep is the symbol use as separator for the data
    #N is how many cross validation we want to apply to the data, N > O
    #model tells us if we want the best scores or all the scores of models
    # Data cleaning 
    x_train, x_test, y_train, y_test, X, Y = preprocessing(abcd_csv, clas, lp, sep)
    # Show the result 
    if N == 1: #1-cross-validation
        return model_mean(x_train, x_test, y_train, y_test, model) #use of the model_mean function defined below
    if N > 1: #N-cross-validation
        return models_Ncv(X, Y, N, model) #use of the models_Ncv function defined below
