In [70]:
import sklearn
import numpy as np
import pandas as pd
from scipy import stats
import json
import os

# Machine Learning Final Project
**Student name:** Mohammad Amin Dadgar

**Student Id:** 4003624016

**Instructor:** Dr. Peyman Adibi

**Date:** Tir 1401 | June, July 2022

The Final Project is the implementation of [dependency analysis of accuracy estimates in k-fold cross validation](https://ieeexplore.ieee.org/document/8012491) article.

**Abstract:** K-fold cross-validation is a method to evaluate the performance of classification algorithms. In this report, we are going to show the appropriateness of K-fold cross-validation for the dependence of fold accuracies. 

**Introduction**
Several studies have shown that in k-fold cross-validation, fold accuracies are dependent, but there is no formal definition for this fact. This report is the reproduction results of referenced article that in section 2, we introduced the statistical methods needed for this experiment. In section 3 sampling the distribution from fold accuracies is introduced and using this in section 4, a statistical method is shown to evaluate the fold accuracies independence. In section 5, the method introduced in section 4 is tested on 20 UCI datasets, and in section 6 we’ve concluded the work done before.

To inference the idea behind this article, in section 4 the variances are given as below

sample variance:
\begin{equation}
s^2 = \frac{\sum_{i=0}^{k} (\bar p_i - \bar{\bar{x}})^2}{k-1}
\end{equation}
And the variance for leave-one-out cross validation
\begin{equation}
\sigma_I^2 = \frac{p(1-p)}{n}
\end{equation}

to explain the equations above we can say that $p$ is the accuracy of the classifier and to find it, we can devide the correct classifications by the sample size.

$\bar p_i$ refers to the accuracy estimate in fold i, and can be written as below
\begin{equation}
\bar{p_i} = \frac{\sum_{j=1}^{m} x_{ij}}{m}
\end{equation}
where $m$ is the count of each fold sample. For example if we had $200$ sample in our dataset and using $5$ folds, then we would have $m=\frac{200}{5}$.

where $x_{ij}$ is a function that outputs $1$ where the classification is correct and outputs $0$ when the classification is wrong.

The $\bar{\bar{x}}$ represents the sample mean and it's possible to find it using the equation below
\begin{equation}
\bar{\bar{x}} = \frac{\sum_{i=1}^{k} \bar{p_i}}{k}
\end{equation}

We will start the dependency test of folds with a hypothesis $H_0: s^2/k=\sigma_l^2$. the test statistics can be found as $\chi^2 = \frac{(k-1) s^2}{k \sigma_l^2}$ when $\chi$ has $k-1$ degrees of freedom.

In [25]:
######### Let's re-calculate the example 2 and 3 in the article
total = 200
classification_res = np.array([32, 30, 27, 30, 25])
sample_mean = np.sum(classification_res / 200) 

for true_classified in classification_res:
    print(f"acc: {true_classified / 40}")

print(f"sample mean is {sample_mean}")

acc: 0.8
acc: 0.75
acc: 0.675
acc: 0.75
acc: 0.625
sample mean is 0.72


In [57]:
##### implementing equation 1
def find_sample_variance(true_classification, fold_samples_count):
    """
    Find sample variance of k-fold cross validation
    
    Parameters:
    -----------
    true_classification : array of floats
        the portion of true classified samples in each fold
        note that the folds count is the length of this array
    fold_samples_count : positive integer
        the count of folds data samples
        
    Returns:
    ---------
    sample_variance : float
        the sample variance of all the folds
    """
    ## find the total sample count
    total_sample_count = fold_samples_count * len(true_classification)
    
    sample_mean = find_sample_mean(true_classification, total_sample_count)
    
    ## the subtraction of true classified portion from sample mean
    subtraction_arr = np.subtract(true_classification / fold_samples_count, sample_mean)

    ## to divide the found value by `k-1`
    sample_variance = np.sum(np.power(subtraction_arr, 2)) / (len(true_classification) - 1)
    
    return sample_variance
    
def find_sample_mean(true_classification, total_samples_count):
    """
    Find the sample mean of k-fold cross validation using the true classification results
    
    Parameters:
    -----------
    true_classification : array of floats
        the portion of true classified samples in each fold
        note that the folds count is the length of this array
    total_samples_count : positive integer
        the count of total sample size (dataset length maybe)
    
    Returns:
    ---------
    sample_mean : float
        the sample mean calculated for all the dataset
    """
    sample_mean = np.sum(true_classification / total_samples_count)
    
    return sample_mean

In [64]:
m_sample_variance = find_sample_variance(classification_res, 40)
m_sample_variance

0.004812500000000001

In [59]:
def find_total_variance(accuracy, total_sample_count):
    """
    Find the variance of a set using its accuracy and samples count
    the equation is `accuracy(1-accuracy)/total_sample_count`
    
    Parameters:
    ------------
    accuracy : float between 0 and 1
        the floating value that rerpresent the portion of true classified samples over all samples
    total_sample_count : integer
        the total samples in a dataset
    
    Returns:
    ---------
    variance : float
        the calculated variance for the dataset
    """
    
    variance = (accuracy * (1 - accuracy)) / total_sample_count
    
    return variance

In [65]:
m_variance = find_total_variance(0.7, 200)
m_variance

0.0010500000000000002

In [62]:
def chi_independence_test(k_folds, total_variance, sample_variance):
    """
    the chi square independence test introduced in the article
    equation is `(`k-1` folds * sample_varience) / `k` folds  * total_variance`
    
    Parameters:
    ------------
    k_folds : positive integer
        the count of folds applied for a model
    total_variance : float
        the variance represented by leave-one-out cross validation
    sample_variance : float
        the variance that is found by aggregation of fold accuracies
        
    Returns:
    ---------
    chi_square : float
        the chi square value 
    """
    
    chi_square = ((k_folds - 1) * sample_variance) / (k_folds * total_variance)
    
    return chi_square

In [71]:
chisquare_value = chi_independence_test(5, m_variance, m_sample_variance)
chisquare_value

3.6666666666666665

And finding the exact values for `Example 3.`, we can now go on to real tests for real datasets, but Before going to experiments in real datasets another thing is to find out the p-value for our test.

So we are going to find the p-value first

In [73]:
stats.chi2.cdf(chisquare_value, 4)

0.5470073861075335

# 1-NN Liver disorders dataset
Let's try KNN method with K=1 with different datasets. minkowski distance with $p=2$ is the euclidean distance and we are using euclidean distance as our nearest neighbour metric.

In [211]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate, cross_val_score, LeaveOneOut, KFold
from sklearn.naive_bayes import GaussianNB

In [82]:
ds_liver = pd.read_csv('Datasets/liver-disorders/bupa.data', names=['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt', 'drinks', 'selector'])
ds_liver.head()

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt,drinks,selector
0,85,92,45,27,31,0.0,1
1,85,64,59,32,23,0.0,2
2,86,54,33,16,54,0.0,2
3,91,78,34,24,36,0.0,2
4,87,70,12,28,10,0.0,2


In [85]:
ds_liver.selector.unique()

array([1, 2], dtype=int64)

we have two classess for each data, selector is the label in our dataset.

In [90]:
KNN_1_liver = KNeighborsClassifier(n_neighbors=1, p=2)
# KNN_1_liver.fit(ds_liver[ds_liver.columns[:-1]], ds_liver.selector)

In [109]:
ds_liver_X = ds_liver[ds_liver.columns[:-1]]
ds_liver_Y = ds_liver.selector 

In [113]:
KNN_1_5cv_scores = cross_validate(KNN_1_liver, 
                                  ds_liver_X,
                                  ds_liver_Y,
                                  cv=5,
                                  return_train_score=True,
                                 return_estimator=True)
KNN_1_5cv_scores

{'fit_time': array([0.0026772 , 0.00232077, 0.00202894, 0.00267959, 0.00164056]),
 'score_time': array([0.00377774, 0.00390458, 0.00401592, 0.00382686, 0.00284386]),
 'estimator': [KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1)],
 'test_score': array([0.65217391, 0.76811594, 0.62318841, 0.66666667, 0.53623188]),
 'train_score': array([1., 1., 1., 1., 1.])}

In [131]:
m_sample_mean = np.mean(KNN_1_5cv_scores['test_score'])
m_sample_mean

0.6492753623188406

In [136]:
## find sample variance
def find_sample_variance_using_accuracies(fold_accuracies):
    """
    Find variances using the accuracies got from k-fold cross validation
    
    Parameters:
    ------------
    fold_accuracies : float between 0 and 1
        the accuracies in each fold of k-fold cross validation
        note that the k folds count will be computed using the length of this parameter  
    
    Returns:
    ---------
    variance : float
        the variance of accuracies
    """
    sample_mean = np.mean(fold_accuracies)
    
    ## the subtraction of true classified portion from sample mean
    subtraction = np.subtract(fold_accuracies, sample_mean)
    
    variance = np.sum(np.power(subtraction, 2)) / (len(fold_accuracies) - 1)
    
    return variance

In [139]:
sample_variance = find_sample_variance_using_accuracies(KNN_1_5cv_scores['test_score'])
sample_variance

0.006973324931737026

## Leave one out method
In this method the folds count is eqaul to the data size and test size in each fold is 1.https://www.statology.org/leave-one-out-cross-validation/

In [152]:
KNN_leaveOneOut = KNeighborsClassifier(n_neighbors=1, p=2)
KNN_leaveOneOut_result = cross_validate(KNN_leaveOneOut, 
                                  ds_liver_X,
                                  ds_liver_Y,
                                  cv=LeaveOneOut(),
                                  return_train_score=True,
                                 return_estimator=False)

In [186]:
KNN_leaveOneOut_result['test_score']

array([0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1.,
       1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
       0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1.,
       1., 0., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0.,
       1., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
       1., 0., 0., 1., 1.

In [197]:
leaveOneOut_acc = sum(KNN_leaveOneOut_result['test_score']) / len(ds_liver)
leaveOneOut_acc

0.6202898550724638

In [178]:
## implementing leave-one-out method by hand
kf = KFold(n_splits=len(ds_liver))

KNN_leaveOneOut = KNeighborsClassifier(n_neighbors=1, p=2)


test_result = []
for train, test in kf.split(ds_liver):
    ## predict using the classifier
    KNN_leaveOneOut.fit(ds_liver_X.loc[train], ds_liver_Y.loc[train])
    ## predict the test data
    y_pred = KNN_leaveOneOut.predict(ds_liver_X.loc[test])
    
    test_result.append(y_pred == ds_liver_Y.loc[test].values)

In [196]:
sum(test_result) / len(ds_liver)

array([0.62028986])

We can see that the accuracy of leave-one-out compited by hand is the same as the `LeaveOneOut()` method in sklearn. So we will continue with sklearn method because of its simplicity. (We just implement it by hand to see we are going in the right direction)

Now as Example 3, we can compute the chi value

In [198]:
sigma = (leaveOneOut_acc * (1 - leaveOneOut_acc)) / len(ds_liver)
sigma

0.000682696668888828

In [200]:
## the folds was chosen 5 above
chi_square = ((5-1) * sample_variance) / (5* sigma)
chi_square

8.171506028394088

In [201]:
## number of freedom is one less than 5
stats.chi2.cdf(chi_square, 4)

0.9145060709028664

With significance level $\alpha = 0.05$, our null hypothesis cannot be reject because the p-value is $0.91$. And we can say that here 5 fold accuracies are independent.


# Table 2 Reproduction
Here we will reproduce table 2 results. To do this we need to find the leave-one-out accuracies of `1-NN`, `3-NN`, `5-NN`, `7-NN` and `NBC` of different datasets.

## Livers-Disorders

In [203]:
ds_liver = pd.read_csv('Datasets/liver-disorders/bupa.data', names=['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt', 'drinks', 'selector'])

ds_liver_X = ds_liver[ds_liver.columns[:-1]]
ds_liver_Y = ds_liver.selector

In [214]:
def find_leave_one_out_results(dataset_X, dataset_Y, verbose=False):
    """
    Find the leave-one-out results of a dataset for 1NN, 3NN, 5NN, 7NN and Naive Bayes algorithm
    
    Parameters:
    ------------
    dataset_X : array_like
        the features vector
    dataset_Y : array_like
        the label for each feature
    verbose : bool
        print the progress if True
        default is False
        
    Returns:
    ---------
    acc_dict : dictionary
        the leave-one-out method applied on the dataset and the results of each method accuracy is returned
    """
    ## initialize a dicitonary
    acc_dict = {}
    
    ## 1-NN, 3-NN, 5-NN, 7-NN
    for neighbours_count in [1, 3, 5, 7]:
        ## initialize KNN classifier with neighbours_count value
        KNN_Classifier = KNeighborsClassifier(n_neighbors=neighbours_count, p=2)
        ## find the results of leave-one-out cross validation
        KNN_leaveOneOut_result = cross_validate(KNN_Classifier, 
                                      dataset_X,
                                      dataset_Y,
                                      cv=LeaveOneOut(),
                                      return_train_score=False,
                                      return_estimator=False)

        ## both dataset_X and dataset_Y have the same length
        ## here just use one of it to find the accuracy of classification
        acc_score = sum(KNN_leaveOneOut_result['test_score']) / len(dataset_X)
        
        if verbose:
            print(f'{neighbours_count}-NN finished with accuracy score: {acc_score}')
        
        ## add the accuracy of the classifier to our results
        acc_dict[f'{neighbours_count}NN'] = acc_score
    
    ## Appying Naive Bayes afterwards
    
    ## Naive Bayes
    NB_Classifier = GaussianNB()
    NB_leaveOneOut_result = cross_validate(NB_Classifier, 
                                  dataset_X,
                                  dataset_Y,
                                  cv=LeaveOneOut(),
                                  return_train_score=False,
                                  return_estimator=False)
    ## get the accuracy of Nive Bayes and save it in the dictionary
    NB_acc_score = sum(KNN_leaveOneOut_result['test_score']) / len(dataset_X)
    acc_dict['NBC'] = NB_acc_score
    
    if verbose:
        print(f'Naive Bayes finished with accuracy score: {NB_acc_score}')
    
    return acc_dict

In [262]:
def read_if_available_else_produce(dataset_X, dataset_Y, file_name):
    """
    Read the data if available and else run the leave-one-out method on dataset
    
    Parameters:
    ------------
    dataset_X : array_like
        the features vector
    dataset_Y : array_like
        the label for each feature
    file_name : string
        where to read or save the data
    
    Returns:
    ---------
    acc_dict : dictionary
        the leave-one-out method applied on the dataset and the results of each method accuracy is returned
    """
    ## initialize out of the conditions to have access to it
    table2_results = {}
    ## check if the results is not available then run the method to produce results
    if os.path.isfile(file_name) == False:  
        print("Results not available, Producing them\n")
        table2_results = find_leave_one_out_results(dataset_X,
                                                    dataset_Y,
                                                    verbose=True)

        ## save the results to use later
        file = open(file_name, 'w')
        json.dump(table2_results, file)
        file.close()
    else:
        print(f"Reading From previous data\nResults from file {file_name}")
        file = open(file_name, 'r')
        table2_results = file.read()
        file.close()

    return table2_results


In [264]:
livers_fileName = 'results/LeaveOneOut_liverDisorders.txt'
table2_results_livers_disorder = read_if_available_else_produce(ds_liver_X, ds_liver_Y, livers_fileName)
table2_results_livers_disorder

Reading From previous data
Results from file results/LeaveOneOut_liverDisorders.txt


'{"1NN": 0.6202898550724638, "3NN": 0.6376811594202898, "5NN": 0.6608695652173913, "7NN": 0.6869565217391305, "NBC": 0.6869565217391305}'

## Letter-Recognition dataset
dataset link: https://archive.ics.uci.edu/ml/machine-learning-databases/letter-recognition/

In [217]:
letter_recognition_cols = ['letter', 'x-box', 'y-box', 
                            'width', 'height', 'onpix', 'x-bar', 'y-bar',
                            'x2-bar', 'y2-bar', 'xy-bar', 'x2y-bar','xy2-bar',
                            'x-ege', 'xegvy', 'y-ege','yegvx']
ds_letter_reco = pd.read_csv('Datasets/letter recognition/letter-recognition.data',
                            names=letter_recognition_cols)

In [252]:
ds_letter_reco.head()

Unnamed: 0,letter,x-box,y-box,width,height,onpix,x-bar,y-bar,x2-bar,y2-bar,xy-bar,x2y-bar,xy2-bar,x-ege,xegvy,y-ege,yegvx
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [222]:
ds_letter_reco_X = ds_letter_reco[letter_recognition_cols[1:]]
ds_letter_reco_Y = ds_letter_reco[letter_recognition_cols[0]]

In [266]:
letter_recognition_result_fileName = 'results/LeaveOneOut_letter_recognition.txt'
table2_results_letter_recognition = read_if_available_else_produce(ds_letter_reco_X,
                               ds_letter_reco_Y
                              ,letter_recognition_result_fileName)
table2_results_letter_recognition

Reading From previous data
Results from file results/LeaveOneOut_letter_recognition.txt


'{"1NN": 0.96245, "3NN": 0.9595, "5NN": 0.958, "7NN": 0.95585, "NBC": 0.95585}'

## MAGIC gamma telescope data
Link: https://archive.ics.uci.edu/ml/datasets/magic+gamma+telescope

In [283]:
ds_magic_gamma = pd.read_csv('Datasets/magic/magic04.data', 
                             names=['fLength', 'fWidth', 'fSize', 'fConc',
                                   'fConc1', 'fAsym', 'fM3Long', 'fM3Trans',
                                   'fAlpha', 'fDist', 'class'])

In [284]:
ds_magic_gamma.head()

Unnamed: 0,fLength,fWidth,fSize,fConc,fConc1,fAsym,fM3Long,fM3Trans,fAlpha,fDist,class
0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
1,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.261,g
2,162.052,136.031,4.0612,0.0374,0.0187,116.741,-64.858,-45.216,76.96,256.788,g
3,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.449,116.737,g
4,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.648,356.462,g


In [285]:
ds_magic_gamma_X = ds_magic_gamma[ds_magic_gamma.columns[:-1]]
ds_magic_gamma_Y = ds_magic_gamma['class']

In [286]:
magic_gamma_FileName = 'results/LeaveOneOut_magic_gamma.txt'
table2_results_magic_gamma = read_if_available_else_produce(ds_letter_reco_X,
                                                            ds_letter_reco_Y,
                                                            magic_gamma_FileName)

1-NN finished with accuracy score: 0.96245
3-NN finished with accuracy score: 0.9595
5-NN finished with accuracy score: 0.958
7-NN finished with accuracy score: 0.95585
Naive Bayes finished with accuracy score: 0.95585


In [287]:
table2_results_magic_gamma

{'1NN': 0.96245, '3NN': 0.9595, '5NN': 0.958, '7NN': 0.95585, 'NBC': 0.95585}

## Page-Blocks dataset
Link: https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification

In [288]:
ds_page_blocks = pd.read_csv('Datasets/page-blocks/page-blocks.data', 
                             sep=' +',
                             engine='python',
                             names=['height', 'length', 'area', 'eccen',
                                   'p_black', 'p_and', 'mean_tr', 'blackpix',
                                   'blackand', 'wb_trans'])

In [289]:
ds_page_blocks.head()


Unnamed: 0,height,length,area,eccen,p_black,p_and,mean_tr,blackpix,blackand,wb_trans
5,7,35,1.4,0.4,0.657,2.33,14,23,6,1
6,7,42,1.167,0.429,0.881,3.6,18,37,5,1
6,18,108,3.0,0.287,0.741,4.43,31,80,7,1
5,7,35,1.4,0.371,0.743,4.33,13,26,3,1
6,3,18,0.5,0.5,0.944,2.25,9,17,4,1


In [290]:
ds_page_blocks_X = ds_page_blocks[ds_page_blocks.columns[:-1]]
ds_page_blocks_Y = ds_page_blocks.wb_trans

In [291]:
page_blocks_FileName = 'results/LeaveOneOut_page_blocks.txt'
table2_results_page_blocks = read_if_available_else_produce(ds_page_blocks_X,
                                                            ds_page_blocks_Y,
                                                            page_blocks_FileName)

1-NN finished with accuracy score: 0.9572446555819477
3-NN finished with accuracy score: 0.9574273707290334
5-NN finished with accuracy score: 0.957061940434862
7-NN finished with accuracy score: 0.9552347889640052
Naive Bayes finished with accuracy score: 0.9552347889640052


## Wine Quality dataset
Link: https://archive.ics.uci.edu/ml/datasets/wine+quality

### White Wine

In [298]:
ds_redWine = pd.read_csv('Datasets/wine-quality/winequality-red.csv', sep=';')
ds_redWine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [299]:
ds_redWine_X = ds_redWine[ds_redWine.columns[:-1]]
ds_redWine_Y = ds_redWine.quality

In [300]:
redWine_FileName = 'results/LeaveOneOut_redWine.txt'
table2_results_redWine = read_if_available_else_produce(ds_redWine_X,
                                                        ds_redWine_Y,
                                                        redWine_FileName)
table2_results_redWine

1-NN finished with accuracy score: 0.6153846153846154
3-NN finished with accuracy score: 0.5234521575984991
5-NN finished with accuracy score: 0.5103189493433395
7-NN finished with accuracy score: 0.4996873045653533
Naive Bayes finished with accuracy score: 0.4996873045653533


{'1NN': 0.6153846153846154,
 '3NN': 0.5234521575984991,
 '5NN': 0.5103189493433395,
 '7NN': 0.4996873045653533,
 'NBC': 0.4996873045653533}

### Red Wine

In [301]:
ds_whiteWine = pd.read_csv('Datasets/wine-quality/winequality-white.csv', sep=';')
ds_whiteWine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [302]:
ds_whiteWine_X = ds_whiteWine[ds_whiteWine.columns[:-1]]
ds_whiteWine_Y = ds_whiteWine.quality

In [303]:
white_Wine_FileName = 'results/LeaveOneOut_white_Wine.txt'
table2_results_White_Wine = read_if_available_else_produce(ds_whiteWine_X,
                                                        ds_whiteWine_Y,
                                                        white_Wine_FileName)
table2_results_White_Wine

1-NN finished with accuracy score: 0.6161698652511229
3-NN finished with accuracy score: 0.5065332788893426
5-NN finished with accuracy score: 0.4959167006941609
7-NN finished with accuracy score: 0.4881584320130666
Naive Bayes finished with accuracy score: 0.4881584320130666


{'1NN': 0.6161698652511229,
 '3NN': 0.5065332788893426,
 '5NN': 0.4959167006941609,
 '7NN': 0.4881584320130666,
 'NBC': 0.4881584320130666}

## TODO: Another Dataset