In [70]:
import sklearn
import numpy as np
import pandas as pd
from scipy import stats

# Machine Learning Final Project
**Student name:** Mohammad Amin Dadgar

**Student Id:** 4003624016

**Instructor:** Dr. Peyman Adibi

**Date:** Tir 1401 | June, July 2022

The Final Project is the implementation of [dependency analysis of accuracy estimates in k-fold cross validation](https://ieeexplore.ieee.org/document/8012491) article.

**Abstract:** K-fold cross-validation is a method to evaluate the performance of classification algorithms. In this report, we are going to show the appropriateness of K-fold cross-validation for the dependence of fold accuracies. 

**Introduction**
Several studies have shown that in k-fold cross-validation, fold accuracies are dependent, but there is no formal definition for this fact. This report is the reproduction results of referenced article that in section 2, we introduced the statistical methods needed for this experiment. In section 3 sampling the distribution from fold accuracies is introduced and using this in section 4, a statistical method is shown to evaluate the fold accuracies independence. In section 5, the method introduced in section 4 is tested on 20 UCI datasets, and in section 6 weâ€™ve concluded the work done before.

To inference the idea behind this article, in section 4 the variances are given as below

sample variance:
\begin{equation}
s^2 = \frac{\sum_{i=0}^{k} (\bar p_i - \bar{\bar{x}})^2}{k-1}
\end{equation}
And the variance for leave-one-out cross validation
\begin{equation}
\sigma_I^2 = \frac{p(1-p)}{n}
\end{equation}

to explain the equations above we can say that $p$ is the accuracy of the classifier and to find it, we can devide the correct classifications by the sample size.

$\bar p_i$ refers to the accuracy estimate in fold i, and can be written as below
\begin{equation}
\bar{p_i} = \frac{\sum_{j=1}^{m} x_{ij}}{m}
\end{equation}
where $m$ is the count of each fold sample. For example if we had $200$ sample in our dataset and using $5$ folds, then we would have $m=\frac{200}{5}$.

where $x_{ij}$ is a function that outputs $1$ where the classification is correct and outputs $0$ when the classification is wrong.

The $\bar{\bar{x}}$ represents the sample mean and it's possible to find it using the equation below
\begin{equation}
\bar{\bar{x}} = \frac{\sum_{i=1}^{k} \bar{p_i}}{k}
\end{equation}

We will start the dependency test of folds with a hypothesis $H_0: s^2/k=\sigma_l^2$. the test statistics can be found as $\chi^2 = \frac{(k-1) s^2}{k \sigma_l^2}$ when $\chi$ has $k-1$ degrees of freedom.

In [25]:
######### Let's re-calculate the example 2 and 3 in the article
total = 200
classification_res = np.array([32, 30, 27, 30, 25])
sample_mean = np.sum(classification_res / 200) 

for true_classified in classification_res:
    print(f"acc: {true_classified / 40}")

print(f"sample mean is {sample_mean}")

acc: 0.8
acc: 0.75
acc: 0.675
acc: 0.75
acc: 0.625
sample mean is 0.72


In [57]:
##### implementing equation 1
def find_sample_variance(true_classification, fold_samples_count):
    """
    Find sample variance of k-fold cross validation
    
    Parameters:
    -----------
    true_classification : array of floats
        the portion of true classified samples in each fold
        note that the folds count is the length of this array
    fold_samples_count : positive integer
        the count of folds data samples
        
    Returns:
    ---------
    sample_variance : float
        the sample variance of all the folds
    """
    ## find the total sample count
    total_sample_count = fold_samples_count * len(true_classification)
    
    sample_mean = find_sample_mean(true_classification, total_sample_count)
    
    ## the subtraction of true classified portion from sample mean
    subtraction_arr = np.subtract(true_classification / fold_samples_count, sample_mean)

    ## to divide the found value by `k-1`
    sample_variance = np.sum(np.power(subtraction_arr, 2)) / (len(true_classification) - 1)
    
    return sample_variance
    
def find_sample_mean(true_classification, total_samples_count):
    """
    Find the sample mean of k-fold cross validation using the true classification results
    
    Parameters:
    -----------
    true_classification : array of floats
        the portion of true classified samples in each fold
        note that the folds count is the length of this array
    total_samples_count : positive integer
        the count of total sample size (dataset length maybe)
    
    Returns:
    ---------
    sample_mean : float
        the sample mean calculated for all the dataset
    """
    sample_mean = np.sum(true_classification / total_samples_count)
    
    return sample_mean

In [64]:
m_sample_variance = find_sample_variance(classification_res, 40)
m_sample_variance

0.004812500000000001

In [59]:
def find_total_variance(accuracy, total_sample_count):
    """
    Find the variance of a set using its accuracy and samples count
    the equation is `accuracy(1-accuracy)/total_sample_count`
    
    Parameters:
    ------------
    accuracy : float between 0 and 1
        the floating value that rerpresent the portion of true classified samples over all samples
    total_sample_count : integer
        the total samples in a dataset
    
    Returns:
    ---------
    variance : float
        the calculated variance for the dataset
    """
    
    variance = (accuracy * (1 - accuracy)) / total_sample_count
    
    return variance

In [65]:
m_variance = find_total_variance(0.7, 200)
m_variance

0.0010500000000000002

In [62]:
def chi_independence_test(k_folds, total_variance, sample_variance):
    """
    the chi square independence test introduced in the article
    equation is `(`k-1` folds * sample_varience) / `k` folds  * total_variance`
    
    Parameters:
    ------------
    k_folds : positive integer
        the count of folds applied for a model
    total_variance : float
        the variance represented by leave-one-out cross validation
    sample_variance : float
        the variance that is found by aggregation of fold accuracies
        
    Returns:
    ---------
    chi_square : float
        the chi square value 
    """
    
    chi_square = ((k_folds - 1) * sample_variance) / (k_folds * total_variance)
    
    return chi_square

In [71]:
chisquare_value = chi_independence_test(5, m_variance, m_sample_variance)
chisquare_value

3.6666666666666665

And finding the exact values for `Example 3.`, we can now go on to real tests for real datasets, but Before going to experiments in real datasets another thing is to find out the p-value for our test.

So we are going to find the p-value first

In [73]:
stats.chi2.cdf(chisquare_value, 4)

0.5470073861075335

## 1-NN
Let's try KNN method with K=1 with different datasets. minkowski distance with $p=2$ is the euclidean distance and we are using euclidean distance as our nearest neighbour metric.

In [157]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate, cross_val_score, LeaveOneOut, KFold

### liver-disorders dataset

In [82]:
ds_liver = pd.read_csv('Datasets/liver-disorders/bupa.data', names=['mcv', 'alkphos', 'sgpt', 'sgot', 'gammagt', 'drinks', 'selector'])
ds_liver.head()

Unnamed: 0,mcv,alkphos,sgpt,sgot,gammagt,drinks,selector
0,85,92,45,27,31,0.0,1
1,85,64,59,32,23,0.0,2
2,86,54,33,16,54,0.0,2
3,91,78,34,24,36,0.0,2
4,87,70,12,28,10,0.0,2


In [85]:
ds_liver.selector.unique()

array([1, 2], dtype=int64)

we have two classess for each data, selector is the label in our dataset.

In [90]:
KNN_1_liver = KNeighborsClassifier(n_neighbors=1, p=2)
# KNN_1_liver.fit(ds_liver[ds_liver.columns[:-1]], ds_liver.selector)

In [109]:
ds_liver_X = ds_liver[ds_liver.columns[:-1]]
ds_liver_Y = ds_liver.selector 

In [113]:
KNN_1_5cv_scores = cross_validate(KNN_1_liver, 
                                  ds_liver_X,
                                  ds_liver_Y,
                                  cv=5,
                                  return_train_score=True,
                                 return_estimator=True)
KNN_1_5cv_scores

{'fit_time': array([0.0026772 , 0.00232077, 0.00202894, 0.00267959, 0.00164056]),
 'score_time': array([0.00377774, 0.00390458, 0.00401592, 0.00382686, 0.00284386]),
 'estimator': [KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1),
  KNeighborsClassifier(n_neighbors=1)],
 'test_score': array([0.65217391, 0.76811594, 0.62318841, 0.66666667, 0.53623188]),
 'train_score': array([1., 1., 1., 1., 1.])}

In [131]:
m_sample_mean = np.mean(KNN_1_5cv_scores['test_score'])
m_sample_mean

0.6492753623188406

In [136]:
## find sample variance
def find_sample_variance_using_accuracies(fold_accuracies):
    """
    Find variances using the accuracies got from k-fold cross validation
    
    Parameters:
    ------------
    fold_accuracies : float between 0 and 1
        the accuracies in each fold of k-fold cross validation
        note that the k folds count will be computed using the length of this parameter  
    
    Returns:
    ---------
    variance : float
        the variance of accuracies
    """
    sample_mean = np.mean(fold_accuracies)
    
    ## the subtraction of true classified portion from sample mean
    subtraction = np.subtract(fold_accuracies, sample_mean)
    
    variance = np.sum(np.power(subtraction, 2)) / (len(fold_accuracies) - 1)
    
    return variance

In [139]:
sample_variance = find_sample_variance_using_accuracies(KNN_1_5cv_scores['test_score'])
sample_variance

0.006973324931737026

#### Leave one out method
In this method the folds count is eqaul to the data size and test size in each fold is 1.https://www.statology.org/leave-one-out-cross-validation/

In [152]:
KNN_leaveOneOut = KNeighborsClassifier(n_neighbors=1, p=2)
KNN_leaveOneOut_result = cross_validate(KNN_leaveOneOut, 
                                  ds_liver_X,
                                  ds_liver_Y,
                                  cv=LeaveOneOut(),
                                  return_train_score=True,
                                 return_estimator=False)

In [186]:
KNN_leaveOneOut_result['test_score']

array([0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 0., 1.,
       1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
       0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1.,
       1., 0., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0.,
       1., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 1.,
       1., 1., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
       1., 0., 0., 1., 1.

In [197]:
leaveOneOut_acc = sum(KNN_leaveOneOut_result['test_score']) / len(ds_liver)
leaveOneOut_acc

0.6202898550724638

In [178]:
## implementing leave-one-out method by hand
kf = KFold(n_splits=len(ds_liver))

KNN_leaveOneOut = KNeighborsClassifier(n_neighbors=1, p=2)


test_result = []
for train, test in kf.split(ds_liver):
    ## predict using the classifier
    KNN_leaveOneOut.fit(ds_liver_X.loc[train], ds_liver_Y.loc[train])
    ## predict the test data
    y_pred = KNN_leaveOneOut.predict(ds_liver_X.loc[test])
    
    test_result.append(y_pred == ds_liver_Y.loc[test].values)

In [196]:
sum(test_result) / len(ds_liver)

array([0.62028986])

We can see that the accuracy of leave-one-out compited by hand is the same as the `LeaveOneOut()` method in sklearn. So we will continue with sklearn method because of its simplicity. (We just implement it by hand to see we are going in the right direction)

Now as Example 3, we can compute the chi value

In [198]:
sigma = (leaveOneOut_acc * (1 - leaveOneOut_acc)) / len(ds_liver)
sigma

0.000682696668888828

In [200]:
## the folds was chosen 5 above
chi_square = ((5-1) * sample_variance) / (5* sigma)
chi_square

8.171506028394088

In [201]:
## number of freedom is one less than 5
stats.chi2.cdf(chi_square, 4)

0.9145060709028664

With significance level $\alpha = 0.05$, our null hypothesis cannot be reject because the p-value is $0.91$. And we can say that here 5 fold accuracies are independent.
