<a href="https://colab.research.google.com/github/Yaminipampana/FMML_LABS/blob/main/FMML_M1L2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning terms and metrics

FMML Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [1]:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=42)

In [2]:
dataset = datasets.fetch_california_housing()
# Dataset description
print(dataset.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

Given below are the list of target values. These correspond to the house value derived considering all the 8 input features and are continuous values. We should use regression models to predict these values but we will start with a simple classification model for the sake of simplicity. We need to just round off the values to the nearest integer and use a classification model to predict the house value.

In [3]:
print("Orignal target values:", dataset.target)

dataset.target = dataset.target.astype(int)

print("Target values after conversion:", dataset.target)
print("Input variables shape:", dataset.data.shape)
print("Output variables shape:", dataset.target.shape)

Orignal target values: [4.526 3.585 3.521 ... 0.923 0.847 0.894]
Target values after conversion: [4 3 3 ... 0 0 0]
Input variables shape: (20640, 8)
Output variables shape: (20640,)


The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [4]:
def NN1(traindata, trainlabel, query):
    """
    This function takes in the training data, training labels and a query point
    and returns the predicted label for the query point using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    query: numpy array of shape (d,) where d is the number of features

    returns: the predicted label for the query point which is the label of the training data which is closest to the query point
    """
    diff = (
        traindata - query
    )  # find the difference between features. Numpy automatically takes care of the size here
    sq = diff * diff  # square the differences
    dist = sq.sum(1)  # add up the squares
    label = trainlabel[np.argmin(dist)]
    return label


def NN(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the nearest neighbour algorithm

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is the label of the training data which is closest to each test point
    """
    predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
    return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [5]:
def RandomClassifier(traindata, trainlabel, testdata):
    """
    This function takes in the training data, training labels and test data
    and returns the predicted labels for the test data using the random classifier algorithm

    In reality, we don't need these arguments but we are passing them to keep the function signature consistent with other classifiers

    traindata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    trainlabel: numpy array of shape (n,) where n is the number of samples
    testdata: numpy array of shape (m,d) where m is the number of test samples and d is the number of features

    returns: the predicted labels for the test data which is a random label from the training data
    """

    classes = np.unique(trainlabel)
    rints = rng.integers(low=0, high=len(classes), size=len(testdata))
    predlabel = classes[rints]
    return predlabel

We need a metric to evaluate the performance of the model. Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. We will use the accuracy metric to evaluate and compate the performance of the K-Nearest Neighbors model and the random classifier.

In [6]:
def Accuracy(gtlabel, predlabel):
    """
    This function takes in the ground-truth labels and predicted labels
    and returns the accuracy of the classifier

    gtlabel: numpy array of shape (n,) where n is the number of samples
    predlabel: numpy array of shape (n,) where n is the number of samples

    returns: the accuracy of the classifier which is the number of correct predictions divided by the total number of predictions
    """
    assert len(gtlabel) == len(
        predlabel
    ), "Length of the ground-truth labels and predicted labels should be the same"
    correct = (
        gtlabel == predlabel
    ).sum()  # count the number of times the groundtruth label is equal to the predicted label.
    return correct / len(gtlabel)

Let us make a function to split the dataset with the desired probability. We will use this function to split the dataset into training and testing sets. We will use the training set to train the model and the testing set to evaluate the model.

In [7]:
def split(data, label, percent):
    # generate a random number for each sample
    rnd = rng.random(len(label))
    split1 = rnd < percent
    split2 = rnd >= percent

    split1data = data[split1, :]
    split1label = label[split1]
    split2data = data[split2, :]
    split2label = label[split2]
    return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [8]:
testdata, testlabel, alltraindata, alltrainlabel = split(
    dataset.data, dataset.target, 20 / 100
)
print("Number of test samples:", len(testlabel))
print("Number of train samples:", len(alltrainlabel))
print("Percent of test data:", len(testlabel) * 100 / len(dataset.target), "%")

Number of test samples: 4144
Number of train samples: 16496
Percent of test data: 20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [9]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)

What is the accuracy of our classifiers on the train dataset?

In [10]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using nearest neighbour algorithm:", trainAccuracy*100, "%")

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Training accuracy using random classifier: ", trainAccuracy*100, "%")

Training accuracy using nearest neighbour algorithm: 100.0 %
Training accuracy using random classifier:  16.4375808538163 %


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case. This is because the random classifier randomly assigns a label to each sample and the probability of assigning the correct label is 1/(number of classes). Let us predict the labels for our validation set and get the accuracy. This accuracy is a good estimate of the accuracy of our model on unseen data.

In [11]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")


valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.10852713178294 %
Validation accuracy using random classifier: 16.884689922480618 %


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. Now let us try another random split and check the validation accuracy. We will see that the validation accuracy changes with the split. This is because the validation set is small and the accuracy is highly dependent on the samples in the validation set. We can get a better estimate of the accuracy by using cross-validation.

In [12]:
traindata, trainlabel, valdata, vallabel = split(
    alltraindata, alltrainlabel, 75 / 100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour algorithm:", valAccuracy*100, "%")

Validation accuracy using nearest neighbour algorithm: 34.048257372654156 %


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [13]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)

print("Test accuracy:", testAccuracy*100, "%")

Test accuracy: 34.91795366795367 %


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%



1. Increasing the Percentage of the Validation Set

 Smaller Training Set: Increasing the size of the validation set reduces the amount of data available for training. Since machine learning models learn patterns from training data, having less data may result in the model learning less effectively, potentially reducing its ability to generalize well.

 Validation Set Accuracy:

 Stabilization of Metrics: With a larger validation set, performance metrics (like accuracy, precision, recall, etc.) are typically more stable. Larger datasets better represent the overall population, leading to a more reliable estimation of the model's true performance.

 Possibly Lower Model Performance: If the model was trained on a reduced training set, the performance metrics might slightly decrease because the model had fewer samples to learn from.
2. Reducing the Percentage of the Validation Set

 Larger Training Set: With more training data, the model generally learns more patterns, potentially improving its performance on unseen data. A larger training set helps prevent underfitting.

 Validation Set Accuracy:

 Higher Variance in Metrics: A smaller validation set may not represent the overall distribution of the data well, leading to high variability in performance metrics. This makes it harder to get an accurate estimate of how the model will perform on new, unseen data.

 Overfitting Risks: If the validation set is too small, the model may overfit the training set, and performance metrics on the validation set may not reflect the model’s true generalization ability.

 Experimentation with Different Splits

 Common training-validation splits include 80-20, 90-10, and 70-30. For example:

 80-20 Split: A well-balanced trade-off between training and validation. There's enough data for the model to learn, and the validation metrics are relatively reliable.

 90-10 Split: A larger training set benefits the learning process, but validation metrics may show higher variability due to a smaller validation set.

 70-30 Split: A larger validation set gives more confidence in validation metrics but may reduce model learning if the training set becomes too small.

 Metrics Impacted

 Validation Accuracy: With more data in the validation set, accuracy becomes more reliable, but the model's ability to generalize may suffer if there’s not enough training data.
 Precision/Recall: These metrics might fluctuate more with smaller validation sets due to imbalance or underrepresentation of certain classes.
 F1-Score: A balance between precision and recall, it can also be sensitive to validation set size, especially in imbalanced datasets.


The size of the training and validation sets directly impacts how well we can predict the accuracy on the test set using the validation set, because these sets control both model learning and evaluation. Here’s a breakdown of how they affect the prediction of test set accuracy:


1. Impact of Training Set Size on Predicting Test Set Accuracy

 Larger Training Set:

 Better Learning: A larger training set allows the model to learn more patterns and generalize better. This leads to better performance on unseen data, including both the validation and test sets.

 More Reliable Validation Accuracy: If the model learns well from a large training set, its performance on the validation set will more closely mirror its performance on the test set.

 More Robust Models: A larger training set reduces overfitting, as the model is less likely to memorize the training data and instead captures the underlying patterns, making validation accuracy a good proxy for test accuracy.

 Smaller Training Set:

 Underfitting Risk: With less data, the model may underfit, meaning it doesn’t capture enough of the patterns in the data. As a result, the validation accuracy may be lower, and there will be less confidence that it reflects the test set accuracy.

 Greater Performance Gap: A smaller training set might cause a gap between validation accuracy and test accuracy. If the model is not well-trained, it may perform poorly on the test set even if it seems to perform decently on a small validation set.

2. Impact of Validation Set Size on Predicting Test Set Accuracy

 Larger Validation Set:

 More Reliable Estimate: A larger validation set offers a better estimate of the model’s performance, as it represents a broader portion of the data. If the validation set accurately reflects the test set distribution, its accuracy will be closer to the test set accuracy.

 Less Variability: With more data in the validation set, the variance in accuracy metrics (e.g., accuracy, precision, recall) is lower. This reduces the likelihood of overestimating or underestimating the model's performance.

 Improved Generalization: A larger validation set minimizes the likelihood of spurious results due to a small sample size, leading to more confidence that the validation accuracy will closely mirror the test accuracy.

 Smaller Validation Set:

 Higher Variability: A smaller validation set has more variance in performance metrics due to fewer data points. This increases the risk that validation accuracy is either overestimated or underestimated, leading to poor predictions about test set accuracy.

 Risk of Overfitting to Validation Data: If the validation set is too small, the model might perform well on the validation set (due to being accidentally well-suited to the small sample), but poorly on the test set, leading to inaccurate predictions of performance on unseen data.

3. Training-Validation-Testing Relationships

 Large Training Set + Small Validation Set: This setup might produce a well-trained model, but the small validation set could give misleading performance metrics (high variance), making it hard to confidently predict the model’s accuracy on the test set.

 Small Training Set + Large Validation Set: This might lead to a model that underfits the data, giving lower validation accuracy, which might underestimate test accuracy. A well-represented validation set may reflect test accuracy better, but if the model hasn't learned well due to the small training set, both validation and test accuracy may be suboptimal.

 Balanced Training and Validation Set: With a balanced split (e.g., 80-20), there’s enough data for the model to learn well and for the validation set to give an accurate estimate of the model's performance on unseen test data.

 Practical Implications

  Smaller Datasets: In smaller datasets, finding a good balance between training and validation data is crucial. In these cases, using techniques like cross-validation (where multiple subsets are used as validation sets) is helpful in accurately predicting test set accuracy.

 Larger Datasets: In larger datasets, a smaller validation set (e.g., 10%) may still yield reliable estimates of test set accuracy since both training and validation data represent enough of the data distribution.


The size of the training and validation sets directly impacts how well we can predict the accuracy on the test set using the validation set, because these sets control both model learning and evaluation. Here’s a breakdown of how they affect the prediction of test set accuracy:


1. Impact of Training Set Size on Predicting Test Set Accuracy

 Larger Training Set:

 Better Learning: A larger training set allows the model to learn more patterns and generalize better. This leads to better performance on unseen data, including both the validation and test sets.

 More Reliable Validation Accuracy: If the model learns well from a large training set, its performance on the validation set will more closely mirror its performance on the test set.

 More Robust Models: A larger training set reduces overfitting, as the model is less likely to memorize the training data and instead captures the underlying patterns, making validation accuracy a good proxy for test accuracy.

 Smaller Training Set:

 Underfitting Risk: With less data, the model may underfit, meaning it doesn’t capture enough of the patterns in the data. As a result, the validation accuracy may be lower, and there will be less confidence that it reflects the test set accuracy.

 Greater Performance Gap: A smaller training set might cause a gap between validation accuracy and test accuracy. If the model is not well-trained, it may perform poorly on the test set even if it seems to perform decently on a small validation set.

2. Impact of Validation Set Size on Predicting Test Set Accuracy

 Larger Validation Set:

 More Reliable Estimate: A larger validation set offers a better estimate of the model’s performance, as it represents a broader portion of the data. If the validation set accurately reflects the test set distribution, its accuracy will be closer to the test set accuracy.

 Less Variability: With more data in the validation set, the variance in accuracy metrics (e.g., accuracy, precision, recall) is lower. This reduces the likelihood of overestimating or underestimating the model's performance.

 Improved Generalization: A larger validation set minimizes the likelihood of spurious results due to a small sample size, leading to more confidence that the validation accuracy will closely mirror the test accuracy.

 Smaller Validation Set:

 Higher Variability: A smaller validation set has more variance in performance metrics due to fewer data points. This increases the risk that validation accuracy is either overestimated or underestimated, leading to poor predictions about test set accuracy.

 Risk of Overfitting to Validation Data: If the validation set is too small, the model might perform well on the validation set (due to being accidentally well-suited to the small sample), but poorly on the test set, leading to inaccurate predictions of performance on unseen data.

3. Training-Validation-Testing Relationships

 Large Training Set + Small Validation Set: This setup might produce a well-trained model, but the small validation set could give misleading performance metrics (high variance), making it hard to confidently predict the model’s accuracy on the test set.

 Small Training Set + Large Validation Set: This might lead to a model that underfits the data, giving lower validation accuracy, which might underestimate test accuracy. A well-represented validation set may reflect test accuracy better, but if the model hasn't learned well due to the small training set, both validation and test accuracy may be suboptimal.

 Balanced Training and Validation Set: With a balanced split (e.g., 80-20), there’s enough data for the model to learn well and for the validation set to give an accurate estimate of the model's performance on unseen test data.

 Practical Implications

  Smaller Datasets: In smaller datasets, finding a good balance between training and validation data is crucial. In these cases, using techniques like cross-validation (where multiple subsets are used as validation sets) is helpful in accurately predicting test set accuracy.

 Larger Datasets: In larger datasets, a smaller validation set (e.g., 10%) may still yield reliable estimates of test set accuracy since both training and validation data represent enough of the data distribution.


> Exercise: Try to implement a 3 nearest neighbour classifier and compare the accuracy of the 1 nearest neighbour classifier and the 3 nearest neighbour classifier on the test dataset. You can use the KNeighborsClassifier class from the scikit-learn library to implement the K-Nearest Neighbors model. You can set the number of neighbors using the n_neighbors parameter. You can also use the accuracy_score function from the scikit-learn library to calculate the accuracy of the model.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>cross-validation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. You can reduce the number of splits to make it faster.

In [14]:
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
    """
    This function takes in the data, labels, split percentage, number of iterations and classifier function
    and returns the average accuracy of the classifier

    alldata: numpy array of shape (n,d) where n is the number of samples and d is the number of features
    alllabel: numpy array of shape (n,) where n is the number of samples
    splitpercent: float which is the percentage of data to be used for training
    iterations: int which is the number of iterations to run the classifier
    classifier: function which is the classifier function to be used

    returns: the average accuracy of the classifier
    """
    accuracy = 0
    for ii in range(iterations):
        traindata, trainlabel, valdata, vallabel = split(
            alldata, alllabel, splitpercent
        )
        valpred = classifier(traindata, trainlabel, valdata)
        accuracy += Accuracy(vallabel, valpred)
    return accuracy / iterations  # average of all accuracies

In [15]:
avg_acc = AverageAccuracy(alltraindata, alltrainlabel, 75 / 100, 10, classifier=NN)
print("Average validation accuracy:", avg_acc*100, "%")
testpred = NN(alltraindata, alltrainlabel, testdata)

print("Test accuracy:", Accuracy(testlabel, testpred)*100, "%")

Average validation accuracy: 33.58463539517022 %
Test accuracy: 34.91795366795367 %


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1. Averaging Across Multiple Splits Provides Consistency

 Reduces Variance: When you evaluate a model on a single validation split, the performance may vary depending on how representative that specific subset is of the overall data. Some splits might be easier for the model to predict than others, leading to variability in performance metrics. By averaging the validation accuracy across multiple splits, you reduce this variance and get a more reliable estimate of the model's performance.

 More Representative of Data Distribution: Each split (or fold) provides a different subset of data for validation. By training and evaluating the model on different subsets, you ensure that all parts of the dataset contribute to both training and validation. This makes the average validation accuracy a better reflection of the model’s ability to generalize across the entire dataset.

 Mitigates Bias from a Single Split: A single validation split might contain outliers or an unrepresentative sample of the data, which can bias the accuracy estimate. Averaging across multiple splits helps cancel out the effect of such anomalies, giving a more stable and consistent result.
2. k-Fold Cross-Validation for Multiple Splits

 k-Fold Cross-Validation is a commonly used method to average validation accuracy across multiple splits. In this technique:

 The data is divided into k equal-sized "folds".

 The model is trained on k−1 folds and validated on the remaining fold.

 This process is repeated k times, each time using a different fold as the validation set.

 The final validation accuracy is the average accuracy across all folds, which provides a more reliable performance estimate.

 The value of k (e.g., 5, 10) controls how many splits are performed. Higher values of k provide more splits but require more computation.
3. Leave-One-Out Cross-Validation (LOO-CV)

 A special case of k-fold cross-validation is leave-one-out cross-validation (LOO-CV), where k equals the number of data points. This method provides the most exhaustive validation, as the model is trained on all but one data point and validated on the remaining one, repeated for every point in the dataset.

  While this reduces variance in performance estimates, it can be computationally expensive for large datasets.
4. Effects on Model Selection and Hyperparameter Tuning

 More Reliable Hyperparameter Tuning: When hyperparameters are tuned based on the performance of a single validation set, there's a risk that the chosen hyperparameters may be overfitted to that specific split. Averaging across multiple splits mitigates this risk, leading to hyperparameter choices that generalize better across different data.

  More Stable Model Selection: Similarly, choosing between different models based on the average performance over multiple splits ensures that the selected model performs consistently well across different subsets of the data, rather than excelling on just one.
5. Considerations for Consistent Results

 Number of Splits (k): The value of k in k-fold cross-validation affects the consistency of results. A larger value of k (e.g., 10) typically provides a more accurate estimate of the model's performance but requires more computation. A lower value (e.g., 5) reduces the computational cost but may increase variability.

 Dataset Size: For smaller datasets, cross-validation is particularly important because a single split may not represent the overall data well. Larger datasets generally allow more reliable performance estimates even with a single split.


1. *Reduced Variance*: By splitting the data multiple times, the model is trained and tested on different subsets of the data. This reduces the variance of the performance estimate, leading to more robust and reliable results.

2. *Use of Full Dataset*: With multiple splits, every observation in the dataset gets a chance to be in both the training and testing sets, providing a more comprehensive view of model performance.

3. *Less Bias*: A single train-test split can lead to a biased estimate of performance if the data is not representative. Multiple splits help mitigate this issue.

For example, *k-fold cross-validation* divides the data into k subsets, trains the model on k-1 folds, and tests on the remaining fold. This process repeats k times, and the final accuracy is averaged across all runs, giving a more accurate estimate.

1. *More Accurate Estimate*: Higher iterations (e.g., more folds in k-fold cross-validation) usually provide a more accurate estimate of model performance. This is because each fold has more data for both training and testing, leading to better utilization of the dataset and a more stable estimate of the model's performance.

2. *Trade-Off with Computation: While more iterations lead to better estimates, they also increase the computational cost. For example, in k-fold cross-validation, the number of iterations is equal to *k, and each iteration involves training and testing the model. Thus, more folds mean more computations.

3. *Diminishing Returns*: After a certain point, increasing the number of folds yields diminishing returns in terms of improving the estimate. For instance, in 10-fold cross-validation, increasing the number of folds beyond this may not provide substantial benefits, but will increase computational demands.

Increasing the number of iterations in cross-validation (like using more folds) helps in better utilizing the limited data but doesn’t fully solve the problem of having very small datasets. Here’s a closer look:

1. *Better Data Utilization*: With more folds (higher iterations), each data point gets to be in both the training and validation sets across different iterations. This means every sample contributes to the training and validation processes multiple times, potentially leading to a more reliable estimate of model performance.

2. *Stability in Estimates*: More iterations (or folds) can give you a more stable estimate of model performance by reducing variance due to the specific choice of training and validation splits. This is particularly useful when the dataset is small.

3. *Not a Complete Solution*:
   - *Overfitting Risks*: Even with increased iterations, a very small dataset may still lead to overfitting, as the model may learn noise or specific patterns that do not generalize well.
   - *Computational Cost*: Increasing the number of folds increases the computational load, which might be inefficient if the dataset is already very small.

4. *Other Strategies*: For very small datasets, consider additional strategies like:
   - *Data Augmentation*: Generating more data from existing samples through techniques such as transformations or synthetic data.
   - *Regularization*: Applying techniques that prevent overfitting by penalizing overly complex models.
   - *Transfer Learning*: Leveraging pre-trained models and fine-tuning them on your small dataset.

> Exercise: How does the accuracy of the 3 nearest neighbour classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbour classifier.

To analyze how the accuracy of the k-nearest neighbors (k-NN) classifier changes with the number of splits and split size, and to compare the results between 1-nearest neighbor (1-NN) and 3-nearest neighbor (3-NN), follow these steps:

### 1. *Dataset Preparation:*
   - Use a dataset appropriate for classification, preferably one with a moderate size to observe the effects more clearly.

### 2. *Perform Cross-Validation:*
   - *1-NN Classifier:*
     - Perform cross-validation with different numbers of splits (e.g., 5-fold, 10-fold) and compute the accuracy for each.
   - *3-NN Classifier:*
     - Repeat the above steps for the 3-NN classifier.

### 3. *Analysis:*

#### *Accuracy with Number of Splits:*
   - *Effect of More Splits:*
     - Generally, increasing the number of splits (e.g., moving from 5-fold to 10-fold cross-validation) can provide a more reliable estimate of accuracy as it reduces variance and ensures that more data is used for both training and validation.
   - *1-NN vs. 3-NN:*
     - *1-NN* tends to have high variance and can be sensitive to noise in the training data. Its performance may fluctuate more with different splits because it relies on a single nearest neighbor.
     - *3-NN* typically smooths out the decision boundary by considering the average class of the 3 nearest neighbors, potentially leading to more stable performance across different splits.

#### *Accuracy with Split Size:*
   - *Effect of Split Size:*
     - Smaller splits (larger number of folds) mean that each validation set is smaller, which might increase variability in the accuracy estimates. However, the model benefits from having more training data.
     - Larger splits (fewer folds) result in larger validation sets and smaller training sets, which may make the accuracy estimate more stable but could suffer from reduced training data.

   - *1-NN vs. 3-NN:*
     - *1-NN* may be more sensitive to the size of the validation set due to its dependence on the closest single point, leading to higher variability in accuracy.
     - *3-NN* tends to be less sensitive to individual data points, so its accuracy might be more stable with different split sizes.

### 4. *Comparison:*

- *1-NN*:
  - Often has higher variance in accuracy with different splits due to overfitting to the nearest neighbor. It can perform very well on training data but may not generalize well, especially with small validation sets.

- *3-NN*:
  - Provides more stable performance across different splits as it averages out the influence of nearest neighbors. It may have better generalization and be less sensitive to the exact split of the data compared to 1-NN.