<a href="https://colab.research.google.com/github/Vijayanirmala1234567890/FMML-LAB-ASSIGNMENT/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [13]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [None]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1.The size of the validation set in relation to the total dataset can have an impact on the accuracy of the validation set. Here's how it typically behaves when you increase or reduce the percentage of the validation set:

Increasing the Percentage of the Validation Set:

Effect: When you allocate a larger percentage of the total dataset to the validation set, you effectively reduce the size of the training set.
Impact on Accuracy: The accuracy on the validation set may become less reliable as it may be more influenced by randomness or the specific data points included in the validation set. It may not provide as accurate an estimate of model generalization.
Trade-offs: On the positive side, you have more data available for training, which can help improve the model's ability to learn from the data. Training may be faster since more data is used.
The choice of the percentage of data allocated to the validation set is often a trade-off between having enough data to estimate model performance reliably and having enough data to train a good model. The optimal percentage can vary depending on factors like the size of your dataset, the complexity of your model, and the nature of your problem.
To mitigate the impact of this choice and provide a more robust estimate of model performance, you can use techniques like k-fold cross-validation. Cross-validation involves repeatedly splitting the dataset into different subsets for training and validation, and averaging the results over multiple iterations. This can help in achieving a better balance between training and validation while reducing the impact of randomness in the validation set.






2.The sizes of the training and validation sets can indeed affect how well you can predict the accuracy on the test set using the validation set. Here's how these factors interplay:

Training Set Size:

A larger training set generally allows your model to learn more about the underlying patterns in the data. This can lead to a model that generalizes better to the test set, as it has seen more diverse examples during training.
Validation Set Size:

A larger validation set can provide a more accurate estimate of your model's performance on unseen data. It helps in reducing the variability in performance estimates that can occur when you have a small validation set. A larger validation set is especially valuable when you want a reliable indicator of how well your model will perform on the test set.
Predicting Test Set Accuracy:

When the training set and validation set are representative of the overall dataset, and they are both reasonably sized, the accuracy on the validation set can serve as a good predictor of the accuracy on the test set. This is because the validation set, if large enough and representative, provides a reliable estimate of how well the model generalizes to unseen data.
Overfitting and Underfitting:

If the training set is too small, the model may underfit, meaning it fails to capture important patterns in the data. This can lead to poor performance on both the validation and test sets.
If the validation set is too small, there's a risk of overfitting to the validation set. The model may perform well on the validation set but fail to generalize to the test set. This can result in an optimistic estimate of test set performance.
Balancing Training and Validation Sizes:

The size of the training and validation sets is often a trade-off. If you allocate too much data to the validation set, you have less data for training, which can hinder model learning. If you allocate too little data to the validation set, the performance estimate may be unreliable.
In summary, to predict test set accuracy effectively using the validation set, it's important to strike a balance between the sizes of the training and validation sets. Both should be large enough and representative of the overall dataset to provide reliable estimates of model performance. Additionally, the choice of validation set size should consider the trade-offs between model learning and performance estimation. Techniques like k-fold cross-validation can also be used to mitigate the impact of validation set size and provide more robust estimates.








3.There's no one-size-fits-all answer to what percentage of the dataset should be reserved for the validation set, as the ideal split depends on various factors, including the size of your dataset, the complexity of your model, and the nature of your problem. However, a common practice is to split your data into training and validation sets using a 70-30 or 80-20 ratio. Here are some considerations:

Dataset Size: If you have a large dataset, you can afford to allocate a smaller percentage (e.g., 10-20%) to the validation set without significantly reducing the training data. Conversely, with a small dataset, you might want to allocate a larger percentage (e.g., 30-40%) to the validation set to ensure you have enough data for reliable validation.

Complexity of Model: More complex models tend to benefit from larger validation sets because they have a higher risk of overfitting. If you're using asimple model, you might be able to get away with a smaller validation set.

Nature of Problem: The difficulty of your machine learning problem can also influence the validation set size. For complex problems with intricate patterns, you might need a larger validation set to ensure a reliable estimate of performance.

Cross-Validation: If you're concerned about the variability of your performance estimate due to the validation set size, consider using k-fold cross-validation. This technique divides your data into k subsets and performs k rounds of training and validation, rotating through different subsets as the validation set in each round. It can provide a more robust estimate of model performance and mitigate the impact of the validation set size.

In practice, it's often a good starting point to reserve around 20-30% of your data for validation. However, you should adjust this percentage based on the specific characteristics of your dataset and problem. Experimentation and validation with different split ratios and cross-validation techniques can help you find the optimal balance between training and validation set sizes for your particular task.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [14]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [15]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.34390158143724614
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3.what is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1.Yes, averaging the validation accuracy across multiple splits, such as using k-fold cross-validation, can indeed provide more consistent and reliable results when assessing your model's performance. This is because it helps mitigate the impact of randomness and variability in the data splits. Here's how it works:

Variability Reduction: In a single train-validation split, the specific data points chosen for the validation set can influence the validation accuracy. By repeating this process with different random splits of the data into training and validation sets (as in k-fold cross-validation), you reduce the impact of any particular random split.

Better Estimation: Averaging the results from multiple splits provides a more stable estimate of your model's performance. It gives you a more reliable indication of how well your model is likely to perform on unseen data
Effective Use of Data: Cross-validation allows you to make better use of your data. Instead of designating a fixed percentage of your data as a validation set (which can be problematic for small datasets), you cycle through different subsets as validation, ensuring that all data points contribute to both training and validation.

Robustness to Outliers: If your dataset contains outliers or particularly challenging samples, cross-validation helps ensure that these are not overly influential in the assessment of your model's performance. Each fold is likely to contain a mix of different data points.

Model Selection and Hyperparameter Tuning: Cross-validation is especially valuable when you are comparing multiple models or tuning hyperparameters. It helps you make more informed decisions about which model or set of hyperparameters is likely to perform better on unseen data.

Consistent Performance Estimates: Averaging the results over multiple splits reduces the risk of getting overly optimistic or pessimistic performance estimates based on the particular random split of the data. It provides a more representative and stable assessment.

In summary, cross-validation, particularly k-fold cross-validation, provides a more consistent and robust estimate of your model's performance. While it doesn't change the accuracy itself, it helps ensure that the accuracy estimate is more reliable and less influenced by the randomness in data splits, making it a valuable technique for assessing and comparing models.

2.Cross-validation, such as k-fold cross-validation, does not necessarily provide a more accurate estimate of test accuracy in the sense of being closer to the true population accuracy (which is typically unknown). However, it does offer several advantages that make it a more trustworthy and informative estimate of a model's generalization performance:

Reliability: Cross-validation provides a more reliable estimate of model performance. By averaging the results over multiple data splits, it reduces the impact of randomness and variability associated with a single train-validation split. This means the estimate is less likely to be overly optimistic or pessimistic due to a particular random split.

Robustness: It helps in finding a performance estimate that is robust to variations in the data. If your dataset contains outliers or particularly challenging samples, cross-validation ensures that the performance estimate accounts for these variations.
Effective Use of Data: Cross-validation makes better use of your data by repeatedly cycling through different subsets for training and validation. This is especially valuable when you have limited data.

Model Selection and Hyperparameter Tuning: Cross-validation is particularly useful when comparing multiple models or tuning hyperparameters. It helps you make more informed decisions about which model or set of hyperparameters is likely to perform better on unseen data.

Early Detection of Overfitting: It can help in early detection of overfitting. If the model's performance starts to degrade across multiple validation folds, it's a sign of overfitting, and you can halt training earlier to prevent it.

Consistent Performance Estimates: By providing a more stable and consistent estimate of model performance, cross-validation aids in obtaining a performance measure that is less sensitive to the particular data split. This is important for obtaining a fair assessment of model generalization
In summary, while cross-validation does not directly provide a more accurate estimate of test accuracy in an absolute sense (as the true population accuracy is usually unknown), it offers several benefits that make its estimate more reliable, robust, and informative for assessing how well your model is likely to perform on unseen data. This makes it a valuable technique for model evaluation and selection.



3.The number of iterations, in the context of machine learning, often refers to the number of times a learning algorithm iterates over the entire training dataset during the training process. This concept is particularly relevant to iterative optimization algorithms like gradient descent or stochastic gradient descent. The effect of the number of iterations on the estimate of model performance varies depending on several factors:

Early Iterations Improve Fit: In the early stages of training, increasing the number of iterations generally helps the model fit the training data better. This can result in improved performance on both the training and validation sets. The model becomes better at capturing patterns in the data.

Diminishing Returns: However, as you continue to increase the number of iterations, the improvements in model performance tend to diminish. The model may start overfitting the training data, meaning it learns the noise or specific examples in the data rather than general patterns. This can lead to a decrease in performance on the validation set.

Finding the Optimal Point: There is usually an optimal number of iterations where the model achieves the best trade-off between underfitting and overfitting. This optimal point depends on factors like the complexity of the model, the size of the dataset, and the learning rate. The goal is to find the right balance that allows the model to generalize well to new, unseen data.

Early Stopping: To prevent overfitting, practitioners often employ early stopping, which involves monitoring the validation performance during training and stopping the training process when the validation performance starts to degrade. This technique helps identify the point at which the model has learned meaningful patterns from the data without overfitting.

Computational Resources: The number of iterations can also be influenced by computational resources. Training a model with a very large number of iterations may require significant time and computational power, and there might be diminishing returns beyond a certain point.

In summary, the effect of the number of iterations on the estimate of model performance is not linear. Increasing iterations initially tends to improve the model's performance on the training and validation sets, but there's a point of diminishing returns where the model starts to overfit. The optimal number of iterations depends on various factors and is often determined through experimentation, monitoring validation performance, and using techniques like early stopping to prevent overfitting.







4.Increasing the number of iterations during model training can help to some extent when dealing with a very small training dataset or validation dataset, but it is not a complete solution to the limitations imposed by the size of the data. Here are the key considerations:

Benefits of More Iterations:

With more iterations, the model has more opportunities to learn from the limited data available. It can improve its ability to capture patterns and potentially perform better on both the training and validation datasets.
Risk of Overfitting:

Increasing iterations may help the model fit the training data better, but it also increases the risk of overfitting. With a very small training dataset, the model might start memorizing the training examples rather than generalizing from them. This can result in poor performance on the validation set and reduced generalization to new data.
Limited Data Information:

Small datasets inherently contain limited information about the underlying patterns in the data. No amount of additional iterations can create information that is not present in the dataset.
Validation Set Size:

When dealing with a very small validation dataset, there's a risk of overfitting to the validation set itself. Increasing iterations can exacerbate this risk, as the model has more chances to adapt specifically to the validation set rather than generalize to unseen data.
Alternative Strategies:

Rather than relying solely on increasing iterations, it's often more effective to consider other strategies, such as data augmentation (for image data), regularization techniques (e.g., dropout, L1/L2 regularization), or using transfer learning (especially for deep learning) to leverage knowledge from larger datasets.
Cross-Validation:

Cross-validation, particularly with k-fold validation, can be useful in mitigating the impact of a small training dataset. It divides the data into multiple folds, rotating through different subsets as the validation set in each iteration. This helps in obtaining a more robust estimate of model performance and reducing the influence of the small dataset size.
In summary, while increasing the number of iterations can aid in learning from small datasets to some extent, it's not a comprehensive solution, and it comes with the risk of overfitting. Dealing with very small training or validation datasets often requires a combination of strategies, including regularization, data augmentation, transfer learning, and robust validation techniques like cross-validation, to ensure a more reliable and effective model.