# Machine Learning 1 - Nearest Neighbors and Decision Trees

## Lab objectives

* Classification with decision trees and random forests.
* Cross-validation and evaluation.

In [46]:
from lab_tools import CIFAR10, get_hog_image

dataset = CIFAR10('./CIFAR10')

Pre-loading training data
Pre-loading test data


# 1. Nearest Neighbor

The following example uses the Nearest Neighbor algorithm on the Histogram of Gradient decriptors in the dataset.

Preprocessing Considerations:

K-Nearest Neighbors is sensitive to the scale of the features. If the features have different scales, it might be beneficial to scale the data.
Standardizing or normalizing the data (e.g., using StandardScaler or MinMaxScaler in scikit-learn) can often improve the performance of K-NN.

In [47]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit( dataset.train['hog'], dataset.train['labels'] )

In [49]:
from sklearn.metrics import accuracy_score, confusion_matrix
# Predict on the training data
train_preds = clf.predict(dataset.train['hog'])

# Compute descriptive performance (accuracy)
train_accuracy = accuracy_score(dataset.train['labels'], train_preds)
print("Descriptive Performance (Accuracy) on Training Data:", train_accuracy)
#  why the Descriptive Performance (Accuracy) on Training Data: 1.0 ? 
#  because we are testing the model on the same data that we trained it on.
#  this is not a good practice, we should test the model on unseen data.
#  this is called overfitting.


Descriptive Performance (Accuracy) on Training Data: 1.0


In [51]:
from sklearn.preprocessing import StandardScaler

# Standardize the training and test data
scaler = StandardScaler()
scaled_train_data = scaler.fit_transform(dataset.train['hog'])
scaled_test_data = scaler.transform(dataset.test['hog'])
# missing values
import numpy as np
scaled_train_data = np.nan_to_num(scaled_train_data)
scaled_test_data = np.nan_to_num(scaled_test_data)

# Use scaled data to train and test K-NN
clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(scaled_train_data, dataset.train['labels'])
test_preds = clf.predict(scaled_test_data)

# Evaluate accuracy
test_accuracy = accuracy_score(dataset.test['labels'], test_preds)
print("Predictive Performance (Accuracy) on Scaled Test Data:", test_accuracy)
# why the performance is better now?
# because we scaled the data, which is a good practice in machine learning.
# scaling the data helps the model to learn better, and to generalize better.
# scaling the data also helps the model to converge faster.
# in this specific model (K-NN), why scaling the data helped?
# because K-NN is a distance-based algorithm, and scaling the data helps to have a better distance measure.
# in general, scaling the data is a good practice in machine learning, especially for distance-based algorithms.


Predictive Performance (Accuracy) on Scaled Test Data: 0.704


In [52]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]}

# Create an instance of the classifier
clf = KNeighborsClassifier()

# Create an instance of GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=5)

# Fit the GridSearchCV instance on the training data
grid_search.fit(scaled_train_data, dataset.train['labels'])

# Get the best hyperparameters
best_params = grid_search.best_params_

# Get the best score
best_score = grid_search.best_score_

print("Best hyperparameters:", best_params)
print("Best score:", best_score)

Best hyperparameters: {'n_neighbors': 5}
Best score: 0.7234


In [53]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler

# Normalize the training and test data
scaler = MinMaxScaler()
normalized_train_data = scaler.fit_transform(dataset.train['hog'])
normalized_test_data = scaler.transform(dataset.test['hog'])

# Use normalized data to train and test K-NN
knn_clf = KNeighborsClassifier()

# Define the hyperparameters grid to search through
param_grid = {
    'n_neighbors': [4, 5, 6],
    'weights': ['uniform', 'distance'],
    'algorithm': ['ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [20, 30, 40],
    'p': [1, 2]  # For Minkowski distance
}

# Initialize GridSearchCV with the classifier, hyperparameters, and cross-validation
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, scoring='accuracy')

# Train the grid search to find the best combination of hyperparameters
grid_search.fit(normalized_train_data, dataset.train['labels'])

# Get the best estimator (classifier) from the grid search
best_knn_clf = grid_search.best_estimator_

# Now, you can use this best classifier to make predictions on the test set
knn_test_preds = best_knn_clf.predict(normalized_test_data)
knn_test_accuracy = accuracy_score(dataset.test['labels'], knn_test_preds)

# Print the best parameters and test accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("K-Nearest Neighbors Predictive Performance (Accuracy) on Normalized Test Data:", knn_test_accuracy)


In [37]:
# Standardize the training and test data
scaler = StandardScaler()
scaled_train_data = scaler.fit_transform(dataset.train['hog'])
scaled_test_data = scaler.transform(dataset.test['hog'])
# missing values
import numpy as np
scaled_train_data = np.nan_to_num(scaled_train_data)
scaled_test_data = np.nan_to_num(scaled_test_data)

# Use scaled data to train and test K-NN
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(scaled_train_data, dataset.train['labels'])
test_preds = clf.predict(scaled_test_data)

# Evaluate accuracy
test_accuracy = accuracy_score(dataset.test['labels'], test_preds)
print("Predictive Performance (Accuracy) on Scaled Test Data:", test_accuracy)

Predictive Performance (Accuracy) on Scaled Test Data: 0.728


In [None]:
#  Difference between crossvalidation and gridsearch
#  Crossvalidation is a technique to evaluate the performance of a model on a dataset.
#  Gridsearch is a technique to find the best hyperparameters for a model on a dataset.
#  can I Use cross-validation to find the best hyper-parameters for this method ? 
#  Yes, you can use cross-validation to find the best hyperparameters for a model using GridSearchCV.
#  What is the difference between cross-validation and grid-search ?

* What is the **descriptive performance** of this classifier ?
* Modify the code to estimate the **predictive performance**.
* Use cross-validation to find the best hyper-parameters for this method.

## 2. Decision Trees

[Decision Trees](http://scikit-learn.org/stable/modules/tree.html#tree) classify the data by splitting the feature space according to simple, single-feature rules. Scikit-learn uses the [CART](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees_.28CART.29) algorithm for [its implementation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of the classifier. 

* **Create a simple Decision Tree classifier** using scikit-learn and train it on the HoG training set.
* Use cross-validation to find the best hyper-paramters for this method.

In [36]:
from sklearn import tree

# No specific preprocessing needed for Decision Trees in this case
dt_clf = tree.DecisionTreeClassifier()

dt_clf.fit(dataset.train['hog'], dataset.train['labels'])

dt_test_preds = dt_clf.predict(dataset.test['hog'])
dt_test_accuracy = accuracy_score(dataset.test['labels'], dt_test_preds)

print("Decision Tree Predictive Performance (Accuracy) on Test Data:", dt_test_accuracy)



Decision Tree Predictive Performance (Accuracy) on Test Data: 0.572
Best hyperparameters: {'max_depth': 10}
Best score: 0.6001333333333333


In [38]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Initialize the DecisionTreeClassifier
dt_clf = tree.DecisionTreeClassifier()

# Define the hyperparameters grid to search through
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

# Initialize GridSearchCV with the classifier, hyperparameters, and cross-validation
grid_search = GridSearchCV(dt_clf, param_grid, cv=5, scoring='accuracy')

# Train the grid search to find the best combination of hyperparameters
grid_search.fit(dataset.train['hog'], dataset.train['labels'])

# Get the best estimator (classifier) from the grid search
best_dt_clf = grid_search.best_estimator_

# Now, you can use this best classifier to make predictions on the test set
dt_test_preds = best_dt_clf.predict(dataset.test['hog'])
dt_test_accuracy = accuracy_score(dataset.test['labels'], dt_test_preds)

# Print the best parameters and test accuracy
print("Best Hyperparameters:", grid_search.best_params_)
print("Decision Tree Predictive Performance (Accuracy) on Test Data:", dt_test_accuracy)


Best Hyperparameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': None, 'min_samples_leaf': 2, 'min_samples_split': 10}
Decision Tree Predictive Performance (Accuracy) on Test Data: 0.601


## 3. Random Forests

[Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) classifiers use multiple decision trees trained on "weaker" datasets (less data and/or less features), averaging the results so as to reduce over-fitting.

* Use scikit-learn to **create a Random Forest classifier** on the CIFAR data. 
* Use cross-validation to find the best hyper-paramters for this method.

In [43]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from lab_tools import CIFAR10, get_hog_image

# Load CIFAR-10 dataset
dataset = CIFAR10('./CIFAR10')

# Define the Random Forest classifier
rf_clf = RandomForestClassifier()

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(dataset.train['hog'], dataset.train['labels'])

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

# Train the Random Forest classifier with the best hyperparameters
best_rf_clf = RandomForestClassifier(**best_params)
best_rf_clf.fit(dataset.train['hog'], dataset.train['labels'])

# Make predictions on the test data
rf_test_preds = best_rf_clf.predict(dataset.test['hog'])

# Calculate the accuracy
rf_test_accuracy = accuracy_score(dataset.test['labels'], rf_test_preds)
print("Random Forest Predictive Performance (Accuracy) on Test Data:", rf_test_accuracy)


Pre-loading training data
Pre-loading test data
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300}
Best Score: 0.7719333333333334
Random Forest Predictive Performance (Accuracy) on Test Data: 0.7806666666666666
