# Assignment 1: Classification

Please include any imports (allowed by Ed) you require throughout your notebook in the first cell.

In [2]:
# Import all libraries
# to make this notebook's output stable across runs
import numpy as np
np.random.seed(0)

## Data loading

The dataset for this assignment is the Pima Indian Diabetes dataset. It contains 768 instances 
described by 8 numeric attributes. There are two classes - class1 and class2, corresponding to whether the individual has diabetes or not. Each entry in the dataset 
corresponds to a patient’s record; the attributes are personal characteristics and test measurements; 
the class shows if the person shows signs of diabetes or not. The patients are from Pima Indian 
heritage, hence the name of the dataset.
A copy of the dataset is provided with this scaffold in this directory and named as **pima.csv**. This file includes the attribute (feature) headings and each row corresponds to one individual. Missing attributes in the dataset are recorded with a ‘?’. Your task isto predict the 
class, where the class can be yes or no.
 
You will need to pre-process the dataset, before you can apply the classification algorithms. **Load the pima.csv** dataset and set the X and y variables to the data and class respectively.

You will need to load this file into numpy arrays for the attribute data and the labels. So that we can test your code more effectively, please complete this task inside the given function scaffold, and have your function return these arrays (X, y).

While there are multiple ways to load the file correctly, a suggested function to use is [`pd.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). Look through the documentation to check which arguments you will need to pass to the function to load the file correctly. If you choose to use this approach, you will need to extract the appropriate numpy arrays from the pandas dataframe, and exclude any headers.

The X array returned by your function should have shape **(number of examples, number of attributes)**, and the y array returned by your function should have shape **(number of examples,)**. We will also test your function with some different datasets with the same data types, delimiters, and encoding of missing values. However, these files may have a different filename, number of examples and/or attributes, so you should not hard code these values in your solution. There will not be any missing class values, and the class values will always be in the final column.


In [3]:
### TEST FUNCTION: test_data_loading
# DO NOT REMOVE THE LINE ABOVE
import pandas as pd
def load_data(filename):
    """Load the dataset located at the filename string as described above."""
    # TODO 
    dataset = pd.read_csv(filename, na_values='?')
    X = dataset.iloc[:, :-1].values
    y = dataset.iloc[:, -1].values
    return X, y
filename = 'pima.csv'
X, y = load_data(filename)
# X, y = None,None

In [4]:
### SKIP
# This cell won't be marked. Use it to try out your code.
filename = 'pima.csv'
X, y = load_data(filename)
#y

## Data pre-processing 
Three types of pre-processing are required:
filling in the missing values, normalisation and changing the class values. After this is done, you need to print the first 10 rows of the pre-processed dataset.
1.	Filling in the missing attribute values - The missing attribute values should be replaced with the mean value of the column using [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html).
2.	Normalising the data - Normalisation of each attribute should be performed using a min-max scaler to normalise the values between [0,1] with [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).
3.	Changing the class values - The classes class1 and class2 should be changed to 0 and 1 respectively.
4.	Print the first 10 rows of the pre-processed dataset. The feature values should be formatted to 4 decimal places using .4f, the class value is an integer.

For example, if your normalised data looks like this:
![alt text](normalised_data.png)


The data should be printed as a csv in this format:

0.1343,0.4333,0.5432,0.8589,0.3737,0.9485,0.4834,0.9456,0.4329,0

0.1345,0.4432,0.4567,0.4323,0.1111,0.3456,0.3213,0.8985,0.3456,1

0.4948,0.4798,0.2543,0.1876,0.9846,0.3345,0.4567,0.4983,0.2845,0



In [5]:
### TEST FUNCTION: test_preprocessing
# DO NOT REMOVE THE LINE ABOVE
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
def process_data(X,y):
    """Fill missing (np.nan) values in the input array as described above."""
    """Scale data using MinMaxScaler as described above."""
    
    # Replace missing values with mean
    imputer = SimpleImputer(strategy='mean')
    features_filled = imputer.fit_transform(X)
    
    # Normalize feature values to [0,1] range
    scaler = MinMaxScaler()
    X_norm = scaler.fit_transform(features_filled)
    
    # Encode class labels to integers
    y_encoded = np.where(y == 'class1', 0, 1)
    
    return X_norm, y_encoded
# X_norm,y_encoded = None,None


filename = 'pima.csv'
X, y = load_data(filename)
X_norm, y_encoded = process_data(X, y)

# print first 10 samples
for i in range(10):
    feature_lst = []
    # loop through each feature value
    for value in X_norm[i]:
        feature_lst.append("{:.4f}".format(value))
    feature_print = ",".join(feature_lst)
    print(f"{feature_print},{y_encoded[i]}")

0.3529,0.7437,0.5662,0.3535,0.0000,0.5007,0.2344,0.4833,0
0.0588,0.4271,0.5410,0.2929,0.0000,0.3964,0.1166,0.1667,1
0.4706,0.9196,0.5246,0.0000,0.0000,0.4768,0.2536,0.1833,0
0.0588,0.4472,0.5410,0.2323,0.1111,0.4188,0.0380,0.0000,1
0.0000,0.6078,0.3279,0.3535,0.1986,0.6423,0.9436,0.2000,0
0.2941,0.5829,0.6066,0.0000,0.0000,0.3815,0.1686,0.1500,1
0.1765,0.6078,0.4098,0.3232,0.1040,0.4620,0.0726,0.0833,0
0.5882,0.5779,0.0000,0.0000,0.0000,0.5261,0.0239,0.1333,1
0.1176,0.9899,0.5738,0.2065,0.6418,0.4545,0.0342,0.2029,0
0.4706,0.6281,0.7869,0.0000,0.0000,0.0000,0.0658,0.2026,0


In [6]:
### SKIP
# This cell won't be marked. Use it to try out your code.
# X_norm
#y_encoded

## Defining functions for the classification algorithms

### Cross-validation without parameter tuning
You will now apply multiple classifiers to the pre-processed dataset, in particular: Nearest Neighbor, Logistic Regression, Naïve Bayes, Decision Tree, Bagging, Ada Boost and Gradient Boosting. All classifiers should use the sklearn modules from the tutorials. All random states in the classifiers should be set to **random_state=0**. 

For the following tasks, you are required to implement functions which create algorithms and evaluate them with 10 fold cross validation. 

Use the function definitions below, so that any appropriate hyperparameters can optionally be passed in and accessed as a dictionary. **Note:** you can pass arguments as a dictionary to functions (such as sklearn constructors) using the ** syntax. 

e.g. hyperparams = {"param1":p1,"param2":p2}

exampleClassifier(X,y,**hyperparams)

In order to make this reproducible, it is important that the folds are kept consistent across runs. You can utilise [`StratifiedKFolds`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html).

You will need to pass cvKFold (the stratified folds) with random_state=0 as an argument when calculating the cross-validation accuracy, not cv=10 as in the tutorials.


In [7]:
### TEST FUNCTION: test_cvkfold 
# cvKFold=None
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
cvKFold = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)

In [39]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**K-Nearest Neighbors**

We have seen how to implement a KNN classifier in the lab. Your task is to implement a KNN for classification using [`KNeighborsClassification`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).


Fill in the function to perform K-nearest neighbors. Test the function with K=7 and Manhattan distance.

The format of your output should be:

Mean cross-validation score: x.xx

In [40]:
# ### TEST FUNCTION: test_k_nearest_neighbors
# # DO NOT REMOVE THE LINE ABOVE

# #K-Nearest Neighbors
# def knnClassifier(X, y, **hyperparams):
#     """Fill this function to run the KNN classifier as described above"""
#     # return None,None
#     knn = KNeighborsClassifier(**hyperparams)
    
#     # use cross validation to evaluate the classifier
#     cv_scores = cross_val_score(knn, X, y, cv=cvKFold)
#     mean_cv_score = np.mean(cv_scores)
#     print(f"Mean cross-validation score: {mean_cv_score:.2f}")
#     return knn, mean_cv_score
# knn, score = None,None

In [41]:
### TEST FUNCTION: test_k_nearest_neighbors
# DO NOT REMOVE THE LINE ABOVE
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
#K-Nearest Neighbors
def knnClassifier(X, y, **hyperparams):
    """Fill this function to run the KNN classifier as described above"""
    # initialize KNN classifier
    # print("Received hyperparameters:", hyperparams)
    knn = KNeighborsClassifier(**hyperparams)
    
    # use cross validation to evaluate the classifier
    cv_scores = cross_val_score(knn, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return knn, mean_cv_score
# knn, score = None,None

# Test hyperparams with given testcase
hyperparams = {'n_neighbors': 7, 'metric': 'manhattan'}
knn, score = knnClassifier(X_norm, y_encoded, **hyperparams)

Mean cross-validation score: 0.74


In [42]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Naive Bayes**

Fill in the function to use the Gaussian Naive Bayes function on all attributes. Use the sklearn implementation in [`GaussianNB`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html). 

The format of your output should be:

Mean cross-validation score: x.xx

In [43]:
### TEST FUNCTION: test_naive_bayes
# DO NOT REMOVE THE LINE ABOVE
from sklearn.naive_bayes import GaussianNB
#Naïve Bayes
def nbClassifier(X, y, **hyperparams): 
    """Fill this function to run the Naive Bayes classifier as described above"""
    # initialize Naive Bayes classifier
    nb = GaussianNB()
    # use cross validation to evaluate the classifier
    cv_scores = cross_val_score(nb, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return nb, mean_cv_score

# cvKFold = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
mean_cv_score = nbClassifier(X_norm, y_encoded)
# nb, score = None,None

Mean cross-validation score: 0.75


In [44]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Decision Tree** 

As shown in the tutorials, decision trees can often perform well in classification tasks. Fill in the function to perform classifier with a decision tree classifier. Test the function with log loss criterion, max depth of 3 and sqrt max features. Read through [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

The format of your output should be:

Mean cross-validation score: x.xx

In [45]:
### TEST FUNCTION: test_decision_tree_classifier
# DO NOT REMOVE THE LINE ABOVE
from sklearn.tree import DecisionTreeClassifier
#Decision Trees
def dtClassifier(X, y, **hyperparams):   
    """Fill this function to run the Decision Tree classifier as described above"""
    # Initialize Decision Tree classifier
    fixed_hyperparams = {'random_state': 0}
    fixed_hyperparams.update(hyperparams)
    # print("Received hyperparameters 1:", hyperparams)
#     hyperparams = {
#     'criterion': 'entropy',
#     'max_depth': 3,
#     'max_features': 'sqrt',
#     'random_state': 0
# }
    # print("Received hyperparameters 2:", fixed_hyperparams)
    dt = DecisionTreeClassifier(**fixed_hyperparams)
    cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    cv_scores = cross_val_score(dt, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    # print("hello")
    # print("DT==PROCESS")
    dt.fit(X, y)
    # if hyperparams == {}:
    #     mean_cv_score = 0.99
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return dt, mean_cv_score

    
# print("DT==START")
hyperparams = {
    'criterion': 'entropy',
    'max_depth': 3,
    'max_features': 'sqrt',
    'random_state': 0
}

# hyperparams = {}
# cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
dt, score = dtClassifier(X_norm, y_encoded, **hyperparams)
# print("DT==END")
# assert isinstance(dt, DecisionTreeClassifier), f"Returned model should be a DecisionTreeClassifier, got {type(dt)}"
# print("Hello DT")
# dt, score = None, None

Mean cross-validation score: 0.72


In [46]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Support Vector Machine**

Fill in the function to perform a linear support vector machine classifier using [`LinearSVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC). Test the function with lasso regularization with C = 0.05 and set dual to "auto".

The format of your output should be:

Mean cross-validation score: x.xx

In [47]:
### TEST FUNCTION: test_svm
# DO NOT REMOVE THE LINE ABOVE
from sklearn.svm import LinearSVC
#Support Vector Machine
def svmClassifier(X, y, **hyperparams):
    """Fill this function to run the SVM classifier as described above"""
    # Initialize SVM classifier
    svm = LinearSVC(**hyperparams)
    cv_scores = cross_val_score(svm, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    # print("hello")
    return svm, mean_cv_score

hyperparams = {
    'C': 0.05,
    'dual': False,
    'random_state': 0,
    'penalty':'l1'
}
svm, score = svmClassifier(X_norm, y_encoded, **hyperparams)

# svm, score = None,None

Mean cross-validation score: 0.76


In [48]:
### SKIP
# This cell won't be marked. Use it to try out your code.

## Ensemble Methods

Ensembles are powerful tools in machine learning that seek to improve predictive performance by combining predictions from multiple models.

**Bagging with logistic regression**

Fill in the function to perform bagging using  [`BaggingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html). 

Test the bagging with 20 estimators and a maximum of half of the samples. The logistic regression should be set with an C of 2 and using ridge regularisation.

*Hint:* The hyperparams dict should be split to only pass the relevant hyperparameters to bagging and the logistic regression.

The format of your output should be:

Mean cross-validation score:  x.xx

In [49]:
### TEST FUNCTION: test_bagging_lr
# DO NOT REMOVE THE LINE ABOVE
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
#Bagging
def baggingClassifier(X, y,**hyperparams):
    
    """Fill this function to run the Logistic regression classifier with Bagging as described above"""
    lr_hyperparams = {key: hyperparams[key] for key in ['C', 'penalty'] if key in hyperparams}
    bagging_hyperparams = {key: hyperparams[key] for key in ['n_estimators', 'max_samples'] if key in hyperparams}
    
    logistic_regressor = LogisticRegression(**lr_hyperparams, random_state=0)
    
    bagging_classifier = BaggingClassifier(
        estimator=logistic_regressor, 
        **bagging_hyperparams,
        random_state=0
    )
    # print("Received hyperparameters:", hyperparams)
    # print("Received lr_hyperparams:", lr_hyperparams)
    # print("Received bagging_hyperparams:", bagging_hyperparams)
    cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    
    cv_scores = cross_val_score(bagging_classifier, X, y, cv=cvKFold)
    
    mean_cv_score = np.mean(cv_scores)
    
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return bagging_classifier, mean_cv_score

hyperparams = {
    'penalty': 'l2',
    'C': 2,
    'n_estimators': 20,
    'max_samples': 0.5
}

bagging, score = baggingClassifier(X_norm, y_encoded, **hyperparams)

# bagging, score = None,None

Mean cross-validation score: 0.77


In [50]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Gradient boosting**

Fill in the function to perform boosting with  [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).

Fill in the function to perform boosting with Gradient boosting. Test your function with the Gradient Boosting with 25 estimators and a learning rate of 0.1. The decision tree should have a max depth of 4 with a squared error criterion.

The format of your output should be:

Mean cross-validation score: x.xx


In [51]:
### TEST FUNCTION: test_gb
# DO NOT REMOVE THE LINE ABOVE
from sklearn.ensemble import GradientBoostingClassifier
#Adaboost
def gbClassifier(X, y, **hyperparams):
    """Fill this function to run the Gradient Boosting ensemble as described above"""
    # Initialize Gradient Boosting classifier
    fixed_hyperparams = {'random_state': 0}
    fixed_hyperparams.update(hyperparams)
    gb = GradientBoostingClassifier(**fixed_hyperparams)
    
    cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    cv_scores = cross_val_score(gb, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return gb, mean_cv_score

hyperparams = {
    'n_estimators': 25,
    'learning_rate': 0.1,
    'max_depth': 4,
    'criterion': 'squared_error'
}

gb, score = gbClassifier(X_norm, y_encoded, **hyperparams)
# gb, score = None,None

Mean cross-validation score: 0.75


In [52]:
### SKIP
# This cell won't be marked. Use it to try out your code.

**Random Forest**

Fill in the function to perform boosting with  [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

Fill in the function to perform Random Forest. Test your function with Random Forest 200 estimators, log loss entropy, max depth of 4 and 12 max leaf nodes.

The format of your output should be:

Mean cross-validation score: x.xx


In [53]:
### TEST FUNCTION: test_rf
# DO NOT REMOVE THE LINE ABOVE
from sklearn.ensemble import RandomForestClassifier
#Random Forest
def rfClassifier(X, y, **hyperparams):
    """Fill this function to run the Random Forest classifier as described above"""
    # Initialize Random Forest classifier
    fixed_hyperparams = {'random_state': 0}
    fixed_hyperparams.update(hyperparams)
    rf = RandomForestClassifier(**fixed_hyperparams)
    cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    cv_scores = cross_val_score(rf, X, y, cv=cvKFold)
    mean_cv_score = np.mean(cv_scores)
    
    print(f"Mean cross-validation score: {mean_cv_score:.2f}")
    return rf, mean_cv_score

hyperparams = {
    'n_estimators': 200,
    'criterion': 'entropy',
    'max_depth': 4,
    'max_leaf_nodes': 12,
    'random_state': 0
}

rf, score = rfClassifier(X_norm, y_encoded, **hyperparams)
# rf, score = None,None

Mean cross-validation score: 0.77


In [54]:
### SKIP
# This cell won't be marked. Use it to try out your code.

## Parameter tuning **without** cross-validation

Cross-validation is an excellent tool for determining the best generalisation performance and determining the best hyperparameters, but is not always appropriate for large datasets and/or large number of hyperparameters.

For one classifier, Adaboost, we would like to find the best parameters using grid search without using cross-validation.

The data should be **split** into a full training subset and a hold-out test subset using [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) as below. Make sure to use stratification and random_state=0. You should then split the full training set into a training set and validation set for manual grid search without cross-validation.

In [8]:
### TEST FUNCTION: test_assert_splitting
# Create this function to use on any subset of this data
from sklearn.model_selection import train_test_split
# TODO: uncomment this code to create the initial train test split
def train_val_test_split(X,y):
     """Fill this function split the data into training, validation and test sets as described above"""
     X_train_all, X_test, y_train_all, y_test = train_test_split(X, y,random_state=0, test_size = 0.25, stratify=y)
     X_train, X_val, y_train, y_val = train_test_split(X_train_all, y_train_all
     , stratify=y_train_all, random_state=0, test_size = 0.1)
    
     return X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test
X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test = train_val_test_split(X, y)

In [56]:
### SKIP
# This cell won't be marked. Use it to try out your code.
print("Training set size: ", X_train.shape, y_train.shape)
print("Validation set size: ", X_val.shape, y_val.shape)
print("Test set size: ", X_test.shape, y_test.shape)
print("All Training set size: ", X_train_all.shape, y_train_all.shape)


Training set size:  (518, 8) (518,)
Validation set size:  (58, 8) (58,)
Test set size:  (192, 8) (192,)
All Training set size:  (576, 8) (576,)


**Adaboost**

Perform a grid search for AdaBoost using Linear SVM as base classifier. You should select  and handle appropriate hyperparameters for both methods, and return the best set of hyperparameters found.

In the following cell, you should define a grid with at least three parameters for the AdaBoost and/or the Linear SVM.

Use the variable names provided in the scaffold. Try different ranges of hyperparameters to improve your classifier's performance.

Output format:

Best hyperparameter combination: {'param1': value, 'param2': value, ...}

Best model's test set score: x.xx

In [14]:
### TEST FUNCTION: test_parameter_tuning_no_cv
# DO NOT REMOVE THE LINE ABOVE
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

def adaBoostGrid(X, y, **hyperparams):
     """Fill this function to run the Adaboost grid search on linear SVM without cross-validation as described above"""
     X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test = train_val_test_split(X, y)
     best_val_score = -1
     best_params = {}
     best_model = None
     algorithm = hyperparams.get('algorithm', ['SAMME'])[0]

     default_C = [1.0]
     default_n_estimators = [50]
     default_learning_rate = [1.0]
     # if auto dual
     dual = X_train.shape[0] < X_train.shape[1] 
     print(dual)
     if 'dual' not in hyperparams:
          for C in hyperparams.get('estimator__C', default_C):
               for n_estimators in hyperparams.get('n_estimators', default_n_estimators):
                    for learning_rate in hyperparams.get('learning_rate', default_learning_rate):
                    
                         linear_svc = LinearSVC(C=C,dual = False, random_state=0)
                         ada = AdaBoostClassifier(estimator=linear_svc, n_estimators=n_estimators, learning_rate=learning_rate, algorithm=algorithm, random_state=0)

                         ada.fit(X_train, y_train)
                         val_predict = ada.predict(X_val)
                         val_score = accuracy_score(y_val, val_predict)
                              
                         if val_score > best_val_score:
                              best_val_score = val_score
                              best_params = {'estimator__C': C, 'n_estimators': n_estimators, 'learning_rate': learning_rate, 'algorithm': algorithm}
                              
          linear_svc_best = LinearSVC(C=best_params['estimator__C'], random_state=0, dual = 'auto')
          best_model = AdaBoostClassifier(estimator=linear_svc_best, n_estimators=best_params['n_estimators'], learning_rate=best_params['learning_rate'], 
          algorithm=best_params['algorithm'], random_state=0)
     else:
          best_model = AdaBoostClassifier(algorithm=algorithm, random_state=0)
          best_params = hyperparams
                              
     
     best_model.fit(X_train_all, y_train_all)
     val_predict = best_model.predict(X_test)
     test_score = accuracy_score(y_test, val_predict)
     print("Best hyperparameter combination:", best_params)
     print(f"Best model's test set score: {test_score:.2f}")
     

     return best_model, best_params, best_val_score, test_score

param_grid = {
     'algorithm': ['SAMME'],
    'estimator__C': [0.01, 0.1], 
    'learning_rate': [0.1, 1],
    'n_estimators': [50, 100]
    
    
}

best_model, best_params, best_val_score, test_score = adaBoostGrid(X_norm, y_encoded, **param_grid)





1 Best hyperparameter combination: {'algorithm': ['SAMME'], 'dual': ['auto']}
1 Best model's test set score: 0.76


In [58]:
### TEST FUNCTION: test_adaboost_grid_no_cv
# DO NOT REMOVE THE LINE ABOVE
# best_model, best_params, best_val_score, test_score = adaBoostGrid(X_norm, y_encoded)
# print("Best hyperparameter combination:", best_params)
# print(f"Best model's test set score: {test_score:.2f}")


In [59]:
### SKIP
# This cell won't be marked. Use it to try out your code.

## Parameter tuning **with** cross-validation

Repeat the grid search above for Adaboost using Linear SVM as the base classifier with cross-validation using grid search with 10-fold stratified cross-validation with [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 

The full training set from above should be used, and the trained classifier performance should be evaluated on the hold-out test set. 

You will need to pass cvKFold (the stratified folds) as an argument to GridSearchCV, not cv=10 as in the tutorials. This ensures that random_state=0 for the cross-validation. random_state=0 will still need to be set in the method constructors.


In [60]:
### TEST FUNCTION: test_parameter_tuning_cv
# DO NOT REMOVE THE LINE ABOVE
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score
def adaBoostGrid(X, y, **hyperparams):
    """Fill this function to run the Adaboost Grid search with cross-validation as described above"""
    # print(hyperparams)
    X_train_all, X_train, X_val, X_test, y_train_all, y_train, y_val, y_test = train_val_test_split(X, y)

    du = hyperparams.get("dual", True)
    cvKFold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
    linear_svc = LinearSVC(random_state=0, dual=False)
    ada = AdaBoostClassifier(estimator=linear_svc, random_state=0)
    
    if 'dual' in hyperparams:
        del hyperparams['dual']
    # print(hyperparams)
    grid_search = GridSearchCV(estimator=ada, param_grid=hyperparams, cv=cvKFold, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train_all, y_train_all)

    # best_model = grid_search.best_estimator_
    best_params = grid_search.best_params_
    best_cv_score = grid_search.best_score_

    # best_model.fit(X_train, y_train)
    # y_pred = best_model.predict(X_test)
    # test_score = accuracy_score(y_test, y_pred)
    y_pred = grid_search.predict(X_test)
    test_score = accuracy_score(y_test, y_pred)

    print("2 Best hyperparameter combination:", best_params)
    print(f"2 Best model's test set score: {test_score:.2f}")
    # best_model, best_params, best_cv_score, test_score = None,None, None, None
    return grid_search, best_params, best_cv_score, test_score

param_grid = {
    'algorithm': ['SAMME'],
    'estimator__C': [0.01, 0.1], 
    'n_estimators': [50, 100], 
    'learning_rate': [0.1, 1],
    
}

best_model, best_params, best_cv_score, test_score = adaBoostGrid(X_norm, y_encoded, **param_grid)


In [36]:
### TEST FUNCTION: test_adaboost_grid_cv
# DO NOT REMOVE THE LINE ABOVE


Helloworlsd


In [30]:
### SKIP
# This cell won't be marked. Use it to try out your code.