# Assignment 2

Please create a `data` folder and put all the datasets in the folder.

In this assignment, you can use sklearn and other python libraries, if not explictly mentioned in the assignment.

## Part-A: Regression - Polynomial Regression with Regularization

1. (2 pts) Load the airfoil dataset and split the data into training and test datasets in ratio 80:20 using random splitting. 
2. (2 pts) Perform Z-score standardization on the data set. Hint: Please first derive the mean and std from the training dataset, and then apply the derived mean and std on the test dataset.
3. (16 pts) Implement Polynomial regression with $l_2$ regularization on the airfoil dataset to predict "Scaled sound pressure level". Try different degrees and regularization parameters and compare the Mean Square Error (MSE) values. Select the best model using 5-fold cross-validation based on MSE. 
4. (5 pts) After selecting the best model, evaluate it on the test dataset.

The dataset (dat file) is available on the Canvas Assignment page.

In [94]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge, LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report
from sklearn.pipeline import Pipeline

In [95]:
df = pd.read_csv("data/airfoil_self_noise.dat",sep='\t',names=['frequency','Angle of attack',
                 'Chord length','Free-stream velocity','Suction side displacement thickness','Scaled sound pressure level'])
print(df.head())

# Your code here.
X = df.iloc[:, :-1] # features
y = df.iloc[:, -1] # response

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

   frequency  Angle of attack  Chord length  Free-stream velocity  \
0        800              0.0        0.3048                  71.3   
1       1000              0.0        0.3048                  71.3   
2       1250              0.0        0.3048                  71.3   
3       1600              0.0        0.3048                  71.3   
4       2000              0.0        0.3048                  71.3   

   Suction side displacement thickness  Scaled sound pressure level  
0                             0.002663                      126.201  
1                             0.002663                      125.201  
2                             0.002663                      125.951  
3                             0.002663                      127.591  
4                             0.002663                      127.461  


In [96]:
# Standardize the training set
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Standardize the test set
X_test = scaler.transform(X_test)

In [97]:
# Define the hyperparameters for grid search
degrees = [1, 2, 3, 4, 5]
alphas = [0.01, 0.1, 1, 10, 100]

# Create a pipeline
pipe = Pipeline([('poly', PolynomialFeatures()), ('ridge', Ridge())])

param = {'poly__degree': degrees, 'ridge__alpha': alphas}
grid = GridSearchCV(pipe, param, cv=5, scoring='neg_mean_squared_error')

# Fit to the training set
grid.fit(X_train, y_train)

# Select the best model
best_model = grid.best_estimator_

print(f'Best model: {best_model}')

Best model: Pipeline(steps=[('poly', PolynomialFeatures(degree=4)),
                ('ridge', Ridge(alpha=0.01))])


In [98]:
# Test on the test set
y_hat = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_hat)

print(f'Test MSE: {round(test_mse, 4)}')

Test MSE: 12.3033


## Part-B: Classification - KNN and Logistic Regression

1. (2 pts) Load the the iris dataset and split the data into training and test datasets in ratio 80:20 using random splitting.
2. (18 pts) Implement KNN algorithm without using sklearn. Try different K values (K=2, 5, 10, 20). Select the best model using 5-fold cross-validation based on accuracy and evaluate it on the test dataset. Report the test accuracy for each class and the overall test accuracy. Hint: matrix operations are much faster than for-loops!
3. (10 pts) Train a Logistic Regression model on the iris dataset. Evaluate the model on the test data set. Report the test accuracy for each class and the overall test accuracy.

In [99]:
data = load_iris()

# your code here
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [100]:
# Standardize the training set
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Standardize the test set
X_test = scaler.transform(X_test)

### KNN

In [101]:
# Implement KNN
def knn(X_train, y_train, X_test, k):
  m = X_test.shape[0] # number of test samples
  n = X_train.shape[0] # number of train samples
  y_hat = np.zeros(m) # initialize y_hat

  # calculate Euclidean distance between each test sample and all train samples
  for i in range(m):
    dist = np.sqrt(np.sum((X_train - X_test[i])**2, axis=1))
    # find k nearest neighbors
    indx = np.argsort(dist)[:k]
    k_label = y_train[indx]
    # find the most common label
    y_hat[i] = np.bincount(k_label).argmax() # index
  
  return y_hat

In [102]:
# Function for Grid Search CV
def GS_CV(X, y, grid, cv=5):
  best_k = None
  best_score = 0

  for k in grid:
    scores = []
    cross_valid = KFold(n_splits=cv, random_state=100, shuffle=True)
    for tr_idx, val_idx in cross_valid.split(X):
      X_tr, y_tr = X[tr_idx], y[tr_idx]
      X_val, y_val = X[val_idx], y[val_idx]
      y_predicted = knn(X_tr, y_tr, X_val, k)
      scores.append(accuracy_score(y_val, y_predicted))
    # Calculate average score for each k and return the best k
    avg_score = np.mean(scores)
    if avg_score > best_score:
      best_k = k
      best_score = avg_score

  return (best_k, best_score)

In [103]:
# Find best K using grid search CV
grid = [2, 5, 10, 20]
best_parameters = GS_CV(X_train, y_train, grid)
best_k, best_score = best_parameters[0], best_parameters[1]
print(f"Best k: {best_k}")
print(f"Best score: {best_score}")

Best k: 5
Best score: 0.9583333333333334


In [104]:
# Test on the test set
y_hat = knn(X_train, y_train, X_test, best_k)
test_accuracy = accuracy_score(y_test, y_hat)

print(f'Overall test accuracy: {round(test_accuracy, 4)}')
print(f'Accuracy for each class:\n {classification_report(y_test, y_hat)}')

Overall test accuracy: 1.0
Accuracy for each class:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      1.00      1.00         6
           2       1.00      1.00      1.00        13

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



### Logistic Regression

In [105]:
# Train Logistic Regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predict on the test set
y_hat = lr.predict(X_test)

# Test on the test set
test_accuracy = accuracy_score(y_test, y_hat)
class_accuracy = classification_report(y_test, y_hat)

print(f'Overall test accuracy: {round(test_accuracy, 4)}')
print(f'Accuracy for each class:\n {class_accuracy}')

Overall test accuracy: 0.9667
Accuracy for each class:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        11
           1       1.00      0.83      0.91         6
           2       0.93      1.00      0.96        13

    accuracy                           0.97        30
   macro avg       0.98      0.94      0.96        30
weighted avg       0.97      0.97      0.97        30



## Part-C: Classification - SVM

1. (2 pts) Load the australian dataset and split the data into training and test datasets in ratio 80:20.
2. (18 pts) Use support vector machine (SVM) to predict whether a credit card should be approved for a person or not. Consider two different kernels and two different values of C. Select the best kernel using 5-fold cross validation. Evaluate the best model on the test dataset and report the accuracy.

The dataset (dat file) is available on the Canvas Assignment page.

In [106]:
df = pd.read_table("data/australian.dat", sep=" ", names=['c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12','c13','c14','response'])
print(df.head())

# Your code here.
X = df.iloc[:, :-1] # features
y = df.iloc[:, -1] # response

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

   c1     c2     c3  c4  c5  c6     c7  c8  c9  c10  c11  c12  c13   c14  \
0   1  22.08  11.46   2   4   4  1.585   0   0    0    1    2  100  1213   
1   0  22.67   7.00   2   8   4  0.165   0   0    0    0    2  160     1   
2   0  29.58   1.75   1   4   4  1.250   0   0    0    1    2  280     1   
3   0  21.67  11.50   1   5   3  0.000   1   1   11    1    2    0     1   
4   1  20.17   8.17   2   6   4  1.960   1   1   14    0    2   60   159   

   response  
0         0  
1         0  
2         0  
3         1  
4         1  


In [107]:
# Standardize the training set
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

# Standardize the test set
X_test = scaler.transform(X_test)

In [108]:
# Define SVM
svm = SVC()

# Define hyperparameters for the grid search
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10], 'kernel': ['linear', 'poly', 'rbf']}

# Define cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=100)

# Find the best model
svm_grid_search = GridSearchCV(svm, param_grid=param_grid, cv=cv, n_jobs=-1)
svm_grid_search.fit(X_train, y_train)

# Print the best parameters and the mean cross-validation score
print(f'Best parameters: {svm_grid_search.best_params_}')
print(f'Average CV score: {round(np.mean(svm_grid_search.best_score_), 4)}')

# Select the best model
best_svm = svm_grid_search.best_estimator_

Best parameters: {'C': 0.01, 'kernel': 'linear'}
Average CV score: 0.8551


In [109]:
# Evaluate the best SVM on the test set
y_hat = best_svm.predict(X_test)
accuracy = accuracy_score(y_test, y_hat)
print(f'Test accuracy of SVM: {round(accuracy, 4)}')

Test accuracy of SVM: 0.8623


## Part-D: Classification - Ensembling

We continue using the australian dataset.
1. (5 pts) Train a KNN model on the dataset. Evaluate both models' performance on the test dataset. Report the test accuracy.
2. (20 pts) Try two different ensemble methods on the dataset. Compare the performance of the ensemble models with the single model (KNN, SVM) and describe your observation.

In [110]:
# Your code here
# Define KNN
knn = KNeighborsClassifier()

# Define hyperparameters for the grid search
param_grid = {'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]}

# Define cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=100)

# Find the best model
knn_grid_search = GridSearchCV(knn, param_grid=param_grid, cv=cv, n_jobs=-1)
knn_grid_search.fit(X_train, y_train)

# Print the best parameters and the mean cross-validation score
print(f'Best parameters: {knn_grid_search.best_params_}')
print(f'Average CV score: {round(np.mean(knn_grid_search.best_score_), 4)}')

# Select the best model
best_knn = knn_grid_search.best_estimator_

Best parameters: {'n_neighbors': 7}
Average CV score: 0.8551


In [111]:
# Evaluate the best KNN on the test set
y_hat = best_knn.predict(X_test)
accuracy = accuracy_score(y_test, y_hat)
print(f'Test accuracy of KNN: {round(accuracy, 4)}')

Test accuracy of KNN: 0.8768


### Ensemble models

***Hard voting***

In [112]:
# Define the voting classifier
vote = VotingClassifier(estimators=[('svm', best_svm), ('knn', best_knn)], voting='hard')

# Train the voting classifier
vote.fit(X_train, y_train)

# Test on the test set
y_hat = vote.predict(X_test)
accuracy = accuracy_score(y_test, y_hat)
print(f'Test accuracy of the Hard Vote Ensemble Model: {round(accuracy, 4)}')

Test accuracy of the Hard Vote Ensemble Model: 0.8986


***Stacked ensemble***

In [114]:
# Define the stacking classifier
stack = StackingClassifier(estimators=[('svm', best_svm), ('knn', best_knn)], final_estimator=LogisticRegression())

# Define hyperparameters for grid search to tune the final estimator
parameters = {'final_estimator__C': [0.01, 0.1, 1, 10]}

# Tune hyperparameters using 5-fold CV
GS_CV = GridSearchCV(stack, parameters, cv=5)
GS_CV.fit(X_train, y_train)

# Select the best model
best_logist = GS_CV.best_estimator_

# Test on the test set
y_hat = best_logist.predict(X_test)
accuracy = accuracy_score(y_test, y_hat)
print(f'Test accuracy of the Stacked Ensemble Model: {round(accuracy, 4)}')

Test accuracy of the Stacked Ensemble Model: 0.8623


| Model | Test Accuracy |
| -------- | -------- |
| SVM | 0.86 |
| KNN | 0.88 |
| Majority Ensemble | 0.90 |
| Stacked Ensemble | 0.86 |

The majority vote ensemble model improved the overall accuracy on the test set. 

The stacked ensemble model performed worse than KNN and just as well as the SVM. This might be because I trained logistic regression for the final estimator model. Some other model could show a better performance.