# Homework 5: Supervised Learning, part 1

### Review

In class #6, we spoke about the importance of having a strong process for building supervised models. It can be very easy to fool yourself into thinking you have a much stronger model than you do!  Some of the common issues we discussed were:

**Not having a strong process for evaluating model performance**

* It is absolutely required to have a separate "test" set that you look at as few times as possible.
* To avoid using your test set, you should use one of the following strategies to choose your model hyperparameters, features, etc:
    * A separate "validation set"
    * Cross-folds validation (CV)
* The modeling task you are working on may help you decide which strategy to use. Some forecasting/prediction problems may benefit from both the test set and validation set being "out of time" samples. In other cases, random sampling (of either the test set or through CV can be useful
* Standard prediction metrics are very useful, but may not be sufficient to capture whether the model works for your business problem
    * Standard metrics can include:
        * Precision, Recall, F-score, Area Under the ROC Curve, Area under the Precision-Recall Curve for classification
        * Mean Squared Error (MSE), Mean Absolute Error (MAE), etc. for regression (we will discuss this more later on)

**Not doing enough work with _Feature Engineering_ or _Feature Selection_**

* Feature engineering is the 'art' of generating features that are better aligned with the domain and prediction problem you are working with.
    * We brainstormed a number of features for the airplane delay prediction, including external data like weather forecasts and deriving features based on the recent number of flights or delays at an airport (or by aircraft type)
* Feature selection is the process of narrowing down the features that you use in a model
    * Some classifiers are more sensitive to "noisy" features than others (like the _KNN_ method we discussed last class)
    * Even for classifiers that are more robust to noisy features, they add no value to the model and can cause issues later, either higher variance or being more prone to data errors
 

### Assignment: building a classification model for airline delays

We want to predict whether our flight is going to be delayed so we know whether we should leave for the aiport yet.

We will make the prediction 6 hours before it is supposed to be delayed. We're choosing 6 hours to balance two things:

* It is useful (we haven't left for the airport yet) unlike a model that predicts 5 minutes in advance.
* It is close enough to the flight time that we can incorporate information about the _current day_ in terms of delays
  

# Feature Brainstorm (from last class)

Data and features:

* Weather data - if we could get it
* recent delays, busy-ness --> by airport (origin and destination) and airline
* how to handle covid
* holidays
* seasonality - make sure you have a full year of training data (and test data?)
* scheduled flights?
* how much data for each split

Modeling choices:

* Feature Scaling
* distance metric and 'k'
* modeling techniques
* how much data for each split


In [69]:
import duckdb
import pandas as pd

from collections import Counter
from matplotlib import pyplot as plt
from sklearn.neighbors import  KNeighborsClassifier

from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, roc_auc_score


pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

## Features

Previously, we created a bunch of indicator variables for features. We did not get particularly far.

Let's instead, do the following:

For AIRLINE CODE, ORIGIN, DESTINATION, MONTH, etc. we want to create lagging features that are the rates of delays for that value.

So, for United Airlines (UA) and for a flight on 6/1/2022, we want features like:

* UA % delayed yesterday
* UA % delayed in past 7 days
* UA % delayed in past 30 days
* UA % delayed in past 365 days

Because we are now using 365 days of history for features, we can't use the first 365 days of data.

Additionally, we may want to "throw out" all the "COVID era" data, so our training data will start on 2022-01-01

I've already written a version of these features using window functions in SQL/DuckDB to get you started


In [71]:
from hw5_helpers import prepare_data

file_path = "/Users/yashwanth/Documents/GWU/Sem 3/Data Mining/Class 5/flights_sample_3m.csv" ## please replace with your own path!

df = prepare_data(file_path) # This returns a dataframe, but also creates a table called flights_w_date in duckdb that has the same data 

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

In [72]:
# Let's inspect the data

df

Unnamed: 0,datetime,FL_DATE,hour,week_of_year,day_name,month_key,month_of_year,AIRLINE_CODE,ORIGIN,DEST,delay_ind,train_test_split,feature_AIRLINE_CODE_1d_delays,feature_AIRLINE_CODE_1d_avg_delays,feature_AIRLINE_CODE_1d_cnt_delays,feature_ORIGIN_1d_delays,feature_ORIGIN_1d_avg_delays,feature_ORIGIN_1d_cnt_delays,feature_DEST_1d_delays,feature_DEST_1d_avg_delays,feature_DEST_1d_cnt_delays,feature_AIRLINE_CODE_7d_delays,feature_AIRLINE_CODE_7d_avg_delays,feature_AIRLINE_CODE_7d_cnt_delays,feature_ORIGIN_7d_delays,feature_ORIGIN_7d_avg_delays,feature_ORIGIN_7d_cnt_delays,feature_DEST_7d_delays,feature_DEST_7d_avg_delays,feature_DEST_7d_cnt_delays,feature_AIRLINE_CODE_30d_delays,feature_AIRLINE_CODE_30d_avg_delays,feature_AIRLINE_CODE_30d_cnt_delays,feature_ORIGIN_30d_delays,feature_ORIGIN_30d_avg_delays,feature_ORIGIN_30d_cnt_delays,feature_DEST_30d_delays,feature_DEST_30d_avg_delays,feature_DEST_30d_cnt_delays,feature_AIRLINE_CODE_365d_delays,feature_AIRLINE_CODE_365d_avg_delays,feature_AIRLINE_CODE_365d_cnt_delays,feature_ORIGIN_365d_delays,feature_ORIGIN_365d_avg_delays,feature_ORIGIN_365d_cnt_delays,feature_DEST_365d_delays,feature_DEST_365d_avg_delays,feature_DEST_365d_cnt_delays
0,2019-01-02 16:50:00,2019-01-02,16,1,Wednesday,201901,1,NK,PBI,ACY,0,feature_only,10.0,0.175439,57,2.0,0.250000,8,,,0,13.0,0.178082,73,2.0,0.181818,11,,,0,13.0,0.178082,73,2.0,0.181818,11,,,0,13.0,0.178082,73,2.0,0.181818,11,,,0
1,2019-01-05 20:30:00,2019-01-05,20,1,Saturday,201901,1,NK,FLL,ACY,0,feature_only,9.0,0.191489,47,5.0,0.217391,23,,,0,32.0,0.135021,237,27.0,0.194245,139,0.0,0.000000,1,32.0,0.135021,237,27.0,0.194245,139,0.0,0.000000,1,32.0,0.135021,237,27.0,0.194245,139,0.0,0.000000,1
2,2019-01-05 21:45:00,2019-01-05,21,1,Saturday,201901,1,NK,MYR,ACY,0,feature_only,7.0,0.145833,48,0.0,0.000000,2,,,0,32.0,0.132780,241,1.0,0.100000,10,0.0,0.000000,1,32.0,0.132780,241,1.0,0.100000,10,0.0,0.000000,1,32.0,0.132780,241,1.0,0.100000,10,0.0,0.000000,1
3,2019-01-06 10:00:00,2019-01-06,10,1,Sunday,201901,1,NK,RSW,ACY,0,feature_only,5.0,0.102041,49,2.0,0.125000,16,0.0,0.000000,2,35.0,0.132075,265,15.0,0.185185,81,0.0,0.000000,3,35.0,0.132075,265,15.0,0.185185,81,0.0,0.000000,3,35.0,0.132075,265,15.0,0.185185,81,0.0,0.000000,3
4,2019-01-07 07:15:00,2019-01-07,07,2,Monday,201901,1,NK,FLL,ACY,0,feature_only,12.0,0.272727,44,9.0,0.321429,28,0.0,0.000000,1,47.0,0.152104,309,42.0,0.224599,187,0.0,0.000000,4,47.0,0.152104,309,42.0,0.224599,187,0.0,0.000000,4,47.0,0.152104,309,42.0,0.224599,187,0.0,0.000000,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2999995,2019-12-11 12:05:00,2019-12-11,12,50,Wednesday,201912,12,DL,ATL,SEA,0,feature_only,28.0,0.102941,272,13.0,0.130000,100,8.0,0.160000,50,207.0,0.111290,1860,69.0,0.094391,731,47.0,0.161512,291,1129.0,0.136188,8290,390.0,0.124521,3132,193.0,0.159901,1207,13976.0,0.145483,96066,6366.0,0.167478,38011,2246.0,0.164542,13650
2999996,2019-12-11 13:22:00,2019-12-11,13,50,Wednesday,201912,12,DL,LAS,SEA,0,feature_only,28.0,0.104869,267,8.0,0.200000,40,8.0,0.160000,50,207.0,0.111470,1857,59.0,0.187302,315,47.0,0.161512,291,1129.0,0.136270,8285,276.0,0.198418,1391,193.0,0.159504,1210,13977.0,0.145463,96086,3145.0,0.196415,16012,2246.0,0.164494,13654
2999997,2019-12-11 13:35:00,2019-12-11,13,50,Wednesday,201912,12,AS,SAN,SEA,0,feature_only,11.0,0.150685,73,0.0,0.000000,16,8.0,0.166667,48,88.0,0.171540,513,26.0,0.151163,172,47.0,0.162630,289,350.0,0.161887,2162,135.0,0.175553,769,193.0,0.159504,1210,3927.0,0.154224,25463,1489.0,0.165592,8992,2246.0,0.164494,13654
2999998,2019-12-11 14:10:00,2019-12-11,14,50,Wednesday,201912,12,AS,SJC,SEA,1,feature_only,8.0,0.115942,69,2.0,0.095238,21,7.0,0.159091,44,88.0,0.172211,511,23.0,0.203540,113,47.0,0.164336,286,350.0,0.161887,2162,73.0,0.145418,502,193.0,0.159636,1209,3927.0,0.154212,25465,862.0,0.142644,6043,2246.0,0.164494,13654


In [73]:
query = '''
SELECT
    month_key,
    COUNT(*) AS num_flights,
    SUM(delay_ind) AS delayed_flights,
    SUM(delay_ind)/COUNT(*) AS pct_delayed_flights
FROM 
    flights_w_date
GROUP BY 
    month_key
ORDER BY 
    month_key'''

duckdb.sql(query).df()

Unnamed: 0,month_key,num_flights,delayed_flights,pct_delayed_flights
0,201901,59412,10044.0,0.169057
1,201902,54565,11461.0,0.210043
2,201903,64894,10913.0,0.168167
3,201904,62080,11052.0,0.178028
4,201905,64966,12536.0,0.192962
5,201906,64777,15312.0,0.23638
6,201907,67213,13835.0,0.205838
7,201908,67469,13415.0,0.198832
8,201909,61987,8215.0,0.132528
9,201910,64869,10297.0,0.158735


# Remaining Data Prep

We are going to need a _Design Matrix_ that includes columns for all the features we want to include.
We will also need the _target variable_ that we are predicting (`delay_ind`)

We are also going to use a hybrid of the evaluation method that we talked about in class:

* We will take the _training_ data and use Cross-folds validation for determining any hyperparameters. Our training data was chosen based on time
* We will keep the "out of time" _test_ data to use at the end to see how well believe our model will perform on unseen data

Our data is also very large! So, we can afford to do some sampling for efficiency


## Question 1: Data Prep

**Question 1a:** create 2 dataframes from the above set of full data:

* train_df (should be where `train_test_split=='train'`
* test_df (should be where `train_test_split=='test'`

**Question 1b:** using those data sets, create 4 dataframes, sampled down:

* X_train (only columns are features)
* y_train (just the target)
* X_test (only columns are features)
* y_test (just the target)

Use a sampling of 50,000 for the training data and 25,000 for the test data

**Question 1c:** Make sure to handle nulls and choose whether to do feature scaling!:

* You can choose your own method for null handling
* You can choose whether or not to do feature scaling
* Please justify your choice either way! (Could be based on intuition/hypotheses at this point. Later, we may ask for more grounding)
  

In [75]:
# 1a: Creating separate DataFrames for training and testing sets based on 'train_test_split'

train_set = df[df['train_test_split'] == 'train'].reset_index(drop=True)
test_set = df[df['train_test_split'] == 'test'].reset_index(drop=True)

# Display the shapes of the newly created DataFrames
train_set_shape = train_set.shape
test_set_shape = test_set.shape
print(f"Training DataFrame shape: {train_set_shape}")
print(f"Testing DataFrame shape: {test_set_shape}")

Training DataFrame shape: (686220, 48)
Testing DataFrame shape: (463484, 48)


In [76]:
# 1b: Sampling 50,000 from train_set and 25,000 from test_set

features_train = train_set.sample(n=50000, random_state=42).drop(columns=['delay_ind'])
target_train = train_set.loc[features_train.index, 'delay_ind']

features_test = test_set.sample(n=25000, random_state=42).drop(columns=['delay_ind'])
target_test = test_set.loc[features_test.index, 'delay_ind']

# Checking the shapes of the sampled DataFrames
print(f"Features_train shape: {features_train.shape}, Target_train shape: {target_train.shape}")
print(f"Features_test shape: {features_test.shape}, Target_test shape: {target_test.shape}")

Features_train shape: (50000, 47), Target_train shape: (50000,)
Features_test shape: (25000, 47), Target_test shape: (25000,)


In [77]:
# 1c: Handling missing values (nulls)

# Identifying numerical and categorical columns
num_columns = features_train.select_dtypes(include=['float64', 'int64']).columns
cat_columns = features_train.select_dtypes(include=['object']).columns

# Filling missing values in numerical columns with the mean
features_train[num_columns] = features_train[num_columns].fillna(features_train[num_columns].mean())

# Filling missing values in categorical columns with the mode
for column in cat_columns:
    features_train[column] = features_train[column].fillna(features_train[column].mode()[0])

# Repeat the same for features_test
features_test[num_columns] = features_test[num_columns].fillna(features_test[num_columns].mean())

for column in cat_columns:
    features_test[column] = features_test[column].fillna(features_test[column].mode()[0])

# Checking if there are any remaining missing values
print(f"Remaining null values in Features_train: {features_train.isnull().sum().sum()}")
print(f"Remaining null values in Features_test: {features_test.isnull().sum().sum()}")

Remaining null values in Features_train: 0
Remaining null values in Features_test: 0


In [78]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler_model = StandardScaler()

# Fit the scaler on training numerical features and transform both train and test numerical data
features_train_scaled = scaler_model.fit_transform(features_train[num_columns])
features_test_scaled = scaler_model.transform(features_test[num_columns])

# Convert the scaled data back to DataFrame for consistency
features_train_scaled = pd.DataFrame(features_train_scaled, columns=num_columns)
features_test_scaled = pd.DataFrame(features_test_scaled, columns=num_columns)

# Combine scaled numerical features with categorical features
train_set_final = pd.concat([features_train_scaled, features_train[cat_columns].reset_index(drop=True)], axis=1)
test_set_final = pd.concat([features_test_scaled, features_test[cat_columns].reset_index(drop=True)], axis=1)

# Display the shapes of the final datasets
print(f"Final Features_train shape: {train_set_final.shape}")
print(f"Final Features_test shape: {test_set_final.shape}")

Final Features_train shape: (50000, 45)
Final Features_test shape: (25000, 45)


## Question 2: KNN Classifier and Cross-validation process

Just like we did in class, we are going to train a K-Nearest-Neighbor Classifier

Because we have only created a "test" and "train" split, we will be using cross-validation on the training data for all of our experiementation.

Let's go through how to get this set up:

We are going to use the [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) method from scikit-learn.

We also want to generate multiple scores:

* precision
* recall
* f1_score
* AUROC

Each of these have a function in sklern.metrics

To piece them together, we will use:

[`make_scorer`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html)

Here are the basic steps:

1. Import KNeighborsClassifier, cross_validate, make_scorer, and each of the metrics.
2. Create a dictionary called `scoring` where each key is the name of a metric and each value is the output of a `make_scorer` call:
    * e.g., `scoring = {'metric_name': make_scorer(score_function, average='binary'}`
3. Instantiate the model. Use `n_neighbors=1 and weights='distance'` You can optionally use `n_jobs` to speed performance
4. Call `cross_validate` with the model, the training data/target, and scoring dictionary. Use 5-folds
5. Print out all results

In [80]:
# One-Hot Encode categorical variables in the training and test datasets
cat_columns = train_set_final.select_dtypes(include=['object']).columns
train_encoded = pd.get_dummies(train_set_final, columns=cat_columns, drop_first=True)
test_encoded = pd.get_dummies(test_set_final, columns=cat_columns, drop_first=True)

# Align the train and test dataframes
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1, fill_value=0)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

# Create a custom scorer for ROC AUC
def custom_auc_roc(true_labels, predicted_scores):
    return roc_auc_score(true_labels, predicted_scores)

# Define the scoring dictionary
scoring_metrics = {
    'precision': make_scorer(precision_score, average='binary', response_method='predict'),
    'recall': make_scorer(recall_score, average='binary', response_method='predict'),
    'f1': make_scorer(f1_score, average='binary', response_method='predict'),
    'roc_auc': make_scorer(custom_auc_roc)  
}

# Instantiate the KNN model with distance-based weighting
knn_model = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)

# Perform cross-validation using the encoded features and target values
cv_result = cross_validate(knn_model, train_encoded, target_train, scoring=scoring_metrics, cv=5, return_train_score=False)

# Print the results
print("Cross-validation results:")
for metric_name in scoring_metrics.keys():
    print(f"{metric_name.capitalize()}: Mean = {cv_result[f'test_{metric_name}'].mean():.4f}, "
          f"Std = {cv_result[f'test_{metric_name}'].std():.4f}")


Cross-validation results:
Precision: Mean = 0.2692, Std = 0.0088
Recall: Mean = 0.2686, Std = 0.0125
F1: Mean = 0.2689, Std = 0.0105
Roc_auc: Mean = 0.5379, Std = 0.0061


## Question 3

OK. If your results are like mine, these are still some unimpressive results, with an auroc barely above the "random guessing" value of 0.5.  This is a tough prediction task with the data we have, but maybe we can eek out a little more performance.  Next week, we will have a few more tools (modeling) techniques in our toolbox that may be better suited to this problem.

First, we know k-nearest neighbor can be sensitive to noisy features and that the curse of dimensionality may also make having too many features problematic. Let's try some feature selection!

Let's use two additional scikit-learn components:

[`f_classif`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_selection.f_classif.html)
and
[`SelectKBest`](https://scikit-learn.org/dev/modules/generated/sklearn.feature_selection.SelectKBest.html)

The f-test comes from regression, but is one way to get some measure of feature importance.

**Question 3a** Using those two functions, choose the top 5 features and see if the results improve.

**Question 3b** What are the most predictive features? Do they make intuitive sense? 


In [82]:
# Question 3a:

from sklearn.feature_selection import SelectKBest, f_classif

# Select the top 5 features using the f-test
feature_selector = SelectKBest(score_func=f_classif, k=5)
train_selected_features = feature_selector.fit_transform(train_encoded, target_train)
test_selected_features = feature_selector.transform(test_encoded)

# Get the mask of selected features
selected_mask = feature_selector.get_support()
selected_feature_names = train_encoded.columns[selected_mask]

print("Top 5 Selected Features:")
print(selected_feature_names)

Top 5 Selected Features:
Index(['feature_AIRLINE_CODE_1d_delays', 'feature_AIRLINE_CODE_1d_avg_delays',
       'feature_AIRLINE_CODE_7d_avg_delays',
       'feature_AIRLINE_CODE_30d_avg_delays',
       'feature_AIRLINE_CODE_365d_avg_delays'],
      dtype='object')


In [83]:
# Question 3b:

# Fit a KNN model on the selected features
knn_model_selected = KNeighborsClassifier(n_neighbors=1, weights='distance', n_jobs=-1)

# Perform cross-validation on the selected features
cv_results_selected_features = cross_validate(knn_model_selected, train_selected_features, target_train, scoring=scoring_metrics, cv=5, return_train_score=False)

# Print the results for selected features
print("Cross-validation results with selected features:")
for metric_name in scoring_metrics.keys():
    print(f"{metric_name.capitalize()}: Mean = {cv_results_selected_features[f'test_{metric_name}'].mean():.4f}, "
          f"Std = {cv_results_selected_features[f'test_{metric_name}'].std():.4f}")

Cross-validation results with selected features:
Precision: Mean = 0.2876, Std = 0.0086
Recall: Mean = 0.2843, Std = 0.0103
F1: Mean = 0.2859, Std = 0.0091
Roc_auc: Mean = 0.5491, Std = 0.0055


# Analysis

The selected features offer key insights into an airline’s performance in relation to flight delays across various timeframes. Features such as 1-day delays and 1-day average delays provide an understanding of immediate operational issues, reflecting short-term disruptions in the airline’s performance. In contrast, the 7-day average delays indicate emerging short-term trends, capturing how performance evolves over a slightly longer period. Additionally, the 30-day and 365-day average delays highlight patterns in the airline's performance over longer timeframes, offering a view of sustained issues and historical trends.

Together, these features suggest that both recent and historical delay data are critical for predicting future delays. The combination of short-term indicators and longer-term averages captures both immediate operational challenges and the broader, sustained patterns that may be influencing the airline's overall performance. This makes them particularly valuable in understanding and forecasting potential delays in the future.

## Question 4: Grid Search

OK, we have gotten perhaps marginally better results. Still not impressive!

Let's get a bit more systematic!

We will do another grid search in order to do the best we can (for now!)

Please loop over the following points in a grid:

* n_neighbors: from 1 to 10
* weights: 'uniform' and 'distance'
* number of top features: from 1 to 10

Create a dataframe of all your results with the choice of parameters, mean values of precision, recall, f1, and auroc. Sort the table from best to worst, according to auroc.

This may take a while to run! You may want to print out what grid combo it is on.


In [86]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score, roc_auc_score

# Define the scoring metrics
scoring_metrics = {
    'precision': make_scorer(precision_score, average='binary', needs_proba=False),
    'recall': make_scorer(recall_score, average='binary', needs_proba=False),
    'f1': make_scorer(f1_score, average='binary', needs_proba=False),
    'roc_auc': make_scorer(roc_auc_score)
}

# Create a list to store results
grid_results = []

# Define the parameter grid for KNN
param_grid = {
    'n_neighbors': range(1, 11),
    'weights': ['uniform', 'distance'],
    'k_best': range(1, 11)
}

# Loop through each combination of parameters
for n_neighbors in param_grid['n_neighbors']:
    for weight in param_grid['weights']:
        for k in param_grid['k_best']:
            # Feature selection with the top k features
            feature_selector = SelectKBest(score_func=f_classif, k=k)
            X_train_top_features = feature_selector.fit_transform(train_selected_features, target_train)
            
            # Initialize the KNN classifier
            knn_model = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weight, n_jobs=-1)
            
            # Perform cross-validation
            cv_results = cross_validate(knn_model, X_train_top_features, target_train, scoring=scoring_metrics, cv=5, return_train_score=False)
            
            # Print current parameter combination being evaluated
            print(f"Evaluating: n_neighbors={n_neighbors}, weights={weight}, k_best={k}")
            
            # Store results
            grid_results.append({
                'n_neighbors': n_neighbors,
                'weights': weight,
                'k_best': k,
                'precision_mean': cv_results['test_precision'].mean(),
                'precision_std': cv_results['test_precision'].std(),
                'recall_mean': cv_results['test_recall'].mean(),
                'recall_std': cv_results['test_recall'].std(),
                'f1_mean': cv_results['test_f1'].mean(),
                'f1_std': cv_results['test_f1'].std(),
                'roc_auc_mean': cv_results['test_roc_auc'].mean(),
                'roc_auc_std': cv_results['test_roc_auc'].std()
            })

# Convert results to a DataFrame
results_df = pd.DataFrame(grid_results)

# Sort results by AUROC
sorted_results_df = results_df.sort_values(by='roc_auc_mean', ascending=False)

# Display the sorted results
print(sorted_results_df)



Evaluating: n_neighbors=1, weights=uniform, k_best=1
Evaluating: n_neighbors=1, weights=uniform, k_best=2
Evaluating: n_neighbors=1, weights=uniform, k_best=3
Evaluating: n_neighbors=1, weights=uniform, k_best=4
Evaluating: n_neighbors=1, weights=uniform, k_best=5




Evaluating: n_neighbors=1, weights=uniform, k_best=6




Evaluating: n_neighbors=1, weights=uniform, k_best=7




Evaluating: n_neighbors=1, weights=uniform, k_best=8




Evaluating: n_neighbors=1, weights=uniform, k_best=9




Evaluating: n_neighbors=1, weights=uniform, k_best=10
Evaluating: n_neighbors=1, weights=distance, k_best=1
Evaluating: n_neighbors=1, weights=distance, k_best=2
Evaluating: n_neighbors=1, weights=distance, k_best=3
Evaluating: n_neighbors=1, weights=distance, k_best=4
Evaluating: n_neighbors=1, weights=distance, k_best=5




Evaluating: n_neighbors=1, weights=distance, k_best=6
Evaluating: n_neighbors=1, weights=distance, k_best=7




Evaluating: n_neighbors=1, weights=distance, k_best=8
Evaluating: n_neighbors=1, weights=distance, k_best=9




Evaluating: n_neighbors=1, weights=distance, k_best=10
Evaluating: n_neighbors=2, weights=uniform, k_best=1
Evaluating: n_neighbors=2, weights=uniform, k_best=2
Evaluating: n_neighbors=2, weights=uniform, k_best=3
Evaluating: n_neighbors=2, weights=uniform, k_best=4
Evaluating: n_neighbors=2, weights=uniform, k_best=5




Evaluating: n_neighbors=2, weights=uniform, k_best=6




Evaluating: n_neighbors=2, weights=uniform, k_best=7




Evaluating: n_neighbors=2, weights=uniform, k_best=8




Evaluating: n_neighbors=2, weights=uniform, k_best=9




Evaluating: n_neighbors=2, weights=uniform, k_best=10
Evaluating: n_neighbors=2, weights=distance, k_best=1
Evaluating: n_neighbors=2, weights=distance, k_best=2
Evaluating: n_neighbors=2, weights=distance, k_best=3
Evaluating: n_neighbors=2, weights=distance, k_best=4
Evaluating: n_neighbors=2, weights=distance, k_best=5




Evaluating: n_neighbors=2, weights=distance, k_best=6
Evaluating: n_neighbors=2, weights=distance, k_best=7




Evaluating: n_neighbors=2, weights=distance, k_best=8
Evaluating: n_neighbors=2, weights=distance, k_best=9




Evaluating: n_neighbors=2, weights=distance, k_best=10
Evaluating: n_neighbors=3, weights=uniform, k_best=1
Evaluating: n_neighbors=3, weights=uniform, k_best=2
Evaluating: n_neighbors=3, weights=uniform, k_best=3
Evaluating: n_neighbors=3, weights=uniform, k_best=4
Evaluating: n_neighbors=3, weights=uniform, k_best=5




Evaluating: n_neighbors=3, weights=uniform, k_best=6




Evaluating: n_neighbors=3, weights=uniform, k_best=7




Evaluating: n_neighbors=3, weights=uniform, k_best=8




Evaluating: n_neighbors=3, weights=uniform, k_best=9




Evaluating: n_neighbors=3, weights=uniform, k_best=10
Evaluating: n_neighbors=3, weights=distance, k_best=1
Evaluating: n_neighbors=3, weights=distance, k_best=2
Evaluating: n_neighbors=3, weights=distance, k_best=3
Evaluating: n_neighbors=3, weights=distance, k_best=4
Evaluating: n_neighbors=3, weights=distance, k_best=5




Evaluating: n_neighbors=3, weights=distance, k_best=6
Evaluating: n_neighbors=3, weights=distance, k_best=7




Evaluating: n_neighbors=3, weights=distance, k_best=8
Evaluating: n_neighbors=3, weights=distance, k_best=9




Evaluating: n_neighbors=3, weights=distance, k_best=10
Evaluating: n_neighbors=4, weights=uniform, k_best=1
Evaluating: n_neighbors=4, weights=uniform, k_best=2
Evaluating: n_neighbors=4, weights=uniform, k_best=3
Evaluating: n_neighbors=4, weights=uniform, k_best=4
Evaluating: n_neighbors=4, weights=uniform, k_best=5




Evaluating: n_neighbors=4, weights=uniform, k_best=6




Evaluating: n_neighbors=4, weights=uniform, k_best=7




Evaluating: n_neighbors=4, weights=uniform, k_best=8




Evaluating: n_neighbors=4, weights=uniform, k_best=9




Evaluating: n_neighbors=4, weights=uniform, k_best=10
Evaluating: n_neighbors=4, weights=distance, k_best=1
Evaluating: n_neighbors=4, weights=distance, k_best=2
Evaluating: n_neighbors=4, weights=distance, k_best=3
Evaluating: n_neighbors=4, weights=distance, k_best=4
Evaluating: n_neighbors=4, weights=distance, k_best=5




Evaluating: n_neighbors=4, weights=distance, k_best=6
Evaluating: n_neighbors=4, weights=distance, k_best=7




Evaluating: n_neighbors=4, weights=distance, k_best=8
Evaluating: n_neighbors=4, weights=distance, k_best=9




Evaluating: n_neighbors=4, weights=distance, k_best=10
Evaluating: n_neighbors=5, weights=uniform, k_best=1
Evaluating: n_neighbors=5, weights=uniform, k_best=2
Evaluating: n_neighbors=5, weights=uniform, k_best=3
Evaluating: n_neighbors=5, weights=uniform, k_best=4
Evaluating: n_neighbors=5, weights=uniform, k_best=5




Evaluating: n_neighbors=5, weights=uniform, k_best=6




Evaluating: n_neighbors=5, weights=uniform, k_best=7




Evaluating: n_neighbors=5, weights=uniform, k_best=8




Evaluating: n_neighbors=5, weights=uniform, k_best=9




Evaluating: n_neighbors=5, weights=uniform, k_best=10
Evaluating: n_neighbors=5, weights=distance, k_best=1
Evaluating: n_neighbors=5, weights=distance, k_best=2
Evaluating: n_neighbors=5, weights=distance, k_best=3
Evaluating: n_neighbors=5, weights=distance, k_best=4
Evaluating: n_neighbors=5, weights=distance, k_best=5




Evaluating: n_neighbors=5, weights=distance, k_best=6




Evaluating: n_neighbors=5, weights=distance, k_best=7
Evaluating: n_neighbors=5, weights=distance, k_best=8




Evaluating: n_neighbors=5, weights=distance, k_best=9




Evaluating: n_neighbors=5, weights=distance, k_best=10
Evaluating: n_neighbors=6, weights=uniform, k_best=1
Evaluating: n_neighbors=6, weights=uniform, k_best=2
Evaluating: n_neighbors=6, weights=uniform, k_best=3
Evaluating: n_neighbors=6, weights=uniform, k_best=4
Evaluating: n_neighbors=6, weights=uniform, k_best=5




Evaluating: n_neighbors=6, weights=uniform, k_best=6




Evaluating: n_neighbors=6, weights=uniform, k_best=7




Evaluating: n_neighbors=6, weights=uniform, k_best=8




Evaluating: n_neighbors=6, weights=uniform, k_best=9




Evaluating: n_neighbors=6, weights=uniform, k_best=10
Evaluating: n_neighbors=6, weights=distance, k_best=1
Evaluating: n_neighbors=6, weights=distance, k_best=2
Evaluating: n_neighbors=6, weights=distance, k_best=3
Evaluating: n_neighbors=6, weights=distance, k_best=4
Evaluating: n_neighbors=6, weights=distance, k_best=5




Evaluating: n_neighbors=6, weights=distance, k_best=6




Evaluating: n_neighbors=6, weights=distance, k_best=7




Evaluating: n_neighbors=6, weights=distance, k_best=8




Evaluating: n_neighbors=6, weights=distance, k_best=9




Evaluating: n_neighbors=6, weights=distance, k_best=10
Evaluating: n_neighbors=7, weights=uniform, k_best=1
Evaluating: n_neighbors=7, weights=uniform, k_best=2
Evaluating: n_neighbors=7, weights=uniform, k_best=3
Evaluating: n_neighbors=7, weights=uniform, k_best=4
Evaluating: n_neighbors=7, weights=uniform, k_best=5




Evaluating: n_neighbors=7, weights=uniform, k_best=6




Evaluating: n_neighbors=7, weights=uniform, k_best=7




Evaluating: n_neighbors=7, weights=uniform, k_best=8




Evaluating: n_neighbors=7, weights=uniform, k_best=9




Evaluating: n_neighbors=7, weights=uniform, k_best=10
Evaluating: n_neighbors=7, weights=distance, k_best=1
Evaluating: n_neighbors=7, weights=distance, k_best=2
Evaluating: n_neighbors=7, weights=distance, k_best=3
Evaluating: n_neighbors=7, weights=distance, k_best=4
Evaluating: n_neighbors=7, weights=distance, k_best=5




Evaluating: n_neighbors=7, weights=distance, k_best=6




Evaluating: n_neighbors=7, weights=distance, k_best=7




Evaluating: n_neighbors=7, weights=distance, k_best=8




Evaluating: n_neighbors=7, weights=distance, k_best=9




Evaluating: n_neighbors=7, weights=distance, k_best=10
Evaluating: n_neighbors=8, weights=uniform, k_best=1
Evaluating: n_neighbors=8, weights=uniform, k_best=2
Evaluating: n_neighbors=8, weights=uniform, k_best=3
Evaluating: n_neighbors=8, weights=uniform, k_best=4
Evaluating: n_neighbors=8, weights=uniform, k_best=5




Evaluating: n_neighbors=8, weights=uniform, k_best=6




Evaluating: n_neighbors=8, weights=uniform, k_best=7




Evaluating: n_neighbors=8, weights=uniform, k_best=8




Evaluating: n_neighbors=8, weights=uniform, k_best=9




Evaluating: n_neighbors=8, weights=uniform, k_best=10
Evaluating: n_neighbors=8, weights=distance, k_best=1
Evaluating: n_neighbors=8, weights=distance, k_best=2
Evaluating: n_neighbors=8, weights=distance, k_best=3
Evaluating: n_neighbors=8, weights=distance, k_best=4
Evaluating: n_neighbors=8, weights=distance, k_best=5




Evaluating: n_neighbors=8, weights=distance, k_best=6




Evaluating: n_neighbors=8, weights=distance, k_best=7




Evaluating: n_neighbors=8, weights=distance, k_best=8




Evaluating: n_neighbors=8, weights=distance, k_best=9




Evaluating: n_neighbors=8, weights=distance, k_best=10
Evaluating: n_neighbors=9, weights=uniform, k_best=1
Evaluating: n_neighbors=9, weights=uniform, k_best=2
Evaluating: n_neighbors=9, weights=uniform, k_best=3
Evaluating: n_neighbors=9, weights=uniform, k_best=4
Evaluating: n_neighbors=9, weights=uniform, k_best=5




Evaluating: n_neighbors=9, weights=uniform, k_best=6




Evaluating: n_neighbors=9, weights=uniform, k_best=7




Evaluating: n_neighbors=9, weights=uniform, k_best=8




Evaluating: n_neighbors=9, weights=uniform, k_best=9




Evaluating: n_neighbors=9, weights=uniform, k_best=10
Evaluating: n_neighbors=9, weights=distance, k_best=1
Evaluating: n_neighbors=9, weights=distance, k_best=2
Evaluating: n_neighbors=9, weights=distance, k_best=3
Evaluating: n_neighbors=9, weights=distance, k_best=4
Evaluating: n_neighbors=9, weights=distance, k_best=5




Evaluating: n_neighbors=9, weights=distance, k_best=6




Evaluating: n_neighbors=9, weights=distance, k_best=7




Evaluating: n_neighbors=9, weights=distance, k_best=8




Evaluating: n_neighbors=9, weights=distance, k_best=9




Evaluating: n_neighbors=9, weights=distance, k_best=10
Evaluating: n_neighbors=10, weights=uniform, k_best=1
Evaluating: n_neighbors=10, weights=uniform, k_best=2
Evaluating: n_neighbors=10, weights=uniform, k_best=3
Evaluating: n_neighbors=10, weights=uniform, k_best=4
Evaluating: n_neighbors=10, weights=uniform, k_best=5




Evaluating: n_neighbors=10, weights=uniform, k_best=6




Evaluating: n_neighbors=10, weights=uniform, k_best=7




Evaluating: n_neighbors=10, weights=uniform, k_best=8




Evaluating: n_neighbors=10, weights=uniform, k_best=9




Evaluating: n_neighbors=10, weights=uniform, k_best=10
Evaluating: n_neighbors=10, weights=distance, k_best=1
Evaluating: n_neighbors=10, weights=distance, k_best=2
Evaluating: n_neighbors=10, weights=distance, k_best=3
Evaluating: n_neighbors=10, weights=distance, k_best=4
Evaluating: n_neighbors=10, weights=distance, k_best=5




Evaluating: n_neighbors=10, weights=distance, k_best=6




Evaluating: n_neighbors=10, weights=distance, k_best=7




Evaluating: n_neighbors=10, weights=distance, k_best=8




Evaluating: n_neighbors=10, weights=distance, k_best=9




Evaluating: n_neighbors=10, weights=distance, k_best=10
     n_neighbors   weights  k_best  precision_mean  precision_std  \
32             2  distance       3        0.317480       0.006241   
33             2  distance       4        0.315335       0.007567   
34             2  distance       5        0.313647       0.009160   
35             2  distance       6        0.313647       0.009160   
36             2  distance       7        0.313647       0.009160   
37             2  distance       8        0.313647       0.009160   
38             2  distance       9        0.313647       0.009160   
39             2  distance      10        0.313647       0.009160   
79             4  distance      10        0.338496       0.004489   
78             4  distance       9        0.338496       0.004489   
77             4  distance       8        0.338496       0.004489   
76             4  distance       7        0.338496       0.004489   
75             4  distance       6        0.338

## Summary

OK, so we have gotten marginally better, but in the process we have tried out a few new tools/steps.

We will build on this kind of framework in the next assignment and introduce the concept of a pipeline to make for less boilerplate.

And, hopefully, with better feature engineering, classification algorithms better suited to the problem and possibly some external data, we will be able to improve the results even more!
