In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, roc_auc_score, accuracy_score, f1_score, precision_score, recall_score

In [None]:
# get data
data = pd.read_csv("combined_air_travel_data.parquet")
train_set, test_set = train_test_split(data, test_size=0.1, random_state=42)

In [None]:
# extra code
from sklearn.metrics import make_scorer, precision_score, recall_score, precision_recall_curve
def precision_at_recall(y, y_pred, *, recall, **kwargs):
    """
    Calculate the precision at a given recall level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9)  # 0.9 is the minimum recall level
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(precision_at_recall, recall=0.9, average="micro")
    """
    return precision_score(y, y_pred, **kwargs) if recall_score(y, y_pred, **kwargs) > recall else 0.0

def recall_at_precision(y, y_pred, *, precision, **kwargs):
    """
    Calculate the recall at a given precision level.

    To use this with cross_val_score, you need to use the make_scorer function to
    create a scoring function that can be used with cross_val_score. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9)
        cross_val_score(model, X, y, cv=5, scoring=scorer)

    The default will only work for binary classification problems. You must change the
    average parameter if you want to use for multiclass classification. For example:

        scorer = make_scorer(recall_at_precision, precision=0.9, average="micro")
    """
    return recall_score(y, y_pred, **kwargs) if precision_score(y, y_pred, **kwargs) > precision else 0.0

In [None]:
# pipeline

Classifiers to Try for Project
==============================
List of classifiers to try along with their most critical hyperparameters. You should expect your short listing of models to take on the order of a day and your hyperparameter tuning multiple days.

Note: you may want to look into each one briefly just to get some info like if they are appropriate for our dataset (considering the number of samples vs features). You will want to read the documentation for default values and understanding what ranges may be useful to examine.

 

Linear Classifiers
------------------

All of these have a coef_ parameter which can be useful during exploration, especially if the penalty is 'l1' which makes the coefficient of unimportant features (near) 0 (but the models all require properly scaled data for the coefficients to be meaningful).

LogisticRegression
penalty: l2 (default), l1 (for sparse models), elasticnet, or none
C: 1 (default), decrease for regularization
dual: False (default)
solver: different solvers are better for different problems
SGDClassifier [basically an online version of LogisticRegression, if online is not needed then you probably don't need this]
loss: hinge (default, like LinearSVC) or log (like LogisticRegression)
penalty: l2 (default), l1, elasticnet, or none
alpha: default 0.0001, increase for regularization
l1_ratio: ratio of Lasso vs Ridge if penalty='elasticnet'
max_iter/tol: max number of steps to attempt and target tolerance to achieve
learning_rate: constant, optimal, invscaling (default), adaptive
eta0: initial learning rate, default is 0.01
shuffle: default True, shuffle data between iterations
early_stopping/validation_fraction/n_iter_no_change: early stopping regularization
 

Neural Network Classifiers
--------------------------

MLPClassifier (i.e. neural network)
hidden_layer_sizes: default is (100,)
activation: 'relu' (default), 'identity', 'logistic', 'tanh'
alpha: default 0.0001, increase for regularization (always L2)
max_iter/tol: max number of steps to attempt and target tolerance to achieve
learning_rate_init: initial learning rate, default is 0.001
batch_size: sizes of batches
shuffle: default True, shuffle data between iterations
early_stopping/validation_fraction/n_iter_no_change: early stopping regularization
 

Tree-Based Classifiers
----------------------

All of these have a feature_importances_ parameter which can be useful during exploration. Scaling does not need to be done for that parameter to have meaning.

DecisionTreeClassifier
criterion: gini (default) or entropy
splitter: best (default) or random (faster)
max_depth, , max_leaf_nodes, min_samples_split, min_samples_leaf, min_impurity_split, etc: control tree generation, decrease max_* to regularize, increase min_* to regularize
presort: setting to True can increase speed for small datasets or restricted depths
RandomForestClassifier and ExtraTreesClassifier
n_estimators: default 100
Supports all hyperparameters of DecisionTreeClassifier listed above except splitter (always best) and presort (always False)
max_features defaults to sqrt, also supports log2, int (for a count), or float (for a percentage)
max_samples & bootstrap: default all samples with bootstrapping
GradientBoostingClassifier
or XGBClassifierLinks to an external site. - improved version but requires an external library to be installed and has a bit difference hyperparameters
learning_rate: default 0.1, lower to increase regularization, higher to go faster
n_estimators: default 100, balance with learning rate, can be fairly high though
subsample: default is 1.0, values <1.0 enable stochastic gradient boosting
n_iter_no_change/validation_fraction: early stopping regularization
Supports most max_* and min_* hyperparameters of DecisionTreeClassifier listed above
max_features defaults to sqrt, also supports log2, int (for a count), or float (for a percentage)
 

Instance-Based Classifiers
--------------------------

KNeighborsClassifier
n_neighbors: 5 (default)
weights: 'uniform' (default) or 'distance'

In [None]:
# models

In [None]:
# save