## Read in train and test sets

In [1]:
import pandas as pd
import numpy as np

robots_train = pd.read_csv('robots_train.csv')
robots_test = pd.read_csv('robots_test.csv')
evil_train = pd.read_csv('evil_train.csv')
evil_test = pd.read_csv('evil_test.csv')
                           

robots_train = robots_train.drop('Unnamed: 0', axis = 1)
robots_test = robots_test.drop('Unnamed: 0', axis = 1)
evil_train = evil_train.drop('Unnamed: 0', axis = 1)

### Exploratory Data Analysis

1. Get the info and summary stats for the `robots_train` dataframe.
    - What kinds of data types do you have?
    - Are there null values?
    - What's the scale of the variables

In [71]:
# Answer here

1. Make a seaborn pairplot to look at comparisons between the variables by filling in the blanks below. Make sure to display whether a robot is evil or not

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


train_whole = pd.concat([robots_train, evil_train], axis = 1)

sns.pairplot(data= 'TYPE HERE' , hue = 'TYPE HERE' , plot_kws = {'alpha': 0.5})
plt.show()

2. Wait, what's going on with the `age` variable? Looks like a boxplot wouldn't capture that weirdness. Use a violin plot to show distributions of the non-target variables, splitting by evilness

In [None]:
# ANSWER HERE BY FILLING IN APPROPRIATE CODE

# Note, to get the plot in a form seaborn like for this, we have to use pd.melt - this one should be pretty easy
train_long = pd.melt(train_whole, id_vars = ['evil'])

sns.violinplot(data=train_long[train_long['variable'] == 'TYPE HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart")
sns.violinplot(data=train_long[train_long['variable'] == 'TYPE HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)
sns.violinplot(data=train_long[train_long['variable'] == 'TYPE HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)
sns.violinplot(data=train_long[train_long['variable'] == 'TYPE HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)
plt.show()


## Preprocessing

### Outliers

3. Ok, between the `pairplot` and `violinplot` above, it looks like there may be outliers. Let's implement a detection scheme to remove them and replace them with Nulls, which we will impute in later steps. Note we'll do this all in one swoop.

In [None]:
# Outlier Detection:

#Create an empty dataframe to hold the robots_train_data with outliers replaced with nans
train_nan_outliers = pd.DataFrame()
robots_train_nans  = pd.DataFrame()

# Loop over training columns - note: we don't zscore the target
for pred in robots_train.columns:
    #print(pred)
    robots_train_nans[pred] = robots_train[pred].apply(lambda x: x if (x - robots_train[pred].mean())/robots_train[pred].std() <= 3 else None)


#Use .value_counts() to examine the number of Nones after detection and replacement
print(robots_train_nans['ENTER TEXT HERE'].#INSERT APPROPRIATE METHOD HERE
    .value_counts())
print(robots_train_nans['ENTER TEXT HERE'].#INSERT APPROPRIATE METHOD HERE
    .value_counts())
print(robots_train_nans['ENTER TEXT HERE'].#INSERT APPROPRIATE METHOD HERE
    .value_counts())
print(robots_train_nans['ENTER TEXT HERE'].#INSERT APPROPRIATE METHOD HERE
    .value_counts())


4. Just for a sanity check, let's see how the data look now after outlier removal. Recreate your violin plots from above using the new outlier-free data. Is there anything you notice about the distributions now that outliers are gone? Note that the quartiles are inside of the violin plots

In [None]:
# First put the outlier-free train data back together with the target variables

train_and_targ_with_nans = #SOME PANDAS FUNCTION HERE#([robots_train_nans, evil_train], axis = 1)

# Do the melting again like above
train_long = pd.melt(train_and_targ_with_nans, id_vars = ['evil'])

sns.violinplot(data=train_long[train_long['variable'] == 'ENTER TEXT HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart")
sns.violinplot(data=train_long[train_long['variable'] == 'ENTER TEXT HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)
sns.violinplot(data=train_long[train_long['variable'] == 'ENTER TEXT HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)
plt.show()
sns.violinplot(data=train_long[train_long['variable'] == 'ENTER TEXT HERE'], x = 'variable', y = 'value', hue = 'evil', split =True ,inner="quart", legend = False)


### Scaling, Imputing, Rebalancing

5. Looking at the plots above, our data seems to be on different scales. We should get all the fields on a comparable scale (Why?). Also we have all these null values from the original data and the outlier removal, so let's impute them. Let's make this a little more interesting and use a KNNImputer. Below, I've sketched out the steps of putting this all together in a `Pipeline` object along with dealing with the imbalance via SMOTE by oversampling the evil class. Fill in the necessaries.

In [None]:
# Note that for imbalanced data pipelines, we have to use the imblearn package's Pipeline objects
# Uncomment the following line and run in to install that package - should only have to do once per new codespace
# pip install imblearn


# Scale variables - Mostly look normalish - for age, let's just see what happens
from sklearn.preprocessing import StandardScaler
# KNN Impute Outliers and nulls
from sklearn.impute import KNNImputer
#Import imblearn pipeline and SMOTE classes
from imblearn.pipeline import Pipeline as imbPipe
from imblearn.over_sampling import SMOTE


preprocess = imbPipe(
    [
        ('scaler', # PUT SCALING STEP IN HERE WITH ANY OPTIONS YOU THINK APPROPRIATE),
        ('knn_imp', #PUT KNN IMPUTER HERE WITH ANY OPTIONS YOU THINK APPROPRIATE)
        ('smote', SMOTE(random_state=1234)) # I went ahead and did the SMOTE for you
    ]
)



# I did this part -note we use fit_resample instead of fit_transform
X_train_smote, y_train_smote = preprocess.fit_resample(robots_train_nans, evil_train)

# Verify that the data has been rebalanced in the next line:
print(#Balance confirmation here)

# Lastly, just verify SMOTE worked by doing a new pairplot
sns.pairplot(data = pd.concat([pd.DataFrame(X_train_smote, columns = #Some Object Here.get_feature_names_out()), y_train_smote], axis = 1), hue = 'evil')#, alpha = 0.5)

## Modeling

In [None]:
# Set up some stuff that'll be used in all the models
from sklearn.model_selection import StratifiedKFold

n_folds = ? # choose your k
kf = StratifiedKFold(n_splits=n_folds, random_state=43, shuffle=True)

scoring_dict = {
     'accuracy': 'accuracy',
     'precision': 'precision',
     'recall': 'recall',
     'f1': 'f1'
}

def confusion_matrix_scorer(clf, X, y):
     y_pred = clf.predict(X)
     cm = confusion_matrix(y, y_pred)
     return {'tn': cm[0, 0], 'fp': cm[0, 1],
             'fn': cm[1, 0], 'tp': cm[1, 1]}



In [None]:
# Here's a quick function to display our results

def get_eval_metrics_report(reg_cv_results, cm_cv_results):
    cm_cv_results_tr = pd.DataFrame(data = [cm_cv_results['train_tn'], cm_cv_results['train_fp'], cm_cv_results['train_fn'], cm_cv_results['train_tp']]).T
    cm_cv_results_tr.columns= ['train_tn', 'train_fp', 'train_fn', 'train_tp']
    
    print("Training Evaluation Metrics: ")
    print("=============================")
    avg_cm_tr = np.array([(np.mean(cm_cv_results_tr['train_tn']), np.mean(cm_cv_results_tr['train_fp'])),
                         (np.mean(cm_cv_results_tr['train_fn']), np.mean(cm_cv_results_tr['train_tp']))])
    print("Average Confusion Matrix across training Folds: \n{}".format(avg_cm_tr))

    sum_cm_tr = np.array([(np.sum(cm_cv_results_tr['train_tn']), np.sum(cm_cv_results_tr['train_fp'])),
                          (np.sum(cm_cv_results_tr['train_fn']), np.sum(cm_cv_results_tr['train_tp']))])
    print("Overall confusion matrix across training folds: \n{}".format(sum_cm_tr))

    for col in reg_cv_results.keys():
        if col.startswith('train'):
            print('Average Training {}: {}'.format(col, round(np.mean(reg_cv_results[col]), 3)))

    print('\n')
    
    cm_cv_results_te = pd.DataFrame(data = [cm_cv_results['train_tn'], cm_cv_results['train_fp'], cm_cv_results['train_fn'], cm_cv_results['train_tp']]).T
    cm_cv_results_te.columns= ['test_tn', 'test_fp', 'test_fn', 'test_tp']

    print("Validation Evaluation Metrics: ")
    print("=============================")
    avg_cm_te = np.array([(np.mean(cm_cv_results_te['test_tn']), np.mean(cm_cv_results_te['test_fp'])),
                         (np.mean(cm_cv_results_te['test_fn']), np.mean(cm_cv_results_te['test_tp']))])
    print("Average Confusion Matrix across validation Folds: \n{}".format(avg_cm_te))

    sum_cm_te = np.array([(np.sum(cm_cv_results_te['test_tn']), np.sum(cm_cv_results_te['test_fp'])),
                          (np.sum(cm_cv_results_te['test_fn']), np.sum(cm_cv_results_te['test_tp']))])
    print("Overall confusion matrix across validation folds: \n{}".format(sum_cm_te))

    for col in reg_cv_results.keys():
        if col.startswith('test'):
            print('Average Validation {}: {}'.format(col, round(np.mean(reg_cv_results[col]), 3)))
    return None

In [None]:
# Function to use cross-validation to get roc curve across folds
# NOTE: I completely stole this from an answer to the question at https://stackoverflow.com/questions/29656550/how-to-plot-pr-curve-over-10-folds-of-cross-validation-in-scikit-learn
from numpy import interp
from sklearn.metrics import accuracy_score, auc, average_precision_score, confusion_matrix, roc_curve, precision_recall_curve
def draw_cv_roc_curve(classifier, cv, X, y, title='ROC Curve'):
    """
    Draw a Cross Validated ROC Curve.
    Args:
        classifier: Classifier Object
        cv: StratifiedKFold Object: (https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation)
        X: Feature Pandas DataFrame
        y: Response Pandas Series
    Example largely taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc_crossval.html#sphx-glr-auto-examples-model-selection-plot-roc-crossval-py
    """
    # Creating ROC Curve with Cross Validation
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)

    i = 0
    for train, test in cv.split(X, y):
        probas_ = classifier.fit(X.iloc[train], y.iloc[train]).predict_proba(X.iloc[test])
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = roc_curve(y.iloc[test], probas_[:, 1])
        tprs.append(interp(mean_fpr, fpr, tpr))
        
        tprs[-1][0] = 0.0
        roc_auc = auc(fpr, tpr)
        aucs.append(roc_auc)
        plt.plot(fpr, tpr, lw=1, alpha=0.3,
                 label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))

        i += 1
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
             label='Luck', alpha=.8)
    
    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    plt.plot(mean_fpr, mean_tpr, color='b',
             label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
             lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                     label=r'$\pm$ 1 std. dev.')

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title(title)
    plt.legend(loc="lower right")
    plt.show()

In [None]:
# Also stolen - Precision-Recall curve
def draw_cv_pr_curve(classifier, cv, X, y, title='PR Curve'):
    """
    Draw a Cross Validated PR Curve.
    Keyword Args:
        classifier: Classifier Object
        cv: StratifiedKFold Object: (https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation)
        X: Feature Pandas DataFrame
        y: Response Pandas Series
        
    Largely taken from: https://stackoverflow.com/questions/29656550/how-to-plot-pr-curve-over-10-folds-of-cross-validation-in-scikit-learn
    """
    y_real = []
    y_proba = []

    i = 0
    for train, test in cv.split(X, y):
        probas_ = classifier.fit(X.iloc[train], y.iloc[train]).predict_proba(X.iloc[test])
        # Compute ROC curve and area the curve
        precision, recall, _ = precision_recall_curve(y.iloc[test], probas_[:, 1])
        
        # Plotting each individual PR Curve
        plt.plot(recall, precision, lw=1, alpha=0.3,
                 label='PR fold %d (AUC = %0.2f)' % (i, average_precision_score(y.iloc[test], probas_[:, 1])))
        
        y_real.append(y.iloc[test])
        y_proba.append(probas_[:, 1])

        i += 1
    
    y_real = np.concatenate(y_real)
    y_proba = np.concatenate(y_proba)
    
    precision, recall, _ = precision_recall_curve(y_real, y_proba)

    plt.plot(recall, precision, color='b',
             label=r'Precision-Recall (AUC = %0.2f)' % (average_precision_score(y_real, y_proba)),
             lw=2, alpha=.8)

    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title(title)
    plt.legend(loc="lower right")
    plt.show()

### Logistic Regression

6. Let's start with detecting evil robots through logistic regression. Before going ahead though, let me give a little overview on it in lecture. After that's done, implement a logistic regression model for detection below using k-fold cv

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.metrics import confusion_matrix, classification_report

num_folds = # Put your choice for k here

# We're going to do something a little weird by including a confusion matrix in our scores to be used in evaluation below
# see https://scikit-learn.org/stable/modules/model_evaluation.html#using-multiple-metric-evaluation

lr_pipe = imbPipe(
    [
        ('scaler', StandardScaler()), # PUT SCALING STEP IN HERE WITH ANY OPTIONS YOU THINK APPROPRIATE),
        ('knn_imp', KNNImputer(n_neighbors= 5)), #PUT KNN IMPUTER HERE WITH ANY OPTIONS YOU THINK APPROPRIATE)
        ('smote', SMOTE(random_state=1234)), # I went ahead and did the SMOTE for you
        ('lr', LogisticRegression(random_state=1234)) # Anything else we should specify here?
    ]
)



lr_cv = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= scoring_dict, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting

# Couldn't get the confusion matrix scorer to work in conjunction with others, so let's keep it separate
lr_cv_cm = cross_validate(# What goes here,
                        #X = ?, y = ?, 
                        scoring= confusion_matrix_scorer, # Let's get a bunch of metrics 
                        cv = kf , return_train_score=True) # Let's return train scores to detect overfitting


### K-Nearest Neighbors Classification

8. Let's do the same thing using the k-Nearest Neighbors algorithm.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k = #? # Pick your preferred k

knn_pipe = imbPipe(
    [
        ('scaler', StandardScaler()), # PUT SCALING STEP IN HERE WITH ANY OPTIONS YOU THINK APPROPRIATE),
        ('knn_imp', KNNImputer(n_neighbors= 5)), #PUT KNN IMPUTER HERE WITH ANY OPTIONS YOU THINK APPROPRIATE)
        ('smote', SMOTE(random_state=1234)), # I went ahead and did the SMOTE for you
        ('knn', KNeighborsClassifier(n_neighbors = k, metric = 'euclidean')) # Anything else we should specify here?
    ]
)


knn_cv = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= scoring_dict, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting

# Couldn't get the confusion matrix scorer to work in conjunction with others, so let's keep it separate
knn_cv_cm = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= confusion_matrix_scorer, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting

### Naïve Bayes Classification

9. Will Naïve Bayes classification work. Let's figure out what we're doing first? 

In [None]:
from sklearn.naive_bayes import GaussianNB


# TODO figure out what to do with age variable - not really Gaussian now, is it?

gnb_pipe = imbPipe(
    [
        ('scaler', StandardScaler()), # PUT SCALING STEP IN HERE WITH ANY OPTIONS YOU THINK APPROPRIATE),
        ('knn_imp', KNNImputer(n_neighbors= 5)), #PUT KNN IMPUTER HERE WITH ANY OPTIONS YOU THINK APPROPRIATE)
        ('smote', SMOTE(random_state=1234)), # I went ahead and did the SMOTE for you
        ('gnb', GaussianNB()) # Anything else we should specify here?
    ]
)



gnb_cv = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= scoring_dict, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting


# Couldn't get the confusion matrix scorer to work in conjunction with others, so let's keep it separate
gnb_cv_cm = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= confusion_matrix_scorer, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting



### Support Vector Machines

10. For our last classification algorithm of the day, let's take a look at support vector machines

In [None]:
from sklearn.svm import SVC


svc_pipe = imbPipe(
    [
        ('scaler', StandardScaler()), # PUT SCALING STEP IN HERE WITH ANY OPTIONS YOU THINK APPROPRIATE),
        ('knn_imp', KNNImputer(n_neighbors= 5)), #PUT KNN IMPUTER HERE WITH ANY OPTIONS YOU THINK APPROPRIATE)
        ('smote', SMOTE(random_state=1234)), # I went ahead and did the SMOTE for you
        ('svc', SVC()) # Anything else we should specify here?
    ]
)

num_folds = 5# Put your choice for k here


svc_cv = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= score_dict, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting

# Couldn't get the confusion matrix scorer to work in conjunction with others, so let's keep it separate
svc_cv_cm = cross_validate(# What goes here?,
                        #X = ?, y = ?, 
                        scoring= confusion_matrix_scorer, # Let's get a bunch of metrics 
                        cv = kf , return_estimator=True, return_train_score=True,
                        return_estimator =True) # Let's return train scores to detect overfitting


### Evaluating on Training/Validation Data

7. Now, just run the next two cells to see some evaluation metrics. Note that the "test" or "validation" data is measured only on the holdout folds of the training data - we still haven't touched the true test set!

In [None]:
full_results_dict = {'Logistic Regression': [lr_cv, lr_cv_cm, lr_pipe],
                     'K-Nearest-Neighbors': [knn_cv, knn_cv_cm, knn_pipe],
                     'Gaussian Naive Bayes': [gnb_cv, gnb_cv_cm, gnb_pipe],
                     'Support Vector Machine': [svc_cv, svc_cv_cm, svc_pipe]}

In [None]:
for model in full_results_dict.keys():
    print("+++++++++++++++++++++++++++++++++++++++++++++++++")
    print('+++ EVALUATION METRICS FOR {0} +++'.format(model))
    print('+++++++++++++++++++++++++++++++++++++++++++++++++')
    get_eval_metrics_report(full_results_dict[model][0], full_results_dict[model][1])
    draw_cv_roc_curve(full_results_dict[model][2], X = robots_train_nans, y = evil_train['evil'], title = '{0} ROC Curve on Train and Validation Folds'.format(model))
    draw_cv_pr_curve(full_results_dict[model][2], X = robots_train_nans, y = evil_train['evil'], title = '{0} PR Curve on Train and Validation Folds'.format(model))
    print('++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('++++++++++++++++++++++++++++++++++++++++++++++++++')
    print('\n\n\n')



### PAUSE: 
In a real life situation what should have we done and what should we do?

### Putting it all together and evaluating on test data

11. Let's put a bow on it, look at our metrics, get some visuals, evaluate on test data, then call it a day for this exercise, ok? First, let's get all our metrics together and make some predictions, and then use those to get a ROC curve. Normally, at this point we'd start evaluating and choosing hyperparameters for our model, but that's in another couple weeks.

In [None]:
# Loop over the models and evaluatate on Test - get ROC curve:

pipeline_dict = {'Logistic Regression': lr_pipe,
                'K-Nearest-Neighbors': knn_pipe,
                'Gaussian Naive Bayes': gnb_pipe,
                'Support Vector Machine': svc_pipe
}

#Finally let's see how we do on the test set:
from sklearn.metrics import auc, precision_recall_curve
for model in pipeline_dict.keys():
    y_pred = pipeline_dict[model].predict_proba(X = robots_test)
    fpr, tpr, thresholds = roc_curve(evil_test['evil'].to_numpy(), y_pred[:,1])
    auc_val = auc(fpr, tpr)

    plt.plot(fpr, tpr, label='{0} (AUC = {1})'.format(model, round(auc_val,3)))
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC on Test Set')
plt.xlabel('False Postive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.tight_layout()
plt.plot()
plt.show()


In [None]:
# And lastly get some classification reports
from sklearn.metrics import classification_report

for model in full_results_dict.keys():
    print(model)
    y_pred = full_results_dict[model][2].predict(X = robots_test)
    print(classification_report(evil_test['evil'].to_numpy(), y_pred))