# Module 4: Exercise B

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

## Explore and Prepare Data

We will be using a cleaned-up dataset from an income survey, which includes demographic data. The target variable indicates whether the income exceeds $50K per year, based on census data.

Let's import and explore the "income_cleaned.csv" file:

In [None]:
income = pd.read_csv('income_cleaned.csv')

>__Task 1__
>
>Generate meta information and the first 5 lines of the DataFrame

In [None]:
# Check meta information about the features
...

In [None]:
# Check the first 5 lines
...

Let's first visualize these data.

>__Task 2__
>
>Create a jointplot for each pair of these features:
>
>- __age__ and __education-num__
>- __age__ and __hours-per-week__
>- __education-num__ and __hours-per-week__

In [None]:
...

>__Task 3__
>
> Convert categorical variables to numerical variables of __workclass__, __occupation__, __race__, and __sex__ columns
>
>- Add the encoded columns to the original DataFrame (Hint: set `drop_first` parameter)
>- Drop these four categorical columns

In [None]:
...

## Split Data

>__Task 4__
>
>- Assign the __income_50k__ column to `y` and the remaining columns to `X`
>- Apply `train_test_split` function with a 80(train):20(test) ratio and set `random_state` to 144
>- Make sure your function returns `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
...

## Apply a Naive Bayes Classifier

>__Task 5__
>
> Train and evaluate a Gaussian NB model
>
>- Initiate the model
>- Fit the model on a train set pair: `X_train` and `y_train`
>- Predict on test set `X_test`
>- Compare predictions with actual `y_test` values using accuracy score

In [None]:
...

# Initiate the model
...

# Fit the model
...

# Predict
...

# Calculate the accuracy
...

print(...)

## Apply a KNN Classifier

Remember, __feature scaling__ is a crucial step in the data preprocessing pipeline, particularly for distance-based algorithms like KNN. For instance, if we have two features with scales [0,1] and [1000,2000], the latter feature with larger magnitudes will dominate the distance calculations. Consequently, the feature with the smaller scale will have minimal impact on the distance calculation and, therefore, the KNN predictions.

>__Task 6__
>
>Apply MinMax normalization to the data
>
>- Initiate a `MinMaxScaler` object with a scale of [0,1]
>- Fit on the train set and transform it to `X_train_scaled`
>- Transform the test set to `X_test_scaled`
>- Check `X_train_scaled`
>
>__Note:__ You should always fit the scaler to the training set and then apply the scaler to the test set. Performing these operations in the reverse order could result in data leakage, which would bias the performance results of the model.

In [None]:
...

# Initiate the scaler
...

# Fit the scaler to the train set and transform it
...

# Apply the same scaler to the test set
...

X_train_scaled

>__Task 7__
>
>Train and evaluate a KNN model
>
>- Initiate the model with a k value of 15
>- Fit the model on a train set pair: `X_train` and `y_train`
>- Predict on test set `X_test`
>- Compare predictions with actual `y_test` values using accuracy score

In [None]:
...

# Initiate the model
...

# Fit the model
...

# Predict
...

# Calculate the accuracy
...

print(...)

>__Task 8__
>
> Repeat the above steps with scaled train set and test set
>
>- Fit the model on a train set pair: `X_train_scaled` and `y_train`
>- Predict on test set `X_test_scaled`
>- Compare predictions with actual `y_test` values using accuracy score

In [None]:
# Fit the model
...

# Predict
...

# Calculate the accuracy
...

print(...)

Tasks 7 and 8 illustrate the impact of feature scaling on KNN. Interestingly, the accuracy of the KNN model with scaled features is slightly lower. This discrepancy could occur when a large-scale or dominant feature happens to hold high predictive power. 

Scaling levels the playing field by treating all features equally, which may inadvertently reduce the overall predictive power, especially if a dominant feature played a crucial role in predictions before scaling. Despite this being an uncommon case, it is generally accepted practice to scale the features for KNN models.

## Evaluate the Performance

### Plot ROC Curves

>__Task 9__
>
>Use the model from Task 5 to predict the probability of `income_50k` on `X_test` and plot ROC curves of the NB model predictions

In [None]:
...

>__Task 10__
>
>Use the model from Task 8 to predict the probability of `income_50k` on `X_test_scaled` and plot ROC curves of the KNN model (with scaled features) predictions

In [None]:
...

### Calculate Evaluation Metrics

>__Task 11__
>
>Use `classification_report` to print out _accuracy_, _precision_, _recall_, _accuracy_, and _F1_ scores for both NB and KNN (scaled feature) model

In [None]:
...

If you’re not using `classification_report` and are directly calculating each metric instead, you can specify how the average will be calculated for precision, recall, and F1 scores. For example, you could set `average='micro'`. 

For multiclass or multilabel targets, the `average` parameter is required. If it’s not specified, scores for each class will be returned.

>__Task 12__
>
>Calculate _accuracy_, _precision_, _recall_, and _F1_ scores for the NB model
>
>- Set `average` to `weighted` for _precision_, _recall_, and _F1_
>
>Are these metrics the same with the classfication report in Task 11?

In [None]:
...

>__Task 13__
>
>Repeat the above task and calculate _accuracy_, _precision_, _recall_, and _F1_ scores for the KNN model

In [None]:
...