## Part 1: Dealing with Class Imbalance

### SMOTE: Synthetic Minority Oversampling Technique

Our first example shows how we can use SMOTE on a un-balanced dataset to generate a new re-balanced dataset.

### Alert:
1. This can be slow when data is large.
1. It can work on binary or multiclass classification data.

### Imbalanced Learn Library
See [Imbalanced-Learn documentation](https://imbalanced-learn.org/stable/).

In [None]:
import imblearn
from sklearn.datasets import make_classification
import numpy as np
import warnings
warnings.filterwarnings('ignore')

We use the following code snippet to generate a classification dataset with an imbalanced target, where the degree of imbalance is set by the `weights` argument below.

In [None]:
%%time
from imblearn.over_sampling import SMOTE 
sample_size = 10**5
X, y = make_classification(n_classes = 3, 
                           class_sep = 2, 
                           weights = [0.05, 0.1, 0.85], 
                           n_informative = 3, 
                           n_redundant = 1, 
                           flip_y = 0, 
                           n_features = 20, 
                           n_clusters_per_class = 1, 
                           n_samples = sample_size, 
                           random_state = 10)

We can use `np.unique` to get counts for each class.

In [None]:
np.unique(y, return_counts = True)

We now use SMOTE to rebalance the dataset.

In [None]:
%%time
sm = SMOTE(random_state = 42)
X_res, y_res = sm.fit_sample(X, y)

### Exercise

1. The new dataset should show equal counts for each class. Verify that that is the case.
1. Return to the above data and increase the sample size to `10**5` (one million). Find out how long it takes to generate the data, and how long it takes to run `SMOTE`.

### End of exercise

## Part 2: Binary classification of Boston housing data

The Boston dataset has housing data including median price. We create a binary label to flag the most expensive houses and build a classifier to predict the likelihood of a house being expensive.

In [None]:
from sklearn.datasets import load_boston
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
boston = load_boston()
boston.keys()

Note that the `boston` object above is a Python dictionary, which include the features `boston['data']`, the target which is the housing price in `boston['target']`, and additional metadata, such a a description.

### Exercise

Print a description of the `boston` dataset and read what each column represents.

### End of exercise

Let's visualize the first few rows of the `boston` data.

In [None]:
boston['data'][1:5]

As you can see this is not very pretty. The reason is that `boston` is a Python array, similar to a matrix. Of course, at the end of the day all tabular data is turned into an array so that we can do linear algebra with it, but for the sake of visualization this is not ideal. The solution is to take the raw array and turn it into a `DataFrame` using the `pandas` library, which was created for this purpose. It allows us to interact with the data in a more code-friendly and intuitive way.

In [None]:
df_boston = pd.DataFrame(boston['data'], columns = boston['feature_names'])
df_boston.head()

So `df_boston` is a `DataFrame` that represents the `boston['data']` array. In fact, if we needed to go back to the array, we can just type `df_boston.values`, but we rarely need to do that. Using `pandas` we can visualize, process, and summarize the data in an easier way than if we had to do it using `numpy` directly. Of course, `pandas` itself usese `numpy` to do this behind the scenes, but this is mostly hidden from us. This is why we say `DataFrame` is an **abstraction layer** on top of `numpy` so data scientists can do their most common tasks without having to use `numpy` directly.

### Exercise

As one example, recall that earlier in the notebook we used `np.unique(...)` to get counts. Use it to get counts for each unique value of the `RAD` column in `df_boston`, in other words `df_boston['RAD']`.

Now turn the counts into percentages instead.

Since getting counts and turning them into percentages is such a common data-related task, there's got to be an easier way to do it. And there is. Search online to see if `pandas` offers a function for getting unique counts for a column in the data. Can you turn the counts into percentages?

### End of exercise

Let's now visualize the target variable, housing price.

In [None]:
%matplotlib inline
ax = sns.distplot(boston['target'])

Say we're interested in training a classification algorithm to predict whether or not a house is worth 40k or more. So first we create a target column in the data that flags houses who sold for 40k or more.

In [None]:
df_boston['is_above_40k'] = boston['target'] >= 40

In [None]:
df_boston.head()

We start by splitting `df_boston` into a training data and a testing data. The easiest way to do this is using the `train_test_split` function.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df_boston.drop(columns = 'is_above_40k'), 
                                                    df_boston['is_above_40k'], 
                                                    test_size = 0.20, 
                                                    random_state = 0)

### Exercise

Find counts for `is_above_40k`.

Train a logistic regression classifier to predict when the price of a house is above 40k. Begin by loading the library as such: `from sklearn.linear_model import LogisticRegression`. Then create an instance of the algorithm and train it by invoking the `.fit(x_train, y_train)`.

Once the model is trained, pass it the testing data to see if we get predictions back. To do so, we invoke the `.predict(x_test)` method. We can also invoke the `.predict_proba(x_test)` method if we wish to get the raw probabilites instead of the final predictions.

Get the accuracy of the model by loading `from sklearn.metrics import accuracy_score` and calling the `accuracy_score` function. What two arguments do we pass to this function to evaluate the model's accuracy?

Is accuracy a good metric for evaluating this model? Why or why not? To give some context, let's say you're a developer and want to predict house prices. You prefer to bid low and lose a bid than bid high for a house that's not worth it.

### End of exercise

Let's find some more useful evaluation metrics. The most direct metric to look at, is the confusion matrix.

In [None]:
from sklearn import metrics

cm = metrics.confusion_matrix(y_test, y_test_pred)
print(cm)

From the confusion matrix, we can derive accuracy, precision, recall, and the F1-score, which is a sort of average of precision and recall. We don't have time to get into all of them in detail, but [here](http://www.win-vector.com/blog/2009/11/i-dont-think-that-means-what-you-think-it-means-statistics-to-english-translation-part-1-accuracy-measures/) is an excellent article I recommend you read.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

One way to visually evaluate a binary classification model is using the ROC plot. By itself, it is not very useful, but by comparing the ROC plot of multiple models we can start seeing which models are better. The area under the ROC plot is called AUC (area under the curve) and the closer it is to 1, the better the model.

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_test_pred)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

### Exercise

1. The `LogisticRegression` classifier we trained above has an argument called `class_weight`. Read the documentation to see what it does, then train a new model this time by providing the class weights. 
1. Does it change any of the results? In what way?

### End of exercise

## Part 3: Cross-validation for binary classification

In the last section, we trained a logistic regression classifier on the Boston housing data. In this section, we train the same logistic regression classifier, but use cross-validation to tune it.

In [None]:
from sklearn.linear_model import LogisticRegressionCV
clr = LogisticRegressionCV(cv = 10, random_state = 13, max_iter = 1000)
clr.fit(x_train, y_train)

In [None]:
predictions = clr.predict(x_train)

In [None]:
np.unique(predictions, return_counts = True)

In [None]:
accuracy_score(predictions, y_train)

In [None]:
metrics.confusion_matrix(y_test, predictions)

In [None]:
print(classification_report(y_test, predictions))

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(y_test, predictions)
roc_auc = metrics.auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

### Exercise

1. Does cross-validation seem to make a difference in the results we get?
1. Change the number of folds from 5 to 10 and train the CV model again? Notice any difference in performance? Note that *performance* here refers to the model's overall accuracy, based on your choice of metric, it does NOT refer to run-time? What was the cost of increasing the number of folds?

### End of exercise