# Pair-coding lab: Introducing class imbalance

In this **lab**, you'll work in pairs or small groups to explore the issue of class imbalance!

As usual, we'll start by loading relevant libraries and setting up our figure aesthetics:

In [None]:
import pandas as pd
import numpy as np
import os

import seaborn as sns
import cmocean
import matplotlib.pyplot as plt
import pylab
%matplotlib inline
%config InlineBackend.figure_format = 'svg'


viz_style = {
    'font.family': 'sans-serif',
    'font.size':11,
    'axes.titlesize':'large',
    'axes.labelsize':'medium',
    'xtick.labelsize':'small',
    'ytick.labelsize':'small',
    'text.color':'#5B5654',
    'axes.labelcolor':'#5B5654',
    'xtick.color':'#5B5654',
    'ytick.color':'#5B5654',
    'axes.edgecolor':'#5B5654',
    'xtick.top':False,
    'ytick.right':False,
    'axes.spines.top':False,
    'axes.spines.right':False,
    'axes.grid':False,
    'boxplot.showfliers':False,
    'boxplot.patchartist':True
}

plt.style.use(viz_style)

case_dir = '/path/to/materials'

## Balanced classes

For the purpose of comparison, we'll start out with a balanced dataset.

`Sklearn` has a very cool function, `make_classification`, that allows us to generate our own datasets with a lot of flexibility. We'll use it to compare balanced and imbalanced datasets, and to explore how the properties of an imbalanced dataset can affect the difficulty of a project.

Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html) for `make_classification`, and in particular pay attention to the many aspects that we can control with various parameters!

In [None]:
from sklearn.datasets import make_classification

### Build a balanced dataset

Our balanced dataset will contain 5000 samples, split evenly into 3 classes. We'll make 5 features, and to be kind to ourselves we'll allow 4 of them to be informative/useful (the 5th will be redundant). We will also impose the simplifying constraint that the distribution of feature space covered by each class only has *one* cluster. Finally, we'll keep the noise in the dataset low, and only allow 1% of the classes to be assigned randomly.

---

Run the cells below to create the dataset and convert the results to a pandas dataframe:

In [None]:
# create dataset
X_balanced, y_balanced = make_classification(n_classes=3, weights=None, n_samples=5000, 
                                             n_features=5, n_redundant=1, n_informative=4,
                                             n_clusters_per_class=1, random_state=0, flip_y=0.01)

In [None]:
# convert the dataset to a pandas dataframe
df_balanced = pd.concat([pd.DataFrame(data=X_balanced), 
                         pd.DataFrame(data=y_balanced, columns=['label'])], 
                        axis=1)
df_balanced.head()

Let's take a look at what our data look like via a pair grid:

In [None]:
g = sns.PairGrid(df_balanced, diag_sharey=False, corner=True, hue='label', palette='crest_r',
                 height=1.7, aspect=1)
g.map_lower(sns.scatterplot)
g.map_diag(sns.kdeplot)
g.add_legend();

### Train and evaluate a classifier

We'll do the usual train-test split:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_balanced[df_balanced.columns[:-1]], 
                                                    df_balanced['label'], 
                                                    random_state=0, test_size=0.25, 
                                                    stratify=df_balanced['label'])

And now we can set up our logistic regression model!

**Note:** you can choose whether to use OVR or multinomial here! The following cell uses the default (which is multinomial), but it's up to you.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

clf = LogisticRegression(random_state=0).fit(X_train_sc, y_train)

We want to do a more robust evaluation of the model than accuracy alone, so you can use this convenience function (below) to compactly summarize your results:

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def summarize_scores(y_true, y_pred, settype='Training'):
    print('\n--- {} SET ---'.format(settype.upper()))
    print('Accuracy:', accuracy_score(y_true, y_pred))
    print('Precision:', precision_score(y_true, y_pred, average='macro'))
    print('Recall:', recall_score(y_true, y_pred, average='macro'))
    print('F1:', f1_score(y_true, y_pred, average='macro'))

In [None]:
# see how your model did:
summarize_scores(y_train, clf.predict(X_train_sc), settype='Training')
summarize_scores(y_test, clf.predict(X_test_sc), settype='Test')

And, as before, we have the confusion matrix:

In [None]:
from sklearn.metrics import confusion_matrix

heatmap_train = sns.heatmap(confusion_matrix(y_train, clf.predict(X_train_sc), normalize="true"), 
                            annot=True, fmt='.2%', cmap="PuBu");
heatmap_train.set_xlabel('Predicted label')
heatmap_train.set_ylabel('True label')
heatmap_train.set_title('Balanced Training set');

In [None]:
heatmap_test = sns.heatmap(confusion_matrix(y_test, clf.predict(X_test_sc), normalize="true"), 
                           annot=True, fmt='.2%', cmap="PuBu")
heatmap_test.set_xlabel('Predicted label')
heatmap_test.set_ylabel('True label')
heatmap_test.set_title('Balanced Test set');

Let's see if we can do better with optimized parameters:

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {'C':10**np.linspace(-2,2, 11)}
gs = GridSearchCV(LogisticRegression(multi_class='multinomial', random_state=0), 
                  parameters, scoring='recall_macro').fit(X_train_sc, y_train)
print('Best params:', gs.best_params_)

summarize_scores(y_train, gs.predict(X_train_sc), settype='Training')
summarize_scores(y_test, gs.predict(X_test_sc), settype='Test')

Notice how the scores are uniformly high!

## Imbalanced classes

In this next section, we'll look at a dataset with class imbalance and see how this is different from the balanced class. You'll start with a pre-defined imbalanced dataset, but later you'll make your own and explore!

### Build an imbalanced dataset

Similar to the balanced dataset, our **imbalanced** dataset has 5000 samples and 3 classes. We still have 5 features (4 of which are informative), one central cluster per class, and randomly assign 1% of the sample classes.

However, this time, the samples are *not* split evenly between the 3 classes. Instead we have:    
>Class 0: 90%  
Class 1: 5%  
Class 2: 5%

In [None]:
# create dataset
from sklearn.datasets import make_classification
X_imbalanced, y_imbalanced = make_classification(n_classes=3, weights=[0.9, 0.05, 0.05], n_samples=5000,
                                                 n_features=5, n_redundant=1, n_informative=4,
                                                 n_clusters_per_class=1, random_state=0, flip_y=0.01)

In [None]:
# convert to pandas dataframe
df_imbalanced = pd.concat([pd.DataFrame(data=X_imbalanced), 
                         pd.DataFrame(data=y_imbalanced, columns=['label'])], 
                        axis=1)
df_imbalanced.head()

Make a pair grid:

In [None]:
g = sns.PairGrid(df_imbalanced, diag_sharey=False, corner=True, hue='label', palette='crest_r',
                 height=1.7, aspect=1)
g.map_lower(sns.scatterplot)
g.map_diag(sns.kdeplot)
g.add_legend();

How is this different from the first dataset?

### Train and evaluate a classifier

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_imbalanced[df_imbalanced.columns[:-1]], 
                                                    df_imbalanced['label'], 
                                                    random_state=0, test_size=0.25, 
                                                    stratify=df_imbalanced['label'])

In [None]:
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

clf = LogisticRegression(random_state=0).fit(X_train_sc, y_train)

# see how your model did:
summarize_scores(y_train, clf.predict(X_train_sc), settype='Training')
summarize_scores(y_test, clf.predict(X_test_sc), settype='Test')

Compare the different evaluation metric scores. Which ones are high? Which ones are low? Discuss with your partner why this might be.

--- 


Check out the confusion matrices for additional context:

In [None]:
heatmap_train = sns.heatmap(confusion_matrix(y_train, clf.predict(X_train_sc), normalize="true"), 
                            annot=True, fmt='.2%', cmap="PuBu");
heatmap_train.set_xlabel('Predicted label')
heatmap_train.set_ylabel('True label')
heatmap_train.set_title('Imbalanced Training set');

In [None]:
heatmap_test = sns.heatmap(confusion_matrix(y_test, clf.predict(X_test_sc), normalize="true"), 
                           annot=True, fmt='.2%', cmap="PuBu")
heatmap_test.set_xlabel('Predicted label')
heatmap_test.set_ylabel('True label')
heatmap_test.set_title('Imbalanced Test set');

Let's see if we can do better via optimizing parameters, but this time, **experiment with different scoring metrics for optimization.** (Check out the `GridSearchCV` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), specifically the "scoring" parameter).

Different choices of scoring metrics for your grid search will lead to different optimal parameters, which in turn will lead to different model performances. How do your results compare when you set `scoring='recall_macro'` vs `scoring='precision_macro'` vs `scoring='accuracy'` (the default) vs. `scoring='balanced_accuracy'`? 

Discuss with your partner why we might want to use `_macro` rather than `_micro` for imbalanced classes (see [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) for a refresher).

In [None]:
parameters = {'C':10**np.linspace(-2,2, 11)}
gs = GridSearchCV(LogisticRegression(random_state=0), 
                  parameters, scoring='recall_macro').fit(X_train_sc, y_train)
print('Best params:', gs.best_params_)

# see how your model did:
summarize_scores(y_train, gs.predict(X_train_sc), settype='Training')
summarize_scores(y_test, gs.predict(X_test_sc), settype='Test')

Next, try balancing class weights in your logistic regression model. Weights can be specified, but the default is to weight inversely by the size of the class (so that smaller classes get larger weights). 

In [None]:
parameters = {'C':10**np.linspace(-2,2, 11)}
gs = GridSearchCV(LogisticRegression(class_weight='balanced', random_state=0), 
                  parameters, scoring='recall_macro').fit(X_train_sc, y_train)
print('Best params:', gs.best_params_)

# see how your model did:
summarize_scores(y_train, gs.predict(X_train_sc), settype='Training')
summarize_scores(y_test, gs.predict(X_test_sc), settype='Test')

How did each evaluation metric change relative to the model without class weights?

### Explore!

Make your own datasets using `make_classification` to answer the following questions:

#### Degree of imbalance

In the first example we used, the class breakdown was 90%/5%/5%. What happens as you increase or decrease the imbalance of classes in the dataset? (Note: the smaller classes do not always have to have the same weights! They just all need to sum to 100%).

#### Number and similarity of classes

Try increasing the number of classes and/or playing around with the `class_sep` parameter to see what happens as the boundaries between classes change:

#### Noisiness of data

What happens if we increase or decrease the fraction of samples that are assigned random classes?

#### Number and usefulness of features

Try changing the number of features and the number of informative features! How does this affect the feasibility of a classification project?

#### Etc!

Keep playing with the `make_classification` parameters. How complicated can you make things? Think about how mess real-world data is. What are the biggest challenges?