# Managing Unbalanced Targets and Preventing Data Leakage

## Objectives

- avoid letting information about test sets get into the training of models
- use best practices for building non-leaky workflows
- repair leaky workflows
- recognize imbalanced classification targets 
- describe sampling techniques that address unbalanced targets

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, mean_squared_error

# First: Avoiding Data Leakage

We have encountered the idea of splitting our data into two, *training* our model on one bit and then *testing* it on the other. The goal is to have an unbiased assessment of our model, and so we want to make sure that nothing about our test data sneaks into the training run of the model, so our test is more like a true test.

### What's Wrong With This Picture?

Look at the below code. We were sure to fit our model on our training data - does that mean we did everything right?

In [None]:
X, y = load_diabetes(return_X_y=True)

In [None]:
ss = StandardScaler().fit(X)
X_scld = ss.transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scld, y, random_state=42)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
print(lr.coef_, lr.intercept_)

#### Discuss:

- 


In [None]:
# If you found an error/mistake, let's fix it!


In [None]:
# Be sure to name your model something different
lr2 = None

In [None]:
# Then let's see if our coefficients are different:


### Error Comparison

It's worth pointing out that, **for linear models**, there is **no** difference in modeling error:

In [None]:
y_test_hat = lr.predict(X_test)
mse = mean_squared_error(y_test, y_test_hat, squared=False)
print(f"Our test RMSE for this model is {round(mse, 2)}.")

In [None]:
y_test2_hat = lr2.predict(X_test_sc)
rmse = mean_squared_error(y_test2, y_test2_hat, squared=False)
print(f"Our test RMSE for this model is {round(mse, 2)}.")

This will **NOT** be true for other sorts of models that use different loss functions.

## Preprocessing

In general all preprocessing steps are subject to the same dangers here. Consider the preprocessing step of one-hot-encoding:

In [None]:
gun_poll = pd.read_csv('data/guns-polls.csv')

In [None]:
gun_poll.head()

In [None]:
gun_poll['Pollster'].value_counts()

Now if I were to fit a one-hot encoder to the whole `Pollster` column here, the encoder would learn all the categories. But I need to prepare myself for the real-world possibility that unfamiliar categories may show up in future records. Let's explore this.

In [None]:
# First I'll do a split
X_train, X_test = train_test_split(gun_poll, random_state=42)

Fit a `OneHotEncoder` to the `Pollster` column in my training data, then check to see which categories are represented.

In [None]:
# Instantiate the one hot encoder, fit it just to the Pollster column
ohe = None

# Can see what categories it learned
ohe.get_feature_names()

In [None]:
# Transform our train and test sets


In [None]:
# Look at the counts across those columns 


There are categories in the testing data that don't appear in the training data! What should 
we do about that?

### Approaches

**Strategy 1**: Divide up the categories proportionally when we do our train_test_split. If we're using `sklearn`'s tool, that means taking advantage of the `stratify` parameter:

In [None]:
new_X_train, new_X_test = train_test_split(gun_poll,
                                           stratify=gun_poll['Pollster'],
                                           random_state=42)

Unfortunately, in this case, we can't use this since some categories have only a single member.

**Strategy 2**: Drop the categories with very few representatives.

In the present case, let's try dropping the single-member categories.

In [None]:
# Using value_counts, let's grab a list of all our 'bad categories' with only 1 instance

In [None]:
# Now, use a lambda function in map to change values to "Small Pollster" if it's a bad category
gun_poll['Pollster'] = gun_poll['Pollster'].map(lambda x: np.nan if x in bad_cols else x)

In [None]:
# Explore how that looks
gun_poll['Pollster'].value_counts()

We could now split this carefully so that new categories don't show up in the testing data. In fact, now we can try the stratified split:

In [None]:
X_train3, X_test3 = train_test_split(gun_poll,
                                     stratify=gun_poll['Pollster'],
                                     test_size=0.3,
                                     random_state=42)

In [None]:
X_train3['Pollster'].value_counts()

In [None]:
X_test3['Pollster'].value_counts()

Now every category that appears in the test data appears also in the training data.

**Strategy 3**: Adjust the settings on the one-hot-encoder.

For `sklearn`'s tool, we'll tweak the `handle_unknown` parameter:

## Leakage into Validation Data

If we employ cross-validation, then our training data points will be serving both for training and for validation. So there's a sense in which we can't help but let some information about our validation data sneak into the model.

But strictly speaking, cross-validation means building *multiple* models, and we still want each to be blind to its validation set.

The dangers of data leakage, therefore, are still very much real in the case of validation data. And they are often more subtle as well. Consider the following line of code:

In [None]:
# Going back to our diabetes data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ss = StandardScaler().fit(X_train)
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

cv_results = cross_validate(estimator=LinearRegression(),
                X=X_train_sc,
                y=y_train,
                return_estimator=True)

In [None]:
# Looking at model coefficients on the first predictor
[model.coef_[0] for model in cv_results['estimator']]

We've built five models here, and none of them saw any points from the test data, so we have no leaks, right?

Wrong! We fit the `StandardScaler` to the whole training set, which means that information about *every* fold will affect every cross-validation. A better practice here would be to split our data into its cross-validation folds *first*. Then we can fit the scaler to only the training folds for each cross-validation.

Of course, the more preprocessing steps we have, the more tedious it becomes to do this work! For such tasks it is often greatly beneficial to take advantage of `sklearn`'s `Pipeline`s, which we'll have more to say about later.

The strategy to break up the data into cross validation folds would look like:

- Split it into five validation folds using `KFold()`
- For each split:

- (i) fit a `StandardScaler` to the four-fold chunk and transform all five folds of data points with it
- (ii) fit a `LinearRegression` to the four-fold chunk

In [None]:
# Would look like:
# KFold spits out index numbers, let's grab them to loop
for train_ind, val_ind in KFold().split(X_train):
    # Getting our train and val X
    train = X_train[train_ind, :]
    val = X_train[val_ind, :]
    # Then our train and val y
    target_train = y_train[train_ind]
    target_val = y_train[val_ind]
    
    ss = StandardScaler().fit(train)
    train_scld = ss.transform(train)
    val_scld = ss.transform(val)
    
    lr = LinearRegression().fit(train_scld, target_train)
    print(lr.score(val_scld, target_val))

# Now: Handling Class Imbalances

## Scenario: Identifying Fraudulent Credit Card Transactions

Credit card companies often try to identify whether a transaction is fraudulent at the time when it occurs, in order to decide whether to approve it. Let's build a classification model to try to classify fraudulent transactions! 

The data for this example came from [this Kaggle dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud), but has been downsampled to just 10,000 rows.

The dataset contains features for the transaction amount, the relative time of the transaction, and 28 other features formed using PCA. The target 'Class' is a 1 if the transaction was fraudulent, 0 otherwise

In [None]:
data = pd.read_csv('data/credit_fraud_small.csv')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
# Define X and y
X = data.drop(columns='Class')
y = data['Class']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=.25, random_state=1)
# Scale the data for modeling
scaler = StandardScaler()
scaler.fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)

# Train a logistic regresssion model with the train data
cred_model = LogisticRegression(random_state=42)
cred_model.fit(X_train_sc, y_train)

### Evaluate

In [None]:
cred_model.score(X_test_sc, y_test)

We got 99.88% accuracy, meaning that 99.88% of our predictions were correct! That seems great, right? Maybe... too great? Let's dig in deeper.

In [None]:
plot_confusion_matrix(cred_model, X_test_sc, y_test);

#### Discuss: What do you notice?

- 


## Class Imbalance

In [None]:
# What does a class imbalance look like?
y_train.value_counts()

### Why do we care?

Think about it - you're asking a computer, which has NO idea what you're talking about or how to identify anything in any way other than how you tell it to identify things, to look at something completely new and categorize it. If you feed it 1000 emails, 950 of which are 'not spam' and 50 of which are 'spam,' and ask it to identify which are 'not spam,' it can just label everything as 'not spam' and be 95% correct! Not bad!

And yet... that doesn't do what you want at all. You want your model to learn the characteristics of 'spam' emails and actually identify the parts of it which are reliable predictors for 'spam' in general, something the computer is increasingly incentivized not to do as the majority in your datasets gets larger compared to the minority. If your target is really imbalanced, your model will have to work increasingly harder in order to do better than the model-less baseline of just predicting the majority class.

## What can we do about it?

### Under-Sampling

Basically, take a sample to reduce the majority class to be the same size as the minority class.

Example:
```
minority = df.loc[df["category"] == "minority"]
majority = df.loc[df["category"] == "majority"].sample(n=len(minority))
```

Problems?

- Losing a lot of observations (in the 50 spam vs 950 not-spam example, we'd lose 900 rows!)


### Over-Sampling

The opposite - keep resampling from our minority class until it's the same size as the majority class.

Example:
```
majority = df.loc[df["category"] == "majority"]
minority = df.loc[df["category"] == "minority"].sample(n=len(majority), replace=True)
```

Problems?

- Will over-fit to the minority class, since it'll see the same minority examples over and over again (in the same 50 spam vs 950 not-spam example, we'd likely repeat each of the rows in the minority class 19 times!)


### Split The Difference

Basically, balance Under and Over sampling so that you do a bit of both - might be better than relying on just one of the above strategies.

### Implementing Over-Sampling

In [None]:
# First, train test split
# We only implement these techniques on training data!
X = data.drop(columns='Class')
y = data['Class']

X_tr_samp, X_te_samp, y_tr_samp, y_te_samp = train_test_split(
    X, y, test_size=.25, random_state=1)

In [None]:
# Need to put our training data back together
train_data = X_tr_samp.copy()
train_data['Class'] = y_tr_samp
train_data.head()

In [None]:
# Let's try over-sampling our minority class and see how we do
# Copy the provided code above, then adjust to our context


# Then use pd.concat to combine, resetting the index using .reset_index(drop=True)
oversampled_train = None
oversampled_train.shape

In [None]:
# Split out oversampled_train back out into X and y
X_tr_oversamp = oversampled_train.drop(columns='Class')
y_tr_oversamp = oversampled_train['Class']

In [None]:
# Scale the data for modeling
scaler = StandardScaler()
scaler.fit(X_tr_oversamp)
X_tr_over_sc = scaler.transform(X_tr_oversamp)
X_te_sc = scaler.transform(X_te_samp)

# Train a logistic regresssion model with the train data
over_model = LogisticRegression(random_state=42)
over_model.fit(X_tr_over_sc, y_tr_oversamp)

In [None]:
over_model.score(X_te_sc, y_te_samp)

In [None]:
plot_confusion_matrix(over_model, X_te_sc, y_te_samp);

#### Discuss:

- 


### Synthetic Data Creation - ADASYN and SMOTE

The **Synthetic Minority Oversampling Technique (SMOTE)** conducts cluster-based over-sampling. SMOTE works by finding all the instances of the minority category within the observations, drawing lines between those instances, and then creating new observations along those lines.

![SMOTE visualized](images/SMOTE_R_visualisation_3.png)

Image source is a great explainer on SMOTE (but uses R for the examples): https://rikunert.com/SMOTE_explained

This is better than simply using a random over-sample, yet not only are these synthetic samples not real data but also these samples are based on your existing minority. So, those new, synthetic samples can still result in over-fitting, since they're made from our original minority category. An additional pitfall you might run into is if one of your minority category is an outlier - you'll have new data that creates synthetic data based on the line between that outlier and another point in your minority, and maybe that new synthetic data point is also an outlier.

Another way to create synthetic data to over-sample our minority category is the **Adaptive Synthetic approach, ADASYN**. ADASYN works similarly to SMOTE, but it focuses on the points in the minority cluster which are the closest to the majority cluster, aka the ones that are most likely to be confused, and focuses on those. It tries to help out your model by focusing on where it might get confused, where 'spam' and 'not spam' are the closest, and making more data in your 'spam' minority category there.


Check out the library [imblearn](https://imbalanced-learn.org/stable/) for implementation of these!

### Implementing SMOTE:

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

Reminder: go back to our original train/test split:

```
X_tr_samp, X_te_samp, y_tr_samp, y_te_samp
```

In [None]:
# New import - note, not SKLearn!


In [None]:
# Instantiate our SMOTE

# Fit on the training data! X_tr_samp, y_tr_samp
X_tr_smote, y_tr_smote = None

In [None]:
X_tr_smote.shape

In [None]:
# Still need to scale
scaler = StandardScaler()
scaler.fit(X_tr_smote)
X_tr_smote_sc = scaler.transform(X_tr_smote)
X_te_sc = scaler.transform(X_te_samp)

# Train a logistic regresssion model with the train data
smote_model = LogisticRegression(random_state=42)
smote_model.fit(X_tr_smote_sc, y_tr_smote)

In [None]:
smote_model.score(X_te_sc, y_te_samp)

In [None]:
plot_confusion_matrix(smote_model, X_te_sc, y_te_samp);

#### Discuss:

- 


### One More Trick: `class_weight='balanced'`

And then, of course, sklearn has some methods to handle imbalanced datasets built right into some models - including logistic regression!

Check out the documentation to find it: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Reminder: go back to our original train/test split:

```
X_tr_samp, X_te_samp, y_tr_samp, y_te_samp
```

In [None]:
# Let's try a model with an adjusted hyperparameter...
logreg_b = None

In [None]:
# Scale the data for modeling
scaler = StandardScaler()
scaler.fit(X_tr_samp)
X_tr_sc = scaler.transform(X_tr_samp)
X_te_sc = scaler.transform(X_te_samp)

# Now, fitting our model and grabbing our training and testing predictions
logreg_b.fit(X_tr_sc, y_tr_samp)

train_preds = logreg_b.predict(X_tr_sc)
test_preds = logreg_b.predict(X_te_sc)

In [None]:
# Plotting the confusion matrix using SKLearn
plot_confusion_matrix(logreg_b, X_te_sc, y_te_samp);

In [None]:
# Printing the metrics nicely
metrics = {"Accuracy": accuracy_score,
           "Recall": recall_score,
           "Precision": precision_score,
           "F1-Score": f1_score}

for name, metric in metrics.items():
    print(f"{name}:"); print("="*len(name))
    print(f"TRAIN: {metric(y_tr_samp, train_preds):.4f}")
    print(f"TEST: {metric(y_te_samp, test_preds):.4f}")
    print("*" * 15)

## Resources:

- [SMOTE Explained for Noobs](https://rikunert.com/SMOTE_explained) (the R tutorial I linked earlier)
- [Resampling Strategies for Imbalanced Datasets](https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets)
- Machine Learning Mastery: [8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/)
- [Handling Imbalanced Datasets in Deep Learning](https://towardsdatascience.com/handling-imbalanced-datasets-in-deep-learning-f48407a0e758)