## Lecture 18: Handling Imbalanced Datasets

## Table of contents

1. [Introduction](#intro)
2. [Algorithmic Level Approaches](#algo)
3. [Data Level Approaches I (Undersampling)](#undersampling)
    1. [Random Undersampling](#undersampling)
    2. [NearMiss](#nearmiss)
    3. [TomekLinks](#tomeklinks)
4. [Data Level Approaches II (Oversampling)](#oversampling)
    1. [Random Oversampling](#oversampling)
    2. [SMOTE](#SMOTE)

## Introduction<a id="intro">

Data imbalance occurs when class distributions in a dataset are skewed, with one class significantly outnumbering others.
This challenge is prevalent in various applications, including:
- Fraud Detection
- Medical Diagnosis
- Text Classification
- Image Recognition

**Consequences:**
- Models prioritize overall accuracy, often favoring the majority class.
- This bias leads to underrepresentation of the minority class, resulting in:
- Skewed predictions
- Poor generalization

Please refer to [presentation](https://drive.google.com/file/d/1-BLvqwQuIN95GCKwTTD2Sa_t8V24qeMh/view?usp=drive_link) for more theoretical notes.

We will be using a dataset of card transactions. You can find the data and a complete dataset description at this [link](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud/data).

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
path = 'data/creditcard.csv'

In [4]:
df = pd.read_csv(path)

In [5]:
df.shape

(284807, 31)

In [6]:
pd.set_option("display.max_columns", 50)

In [7]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

In [8]:
df.Class.value_counts(normalize=True)*100

0    99.827251
1     0.172749
Name: Class, dtype: float64

We observe that only 0.17% of the transactions are fraudulent (positive class), indicating a highly imbalanced dataset.

We'll begin with a brief data cleaning process before moving on to model building.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [10]:
df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.918649e-15,5.682686e-16,-8.761736e-15,2.811118e-15,-1.552103e-15,2.04013e-15,-1.698953e-15,-1.893285e-16,-3.14764e-15,1.772925e-15,9.289524e-16,-1.803266e-15,1.674888e-15,1.475621e-15,3.501098e-15,1.39246e-15,-7.466538e-16,4.258754e-16,9.019919e-16,5.126845e-16,1.47312e-16,8.042109e-16,5.282512e-16,4.456271e-15,1.426896e-15,1.70164e-15,-3.662252e-16,-1.217809e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [11]:
df.duplicated().sum() # checking for duplicates

1081

We have 1081 duplicated rows in our dataset. Let's remove them.

In [12]:
df.drop_duplicates(inplace=True)

In [13]:
df.groupby('Class').Amount.mean()

Class
0     88.413575
1    123.871860
Name: Amount, dtype: float64

In [14]:
# we will drop time column for now, we can use it later to improve the accuracy of the model
df = df.drop('Time', axis=1)

In [15]:
X = df.drop(columns='Class', axis=1)
y = df['Class']

In [16]:
X.shape, y.shape

((283726, 29), (283726,))

We’ll split the dataset into training, validation, and test sets with a 60/20/20% ratio. To ensure consistent class distribution across each partition, we’ll use the `stratify` parameter in `train_test_split`, preserving the proportion of instances in each class in every subset. The combined `X_train_val` set will include both training and validation data, allowing for direct use during cross-validation.

In [17]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, stratify=y_train_val, test_size=0.25, random_state=42)

In [18]:
X_train.shape, X_val.shape, X_test.shape

((170235, 29), (56745, 29), (56746, 29))

We’ll start by building a logistic regression model directly on our imbalanced data to establish a baseline. Then, we’ll apply various strategies for handling data imbalance and assess the improvements. For cross-validation, we’ll use a `StratifiedKFold` object to maintain consistent class distribution across folds.

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

lr = LogisticRegression(max_iter=1000)
kf = StratifiedKFold(n_splits=5, shuffle=False)

In [20]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    lr.fit(X_fold_train, y_fold_train)
    
    y_pred = lr.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [01:51, 22.28s/it]


In [21]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)

pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])

Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.621474,0.857103,0.718459,0.999194


Let’s experiment with different prediction thresholds. By default, logistic regression classifies a sample as positive if its probability exceeds 0.5. We’ll retrain the model on the validation set and adjust the threshold to achieve a recall of 0.9 (90%).

In [22]:
lr.fit(X_train, y_train)

In [23]:
y_prob = lr.predict_proba(X_val)[:, 1]
y_pred = (y_prob >= 0.5).astype(int)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [24]:
#original 0.5 threhsold
recall = recall_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)

pd.DataFrame(data=[(recall, precision, f1, accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])

Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.691489,0.878378,0.77381,0.99933


In [25]:
y_prob = lr.predict_proba(X_val)[:, 1]
y_pred = (y_prob >= 0.0015).astype(int)

recall = recall_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)

pd.DataFrame(data=[(recall, precision, f1, accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])

Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.904255,0.027887,0.054106,0.947625


Setting the threshold to 0.0015 allows us to achieve a recall of 90%, meaning we identify 90% of fraudulent transactions. However, note that precision drops to 2.7%, indicating that only 2.7% of our positive predictions are correct.

Let’s examine the confusion matrix to analyze the counts of true negatives, false positives, false negatives, and true positives.








In [26]:
from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_val, y_pred)
cm

# (tn, fp, 
#  fn, tp)

array([[53688,  2963],
       [    9,    85]], dtype=int64)

### Algorithm Level Approach for hanlding data imbalance<a id="algo">

We can adjust the learning process of logistic regression to give more emphasis to the minority class by setting `class_weight='balanced'`. This approach modifies the cost function by assigning higher penalties to misclassified instances of the minority class, effectively increasing their impact on the model’s optimization process.

In [27]:
lr = LogisticRegression(max_iter=1000, class_weight='balanced')

In [28]:
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    lr.fit(X_fold_train, y_fold_train)
    
    y_pred = lr.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [01:53, 22.72s/it]


In [29]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)

pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])

Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.912526,0.05931,0.111321,0.975601


We observe a significant improvement in recall compared to our initial predictions. While the precision remains relatively low at 5.9%, it is still better than the unadjusted logistic regression model (without `class_weight='balanced'`).

## Data Level Approaches. Undersampling <a id="undersampling">

The primary data-level approaches for handling class imbalance are undersampling and oversampling. Let’s begin with undersampling.

Through undersampling, we reduce the number of instances in the majority class to better balance it with the minority class. There are different undersampling strategies, with the simplest being random undersampling, which randomly removes instances from the majority class. 

We’ll use the `RandomUnderSampler` implementation from the `imblearn` library; please install this library if you haven’t already.

In [30]:
# !pip install imblearn

In [31]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

rus = RandomUnderSampler(random_state=42)

In [32]:
X_under, y_under = rus.fit_resample(X_train, y_train)

In [33]:
y_under.value_counts()

0    284
1    284
Name: Class, dtype: int64

After applying the undersampler with `fit_resample` on the training set, the majority class instances are reduced to match the count of the minority class. However, since our data contains very few minority class instances (284 in the training set), undersampling may not be ideal; with such limited data, the model may struggle to learn effectively.

We'll create a pipeline that includes both the undersampling step and the model. This allows us to apply the process seamlessly to both the training and validation sets. We're using the `Pipeline` from `imblearn`, as `sklearn`'s version does not support operations that modify the number of rows in the dataset.

Importantly, when using the pipeline on the validation set, we only call `predict`, ensuring that undersampling is not applied during validation. This approach is crucial, as we need to evaluate model performance on real-world data, not undersampled data.

In [34]:
random_under_pipeline = Pipeline(steps=[
    ('random_under', RandomUnderSampler(random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=13))
])

In [35]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    random_under_pipeline.fit(X_fold_train, y_fold_train)
    
    y_pred = random_under_pipeline.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [00:02,  2.30it/s]


In [36]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)
pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])

Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.917895,0.039725,0.076146,0.962829


### NearMiss undersampling<a id="nearmiss">

Another undersampling strategy is NearMiss. It removes majority class instances that are farthest from the minority class instances, effectively keeping only those majority instances that are closest to the minority class. This approach ensures that the remaining majority samples are more representative of the decision boundary, making it easier for the model to learn the distinctions between classes.

In [37]:
from imblearn.under_sampling import NearMiss

random_under_pipeline = Pipeline(steps=[
    ('nearmiss_under', NearMiss()),
    ('lr', LogisticRegression(max_iter=1000, random_state=13))
])

In [38]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    random_under_pipeline.fit(X_fold_train, y_fold_train)
    
    y_pred = random_under_pipeline.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [00:03,  1.46it/s]


In [39]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)
pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])



Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.973509,0.002355,0.004698,0.310644


### Tomek Links<a id="tomeklinks">

Another undersampling approach is Tomek Links. This method removes the majority class sample from each identified pair of closest points (one from each class) if they form a Tomek Link, meaning they are each other’s nearest neighbors. This technique helps to clean the boundary between classes, reducing overlap and making the classes more separable.

In [40]:
from imblearn.under_sampling import TomekLinks

random_under_pipeline = Pipeline(steps=[
    ('tl_under', TomekLinks()),
    ('lr', LogisticRegression(max_iter=1000, random_state=13))
])

In [41]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    random_under_pipeline.fit(X_fold_train, y_fold_train)
    
    y_pred = random_under_pipeline.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [09:56, 119.23s/it]


In [42]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)
pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])



Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.62414,0.857642,0.720438,0.999198


## Oversampling<a id="oversampling">

Oversampling focuses on increasing the number of minority class instances to balance them with the majority class. There are several approaches to achieve this, with one of the simplest being `RandomOverSampler`, which randomly duplicates minority class samples.

In [43]:
from imblearn.over_sampling import RandomOverSampler

In [44]:
ros = RandomOverSampler(random_state=42)
ros.fit(X_train, y_train)

In [45]:
X_over, y_over = ros.fit_resample(X_train, y_train)

In [46]:
X_over.shape

(339902, 29)

In [47]:
y_over.value_counts()

0    169951
1    169951
Name: Class, dtype: int64

After applying random oversampling, the number of instances in the minority class has increased to match the majority class, creating a balanced dataset.

Let’s create a pipeline that includes both `RandomOverSampler` and `StandardScaler()`. Standardizing features can enhance the convergence speed of gradient descent in logistic regression, leading to faster model training.

In [50]:
from sklearn.preprocessing import StandardScaler

random_over_pipeline = Pipeline(steps=[
    ('standard_scaler', StandardScaler()),
    ('random_over', RandomOverSampler(random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=13))
])

In [51]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    random_over_pipeline.fit(X_fold_train, y_fold_train)
    
    y_pred = random_over_pipeline.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [00:13,  2.64s/it]


In [52]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)
pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])



Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.912526,0.059951,0.112454,0.975892


### SMOTE<a id="smote">

Another oversampling strategy is SMOTE (Synthetic Minority Over-sampling Technique). SMOTE generates additional instances for the minority class by creating synthetic samples. It does this by selecting instances from the minority class and interpolating between them, generating new samples that lie along the line segments connecting nearest neighbors.








In [53]:
from imblearn.over_sampling import SMOTE

random_over_pipeline = Pipeline(steps=[
    ('standard_scaler', StandardScaler()),
    ('random_over', SMOTE(random_state=42)),
    ('lr', LogisticRegression(max_iter=1000, random_state=13))
])

In [54]:
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from tqdm import tqdm
recall_scores = []
precision_scores = []
f1_scores = []
accuracy_scores = []

for train_index, val_index in tqdm(kf.split(X_train_val, y_train_val)):
    X_fold_train, X_fold_val = X_train_val.iloc[train_index], X_train_val.iloc[val_index]
    y_fold_train, y_fold_val = y_train_val.iloc[train_index], y_train_val.iloc[val_index]
    
    random_over_pipeline.fit(X_fold_train, y_fold_train)
    
    y_pred = random_over_pipeline.predict(X_fold_val)
    
    recall = recall_score(y_fold_val, y_pred)
    precision = precision_score(y_fold_val, y_pred)
    f1 = f1_score(y_fold_val, y_pred)
    accuracy = accuracy_score(y_fold_val, y_pred)
    
    recall_scores.append(recall)
    precision_scores.append(precision)
    f1_scores.append(f1)
    accuracy_scores.append(accuracy)

5it [00:17,  3.51s/it]


In [55]:
average_recall = np.mean(recall_scores)
average_precision = np.mean(precision_scores)
average_f1 = np.mean(f1_scores)
average_accuracy = np.mean(accuracy_scores)
pd.DataFrame(data=[(average_recall, average_precision, average_f1, average_accuracy)], columns=['Recall', 'Precision', 'F1', 'Accuracy'])



Unnamed: 0,Recall,Precision,F1,Accuracy
0,0.915123,0.05511,0.103914,0.973619
