
# Lab | Handling Data Imbalance in Classification Models

For this lab and in the next lessons we will use the dataset 'Healthcare For All' building a model to predict who will donate (TargetB) and how much they will give (TargetD) (will be used for lab on Friday). You will be using `files_for_lab/learningSet.csv` file which you have already downloaded from class.

### Scenario

You are revisiting the Healthcare for All Case Study. You are provided with this historical data about Donors and how much they donated. Your task is to build a machine learning model that will help the company identify people who are more likely to donate and then try to predict the donation amount.

### Instructions

In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in the class.

Here is the list of steps to be followed (building a simple model without balancing the data):

- Import the required libraries and modules that you would need.
- Read that data into Python and call the dataframe `donors`.
- Check the datatypes of all the columns in the data. 
- Check for null values in the dataframe. Replace the null values using the methods learned in class.
- Split the data into numerical and catagorical.  Decide if any columns need their dtype changed.
- Concatenate numerical and categorical back together again for your X dataframe.  Designate the Target as y.
  
  - Split the data into a training set and a test set.
  - Split further into train_num and train_cat.  Also test_num and test_cat.
  - Scale the features either by using normalizer or a standard scaler. (train_num, test_num)
  - Encode the categorical features using One-Hot Encoding or Ordinal Encoding.  (train_cat, test_cat)
      - **fit** only on train data transform both train and test
      - again re-concatenate train_num and train_cat as X_train as well as test_num and test_cat as X_test
  - Fit a logistic regression model on the training data.
  - Check the accuracy on the test data.

**Note**: So far we have not balanced the data.

Managing imbalance in the dataset

- Check for the imbalance.
- Use the resampling strategies used in class for upsampling and downsampling to create a balance between the two classes.
- Each time fit the model and see how the accuracy of the model has changed.





In [103]:
# Import the libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')


In [104]:
pd.set_option('display.max_rows', False)
pd.set_option('display.max_columns', None)



In [105]:
# Read the data
numerical = pd.read_csv('/Users/leozinho.air/Desktop/Ironhack/class_30/numerical.csv')
categorical = pd.read_csv('/Users/leozinho.air/Desktop/Ironhack/class_30/categorical.csv')
targets = pd.read_csv('/Users/leozinho.air/Desktop/Ironhack/class_30/target.csv')
donors = pd.concat([numerical, categorical, targets], axis = 1)
print(donors['TARGET_B'].value_counts())



0    90569
1     4843
Name: TARGET_B, dtype: int64


In [106]:
# Check the dtypes
donors.dtypes

TCODE             int64
AGE             float64
INCOME            int64
WEALTH1           int64
HIT               int64
MALEMILI          int64
MALEVET           int64
VIETVETS          int64
WWIIVETS          int64
LOCALGOV          int64
STATEGOV          int64
FEDGOV            int64
                 ...   
DOB_YR            int64
DOB_MM            int64
MINRDATE_YR       int64
MINRDATE_MM       int64
MAXRDATE_YR       int64
MAXRDATE_MM       int64
LASTDATE_YR       int64
LASTDATE_MM       int64
FIRSTDATE_YR      int64
FIRSTDATE_MM    float64
TARGET_B          int64
TARGET_D        float64
Length: 339, dtype: object

In [107]:
# Check for null
print('Amount of null values: ')
print(donors.isnull().sum().sum())
print('\n')

# Do the percentage of nulls
nulls_percent_df = donors.isna().sum()/len(donors)

# Put it into a dataframe
nulls_percent_df = pd.DataFrame(donors.isna().sum()/len(donors))

# Take it out of the index
nulls_percent_df = pd.DataFrame(donors.isna().sum()/len(donors)).reset_index()

# Finally lets change columns names
nulls_percent_df.columns = ['columns_name', 'nulls_percentage']

# Order it
sorted_nulls_percent_df = nulls_percent_df.sort_values(by='nulls_percentage', ascending=False)

# Display the dataset
print('Dataframe with % of null values per column sorted descending:')
display(sorted_nulls_percent_df)





Amount of null values: 
2


Dataframe with % of null values per column sorted descending:


Unnamed: 0,columns_name,nulls_percentage
336,FIRSTDATE_MM,0.000021
0,TCODE,0.000000
223,OEDC4,0.000000
231,EC5,0.000000
230,EC4,0.000000
229,EC3,0.000000
228,EC2,0.000000
227,EC1,0.000000
226,OEDC7,0.000000
...,...,...


In [108]:
# Drop null values
donors.dropna(inplace = True)

In [109]:
# X y split 
y = donors['TARGET_B']
X = donors.drop(['TARGET_B', 'TARGET_D'], axis = 1)

# Split the data into numerical and categorical
numericalX = X.select_dtypes(include = [np.number])
categoricalX = X.select_dtypes(exclude = [np.number])

In [110]:
# One-hot encoding categorical features
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(drop='first').fit(categoricalX)
encoded_categorical = encoder.transform(categoricalX).toarray()
encoded_categorical = pd.DataFrame(encoded_categorical)
encoded_categorical.columns = [str(col) if isinstance(col, int) else col for col in encoded_categorical.columns] # colums as strings

# Normalizing
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
numericalX_norm = min_max_scaler.fit_transform(numericalX)
numericalX_norm = pd.DataFrame(numericalX_norm, columns=numericalX.columns)  # Ensure to use the original column names

# Concat again the data
X = pd.concat([numericalX_norm, encoded_categorical], axis = 1)



In [111]:
# Train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0, test_size = 0.2)

X_train.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)

train = pd.concat([X_train, y_train],axis=1)
print(train['TARGET_B'].value_counts())







0    72496
1     3832
Name: TARGET_B, dtype: int64


In [112]:
# Logistic Regression on imbalanced data
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 100)

# Fit the model
lr.fit(X_train,y_train)

# Predictions
from sklearn.metrics import accuracy_score

y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}") # 0.946



Accuracy: 0.9469657268630123


In [122]:
from sklearn.metrics import precision_score #precision metrics
from sklearn.metrics import recall_score #recall metrics
from sklearn.metrics import f1_score #f1 metrics
from sklearn.metrics import classification_report #classification metrics

#maximizing recall for all healthy issue topic

pred = lr.predict(X_test)

print("Precision is : ", precision_score(y_test, pred))
print("Recall is : ", recall_score(y_test, pred))
print("F1 is : ", f1_score(y_test, pred))

print(classification_report(y_test, pred))



Precision is :  0.06921086675291074
Recall is :  0.5291790306627102
F1 is :  0.12241162338405218
              precision    recall  f1-score   support

           0       0.96      0.60      0.74     18071
           1       0.07      0.53      0.12      1011

    accuracy                           0.60     19082
   macro avg       0.51      0.57      0.43     19082
weighted avg       0.91      0.60      0.71     19082



'The results of your logistic regression model on imbalanced data show a high accuracy of 0.946 (or 94.6%), \nwhich initially might seem excellent. However, the precision, recall, and F1 scores for the minority class (class 1) are all 0.0,\nindicating the model did not correctly predict any instances of the minority class.\nThis is a common issue when working with imbalanced datasets, where the model becomes biased towards the majority class (class 0 in this case), \nleading to poor performance on the minority class despite high overall accuracy.'

#### The results of the logistic regression model on imbalanced data show a high accuracy of 0.946 (or 94.6%), which initially might seem excellent. However, the precision, recall, and F1 scores for the minority class (class 1) are all 0.0, indicating the model did not correctly predict any instances of the minority class. This is a common issue when working with imbalanced datasets, where the model becomes biased towards the majority class (class 0 in this case), leading to poor performance on the minority class despite high overall accuracy.

## Managing imbalance in the dataset




In [114]:
# Check for the imbalance
display(train['TARGET_B'].value_counts()) # 0 -> 72496 # 1 -> 3832

# Separate majority and minority classes
category_0 = train[train['TARGET_B'] == 0]
category_1 = train[train['TARGET_B'] == 1]



0    72496
1     3832
Name: TARGET_B, dtype: int64

### Downsampling

In [115]:
# Downsample the majority class
from sklearn.utils import resample

category_0_downsampled = resample(category_0, 
                                  replace=False,    # sample without replacement
                                  n_samples=len(category_1),  # to match minority class
                                  random_state=123) # reproducible results

# Combine minority class with downsampled majority class
train_downsampled = pd.concat([category_0_downsampled, category_1])
train_downsampled = train_downsampled.reset_index(drop = True)

train_downsampled["TARGET_B"].value_counts()

0    3832
1    3832
Name: TARGET_B, dtype: int64

In [116]:
# X y split
X_down = train_downsampled.drop(['TARGET_B'], axis = 1)
y_down = train_downsampled['TARGET_B']

# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 100)

# Fit the model
lr.fit(X_down,y_down)

# Predictions
from sklearn.metrics import accuracy_score

y_pred = lr.predict(X_down)
accuracy = accuracy_score(y_down, y_pred)

print(f"Accuracy: {accuracy}") # 0.609

Accuracy: 0.6092118997912317


In [117]:

print("Precision is : ", precision_score(y_down, y_pred))
print("Recall is : ", recall_score(y_down, y_pred))
print("F1 is : ", f1_score(y_down, y_pred))

print(classification_report(y_down, y_pred))

Precision is :  0.6121082239485668
Recall is :  0.596294363256785
F1 is :  0.604097818902842
              precision    recall  f1-score   support

           0       0.61      0.62      0.61      3832
           1       0.61      0.60      0.60      3832

    accuracy                           0.61      7664
   macro avg       0.61      0.61      0.61      7664
weighted avg       0.61      0.61      0.61      7664



### Oversampling

In [118]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_over = train.drop('TARGET_B', axis = 1)
y_over = train['TARGET_B']

X_sm, y_sm = smote.fit_resample(X_over, y_over)
y_sm.value_counts()


0    72496
1    72496
Name: TARGET_B, dtype: int64

In [119]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter = 100)
# Fit the model
lr.fit(X_sm,y_sm)

# Predictions
from sklearn.metrics import accuracy_score

y_pred = lr.predict(X_sm)
accuracy = accuracy_score(y_sm, y_pred)

print(f"Accuracy: {accuracy}") # 0.618


Accuracy: 0.6188548333701169


#### In the downsampled scenario, the dataset size is reduced to make the classes balanced (each class having 3832 instances). The metrics here are relatively balanced across both classes, indicating that the model is performing fairly evenly in predicting both classes. This balance is crucial in scenarios where both false positives and false negatives carry significant consequences.

In [120]:


print("Precision is : ", precision_score(y_sm, y_pred))
print("Recall is : ", recall_score(y_sm, y_pred))
print("F1 is : ", f1_score(y_sm, y_pred))

print(classification_report(y_sm, y_pred))

Precision is :  0.6168988861604418
Recall is :  0.6272208121827412
F1 is :  0.6220170308812969
              precision    recall  f1-score   support

           0       0.62      0.61      0.62     72496
           1       0.62      0.63      0.62     72496

    accuracy                           0.62    144992
   macro avg       0.62      0.62      0.62    144992
weighted avg       0.62      0.62      0.62    144992



#### With the upsampled dataset, the minority class is increased to match the majority class, resulting in a much larger dataset (each class having 72,496 instances). The metrics show a slight improvement over the downsampled data, particularly in recall, which suggests the model is slightly better at identifying positive cases in the upsampled scenario.