# Activity: Comparing Imbalanced Classifiers

In this activity, you’ll fit various balanced and imbalanced models to small business loan data. You’ll then compare the results by using the metrics that you’ve learned.


In [7]:
# Import the required modules
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.metrics import balanced_accuracy_score
from imblearn.metrics import classification_report_imbalanced

import warnings
warnings.filterwarnings("ignore")

## Step 1: Read in the CSV file from the `Resources` folder into a Pandas DataFrame. 

In [8]:
# Read the sba_loans.csv file from the Resources folder into a Pandas DataFrame
loans_df = pd.read_csv('https://static.bc-edx.com/mbc/ai/m5/datasets/sba_loans.csv')

# Review the DataFrame
loans_df

Unnamed: 0,Year,Month,Amount,Term,Zip,CreateJob,NoEmp,RealEstate,RevLineCr,UrbanRural,Default
0,2001,11,32812,36,92801,0,1,0,1,0,0
1,2001,4,30000,56,90505,0,1,0,1,0,0
2,2001,4,30000,36,92103,0,10,0,1,0,0
3,2003,10,50000,36,92108,0,6,0,1,0,0
4,2006,7,343000,240,91345,3,65,1,0,2,0
...,...,...,...,...,...,...,...,...,...,...,...
1541,2006,6,150000,60,92346,0,5,0,0,2,0
1542,1997,4,99000,300,92021,0,4,1,0,0,0
1543,1997,2,50000,84,93012,0,2,0,0,0,0
1544,1997,1,251150,120,91352,0,3,0,0,0,0


## Step 2: Create a Series named `y` that contains the data from the "Default" column of the original DataFrame. Note that this Series will contain the labels. Create a new DataFrame named `X` that contains the remaining columns from the original DataFrame. Note that this DataFrame will contain the features.

In [9]:
# Split the data into X (features) and y (lables)

# The y variable should focus on the Default column
y = loans_df['Default']

# The X variable should include all features except the Default column
X = loans_df.drop(columns='Default')


### Step 3: Split the features and labels into training and testing sets, and `StandardScaler` your X data.

In [10]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [11]:
# Scale the data
scaler = StandardScaler()
X_scaler = scaler.fit(X_train)
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Step 4: Check the magnitude of imbalance in the data set by viewing  the number of distinct values  (`value_counts`) for the labels.

In [12]:
# Count the distinct values in the orignal labels data
y.value_counts()

0    1411
1     135
Name: Default, dtype: int64

In [13]:
y_train.value_counts()

0    1063
1      96
Name: Default, dtype: int64

## Step 5: Fit two versions of a random forest model to the data: the first, a regular `RandomForest` classifier, and the second, a `BalancedRandomForest` classifier.

In [15]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=1)

# Fitting the model
rf_model = rf_model.fit(X_train_scaled, y_train)

# Making predictions using the testing data
rf_predictions = rf_model.predict(X_test_scaled)


In [16]:
# Import BalancedRandomForestClassifier from imblearn
from imblearn.ensemble import BalancedRandomForestClassifier

# Instantiate a BalancedRandomForestClassifier instance
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=1)

# Fit the model to the training data
brf = brf.fit(X_train_scaled, y_train)

In [17]:
# Predict labels for testing features
brf_predictions = brf.predict(X_test_scaled)

## Step 6: Resample and fit the training data by one additional method for imbalanced data, such as `RandomOverSampler`, undersampling, or a synthetic technique. Re-esimate by `RandomForest`.

In [27]:
# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# Instantiate the SMOTE model instance
smote = SMOTE(random_state= 1, sampling_strategy='auto'
             )
# Fit the SMOTE model to the training data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Fit the RandomForestClassifier on the resampled data
model_resampled_rf = RandomForestClassifier()
model_resampled_rf.fit(X_resampled, y_resampled)

# Generate predictions based on the resampled data model
rf_resampled_predictions = model_resampled_rf.predict(X_test)

## Step 7: Print the confusion matrixes, accuracy scores, and classification reports for the three different models.

In [28]:
# Print the confusion matrix for RandomForest on the original data
confusion_matrix(y_test, rf_predictions)

array([[338,  10],
       [ 13,  26]], dtype=int64)

In [29]:
# Print the confusion matrix for balanced random forest data
confusion_matrix(y_test, brf_predictions)

array([[309,  39],
       [  5,  34]], dtype=int64)

In [30]:
# Print the confusion matrix for your additional model on the resampled data
confusion_matrix(y_test, rf_resampled_predictions)

array([[330,  18],
       [  8,  31]], dtype=int64)

In [31]:
# Print the accuracy score for the original data
balanced_accuracy_score(y_test, rf_predictions)

0.8189655172413792

In [32]:
# Print the accuracy score for the balanced random forest data
balanced_accuracy_score(y_test, brf_predictions)

0.8798629531388152

In [33]:
# Print the accuracy score for your additional model with resampled data
balanced_accuracy_score(y_test, rf_resampled_predictions)

0.8715738284703802

In [34]:
# Print the classification report for the original data
print(classification_report_imbalanced(y_test, rf_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.96      0.97      0.67      0.97      0.80      0.67       348
          1       0.72      0.67      0.97      0.69      0.80      0.63        39

avg / total       0.94      0.94      0.70      0.94      0.80      0.66       387



In [35]:
# Print the classification report for the balanced random forest data
print(classification_report_imbalanced(y_test, brf_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.98      0.89      0.87      0.93      0.88      0.78       348
          1       0.47      0.87      0.89      0.61      0.88      0.77        39

avg / total       0.93      0.89      0.87      0.90      0.88      0.78       387



In [36]:
# Print the classification report for your additional model with resampled data
print(classification_report_imbalanced(y_test, rf_resampled_predictions))

                   pre       rec       spe        f1       geo       iba       sup

          0       0.98      0.95      0.79      0.96      0.87      0.77       348
          1       0.63      0.79      0.95      0.70      0.87      0.74        39

avg / total       0.94      0.93      0.81      0.94      0.87      0.76       387



## Step 8: Evaluate the effectiveness of `RandomForest`, `BalancedRandomForest`, and your one additional imbalanced classifier for predicting the minority class. 

### Answer the following question: Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?

**Question:** Does the model generated using one of the imbalanced methods more accurately flag all the loans that eventually defaulted?
    
**Answer:**