Imbalanced Classification
1. **Defining the Question**
using past behaviour data, predict whether a customer will leave the bank soon.

b) **Defining the Metric for Success**
Build a model with the maximum possible F1 score. To pass the project, you need an F1 score of at least 0.59. Check the F1 for the test set.

Beta Bank customers are leaving: slowly by slowly, chunning away on a montly basis. The bank has figured out it’s easier to manage existing customers from chunning than to attract new ones. We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.

d) **Recording the Experimental Design**
Describe the steps/approach that you will use to answer the given question.

Data Exploration

Data Preparation

Data Modeling

Summary of Findings and Recommendations

**Imbalanced Classification**

e) Data Relevance
How relevant was the provided data? Answer: the provided data met the for relevancy

In [30]:

# We importing libraries 
# ---
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from sklearn import tree

In [31]:
#Reading the Data
url = 'https://bit.ly/2XZK7Bo'
model_df = pd.read_csv(url)

model_df.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
#Exploring the data
print("Number of rows:", len(model_df))
print("Number of columns:", len(model_df.columns))


In [33]:
model_df.shape

(10000, 14)

In [34]:
#Exploring the Data types
print(model_df.dtypes)


RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object


In [35]:
#overview of the data and identifying missing values.
model_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [36]:
#checking for missing values in the DataFrame
model_df.isnull().sum()


RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [37]:
 #checking for duplicate values in the DataFrame
 model_df.duplicated().sum()

0

In [38]:
#overview of the distribution and range of values in each numerical column
model_df.describe()


Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


**Observations are as below**

> The dataset contains 10,000 rows and 11 columns.

> The Age column has a mean value of 38.92 and a standard deviation of 10.49, with a minimum value of 18 and a maximum value of 92.

> The Tenure column has missing values, as the count for this column is 9091 instead of 10,000.

> The Balance column has a mean value of 76,485.89 and a standard deviation of 62,397.41, with a minimum value of 0 and a maximum value of 250,898.09.

**Data Preparation and Cleaning**



In [39]:
#Converting all column names to lowercase and remove any leading/trailing whitespaces in column names
model_df.columns = [col.lower().strip() for col in model_df.columns]
model_df.columns

Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited'],
      dtype='object')

In [40]:

#Filling missing values in the 'tenure' column with the mean value of the column.
model_df.fillna(model_df.mean(), inplace=True)


  model_df.fillna(model_df.mean(), inplace=True)


In [41]:
#Remove unnecessary columns such as 'RowNumber', 'CustomerId', and 'Surname' as they do not provide any useful information for our analysis.
model_df = model_df.drop(['rownumber', 'customerid', 'surname'], axis=1)

model_df.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [42]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [43]:
#Convert 'geography' and 'gender' columns to categorical data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

model_df["gender"] = le.fit_transform(model_df["gender"])
model_df["geography"] = le.fit_transform(model_df["geography"])

In [44]:
#Inspecting the new data
model_df.head()

Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,0,0,42,2.0,0.0,1,1,1,101348.88,1
1,608,2,0,41,1.0,83807.86,1,0,1,112542.58,0
2,502,0,0,42,8.0,159660.8,3,1,0,113931.57,1
3,699,0,0,39,1.0,0.0,2,0,0,93826.63,0
4,850,2,0,43,2.0,125510.82,1,1,1,79084.1,0


In [45]:
#Splitting the dataframe into features (X) and target variable (y) for modeling purposes.


features = model_df.drop("exited", axis = 1)
target = model_df["exited"]

X_train, X_test, Y_train, Y_test = train_test_split(features, target, test_size = 0.25, random_state = 12)

print(f'X_train has {X_train.shape[0]} rows, Y_train also has  {Y_train.shape[0]} rows')

print(f'X_test has {X_test.shape[0]} rows, Y_test also has  {Y_test.shape[0]} rows')


X_train has 7500 rows, Y_train also has  7500 rows
X_test has 2500 rows, Y_test also has  2500 rows


In [46]:
#checking for class imbalance in the target variable of the dataset. 
# Compute and print the count of examples in each class
class_counts = model_df["exited"].value_counts()
print("Exited:", class_counts[1])
print("Did not exit:", class_counts[0])


Exited: 2037
Did not exit: 7963


In [47]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import cross_val_score

# Split the data into train and test sets (assuming this has already been done)
# X_train, X_test, Y_train, Y_test = ...

# Train a logistic regression model on the imbalanced data
imbalanced_log_model = LogisticRegression(solver='liblinear', random_state=12345)
imbalanced_log_model.fit(X_train, Y_train)

# Evaluate the model on the test set
Y_pred = imbalanced_log_model.predict(X_test)
accuracy = accuracy_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)
auc = roc_auc_score(Y_test, imbalanced_log_model.predict_proba(X_test)[:,1])
print(f"Accuracy: {accuracy:.4f}")
print(f"F1 score: {f1:.4f}")
print(f"AUC: {auc:.4f}")

# Evaluate the model using cross-validation (optional)
cv_scores = cross_val_score(imbalanced_log_model, X_train, Y_train, cv=5, scoring='roc_auc')
print(f"Cross-validation AUC: {np.mean(cv_scores):.4f}")


Accuracy: 0.7764
F1 score: 0.0851
AUC: 0.6758
Cross-validation AUC: 0.6645


 **Evaluating the logistic regression model on imbalanced data**

In [48]:
#upsampling the minority class in our imbalanced dataset 
from sklearn.utils import shuffle
def upsample(X, y, repeat):
    features_zeros = X[y == 0]
    features_ones = X[y == 1]
    target_zeros = y[y == 0]
    target_ones = y[y == 1]
    
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=42)

    return features_upsampled, target_upsampled


features_upsampled, target_upsampled = upsample(X_train, Y_train, 10)

model_up = LogisticRegression(random_state=42, solver='liblinear')
model_up.fit(features_upsampled, target_upsampled)
predicted_valid = model_up.predict(X_test)
print('F1 score:', f1_score(Y_test, predicted_valid))

F1 score: 0.3957337256344244


**Observations**
The sampling_strategy parameter is set to 1.0, which means that the number of minority class examples will be increased to match the number of majority class examples. This is intended to balance the class distribution and improve the performance of the model.

The upsampled data is then used to fit a logistic regression model using the LogisticRegression function from sklearn.linear_model.



In [49]:
#using down-sampling to balance the class distribution in the training data
def downsample(X, y, fraction):
    features_zeros = X[y == 0]
    features_ones = X[y == 1]
    target_zeros = y[y == 0]
    target_ones = y[y == 1]

    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=42)]+ [features_ones])

    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=42)]+ [target_ones])

    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=42)

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(X_train, Y_train, 0.1)

model_down = LogisticRegression(random_state=42, solver='liblinear')
model_down.fit(features_downsampled, target_downsampled)
predicted_valid = model_down.predict(X_test)
print('F1 score:', f1_score(Y_test, predicted_valid))



F1 score: 0.37289278489548205


**Observation**
The downsampled data is created by randomly selecting a fraction of the majority class and concatenating it with all instances of the minority class. The resulting data is then shuffled before being returned.

In [59]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score

# Create and fit the random forest classifier
rfc = RandomForestClassifier(random_state=42, class_weight='balanced')
rfc.fit(X_train, Y_train)

# Make predictions on the test set
y_pred = rfc.predict(X_test)
y_proba = rfc.predict_proba(X_test)[:,1]

# Evaluate the classifier using different metrics
print("F1 score:", f1_score(Y_test, y_pred))
print('Accuracy:', accuracy_score(Y_test, y_pred))
print("AUC-ROC:", roc_auc_score(Y_test, y_proba))

     





F1 score: 0.5750873108265425
Accuracy: 0.854
AUC-ROC: 0.8605980955975959


In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

RandForestOpt.fit(X_train, Y_train)
probabilities_valid = RandForestOpt.predict_proba(X_test)
probabilities_one_valid = probabilities_valid[:, 1]

fpr, tpr, thresholds = roc_curve(Y_test, probabilities_one_valid)

plt.figure()
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.show()








