## Binary Classification with a Bank Churn Dataset
Objective of this model is to predict whether a customer continues with their account or closes it (e.g., churns).

In [16]:
# One-time setup
# ! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Install dataset from Kaggle

In [18]:
! kaggle competitions download -c playground-series-s4e1
! unzip playground-series-s4e1.zip

playground-series-s4e1.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  playground-series-s4e1.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


First, let's import training data. Before we continue, we perform some data pre-processing. In our data, we have columns of non-numerical data. We drop irrelevant ones and one-hot encode ones that do. Then, we scale numerical features as they originally consist of different ranges. We then finally allocate 20% of training data for validation testing.

In [28]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler

train_data_path = './train.csv'
train_data = pd.read_csv(train_data_path)

# Drop irrelevant columns (id, CustomerId, Surname)
train_data = train_data.drop(columns=["id", "CustomerId", "Surname"])

# one-hot encode categorical variables
train_data = pd.get_dummies(train_data, columns=["Geography", "Gender"])

# Seperate features and labels
train_labels = train_data["Exited"]
train_data = train_data.drop(columns=["Exited"])

# Normalize/Standardize numerical features
scaler = StandardScaler()
train_data_scaled = scaler.fit_transform(train_data)

# Split the data into training and validation sets
X_train, X_validation, y_train, y_validation = train_test_split(train_data_scaled, train_labels, test_size=0.2, random_state=0)

We perform the same pre-processing to the test dataset.

In [None]:
# Pre-process test data as well
test_data_path = './test.csv'
test_data = pd.read_csv(test_data_path)

test_data = test_data.drop(columns=["id", "CustomerId", "Surname"])
test_data = pd.get_dummies(test_data, columns=["Geography", "Gender"])

y_test = test_data["Exited"]
X_test = test_data.drop(columns=["Exited"])

X_test = scaler.fit_transform(X_test)

Now, define model