# **Customer Churn Prediction** ¶


## **Problem Statement:**

Develop a predictive model to identify customers at risk of churning from an investment bank, enabling proactive retention strategies to minimize customer loss and maximize revenue growth.


## **About the Dataset**

There are 14 columns/features and 10k rows/samples.

**RowNumber**—corresponds to the record (row) number and has no effect on the output.

**CustomerId**—contains random values and has no effect on customer leaving the bank.

**Surname**—the surname of a customer has no impact on their decision to leave the bank.

**CreditScore**—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.

**Geography**—a customer’s location can affect their decision to leave the bank.

**Gender**—it’s interesting to explore whether gender plays a role in a customer leaving the bank.

**Age**—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.

**Tenure**—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.

**Balance**—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.

**NumOfProducts**—refers to the number of products that a customer has purchased through the bank.

**HasCrCard**—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.

**IsActiveMember**—active customers are less likely to leave the bank.

**EstimatedSalary**—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.

**Exited**—whether or not the customer left the bank.


## **KNN**

The K-Nearest Neighbors (KNN) algorithm is a simple and effective machine learning technique that classifies data points by finding the K most similar instances to a new input and voting for the target class or value.

### **The most commonly used hyperparameters for K-Nearest Neighbors (KNN) algorithm:**

n_neighbors: The number of nearest neighbors to consider when making a prediction. Increasing this number can improve the model's performance, but also increases the computation time.

weights: The weight function used to calculate the distance between samples. Supported weights are 'uniform' (all points have equal weight) and 'distance' (points closer to the query point have higher weight).

algorithm: The algorithm used to compute the nearest neighbors. Supported algorithms are 'brute' (exhaustive search), 'kd_tree' (k-d tree search), and 'ball_tree' (ball tree search).

leaf_size: The number of samples in each leaf node of the k-d tree or ball tree. Increasing this number can improve the model's performance, but also increases the computation time.

p: The power parameter for the Minkowski metric. When p=1, it is the Manhattan distance, and when p=2, it is the Euclidean distance.

metric: The distance metric used to calculate the distance between samples. Supported metrics are 'minkowski' (Minkowski distance), 'euclidean' (Euclidean distance), 'manhattan' (Manhattan distance), and 'chebyshev' (Chebyshev distance).

### **Here are some common values for these hyperparameters:**

n_neighbors: 3, 5, 10, 20

weights: 'uniform', 'distance'

algorithm: 'brute', 'kd_tree', 'ball_tree'

leaf_size: 10, 20, 30

p: 1, 2

metric: 'minkowski', 'euclidean', 'manhattan', 'chebyshev'


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score,roc_auc_score
from sklearn.svm import SVC
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import warnings
warnings.filterwarnings('ignore')



In [2]:
from google.colab import drive

ModuleNotFoundError: No module named 'google.colab'

In [None]:
drive.mount('/content/drive')

In [None]:

# Load data
data = pd.read_csv('/content/drive/My Drive/Churn Project/churn.csv')



In [None]:
data.head()

In [None]:
data.info()

In [None]:
# is null?
isnull = data.isnull().sum()
isnull

In [None]:
# Preprocess data
selected_features = [
    'CreditScore', 'Geography', 'Gender', 'Age',
    'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
    'IsActiveMember', 'EstimatedSalary'
]
X = data[selected_features]
y = data[['Exited']]



In [None]:
# Label encoding
le = LabelEncoder()
X['Geography'] = le.fit_transform(X['Geography'])
X['Gender'] = le.fit_transform(X['Gender'])



In [None]:
# Scaling
scaler = MinMaxScaler()
X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])



In [None]:
# Split data
train_X, val_X, train_y, val_y = train_test_split(
    X, y, random_state=0, train_size=0.8
)



In [None]:
# Train model
model = KNeighborsClassifier(n_neighbors=2, metric='euclidean', weights='uniform', algorithm='auto', leaf_size=50, p=2)
model.fit(train_X, train_y)



In [None]:
# Evaluate model
val_prediction = model.predict(val_X)
y_pred_proba = model.predict_proba(val_X)[:,1]
accuracy = accuracy_score(val_y, val_prediction)
print(f'Model accuracy: {accuracy}')



In [None]:
print(confusion_matrix(val_y, val_prediction))
print(classification_report(val_y, val_prediction))

In [None]:
auc = roc_auc_score(val_y, y_pred_proba)
print(auc)

In [None]:
# Save model
joblib.dump(model, 'churn_model.pkl')

### **OOP Approach**

In [None]:
class ChurnPrediction:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None
        self.X = None
        self.y = None
        self.train_X = None
        self.val_X = None
        self.train_y = None
        self.val_y = None
        self.model = None

    def load_data(self):
        self.data = pd.read_csv(self.file_path)

    def preprocess_data(self):
        selected_features = [
            'CreditScore', 'Geography', 'Gender', 'Age',
            'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
            'IsActiveMember', 'EstimatedSalary'
        ]
        self.X = self.data[selected_features]
        self.y = self.data[['Exited']]

        # Encoding categorical variables
        le = LabelEncoder()
        self.X['Geography'] = le.fit_transform(self.X['Geography'])
        self.X['Gender'] = le.fit_transform(self.X['Gender'])

        # Scaling numerical variables
        scaler = MinMaxScaler()
        self.X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(self.X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])

    def split_data(self):
        self.train_X, self.val_X, self.train_y, self.val_y = train_test_split(
            self.X, self.y, random_state=0, train_size=0.8
        )

    def train_model(self):
        self.model = KNeighborsClassifier(n_neighbors=5)
        self.model.fit(self.train_X, self.train_y)

    def evaluate_model(self):
        val_prediction = self.model.predict(self.val_X)
        accuracy = accuracy_score(self.val_y, val_prediction)
        print(f'Model accuracy: {accuracy}')
        y_pred_proba = self.model.predict_proba(self.val_X)[:,1]
        auc = roc_auc_score(self.val_y, y_pred_proba)
        print(f'Model auc score: {auc}')
        return accuracy, auc

    def save_model(self, model_path):
        joblib.dump(self.model, model_path)

    def load_model(self, model_path):
        self.model = joblib.load(model_path)

# Usage
churn = ChurnPrediction('/content/drive/My Drive/Churn Project/churn.csv')
churn.load_data()
churn.preprocess_data()
churn.split_data()
churn.train_model()
accuracy = churn.evaluate_model()

# Save the model
churn.save_model('churn_model.pkl')

### **Procedural Approach**

In [None]:
def load_data(file_path):
    data = pd.read_csv(file_path)
    return data

def preprocess_data(data):
    selected_features = [
        'CreditScore', 'Geography', 'Gender', 'Age',
        'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
        'IsActiveMember', 'EstimatedSalary'
    ]
    X = data[selected_features]
    y = data[['Exited']]

    # Label encoding
    le = LabelEncoder()
    X['Geography'] = le.fit_transform(X['Geography'])
    X['Gender'] = le.fit_transform(X['Gender'])

    # Scaling
    scaler = MinMaxScaler()
    X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']] = scaler.fit_transform(X[['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']])

    return X, y

def split_data(X, y):
    train_X, val_X, train_y, val_y = train_test_split(
        X, y, random_state=0, train_size=0.8
    )
    return train_X, val_X, train_y, val_y

def train_model(train_X, train_y):
    model = KNeighborsClassifier(n_neighbors=5)
    model.fit(train_X, train_y)
    return model

def evaluate_model(model, val_X, val_y):
    val_prediction = model.predict(val_X)
    accuracy = accuracy_score(val_y, val_prediction)
    print(f'Model accuracy: {accuracy}')

    auc = roc_auc_score(val_y, val_prediction)
    print(f'Model auc score: {auc}')
    return accuracy, auc

def save_model(model, model_path):
    joblib.dump(model, model_path)

def load_model(model_path):
    model = joblib.load(model_path)
    return model

# Usage
file_path = '/content/drive/My Drive/Churn Project/churn.csv'
data = load_data(file_path)
X, y = preprocess_data(data)
train_X, val_X, train_y, val_y = split_data(X, y)
model = train_model(train_X, train_y)
accuracy, auc = evaluate_model(model, val_X, val_y)
save_model(model, 'churn_model.pkl')

A Decision Tree Classifier is a type of supervised learning algorithm in machine learning. It works by creating a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. The tree is constructed by recursively partitioning the data into subsets based on the values of the input features.

### **The most commonly used hyperparameters for Decision Tree Classifier**:

criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

max_depth: The maximum depth of the tree. Increasing this number can improve the model's performance, but also increases the risk of overfitting.

min_samples_split: The minimum number of samples required to split an internal node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

min_samples_leaf: The minimum number of samples required to be at a leaf node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

max_features: The maximum number of features to consider at each split. Increasing this number can improve the model's performance, but also increases the computation time.

random_state: The random seed used to shuffle the data before splitting it into training and testing sets. Setting this to a fixed value ensures reproducibility of the results.

class_weight: The weight assigned to each class during training. This can be useful for imbalanced datasets, where one class has a much larger number of instances than the others.

### **Here are some common values for these hyperparameters:**

criterion: 'gini', 'entropy'

max_depth: 3, 5, 10, None (None means no limit)

min_samples_split: 2, 5, 10

min_samples_leaf: 1, 5, 10

max_features: 'auto', 'sqrt', 'log2', None (None means no limit)

random_state: 0, 42, 100

class_weight: 'balanced', 'balanced_subsample', None (None means all classes are equal)


### **OOP Approach**

In [None]:
class ChurnPrediction:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None
        self.X = None
        self.y = None
        self.train_X = None
        self.val_X = None
        self.train_y = None
        self.val_y = None
        self.model = None

    def load_data(self):
        self.data = pd.read_csv(self.file_path)

    def preprocess_data(self):
        selected_features = [
            'CreditScore', 'Geography', 'Gender', 'Age',
            'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
            'IsActiveMember', 'EstimatedSalary'
        ]
        self.X = self.data[selected_features]
        self.y = self.data[['Exited']]

        self.X = pd.get_dummies(self.X, columns = ["Geography", "Gender"])
        #self.X.drop(columns=["Geography", "Gender"],axis=1, inplace=True)

    def split_data(self):
        self.train_X, self.val_X, self.train_y, self.val_y = train_test_split(
            self.X, self.y, random_state=0, train_size=0.8
        )

    def train_model(self):
        self.model = DecisionTreeClassifier(random_state=0)
        self.model.fit(self.train_X, self.train_y)

    def evaluate_model(self):
        val_prediction = self.model.predict(self.val_X)
        accuracy = accuracy_score(self.val_y, val_prediction)
        print(f'Model accuracy: {accuracy}')
        y_pred_proba = self.model.predict_proba(self.val_X)[:,1]
        auc = roc_auc_score(self.val_y, y_pred_proba)
        print(f'Model auc score: {auc}')
        return accuracy, auc

    def save_model(self, model_path):
        joblib.dump(self.model, model_path)

    def load_model(self, model_path):
        self.model = joblib.load(model_path)

# Usage
churn = ChurnPrediction('/content/drive/My Drive/Churn Project/churn.csv')
churn.load_data()
churn.preprocess_data()
churn.split_data()
churn.train_model()
accuracy, auc = churn.evaluate_model()

# Save the model
churn.save_model('churn_model.pkl')

### **Procedural Approach**

In [None]:
def load_data(file_path):
    return pd.read_csv(file_path)

def preprocess_data(data):
    selected_features = [
        'CreditScore', 'Geography', 'Gender', 'Age',
        'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
        'IsActiveMember', 'EstimatedSalary'
    ]
    X = data[selected_features]
    y = data[['Exited']]

    X = pd.get_dummies(X, columns = ["Geography", "Gender"])
    #X.drop(columns=["Geography", "Gender"], inplace=True)

    return X, y

def split_data(X, y):
    return train_test_split(
        X, y, random_state=0, train_size=0.8
    )

def train_model(X, y):
    model = DecisionTreeClassifier(random_state=0)
    model.fit(X, y)
    return model

def evaluate_model(model, X, y):
    val_prediction = model.predict(X)
    accuracy = accuracy_score(y, val_prediction)
    print(f'Model accuracy: {accuracy}')
    y_pred_proba = model.predict_proba(X)[:,1]
    auc = roc_auc_score(y, y_pred_proba)
    print(f'Model auc score: {auc}')
    return accuracy, auc

def save_model(model, model_path):
    joblib.dump(model, model_path)

# Usage
file_path = '/content/drive/My Drive/Churn Project/churn.csv'
data = load_data(file_path)
X, y = preprocess_data(data)
train_X, val_X, train_y, val_y = split_data(X, y)
model = train_model(train_X, train_y)
accuracy, auc = evaluate_model(model, val_X, val_y)
save_model(model, 'churn_model.pkl')

Random Forest is a supervised learning algorithm that combines multiple decision trees to produce a more accurate and stable prediction model. It works by creating a collection of decision trees, where each tree is trained on a random subset of the training data. The final prediction is made by combining the predictions of all the trees.

### **The most commonly used hyperparameters for Random Forest Classifier:**

n_estimators: The number of trees in the forest. Increasing this number can improve the model's performance, but also increases the computation time.

criterion: The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.

max_depth: The maximum depth of each tree. Increasing this number can improve the model's performance, but also increases the risk of overfitting.

min_samples_split: The minimum number of samples required to split an internal node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

min_samples_leaf: The minimum number of samples required to be at a leaf node. Decreasing this number can lead to overfitting, while increasing it can lead to underfitting.

max_features: The maximum number of features to consider at each split. Increasing this number can improve the model's performance, but also increases the computation time.

max_leaf_nodes: The maximum number of leaf nodes in each tree. Increasing this number can improve the model's performance, but also increases the computation time.

min_impurity_decrease: The minimum decrease in impurity required to split an internal node. Increasing this number can lead to underfitting, while decreasing it can lead to overfitting.

bootstrap: Whether to use bootstrap sampling to build each tree. If True, each tree is built on a random subset of the training data.

oob_score: Whether to use out-of-bag samples to estimate the generalization accuracy.

random_state: The random seed used to shuffle the data before building each tree. Setting this to a fixed value ensures reproducibility of the results.

class_weight: The weight assigned to each class during training. This can be useful for imbalanced datasets, where one class has a much larger number of instances than the others.

### **Here are some common values for these hyperparameters:**

n_estimators: 10, 50, 100, 200

criterion: 'gini', 'entropy'

max_depth: 3, 5, 10, None (None means no limit)

min_samples_split: 2, 5, 10

min_samples_leaf: 1, 5, 10

max_features: 'auto', 'sqrt', 'log2', None (None means no limit)

max_leaf_nodes: 10, 50, 100, None (None means no limit)

min_impurity_decrease: 0.0, 0.1, 0.5

bootstrap: True, False

oob_score: True, False

random_state: 0, 42, 100

class_weight: 'balanced', 'balanced_subsample', None (None means all classes are equal)


### **OOP Approach**

In [None]:
class ChurnPrediction:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None
        self.X = None
        self.y = None
        self.train_X = None
        self.val_X = None
        self.train_y = None
        self.val_y = None
        self.model = None

    def load_data(self):
        self.data = pd.read_csv(self.file_path)

    def preprocess_data(self):
        selected_features = [
            'CreditScore', 'Geography', 'Gender', 'Age',
            'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
            'IsActiveMember', 'EstimatedSalary'
        ]
        self.X = self.data[selected_features]
        self.y = self.data[['Exited']]

        self.X = pd.get_dummies(self.X, columns = ["Geography", "Gender"])

    def split_data(self):
        self.train_X, self.val_X, self.train_y, self.val_y = train_test_split(
            self.X, self.y, random_state=0, train_size=0.8
        )

    def train_model(self):
        self.model = RandomForestClassifier(random_state=0)
        self.model.fit(self.train_X, self.train_y)

    def evaluate_model(self):
        val_prediction = self.model.predict(self.val_X)
        accuracy = accuracy_score(self.val_y, val_prediction)
        print(f'Model accuracy: {accuracy}')
        y_pred_proba = self.model.predict_proba(self.val_X)[:,1]
        auc = roc_auc_score(self.val_y, y_pred_proba)
        print(f'Model auc score: {auc}')
        return accuracy, auc

    def save_model(self, model_path):
        joblib.dump(self.model, model_path)

    def load_model(self, model_path):
        self.model = joblib.load(model_path)

# Usage
churn = ChurnPrediction('/content/drive/My Drive/Churn Project/churn.csv')
churn.load_data()
churn.preprocess_data()
churn.split_data()
churn.train_model()
accuracy, auc = churn.evaluate_model()

# Save the model
churn.save_model('churn_model.pkl')

### **Procedural Approach**

In [None]:
def load_data(file_path):
    return pd.read_csv(file_path)

def preprocess_data(data):
    selected_features = [
        'CreditScore', 'Geography', 'Gender', 'Age',
        'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
        'IsActiveMember', 'EstimatedSalary'
    ]
    X = data[selected_features]
    y = data[['Exited']]

    X = pd.get_dummies(X, columns = ["Geography", "Gender"])

    return X, y

def split_data(X, y):
    return train_test_split(
        X, y, random_state=0, train_size=0.8
    )

def train_model(X, y):
    model = RandomForestClassifier(random_state=0)
    model.fit(X, y)
    return model

def evaluate_model(model, X, y):
    val_prediction = model.predict(X)
    accuracy = accuracy_score(y, val_prediction)
    print(f'Model accuracy: {accuracy}')
    y_pred_proba = model.predict_proba(X)[:,1]
    auc = roc_auc_score(y, y_pred_proba)
    print(f'Model auc score: {auc}')
    return accuracy, auc

def save_model(model, model_path):
    joblib.dump(model, model_path)

# Usage
file_path = '/content/drive/My Drive/Churn Project/churn.csv'
data = load_data(file_path)
X, y = preprocess_data(data)
train_X, val_X, train_y, val_y = split_data(X, y)
model = train_model(train_X, train_y)
accuracy, auc = evaluate_model(model, val_X, val_y)
save_model(model, 'churn_model.pkl')

Support Vector Machine (SVM) is a supervised learning algorithm that can be used for classification and regression tasks. It works by finding the hyperplane that maximally separates the classes in the feature space.

### **The most commonly used hyperparameters for Support Vector Machines (SVMs) are:**

C: The regularization parameter. It controls the trade-off between the margin and the misclassification error.


kernel: The kernel function used to transform the data into a higher dimensional space.


gamma: The kernel coefficient. It is used to control the spread of the kernel.
degree: The degree of the polynomial kernel.


### **Here are some common values for these hyperparameters:**

C: 1.0, 10.0, 100.0, 1000.0

kernel: 'rbf', 'linear', 'poly', 'sigmoid'

gamma: 'scale', 'auto', 0.1, 1.0, 10.0

degree: 2, 3, 4, 5


### **OOP Approach**

In [None]:
class ChurnPrediction:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None
        self.X = None
        self.y = None
        self.train_X = None
        self.val_X = None
        self.train_y = None
        self.val_y = None
        self.model = None

    def load_data(self):
        self.data = pd.read_csv(self.file_path)

    def preprocess_data(self):
        selected_features = [
            'CreditScore', 'Geography', 'Gender', 'Age',
            'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
            'IsActiveMember', 'EstimatedSalary'
        ]
        self.X = self.data[selected_features]
        self.y = self.data[['Exited']]

        self.X = pd.get_dummies(self.X, columns = ["Geography", "Gender"])

    def split_data(self):
        self.train_X, self.val_X, self.train_y, self.val_y = train_test_split(
            self.X, self.y, random_state=0, train_size=0.8
        )

    def train_model(self):
        self.model = SVC(probability=True,random_state=0)
        self.model.fit(self.train_X, self.train_y)

    def evaluate_model(self):
      val_prediction = self.model.predict(self.val_X)
      accuracy = accuracy_score(self.val_y, val_prediction)
      print(f'Model accuracy: {accuracy}')
      y_pred_proba = self.model.predict_proba(self.val_X)[:,1]
      auc = roc_auc_score(self.val_y, y_pred_proba)
      print(f'Model auc score: {auc}')
      return accuracy, auc
    def save_model(self, model_path):
        joblib.dump(self.model, model_path)

    def load_model(self, model_path):
        self.model = joblib.load(model_path)

# Usage
churn = ChurnPrediction('/content/drive/My Drive/Churn Project/churn.csv')
churn.load_data()
churn.preprocess_data()
churn.split_data()
churn.train_model()
accuracy, auc = churn.evaluate_model()

# Save the model
churn.save_model('churn_model.pkl')