# `DEMO` Use of the functions in the `tree` module
## Purpose
This notebook briefly demonstrate how to use the functions to display and analyse a classification tree. A model is first created and trained (without too much attention paid to its performance) in order to have a system to analyse and use the functions on.

In [170]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [171]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier

In [172]:
# Global variables
DATASET_FILE = "Dataset/Bank_Churn.csv"

## Dataset and Model Training
### Dataset preparation
The dataset used in the project includes information from bank customers and whether they had churned. The dataset is downloaded from [Maven Analytics](https://mavenanalytics.io/data-playground). The dataset is already cleaned but needs to be prepared for use in the ML model.

In [173]:
# Load dataset
data_df = pd.read_csv(DATASET_FILE)

# Remove unused columns
cleaned_df = data_df.drop(columns=['CustomerId', 'Surname'])

# Split target and features
y = cleaned_df['Exited']
X = cleaned_df.drop(columns=['Exited'])

In [174]:
# Check how many countries are covered under the Geography column
X['Geography'].value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

In [175]:
# Keep the columns the same (i.e. no binning) if there are fewer than 5 countries
nb_countries = len(X['Geography'].value_counts())
if nb_countries < 5:
    print(f"The dataset contains clients from fewer than 5 countries: {nb_countries} countries in the dataset.")
else:
    print(f"The dataset contains clients from 5 or more countries: {nb_countries} countries in the dataset. It is recommended to perform some binning.")

The dataset contains clients from fewer than 5 countries: 3 countries in the dataset.


In [176]:
# Identify the columns based on their type
cat_columns = []
num_columns = []

for c in X.columns:
    dtype = X[c].dtypes

    if dtype == 'object':
        # The column is categorical
        cat_columns.append(c)
    else:
        # The column is numerical
        num_columns.append(c)

# Display the types of columns
print(f"Numerical columns ({len(num_columns)}):", num_columns)
print(f"Categorical columns ({len(cat_columns)}):", cat_columns)

Numerical columns (8): ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
Categorical columns (2): ['Geography', 'Gender']


In [177]:
for c in cat_columns:
    # Count the number of unique values
    nb_values = X[c].nunique()

    # For binary columns:
    if nb_values == 2:
        # - simply encode as 1 and 0
        encoded_col = pd.get_dummies(X[c], drop_first=True)
        X[c] = encoded_col

        # - use the column name based on the 1-value
        X = X.rename(columns={c: encoded_col.columns[0]})

    # For other columns:
    else:
        # - keep all values as dummies
        dummies = pd.get_dummies(X[c], drop_first=False)

        # - drop the original column
        X = X.drop(columns=[c])

        # - append the dummies to the feature DataFrame
        X = pd.merge(X, dummies, how='outer', left_index=True, right_index=True)

In [178]:
X

Unnamed: 0,CreditScore,Male,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,France,Germany,Spain
0,619,0,42,2,0.00,1,1,1,101348.88,1,0,0
1,608,0,41,1,83807.86,1,0,1,112542.58,0,0,1
2,502,0,42,8,159660.80,3,1,0,113931.57,1,0,0
3,699,0,39,1,0.00,2,0,0,93826.63,1,0,0
4,850,0,43,2,125510.82,1,1,1,79084.10,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,1,39,5,0.00,2,1,0,96270.64,1,0,0
9996,516,1,35,10,57369.61,1,1,1,101699.77,1,0,0
9997,709,0,36,7,0.00,1,0,1,42085.58,1,0,0
9998,772,1,42,3,75075.31,2,1,0,92888.52,0,1,0


### Split between training and test sets and normalise

In [179]:
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

# Normalise the training set
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_train_normalized = pd.DataFrame(X_train_normalized, columns=X.columns)

# Apply the normaliser to the test set
X_test_normalized = scaler.transform(X_test)
X_test_normalized = pd.DataFrame(X_test_normalized, columns=X.columns)

### Model training

In [180]:
# Train a classification tree
clf = DecisionTreeClassifier(random_state=42)
model = clf.fit(X_train_normalized, y_train)

In [181]:
# Get the predictions for the test dataset
y_pred = model.predict(X_test_normalized)

## Model analysis

In [182]:
performance = pd.DataFrame({'test': y_test, 'pred': y_pred})

In [183]:
performance['check'] = performance['test'] == performance['pred']

In [184]:
performance['check'].sum()/len(performance)

0.79375

In [185]:
from ML_Utils_Analysis import get_decision_boundaries

In [187]:
boundaries_df, split_nodes = get_decision_boundaries(model, scaler, X.columns)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_switch_leaf['branch_true'] = node_switch_leaf['node']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_switch_leaf['branch_false'] = node_switch_leaf['node']


In [None]:
boundaries_df