In [35]:
import pandas as pd
import numpy as np
#default to .3f for pandas floats
pd.options.display.float_format = '{:.3f}'.format

models = {}
n_jobs=16



# Decision Trees

We were able to extract some signal from our dataset during the logistic regression exploration and have settled on three features to use:
- Ethnicity
- Gender
- Search Reason
- Area Command

We will now explore these in a decision tree model and see if we can boost performance.

# Load and Split Data

In [36]:
df_init = pd.read_csv("../data/merged_data.csv", index_col=0)
df_init.info()

label = "Search Result"
s_labels = df_init[label]
df_init = df_init.drop(label, axis=1)
df_init = df_init.drop("ward_code", axis=1)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74409 entries, 0 to 74408
Columns: 224 entries, ward_code to employment_count_log
dtypes: float64(44), int64(176), object(4)
memory usage: 127.7+ MB


In [37]:
from sklearn.model_selection import train_test_split

X_train_init, X_val_init, y_train, y_val = train_test_split(df_init.copy(), s_labels.copy(), random_state=42, test_size=0.5)
X_test_init, X_val_init, y_test, y_val = train_test_split(X_val_init, y_val, random_state=42, test_size=0.5)

# Model

We'll run a decision tree model over a range of depths and plot some key metrics.

In [40]:
#setup

def list_onehot_columns(df, column_prefix):
    """returns a list of columns from a dataframe that begin wih the given column_prefix"""
    return df.columns[df.columns.str.contains(f"{column_prefix}.*", regex=True)].to_list()

cols = []

cols += list_onehot_columns(X_train_init, "Nominal Ethnicity")
cols += list_onehot_columns(X_train_init, "Search Reason")
cols += list_onehot_columns(X_train_init, "Area Command")
cols += ["Nominal Gender_Male"]

X_train = X_train_init[cols]
X_val = X_val_init[cols]

In [123]:
#run models and calculate metrics

from sklearn.metrics import confusion_matrix, plot_confusion_matrix

df_metrics_train = pd.DataFrame()
df_metrics_val = pd.DataFrame()

depths = range(2,10)

for depth in depths:
    model_name = "initial_d=" + str(depth)
    models[model_name] = tree.DecisionTreeClassifier(max_depth=depth)
    models[model_name].fit(X_train, y_train)
    y_pred = models[model_name].predict(X_train)
    tn,fp,fn,tp = confusion_matrix(y_train, y_pred).ravel()
    precision = tp / (tp + fp)
    fpr = fp / (tn + fp)
    fnr = fn / (tp + tn)
    npv = tn / (tn + fn) 

    df_metrics_train = df_metrics_train.append({"depth":depth, "precision":precision, "fpr":fpr, "fnr":fnr, "npv":npv, "fnr+fpr": fnr + fpr}, ignore_index=True)


    y_pred = models[model_name].predict(X_val)
    tn,fp,fn,tp = confusion_matrix(y_val, y_pred).ravel()
    precision = tp / (tp + fp)
    fpr = fp / (tn + fp)
    fnr = fn / (tp + tn)
    npv = tn / (tn + fn) 

    df_metrics_val = df_metrics_val.append({"depth":depth, "precision":precision, "fpr":fpr, "fnr":fnr, "npv":npv, "fnr+fpr": fnr + fpr}, ignore_index=True)


df_metrics_train = df_metrics_train.set_index("depth")
df_metrics_val = df_metrics_val.set_index("depth")

In [124]:
import plotly.express as px

metricx_train = px.line(df_metrics_train, title="Decision Tree Training Metrics")
metricx_train.update_layout(yaxis_range=[0,1])
metrics_val = px.line(df_metrics_val, title="Decision Tree Validation Metrics")
metrics_val.update_layout(yaxis_range=[0,1])

metricx_train.show()
metrics_val.show()

Our precision is much improved, scoring 0.69 at a depth of 3 in both training and validation. For reference, the best precision from the logistic regression (LR) models was 0.44. Precision then diversion and decreases at higher depths suggesting that more depth is not useful.

Negative predictive value stays approximately constant at all depths at ~0.65. This is slightly under our previous LR best of 0.72.

Our false positive rate is 0 at this depth which is excellent, but the false negative rate sits high at 0.55. This is higher than our best LR model at 0.39.

Overall, with minimal optimisation effort, this tree model is perfoming significantly better in some metrics (precision, false positive rate) but slightly worse in others (negative predictive value, false negative rate).

Further optimisation of a decision tree model could yeiled more imporvements, or we may be able to use some sort of ensemble method to combine the ability of the tree to predict positive well, and the logisitic regressor to predict negative.