# Interpretable Clusters !

In [80]:
# libraries we're using
using CSV
using DataFrames
using CategoricalArrays

In [116]:
# load data
df = CSV.read("demo_clusters_final.csv", DataFrame)

# cluster is categorical variable
df.Cluster = categorical(df.Cluster)

# select all feature columns (numeric!)
df_numeric = df[:, 5:end-1]

print(df_numeric)

# set seed to get reproducible results
seed = 42

[1m1361×53 DataFrame
[1m  Row │[1m total population [1m prop males [1m prop females [1m prop of people under 5 [1m prop of people aged 5-9 [1m prop of people aged 10-14 [1m prop of people aged 15-19 [1m prop of people aged 20-24 [1m prop of people aged 25-34 [1m prop of people aged 35-44 [1m prop of people aged 45-54 [1m prop of people aged 55-59 [1m prop of people aged 60-64 [1m prop of people aged 65-74 [1m prop of people aged 75-84 [1m prop of people aged over 85 [1m median aged [1m prop of white people [1m prop of black people [1m prop of native american people [1m prop of asian people(including indian and chinese) [1m prop of pacific islander [1m prop of employeed people 16 and over [1m prop of commuters who drive to work [1m prop of commuter who carpool to work [1m prop of commuter who took public transport to work [1m prop of commuters who walk to work [1m prop of commuters who work from home [1m prop with health insurance  [1m mean commute time t

Excessive output truncated after 524288 bytes.

42

### Check different types of error - Which performs best

In [87]:
accuracy_tracker = []
loss_functions = ["misclassification", "gini", "entropy"]

for loss_fn in loss_functions

    lnr = IAI.OptimalTreeClassifier(
                    random_seed = seed,
                    criterion = loss_fn,
                    max_depth = 3,
                    minbucket = 5)
    
    grid = IAI.GridSearch(lnr,
        max_depth = 3,
        minbucket = 5
    )

    IAI.fit!(grid, df_numeric, df.Cluster)

    lnr = IAI.get_learner(grid)
    
    val_accuracy = IAI.score(lnr_depth, df_numeric, df.Cluster)
    
    # Add AUC score to AUC tracker
    append!(accuracy_tracker, val_accuracy)

    # adding tree visualization for interpretation
    IAI.show_in_browser(lnr)
end

In [88]:
println("--------AUC scores in Validation Set--------")
for i in range(1,length(loss_functions))
    println(loss_functions[i], ": ", round(accuracy_tracker[i], digits=4))
end

--------AUC scores in Validation Set--------
misclassification: 0.061
gini: 0.061
entropy: 0.061


Since we have similar performance across the splitting criteria loss functions, we can choose any. We use misclassification since it is the most straight-forward and our purposes don't require further granularity. 

### Check different depths - Which performs best?

In [100]:
# build classification tree

depths = [3, 4, 5]

grid = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed = seed,
        criterion=:misclassification,
        missingdatamode=:separate_class,
        minbucket = 5
    ),
    max_depth = depths,
    minbucket = 5,
)

# select all feature columns (numeric!)
df_selected = df[:, 4:end-1]

IAI.fit!(grid, df_numeric, df.Cluster)

In [101]:
lnr_depth = IAI.get_learner(grid)

# Evaluate the classification accuracy
accuracy = IAI.score(lnr_depth, df_numeric, df.Cluster)
println("Classification Accuracy: ", accuracy)
best_depth = lnr_depth.max_depth
println("Best tree depth: ", best_depth)

# Visualize the tree
IAI.show_in_browser(grid)

Classification Accuracy: 0.916972814107274
Best tree depth: 5


"/var/folders/p_/j5379bnd7t7f3l97gslk8q_80000gn/T/jl_2coujx.html"

Even though this is the best tree depth of the values we tested is 5, this is not super interpretable to distill into insights. Thus, we will explicitly print out trees for depth of 3 and 4 as well as see if those might be better at explaining the phenomena with acceptable accuracy scores (>80%).

#### Depth 4, Misclassification split

In [117]:
grid_4 = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed = seed,
        max_depth=4,             
        criterion=:misclassification, 
        missingdatamode=:separate_class 
    )
)

IAI.fit!(grid_4, df_numeric, df.Cluster)

In [118]:
lnr_depth_4 = IAI.get_learner(grid_4)

# Evaluate the classification accuracy
accuracy = IAI.score(lnr_depth_4, df_numeric, df.Cluster)
println("Classification Accuracy: ", accuracy)

IAI.show_in_browser(grid_4)

Classification Accuracy: 0.8714180749448934


"/var/folders/p_/j5379bnd7t7f3l97gslk8q_80000gn/T/jl_BYoc3o.html"

#### Depth 3, Misclassification split

In [119]:
grid_3 = IAI.GridSearch(
    IAI.OptimalTreeClassifier(
        random_seed = seed,
        max_depth=3,               
        criterion=:misclassification,  
        missingdatamode=:separate_class  
    )
)

IAI.fit!(grid_3, df_numeric, df.Cluster)

In [120]:
lnr_depth_3 = IAI.get_learner(grid_3)

# Evaluate the classification accuracy
accuracy = IAI.score(lnr_depth_3, df_numeric, df.Cluster)
println("Classification Accuracy: ", accuracy)

IAI.show_in_browser(grid_3)

Classification Accuracy: 0.7832476120499633


"/var/folders/p_/j5379bnd7t7f3l97gslk8q_80000gn/T/jl_K5n0Bi.html"

#### Interpretation with Tree of Depth 3
Cluster 1 - small population ---> less populated suburbs
Cluster 2 - large population, not taking public transit, educated ---> young professionals in the city
Cluster 3 - works from home, takes public transit, does not have high school degree ---> ????
Cluster 4 - large population, takes public transit, has high school degree ---> inner-city workers (????)
Cluster 5 - does not take public transit or work frm home, low college grads ---> ?????

#### Interpretation with Tree of Depth 4
Cluster 1 - small/medium population ---> less populated suburbs
Cluster 2 - large population, not walking to work, houses built in 1990-99 ---> wealthy millennials
Cluster 3 - large asian population that walk/not drive to work ---> chinatowns in various cities ?
Cluster 4 - large population, ethnic, educated (hs) ---> city
Cluster 5 - medium population, drive to work, very white areas --->