### Question 4: Apply Decision Tree Classifier to the Dataset 

Let us now try a Decision Tree Classifier to predict High Income vs Low Income. Use only the first 500 rows for training.



# Result

- Create a function dec_tree(path), where path is the location of the csv file
- This function should load the file, train the model, and then test the model
- The function should return a 2-tuple, (feature importances array, R^2 of model when run on last 500 rows)
- E.g. ([0.5 , 0.5], 0.8)

In [1]:
# data management

import pandas as pd
import numpy as np

# ML prediction

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [2]:
def dec_tree(path):
    df = pd.read_csv(path, index_col=0)
    # create income categories (1 = High Income / 0 = Low Income)
    df['High Income'] = np.where(df['Income']>= 70000, 1, 0)
    train = df.iloc[0:500]
    test = df.iloc[500:]
    # cross valid hyperparameters tuning (max_depth, min_samples_split... are NOT considered because there are only 2 features)
    try_grid = [{'criterion':['gini','entropy'],'splitter': ['best','random']}]
    dec_init = GridSearchCV(DecisionTreeClassifier(), param_grid=try_grid, cv=10).fit(train[['Experience','Age']],train['High Income'])
    dec = DecisionTreeClassifier(criterion=dec_init.best_params_['criterion'],splitter=dec_init.best_params_['splitter']).fit(train[['Experience','Age']], train['High Income'])
    return (dec.feature_importances_, dec.score(test[['Experience','Age']], test['High Income']))

In [3]:
dec_tree('HKUST_FinTech_Income_Dataset.csv')

(array([0.99077552, 0.00922448]), 0.992)