```
     _                                     ____             _       _   ____                                    
    / \   _ __ ___   __ _ _______  _ __   / ___|  ___   ___(_) __ _| | |  _ \ _ __ ___   __ _ _ __ ___  ___ ___ 
   / _ \ | '_ ` _ \ / _` |_  / _ \| '_ \  \___ \ / _ \ / __| |/ _` | | | |_) | '__/ _ \ / _` | '__/ _ \/ __/ __|
  / ___ \| | | | | | (_| |/ / (_) | | | |  ___) | (_) | (__| | (_| | | |  __/| | | (_) | (_| | | |  __/\__ \__ \
 /_/   \_\_| |_| |_|\__,_/___\___/|_| |_| |____/ \___/ \___|_|\__,_|_| |_|   |_|  \___/ \__, |_|  \___||___/___/
                                                                                        |___/                   
```

### Module
DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.

### Goal
Ceate a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

### Tools
1. Pandas
2. scikit-learn
3. DecisionTreeClassifier Algorithm

### Requirement
1. File Definition
2. Data Preparation
3. hotspot_spi.csv generated
 
### Data Source
__${WORKDIR}__/data/ouptut/hotspot_spi.csv


In [1]:
import os
import sys

supervised_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
sys.path.append(supervised_dir)
sys.path

['/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised/classifier',
 '/opt/anaconda3/lib/python39.zip',
 '/opt/anaconda3/lib/python3.9',
 '/opt/anaconda3/lib/python3.9/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.9/site-packages',
 '/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised']

In [2]:
import os
import pandas as pd

import functions_classifier as func
from  load_dataset import LoadDataset, SpiType

from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, scale
from sklearn.model_selection import (cross_validate, train_test_split)

## Get the data

In [3]:
load_dataset = LoadDataset()
X, y = load_dataset.return_X_y_clf()

### Split dataset into train and test sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

print("X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape, "y_test.shape:", y_test.shape)

X_train.shape: (1619, 49) y_train.shape: (1619,)
X_test.shape: (694, 49) y_test.shape: (694,)


## Modeling

**Building, train and predict model**

In [5]:
params = {
    "criterion": "entropy", 
    "max_depth": 100, 
    "min_samples_leaf": 40, 
    "min_samples_split": 2, 
    "splitter": "best"
}

decision_tree_classifier = DecisionTreeClassifier(**params)
pipeline = make_pipeline(
    StandardScaler(),
    decision_tree_classifier
)
_ = pipeline.fit(X_train, y_train)

**Getting Best Hyperparameter Optimization**

*Note: The execution of the code below may take a few minutes or hours.*

*Uncomment and run it when you need to optimize hyperparameters.*

In [6]:
# from sklearn.model_selection import (GridSearchCV)
# import warnings

# warnings.filterwarnings("ignore")

# parameters = {
#     "criterion": ("gini", "entropy"),
#     "splitter": ("best", "random"),
#     "max_depth":[20, 40, 60, 80, 100],
#     "min_samples_leaf":[20, 40, 60, 80, 100],
#     "min_samples_split": [2, 4, 6, 8, 10]
# }

# gridsearch = GridSearchCV(DecisionTreeClassifier(), parameters)
# gridsearch.fit(scale(X_train), y_train)

# print("Tuned Hyperparameters :", gridsearch.best_params_)
# print("Best Score:", gridsearch.best_score_)

__Check the most relevant features for the training model__

In [7]:
func.get_feature_importances(decision_tree_classifier, X_train)

Unnamed: 0,Features,Relevance (%)
35,Focos de calor por habitantes,53
1,Mortalidade materna,9
14,Homicídios,8
30,Mortalidade por suicídios,7
34,Emissões CO2,4
3,Mortalidade por doenças infecciosas,4
31,Áreas Protegidas,2
25,Densidade TV por assinatura,2
29,Mortalidade por doenças respiratórias,2
24,Densidade telefonia movel,1


__Predict and show model result__

In [8]:
y_predict = pipeline.predict(X_test)
func.show_model_result(pipeline, X, y, y_test, y_predict)


Computing cross-validated metrics
----------------------------------------------------------------------
Scores: [0.44924406 0.46436285 0.46652268 0.46536797 0.55194805]
Mean = 0.48 / Standard Deviation = 0.04

Confunsion Matrix
----------------------------------------------------------------------
[[129  27  22   4]
 [ 57  68  41  18]
 [ 24  51  40  45]
 [  8  27  16 117]]

Classification Report
----------------------------------------------------------------------
              precision    recall  f1-score   support

          Q1       0.59      0.71      0.65       182
          Q2       0.39      0.37      0.38       184
          Q3       0.34      0.25      0.29       160
          Q4       0.64      0.70      0.66       168

    accuracy                           0.51       694
   macro avg       0.49      0.51      0.49       694
weighted avg       0.49      0.51      0.50       694

----------------------------------------------------------------------
Accuracy: 0.51
Precici