```
     _                                     ____             _       _   ____                                    
    / \   _ __ ___   __ _ _______  _ __   / ___|  ___   ___(_) __ _| | |  _ \ _ __ ___   __ _ _ __ ___  ___ ___ 
   / _ \ | '_ ` _ \ / _` |_  / _ \| '_ \  \___ \ / _ \ / __| |/ _` | | | |_) | '__/ _ \ / _` | '__/ _ \/ __/ __|
  / ___ \| | | | | | (_| |/ / (_) | | | |  ___) | (_) | (__| | (_| | | |  __/| | | (_) | (_| | | |  __/\__ \__ \
 /_/   \_\_| |_| |_|\__,_/___\___/|_| |_| |____/ \___/ \___|_|\__,_|_| |_|   |_|  \___/ \__, |_|  \___||___/___/
                                                                                        |___/                   
 Supervised Decision Tree Classifier
```

### Module
DecisionTreeClassifier is a class capable of performing multi-class classification on a dataset.

### Goal
Ceate a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

### Tools
1. Pandas
2. scikit-learn
2. NearestNeighbors Algorithms

### Requirement
1. File Definition
2. Data Preparation
3. hotspot_spi.csv generated
 
### Data Source
__${WORKDIR}__/data/ouptut/hotspot_spi.csv


In [1]:
import os
import pandas as pd

import functions as func
from  load_dataset import LoadDataset

from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import (cross_validate, train_test_split)

## Get the data

In [3]:
load_dataset = LoadDataset()
X, y = load_dataset.get_data()

### Split dataset into train and test sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape, "y_test.shape:", y_test.shape)

X_train.shape: (1737, 49) y_train.shape: (1737,)
X_test.shape: (579, 49) y_test.shape: (579,)


## Modeling

**Building, train and predict model**

**Getting Best Hyperparameter Optimization**

*Note: The execution of the code below may take a few minutes or hours.*

*Uncomment and run it when you need to optimize hyperparameters.*

In [5]:
# space = dict()
# space['criterion'] = ["gini", "entropy"]
# space['splitter'] = ["best", "random"]
# space['max_depth'] = [n for n in range(10)]
# # space['min_samples_split'] = [n for n in range(10)]
# space['min_samples_leaf'] = [n for n in range(10)]

# func.show_best_hyperparameter_optimization(
#     DecisionTreeClassifier(), 
#     space, 
#     X_train, 
#     y_train
# )

In [6]:
decision_tree_classifier = DecisionTreeClassifier(max_depth=6)
pipeline = make_pipeline(
    StandardScaler(),
    decision_tree_classifier
)

_ = pipeline.fit(X_train, y_train)

__Check the most relevant features for the training model__

In [7]:
func.get_feature_importances(decision_tree_classifier, X_train)

Unnamed: 0,Features,Relevance (%)
35,Focos de calor por habitantes,39
11,Moradias com piso adequado,8
37,Transporte Público,4
24,Densidade telefonia movel,4
18,Distorção idade-série ensino fundamental,4
34,Emissões CO2,3
33,Desmatamento recente,3
32,Desmatamento acumulado,3
3,Mortalidade por doenças infecciosas,3
27,Mortalidade por câncer,3


__Predict and show model result__

In [8]:
y_predict = pipeline.predict(X_test)
func.show_model_result(pipeline, X, y, y_test, y_predict)


Computing cross-validated metrics
----------------------------------------------------------------------
Scores: [0.3987069  0.39308855 0.44708423 0.41900648 0.44492441]
Mean = 0.42 / Standard Deviation = 0.02

Confunsion Matrix
----------------------------------------------------------------------
[[76  6 63 11 11]
 [ 5  9 59  0 28]
 [38  7 93  1 26]
 [38  0  7 19  1]
 [ 4  2 19  0 56]]

Classification Report
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.47      0.46      0.46       167
           1       0.38      0.09      0.14       101
           2       0.39      0.56      0.46       165
           3       0.61      0.29      0.40        65
           4       0.46      0.69      0.55        81

    accuracy                           0.44       579
   macro avg       0.46      0.42      0.40       579
weighted avg       0.44      0.44      0.41       579

-------------------------

ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

**Show Curve ROC and Area Under the ROC**

In [None]:
func.show_curve_roc(pipeline, X_test, y_test, y_predict)