```
     _                                     ____             _       _   ____                                    
    / \   _ __ ___   __ _ _______  _ __   / ___|  ___   ___(_) __ _| | |  _ \ _ __ ___   __ _ _ __ ___  ___ ___ 
   / _ \ | '_ ` _ \ / _` |_  / _ \| '_ \  \___ \ / _ \ / __| |/ _` | | | |_) | '__/ _ \ / _` | '__/ _ \/ __/ __|
  / ___ \| | | | | | (_| |/ / (_) | | | |  ___) | (_) | (__| | (_| | | |  __/| | | (_) | (_| | | |  __/\__ \__ \
 /_/   \_\_| |_| |_|\__,_/___\___/|_| |_| |____/ \___/ \___|_|\__,_|_| |_|   |_|  \___/ \__, |_|  \___||___/___/
                                                                                        |___/                   
```

### Module
__ExtraTreesClassifier__ implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

### Goal
Ceate a model that combine the predictions of several decision trees estimators.

### Tools
1. Pandas
2. scikit-learn
2. ExtraTreesClassifier

### Requirement
1. File Definition
2. Data Preparation
3. hotspot_spi.csv generated
 
### Data Source
__${WORKDIR}__/data/ouptut/hotspot_spi.csv

In [1]:
import os
import sys

supervised_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
sys.path.append(supervised_dir)
sys.path

['/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised/classifier',
 '/opt/anaconda3/lib/python39.zip',
 '/opt/anaconda3/lib/python3.9',
 '/opt/anaconda3/lib/python3.9/lib-dynload',
 '',
 '/opt/anaconda3/lib/python3.9/site-packages',
 '/home/fausto/Development/workspace/amazon-social-progress/ml_models/supervised']

In [2]:
import pandas as pd

import functions_classifier as func
from  load_dataset import LoadDataset

from sklearn.ensemble import ExtraTreesClassifier

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, scale
from sklearn.model_selection import train_test_split

## Get the data

In [3]:
load_dataset = LoadDataset()
X, y = load_dataset.return_X_y_clf()

### Split dataset into train and test sets

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

print("X_train.shape:", X_train.shape, "y_train.shape:", y_train.shape)
print("X_test.shape:", X_test.shape, "y_test.shape:", y_test.shape)

X_train.shape: (1619, 49) y_train.shape: (1619,)
X_test.shape: (694, 49) y_test.shape: (694,)


## Modeling

### Getting Best Hyperparameter Optimization

*Note: The execution of the code below may take a few minutes or hours.*

*Uncomment and run it when you need to optimize hyperparameters.*

In [5]:
# from sklearn.model_selection import (GridSearchCV)
# import warnings

# warnings.filterwarnings("ignore")

# parameters = {
#     "n_estimators":[100, 200, 300, 400, 500, 600],
#     "criterion": ("gini", "entropy", "log_loss"),
#     "max_depth":[20, 40, 60, 80, 100],
#     "min_samples_split": [2, 4, 6, 8, 10],
#     "min_samples_leaf":[20, 40, 60, 80, 100],
# }

# gridsearch = GridSearchCV(ExtraTreesClassifier(), parameters)
# gridsearch.fit(scale(X_train), y_train)

# print("Tuned Hyperparameters :", gridsearch.best_params_)
# print("Best Score:", gridsearch.best_score_)

### Building, train and predict model

In [32]:
params = {
    "criterion": "gini", 
    "max_depth": 100, 
    "min_samples_leaf": 20, 
    "min_samples_split": 3, 
    "n_estimators": 200
}
classifier = ExtraTreesClassifier(**params)
pipeline = make_pipeline(
    StandardScaler(),
    classifier
)

_ = pipeline.fit(X_train, y_train)

### Check the most relevant features for the training model

In [33]:
func.get_feature_importances(classifier, X_train)

Unnamed: 0,Features,Relevance (%)
34,Emissões CO2,15
35,Focos de calor por habitantes,12
32,Desmatamento acumulado,9
31,Áreas Protegidas,6
15,Homicídios Taxa,5
14,Homicídios,3
33,Desmatamento recente,3
18,Distorção idade-série ensino fundamental,3
27,Mortalidade por câncer,2
26,Mortalidade por diabetes mellitus,2


### Predict and show model result

In [34]:
y_predict = pipeline.predict(X_test)
func.show_model_result(pipeline, X, y, y_test, y_predict)


Computing cross-validated metrics
----------------------------------------------------------------------
Scores: [0.51403888 0.51403888 0.5399568  0.53679654 0.49350649]
Mean = 0.52 / Standard Deviation = 0.02

Confunsion Matrix
----------------------------------------------------------------------
[[140  12  10   3]
 [ 69  50  43  28]
 [ 20  31  55  54]
 [  5  20  25 129]]

Classification Report
----------------------------------------------------------------------
              precision    recall  f1-score   support

          Q1       0.60      0.85      0.70       165
          Q2       0.44      0.26      0.33       190
          Q3       0.41      0.34      0.38       160
          Q4       0.60      0.72      0.66       179

    accuracy                           0.54       694
   macro avg       0.51      0.54      0.52       694
weighted avg       0.51      0.54      0.51       694

----------------------------------------------------------------------
Accuracy: 0.54
Precici