## Documentação
1. O foco será na aplicação de modelos combinamos afim de obter consitência nas previsões ao invés de uma posição atraente no `lb`.

2. A abordagem será data-centric

3. Este notebook será atualizado vagasoramente em ordem de melhorar o resultado final


## Modelos

1. XGBOOST: 
- Tem um problema que precisa subtrair '-1' da coluna y, para contagem começar no zero.

## Pipeline
1. Numérico
- StandardScaler()

2. Categórico
- OneHotEncoder

## GridSearch
1. Cross-Validation
- Kfold: 5 pastas

# Código
Estrutura para encontrar padrões nos dados para previsão da coluna `damage_grade`. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = 10, 4
plt.rcParams['font.size'] = 12

from sklearn.model_selection import train_test_split

import xgboost as xgb

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import roc_auc_score, make_scorer

from sklearn.metrics import classification_report


URL_TRAIN = '/kaggle/input/ml-olympiad-predicting-earthquake-damage/train.csv'
URL_SAMPLE = '/kaggle/input/ml-olympiad-predicting-earthquake-damage/sample_submission.csv'
URL_TEST = '/kaggle/input/ml-olympiad-predicting-earthquake-damage/test.csv'

N_SAMPLE = 0
TARGET = 'damage_grade'
SEED = 101

In [2]:
%%time

# Carregar dados
train = pd.read_csv(URL_TRAIN) 

# Separar dados de treino e teste
X, y = train.drop(TARGET, axis=1), train[TARGET] - 1
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Parâmetros para o XGBOOST
params =  {
    
    #"classifier__booster":['dart', 'gbtree', 'gblinear'], # Booster
    #"classifier__eta": np.logspace(0, -2, num=2), # Learning rate
    #"classifier__max_depth": np.arange(1, 100, 10) # Profundidade
}

# Modelo
base_classifier = xgb.XGBClassifier(device="cpu")

# Definir transformações nos dados
numeric_features = X.select_dtypes(exclude=['object']).columns
numeric_transformer = Pipeline( # <- Aplicar Scaling aos dados
        steps=[("scaler", StandardScaler())]) 

categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(# <- Codificar as variáveis categóricas
        steps=[("encoder", OneHotEncoder(handle_unknown='error'))]) 

preprocessor = ColumnTransformer( # <- Combinar as transformações
        transformers=[
            ("num", numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)])

clf = Pipeline( # <- Combinar transformações e modelo escolhido
        steps=[
            ("preprocessor", preprocessor),
            ("classifier", base_classifier)])

# Método de validação cruzada
kfold = KFold(n_splits=5) # <- Cinco pastas

# Busca pelos melhores parâmetros
grid = GridSearchCV( 
        clf,
        param_grid={})

model = grid.fit(X_train, y_train)

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 444, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/pipeline.py", line 718, in score
    Xt = transform.transform(Xt)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 800, in transform
    Xs = self._fit_transform(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/compose/_column_transformer.py", line 658, in _fit_transform
    return Parallel(n_jobs=self.n_jobs)(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/parallel.py", line 63, i

CPU times: user 13.6 s, sys: 155 ms, total: 13.8 s
Wall time: 4.07 s


In [3]:
# Avaliar em dados de treino
y_hat_train = model.predict(X_train)
report_train = classification_report(y_train, y_hat_train)
print(report_train)

              precision    recall  f1-score   support

           0       0.99      0.94      0.96       581
           1       0.95      0.97      0.96      1590
           2       0.96      0.94      0.95      1029

    accuracy                           0.96      3200
   macro avg       0.96      0.95      0.96      3200
weighted avg       0.96      0.96      0.96      3200



In [4]:
# Avaliar em dados de teste
y_hat_test = model.predict(X_val)
report_train = classification_report(y_val, y_hat_test)
print(report_train)

              precision    recall  f1-score   support

           0       0.65      0.52      0.58       148
           1       0.50      0.59      0.54       378
           2       0.44      0.38      0.41       274

    accuracy                           0.51       800
   macro avg       0.53      0.50      0.51       800
weighted avg       0.51      0.51      0.50       800

