## üè† House Prices - XGBoost 2

### Destaques do Modelo XGBoost ( Score:0.13202) 

* **Pr√©-processamento inicial:** leitura dos dados e remo√ß√£o de colunas com mais de 10% de valores ausentes, garantindo um conjunto de treino consistente e sem vari√°veis excessivamente incompletas.

* **Transforma√ß√£o das vari√°veis:** aplica√ß√£o de imputa√ß√£o por mediana e padroniza√ß√£o nas vari√°veis num√©ricas, enquanto as categ√≥ricas s√£o convertidas via One-Hot Encoding. Todo o fluxo √© integrado em um √∫nico ColumnTransformer para assegurar coer√™ncia durante treino e infer√™ncia.

* **Treinamento do modelo:** constru√ß√£o de um pipeline com o XGBRegressor, cujos hiperpar√¢metros foram ajustados com RandomizedSearchCV. O modelo √© treinado utilizando a vari√°vel alvo em escala logar√≠tmica para maior estabilidade e desempenho preditivo.

* **Gera√ß√£o das previs√µes:** o conjunto de teste √© reorganizado para manter as mesmas colunas do treino, as previs√µes s√£o obtidas e convertidas novamente da escala log para a escala original. O resultado final alcan√ßou Score: **0.13202** no Kaggle, posicionando este experimento como um dos melhores dentro do projeto.

## 1. Configura√ß√µes Iniciais


In [1]:
# =====================================================
# üè† House Prices - XGBoost
# =====================================================
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning, 
                       message='Found unknown categories in columns')
import time
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)

# Scikit-learn - Model selection e avalia√ß√£o
from sklearn.model_selection import train_test_split, cross_val_score, KFold, RandomizedSearchCV

# Scikit-learn - Pr√©-processamento e pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Scikit-learn - Modelos lineares
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV

# Scikit-learn - Ensemble methods
from sklearn.ensemble import RandomForestRegressor

# Scikit-learn - M√©tricas de avalia√ß√£o
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# XGBoost
from xgboost import XGBRegressor

# Distribui√ß√µes para busca de hiperpar√¢metros
from scipy.stats import randint, uniform,loguniform

from setup_notebook import setup_path
setup_path()
from src.model_utils import *

# =====================================================
# üìÅ 1. Leitura dos dados
# =====================================================
dfo = pd.read_csv("/home/akel/PycharmProjects/Kaggle/HousePrices/data/train.csv")

df_train=dfo.copy()
# =====================================================
# üßπ 2. Pr√©-processamento inicial
# =====================================================
# remo√ß√£o de colunas com muitos nulos (> 10%)
colnull_train=df_train.columns[(df_train.isnull().sum()/df_train.shape[0]>0.1)] # 
df_train=df_train.drop(columns=colnull_train,axis=1)

id_train=df_train['Id']

# obtendo nome das vari√°veis categ√≥ricas e num√©ricas
num_features = df_train.select_dtypes(include=['number']).columns.drop(['Id', 'SalePrice'])
cat_features = df_train.select_dtypes(include=['object']).columns

# =====================================================
# üß© 3. Pr√©-processadores
# =====================================================
# NAN -> median
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# categoric -> binario onehotcode 
cat_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first',
                             sparse_output=False,
                             handle_unknown='ignore'))
])

preprocessador = ColumnTransformer(transformers=[
    ('cat', cat_transformer, cat_features),
    ('num', num_transformer, num_features)   
],verbose_feature_names_out=False) 


# =====================================================
#  ü§ñ 4.Modelos&pipeline
# =====================================================
model_xg2 = XGBRegressor( objective='reg:squarederror', n_estimators=700,
                         subsample= 0.6,reg_lambda= 0.5,reg_alpha= 1.0,
                         max_depth= 3,learning_rate= 0.073,
                         colsample_bytree= 0.7,
                         n_jobs=-1 ) 
model_XGB2 = Pipeline([ ('preprocess', preprocessador), 
                        ('model',model_xg2 )])

## 2.Treinamento

In [2]:
X=df_train.drop(['Id', 'SalePrice'], axis=1)
y_log=np.log1p(df_train['SalePrice'])
 
X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.3, random_state=42)
model_XGB2.fit(X_train, y_train)
y_pred = model_XGB2.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
resultados={
            'Modelo': 'xg2',
            'MAE': round(mae, 4),
            'RMSE': round(rmse, 4),
            'R¬≤': round(r2, 4)}
print('‚úÖ XGBOOST 2')
print(resultados)

‚úÖ XGBOOST 2
{'Modelo': 'xg2', 'MAE': 0.0838, 'RMSE': 0.124, 'R¬≤': 0.9093}


## 3. Submiss√£o

In [4]:
# üìÅ 1. Leitura dos dados
base=pd.read_csv('/home/akel/PycharmProjects/Kaggle/HousePrices/data/test.csv')

# üßπ 2. Pr√©-processamento inicial
df_test = base.drop(colnull_train, axis=1, errors="ignore") 

id_test=df_test['Id']
# Base de teste
df_testX= df_test.drop(["Id"], axis=1)

# mesmas colunas treinadas
X_train_cols = X_train.columns
df_testX = df_testX.reindex(columns=X_train_cols, fill_value=0)

# Pipipeline model_XGB2
y_final_log = model_XGB2.predict(df_testX)
y_final=np.expm1(y_final_log)

submission = pd.DataFrame({
    'Id': id_test,
    'SalePrice': y_final
})

#5. Salvar o arquivo CSV (sem √≠ndice)
#submission.to_csv('/home/akel/PycharmProjects/Kaggle/HousePrices/data/submission_XGB2_tunned2.csv', index=False)
print('arquivo salvo!')

arquivo salvo!
