<a href="https://colab.research.google.com/github/Vitor104/ads-machineLearningQ6/blob/main/Q6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Uma imobiliária deseja prever o valor de imóveis com base em características como localização,
número de quartos, tamanho do terreno, entre outros.
Tarefas:
- Utilize um dataset de preços de imóveis (exemplo: California Housing do Scikit-Learn).
- Aplique técnicas de feature engineering para melhorar o desempenho do modelo.
- Teste diferentes algoritmos de regressão, como Regressão Linear, XGBoost e Redes Neurais
Artificiais (ANNs).
- Avalie os modelos com métricas como RMSE e R²

Pergunta: Qual modelo teve menor erro de previsão? Como otimizar ainda mais o desempenho?

In [1]:
# Importar as bibliotecas necessárias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neural_network import MLPRegressor

In [2]:
# Importar e carregar o dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)

In [3]:
# Mostrar o dataset
housing.data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [4]:
# Juntar colunas para melhorar o desempenho do modelo
housing.data['population_per_household'] = housing.data['Population'] / housing.data['AveOccup']
housing.data['bedrooms_per_room'] = housing.data['AveRooms'] / housing.data['AveBedrms']

In [5]:
housing.data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,population_per_household,bedrooms_per_room
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,126.0,6.821705
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,1138.0,6.418626
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,177.0,7.721053
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,219.0,5.421277
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,259.0,5.810714
...,...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,330.0,4.451872
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,114.0,4.646667
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,433.0,4.647423
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,349.0,4.547677


In [6]:
# Separar as variáveis para normalização
scaler = StandardScaler()
colunas_para_normalizar = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'population_per_household', 'bedrooms_per_room']

In [7]:
# Separar em teste e treino
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.2, random_state=42)

In [8]:
# Normalizar variáveis usando fit_transform apenas nos dados de treino
X_train[colunas_para_normalizar] = scaler.fit_transform(X_train[colunas_para_normalizar])
X_test[colunas_para_normalizar] = scaler.transform(X_test[colunas_para_normalizar])

In [9]:
housing.data

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,population_per_household,bedrooms_per_room
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,126.0,6.821705
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,1138.0,6.418626
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,177.0,7.721053
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,219.0,5.421277
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,259.0,5.810714
...,...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,330.0,4.451872
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,114.0,4.646667
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,433.0,4.647423
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,349.0,4.547677


Aqui usei a Regressão Linear. Em seguida, avaliei o modelo com RMSE e R².

In [10]:
# Usar a Regressão Linear
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [16]:
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Mean Squared Error (Linear Regression):", mse)
print("Root Mean Squared Error (Linear Regression):", rmse)

Mean Squared Error (Linear Regression): 0.4882299904699801
Root Mean Squared Error (Linear Regression): 0.6987345636720572


In [12]:
r2 = r2_score(y_test, y_pred)
print("R² (Linear Regression):", r2)

R² (Linear Regression): 0.627421668672186


Aqui usei o XGBoost. Em seguida, avaliei o modelo com RMSE e R².

In [13]:
# Usar XGBoost
import xgboost as xgb
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
xgb_regressor.fit(X_train, y_train)
y_pred_xgb = xgb_regressor.predict(X_test)

In [17]:
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mse_xgb)
print("Mean Squared Error (XGBoost):", mse_xgb)
print("Root Mean Squared Error (XGBoost):", rmse_xgb)

Mean Squared Error (XGBoost): 0.22822947764645476
Root Mean Squared Error (XGBoost): 0.4777336890428126


In [15]:
r2 = r2_score(y_test, y_pred_xgb)
print("R² (XGBoost):", r2)

R² (XGBoost): 0.8258333990104133


Aqui usei o Redes Neurais Artificiais. Em seguida, avaliei o modelo com RMSE e R².

In [18]:
# Usando Redes Neurais Artificiais
mlp_regressor = MLPRegressor(hidden_layer_sizes=(100, 50), max_iter=1000, random_state=42)
mlp_regressor.fit(X_train, y_train)
y_pred_mlp = mlp_regressor.predict(X_test)

In [20]:
# Avaliando com RMSE e R²
mse_mlp = mean_squared_error(y_test, y_pred_mlp)
rmse_mlp = np.sqrt(mse_mlp)
print("Mean Squared Error (MLP):", mse_mlp)
print("Root Mean Squared Error (MLP):", rmse_mlp)

Mean Squared Error (MLP): 0.4968130824967447
Root Mean Squared Error (MLP): 0.7048496878744749


In [21]:
r2 = r2_score(y_test, y_pred_mlp)
print("R² (ANNs):", r2)

R² (ANNs): 0.6208717348963304


# Qual modelo teve menor erro de previsão?

R: Avaliando os três modelos com a métrica RMSE, o XGBoost foi o que apresentou o menor erro de previsão, com um RMSE de **0.4777**. Os modelos de Regressão Linear e Rede Neural tiveram um desempenho inferior e muito próximo entre si, com RMSE de **0.6987** e **0.7048**, respectivamente.

# Como otimizar ainda mais o desempenho?

R: Isso poderia ser alcançado com uma maior gama de dados e otimização do modelo aplicando mais técnicas de feature engineering.