# Exercico Pratico sobre Cross Validation

### Importar Dataset 

### Atributos do Dataset Boston Housing

- **MedInc**        rendimento mediano no grupo de blocos
- **HouseAge**      idade média da casa no grupo de blocos
- **AveRooms**      número médio de quartos por domicílio
- **AveBedrms**     número médio de quartos por agregado familiar
- **Population**    população do grupo de blocos
- **AveOccup**      número médio de membros da família
- **Latitude**      bloco latitude grupo
- **Longitude**     grupo de blocos longitude
- **MedHouseVal**   média do preço da casa

In [1]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Adiciona a variável target (preço das casas)
df[data.target_names[0]] = data.target

### Informação do dataset

In [2]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

### Visulaizar Dataset

In [3]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Visualizar Correlações entre Variáveis

In [3]:
correlacoes = df.corr(method='spearman')
correlacoes

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
MedInc,1.0,-0.147308,0.643671,-0.252426,0.006268,-0.044171,-0.088029,-0.009928,0.676778
HouseAge,-0.147308,1.0,-0.231409,-0.120981,-0.283879,-0.024833,0.03244,-0.150752,0.074855
AveRooms,0.643671,-0.231409,1.0,0.082046,-0.105385,0.018807,0.127134,-0.044783,0.263367
AveBedrms,-0.252426,-0.120981,0.082046,1.0,0.027027,-0.132315,0.047197,0.010884,-0.125187
Population,0.006268,-0.283879,-0.105385,0.027027,1.0,0.242337,-0.123626,0.123527,0.003839
AveOccup,-0.044171,-0.024833,0.018807,-0.132315,0.242337,1.0,-0.150954,0.181468,-0.256594
Latitude,-0.088029,0.03244,0.127134,0.047197,-0.123626,-0.150954,1.0,-0.879203,-0.165739
Longitude,-0.009928,-0.150752,-0.044783,0.010884,0.123527,0.181468,-0.879203,1.0,-0.069667
MedHouseVal,0.676778,0.074855,0.263367,-0.125187,0.003839,-0.256594,-0.165739,-0.069667,1.0


### Separar os Dados do Target e vizualizar as respetivas dimensões

In [4]:
X = df.iloc[:1000, :-1].values
Y = df.iloc[:1000, -1].values

X.shape, Y.shape

((1000, 8), (1000,))

### Treinar Modelo de Regressão Linear com Cross Validation

In [5]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
import numpy as np

linear_regression = LinearRegression()
scores = cross_val_score(linear_regression, X, Y, cv=5)

##### Métricas Obtidas

In [6]:
print("Valores de R^2 para cada iteração:", scores)
print("Média dos resultados obtidos:", np.mean(scores))

Valores de R^2 para cada iteração: [0.65085152 0.14866723 0.27853963 0.32934572 0.29032607]
Média dos resultados obtidos: 0.3395460339939757


### Treinar Modelo de SVR com Cross Validation

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

# 2. Normalizar os dados utilizando um pipeline
modelSVR = SVR(kernel='linear')

scores = cross_val_score(modelSVR, X, Y, cv=5, scoring='neg_mean_squared_error')

In [None]:
print(scores)

# 4. Converter os scores para valores positivos e calcular a média
mse_scores = -scores
print("Erro Quadrático Médio em cada fold:", mse_scores)
print("Erro Quadrático Médio médio:", np.mean(mse_scores))

### Treinar Regressão de Ridge e Lasso com Cross Validation

In [None]:
from sklearn.linear_model import LassoCV, RidgeCV

modelR = RidgeCV()
modelR.fit(X, Y)

modelL = LassoCV()
modelL.fit(X, Y)

print("Modelo Ridge:")
print("Coeficientes:", modelR.coef_)
print("Interceção:", modelR.intercept_)
print("Alpha:", modelR.alpha_)

print()

print("Modelo Lasso:")
print("Coeficientes:", modelL.coef_)
print("Interceção:", modelL.intercept_)
print("Alpha:", modelL.alpha_)

### Fazer Previsões

In [None]:
previsaoRidge = modelR.predict(X)
previsaoLasso = modelL.predict(X)

### Métricas de desempenho

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("REGRESSÃO DE RIDGE")
# Erro médio absoluto
print("Erro médio absoluto:", mean_absolute_error(Y, previsaoRidge))
# Erro quadrático médio
print("Erro quadrático médio:", mean_squared_error(Y, previsaoRidge))
# Raiz do erro quadrático médio
print("Raiz do erro quadrático médio:", np.sqrt(mean_squared_error(Y, previsaoRidge)))

print()

print("REGRESSÃO DE LASSO")
# Erro médio absoluto
print("Erro médio absoluto:", mean_absolute_error(Y, previsaoLasso))
# Erro quadrático médio
print("Erro quadrático médio:", mean_squared_error(Y, previsaoLasso))
# Raiz do erro quadrático médio
print("Raiz do erro quadrático médio:", np.sqrt(mean_squared_error(Y, previsaoLasso)))