# Project 3 Regression

Predicting house prices is a critical task in the real estate industry, as it directly impacts buyers, sellers, and investors. The goal of this project is to develop a predictive model that accurately estimates the sale price of residential properties in Ames, Iowa. This challenge is particularly significant given the wide range of factors that can influence property prices, from physical characteristics to neighborhood features.

The dataset used in this project comes from the Kaggle competition titled “House Prices: Advanced Regression Techniques.” It contains 79 explanatory variables that describe almost every aspect of residential homes in Ames, Iowa. These features include lot size, building dimensions, year built, quality ratings, and more. The competition challenges participants to predict the final sale price of each house based on these variables. The dataset was compiled by Dean De Cock for educational purposes and serves as a modern alternative to the Boston Housing dataset.

The project’s objective is to predict the sale price of each house given its features. The predictions are evaluated using Root Mean Squared Error (RMSE) between the logarithm of the predicted and actual sale prices. Using logarithms ensures that errors in predicting both expensive and inexpensive houses are treated equally, preventing bias toward high-priced properties.

Regression analysis is at the core of this project, as it is well-suited for predicting continuous numerical values such as house prices. In regression, a mathematical model is constructed to estimate the relationship between one or more independent variables (features) and a dependent variable (sale price). One commonly used regression technique is linear regression, which models the relationship between the dependent and independent variables as a linear combination of the input features. However, more advanced techniques like random forest regression and gradient boosting regression can also be applied to capture complex patterns and interactions within the data. These techniques build multiple decision trees and combine their outputs to achieve better accuracy and robustness.

By leveraging feature engineering and advanced regression techniques, this project aims to create a reliable model that accurately predicts house prices, offering valuable insights for real estate professionals and stakeholders.



In [None]:
# Import necessary libraries for data manipulation and analysis
import pandas as pd  # Library for data manipulation and analysis
import numpy as np  # Library for numerical computations

# Import libraries for data visualization
import matplotlib.pyplot as plt  # Library for creating static and interactive plots
import seaborn as sns  # Library for creating informative and attractive statistical graphics

# Enable inline plotting for Jupyter notebooks
%matplotlib inline

# Import libraries for machine learning
from sklearn.linear_model import LinearRegression  # Library for linear regression models
from sklearn.model_selection import train_test_split  # Library for splitting data into training and testing sets
from sklearn.metrics import mean_squared_error  # Library for evaluating model performance using mean squared error

from sklearn.feature_selection import RFE, SequentialFeatureSelector, VarianceThreshold, SelectKBest, f_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, precision_score

In [None]:
# Load training and testing data from CSV files
train_df = pd.read_csv('train.csv')  # Load training data into a Pandas DataFrame
test_df = pd.read_csv('test.csv')  # Load testing data into a Pandas DataFrame

# Display the first few rows of the training data to verify loading and formatting
train_df.head()  # Print the first few rows of the training DataFrame

In [None]:
# Display information about the training data, including data types and counts of non-null values
train_df.info()  # Print a concise summary of the training DataFrame

In [None]:
# Display information about the testing data, including data types and counts of non-null values
test_df.info()  # Print a concise summary of the testing DataFrame

In [None]:
# Generate descriptive statistics for the training data, including mean, std, min, 25%, 50%, 75%, and max values
train_df.describe()  # Print summary statistics for the training DataFrame

In [None]:
# Generate descriptive statistics for the testing data, including mean, std, min, 25%, 50%, 75%, and max values
test_df.describe()  # Print summary statistics for the testing DataFrame

In [None]:
# Print the count of missing values in each column of the training data
print(train_df.isnull().sum())  # Display the total number of null values in each column of the training DataFrame

In [None]:
# Print the count of missing values in each column of the testing data
print(test_df.isnull().sum())  # Display the total number of null values in each column of the testing DataFrame

In [None]:
# Remove columns from the training and testing data that have more than 45% missing values
train_df = train_df.drop(train_df.columns[(train_df.isnull().sum() / train_df.shape[0] > 0.45)], axis=1)
# Drop columns with high missing value rates from the training DataFrame

test_df = test_df.drop(test_df.columns[(test_df.isnull().sum() / test_df.shape[0] > 0.45)], axis=1)
# Drop columns with high missing value rates from the testing DataFrame

# Verify the updated count of missing values in each column of the training and testing data
print(train_df.isnull().sum())  # Display the updated total number of null values in each column of the training DataFrame
print(test_df.isnull().sum())  # Display the updated total number of null values in each column of the testing DataFrame

In [None]:
# Replace missing values in the 'LotFrontage' column with the median value for both training and testing data
train_df['LotFrontage'] = train_df['LotFrontage'].fillna(train_df['LotFrontage'].median())
# Fill missing values in the 'LotFrontage' column of the training DataFrame with the median value

test_df['LotFrontage'] = test_df['LotFrontage'].fillna(test_df['LotFrontage'].median())
# Fill missing values in the 'LotFrontage' column of the testing DataFrame with the median value

# Verify the updated count of missing values in each column of the training and testing data
print(train_df.isnull().sum())  # Display the updated total number of null values in each column of the training DataFrame
print(test_df.isnull().sum())  # Display the updated total number of null values in each column of the testing DataFrame

In [None]:
# Replace missing values in categorical columns with the most frequent value (mode) for both training and testing data
train_df = train_df.fillna(train_df.mode().iloc[0])
# Fill missing values in the training DataFrame with the most frequent value (mode) for each column

test_df = test_df.fillna(test_df.mode().iloc[0])
# Fill missing values in the testing DataFrame with the most frequent value (mode) for each column

# Verify the updated count of missing values in each column of the training and testing data
print(train_df.isnull().sum())  # Display the updated total number of null values in each column of the training DataFrame
print(test_df.isnull().sum())  # Display the updated total number of null values in each column of the testing DataFrame

In [None]:
train_df.head()

In [None]:
# Remove non-informative columns from the training and testing data
train_df = train_df.drop(['Id', 'Street', 'LotShape', 'LandContour', 'Utilities', 
                          'LotConfig', 'LandSlope', 'RoofMatl', 'Heating', 'CentralAir',
                          'Electrical', 'Functional', 'GarageQual', 'GarageCond', 'BsmtCond',], axis=1)
# Drop columns that do not provide useful information for modeling from the training DataFrame

test_df = test_df.drop(['Id', 'Street', 'LotShape', 'LandContour', 'Utilities', 
                        'LotConfig', 'LandSlope', 'RoofMatl', 'Heating', 'CentralAir',
                        'Electrical', 'Functional', 'GarageQual', 'GarageCond', 'BsmtCond',], axis=1)
# Drop columns that do not provide useful information for modeling from the testing DataFrame

# Display the first few rows of the updated training data to verify changes
train_df.head()  # Print the first few rows of the updated training DataFrame

In [None]:
# Remove columns 'Condition1' and 'Condition2' from the training and testing data
# These columns have dominant values for normal conditions (86% and 99% respectively)
# and are suggested to have little to no predictive power
train_df = train_df.drop(['Condition1', 'Condition2'], axis=1)
# Drop columns 'Condition1' and 'Condition2' from the training DataFrame

test_df = test_df.drop(['Condition1', 'Condition2'], axis=1)
# Drop columns 'Condition1' and 'Condition2' from the testing DataFrame

# Display the first few rows of the updated training data to verify changes
train_df.head()  # Print the first few rows of the updated training DataFrame

In [None]:
# One-hot encode categorical columns to ensure a 1 or 0 output
# This will convert categorical variables into numerical variables that can be used in modeling
columns_to_encode = ['BldgType', 'HouseStyle', 'Exterior1st', 'Exterior2nd', 'RoofStyle', 'ExterQual',
                     'ExterCond', 'Foundation',  'HeatingQC', 'KitchenQual',
                     'GarageType', 'GarageFinish', 'SaleType', 'SaleCondition', 'PavedDrive', 'MSZoning', 
                     'BsmtQual',  'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',]
# Define the list of columns to be one-hot encoded

train_df = pd.get_dummies(train_df, columns=columns_to_encode, drop_first=True, dtype=int)
# One-hot encode the specified columns in the training DataFrame, dropping the first category to avoid multicollinearity
# and ensuring a 1 or 0 output by specifying dtype=int

test_df = pd.get_dummies(test_df, columns=columns_to_encode, drop_first=True, dtype=int)
# One-hot encode the specified columns in the testing DataFrame, dropping the first category to avoid multicollinearity
# and ensuring a 1 or 0 output by specifying dtype=int

# Display the first few rows of the updated training data to verify changes
train_df.head()  # Print the first few rows of the updated training DataFrame

In [None]:
# Step 1: Calculate the overall mean house price
# This will be used as a baseline to compare the mean house prices of different neighborhoods
overall_mean = train_df['SalePrice'].mean()
# Calculate the mean of the 'SalePrice' column in the training DataFrame

# Step 2: Calculate the mean house price for each neighborhood
# This will help us understand how the mean house price varies across different neighborhoods
neighborhood_means = train_df.groupby('Neighborhood')['SalePrice'].mean()
# Group the training DataFrame by the 'Neighborhood' column and calculate the mean of the 'SalePrice' column for each group

# Step 3: Map the mean values to the 'Neighborhood' column
# This will create a new column 'Neighborhood_Encoded' that contains the mean house price for each neighborhood
train_df['Neighborhood_Encoded'] = train_df['Neighborhood'].map(neighborhood_means)
# Create a new column 'Neighborhood_Encoded' in the training DataFrame by mapping the 'Neighborhood' column to the mean house prices calculated earlier

test_df['Neighborhood_Encoded'] = test_df['Neighborhood'].map(neighborhood_means)
# Create a new column 'Neighborhood_Encoded' in the testing DataFrame by mapping the 'Neighborhood' column to the mean house prices calculated earlier

# Optional: Track the positive and negative influences
# This will help us understand how each neighborhood affects the house price compared to the overall mean
neighborhood_influence = neighborhood_means - overall_mean
# Calculate the difference between the mean house price of each neighborhood and the overall mean

print("Neighborhood Influence on Price:")
print(neighborhood_influence.sort_values(ascending=False))
# Print the neighborhood influences in descending order (i.e., from most positive to most negative)

# Drop the original 'Neighborhood' column after encoding
# This is because we no longer need the original 'Neighborhood' column after creating the 'Neighborhood_Encoded' column
train_df = train_df.drop(['Neighborhood'], axis=1)
# Drop the 'Neighborhood' column from the training DataFrame

test_df = test_df.drop(['Neighborhood'], axis=1)
# Drop the 'Neighborhood' column from the testing DataFrame

# Display the top 5 rows of the updated DataFrame
print(train_df[['Neighborhood_Encoded', 'SalePrice']].head())
# Print the top 5 rows of the updated training DataFrame, showing the 'Neighborhood_Encoded' and 'SalePrice' columns


In [None]:
# Calculate the correlation matrix of the training DataFrame
# This will help us understand the relationships between different columns in the DataFrame
train_df.corr()
# Calculate the correlation coefficient for each pair of columns in the training DataFrame

In [None]:
# Import the variance_inflation_factor function from statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Select numerical features for VIF calculation
# This will exclude the 'SalePrice' column and only consider numerical columns
numerical_features = train_df.select_dtypes(include=[np.number]).drop(['SalePrice'], axis=1)
# Select numerical columns from the training DataFrame and drop the 'SalePrice' column

# Calculate VIF for each feature
# This will help us identify features that are highly correlated with each other
vif_data = pd.DataFrame()
# Create an empty DataFrame to store the VIF results

vif_data['Feature'] = numerical_features.columns
# Add a column to the DataFrame with the feature names

vif_data['VIF'] = [variance_inflation_factor(numerical_features.values, i) for i in range(numerical_features.shape[1])]
# Calculate the VIF for each feature and add it to the DataFrame

# Print the VIF results in descending order (i.e., from highest to lowest)
print(vif_data.sort_values(by='VIF', ascending=False))
# Sort the DataFrame by the VIF column in descending order and print the results

# Save the VIF results to a CSV file
vif_data.to_csv('vif_output.csv', index=False)
# Write the DataFrame to a CSV file named 'vif_output.csv' without including the index column


In [None]:
# Identify features with high VIF values (> 25)
# This is a common threshold for identifying features with high multicollinearity
high_vif = vif_data[vif_data['VIF'] > 25]
# Filter the VIF DataFrame to include only rows where the VIF value is greater than 25

# Print the features with high VIF values
print(high_vif)
# Print the filtered DataFrame to display the features with high VIF values



# Split the dataset into training and testing sets
- also define the feature set (X) and the target variable (y)

In [None]:
# Define the feature set (X) and the target variable (y)
X = train_df.drop('SalePrice', axis=1)
y = train_df['SalePrice']

# Split the data into training and testing sets with a 80-20 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Method 1: Variance Threshold (remove features with low variance)

In [None]:
# Create a dictionary to store the selected features from different feature selection methods
selected_features = {}

# Use VarianceThreshold to select features with a variance threshold of 0.1
from sklearn.feature_selection import VarianceThreshold

selector_variance = VarianceThreshold(threshold=0.1)
selector_variance.fit(X_train)

# Get the selected features and store them in the dictionary
selected_features['Variance Threshold'] = X_train.columns[selector_variance.get_support()].tolist()

### Method 2: Correlation (check and remove highly correlated features)

In [None]:
# Calculate the correlation matrix of the training data
correlation_matrix = X_train.corr().abs()

# Get the upper triangle of the correlation matrix
# This is done to avoid duplicate comparisons and to only consider the correlations between different features
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Identify the features with high correlation (above 0.9)
high_corr_features = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.9)]

# Select the features that are not highly correlated with each other
selected_features['Correlation'] = X_train.columns.difference(high_corr_features).tolist()

### Method 3: Statistical Test (SelectKBest with ANOVA F-test)s

In [None]:
# Use SelectKBest to select the top 11 features based on the F-regression score
# The F-regression score is a statistical test that measures the correlation between each feature and the target variable
selector_kbest = SelectKBest(score_func=f_regression, k=8)

# Fit the SelectKBest selector to the training data
selector_kbest.fit(X_train, y_train)

# Get the selected features and store them in the dictionary
selected_features['Statistical Test (SelectKBest)'] = X_train.columns[selector_kbest.get_support()].tolist()

### Method 4: Forward Selection with Linear Regression

In [None]:
# Use SequentialFeatureSelector to perform forward selection with linear regression
# This selects the top 10 features that have the highest correlation with the target variable
lr_model = LinearRegression()  # Create a linear regression model
forward_selector = SequentialFeatureSelector(lr_model, n_features_to_select=8, direction='forward', scoring='neg_mean_squared_error', cv=8)

# Fit the forward selector to the training data
forward_selector.fit(X_train, y_train)

# Get the selected features and store them in the dictionary
selected_features['Forward Selection (Linear Regression)'] = X_train.columns[forward_selector.get_support()].tolist()

### Method 5: Recursive Feature Elimination (RFE) with Decision Tree Regressor

In [None]:
# Use Recursive Feature Elimination (RFE) with a decision tree regressor to select the top 11 features
dt_model = DecisionTreeRegressor()  # Create a decision tree regressor model
rfe_selector = RFE(dt_model, n_features_to_select=8)  # Initialize the RFE selector with the decision tree model and the number of features to select

# Fit the RFE selector to the training data
rfe_selector.fit(X_train, y_train)

# Get the selected features and store them in the dictionary
selected_features['RFE (Decision Tree Regressor)'] = X_train.columns[rfe_selector.get_support()].tolist()

### Display the Features Selected by Each Method

In [None]:
selected_features_df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in selected_features.items()]))
selected_features_df

## KNN

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X_selected = ['LotFrontage', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd', 
              'BsmtFinSF1', 'BsmtUnfSF', '1stFlrSF', '2ndFlrSF', 
              'GrLivArea', 'GarageArea', 'MoSold', 'Neighborhood_Encoded']

X = train_df[X_selected]  # Extract the actual feature values
y = train_df['SalePrice']  # Target variable

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

knn = KNeighborsRegressor(n_neighbors=5)  # Corrected to KNeighborsRegressor
knn.fit(X_scaled, y)

y_pred_train = knn.predict(X_scaled)

mae = mean_absolute_error(y, y_pred_train)
rmse = mean_squared_error(y, y_pred_train, squared=False)  # RMSE
r2 = r2_score(y, y_pred_train)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


## KNN Plot
At KNN 5, the model is able to capture the underlying patterns in the data with a smaller number of neighbors, resulting in more accurate predictions.
As you increase the number of neighbors (K), the model starts to overfit the data, trying to fit the noise rather than the underlying patterns. This results in less accurate predictions.
At KNN 15, the model has overfit the data to the point where it's making the least accurate predictions.

In [None]:
# Define a list of k values to test for the KNeighborsRegressor
# These values will be used to determine the number of neighbors to consider when making predictions
k_values = [3, 5, 7, 9, 11, 13, 15, 19]

# Initialize an empty list to store the mean squared error (MSE) values for each k value
mse_k = []

# Iterate over each k value in the list
for i in k_values:
  # Create a new KNeighborsRegressor instance with the current k value
  # This will determine the number of neighbors to consider when making predictions
  knn = KNeighborsRegressor(n_neighbors=i)
  
  # Train the KNeighborsRegressor model using the training data
  # This will allow the model to learn the relationships between the features and the target variable
  knn.fit(X_train, y_train)
  
  # Use the trained model to make predictions on the test data
  # These predictions will be used to calculate the MSE
  y_pred = knn.predict(X_test)
  
  # Calculate the mean squared error (MSE) between the predicted values and the actual values
  # This will give us a measure of the model's performance for the current k value
  mse = mean_squared_error(y_test, y_pred)
  
  # Append the MSE value to the list of MSE values
  # This will allow us to plot the MSE values for each k value later
  mse_k.append(mse)

# Create a plot of the MSE values against the k values
# This will allow us to visualize the relationship between the number of neighbors and the model's performance
plt.plot(k_values, mse_k)

# Add a label to the x-axis to indicate that it represents the number of neighbors (k)
plt.xlabel('Number of Neighbors (k)')

# Add a label to the y-axis to indicate that it represents the mean squared error (MSE)
plt.ylabel('Mean Squared Error (MSE)')

# Add a title to the plot to indicate that it shows the MSE vs k for the KNeighborsRegressor
plt.title('KNeighborsRegressor MSE vs k')

# Display the plot
plt.show()

In [None]:
import statsmodels.api as sm
X = train_df[X_selected]
y = train_df['SalePrice']

X = sm.add_constant(X)  # Adds an intercept term to the model

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())

## Decision Tree

In [None]:
X_selected = ['LotArea', 'OverallQual', 'YearRemodAdd', 
              'BsmtFinSF1', 'BsmtUnfSF', '1stFlrSF', 
              'GarageArea', 'Neighborhood_Encoded']
# Ensure X_train and X_test only contain the selected features
X_train_selected = X_train[X_selected]
X_test_selected = X_test[X_selected]

# Initialize the Decision Tree Regressor model
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(max_depth=7, random_state=42)

# Train the model with the selected features
model.fit(X_train_selected, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_selected)

# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Calculate MAE, RMSE, and R² score
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


## Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
model = DecisionTreeRegressor(max_depth=5, min_samples_leaf=5, min_samples_split=2, random_state=42)
cv_scores = cross_val_score(model, X_train_selected, y_train, cv=8, scoring='neg_mean_absolute_error')
print(f"Cross-validated MAE: {-cv_scores.mean():.4f}")


In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
    'max_depth': [5, 6, 7, 8],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5],
}

grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5, scoring='neg_mean_absolute_error')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation MAE: {-grid_search.best_score_:.4f}")


In [None]:
# Initialize the model with the best parameters
best_dt_model = DecisionTreeRegressor(max_depth=8, min_samples_leaf=5, min_samples_split=2)
best_dt_model.fit(X_selected, y_train)

# Predict on the training and test data
y_pred_train = best_dt_model.predict(X_train_selected)
y_pred_test = best_dt_model.predict(X_test_selected)

# Calculate MAE for both training and test data
train_mae = mean_absolute_error(y_train, y_pred_train)
test_mae = mean_absolute_error(y_test, y_pred_test)

print(f"Training MAE: {train_mae:.4f}")
print(f"Test MAE: {test_mae:.4f}")


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Initialize and fit the regressor
dt_reg = DecisionTreeRegressor(max_depth=7, min_samples_leaf=7, min_samples_split=2, random_state=42)
dt_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred_reg = dt_reg.predict(X_test)

# Calculate MAE, RMSE, and R²
mae = mean_absolute_error(y_test, y_pred_reg)
rmse = mean_squared_error(y_test, y_pred_reg, squared=False)
r2 = r2_score(y_test, y_pred_reg)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Initialize lists to store results
mae_results = []
rmse_results = []
r2_results = []
max_depth_range = range(1, 21)  # Adjust the range based on your preference

# Loop through different max_depth values
for max_depth in max_depth_range:
    # Initialize the Decision Tree Regressor with current max_depth
    dt_regressor = DecisionTreeRegressor(max_depth=max_depth, random_state=42)

    # Fit the model to the training data
    dt_regressor.fit(X_train, y_train)

    # Predict on the test data
    y_pred = dt_regressor.predict(X_test)

    # Calculate MAE, RMSE, and R²
    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)

    # Append results to lists
    mae_results.append(mae)
    rmse_results.append(rmse)
    r2_results.append(r2)

    # Print the results for the current max_depth
    print(f"max_depth: {max_depth}, MAE: {mae:.4f}, RMSE: {rmse:.4f}, R²: {r2:.4f}")

# Find the best max_depth based on lowest MAE
best_max_depth = max_depth_range[mae_results.index(min(mae_results))]

print(f"Best max_depth: {best_max_depth}")


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor

# Initialize the Decision Tree Regressor
dt = DecisionTreeRegressor(max_depth=8)

# Perform k-fold cross-validation (e.g., 5 folds)
cv_scores = cross_val_score(dt, X_scaled, y, cv=5, scoring='neg_mean_absolute_error')

# The scores will be negative since we're minimizing error, so take the absolute value
cv_scores = -cv_scores

# Print the mean and standard deviation of the MAE across the folds
print(f"Cross-validated MAE: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
