## House Prices Prediction with Advanced Regression Techniques

### Project Overview

This project involves predicting house prices using the **House Prices: Advanced Regression Techniques** dataset from Kaggle. We will utilize advanced regression models like **Ridge Regression**, **Lasso Regression**, and **ElasticNet** to improve generalization and reduce overfitting on high-dimensional data.

By the end of this project, we will:
- Preprocess the data, handle missing values, and encode categorical variables.
- Implement Ridge, Lasso, and ElasticNet models to predict house prices.
- Evaluate the performance of the models using metrics like **Mean Absolute Error (MAE)** and **R-squared**.

Let's begin by loading the dataset and exploring its structure.


In [11]:
# Import necessary libraries
import pandas as pd

# Load the training dataset
train_df = pd.read_csv('train.csv')

# Display the first few rows of the dataset
train_df.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Data Preprocessing

In this section, we will preprocess the data by:
1. Handling missing values.
2. Encoding categorical variables.
3. Scaling the numerical features.


In [3]:
# Import necessary libraries for preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate target (SalePrice) and features
X = train_df.drop(columns=['SalePrice', 'Id'])
y = train_df['SalePrice']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns
numerical_cols = X.select_dtypes(exclude=['object']).columns

# Preprocessing for numerical data: Impute missing values and scale features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data: Impute missing values and encode categories
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine numerical and categorical transformers into a ColumnTransformer
preprocessor = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_cols),
    ('cat', categorical_transformer, categorical_cols)
])

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply the transformations on the dataset
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Display the shape of the processed training and testing sets
X_train.shape, X_test.shape


((1168, 285), (292, 285))

## Ridge, Lasso, and ElasticNet Regression Models

We will now implement three regularization techniques:
1. **Ridge Regression (L2 Regularization)**: This penalizes large coefficients, reducing overfitting.
2. **Lasso Regression (L1 Regularization)**: This drives some coefficients to zero, effectively selecting features.
3. **ElasticNet Regression**: Combines L1 and L2 regularization for a balanced approach.


In [5]:
# Import the necessary libraries for models and evaluation
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the models
ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.1)
elasticnet_model = ElasticNet(alpha=0.1, l1_ratio=0.5)

# Train the models
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
elasticnet_model.fit(X_train, y_train)

# Predictions on the test set
ridge_pred = ridge_model.predict(X_test)
lasso_pred = lasso_model.predict(X_test)
elasticnet_pred = elasticnet_model.predict(X_test)

# Evaluate the models
ridge_mae = mean_absolute_error(y_test, ridge_pred)
lasso_mae = mean_absolute_error(y_test, lasso_pred)
elasticnet_mae = mean_absolute_error(y_test, elasticnet_pred)

ridge_r2 = r2_score(y_test, ridge_pred)
lasso_r2 = r2_score(y_test, lasso_pred)
elasticnet_r2 = r2_score(y_test, elasticnet_pred)

# Display the results
print(f"Ridge - MAE: {ridge_mae}, R2: {ridge_r2}")
print(f"Lasso - MAE: {lasso_mae}, R2: {lasso_r2}")
print(f"ElasticNet - MAE: {elasticnet_mae}, R2: {elasticnet_r2}")


Ridge - MAE: 19006.27520683145, R2: 0.8838815282935925
Lasso - MAE: 18015.553395821902, R2: 0.8950852908903248
ElasticNet - MAE: 18645.46557155519, R2: 0.8675482800896648


  model = cd_fast.sparse_enet_coordinate_descent(


## Improving Lasso and ElasticNet Convergence

We will address the **ConvergenceWarning** seen in the previous results for **Lasso** and **ElasticNet**. The warning indicated that the model did not converge within the default number of iterations, which might have affected the performance of the models.

### Adjustments:
1. **Increase the number of iterations**: Lasso and ElasticNet models require more iterations to converge, especially when the dataset is complex.
2. **Adjust the tolerance**: Lowering the tolerance for convergence will help the model find a more optimal solution.
3. **Experiment with `alpha`**: We will also experiment with a larger `alpha` for ElasticNet to see if it improves the performance.


In [8]:
# Import necessary libraries for the three models
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_absolute_error, r2_score

# Initialize the Ridge, Lasso, and ElasticNet models with tuned parameters
ridge_model = Ridge(alpha=1.0)
lasso_model = Lasso(alpha=0.1, max_iter=50000, tol=0.001)
elasticnet_model = ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=5000, tol=0.001)

# Train the models
ridge_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
elasticnet_model.fit(X_train, y_train)

# Predictions on the test set
ridge_pred = ridge_model.predict(X_test)
lasso_pred = lasso_model.predict(X_test)
elasticnet_pred = elasticnet_model.predict(X_test)

# Evaluate the models
ridge_mae = mean_absolute_error(y_test, ridge_pred)
lasso_mae = mean_absolute_error(y_test, lasso_pred)
elasticnet_mae = mean_absolute_error(y_test, elasticnet_pred)

ridge_r2 = r2_score(y_test, ridge_pred)
lasso_r2 = r2_score(y_test, lasso_pred)
elasticnet_r2 = r2_score(y_test, elasticnet_pred)

# Display the results
print(f"Ridge - MAE: {ridge_mae}, R2: {ridge_r2}")
print(f"Lasso - MAE: {lasso_mae}, R2: {lasso_r2}")
print(f"ElasticNet - MAE: {elasticnet_mae}, R2: {elasticnet_r2}")


Ridge - MAE: 19006.27520683145, R2: 0.8838815282935925
Lasso - MAE: 18012.98620018198, R2: 0.895121527044661
ElasticNet - MAE: 18645.466678960183, R2: 0.8675482666709546


After tuning the maximum iterations (`max_iter`) and adjusting tolerance (`tol`) for convergence, we obtained the following results for Ridge, Lasso, and ElasticNet:

- **Ridge Regression**:
  - Mean Absolute Error (MAE): *19,006*
  - R-squared (R²): *0.883*

- **Lasso Regression**:
  - Mean Absolute Error (MAE): *18,012*
  - R-squared (R²): *0.895*

- **ElasticNet Regression**:
  - Mean Absolute Error (MAE): *18,645*
  - R-squared (R²): *0.867*

### Analysis

- **Ridge Regression**:
  Ridge regression, with an MAE of 19,006 and R² of 0.883, performed well and reduces overfitting by applying L2 regularization. However, compared to Lasso, it did not perform as strongly on this dataset. 

- **Lasso Regression**:
  Lasso performed the best, with an MAE of 18,012 and R² of 0.895, indicating that it captures around **89.5%** of the variance in the house prices. The L1 regularization in Lasso likely led to better feature selection by driving irrelevant coefficients to zero.

- **ElasticNet Regression**:
  ElasticNet, with an MAE of 18,645 and R² of 0.867, provided a balance between L1 and L2 regularization but still underperformed compared to Lasso. Fine-tuning the **l1_ratio** or **alpha** could improve its performance.


## Conclusion

### Conclusion:
- **Lasso Regression** outperformed both Ridge and ElasticNet, achieving the lowest Mean Absolute Error (18,012) and highest R² (0.895).
- **Ridge Regression** provided a solid performance, reducing overfitting with L2 regularization, but with slightly higher error (MAE: 19,006).
- **ElasticNet Regression** showed balanced regularization but underperformed compared to Lasso, with an MAE of 18,645 and R² of 0.867.

### Next Possible Steps:
1. **Hyperparameter Tuning**: Further fine-tuning **alpha** and **l1_ratio** in ElasticNet may improve its performance.
2. **Cross-Validation**: Perform cross-validation to ensure that the models generalize well to unseen data.
3. **Feature Engineering**: Consider creating interaction features or addressing skewness in numerical variables to improve predictions.
4. **Model Comparison**: Explore more advanced models like **Random Forest** or **XGBoost** to handle non-linear relationships in the dataset.

By further tuning and exploring these models, we can improve accuracy and reduce the error margin.
