# Linear Regression Models for Car Price Prediction

**Objective:** Compare multiple linear regression variants to predict car prices.

**Models Tested:**
1. **Linear Regression:** Basic OLS (Ordinary Least Squares)
2. **Ridge Regression:** L2 regularization (prevents overfitting)
3. **Lasso Regression:** L1 regularization (feature selection)
4. **ElasticNet:** Combination of L1 and L2

**Why Linear Models?**
- Simple and interpretable
- Fast to train
- Good baseline for comparison
- Work well with linear relationships

**Key Steps:**
1. Load preprocessed data
2. Compare multiple models with GridSearchCV
3. Train best model (Ridge)
4. Evaluate performance
5. Compare with Random Forest


In [None]:
# Import necessary libraries for linear regression modeling
import pandas as pd  # Data manipulation
import matplotlib.pyplot as plt  # Visualization
import seaborn as sns  # Statistical plots
from sklearn.model_selection import train_test_split  # Data splitting
from sklearn.metrics import mean_absolute_error, r2_score, root_mean_squared_error  # Evaluation metrics
from sklearn.linear_model import LinearRegression  # Basic linear regression
from sklearn.model_selection import GridSearchCV  # Hyperparameter tuning
import numpy as np  # Numerical operations (reverse log transformation)
from sklearn.linear_model import Ridge, Lasso, ElasticNet  # Regularized linear models
import warnings
warnings.filterwarnings("ignore")  # Suppress warnings for cleaner output

In [None]:
# Load the preprocessed data (cleaned, encoded, log-transformed)
data = pd.read_csv("../Data/processed_data.csv")
data.head()  # Preview the data

Unnamed: 0,Price,Mileage,Brand,Model,Automatic Transmission,Air Conditioner,Power Steering,Remote Control,Years_of_usage
0,14.84513,5.70711,59,738,True,True,True,True,0
1,13.997833,11.78296,32,789,True,True,True,True,3
2,14.533351,11.736077,47,805,True,True,True,True,3
3,14.533351,11.338584,5,842,True,True,True,True,6
4,13.458837,11.225257,18,618,True,True,True,True,3


In [None]:
# Separate features (X) from target variable (y)
x = data.drop(columns="Price")  # All features except Price
y = data["Price"]  # Target variable (log-transformed price)

In [None]:
# Split data into training (70%) and testing (30%) sets
# random_state=42 ensures same split as Random Forest for fair comparison
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)

## Hyperparameter Tuning with Grid Search

**Models Compared:**
1. **LinearRegression:** No regularization
2. **Ridge:** L2 regularization (shrinks coefficients)
3. **Lasso:** L1 regularization (can zero out coefficients)
4. **ElasticNet:** Combines L1 and L2

The commented code below shows the Grid Search process that found **Ridge with Œ±=10** as the best model.

In [None]:
# Grid Search code (commented after finding best parameters)
# This tested 4 different linear regression variants with various hyperparameters
# 
# Parameters tested:
# - alpha: Regularization strength (0.1, 1.0, 10.0)
# - fit_intercept: Whether to calculate intercept
# - solver: Optimization algorithm
# - l1_ratio: Balance between L1 and L2 for ElasticNet
#
# Grid Search used 5-fold cross-validation and MAE as scoring metric
# Result: Ridge with alpha=10.0, solver='svd' performed best

# models = {
#     "LinearRegression": {
#         "model": LinearRegression(),
#         "param_grid": {
#             "fit_intercept": [True, False],  # Calculate intercept or not
#         },
#     },
#     "Ridge": {  # L2 regularization
#         "model": Ridge(),
#         "param_grid": {
#             "alpha": [0.1, 1.0, 10.0],  # Regularization strength
#             "fit_intercept": [True, False],
#             "solver": ["auto", "svd", "sag", "saga"],  # Optimization algorithm
#             "max_iter": [100, 1000, 10000]  # Maximum iterations
#         },
#     },
#     "Lasso": {  # L1 regularization (feature selection)
#         "model": Lasso(),
#         "param_grid": {
#             "alpha": [0.1, 1.0, 10.0],
#             "fit_intercept": [True, False],
#             "max_iter": [100, 1000, 10000]
#         },
#     },
#     "ElasticNet": {  # Combination of L1 and L2
#         "model": ElasticNet(),
#         "param_grid": {
#             "alpha": [0.1, 1.0, 10.0],
#             "l1_ratio": [0.1, 0.5, 0.9],  # Balance between L1 and L2
#             "fit_intercept": [True, False],
#             "max_iter": [100, 1000, 10000]
#         },
#     },
# }

# # Test each model with Grid Search
# for model, model_info in models.items():
#     print(f"\nPerforming Grid Search for {model}...")
#
#     grid_search = GridSearchCV(
#         estimator=model_info["model"],
#         param_grid=model_info["param_grid"],
#         cv=5,  # 5-fold cross-validation
#         scoring="neg_mean_absolute_error",  # Minimize MAE
#     )
#
#     grid_search.fit(x_train, y_train)
#
#     # Print best parameters and scores
#     print(f"Best parameters for {model}: {grid_search.best_params_}")
#     print(f"Best cross-validation score for {model}: {-grid_search.best_score_}")
#     
#     # Evaluate on test set
#     y_pred = grid_search.predict(x_test)
#     mae = mean_absolute_error(y_test, y_pred)
#     r2 = r2_score(y_test, y_pred)
#     rmse = root_mean_squared_error(y_test, y_pred)
#
#     print(f"Test MAE for {model}: {mae}")
#     print(f"Test R2 for {model}: {r2}")
#     print(f"Test RMSE for {model}: {rmse}")

In [None]:
# Create Ridge regression model with best parameters from Grid Search
model = Ridge(
    alpha=10.0,  # Strong regularization to prevent overfitting
    fit_intercept=True,  # Include intercept term
    solver="svd",  # Singular Value Decomposition solver (stable)
    max_iter=100  # Maximum iterations for convergence
)

# Train the model on training data
model.fit(x_train, y_train)

# Make predictions on test set
y_pred = model.predict(x_test)

## Model Evaluation

Assess Ridge regression performance using standard regression metrics.

In [None]:
# Recalculate predictions (redundant but ensures consistency)
y_pred = model.predict(x_test)

# Calculate evaluation metrics (multiplied by 100 for percentage representation)
mae = mean_absolute_error(y_test, y_pred) * 100  # Average prediction error
r2 = r2_score(y_test, y_pred) * 100  # Variance explained (0-100%)
rmse = root_mean_squared_error(y_test, y_pred) * 100  # Root mean squared error

# Display metrics
print(f"Test MAE : {mae:.2f}")  # Lower is better
print(f"Test R2 : {r2:.2f}")  # Higher is better (max 100%)
print(f"Test RMSE : {rmse:.2f}")  # Lower is better

Test MAE : 40.45
Test R2 : 61.86
Test RMSE : 53.16


In [None]:
# Create comparison table of actual vs predicted prices
# np.expm1() reverses the log transformation to get actual prices in EGP
# round(2) shows prices with 2 decimal places
pd.DataFrame({
    'Actual': np.expm1(y_test).round(2),  # Actual price
    'Predicted': np.expm1(y_pred).round(2)  # Predicted price
}).head(20)  # Display first 20 predictions

Unnamed: 0,Actual,Predicted
10534,6800000.0,3071642.15
11227,250000.0,560986.11
14319,1270000.0,1041592.48
17615,820000.0,932498.59
16922,2000000.0,581209.66
14855,460000.0,400473.85
3643,4200000.0,1150722.92
1595,1450000.0,539567.73
18046,450000.0,434042.74
12528,850000.0,536719.83


## Model Comparison: Ridge vs Random Forest

Let's compare the performance of our Ridge Regression with Random Forest.

### Ridge Regression Performance:
```
Test MAE  : 40.45
Test R¬≤   : 61.86%
Test RMSE : 53.16
```

**Interpretation:**
- R¬≤ = 61.86% means the model explains ~62% of price variance
- Moderate performance - decent but has room for improvement

### Random Forest Performance:
```
Test RMSE : 0.30
Test MAE  : 0.21
Test R¬≤   : 0.88 (88%)
```

**Interpretation:**
- R¬≤ = 88% means the model explains 88% of price variance! ‚≠ê
- Much better performance than Ridge Regression


## Conclusion: Random Forest Wins! üèÜ

### Performance Comparison:

| Metric | Ridge Regression | Random Forest | Winner |
|--------|-----------------|---------------|--------|
| **R¬≤ Score** | 61.86% | **88.00%** | Random Forest |
| **MAE** | 40.45 | **0.21** | Random Forest |
| **RMSE** | 53.16 | **0.30** | Random Forest |

### Why Random Forest Outperforms Ridge:

1. **Non-linear Relationships:** Car prices have complex non-linear patterns
   - Ridge assumes linear relationships ‚Üí limited accuracy
   - Random Forest captures non-linearity ‚Üí higher accuracy

2. **Feature Interactions:** 
   - Brand + Model + Year interact in complex ways
   - Random Forest automatically learns these interactions
   - Ridge requires manual feature engineering

3. **Flexibility:**
   - Ridge: Simple, interpretable, but limited
   - Random Forest: Complex, powerful, captures nuances

### When to Use Each Model:

**Use Ridge When:**
- ‚úÖ Need interpretability (see coefficient importance)
- ‚úÖ Fast training required
- ‚úÖ Linear relationships expected
- ‚úÖ Building a baseline model

**Use Random Forest When:**
- ‚úÖ Maximum accuracy needed
- ‚úÖ Complex non-linear data
- ‚úÖ Don't need to explain predictions
- ‚úÖ Have sufficient computational resources

### Final Recommendation:
**Use Random Forest for car price prediction** - 26% better R¬≤ score justifies the added complexity!