<p><a href="https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%209%20Notebooks/GDAN%205400%20-%20Week%209%20Class%20Notebook.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

# Kaggle Competition: Housing Prices – Advanced Regression Techniques

In today's class, as well as coding assignment #5 and the final project, we will be using the [Housing Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition on Kaggle

### Competition Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

In this fifth assignment, we are switching to another competition on *Kaggle*, an online platform for data science and machine learning that provides datasets, competitions, collaborative notebooks, and learning resources.

### Evaluation

#### Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

#### Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

---

These exercises will help strengthen your ability to explore, preprocess, and model real-world datasets using machine learning. You will gain hands-on experience with data cleaning, feature engineering, and predictive modeling, all while working with a classic dataset in a competitive Kaggle environment.


---

### Data Dictionary

As a reference, here is a data dictionary describing the variables you will find in the dataset:

---


| **Data Dictionary** |                   |
|---------------------|----------------|
| **Feature**         | **Description** |
| SalePrice      | Property's sale price in dollars (target variable) |
| MSSubClass     | Building class |
| MSZoning       | General zoning classification |
| LotFrontage    | Linear feet of street connected to property |
| LotArea        | Lot size in square feet |
| Street         | Type of road access |
| Alley          | Type of alley access |
| LotShape       | General shape of property |
| LandContour    | Flatness of the property |
| Utilities      | Type of utilities available |
| LotConfig      | Lot configuration |
| LandSlope      | Slope of property |
| Neighborhood   | Physical locations within Ames city limits |
| Condition1     | Proximity to main road or railroad |
| Condition2     | Proximity to main road or railroad (if a second is present) |
| BldgType       | Type of dwelling |
| HouseStyle     | Style of dwelling |
| OverallQual    | Overall material and finish quality |
| OverallCond    | Overall condition rating |
| YearBuilt      | Original construction date |
| YearRemodAdd   | Remodel date |
| RoofStyle      | Type of roof |
| RoofMatl       | Roof material |
| Exterior1st    | Exterior covering on house |
| Exterior2nd    | Exterior covering on house (if more than one material) |
| MasVnrType     | Masonry veneer type |
| MasVnrArea     | Masonry veneer area in square feet |
| ExterQual      | Exterior material quality |
| ExterCond      | Present condition of exterior material |
| Foundation     | Type of foundation |
| BsmtQual       | Height of the basement |
| BsmtCond       | General condition of the basement |
| BsmtExposure   | Walkout or garden level basement walls |
| BsmtFinType1   | Quality of basement finished area |
| BsmtFinSF1     | Type 1 finished square feet |
| BsmtFinType2   | Quality of second finished area (if present) |
| BsmtFinSF2     | Type 2 finished square feet |
| BsmtUnfSF      | Unfinished square feet of basement area |
| TotalBsmtSF    | Total square feet of basement area |
| Heating        | Type of heating |
| HeatingQC      | Heating quality and condition |
| CentralAir     | Central air conditioning (Yes/No) |
| Electrical     | Electrical system type |
| 1stFlrSF       | First floor square feet |
| 2ndFlrSF       | Second floor square feet |
| LowQualFinSF   | Low quality finished square feet (all floors) |
| GrLivArea      | Above grade (ground) living area square feet |
| BsmtFullBath   | Basement full bathrooms |
| BsmtHalfBath   | Basement half bathrooms |
| FullBath       | Full bathrooms above grade |
| HalfBath       | Half bathrooms above grade |
| Bedroom        | Number of bedrooms above basement level |
| Kitchen        | Number of kitchens |
| KitchenQual    | Kitchen quality |
| TotRmsAbvGrd   | Total rooms above grade (excludes bathrooms) |
| Functional     | Home functionality rating |
| Fireplaces     | Number of fireplaces |
| FireplaceQu    | Fireplace quality |
| GarageType     | Garage location |
| GarageYrBlt    | Year garage was built |
| GarageFinish   | Interior finish of the garage |
| GarageCars     | Garage size in car capacity |
| GarageArea     | Garage size in square feet |
| GarageQual     | Garage quality |
| GarageCond     | Garage condition |
| PavedDrive     | Paved driveway presence |
| WoodDeckSF     | Wood deck area in square feet |
| OpenPorchSF    | Open porch area in square feet |
| EnclosedPorch  | Enclosed porch area in square feet |
| 3SsnPorch      | Three-season porch area in square feet |
| ScreenPorch    | Screen porch area in square feet |
| PoolArea       | Pool area in square feet |
| PoolQC         | Pool quality |
| Fence          | Fence quality |
| MiscFeature    | Miscellaneous feature not covered in other categories |
| MiscVal        | Dollar value of miscellaneous feature |
| MoSold         | Month sold |
| YrSold         | Year sold |
| SaleType       | Type of sale |
| SaleCondition  | Condition of sale |

---

### Access Full Codebook

The link below describes what the codes mean for each of the variables.

https://github.com/gdsaxton/GDAN5400/blob/main/Housing_Prices/data_description.txt


# Machine Learning Step 1: Understanding the Problem 

In line with the class lecture and exercises, you now have a good sense of the first task in tackling a machine learning project: *understanding the problem*. Specifically, you understand that this is a *regression* problem requiring you to predict housing sales prices by minimizing *RMSLE*. Moreover, you have a preliminary *conceptual model* to guide your efforts. Now you are ready to start coding!


# Machine Learning Step 2: Exploratory Data Analysis (EDA)

The `Exploratory Data Analysis (EDA)` stage helps uncover patterns, relationships, and potential issues within the dataset before modeling. This involves summarizing key statistics, visualizing distributions, identifying correlations, and detecting anomalies or missing values. EDA provides critical insights that guide feature selection, preprocessing strategies, and model choice. A thorough EDA ensures a deeper understanding of the data, leading to more informed decision-making in subsequent stages.

#### Load the Housing Prices `Training` Dataset  

I have uploaded the training and test datasets onto the class GitHub repository.

In [None]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

In [None]:
train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

#### Run `info()` to inspect variables

In [None]:
train.info()

#### Identify variables with missing data

In [None]:
train.isnull().sum()[train.isnull().sum() > 0]

#### Explore Numeric Variables with Histograms  

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix

train.select_dtypes(include='number').hist(figsize=(13, 8))
plt.tight_layout()
plt.show()

#### Generate an Automated Data Report  
- Install and use `ydata-profiling` to create a detailed report of the dataset.  
- This report will provide insights into **missing values, distributions, correlations, and more**.  
- **Tip:** Instead of manually exploring each variable, use this **automated tool** to summarize the data in one step.  
- Save the report as an **HTML file** for easy viewing.

In [None]:
# Install ydata-profiling
!pip install ydata_profiling --quiet
# Install ydata-profiling
from ydata_profiling import ProfileReport

In [None]:
# Generate the report
profile = ProfileReport(train,title="Housing_Prices")

In [None]:
# Save the report to an HTML file
profile.to_file("housing_prices.html")

# Machine Learning Step 3: Data Preprocessing and Feature Engineering
The `Data Preprocessing & Feature Engineering` stage is crucial for ensuring that the dataset is clean, structured, and optimized for machine learning models. This involves handling missing values, encoding categorical variables, scaling numerical features, and detecting outliers. Additionally, feature engineering enhances predictive performance by creating new meaningful variables, transforming existing ones, or selecting the most relevant features. Effective preprocessing and engineering can significantly improve model accuracy and generalization.

#### Example: Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

In [None]:
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

In [None]:
train['LotFrontage'] = train['LotFrontage'].fillna(train["LotFrontage"].median())
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

#### Create a `High_Quality` Variable

From the codebook we see that the values of `OverallQual` are the following:

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

In [None]:
#Check for missing values
print("Missing values in OverallQual column:", train["OverallQual"].isnull().sum())

In [None]:
#Frequencies 
train['OverallQual'].value_counts().sort_index()

In [None]:
#Create the binary variable
train['High_Quality'] = train['OverallQual'].apply(lambda x: 1 if x >= 7 else 0)
train['High_Quality'].value_counts()

In [None]:
#Cross-tabulation to verify coding
pd.crosstab(train['High_Quality'], train['OverallQual'])

In [None]:
#Check for missing values
print("Missing values in High_Quality column:", train["High_Quality"].isnull().sum())

#### Create Variable `Age`

In [None]:
train['Age'] = 2025 - train['YearBuilt']
train[['YearBuilt', 'Age']].describe().T

In [None]:
#Check for missing values
print("Missing values in Age column:", train["Age"].isnull().sum())

# Machine Learning Step 4: Model Selection, Training, Tuning, & Evaluation

The `Model Selection, Training, Tuning, & Evaluation` stage focuses on choosing the most suitable machine learning algorithms, fitting them to the data, optimizing performance, and assessing their effectiveness. It begins with selecting baseline models and training them on the processed dataset. Hyperparameter tuning—using techniques like grid search, random search, or Bayesian optimization—helps refine model performance. Evaluation metrics such as RMSE, accuracy, or F1-score are used to measure success and compare models. This iterative process ensures that the best-performing and most generalizable model is identified for final predictions.


#### Select Predictors and Split the Data into Testing and Training Datasets

We will start with a single predictor: `Age`

In [None]:
from sklearn.model_selection import train_test_split

features = ['Age']

X = train[features]
y = train['SalePrice']
print(X.shape, y.shape)

# Splitting training data into train and validation sets
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

#### Train and Evaluate a Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_squared_log_error

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Initiate and run linear regression model
model =  LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

# Calculate regression performance metrics
mae = mean_absolute_error(y_val, val_predictions)
mse = mean_squared_error(y_val, val_predictions)
rmse = np.sqrt(mse)
rmsle_score = np.sqrt(mean_squared_log_error(y_val, val_predictions))
r2 = r2_score(y_val, val_predictions)

# Print results
print("Model Performance Metrics:")
print(f"R² Score: {r2:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_score:.4f}")

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)


# Generate predictions on validation set
val_predictions = model.predict(X_val) 

# Evaluate model performance on validation data
score = rmsle(y_val, val_predictions)
print("RMSLE:", score)

### Evaluating Model Performance 

Our regression model uses **only "Age" of a house** to predict **Sale Price**, and the performance metrics suggest that **Age alone is not a great predictor** of home prices. Let’s break down what these numbers mean in plain terms.

---

#### R² Score: 0.2898 (Weak Fit)
- **What it means:** The model **only explains about 29% of the variation** in home prices.
- **Why this is a problem:** This suggests that **other important factors** (like square footage, location, condition, and number of bedrooms) are missing from the model.
- **Analogy:** Imagine trying to predict someone’s weight **using only their height**—it helps, but it’s far from perfect.

---

#### 2. Mean Absolute Error (MAE): \\$51,148.47 (Large Errors)
- **What it means:** On average, the model’s predictions are **off by about \\$51,000**.
- **Why this is a problem:** In real estate, an error this large could make a big difference in pricing decisions.
- **Example:** If a house is actually worth **\\$250,000**, the model might predict something like **\\$200,000** or **\\$300,000**, which is a major miss.

---

#### Mean Squared Error (MSE): 5.45 Billion & Root Mean Squared Error (RMSE): \\$73,809.45 (Even Larger Errors for Some Houses)
- **What it means:** RMSE shows that **larger errors are even more extreme**—some predictions could be off by **\\$70,000 or more**.
- **Why this is a problem:** RMSE is higher than MAE, meaning that **a few very bad predictions are pulling up the error** (the model struggles particularly with some houses).

---

#### **4. Root Mean Squared Logarithmic Error (RMSLE): 0.3595 (Better for Relative Errors)**
- **What it means:** This number tells us how **far off the predictions are in percentage terms** instead of dollar amounts.
- **Why it’s useful:** If RMSLE is **close to zero**, the model is doing well at predicting houses **relative to their actual value** (for example, predicting a \$100,000 home as \$110,000 is not as bad as predicting a \$1M home as $1.1M).
- **For comparison:** An RMSLE of **0.3595** suggests moderate errors but is **not terrible** compared to RMSE.

---

#### Interpreting RMSLE: 0.3595 in More Detail

The **Root Mean Squared Logarithmic Error (RMSLE) = 0.3595** tells us how far off our **predictions are in percentage terms**, rather than absolute dollar differences. 

#### What Does RMSLE Actually Measure?
RMSLE compares the **log-transformed actual prices** with the **log-transformed predicted prices**, then calculates the **square root of their mean squared difference**:

\\[
RMSLE = \sqrt{\frac{1}{n} \sum \left( \log(1 + \hat{y}_i) - \log(1 + y_i) \right)^2 }
\\]

Where:
- \\( y_i \\) = actual price
- \\( \hat{y}_i \\) = predicted price
- **Taking the log** helps **reduce the impact of very large errors** and focuses more on **relative differences**.

#### **2. How Do We Interpret RMSLE = 0.3595?**
Since RMSLE is a measure of **relative error**, it can be approximately interpreted as:

\\[
e^{0.3595} - 1 \approx 0.432
\\]

This means, **on average, predictions are off by about ±43.2%** of the actual price.

---

#### What Does a ±43.2% Error Mean in Real Terms?
For different price ranges, this error translates to the following **expected prediction errors**:

| **Actual Sale Price** | **Typical Predicted Range** (±43.2% error) |
|----------------------|----------------------------------|
| **\\$100,000** | **\\$56,800 – \\$143,200** |
| **\\$200,000** | **\\$113,600 – \\$286,400** |
| **\\$300,000** | **\\$170,400 – \\$429,600** |
| **\\$500,000** | **\\$284,000 – \\$716,000** |


This means that for a house actually worth \\$300,000, the model might predict anything from \\$170K to \\$430K, which is a very large and impractical range.

---

#### How Does This Compare to Other Models?
| **RMSLE Value** | **Interpretation** |
|---------------|----------------|
| **0.1 or less** | Excellent – very close predictions |
| **0.2 - 0.3** | Good – reasonable accuracy |
| **0.3 - 0.4** | Acceptable – useful but not great |
| **Above 0.4** | Poor – large deviations in predictions |

At **0.3595**, our model is in the **"borderline acceptable"** range, but it’s **not good enough for real estate pricing** because of the high variation.


#### Sidebar – Calculating a Single Prediction for a 10-year-old House

I will go into some technical detail here. You don't need to try to understand the math, but hopefully you will get the intuition of how we are using the "trained" regression model to generate new predictions. This is similar to what we covered in the lecture, but with only one predictor variable (`Age`).

First, we will extract the 'intercept' (aka 'constant term') from the above regression model. 

In [None]:
# Get intercept (constant term)
intercept = model.intercept_

# Get coefficients
coefficients = model.coef_

# Print results
print(f"Intercept (Constant): {intercept}")
print("Coefficients:")
for feature, coef in zip(X_train.columns, coefficients):
    print(f"{feature}: {coef}")

<br>Now we can use those numbers to calculate the predicted sale price for a hypothetical 10-year-old house in Ames, Iowa. 

In [None]:
# Given updated coefficients from trained model
intercept = 251736.368624  # β₀
age_coefficient = -1300.9310044469596  # β₁

# Input values for prediction
age = 10  # Age of the house in years

# Compute predicted sale price using the regression equation
predicted_price = intercept + (age_coefficient * age) 

# Display result
predicted_price

---

Here is a technical description of how the trained regression model uses information to generate predictions:

#### Regression Equation

The linear regression equation is:

\\[
\hat{y} = \beta_0 + \beta_1 X_1
\\]

where:
- \\( \hat{y} \\) (y-hat) represents the **predicted** sale price of the house.
- \\( \beta_0 \\) is the **intercept** (the predicted price when all independent variables are zero).
- \\( \beta_1 \\) is the **coefficient** for the house age variable.
- \\( X_1 \\) represents the **age of the house** in years.

#### Substituting the Given Values:

\\[
\hat{y} = 251736.37 + (-1300.93 \times \text{Age})
\\]

For a house that is 10 years old:

\\[
\hat{y} = 251736.37 + (-1300.93 \times 10)
\\]

\\[
\hat{y} = 251736.37 - 13009.31
\\]

\\[
\hat{y} = 238727.06
\\]

Thus, the predicted sale price for a 10-year-old house in Ames, Iowa is **\\$238,727.06**.



### How Good is Our Model?
Above, we learned that the RMSLE score is pretty average. It clearly is not a strong model. Let's look at some graphs that we can use to help identify some issues.

#### Plot Actual vs. Predicted Values

In [None]:
# Plot predicted vs. actual prices
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_val, y=val_predictions, alpha=0.7)
plt.plot([y_val.min(), y_val.max()], [y_val.min(), y_val.max()], '--r', linewidth=2)  # Identity line
plt.xlabel("Actual Sale Price", labelpad=15)
plt.ylabel("Predicted Sale Price", labelpad=15)
plt.title("Predicted vs. Actual Sale Price (Linear Regression – Age)", pad=20)
plt.show()

#### Plot Age vs. Actual Sale Price (with Regression Line for Predicted Price)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set a modern Seaborn style
sns.set_style("whitegrid")
sns.set_palette("muted")

# Create figure with better size
plt.figure(figsize=(9, 6))

# Scatter plot of Age vs. Actual Sale Price
sns.scatterplot(x=X_val['Age'], y=y_val, alpha=0.6, s=50, label="Actual Prices", color="royalblue")

# Fit a regression line (Predicted SalePrice vs. Age) with a slightly thicker line
sns.regplot(x=X_val['Age'], y=val_predictions, scatter=False, color="red",
            label="Regression Line", line_kws={"linewidth": 1.5})

# Select example (idx = 5)
idx = 5  
actual = y_val.iloc[idx]
predicted = val_predictions[idx]
age_value = X_val.iloc[idx]['Age']

# Improve axis labels and title
plt.xlabel("Age of House (Years)", fontsize=13, labelpad=15)
plt.ylabel("Sale Price ($)", fontsize=13, labelpad=15)
plt.title("Age vs. Sale Price with Regression Line", fontsize=15, pad=20, fontweight="bold")

# Add a legend with a better location
plt.legend(frameon=True, fontsize=11, loc="upper right")

# Show the final plot
plt.show()

#### Plot Age vs. Actual Sale Price, with Regression Line for Predicted Price – Highlighting One Data Point

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set a modern Seaborn style
sns.set_style("whitegrid")
sns.set_palette("muted")

# Create figure with better size
plt.figure(figsize=(9, 6))

# Scatter plot of Age vs. Actual Sale Price
sns.scatterplot(x=X_val['Age'], y=y_val, alpha=0.6, s=50, label="Actual Prices", color="royalblue")

# Fit a regression line (Predicted SalePrice vs. Age) with a slightly thicker line
sns.regplot(x=X_val['Age'], y=val_predictions, scatter=False, color="red",
            label="Regression Line", line_kws={"linewidth": 1.5})

# Select example (idx = 5)
idx = 5  
actual = y_val.iloc[idx]
predicted = val_predictions[idx]
age_value = X_val.iloc[idx]['Age']

# Highlight idx = 5 with distinct colors
plt.scatter(age_value, actual, color="blue", s=50, edgecolor="black", label="Actual (Idx 5)", zorder=3)
plt.scatter(age_value, predicted, color="red", s=50, edgecolor="black", label="Predicted (Idx 5)", zorder=3)

# Draw a vertical error line between actual and predicted values
plt.plot([age_value, age_value], [actual, predicted], linestyle="--", color="black", linewidth=1.5)

# Annotate the actual and predicted values with better positioning
plt.text(age_value, actual - 11500, f"Actual: {actual:,.0f}", fontsize=11, color="blue", ha="right", va='top', fontweight="bold")
plt.text(age_value, predicted + 9500, f"Predicted: {predicted:,.0f}", fontsize=11, color="darkred", ha="left", va='bottom', fontweight="bold")

# Improve axis labels and title
plt.xlabel("Age of House (Years)", fontsize=13, labelpad=15)
plt.ylabel("Sale Price ($)", fontsize=13, labelpad=15)
plt.title("Age vs. Sale Price with Regression Line", fontsize=15, pad=20, fontweight="bold")

# Add a legend with a better location
plt.legend(frameon=True, fontsize=11, loc="upper right")

# Show the final plot
plt.show()


### Interpreting Figures
Our scatter plots suggest that our linear regression model is **systematically underpredicting** sale prices, especially for higher-priced homes. There are a few key reasons why this might be happening:

#### Feature Selection: Using Only 'Age' as a Predictor
   - The house **age alone is likely not sufficient** to predict sale price accurately.
   - Housing prices depend on multiple factors like square footage, number of bedrooms, location, lot size, condition, and more.
   - The model might be too simple (high bias), leading to poor predictive performance.

#### Non-Linear Relationship
   - Housing prices **may not have a simple linear relationship with age**.
   - Older houses could either be more valuable (historic homes) or less valuable (due to depreciation), creating **a non-monotonic relationship**.
   - A **log transformation** on price might help capture a non-linear pattern.

#### Skewed Data Distribution
   - If sale prices have a **right-skewed distribution** (a long tail of high-priced houses), linear regression may struggle.
   - You can check this with `sns.histplot(y, bins=50, kde=True)`.
   - A log transformation (`np.log1p(y)`) might improve predictions.

#### Heteroscedasticity
   - The variance in errors seems to increase as actual sale price increases.
   - This violates one of the assumptions of linear regression, leading to biased estimates.
   - **A log-transformed regression** could stabilize variance.

#### Solutions:
1. **Add more features** (e.g., square footage, number of bedrooms, location dummy variables).
2. **Use a log-transformed target variable:**
   ```python
   y = np.log1p(train['SalePrice'])
   ```
   And then exponentiate predictions when interpreting:
   ```python
   y_pred = np.expm1(model.predict(X_test))
   ```
3. **Try polynomial regression or non-linear models** if `Age` has a complex effect.

# Machine Learning Step 5: Generate Predictions and Submitting – 
In this stage, the trained and optimized model is used to generate predictions on the test dataset. These predictions are then formatted according to the competition’s submission requirements, ensuring they align with the expected structure. Before submitting, it’s important to perform sanity checks to avoid common mistakes, such as data leakage or incorrect indexing. Once submitted, the Kaggle leaderboard provides feedback on the model’s real-world performance, helping assess its competitiveness.


#### Make Predictions on `test.csv` and Generate Submission File

Let's load the `test.csv` file from the GitHub repository.

In [None]:
test_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/test.csv'
test = pd.read_csv(test_url)
print(len(test))
test.head()

#### Create `Age`

In [None]:
test['Age'] = 2025 - test['YearBuilt']
test[['Age']].describe().T

In [None]:
#Check for missing values
print("Missing values in Age column:", test["Age"].isnull().sum())

In [None]:
# Select the same predictor variables as in training
X_test = test[features]
X_test

In [None]:
# Generate predictions for Kaggle test set using the trained model
test_predictions = model.predict(X_test)
print('# of predictions:', len(test_predictions))
test_predictions[:5]

In [None]:
# Ensure predictions are non-negative (house prices cannot be negative)
print(f"Min SalePrice: {test_predictions.min()}")
print(f"Max SalePrice: {test_predictions.max()}")

#If there are non-negative, run the following line:
#test_predictions = np.maximum(test_predictions, 0)

In [None]:
# Add predictions to test dataset
test['SalePrice'] = test_predictions

In [None]:
# Create submission file
submission_df = pd.DataFrame({"Id": test["Id"], "SalePrice": test_predictions})
submission_df.info()

In [None]:
submission_df

#### Save File

In [None]:
#Save file
submission_df.to_csv("submission.csv", index=False)
print("Submission file saved as 'submission.csv'")

#### Sidebar – Exploring the Predictions

In [None]:
#If you want to see the predicted frequencies
submission_df['SalePrice'].describe()

#### Histogram of Predicted Sales Prices
As you can see below, the histogram hints at our model not being particularly strong. We should expect a more 'normal' distribution.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set Seaborn style
sns.set_theme(style="whitegrid")

# Create a histogram with KDE (density) curve
plt.figure(figsize=(10, 6))
sns.histplot(submission_df['SalePrice'], bins=30, kde=True, color='royalblue', edgecolor='black')

# Add labels and title
plt.xlabel("Predicted SalePrice", fontsize=14, labelpad=15)
plt.ylabel("Frequency", fontsize=14, labelpad=15)
plt.title("Distribution of Predicted House Prices", fontsize=16, pad=20)

# Show the plot
plt.show()

# Machine Learning Step 6: Iterate and Improve – 
Machine learning is an iterative process, and refining the approach is key to achieving better results. After reviewing leaderboard scores and validation metrics, potential improvements—such as trying different models, engineering new features, fine-tuning hyperparameters, or adjusting preprocessing techniques—can be explored. Comparing multiple approaches and leveraging ensemble methods often lead to better generalization. Continuous iteration and learning from past submissions help enhance performance and ranking over time.

### Generic Steps to Improve the Model
#### 1. Transform Variables
  - Log transformations `SalePrice`
  - Normalization
  - Standardization

#### 2. Add Variables
  - Test Different Variables 
  - 

#### 3. Test Different Models
  - Instead of Linear Regression, try Decision Tree, Random Forest, XGBoost, etc.

#### 4. Hyperparameter Tuning
  - Use `GridSearchCV` to test different hyperparameters
####

#### Our Original Model with `Age` as Sole Predictor

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

features = ['Age']

X = train[features]
y = train['SalePrice']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Initiate and run linear regression model
model =  LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

#Calculate RMSLE
rmsle_score = np.sqrt(mean_squared_log_error(y_val, val_predictions))
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_score:.4f}")

### Log Transform `SalePrice`
Log-transforming `SalePrice` helps stabilize the variance and ensures that the model treats price differences proportionally, meaning a \\$10,000 increase matters more for a \\$100,000 house than for a \\$1,000,000 house.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

# Define features
features = ['Age']

# Select independent (X) and dependent (y) variables
X = train[features]
y = np.log1p(train['SalePrice'])  # Apply log transformation to target variable

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(np.expm1(y_true), np.expm1(y_pred)))  # Convert back to original scale

# Initiate and train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

# Calculate RMSLE on original scale
rmsle_value = rmsle(y_val, val_predictions)
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_value:.4f}")

### Log Transform Age
Log-transforming `Age` helps reduce the impact of very old houses, making the relationship between age and sale price more linear and easier for the model to understand.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

# Define features
features = ['Age']

# Select independent (X) and dependent (y) variables
X = np.log1p(train[features])  # Log-transform Age
y = train['SalePrice']

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Initiate and train linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

# Calculate RMSLE
rmsle_value = rmsle(y_val, val_predictions)
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_value:.4f}")


### Add More Variables

Let's add the variables we modified: `LotFrontage` and `High_Quality`

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error

features = ['Age', 'LotFrontage', 'High_Quality']

X = train[features]
y = train['SalePrice']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative

# Initiate and run linear regression model
model =  LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

#Calculate RMSLE
rmsle_score = np.sqrt(mean_squared_log_error(y_val, val_predictions))
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_score:.4f}")

### Feature Importance Analysis

Now let's analyze *feature importance* to understand which factors have the greatest impact on house prices. This helps us *interpret the model's decisions*, prioritize key variables, and identify potential areas for improvement (e.g., adding more relevant features or transforming existing ones). By ranking coefficients, we gain insights into which characteristics buyers value most, guiding both predictive modeling and real-world decision-making in housing markets.


---

##### Explanation 

Above we have performed a *linear regression analysis* to understand how three different features (`Age`, `LotFrontage`, and `High_Quality`) influence house sale prices. Linear regression finds the best-fitting line that predicts the sale price based on these features.  

The regression model assigns a *coefficient* to each feature, representing how much the sale price changes when that feature increases by one unit, *holding all else constant*. Larger coefficients (in absolute terms) indicate stronger influence on price.  

Let's first create a dataframe holding the trained regression coefficients.


---


In [None]:
coef_df = pd.DataFrame({"Feature": features, "Coefficient": model.coef_})
coef_df = coef_df.sort_values(by="Coefficient", key=abs, ascending=False)  # Sort by absolute value
coef_df

#### Interpreting the Coefficients
| Feature       | Coefficient | Interpretation |
|--------------|------------|---------------|
| *High_Quality* | 79,696.51 | Houses classified as "High Quality" sell for **\\$79,696 more** on average, compared to non-high-quality houses, all else being equal. This is the most important factor in the model. |
| *LotFrontage* | 712.52 | For each additional foot of street frontage, the sale price **increases by \\$712.52**, assuming all other factors stay the same. |
| *Age* | -568.40 | Each additional year of house age **reduces** the sale price by **\\$568.40**, meaning older houses tend to sell for less. The negative sign confirms this expected trend. |

---

#### **Key Takeaways**
1. *High_Quality* has the *largest positive impact* on sale price, making it the most important factor in the model.  
2. *LotFrontage* also has a positive effect—larger frontages increase sale prices, but its impact is much smaller than High_Quality.  
3. *Age has a negative coefficient*, meaning older homes tend to be valued lower, though the impact is moderate.  

The visualization further reinforces these insights by showing the magnitude of each coefficient. 

--- 

#### Visualization

To further highlight the relative importance of the variables, lets's generate a Feature Importance Plot.


In [None]:
plt.figure(figsize=(8, 6))
sns.barplot(x="Coefficient", y="Feature", data=coef_df, hue="Feature", palette="Blues_r")

plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.title("Feature Importance (Regression Coefficients)")
plt.show()

## Try Different Models

Now let's try out different models. We will test multiple machine learning models to compare their performance in predicting house prices. Instead of relying only on linear regression, we are evaluating a mix of **linear models (Ridge, Lasso, ElasticNet), tree-based models (DecisionTree, RandomForest, XGBoost), and nonlinear models (SVR)**.  

By training each model on the same dataset and computing the **Root Mean Squared Logarithmic Error (RMSLE)** for validation predictions, we can determine which model generalizes best. This process helps us **identify the most accurate and robust approach** for this specific problem, guiding model selection for final predictions.  

We will be using the same **train-test split and features** – `Age`, `LotFrontage`, `High_Quality` – so we will not re-run those parts of the code.  

---


We will test the following **eight machine learning models**:  

#### **1️⃣ Linear Regression (The Straight-Line Approach)**  
- **How it works**: Assumes that house prices change in a **straight-line relationship** with features (e.g., if `YearBuilt` goes up, price increases by a fixed amount).  
- **Pros**: Simple, interpretable.  
- **Cons**: Can't capture complex patterns.  

---

#### **2️⃣ Ridge Regression (Prevents Overfitting)**  
- **How it works**: A linear regression model that **prevents overfitting** by reducing extreme coefficient values.  
- **Why it's useful**: Helps stabilize predictions when features are highly correlated.  

---

#### **3️⃣ Lasso Regression (Feature Selection Model)**  
- **How it works**: Similar to Ridge Regression, but **removes less important features** by shrinking some coefficients to zero.  
- **Why it's useful**: Automatically selects the most important features, simplifying the model.  

---

#### **4️⃣ ElasticNet (Balanced Regularization Model)**  
- **How it works**: A combination of **Lasso and Ridge Regression** that balances feature selection and coefficient shrinkage.  
- **Why it's useful**: Helps when **some features should be removed** while others need **regularization**.  

---

#### **5️⃣ Decision Tree (The Rule-Based Approach)**  
- **How it works**: Think of this model as a series of **Yes/No questions** that split the data into groups based on features.  
  - Example: *Is the house built after 2000?* → If yes, go to the next rule.  
- **Why it's useful**: Can handle **non-linear relationships** in the data.  
- **Cons**: Can overfit if the tree is too deep.  

---

#### 6️⃣ Random Forest (The Team Decision Tree)   
- **How it works**: Instead of using just one decision tree, this model **combines multiple decision trees** and takes the average of their predictions.  
- **Why it's useful**: More stable, avoids overfitting, and captures **complex relationships** in the data.  

---

#### **7️⃣ XGBoost (The Smartest Tree Model)**  
- **How it works**: Like Random Forest, but instead of treating trees equally, XGBoost **learns from mistakes** step by step.  
- **Why it's useful**: Often one of the **most powerful models** for structured data.  

---

#### **8️⃣ Support Vector Regression (SVR)**  
- **How it works**: Instead of fitting a single best-fit line, SVR finds a **small range (margin)** where most predictions will fall.  
- **Why it's useful**: Handles **nonlinear relationships** better than standard regression.  
- **Cons**: Can be slower on large datasets and performed worst in our analysis.  

---


#### **📊 Summary of Models**
| Model | Purpose |
|-------|---------|
| **Linear Regression** | Baseline model, assumes a linear relationship | 
| **Ridge Regression** | Prevents overfitting by **shrinking coefficients** |
| **Lasso Regression** | Shrinks **and removes** irrelevant features | 
| **ElasticNet** | Balances **feature selection (L1) and shrinkage (L2)** | 
| **Decision Tree** | Splits data using **rule-based conditions** | 
| **Random Forest** | Uses **multiple decision trees** to improve stability | 
| **XGBoost** | Learns from **previous mistakes** to improve predictions | 
| **SVR** | Uses a **flexible margin** instead of a single line | 

---


## Key Question: Which Model Will Perform Best?

---


### Code Explanation

1. **Define a dictionary of models (`models`)**  
   - Several regression models are stored in a dictionary with their names as keys and model objects as values.  
   - Models include **Linear Regression, Ridge, Lasso, ElasticNet, Decision Tree, Random Forest, XGBoost, and SVR**.  

2. **Create an empty list (`results`)**  
   - This list will store the performance of each model.  

3. **Loop through each model, train it, and evaluate it**  
   - For each model:  
     - It is trained using `X_train` and `y_train`.  
     - It makes predictions on `X_val` (validation data).  
     - The **Root Mean Squared Logarithmic Error (RMSLE)** is calculated to measure model performance.  
     - The model's name and its RMSLE score are added to the `results` list.  

4. **Convert results into a Pandas DataFrame (`results_df`)**  
   - The results are stored in a DataFrame and sorted in **ascending order by RMSLE**, so the best-performing model appears first.  
   
#### **Why Are We Doing This?**  
This approach allows us to **compare multiple models efficiently** and determine which one gives the best predictions for house prices. It helps us make an informed decision on **which model to use in the final analysis**.     
   

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_log_error

# Define models with default parameters
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'ElasticNet': ElasticNet(),
    'DecisionTree': DecisionTreeRegressor(),
    'RandomForest': RandomForestRegressor(),
    'XGBoost': XGBRegressor(verbosity=0),
    'SVR': SVR(),
}

# Create empty list for storing results
results = []

#Train each model in a loop, saving model name RMSLE score into results dataframe
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    rmsle_score = rmsle(y_val, y_pred)
    results.append({'Model': name, 'RMSLE': rmsle_score})

# Convert results to DataFrame; sort by RMSLE and output
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSLE')
results_df

#### Summary of Model Performance

- **🏆 Best Overall Model** → **Random Forest** (lowest RMSLE).  
- **Worst Model** → **SVR** (highest RMSLE).  
- **Tree-based models (Random Forest, XGBoost, Decision Tree) outperformed linear models.**  
- **Regularized linear models (Ridge, Lasso, ElasticNet) performed similarly.**  

---

#### Takeaways

- **Random Forest** was the best-performing model for this dataset.  
- **Tree-based models** (Random Forest, XGBoost, Decision Tree) captured complex patterns better than linear models.  
- **Regularized linear models (Ridge, Lasso, ElasticNet) performed similarly** and were slightly better than standard Linear Regression.  
- **SVR performed the worst**, likely due to computational inefficiency and the dataset's characteristics.  
- **Feature selection matters**—adding more meaningful features (e.g., neighborhood, number of bathrooms) could improve accuracy.  

**Next Steps:** Try adding more meaningful features and/or hyperparameter tuning, to further improve accuracy.


### Select the Best Model for Generating Updated Submission File
First, select the best model and re-generate predictions on `validation` dataset without retraining

In [None]:
# Retrieve the trained model without retraining
model = models['RandomForest']   # No re-fitting, just using the stored model
#model.fit(X_train, y_train)  #If you want to retrain the model, uncomment this line

# Generate predictions on validation set
val_predictions = model.predict(X_val) # Model is already trained, just predict

# Evaluate model performance on validation data
rmsle_score = rmsle(y_val, val_predictions)
print("RMSLE:", rmsle_score)

<br>Alternative: Extract best model from `results_df` programmatically

In [None]:
# Extract the best model from results_df based on lowest RMSLE
best_model_name = results_df.loc[results_df['RMSLE'].idxmin(), 'Model']
print(f"Best Model: {best_model_name}")

model = models[best_model_name] # No re-fitting, just using the stored model
#model.fit(X_train, y_train)  #If you want to retrain the model, uncomment this line

# Generate predictions on validation set
val_predictions = model.predict(X_val) # Model is already trained, just predict

# Evaluate model performance on validation data
rmsle_score = rmsle(y_val, val_predictions)
print("RMSLE:", rmsle_score)

#### Generate Updated Predictions on `test.csv` and Submission File
Now we can apply the model to `test.csv`. We'll also reproduce the descriptives and the histogram to see if our distribution of predicted `SalePrice` has changed.

In [None]:
#Create the binary variable
test['High_Quality'] = test['OverallQual'].apply(lambda x: 1 if x >= 7 else 0)
test['High_Quality'].value_counts()

In [None]:
print("Missing values in High_Quality column:", test["High_Quality"].isnull().sum())

In [None]:
test['LotFrontage'] = test['LotFrontage'].fillna(test["LotFrontage"].median())
print("Missing values in LotFrontage column:", test["LotFrontage"].isnull().sum())

In [None]:
# Select the same predictor variables as in training
X_test = test[features]
X_test[:2]

In [None]:
X_test.info()

In [None]:
# Generate predictions for Kaggle test set using the trained model
test_predictions = model.predict(X_test)
print('# of predictions:', len(test_predictions))
test_predictions[:5]

In [None]:
# Ensure predictions are non-negative (house prices cannot be negative)
print(f"Min SalePrice: {test_predictions.min()}")
print(f"Max SalePrice: {test_predictions.max()}")

In [None]:
# Add predictions to test dataset
test['SalePrice'] = test_predictions

# Create submission file
submission_df = pd.DataFrame({"Id": test["Id"], "SalePrice": test_predictions})
submission_df.info()

In [None]:
submission_df

In [None]:
#Save file
submission_df.to_csv("submission.csv", index=False)
print("Submission file saved as 'submission.csv'")

In [None]:
#If you want to see the predicted frequencies
submission_df['SalePrice'].describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Set Seaborn style
sns.set_theme(style="whitegrid")

# Create a histogram with KDE (density) curve
plt.figure(figsize=(10, 6))
sns.histplot(submission_df['SalePrice'], bins=30, kde=True, color='royalblue', edgecolor='black')

# Add labels and title
plt.xlabel("Predicted SalePrice", fontsize=14, labelpad=15)
plt.ylabel("Frequency", fontsize=14, labelpad=15)
plt.title("Distribution of Predicted House Prices", fontsize=16, pad=20)

# Show the plot
plt.show()