<p><a href="https://colab.research.google.com/github/gdsaxton/GDAN5400/blob/main/Week%209%20Notebooks/GDAN%205400%20-%20Week%209%20Class%20Notebook_Quick_Start_Code.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a></p>

# Kaggle Competition: Housing Prices – Advanced Regression Techniques

In today's class, as well as coding assignment #5 and the final project, we will be using the [Housing Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition on Kaggle


---


| **Data Dictionary** |                   |
|---------------------|----------------|
| **Feature**         | **Description** |
| SalePrice      | Property's sale price in dollars (target variable) |
| MSSubClass     | Building class |
| MSZoning       | General zoning classification |
| LotFrontage    | Linear feet of street connected to property |
| LotArea        | Lot size in square feet |
| Street         | Type of road access |
| Alley          | Type of alley access |
| LotShape       | General shape of property |
| LandContour    | Flatness of the property |
| Utilities      | Type of utilities available |
| LotConfig      | Lot configuration |
| LandSlope      | Slope of property |
| Neighborhood   | Physical locations within Ames city limits |
| Condition1     | Proximity to main road or railroad |
| Condition2     | Proximity to main road or railroad (if a second is present) |
| BldgType       | Type of dwelling |
| HouseStyle     | Style of dwelling |
| OverallQual    | Overall material and finish quality |
| OverallCond    | Overall condition rating |
| YearBuilt      | Original construction date |
| YearRemodAdd   | Remodel date |
| RoofStyle      | Type of roof |
| RoofMatl       | Roof material |
| Exterior1st    | Exterior covering on house |
| Exterior2nd    | Exterior covering on house (if more than one material) |
| MasVnrType     | Masonry veneer type |
| MasVnrArea     | Masonry veneer area in square feet |
| ExterQual      | Exterior material quality |
| ExterCond      | Present condition of exterior material |
| Foundation     | Type of foundation |
| BsmtQual       | Height of the basement |
| BsmtCond       | General condition of the basement |
| BsmtExposure   | Walkout or garden level basement walls |
| BsmtFinType1   | Quality of basement finished area |
| BsmtFinSF1     | Type 1 finished square feet |
| BsmtFinType2   | Quality of second finished area (if present) |
| BsmtFinSF2     | Type 2 finished square feet |
| BsmtUnfSF      | Unfinished square feet of basement area |
| TotalBsmtSF    | Total square feet of basement area |
| Heating        | Type of heating |
| HeatingQC      | Heating quality and condition |
| CentralAir     | Central air conditioning (Yes/No) |
| Electrical     | Electrical system type |
| 1stFlrSF       | First floor square feet |
| 2ndFlrSF       | Second floor square feet |
| LowQualFinSF   | Low quality finished square feet (all floors) |
| GrLivArea      | Above grade (ground) living area square feet |
| BsmtFullBath   | Basement full bathrooms |
| BsmtHalfBath   | Basement half bathrooms |
| FullBath       | Full bathrooms above grade |
| HalfBath       | Half bathrooms above grade |
| Bedroom        | Number of bedrooms above basement level |
| Kitchen        | Number of kitchens |
| KitchenQual    | Kitchen quality |
| TotRmsAbvGrd   | Total rooms above grade (excludes bathrooms) |
| Functional     | Home functionality rating |
| Fireplaces     | Number of fireplaces |
| FireplaceQu    | Fireplace quality |
| GarageType     | Garage location |
| GarageYrBlt    | Year garage was built |
| GarageFinish   | Interior finish of the garage |
| GarageCars     | Garage size in car capacity |
| GarageArea     | Garage size in square feet |
| GarageQual     | Garage quality |
| GarageCond     | Garage condition |
| PavedDrive     | Paved driveway presence |
| WoodDeckSF     | Wood deck area in square feet |
| OpenPorchSF    | Open porch area in square feet |
| EnclosedPorch  | Enclosed porch area in square feet |
| 3SsnPorch      | Three-season porch area in square feet |
| ScreenPorch    | Screen porch area in square feet |
| PoolArea       | Pool area in square feet |
| PoolQC         | Pool quality |
| Fence          | Fence quality |
| MiscFeature    | Miscellaneous feature not covered in other categories |
| MiscVal        | Dollar value of miscellaneous feature |
| MoSold         | Month sold |
| YrSold         | Year sold |
| SaleType       | Type of sale |
| SaleCondition  | Condition of sale |

---

### Access Full Codebook

The link below describes what the codes mean for each of the variables.

https://github.com/gdsaxton/GDAN5400/blob/main/Housing_Prices/data_description.txt


#### Load the Housing Prices `Training` Dataset  

I have uploaded the training and test datasets onto the class GitHub repository.

In [None]:
import numpy as np
import pandas as pd

#http://pandas.pydata.org/pandas-docs/stable/options.html
pd.set_option('display.max_columns', None)  #Set PANDAS to show all columns in DataFrame
pd.set_option('max_colwidth', 500)

In [None]:
train_url = 'https://raw.githubusercontent.com/gdsaxton/GDAN5400/refs/heads/main/Housing_Prices/train.csv'
train = pd.read_csv(train_url)
print('# of rows in training dataset:', len(train), '\n')
train[:2]

#### Run `info()` to inspect variables

In [None]:
train.info()

#### Identify variables with missing data

In [None]:
train.isnull().sum()[train.isnull().sum() > 0]

#### Fill in Missing Values for `LotFrontage`
- The `LotFrontage` column contains missing values that must be filled before modeling.  
- Use the **median** value to replace missing values, as it is less affected by outliers.  
- After filling in the missing values, verify that `LotFrontage` no longer has any missing entries.  

In [None]:
train['LotFrontage'] = train['LotFrontage'].fillna(train["LotFrontage"].median())
print("Missing values in LotFrontage column:", train["LotFrontage"].isnull().sum())

#### Create a `High_Quality` Variable

From the codebook we see that the values of `OverallQual` are the following:

OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor

In [None]:
#Frequencies 
train['OverallQual'].value_counts().sort_index()

In [None]:
#Create the binary variable
train['High_Quality'] = train['OverallQual'].apply(lambda x: 1 if x >= 7 else 0)
train['High_Quality'].value_counts()

In [None]:
#Check for missing values
print("Missing values in High_Quality column:", train["High_Quality"].isnull().sum())

#### Create Variable `Age`

In [None]:
train['Age'] = 2025 - train['YearBuilt']
train[['YearBuilt', 'Age']].describe().T

In [None]:
#Check for missing values
print("Missing values in Age column:", train["Age"].isnull().sum())

# Machine Learning Step 4: Model Selection, Training, Tuning, & Evaluation

The `Model Selection, Training, Tuning, & Evaluation` stage focuses on choosing the most suitable machine learning algorithms, fitting them to the data, optimizing performance, and assessing their effectiveness. It begins with selecting baseline models and training them on the processed dataset. Hyperparameter tuning—using techniques like grid search, random search, or Bayesian optimization—helps refine model performance. Evaluation metrics such as RMSE, accuracy, or F1-score are used to measure success and compare models. This iterative process ensures that the best-performing and most generalizable model is identified for final predictions.


#### Select Predictors and Split the Data into Testing and Training Datasets

We will start with a single predictor: `Age`

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_squared_log_error

# Define RMSLE scoring function
def rmsle(y_true, y_pred):
    return np.sqrt(mean_squared_log_error(y_true, np.maximum(y_pred, 0)))  # Ensure predictions are non-negative


features = ['Age', 'LotFrontage', 'High_Quality']

X = train[features]
y = train['SalePrice']

# Splitting training data into train and validation sets
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Initiate and run linear regression model
model =  LinearRegression()
model.fit(X_train, y_train)

# Generate predictions on validation set
val_predictions = model.predict(X_val)

# Calculate regression performance metrics
rmsle_score = np.sqrt(mean_squared_log_error(y_val, val_predictions))
r2 = r2_score(y_val, val_predictions)

# Print results
print(f"Root Mean Squared Logarithmic Error (RMSLE): {rmsle_score:.4f}")

#### Extract Regression Coefficients

In [None]:
# Get intercept (constant term)
intercept = model.intercept_

# Get coefficients
coefficients = model.coef_

# Print results
print(f"Intercept (Constant): {intercept}")
print("Coefficients:")
for feature, coef in zip(X_train.columns, coefficients):
    print(f"{feature}: {coef}")

#### We can use the above to save coefficient values

In [None]:
# Given updated coefficients from trained model
intercept = 132051.68970351096  # β0
age_coefficient = -568.4004881644231  # β1
lot_frontage_coefficient = 712.5244879924448  # β2
quality_coefficient = 79696.51403482091  # β3

## Regression Equation

The predicted sale price (\\(\hat{Y}\\)) is given by:

\\[
\hat{Y} = \beta_0 + \beta_1 \times \text{Age} + \beta_2 \times \text{LotFrontage} + \beta_3 \times \text{Quality}
\\]

where:
- \\( \hat{y} \\) (y-hat) represents the *predicted* sale price of the house.
- \\( \beta_0 \\) is the *intercept* (the predicted price when all independent variables are zero).
- \\( \beta_1 \\) is the *coefficient* for the age variable.
- \\( \beta_2 \\) is the *coefficient* for the lot frontage variable.
- \\( \beta_3 \\) is the *coefficient* for the overall quality variable.

Substituting the given coefficients:

\\[
\hat{Y} = 132051.69 - 568.40 \times \text{Age} + 712.52 \times \text{LotFrontage} + 79696.51 \times \text{Quality}
\\]


#### Now let's get mean values by running descriptives

We'll plug these values into the above equation

In [None]:
train[features].describe().T

#### Predicted Price for a House in Ames, IA of `Average Age` and `LotFrontage` and Low Quality (`High_Quality=0`)

In [None]:
# Input values for prediction
mean_age = train['Age'].mean() #53.73 
mean_frontage = train['LotFrontage'].mean() #69.86

# Compute predicted sale price using the regression equation
predicted_price = (
    intercept +
    (age_coefficient * mean_age) +
    (lot_frontage_coefficient * mean_frontage) +
    (0 * quality_coefficient)
    )

# Display result
(f'Predicted Sale Price for House with Average Age and Lot Frontage but Low Quality: ${predicted_price:,.2f}')

In [None]:
# Compute predicted sale price using the regression equation
predicted_price = (
    intercept +
    (age_coefficient * mean_age) +
    (lot_frontage_coefficient * mean_frontage) +
    (1 * quality_coefficient)
    )

# Display result
(f'Predicted Sale Price for House with Average Age and Lot Frontage but Low Quality: ${predicted_price:,.2f}')