# House Prices - Advanced Regression Techniques
Predict sales prices and practice feature engineering, RFs, and gradient boosting

**Author:** Lingsong Zeng<br>
**Date:** 12/31/2024

## Table of Contents

1. [Introduction](#Introduction)
2. [EDA](#EDA)
3. [Feature Engineering](#Feature-Engineering)
4. [Preprocessing](#Preprocessing)
5. [Modeling](#Modeling)   
    - [kNN](#kNN)
    - [SVM](#SVM)
    - [Linear Regression](#Linear-Regression)
    - [Lasso](#Lasso)
    - [Ridge](#Ridge)
    - [ElasticNet](#ElasticNet)
    - [Decision Tree](#Decision-Tree)
    - [Bagging](#Bagging)
    - [Random Forest](#Random-Forest)
    - [XGBoost](#XGBoost)
    - [LightGBM](#LightGBM)
    - [Stacking](#Stacking)
6. [References](#References)

## Introduction

### Overview

> This competition runs indefinitely with a rolling leaderboard. [Learn more](https://www.kaggle.com/docs/competitions#getting-started)

### Description

![](https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png)

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### Practice Skills

- Creative feature engineering 
- Advanced regression techniques like random forest and gradient boosting

### Acknowledgments

The [Ames Housing dataset](https://www.amstat.org/publications/jse/v19n3/decock.pdf) was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

Photo by [Tom Thain](https://unsplash.com/@tthfilms) on Unsplash.

### Evaluation

#### Goal

It is my job to predict the sales price for each house. For each Id in the test set, predict the value of the `SalePrice` variable. 

#### Metric

Submissions are evaluated on [Root-Mean-Squared-Error (RMSE)](https://en.wikipedia.org/wiki/Root_mean_square_deviation) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

In [1]:
library(readr)
library(dplyr)


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




## Preprocessing

In [3]:
# Construct file paths
train_path <- file.path("data", "house-prices", "raw", "train.csv")
test_path <- file.path("data", "house-prices", "raw", "test.csv")

# Read data
train <- read_csv(train_path)
test <- read_csv(test_path)

# Extract the Id column
Id <- test$Id

# Remove the Id column
train <- train %>% select(-Id)
test <- test %>% select(-Id)

[1mRows: [22m[34m1460[39m [1mColumns: [22m[34m81[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
[32mdbl[39m (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m1459[39m [1mColumns: [22m[34m80[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
[32mdbl[39m (37): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Y