# Module 1.1: Machine Learning Workflow Fundamentals

## üéØ Learning Objectives

By the end of this module, you will be able to:
1. **Load and explore** datasets using R's data manipulation functions
2. **Apply sampling techniques** for handling large datasets and class imbalance
3. **Preprocess data** including handling missing values and creating dummy variables
4. **Partition data** properly to avoid overfitting
5. **Build and evaluate** predictive models using best practices

---

## üìä Why These Skills Matter in Business

Machine learning is transforming how businesses make decisions. But the **quality of your model depends entirely on how you prepare your data**. Consider these real-world scenarios:

| Business Problem | ML Skill Needed | This Module |
|------------------|-----------------|-------------|
| "Which customers will churn?" | Handling imbalanced classes | Part 2: Sampling |
| "How much is this property worth?" | Building regression models | Part 5: Modeling |
| "Is this transaction fraudulent?" | Proper train/test splits | Part 4: Partitioning |
| "What drives customer satisfaction?" | Feature engineering | Part 3: Preprocessing |

### The Data Science Workflow

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   COLLECT   ‚îÇ ‚Üí  ‚îÇ   EXPLORE   ‚îÇ ‚Üí  ‚îÇ  PREPROCESS ‚îÇ ‚Üí  ‚îÇ   MODEL     ‚îÇ ‚Üí  ‚îÇ  EVALUATE   ‚îÇ
‚îÇ    Data     ‚îÇ    ‚îÇ    Data     ‚îÇ    ‚îÇ    Data     ‚îÇ    ‚îÇ   Build     ‚îÇ    ‚îÇ   & Deploy  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
      ‚Üë                                     ‚îÇ                                      ‚îÇ
      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                    Iterate and Improve
```

**Key Insight**: Data scientists spend **60-80% of their time** on data preparation (Parts 1-4). The modeling (Part 5) is often the easy part!

### üîë Essential Formulas You'll Learn

Throughout this module, we'll use several key formulas:

| Formula | Description |
|---------|-------------|
| $\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ | Root Mean Squared Error |
| $\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$ | Mean Absolute Error |
| $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \epsilon$ | Linear Regression Model |

Where:
- $y_i$ = actual value for observation $i$
- $\hat{y}_i$ = predicted value for observation $i$
- $n$ = number of observations
- $\beta$ = model coefficients
- $\epsilon$ = error term

## üõ†Ô∏è Setup: Installing Required Packages

Before we begin, we need to install and load the required packages. We'll use the `mlba` package (Machine Learning for Business Analytics) which contains datasets and helper functions specifically designed for learning ML concepts.

### Required Packages Overview

| Package | Purpose |
|---------|---------|
| `mlba` | Contains example datasets (WestRoxbury housing data) and helper functions |
| `tidyverse` | Data manipulation and visualization (includes dplyr, ggplot2, tidyr) |
| `caret` | Classification and Regression Training - model building and evaluation |
| `fastDummies` | Quick creation of dummy/indicator variables from categorical data |

> **Note**: The `mlba` package is installed from GitHub using `devtools::install_github()`. This is a common pattern for packages not yet on CRAN (the official R package repository).

In [12]:
# ==============================================================================
# PACKAGE INSTALLATION AND SETUP
# ==============================================================================
# This cell installs and loads ALL packages needed for this notebook.
# Run this cell first before running any other cells.
# ==============================================================================

# Set CRAN mirror for package installation
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Function to install and load packages
load_package <- function(pkg) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg)
    library(pkg, character.only = TRUE)
  }
}

# Install/load devtools first (needed for GitHub packages)
load_package("devtools")

# Install mlba from GitHub if not already installed
if (!require(mlba, quietly = TRUE)) {
  devtools::install_github("gedeck/mlba/mlba", force = TRUE)
}
library(mlba)

# Core packages for data manipulation and visualization
# tidyverse includes: dplyr, ggplot2, tidyr, readr, purrr, tibble, stringr, forcats
load_package("tidyverse")

# Additional packages used in this notebook
load_package("fastDummies")  # For creating dummy variables
load_package("caret")        # For data partitioning and model metrics

# Disable scientific notation for easier reading of large numbers
# scipen=999 means R will prefer fixed notation up to 999 digits
options(scipen = 999)

cat("\n‚úì All packages installed and loaded successfully!\n")
cat("  - mlba: Course datasets\n")
cat("  - tidyverse: Data manipulation (dplyr, tidyr, ggplot2, etc.)\n")
cat("  - fastDummies: Dummy variable creation\n")
cat("  - caret: Data partitioning and model evaluation\n")


‚úì All packages installed and loaded successfully!
  - mlba: Course datasets
  - tidyverse: Data manipulation (dplyr, tidyr, ggplot2, etc.)
  - fastDummies: Dummy variable creation
  - caret: Data partitioning and model evaluation


---

## Part 1: Preliminary Steps - Loading and Exploring Data

### üìö Loading and Looking at the Data in R

The first step in any data science project is **understanding your data**. This exploratory phase answers critical questions:
- How many observations (rows) and features (columns) do we have?
- What types of variables are present (numeric, categorical, text)?
- Are there any obvious data quality issues (missing values, outliers)?

### About the West Roxbury Housing Dataset

We'll use the **West Roxbury Housing** dataset, which contains property assessment data from the West Roxbury neighborhood of Boston, Massachusetts.

| Feature | Description |
|---------|-------------|
| `TOTAL.VALUE` | Total assessed value of the property (our target variable) |
| `TAX` | Annual property tax |
| `LOT.SQFT` | Lot size in square feet |
| `LIVING.AREA` | Living area in square feet |
| `FLOORS` | Number of floors |
| `ROOMS` | Total number of rooms |
| `BEDROOMS` | Number of bedrooms |
| `FULL.BATH` | Number of full bathrooms |
| `HALF.BATH` | Number of half bathrooms |
| `REMODEL` | Remodeling status (None, Old, Recent) |
| `YR.BUILT` | Year the property was built |

In [13]:
# ==============================================================================
# LOADING AND INITIAL DATA EXPLORATION
# ==============================================================================

# Load the West Roxbury housing dataset from the mlba package
# The :: operator accesses the WestRoxbury dataset directly from mlba
housing.df = mlba::WestRoxbury

# Get the dimensions of the data frame
# Returns: c(number_of_rows, number_of_columns)
# This tells us we have X observations and Y features to work with
dim(housing.df)

# Display the first 6 rows of the dataset
# head() is essential for quickly understanding the data structure
# Use head(df, n) to show n rows instead of the default 6
head(housing.df, 25)

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,344.2,4330,9965,1880,2436,1352,2.0,6,3,1,1,1,0,
2,412.6,5190,6590,1945,3108,1976,2.0,10,4,2,1,1,0,Recent
3,330.1,4152,7500,1890,2294,1371,2.0,8,4,1,1,1,0,
4,498.6,6272,13773,1957,5032,2608,1.0,9,5,1,1,1,1,
5,331.5,4170,5000,1910,2370,1438,2.0,7,3,2,0,1,0,
6,337.4,4244,5142,1950,2124,1060,1.0,6,3,1,0,1,1,Old
7,359.4,4521,5000,1954,3220,1916,2.0,7,3,1,1,1,0,
8,320.4,4030,10000,1950,2208,1200,1.0,6,3,1,0,1,0,
9,333.5,4195,6835,1958,2582,1092,1.0,5,3,1,0,1,1,Recent
10,409.4,5150,5093,1900,4818,2992,2.0,8,4,2,0,1,0,


### üìã Understanding the Data Exploration Functions

| Function | Purpose | When to Use |
|----------|---------|-------------|
| `dim(df)` | Returns (rows, columns) | First check - understand data size |
| `head(df)` | Shows first 6 rows | Quick visual inspection |
| `head(df, n)` | Shows first n rows | When 6 rows isn't enough |
| `tail(df)` | Shows last 6 rows | Check for data truncation issues |
| `View(df)` | Opens interactive viewer | RStudio only - full data exploration |
| `str(df)` | Shows structure and types | Understanding variable types |
| `summary(df)` | Statistical summary | Quick stats on all numeric variables |

### üîç Subsetting Data: Accessing Rows and Columns

R provides multiple ways to access subsets of data using `[row, column]` notation. This is fundamental to data manipulation:

**Bracket Notation Syntax**: `dataframe[rows, columns]`
- Leave empty to select all: `df[1:10, ]` = first 10 rows, ALL columns
- Use `:` for ranges: `1:10` = 1, 2, 3, ... 10
- Use `c()` for specific selections: `c(1, 3, 5)` = columns 1, 3, and 5

In [14]:
# ==============================================================================
# DATA SUBSETTING WITH BRACKET NOTATION
# ==============================================================================

# Example 1: Select first 10 rows of the FIRST column only
# [1:10, 1] = rows 1-10, column 1 (TOTAL.VALUE)
housing.df[1:10, 1]

# Example 2: Select first 10 rows of ALL columns
# [1:10, ] = rows 1-10, leave column blank to get all columns
housing.df[1:10, ]

# Example 3: Select a single row (5th row) of the first 10 columns
# [5, 1:10] = row 5 only, columns 1 through 10
housing.df[5, 1:10]

# Example 4: Select specific non-consecutive columns using c()
# c(1:2, 4, 8:10) creates vector: 1, 2, 4, 8, 9, 10
# This is useful when you need specific features that aren't adjacent
housing.df[5, c(1:2, 4, 8:10)]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,344.2,4330,9965,1880,2436,1352,2,6,3,1,1,1,0,
2,412.6,5190,6590,1945,3108,1976,2,10,4,2,1,1,0,Recent
3,330.1,4152,7500,1890,2294,1371,2,8,4,1,1,1,0,
4,498.6,6272,13773,1957,5032,2608,1,9,5,1,1,1,1,
5,331.5,4170,5000,1910,2370,1438,2,7,3,2,0,1,0,
6,337.4,4244,5142,1950,2124,1060,1,6,3,1,0,1,1,Old
7,359.4,4521,5000,1954,3220,1916,2,7,3,1,1,1,0,
8,320.4,4030,10000,1950,2208,1200,1,6,3,1,0,1,0,
9,333.5,4195,6835,1958,2582,1092,1,5,3,1,0,1,1,Recent
10,409.4,5150,5093,1900,4818,2992,2,8,4,2,0,1,0,


Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>
5,331.5,4170,5000,1910,2370,1438,2,7,3,2


Unnamed: 0_level_0,TOTAL.VALUE,TAX,YR.BUILT,ROOMS,BEDROOMS,FULL.BATH
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>
5,331.5,4170,1910,7,3,2


In [15]:
# ==============================================================================
# ACCESSING COLUMNS BY NAME USING $ NOTATION
# ==============================================================================

# The $ operator extracts a single column as a vector
# This is more readable than bracket notation for single columns

# Extract the first 10 values from TOTAL.VALUE column
# $TOTAL.VALUE returns the entire column; [1:10] subsets it
housing.df$TOTAL.VALUE[1:10]

# Get the total number of observations (rows) in TOTAL.VALUE
# length() returns the number of elements in a vector
length(housing.df$TOTAL.VALUE)

# Calculate the arithmetic mean of property values
# mean() = sum of all values / number of values
# Formula: mean = (1/n) * Œ£x·µ¢
mean(housing.df$TOTAL.VALUE)

# Generate comprehensive summary statistics for ALL columns
# For numeric: Min, 1st Quartile, Median, Mean, 3rd Quartile, Max
# For factors: Frequency counts of each level
summary(housing.df)

  TOTAL.VALUE          TAX           LOT.SQFT        YR.BUILT      GROSS.AREA  
 Min.   : 105.0   Min.   : 1320   Min.   :  997   Min.   :   0   Min.   : 821  
 1st Qu.: 325.1   1st Qu.: 4090   1st Qu.: 4772   1st Qu.:1920   1st Qu.:2347  
 Median : 375.9   Median : 4728   Median : 5683   Median :1935   Median :2700  
 Mean   : 392.7   Mean   : 4939   Mean   : 6278   Mean   :1937   Mean   :2925  
 3rd Qu.: 438.8   3rd Qu.: 5520   3rd Qu.: 7022   3rd Qu.:1955   3rd Qu.:3239  
 Max.   :1217.8   Max.   :15319   Max.   :46411   Max.   :2011   Max.   :8154  
  LIVING.AREA       FLOORS          ROOMS           BEDROOMS      FULL.BATH    
 Min.   : 504   Min.   :1.000   Min.   : 3.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:1308   1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:3.00   1st Qu.:1.000  
 Median :1548   Median :2.000   Median : 7.000   Median :3.00   Median :1.000  
 Mean   :1657   Mean   :1.684   Mean   : 6.995   Mean   :3.23   Mean   :1.297  
 3rd Qu.:1874   3rd Qu.:2.000   3rd Qu.:

### üìã Complete Subsetting Syntax Reference

| Syntax | Description | Example Output |
|--------|-------------|----------------|
| `df[1:10, ]` | First 10 rows, all columns | Full rows of data |
| `df[, 1:5]` | All rows, first 5 columns | Subset of features |
| `df[5, 3]` | Single cell (row 5, column 3) | One value |
| `df$column` | Access column by name | Vector of values |
| `df[, c(1,3,5)]` | Specific columns (1, 3, 5) | Non-consecutive columns |
| `df[df$col > 100, ]` | Filter rows by condition | Conditional subset |

### üí° Pro Tips for Data Subsetting

1. **Use column names** when possible: `df$TOTAL.VALUE` is clearer than `df[, 1]`
2. **Combine conditions** with `&` (AND) and `|` (OR): `df[df$ROOMS > 5 & df$BEDROOMS > 2, ]`
3. **Negative indexing** excludes elements: `df[, -1]` removes the first column
4. **which()** finds indices: `which(df$TOTAL.VALUE > 500000)` returns row numbers

---

## Part 2: Sampling from a Database

### üé≤ Why Sampling Matters in Machine Learning

Sampling is a fundamental technique used throughout the ML workflow:

| Scenario | Problem | Sampling Solution |
|----------|---------|-------------------|
| **Big Data** | Dataset has millions of rows | Random sample for faster exploration |
| **Class Imbalance** | 99% normal, 1% fraud | Oversample fraud or undersample normal |
| **Cross-Validation** | Need multiple train/test splits | Repeated random sampling |
| **Bootstrap** | Estimate model uncertainty | Sample with replacement |

### Understanding Sampling Types

1. **Simple Random Sampling**: Each observation has equal probability of selection
2. **Stratified Sampling**: Maintains proportions of subgroups (e.g., 30% fraud in sample)
3. **Weighted Sampling**: Assigns different probabilities to different observations
4. **Cluster Sampling**: Randomly selects groups, then includes all members

### Random Sampling in R

The `sample()` function is the foundation for all sampling operations in R.

In [16]:
# ==============================================================================
# SIMPLE RANDOM SAMPLING
# ==============================================================================

# Reload the housing data to ensure we're working with the original
housing.df = mlba::WestRoxbury

# Set seed for reproducibility
# CRITICAL: Without set.seed(), you'll get different results each time!
# The number 42 is arbitrary - use any integer, just be consistent
set.seed(42)

# Randomly select 5 row names (indices) from the data frame
# row.names() returns all row identifiers as a character vector
# sample(x, n) randomly selects n elements from vector x
s = sample(row.names(housing.df), 10)

# Display the sampled rows
# Use the sampled indices to extract those specific rows
housing.df[s, ]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
2609,410.9,5169,4700,1968,2822,1586,2.0,12,3,2,1,1,1,
4069,418.3,5262,7000,1912,4006,2430,2.0,8,4,1,1,1,0,
2369,354.7,4462,5385,1925,2878,1248,1.0,6,2,1,0,1,1,
5273,357.6,4498,5000,1930,2549,1632,2.0,7,3,1,0,1,1,
1098,317.7,3996,4976,1948,1872,1248,2.0,6,3,1,0,1,0,
1252,289.1,3636,6000,1950,2532,1160,1.0,5,2,1,1,1,0,
634,289.2,3638,6214,1963,2731,1438,1.5,6,3,2,0,1,0,
2097,409.6,5152,5750,1925,3140,1928,2.0,8,4,1,1,1,0,
5248,508.5,6396,6169,1955,3577,1971,2.0,8,4,2,1,1,2,Old
5423,372.9,4691,7240,1950,3006,1644,1.5,8,3,1,1,1,0,


### ‚öñÔ∏è Weighted (Stratified) Sampling

Sometimes simple random sampling isn't enough. Consider these scenarios:

- **Rare events**: Fraud occurs in 0.1% of transactions - random sampling might miss them entirely
- **Important subgroups**: Luxury homes (10+ rooms) may be rare but crucial for high-value predictions
- **Research requirements**: Need minimum representation from each category

**Weighted sampling** assigns different selection probabilities to different observations, ensuring rare but important cases are adequately represented.

### The `prob` Parameter in sample()

The `prob` argument specifies the probability weight for each element:
- Higher probability = more likely to be selected
- Probabilities don't need to sum to 1 (R normalizes them automatically)

In [17]:
# ==============================================================================
# WEIGHTED SAMPLING - OVERSAMPLING RARE CASES
# ==============================================================================

# Goal: Sample houses with over 10 rooms more frequently
# These large houses are rare but may have different valuation patterns

set.seed(42)  # Reproducibility

# Create weighted sample with probability vector
# ifelse(condition, value_if_true, value_if_false):
#   - Houses with ROOMS > 10 get probability 0.9 (90x more likely!)
#   - Houses with ROOMS <= 10 get probability 0.01
# R automatically normalizes these to sum to 1

s = sample(row.names(housing.df), 20, 
           prob = ifelse(housing.df$ROOMS > 10, 0.9, 0.01))

# Display the weighted sample - notice most houses have many rooms!
housing.df[s, ]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
4354,490.1,6165,6735,1915,4385,2439,2.0,7,3,1,1,1,1,
4733,404.7,5091,6250,1930,3299,1802,2.0,7,3,1,0,1,0,Recent
1838,409.5,5151,4864,1950,3998,2176,1.5,11,4,2,1,2,1,
2932,534.2,6720,11700,1880,4521,2642,2.0,10,4,1,1,1,1,
3548,349.3,4394,5792,1890,3000,2032,2.0,12,3,2,0,1,0,
3872,576.1,7247,6501,1920,5197,3636,3.0,12,4,3,1,2,1,Old
1390,462.9,5823,8720,1994,3182,1738,2.0,6,3,2,1,1,1,
3489,771.0,9699,27408,1914,5786,3050,2.0,12,4,3,0,1,0,Recent
2784,579.5,7290,12246,1900,4138,2414,2.0,11,4,2,1,1,0,Recent
915,316.7,3984,5229,1957,2608,1434,1.5,6,3,2,0,1,1,Old


### üìä Rebalancing Classes with Upsampling

**Class imbalance** is one of the most common problems in machine learning:

| Dataset | Majority Class | Minority Class | Imbalance Ratio |
|---------|---------------|----------------|-----------------|
| Fraud Detection | Legitimate (99.9%) | Fraud (0.1%) | 1000:1 |
| Disease Diagnosis | Healthy (95%) | Disease (5%) | 19:1 |
| Customer Churn | Retained (85%) | Churned (15%) | ~6:1 |

**Why imbalance matters**: Models tend to predict the majority class because it minimizes overall error. A model that always predicts "no fraud" is 99.9% accurate but completely useless!

### Upsampling vs Downsampling

| Technique | Method | Pros | Cons |
|-----------|--------|------|------|
| **Upsampling** | Duplicate minority class | Preserves all information | Larger dataset, overfitting risk |
| **Downsampling** | Remove majority class | Faster training | Loses potentially useful data |
| **SMOTE** | Synthetic minority examples | Creates new data points | Can introduce noise |

In [18]:
# ==============================================================================
# UPSAMPLING TO BALANCE CLASSES
# ==============================================================================

# Convert REMODEL to a factor (categorical variable)
# factor() tells R this is a categorical variable with distinct levels
housing.df$REMODEL = factor(housing.df$REMODEL)

# Check current class distribution - likely imbalanced!
cat("Original distribution:\n")
table(housing.df$REMODEL)

# Upsample using caret's upSample function
# Parameters:
#   x = the data frame to upsample
#   y = the factor variable to balance
#   list = TRUE returns a list; we extract $x for the data frame
upsampled.df = caret::upSample(housing.df, housing.df$REMODEL, list = TRUE)$x

# Verify the new balanced distribution
# After upsampling, all classes should have equal counts
cat("\nAfter upsampling:\n")
table(upsampled.df$REMODEL)

Original distribution:



  None    Old Recent 
  4346    581    875 


After upsampling:



  None    Old Recent 
  4346   4346   4346 

### üìã Interpreting the Rebalancing Results

**Before upsampling**: Classes have different sizes (imbalanced)
- The majority class dominates the dataset
- Models trained on this data may ignore minority classes

**After upsampling**: All classes have equal representation
- Minority classes are duplicated until they match the majority
- Models now have equal examples of each class to learn from

### üè¢ Business Impact of Class Balancing

| Application | Without Balancing | With Balancing |
|-------------|-------------------|----------------|
| Fraud Detection | Misses rare fraud cases | Catches more fraud at cost of false positives |
| Churn Prediction | Underestimates churn risk | Better identifies at-risk customers |
| Disease Diagnosis | Under-diagnoses rare conditions | More sensitive to disease indicators |
| Loan Default | Underestimates default risk | More conservative lending decisions |

> **‚ö†Ô∏è Warning**: Always balance your **training data only**! Keep your test/validation sets in their natural proportions to evaluate real-world performance.

---

## Part 3: Preprocessing and Cleaning the Data

### üîß Why Data Preprocessing is Critical

Raw data is rarely ready for machine learning. Common issues include:

| Problem | Example | Solution |
|---------|---------|----------|
| Missing values | `NA` in BEDROOMS column | Imputation or removal |
| Wrong data types | Numeric ZIP codes | Convert to factor |
| Categorical variables | "Yes"/"No" strings | Create dummy variables |
| Different scales | Age (0-100) vs Income (0-1M) | Standardization |
| Outliers | Income = $1 billion | Winsorization or removal |

### Understanding Variable Types in R

| R Type | Description | Example | ML Treatment |
|--------|-------------|---------|--------------|
| `numeric` | Continuous values | 345000.50 | Use directly |
| `integer` | Whole numbers | 6 | Use directly |
| `character` | Text strings | "Recent" | Convert to factor |
| `factor` | Categorical | 3 levels | Create dummy variables |
| `logical` | TRUE/FALSE | TRUE | Treat as 0/1 |

### Using `str()` to Understand Data Structure

In [19]:
# ==============================================================================
# EXAMINING DATA STRUCTURE WITH str()
# ==============================================================================

# tidyverse was loaded in the first cell to avoid namespace conflicts

# Reload clean data
housing.df = mlba::WestRoxbury

# str() shows the structure of any R object
# For data frames, it displays:
#   - Number of observations and variables
#   - Variable names and types
#   - First few values of each variable
# CRITICAL: Always run str() when loading new data to understand types!
str(housing.df)

'data.frame':	5802 obs. of  14 variables:
 $ TOTAL.VALUE: num  344 413 330 499 332 ...
 $ TAX        : int  4330 5190 4152 6272 4170 4244 4521 4030 4195 5150 ...
 $ LOT.SQFT   : int  9965 6590 7500 13773 5000 5142 5000 10000 6835 5093 ...
 $ YR.BUILT   : int  1880 1945 1890 1957 1910 1950 1954 1950 1958 1900 ...
 $ GROSS.AREA : int  2436 3108 2294 5032 2370 2124 3220 2208 2582 4818 ...
 $ LIVING.AREA: int  1352 1976 1371 2608 1438 1060 1916 1200 1092 2992 ...
 $ FLOORS     : num  2 2 2 1 2 1 2 1 1 2 ...
 $ ROOMS      : int  6 10 8 9 7 6 7 6 5 8 ...
 $ BEDROOMS   : int  3 4 4 5 3 3 3 3 3 4 ...
 $ FULL.BATH  : int  1 2 1 1 2 1 1 1 1 2 ...
 $ HALF.BATH  : int  1 1 1 1 0 0 1 0 0 0 ...
 $ KITCHEN    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ FIREPLACE  : int  0 0 0 1 0 1 0 0 1 0 ...
 $ REMODEL    : chr  "None" "Recent" "None" "None" ...


In [20]:
# ==============================================================================
# CONVERTING CHARACTER TO FACTOR
# ==============================================================================

# REMODEL is currently a character variable (text)
# We need to convert it to a factor for proper ML treatment

# Convert REMODEL to factor
# factor() creates a categorical variable with defined levels
housing.df$REMODEL = factor(housing.df$REMODEL)

# Verify the conversion worked
# str() should now show "Factor" instead of "chr"
str(housing.df$REMODEL)

# Show the factor's levels (unique categories)
# levels() returns the distinct values in order
# This is important: the first level is often the "reference" in regression
levels(housing.df$REMODEL)

 Factor w/ 3 levels "None","Old","Recent": 1 3 1 1 1 2 1 1 3 1 ...


### üìã Interpreting `str()` Output

The `str()` function reveals the internal structure of your data:

| Type Shown | Meaning | Example |
|------------|---------|---------|
| `num` | Numeric (continuous) | 345678.5 |
| `int` | Integer (whole numbers) | 6 |
| `chr` | Character (text strings) | "Recent" |
| `Factor w/ 3 levels` | Categorical with 3 categories | "None", "Old", "Recent" |

### Why Factors Matter in Machine Learning

R treats factors specially in statistical models:

1. **Automatic dummy encoding**: `lm()` and `glm()` automatically create dummy variables
2. **Proper contrasts**: Factor levels are compared against a reference level
3. **Controlled ordering**: You can specify level order for ordinal data
4. **Memory efficiency**: Factors are stored as integers internally

### Converting Variables to Factors

When R imports data, it may read categorical variables as text (`chr`). Common conversions:

```r
# Single column
df$column = factor(df$column)

# Multiple columns with mutate
df = df %>% mutate(col1 = factor(col1), col2 = factor(col2))

# Specify level order (first level = reference)
df$column = factor(df$column, levels = c("Low", "Medium", "High"))
```

### üîó Using Tidyverse Pipes for Clean, Readable Code

The `%>%` (pipe) operator is a game-changer for data manipulation. It takes the output of one function and passes it as the **first argument** to the next function.

**Without pipes** (nested, hard to read):
```r
result = function3(function2(function1(data)))
```

**With pipes** (linear, easy to read):
```r
result = data %>%
  function1() %>%
  function2() %>%
  function3()
```

> **üí° Tip**: Read `%>%` as "and then" - for example: "Take the data AND THEN mutate AND THEN filter"

In [21]:
# ==============================================================================
# PIPE OPERATOR FOR DATA PREPROCESSING CHAINS
# ==============================================================================

# dplyr (from tidyverse) provides mutate, %>%, and other data manipulation functions
# All packages were loaded in the first cell

# Load and preprocess data in one elegant statement using pipes
# The %>% operator passes the result as the first argument to the next function

housing.df = mlba::WestRoxbury %>%
  # mutate() creates or modifies columns
  # Here we convert REMODEL from character to factor
  mutate(REMODEL = factor(REMODEL))

# Verify the transformation
# REMODEL should now show as "Factor" with 3 levels
str(housing.df$REMODEL)

 Factor w/ 3 levels "None","Old","Recent": 1 3 1 1 1 2 1 1 3 1 ...


### üî¢ Handling Categorical Variables: Dummy (One-Hot) Encoding

Many machine learning algorithms **require numeric inputs**. We convert categorical variables to **dummy variables** (also called indicator or one-hot encoding).

### How Dummy Encoding Works

For a variable with $k$ categories, we create $k-1$ binary (0/1) columns:

**Example**: `REMODEL` with levels [None, Old, Recent] becomes:

| Original | REMODEL_Old | REMODEL_Recent | Interpretation |
|----------|-------------|----------------|----------------|
| None | 0 | 0 | Reference category (both 0) |
| Old | 1 | 0 | Old = 1 |
| Recent | 0 | 1 | Recent = 1 |

### Why $k-1$ Columns (Not $k$)?

Creating $k$ columns would introduce **perfect multicollinearity** - one column is perfectly predictable from the others. This breaks regression models. The omitted category becomes the **reference level**.

> **Formula**: If a house has REMODEL_Old = 0 AND REMODEL_Recent = 0, it must be "None"

In [22]:
# ==============================================================================
# CREATING DUMMY VARIABLES WITH fastDummies
# ==============================================================================

# fastDummies package was loaded in the first cell

# Create dummy variables from the REMODEL column
# Parameters explained:
#   remove_selected_columns = TRUE: removes the original REMODEL column
#   remove_first_dummy = TRUE: removes one dummy to avoid multicollinearity
#                              (the removed level becomes the reference)
housing.df = dummy_cols(mlba::WestRoxbury,
                        remove_selected_columns = TRUE,  # Drop original column
                        remove_first_dummy = TRUE)       # Avoid multicollinearity

# Display first 2 rows to see the new dummy columns
# Look for REMODEL_Old and REMODEL_Recent columns (REMODEL_None was removed)
housing.df %>% head(2)

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL_Old,REMODEL_Recent
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,344.2,4330,9965,1880,2436,1352,2,6,3,1,1,1,0,0,0
2,412.6,5190,6590,1945,3108,1976,2,10,4,2,1,1,0,0,1


### üö´ Handling Missing Values (NA)

Missing values (`NA` in R) are a reality in real-world data. They can occur due to:
- Data entry errors
- Survey non-responses
- Sensor malfunctions
- System integration issues

### Common Strategies for Missing Values

| Strategy | Method | Pros | Cons |
|----------|--------|------|------|
| **Deletion** | Remove rows with NA | Simple, clean | Loses valuable data |
| **Mean Imputation** | Replace with column mean | Preserves sample size | Distorts distribution |
| **Median Imputation** | Replace with column median | Robust to outliers | Reduces variance |
| **Mode Imputation** | Replace with most common value | Good for categorical | May not make sense for numeric |
| **Predictive Imputation** | Model NA values from other columns | Most sophisticated | Complex, risk of overfitting |

### Mathematical Formulas for Imputation

- **Mean**: $\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$ (sensitive to outliers)
- **Median**: Middle value when sorted (robust to outliers)
- **Mode**: Most frequently occurring value

In [23]:
# ==============================================================================
# SIMULATING AND DETECTING MISSING VALUES
# ==============================================================================

# Reload clean data
housing.df = mlba::WestRoxbury

# Simulate missing data by randomly setting some BEDROOMS values to NA
# In practice, you'd be dealing with actual missing values from your data source
set.seed(1)  # Reproducibility

# Randomly select 10 row indices to make missing
rows.to.missing = sample(row.names(housing.df), 10)

# Set those BEDROOMS values to NA (Not Available)
housing.df[rows.to.missing, ]$BEDROOMS = NA

# Check the result using summary()
# summary() will show how many NA's exist in each column
# Look for "NA's : 10" in the BEDROOMS row
cat("Summary with missing values:\n")
summary(housing.df$BEDROOMS)

Summary with missing values:


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    3.00    3.00    3.23    4.00    9.00      10 

In [24]:
# ==============================================================================
# IMPUTING MISSING VALUES WITH MEDIAN
# ==============================================================================

# Replace NA values in BEDROOMS with the median value
# Using tidyr's replace_na() function for clean syntax

# IMPORTANT: Use na.rm = TRUE when calculating median to ignore NA values!
# Without na.rm = TRUE, median() would return NA if any NA exists

# Calculate the median of non-missing BEDROOMS values
bedroom_median = median(housing.df$BEDROOMS, na.rm = TRUE)

# Replace NA values with the calculated median
housing.df = housing.df %>%
  replace_na(list(BEDROOMS = bedroom_median))

# Verify imputation worked - no more NA's should appear
cat("Summary after imputation:\n")
summary(housing.df$BEDROOMS)

Summary after imputation:


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   3.000   3.229   4.000   9.000 

### üìã Why Median is Often Better Than Mean for Imputation

Consider this example with house prices:
- Houses: $200K, $250K, $300K, $350K, **$10M** (outlier mansion)

| Statistic | Value | Effect of Outlier |
|-----------|-------|-------------------|
| **Mean** | $2.22M | Heavily skewed by $10M outlier |
| **Median** | $300K | Unaffected - still the middle value |

For BEDROOMS, median is appropriate because:
- It's robust to any outlier properties (mansions with 20+ bedrooms)
- It maintains the typical bedroom count for imputation
- It's a whole number, which makes sense for bedrooms

### üìù Documentation Best Practice

**Always document your imputation strategy!** Stakeholders need to know:
1. Which variables had missing values
2. How many observations were affected
3. What imputation method was used
4. The rationale for choosing that method

```r
# Example documentation comment:
# BEDROOMS: 10 missing values (0.9%) imputed with median = 3
# Median chosen over mean due to presence of outlier properties
```

---

## Part 4: Predictive Power and Overfitting

### ‚ö†Ô∏è The Overfitting Problem

**Overfitting** is the #1 pitfall in machine learning. It occurs when a model learns the training data TOO well, memorizing noise and random patterns that don't generalize to new data.

### Visual Understanding of Fit Quality

```
UNDERFITTING          GOOD FIT              OVERFITTING
    ‚Üì                    ‚Üì                      ‚Üì
  ~~~~               ~~~~~~~~               ~~~~~~~~
 ~    ~  (data)     ~        ~             ~  ~~  ~
~      ~           ~    ‚úì     ~            ~  ~~ ~~
 ‚Äî‚Äî‚Äî‚Äî‚Äî‚Äî           ~~~~~~~~~~~~~           ~~~~  ~~~~
(too simple)      (captures trend)        (memorizes noise)
```

### Diagnosing Model Fit

| Training Accuracy | Test Accuracy | Diagnosis | Action |
|-------------------|---------------|-----------|--------|
| 95% | 93% | ‚úÖ Good generalization | Deploy model |
| 99% | 70% | ‚ùå Overfitting | Simplify model, add regularization |
| 65% | 63% | ‚ö†Ô∏è Underfitting | Add features, use complex model |
| 60% | 80% | üîÑ Data leakage (rare) | Check for target leakage |

### üõ°Ô∏è Solution: Data Partitioning

To detect overfitting, we must evaluate on data the model has **never seen**:

| Partition | % of Data | Purpose |
|-----------|-----------|---------|
| **Training** | 50-70% | Build/fit the model |
| **Validation** | 15-25% | Tune hyperparameters |
| **Test/Holdout** | 15-25% | Final unbiased evaluation |

> **Golden Rule**: NEVER use test data during model development. It must remain "unseen" until final evaluation.

### üìä Holdout Partition (60% Train / 40% Test)

The simplest partitioning scheme divides data into two sets:
- **Training set (60%)**: Used to fit model parameters
- **Holdout set (40%)**: Reserved for final evaluation

This is appropriate when:
- Dataset is large enough that 40% provides reliable estimates
- No hyperparameter tuning is needed
- Quick prototyping is the goal

In [25]:
# ==============================================================================
# TWO-WAY DATA PARTITION: 60% TRAINING / 40% HOLDOUT
# ==============================================================================

# Reload and preprocess the data
housing.df = mlba::WestRoxbury %>%
  mutate(REMODEL = factor(REMODEL))

# CRITICAL: Set seed for reproducibility!
# This ensures the same split every time you run the code
set.seed(1)

# Randomly sample 60% of row IDs for training
# rownames() returns the row identifiers (usually "1", "2", "3", ...)
# nrow() * 0.6 calculates 60% of total observations
train.rows = sample(rownames(housing.df), nrow(housing.df) * 0.6)

# Create training data frame by selecting those rows
train.df = housing.df[train.rows, ]

# Holdout = all rows NOT in training set
# setdiff(A, B) returns elements in A but not in B
holdout.rows = setdiff(rownames(housing.df), train.rows)
holdout.df = housing.df[holdout.rows, ]

# Verify the split sizes
cat("Training set:", nrow(train.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

Training set: 3481 rows
Holdout set: 2321 rows


### üìä Three-Way Partition (50% Train / 30% Validation / 20% Test)

When tuning hyperparameters (e.g., tree depth, regularization strength), we need THREE partitions:

| Partition | Purpose | Used During |
|-----------|---------|-------------|
| **Training (50%)** | Fit model parameters | Model building |
| **Validation (30%)** | Compare model variants, tune hyperparameters | Model selection |
| **Test/Holdout (20%)** | Final unbiased performance estimate | After all decisions are made |

> **Why three sets?** If you use the test set to tune hyperparameters, you're essentially "peeking" at it, and your final performance estimate will be optimistically biased.

In [26]:
# ==============================================================================
# THREE-WAY DATA PARTITION: 50% TRAIN / 30% VALIDATION / 20% HOLDOUT
# ==============================================================================

set.seed(1)  # Reproducibility

# Step 1: Select 50% for training
train.rows = sample(rownames(housing.df), nrow(housing.df) * 0.5)

# Step 2: From the REMAINING rows, select 30% of original for validation
# setdiff() gets rows not in training set
# Then sample 30% of original dataset size from those remaining rows
valid.rows = sample(setdiff(rownames(housing.df), train.rows),
                    nrow(housing.df) * 0.3)

# Step 3: Holdout = everything not in training or validation
# union(A, B) combines both sets
# setdiff removes those from all rows, leaving holdout
holdout.rows = setdiff(rownames(housing.df), union(train.rows, valid.rows))

# Create the three data frames
train.df = housing.df[train.rows, ]
valid.df = housing.df[valid.rows, ]
holdout.df = housing.df[holdout.rows, ]

# Verify the partition sizes (should sum to original total)
cat("Training set:", nrow(train.df), "rows\n")
cat("Validation set:", nrow(valid.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

Training set: 2901 rows
Validation set: 1740 rows
Holdout set: 1161 rows


### üéØ Using caret for Stratified Partitioning

Simple random partitioning can accidentally create unbalanced splits. **Stratified sampling** ensures that each partition has similar distributions of the target variable.

**Example Problem**: 
- If 20% of houses are "Recently Remodeled"
- Random split might give Training = 25%, Test = 10%
- Model learns different patterns than exist in test data!

The `caret` package's `createDataPartition()` function performs stratified sampling automatically based on your target variable.

In [27]:
# ==============================================================================
# STRATIFIED PARTITIONING WITH caret
# ==============================================================================

set.seed(1)  # Reproducibility

# createDataPartition creates stratified samples
# Parameters:
#   y = the variable to stratify by (ensures similar distribution in both sets)
#   p = proportion for training (0.6 = 60%)
#   list = FALSE returns a vector of indices instead of a list
idx = caret::createDataPartition(housing.df$TOTAL.VALUE, p = 0.6, list = FALSE)

# Use positive indexing for training, negative for holdout
# idx contains row numbers for training
# -idx means "all rows EXCEPT those in idx"
train.df = housing.df[idx, ]
holdout.df = housing.df[-idx, ]

# Verify partition sizes
cat("Training set:", nrow(train.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

Training set: 3483 rows
Holdout set: 2319 rows


---

## Part 5: Building a Predictive Model

### üèóÔ∏è Complete Modeling Workflow

Now we put everything together into a complete machine learning pipeline:

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 1: LOAD & PREPROCESS                                               ‚îÇ
‚îÇ  ‚Ä¢ Load data ‚Üí Remove NA ‚Üí Convert types ‚Üí Create dummies                ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                             ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 2: PARTITION DATA                                                  ‚îÇ
‚îÇ  ‚Ä¢ Split into Training (60%) and Holdout (40%)                           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                             ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 3: BUILD MODEL                                                     ‚îÇ
‚îÇ  ‚Ä¢ Train linear regression on training data ONLY                         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                             ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 4: EVALUATE                                                        ‚îÇ
‚îÇ  ‚Ä¢ Calculate metrics on holdout data ‚Üí Check for overfitting             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                                             ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 5: PREDICT                                                         ‚îÇ
‚îÇ  ‚Ä¢ Apply model to new unseen data                                        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [28]:
# ==============================================================================
# STEP 1-2: COMPLETE DATA PREPROCESSING AND PARTITIONING
# ==============================================================================

# All required packages (tidyverse, mlba, fastDummies, caret) were loaded in the first cell

# Step 1: Load and preprocess data using a single pipe chain
housing.df = mlba::WestRoxbury %>%
  # Remove any rows with missing values
  # drop_na() is from tidyr package (part of tidyverse)
  drop_na() %>%
  # Remove TAX column - it's perfectly correlated with TOTAL.VALUE
  # (Tax is calculated directly from value, so it would cause data leakage!)
  select(-TAX) %>%
  # Convert REMODEL to factor for proper categorical handling
  mutate(REMODEL = factor(REMODEL)) %>%
  # Create dummy variables for all factor columns
  dummy_cols(select_columns = c('REMODEL'),  # Only dummy-encode REMODEL
             remove_selected_columns = TRUE,  # Remove original column
             remove_first_dummy = TRUE)       # Avoid multicollinearity

# Step 2: Create stratified train/test split
set.seed(1)  # ALWAYS set seed for reproducibility!
idx = caret::createDataPartition(housing.df$TOTAL.VALUE, p = 0.6, list = FALSE)
train.df = housing.df[idx, ]
holdout.df = housing.df[-idx, ]

# Status report
cat("Data prepared. Training:", nrow(train.df), "| Holdout:", nrow(holdout.df))

Data prepared. Training: 3483 | Holdout: 2319

### üìã Understanding the Preprocessing Pipeline

Each step in our pipeline has a specific purpose:

| Step | Function | Why It Matters |
|------|----------|----------------|
| `drop_na()` | Remove missing values | Many algorithms can't handle NA |
| `select(-TAX)` | Remove TAX column | TAX is calculated from TOTAL.VALUE (target leakage!) |
| `mutate(REMODEL=factor())` | Convert to factor | Required for proper dummy encoding |
| `dummy_cols()` | Create 0/1 indicators | ML algorithms need numeric inputs |

### ‚ö†Ô∏è Why We Removed TAX (Target Leakage)

**Target leakage** occurs when your features contain information about the target that wouldn't be available at prediction time.

Property TAX is calculated as: $\text{TAX} = \text{TOTAL.VALUE} \times \text{tax\_rate}$

Including TAX would make predicting TOTAL.VALUE trivially easy (just divide TAX by tax_rate), but in real predictions, you won't know the tax until you know the value!

### Step 3: Building the Linear Regression Model

**Linear Regression** models the relationship:

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p$$

Where:
- $\hat{y}$ = predicted value (TOTAL.VALUE)
- $\beta_0$ = intercept (baseline value when all features are 0)
- $\beta_i$ = coefficient (how much $y$ changes when $x_i$ increases by 1)
- $x_i$ = feature values (LIVING.AREA, BEDROOMS, etc.)

In [29]:
# ==============================================================================
# STEP 3: BUILD LINEAR REGRESSION MODEL
# ==============================================================================

# Build linear regression model using lm() (linear model)
# Formula syntax: target ~ predictors
#   TOTAL.VALUE ~ . means "predict TOTAL.VALUE using ALL other columns"
#   The "." is shorthand for "all remaining variables"
reg = lm(TOTAL.VALUE ~ ., data = train.df)

# Create a data frame of training results for analysis
# This helps us understand how well the model fits the training data
train.res = data.frame(
  actual = train.df$TOTAL.VALUE,     # True values from data
  predicted = reg$fitted.values,      # Model's predictions
  residuals = reg$residuals           # Errors (actual - predicted)
)

# Display first 6 training predictions
cat("Training set predictions (first 6 rows):\n")
head(train.res)

Training set predictions (first 6 rows):


Unnamed: 0_level_0,actual,predicted,residuals
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,344.2,384.4206,-40.220638
4,498.6,546.4628,-47.862759
5,331.5,347.917,-16.417031
12,344.5,380.4297,-35.929727
13,315.5,313.1879,2.312083
15,326.2,345.3751,-19.175064


### üìã Understanding the lm() Output

The `lm()` function returns a model object containing:

| Component | Access With | Description |
|-----------|-------------|-------------|
| Coefficients | `reg$coefficients` | The $\beta$ values for each feature |
| Fitted Values | `reg$fitted.values` | Predictions on training data |
| Residuals | `reg$residuals` | Errors: $y - \hat{y}$ |
| R-squared | `summary(reg)$r.squared` | Proportion of variance explained |

### Interpreting the Results Table

- **actual**: The true property value from our data
- **predicted**: What the model estimates the value should be
- **residuals**: The error (actual - predicted)
  - Positive residual = model underestimated
  - Negative residual = model overestimated
  - Residuals should be randomly distributed around 0

### Step 4: Evaluating on Holdout Data

Now the critical test ‚Äî how does our model perform on **data it has never seen**? This is the true test of generalization.

In [30]:
# ==============================================================================
# STEP 4: EVALUATE ON HOLDOUT DATA
# ==============================================================================

# Use predict() to generate predictions for holdout data
# predict(model, newdata) applies the model to new observations
# The model has NEVER seen holdout.df during training!
pred = predict(reg, newdata = holdout.df)

# Create holdout results data frame
holdout.res = data.frame(
  actual = holdout.df$TOTAL.VALUE,             # True values
  predicted = pred,                             # Model predictions
  residuals = holdout.df$TOTAL.VALUE - pred    # Calculate residuals manually
)

# Display first 6 holdout predictions
cat("Holdout set predictions (first 6 rows):\n")
head(holdout.res)

Holdout set predictions (first 6 rows):


Unnamed: 0_level_0,actual,predicted,residuals
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
2,412.6,460.2777,-47.677744
3,330.1,359.392,-29.291958
6,337.4,290.0277,47.372303
7,359.4,402.5332,-43.133242
8,320.4,314.0683,6.331652
9,333.5,339.8206,-6.320582


### üìã Why Holdout Evaluation Matters

**Training performance can be misleading!** A model can memorize training data perfectly (100% accuracy) but fail completely on new data.

| Metric | Training Set | Holdout Set | Interpretation |
|--------|--------------|-------------|----------------|
| Always low | ‚úì Good | ‚úó May still overfit | Compare both! |
| Similar values | ‚úì Good | ‚úì Good | Model generalizes well |
| Train << Holdout | Rare | Rare | Possible data leakage |
| Train >> Holdout | ‚úó Bad | ‚úó Bad | Overfitting detected |

### Understanding Residual Patterns

Good residuals should show:
- **Mean near 0**: No systematic bias
- **Random scatter**: No patterns
- **Similar variance**: Homoscedasticity
- **No extreme outliers**: Model handles edge cases

### Step 5: Calculating Performance Metrics

We need quantitative measures to compare models objectively.

### üìê Key Regression Metrics

| Metric | Formula | Interpretation |
|--------|---------|----------------|
| **R¬≤** (R-squared) | $1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ | Proportion of variance explained (0-1); closer to 1 is better |
| **Adj. R¬≤** (Adjusted R-squared) | $1 - \frac{(1-R^2)(n-1)}{n-p-1}$ | R¬≤ adjusted for number of predictors; penalizes overfitting |
| **ME** (Mean Error) | $\frac{1}{n}\sum(y_i - \hat{y}_i)$ | Average error; should be near 0. Positive = underestimate |
| **RMSE** (Root Mean Squared Error) | $\sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$ | Typical error magnitude in original units (dollars) |
| **MAE** (Mean Absolute Error) | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Average absolute deviation; less sensitive to outliers |
| **MAPE** (Mean Absolute Percentage Error) | $\frac{100}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right|$ | Average percentage error; scale-independent |

### Understanding R-squared (R¬≤)

**R-squared** is one of the most important metrics for regression models:

- **R¬≤ = 0.85** means the model explains 85% of the variance in property values
- **R¬≤ = 1.0** means perfect prediction (suspicious - likely overfitting!)
- **R¬≤ = 0.0** means the model is no better than predicting the mean

**Adjusted R¬≤** is preferred when comparing models with different numbers of features because it penalizes adding variables that don't improve the model.

| R¬≤ Value | Interpretation |
|----------|----------------|
| 0.90+ | Excellent fit (verify not overfitting) |
| 0.70-0.90 | Good fit |
| 0.50-0.70 | Moderate fit |
| < 0.50 | Weak fit - consider more features or different model |

### RMSE vs MAE vs MAPE: When to Use Which?

- **RMSE**: Penalizes large errors more heavily (squared). Use when big errors are much worse than small ones.
- **MAE**: Treats all errors equally. Use when errors have linear cost regardless of magnitude.
- **MAPE**: Expresses error as a percentage. Use when comparing across different scales or communicating to stakeholders.

**Example**: For house pricing:

- RMSE = $45,000 means typical predictions are about $45K off
- MAPE = 10% means predictions are off by about 10% on average
- MAE = $35,000 means average absolute error is $35K (RMSE > MAE suggests some large outliers)

In [31]:
# ==============================================================================
# CALCULATE TRAINING SET METRICS
# ==============================================================================

# caret provides RMSE() and MAE() functions (loaded in first cell)
# R-squared metrics come from the model summary

# Get R-squared values from the model summary
model_summary = summary(reg)

# Compute metrics on training set
# These tell us how well the model fits the data it was trained on
cat("=== Training Set Metrics ===\n")

data.frame(
  # R-squared: proportion of variance explained by the model
  # Ranges from 0 to 1; higher is better
  R.squared = model_summary$r.squared,
  
  # Adjusted R-squared: R¬≤ adjusted for number of predictors
  # Penalizes adding variables that don't improve the model
  Adj.R.squared = model_summary$adj.r.squared,
  
  # Mean Error: average of residuals
  # Should be near 0; positive means we underestimate on average
  ME = round(mean(train.res$residuals), 5),
  
  # Root Mean Squared Error: sqrt of average squared error
  # Sensitive to large errors; in dollars
  RMSE = RMSE(pred = train.res$predicted, obs = train.res$actual),
  
  # Mean Absolute Error: average of absolute errors
  # More robust to outliers than RMSE
  MAE = MAE(pred = train.res$predicted, obs = train.res$actual),
  
  # Mean Absolute Percentage Error: average percentage error
  # Scale-independent; useful for comparing across different targets
  MAPE = mean(abs((train.res$actual - train.res$predicted) / train.res$actual)) * 100
)

=== Training Set Metrics ===


R.squared,Adj.R.squared,ME,RMSE,MAE,MAPE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.822098,0.8214314,0,42.14665,31.98717,8.356575


In [32]:
# ==============================================================================
# CALCULATE HOLDOUT SET METRICS
# ==============================================================================

# Compute metrics on holdout set
# This is the TRUE test - performance on data the model never saw
cat("=== Holdout Set Metrics ===\n")

# Calculate R-squared for holdout data manually
# R¬≤ = 1 - (SS_residual / SS_total)
ss_res_holdout = sum((holdout.res$actual - holdout.res$predicted)^2)
ss_tot_holdout = sum((holdout.res$actual - mean(holdout.res$actual))^2)
r_squared_holdout = 1 - (ss_res_holdout / ss_tot_holdout)

# Calculate Adjusted R-squared for holdout
n_holdout = nrow(holdout.res)
p = length(reg$coefficients) - 1  # number of predictors (excluding intercept)
adj_r_squared_holdout = 1 - ((1 - r_squared_holdout) * (n_holdout - 1) / (n_holdout - p - 1))

data.frame(
  # R-squared on holdout - the TRUE test of explanatory power
  R.squared = r_squared_holdout,
  
  # Adjusted R-squared on holdout
  Adj.R.squared = adj_r_squared_holdout,
  
  # Mean Error on holdout data
  ME = round(mean(holdout.res$residuals), 5),
  
  # RMSE on holdout - compare this to training RMSE!
  RMSE = RMSE(pred = holdout.res$predicted, obs = holdout.res$actual),
  
  # MAE on holdout
  MAE = MAE(pred = holdout.res$predicted, obs = holdout.res$actual),
  
  # MAPE on holdout - percentage error for easy interpretation
  MAPE = mean(abs((holdout.res$actual - holdout.res$predicted) / holdout.res$actual)) * 100
)

=== Holdout Set Metrics ===


R.squared,Adj.R.squared,ME,RMSE,MAE,MAPE
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.7993859,0.7982544,-1.04237,43.90381,33.05476,8.609297


### üìã Interpreting Model Performance

**Compare Training vs Holdout Metrics:**

| Scenario | Training R¬≤ | Holdout R¬≤ | Diagnosis |
|----------|-------------|------------|-----------|
| Similar values | 0.82 | 0.80 | ‚úÖ Good generalization |
| Training >> Holdout | 0.95 | 0.65 | ‚ùå Overfitting |
| Both low | 0.45 | 0.42 | ‚ö†Ô∏è Underfitting - need better features |

| Scenario | Training RMSE | Holdout RMSE | Diagnosis |
|----------|---------------|--------------|-----------|
| Similar values | $40,000 | $42,000 | ‚úÖ Good generalization |
| Training << Holdout | $25,000 | $55,000 | ‚ùå Overfitting |
| Both very high | $80,000 | $82,000 | ‚ö†Ô∏è Underfitting |

**Practical Interpretation:**
- "On average, our predictions are off by approximately $[RMSE] from actual property values"
- For a $400,000 house, an RMSE of $40,000 represents ~10% error
- R¬≤ of 0.80 means the model explains 80% of the variance in property values
- Whether this is acceptable depends on the business context

### üéØ Model Improvement Strategies

If holdout performance is poor, consider:
1. **Add features**: More relevant predictors
2. **Feature engineering**: Create interaction terms, polynomial features
3. **Different algorithms**: Try random forest, gradient boosting
4. **Handle outliers**: Identify and address extreme values
5. **More data**: If possible, collect more training examples

---

## Part 6: Making Predictions on New Data

### üöÄ Deploying Your Model

Once your model is validated and performing well on holdout data, you can use it to predict values for completely new observations. This is the ultimate goal of predictive modeling!

**Real-world applications:**
- Estimate property values for new listings
- Price recommendations for sellers
- Investment opportunity analysis
- Property tax assessment updates

### Preparing New Data for Prediction

**Critical**: New data must have the **exact same features** as the training data:
- Same column names
- Same data types
- Same dummy variable structure
- Factors must have the same levels

In [33]:
# ==============================================================================
# PREPARING NEW DATA FOR PREDICTION
# ==============================================================================

# For demonstration, we'll use rows 100-102 from original data
# In practice, this would be brand new properties you want to value
housing.df = mlba::WestRoxbury

# Create "new" data by selecting some rows and removing the target variable
# [100:102, -1] means rows 100-102, all columns EXCEPT column 1 (TOTAL.VALUE)
# We remove TOTAL.VALUE because that's what we're trying to predict!
new.data = housing.df[100:102, -1] %>%
  # Convert REMODEL to factor with explicit levels
  # CRITICAL: levels must match what the model was trained on!
  mutate(REMODEL = factor(REMODEL, levels = c("None", "Old", "Recent"))) %>%
  # Create dummy variables with same structure as training data
  dummy_cols(select_columns = c('REMODEL'),
             remove_selected_columns = TRUE, 
             remove_first_dummy = TRUE)

# Display the prepared new data
cat("New data to predict:\n")
new.data

New data to predict:


TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL_Old,REMODEL_Recent
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
3818,4200,1960,2670,1710,2.0,10,4,1,1,1,1,0,0
3791,6444,1940,2886,1474,1.5,6,3,1,1,1,1,0,0
4275,5035,1925,3264,1523,1.0,6,2,1,0,1,0,0,1


In [34]:
# ==============================================================================
# MAKING PREDICTIONS ON NEW DATA
# ==============================================================================

# Use predict() with the trained model and new data
# predict(model, newdata) returns a vector of predictions
pred = predict(reg, newdata = new.data)

# Display the predicted property values
# Each value is the model's estimate of TOTAL.VALUE for that property
cat("\nPredicted property values:\n")
pred


Predicted property values:


### üìã Interpreting Predictions on New Data

**What the output means:**
- Each predicted value is the model's estimate of `TOTAL.VALUE` for that property
- Predictions are based on the property characteristics (LIVING.AREA, BEDROOMS, etc.)
- Values are in dollars (same units as the target variable)

**Using predictions in practice:**
```r
# Add predictions to the new data
new.data$predicted_value = predict(reg, newdata = new.data)

# Create confidence intervals (95%)
prediction_interval = predict(reg, newdata = new.data, interval = "prediction")
```

### üè¢ Production Deployment Considerations

When deploying ML models to production, consider these critical factors:

| Factor | Key Question | Best Practice |
|--------|--------------|---------------|
| **Monitoring** | How will you track accuracy over time? | Log predictions vs actuals, alert on drift |
| **Retraining** | When should you update the model? | Regular schedule or when performance drops |
| **Fallback** | What if the model fails or produces outliers? | Default values, human review for edge cases |
| **Explainability** | Can you explain predictions? | SHAP values, coefficient interpretation |
| **Bias** | Does the model treat all groups fairly? | Regular fairness audits |
| **Versioning** | How do you track model versions? | MLflow, DVC, or similar tools |

### ‚ö†Ô∏è Common Pitfalls in Production

1. **Feature drift**: Input data distribution changes over time
2. **Target drift**: Relationship between features and target changes
3. **Missing features**: New data lacks some columns
4. **Novel categories**: New factor levels not seen in training
5. **Scale changes**: Feature ranges differ from training

---

## üìö Summary: Key Takeaways

### Quick Reference: Data Exploration Functions

| Task | Function | Example |
|------|----------|---------|
| Dimensions | `dim(df)` | `dim(housing.df)` ‚Üí rows, cols |
| Structure | `str(df)` | Shows types and first values |
| Summary stats | `summary(df)` | Min, Max, Mean, Median, Quartiles |
| First rows | `head(df)` | First 6 rows |
| Last rows | `tail(df)` | Last 6 rows |

### Quick Reference: Data Preprocessing Functions

| Task | Function | Example |
|------|----------|---------|
| Create factor | `factor(x)` | `factor(df$col)` |
| Dummy variables | `dummy_cols()` | Converts categories to 0/1 |
| Handle NA | `replace_na()`, `drop_na()` | Imputation or removal |
| Rebalance classes | `caret::upSample()` | Duplicate minority class |
| Pipe operator | `%>%` | Chain operations cleanly |

### Quick Reference: Model Building Functions

| Task | Function | Example |
|------|----------|---------|
| Data partition | `caret::createDataPartition()` | Stratified train/test split |
| Linear regression | `lm(y ~ ., data)` | `lm(TOTAL.VALUE ~ ., train.df)` |
| Model summary | `summary(model)` | Get R¬≤, coefficients, p-values |
| Make predictions | `predict(model, newdata)` | Apply model to new data |
| RMSE | `caret::RMSE()` | Root Mean Squared Error |
| MAE | `caret::MAE()` | Mean Absolute Error |

### ‚úÖ Machine Learning Best Practices Checklist

- [ ] **Always use `set.seed()`** for reproducible results
- [ ] **Never evaluate on training data alone** - always use holdout/test set
- [ ] **Document preprocessing steps** for stakeholders and future reference
- [ ] **Check for overfitting** by comparing training vs holdout performance
- [ ] **Handle missing values** explicitly with documented strategy
- [ ] **Encode categorical variables** appropriately (dummy variables or factors)
- [ ] **Remove target leakage** - features that reveal the target
- [ ] **Stratify partitions** to maintain class proportions
- [ ] **Monitor model performance** in production over time

### üî¢ Key Formulas Summary

| Formula | Description |
|---------|-------------|
| $\hat{y} = \beta_0 + \sum_{i=1}^{p} \beta_i x_i$ | Linear Regression |
| $R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ | R-squared (Coefficient of Determination) |
| $R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$ | Adjusted R-squared |
| $\text{RMSE} = \sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$ | Root Mean Squared Error |
| $\text{MAE} = \frac{1}{n}\sum|y_i - \hat{y}_i|$ | Mean Absolute Error |
| $\text{ME} = \frac{1}{n}\sum(y_i - \hat{y}_i)$ | Mean Error (Bias) |
| $\text{MAPE} = \frac{100}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right|$ | Mean Absolute Percentage Error |

---

### üéì Next Steps

Apply these techniques to your own datasets! Consider:
1. Loading your own CSV/Excel data with `read.csv()` or `readxl::read_excel()`
2. Exploring different target variables
3. Trying other algorithms: decision trees, random forests, gradient boosting
4. Adding feature engineering: interactions, polynomial terms, transformations

**Remember**: The goal isn't just to build models, but to build models that **generalize well** and provide **business value**!