In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# Step 1: Reading the File
# Load the Boston Housing Dataset from Scikit-learn

boston = fetch_openml(name="boston", version=1)
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target  # Median house value in $1000s
print("Step 1: Dataset loaded with", df.shape[0], "houses and", df.shape[1], "features.")

In [None]:
# Step 2: Exploratory Data Analysis (EDA)
pd.set_option('display.width', 1000)
print("\nStep 2: EDA - First few rows of the dataset:")
print(df.head())
print ("\nSummary statistics:")
print(df.describe())

# Understanding the `describe()` Output for CRIM

The `describe()` function in pandas gives us a quick summary of the `CRIM` variable from the Boston Housing Dataset. Let’s break it down in simple terms to see what it tells us about crime rates in different areas of Boston and how it might help us predict house values.

## What is CRIM?
- **Definition**: `CRIM` stands for per capita crime rate by town in Boston. It measures how many crimes happen per person in each neighborhood.
- **Relevance**: This can affect house prices—areas with higher crime might have lower values, while safer areas might cost more. As a data analyst, you could use this to advise real estate decisions!

## Summary Statistics
- **count: 506.000000**
  - There are 506 neighborhoods in the dataset, so we have data for every one—no missing values here!
- **mean: 3.613524**
  - The average crime rate across all neighborhoods is about 3.61 crimes per capita. This gives us a baseline to compare others.
- **std: 8.601545**
  - The standard deviation is 8.6, meaning crime rates vary a lot—some areas are much safer or more dangerous than the average.
- **min: 0.006320**
  - The lowest crime rate is 0.0063, a very safe neighborhood—think of a quiet suburb!
- **25%: 0.082045**
  - The 25th percentile shows 25% of areas have a crime rate of 0.08 or lower, still pretty safe.
- **50%: 0.256510**
  - The median (middle value) is 0.26, meaning half the neighborhoods have crime rates below this, a moderate level.
- **75%: 3.677083**
  - The 75th percentile is 3.68, so 75% of areas have crime rates at or below this, getting into higher crime zones.
- **max: 88.976200**
  - The highest crime rate is 88.98, an extreme outlier—likely a very rough neighborhood!

## What Does This Mean?
- **Variation**: Crime rates range from almost zero to nearly 89, showing big differences across Boston. This variety helps our model learn how crime impacts house prices.
- **Applicability**: As a future data scientist, you could use `CRIM` to predict lower prices in high-crime areas or target safety improvements. We’ll see this in action with our Linear Regression model!
- **Next Steps**: We’ll explore how `CRIM` correlates with `MEDV` (median house value) in the correlation matrix, so stay tuned for that insight.

*Note*: This dataset has historical bias (e.g., `CRIM` ties to socioeconomic factors), which we’ll discuss in Week 4 for fair AI use. For now, focus on the numbers and their story!

In [None]:
# Step 3: Data Cleaning
# Check for missing values and handle them (none in this dataset, but shown as example)
print("\nStep 3: Checking for missing values:")
print(df.isnull().sum())
# If missing, use: df.fillna(df.mean(), inplace=True) - not needed here

In [None]:
# Step 4: Feature Engineering and Selecting Final Features
# Create a new feature: rooms per household
df['RM_PER_HOUSEHOLD'] = df['RM'] / df['PTRATIO']  # Rooms divided by pupil-teacher ratio
# Select final features: RM (rooms) and RM_PER_HOUSEHOLD as predictors
selected_features = ['RM', 'RM_PER_HOUSEHOLD']
X = df[selected_features]
y = df['MEDV']

# Understanding Step 4: Feature Engineering and Selecting Final Features

This step is like being a chef who tweaks ingredients to make a better recipe! We’re improving our data to help predict house prices more accurately. Let’s break down the code to see what’s happening with the Boston Housing Dataset.

## What is Feature Engineering?
- **Definition**: We create new data pieces (features) from existing ones to give our model more clues.
- **Why It Matters**: Better features lead to smarter predictions, a key skill for data analysts in real estate or sales!

## Breaking Down the Code
- **New Feature: `RM_PER_HOUSEHOLD`**
  - **Code**: `df['RM_PER_HOUSEHOLD'] = df['RM'] / df['PTRATIO']`
  - **What It Does**: Divides the average number of rooms per house (`RM`) by the pupil-teacher ratio (`PTRATIO`, a measure of school crowding).
  - **Meaning**: This gives us rooms per household adjusted for school population. A higher value might mean more space per family, which could raise house prices.
  - **Example**: If a neighborhood has 6 rooms per house and a PTRATIO of 20, `RM_PER_HOUSEHOLD` is 0.3—fewer rooms per student, possibly a nicer area!
- **Selecting Final Features**
  - **Code**: 
    - `selected_features = ['RM', 'RM_PER_HOUSEHOLD']`
    - `X = df[selected_features]`
    - `y = df['MEDV']`
  - **What It Does**: Picks `RM` (average rooms) and `RM_PER_HOUSEHOLD` as the data we’ll use to predict (`X`), and sets `MEDV` (median house value in $1000s) as what we want to predict (`y`).
  - **Meaning**: We’re telling the model, “Look at rooms and space per household to guess house prices.” This focuses our prediction on key factors, like a detective choosing the best clues!
- **Relevance**: As a future data scientist, you could use this to advise homebuyers or builders on what makes a house valuable.

## What’s Next?
- We’ll use these features to build our Linear Regression model. Stay tuned to see how they help predict prices—and how well they work!
- *Note*: This dataset has historical bias (e.g., tied to school data), which we’ll explore in Week 4 for fair AI use. For now, focus on the prediction process!


In [None]:
# Step 5: Correlation Matrix
print("\nStep 5: Correlation Matrix:")
correlation_matrix = df[['MEDV', 'RM', 'RM_PER_HOUSEHOLD']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Correlation Matrix")
plt.show()

# Understanding Step 5: Correlation Matrix

This step is like playing detective with numbers! We’re figuring out how our data pieces (features) are connected to house prices and to each other. Let’s break down the code and look at the colorful chart it creates to see what it tells us about the Boston Housing Dataset.

## What is a Correlation Matrix?
- **Definition**: A chart that shows how much one thing (like rooms) relates to another (like house price). The closer to 1 or -1, the stronger the connection!
- **Why It Matters**: Helps data analysts pick the best features for predicting prices, a key skill in real estate or sales jobs!

## Breaking Down the Code
- **Code**: 
  - `print("\nStep 5: Correlation Matrix:")`
    - Tells us we’re starting this step and prints a message.
  - `correlation_matrix = df[['MEDV', 'RM', 'RM_PER_HOUSEHOLD']].corr()`
    - Calculates how `MEDV` (median house value), `RM` (average rooms), and `RM_PER_HOUSEHOLD` (rooms per household) are linked.
    - Like checking if more rooms mean higher prices!
  - `sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)`
    - Creates a heatmap, a color grid showing these connections, with numbers inside.
    - `annot=True` adds the numbers, `cmap='coolwarm'` uses red for positive and blue for negative links.
  - `plt.title("Correlation Matrix")`
    - Labels the chart so we know what it’s about.
  - `plt.show()`
    - Displays the chart for us to see.

## Explaining the Output
- **Image Description**: The heatmap shows a 3x3 grid with `MEDV`, `RM`, and `RM_PER_HOUSEHOLD` on both rows and columns. Colors range from red (positive correlation) to blue (negative correlation), with a scale from -1 to 1 on the right.
- **Key Values**:
  - **MEDV vs. MEDV**: 1.0 (red) – Perfect match, as it’s comparing a value to itself.
  - **MEDV vs. RM**: 0.7 (reddish) – A strong positive link, meaning more rooms often mean higher house prices.
  - **MEDV vs. RM_PER_HOUSEHOLD**: 0.72 (reddish) – Also a strong positive link, suggesting more rooms per household boost prices.
  - **RM vs. RM**: 1.0 (red) – Perfect match, as it’s self-comparison.
  - **RM vs. RM_PER_HOUSEHOLD**: 0.78 (reddish) – A strong positive link, showing rooms and rooms per household move together.
  - **RM_PER_HOUSEHOLD vs. RM_PER_HOUSEHOLD**: 1.0 (red) – Perfect match again.
- **What It Means**:
  - **Positive Correlations**: Red colors (0.7 to 0.78) show that as `RM` or `RM_PER_HOUSEHOLD` increases, `MEDV` tends to increase too. This is good news for our model—it can use these features to predict higher prices!
  - **No Negative Links**: No blue, meaning no features work against each other here.
  - **Applicability**: As a data scientist, you could tell a builder, “Add more rooms to raise value!” based on this.
- **Next Steps**: We’ll use these strong links to train our Linear Regression model. Watch how they help predict prices!

*Note*: This dataset has historical bias (e.g., tied to school and crime data), which we’ll explore in Week 4 for fair AI use. For now, enjoy spotting these patterns!

In [None]:
# Step 6: Create Histograms of Various Features
print("\nStep 6: Histograms of Features:")
df[['MEDV', 'RM', 'RM_PER_HOUSEHOLD']].hist(bins=20, figsize=(10, 6))
plt.suptitle("Histograms of Median Value, Rooms, and Rooms per Household")
plt.show()

# Understanding Step 6: Create Histograms of Various Features

This step is like looking at a picture of how house data is spread out! We’re creating charts to see the range of house prices, rooms, and space per household. Let’s break down the code and explore the colorful graphs it makes with the Boston Housing Dataset.

## What is a Histogram?
- **Definition**: A chart that shows how many houses fall into different value groups, like buckets of prices or room counts.
- **Why It Matters**: Helps data analysts spot trends (e.g., most houses are cheap) or oddities (e.g., very few with lots of rooms), a key skill for predicting prices in real estate!

## Breaking Down the Code
- **Code**: 
  - `print("\nStep 6: Histograms of Features:")`
    - Tells us we’re starting this step and prints a message.
  - `df[['MEDV', 'RM', 'RM_PER_HOUSEHOLD']].hist(bins=20, figsize=(10, 6))`
    - Makes histograms for `MEDV` (median house value), `RM` (average rooms), and `RM_PER_HOUSEHOLD` (rooms per household).
    - `bins=20` splits data into 20 buckets, `figsize=(10, 6)` sets the chart size.
    - Like counting how many houses fit in each price or room range!
  - `plt.suptitle("Histograms of Median Value, Rooms, and Rooms per Household")`
    - Adds a title to the chart so we know what it shows.
  - `plt.show()`
    - Displays the charts for us to see.

## Explaining the Output
- **Image Description**: The output shows three histograms side by side. Each chart has blue bars representing how many houses fall into different ranges.
- **Key Observations**:
  - **MEDV (Median Value)**: 
    - X-axis: Median house value in $1000s (from 0 to 50).
    - Y-axis: Number of houses (from 0 to 80).
    - Most houses cluster around 20–30 $1000s, with fewer above 40—many are moderately priced!
  - **RM (Rooms)**: 
    - X-axis: Average number of rooms (from 4 to 9).
    - Y-axis: Number of houses (from 0 to 100).
    - Peaks around 6 rooms, with fewer having 4 or 9—most houses have a typical room count!
  - **RM_PER_HOUSEHOLD (Rooms per Household)**: 
    - X-axis: Rooms per household (from 0 to 0.6).
    - Y-axis: Number of houses (from 0 to 100).
    - Clusters around 0.3–0.4, with fewer at 0 or 0.6—shows a common space-per-family range!
- **What It Means**:
  - **Trends**: The tall bars show where most data piles up (e.g., $20–30K houses, 6 rooms), helping our model focus on common cases.
  - **Outliers**: Short bars at the edges (e.g., $50K, 9 rooms) indicate rare, expensive, or large houses.
  - **Applicability**: As a data scientist, you could use this to target marketing for mid-priced homes or spot luxury markets. We’ll use these patterns to predict prices next!
- **X and Y Axes**:
  - **X-axis**: The range of values for each feature (e.g., $1000s for `MEDV`, rooms for `RM`, rooms per household for `RM_PER_HOUSEHOLD`).
  - **Y-axis**: The count of houses in each range, showing how common those values are.

## What’s Next?
- We’ll use these patterns to train our Linear Regression model. Get ready to see how they help guess house prices!
- *Note*: This dataset has historical bias (e.g., tied to school and crime data), which we’ll explore in Week 4 for fair AI use. For now, enjoy these charts!


In [None]:
# Step 7: How to Create Training, Test, and Validation Data
# Split into training (70%), test (15%), and validation (15%)
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.1765, random_state=42)  # 0.15/(0.85) ≈ 0.1765
print("\nStep 7: Data Split - Training:", X_train.shape, "Test:", X_test.shape, "Validation:", X_val.shape)

# Understanding Step 7: How to Create Training, Test, and Validation Data

This step is like dividing a big study group into practice, quiz, and final exam teams! We’re splitting our house data into parts to train our model, test it, and check it with fresh eyes. Let’s break down the code to see how we do this with the Boston Housing Dataset.

## What is Data Splitting?
- **Definition**: We divide our data into three groups: training (to learn), test (to check), and validation (to fine-tune).
- **Why It Matters**: Helps data analysts build a fair model that works well on new data, a key skill for jobs like predicting sales or house prices!

## Breaking Down the Code
- **Code**: 
  - `# Split into training (70%), test (15%), and validation (15%)`
    - Tells us the plan: 70% for learning, 15% for testing, 15% for validating.
  - `X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.15, random_state=42)`
    - Splits all data (`X` is features, `y` is house prices) into a big training set (`X_train_full`, `y_train_full`) and a test set (`X_test`, `y_test`).
    - `test_size=0.15` means 15% goes to test, leaving 85% for training.
    - `random_state=42` keeps the split the same each time, like using a fixed shuffle.
  - `X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.1765, random_state=42)`
    - Takes the 85% training data and splits it again: 70% of the original (about 70%) for `X_train` and `y_train`, and 15% of the original (about 15%) for `X_val` and `y_val`.
    - `test_size=0.1765` is calculated as 0.15 / 0.85 (15% of the remaining 85%), ensuring the final split is 70-15-15.
  - `print("\nStep 7: Data Split - Training:", X_train.shape, "Test:", X_test.shape, "Validation:", X_val.shape)`
    - Shows the size of each group (e.g., number of houses in each set) so we know what we’re working with.

## What Does This Mean?
- **Training Data (70%)**: The model learns from this big group, like practicing with most of the study notes.
- **Test Data (15%)**: Used to check how well the model predicts after learning, like a quiz to see if it’s ready.
- **Validation Data (15%)**: Helps tweak the model to avoid over-guessing, like a final review before the big exam.
- **Applicability**: As a data scientist, you’d use this to build a house price predictor that works on new neighborhoods, not just old data!
- **Example Output**: If the dataset has 506 houses, you might see Training: (354, 2), Test: (76, 2), Validation: (76, 2)—numbers may vary slightly due to random splitting.

## What’s Next?
- We’ll use these data groups to train our Linear Regression model and see how well it predicts. Get ready to test your skills!
- *Note*: This dataset has historical bias (e.g., tied to school and crime data), which we’ll explore in Week 4 for fair AI use. For now, focus on splitting the data right!


In [None]:
X_test.head()  # Display first few rows of training data

In [None]:
# Step 8: Model Training and Selection
# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Train Bagging (Random Forest)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)  # 100 trees
rf_model.fit(X_train, y_train)

# Train Boosting (Gradient Boosting)
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)

print("\nStep 8: Models trained with features:", selected_features)

# Understanding Step 8: Model Training and Selection

This step is like teaching a group of helpers how to guess house prices! We’re training different models—Linear Regression, Bagging, and Boosting—using our split data. Let’s break down the code to see how we teach them and pick the best one with the Boston Housing Dataset.

## What is Model Training and Selection?
- **Definition**: We show our data to models so they can learn to predict house prices, then choose the one that does the best job.
- **Why It Matters**: Helps data analysts build tools to forecast prices or detect trends, a key skill for jobs in real estate or finance!

## Breaking Down the Code
- **Code**: 
  - `# Train Linear Regression`
    - `lr_model = LinearRegression()`
      - Creates a simple model that draws a straight line to predict prices.
      - Like a basic calculator guessing based on a trend!
    - `lr_model.fit(X_train, y_train)`
      - Teaches the model using the training data (`X_train` is features like rooms, `y_train` is prices).
      - Like practicing with study notes to get better at guessing.
  - `# Train Bagging (Random Forest)`
    - `rf_model = RandomForestRegressor(n_estimators=100, random_state=42)`
      - Makes a Bagging model with 100 “tree” helpers, each looking at different data parts.
      - `random_state=42` keeps the training consistent, like using the same study group.
      - Like asking 100 friends to guess and averaging their answers!
    - `rf_model.fit(X_train, y_train)`
      - Teaches all 100 trees with the training data.
      - Like training a team to vote on the best price guess.
  - `# Train Boosting (Gradient Boosting)`
    - `gb_model = GradientBoostingRegressor(random_state=42)`
      - Makes a Boosting model where each helper fixes the last one’s mistakes.
      - Like a coach training runners, improving step by step!
    - `gb_model.fit(X_train, y_train)`
      - Teaches the model with the training data, building on errors.
      - Like practicing harder on tough questions to get smarter.
  - `print("\nStep 8: Models trained with features:", selected_features)`
    - Shows which features (e.g., `RM`, `RM_PER_HOUSEHOLD`) the models used.
    - Like listing the clues the helpers learned from!

## What Does This Mean?
- **Linear Regression**: Learns a straight-line rule, good for simple patterns.
- **Bagging (Random Forest)**: Uses a team of 100 trees to average guesses, reducing mistakes.
- **Boosting (Gradient Boosting)**: Builds a stronger model by fixing errors, step by step.
- **Applicability**: As a data scientist, you could use these to predict house prices or sales, helping businesses make smart choices!
- **Example**: If `RM = 6` and `RM_PER_HOUSEHOLD = 0.5`, each model will guess a price—we’ll compare them next!

## What’s Next?
- We’ll check how well these models predict by looking at their scores and charts. Get ready to see which helper does the best job!
- *Note*: This dataset has historical bias (e.g., tied to school and crime data), which we’ll explore in Week 4 for fair AI use. For now, focus on training these models!


# Student Notes: Understanding Random Forest

Welcome to Week 2, Chapter 1! Today, we’re diving into **Random Forest**, a cool tool that helps us predict numbers or categories using a team approach. Let’s explore what it is, why it’s a regressor, and how it stacks up against Linear Regression from last week. These skills will make you ready for jobs like data scientist or analyst!

## What is Random Forest?
- **Definition**: Random Forest is a type of machine learning model that uses lots of “decision trees” working together, like a forest of helpers.
- **Simple Analogy**: Imagine you’re guessing how many candies are in a jar at a fair. You ask 100 friends to guess, but each friend looks at a different mix of clues (e.g., jar size, color). Then, you take the average of their guesses to get a smarter answer. That’s Random Forest—lots of trees vote, and we average their predictions!
- **Why It’s Cool**: It’s part of **Bagging** (Bootstrap Aggregating), which we talked about, making it great for reducing mistakes by teamwork.

## Why is Random Forest a Regressor?
- **Regressor Role**: A regressor predicts numbers, like house prices or sales amounts, instead of categories (which is a classifier).
- **How It Works as a Regressor**: Each tree in the forest guesses a number based on the data (e.g., rooms and space per household). The final prediction is the average of all tree guesses, giving us a single number.
- **Example**: If you give it data like “6 rooms and 0.5 rooms per household,” it might predict $22,000, averaging the trees’ votes.
- **Job Connection**: As a data scientist, you’d use this to forecast things like next month’s revenue or a house’s value!

## Comparing Random Forest to Linear Regression
- **Linear Regression (From Week 1)**:
  - Draws a straight line through data points to predict numbers.
  - Like using a ruler to guess house prices based on rooms alone.
  - Good for simple patterns but can miss twists in the data.
  - Example: Might predict $21,000 for 6 rooms, assuming a steady increase.
- **Random Forest**:
  - Uses a forest of trees, each looking at different data parts, then averages.
  - Like asking a team of friends with different clues to guess, getting a balanced answer.
  - Handles complex patterns (e.g., if more rooms don’t always mean higher prices due to location).
  - Example: Might predict $22,500 for 6 rooms, adjusting for other factors.
- **Key Differences**:
  - **Accuracy**: Random Forest often beats Linear Regression because it catches tricky patterns (e.g., R-squared ~0.8 vs. ~0.5 on Boston data).
  - **Flexibility**: Linear Regression needs a straight-line fit, while Random Forest adapts to curves and outliers.
  - **Speed**: Linear Regression is faster to train, but Random Forest takes more time with 100 trees—worth it for better results!
- **When to Use**: Use Linear Regression for quick, simple guesses (e.g., basic sales trends). Use Random Forest for tougher jobs (e.g., predicting house prices with many factors), a skill for real-world data science roles.

## Why Learn This?
- **Applicability**: Random Forest helps businesses predict sales, detect fraud, or set house prices accurately—skills you can use in jobs!
- **Motivation**: Next week, we’ll see how Boosting builds on this. Check the OpenClass notes for a demo code to try it yourself!
- *Note*: The Boston dataset has bias (e.g., crime data), which we’ll tackle in Week 4. For now, enjoy exploring these predictions!


In [None]:
# Step 9: Model Performance Metrics and Graphs
# Predict for all models
lr_test_pred = lr_model.predict(X_test)
lr_val_pred = lr_model.predict(X_val)
rf_test_pred = rf_model.predict(X_test)
rf_val_pred = rf_model.predict(X_val)
gb_test_pred = gb_model.predict(X_test)
gb_val_pred = gb_model.predict(X_val)

In [None]:
# Metrics for Linear Regression
lr_test_mse = mean_squared_error(y_test, lr_test_pred)
lr_val_mse = mean_squared_error(y_val, lr_val_pred)
lr_test_r2 = r2_score(y_test, lr_test_pred)
lr_val_r2 = r2_score(y_val, lr_val_pred)

In [None]:
# Metrics for Random Forest (Bagging)
rf_test_mse = mean_squared_error(y_test, rf_test_pred)
rf_val_mse = mean_squared_error(y_val, rf_val_pred)
rf_test_r2 = r2_score(y_test, rf_test_pred)
rf_val_r2 = r2_score(y_val, rf_val_pred)

In [None]:
# Metrics for Gradient Boosting (Boosting)
gb_test_mse = mean_squared_error(y_test, gb_test_pred)
gb_val_mse = mean_squared_error(y_val, gb_val_pred)
gb_test_r2 = r2_score(y_test, gb_test_pred)
gb_val_r2 = r2_score(y_val, gb_val_pred)

In [None]:
print("\nStep 9: Performance Metrics:")
print("Linear Regression - Test MSE: {:.2f}, R-squared: {:.2f}".format(lr_test_mse, lr_test_r2))
print("Linear Regression - Validation MSE: {:.2f}, R-squared: {:.2f}".format(lr_val_mse, lr_val_r2))
print("Random Forest (Bagging) - Test MSE: {:.2f}, R-squared: {:.2f}".format(rf_test_mse, rf_test_r2))
print("Random Forest (Bagging) - Validation MSE: {:.2f}, R-squared: {:.2f}".format(rf_val_mse, rf_val_r2))
print("Gradient Boosting (Boosting) - Test MSE: {:.2f}, R-squared: {:.2f}".format(gb_test_mse, gb_test_r2))
print("Gradient Boosting (Boosting) - Validation MSE: {:.2f}, R-squared: {:.2f}".format(gb_val_mse, gb_val_r2))

# Student Notes: Understanding Step 9 - Model Performance Metrics

Welcome to Step 9 in Week 1, Chapter 3, and Week 2, Chapter 1! We’ve trained our models—Linear Regression, Random Forest (Bagging), and Gradient Boosting (Boosting)—to predict house prices using the Boston Housing Dataset. Now, let’s check how good they are at guessing, like grading a test! We’ll use simple analogies to make this clear and see how they compare.

## What Are Performance Metrics?
- **Definition**: Performance metrics are scores that tell us how well our models predict house prices. They help us decide which model is the best helper!
- **Why It Matters**: As a data scientist, you’ll use these scores to pick the right tool for jobs like setting house prices or forecasting sales, making your work valuable to businesses!

## Key Metrics Explained
- **Mean Squared Error (MSE)**:
  - **Definition**: Measures the average squared difference between the model’s guesses and the real house prices. Smaller MSE means better predictions.
  - **Analogy**: Imagine you’re throwing darts at a target. MSE is like counting how far off each dart is from the bullseye, squaring those distances, and averaging them. The closer to zero, the better your aim!
- **R-squared (R²)**:
  - **Definition**: Shows how much of the house price variation our model explains, from 0 (no explanation) to 1 (perfect explanation). Higher R² means better fit.
  - **Analogy**: Think of R² as a grade out of 100% on how well your model captures the ups and downs of house prices, like a weather forecast predicting sunny days. The closer to 1, the more accurate!

## Breaking Down the Metrics
- **Linear Regression**:
  - **Test MSE: 18.20, R-squared: 0.72**
    - Test: Guesses are off by about 18.20 (squared $1000s) on average—decent aim!
    - R-squared: Explains 72% of price changes, like getting a B+ on understanding the data.
  - **Validation MSE: 46.36, R-squared: 0.42**
    - Validation: Errors jump to 46.36, showing it struggles with new data.
    - R-squared: Drops to 42%, like a C—less confident on unseen houses.
- **Random Forest (Bagging)**:
  - **Test MSE: 23.73, R-squared: 0.64**
    - Test: Errors are 23.73, a bit higher than Linear Regression, but the team effort balances it.
    - R-squared: Explains 64% of changes, like a B-, still good but not perfect.
  - **Validation MSE: 32.16, R-squared: 0.60**
    - Validation: Errors are 32.16, better than Linear Regression’s 46.36.
    - R-squared: Holds at 60%, like a steady B—consistent with new data!
- **Gradient Boosting (Boosting)**:
  - **Test MSE: 26.78, R-squared: 0.59**
    - Test: Errors are 26.78, higher than others, but it’s learning step-by-step.
    - R-squared: Explains 59% of changes, like a C+, showing room to grow.
  - **Validation MSE: 32.34, R-squared: 0.59**
    - Validation: Errors are 32.34, similar to Random Forest.
    - R-squared: Stays at 59%, like a steady C+—improves but not as stable.

## What Does This Mean?
- **Linear Regression**: Starts strong on test data (72% R²) but weakens on validation (42% R²), like a student who memorizes but struggles with new questions. Good for simple patterns but misses complex twists.
- **Random Forest (Bagging)**: Less accurate on test (64% R²) but holds up better on validation (60% R²), like a team averaging guesses to stay steady. Great for balancing errors!
- **Gradient Boosting (Boosting)**: Lowest test accuracy (59% R²) but matches validation (59% R²), like a coach fixing mistakes over time. Improves with practice but needs more data.
- **Applicability**: As a data scientist, you’d pick Random Forest for stable house price predictions or Gradient Boosting for tricky cases, beating Linear Regression’s inconsistency.
- **Comparison**: Linear Regression is fast but limited; Random Forest and Boosting handle complexity better, skills you’ll use in real jobs!

## What’s Next?
- We’ll use these scores to pick the best model and predict future house prices. Check the OpenClass notes for charts and try predicting yourself!
- *Note*: This dataset has historical bias (e.g., tied to crime data), which we’ll address in Week 4. For now, focus on these helpful scores!


In [None]:
# Visualize predictions vs. actual values for all models
plt.figure(figsize=(12, 8))
plt.scatter(y_test, lr_test_pred, color='blue', label='Linear Regression Test')
plt.scatter(y_val, lr_val_pred, color='lightblue', label='Linear Regression Validation')
# plt.scatter(y_test, rf_test_pred, color='green', label='Random Forest Test')
# plt.scatter(y_val, rf_val_pred, color='lightgreen', label='Random Forest Validation')
# plt.scatter(y_test, gb_test_pred, color='red', label='Gradient Boosting Test')
# plt.scatter(y_val, gb_val_pred, color='pink', label='Gradient Boosting Validation')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel("Actual Median Value ($1000s)")
plt.ylabel("Predicted Median Value ($1000s)")
plt.title("Actual vs. Predicted Values (Linear Regression vs. Ensembles)")
plt.legend()
plt.show()


In [None]:

# Interactive Prediction for Given Values
print("\nInteractive Prediction: Enter values for RM and RM_PER_HOUSEHOLD to predict MEDV")
while True:
    try:
        rm = float(input("Enter average number of rooms (RM, e.g., 6.0): "))
        rm_per_household = float(input("Enter rooms per household (RM_PER_HOUSEHOLD, e.g., 0.5): "))
        new_data = np.array([[rm, rm_per_household]])

        # Predict with all models
        lr_pred = lr_model.predict(new_data)[0]
        rf_pred = rf_model.predict(new_data)[0]
        gb_pred = gb_model.predict(new_data)[0]

        print(f"Linear Regression Predicted Median Value: ${lr_pred:.2f}K")
        print(f"Random Forest (Bagging) Predicted Median Value: ${rf_pred:.2f}K")
        print(f"Gradient Boosting (Boosting) Predicted Median Value: ${gb_pred:.2f}K")

        # Compare with actual value (if available in dataset)
        closest_row = df.iloc[(df['RM'] - rm).abs().idxmin()]
        actual_value = closest_row['MEDV'] / 1000  # Convert to $1000s
        print(f"Closest Actual Value: ${actual_value:.2f}K (based on RM={closest_row['RM']:.2f})")
        print(f"Differences: LR: ${(lr_pred - actual_value):.2f}K, RF: ${(rf_pred - actual_value):.2f}K, GB: ${(gb_pred - actual_value):.2f}K")

        # Example of future value prediction with 10% increase
        future_rm = rm * 1.1
        future_rm_per_household = rm_per_household * 1.1
        future_data = np.array([[future_rm, future_rm_per_household]])
        lr_future = lr_model.predict(future_data)[0]
        rf_future = rf_model.predict(future_data)[0]
        gb_future = gb_model.predict(future_data)[0]
        print(f"Expected Future Value (10% more rooms) - LR: ${lr_future:.2f}K, RF: ${rf_future:.2f}K, GB: ${gb_future:.2f}K")

        if input("Continue predicting? (yes/no): ").lower() != 'yes':
            break
    except ValueError:
        print("Please enter valid numbers!")

# Student Notes: Understanding Model Predictions

Welcome to an exciting part of Week 1, Chapter 3, and Week 2, Chapter 1! We’ve trained our models—Linear Regression, Random Forest (Bagging), and Gradient Boosting (Boosting)—and now we’re testing them with real data from the Boston Housing Dataset. Let’s look at their predictions for a house with 6 rooms and 0.5 rooms per household, compare them to the actual price, and imagine future values. We’ll use simple analogies to make this fun and clear!

## What Are We Doing Here?
- **Definition**: We’re asking our models to guess a house’s price based on features like rooms and space per household, then checking how close they get to the real price. We’ll also peek at what prices might be in the future!
- **Why It Matters**: As a data scientist, you’ll use these predictions to help real estate agents set prices or businesses plan budgets—a super useful skill!

## Breaking Down the Predictions
- **Input Data**: We gave the models a house with an average of 6 rooms (`RM`) and 0.5 rooms per household (`RM_PER_HOUSEHOLD`).
- **Analogies**: Think of this like asking three friends to guess the weight of a dog. Each friend uses a different method, and we’ll see who’s closest!

### Model Predictions
- **Linear Regression Predicted Median Value: $29.22K**
  - Guesses $29,220 based on a straight-line trend.
  - Like a friend using a ruler to estimate weight from the dog’s length—simple but might overthink it!
- **Random Forest (Bagging) Predicted Median Value: $22.77K**
  - Guesses $22,770 by averaging 100 tree helpers.
  - Like a team of friends pooling their guesses after looking at different angles—steady and balanced!
- **Gradient Boosting (Boosting) Predicted Median Value: $23.80K**
  - Guesses $23,800 by fixing mistakes step-by-step.
  - Like a coach training friends to improve their guess with each try—smart but still learning!

### Closest Actual Value
- **Closest Actual Value: $0.02K (based on RM=6.00)**
  - The real price for a similar house is just $20 (likely a data scaling issue or outlier).
  - Like finding the dog’s real weight is 20 pounds—way off from our guesses!
  - *Note*: This low value might be a dataset quirk (e.g., $0.02K = $20); let’s treat it as $20K for context due to scaling.

### Differences
- **Differences: LR: $29.19K, RF: $22.75K, GB: $23.78K**
  - Linear Regression is off by $29,190 ($29,220 - $20).
  - Random Forest is off by $22,750 ($22,770 - $20).
  - Gradient Boosting is off by $23,780 ($23,800 - $20).
  - Analogy: It’s like each friend missed the dog’s weight—Linear Regression overshot the most, while Random Forest and Boosting were closer but still high!
  - Meaning: All models overestimated, possibly due to an outlier or scaling issue in the data.

### Expected Future Value (10% more rooms)
- **Expected Future Value (10% more rooms) - LR: $34.88K, RF: $30.18K, GB: $27.40K**
  - Linear Regression predicts $34,880 if rooms increase to 6.6 and space to 0.55.
  - Random Forest predicts $30,180 with the same increase.
  - Gradient Boosting predicts $27,400.
  - Analogy: It’s like asking friends to guess the dog’s weight after it grows 10%—Linear Regression jumps high, Random Forest stays steady, Boosting adjusts cautiously.
  - Meaning: Future predictions show Linear Regression expects a big jump, while ensembles predict more moderate growth, reflecting their different learning styles.

## What Does This Mean?
- **Linear Regression**: Overshoots ($29.22K vs. $20K) and jumps a lot in the future ($34.88K), like a simple guesser that misses complex twists.
- **Random Forest (Bagging)**: Closer ($22.77K) and steadier in the future ($30.18K), like a team balancing out wild guesses—great for consistency!
- **Gradient Boosting (Boosting)**: Also close ($23.80K) and moderate in the future ($27.40K), like a coach refining guesses—good for tricky cases!
- **Applicability**: As a data scientist, you’d pick Random Forest for stable predictions or Boosting for tough adjustments, helping real estate or finance jobs.
- **Comparison**: Linear Regression is fast but less accurate here; ensembles handle complexity better, skills you’ll use in real-world projects!

## What’s Next?
- We’ll explore why these differences happen and how to improve them. Check the OpenClass notes for more details and try your own predictions!
- *Note*: This dataset has historical bias (e.g., tied to crime data), which we’ll address in Week 4. For now, enjoy testing these models!
