# Day 6: Introduction to Machine Learning - Linear Regression

Welcome to your first real Machine Learning lesson! Today we'll learn the fundamentals of ML and implement our first predictive model using **Linear Regression**.

### What is Machine Learning?
- Teaching computers to learn patterns from data without being explicitly programmed.
- Instead of writing rules, we let the algorithm **discover rules** from examples.
- The model gets better with more data.

### Topics Covered:
1. **The ML Pipeline**
2. **Types of Machine Learning**
3. **Linear Regression Theory**
4. **Train/Test Split**
5. **Building Your First Model**
6. **Loss Functions & Evaluation Metrics**
7. **Making Predictions**
8. **Mini Project: House Price Prediction**

In [None]:
# Import essential libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-Learn imports
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# Settings
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
np.random.seed(42)

print("Libraries loaded successfully!")
print("Ready to start your ML journey!")

## 1. The Machine Learning Pipeline

Every ML project follows a similar workflow:

```
Data Collection → Data Preprocessing → Feature Engineering → Train/Test Split → 
Model Training → Evaluation → Prediction → Deployment
```

Let's understand each step!

In [None]:
# Visual representation of the ML Pipeline
pipeline_steps = [
    ('1. Data Collection', 'Gather raw data from sources'),
    ('2. Data Cleaning', 'Handle missing values, remove duplicates'),
    ('3. Feature Engineering', 'Create/select meaningful features'),
    ('4. Train/Test Split', 'Divide data for training and evaluation'),
    ('5. Model Training', 'Fit the algorithm on training data'),
    ('6. Evaluation', 'Measure model performance on test data'),
    ('7. Prediction', 'Make predictions on new, unseen data'),
]

print("THE MACHINE LEARNING PIPELINE")
print("=" * 60)
for step, description in pipeline_steps:
    print(f"\n{step}")
    print(f"   → {description}")
print("\n" + "=" * 60)

## 2. Types of Machine Learning

| Type | Description | Examples |
|------|-------------|----------|
| **Supervised Learning** | Learn from labeled data | Regression, Classification |
| **Unsupervised Learning** | Find patterns in unlabeled data | Clustering, Dimensionality Reduction |
| **Reinforcement Learning** | Learn through trial and error | Game AI, Robotics |

Today we focus on **Supervised Learning - Regression** (predicting continuous values).

In [None]:
# Visualize the difference between Classification and Regression
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Classification Example
np.random.seed(42)
class_a_x = np.random.normal(2, 0.5, 50)
class_a_y = np.random.normal(2, 0.5, 50)
class_b_x = np.random.normal(4, 0.5, 50)
class_b_y = np.random.normal(4, 0.5, 50)

axes[0].scatter(class_a_x, class_a_y, c='#3498db', label='Class A', s=60, alpha=0.7)
axes[0].scatter(class_b_x, class_b_y, c='#e74c3c', label='Class B', s=60, alpha=0.7)
axes[0].plot([0, 6], [6, 0], 'k--', linewidth=2, label='Decision Boundary')
axes[0].set_title('Classification: Predicting Categories', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].legend()
axes[0].set_xlim(0, 6)
axes[0].set_ylim(0, 6)

# Regression Example
x_reg = np.linspace(0, 10, 50)
y_reg = 2 * x_reg + 5 + np.random.normal(0, 2, 50)

axes[1].scatter(x_reg, y_reg, c='#2ecc71', s=60, alpha=0.7, label='Data Points')
axes[1].plot(x_reg, 2 * x_reg + 5, 'r-', linewidth=2, label='Regression Line')
axes[1].set_title('Regression: Predicting Continuous Values', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Feature (e.g., Square Feet)')
axes[1].set_ylabel('Target (e.g., Price)')
axes[1].legend()

plt.tight_layout()
plt.show()

## 3. Linear Regression Theory

### The Concept
Linear Regression finds the best-fitting straight line through your data points.

### The Equation
```
y = mx + b
```
Or in ML terminology:
```
y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
```

Where:
- **y** = Target variable (what we're predicting)
- **x** = Features (input variables)
- **w** = Weights (coefficients - learned from data)
- **b** = Bias/Intercept (where the line crosses Y-axis)

In [None]:
# Understanding Linear Regression Visually
np.random.seed(42)

# Create sample data: Square Feet vs Price
sqft = np.array([1000, 1200, 1400, 1600, 1800, 2000, 2200, 2400, 2600, 2800])
price = np.array([150, 180, 210, 240, 280, 310, 350, 380, 420, 450])

# Add some noise
price_noisy = price + np.random.normal(0, 15, len(price))

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Plot 1: Just the data
axes[0].scatter(sqft, price_noisy, c='#3498db', s=100, edgecolors='white')
axes[0].set_title('Step 1: Our Data', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Square Feet')
axes[0].set_ylabel('Price ($K)')

# Plot 2: Finding the best line
axes[1].scatter(sqft, price_noisy, c='#3498db', s=100, edgecolors='white')
# Try different lines
for slope, intercept, color, alpha in [(0.1, 100, 'gray', 0.3), (0.2, 50, 'gray', 0.3), (0.15, 10, 'red', 1)]:
    y_line = slope * sqft + intercept
    axes[1].plot(sqft, y_line, color=color, linewidth=2, alpha=alpha)
axes[1].set_title('Step 2: Finding Best Fit Line', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Square Feet')
axes[1].set_ylabel('Price ($K)')

# Plot 3: Best fit with residuals
# Calculate best fit
z = np.polyfit(sqft, price_noisy, 1)
p = np.poly1d(z)
y_pred = p(sqft)

axes[2].scatter(sqft, price_noisy, c='#3498db', s=100, edgecolors='white', label='Actual')
axes[2].plot(sqft, y_pred, 'r-', linewidth=2, label=f'Best Fit: y = {z[0]:.3f}x + {z[1]:.1f}')
# Draw residuals (errors)
for xi, yi, yi_pred in zip(sqft, price_noisy, y_pred):
    axes[2].plot([xi, xi], [yi, yi_pred], 'g--', alpha=0.5)
axes[2].set_title('Step 3: Best Line with Errors', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Square Feet')
axes[2].set_ylabel('Price ($K)')
axes[2].legend()

plt.tight_layout()
plt.show()

print(f"\nThe Best Fit Equation: Price = {z[0]:.4f} × SqFt + {z[1]:.2f}")
print(f"This means: For every 1 sqft increase, price increases by ${z[0]*1000:.2f}")

## 4. Train/Test Split - Why It's Critical!

### The Problem: Overfitting
If we train and test on the same data, the model might just "memorize" the data instead of learning patterns.

### The Solution: Split Your Data
- **Training Set (70-80%)**: Data the model learns from
- **Test Set (20-30%)**: Data the model has never seen - used for evaluation

In [None]:
# Create a larger dataset
np.random.seed(42)
n_samples = 200

# Features
X = np.random.uniform(800, 3500, n_samples).reshape(-1, 1)  # Square feet

# Target (with realistic relationship)
y = 50000 + 150 * X.flatten() + np.random.normal(0, 30000, n_samples)  # Price

print(f"Total samples: {len(X)}")
print(f"Feature shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# The Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing
    random_state=42     # For reproducibility
)

print("DATA SPLIT SUMMARY")
print("=" * 40)
print(f"Training set size: {len(X_train)} samples ({len(X_train)/len(X)*100:.0f}%)")
print(f"Test set size: {len(X_test)} samples ({len(X_test)/len(X)*100:.0f}%)")
print("\nTraining Data:")
print(f"  X_train shape: {X_train.shape}")
print(f"  y_train shape: {y_train.shape}")
print("\nTest Data:")
print(f"  X_test shape: {X_test.shape}")
print(f"  y_test shape: {y_test.shape}")

In [None]:
# Visualize the split
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(X_train, y_train, c='#3498db', alpha=0.6, label=f'Training Data ({len(X_train)} samples)', s=50)
ax.scatter(X_test, y_test, c='#e74c3c', alpha=0.8, label=f'Test Data ({len(X_test)} samples)', s=50, marker='s')

ax.set_title('Train/Test Split Visualization', fontsize=14, fontweight='bold')
ax.set_xlabel('Square Feet', fontsize=12)
ax.set_ylabel('Price ($)', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

## 5. Building Your First ML Model!

Using Scikit-Learn, building a model is just 3 lines of code:
1. **Create** the model
2. **Fit** (train) on data
3. **Predict** on new data

In [None]:
# Step 1: Create the model
model = LinearRegression()

print("Model created!")
print(f"Model type: {type(model).__name__}")

In [None]:
# Step 2: Train (fit) the model
model.fit(X_train, y_train)

print("Model trained!")
print("\n LEARNED PARAMETERS:")
print(f"   Coefficient (slope): {model.coef_[0]:.4f}")
print(f"   Intercept (bias): {model.intercept_:.4f}")
print(f"\n The Equation: Price = {model.coef_[0]:.2f} × SqFt + {model.intercept_:.2f}")
print(f"\n Interpretation: For every 1 sqft increase, price goes up by ${model.coef_[0]:.2f}")

In [None]:
# Step 3: Make predictions
y_pred_train = model.predict(X_train)  # Predictions on training data
y_pred_test = model.predict(X_test)    # Predictions on test data

print("Predictions made!")
print("\nSample Predictions vs Actual (Test Set):")
print("-" * 50)
print(f"{'SqFt':>10} | {'Actual':>12} | {'Predicted':>12} | {'Error':>10}")
print("-" * 50)
for i in range(5):
    sqft = X_test[i][0]
    actual = y_test[i]
    predicted = y_pred_test[i]
    error = actual - predicted
    print(f"{sqft:>10.0f} | ${actual:>10,.0f} | ${predicted:>10,.0f} | ${error:>+9,.0f}")

In [None]:
# Visualize the model's predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Regression line on training data
axes[0].scatter(X_train, y_train, c='#3498db', alpha=0.6, label='Training Data')
axes[0].plot(np.sort(X_train.flatten()), model.predict(np.sort(X_train, axis=0)), 
             'r-', linewidth=2, label='Model')
axes[0].set_title('Model Fit on Training Data', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Square Feet')
axes[0].set_ylabel('Price ($)')
axes[0].legend()

# Plot 2: Predictions vs Actual on test data
axes[1].scatter(y_test, y_pred_test, c='#2ecc71', alpha=0.6)
# Perfect prediction line
min_val = min(y_test.min(), y_pred_test.min())
max_val = max(y_test.max(), y_pred_test.max())
axes[1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
axes[1].set_title('Predicted vs Actual (Test Set)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Actual Price ($)')
axes[1].set_ylabel('Predicted Price ($)')
axes[1].legend()

plt.tight_layout()
plt.show()

## 6. Loss Functions & Evaluation Metrics

How do we know if our model is any good? We use **metrics**!

### Common Regression Metrics:

| Metric | Full Name | What It Measures | Ideal Value |
|--------|-----------|------------------|-------------|
| **MSE** | Mean Squared Error | Average of squared errors | Lower = Better |
| **RMSE** | Root Mean Squared Error | Square root of MSE (same units as target) | Lower = Better |
| **MAE** | Mean Absolute Error | Average of absolute errors | Lower = Better |
| **R²** | R-Squared (Coefficient of Determination) | Variance explained by model | Closer to 1 = Better |

In [None]:
# Understanding Loss Functions
print(" LOSS FUNCTIONS EXPLAINED")
print("=" * 60)

# Example predictions and actuals
actual = np.array([100, 200, 300, 400, 500])
predicted = np.array([110, 190, 320, 380, 510])
errors = actual - predicted

print("\nSimple Example:")
print(f"Actual:    {actual}")
print(f"Predicted: {predicted}")
print(f"Errors:    {errors}")

# Calculate metrics step by step
print("\n--- Mean Absolute Error (MAE) ---")
mae = np.mean(np.abs(errors))
print(f"MAE = mean(|errors|) = mean({np.abs(errors)}) = {mae:.2f}")
print("Interpretation: On average, predictions are off by $" + f"{mae:.0f}")

print("\n--- Mean Squared Error (MSE) ---")
squared_errors = errors ** 2
mse = np.mean(squared_errors)
print(f"MSE = mean(errors²) = mean({squared_errors}) = {mse:.2f}")
print("Note: Penalizes large errors more heavily")

print("\n--- Root Mean Squared Error (RMSE) ---")
rmse = np.sqrt(mse)
print(f"RMSE = √MSE = √{mse:.2f} = {rmse:.2f}")
print("Interpretation: Typical prediction error is about $" + f"{rmse:.0f}")

In [None]:
# Calculate metrics for our model
print(" MODEL EVALUATION")
print("=" * 60)

# Training metrics
train_mse = mean_squared_error(y_train, y_pred_train)
train_rmse = np.sqrt(train_mse)
train_mae = mean_absolute_error(y_train, y_pred_train)
train_r2 = r2_score(y_train, y_pred_train)

# Test metrics
test_mse = mean_squared_error(y_test, y_pred_test)
test_rmse = np.sqrt(test_mse)
test_mae = mean_absolute_error(y_test, y_pred_test)
test_r2 = r2_score(y_test, y_pred_test)

print("\n TRAINING SET METRICS:")
print(f"   MSE:  {train_mse:,.0f}")
print(f"   RMSE: ${train_rmse:,.0f}")
print(f"   MAE:  ${train_mae:,.0f}")
print(f"   R²:   {train_r2:.4f} ({train_r2*100:.1f}% variance explained)")

print("\n TEST SET METRICS:")
print(f"   MSE:  {test_mse:,.0f}")
print(f"   RMSE: ${test_rmse:,.0f}")
print(f"   MAE:  ${test_mae:,.0f}")
print(f"   R²:   {test_r2:.4f} ({test_r2*100:.1f}% variance explained)")

print("\n INTERPRETATION:")
print(f"   On average, our predictions are off by ${test_mae:,.0f}")
print(f"   The model explains {test_r2*100:.1f}% of the variance in house prices")

In [None]:
# Visualize the residuals (errors)
residuals = y_test - y_pred_test

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Residuals vs Predicted
axes[0].scatter(y_pred_test, residuals, c='#3498db', alpha=0.6)
axes[0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0].set_title('Residuals vs Predicted', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Predicted Price')
axes[0].set_ylabel('Residual (Error)')

# Plot 2: Residual Distribution
axes[1].hist(residuals, bins=20, color='#2ecc71', edgecolor='white', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_title('Residual Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Residual')
axes[1].set_ylabel('Frequency')

# Plot 3: Metrics comparison
metrics = ['RMSE', 'MAE']
train_values = [train_rmse, train_mae]
test_values = [test_rmse, test_mae]

x = np.arange(len(metrics))
width = 0.35

axes[2].bar(x - width/2, train_values, width, label='Train', color='#3498db')
axes[2].bar(x + width/2, test_values, width, label='Test', color='#e74c3c')
axes[2].set_title('Train vs Test Metrics', fontsize=12, fontweight='bold')
axes[2].set_xticks(x)
axes[2].set_xticklabels(metrics)
axes[2].set_ylabel('Error ($)')
axes[2].legend()

plt.tight_layout()
plt.show()

## 7. Making Predictions on New Data

Now let's use our trained model to predict prices for houses we've never seen!

In [None]:
# Predict prices for new houses
new_houses = np.array([[1500], [2000], [2500], [3000], [3500]])

predictions = model.predict(new_houses)

print(" PRICE PREDICTIONS FOR NEW HOUSES")
print("=" * 50)
print(f"\nModel: Price = ${model.coef_[0]:.2f} × SqFt + ${model.intercept_:,.2f}")
print("\n" + "-" * 50)
print(f"{'Square Feet':^15} | {'Predicted Price':^20}")
print("-" * 50)

for sqft, price in zip(new_houses.flatten(), predictions):
    print(f"{sqft:^15,.0f} | ${price:^18,.0f}")

print("-" * 50)

In [None]:
# Interactive prediction visualization
fig, ax = plt.subplots(figsize=(12, 6))

# Plot original data
ax.scatter(X, y, c='#3498db', alpha=0.4, label='Original Data')

# Plot regression line
X_line = np.linspace(500, 4000, 100).reshape(-1, 1)
y_line = model.predict(X_line)
ax.plot(X_line, y_line, 'r-', linewidth=2, label='Model')

# Plot new predictions
ax.scatter(new_houses, predictions, c='#2ecc71', s=200, marker='*', 
           edgecolors='black', linewidth=2, label='New Predictions', zorder=5)

# Add price labels
for sqft, price in zip(new_houses.flatten(), predictions):
    ax.annotate(f'${price/1000:.0f}K', 
                xy=(sqft, price), 
                xytext=(sqft + 100, price + 30000),
                fontsize=10, fontweight='bold')

ax.set_title('House Price Predictions', fontsize=14, fontweight='bold')
ax.set_xlabel('Square Feet', fontsize=12)
ax.set_ylabel('Price ($)', fontsize=12)
ax.legend()

plt.tight_layout()
plt.show()

---

## Mini Project: Complete House Price Prediction System

**Goal:** Build a complete ML system to predict house prices using multiple features.

Let's put everything together!

In [None]:
# Create a comprehensive housing dataset
np.random.seed(42)
n = 500

# Generate features
sqft = np.random.normal(2000, 500, n).clip(800, 4500)
bedrooms = np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.40, 0.30, 0.10])
bathrooms = np.random.choice([1, 1.5, 2, 2.5, 3], n, p=[0.15, 0.20, 0.35, 0.20, 0.10])
age = np.random.uniform(0, 50, n)
has_garage = np.random.choice([0, 1], n, p=[0.3, 0.7])

# Generate target with realistic relationships
price = (
    50000 +                          # Base price
    sqft * 150 +                     # $150 per sqft
    bedrooms * 20000 +               # $20K per bedroom
    bathrooms * 15000 +              # $15K per bathroom
    (50 - age) * 1000 +              # Newer = more expensive
    has_garage * 30000 +             # $30K for garage
    np.random.normal(0, 25000, n)    # Noise
).clip(100000, 1000000)

# Create DataFrame
housing_df = pd.DataFrame({
    'SqFt': sqft.astype(int),
    'Bedrooms': bedrooms,
    'Bathrooms': bathrooms,
    'Age': age.astype(int),
    'Has_Garage': has_garage,
    'Price': price.astype(int)
})

print(" HOUSING DATASET")
print("=" * 60)
print(f"Total Records: {len(housing_df)}")
print(f"Features: {list(housing_df.columns[:-1])}")
print(f"Target: Price")
print("\nFirst 10 Records:")
print(housing_df.head(10))
print("\nStatistical Summary:")
print(housing_df.describe().round(0))

In [None]:
# Step 1: Exploratory Data Analysis
print(" STEP 1: EXPLORATORY DATA ANALYSIS")
print("=" * 60)

# Correlation analysis
correlations = housing_df.corr()['Price'].drop('Price').sort_values(ascending=False)
print("\nCorrelation with Price:")
for feature, corr in correlations.items():
    bar = '' * int(abs(corr) * 20)
    print(f"  {feature:12}: {corr:+.3f} {bar}")

In [None]:
# Visualize correlations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

features = ['SqFt', 'Bedrooms', 'Bathrooms', 'Age', 'Has_Garage']

for i, feature in enumerate(features):
    row, col = i // 3, i % 3
    
    if feature in ['Has_Garage']:
        # Box plot for categorical
        housing_df.boxplot(column='Price', by=feature, ax=axes[row, col])
        axes[row, col].set_title(f'Price by {feature}', fontsize=11)
    else:
        # Scatter for continuous
        axes[row, col].scatter(housing_df[feature], housing_df['Price'], alpha=0.5, c='#3498db')
        # Add trend line
        z = np.polyfit(housing_df[feature], housing_df['Price'], 1)
        p = np.poly1d(z)
        x_line = np.linspace(housing_df[feature].min(), housing_df[feature].max(), 100)
        axes[row, col].plot(x_line, p(x_line), 'r-', linewidth=2)
        axes[row, col].set_title(f'Price vs {feature}', fontsize=11)
    
    axes[row, col].set_xlabel(feature)
    axes[row, col].set_ylabel('Price')

# Correlation heatmap
sns.heatmap(housing_df.corr(), annot=True, cmap='RdYlBu_r', center=0, 
            ax=axes[1, 2], fmt='.2f', square=True)
axes[1, 2].set_title('Correlation Matrix', fontsize=11)

plt.suptitle('Feature Analysis', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Step 2: Prepare Data
print(" STEP 2: PREPARE DATA")
print("=" * 60)

# Features and Target
X = housing_df[['SqFt', 'Bedrooms', 'Bathrooms', 'Age', 'Has_Garage']]
y = housing_df['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"\nFeatures used: {list(X.columns)}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

In [None]:
# Step 3: Train the Model
print(" STEP 3: TRAIN THE MODEL")
print("=" * 60)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

print("\n Model Trained Successfully!")
print("\n LEARNED COEFFICIENTS:")
print("-" * 40)

for feature, coef in zip(X.columns, model.coef_):
    print(f"  {feature:12}: ${coef:+,.2f}")
print(f"  {'Intercept':12}: ${model.intercept_:+,.2f}")

print("\n INTERPRETATION:")
print(f"  • Each additional sqft adds ${model.coef_[0]:.2f} to the price")
print(f"  • Each bedroom adds ${model.coef_[1]:,.0f} to the price")
print(f"  • Each bathroom adds ${model.coef_[2]:,.0f} to the price")
print(f"  • Each year of age {'decreases' if model.coef_[3] < 0 else 'increases'} price by ${abs(model.coef_[3]):,.0f}")
print(f"  • Having a garage adds ${model.coef_[4]:,.0f} to the price")

In [None]:
# Step 4: Evaluate the Model
print(" STEP 4: EVALUATE THE MODEL")
print("=" * 60)

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Calculate metrics
metrics = {
    'Training': {
        'RMSE': np.sqrt(mean_squared_error(y_train, y_pred_train)),
        'MAE': mean_absolute_error(y_train, y_pred_train),
        'R²': r2_score(y_train, y_pred_train)
    },
    'Test': {
        'RMSE': np.sqrt(mean_squared_error(y_test, y_pred_test)),
        'MAE': mean_absolute_error(y_test, y_pred_test),
        'R²': r2_score(y_test, y_pred_test)
    }
}

print("\n" + "-" * 50)
print(f"{'Metric':^15} | {'Training':^15} | {'Test':^15}")
print("-" * 50)
print(f"{'RMSE':^15} | ${metrics['Training']['RMSE']:>12,.0f} | ${metrics['Test']['RMSE']:>12,.0f}")
print(f"{'MAE':^15} | ${metrics['Training']['MAE']:>12,.0f} | ${metrics['Test']['MAE']:>12,.0f}")
print(f"{'R²':^15} | {metrics['Training']['R²']:>13.4f} | {metrics['Test']['R²']:>13.4f}")
print("-" * 50)

print(f"\n MODEL PERFORMANCE SUMMARY:")
print(f"   • The model explains {metrics['Test']['R²']*100:.1f}% of price variance")
print(f"   • Average prediction error: ${metrics['Test']['MAE']:,.0f}")
print(f"   • Typical error range: ±${metrics['Test']['RMSE']:,.0f}")

In [None]:
# Visualization of model performance
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 1. Actual vs Predicted
axes[0].scatter(y_test, y_pred_test, c='#3498db', alpha=0.5)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
axes[0].set_title('Predicted vs Actual Prices', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Actual Price ($)')
axes[0].set_ylabel('Predicted Price ($)')

# 2. Residual Distribution
residuals = y_test - y_pred_test
axes[1].hist(residuals, bins=30, color='#2ecc71', edgecolor='white', alpha=0.7)
axes[1].axvline(x=0, color='r', linestyle='--', linewidth=2)
axes[1].set_title('Residual Distribution', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Prediction Error ($)')
axes[1].set_ylabel('Frequency')

# 3. Feature Importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': np.abs(model.coef_)
}).sort_values('Coefficient', ascending=True)

colors = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(feature_importance)))
axes[2].barh(feature_importance['Feature'], feature_importance['Coefficient'], color=colors)
axes[2].set_title('Feature Importance (|Coefficient|)', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Absolute Coefficient Value')

plt.tight_layout()
plt.show()

In [None]:
# Step 5: Make Predictions on New Houses
print(" STEP 5: PREDICT NEW HOUSE PRICES")
print("=" * 60)

# Create new houses to predict
new_houses = pd.DataFrame({
    'SqFt': [1500, 2000, 2500, 3000, 3500],
    'Bedrooms': [2, 3, 3, 4, 5],
    'Bathrooms': [1.5, 2, 2.5, 3, 3],
    'Age': [5, 10, 15, 0, 20],
    'Has_Garage': [1, 1, 1, 1, 0]
})

# Make predictions
predictions = model.predict(new_houses)

# Display results
results = new_houses.copy()
results['Predicted_Price'] = predictions.astype(int)

print("\n NEW HOUSE PRICE PREDICTIONS:")
print(results.to_string(index=False))

print("\n SAMPLE CALCULATION (House #1):")
house = new_houses.iloc[0]
calculated = (
    model.intercept_ +
    model.coef_[0] * house['SqFt'] +
    model.coef_[1] * house['Bedrooms'] +
    model.coef_[2] * house['Bathrooms'] +
    model.coef_[3] * house['Age'] +
    model.coef_[4] * house['Has_Garage']
)
print(f"   Intercept: ${model.intercept_:,.2f}")
print(f"   + SqFt ({house['SqFt']}): ${model.coef_[0] * house['SqFt']:,.2f}")
print(f"   + Bedrooms ({house['Bedrooms']}): ${model.coef_[1] * house['Bedrooms']:,.2f}")
print(f"   + Bathrooms ({house['Bathrooms']}): ${model.coef_[2] * house['Bathrooms']:,.2f}")
print(f"   + Age ({house['Age']}): ${model.coef_[3] * house['Age']:,.2f}")
print(f"   + Garage ({house['Has_Garage']}): ${model.coef_[4] * house['Has_Garage']:,.2f}")
print(f"   = Total: ${calculated:,.2f}")

In [None]:
# Final Summary
print("\n" + "=" * 70)
print(" DAY 6 COMPLETE: LINEAR REGRESSION SUMMARY")
print("=" * 70)

print(f"""
 WHAT WE LEARNED:

   1. THE ML PIPELINE
      Data → Preprocessing → Split → Train → Evaluate → Predict

   2. LINEAR REGRESSION
      y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
      Finds the best line (or hyperplane) through your data

   3. TRAIN/TEST SPLIT
      Prevents overfitting by evaluating on unseen data

   4. EVALUATION METRICS
      • RMSE: Typical prediction error
      • MAE: Average absolute error
      • R²: Variance explained (0-1, higher is better)

   5. SCIKIT-LEARN WORKFLOW
      model = LinearRegression()
      model.fit(X_train, y_train)
      predictions = model.predict(X_test)

 OUR MODEL'S PERFORMANCE:
   • R² Score: {metrics['Test']['R²']:.4f} ({metrics['Test']['R²']*100:.1f}% variance explained)
   • MAE: ${metrics['Test']['MAE']:,.0f} average error
   • RMSE: ${metrics['Test']['RMSE']:,.0f} typical error range
""")

print("=" * 70)
print(" Ready for Day 7: Classification Algorithms!")
print("=" * 70)

---

## Practice Exercises

Try these on your own:

1. **Feature Engineering**: Create a new feature `Price_Per_SqFt` and see if it improves predictions
2. **Different Split Ratios**: Try 70/30 and 90/10 splits and compare results
3. **Polynomial Features**: Use `sklearn.preprocessing.PolynomialFeatures` to capture non-linear relationships
4. **Cross-Validation**: Use `sklearn.model_selection.cross_val_score` for more robust evaluation

---

## Key Takeaways

| Concept | Key Points |
|---------|------------|
| **Linear Regression** | Predicts continuous values using linear equation |
| **Train/Test Split** | Essential to prevent overfitting |
| **Coefficients** | Show feature importance and direction |
| **R² Score** | 0-1 scale, higher means better fit |
| **RMSE/MAE** | Lower is better, in same units as target |

### When to Use Linear Regression:
- Target variable is continuous
- Relationship between features and target is approximately linear
- You need interpretable results (coefficients explain relationships)

---

**Next Up:** Day 7 - Classification with Logistic Regression, Decision Trees & Random Forests!