# Week 3: Introduction to Supervised Learning and Linear Regression

## Learning Objectives:
- Understand ML concepts and terminology
- Learn supervised vs unsupervised learning
- Implement first ML algorithms
- Understand model evaluation

## Topics Covered:
- Types of ML: supervised, unsupervised, reinforcement
- Linear regression singular
- Feature Engineering
- Feature scaling and selection
- Model evaluation metrics (RMSE)
- Train/validation/test splits
- Cross-validation
- Overfitting and underfitting

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. What is Machine Learning?

Machine Learning is a subset of artificial intelligence that enables computers to learn and make decisions from data without being explicitly programmed for every scenario.

### Key Terminology:
- **Algorithm**: The mathematical procedure used to make predictions
- **Model**: The output of an algorithm after training on data
- **Features**: Input variables used to make predictions
- **Target/Label**: The output variable we want to predict
- **Training**: The process of teaching the algorithm using historical data
- **Prediction**: Using the trained model to make forecasts on new data
- **Overfitting**: When a model learns the training data too well and doesn't generalize
- **Underfitting**: When a model is too simple to capture the underlying pattern

## 2. Types of Machine Learning

### 2.1 Supervised Learning
Learning with labeled data (input-output pairs)

**Classification**: Predicting categories/classes
- Examples: Email spam detection, image recognition, medical diagnosis
- Target variable: Discrete/categorical

**Regression**: Predicting continuous numerical values
- Examples: House price prediction, stock price forecasting, temperature prediction
- Target variable: Continuous/numerical

### 2.2 Unsupervised Learning
Learning patterns in data without labeled examples
- Examples: Customer segmentation, anomaly detection, data compression
- No target variable provided

### 2.3 Reinforcement Learning
Learning through interaction with an environment using rewards and penalties
- Examples: Game playing, autonomous vehicles, recommendation systems
- Learning through trial and error

In [None]:
# Create a simple dataset to demonstrate supervised learning
np.random.seed(42)

# Generate synthetic house price data
n_samples = 200
house_size = np.random.normal(1500, 500, n_samples)  # Square feet
bedrooms = np.random.randint(1, 6, n_samples)  # Number of bedrooms
age = np.random.randint(0, 50, n_samples)  # House age in years

# Create a realistic relationship for house prices
price = (house_size * 120 + bedrooms * 15000 - age * 1000 + 
         np.random.normal(0, 20000, n_samples))

# Create DataFrame
house_data = pd.DataFrame({
    'Size_SqFt': house_size,
    'Bedrooms': bedrooms,
    'Age_Years': age,
    'Price': price
})

print("House Price Dataset:")
print(house_data.head())
print(f"\nDataset shape: {house_data.shape}")

In [None]:
# Visualize the relationships
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Size vs Price
axes[0, 0].scatter(house_data['Size_SqFt'], house_data['Price'], alpha=0.6, color='blue')
axes[0, 0].set_xlabel('Size (Square Feet)')
axes[0, 0].set_ylabel('Price ($)')
axes[0, 0].set_title('House Size vs Price')
axes[0, 0].grid(True, alpha=0.3)

# Bedrooms vs Price
bedroom_avg = house_data.groupby('Bedrooms')['Price'].mean()
axes[0, 1].bar(bedroom_avg.index, bedroom_avg.values, color='green', alpha=0.7)
axes[0, 1].set_xlabel('Number of Bedrooms')
axes[0, 1].set_ylabel('Average Price ($)')
axes[0, 1].set_title('Bedrooms vs Average Price')
axes[0, 1].grid(True, alpha=0.3)

# Age vs Price
axes[1, 0].scatter(house_data['Age_Years'], house_data['Price'], alpha=0.6, color='red')
axes[1, 0].set_xlabel('Age (Years)')
axes[1, 0].set_ylabel('Price ($)')
axes[1, 0].set_title('House Age vs Price')
axes[1, 0].grid(True, alpha=0.3)

# Price distribution
axes[1, 1].hist(house_data['Price'], bins=25, color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].set_xlabel('Price ($)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Price Distribution')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate correlations
correlation_matrix = house_data.corr()
print("\nCorrelation Matrix:")
print(correlation_matrix)

## 3. Linear Regression: Your First ML Algorithm

Linear regression is one of the simplest and most interpretable machine learning algorithms. It assumes a linear relationship between the input features and the target variable.

### Mathematical Foundation:
For simple linear regression (one feature):
```
y = β₀ + β₁x + ε
```

For multiple linear regression:
```
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε
```

Where:
- y = target variable (price)
- β₀ = intercept
- β₁, β₂, ..., βₙ = coefficients
- x₁, x₂, ..., xₙ = features
- ε = error term

In [None]:
# Simple Linear Regression - Single Feature
print("=== SIMPLE LINEAR REGRESSION ===")

# Use only house size as feature
X_simple = house_data[['Size_SqFt']]
y = house_data['Price']

# Split the data
X_train_simple, X_test_simple, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

# Create and train the model
simple_model = LinearRegression()
simple_model.fit(X_train_simple, y_train)

# Make predictions
y_pred_simple = simple_model.predict(X_test_simple)

# Print model parameters
print(f"Intercept (β₀): ${simple_model.intercept_:,.2f}")
print(f"Coefficient (β₁): ${simple_model.coef_[0]:.2f} per square foot")
print(f"\nModel equation: Price = ${simple_model.intercept_:,.2f} + ${simple_model.coef_[0]:.2f} × Size")

In [None]:
# Visualize the simple linear regression
plt.figure(figsize=(12, 8))

# Plot data points
plt.scatter(X_test_simple, y_test, alpha=0.6, color='blue', label='Actual Prices')
plt.scatter(X_test_simple, y_pred_simple, alpha=0.6, color='red', label='Predicted Prices')

# Plot regression line
X_line = np.linspace(X_simple.min(), X_simple.max(), 100).reshape(-1, 1)
y_line = simple_model.predict(X_line)
plt.plot(X_line, y_line, color='green', linewidth=2, label='Regression Line')

plt.xlabel('Size (Square Feet)')
plt.ylabel('Price ($)')
plt.title('Simple Linear Regression: House Size vs Price')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Calculate and display metrics
rmse_simple = np.sqrt(mean_squared_error(y_test, y_pred_simple))
mae_simple = mean_absolute_error(y_test, y_pred_simple)
r2_simple = r2_score(y_test, y_pred_simple)

print(f"\nModel Performance:")
print(f"RMSE: ${rmse_simple:,.2f}")
print(f"MAE: ${mae_simple:,.2f}")
print(f"R² Score: {r2_simple:.3f}")

## 4. Multiple Linear Regression

Multiple linear regression uses multiple features to make predictions. This often provides better performance than using a single feature.

In [None]:
# Multiple Linear Regression - All Features
print("=== MULTIPLE LINEAR REGRESSION ===")

# Use all features
X_multiple = house_data[['Size_SqFt', 'Bedrooms', 'Age_Years']]

# Split the data
X_train_multiple, X_test_multiple, y_train, y_test = train_test_split(
    X_multiple, y, test_size=0.2, random_state=42
)

# Create and train the model
multiple_model = LinearRegression()
multiple_model.fit(X_train_multiple, y_train)

# Make predictions
y_pred_multiple = multiple_model.predict(X_test_multiple)

# Print model parameters
print(f"Intercept (β₀): ${multiple_model.intercept_:,.2f}")
print("\nCoefficients:")
for feature, coef in zip(X_multiple.columns, multiple_model.coef_):
    print(f"  {feature}: {coef:.2f}")

# Model equation
print(f"\nModel equation:")
print(f"Price = ${multiple_model.intercept_:,.2f} + ")
print(f"        ${multiple_model.coef_[0]:.2f} × Size + ")
print(f"        ${multiple_model.coef_[1]:,.2f} × Bedrooms + ")
print(f"        ${multiple_model.coef_[2]:.2f} × Age")

In [None]:
# Compare model performance
rmse_multiple = np.sqrt(mean_squared_error(y_test, y_pred_multiple))
mae_multiple = mean_absolute_error(y_test, y_pred_multiple)
r2_multiple = r2_score(y_test, y_pred_multiple)

print("=== MODEL COMPARISON ===")
print(f"{'Metric':<15} {'Simple':<15} {'Multiple':<15} {'Improvement':<15}")
print("-" * 60)
print(f"{'RMSE':<15} ${rmse_simple:<14,.0f} ${rmse_multiple:<14,.0f} {((rmse_simple-rmse_multiple)/rmse_simple)*100:<14.1f}%")
print(f"{'MAE':<15} ${mae_simple:<14,.0f} ${mae_multiple:<14,.0f} {((mae_simple-mae_multiple)/mae_simple)*100:<14.1f}%")
print(f"{'R² Score':<15} {r2_simple:<15.3f} {r2_multiple:<15.3f} {((r2_multiple-r2_simple)/r2_simple)*100:<14.1f}%")

## 5. Train/Validation/Test Splits

Proper data splitting is crucial for reliable model evaluation.

In [None]:
# Train/Validation/Test Split
print("=== TRAIN/VALIDATION/TEST SPLIT ===")

# First split: separate test set (20%)
X_temp, X_test, y_temp, y_test = train_test_split(
    X_multiple, y, test_size=0.2, random_state=42
)

# Second split: separate train and validation from remaining data
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 of 0.8 = 0.2 of total
)

print(f"Total dataset size: {len(X_multiple)}")
print(f"Training set size: {len(X_train)} ({len(X_train)/len(X_multiple)*100:.1f}%)")
print(f"Validation set size: {len(X_val)} ({len(X_val)/len(X_multiple)*100:.1f}%)")
print(f"Test set size: {len(X_test)} ({len(X_test)/len(X_multiple)*100:.1f}%)")

# Train model on training set
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate on all three sets
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)
test_pred = model.predict(X_test)

train_rmse = np.sqrt(mean_squared_error(y_train, train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, val_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, test_pred))

train_r2 = r2_score(y_train, train_pred)
val_r2 = r2_score(y_val, val_pred)
test_r2 = r2_score(y_test, test_pred)

print(f"\nModel Performance:")
print(f"Training RMSE: ${train_rmse:,.2f}, R²: {train_r2:.3f}")
print(f"Validation RMSE: ${val_rmse:,.2f}, R²: {val_r2:.3f}")
print(f"Test RMSE: ${test_rmse:,.2f}, R²: {test_r2:.3f}")

## 6. Cross-Validation

Cross-validation provides a more robust estimate of model performance by using multiple train/validation splits.

In [None]:
# Cross-Validation
print("=== CROSS-VALIDATION ===")

# 5-fold cross-validation
cv_scores = cross_val_score(LinearRegression(), X_multiple, y, cv=5, scoring='r2')
cv_rmse_scores = cross_val_score(LinearRegression(), X_multiple, y, cv=5, 
                                scoring='neg_mean_squared_error')
cv_rmse_scores = np.sqrt(-cv_rmse_scores)

print(f"5-Fold Cross-Validation Results:")
print(f"R² Scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
print(f"\nRMSE Scores: {cv_rmse_scores}")
print(f"Mean RMSE: ${cv_rmse_scores.mean():,.2f} (+/- ${cv_rmse_scores.std() * 2:,.2f})")

# Visualize cross-validation results
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# R² scores
axes[0].bar(range(1, 6), cv_scores, alpha=0.7, color='skyblue')
axes[0].axhline(y=cv_scores.mean(), color='red', linestyle='--', 
               label=f'Mean: {cv_scores.mean():.3f}')
axes[0].set_xlabel('Fold')
axes[0].set_ylabel('R² Score')
axes[0].set_title('Cross-Validation R² Scores')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# RMSE scores
axes[1].bar(range(1, 6), cv_rmse_scores, alpha=0.7, color='lightcoral')
axes[1].axhline(y=cv_rmse_scores.mean(), color='red', linestyle='--', 
               label=f'Mean: ${cv_rmse_scores.mean():,.0f}')
axes[1].set_xlabel('Fold')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Cross-Validation RMSE Scores')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Overfitting and Underfitting

Understanding the bias-variance tradeoff is crucial for building good models.

In [None]:
# Demonstrate overfitting with polynomial features
print("=== OVERFITTING AND UNDERFITTING ===")

# Use only size feature for simplicity
X_simple = house_data[['Size_SqFt']].values
y_simple = house_data['Price'].values

# Split data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# Test different polynomial degrees
degrees = [1, 2, 3, 5, 10, 15]
train_scores = []
test_scores = []

for degree in degrees:
    # Create polynomial features
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly = poly_features.fit_transform(X_train_simple)
    X_test_poly = poly_features.transform(X_test_simple)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_train_poly, y_train_simple)
    
    # Calculate scores
    train_score = model.score(X_train_poly, y_train_simple)
    test_score = model.score(X_test_poly, y_test_simple)
    
    train_scores.append(train_score)
    test_scores.append(test_score)
    
    print(f"Degree {degree}: Train R² = {train_score:.3f}, Test R² = {test_score:.3f}")

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_scores, 'o-', label='Training Score', color='blue')
plt.plot(degrees, test_scores, 'o-', label='Testing Score', color='red')
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Overfitting vs Underfitting')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\nKey Observations:")
print("- Low degrees (1-2): May underfit - too simple to capture patterns")
print("- High degrees (10+): May overfit - too complex, memorizes training data")
print("- Optimal degree: Where test score is maximized")

## 8. Model Evaluation Metrics Deep Dive

Understanding different evaluation metrics helps choose the right model for your problem.

In [None]:
# Comprehensive evaluation metrics
print("=== EVALUATION METRICS EXPLAINED ===")

# Use our multiple regression model
y_pred = y_pred_multiple

# Calculate various metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Mean Absolute Percentage Error (MAPE)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100

print(f"Mean Squared Error (MSE): {mse:,.0f}")
print(f"  - Measures average squared difference between actual and predicted")
print(f"  - Penalizes large errors more heavily")
print(f"  - Units: squared dollars")

print(f"\nRoot Mean Squared Error (RMSE): ${rmse:,.2f}")
print(f"  - Square root of MSE")
print(f"  - Same units as target variable (dollars)")
print(f"  - Easier to interpret than MSE")

print(f"\nMean Absolute Error (MAE): ${mae:,.2f}")
print(f"  - Average absolute difference between actual and predicted")
print(f"  - Less sensitive to outliers than RMSE")
print(f"  - Same units as target variable")

print(f"\nR² Score (Coefficient of Determination): {r2:.3f}")
print(f"  - Proportion of variance in target explained by model")
print(f"  - Range: -∞ to 1 (1 is perfect, 0 is no better than mean)")
print(f"  - {r2*100:.1f}% of price variation is explained by our model")

print(f"\nMean Absolute Percentage Error (MAPE): {mape:.2f}%")
print(f"  - Average percentage error between actual and predicted")
print(f"  - Useful for understanding relative error size")
print(f"  - Our model is off by an average of {mape:.1f}%")

## 9. Practice Exercises

Now it's your turn to practice what you've learned!

In [None]:
### Exercise 1: Feature Engineering
# Create a new feature that represents price per square foot
# and train a model using this feature along with the original features

# Your code here
# Solution:
house_data_extended = house_data.copy()
house_data_extended['Price_per_SqFt'] = house_data_extended['Price'] / house_data_extended['Size_SqFt']
house_data_extended['Room_Size'] = house_data_extended['Size_SqFt'] / house_data_extended['Bedrooms']

# Train model with engineered features
X_engineered = house_data_extended[['Size_SqFt', 'Bedrooms', 'Age_Years', 'Room_Size']]
X_train_eng, X_test_eng, y_train_eng, y_test_eng = train_test_split(
    X_engineered, house_data_extended['Price'], test_size=0.2, random_state=42
)

model_eng = LinearRegression()
model_eng.fit(X_train_eng, y_train_eng)
y_pred_eng = model_eng.predict(X_test_eng)

print(f"Model with engineered features - R²: {r2_score(y_test_eng, y_pred_eng):.3f}")
print(f"Original model - R²: {r2_multiple:.3f}")
print(f"Improvement: {((r2_score(y_test_eng, y_pred_eng) - r2_multiple) / r2_multiple * 100):.1f}%")

In [None]:
### Exercise 2: Model Comparison
# Compare the performance of different train/test split ratios

# Your code here
# Solution:
split_ratios = [0.1, 0.2, 0.3, 0.4, 0.5]
results = []

for ratio in split_ratios:
    X_train_split, X_test_split, y_train_split, y_test_split = train_test_split(
        X_multiple, y, test_size=ratio, random_state=42
    )
    
    model_split = LinearRegression()
    model_split.fit(X_train_split, y_train_split)
    y_pred_split = model_split.predict(X_test_split)
    
    r2_split = r2_score(y_test_split, y_pred_split)
    results.append(r2_split)
    
    print(f"Test size: {ratio*100:.0f}% - R²: {r2_split:.3f}")

# Plot results
plt.figure(figsize=(10, 6))
plt.plot([r*100 for r in split_ratios], results, 'o-', color='blue', linewidth=2, markersize=8)
plt.xlabel('Test Set Size (%)')
plt.ylabel('R² Score')
plt.title('Model Performance vs Test Set Size')
plt.grid(True, alpha=0.3)
plt.show()

## 10. Summary

Congratulations! You've completed your introduction to supervised learning and linear regression. Here's what you learned:

### Key Concepts Mastered:
1. **Machine Learning Types**: Supervised, unsupervised, and reinforcement learning
2. **Linear Regression**: Simple and multiple linear regression
3. **Feature Engineering**: Creating new features to improve model performance
4. **Model Evaluation**: RMSE, MAE, R², and MAPE metrics
5. **Data Splitting**: Train/validation/test splits and cross-validation
6. **Overfitting/Underfitting**: Understanding the bias-variance tradeoff

### Key Skills Acquired:
- Building and interpreting linear regression models
- Evaluating model performance using multiple metrics
- Implementing proper data splitting strategies
- Understanding overfitting and underfitting
- Creating and engineering features for better predictions

### Next Week Preview: Multiple Linear Regression and Fine Tuning
- Multiple Linear Regression concepts
- Ridge Regression for regularization
- Lasso Regression for feature selection
- ElasticNet combining Ridge and Lasso
- Model selection and hyperparameter tuning

### Best Practices to Remember:
- Always split your data before training
- Use cross-validation for robust performance estimates
- Start with simple models before adding complexity
- Feature engineering can significantly improve performance
- Understand your evaluation metrics and choose appropriate ones
- Watch out for overfitting with complex models

Great job on completing your first machine learning algorithms! 🎉