# Module 17: Linear Regression

## Topics Covered
1. Simple Linear Regression
2. Multiple Linear Regression
3. Assumptions of Linear Regression
4. Ordinary Least Squares (OLS)
5. Gradient Descent for Regression
6. Polynomial Regression
7. Regularization (Ridge, Lasso, Elastic Net)
8. Regression Evaluation Metrics
9. Residual Analysis
10. Feature Scaling and Transformation

## Learning Objectives

By the end of this module, you will be able to:
- Build and interpret simple and multiple linear regression models
- Understand the mathematical foundations of OLS and gradient descent
- Apply polynomial regression for non-linear relationships
- Use regularization techniques to prevent overfitting
- Evaluate regression models using appropriate metrics
- Diagnose model issues through residual analysis
- Prepare features for optimal model performance

---

---
# Section 1: Simple Linear Regression
---

## What is Simple Linear Regression?

**Simple linear regression** models the relationship between two variables using a straight line. The goal is to find the best-fitting line that describes how one variable (dependent/target) changes with another variable (independent/feature).

The equation is: **y = β₀ + β₁x + ε**

Where:
- y is the dependent variable (what we predict)
- x is the independent variable (what we use to predict)
- β₀ is the intercept (y-value when x=0)
- β₁ is the slope (change in y for unit change in x)
- ε is the error term

### Why This Matters in Data Science

Linear regression is one of the most widely used algorithms in data science. It's interpretable, fast, and serves as the foundation for more complex models. Use cases include sales forecasting, price prediction, and understanding relationships between business metrics.

In [None]:
# Example 1: Simple Linear Regression - Advertising Spend vs Sales
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic data
np.random.seed(42)
advertising_spend = np.array([10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])
sales = 1000 + 50 * advertising_spend + np.random.normal(0, 200, len(advertising_spend))

# Create DataFrame
df = pd.DataFrame({'Advertising_Spend': advertising_spend, 'Sales': sales})
print(df.head())

# Visualize relationship
plt.figure(figsize=(10, 6))
plt.scatter(advertising_spend, sales, alpha=0.7, s=100)
plt.xlabel('Advertising Spend ($1000s)')
plt.ylabel('Sales ($)')
plt.title('Advertising Spend vs Sales')
plt.grid(alpha=0.3)
plt.show()

print(f"\nCorrelation: {np.corrcoef(advertising_spend, sales)[0,1]:.3f}")

In [None]:
# Example 2: Building and Evaluating a Linear Regression Model

# Reshape data for sklearn
X = advertising_spend.reshape(-1, 1)
y = sales

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Model parameters
print("Model Parameters:")
print(f"Intercept (β₀): ${model.intercept_:.2f}")
print(f"Slope (β₁): ${model.coef_[0]:.2f}")
print(f"\nInterpretation: For every $1000 increase in advertising, sales increase by ${model.coef_[0]:.2f}")

# Evaluate model
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f"\nTraining R²: {train_r2:.3f}")
print(f"Testing R²: {test_r2:.3f}")
print(f"Training RMSE: ${train_rmse:.2f}")
print(f"Testing RMSE: ${test_rmse:.2f}")

# Visualize predictions
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, alpha=0.6, label='Training Data', s=100)
plt.scatter(X_test, y_test, alpha=0.6, label='Test Data', s=100, color='orange')
plt.plot(X, model.predict(X), 'r-', linewidth=2, label='Fitted Line')
plt.xlabel('Advertising Spend ($1000s)')
plt.ylabel('Sales ($)')
plt.title('Linear Regression: Advertising vs Sales')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## Practice Exercise 1.1

**Task:** Given this data about study hours and exam scores:
```
study_hours = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
exam_scores = [50, 55, 60, 65, 70, 72, 78, 83, 88, 90]
```

1. Create a scatter plot
2. Build a linear regression model
3. Print the equation (intercept and slope)
4. Calculate R² score
5. Predict the score for someone who studies 7.5 hours

**Expected Output:**
```
Equation: score = 42.0 + 4.8 × hours
R² ≈ 0.98
Prediction for 7.5 hours ≈ 78
```

In [None]:
# Your code here


In [None]:
# Solution 1.1

study_hours = np.array([2, 3, 4, 5, 6, 7, 8, 9, 10, 11]).reshape(-1, 1)
exam_scores = np.array([50, 55, 60, 65, 70, 72, 78, 83, 88, 90])

# Scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(study_hours, exam_scores, s=100, alpha=0.7)
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.title('Study Hours vs Exam Scores')
plt.grid(alpha=0.3)
plt.show()

# Build model
model = LinearRegression()
model.fit(study_hours, exam_scores)

# Print equation
print(f"Equation: score = {model.intercept_:.1f} + {model.coef_[0]:.1f} × hours")

# Calculate R²
y_pred = model.predict(study_hours)
r2 = r2_score(exam_scores, y_pred)
print(f"R² Score: {r2:.3f}")

# Predict for 7.5 hours
prediction = model.predict([[7.5]])[0]
print(f"Prediction for 7.5 hours: {prediction:.1f}")

---
# Section 2: Multiple Linear Regression
---

## What is Multiple Linear Regression?

**Multiple linear regression** extends simple linear regression to use multiple features for prediction. Instead of one line, we're fitting a hyperplane in multi-dimensional space.

The equation is: **y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ + ε**

Each coefficient (β) represents the change in y for a unit change in that feature, holding all other features constant.

### Why This Matters in Data Science

Real-world predictions rarely depend on just one variable. Multiple regression lets you model complex relationships - predicting house prices based on size, location, and age, or sales based on advertising, season, and competition.

In [None]:
# Example 1: Multiple Linear Regression - House Price Prediction

# Generate synthetic housing data
np.random.seed(42)
n_samples = 100
size_sqft = np.random.randint(1000, 3500, n_samples)
bedrooms = np.random.randint(2, 6, n_samples)
age_years = np.random.randint(0, 50, n_samples)

# Price formula: base + size_effect + bedroom_effect - age_effect + noise
price = (50000 + 
        150 * size_sqft + 
        25000 * bedrooms - 
        1000 * age_years + 
        np.random.normal(0, 50000, n_samples))

# Create DataFrame
df_houses = pd.DataFrame({
    'Size_SqFt': size_sqft,
    'Bedrooms': bedrooms,
    'Age_Years': age_years,
    'Price': price
})

print("House Data Sample:")
print(df_houses.head())
print(f"\nDataset shape: {df_houses.shape}")
print(f"\nCorrelations with Price:")
print(df_houses.corr()['Price'].sort_values(ascending=False))

In [None]:
# Example 2: Building Multiple Regression Model

# Prepare features and target
X = df_houses[['Size_SqFt', 'Bedrooms', 'Age_Years']]
y = df_houses['Price']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model_multi = LinearRegression()
model_multi.fit(X_train, y_train)

# Model coefficients
print("Multiple Regression Model:")
print(f"Intercept: ${model_multi.intercept_:,.2f}")
for feature, coef in zip(X.columns, model_multi.coef_):
    print(f"{feature}: ${coef:,.2f}")

print("\nInterpretation:")
print(f"- Each additional sqft adds ${model_multi.coef_[0]:.2f} to price")
print(f"- Each additional bedroom adds ${model_multi.coef_[1]:,.2f} to price")
print(f"- Each additional year of age reduces price by ${abs(model_multi.coef_[2]):,.2f}")

# Evaluate
y_pred_train = model_multi.predict(X_train)
y_pred_test = model_multi.predict(X_test)

print(f"\nModel Performance:")
print(f"Training R²: {r2_score(y_train, y_pred_train):.3f}")
print(f"Testing R²: {r2_score(y_test, y_pred_test):.3f}")
print(f"Testing RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred_test)):,.2f}")

# Example prediction
sample_house = pd.DataFrame({'Size_SqFt': [2500], 'Bedrooms': [4], 'Age_Years': [10]})
predicted_price = model_multi.predict(sample_house)[0]
print(f"\nPrediction for 2500 sqft, 4 bed, 10 years old: ${predicted_price:,.2f}")

## Practice Exercise 2.1

**Task:** Build a multiple regression model to predict student performance:
```
data = {
    'study_hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'attendance_pct': [70, 75, 80, 85, 88, 90, 92, 95, 97, 98],
    'prev_score': [50, 55, 60, 62, 68, 70, 75, 78, 82, 85],
    'final_score': [55, 60, 65, 70, 75, 78, 83, 87, 90, 93]
}
```

1. Build a multiple regression model
2. Print all coefficients with interpretation
3. Calculate R² score
4. Which feature is most important? (highest absolute coefficient)

**Expected Output:**
```
R² > 0.95
Most important feature: study_hours
```

In [None]:
# Your code here


In [None]:
# Solution 2.1

data = {
    'study_hours': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'attendance_pct': [70, 75, 80, 85, 88, 90, 92, 95, 97, 98],
    'prev_score': [50, 55, 60, 62, 68, 70, 75, 78, 82, 85],
    'final_score': [55, 60, 65, 70, 75, 78, 83, 87, 90, 93]
}

df_students = pd.DataFrame(data)
X = df_students[['study_hours', 'attendance_pct', 'prev_score']]
y = df_students['final_score']

# Build model
model = LinearRegression()
model.fit(X, y)

# Print coefficients
print("Model Coefficients:")
print(f"Intercept: {model.intercept_:.2f}")
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.3f}")

# Calculate R²
y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
print(f"\nR² Score: {r2:.3f}")

# Most important feature
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': np.abs(model.coef_)
}).sort_values('Coefficient', ascending=False)

print(f"\nMost important feature: {feature_importance.iloc[0]['Feature']}")
print("\nFeature Importance (by absolute coefficient):")
print(feature_importance)

---
# Module Summary

## Key Takeaways

- **Simple linear regression** models the relationship between two variables using a straight line
- **Multiple linear regression** extends this to multiple features, enabling more realistic models
- **OLS (Ordinary Least Squares)** finds the best-fit line by minimizing squared residuals
- **Gradient descent** is an iterative optimization algorithm that can also find regression coefficients
- **Polynomial regression** captures non-linear relationships while still using linear regression
- **Regularization** (Ridge, Lasso, Elastic Net) prevents overfitting by penalizing large coefficients
- **Evaluation metrics** like R², RMSE, and MAE quantify model performance
- **Residual analysis** helps diagnose model assumptions and identify issues
- **Feature scaling** and **transformation** improve model performance and convergence
- Coefficients in linear regression are **interpretable** - each represents the effect of one feature on the target

## Next Module

In Module 18: Logistic Regression, we'll extend regression concepts to classification problems. You'll learn how to predict binary outcomes and evaluate classification models using metrics like accuracy, precision, recall, and ROC curves.

## Additional Practice

For extra practice, try these challenges:

1. **Feature Engineering**: Create polynomial features and interaction terms to improve model performance
2. **Real Dataset**: Use the California Housing dataset from sklearn and build a comprehensive regression model
3. **Regularization Comparison**: Compare Ridge, Lasso, and Elastic Net on a dataset with many features
4. **Gradient Descent**: Implement gradient descent from scratch and compare with sklearn's solution
5. **Residual Analysis**: Create a complete residual diagnostic plot (residuals vs fitted, Q-Q plot, scale-location, leverage)