# **AI TECH INSTITUTE** ¬∑ *Intermediate AI & Data Science*
### Week 8 - Lab 01: Linear Regression Fundamentals
**Instructor:** Amir Charkhi | **Type:** Hands-On Practice

> Master Linear and Polynomial Regression

## üéØ Lab Objectives

In this lab, you'll practice:
- Building and interpreting Linear Regression models
- Understanding coefficients and feature importance
- Detecting and handling overfitting
- Implementing Polynomial Regression
- Using regularization (Ridge, Lasso)

**Time**: 35-45 minutes  
**Difficulty**: ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ (Intermediate)

---

## üìö Quick Reference

**Linear Regression:**
```python
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures

# Simple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Coefficients
coefficients = model.coef_
intercept = model.intercept_

# Polynomial Features
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)

# Regularization
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=1.0)
```

---

In [None]:
# Setup - Run this cell first!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
print("‚úÖ Setup complete! Let's master Linear Regression!")

---

## üìä Exercise 1: Simple Linear Regression

Let's start with the basics and build our first regression model!

### Task 1.1: Load and Explore the Dataset

In [None]:
# Load California Housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')

print(f"Dataset loaded: {len(X)} samples, {X.shape[1]} features")
print(f"\nFeatures: {list(X.columns)}")

# TODO 1.1: Explore the data
# - Display first 5 rows
# - Show summary statistics
# - Display target variable statistics (min, max, mean)

# Your code here:


print("\n‚úÖ Task 1.1 Complete!")

### Task 1.2: Visualize Feature Relationships

In [None]:
# TODO 1.2: Create visualizations
# Requirements:
#   1. Create a correlation heatmap
#   2. Create scatter plot: MedInc vs target (most correlated feature)
#   3. Identify the top 3 features most correlated with target

# Your code here:
# Hint: Combine X and y into a single DataFrame for correlation
df_combined = pd.concat([X, y], axis=1)

# 1. Correlation heatmap


# 2. Scatter plot for top feature


# 3. Print top 3 correlated features
correlations = # Calculate correlations with target
print("\nTop 3 features by correlation:")
# Your code here


print("\n‚úÖ Task 1.2 Complete!")

### Task 1.3: Build Your First Model

In [None]:
# TODO 1.3: Train a Linear Regression model
# Steps:
#   1. Split data (80/20, random_state=42)
#   2. Create and train LinearRegression model
#   3. Make predictions on test set
#   4. Calculate MAE, RMSE, and R¬≤ score

# Your code here:
X_train, X_test, y_train, y_test = # Split the data

model = # Create model
# Train model

y_pred = # Make predictions

# Calculate metrics
mae = # Mean Absolute Error
rmse = # Root Mean Squared Error
r2 = # R¬≤ Score

# Validation (Don't modify)
print("Model Performance:")
print(f"  MAE:  ${mae*100:.2f}k")
print(f"  RMSE: ${rmse*100:.2f}k")
print(f"  R¬≤:   {r2:.4f}")

if r2 > 0.5:
    print(f"\n‚úÖ Good start! Model explains {r2*100:.1f}% of variance")
    print("üéâ Task 1.3 Complete!")
else:
    print("\n‚ö†Ô∏è Check your code - R¬≤ should be above 0.5")

---

## üîç Exercise 2: Understanding Coefficients

Let's interpret what the model learned!

### Task 2.1: Analyze Feature Importance

In [None]:
# TODO 2.1: Extract and visualize coefficients
# Requirements:
#   1. Get model coefficients and create a DataFrame
#   2. Sort by absolute coefficient value
#   3. Create a horizontal bar plot
#   4. Print the intercept

# Your code here:
coefficients = # Get coefficients
intercept = # Get intercept

# Create DataFrame of coefficients
coef_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': # Your coefficients here
})

# Sort by absolute value


# Visualize
plt.figure(figsize=(10, 6))
# Create horizontal bar plot


plt.tight_layout()
plt.show()

print(f"\nIntercept: {intercept:.4f}")
print("\nüí° Interpretation:")
print("  Positive coefficient = feature increases prediction")
print("  Negative coefficient = feature decreases prediction")
print("  Larger absolute value = stronger effect")
print("\n‚úÖ Task 2.1 Complete!")

### Task 2.2: Manual Prediction

In [None]:
# TODO 2.2: Make a manual prediction using the equation
# Linear Regression: y = intercept + coef1*x1 + coef2*x2 + ... + coefn*xn

# Get the first test sample
sample = X_test.iloc[0]
actual = y_test.iloc[0]
model_pred = model.predict(sample.values.reshape(1, -1))[0]

print("Sample features:")
print(sample)

# Your code here:
# Calculate prediction manually: intercept + sum of (coefficient * feature value)
manual_pred = # Start with intercept, then add each (coef * feature value)


# Validation (Don't modify)
print(f"\nManual prediction: {manual_pred:.4f}")
print(f"Model prediction:  {model_pred:.4f}")
print(f"Actual value:      {actual:.4f}")

if abs(manual_pred - model_pred) < 0.01:
    print("\n‚úÖ Perfect! You understand the linear equation!")
    print("üéâ Task 2.2 Complete!")
else:
    print("\n‚ö†Ô∏è Manual and model predictions should match")

---

## üìà Exercise 3: Polynomial Regression

Sometimes relationships aren't perfectly linear - let's add some curves!

### Task 3.1: Generate Non-Linear Data

In [None]:
# Create synthetic non-linear data
np.random.seed(42)
X_simple = np.linspace(0, 10, 100).reshape(-1, 1)
y_simple = 3 + 2*X_simple + 0.5*X_simple**2 + np.random.normal(0, 2, X_simple.shape)
y_simple = y_simple.flatten()

# TODO 3.1: Visualize the non-linear relationship
# Create a scatter plot to see the curve

# Your code here:
plt.figure(figsize=(10, 6))
# Create scatter plot


print("üí° Notice the curved pattern - linear regression won't fit perfectly!")
print("‚úÖ Task 3.1 Complete!")

### Task 3.2: Compare Linear vs Polynomial Regression

In [None]:
# TODO 3.2: Fit both linear and polynomial regression
# Requirements:
#   1. Fit simple Linear Regression
#   2. Create polynomial features (degree=2)
#   3. Fit Linear Regression on polynomial features
#   4. Compare R¬≤ scores
#   5. Visualize both fits

# Split data
X_train_simple, X_test_simple, y_train_simple, y_test_simple = train_test_split(
    X_simple, y_simple, test_size=0.2, random_state=42
)

# Your code here:
# 1. Linear model
linear_model = # Create and fit linear model

y_pred_linear = # Predict
r2_linear = # Calculate R¬≤

# 2. Polynomial model
poly = # Create PolynomialFeatures(degree=2)
X_train_poly = # Transform training data
X_test_poly = # Transform test data

poly_model = # Create and fit on polynomial features

y_pred_poly = # Predict
r2_poly = # Calculate R¬≤

# Validation and Visualization
print("Model Comparison:")
print(f"  Linear Regression R¬≤:     {r2_linear:.4f}")
print(f"  Polynomial Regression R¬≤: {r2_poly:.4f}")
print(f"  Improvement: {(r2_poly - r2_linear):.4f}")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_test_simple, y_test_simple, alpha=0.5, label='Actual')
plt.scatter(X_test_simple, y_pred_linear, alpha=0.5, label='Linear Fit')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Linear Regression (R¬≤={r2_linear:.3f})')
plt.legend()

plt.subplot(1, 2, 2)
plt.scatter(X_test_simple, y_test_simple, alpha=0.5, label='Actual')
plt.scatter(X_test_simple, y_pred_poly, alpha=0.5, label='Polynomial Fit')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Polynomial Regression (R¬≤={r2_poly:.3f})')
plt.legend()

plt.tight_layout()
plt.show()

if r2_poly > r2_linear:
    print("\n‚úÖ Polynomial fits better for non-linear data!")
    print("üéâ Task 3.2 Complete!")

### Task 3.3: Detecting Overfitting

In [None]:
# TODO 3.3: Explore what happens with very high polynomial degrees
# Requirements:
#   1. Try polynomial degrees from 1 to 15
#   2. Calculate train and test R¬≤ for each
#   3. Plot the results
#   4. Identify the overfitting point

# Your code here:
degrees = range(1, 16)
train_scores = []
test_scores = []

for degree in degrees:
    # Create polynomial features
    
    # Transform data
    
    # Train model
    
    # Calculate R¬≤ for both train and test
    
    # Append to lists
    pass

# Plot results
plt.figure(figsize=(10, 6))
plt.plot(degrees, train_scores, 'o-', label='Training R¬≤', linewidth=2)
plt.plot(degrees, test_scores, 's-', label='Test R¬≤', linewidth=2)
plt.xlabel('Polynomial Degree', fontsize=12)
plt.ylabel('R¬≤ Score', fontsize=12)
plt.title('Overfitting Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Signs of Overfitting:")
print("  - Training R¬≤ keeps improving")
print("  - Test R¬≤ starts decreasing")
print("  - Large gap between train and test scores")
print("\n‚úÖ Task 3.3 Complete!")

---

## üõ°Ô∏è Exercise 4: Regularization

Prevent overfitting with Ridge and Lasso regression!

### Task 4.1: Compare Ridge and Lasso

In [None]:
# Back to California Housing with all features
# TODO 4.1: Compare Linear, Ridge, and Lasso Regression
# Requirements:
#   1. Standardize features (important for regularization!)
#   2. Train LinearRegression, Ridge(alpha=1.0), and Lasso(alpha=1.0)
#   3. Calculate test R¬≤ for each
#   4. Compare number of non-zero coefficients

# Use the housing data from Exercise 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Your code here:
# 1. Standardize features
scaler = StandardScaler()
X_train_scaled = # Fit and transform training data
X_test_scaled = # Transform test data

# 2. Train three models
models = {
    'Linear': # LinearRegression()
    'Ridge': # Ridge(alpha=1.0)
    'Lasso': # Lasso(alpha=1.0)
}

results = []
for name, model in models.items():
    # Train model
    
    # Predict and evaluate
    
    # Count non-zero coefficients
    
    pass

# Display results
results_df = pd.DataFrame(results)
print("\nModel Comparison:")
print(results_df.to_string(index=False))

print("\nüí° Key Differences:")
print("  Ridge: Shrinks all coefficients but keeps all features")
print("  Lasso: Can reduce some coefficients to exactly zero (feature selection!)")
print("\n‚úÖ Task 4.1 Complete!")

### Task 4.2: Tuning Regularization Strength

In [None]:
# TODO 4.2: Find optimal alpha for Ridge regression
# Requirements:
#   1. Try different alpha values: [0.01, 0.1, 1, 10, 100]
#   2. Use cross-validation to evaluate each
#   3. Plot CV scores vs alpha
#   4. Identify best alpha

# Your code here:
alphas = [0.01, 0.1, 1, 10, 100]
cv_scores_list = []

for alpha in alphas:
    # Create Ridge model with this alpha
    
    # Perform 5-fold cross-validation
    
    # Store mean CV score
    pass

# Visualize
plt.figure(figsize=(10, 6))
# Create plot of alpha vs CV scores


# Find and print best alpha
best_idx = # Index of best score
best_alpha = alphas[best_idx]
best_score = cv_scores_list[best_idx]

print(f"\nBest alpha: {best_alpha}")
print(f"Best CV R¬≤: {best_score:.4f}")
print("\n‚úÖ Task 4.2 Complete!")

---

## üéØ Exercise 5: Model Selection Challenge

Put everything together!

### Task 5.1: Choose the Best Approach

In [None]:
# TODO 5.1: Complete workflow to find the best model
# Given scenario: Predicting house prices with limited features
# Requirements:
#   1. Select only top 3 most important features from correlation
#   2. Try: Linear, Ridge, Lasso, Polynomial(degree=2) + Ridge
#   3. Use cross-validation for fair comparison
#   4. Select best model and evaluate on test set

print("üè† Housing Price Prediction Challenge")
print("="*50)

# Your complete solution here:
# Step 1: Feature selection


# Step 2: Train/test split


# Step 3: Compare models with CV


# Step 4: Train best model and evaluate


# Validation
print("\n" + "="*50)
if r2_test > 0.5:  # Replace with your test R¬≤
    print("\n‚úÖ Excellent! You've built a solid regression model!")
    print("\nüí° What you learned:")
    print("  - Linear regression basics")
    print("  - Polynomial features for non-linearity")
    print("  - Regularization to prevent overfitting")
    print("  - Model selection with cross-validation")
    print("\nüéâ Task 5.1 Complete!")
    print("üéâ Lab 01 Complete!")

---

## üèÜ Lab Complete!

### What You Practiced:

‚úÖ **Exercise 1**: Simple Linear Regression basics  
‚úÖ **Exercise 2**: Understanding and interpreting coefficients  
‚úÖ **Exercise 3**: Polynomial Regression and overfitting  
‚úÖ **Exercise 4**: Ridge and Lasso regularization  
‚úÖ **Exercise 5**: Complete model selection workflow  

### Key Takeaways:

1. **Coefficients show feature importance** - positive/negative and magnitude matter
2. **Polynomial features capture non-linear relationships** but can overfit
3. **Regularization prevents overfitting** by penalizing large coefficients
4. **Ridge shrinks coefficients**, Lasso can **eliminate features**
5. **Always standardize** before applying regularization
6. **Cross-validation** helps find optimal hyperparameters

### When to Use Each:

- **Linear Regression**: Fast baseline, interpretable, works when relationships are linear
- **Polynomial**: When you see curves in scatter plots
- **Ridge**: When all features might be useful, prevents overfitting
- **Lasso**: When you want automatic feature selection

### Next Steps:

- Move to **Lab 02** for Decision Trees and Random Forests
- Try these techniques on your own datasets
- Experiment with different polynomial degrees and alpha values

**Outstanding work! üéâ**