# CE49X: Introduction to Computational Thinking and Data Science for Civil Engineers
## Week 6: Introduction to Machine Learning

**Instructor:** Dr. Eyuphan Koc  
**Department of Civil Engineering, Bogazici University**  
**Semester:** Spring 2026

---

Based on *Python Data Science Handbook* by Jake VanderPlas  
Chapter 5: Machine Learning (Sections 5.0--5.6)  
https://jakevdp.github.io/PythonDataScienceHandbook/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

## Table of Contents

1. [What is Machine Learning?](#1.-What-is-Machine-Learning?)
2. [Introducing Scikit-Learn](#2.-Introducing-Scikit-Learn)
3. [Hyperparameters and Model Validation](#3.-Hyperparameters-and-Model-Validation)
4. [Linear Regression](#4.-Linear-Regression)
5. [Summary and Next Steps](#5.-Summary-and-Next-Steps)

---
## 1. What is Machine Learning?

### What is Machine Learning?

**Definition:**
- Machine Learning is about building **mathematical models** to understand data
- Fundamentally, it's a **data-driven approach** to learning patterns
- Models learn from examples rather than explicit programming

**Key Idea:**

> *"Instead of programming explicit rules, we provide examples and let the algorithm discover the patterns."*

> **Example: Civil Engineering**  
> Rather than coding rules for predicting concrete strength, we provide examples of mix designs and their measured strengths -- the model learns the relationship.

### Categories of Machine Learning

| | **Supervised Learning** | **Unsupervised Learning** |
|---|---|---|
| **Data** | Learn from **labeled** data | Learn from **unlabeled** data |
| **Setup** | Have input-output pairs | No predefined outputs |
| **Goal** | Predict outputs for new inputs | Discover structure |
| **Type 1** | **Classification**: Predict discrete labels | **Clustering**: Group similar items |
| **Type 2** | **Regression**: Predict continuous values | **Dimensionality Reduction**: Compress data |

> **Key Insight: Key Difference**  
> Supervised: "Here are examples with answers" vs Unsupervised: "Find patterns in this data"

### Supervised Learning: Classification vs Regression

| | **Classification** | **Regression** |
|---|---|---|
| **Predict** | **Discrete categories** | **Continuous values** |
| **Output** | A label or class | A number |
| **General Examples** | Email: spam or not spam? | House price prediction |
| | Image: cat or dog? | Temperature forecasting |
| **CE Examples** | Soil type: clay, sand, or silt? | Concrete strength from mix design |
| | Structure: safe or unsafe? | Bridge deflection under load |

### Visualizing Classification and Regression

The following plot shows the key difference between classification (left) and regression (right).

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Classification plot
ax = axes[0]
# Class A (blue)
ax.scatter([0.5, 0.8, 0.6], [0.5, 0.7, 1.0], c='steelblue', s=80, label='Class A', edgecolors='k', zorder=3)
# Class B (red)
ax.scatter([2.0, 2.3, 1.8], [1.5, 1.8, 2.0], c='indianred', s=80, label='Class B', edgecolors='k', zorder=3)
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('Classification', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim(0, 3)
ax.set_ylim(0, 2.5)
ax.grid(True, alpha=0.3)

# Regression plot
ax = axes[1]
x_pts = np.array([0.5, 1.0, 1.5, 2.0, 2.5])
y_pts = np.array([0.6, 1.0, 1.3, 1.7, 2.0])
ax.scatter(x_pts, y_pts, c='indianred', s=80, edgecolors='k', zorder=3, label='Data points')
# Regression line
x_line = np.linspace(0.3, 2.7, 100)
y_line = 0.3 + 0.7 * x_line
ax.plot(x_line, y_line, 'steelblue', linewidth=2, label='Fit line')
ax.set_xlabel('$x$', fontsize=12)
ax.set_ylabel('$y$', fontsize=12)
ax.set_title('Regression', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim(0, 3)
ax.set_ylim(0, 2.5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Unsupervised Learning: Clustering vs Dimensionality Reduction

| | **Clustering** | **Dimensionality Reduction** |
|---|---|---|
| **Goal** | Group similar data points | Reduce number of features |
| **Labels** | No predefined labels | Preserve important information |
| **Use** | Discover natural groupings | Visualization & compression |
| **General Examples** | Customer segmentation | Image compression |
| | Document organization | Feature extraction |
| **CE Examples** | Grouping similar bridge designs | Compress sensor data streams |
| | Identifying failure patterns | Visualize high-D material properties |

### Visualizing Clustering and Dimensionality Reduction

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Clustering plot
ax = axes[0]
cluster1_x = [0.5, 0.7, 0.4]
cluster1_y = [0.5, 0.8, 0.9]
cluster2_x = [2.0, 2.2, 1.9]
cluster2_y = [1.5, 1.7, 1.9]
ax.scatter(cluster1_x, cluster1_y, c='steelblue', s=80, edgecolors='k', zorder=3, label='Cluster 1')
ax.scatter(cluster2_x, cluster2_y, c='indianred', s=80, edgecolors='k', zorder=3, label='Cluster 2')
# Dashed circles around clusters
circle1 = plt.Circle((0.55, 0.73), 0.35, fill=False, linestyle='--', color='steelblue', linewidth=2)
circle2 = plt.Circle((2.03, 1.7), 0.35, fill=False, linestyle='--', color='indianred', linewidth=2)
ax.add_patch(circle1)
ax.add_patch(circle2)
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('Clustering', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim(0, 3)
ax.set_ylim(0, 2.5)
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)

# Dimensionality Reduction plot
ax = axes[1]
pts_x = [0.8, 1.2, 1.6, 2.0]
pts_y = [0.7, 1.0, 1.3, 1.6]
ax.scatter(pts_x, pts_y, c='indianred', s=80, edgecolors='k', zorder=3, label='Data points')
ax.annotate('', xy=(2.5, 2.0), xytext=(0.3, 0.3),
            arrowprops=dict(arrowstyle='->', color='steelblue', lw=2.5))
ax.text(2.6, 2.05, 'PC1', fontsize=12, color='steelblue', fontweight='bold')
ax.set_xlabel('$x_1$', fontsize=12)
ax.set_ylabel('$x_2$', fontsize=12)
ax.set_title('Dimensionality Reduction', fontsize=14)
ax.legend(fontsize=10)
ax.set_xlim(0, 3)
ax.set_ylim(0, 2.5)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Machine Learning Workflow

**Key Steps:**
1. **Data Collection & Preprocessing:** Gather, clean, normalize data
2. **Feature Engineering:** Select/create informative features
3. **Model Selection & Training:** Choose algorithm, fit to data
4. **Validation:** Test on unseen data, tune hyperparameters
5. **Deployment:** Use model in production

### Machine Learning Workflow Diagram

```
 [Raw Data] --> [Preprocess & Clean] --> [Feature Engineering]
                                                  |
                                                  v
                    [Validate] <-- [Train Model] <-- [Choose Model]
                        |                ^                       
                        |                |  (adjust)             
                        +----------------+                       
                        |
                        v
                     [Deploy]
```

---
## 2. Introducing Scikit-Learn

### What is Scikit-Learn?

**Scikit-Learn** is Python's premier machine learning library.

**Key Features:**
- **Consistent API:** All models follow the same interface
- **Comprehensive:** Classification, regression, clustering, dimensionality reduction
- **Well-documented:** Excellent documentation and examples
- **Built on NumPy/SciPy:** Fast and efficient
- **Open-source:** Free and actively maintained

**Installation:**
```bash
pip install scikit-learn
# or
conda install scikit-learn
```

> **Example: Why Scikit-Learn?**  
> Unified interface means learning one model teaches you all models!

### Data Representation in Scikit-Learn

**Two fundamental data structures:**

| | **Features Matrix: $\mathbf{X}$** | **Target Array: $\mathbf{y}$** |
|---|---|---|
| **Shape** | $[n_{\text{samples}}, n_{\text{features}}]$ | $[n_{\text{samples}}]$ |
| **Content** | Each row = one sample, each column = one feature | Labels (classification) or values (regression) |
| **Type** | 2D NumPy array or pandas DataFrame | 1D NumPy array or pandas Series |

### The Estimator API

**All Scikit-Learn models follow the same pattern:**

1. **Choose a model class** and import it
2. **Choose hyperparameters** by instantiating the class
3. **Arrange data** into features matrix $\mathbf{X}$ and target vector $\mathbf{y}$
4. **Fit the model** to your data with `.fit()`
5. **Apply the model** with `.predict()` or `.transform()`

**Universal Interface:**
```python
from sklearn.some_module import SomeModel

# 1. Choose model and hyperparameters
model = SomeModel(hyperparameter1=value1,
                  hyperparameter2=value2)

# 2. Fit to data
model.fit(X, y)

# 3. Predict on new data
predictions = model.predict(X_new)
```

### [TOGETHER] Example: Simple Linear Regression (Water-Cement Ratio vs Concrete Strength)

**Problem:** Predict concrete strength from water-cement ratio

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

# Generate sample data (water-cement ratio vs strength)
X = np.array([[0.4], [0.45], [0.5], [0.55], [0.6], [0.65]])
y = np.array([45, 40, 35, 30, 25, 20])  # Strength in MPa

# 1. Choose model
model = LinearRegression()

# 2. Fit model
model.fit(X, y)

# 3. Make predictions
X_new = np.array([[0.48], [0.58]])
predictions = model.predict(X_new)

print(f"Predictions: {predictions}")
print(f"Slope: {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

### Visualizing the Regression Line

Let's plot the original data points and the fitted regression line.

In [None]:
# Create a range of values for plotting the regression line
X_line = np.linspace(0.30, 0.70, 100).reshape(-1, 1)
y_line = model.predict(X_line)

# Create the plot
plt.figure(figsize=(10, 6))

# Plot the original data points
plt.scatter(X, y, color='blue', s=100, alpha=0.6, label='Training Data')

# Plot the regression line
plt.plot(X_line, y_line, color='red', linewidth=2, label='Regression Line')

# Plot the predictions
plt.scatter(X_new, predictions, color='green', s=100, marker='s',
            label='Predictions', zorder=5)

# Add labels and title
plt.xlabel('Water-Cement Ratio', fontsize=12)
plt.ylabel('Concrete Strength (MPa)', fontsize=12)
plt.title('Linear Regression: Water-Cement Ratio vs Concrete Strength', fontsize=14)
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

**Results Interpretation:**
- **Slope:** The coefficient showing how much strength changes per unit change in water-cement ratio
- **Intercept:** The y-intercept of the linear model
- Higher water-cement ratio leads to lower concrete strength (negative slope)

### [QUICK] Example: Classification with Iris Dataset

**Problem:** Classify iris flowers based on petal/sepal measurements

> **Example: Civil Engineering Analogy**  
> Replace iris measurements with soil properties (grain size, moisture, density) to classify soil types (clay, silt, sand).

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Load data
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

# Split data: 80% training, 20% testing
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42)

# Create and train model (k=3 nearest neighbors)
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train_iris, y_train_iris)

# Evaluate accuracy
accuracy = knn_model.score(X_test_iris, y_test_iris)
print(f"Test Accuracy: {accuracy:.2%}")

### Unsupervised Learning: PCA Example

**Principal Component Analysis (PCA):** Reduce dimensionality while preserving variance

> **Example: Engineering Application**  
> Compress multi-sensor structural health monitoring data from 100 sensors to 5 principal components, retaining 95% of information.

In [None]:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load high-dimensional data (4 features)
iris = load_iris()
X_iris_full = iris.data  # Shape: (150, 4)

# Reduce to 2 dimensions for visualization
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_iris_full)  # Shape: (150, 2)

# How much variance is explained?
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total: {sum(pca.explained_variance_ratio_):.2%}")

### Unsupervised Learning: K-Means Clustering

**K-Means:** Group data into $k$ clusters

> **Example: Civil Engineering Application**  
> Cluster bridge inspection data to identify structures with similar damage patterns for targeted maintenance strategies.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

# Sample data: structural damage measurements
X_damage = np.array([[1, 2], [1.5, 1.8], [5, 8],
                     [8, 8], [1, 0.6], [9, 11]])

# Create 2 clusters
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
kmeans.fit(X_damage)

# Get cluster labels
labels = kmeans.labels_
print(f"Cluster assignments: {labels}")

# Get cluster centers
centers = kmeans.cluster_centers_
print(f"Cluster centers:\n{centers}")

---
## 3. Hyperparameters and Model Validation

### Why Model Validation?

**The Fundamental Problem:**
- We want models that **generalize** to new, unseen data
- Simply fitting training data is not enough
- Need to estimate performance on future data

> **Key Insight: Common Mistake -- Training on Test Data**  
> **WRONG:** Evaluate model on the same data used for training  
> **Result:** Overly optimistic performance estimates

**Solution:** Hold out a separate **test set**

### Train-Test Split Visualization

In [None]:
fig, ax = plt.subplots(figsize=(10, 2))

# Training data rectangle
ax.barh(0, 80, left=0, height=0.5, color='steelblue', alpha=0.4, edgecolor='k')
ax.text(40, 0, 'Training Data (80%)', ha='center', va='center', fontsize=13, fontweight='bold')

# Test data rectangle
ax.barh(0, 20, left=80, height=0.5, color='indianred', alpha=0.4, edgecolor='k')
ax.text(90, 0, 'Test (20%)', ha='center', va='center', fontsize=13, fontweight='bold')

ax.set_xlim(-2, 102)
ax.set_ylim(-0.6, 0.6)
ax.set_xlabel('All Available Data', fontsize=12)
ax.set_yticks([])
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

### Train-Test Split

**Basic Approach:** Split data into training and testing sets

> **Key Points**  
> - `random_state`: ensures reproducibility  
> - **Never** use test data during training or hyperparameter tuning  
> - Test set estimates performance on unseen data

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Use the iris data for a regression-style demo
# Predict petal width from other measurements
iris = load_iris()
X_demo = iris.data[:, :3]  # sepal length, sepal width, petal length
y_demo = iris.data[:, 3]   # petal width

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X_demo, y_demo, test_size=0.2, random_state=42)

# Fit on training data only
model_demo = LinearRegression()
model_demo.fit(X_train, y_train)

# Evaluate on test data
train_score = model_demo.score(X_train, y_train)
test_score = model_demo.score(X_test, y_test)

print(f"Training R^2: {train_score:.3f}")
print(f"Test R^2: {test_score:.3f}")

### [TOGETHER] Cross-Validation

**Problem with single train-test split:**
- Performance depends on which samples ended up in test set
- Wastes data (only 80% used for training)

**Solution: K-Fold Cross-Validation**

**Process:**
1. Split data into $k$ equal parts (folds)
2. Train on $k-1$ folds, test on the remaining fold
3. Repeat $k$ times, each fold used as test set once
4. Average the $k$ performance scores

### K-Fold Cross-Validation Diagram

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))

n_folds = 5
colors_train = 'steelblue'
colors_test = 'indianred'

for i in range(n_folds):
    y_pos = n_folds - 1 - i
    for j in range(n_folds):
        x_start = j * (1.0 / n_folds)
        width = 1.0 / n_folds
        if j == i:
            color = colors_test
            alpha = 0.5
            label = 'Test'
        else:
            color = colors_train
            alpha = 0.4
            label = 'Train'
        rect = plt.Rectangle((x_start, y_pos - 0.35), width, 0.7,
                             facecolor=color, alpha=alpha, edgecolor='k', linewidth=1)
        ax.add_patch(rect)
        if j == i:
            ax.text(x_start + width / 2, y_pos, 'Test', ha='center', va='center',
                    fontsize=10, fontweight='bold')
    ax.text(-0.08, y_pos, f'Fold {i+1}:', ha='right', va='center', fontsize=11)

# Legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=colors_train, alpha=0.4, edgecolor='k', label='Train'),
                   Patch(facecolor=colors_test, alpha=0.5, edgecolor='k', label='Test')]
ax.legend(handles=legend_elements, loc='upper right', fontsize=11)

ax.set_xlim(-0.15, 1.05)
ax.set_ylim(-0.6, n_folds - 0.2)
ax.set_xticks([])
ax.set_yticks([])
ax.set_title('5-Fold Cross-Validation', fontsize=14)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.spines['left'].set_visible(False)

plt.tight_layout()
plt.show()

### K-Fold Cross-Validation in Scikit-Learn

**Advantages:**
- More robust performance estimate
- Uses all data for both training and validation
- Provides variance estimate (standard deviation)

> **Example: Typical Choice**  
> 5-fold or 10-fold cross-validation is standard. Use more folds for small datasets.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

# Create model
model_cv = LinearRegression()

# Perform 5-fold cross-validation on the iris regression task
scores = cross_val_score(model_cv, X_demo, y_demo, cv=5, scoring='r2')

print(f"Cross-validation scores: {scores}")
print(f"Mean R^2: {scores.mean():.3f}")
print(f"Std Dev: {scores.std():.3f}")

### Bias-Variance Tradeoff

**Two sources of model error:**

| | **Bias (Underfitting)** | **Variance (Overfitting)** |
|---|---|---|
| **Model** | Too simple | Too complex |
| **Issue** | Cannot capture true pattern | Fits noise in training data |
| **Training error** | High | Low |
| **Test error** | High | High |
| **Example** | Linear model for nonlinear data | High-degree polynomial |

**Goal:** Find the sweet spot with minimum test error!

### Bias-Variance Tradeoff Visualization

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

x = np.linspace(0.5, 10, 200)

# Training error: decreases with complexity
train_err = 2.5 + 8 * np.exp(-x / 2)
# Test error: decreases then increases (U-shape)
test_err = 2.5 + 8 * np.exp(-x / 2) + 0.5 * x**2 / 8

ax.plot(x, train_err, color='steelblue', linewidth=2.5, label='Training Error')
ax.plot(x, test_err, color='indianred', linewidth=2.5, label='Test Error')

# Mark regions
ax.axvline(x=2.5, color='gray', linestyle=':', alpha=0.5)
ax.axvline(x=6.5, color='gray', linestyle=':', alpha=0.5)
ax.text(1.2, 1.0, 'Underfitting', fontsize=11, ha='center', style='italic')
ax.text(4.5, 1.0, 'Good', fontsize=11, ha='center', style='italic', fontweight='bold')
ax.text(8.2, 1.0, 'Overfitting', fontsize=11, ha='center', style='italic')

ax.set_xlabel('Model Complexity', fontsize=13)
ax.set_ylabel('Error', fontsize=13)
ax.set_title('Bias-Variance Tradeoff', fontsize=14)
ax.legend(fontsize=12)
ax.set_xlim(0, 10.5)
ax.set_ylim(0, 12)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Validation Curves

**Validation Curve:** Plot performance vs. a single hyperparameter

**Purpose:**
- Visualize bias-variance tradeoff
- Select optimal hyperparameter value
- Diagnose under/overfitting

### Example Validation Curve

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

degrees = np.arange(1, 16)

# Simulated training score (always increases)
train_score_sim = 0.3 + 0.6 * (1 - np.exp(-(degrees - 1) / 3))
# Simulated validation score (increases then decreases)
val_score_sim = 0.3 + 0.5 * (1 - np.exp(-(degrees - 1) / 3)) - 0.05 * (degrees - 5)**2 / 10

ax.plot(degrees, train_score_sim, 'o-', color='steelblue', linewidth=2, markersize=6, label='Training Score')
ax.plot(degrees, val_score_sim, 's-', color='indianred', linewidth=2, markersize=6, label='Validation Score')

# Optimal line
ax.axvline(x=5, color='green', linestyle='--', linewidth=2, alpha=0.7)
ax.text(5.3, 0.9, 'Optimal', fontsize=12, color='green', fontweight='bold')

ax.set_xlabel('Polynomial Degree', fontsize=13)
ax.set_ylabel('Score ($R^2$)', fontsize=13)
ax.set_title('Validation Curve', fontsize=14)
ax.legend(fontsize=12)
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Creating Validation Curves in Scikit-Learn

> **Example: Engineering Application**  
> Tune regularization strength when predicting structural response to prevent overfitting to measurement noise.

In [None]:
from sklearn.model_selection import validation_curve
from sklearn.linear_model import Ridge
import numpy as np

# Test different regularization strengths
param_range = np.logspace(-4, 4, 10)

train_scores, val_scores = validation_curve(
    Ridge(), X_demo, y_demo,
    param_name='alpha',
    param_range=param_range,
    cv=5,
    scoring='r2'
)

# Average across folds
train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

# Find best alpha
best_alpha = param_range[val_mean.argmax()]
print(f"Best alpha: {best_alpha:.4f}")

### Learning Curves

**Learning Curve:** Plot performance vs. training set size

**Purpose:**
- Diagnose whether more data will help
- Identify high bias vs. high variance

**Diagnosis:**
- **Large gap:** High variance $\rightarrow$ more data or regularization
- **Converged low:** High bias $\rightarrow$ more complex model

### Learning Curves: High Variance vs High Bias

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

train_sizes = np.linspace(10, 100, 50)

# High Variance (Overfitting)
ax = axes[0]
train_score_hv = 0.95 - 0.3 * np.exp(-train_sizes / 30)
val_score_hv = 0.75 * (1 - np.exp(-train_sizes / 30)) + 0.4
ax.plot(train_sizes, train_score_hv, color='steelblue', linewidth=2.5, label='Training')
ax.plot(train_sizes, val_score_hv, color='indianred', linewidth=2.5, label='Validation')
ax.set_xlabel('Training Set Size', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('High Variance (Overfitting)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim(0.3, 1.05)
ax.text(70, 0.5, 'Large gap\nMore data helps', fontsize=10, ha='center',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
ax.grid(True, alpha=0.3)

# High Bias (Underfitting)
ax = axes[1]
train_score_hb = 0.55 + 0.05 * np.log10(train_sizes / 10 + 0.01)
val_score_hb = 0.50 + 0.05 * np.log10(train_sizes / 10 + 0.01)
ax.plot(train_sizes, train_score_hb, color='steelblue', linewidth=2.5, label='Training')
ax.plot(train_sizes, val_score_hb, color='indianred', linewidth=2.5, label='Validation')
ax.set_xlabel('Training Set Size', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('High Bias (Underfitting)', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim(0.3, 1.05)
ax.text(70, 0.8, 'Small gap\nMore complexity\nneeded', fontsize=10, ha='center',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### [PRACTICE] Grid Search for Hyperparameter Tuning

**Problem:** Many models have multiple hyperparameters to tune

**Grid Search:** Try all combinations of hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# Use iris data for classification
iris = load_iris()
X_grid, y_grid = iris.data, iris.target
X_train_grid, X_test_grid, y_train_grid, y_test_grid = train_test_split(
    X_grid, y_grid, test_size=0.2, random_state=42)

# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf', 'linear']
}

# Create grid search with 5-fold CV
grid = GridSearchCV(SVC(), param_grid, cv=5,
                    scoring='accuracy')

# Fit searches all combinations
grid.fit(X_train_grid, y_train_grid)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV score: {grid.best_score_:.3f}")

# Use best model for predictions
best_model = grid.best_estimator_
print(f"Test score: {best_model.score(X_test_grid, y_test_grid):.3f}")

---
## 4. Linear Regression

### Linear Regression: The Foundation

**Goal:** Fit a linear relationship between features and target

**Model:**

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_p x_p = w_0 + \sum_{j=1}^{p} w_j x_j$$

Where: $\hat{y}$ = predicted value, $x_j$ = features, $w_j$ = weights, $w_0$ = intercept

**Learning Objective:** Find weights that minimize error

$$\text{minimize} \quad \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \left(y_i - w_0 - \sum_{j=1}^{p} w_j x_{ij}\right)^2$$

> **Example: Civil Engineering**  
> Predict concrete compressive strength from: cement content, water ratio, age, aggregate size, etc.

### Simple Linear Regression Example: Age vs Concrete Strength

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Data: Age vs Concrete Strength
age = np.array([3, 7, 14, 28, 56, 90])
strength = np.array([20, 32, 38, 45, 50, 52])

X_age = age.reshape(-1, 1)
y_strength = strength

# Fit model
model_age = LinearRegression()
model_age.fit(X_age, y_strength)

# Coefficients
print(f"Slope: {model_age.coef_[0]:.2f} MPa/day")
print(f"Intercept: {model_age.intercept_:.2f} MPa")
print(f"R^2: {model_age.score(X_age, y_strength):.3f}")

# Predict
age_new = np.array([[21], [42]])
pred = model_age.predict(age_new)
print(f"\nPredicted strength at 21 days: {pred[0]:.1f} MPa")
print(f"Predicted strength at 42 days: {pred[1]:.1f} MPa")

### Linear Regression Fit Visualization

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Data points
ax.scatter(age, strength, c='indianred', s=100, edgecolors='k', zorder=3, label='Measured Data')

# Regression line
x_plot = np.linspace(0, 100, 200).reshape(-1, 1)
y_plot = model_age.predict(x_plot)
ax.plot(x_plot, y_plot, color='steelblue', linewidth=2.5, label='Linear Fit')

ax.set_xlabel('Age (days)', fontsize=13)
ax.set_ylabel('Strength (MPa)', fontsize=13)
ax.set_title('Concrete Age vs Compressive Strength', fontsize=14)
ax.legend(fontsize=12)
ax.set_xlim(0, 100)
ax.set_ylim(0, 60)
ax.grid(True, alpha=0.3)

# Add interpretation
ax.text(60, 15, f'$y = {model_age.coef_[0]:.2f}x + {model_age.intercept_:.2f}$\n$R^2 = {model_age.score(X_age, y_strength):.3f}$',
        fontsize=12, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

**Interpretation:**
- Each day adds approximately 0.39 MPa of strength
- Base strength (intercept): ~16.5 MPa
- $R^2 \approx 0.87$ -- reasonable fit, but note that the relationship between age and concrete strength is actually nonlinear (logarithmic)

### Polynomial Regression

**Idea:** Use polynomial features to fit nonlinear relationships

**Transform:**

$$x \rightarrow [x, x^2, x^3, \ldots, x^d]$$

Then apply linear regression: $\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d$

> **Key Insight: Warning**  
> Higher degree $\rightarrow$ more flexibility $\rightarrow$ risk of overfitting!

### [LIVE] Polynomial Regression: Degree Comparison

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Sample data
np.random.seed(42)
x_data = np.array([-2, -1, 0, 1, 2])
y_data = np.array([3.8, 1.2, 0.1, 1.8, 6.2])
X_poly_data = x_data.reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
x_fine = np.linspace(-2.5, 2.5, 200).reshape(-1, 1)

degrees = [1, 2, 5]
titles = ['Degree 1 (Linear)', 'Degree 2 (Quadratic)', 'Degree 5 (Overfit Risk)']

for ax, deg, title in zip(axes, degrees, titles):
    # Create and fit pipeline
    pipe = make_pipeline(PolynomialFeatures(deg), LinearRegression())
    pipe.fit(X_poly_data, y_data)
    y_fine = pipe.predict(x_fine)
    r2 = pipe.score(X_poly_data, y_data)

    ax.scatter(x_data, y_data, c='indianred', s=80, edgecolors='k', zorder=3, label='Data')
    ax.plot(x_fine, y_fine, 'steelblue', linewidth=2, label=f'Fit ($R^2$={r2:.3f})')
    ax.set_xlabel('$x$', fontsize=12)
    ax.set_ylabel('$y$', fontsize=12)
    ax.set_title(title, fontsize=13)
    ax.legend(fontsize=10)
    ax.set_xlim(-3, 3)
    ax.set_ylim(-2, 8)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Polynomial Features in Scikit-Learn

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Original data
X_poly = np.array([[x] for x in range(10)])
y_poly = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# Create polynomial regression pipeline
# Degree 3: [1, x, x^2, x^3]
poly_model = make_pipeline(
    PolynomialFeatures(degree=3),
    LinearRegression()
)

# Fit and predict
poly_model.fit(X_poly, y_poly)
y_pred_poly = poly_model.predict(X_poly)

# Evaluate
r2 = poly_model.score(X_poly, y_poly)
print(f"R^2 score: {r2:.3f}")
print(f"\nPipeline chains transformations automatically!")
print(f"PolynomialFeatures(degree=3) transforms x -> [1, x, x^2, x^3]")
print(f"Then LinearRegression fits the transformed features.")

### Regularization: Controlling Complexity

**Problem:** Complex models overfit to training data

**Solution:** Add penalty for large coefficients

| | **Ridge Regression (L2)** | **Lasso Regression (L1)** |
|---|---|---|
| **Objective** | $\text{minimize} \sum (y_i - \hat{y}_i)^2 + \alpha \sum w_j^2$ | $\text{minimize} \sum (y_i - \hat{y}_i)^2 + \alpha \sum \|w_j\|$ |
| **Effect** | Shrinks coefficients | Shrinks some to exactly zero |
| **Features** | Keeps all features | Performs feature selection |
| **Parameter** | $\alpha$: regularization strength | $\alpha$: regularization strength |

**Key:** Larger $\alpha$ $\rightarrow$ stronger regularization $\rightarrow$ simpler model

### Effect of Regularization on Coefficients

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

alpha_range = np.logspace(-2, 2, 100)

# Simulated coefficient paths
ridge_w1 = 5 / (1 + 0.2 * alpha_range)
ridge_w2 = 3 / (1 + 0.2 * alpha_range)
lasso_w1 = np.maximum(0, 5 - 0.5 * alpha_range)

ax.plot(alpha_range, ridge_w1, color='steelblue', linewidth=2.5, label='Ridge: $w_1$')
ax.plot(alpha_range, ridge_w2, color='green', linewidth=2.5, label='Ridge: $w_2$')
ax.plot(alpha_range, lasso_w1, color='indianred', linewidth=2.5, linestyle='--', label='Lasso: $w_1$')

ax.set_xscale('log')
ax.set_xlabel('Regularization $\\alpha$', fontsize=13)
ax.set_ylabel('Coefficient Value', fontsize=13)
ax.set_title('Effect of Regularization on Coefficients', fontsize=14)
ax.legend(fontsize=12)
ax.set_ylim(0, 6)
ax.grid(True, alpha=0.3)

# Annotations
ax.annotate('Lasso drives\ncoefficient to 0', xy=(10, 0.05), fontsize=10,
            xytext=(20, 1.5), arrowprops=dict(arrowstyle='->', color='indianred'),
            color='indianred', fontweight='bold')

plt.tight_layout()
plt.show()

### Ridge and Lasso in Scikit-Learn

> **Example: When to Use Which?**  
> **Ridge:** When all features are potentially relevant  
> **Lasso:** When you want automatic feature selection

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Use iris regression task (predict petal width from other features)
iris = load_iris()
X_reg = iris.data[:, :3]
y_reg = iris.data[:, 3]
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42)

# Ridge Regression (L2)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_reg, y_train_reg)
ridge_score = ridge.score(X_test_reg, y_test_reg)

# Lasso Regression (L1)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_reg, y_train_reg)
lasso_score = lasso.score(X_test_reg, y_test_reg)

# Compare coefficients
print(f"Ridge R^2: {ridge_score:.3f}")
print(f"Ridge coefficients: {ridge.coef_}")
print(f"\nLasso R^2: {lasso_score:.3f}")
print(f"Lasso coefficients: {lasso.coef_}")
print(f"Non-zero Lasso features: {np.sum(lasso.coef_ != 0)}")

### Real-World Example: Bicycle Traffic Prediction

**Problem:** Predict daily bicycle traffic on Seattle's Fremont Bridge

**Features:**
- Temperature, precipitation
- Day of week, month
- Holiday indicator
- Hour of day (if hourly data)

**Approach:**
1. Feature engineering: add polynomial features for temperature
2. Add interaction terms (e.g., temp x weekend)
3. Use Ridge regression to prevent overfitting
4. Validate with cross-validation

---
## 5. Summary and Next Steps

### Key Takeaways

**1. Machine Learning Fundamentals**
- Supervised (classification, regression) vs Unsupervised (clustering, dim reduction)
- Data-driven approach to building predictive models

**2. Scikit-Learn Workflow**
- Consistent API: `fit()`, `predict()`, `transform()`
- Data representation: features matrix $\mathbf{X}$, target vector $\mathbf{y}$

**3. Model Validation**
- Never evaluate on training data
- Use train-test split or cross-validation
- Understand bias-variance tradeoff

**4. Linear Regression**
- Foundation for many ML algorithms
- Polynomial features for nonlinearity
- Regularization (Ridge/Lasso) prevents overfitting

### Civil Engineering Applications

**Machine Learning is transforming civil engineering:**

1. **Structural Health Monitoring**
   - Classify damage types from sensor data
   - Predict remaining service life

2. **Material Science**
   - Predict concrete/steel properties from composition
   - Optimize mix designs

3. **Traffic & Transportation**
   - Traffic flow prediction and optimization
   - Route planning and demand forecasting

4. **Construction Management**
   - Project cost and duration estimation
   - Risk assessment and safety prediction

5. **Environmental Engineering**
   - Water quality prediction
   - Climate impact assessment

### Next Steps in Machine Learning

**Coming in Week 7:**
- **Naive Bayes:** Probabilistic classification
- **Support Vector Machines:** Maximum-margin classifiers
- **Decision Trees & Random Forests:** Ensemble methods
- **Clustering:** K-Means, hierarchical clustering
- **Dimensionality Reduction:** PCA deep dive

**Practice Resources:**
- **Scikit-Learn Documentation:** https://scikit-learn.org
- **Kaggle:** Real-world datasets and competitions
- **Course Notebooks:** Hands-on examples in repository

---

### Questions?

**Dr. Eyuphan Koc**  
eyuphan.koc@bogazici.edu.tr  

*Office Hours: By appointment*

**Next Lecture:** Advanced ML Algorithms (Naive Bayes, SVM, Random Forests)