An activity that uses `SGDRegressor` on a built‑in dataset and highlights why **scaling your features is crucial** for gradient‑descent‑based methods. [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)

### Goal of the activity

- Use `SGDRegressor` to model a real regression problem.  
- Compare performance **with and without** feature scaling.  
- Observe how scaling affects convergence and model quality.

We’ll use the **Diabetes** regression dataset built into scikit‑learn. [codecademy](https://www.codecademy.com/article/linear-regression-with-scikit-learn-a-step-by-step-guide-using-python)

### Step 1: Load the dataset and do a basic train/test split

Note: `load_diabetes()` returns features that are already scaled in a particular way, but for this activity you still see the *effect* of proper standardization because `SGDRegressor` is very sensitive to scale. [scikit-learn](https://scikit-learn.org/stable/modules/sgd.html)

In [6]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
import warnings
warnings.simplefilter(action='ignore')

# Load diabetes dataset as numpy arrays
diabetes = load_diabetes()           # built-in dataset
X = diabetes.data                    # shape (442, 10), already standardized-ish but we will treat it as raw
y = diabetes.target                  # disease progression measure

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Step 2: Fit `SGDRegressor` **without** additional scaling

What to look for:

- There might be **unstable or mediocre performance** (e.g., poor R², relatively large MSE).  
- Coefficients may be quite noisy or show signs the optimizer struggled to converge nicely, depending on random state and learning rate. [sdsawtelle.github](https://sdsawtelle.github.io/blog/output/week2-andrew-ng-machine-learning-with-python.html)


In [7]:
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error, r2_score

sgd_plain = SGDRegressor(
    loss="squared_error",
    max_iter=5000,
    tol=1e-3,
    random_state=42
)

sgd_plain.fit(X_train, y_train)

y_pred_plain = sgd_plain.predict(X_test)
mse_plain = mean_squared_error(y_test, y_pred_plain)
r2_plain = r2_score(y_test, y_pred_plain)

print("Without extra scaling:")
print("  MSE:", mse_plain)
print("  R^2:", r2_plain)
print("  Coefficients:", sgd_plain.coef_)

Without extra scaling:
  MSE: 2867.9671711478036
  R^2: 0.4586853477817451
  Coefficients: [  48.54964032 -154.7957172   447.94627464  295.81644927  -41.55911569
  -87.92352684 -204.32972962  145.29889217  337.15113445  135.06416597]


### Step 3: Fit `SGDRegressor` **with** proper scaling via a pipeline

The scikit‑learn docs emphasize: “Always scale the input. The most convenient way is to use a pipeline.” [scikit-learn](https://scikit-learn.org/1.0/modules/generated/sklearn.linear_model.SGDRegressor.html)

What to look for:
- **Better R²** and generally **lower MSE** compared with the unscaled version.  
- Much more stable training across different runs and hyperparameters. [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)


In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

sgd_scaled = make_pipeline(
    StandardScaler(),              # standardize features: mean 0, variance 1
    SGDRegressor(
        loss="squared_error",
        max_iter=5000,
        tol=1e-3,
        random_state=42
    )
)

sgd_scaled.fit(X_train, y_train)

y_pred_scaled = sgd_scaled.predict(X_test)
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
r2_scaled = r2_score(y_test, y_pred_scaled)

print("\nWith StandardScaler in pipeline:")
print("  MSE:", mse_scaled)
print("  R^2:", r2_scaled)


With StandardScaler in pipeline:
  MSE: 2883.720046438128
  R^2: 0.4557120702996995


### Step 4: Discussion prompts for learning

Use these questions to reflect (or as written questions in a worksheet):

1. **Why does scaling matter for SGD?**  
   - In gradient descent, the step size in each direction depends on both the gradient and the learning rate. When features have very different scales, one feature can dominate the gradient, making it hard to choose a single learning rate that works well for all parameters. Scaling puts features on similar numerical ranges, so the optimizer progresses more evenly in every dimension. [scikit-learn](https://scikit-learn.org/stable/modules/sgd.html)

2. **What did you observe about performance before and after scaling?**  
   - Compare the MSE and R² values. Did scaling improve test performance? Did it affect convergence warnings or stability?

3. **Why use a `Pipeline` rather than calling `StandardScaler` manually?**  
   - A pipeline ensures that exactly the **same scaling learned from the training data** is applied to the test data, preventing data leakage and making your workflow more robust and concise. [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)

4. **How is this tied to stochastic gradient descent concepts?**  
   - `SGDRegressor` is a practical implementation of SGD for linear regression. 
     - Uses gradient steps on mini‑batches or single samples.  
     - Is sensitive to the scale of features.  
     - Benefits strongly from preprocessing like standardization.