<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/1_Lab_StatisticalML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Statistical ML
Using linear regression as the sample model, we will use this first lab as warm-up on:
* coding,
* data handling,
* SGD optimizer
* model fitting - data splitting for train /test; data batching
* regularization

**Review**
$Eq$

* [Overview of Linear regression for prediction]((https://developers.google.com/machine-learning/crash-course/linear-regression)
    * [Model architecture](https://developers.google.com/machine-learning/crash-course/linear-regression): linear equation
    $\hat{y} = b+ w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + w_N x_N$
    * [Model fitting or Objective function](https://developers.google.com/machine-learning/crash-course/linear-regression/loss): minimize "Loss" between actual values ($y$) and predicted values ($\hat{y}$)
    $$min_{\mathbf{w}}L(\mathbf{w})$$

    $$ L(\mathbf{w}) = \sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2$$

    * [Optimizer(solution method)](https://colab.research.google.com/drive/1pPB_YTQ93pXyXctHPP-TMBN5woWJvV6J#scrollTo=ZbhdrHKpaeAl) or [Link 2](https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent): stochastic gradient descent



**Review**
* Data processing overview
[Google ML crash course on Data](https://developers.google.com/machine-learning/crash-course/numerical-data) :
  1. Vizualize the data: review data statistics,check for outliners, check for bad data
  * [Vizualize the data](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/numerical_data_stats.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=numerical_data_stats#scrollTo=HYn5jBq2_Onh)
  * [Modify bad data](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/numerical_data_bad_values.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=numerical_data_bad_values#scrollTo=RlPvUoDDsLoT)
  2. [Data preparation](https://developers.google.com/machine-learning/crash-course/numerical-data/normalization): normalize (min-max scaling, z-score normalization, log transformation, clipping)

**Review**
* [Stochastic gradient descent Optimizer](https://colab.research.google.com/drive/1pPB_YTQ93pXyXctHPP-TMBN5woWJvV6J#scrollTo=ZbhdrHKpaeAl) or [Link 2](https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent)
  * Review the SGD algorithm
  * Review data batching

**Review codes**
* [Predict fuel efficiency - TensorFlow](https://www.tensorflow.org/guide/core/quickstart_core)
  * Not using inbuilt functions for regression model or SDG  
* [Linear regression_Taxi_Analysis steps with programming guide](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/cc/exercises/linear_regression_taxi.ipynb?utm_source=mlcc&utm_campaign=colab-external&utm_medium=referral&utm_content=linear_regression#scrollTo=W6a7dtcCob-n)
  * This has more detailed data vizualization and processing, and hyperparameter tuning exercises. It doe not use inbuilt packages either.


**Review**
* [Regularization techniques (to avoid overfitting)](https://aunnnn.github.io/ml-tutorial/html/blog_content/linear_regression/linear_regression_regularized.html): Ridge, Lasso, Elastic Net

**Programming exercise**
* Modify the [Predict fuel efficiency - TensorFlow](https://www.tensorflow.org/guide/core/quickstart_core) code to add regularizations

**FYI- Statistical package**
* Scikit Learn is the package typially used for statistical machine learning. It has inbuilt regression functions
* [Sample linear regression  code](https://scikit-learn.org/1.5/auto_examples/linear_model/plot_ols.html)

Deep Learning packages
* Keras, Pytorch, Tensorflow are used for deep learing

In [None]:
## AI Generated
import numpy as np
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Generate some sample data
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features (important for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 1. Linear Regression without Regularization
print("--- Linear Regression (No Regularization) ---")
lin_reg = LinearRegression()
lin_reg.fit(X_train_scaled, y_train)
y_pred_lin_reg = lin_reg.predict(X_test_scaled)
rmse_lin_reg = np.sqrt(mean_squared_error(y_test, y_pred_lin_reg))
print(f"RMSE: {rmse_lin_reg:.2f}")
print(f"Coefficients: {lin_reg.coef_}")

# 2. Lasso Regression (L1 Regularization)
print("\n--- Lasso Regression (L1 Regularization) ---")
# alpha is the regularization strength; higher alpha means stronger regularization
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X_train_scaled, y_train)
y_pred_lasso = lasso_reg.predict(X_test_scaled)
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
print(f"RMSE: {rmse_lasso:.2f}")
print(f"Coefficients: {lasso_reg.coef_}")

# 3. Ridge Regression (L2 Regularization)
print("\n--- Ridge Regression (L2 Regularization) ---")
# alpha is the regularization strength; higher alpha means stronger regularization
ridge_reg = Ridge(alpha=1.0)
ridge_reg.fit(X_train_scaled, y_train)
y_pred_ridge = ridge_reg.predict(X_test_scaled)
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
print(f"RMSE: {rmse_ridge:.2f}")
print(f"Coefficients: {ridge_reg.coef_}")