Okay, here is the continuation of your comprehensive notes, covering Module 7.

---

**Module 7: Introduction to Regression Modeling with Scikit-Learn**

This module introduces fundamental regression models available in the Scikit-learn library. Regression analysis is a statistical method used to model the relationship between a dependent (target) variable and one or more independent (feature) variables. The goal is typically to predict a continuous outcome. We'll cover the concepts, Scikit-learn implementation, and how to interpret the model coefficients for Simple Linear Regression, Multiple Linear Regression, Polynomial Regression, and Ridge Regression.

**7.1 Overview of Scikit-learn for Machine Learning**

Scikit-learn is a cornerstone library for machine learning in Python, offering a vast array of tools for both supervised (like regression and classification) and unsupervised (like clustering and dimensionality reduction) learning. It's known for its clean, consistent API and comprehensive documentation. *[13]*

* **Key Concepts in Scikit-learn's API:**
    * **Estimators:** The core objects in Scikit-learn. Any object that can learn from data (either to predict, classify, transform, or cluster) is an estimator.
        * All estimators implement a `fit(X, y)` method (for supervised learning) or `fit(X)` (for unsupervised learning) to learn from the training data. `X` represents the features and `y` the target variable. *[13]*
        * Supervised estimators (like regressors and classifiers) typically have a `predict(X_new)` method to make predictions on new, unseen data.
        * Some estimators also have a `score(X, y)` method to evaluate performance.
    * **Transformers:** Estimators that can also transform data. Examples include scalers (`StandardScaler`, `MinMaxScaler`), encoders (`OneHotEncoder`), and feature extractors/creators (`PolynomialFeatures`).
        * Transformers have a `transform(X)` method to apply the learned transformation.
        * They often have a `fit_transform(X, y=None)` method, which is a convenient shortcut for fitting and then transforming the same data. *[13]*
    * **Pipelines (`sklearn.pipeline.Pipeline`):**
        * Allow chaining multiple steps (e.g., a scaler followed by a feature creator, then a regressor) into a single estimator object.
        * This simplifies workflows, makes code cleaner, and is crucial for preventing data leakage (e.g., ensuring transformations are learned only on training data and consistently applied). *[13]*
        * A pipeline itself has `fit()`, `predict()`, and `score()` methods.

* **General Workflow for Supervised Learning in Scikit-learn:**
    1.  **Import Model:** Import the desired model class (e.g., `from sklearn.linear_model import LinearRegression`).
    2.  **Prepare Data:**
        * Separate features (`X`) and target (`y`).
        * Split data into training and testing sets (e.g., using `train_test_split` from `sklearn.model_selection`).
        * Perform necessary preprocessing (scaling, encoding) – often done within a Pipeline.
    3.  **Instantiate Model:** Create an instance of the model, specifying any hyperparameters (e.g., `model = LinearRegression()`). Hyperparameters are settings that are not learned from the data directly but are set before training.
    4.  **Fit Model:** Train the model on the **training data**: `model.fit(X_train, y_train)`. *[90]*
    5.  **Make Predictions:** Use the trained model to make predictions on **new data** (typically the test set): `y_pred = model.predict(X_test)`. *[90]*
    6.  **Evaluate Model:** Assess the model's performance using appropriate metrics (e.g., Mean Squared Error, R-squared for regression).

**7.2 Simple Linear Regression (SLR)**

Simple Linear Regression (SLR) models the linear relationship between a **single independent variable** (feature, `x`) and a **continuous dependent variable** (target, `y`). It aims to find the best-fitting straight line through the data points. *[90]*

* **Concept and Equation:**
    The relationship is described by the equation:
    $$y = \beta_0 + \beta_1 x + \epsilon$$
    Where: *[90]*
    * `y`: Dependent variable (what we want to predict).
    * `x`: Independent variable (the predictor).
    * `\beta_0` (beta-zero): **Intercept** – the predicted value of `y` when `x` is 0. It's where the regression line crosses the y-axis.
    * `\beta_1` (beta-one): **Slope** (or coefficient for `x`) – the change in the predicted value of `y` for a one-unit increase in `x`. It represents the steepness and direction of the line.
    * `\epsilon` (epsilon): **Error term** – the difference between the observed `y` value and the value predicted by the line (`\hat{y} = \beta_0 + \beta_1 x`). It accounts for random variability or factors not included in the model.
    The goal of SLR is to find the values of `\beta_0` and `\beta_1` that minimize the sum of the squared errors (residuals), often called the Ordinary Least Squares (OLS) method.

* **Implementation with Scikit-learn:**
    The `LinearRegression` class from `sklearn.linear_model` is used. *[90]*

In [None]:
import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import mean_squared_error, r2_score

    # Sample data for SLR
    np.random.seed(0) # for reproducibility
    # X_slr_data represents a single feature, needs to be 2D for scikit-learn ([n_samples, n_features])
    X_slr_feature = np.random.rand(50, 1) * 10  # Single feature, e.g., Hours Studied
    # y = 2.5x + 5 + noise (True relationship: intercept=5, slope=2.5)
    y_slr_target = 2.5 * X_slr_feature.squeeze() + np.random.randn(50) * 5 + 5 # Target, e.g., Exam Score

    # Split data
    X_slr_train, X_slr_test, y_slr_train, y_slr_test = train_test_split(
        X_slr_feature, y_slr_target, test_size=0.2, random_state=42
    )

    # 1. Instantiate the model
    slr_model = LinearRegression()

    # 2. Fit the model to the training data
    slr_model.fit(X_slr_train, y_slr_train)
    print("SLR model fitting complete.")

* **Interpreting Coefficients:**
    After fitting the model, the estimated intercept (`\hat{\beta_0}`) and coefficient (`\hat{\beta_1}`) can be accessed:
    * `slr_model.intercept_`: Stores the estimated intercept `\hat{\beta_0}`. *[92]*
    * `slr_model.coef_`: Stores the estimated coefficient(s) as a NumPy array. For SLR, it will be an array with one element, `\hat{\beta_1}`. *[92]*

    * **Sign of Coefficient (`\hat{\beta_1}`):** *[94]*
        * Positive `\hat{\beta_1}`: Indicates a positive linear relationship (as `x` increases, `y` tends to increase).
        * Negative `\hat{\beta_1}`: Indicates a negative linear relationship (as `x` increases, `y` tends to decrease).
    * **Magnitude of Coefficient (`\hat{\beta_1}`):** *[94]*
        * The value of `\hat{\beta_1}` represents the average change in the dependent variable (`y`) for a one-unit increase in the independent variable (`x`).

In [None]:
estimated_intercept_slr = slr_model.intercept_
    estimated_coefficient_slr = slr_model.coef_[0] # For SLR, coef_ is an array with one value

    print(f"SLR Estimated Intercept (beta_0_hat): {estimated_intercept_slr:.4f}")
    print(f"SLR Estimated Coefficient for X (beta_1_hat): {estimated_coefficient_slr:.4f}")

    # Example Interpretation:
    # If intercept is 5.123 and coefficient is 2.450 for 'Hours Studied' vs 'Exam Score':
    # - For 0 hours studied, the predicted exam score is 5.123. (Often, intercept interpretation needs context, as x=0 might be unrealistic).
    # - For each additional hour studied, the exam score is predicted to increase by 2.450 points, on average.

    # 3. Make predictions
    y_slr_pred_train = slr_model.predict(X_slr_train)
    y_slr_pred_test = slr_model.predict(X_slr_test)

* **Visualization:**
    A scatter plot of the data points (`x` vs `y`) with the fitted regression line (`\hat{y} = \hat{\beta_0} + \hat{\beta_1} x`) is a common way to visualize an SLR model. Seaborn's `regplot` can do this directly, or you can plot the scatter of actuals and then overlay the predicted line. *[91]*

In [None]:
plt.figure(figsize=(10, 6))
    # Scatter plot of the actual test data
    plt.scatter(X_slr_test, y_slr_test, color='skyblue', label='Actual Test Data', alpha=0.7)
    # Plot the regression line using test predictions
    plt.plot(X_slr_test, y_slr_pred_test, color='red', linewidth=2, label='Fitted Regression Line (Predictions)')

    plt.title('Simple Linear Regression Fit', fontsize=16)
    plt.xlabel('Independent Variable (X - e.g., Hours Studied)', fontsize=12)
    plt.ylabel('Dependent Variable (y - e.g., Exam Score)', fontsize=12)
    plt.legend()
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()
    # [Image: slr_fit.png - A scatter plot of X_slr_test vs y_slr_test with the red regression line (y_slr_pred_test) plotted over it. Clear labels for axes and legend.]

Alternatively, using Seaborn's `regplot`:

In [None]:
plt.figure(figsize=(10, 6))
    # Combine X_slr_test and y_slr_test into a DataFrame for Seaborn
    df_slr_test = pd.DataFrame({'X': X_slr_test.squeeze(), 'y': y_slr_test})
    sns.regplot(x='X', y='y', data=df_slr_test, scatter_kws={'color': 'skyblue', 'alpha':0.7}, line_kws={'color':'red'})
    plt.title('Simple Linear Regression using Seaborn regplot', fontsize=16)
    plt.xlabel('Independent Variable (X)', fontsize=12)
    plt.ylabel('Dependent Variable (y)', fontsize=12)
    plt.show()
    # [Image: slr_seaborn_regplot.png - Similar to slr_fit.png, but generated using sns.regplot, possibly showing a confidence interval band around the regression line.]

**7.3 Multiple Linear Regression (MLR)**

Multiple Linear Regression (MLR) extends SLR to model the relationship between **two or more independent variables** (`x_1, x_2, ..., x_p`) and a single continuous dependent variable (`y`). *[96]*

* **Concept and Equation:**
    The relationship is described by:
    $$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon$$
    Where: *[92]*
    * `\beta_0`: Intercept.
    * `\beta_1, \beta_2, ..., \beta_p`: Coefficients for each independent variable `x_1, x_2, ..., x_p`.
    * `\epsilon`: Error term.
* **Implementation with Scikit-learn:**
    The same `LinearRegression` class is used. The input `X` will now be a DataFrame or NumPy array with multiple columns (features). *[96]*

In [None]:
# Sample data for MLR
    np.random.seed(1) # for reproducibility
    # X_mlr_data will have 3 features
    X_mlr_features_df = pd.DataFrame(np.random.rand(100, 3) * 10, columns=['Feature1_Hrs', 'Feature2_PrevScore', 'Feature3_Projects'])
    # True relationship: y = 3*x1 + 1.5*x2 - 0.8*x3 + 7 + noise
    y_mlr_target = (3 * X_mlr_features_df['Feature1_Hrs'] +
                    1.5 * X_mlr_features_df['Feature2_PrevScore'] -
                    0.8 * X_mlr_features_df['Feature3_Projects'] +
                    np.random.randn(100) * 7 + 7)

    X_mlr_train, X_mlr_test, y_mlr_train, y_mlr_test = train_test_split(
        X_mlr_features_df, y_mlr_target, test_size=0.25, random_state=42
    )

    mlr_model = LinearRegression()
    mlr_model.fit(X_mlr_train, y_mlr_train)
    print("\nMLR model fitting complete.")

* **Interpreting Coefficients:**
    * `mlr_model.intercept_`: Estimated `\hat{\beta_0}`.
    * `mlr_model.coef_`: A NumPy array containing the estimated coefficients `\hat{\beta_1}, \hat{\beta_2}, ..., \hat{\beta_p}`. The order corresponds to the order of columns in `X_mlr_train`.
    * **Crucial Interpretation Point:** Each coefficient `\hat{\beta_j}` represents the average change in the dependent variable `y` for a one-unit increase in the independent variable `x_j`, **holding all other independent variables in the model constant**. *[94]* This "ceteris paribus" condition is vital.

In [None]:
estimated_intercept_mlr = mlr_model.intercept_
    estimated_coefficients_mlr = mlr_model.coef_

    print(f"\nMLR Estimated Intercept: {estimated_intercept_mlr:.4f}")
    # Create a Series for easier viewing of coefficients with feature names
    coeffs_df = pd.Series(estimated_coefficients_mlr, index=X_mlr_features_df.columns)
    print("MLR Estimated Coefficients (betas):\n", coeffs_df)

    # Example Interpretation (hypothetical coefficients):
    # Feature1_Hrs: 2.95  => For a one-unit increase in Feature1_Hrs, y is predicted to increase by 2.95 units,
    #                         assuming Feature2_PrevScore and Feature3_Projects are held constant.
    # Feature2_PrevScore: 1.48 => For a one-unit increase in Feature2_PrevScore, y is predicted to increase by 1.48 units,
    #                         assuming Feature1_Hrs and Feature3_Projects are held constant.
    # Feature3_Projects: -0.75 => For a one-unit increase in Feature3_Projects, y is predicted to decrease by 0.75 units,
    #                         assuming Feature1_Hrs and Feature2_PrevScore are held constant.

    y_mlr_pred_test = mlr_model.predict(X_mlr_test)

* **Multicollinearity:** If independent variables in an MLR model are highly correlated with each other (multicollinearity), the coefficient estimates can become unstable, have large standard errors, and be difficult to interpret reliably. The overall model might still predict well, but individual coefficient interpretations become suspect. *[90]*

**7.4 Polynomial Regression**

Polynomial Regression is a type of regression analysis where the relationship between the independent variable `x` and the dependent variable `y` is modeled as an *n*-th degree polynomial in `x`. Although it models non-linear relationships in the original feature space, it's considered a **special case of multiple linear regression** because the model is linear in terms of its *coefficients*. *[98]*

* **Concept and Equation:**
    For a single feature `x`, a polynomial regression of degree `d` is:
    $$y = \beta_0 + \beta_1 x + \beta_2 x^2 + ... + \beta_d x^d + \epsilon$$
    New features (`x^2, x^3, ..., x^d`) are created from the original feature `x`. The model then fits a linear relationship to these original and transformed features. *[98]*

* **Scikit-learn `PolynomialFeatures`:**
    This transformer from `sklearn.preprocessing` generates a new feature matrix consisting of all polynomial combinations of the features with a degree less than or equal to the specified degree. *[98]*
    * `degree`: The degree of the polynomial features to generate (e.g., `degree=2` with input `[x]` generates `[1, x, x^2]`; with input `[x1, x2]` generates `[1, x1, x2, x1^2, x1*x2, x2^2]`). *[99]*
    * `include_bias=True` (default): Includes a column of ones (the bias or intercept term). Usually set to `False` when used with `LinearRegression` (which handles its own intercept), or within a Pipeline where `LinearRegression` is the final step.
    * `interaction_only=False` (default): If `True`, only interaction features (e.g., `x1*x2`) are produced, not higher-order terms of single features (e.g., `x1^2`). *[101]*

* **Implementation:**
    Typically involves a two-step process (or streamlined using a Scikit-learn `Pipeline` *[89]*):
    1.  Transform the input features using `PolynomialFeatures`.
    2.  Fit a `LinearRegression` model on these new, transformed features.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
    from sklearn.pipeline import Pipeline # For easily combining steps

    # Generate some non-linear data for better illustration
    np.random.seed(0)
    X_poly_feature = np.sort(np.random.rand(70, 1) * 10 - 5, axis=0) # Single feature from -5 to 5
    # True relationship: y = 0.5x^2 + x + 2 + noise
    y_poly_target = 0.5 * X_poly_feature.squeeze()**2 + X_poly_feature.squeeze() + 2 + np.random.randn(70) * 3

    X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(
        X_poly_feature, y_poly_target, test_size=0.3, random_state=42
    )

    # Define the degree of the polynomial
    degree = 2 # Try changing to 1 (linear), 3, 10 (overfitting) to see effect

    # Create a pipeline: 1. PolynomialFeatures, 2. LinearRegression
    poly_reg_pipeline = Pipeline([
        ("poly_features", PolynomialFeatures(degree=degree, include_bias=False)), # No bias needed for LinearRegression step
        ("lin_reg", LinearRegression())
    ])

    poly_reg_pipeline.fit(X_poly_train, y_poly_train)
    y_poly_pred_test = poly_reg_pipeline.predict(X_poly_test)

    # Accessing coefficients (from the linear regression step in the pipeline)
    # poly_transformer = poly_reg_pipeline.named_steps['poly_features']
    linear_regressor_in_pipe = poly_reg_pipeline.named_steps['lin_reg']
    # print(f"Polynomial features names (example): {poly_transformer.get_feature_names_out(['x'])}") # ['x', 'x^2'] for degree 2
    print(f"\nPolynomial Regression (degree {degree}) Intercept: {linear_regressor_in_pipe.intercept_:.4f}")
    print(f"Polynomial Regression (degree {degree}) Coefficients: {linear_regressor_in_pipe.coef_}")
    # For degree 2, coef_ will have two values: for x and x^2 terms.

* Polynomial regression can capture more complex, non-linear patterns in data than simple linear regression. *[100]*
    * **Risk of Overfitting:** Choosing a very high degree can lead to overfitting. The model might fit the training data noise too closely and perform poorly on unseen test data. *[100]*
    * `[Diagram: A plot showing underfitting (a straight line trying to fit curved data), good fit (a curve of appropriate degree fitting the data well), and overfitting (a very wiggly curve passing through almost all training points, likely to generalize poorly).]`

* **Visualization for Polynomial Regression:**

In [None]:
plt.figure(figsize=(12, 7))
    plt.scatter(X_poly_train, y_poly_train, color='skyblue', label='Training Data', alpha=0.6)
    plt.scatter(X_poly_test, y_poly_test, color='orange', label='Test Data', alpha=0.8)

    # For a smooth line, predict on a sorted range of X values
    X_plot_range = np.linspace(X_poly_feature.min(), X_poly_feature.max(), 100).reshape(-1, 1)
    y_plot_pred_poly = poly_reg_pipeline.predict(X_plot_range)

    plt.plot(X_plot_range, y_plot_pred_poly, color='red', linewidth=2,
             label=f'Polynomial Fit (degree={degree})')

    # For comparison: fit and plot a simple linear regression
    simple_lin_reg_on_poly = LinearRegression()
    simple_lin_reg_on_poly.fit(X_poly_train, y_poly_train)
    y_plot_pred_simple = simple_lin_reg_on_poly.predict(X_plot_range)
    plt.plot(X_plot_range, y_plot_pred_simple, color='green', linestyle='--', linewidth=2,
             label='Simple Linear Fit')

    plt.title(f'Polynomial Regression (Degree {degree}) vs. Simple Linear Regression', fontsize=16)
    plt.xlabel('Independent Variable (X)', fontsize=12)
    plt.ylabel('Dependent Variable (y)', fontsize=12)
    plt.legend()
    plt.ylim(y_poly_target.min()-10, y_poly_target.max()+10) # Adjust y-limits for better view
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()
    # [Image: polynomial_fit_comparison.png - A scatter plot of non-linear data points (X_poly vs y_poly).
    # A red curve shows the polynomial regression fit (e.g., degree 2 or 3), following the data's curvature.
    # A green dashed line shows a simple linear regression fit, which poorly captures the trend.
    # Axes, legend, and title included.]

**7.5 Ridge Regression (L2 Regularization)**

Ridge Regression is a linear regression model that includes an **L2 regularization penalty**. Regularization is a technique used to prevent overfitting (especially in models with many features or multicollinearity) by adding a penalty term to the loss function that discourages overly complex models (i.e., models with very large coefficient values). *[104]*

* **Concept and Loss Function:**
    The standard Ordinary Least Squares (OLS) method (used in `LinearRegression`) minimizes the sum of squared residuals (RSS). Ridge Regression modifies this by adding a penalty proportional to the sum of the squares of the coefficients:
    $$\text{Loss Function (Ridge)} = \text{RSS} + \alpha \sum_{j=1}^{p} \beta_j^2$$
    $$\text{Loss Function (Ridge)} = \sum_{i=1}^{n} (y_i - (\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}))^2 + \alpha \sum_{j=1}^{p} \beta_j^2$$
    Where: *[105]*
    * `\alpha` (alpha, also sometimes denoted as lambda `\lambda`): The **regularization strength parameter**. It's a non-negative hyperparameter that controls the amount of shrinkage applied to the coefficients. *[104]*
        * If `\alpha = 0`, Ridge Regression becomes identical to OLS (Simple/Multiple Linear Regression).
        * As `\alpha \rightarrow \infty` (alpha gets very large), the coefficients are shrunk strongly towards zero. This leads to a simpler model with potentially higher bias but lower variance (better generalization). *[106]*
        * If `\alpha` is small, the penalty has less effect.
    * `\beta_j^2`: The square of the coefficient for the *j*-th feature. The intercept `\beta_0` is typically *not* penalized.

* **Purpose of Ridge Regression:**
    * **Reduces Model Complexity & Prevents Overfitting:** Especially useful when the number of predictors (`p`) is large relative to the number of observations (`n`), or when features are highly correlated.
    * **Handles Multicollinearity:** More effectively than OLS. When predictors are highly correlated, OLS coefficients can have high variance (be very sensitive to small changes in data). Ridge tends to shrink correlated coefficients towards each other and stabilize them.
    * **Coefficient Shrinkage:** Shrinks coefficients towards zero but **does not set them exactly to zero** (unless `\alpha` is infinitely large). This means it keeps all features in the model but reduces their impact. This is a key difference from Lasso Regression (L1 regularization), which can perform feature selection by setting some coefficients to exactly zero.

* **Scikit-learn: `Ridge` class from `sklearn.linear_model`.** *[104]*
    * The `alpha` parameter corresponds to `\alpha` in the loss function.
    * **Important:** Feature scaling (e.g., Standardization using `StandardScaler`) is generally **highly recommended** before applying Ridge Regression. The L2 penalty (`\sum \beta_j^2`) is sensitive to the scale of the features; features with larger values (and thus potentially larger coefficients even if not more important) might be penalized more heavily simply due to their scale. Scaling ensures features are on a comparable basis before the penalty is applied.

In [None]:
from sklearn.linear_model import Ridge
    from sklearn.preprocessing import StandardScaler # Scaling is important for Ridge

    # Using MLR data (X_mlr_train, y_mlr_train, X_mlr_test)
    # 1. Scale data first (fit on train, transform train and test)
    scaler_for_ridge = StandardScaler()
    X_mlr_train_scaled = scaler_for_ridge.fit_transform(X_mlr_train)
    X_mlr_test_scaled = scaler_for_ridge.transform(X_mlr_test) # Use same scaler

    # 2. Instantiate Ridge model with a chosen alpha
    # The optimal alpha is usually found via cross-validation (covered in Module 8)
    alpha_value = 1.0 # Common default starting point
    ridge_model = Ridge(alpha=alpha_value)

    # 3. Fit the Ridge model
    ridge_model.fit(X_mlr_train_scaled, y_mlr_train)

    print(f"\nRidge Regression (alpha={alpha_value}) Intercept: {ridge_model.intercept_:.4f}")
    ridge_coeffs_df = pd.Series(ridge_model.coef_, index=X_mlr_features_df.columns) # Assuming X_mlr_features_df had original column names
    print(f"Ridge Regression (alpha={alpha_value}) Coefficients:\n", ridge_coeffs_df)

    # Compare with OLS coefficients (from mlr_model, assuming X_mlr_train was unscaled there)
    # For a fair comparison, fit an OLS model on SCALED data too:
    ols_on_scaled_data = LinearRegression()
    ols_on_scaled_data.fit(X_mlr_train_scaled, y_mlr_train)
    ols_coeffs_scaled_df = pd.Series(ols_on_scaled_data.coef_, index=X_mlr_features_df.columns)
    print("\nOLS (LinearRegression) Coefficients on SCALED data for comparison:\n", ols_coeffs_scaled_df)
    # Expect Ridge coefficients to be smaller in magnitude (shrunk towards zero) than OLS coefficients on scaled data, especially if alpha > 0.

    y_ridge_pred_test = ridge_model.predict(X_mlr_test_scaled)

* **Effect of `alpha` on Coefficients:**
    Higher values of `alpha` lead to greater shrinkage of coefficients towards zero. A common practice is to try a range of `alpha` values and use cross-validation to find the one that yields the best performance on unseen data. *[106]*

In [None]:
# Visualize effect of different alpha values on coefficients
    n_alphas = 200
    alphas_range = np.logspace(-4, 4, n_alphas) # Range of alpha values from 0.0001 to 10000
    coefs_ridge_path = []

    for current_alpha in alphas_range:
        ridge_path_model = Ridge(alpha=current_alpha, fit_intercept=True) # fit_intercept is True by default
        ridge_path_model.fit(X_mlr_train_scaled, y_mlr_train) # Using SCALED training data
        coefs_ridge_path.append(ridge_path_model.coef_)

    plt.figure(figsize=(12, 7))
    ax = plt.gca()
    ax.plot(alphas_range, coefs_ridge_path)
    ax.set_xscale('log') # Alpha is often viewed on a log scale
    # ax.set_xlim(ax.get_xlim()[::-1]) # Optional: reverse x-axis for some conventions
    plt.xlabel('Alpha (Regularization Strength) - Log Scale', fontsize=12)
    plt.ylabel('Coefficient Magnitudes', fontsize=12)
    plt.title('Ridge Coefficients as a Function of Regularization Strength (Alpha)', fontsize=16)
    plt.axis('tight')
    plt.legend(X_mlr_features_df.columns, loc='upper right', title="Features") # Add legend for features
    plt.grid(True, linestyle='--', alpha=0.5)
    plt.show()
    # [Image: ridge_coefficients_vs_alpha.png - A plot with Alpha on a log-scaled x-axis and Coefficient values on the y-axis.
    # Multiple lines (one for each feature in X_mlr_features_df) show how coefficients shrink towards zero as alpha increases.
    # A legend identifies each line with its feature name.]

Ridge regression is a valuable tool for improving the stability and generalization of linear models, particularly in situations with many features, multicollinearity, or when trying to prevent overfitting.

---

**Module 7: Practice Questions**

116. **Scikit-learn API:** What is the primary method called on an estimator object in Scikit-learn to train it on data? What are the typical arguments to this method for supervised learning?
117. **Scikit-learn API:** After training a supervised model, which method is used to get predictions on new data?
118. **SLR Equation:** Write the basic equation for Simple Linear Regression and briefly define each term (`y`, `x`, `\beta_0`, `\beta_1`, `\epsilon`).
119. **SLR Coefficient Interpretation:** In an SLR model predicting 'Salary' (in thousands of dollars) based on 'YearsExperience', if the coefficient `\hat{\beta_1}` for 'YearsExperience' is 5.2, how would you interpret this value?
120. **Coding SLR:** You have `X_train_feature` (a single feature, 2D) and `y_train_target`. Write the Scikit-learn code to instantiate and fit a Simple Linear Regression model.
121. **MLR Concept:** How does Multiple Linear Regression differ from Simple Linear Regression?
122. **MLR Coefficient Interpretation:** In an MLR model, if the coefficient for feature `x_k` is `\hat{\beta_k} = -2.5`, how is this interpreted regarding the relationship between `x_k` and `y`, considering other features in the model?
123. **Polynomial Regression:** Why is Polynomial Regression considered a special case of Multiple Linear Regression, even though it can model non-linear relationships?
124. **`PolynomialFeatures`:** If you use `PolynomialFeatures(degree=2, include_bias=False)` on a dataset with two original features `[a, b]`, what new features would be generated?
125. **Overfitting:** What is a major risk when using a high degree in Polynomial Regression?
126. **Pipeline:** Why is it often beneficial to use a `Pipeline` in Scikit-learn when combining steps like `PolynomialFeatures` and `LinearRegression`?
127. **Regularization:** What is the primary purpose of regularization techniques like Ridge Regression?
128. **Ridge Regression (Alpha):** In Ridge Regression, what does the hyperparameter `\alpha` (alpha) control? What happens if `\alpha = 0`? What happens if `\alpha` is very large?
129. **Ridge vs. OLS:** How do the coefficient estimates from Ridge Regression typically compare to those from Ordinary Least Squares (OLS) Linear Regression, especially when `\alpha > 0`?
130. **Feature Scaling for Ridge:** Why is feature scaling (e.g., standardization) particularly important before applying Ridge Regression?
131. **Coding Ridge:** Assume you have scaled training data `X_train_scaled` and target `y_train`. Write Scikit-learn code to instantiate and fit a Ridge Regression model with `alpha = 0.5`.
132. **Multicollinearity:** How might Ridge Regression help in situations where multicollinearity exists among predictor variables?
133. **L2 Penalty:** The penalty term in Ridge Regression (`\alpha \sum \beta_j^2`) is known as an L2 penalty. What does this mean in terms of how it penalizes coefficients?
134. **Model Complexity:** Does increasing the `alpha` value in Ridge Regression increase or decrease the complexity of the resulting model? Explain.
135. **Visualization:** What kind of plot is typically used to visualize the effect of different `alpha` values on the coefficients in Ridge Regression?

---
*(Module 8 will follow in the next response. The provided transcript for Module 8 is incomplete, so the notes will reflect that.)*

<div class="md-recitation">
  Sources
  <ol>
  <li><a href="https://benny.istan.to/blog/20230616-regression-analysis-with-dummy-variables">https://benny.istan.to/blog/20230616-regression-analysis-with-dummy-variables</a></li>
  <li><a href="https://github.com/Bhumi45/MachineLearning">https://github.com/Bhumi45/MachineLearning</a></li>
  <li><a href="https://cyberleninka.ru/article/n/oil-production-forecasting-using-regression-algorithms">https://cyberleninka.ru/article/n/oil-production-forecasting-using-regression-algorithms</a></li>
  <li><a href="https://www.fastercapital.com/content/Polynomial-Regression--The-Curves-of-Discovery--Polynomial-Regression-for-a-Non-Linear-World.html">https://www.fastercapital.com/content/Polynomial-Regression--The-Curves-of-Discovery--Polynomial-Regression-for-a-Non-Linear-World.html</a></li>
  <li><a href="https://machinelearningmastery.com/capturing-curves-advanced-modeling-with-polynomial-regression/">https://machinelearningmastery.com/capturing-curves-advanced-modeling-with-polynomial-regression/</a></li>
  </ol>
</div>