# Supervised Machine Learning: Linear Regression

## Linear Regression: Unscaled vs. Scaled Data
In this demo, we follow the ML process:
1. **Remember:** Load and inspect the data.
2. **Formulate:** Build a linear regression model first on raw (unscaled) data.
3. **Predict:** Evaluate the model's performance.

Then we apply feature scaling and rebuild the model to compare results.
We use the Student Performance dataset from Kaggle to predict the "Performance Index" of students.

In [2]:
# import necessary libraries
import pandas as pd
import numpy as np

# Download data from Kaggle
!kaggle datasets download -d nikhil7280/student-performance-multiple-linear-regression
!unzip student-performance-multiple-linear-regression.zip

# Import dataframe
df = pd.read_csv("Student_Performance.csv")
df

Dataset URL: https://www.kaggle.com/datasets/nikhil7280/student-performance-multiple-linear-regression
License(s): other
Downloading student-performance-multiple-linear-regression.zip to /content
  0% 0.00/48.5k [00:00<?, ?B/s]
100% 48.5k/48.5k [00:00<00:00, 41.2MB/s]
Archive:  student-performance-multiple-linear-regression.zip
  inflating: Student_Performance.csv  


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,Yes,9,1,91.0
1,4,82,No,4,2,65.0
2,8,51,Yes,7,2,45.0
3,5,52,Yes,5,2,36.0
4,7,75,No,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,Yes,4,2,23.0
9996,7,64,Yes,8,5,58.0
9997,6,83,Yes,8,5,74.0
9998,9,97,Yes,7,0,95.0


In [3]:
# Convert extracurricular activities to numeric
df["Extracurricular Activities"] = df["Extracurricular Activities"].map({"Yes":1,
                                                                          "No":0})

In [4]:
# Define the features and target variable based on the dataset
feature_vars = ["Hours Studied", "Previous Scores", "Sleep Hours",
                "Sample Question Papers Practiced", "Extracurricular Activities"]

X = df[feature_vars]
y = df["Performance Index"]

# Display a preview of the dataset
df

Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7,99,1,9,1,91.0
1,4,82,0,4,2,65.0
2,8,51,1,7,2,45.0
3,5,52,1,5,2,36.0
4,7,75,0,8,5,66.0
...,...,...,...,...,...,...
9995,1,49,1,4,2,23.0
9996,7,64,1,8,5,58.0
9997,6,83,1,8,5,74.0
9998,9,97,1,7,0,95.0


## Part 1: Linear Regression on Unscaled Data
In this section, we build a [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) model on the raw data.
This helps us see the effect of differing scales on the coefficients.
We start by [spliting our data into training and testing sets](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split).

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split the raw data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

# Initialize and train the linear regression model on unscaled data
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lin_reg.predict(X_test)
y_pred

array([54.71185392, 22.61551294, 47.90314471, ..., 16.79341955,
       63.34327368, 45.94262301])

In [6]:
from sklearn.metrics import mean_squared_error, root_mean_squared_error, r2_score

# Evaluate model performance
mse_lin = mean_squared_error(y_test, y_pred)
rmse_lin = root_mean_squared_error(y_test, y_pred)
r2_lin = r2_score(y_test, y_pred)

print("Unscaled Data Model:")
print(f"Mean Squared Error: {mse_lin:.2f}")
print(f"Root Squared Error: {rmse_lin:.2f}")
print(f"R² Score: {r2_lin:.2f}")

Unscaled Data Model:
Mean Squared Error: 4.08
Root Squared Error: 2.02
R² Score: 0.99


### Notes on Unscaled Model:
- **Coefficients (Unscaled):**
    - Each coefficient represents the change in the Performance Index for a one-unit change in the respective feature, holding all other features constant.
    - For example, if "Hours Studied" has a coefficient of 2.85, it implies that for each additional hour studied, the Performance Index increases by 2.85 points (assuming other factors remain constant).
    - However, because features are in different units (e.g., hours vs. scores), comparing these coefficients directly may be misleading.

- **R² Score:**
    - This metric indicates the proportion of the variance in the target variable explained by the model.
    - An R² close to 1 suggests a very good fit, while an R² near 0 indicates the model fails to capture much variance.

- **MSE & RMSE:**
    - MSE measures the average squared difference between actual and predicted values.
    - RMSE, being the square root of MSE, gives an error metric in the same units as the target.
    - Lower RMSE values indicate better predictive performance.

### Manually Computing a Prediction from Our Model
- In this section, we'll calculate a predicted value by hand (i.e., by multiplying the model's coefficients by the original feature values and adding the intercept).
- This mirrors exactly what the model does internally.

- **Why is this helpful?**
   - It reinforces how linear regression makes its predictions using the equation: `prediction = intercept + (coef_1 * x_1) + (coef_2 * x_2) + ...`
   - It helps us see the individual impact of each feature on the final prediction.
   - It confirms that the manual approach matches the `model.predict()` output.

#### 1. Extract the coefficients and intercept from our trained model

In [9]:
# View our model's coefficients
coef_series = pd.Series(lin_reg.coef_, index = X.columns)
intercept = pd.Series(lin_reg.intercept_)
print(intercept)
coef_series

0   -33.921946
dtype: float64


Unnamed: 0,0
Hours Studied,2.852484
Previous Scores,1.016988
Sleep Hours,0.476941
Sample Question Papers Practiced,0.191831
Extracurricular Activities,0.608617


#### 2. Select a single row of our data (e.g., the second row)
- We select only the columns that were used as features in our model.
- The row's values represent the actual data for Hours Studied, Previous Scores, etc.

In [11]:
row_index = 1
row_features = X.iloc[row_index]
row_features

Unnamed: 0,1
Hours Studied,4
Previous Scores,82
Sleep Hours,4
Sample Question Papers Practiced,2
Extracurricular Activities,0


#### 3. Compute the manual prediction

In [21]:
print((row_features * coef_series).sum() + intercept)

print(y.iloc[1])

0    63.172451
dtype: float64


65.0

**Explanation:**
- We multiply each feature value by its corresponding coefficient and sum them up.
- Then, we add the intercept.
- This is precisely the linear regression equation:
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$

Where:
 - $\beta_0$ is the intercept
 - $\beta_i$ is the coefficient for feature $x_i$

 Thus, `manual_prediction` should match what the model would predict internally.

#### 4. Compare to `model.predict()` for confirmation

In [26]:
print(lin_reg.predict([X.iloc[1]]))

[63.17245064]




### **Observation:**
- The `manual_prediction` and `model_prediction` should be nearly identical (up to minor floating-point differences).
- If they match, we've confirmed our understanding of how the model uses coefficients and intercept to make a prediction.

### Why This Matters
- **Transparency:** It shows exactly how each feature influences the final predicted value.
- **Verification:** Confirms our "manual" math aligns with the model's internal computation.
- **Interpretability:** By inspecting the coefficients, we see which features have the biggest impact (positive or negative) on the Performance Index, and we can discuss whether the magnitudes make sense given the domain context.

In [33]:
# statsmodels

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Part 2: Linear Regression on Scaled Data
Now we apply feature scaling using StandardScaler and rebuild the model.
Scaling brings all features to a similar scale, which aids in the interpretation of the coefficients.

In [46]:
from sklearn.preprocessing import StandardScaler

# initialize the scaler and apply it to the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

# split the scaled data
X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X_scaled, y,
                                                 test_size = 0.2, random_state = 42)

# fit scaled data
lin_reg_scaled = LinearRegression()
lin_reg_scaled.fit(X_train_scaled, y_train_scaled)

# make predictions
y_pred_scaled = lin_reg_scaled.predict(X_test_scaled)
y_pred_scaled

# evaluate model performance
mse_scaled = mean_squared_error(y_test_scaled, y_pred_scaled)
r2_scaled = r2_score(y_test_scaled, y_pred_scaled)
rmse_scaled = root_mean_squared_error(y_test_scaled, y_pred_scaled)

print("\nScaled Data Model:")
print(f"Mean Squared Error: {mse_scaled:.2f}")
print(f"Root Mean Squared Error: {rmse_scaled:.2f}")
print(f"R² Score: {r2_scaled:.2f}")
print("Model Coefficients (Scaled):")
print(pd.Series(lin_reg_scaled.coef_, index=X_scaled.columns))


Scaled Data Model:
Mean Squared Error: 4.08
Root Mean Squared Error: 2.02
R² Score: 0.99
Model Coefficients (Scaled):
Hours Studied                        7.385592
Previous Scores                     17.636899
Sleep Hours                          0.808787
Sample Question Papers Practiced     0.550020
Extracurricular Activities           0.304292
dtype: float64


### Notes on Scaled Model:
- **Coefficients (Scaled):**
    - After scaling, each coefficient indicates the change in the Performance Index for a one standard deviation change in that feature.
    - This standardization makes it easier to compare the relative importance of features.
    - For example, a higher coefficient means that feature has a larger effect on the target, per standard deviation change.

- **R² and RMSE Comparison:**
    - Often the overall performance metrics (R² and RMSE) do not change dramatically after scaling for linear regression.
    - However, scaling is essential for interpreting the model coefficients correctly, especially when features are on different scales.
    - It is also a critical preprocessing step for many other algorithms.

# Conclusion
In this demo, we:
- Built and evaluated a linear regression model on unscaled data.
- Re-trained the model after applying feature scaling.
- Observed that while overall performance metrics (**MSE** and **R²**) may be similar, scaling is crucial for the interpretability of model coefficients and for ensuring that features contribute in a balanced way.
  
### Key Takeaways:
- **Coefficients:** On unscaled data, coefficients are tied to the original units, which can be hard to compare.
  After scaling, coefficients represent the effect of a one standard deviation change in the feature.
- **R² Score:** Reflects the proportion of variance in the target variable explained by the model.
- **MSE (and RMSE):** Lower values indicate better model performance; RMSE provides an error measure in the target's units.

This process reflects the "remember-formulate-predict" approach in machine learning.