In [None]:
#Supervised Learning: Regression Models and Performance Metrics

1. What is Simple Linear Regression (SLR)? Explain its purpose.
  - Simple Linear Regression (SLR) is a statistical method used to model the relationship between two variables — one independent variable (X) and one dependent variable (Y) — by fitting a straight line through the data points.
  - Formula:
𝑌
=
𝑎
+
𝑏
𝑋
+
𝜀
Y=a+bX+ε

Where:

Y → Dependent (response) variable

X → Independent (predictor) variable

a → Intercept (value of Y when X = 0)

b → Slope (change in Y for one-unit change in X)

ε → Error term (difference between actual and predicted values)

  - Purpose:

The main goal of SLR is to:

Understand the relationship between two variables (how X affects Y).

Predict the value of Y based on a given value of X.

Measure the strength and direction of the relationship (through the slope and correlation).

  - Example:
If you want to predict a student’s exam score (Y) based on the number of study hours (X), simple linear regression can help you find a linear equation that best describes this relationship.

2. What are the key assumptions of Simple Linear Regression?
  - Key Assumptions of Simple Linear Regression

Simple Linear Regression (SLR) is based on several statistical assumptions that must hold true for the model’s results to be valid, reliable, and unbiased. Violating these assumptions can lead to incorrect conclusions or poor predictions.

The main assumptions are explained below:

1. Linearity

Meaning: The relationship between the independent variable (X) and the dependent variable (Y) should be linear.

Explanation: Changes in X are assumed to cause proportional and consistent changes in Y.

Example: If study hours (X) increase, exam scores (Y) should increase (or decrease) in a straight-line pattern.

How to Check: Scatter plot of X vs Y should show a straight-line trend, not a curve.

2. Independence of Errors

Meaning: The residuals (errors) — the differences between actual and predicted Y values — should be independent of each other.

Explanation: One observation’s error should not influence another’s.

Example: In time-series data (like sales over months), residuals may be correlated — this violates independence.

How to Check: Use the Durbin-Watson test to detect autocorrelation in residuals.

3. Homoscedasticity (Constant Variance of Errors)

Meaning: The variance of residuals should be constant across all levels of X.

Explanation: The spread of residuals should be roughly the same for all predicted values — not wider or narrower at some points.

Violation: When the variance changes with X, it’s called heteroscedasticity, which makes predictions unreliable.

How to Check: Plot residuals vs. fitted values — the spread should look random and uniform.

4. Normality of Errors

Meaning: The residuals should be normally distributed (bell-shaped curve).

Explanation: This assumption is crucial for valid hypothesis testing and confidence intervals.

Violation: If residuals are skewed or have outliers, estimates of regression coefficients may be biased.

How to Check: Use a histogram, Q-Q plot, or Shapiro–Wilk test on residuals.

5. No Perfect Multicollinearity (not relevant in SLR but conceptually important)

Meaning: In SLR, there is only one independent variable, so multicollinearity doesn’t arise. However, in multiple regression, independent variables should not be highly correlated.

Note: For SLR, we only ensure that X is measured without severe errors.

6. No Measurement Error in the Independent Variable

Meaning: The predictor variable (X) should be measured accurately.

Explanation: Errors in X can bias the slope coefficient (b) and weaken the relationship between X and Y.

Example: If "study hours" are self-reported inaccurately, it can distort results.

7. Observations are Randomly Sampled

Meaning: The data should be collected using random sampling methods.

Explanation: This ensures that the results are representative of the population and not biased by selection methods.

Example: If only top-performing students are sampled, the regression will not generalize to all students.

3. Write the mathematical equation for a simple linear regression model and
explain each term.

  - Mathematical Equation of a Simple Linear Regression Model

The general form of a Simple Linear Regression (SLR) model is:

𝑌
=
𝑎
+
𝑏
𝑋
+
𝜀
Y=a+bX+ε

  - | **Term**        | **Meaning**                                   | **Explanation**                                                                                                            |
| --------------- | --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- |
| **Y**           | **Dependent Variable (Response Variable)**    | The outcome we are trying to predict or explain (e.g., exam score, sales, height).                                         |
| **X**           | **Independent Variable (Predictor Variable)** | The variable used to predict Y (e.g., study hours, advertising spend, age).                                                |
| **a**           | **Intercept (Constant Term)**                 | The expected value of Y when X = 0. It represents where the regression line crosses the Y-axis.                            |
| **b**           | **Slope Coefficient**                         | It measures the change in Y for a one-unit increase in X. If b = 2, Y increases by 2 units for every 1 unit increase in X. |
| **ε (epsilon)** | **Error Term or Residual**                    | Represents random variation in Y that cannot be explained by X. It accounts for other factors or measurement errors.       |

  - Interpretation Example:

Suppose the regression equation is:

Exam Score
=
40
+
5
×
(
Study Hours
)
Exam Score=40+5×(Study Hours)

Here:

a = 40 → Even without studying (0 hours), the expected score is 40.

b = 5 → For every additional hour studied, the score increases by 5 points.

ε → Captures differences due to other factors like motivation or difficulty level.

4. Provide a real-world example where simple linear regression can be applied.
  - Real-World Example of Simple Linear Regression
Example: Predicting House Prices Based on Size

Scenario:
A real estate company wants to predict the price of a house (Y) based on its size in square feet (X).

Regression Equation:
House Price
=
𝑎
+
𝑏
×
(
Size
)
+
𝜀
House Price=a+b×(Size)+ε

Where:

Y (House Price): Dependent variable — the value we want to predict.

X (Size): Independent variable — the factor used to make the prediction.

a: Intercept — average price when the size is 0 (base value).

b: Slope — shows how much the price increases for each additional square foot.

ε: Error term — accounts for other factors like location, age, or condition of the house.

Example Interpretation:

Suppose the regression model is:

Price
=
50,000
+
200
×
(
Size
)
Price=50,000+200×(Size)

Intercept (50,000): Base price of a house (even if very small).

Slope (200): For every extra square foot, the house price increases by ₹200.

Prediction: A 1,000 sq. ft. house is estimated to cost:

50,000
+
200
×
1000
=
₹
2
,
50
,
000
50,000+200×1000=₹2,50,000
Other Real-World Examples:

Predicting exam scores from study hours.

Estimating sales revenue from advertising spending.

Predicting crop yield based on rainfall.

Estimating car mileage (km/l) from engine size.

5. What is the method of least squares in linear regression?
  - Method of Least Squares in Linear Regression

The method of least squares is a mathematical technique used to find the best-fitting straight line in a simple linear regression model. It determines the line that minimizes the sum of the squared differences (errors) between the actual data points and the values predicted by the line.

Mathematical Form:

The simple linear regression equation is:

𝑌
=
𝑎
+
𝑏
𝑋
+
𝜀
Y=a+bX+ε

Where:

Y = Dependent variable (value to be predicted)

X = Independent variable (predictor)

a = Intercept (value of Y when X = 0)

b = Slope (change in Y for a one-unit change in X)

ε = Error term (difference between observed and predicted Y)

Least Squares Principle:

The method minimizes the sum of squared residuals:

𝑆
=
∑
(
𝑌
𝑖
−
𝑎
−
𝑏
𝑋
𝑖
)
2
S=∑(Y
i
	​

−a−bX
i
	​

)
2

This ensures the total error between actual and predicted values is as small as possible.

Formulas for Coefficients:
𝑏
=
𝑛
∑
𝑋
𝑌
−
∑
𝑋
∑
𝑌
𝑛
∑
𝑋
2
−
(
∑
𝑋
)
2
b=
n∑X
2
−(∑X)
2
n∑XY−∑X∑Y
	​

𝑎
=
𝑌
ˉ
−
𝑏
𝑋
ˉ
a=
Y
ˉ
−b
X
ˉ

Where:

𝑋
ˉ
X
ˉ
 = Mean of X values

𝑌
ˉ
Y
ˉ
 = Mean of Y values

𝑛
n = Number of observations

Example:

If we study how study hours (X) affect exam scores (Y), the least squares method gives a line like:

𝑌
^
=
40
+
5
𝑋
Y
^
=40+5X

Meaning — for every extra study hour, the score increases by 5 marks.

6. What is Logistic Regression? How does it differ from Linear Regression?
  - Logistic Regression is a statistical and machine learning technique used to predict categorical outcomes, especially binary outcomes (like Yes/No, 0/1, Pass/Fail).

It estimates the probability that a given input (X) belongs to a particular category (Y).

Mathematical Form:

Unlike linear regression, logistic regression does not model Y directly.
Instead, it models the probability (p) that Y = 1 using the logistic (sigmoid) function:

𝑝
=
1
1
+
𝑒
−
(
𝑎
+
𝑏
𝑋
)
p=
1+e
−(a+bX)
1
	​


Taking the log of the odds (called logit transformation):

log
⁡
(
𝑝
1
−
𝑝
)
=
𝑎
+
𝑏
𝑋
log(
1−p
p
	​

)=a+bX

Where:

p → Probability that Y = 1

a → Intercept

b → Coefficient for predictor X

e → Base of natural logarithm (~2.718)

Purpose:

Used when the dependent variable is categorical (usually binary).

Predicts the likelihood or probability of an event occurring.

Common in classification problems (e.g., spam detection, disease prediction, etc.)

Difference Between Logistic and Linear Regression
Basis	Linear Regression	Logistic Regression
Type of Output	Continuous (e.g., marks, salary, temperature)	Categorical (e.g., Yes/No, 0/1, Pass/Fail)
Equation Form
𝑌
=
𝑎
+
𝑏
𝑋
Y=a+bX
log
⁡
(
𝑝
1
−
𝑝
)
=
𝑎
+
𝑏
𝑋
log(
1−p
p
	​

)=a+bX
Predicted Values	Any real number (−∞ to +∞)	Probability between 0 and 1
Error Measurement	Uses least squares method	Uses maximum likelihood estimation (MLE)
Line Type	Straight line	S-shaped (sigmoid) curve
Use Case	Regression problems (predicting quantity)	Classification problems (predicting class or probability)
Example:

Linear Regression: Predicting a student’s exam score based on study hours.
→ Output: 85 marks

Logistic Regression: Predicting whether a student passes or fails based on study hours.
→ Output: Probability (e.g., 0.85 = 85% chance of passing)

7. Name and briefly describe three common evaluation metrics for regression models.
  - Three Common Evaluation Metrics for Regression Models

Regression models are evaluated by measuring how well predicted values match the actual values.
Below are three widely used metrics:

1. Mean Absolute Error (MAE)
𝑀
𝐴
𝐸
=
1
𝑛
∑
𝑖
=
1
𝑛
∣
𝑌
𝑖
−
𝑌
𝑖
^
∣
MAE=
n
1
	​

i=1
∑
n
	​

∣Y
i
	​

−
Y
i
	​

^
	​

∣

Meaning: It calculates the average of absolute differences between actual values (
𝑌
𝑖
Y
i
	​

) and predicted values (
𝑌
𝑖
^
Y
i
	​

^
	​

).

Interpretation: Lower MAE means better accuracy.

Advantage: Easy to interpret — shows the average prediction error in the same units as the target variable.

2. Mean Squared Error (MSE)
𝑀
𝑆
𝐸
=
1
𝑛
∑
𝑖
=
1
𝑛
(
𝑌
𝑖
−
𝑌
𝑖
^
)
2
MSE=
n
1
	​

i=1
∑
n
	​

(Y
i
	​

−
Y
i
	​

^
	​

)
2

Meaning: It measures the average squared difference between actual and predicted values.

Interpretation: Penalizes larger errors more heavily (due to squaring).

Advantage: Good for detecting large deviations in predictions.

3. R-squared (Coefficient of Determination)
𝑅
2
=
1
−
∑
(
𝑌
𝑖
−
𝑌
𝑖
^
)
2
∑
(
𝑌
𝑖
−
𝑌
ˉ
)
2
R
2
=1−
∑(Y
i
	​

−
Y
ˉ
)
2
∑(Y
i
	​

−
Y
i
	​

^
	​

)
2
	​


Meaning: It represents the proportion of variance in the dependent variable that is explained by the model.

Range: 0 ≤ R² ≤ 1

Interpretation:

R² = 1 → Perfect fit

R² = 0 → Model explains no variability

Advantage: Gives an overall measure of how well the regression line fits the data.

8. What is the purpose of the R-squared metric in regression analysis?
  - Purpose of the R-squared Metric in Regression Analysis

The R-squared (R²) metric, also known as the coefficient of determination, is used to measure how well a regression model explains the variability of the dependent variable (Y) based on the independent variable(s) (X).

Mathematical Formula:
𝑅
2
=
1
−
∑
(
𝑌
𝑖
−
𝑌
𝑖
^
)
2
∑
(
𝑌
𝑖
−
𝑌
ˉ
)
2
R
2
=1−
∑(Y
i
	​

−
Y
ˉ
)
2
∑(Y
i
	​

−
Y
i
	​

^
	​

)
2
	​


Where:

𝑌
𝑖
Y
i
	​

 = Actual values

𝑌
𝑖
^
Y
i
	​

^
	​

 = Predicted values

𝑌
ˉ
Y
ˉ
 = Mean of actual values

Purpose and Interpretation:

Explains Goodness of Fit:
R² shows how well the regression line fits the data. It tells us the proportion of total variation in Y that is explained by X.

Measures Predictive Power:
A higher R² value means the model has stronger predictive ability.

Interpretation Range:

R² = 1: Perfect fit — model explains 100% of variation in Y.

R² = 0: Model explains none of the variation (no linear relationship).

Example: R² = 0.80 → 80% of the variation in Y is explained by X.

Example:

If we build a regression model to predict house prices from house size,
and the R² value = 0.85, it means 85% of the variation in house prices can be explained by house size.

In [6]:
#9. Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data (replace with your actual data)
# X: Independent variable (e.g., study hours)
# Y: Dependent variable (e.g., exam scores)
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)  # Reshape for scikit-learn
Y = np.array([30, 35, 40, 45, 50, 55, 60, 65, 70, 75])

# Create a Linear Regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, Y)

# Print the slope (coefficient) and intercept
print(f"Slope (coefficient): {model.coef_[0]:.2f}")
print(f"Intercept: {model.intercept_:.2f}")

Slope (coefficient): 5.00
Intercept: 25.00


10. How do you interpret the coefficients in a simple linear regression model?
  - Interpretation of Coefficients in a Simple Linear Regression Model

The simple linear regression equation is:

𝑌
=
𝑎
+
𝑏
𝑋
+
𝜀
Y=a+bX+ε

Where:

Y → Dependent variable (the outcome being predicted)

X → Independent variable (the predictor)

a → Intercept (constant term)

b → Slope (regression coefficient)

ε → Error term (random noise)

1. Intercept (a)

Meaning: The value of Y when X = 0.

It represents the starting point or baseline level of the dependent variable.

Example:
In the model

Exam Score
=
40
+
5
(
Study Hours
)
Exam Score=40+5(Study Hours)

the intercept (a = 40) means that even if a student studies 0 hours, their expected score is 40 marks.

2. Slope (b)

Meaning: The rate of change in Y for every one-unit increase in X.

It shows the strength and direction of the relationship:

Positive b: As X increases, Y increases.

Negative b: As X increases, Y decreases.

Example:
In the same model,
𝑏
=
5
b=5 means that for every additional hour studied, the exam score increases by 5 marks on average.

3. Error Term (ε)

Represents the difference between the actual and predicted Y values.

Captures effects of other variables not included in the model.