## Supervised Learning: Regression Models and Performance Metrics

Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose

* Simple Linear Regression (SLR) is a statistical method used to study the relationship between two continuous variables — one independent variable (X) and one dependent variable (Y).
It helps us understand how the dependent variable changes when the independent variable changes.

* The relationship between X and Y is represented by a straight line, called the regression line, given by the equation:
* **Y=β0​+β1​X+ε**

Where:

* Y = Dependent variable (the one we want to predict)

* X = Independent variable (the predictor)

* β₀ = Intercept (value of Y when X = 0)

* β₁ = Slope (rate of change of Y for a one-unit change in X)

* ε = Error term (difference between actual and predicted values)

**Example:**
* Suppose we want to predict a student’s marks (Y) based on the number of study hours (X). If the regression equation is

* **Marks=20+5×(StudyHours)**

* it means that for every 1 extra hour of study, marks increase by 5 on average.

* **Purpose of Simple Linear Regression:**
* The main purposes of SLR are:

* **Prediction:**

* To predict the value of one variable based on another.

* Example: Predict sales based on advertising expenditure.

* **Understanding Relationships:**

* To measure how strongly two variables are related and in which direction (positive or negative).

* **Trend Analysis:**

* To find trends or patterns in data over time.

* **Decision Making:**

* Helps businesses, researchers, and analysts make data-driven decisions.

Question 2: What are the key assumptions of Simple Linear Regression?

* Simple Linear Regression (SLR) is based on certain key assumptions to ensure that the model gives accurate and reliable results.
If these assumptions are violated, the predictions and interpretations of the model may not be valid.

* **1. Linearity:**

* There should be a linear relationship between the independent variable (X) and the dependent variable (Y).

* The change in Y should be proportional to the change in X.

 **Example:**
* If study hours increase, marks should increase (or decrease) in a roughly straight-line pattern.

* **2. Independence of Errors:**

* The residuals (errors) should be independent of each other.

* This means that the error made for one observation should not influence the error made for another.

**Example:**
* In time-series data, this ensures that one observation does not depend on the previous one.

* **3. Homoscedasticity (Constant Variance of Errors):**

* The variance of errors should be constant for all values of X.

* In other words, the spread of residuals should be the same across all predicted values.

* If the spread increases or decreases, it indicates heteroscedasticity, which violates this assumption.

* **4. Normality of Errors:**

* The residuals (differences between actual and predicted values) should be normally distributed.

* This assumption is mainly important for statistical tests like confidence intervals and hypothesis testing.

* **5. No Multicollinearity (for multiple regression):**

* Although this is more relevant for Multiple Linear Regression, in SLR it implies that:

* The independent variable (X) should not be highly correlated with any other variable.

* In SLR, since there is only one X, this assumption is automatically satisfied.

* **6. No Autocorrelation:**

* In case of time-based data, the residuals should not show any pattern or correlation over time.

* If errors are correlated (for example, increasing or decreasing in a pattern), the model violates this assumption.

Question 3: Write the mathematical equation for a simple linear regression model and explain each term.

* **1. Mathematical Equation:**

* The mathematical form of a Simple Linear Regression (SLR) model is:
* Y=β0​+β1​X+ε

* **2. Explanation of Each Term:**

* **1. Y (Dependent Variable):**

* The variable we want to predict or explain.

* Example: Sales, Marks, Height, etc.

* **2. X (Independent Variable):**

* The variable used to predict the value of Y.

* Example: Advertising cost, Study hours, Age, etc.

* **3. β₀ (Beta Zero) – Intercept:**

* It represents the value of Y when X = 0.

* In other words, it is the point where the regression line crosses the Y-axis.

* Example: If study hours = 0, then β₀ shows the base marks a student might get.

* **4. β₁ (Beta One) – Slope Coefficient:**

* It shows the change in Y for a one-unit change in X.

* If β₁ = 5, it means that when X increases by 1 unit, Y increases by 5 units (assuming a positive relationship).

* **5. ε (Error Term / Residual):**

* Represents the difference between actual and predicted values of Y.

* It captures all other factors that affect Y but are not included in the model.

Question 4: Provide a real-world example where simple linear regression can be
applied.

* **1. Real-World Example:**
* **Predicting Sales based on Advertising Expenditure**

* In a business scenario, companies often want to know how advertising spending affects sales.
Simple Linear Regression can be used to model this relationship and predict future sales.

* **2. Explanation:**

* Let’s assume a company collects the following data:
| Advertising (₹ in lakhs) | Sales (₹ in lakhs) |
| ------------------------ | ------------------ |
| 1                        | 10                 |
| 2                        | 20                 |
| 3                        | 28                 |
| 4                        | 40                 |
| 5                        | 50                 |

* By applying Simple Linear Regression, the company gets the equation:
* **Sales=5+9×(Advertising)**

* **3. Interpretation:**

* Intercept (β₀ = 5):
* When advertising is ₹0, the expected sales are ₹5 lakhs (base level sales).

* Slope (β₁ = 9):
* For every additional ₹1 lakh spent on advertising, sales increase by ₹9 lakhs on average.

* So, if the company plans to spend ₹6 lakhs on advertising, the predicted sales will be

* **4. Other Real-World Examples:**

* Predicting house prices based on area (in sq. ft).

* Estimating student marks based on study hours.

* Predicting electricity consumption based on temperature.

* Forecasting salary based on years of experience.

Question 5: What is the method of least squares in linear regression?

* The Method of Least Squares is a mathematical technique used in Linear Regression to find the best-fitting line through a set of data points.
It determines the line (regression line) that minimizes the sum of the squares of the errors (residuals) — i.e., the difference between the actual and predicted values of the dependent variable.

* **Concept Explanation:**

* In Simple Linear Regression, the equation of a straight line is:

* **Y=β0​+β1​X+ε**

Where:

* Y = Dependent variable

* X = Independent variable

* β₀ = Intercept

* β₁ = Slope

* ε = Error (difference between actual and predicted value)

* For each observation i, the error (residual) is:
* **εi​=Yi​−(β0​+β1​Xi​)**
* The Method of Least Squares minimizes the sum of squared errors (SSE):
* **S=Σ(Yi​−β0​−β1​Xi​)2**

* **Objective:**

* Find the values of β₀ and β₁ that make the sum of squared differences between the actual (Y) and predicted (
𝑌
^
Y
^
) values as small as possible.

Question 6: What is Logistic Regression? How does it differ from Linear Regression?

* **1. Definition of Logistic Regression:**

* Logistic Regression is a statistical method used to predict the probability of a binary outcome (two possible results) based on one or more independent variables.

* It is mainly used when the dependent variable is categorical, such as:

* Yes / No

* Pass / Fail

* 0 / 1

* Disease / No Disease
* **2. Equation of Logistic Regression:**

* Instead of fitting a straight line, Logistic Regression fits an S-shaped curve (Sigmoid Function) to model the probability of an event.

* The equation is:
* **P(Y=1)=1+e−(β0​+β1​X)1​**

| **Basis**         | **Linear Regression**                             | **Logistic Regression**                               |
| ----------------- | ------------------------------------------------- | ----------------------------------------------------- |
| **Purpose**       | Predicts continuous values (e.g., marks, salary). | Predicts categorical outcomes (e.g., yes/no, 0/1).    |
| **Output Range**  | Output can be any real number (−∞ to +∞).         | Output is between 0 and 1 (probability).              |
| **Equation Type** | Straight line: ( Y = β₀ + β₁X )                   | Sigmoid curve: ( P(Y=1) = \frac{1}{1+e^{-(β₀+β₁X)}} ) |
| **Error Type**    | Measured by Mean Squared Error (MSE).             | Measured by Log-Loss or Cross-Entropy.                |
| **Use Case**      | Regression problems (predicting amounts).         | Classification problems (predicting classes).         |


Question 7: Name and briefly describe three common evaluation metrics for regression models.

* To measure how well a regression model performs, we use certain evaluation metrics that compare the predicted values with the actual values.
Here are three commonly used evaluation metrics for regression models:

* **1. Mean Absolute Error (MAE):**

* MAE measures the average absolute difference between the predicted and actual values
* **MAE=n1​i=1∑n​∣Yi​−Yi​^​∣**

* **2. Mean Squared Error (MSE):**

* MSE measures the average of the squared differences between predicted and actual values.
* **MSE=n1​i=1∑n​(Yi​−Yi​^​)2**
* Example:
If MSE = 9, the average squared error between predicted and actual values is 9.

* **3. R-Squared (Coefficient of Determination):**

* R² measures how well the regression model explains the variability of the dependent variable.
* **R2=1−SSres/SStot**

Question 8: What is the purpose of the R-squared metric in regression analysis?

* R-squared (R²), also called the Coefficient of Determination, is a statistical measure that indicates how well the independent variable(s) explain the variability in the dependent variable in a regression model.

* It tells us how well the regression line fits the data.
* Formula:
* **R2=1−SSres/SStot**

* **Interpretation:**

* R² = 0: The model explains none of the variability in the dependent variable.

* R² = 1: The model explains all the variability perfectly.

* 0 < R² < 1: The model explains partially the variability in the dependent variable.

* **Purpose of R-squared:**

* 1.Measure of Goodness of Fit:

* Shows how well the regression line fits the actual data points.

* 2.Model Performance Evaluation:

* Helps compare different regression models — higher R² usually means a better model.

* 3.Explains Predictive Power:

* Indicates how much of the outcome variation is explained by the predictors.

* 4.Decision Making:

* Helps analysts and researchers decide whether the model is suitable for prediction.

* **Limitations:**

* A high R² does not always mean the model is good — it may overfit.

* R² cannot tell whether the relationship is causal or not.



In [1]:
# Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

# Import necessary libraries
import numpy as np
from sklearn.linear_model import LinearRegression

# Example dataset
# X -> Independent variable (2D array)
# Y -> Dependent variable
X = np.array([[1], [2], [3], [4], [5]])   # Study Hours
Y = np.array([2, 4, 5, 4, 5])             # Marks

# Create a Linear Regression model
model = LinearRegression()

# Fit the model to the data
model.fit(X, Y)

# Print the slope (coefficient) and intercept
print("Slope (β₁):", model.coef_[0])
print("Intercept (β₀):", model.intercept_)

# Predicting values (optional)
Y_pred = model.predict(X)
print("Predicted Values:", Y_pred)



Slope (β₁): 0.6
Intercept (β₀): 2.2
Predicted Values: [2.8 3.4 4.  4.6 5.2]


Question 10: How do you interpret the coefficients in a simple linear regression model?

* In a Simple Linear Regression (SLR) model, the equation is:
* **Y=β0​+β1​X+ε**

* **1. Intercept (β₀):**

* The intercept is the value of Y when X = 0.

* Interpretation: It represents the baseline value of the dependent variable.

* Example:
* If we are predicting student marks based on study hours, and β₀ = 20, this means that a student who studies 0 hours is expected to score 20 marks on average.

* **2. Slope (β₁):**

* The slope indicates the change in Y for a one-unit increase in X.

* **Interpretation:**

* Positive β₁ → Y increases as X increases.

* Negative β₁ → Y decreases as X increases.

* Example:
* If β₁ = 5, it means that for every additional hour of study, the student’s marks increase by 5 on average.

* **3. Combined Interpretation:**

* The regression line Y=β0​+β1X shows how X influences Y.

* By looking at the coefficients:

* We can quantify the effect of X on Y.

* Make predictions using the equation.