#1) What is Simple Linear Regression (SLR)? Explain its purpose?

- **Simple Linear Regression (SLR)** is a **statistical method** used to study the **relationship between two variables** — one **independent variable (X)** and one **dependent variable (Y)**.

**Definition**

Simple Linear Regression estimates how the dependent variable ( Y ) changes as the independent variable ( X ) changes.
It fits a **straight line** (called the regression line) through the data points to predict ( Y ) based on ( X ).

The equation for SLR is:

[
Y = β₀ + β₁X + ε
]

Where:

* ( Y ) = Dependent variable (the one we’re trying to predict)
* ( X ) = Independent variable (the predictor)
* ( β₀ ) = Intercept (value of Y when X = 0)
* ( β₁ ) = Slope (change in Y for each unit change in X)
* ( ε ) = Error term (difference between predicted and actual value)

**Purpose of Simple Linear Regression**

1. **Prediction:**
   To predict the value of one variable based on the value of another.
   Example: Predicting **house price (Y)** based on **size (X)**.

2. **Understanding relationships:**
   To determine whether and how strongly two variables are related.
   Example: Understanding the relationship between **advertising spend** and **sales**.

3. **Trend analysis:**
   To identify general trends in data — for instance, if sales increase linearly with marketing expenditure.

**Example**

Suppose we have data on hours studied (X) and exam score (Y).
If the regression line comes out as:

[
Y = 30 + 5X
]

This means:

* Base score (when hours studied = 0) is 30.
* For every additional hour studied, the score increases by 5 points.

**Key Assumptions**

1. Linear relationship between X and Y
2. Independence of observations
3. Constant variance of errors (homoscedasticity)
4. Errors are normally distributed


#2) What are the key assumptions of Simple Linear Regression?

- The **key assumptions of Simple Linear Regression (SLR)** ensure that the model’s results are **valid, reliable, and interpretable**.


1. **Linearity**

* The relationship between the **independent variable (X)** and the **dependent variable (Y)** is **linear**.
* This means changes in ( Y ) are proportional to changes in ( X ).
* **Check:** Scatter plot of X vs. Y should show a roughly straight-line pattern.

*Example:* If doubling X roughly doubles Y, the relationship is likely linear.


2. **Independence of Errors**

* The **residuals (errors)** — the differences between actual and predicted Y — should be **independent** of each other.
* No pattern or correlation should exist among errors.

*Why it matters:* If errors are correlated (e.g., in time-series data), predictions can become biased.

3. **Homoscedasticity (Constant Variance of Errors)**

* The **variance of residuals** should be **constant** across all levels of X.
* In other words, the spread of errors should be uniform along the regression line.

*Check:* A plot of residuals vs. predicted values should show random scatter (no funnel shape).


4. **Normality of Errors**

* The residuals should be **normally distributed** around the regression line.
* This assumption is especially important for hypothesis testing and constructing confidence intervals.

*Check:* Histogram or Q-Q plot of residuals should look approximately normal.

5. **No Measurement Error in X**

* The independent variable (X) is assumed to be measured **without error**.
* In practice, small measurement errors are okay, but large ones can distort the relationship.

**Summary Table**

| **Assumption**            | **Meaning**                                   | **How to Check**                  |
| ------------------------- | --------------------------------------------- | --------------------------------- |
| Linearity                 | Relationship between X and Y is straight-line | Scatter plot                      |
| Independence of errors    | Residuals not correlated                      | Durbin–Watson test, residual plot |
| Homoscedasticity          | Equal variance of errors                      | Residuals vs. fitted values plot  |
| Normality of errors       | Errors are normally distributed               | Q-Q plot, histogram               |
| No measurement error in X | X values are accurate                         | Data collection methods           |



#3) Write the mathematical equation for a simple linear regression model and
explain each term.?

- **Equation:**

[
Y = β₀ + β₁X + ε
]


**Explanation of Each Term:**

| **Term** | **Meaning**                   | **Description**                                                                                                                                         |
| -------- | ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ( Y )    | **Dependent Variable**        | The variable we want to predict or explain (e.g., sales, marks, salary).                                                                                |
| ( X )    | **Independent Variable**      | The predictor or explanatory variable used to predict Y (e.g., advertising spend, hours studied, experience).                                           |
| ( β₀ )   | **Intercept (Constant Term)** | The expected value of Y when X = 0. It represents where the regression line crosses the Y-axis.                                                         |
| ( β₁ )   | **Slope Coefficient**         | The amount by which Y changes for a **one-unit increase** in X. It measures the strength and direction of the relationship between X and Y.             |
| ( ε )    | **Error Term (Residual)**     | The difference between the **actual value** of Y and the **predicted value** from the model. It captures random variation or factors not included in X. |

**Example:**

Suppose the regression equation is:
[
Y = 25 + 4X
]

This means:

* Intercept ((β₀ = 25)): When X = 0, the predicted Y = 25.
* Slope ((β₁ = 4)): For every 1-unit increase in X, Y increases by 4 units.

*Example interpretation:*
If X = hours studied and Y = exam score, then each additional hour studied increases the expected score by 4 marks.



4) Provide a real-world example where simple linear regression can be
applied.?

- **Example: Predicting House Prices Based on Size**

**Scenario:**
A real estate analyst wants to predict the **price of a house (Y)** based on its **size in square feet (X)**.

**Step 1: Define the Variables**

* **Dependent Variable (Y):** House Price (in ₹ or $)
* **Independent Variable (X):** House Size (in square feet)

**Step 2: Model Equation**

[
\text{Price} = β₀ + β₁(\text{Size}) + ε
]

Where:

* ( β₀ ): Base price (price when size = 0, often theoretical)
* ( β₁ ): Price increase per additional square foot
* ( ε ): Error term (factors like location, condition, etc.)

**Step 3: Example Outcome**

Suppose after fitting the regression model, we get:
[
\text{Price} = 5,00,000 + 3,000 \times (\text{Size})
]

This means:

* The base price of any property (intercept) = ₹5,00,000
* For every additional **1 sq. ft.**, the price increases by **₹3,000**

**Step 4: Prediction Example**

If a house is **1,200 sq. ft**, then:
[
\text{Predicted Price} = 5,00,000 + 3,000(1,200) = ₹41,00,000
]

**Why Use SLR Here**

* The relationship between **size** and **price** is approximately **linear**.
* It helps in **estimating market prices** and **making investment decisions**.

**Other Real-World Examples**

| **Scenario**               | **Dependent Variable (Y)** | **Independent Variable (X)** |
| -------------------------- | -------------------------- | ---------------------------- |
| Predicting student scores  | Exam Marks                 | Hours Studied                |
| Predicting sales           | Monthly Sales              | Advertising Budget           |
| Predicting crop yield      | Yield (kg)                 | Rainfall (mm)                |
| Predicting fuel efficiency | Mileage (km/l)             | Engine Size (cc)             |



5) What is the method of least squares in linear regression?

- The **Method of Least Squares** is the **most common technique** used to find the **best-fitting line** in a **linear regression model** — that is, the line that best represents the relationship between the independent variable (X) and the dependent variable (Y).

**Definition**

The **method of least squares** determines the regression line by **minimizing the sum of the squared differences** between the **actual values (Y)** and the **predicted values (( \hat{Y} ))** from the line.

In simple terms:
It finds the line where the total error between the observed and predicted points is **as small as possible**.

**Mathematical Idea**

For each data point ( (X_i, Y_i) ), the model predicts:
[
\hat{Y_i} = β₀ + β₁X_i
]

The **residual (error)** for each point is:
[
e_i = Y_i - \hat{Y_i}
]

The **method of least squares** minimizes the **sum of squared residuals (errors):**
[
S = \sum (Y_i - \hat{Y_i})^2 = \sum (Y_i - β₀ - β₁X_i)^2
]

We find ( β₀ ) and ( β₁ ) such that ( S ) is **as small as possible**.

**Formulas for Coefficients**

By minimizing ( S ), we derive the formulas:

[
β₁ = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}
]

[
β₀ = \bar{Y} - β₁\bar{X}
]

Where:

* ( \bar{X} ) = Mean of X values
* ( \bar{Y} ) = Mean of Y values

**Purpose**

* To **find the most accurate regression line** that minimizes overall prediction error.
* To ensure the line fits the data in a way that the squared deviations between observed and predicted values are minimized.

**Example**

If we have data on:

| X (Hours studied) | Y (Marks scored) |
| ----------------- | ---------------- |
| 2                 | 40               |
| 3                 | 50               |
| 4                 | 65               |
| 5                 | 70               |

Using the least squares method, we can calculate β₀ and β₁ to get the best-fitting line:
[
Y = β₀ + β₁X
]

That line could look like:
[
Y = 25 + 9X
]
→ Predicts that each extra hour of study increases the score by 9 marks.

**In summary:**

| **Concept**   | **Meaning**                                               |
| ------------- | --------------------------------------------------------- |
| Objective     | Minimize total squared prediction errors                  |
| What it finds | Best-fitting regression line                              |
| Why squares?  | Squaring avoids negative errors canceling positive ones   |
| Result        | Provides optimal values for intercept (β₀) and slope (β₁) |



6) What is Logistic Regression? How does it differ from Linear Regression?

- **Logistic Regression — Overview**

**Logistic Regression** is a **statistical method** used to model the relationship between one or more independent variables (X) and a **categorical dependent variable (Y)** — usually **binary** (i.e., having two possible outcomes like *Yes/No*, *0/1*, *Success/Failure*).

Despite its name, **it’s a classification technique, not a regression technique**.

**Purpose**

To **predict the probability** that a given input belongs to a particular class.
For example:

* Will a customer **buy** a product? (Yes/No)
* Will a student **pass** or **fail** an exam?
* Is an email **spam** or **not spam**?

**Mathematical Form**

Instead of predicting a continuous value (like Linear Regression), Logistic Regression predicts a **probability** between 0 and 1 using the **logistic (sigmoid) function**:

[
P(Y=1) = \frac{1}{1 + e^{-(β₀ + β₁X)}}
]

Where:

* ( P(Y=1) ): Probability that the output is 1 (success)
* ( β₀, β₁ ): Coefficients estimated from data
* ( e ): Base of natural logarithms (~2.718)

**Interpretation**

* If ( P(Y=1) > 0.5 ), predict **class = 1**
* If ( P(Y=1) < 0.5 ), predict **class = 0**

**Key Differences Between Linear and Logistic Regression**

| **Feature**              | **Linear Regression**                         | **Logistic Regression**                                       |
| ------------------------ | --------------------------------------------- | ------------------------------------------------------------- |
| **Type of Output**       | Continuous (any real number)                  | Categorical (usually binary: 0/1)                             |
| **Goal**                 | Predict a numeric value (e.g., sales, salary) | Predict a probability or class (e.g., yes/no)                 |
| **Equation Form**        | ( Y = β₀ + β₁X + ε )                          | ( P(Y=1) = \frac{1}{1 + e^{-(β₀ + β₁X)}} )                    |
| **Linearity Assumption** | Assumes a linear relationship between X and Y | Assumes a linear relationship between X and **log-odds of Y** |
| **Error Distribution**   | Errors are normally distributed               | Follows a binomial distribution                               |
| **Use Case**             | Regression (continuous prediction)            | Classification (categorical prediction)                       |
| **Range of Output**      | (−∞ to +∞)                                    | (0 to 1) — represents probability                             |

**Example**

| Hours Studied (X) | Pass (Y) |
| ----------------- | -------- |
| 1                 | 0        |
| 3                 | 0        |
| 4                 | 1        |
| 6                 | 1        |
| 8                 | 1        |

* **Linear Regression** might try to fit a straight line (and can predict impossible values like −0.2 or 1.3).
* **Logistic Regression** fits an **S-shaped (sigmoid) curve** and gives probabilities like:

  * ( P(\text{Pass}) = 0.1, 0.3, 0.6, 0.9, ) etc.

**In short:**

> **Linear Regression** → Predicts a *continuous value*

> **Logistic Regression** → Predicts a *probability of belonging to a category*



7) Name and briefly describe three common evaluation metrics for regression
models?

- **1. Mean Absolute Error (MAE)**

**Definition:**
MAE measures the **average absolute difference** between the actual and predicted values.

[
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |Y_i - \hat{Y_i}|
]

**Interpretation:**

* It tells how far predictions are, on average, from the actual values.
* Lower MAE = better accuracy.
* Easy to understand because it’s in the same unit as the target variable.

*Example:*
If MAE = 5, the model’s predictions are off by 5 units on average.

**2. Mean Squared Error (MSE)**

**Definition:**
MSE measures the **average of squared differences** between actual and predicted values.

[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2
]

**Interpretation:**

* Penalizes **larger errors more strongly** because errors are squared.
* Useful for emphasizing big mistakes.
* Lower MSE = better model performance.

*Note:* Units are **squared**, so not directly interpretable in original scale.

**3. R-squared (R²) – Coefficient of Determination**

**Definition:**
R² represents the **proportion of variance in the dependent variable (Y)** that is explained by the independent variable(s) in the model.

[
R^2 = 1 - \frac{\sum (Y_i - \hat{Y_i})^2}{\sum (Y_i - \bar{Y})^2}
]

**Interpretation:**

* ( R^2 = 1 ): Perfect prediction
* ( R^2 = 0 ): Model explains none of the variance
* Higher R² = better model fit

*Example:*
If ( R^2 = 0.85 ), it means **85% of the variation** in Y is explained by the model.

**Summary Table**

| **Metric** | **Formula**                             | **Interpretation**                                 |   |                                           |
| ---------- | --------------------------------------- | -------------------------------------------------- | - | ----------------------------------------- |
| **MAE**    | ( \frac{1}{n}\sum                       | Y_i - \hat{Y_i}                                    | ) | Average absolute error; easy to interpret |
| **MSE**    | ( \frac{1}{n}\sum (Y_i - \hat{Y_i})^2 ) | Penalizes large errors; sensitive to outliers      |   |                                           |
| **R²**     | ( 1 - \frac{SS_{res}}{SS_{tot}} )       | Measures proportion of variance explained by model |   |                                           |



8) What is the purpose of the R-squared metric in regression analysis?

- **Purpose of the R-squared (R²) Metric in Regression Analysis**

**R-squared (R²)** — also called the **coefficient of determination** — is a statistical measure that explains **how well a regression model fits the data**.

**Definition**

R² represents the **proportion of the variance** in the **dependent variable (Y)** that is **explained by the independent variable(s) (X)** in the model.

[
R^2 = 1 - \frac{\text{SS}*{res}}{\text{SS}*{tot}}
]

Where:

* ( \text{SS}_{res} = \sum (Y_i - \hat{Y_i})^2 ) → Residual sum of squares (unexplained variance)
* ( \text{SS}_{tot} = \sum (Y_i - \bar{Y})^2 ) → Total sum of squares (total variance in Y)

**Purpose**

1. **Measures Goodness of Fit**
   R² shows how well the regression line represents the actual data points.

   * The closer R² is to **1**, the better the model fits the data.
   * An R² of **0** means the model does not explain any variation in Y.

2. **Explains Variance**
   It quantifies the percentage of the dependent variable’s variation that is explained by the independent variable(s).

   * Example: ( R^2 = 0.80 ) → 80% of the variation in Y is explained by X; 20% is due to other factors or random noise.

3. **Model Comparison**
   Helps compare models — higher R² generally means a better fit (but should be used carefully with multiple predictors, as R² always increases when more variables are added).

**Interpretation Example**

If you’re predicting **house prices** based on **size**, and your model gives:
[
R^2 = 0.90
]
→ It means **90% of the variation** in house prices can be explained by house size.
Only **10%** of the variation is due to other factors (like location, age, etc.).

**Important Notes**

* R² alone **does not indicate** if the model is good or if it makes **accurate predictions** — a high R² can still come from an overfitted model.
* For multiple regression, we often use **Adjusted R²**, which accounts for the number of predictors.

**In summary:**

> **R² measures how much of the variation in the dependent variable is explained by the model — helping you judge how well your regression line fits the data.**


In [1]:
#9 Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

# Import necessary libraries
from sklearn.linear_model import LinearRegression
import numpy as np

# Example data
# X: independent variable (e.g., hours studied)
# Y: dependent variable (e.g., exam score)
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)  # reshaped to 2D array for sklearn
Y = np.array([2, 4, 5, 4, 5])

# Create and fit the linear regression model
model = LinearRegression()
model.fit(X, Y)

# Print slope (coefficient) and intercept
print("Slope (β₁):", model.coef_[0])
print("Intercept (β₀):", model.intercept_)


Slope (β₁): 0.6
Intercept (β₀): 2.2
