# Question 1 : What is Simple Linear Regression (SLR)? Explain its purpose.

Answer:

**Simple Linear Regression (SLR)** is one of the most basic yet powerful statistical methods used to model and analyze the relationship between **two continuous variables** — one **independent variable (predictor)** and one **dependent variable (response)**. It assumes that the relationship between these variables can be expressed using a **straight-line equation**, given by:

**Y = b₀ + b₁X + e**

Where:  
- **Y** = Dependent variable (the outcome or value to be predicted)  
- **X** = Independent variable (the predictor or input variable)  
- **b₀** = Intercept (the value of Y when X = 0)  
- **b₁** = Slope (the amount of change in Y for a one-unit change in X)  
- **e** = Error term (difference between observed and predicted values)

The main **purpose of Simple Linear Regression** is to:  
1. **Understand** the relationship between two variables.  
2. **Quantify** how strongly one variable affects another.  
3. **Predict** the value of the dependent variable based on the independent variable.  

SLR works by fitting the **best possible straight line** through the data points using the **least squares method**, which minimizes the sum of squared differences between the observed and predicted values. This ensures that the fitted line represents the overall trend in the data as accurately as possible.

**Interpretation of coefficients:**  
- The **intercept (b₀)** indicates the predicted value of Y when X equals zero.  
- The **slope (b₁)** shows how much Y changes for each one-unit increase in X. A positive slope means a direct relationship, while a negative slope indicates an inverse relationship.

**Applications:**  
- In **business**, it helps predict future sales based on advertising expenditure.  
- In **economics**, it can forecast demand based on changes in price.  
- In **education**, it can estimate student performance based on study hours.  
- In **science and engineering**, it helps identify relationships between variables such as temperature and pressure.

Overall, **Simple Linear Regression** is a foundational technique in both **statistics and machine learning**. It provides a clear, interpretable model for identifying trends, understanding cause-and-effect relationships, and making informed predictions. Though simple, it forms the basis for more complex regression models such as multiple and polynomial regression.

# Question 2: What are the key assumptions of Simple Linear Regression?

Answer:

**Simple Linear Regression (SLR)**, like all statistical models, is based on certain key **assumptions** that must be satisfied for the model’s results to be reliable and accurate. These assumptions ensure that the estimated relationship between the independent and dependent variables is valid and that the conclusions drawn from the regression are meaningful.

The **key assumptions of Simple Linear Regression** are as follows:

1. **Linearity:**  
   The relationship between the **independent variable (X)** and the **dependent variable (Y)** is assumed to be *linear*. This means that changes in X produce proportional changes in Y. If the relationship is non-linear, the linear regression model will not fit the data accurately.

2. **Independence of Errors:**  
   The residuals (errors) should be *independent* of each other. This means that the prediction errors for one observation should not influence the errors for another. Violation of this assumption often occurs in time-series data, where values are correlated over time.

3. **Homoscedasticity (Constant Variance of Errors):**  
   The variance of the residuals should remain *constant across all levels of X*. In other words, the spread of errors should be roughly the same for small and large values of the independent variable. If the variance changes (heteroscedasticity), it can lead to unreliable significance tests and biased standard errors.

4. **Normality of Errors:**  
   The residuals should be *normally distributed*. This assumption is especially important for hypothesis testing and constructing confidence intervals. If residuals are not normally distributed, the estimated coefficients and statistical tests may not be accurate.

5. **No Multicollinearity (for Multiple Regression):**  
   Although this primarily applies to **Multiple Linear Regression**, it is important to ensure that the predictor variable in SLR is not highly correlated with another variable. In SLR, since there is only one independent variable, this assumption is inherently satisfied.

6. **No Autocorrelation:**  
   The residuals should not exhibit *systematic patterns over time*. This is particularly important in time-dependent data. Autocorrelation violates the assumption of independent errors and can mislead the interpretation of regression results.

7. **Measurement Accuracy:**  
   Both the independent and dependent variables should be measured accurately. Errors in measurement can distort the relationship between X and Y, leading to biased parameter estimates.

**Importance of These Assumptions:**  
When these assumptions are met, the Simple Linear Regression model provides **unbiased, consistent, and efficient estimates** of the regression coefficients. If these assumptions are violated, it can result in misleading predictions, unreliable confidence intervals, and incorrect statistical conclusions.

In short, checking these assumptions through diagnostic plots (like residual vs fitted plots, Q-Q plots, etc.) is a crucial step in regression analysis. Only when these assumptions hold true can the results of a Simple Linear Regression model be trusted for accurate prediction and interpretation.

# Question 3: Write the mathematical equation for a simple linear regression model and explain each term.


Answer:

The **mathematical equation** for a **Simple Linear Regression (SLR)** model is expressed as:

**Y = b₀ + b₁X + e**

This equation represents the relationship between a **dependent variable (Y)** and an **independent variable (X)**, assuming the relationship between them is *linear*.

**Explanation of Each Term:**

1. **Y (Dependent Variable):**  
   It is the *output variable* or the *response* we are trying to predict or explain.  
   Example: Sales, Marks, Salary, etc.  

2. **X (Independent Variable):**  
   It is the *predictor variable* or *input* that influences the dependent variable.  
   Example: Advertising expenditure, Hours studied, Experience, etc.  

3. **b₀ (Intercept or Constant Term):**  
   It represents the *expected value of Y when X = 0*.  
   In other words, it is the point where the regression line crosses the Y-axis.  
   It indicates the baseline value of Y when there is no influence from X.  

4. **b₁ (Slope or Regression Coefficient):**  
   It measures the *rate of change* in Y for every *one-unit change* in X.  
   - If **b₁ > 0**, there is a *positive relationship* (Y increases as X increases).  
   - If **b₁ < 0**, there is a *negative relationship* (Y decreases as X increases).  
   The slope thus determines the *direction* and *strength* of the relationship between X and Y.  

5. **e (Error Term or Residual):**  
   It represents the *unexplained variation* in Y — the difference between the *observed value* and the *predicted value* of Y.  
   It accounts for random factors or influences not captured by the independent variable.  

**Graphical Representation:**  
In a scatter plot, the regression equation represents the *best-fitting straight line* through the data points, where:  
- The line minimizes the sum of squared differences between the observed and predicted values (least squares method).  
- The slope determines the tilt of the line, while the intercept sets its position on the Y-axis.

*Example:*  
If we are predicting a student’s score based on hours studied, the regression equation might be:  
**Score = 35 + 5 * (Hours Studied)**  
Here,  
- **b₀ = 35** → Base score when study hours = 0  
- **b₁ = 5** → Each additional study hour increases the score by 5 marks  


# Question 4: Provide a real-world example where simple linear regression can be applied.

Answer:

A **real-world example** of applying **Simple Linear Regression (SLR)** can be seen in the field of **marketing and sales prediction** — specifically, in estimating how **advertising expenditure** affects **product sales**.

*Example:*  **Predicting Sales Based on Advertising Spend**

A company wants to understand how much its **sales revenue (Y)** depends on the **amount spent on advertising (X)**. The goal is to find a linear relationship between these two variables and use it to predict future sales.

**Scenario:**
Suppose a retail company collects the following data for several months:
- X = Amount spent on advertising (in thousand rupees)
- Y = Monthly sales revenue (in thousand rupees)

After plotting the data, it appears that as advertising spending increases, sales also tend to increase linearly. The company then applies the **Simple Linear Regression model**:

**Y = b₀ + b₁X + e**

Where:  
- **Y** = Sales revenue  
- **X** = Advertising expenditure  
- **b₀** = Intercept (sales when advertising = 0)  
- **b₁** = Slope (change in sales for a one-unit increase in advertising)  
- **e** = Error term (unexplained variation)

**Interpretation of Results:**
Let’s assume the model output is:
**Sales = 50 + 8 * (Advertising Spend)**  

This means:  
- When the company spends **₹0** on advertising, expected sales are **₹50,000** (the intercept).  
- For every additional **₹1,000 spent on advertising**, sales are expected to increase by **₹8,000** (the slope).  

**Insights and Usefulness:**
- The company can use this model to **forecast future sales** based on different advertising budgets.  
- It helps in **budget planning** by estimating the return on investment (ROI) from advertising.  
- Management can use it to **evaluate marketing effectiveness** and make data-driven decisions.

**Other Real-World Applications of SLR include:**
1. Predicting **student performance** based on study hours.  
2. Estimating **crop yield** based on rainfall or fertilizer use.  
3. Predicting **house prices** based on size or location.  
4. Forecasting **electricity consumption** based on temperature.  
5. Estimating **employee attrition risk** based on years of service.

In short, Simple Linear Regression provides an effective and interpretable way to model real-world relationships between two quantitative variables, enabling organizations to make informed predictions, optimize strategies, and identify key influencing factors.

# Question 5: What is the method of least squares in linear regression?

Answer:

The **method of least squares** is a fundamental mathematical technique used in **linear regression** to find the *best-fitting line* through a set of data points. It determines the values of the regression coefficients (**b₀** and **b₁**) such that the line minimizes the total error between the observed and predicted values of the dependent variable.

**Concept:**
In Simple Linear Regression, the relationship between the dependent variable (Y) and the independent variable (X) is given by the equation:

**Y = b₀ + b₁X + e**

Where:
- **Y** = Observed value (actual data)
- **b₀** = Intercept of the regression line
- **b₁** = Slope of the regression line
- **X** = Independent variable
- **e** = Error term (difference between observed and predicted Y)

The **method of least squares** minimizes the *sum of squared residuals*, where each residual represents the vertical distance between the actual and predicted value of Y. Mathematically, this can be written as:

**Minimize: Σ (Yᵢ − Ŷᵢ)²**

where:
- **Yᵢ** = Actual observed value  
- **Ŷᵢ = b₀ + b₁Xᵢ** = Predicted value from the regression line  
- The goal is to find the values of **b₀** and **b₁** that make this sum as small as possible.

**Formulas for Coefficients:**
The regression coefficients obtained using the least squares method are:

**b₁ = Σ[(Xᵢ − X̄)(Yᵢ − Ȳ)] / Σ[(Xᵢ − X̄)²]**  
**b₀ = Ȳ − b₁X̄**

Where:
- **X̄** = Mean of X values  
- **Ȳ** = Mean of Y values  

**Interpretation:**
- The **slope (b₁)** represents how much Y changes for each unit change in X.  
- The **intercept (b₀)** represents the predicted value of Y when X = 0.  

**Why It’s Called “Least Squares”:**
The method is called *least squares* because it minimizes the *sum of the squares* of the differences between observed and predicted values, ensuring that large errors are penalized more heavily than small ones.

**Advantages of the Least Squares Method:**
1. Provides an objective way to determine the best-fitting line.  
2. Ensures the line passes through the center of the data (mean values of X and Y).  
3. Produces unbiased and efficient estimates under standard regression assumptions.  
4. Simple to compute and interpret.

*Example:*

If we plot study hours (X) against test scores (Y), the least squares method will find the line that best represents the overall trend of scores increasing with study time. The line minimizes the total squared difference between the actual and predicted scores.

To summararize, the **method of least squares** is the cornerstone of regression analysis. It ensures that the chosen regression line gives the smallest possible prediction errors, providing the most accurate and reliable model for understanding and forecasting relationships between variables.

# Question 6: What is Logistic Regression? How does it differ from Linear Regression?

Answer:

**Logistic Regression** is a type of **supervised learning algorithm** used to model the relationship between one or more **independent variables (predictors)** and a **categorical dependent variable (outcome)**. Unlike **Linear Regression**, which predicts continuous numerical values, **Logistic Regression** predicts the *probability* that an observation belongs to a particular class (such as 0 or 1).

It is widely used for **classification problems**, such as predicting whether an email is spam or not, whether a customer will buy a product, or whether a patient has a disease.

---

**Mathematical Form of Logistic Regression**

The logistic regression model uses the **logistic (sigmoid) function** to map predicted values between 0 and 1.  
The model is expressed as:

**p = 1 / (1 + e^-(b₀ + b₁X))**

Where:  
- **p** = Probability that the dependent variable belongs to class 1  
- **b₀** = Intercept  
- **b₁** = Coefficient of the independent variable  
- **X** = Independent variable  
- **e** = Base of the natural logarithm (~2.718)

The output **p** represents the probability of success (Y = 1), and **1 - p** represents the probability of failure (Y = 0).  
A common threshold (like 0.5) is used to classify outcomes:
- If **p ≥ 0.5**, predict class **1**
- If **p < 0.5**, predict class **0**

---

**Difference Between Logistic and Linear Regression**

| **Aspect** | **Linear Regression** | **Logistic Regression** |
|-------------|------------------------|---------------------------|
| **Type of Output** | Predicts *continuous numerical values* | Predicts *categorical outcomes* (probabilities between 0 and 1) |
| **Equation Form** | Y = b₀ + b₁X | log(p / (1 - p)) = b₀ + b₁X |
| **Nature of Relationship** | Models a *linear relationship* between X and Y | Models a *non-linear (S-shaped)* relationship using the logistic function |
| **Error Measurement** | Uses *Mean Squared Error (MSE)* | Uses *Log-Loss* or *Cross-Entropy* |
| **Assumptions About Residuals** | Errors are normally distributed | Residuals are not normally distributed; model uses probabilities |
| **Use Case** | Regression (prediction of continuous outcomes) | Classification (prediction of categorical outcomes) |

---

*Example:*
Suppose a bank wants to predict whether a loan applicant will default *(Yes/No)* based on income level.  
- **Linear Regression** would give an output like 1.2 or -0.3, which are not valid probabilities.  
- **Logistic Regression**, on the other hand, would output a probability like **0.85**, meaning there is an 85% chance that the applicant will default.

---

**Summary-**
- **Linear Regression** is used when the dependent variable is *continuous* (e.g., predicting house prices, temperature, sales).  
- **Logistic Regression** is used when the dependent variable is *categorical* (e.g., yes/no, pass/fail, churn/not churn).  
- Logistic Regression transforms the linear combination of inputs using the *sigmoid function* to keep predictions within the range of 0 to 1, making it ideal for classification tasks.

Thus, while both models aim to find relationships between variables, **Logistic Regression** specializes in predicting **probabilities and class memberships**, not continuous numerical outcomes.

# Question 7: Name and briefly describe three common evaluation metrics for regression models.


Answer:

Evaluating a **regression model** is essential to measure how well it predicts continuous outcomes. Several metrics are used to assess the performance and accuracy of regression models. The three most common evaluation metrics are **Mean Absolute Error (MAE)**, **Mean Squared Error (MSE)**, and **R-squared (R²)**.

---

**1. Mean Absolute Error (MAE)**

**Definition:**  
MAE measures the *average magnitude of errors* between predicted and actual values, without considering their direction (positive or negative).  
It gives a straightforward measure of how far predictions are from the true values on average.

**Formula:**  
**MAE = (1/n) × Σ |Yᵢ − Ŷᵢ|**

Where:  
- **Yᵢ** = Actual value  
- **Ŷᵢ** = Predicted value  
- **n** = Number of observations  

**Interpretation:**  
- A *lower MAE* value indicates better model performance.  
- It is easy to interpret since it uses the same units as the dependent variable.

*Example:*

If the MAE is 3.5, it means that on average, predictions differ from actual values by 3.5 units.

---

**2. Mean Squared Error (MSE)**

**Definition:**  
MSE measures the *average of squared differences* between predicted and actual values. It gives more weight to larger errors because the differences are squared.

**Formula:**  
**MSE = (1/n) × Σ (Yᵢ − Ŷᵢ)²**

**Interpretation:**  
- A *lower MSE* indicates better predictive accuracy.  
- Since errors are squared, MSE penalizes large deviations more heavily, making it sensitive to outliers.

*Example:*

If the MSE is 12.5, this means that on average, the squared difference between actual and predicted values is 12.5.

---
**3. R-squared (R²) – Coefficient of Determination**

**Definition:**  
R² measures the *proportion of variance* in the dependent variable that is explained by the independent variable(s) in the model.

**Formula:**  
**R² = 1 − (Σ (Yᵢ − Ŷᵢ)² / Σ (Yᵢ − Ȳ)²)**

Where:  
- **Ȳ** = Mean of actual values  

**Interpretation:**  
- **R²** ranges from **0 to 1**.  
- An **R² of 1** means the model perfectly fits the data, while **R² of 0** means it does not explain any variation.  
- Higher R² indicates better model performance.

*Example:*

If R² = 0.85, it means that 85% of the variation in the dependent variable is explained by the model.

---

**Summary-**
| **Metric** | **Measures** | **Ideal Value** | **Sensitivity** |
|-------------|--------------|-----------------|-----------------|
| **MAE** | Average absolute error | Closer to 0 | Low (robust to outliers) |
| **MSE** | Average squared error | Closer to 0 | High (penalizes large errors) |
| **R²** | Proportion of variance explained | Closer to 1 | Moderate |


# Question 8: What is the purpose of the R-squared metric in regression analysis?

Answer:

The **R-squared (R²)** metric, also known as the **Coefficient of Determination**, is a key statistical measure used in **regression analysis** to evaluate how well the independent variable(s) explain the variation in the dependent variable. It provides an overall indication of the **goodness of fit** of the regression model.

---

**Definition and Formula:**

**R² = 1 − (Σ (Yᵢ − Ŷᵢ)² / Σ (Yᵢ − Ȳ)²)**

Where:  
- **Yᵢ** = Actual observed values  
- **Ŷᵢ** = Predicted values from the regression model  
- **Ȳ** = Mean of the actual values  

---
**Purpose of R-squared:**

1. **Measures Model Fit:**  
   R² indicates how well the regression line represents the observed data. A higher R² value means the model fits the data better.

2. **Explains Variability:**  
   It quantifies the **proportion of variance** in the dependent variable (**Y**) that is explained by the independent variable(s) (**X**).  
   - For example, R² = 0.80 means that **80% of the variation** in Y can be explained by X, and the remaining 20% is due to random errors or factors not included in the model.

3. **Evaluates Predictive Power:**  
   R² helps assess how well the model can predict unseen data. A higher R² typically suggests stronger predictive capability, although it must be interpreted carefully.

4. **Model Comparison:**  
   When comparing multiple regression models built on the same dataset, R² can be used to determine which model better explains the data.

---
**Interpretation of R-squared Values:**

| **R² Value** | **Interpretation** |
|---------------|--------------------|
| 0 | The model explains none of the variance in the dependent variable. |
| 0 < R² < 0.5 | The model explains a small to moderate portion of the variance. |
| 0.5 ≤ R² < 0.8 | The model has a reasonably good fit. |
| 0.8 ≤ R² ≤ 1 | The model fits the data very well. |
| 1 | Perfect fit (rare in real-world data). |

---

**Limitations of R-squared:**
1. **Does not indicate causation** — A high R² does not mean that X causes Y.  
2. **Sensitive to number of variables** — Adding more predictors always increases R², even if they are not meaningful.  
3. **Does not measure bias or overfitting** — A model can have a high R² but still perform poorly on unseen data.  
4. **Adjusted R-squared** is often preferred for multiple regression since it penalizes unnecessary predictors.

---

*Example:*

Suppose a regression model predicts house prices based on area.  
If **R² = 0.85**, it means **85% of the variability** in house prices is explained by the model using the “area” variable, while 15% is due to other unexplained factors like location, design, or market trends.

---

**Summary-**

The **purpose of R-squared** is to measure how well the regression model captures the variability of the dependent variable. It serves as a concise indicator of model effectiveness — the higher the R², the better the model explains and fits the data. However, it should always be interpreted alongside other metrics such as **Adjusted R², MAE, or MSE** to ensure a complete understanding of model performance.

# Question 9: Write Python code to fit a simple linear regression model using scikit-learn and print the slope and intercept.

In [3]:
# Answer

from sklearn.linear_model import LinearRegression
import numpy as np

# Example data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])

model = LinearRegression()
model.fit(X, y)

print("Slope (b1):", model.coef_[0])
print("Intercept (b0):", model.intercept_)

Slope (b1): 0.6
Intercept (b0): 2.2


# Question 10: How do you interpret the coefficients in a simple linear regression model?

Answer:

In a **Simple Linear Regression (SLR)** model, the relationship between the **dependent variable (Y)** and the **independent variable (X)** is expressed using the equation:

**Y = b₀ + b₁X + e**

Where:  
- **b₀** = Intercept (constant term)  
- **b₁** = Coefficient or slope of X  
- **e** = Error term  

The **coefficients (b₀ and b₁)** are the most important parameters in a regression model, as they define the nature and strength of the relationship between X and Y.

---

**1. Intercept (b₀)**
- The **intercept** represents the predicted value of the dependent variable (**Y**) when the independent variable (**X**) equals zero.  
- It indicates the starting point of the regression line on the Y-axis.  
- Mathematically, when X = 0, **Y = b₀**.  
- The intercept helps establish the baseline level of Y before any influence from X.  

*Example:*

If the regression equation is **Y = 50 + 5X**, then **b₀ = 50** means that when X = 0, the expected value of Y is 50.

---

**2. Slope (b₁)**
- The **slope** (or regression coefficient) represents the *change in Y* for a *one-unit change in X*, assuming all other factors remain constant.  
- It indicates both the **direction** and **magnitude** of the relationship between X and Y.  
  - If **b₁ > 0**, there is a *positive relationship*: as X increases, Y increases.  
  - If **b₁ < 0**, there is a *negative relationship*: as X increases, Y decreases.  
- The absolute value of b₁ shows how sensitive Y is to changes in X.

*Example:*

If **b₁ = 5**, it means that for every 1-unit increase in X, the predicted value of Y increases by 5 units.

---
**3. Overall Interpretation:**

Together, **b₀ and b₁** define the regression line that best fits the data.  
- **b₀** determines where the line crosses the Y-axis.  
- **b₁** determines the slope or tilt of that line.  

The model uses these coefficients to make predictions for Y based on any given value of X:
**Ŷ = b₀ + b₁X**

---

**4. Example Interpretation:**

Consider the regression equation:

**Salary = 30,000 + 2,000 × Experience**

Here:  
- **Intercept (b₀ = 30,000):** When experience = 0, the predicted starting salary is ₹30,000.  
- **Slope (b₁ = 2,000):** For every additional year of experience, salary increases by ₹2,000.

Thus, both coefficients together describe how salary (Y) changes in response to experience (X).

---

**Summary-**

The **coefficients in a simple linear regression model** provide a clear and interpretable description of the relationship between the independent and dependent variables.  
- The **intercept** shows the baseline value of Y.  
- The **slope** shows the rate and direction of change in Y with respect to X.  
Together, they allow the model to make predictions, explain trends, and quantify the impact of one variable on another in a straightforward and interpretable way.