# Assignment (26th March) : Regression - 1

### Q1. Explain the difference between simple linear regression and multiple linear regression. Provide an example of each.

**ANS:** **`Simple Linear Regression:`**

- **Definition:** Models the relationship between two variables: one independent variable (predictor) and one dependent variable (response).
- **Equation:** \( y = b_0 + b_1x \)
  - \( y \): Dependent variable
  - \( x \): Independent variable
  - \( b_0 \): Intercept
  - \( b_1 \): Slope

**Example:**
Predicting a person's weight based on their height.
\[ \text{Weight} = b_0 + b_1 \times \text{Height} \]

**`Multiple Linear Regression:`**

- **Definition:** Models the relationship between one dependent variable and two or more independent variables.
- **Equation:** <p align = "centre"> \[ y = b_0 + b_1x_1 + b_2x_2 + \ldots + b_nx_n \] </p>
  - \( y \): Dependent variable
  - \( x_1, x_2, \ldots, x_n \): Independent variables
  - \( b_0 \): Intercept
  - \( b_1, b_2, \ldots, b_n \): Coefficients

**Example:**
Predicting a person's weight based on their height and age.
<p align = "centre">
\[ \text{Weight} = b_0 + b_1 \times \text{Height} + b_2 \times \text{Age} \]
</p>

### Q2. Discuss the assumptions of linear regression. How can you check whether these assumptions hold in a given dataset?

**ANS** **`Assumptions of Linear Regression:`**

1. **Linearity:** The relationship between the independent and dependent variables should be linear.
2. **Independence:** Observations should be independent of each other.
3. **Homoscedasticity:** The residuals (errors) have constant variance at every level of the independent variable.
4. **Normality:** The residuals of the model should be normally distributed.
5. **No Multicollinearity:** Independent variables should not be highly correlated with each other.

**`Checking Assumptions:`**

1. **Linearity:**
   - **Method:** Plot the observed vs. predicted values.
   - **Check:** Look for a linear relationship.


2. **Independence:**
   - **Method:** Use the Durbin-Watson test.
   - **Check:** Values close to 2 indicate independence.


3. **Homoscedasticity:**
   - **Method:** Plot residuals vs. predicted values.
   - **Check:** Look for a random scatter (no pattern).


4. **Normality:**
   - **Method:** Use Q-Q plot or histogram of residuals.
   - **Check:** Residuals should follow a normal distribution.


5. **No Multicollinearity:**
   - **Method:** Calculate Variance Inflation Factor (VIF).
   - **Check:** VIF values below 10 indicate low multicollinearity.



### Q3. How do you interpret the slope and intercept in a linear regression model? Provide an example using a real-world scenario.

**ANS:** **`Interpreting the Slope and Intercept in a Linear Regression Model:`**

1. **Intercept (\(b_0\)):**
   - **Definition:** The predicted value of the dependent variable when all independent variables are zero.
   - **Interpretation:** Represents the baseline level of the dependent variable.

2. **Slope (\(b_1\), \(b_2\), etc.):**
   - **Definition:** The change in the dependent variable for a one-unit change in the corresponding independent variable, holding other variables constant.
   - **Interpretation:** Indicates the strength and direction of the relationship between each independent variable and the dependent variable.

**`Example: Predicting House Prices`**

**Scenario:** We are predicting house prices (\(y\)) based on the size of the house in square feet (\(x\)).

**Linear Regression Equation:**
<p align = "centre">
\[ \text{Price} = b_0 + b_1 \times \text{Size} \]
</p> 

Let's say we have the following estimated equation:
<p align = "centre">
\[ \text{Price} = 50,000 + 200 \times \text{Size} \]
</p>

**`Interpretation:`**

- **Intercept (50,000):** 
  - When the size of the house is 0 square feet, the predicted price is $50,000. This represents the base price of a house without considering its size (though practically, a house with 0 square feet isn't realistic, it provides a starting point for the model).

- **Slope (200):**
  - For each additional square foot of size, the house price increases by $200. This shows a positive relationship between house size and price, meaning larger houses are predicted to be more expensive.


**`Real-World Example:`**

If a house is 1,000 square feet:
<p align = "centre">
\[ \text{Price} = 50,000 + 200 \times 1,000 = 50,000 + 200,000 = 250,000 \]
<p/>


### Q4. Explain the concept of gradient descent. How is it used in machine learning?

**ANS:** Gradient descent is an optimization algorithm used to minimize the cost function in machine learning models. It iteratively adjusts the model parameters to find the values that minimize the cost function, thereby improving the model's predictions.

**`How Gradient Descent Works:`**

1. **Initialization:** Start with initial values for the model parameters (often set randomly).
2. **Compute Gradient:** Calculate the gradient (partial derivatives) of the cost function with respect to each parameter. This gradient indicates the direction and rate of the steepest ascent.
3. **Update Parameters:** Adjust the parameters in the opposite direction of the gradient to move towards the minimum of the cost function. The size of the step is determined by the learning rate.
4. **Iterate:** Repeat the process until convergence, meaning the parameters change very little or the cost function reaches a minimum value.

**`Update Rule:`**
<p align = "centre">
\[ \theta := \theta - \alpha \nabla J(\theta) \]
</p>

Where:
- \( theta \) represents the model parameters.

- \( alpha \) is the learning rate.

- \( nabla J(theta) \) is the gradient of the cost function with respect to \( \theta \).

**`Usage in Machine Learning:`**

1. **Linear Regression:** Minimize the mean squared error between predicted and actual values.
2. **Logistic Regression:** Minimize the log loss (cross-entropy) to improve classification accuracy.
3. **Neural Networks:** Adjust weights and biases to minimize the loss function and improve model performance.


### Q5. Describe the multiple linear regression model. How does it differ from simple linear regression?

**ANS:** **`Multiple Linear Regression:`**

- **Model:** Models the relationship between one dependent variable and two or more independent variables.
- **Equation:** <p align = "centre"> \( y = b_0 + b_1x_1 + b_2x_2 + dots + b_nx_n \) </p>
  - \( y \): Dependent variable
  - \( x_1, x_2, dots, x_n \): Independent variables
  - \( b_0 \): Intercept
  - \( b_1, b_2, dots, b_n \): Coefficients for each independent variable

**`Difference from Simple Linear Regression:`**

- **Simple Linear Regression:** Involves one independent variable and one dependent variable. 
  - **Equation:** <p align = "centre"> \( y = b_0 + b_1x \) </p>
- **Multiple Linear Regression:** Involves multiple independent variables predicting the dependent variable. 
  - **Equation:** <p align = "centre"> \( y = b_0 + b_1x_1 + b_2x_2 + dots + b_nx_n \) </p>
- **Complexity:** Multiple linear regression can capture the combined effect of several predictors, whereas simple linear regression captures the effect of only one predictor.

### Q6. Explain the concept of multicollinearity in multiple linear regression. How can you detect and address this issue?

**ANS:** Multicollinearity occurs in multiple linear regression when two or more independent variables are highly correlated, meaning they provide redundant information about the response variable. This can make it difficult to determine the individual effect of each predictor on the dependent variable and can lead to unstable estimates of regression coefficients.


**`Detecting Multicollinearity:`**

1. **Correlation Matrix:**
   - Calculate the correlation matrix of the independent variables.
   - High correlations (close to 1 or -1) between pairs of variables indicate potential multicollinearity.

2. **Variance Inflation Factor (VIF):**
   - Compute the VIF for each independent variable.
   - <p align = "centre"> \( \text{VIF}_j = \frac{1}{1 - R_j^2} \) </p>
     - \( R_j^2 \) is the coefficient of determination of the regression of \( X_j \) on all other predictors.
   - VIF values greater than 10 (sometimes 5) indicate high multicollinearity.

3. **Condition Number:**
   - Compute the condition number of the feature matrix.
   - High condition numbers (greater than 30) indicate potential multicollinearity.

**`Addressing Multicollinearity:`**

1. **Remove Highly Correlated Predictors:**
   - Identify and remove one of the highly correlated variables.
   - This can simplify the model and reduce redundancy.

2. **Combine Variables:**
   - Create a single predictor from the correlated variables through techniques like Principal Component Analysis (PCA).

3. **Ridge Regression (L2 Regularization):**
   - Apply ridge regression, which adds a penalty to the size of the coefficients.
   - This can help reduce the variance of the estimates without removing variables.

4. **Increase Sample Size:**
   - If feasible, increase the sample size to provide more information and reduce the impact of multicollinearity.


### Q7. Describe the polynomial regression model. How is it different from linear regression?

**ANS:** **`Polynomial Regression:`**

- **Model:** Extends linear regression by including polynomial terms of the independent variables.
- **Equation:** \( y = b_0 + b_1x + b_2x^2 + \ldots + b_nx^n \)
- **Purpose:** Captures non-linear relationships between the independent and dependent variables.

**`Difference from Linear Regression:`**

- **Linear Regression:** Models a straight-line relationship (\( y = b_0 + b_1x \)).
- **Polynomial Regression:** Models a curved relationship by including higher-order terms (\( x^2, x^3, \ldots \)).

### Q8. What are the advantages and disadvantages of polynomial regression compared to linear regression? In what situations would you prefer to use polynomial regression?

**ANS:** **`Advantages of Polynomial Regression:`**

1. **Captures Non-Linearity:** Can model complex, non-linear relationships between variables.
2. **Flexibility:** Higher-degree polynomials can fit a wider range of data patterns.

**`Disadvantages of Polynomial Regression:`**

1. **Overfitting:** High-degree polynomials can fit the noise in the data, leading to poor generalization.
2. **Complexity:** More difficult to interpret and require more computational resources.
3. **Extrapolation:** Poor at predicting outside the range of the observed data.

**`When to Prefer Polynomial Regression:`**

- **Non-Linear Relationships:** When the relationship between variables is clearly non-linear and cannot be captured by a straight line.
- **Data Patterns:** When exploratory data analysis shows a curved pattern that linear regression fails to fit well.