<a href="https://colab.research.google.com/github/Yashgg10/EDA/blob/main/Evaluation_Metrics_and_Regression_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Regression and Machine Learning Key Concepts**

---

#### **1. What does R-squared represent in a regression model?**
- R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables.  
- Value ranges from 0 to 1:
  - **0**: Model explains none of the variance.
  - **1**: Model explains all the variance.

---

#### **2. What are the assumptions of linear regression?**
1. **Linearity**: The relationship between predictors and outcome is linear.
2. **Independence**: Residuals are independent of each other.
3. **Homoscedasticity**: Residuals have constant variance.
4. **Normality**: Residuals are normally distributed.
5. **No Multicollinearity**: Predictors are not highly correlated with each other.

---

#### **3. What is the difference between R-squared and Adjusted R-squared?**
- **R-squared**: Measures the variance explained by all predictors.
- **Adjusted R-squared**: Accounts for the number of predictors, penalizing the addition of irrelevant variables.

---

#### **4. Why do we use Mean Squared Error (MSE)?**
- MSE measures the average squared difference between predicted and actual values.  
- Squaring amplifies larger errors, making it useful for assessing model accuracy.

---

#### **5. What does an Adjusted R-squared value of 0.85 indicate?**
- Approximately 85% of the variance in the dependent variable is explained by the model, adjusted for the number of predictors.

---

#### **6. How do we check for normality of residuals in linear regression?**
- **Histogram** of residuals.
- **Q-Q Plot**: Compare residuals to a normal distribution.
- **Shapiro-Wilk or Kolmogorov-Smirnov Tests**: Statistical tests for normality.

---

#### **7. What is multicollinearity, and how does it impact regression?**
- **Multicollinearity** occurs when predictors are highly correlated.  
- Impacts:
  - Unstable coefficient estimates.
  - Inflated standard errors.
  - Reduced interpretability.

---

#### **8. What is Mean Absolute Error (MAE)?**
- MAE is the average absolute difference between predicted and actual values.
- Less sensitive to outliers compared to MSE.

---

#### **9. What are the benefits of using an ML pipeline?**
- **Automation**: Streamlines data preprocessing and model building.
- **Reproducibility**: Ensures consistent results.
- **Scalability**: Efficiently handles large datasets.
- **Modularity**: Simplifies the addition or removal of steps.

---

#### **10. Why is RMSE considered more interpretable than MSE?**
- RMSE is in the same units as the target variable, making it easier to understand and compare.

---

#### **11. What is pickling in Python, and how is it useful in ML?**
- **Pickling**: Serializes Python objects into a binary format for storage.  
- Usefulness:
  - Saves trained models for future use.
  - Enables transfer of models across systems.

---

#### **12. What does a high R-squared value mean?**
- Indicates the model explains a large proportion of the variance.
- Caution: A high value doesn’t guarantee a good model; check for overfitting.

---

#### **13. What happens if linear regression assumptions are violated?**
- Results may be biased or unreliable:
  - Inaccurate coefficients.
  - Misleading p-values.
  - Poor predictive performance.

---

#### **14. How can we address multicollinearity in regression?**
- Use **Variance Inflation Factor (VIF)** to detect it.
- Remedies:
  - Remove highly correlated predictors.
  - Use dimensionality reduction (e.g., PCA).
  - Regularization methods like Ridge or Lasso.

---

#### **15. How can feature selection improve model performance in regression analysis?**
- Reduces noise and overfitting.
- Improves computational efficiency.
- Enhances model interpretability by focusing on the most relevant predictors.

---

#### **16. How is Adjusted R-squared calculated?**
adjusted R2  = [1-(1-rsquare)*n-1/(n-p-1)]
#n is no of observation and p is no of predictor variables
- **n**: Number of observations.
- **k**: Number of predictors.

---

#### **17. Why is MSE sensitive to outliers?**
- Squaring residuals magnifies the impact of larger errors caused by outliers.

---

#### **18. What is the role of homoscedasticity in linear regression?**
- Ensures that residuals have constant variance.
- Violations can lead to biased standard errors and unreliable hypothesis tests.

---

#### **19. What is Root Mean Squared Error (RMSE)?**
- RMSE is the square root of the MSE.
\[
\text{RMSE} = \sqrt{\text{MSE}}
\]
- Provides a measure of model accuracy in the same units as the target variable.

---

#### **20. Why is pickling considered risky?**
- **Security risks**: Deserializing untrusted files may execute malicious code.
- **Compatibility issues**: Pickle files are Python-specific and version-dependent.

---

#### **21. What alternatives exist to pickling for saving ML models?**
- **Joblib**: Efficient for large numpy arrays.
- **ONNX**: Platform-independent model serialization.
- **PMML**: Standard format for predictive models.

---

#### **22. What is heteroscedasticity, and why is it a problem?**
- Heteroscedasticity occurs when residual variance is not constant.  
- Problems:
  - Biased coefficient estimates.
  - Invalid hypothesis tests.

---

#### **23. How can interaction terms enhance a regression model's predictive power?**
- Captures relationships where the effect of one predictor depends on another.  
- Example: Sales = Price + Advertising + (Price × Advertising).