1. What is Simple Linear Regression?

Simple Linear Regression is a statistical method that models the relationship between two continuous variables: one independent variable (X) and one dependent variable (Y). It assumes a linear relationship and fits a straight line (Y = mX + c) to the data that minimizes the difference between the observed and predicted values.

2. What are the key assumptions of Simple Linear Regression?

- Linearity between independent and dependent variable
- Homoscedasticity (constant variance of residuals)
- Independence of observations
- Normal distribution of residuals
- No significant outliers

3. What does the coefficient m represent in the equation Y = mX + c?

m is the slope of the regression line. It represents the change in the dependent variable (Y) for a one-unit change in the independent variable (X).

4. What does the intercept c represent in the equation Y = mX + c?

c is the intercept, or the value of Y when X = 0. It indicates the starting value of Y in the absence of X.

5. How do we calculate the slope m in Simple Linear Regression?

m = Σ((Xi - X̄)(Yi - Ȳ)) / Σ((Xi - X̄)^2) — the ratio of the covariance of X and Y to the variance of X.

6. What is the purpose of the least squares method in Simple Linear Regression?

It minimizes the sum of squared residuals (differences between observed and predicted Y values), ensuring the best-fitting regression line.

7. How is the coefficient of determination (R²) interpreted in Simple Linear Regression?

R² indicates the proportion of variance in the dependent variable that is explained by the independent variable. R² = 1 implies perfect prediction; R² = 0 means no predictive power.

8. What is Multiple Linear Regression?

It models the relationship between one dependent variable and two or more independent variables using the equation: Y = b0 + b1X1 + b2X2 + ... + bnXn.

9. What is the main difference between Simple and Multiple Linear Regression?

Simple Linear Regression involves one independent variable, whereas Multiple Linear Regression involves two or more independent variables.

10. What are the key assumptions of Multiple Linear Regression?

- Linearity between each predictor and the response
- Multivariate normality
- No or little multicollinearity
- Homoscedasticity
- Independence of errors

11. What is heteroscedasticity, and how does it affect the results of a Multiple Linear Regression model?

Heteroscedasticity refers to non-constant variance of residuals. It violates regression assumptions and can lead to inefficient estimates and biased standard errors.

12. How can you improve a Multiple Linear Regression model with high multicollinearity?

- Remove or combine correlated variables
- Use Principal Component Analysis (PCA)
- Apply Ridge or Lasso regression
- Center the variables (mean subtraction)

13. What are some common techniques for transforming categorical variables for use in regression models?

- One-hot encoding
- Label encoding
- Dummy variable creation

14. What is the role of interaction terms in Multiple Linear Regression?

Interaction terms (e.g., X₁×X₂) capture the effect of variables interacting with each other, helping to model complex relationships that are not purely additive.

15. How can the interpretation of intercept differ between Simple and Multiple Linear Regression?

In Simple Linear Regression, the intercept is Y when X = 0. In Multiple Linear Regression, it’s the expected Y value when all predictors are 0, which may not be meaningful if 0 is outside the data range.

16. What is the significance of the slope in regression analysis, and how does it affect predictions?

The slope quantifies the effect of a one-unit change in the predictor on the response variable. It directly influences the predictions made by the regression model.

17. How does the intercept in a regression model provide context for the relationship between variables?

It establishes the baseline value of the dependent variable and provides context for how predictions behave when all independent variables are zero.

18. What are the limitations of using R² as a sole measure of model performance?

- Doesn’t indicate whether a regression model is appropriate
- Can be artificially high with more variables (overfitting)
- Doesn’t reveal bias or variance issues

19. How would you interpret a large standard error for a regression coefficient?

A large standard error implies that the coefficient estimate is not precise and may not be statistically significant.

20. How can heteroscedasticity be identified in residual plots, and why is it important to address it?

If residual plots show a funnel shape (residuals increasing/decreasing with fitted values), it indicates heteroscedasticity. Addressing it is essential to ensure valid inference and accurate confidence intervals.

21. What does it mean if a Multiple Linear Regression model has a high R² but low adjusted R²?

It suggests that some predictors do not add meaningful value, and the model may be overfitting the data.

22. Why is it important to scale variables in Multiple Linear Regression?

Scaling ensures comparability of coefficients, improves convergence of gradient-based methods, and is crucial when using regularization.

23. What is polynomial regression?

Polynomial Regression is an extension of linear regression where the relationship between the independent variable and the dependent variable is modeled as an nth-degree polynomial.

24. How does polynomial regression differ from linear regression?

Linear regression models straight-line relationships, while polynomial regression can model curved (nonlinear) relationships.

25. When is polynomial regression used?

When data shows a nonlinear relationship that cannot be captured by a straight line.

26. What is the general equation for polynomial regression?

Y = b₀ + b₁X + b₂X² + ... + bₙXⁿ

27. Can polynomial regression be applied to multiple variables?

Yes, by including polynomial terms for each variable and their interactions, it becomes a multivariate polynomial regression.

28. What are the limitations of polynomial regression?

- Overfitting at high degrees
- Sensitive to outliers
- Poor extrapolation behavior
- Complexity in interpretation

29. What methods can be used to evaluate model fit when selecting the degree of a polynomial?

- Cross-validation
- Adjusted R²
- AIC/BIC
- Residual analysis

30. Why is visualization important in polynomial regression?

It helps in understanding the fit of the model, detecting overfitting/underfitting, and conveying model behavior intuitively.


In [5]:
#31. How is polynomial regression implemented in Python?

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Example dataset
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 4, 9, 16, 25])  # y = x^2

# Create a model with polynomial features
model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())
model.fit(X, y)

# Predict
predictions = model.predict(X)
print(predictions)



[ 1.  4.  9. 16. 25.]
