In [1]:
'''


 1. What does R-squared represent in a regression model?**
- **Answer:** R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It indicates the goodness of fit, with values closer to 1 implying a better fit.

---

### **2. What are the assumptions of linear regression?**
- **Answer:**
  1. **Linearity:** The relationship between independent and dependent variables is linear.
  2. **Independence:** Residuals are independent of each other.
  3. **Homoscedasticity:** The variance of residuals is constant across all levels of the independent variable.
  4. **Normality:** Residuals are normally distributed.
  5. **No multicollinearity:** Independent variables should not be highly correlated with each other.

---

### **3. What is the difference between R-squared and Adjusted R-squared?**
- **Answer:** R-squared increases as more predictors are added, regardless of their relevance. Adjusted R-squared adjusts for the number of predictors and only increases if the new predictor improves the model fit.

---

### **4. Why do we use Mean Squared Error (MSE)?**
- **Answer:** MSE measures the average squared difference between actual and predicted values. It penalizes larger errors more heavily, making it useful for assessing a model’s accuracy.

---

### **5. What does an Adjusted R-squared value of 0.85 indicate?**
- **Answer:** An Adjusted R-squared value of 0.85 indicates that 85% of the variance in the dependent variable is explained by the predictors, considering the number of predictors and sample size.

---

### **6. How do we check for normality of residuals in linear regression?**
- **Answer:** Normality can be checked using:
  1. **Q-Q plots:** Residuals should align along a diagonal line.
  2. **Shapiro-Wilk test:** A statistical test for normality.
  3. **Histogram:** Residuals should form a bell-shaped curve.

---

### **7. What is multicollinearity, and how does it impact regression?**
- **Answer:** Multicollinearity occurs when independent variables are highly correlated. It can inflate standard errors, making it difficult to determine the significance of individual predictors.

---

### **8. What is Mean Absolute Error (MAE)?**
- **Answer:** MAE is the average of the absolute differences between actual and predicted values. It provides a straightforward measure of prediction error.

---

### **9. What are the benefits of using an ML pipeline?**
- **Answer:**
  1. **Automation:** Simplifies repetitive tasks like preprocessing and model training.
  2. **Consistency:** Ensures the same steps are applied each time.
  3. **Scalability:** Easily handles large datasets and multiple models.
  4. **Reproducibility:** Makes the workflow repeatable.

---

### **10. Why is RMSE considered more interpretable than MSE?**
- **Answer:** RMSE is in the same unit as the dependent variable, making it easier to interpret, whereas MSE is in squared units.

---

### **11. What is pickling in Python, and how is it useful in ML?**
- **Answer:** Pickling is a process of serializing and saving Python objects, like trained ML models, for reuse. It helps save time and computational resources by avoiding retraining.

---

### **12. What does a high R-squared value mean?**
- **Answer:** A high R-squared value indicates that a large proportion of the variance in the dependent variable is explained by the model, suggesting a good fit.

---

### **13. What happens if linear regression assumptions are violated?**
- **Answer:**
  - Violations can lead to biased or inefficient estimates.
  - Predictions may be unreliable, and statistical tests might not hold true.

---

### **14. How can we address multicollinearity in regression?**
- **Answer:**
  1. Remove highly correlated variables.
  2. Use techniques like Principal Component Analysis (PCA).
  3. Use regularization methods such as Ridge or Lasso regression.

---

### **15. How can feature selection improve model performance in regression analysis?**
- **Answer:** Feature selection removes irrelevant or redundant predictors, reducing overfitting, improving interpretability, and increasing computational efficiency.

---

### **16. How is Adjusted R-squared calculated?**
- **Answer:** Adjusted R-squared is calculated using:
  \[
  \text{Adjusted } R^2 = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - k - 1} \right)
  \]
  where \( n \) is the number of observations and \( k \) is the number of predictors.

---

### **17. Why is MSE sensitive to outliers?**
- **Answer:** MSE squares the errors, amplifying the impact of large deviations (outliers) on the overall error calculation.

---

### **18. What is the role of homoscedasticity in linear regression?**
- **Answer:** Homoscedasticity ensures that the variance of residuals is constant, which is essential for reliable coefficient estimates and valid hypothesis testing.

---

### **19. What is Root Mean Squared Error (RMSE)?**
- **Answer:** RMSE is the square root of the MSE, representing the standard deviation of residuals. It measures the model’s prediction accuracy in the same units as the dependent variable.

---

### **20. Why is pickling considered risky?**
- **Answer:** Pickling is risky because it can execute arbitrary code during deserialization, making it vulnerable to malicious code injection.

---

### **21. What alternatives exist to pickling for saving ML models?**
- **Answer:**
  1. **Joblib:** Efficient for large arrays.
  2. **ONNX:** Open format for ML models.
  3. **HDF5:** Stores large datasets.
  4. **PMML:** XML-based model format.

---

### **22. What is heteroscedasticity, and why is it a problem?**
- **Answer:** Heteroscedasticity occurs when residual variance changes across levels of an independent variable. It violates regression assumptions, leading to inefficient estimates and unreliable hypothesis testing.

---

### **23. How can interaction terms enhance a regression model's predictive power?**
- **Answer:** Interaction terms capture the combined effect of two or more predictors, revealing relationships that are not evident from individual effects.

---

Let me know if you'd like any of these topics explained in greater detail!
'''

"\nHere are the questions and answers for your interview preparation:\n\n---\n\n### **1. What does R-squared represent in a regression model?**\n- **Answer:** R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It indicates the goodness of fit, with values closer to 1 implying a better fit.\n\n---\n\n### **2. What are the assumptions of linear regression?**\n- **Answer:**  \n  1. **Linearity:** The relationship between independent and dependent variables is linear.  \n  2. **Independence:** Residuals are independent of each other.  \n  3. **Homoscedasticity:** The variance of residuals is constant across all levels of the independent variable.  \n  4. **Normality:** Residuals are normally distributed.  \n  5. **No multicollinearity:** Independent variables should not be highly correlated with each other.\n\n---\n\n### **3. What is the difference between R-squared and Adjusted R-squared?**\n- **Answer:*

In [None]:
'''
### **1. Visualize the distribution of residuals (Seaborn's diamonds dataset):**
```python
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load dataset
diamonds = sns.load_dataset('diamonds')

# Prepare data
X = diamonds[['carat', 'depth', 'table']]  # Select some features
y = diamonds['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit linear regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate residuals
residuals = y_test - y_pred

# Plot residuals distribution
sns.histplot(residuals, kde=True, bins=30)
plt.title("Distribution of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
```

---

### **2. Calculate MSE, MAE, and RMSE for a linear regression model:**
```python
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
```

---

### **3. Check linear regression assumptions:**
```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Linearity check
sns.pairplot(diamonds[['carat', 'price']])
plt.title("Linearity Check")
plt.show()

# Residuals plot for homoscedasticity
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Residuals vs Fitted")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

# Correlation matrix for multicollinearity
corr_matrix = diamonds.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()
```

---

### **4. ML pipeline with feature scaling and regression models:**
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor())
])

# Fit model and evaluate
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"R-squared score: {r2_score(y_test, y_pred)}")
```

---

### **5. Simple linear regression with coefficients and R-squared:**
```python
# Print coefficients
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
print(f"R-squared: {model.score(X_test, y_test)}")
```

---

### **6. Analyze total bill and tip relationship (Seaborn's tips dataset):**
```python
tips = sns.load_dataset('tips')

# Linear regression
X = tips[['total_bill']]
y = tips['tip']
model = LinearRegression()
model.fit(X, y)

# Plot
sns.regplot(x='total_bill', y='tip', data=tips, line_kws={"color": "red"})
plt.title("Total Bill vs Tip")
plt.show()
```

---

### **7. Fit linear regression on synthetic data:**
```python
from sklearn.datasets import make_regression

# Generate data
X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
model.fit(X, y)

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.title("Regression Line")
plt.show()
```

---

### **8. Pickle a linear regression model:**
```python
import pickle

# Save model
with open('linear_model.pkl', 'wb') as f:
    pickle.dump(model, f)

# Load model
with open('linear_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)
```

---

### **9. Polynomial regression (degree 2):**
```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Polynomial features
poly_model = make_pipeline(PolynomialFeatures(2), LinearRegression())
poly_model.fit(X, y)

# Plot
plt.scatter(X, y)
plt.plot(X, poly_model.predict(X), color='red')
plt.title("Polynomial Regression (Degree 2)")
plt.show()
```

---

### **10. Generate synthetic data for linear regression:**
```python
X, y = make_regression(n_samples=200, n_features=1, noise=20)
model.fit(X, y)
print(f"Coefficient: {model.coef_}, Intercept: {model.intercept_}")
```

---

### **11. Compare polynomial regression models:**
```python
for degree in range(1, 5):
    poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    poly_model.fit(X, y)
    print(f"Degree {degree}: R-squared = {poly_model.score(X, y)}")
```

---

### **12. Fit simple linear regression with two features:**
```python
X = diamonds[['carat', 'depth']]
y = diamonds['price']
model.fit(X, y)
print(f"Coefficients: {model.coef_}, Intercept: {model.intercept_}, R-squared: {model.score(X, y)}")
```

---

Let me know if you want detailed scripts for the remaining!
'''