### Machine Learning Regression: Theory + Practical (Pwskills Java + DSA)

---

## ✅ Theoretical Questions

**G. What does R-squared represent in a regression model?**
R-squared measures the proportion of variance in the dependent variable that can be predicted from the independent variables. It ranges from 0 to 1.

**\$. What are the assumptions of linear regression?**

* Linearity
* Independence of errors
* Homoscedasticity (equal variance)
* Normality of residuals
* No multicollinearity

**2. Difference between R-squared and Adjusted R-squared:**

* R-squared increases with more predictors.
* Adjusted R-squared adjusts for number of predictors, penalizing overfitting.

**E. Why use Mean Squared Error (MSE)?**
MSE penalizes large errors more than small ones and is easy to differentiate for optimization.

**. What does Adjusted R-squared = 0.85 mean?**
85% of the variability in the dependent variable is explained by the model, adjusted for the number of predictors.

**. How to check for normality of residuals?**

* Histogram or Q-Q plot
* Shapiro-Wilk test
* Skewness/Kurtosis

**@. What is multicollinearity?**
When predictors are highly correlated with each other, causing instability in coefficient estimation. It inflates standard errors.

**. What is Mean Absolute Error (MAE)?**
MAE measures the average absolute difference between predicted and actual values. Less sensitive to outliers than MSE.

**. Benefits of using ML pipeline:**

* Modular code
* Automation of preprocessing
* Reduces errors in repeated experiments
* Enables reproducibility

**GC. Why is RMSE more interpretable than MSE?**
RMSE is in the same unit as the target variable, unlike MSE which is squared.

**GG. What is pickling in Python (ML)?**
Pickling serializes objects like models to a file so they can be reloaded later. Useful for deployment.

**G\$. High R-squared value means:**
Model explains a large proportion of variance, but doesn't guarantee accuracy or correct assumptions.

**G2. What happens if regression assumptions are violated?**

* Biased or inefficient estimates
* Unreliable predictions
* Invalid statistical inference

**GE. How to address multicollinearity?**

* Drop correlated features
* Use PCA or regularization (e.g., Ridge)
* Combine features

**G. How does feature selection help?**

* Reduces overfitting
* Speeds up training
* Improves interpretability
* Removes noise

**G. Adjusted R-squared formula:**

```
Adjusted R^2 = 1 - [(1 - R^2)(n - 1) / (n - k - 1)]
```

Where n = samples, k = predictors

**G\@. Why is MSE sensitive to outliers?**
Because it squares the error, making large errors disproportionately affect the total loss.

**G. Role of homoscedasticity:**
Equal variance of residuals across predictions ensures valid hypothesis testing. Violations lead to biased standard errors.

**G. What is RMSE?**
Square root of MSE. Indicates average prediction error in actual units.

**\$C. Why is pickling risky?**
Pickled files can execute arbitrary code — security risk when loading untrusted files.

**\$G. Alternatives to pickling:**

* `joblib` (faster for large numpy arrays)
* `ONNX`
* `PMML`

**\$\$. What is heteroscedasticity?**
Unequal residual variance. Violates assumptions, leading to inefficient and biased estimates.

**\$2. How do interaction terms help?**
They model combined effects of variables on outcome, capturing nonlinear relationships.

---

## ✅ Practical Tasks (Python Scripts Outline)

**1. Visualize residual distribution (diamonds dataset):**

```python
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load diamonds dataset
df = sns.load_dataset('diamonds')
df = df[['carat', 'price']].dropna()
X = df[['carat']]
y = df['price']

model = LinearRegression().fit(X, y)
y_pred = model.predict(X)
residuals = y - y_pred

sns.histplot(residuals, kde=True)
plt.title("Residual Distribution")
plt.show()
```

**2. Calculate MSE, MAE, RMSE**

```python
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

mse = mean_squared_error(y, y_pred)
mae = mean_absolute_error(y, y_pred)
rmse = np.sqrt(mse)
print(f"MSE: {mse}, MAE: {mae}, RMSE: {rmse}")
```

**3. Check regression assumptions**

```python
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Linearity
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red')
plt.title("Residuals vs Predictions")
plt.show()

# Multicollinearity
sns.heatmap(df.corr(), annot=True)
plt.title("Correlation Matrix")
plt.show()
```

**4. ML pipeline with feature scaling**

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('reg', Ridge())
])
scores = cross_val_score(pipe, X, y, cv=5)
print("Cross-validated scores:", scores)
```

**5. Simple regression model**

```python
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
print("R-squared:", model.score(X, y))
```

**6. Linear regression on 'tips' dataset**

```python
import seaborn as sns
from sklearn.linear_model import LinearRegression

df = sns.load_dataset('tips')
X = df[['total_bill']]
y = df['tip']
model = LinearRegression().fit(X, y)

sns.regplot(x='total_bill', y='tip', data=df)
plt.title("Tip vs Total Bill")
plt.show()
```

(*Scripts 7–25 to be continued or exported upon request*)

---

Let me know if you'd like a `.docx` or `.pdf` version, and I can finish writing out scripts 7–25 and format everything for easy submission.
