<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex07_hw_students_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 7 - team work

Authors: name1, name 2, name3


## Description of the Assignment

The dataset `Boston` contains a total of 506 records from towns in the suburbs of Boston, MA, USA. The data originates from the study by Harrison, D., and Rubinfeld, D.L. (1978), *Hedonic prices and the demand for clean air*, J. Environ. Economics and Management, 5, 81–102.

The dataset includes 14 variables. The goal is to explore the influence of 13 of them on the median value of owner-occupied homes (`medv`). Below is a description of the variables:

| Feature   | Description                                                                 |
|-----------|-----------------------------------------------------------------------------|
| `crim`    | Per capita crime rate by town                                              |
| `zn`      | Proportion of residential land zoned for lots over 25,000 sq.ft            |
| `indus`   | Proportion of non-retail business acres per town                           |
| `chas`    | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)      |
| `nox`     | Nitrogen oxides concentration (parts per 10 million)                       |
| `rm`      | Average number of rooms per dwelling                                       |
| `age`     | Proportion of owner-occupied units built prior to 1940                     |
| `dis`     | Weighted mean of distances to five Boston employment centres               |
| `rad`     | Index of accessibility to radial highways                                  |
| `tax`     | Full-value property-tax rate per $10,000$                                   |
| `ptratio` | Pupil-teacher ratio by  town    |                                            |
| `black_tra`   | $1000\left(\text{black_pop} - 0.63\right)^2$ where `black_pop` is the proportion of blacks by town       |
| `lstat`   | Lower status of the population (percent)                                   |
| `medv`    | Median value of owner-occupied homes in $1000s                             |

---

## Conditions and Scoring

- Collaboration in the team is allowed and recommended.
- This homework includes 14 questions.
- Submit the homework in the corresponding `.ipynb` file, via MS Teams by the next week.
---


In [None]:
# Import libraries
import pandas as pd
import numpy as np


In [None]:
import pandas as pd
import numpy as np

# URL for the Boston housing dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"

# Reading the dataset
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

# Processing the dataset into features and target
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# Column names
columns = [
    "crim", "zn", "indus", "chas", "nox", "rm", "age",
    "dis", "rad", "tax", "ptratio", "black_tra", "lstat"
]
boston_df = pd.DataFrame(data, columns=columns)
boston_df["medv"] = target


boston_df
boston_df.describe()


## Exploratory and Graphical Analysis

### Question 01

- Check for missing values and verify the dimensions of the dataset.
- Summarize the descriptive statistics of all variables.
- Plot a histogram and density estimate for the response variable `medv`.
- Examine the frequency table of `medv` values and discuss whether rounding, truncation, or other issues are present.
- Remove measurements deemed unreliable and discuss what this implies for the response model.
---





In [None]:
boston_df.info()
boston_df.describe()
print(boston_df.isnull().sum())  # Check for missing values
print(boston_df.shape)

table = boston_df['medv'].value_counts().reset_index()
print(table)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
sns.histplot(boston_df['medv'], kde=True, bins=30, color='blue', edgecolor='black')
plt.title("Histogram and Density Estimate for MEDV")
plt.xlabel("MEDV")
plt.ylabel("Frequency")
plt.show()

In [None]:
print(boston_df['medv'].value_counts().sort_index())
unreliable_measurements = boston_df[boston_df['medv'] == 50].index
df_cleaned = boston_df.drop(unreliable_measurements)

print(f"Removed {len(unreliable_measurements)} unreliable measurements.")

plt.figure(figsize=(10, 6))
sns.histplot(df_cleaned['medv'], kde=True, bins=30, color='orange', edgecolor='black')
plt.title("Histogram and Density Estimate for MEDV")
plt.xlabel("MEDV")
plt.ylabel("Frequency")
plt.show()

Measurements with 'meadv' >= 50 removed. Model will be reliable for data with 'meadv' under 50 or only slightly above 50.

## Simple Regression Model: Median Price and Crime

### Question 2

- Build a simple linear regression model to examine if the crime rate (`crim`) affects the median value of homes (`medv`).
- If there is an effect, determine how much the housing price decreases as the crime rate increases.
---



In [None]:
import statsmodels.formula.api as smf
formula = 'medv ~ (crim)'

# Fit an OLS model with interactions
model = smf.ols(formula, data=df_cleaned).fit()

# Display model summary
print(model.summary())

Model with intercept can be interpreted as how the home price varies if there is higher crime rate from the mean home price. The R2 statistics is quite low, although for crude data and only one parameter it is not bad.
The model shows how the median house price varies from the price when crime rate is 0, then for every unit the crime rate increases, house value drops by $400.


### Question 3

- Experiment with power and logarithmic transformations of the response variable (`medv`).
- To find the optimal power transformation, plot the log-likelihood profile for the Box-Cox transformation and compare it with a logarithmic transformation.

---

In [None]:
df_cleaned['log_medv'] = np.log(df_cleaned['medv'])
sns.histplot(df_cleaned['log_medv'], kde=True, bins=30, color='blue', edgecolor='black')
plt.title("Histogram of Log-Transformed MEDV")
plt.xlabel("log(MEDV)")
plt.ylabel("Frequency")
plt.show()

In [None]:
from scipy.stats import boxcox
from scipy.stats import boxcox_llf
from scipy.stats import chi2
import scipy.stats as stats

df_cleaned['medv_positive'] = df_cleaned['medv']
boxcox_transformed, lambda_opt = boxcox(df_cleaned['medv_positive'])

df_cleaned['boxcox_medv'] = boxcox_transformed

print(f"Optimal lambda for Box-Cox Transformation: {lambda_opt}")

lambda_values = np.linspace(-2, 2, 100)

log_likelihoods = [boxcox_llf(lmb, df_cleaned['medv_positive']) for lmb in lambda_values]

max_log_likelihood = max(log_likelihoods)

threshold = max_log_likelihood - chi2.ppf(0.95, df=1) / 2

lambda_conf_interval = lambda_values[
    (log_likelihoods >= threshold)
]

lambda_low = min(lambda_conf_interval)
lambda_high = max(lambda_conf_interval)

log_likelihoods_function = stats.boxcox_llf(0, df_cleaned['medv_positive'])

plt.figure(figsize=(10, 6))
plt.plot(lambda_values, log_likelihoods, label='Log-Likelihood')
plt.axvline(lambda_opt, color='red', linestyle='--', label=f"Optimal lambda = {lambda_opt:.2f}")
plt.axvline(lambda_low, color="green", linestyle="--", label=f"95% CI Lower = {lambda_low:.2f}")
plt.axvline(lambda_high, color="blue", linestyle="--", label=f"95% CI Upper = {lambda_high:.2f}")
plt.axhline(threshold, color="gray", linestyle="--", label="95% CI Threshold")
plt.scatter([0], [log_likelihoods_function], color='orange', label=f"lambda = {0}")
plt.title("Log-Likelihood Profile for Box-Cox Transformation")
plt.xlabel("Lambda")
plt.ylabel("Log-Likelihood")
plt.legend()
plt.grid()
plt.show()

In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)
sns.histplot(df_cleaned['log_medv'], kde=True, color='blue', bins=30)
plt.title("Logarithmic Transformation (lambda = 0)")
plt.xlabel("log(MEDV)")

plt.subplot(1, 2, 2)
sns.histplot(df_cleaned['boxcox_medv'], kde=True, color='green', bins=30)
plt.title(f"Box-Cox Transformation (Optimal lambda = {lambda_opt:.2f})")
plt.xlabel("Box-Cox Transformed MEDV")

plt.tight_layout()
plt.show()

Lambda 0.5 was chosen as it is close to 0.4 and easier to apply and interpret

In [None]:
import statsmodels.api as sm

X = sm.add_constant(df_cleaned['crim'])

model_log = sm.OLS(df_cleaned['log_medv'], X).fit()
model_boxcox = sm.OLS(df_cleaned['boxcox_medv'], X).fit()

print("Logarithmic Transformation Model Summary:")
print(model_log.summary())

print(f"\nBox-Cox Transformation with lambda = {lambda_opt:.2f} Model Summary:")
print(model_boxcox.summary())


In [None]:
plt.figure(figsize=(14, 6))

plt.subplot(1, 2, 1)

model_log_resid = model_log.resid
sns.histplot(model_log_resid, kde=True, color='blue')
plt.title("Residuals of Log-Transformed Model")

model_boxcox_resid = model_boxcox.resid
plt.subplot(1, 2, 2)
sns.histplot(model_boxcox_resid, kde=True, color='green')
plt.title("Residuals of Box-Cox-Transformed Model")


# Calculate studentized residuals for each model
residuals_log_student = model_log.get_influence().resid_studentized_internal
residuals_boxcox_student = model_boxcox.get_influence().resid_studentized_internal


residuals_df = pd.DataFrame({
    'Logarithmic': residuals_log_student,
    'Box-Cox': residuals_boxcox_student
})
fig, axes = plt.subplots(1,2, figsize=(12, 5))
axes = axes.flatten()

for i, col in enumerate(residuals_df.columns):
    axes[i].plot(residuals_df.index, residuals_df[col], marker='o', linestyle='', alpha=0.5)
    axes[i].axhline(0, color='gray', linestyle='--')
    axes[i].set_title(f"Studentized Residuals: {col} transformation")
    axes[i].set_xlabel("Index")
    axes[i].set_ylabel("Studentized Residuals")

plt.tight_layout()
plt.show()

# QQ-plots for studentized residuals
fig, axes = plt.subplots(1,2,  figsize=(12, 5))
for i, col in enumerate(residuals_df.columns):
    sm.qqplot(residuals_df[col], line='45', ax=axes[i])
    axes[i].set_title(f"QQ-plot: {col} transformation")

plt.tight_layout()
plt.show()

### Question 4

- Based on the simple linear model and on the model with logarithmic transformations of the response variable, estimate the increase or decrease in housing prices for a one-unit change in the crime rate (`crim`).
- Provide the correct interpretation from both models.
---

The price change under the model with just intercept and cirme rate, without data transformation was already explained. If we perform the logarithmic transformation, the crime rate coefficient has to be interpreted as a percentage change, that is calculated as: 100*(e^{beta_crime} - 1)
because the values need to be transromed back using the exponential and then interpreted as proportional change

### Question 5

- Keep the logarithmic transformation of the response (`medv`) and try transforming the independent variable (`crim`).
- Use techniques such as piecewise constant transformations, or polynomial transformations (quadratic and cubic).
- Use information from plots such as Component-Residual Plots (Partial Residual Plots) and Partial Regression Plots to guide your transformations.
- Discuss whether these models can be compared using an F-test. If applicable, perform the test and interpret the results.
---

In [None]:
df_cleaned['crim_binned'] = pd.cut(df_cleaned['crim'], bins=[-np.inf, 1, 5, 10, np.inf], labels=['low', 'medium', 'high', 'very_high'])

crim_dummies = pd.get_dummies(df_cleaned['crim_binned'], drop_first=True)
X_piecewise = sm.add_constant(crim_dummies)

model_piecewise = sm.OLS(df_cleaned['log_medv'], X_piecewise.astype(float)).fit()
print(model_piecewise.summary())

Nejsem si jisty co to znamena.

In [None]:
df_cleaned['crim_squared'] = df_cleaned['crim'] ** 2
df_cleaned['crim_cubed'] = df_cleaned['crim'] ** 3

X_poly2 = sm.add_constant(df_cleaned[['crim', 'crim_squared']])
model_poly2 = sm.OLS(df_cleaned['log_medv'], X_poly2.astype(float)).fit()

X_poly3 = sm.add_constant(df_cleaned[['crim', 'crim_squared', 'crim_cubed']])
model_poly3 = sm.OLS(df_cleaned['log_medv'], X_poly3.astype(float)).fit()

print("Quadratic Model Summary:")
print(model_poly2.summary())

print("\nCubic Model Summary:")
print(model_poly3.summary())

Even higher order terms seem to be statistically significant and R2 and even adj-R2 improved when moving to cubed variables. Other statistics remained rather same.

In [None]:
from statsmodels.graphics.regressionplots import plot_ccpr
from statsmodels.graphics.regressionplots import plot_partregress

fig, ax = plt.subplots(figsize=(10, 6))
plot_ccpr(model_poly2, 'crim', ax=ax)
plot_ccpr(model_poly3, 'crim', ax=ax)
plt.title("Component-Residual Plot for crim")
plt.show()

fig, ax = plt.subplots(figsize=(18, 8))
plot_partregress('log_medv', 'crim', ['crim_squared'], data=df_cleaned, ax=ax)
plot_partregress('log_medv', 'crim',['crim_cubed'], data=df_cleaned, ax=ax)
plt.title("Partial Regression Plot for crim")
plt.show()

In [None]:
sm.graphics.plot_partregress_grid(model_poly2)
sm.graphics.plot_partregress_grid(model_poly3)
plt.show()

Null hypothesis: linear model is sufficient vs. the higher orders have large significance. Since the squared and cubed crime had good statistical significance it is suitable to compare them using the F statistics.

In [None]:
# Compare linear model to quadratic model
f_test_result = model_poly2.compare_f_test(model)
print(f"F-test result (linear vs. quadratic): {f_test_result}")

# Compare linear to cubic model
f_test_result2 = model_poly3.compare_f_test(model)
print(f"F-test result (quadratic vs. cubic): {f_test_result2}")

linear vs. quadratic results: F=288844.00, p=0.0, df=1.0, which indicatec that the cubic model fits significantly better than the linar one, however the improvement is not large going from quadratic to cubic model.

### Question 6

- Select one of the previous models, justify your choice, and validate it using the appropriate hypothesis tests for residuals (normality, homoscedasticity, etc.).
- Use diagnostic plots such as Q-Q plots, residuals vs. fitted values, and others to evaluate the model's assumptions.
---


In [None]:
# Compare quadratic to cubic
f_test_result3 = model_poly3.compare_f_test(model_poly2)
print(f"F-test result (quadratic vs. cubic): {f_test_result3}")

In [None]:
# Residuals vs. Fitted Values
plt.figure(figsize=(8, 6))
residuals_poly2 = model_poly2.resid
sns.residplot(x=model_poly2.fittedvalues, y=residuals_poly2, lowess=True, line_kws={'color': 'red'})
plt.title('Residuals for quadratic model vs Fitted Values')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()

In [None]:
# Residuals vs. Fitted Values
plt.figure(figsize=(8, 6))
residuals_poly3 = model_poly3.resid
sns.residplot(x=model_poly3.fittedvalues, y=residuals_poly3, lowess=True, line_kws={'color': 'red'})
plt.title('Residuals for cubic model vs Fitted Values')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()

In [None]:
import scipy.stats as stats

# Q-Q plot
plt.figure(figsize=(8, 6))
stats.probplot(residuals_poly2, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

In [None]:
import scipy.stats as stats

# Q-Q plot
plt.figure(figsize=(8, 6))
stats.probplot(residuals_poly3, dist="norm", plot=plt)
plt.title('Q-Q Plot')
plt.show()

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

X_quad_const = sm.add_constant(df_cleaned[['crim', 'crim_squared']])

bp_test_stat, bp_test_p_value, _, _ = het_breuschpagan(residuals_poly2, X_quad_const)
print(f"Breusch-Pagan test statistic: {bp_test_stat}, p-value: {bp_test_p_value}")

X_quad_const = sm.add_constant(df_cleaned[['crim', 'crim_squared']])

bp_test_stat, bp_test_p_value, _, _ = het_breuschpagan(residuals_poly3, X_quad_const)
print(f"Breusch-Pagan test statistic: {bp_test_stat}, p-value: {bp_test_p_value}")

By simply observing the plotted residuals one could see that there is no signifficant heteroskedacity. I added the Breusch-Pagan test here for completeness, where we failed to reject the null hypothesis that there is heteroskedacity. Durbin-Watson statistics for both models was around 0.8, which suggests some autocorrelation (which appears with the Durbin-Watson statistics being near 0).

Based on D-W statistics, QQ plots and residual plots being largely the same with the quadratic and the cubic model and seeing no significant improvement taking the cubic model... R2 didn't improve much and the F-test suggested that although there is statistical signifficance in taking the cubic model, the improvement is not worth it. I would pick the quadratic model, since the interpretation is more straight forward and it is simpler and also offers good precision.


## Multivariate Regression Model

### Question 7

- Build a multivariate linear regression model with a logarithmic transformation of the response (`medv`).
- Explore relationships between housing prices and other independent variables in an additive model (no interactions).
- Use criteria such as AIC, BIC, $ R^2 $, and F-statistics to select the best model.
- Investigate whether the relationship between `crim` and `medv` can be explained by other variables, such as proximity to highways or pollution levels.
---

In [None]:
df_cleaned.head()

In [None]:
df_cleaned['log_medv'] = np.log(df_cleaned['medv'])

X = df_cleaned[['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black_tra', 'lstat']]

# Adds constant term to fit intercept
X = sm.add_constant(X)

model = sm.OLS(df_cleaned['log_medv'], X).fit()

print(model.summary())

zn, indus, chas and age don't seem to be statistically significant in the model without interactions. R2 is quite high at 0.8 and D-W near 1, suggesting that there is no apparent autocorrelation.

In [None]:
X = df_cleaned[['crim', 'nox', 'rm', 'dis', 'rad', 'tax', 'ptratio', 'black_tra', 'lstat']]

# Adds constant term to fit intercept
X = sm.add_constant(X)

model = sm.OLS(df_cleaned['log_medv'], X).fit()

print(model.summary())

both AIC and BIC are larger for this reduced model, all the variables are now statistically significant and R2, adj-R2 didn't drop. D-W remains the same, while F-statistics value increased indicating that this model is statistically more significant than the larger one.

### Question 8

- Incorporate `crim` (crime rate) into the final model and compare how its influence on the median housing price differs from the simple regression model with a logarithmic transformation of the response (from Question 4).
- Estimate the reduction in median housing price for a one-unit increase in the crime rate per 1,000 residents.
---

In [None]:
X_simple = sm.add_constant(df_cleaned[['crim']])
model_simple = sm.OLS(df_cleaned['log_medv'], X_simple).fit()
print(model_simple.summary())

Percentage change in median house value when crime rate per 1000 residents changes by one unit: 100 * (e^beta_crim - 1) = 100 * (e^{-0.02} - 1) = -1.98%

In [None]:
X_multivariate = df_cleaned[['crim', 'nox', 'dis', 'rad', 'zn', 'indus', 'chas', 'rm', 'age', 'ptratio', 'black_tra', 'lstat']]
X_multivariate = sm.add_constant(X_multivariate)

model_multivariate = sm.OLS(df_cleaned['log_medv'], X_multivariate).fit()
print(model_multivariate.summary())

Percentage change in median house value in the multivariate model, when crime rate per 1000 residents changes by one unit: 100 * (e^beta_crim - 1) = 100 * (e^{-0.01} - 1) = -1%

In [None]:
selected_vars = df_cleaned[['crim', 'nox', 'dis', 'rad', 'log_medv']]
correlation_matrix = selected_vars.corr()
print(correlation_matrix)

plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

### Question 9

- Present your final predictive model for `medv` and discuss the key parameters such as $ R^2 $, $ \sigma $, and F-statistics.
- Compare the final model with the simple linear model from Question 6. Discuss how these parameters have changed and whether this change was expected.
- Validate the model both graphically and using hypothesis tests.
---

In [None]:
X_no_nox_dis_rad = df_cleaned[['crim', 'zn', 'indus', 'chas', 'rm', 'age', 'ptratio', 'black_tra', 'lstat']]
X_no_nox_dis = sm.add_constant(X_no_nox_dis_rad)
model_no_nox_dis_rad = sm.OLS(df_cleaned['log_medv'], X_no_nox_dis_rad).fit()
f_test_result = model_multivariate.compare_f_test(model_no_nox_dis_rad)
print(f"F-statistic: {f_test_result[0]:.4f}, p-value: {f_test_result[1]:.4f}")
print("===============================================")
print(model_no_nox_dis_rad.summary())

We reject the null hypothesis that the simpler model is as good as the full model, even though the R2 is higher. This could be due to multicolinearity in the omitted variables.

In [None]:
residuals = model.resid

SSE = np.sum(residuals**2)
n = len(model.model.endog)
p = model.df_model + 1
sigma = np.sqrt(SSE / (n - p))

residuals_mul = model_multivariate.resid
SSE_mul = np.sum(residuals_mul**2)
n_mul = len(model_multivariate.model.endog)
p_mul = model_multivariate.df_model + 1
sigma_mul = np.sqrt(SSE_mul / (n_mul - p_mul))

residuals_simpler = model_no_nox_dis_rad.resid
SSE_simpler = np.sum(residuals_simpler**2)
n_simpler = len(model_no_nox_dis_rad.model.endog)
p_simpler = model_no_nox_dis_rad.df_model + 1
sigma_simpler = np.sqrt(SSE_simpler / (n_simpler - p_simpler))


print(f"Standard Error of Residuals (sigma) for Linear Model: {sigma}")
print("R2 for Linear Model: 0.320")
print("F-stat for Linear Model: 230")
print("===============")

print(f"Standard Error of Residuals (sigma) for Multivariate Model: {sigma_mul}")
print("R2 for Multivariate Model: 0.78")
print("F-stat for Multivariate Model: 142")

print("===============")
print("For completness we include the model without variables: nox, dis, rad")
print(f"Standard Error of Residuals (sigma) for Simple Model: {sigma_simpler}")
print("R2 for Simple Model: 0.993")
print("F-stat for Simple Model: 7900")

Removing nox, dis and rad likely results in overfitting on the available data. In this scenario, the simple linear model captures only 32% of the variability in data, while it offers a good fit.

In [None]:
plt.figure(figsize=(8, 6))
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot, Linear Model')
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
stats.probplot(residuals_mul, dist="norm", plot=plt)
plt.title('Q-Q Plot, Multivariate Model')
plt.show()

In [None]:
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs. Fitted Values, Linear Model")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

In [None]:
plt.scatter(model_multivariate.fittedvalues, model_multivariate.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs. Fitted Values, Multivariate Model")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

In [None]:
shapiro_test_linear_model = stats.shapiro(model.resid)
print("Shapiro-Wilk test statistic for Linear Model:", shapiro_test_linear_model.statistic, "p-value:", shapiro_test_linear_model.pvalue)

shapiro_test_mul = stats.shapiro(model_multivariate.resid)
print("Shapiro-Wilk test statistic for Multivariate Model:", shapiro_test_mul.statistic, "p-value:", shapiro_test_mul.pvalue)

import statsmodels.stats.api as sms
bp_test = sms.het_breuschpagan(model.resid, model.model.exog)
print("Breusch-Pagan test statistic for Linear Model:", bp_test[0], "p-value:", bp_test[1])

bp_test = sms.het_breuschpagan(model_multivariate.resid, model_multivariate.model.exog)
print("Breusch-Pagan test statistic for Multivariate Model:", bp_test[0], "p-value:", bp_test[1])
dw_test = sm.stats.durbin_watson(model.resid)
print("Durbin-Watson test statistic for Linear Model:", dw_test)
dw_test = sm.stats.durbin_watson(model_multivariate.resid)
print("Durbin-Watson test statistic: for Multivariate Model", dw_test)

Both the linear model and multivariate model show signs of problems with normality, heteroscedasticity.

### Question 10

- Based on your final model, answer whether reducing the crime rate in an area would lead to an increase in housing prices in that area.
- Provide an explanation based on your findings.
---

In [None]:
df_cleaned['crim_squared'] = df_cleaned['crim'] ** 2
X_multivariate_square_crim = df_cleaned[['crim', 'nox', 'dis', 'rad', 'rm', 'ptratio', 'black_tra', 'lstat', 'crim_squared']]
X_multivariate_square_crim = sm.add_constant(X_multivariate_square_crim)

model_multivariate_square_crim = sm.OLS(df_cleaned['log_medv'], X_multivariate_square_crim).fit()
print(model_multivariate_square_crim.summary())

shapiro_test_mul_sq = stats.shapiro(model_multivariate_square_crim.resid)
print("Shapiro-Wilk test statistic for Multivariate Model with Square crim:", shapiro_test_mul_sq.statistic, "p-value:", shapiro_test_mul_sq.pvalue)

bp_test_mul_sq = sms.het_breuschpagan(model_multivariate_square_crim.resid, model_multivariate_square_crim.model.exog)
print("Breusch-Pagan test statistic for Multivariate Model:", bp_test_mul_sq[0], "p-value:", bp_test_mul_sq[1])
dw_test_mul_sq = sm.stats.durbin_watson(model_multivariate_square_crim.resid)
print("Durbin-Watson test statistic for Linear Model:", dw_test_mul_sq)

residuals_mul_sq = model_multivariate_square_crim.resid
plt.figure(figsize=(8, 6))
stats.probplot(residuals_mul_sq, dist="norm", plot=plt)
plt.title('Q-Q Plot, Multivariate Model with Squared crim')
plt.show()

plt.scatter(model_multivariate_square_crim.fittedvalues,model_multivariate_square_crim.resid)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Residuals vs. Fitted Values, Multivariate Model with Square crim")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()


### Question 11: Compare Coefficients in Simple Models

Investigate, if the transformation of `black_pop` into `black_tra` was  misleading and suggestive. Add new variable `black_pop` into the data frame by inverse of orginal transformation.

- Build two separate simple linear regression models:
  1. Predicting `medv` using `black_tra`.
  2. Predicting `medv` using `black_pop`.
- Compare the coefficients from both models and interpret the differences.
- Discuss whether the transformation of `black_tra` appears to exaggerate or diminish its relationship with `medv`.
---

In [None]:
df_cleaned['black_pop'] = np.sqrt(df_cleaned['black_tra'] / 1000) + 0.63
X_black_tra = sm.add_constant(df_cleaned['black_tra'])
y_medv = df_cleaned['log_medv']
model_black_tra = sm.OLS(y_medv, X_black_tra).fit()

X_black_pop = sm.add_constant(df_cleaned['black_pop'])
model_black_pop = sm.OLS(y_medv, X_black_pop).fit()

# Print the summary for both models
print("Model 1 (black_tra):\n", model_black_tra.summary())
print("Model 2 (black_pop):\n", model_black_pop.summary())

The transformation of black_pop into black_tra diminishes the relationship with medv because it introduces non-linearity and compresses the range of values and that is why I think the transformation is unnecessary and has less direct explainability compared to black_tra.

In [None]:
from sklearn.feature_selection import f_regression

y = df_cleaned['medv']

X_all = df_cleaned.drop(columns=['log_medv', 'black_pop', 'crim_binned', 'medv_positive', 'boxcox_medv', 'medv', 'crim_cubed'])
X_all = sm.add_constant(X_all)
print(X_all.head())

had to remove crim_binned since it is a cathegory variable that is not compatible with the code.



### Question 12: Stepwise Regression with `black_tra`

- Perform stepwise regression starting with all independent variables, including `black_tra`, as predictors of `medv`.
- Evaluate whether `black_tra` remains significant in the final model after stepwise variable selection.
- Discuss whether its significance changes when considered alongside other predictors.
---

In [None]:
def backward_elimination_aic_bic(X, y, criterion="AIC"):
    included = list(X.columns)
    best_metric = float("inf")
    best_model = None

    while True:
        model_stepwise = sm.OLS(y, sm.add_constant(X[included])).fit()
        current_metric = model_stepwise.aic if criterion.upper() == "AIC" else model_stepwise.bic

        if current_metric < best_metric:
            best_metric = current_metric
            best_model = model_stepwise
        else:
            break
        pvalues = model_stepwise.pvalues.iloc[1:]
        worst_feature = pvalues.idxmax()

        if pvalues[worst_feature] > 0.05:
            print(f"Removing '{worst_feature}' with p-value {pvalues[worst_feature]:.4f} and {criterion} {current_metric:.2f}")
            included.remove(worst_feature)
        else:
            break

    return best_model, included

model_backward_aic, selected_features_aic = backward_elimination_aic_bic(X_all, y, criterion="AIC")
print("Selected features based on AIC:", selected_features_aic)
print(model_backward_aic.summary())

model_backward_bic, selected_features_bic = backward_elimination_aic_bic(X_all, y, criterion="BIC")
print("Selected features based on BIC:", selected_features_bic)
print(model_backward_bic.summary())



In [None]:
X_aic_no_black_tra = df_cleaned.drop(columns=['log_medv', 'black_pop', 'crim_binned', 'medv_positive', 'boxcox_medv', 'medv', 'crim_cubed', 'black_tra', 'chas', 'indus', 'crim_squared'])
model_backawrd_aic_no_black_tra = sm.OLS(df_cleaned['medv'], X_aic_no_black_tra).fit()
f_test_result_aic = model_backward_aic.compare_f_test(model_backawrd_aic_no_black_tra)
print(f"F-statistic comparing model from stepwise to one without balck_tra: {f_test_result[0]:.4f}, p-value: {f_test_result[1]:.4f}")

The F-stat suggests that we reject the null hypothesis that the model without black_tra is as good as the one including it. So black_tra is a significant predictor.


### Question 13: Stepwise Regression with `black_pop`

- Repeat the stepwise regression from Question 12, but this time replace `black_tra` with `black_pop`.
- Evaluate whether `black_pop` remains significant in the final model.
- Compare its significance to that of `black_tra` from Question 12.
---

In [None]:
y = df_cleaned['medv']

X_all_pop = df_cleaned.drop(columns=['crim_binned', 'black_tra', 'medv_positive', 'boxcox_medv', 'medv', 'crim_cubed'])
print(X_all_pop.head())

In [None]:
 def backward_elimination_aic_bic(X, y, criterion="AIC"):
    included = list(X.columns)
    best_metric = float("inf")
    best_model = None

    while True:
        model_stepwise = sm.OLS(y, sm.add_constant(X[included])).fit()
        current_metric = model_stepwise.aic if criterion.upper() == "AIC" else model_stepwise.bic

        if current_metric < best_metric:
            best_metric = current_metric
            best_model = model_stepwise
        else:
            break
        pvalues = model_stepwise.pvalues.iloc[1:]
        worst_feature = pvalues.idxmax()

        if pvalues[worst_feature] > 0.05:
            print(f"Removing '{worst_feature}' with p-value {pvalues[worst_feature]:.4f} and {criterion} {current_metric:.2f}")
            included.remove(worst_feature)
        else:
            break

    return best_model, included

model_backward_aic_pop, selected_features_aic_pop = backward_elimination_aic_bic(X_all_pop, y, criterion="AIC")
print("Selected features based on AIC:", selected_features_aic_pop)
print(model_backward_aic_pop.summary())

model_backward_bic_pop, selected_features_bic_pop = backward_elimination_aic_bic(X_all_pop, y, criterion="BIC")
print("Selected features based on BIC:", selected_features_bic_pop)
print(model_backward_bic_pop.summary())

balck_pop was removed by the stepwise regression, so there is no point in running the F-stat test to compare them. black_tra is statistically significant, while black_pop is not.

### Question 14: Impact on Predictions

- For both the models from Questions 12 and 13 (stepwise regression with `black_tra` and `black_pop`), compare their predictions for `medv`.
- Specifically:
  1. Calculate predictions for a range of values of `black_tra` and `black_pop`.
  2. Plot the predictions and interpret whether the two variables result in substantially different predicted values.
- Discuss whether the transformed variable (`black_tra`) or its proportion counterpart (`black_pop`) leads to any noticeable bias or distortion in predictions.
---

In [None]:
#pop
matrix_pop =  X_all_pop[selected_features_bic_pop].values
matrix_with_intercept_pop = np.hstack((np.ones((matrix_pop.shape[0], 1)), matrix_pop))
predicted_values_pop = matrix_with_intercept_pop @ model_backward_bic_pop.params

#tra
matrix_tra =  X_all[selected_features_bic].values
predicted_values_tra = matrix_tra @ model_backward_bic.params

#plot predicted values
plt.figure(figsize=(12, 6))
plt.plot(range(len(predicted_values_pop)), predicted_values_pop, label="Predicted (pop)", color="blue", alpha=0.9)
plt.plot(range(len(predicted_values_tra)), predicted_values_tra, label="Predicted (tra)", color="orange", alpha=0.9)
plt.plot(range(len(df_cleaned['medv'])), df_cleaned['medv'], label="Medv", color="green", alpha=0.3)

plt.xlabel("Index", fontsize=12)
plt.ylabel("Predicted medv", fontsize=12)
plt.title("Scatter Plot of Predicted Values: pop vs tra", fontsize=14)
plt.legend(fontsize=10)
plt.grid(True)
plt.show()

