<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex07_hw.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 7 - team work

Authors: name1, name 2, name3


## Description of the Assignment

The dataset `Boston` contains a total of 506 records from towns in the suburbs of Boston, MA, USA. The data originates from the study by Harrison, D., and Rubinfeld, D.L. (1978), *Hedonic prices and the demand for clean air*, J. Environ. Economics and Management, 5, 81–102.

The dataset includes 14 variables. The goal is to explore the influence of 13 of them on the median value of owner-occupied homes (`medv`). Below is a description of the variables:

| Feature   | Description                                                                 |
|-----------|-----------------------------------------------------------------------------|
| `crim`    | Per capita crime rate by town                                              |
| `zn`      | Proportion of residential land zoned for lots over 25,000 sq.ft            |
| `indus`   | Proportion of non-retail business acres per town                           |
| `chas`    | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)      |
| `nox`     | Nitrogen oxides concentration (parts per 10 million)                       |
| `rm`      | Average number of rooms per dwelling                                       |
| `age`     | Proportion of owner-occupied units built prior to 1940                     |
| `dis`     | Weighted mean of distances to five Boston employment centres               |
| `rad`     | Index of accessibility to radial highways                                  |
| `tax`     | Full-value property-tax rate per $10,000$                                   |
| `ptratio` | Pupil-teacher ratio by  town    |                                            |
| `black_tra`   | $1000\left(\text{black_pop} - 0.63\right)^2$ where `black_pop` is the proportion of blacks by town       |
| `lstat`   | Lower status of the population (percent)                                   |
| `medv`    | Median value of owner-occupied homes in $1000s                             |

---

## Conditions and Scoring

- Collaboration in the team is allowed and recommended.
- This homework includes 14 questions.
- Submit the homework in the corresponding `.ipynb` file, via MS Teams by the next week.
---


In [None]:
# Import libraries
import pandas as pd
import numpy as np


In [None]:
import pandas as pd
import numpy as np

# URL for the Boston housing dataset
data_url = "http://lib.stat.cmu.edu/datasets/boston"

# Reading the dataset
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

# Processing the dataset into features and target
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# Column names
columns = [
    "crim", "zn", "indus", "chas", "nox", "rm", "age",
    "dis", "rad", "tax", "ptratio", "black_tra", "lstat"
]
boston_df = pd.DataFrame(data, columns=columns)
boston_df["medv"] = target


boston_df



## Exploratory and Graphical Analysis

### Question 01

- Check for missing values and verify the dimensions of the dataset.
- Summarize the descriptive statistics of all variables.
- Plot a histogram and density estimate for the response variable `medv`.
- Examine the frequency table of `medv` values and discuss whether rounding, truncation, or other issues are present.
- Remove measurements deemed unreliable and discuss what this implies for the response model.

```python





## Simple Regression Model: Median Price and Crime

### Question 2

- Build a simple linear regression model to examine if the crime rate (`crim`) affects the median value of homes (`medv`).
- If there is an effect, determine how much the housing price decreases as the crime rate increases.

---

### Question 3

- Experiment with power and logarithmic transformations of the response variable (`medv`).
- To find the optimal power transformation, plot the log-likelihood profile for the Box-Cox transformation and compare it with a logarithmic transformation.

---

### Question 4

- Based on the simple linear model and on the model with logarithmic transformations of the response variable, estimate the increase or decrease in housing prices for a one-unit change in the crime rate (`crim`).
- Provide the correct interpretation from both models.

---

### Question 5

- Keep the logarithmic transformation of the response (`medv`) and try transforming the independent variable (`crim`).
- Use techniques such as piecewise constant transformations, or polynomial transformations (quadratic and cubic).
- Use information from plots such as Component-Residual Plots (Partial Residual Plots) and Partial Regression Plots to guide your transformations.
- Discuss whether these models can be compared using an F-test. If applicable, perform the test and interpret the results.

---

### Question 6

- Select one of the previous models, justify your choice, and validate it using the appropriate hypothesis tests for residuals (normality, homoscedasticity, etc.).
- Use diagnostic plots such as Q-Q plots, residuals vs. fitted values, and others to evaluate the model's assumptions.

---



## Multivariate Regression Model

### Question 7

- Build a multivariate linear regression model with a logarithmic transformation of the response (`medv`).
- Explore relationships between housing prices and other independent variables in an additive model (no interactions).
- Use criteria such as AIC, BIC, $ R^2 $, and F-statistics to select the best model.
- Investigate whether the relationship between `crim` and `medv` can be explained by other variables, such as proximity to highways or pollution levels.

---

### Question 8

- Incorporate `crim` (crime rate) into the final model and compare how its influence on the median housing price differs from the simple regression model with a logarithmic transformation of the response (from Question 4).
- Estimate the reduction in median housing price for a one-unit increase in the crime rate per 1,000 residents.

---

### Question 9

- Present your final predictive model for `medv` and discuss the key parameters such as $ R^2 $, $ \sigma $, and F-statistics.
- Compare the final model with the simple linear model from Question 6. Discuss how these parameters have changed and whether this change was expected.
- Validate the model both graphically and using hypothesis tests.

---

### Question 10

- Based on your final model, answer whether reducing the crime rate in an area would lead to an increase in housing prices in that area.
- Provide an explanation based on your findings.


## Investigating the Transformation of the `black_tra` Variable

<!--
# Add a new variable `black_pop` representing the proportion of Black population
boston_df["black_pop"] = 0.63 - np.sqrt(boston_df["black_tra"] / 1000)
 -->

In [None]:
# Motivation of this section
from sklearn.datasets import load_boston

# Load dataset
boston_data = load_boston()


### Question 11: Compare Coefficients in Simple Models

Investigate, if the transformation of `black_pop` into `black_tra` was  misleading and suggestive. Add new variable `black_pop` into the data frame by inverse of orginal transformation.

- Build two separate simple linear regression models:
  1. Predicting `medv` using `black_tra`.
  2. Predicting `medv` using `black_pop`.
- Compare the coefficients from both models and interpret the differences.
- Discuss whether the transformation of `black_tra` appears to exaggerate or diminish its relationship with `medv`.

---

### Question 12: Stepwise Regression with `black_tra`

- Perform stepwise regression starting with all independent variables, including `black_tra`, as predictors of `medv`.
- Evaluate whether `black_tra` remains significant in the final model after stepwise variable selection.
- Discuss whether its significance changes when considered alongside other predictors.

---

### Question 13: Stepwise Regression with `black_pop`

- Repeat the stepwise regression from Question 12, but this time replace `black_tra` with `black_pop`.
- Evaluate whether `black_pop` remains significant in the final model.
- Compare its significance to that of `black_tra` from Question 12.

---

### Question 14: Impact on Predictions

- For both the models from Questions 12 and 13 (stepwise regression with `black_tra` and `black_pop`), compare their predictions for `medv`.
- Specifically:
  1. Calculate predictions for a range of values of `black_tra` and `black_pop`.
  2. Plot the predictions and interpret whether the two variables result in substantially different predicted values.
- Discuss whether the transformed variable (`black_tra`) or its proportion counterpart (`black_pop`) leads to any noticeable bias or distortion in predictions.
