# Interpretable Machine Learning
## Exercise Sheet 2: Interpretable Models

### This exercise sheet covers lecture 2 on interpretable models
Sophie Langbein (langbein@leibniz-bips.de)<br>
Pegah Golchian (golchian@leibniz-bips.de)

This exercise sheet will be discussed in the Exercise Session on the 6th of November 2023.
<hr style="border:1.5px solid gray"> </hr>

# Linear Regression
The linear model is one of the first and best-studied models for regression
problems. However, it makes some (strong) assumptions about the given data. 
In the following exercise, you will learn how to fit and interpret a linear 
model and investigate its limitations. 

Let's consider a linear regression model 
for three features $p=3$ of the form
$$
\hat{f}(x) = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3\beta_3 = X \beta
$$
and the data is given in the csv-file `sheet_2_lin_data.csv`.

**a)** Load the data as a `pandas` data frame and visualize in a 2D plot how each 
feature ($X_1, X_2, X_3$) affects the output $Y$. Preprocess the loaded data frame
with `pandas.melt` and use the function `seaborn.scatterplot` with specified 
grouping argument (`hue`) for the visualization.

**b)** Determine the _design matrix_ $X$ of the given data as a `numpy` array (don't 
forget the column of ones for the intercept $\beta_0$). Using this matrix $X$, 
estimate the regression coefficients $\beta$ of the linear model by the method of least 
squares and calculate the corresponding standard error of the estimates. Do this from scratch without using any preprogrammed functions for $\beta$ or the standard errors.

**Hint:** The estimated linear regression coefficients are given by 
$\hat{\beta} = (X^T X)^{-1} X^T Y$ and the standard error
is given by the square root of the diagonal entries of $(X^TX)^{-1} \hat{\sigma}^2$ with 
$\hat{\sigma}^2 = \frac{1}{n - p} \sum_{i=1}^n (y_i- \hat{y}_i)^2$.

**c)** Calculate the $R^2$-value. What does the $R^2$-value mean in this context?

**d)** What happens to the outcome if we increase the first feature value $X_1$ by 3? 

**e)** Which feature is the most important one?

**f)** The data was sampled by the data generating process 
$$f(x) = x_1 + 2 x_2 + 3 x_3 + \varepsilon.$$ 
Does the most important feature from part **d)** make sense regarding the data 
generating process? Explain your answer!

<hr style="border:1.5px solid gray"> </hr>

# Logistic Regression

## Introduction of Odds Ratio

The iml course is taught at three universities: Bremen, Hannover and Munich. 

The following table shows the absolute joint and marginal distribution of the exam outcomes and the university location for students who took the course in 2021 (totally real data of course).

|          | Bremen   | Hannover | Munich   || Σ     |
|----------|----------|----------|----------||-------|
| Pass     |    14    |    25    |    42    ||   81  |
| Fail     |    21    |    15    |     7    ||   43  |
|----------|----------|----------|----------||-------|
| Σ        |    35    |    40    |    49    ||  124 ||

**a)** Calculate and interpret the odds of passing the IML exam vs. failing the IML exam in 2021.

**b)** Calculate and interpret the odds of passing the IML exam vs. failing the IML exam in Bremen in 2021.

**c)** Calculate and interpret the odds ratio of passing vs. failing of a student in Munich vs. Bremen in 2021.

## Logistic Regression
In this exercise, a logistic model is fitted on the [wine dataset](https://archive.ics.uci.edu/dataset/109/wine) derived from three different cultivars (`first cultivar`, `second cultivar`, `third cultivar`). For this exercise we want to consider a two dimensional classification problem, which is why we are interested only in whether a wine is of the first cultivar (`first cultivar`) or not (`second cultivar` or `third cultivar`). For 
this purpose, we use the [`scikit-learn`](https://scikit-learn.org/stable/) package.

**a)** Load the wine dataset from `sklearn.datasets` and save the features with 
corresponding output in one `pandas` dataframe. Adjust the output to the 
binary classification problem described above. The outcome dataframe should contain the new binary target in a column named `target_binary` and apart from that only the four feature columns `alcohol`, `malic_acid`, `magnesium`and `color_intensity`. 

**b)** Visualize this dataframe with `seaborn.pairplot` with grouping argument `target_binary`.
What do you think which feature incremented by one unit increases the odds for the class `first cultivar` the most? Explain your choice.

**c)** Fit a regression model with the function `sklearn.linear_model.LogisticRegression` and extract the estimated model parameters $\beta$ from it. Now we want to determine 
the odds ratio of the parameters. Similar to examples **b)* and **c)** in Exercise 3.1, 
a logistic model provides a probability conditional on the input data $x$, i.e. 
the odds are given by

$$
\text{odds}_\text{logreg} = \frac{\mathbb{P}(Y = 1 \mid x)}{\mathbb{P}(Y = 0 \mid x)}.
$$

The stochastic model is now modified by increasing a component of $x$. Denote with $x_{x_i \to x_i +1}$ a vector where the $i$-th component (feature) is increased by one unit. Thus, the following representation of the odds ratio for the i-th parameter results:

$$
\text{OR}_i = \frac{\frac{\mathbb{P}(Y = 1 \, \mid \, x_{x_i \to x_i +1})}{\mathbb{P}(Y = 0 \, \mid \, x_{x_i \to x_i +1})}}{\frac{\mathbb{P}(Y = 1 \, \mid \, x)}{\mathbb{P}(Y = 0 \, \mid \, x)}} = \exp(\beta_i).
$$

What are the odds ratios of your fitted logistic regression model, and which 
feature increases the odds for the class `first cultiva` the most? How does your 
choice fit with your thoughts from **b)**?

<hr style="border:1.5px solid gray"> </hr>

# Linear Regression with Interactions and Generalized Additive Models
 
Load the dataset `house_dataset_2.csv`. This dataset contains the purchase prices of $10.000$ houses in Germany and includes the `Size` (in m²), `Location` ($0=$ bad, $1=$ good, $2=$ very good and $3=$ luxury), the year of construction (`Year of Construction`), the number of rooms (`Number of Rooms`), the size of the garden (in m²) (`Garden Size`) and whether the house is in an urban area or not (`urban`).

**a)** Fit a linear regression model and calculate the mean squared error (MSE) of the dataset.

**b)** How can this dataset be examined to see whether the feature interactions between the features `Size` ~ `Garden Size`, `Size` ~ `Number of Rooms` and `Garden Size` ~ `Number of Rooms`  affect the output? A visual explanation is enough.

**c)** Fit a linear regression model on these data by manually adding the relevant interactions. Then calculate the MSE again.

**d)** Fit a generalized additive model on these data using `LinearGAM` from the [`pygam`](https://pygam.readthedocs.io/en/latest/) package. Accomodate for non-linearity in the features by using spline term `s` and ad the relevant interactions using tensor product terms `te`. Then calculate the MSE again. What do you notice about the MSE? Why is this happening?

<hr style="border:1.5px solid gray"> </hr>

# Decision Trees

For this exercise, we will use the [wine dataset](https://archive.ics.uci.edu/dataset/109/wine) from before. Now we will look at the original dataset from `sklearn.datasets`. 

**a)** Load the data and get yourself familiar with the explanatory variables. Hint: have a look at `feature_names` and `target_names`.

**b)** We now want to build a decision tree model that predicts whether the a wine is of kind `first cultivar, second cultivar` or `third cultivar`. For this purpose, do we have to fit a regression or classification decision tree? 

**c)** Fit a decision tree model to the `wine dataset` set using the [`scikit-learn`](https://scikit-learn.org/stable/) package.

**d)** Display the fitted tree graphically using the `plot_tree()` method. Explain how to read the output and interpret the results.