# Interpretable Machine Learning
## Exercise Sheet: 2

### This exercise sheet covers chapters 5.1+5.2 from the IML book by Christoph Molnar
Kristin Blesch (blesch@leibniz-bips.de)<br>
Niklas Koenen (koenen@leibniz-bips.de)
<hr style="border:1.5px solid gray"> </hr>

# 1) Linear Regression
The linear model is one of the first and best-studied models for regression
problems. However, it makes some (strong) assumptions about the given data. 
In the following exercise, you will learn how to fit and interpret a linear 
model and investigate its limitations. 

Let's consider a linear regression model 
for three features $p=3$ of the form
$$
\hat{f}(x) = \beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3\beta_3 = X \beta
$$
and the data is given in the csv-file `sheet_2_lin_data.csv`.

**a)** Load the data as a `pandas` data frame and visualize in a 2D plot how each 
feature ($X_1, X_2, X_3$) affects the output $Y$. Preprocess the loaded data frame
with `pandas.melt` and use the function `seaborn.scatterplot` with specified 
grouping argument (`hue`) for the visualization.

**Solution:**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_theme() # Because seaborn theme looks awesome
import pandas as pd

# Your solution here

**b)** Determine the _design matrix_ $X$ of the given data as a `numpy` array (don't 
forget the column of ones for the intercept $\beta_0$). Using this matrix $X$, 
estimate the regression coefficients $\beta$ of the linear model by the method of least 
squares and calculate the corresponding standard error of the estimates.  
**Hint:** The estimated linear regression coefficients are given by 
$\hat{\beta} = (X^T X)^{-1} X^T Y$ and the standard error
is given by the square root of the diagonal entries of $(X^TX)^{-1} \hat{\sigma}^2$ with 
$\hat{\sigma}^2 = \frac{1}{n - p} \sum_{i=1}^n (y_i- \hat{y}_i)^2$.

**Solution:**

In [None]:
import numpy as np

# Your solution here

**c)** Calculate the $R^2$-value. What does the $R^2$-value mean in this context?

**Solution:**

In [None]:
# Your solution here

**d)** What happens if we increase the first feature value $X_1$ by three? Which feature
is the most important one?

**Solution:**

In [None]:
# Your solution here

**e)** The data was sampled by the data generating process 
$$f(x) = x_1 + 2 x_2 + 3 x_3 + \varepsilon.$$ 
Does the most important feature from part d) make sense regarding the data 
generating process? Explain your answer!

**Solution:**

In [None]:
# Your solution here

<hr style="border:1.5px solid gray"> </hr>

# 2) Logistic Regression

## a) Introduction of odds ratio
**i)** Suppose we have an ordinary and uniformly distributed six-sided
dice described by a random variable $Y_1$. What are the odds of rolling a 
$5$ or a $6$?

**Solution:**

<span style="color:#AAAEBC"> - your solution here - </span>

**ii)** Let's consider a loaded dice $Y_2$ that rolls a $6$ with a probability of $\frac{1}{3}$
and a $5$ with a probability of $\frac{1}{12}$. What are the odds for the event 
$Y_2 \geq 5$ now?

**Solution:**

<span style="color:#AAAEBC"> - your solution here - </span>

**iii)** How have the odds changed from dice $Y_1$ to $Y_2$, i.e. by how much is it
higher/lower? This factor is called the _odds ratio_.

**Solution:**

<span style="color:#AAAEBC"> - your solution here - </span>

## b) Logistic Regression in Python
In this exercise, a logistic model is fitted on the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) indicating whether 
a given plant of the iris genus is a member of the species 'virginica' or not. For 
this purpose, we use the mighty package [`scikit-learn`](https://scikit-learn.org/stable/).

**i)** Load the iris dataset from `sklearn.datasets` and save the features with 
corresponding output in one `pandas` dataframe. Then adjust the output to the 
binary classification problem described above.

**Solution:**

In [None]:
from sklearn.datasets import load_iris
import pandas as pd

# Your solution here

**ii)** Visualize this dataframe with `seaborn.pairplot` with grouping argument `virginica`.
What do you think which feature incremented by one unit increases the odds for the class `virginica` the most? Explain your choice.

**Solution:**

In [None]:
# Your solution here

**iii)** Fit a regression model with the function `sklearn.linear_model.LogisticRegression` and extract the estimated model parameters $\beta$ from it. Now we want to determine 
the odds ratio of the parameters. Compared to the simple example from part a), 
a logistic model provides a probability conditional on the input data $x$, i.e. 
the odds is given by
$$
\text{odds}_\text{logreg} = \frac{\mathbb{P}(Y = 1 \mid x)}{1- \mathbb{P}(Y = 1 \mid x)}.
$$
Therefore instead of using another dice, the stochastic model is now modified by increasing a component of $x$. Denote with $x_{x_i \to x_i +1}$ a vector where the $i$-th component (feature) is increased by one unit. Thus, the following representation of the odds ratio for the i-th parameter results from the lecture:
$$
\text{odds ratio}_i = \frac{\frac{\mathbb{P}(Y = 1 \mid x_{x_i \to x_i +1})}{1- \mathbb{P}(Y = 1 \mid x_{x_i \to x_i +1})}}{\frac{\mathbb{P}(Y = 1 \mid x)}{1- \mathbb{P}(Y = 1 \mid x)}} = \exp(\beta_i).
$$
What are the odds ratios of your fitted logistic regression model, and which 
feature increases the odds for the class 'virginica' the most? How does your 
choice fit with your thoughts from part II)?

**Solution:**

In [None]:
from sklearn.linear_model import LogisticRegression

# Your solution here

 # 3) Prediction of Purchase Prices of Houses
Load the dataset `sheet_2_house_price_data.csv`. This dataset contains the purchase prices of $10.000$ houses in Germany and includes the `size` (in m²), `location` ($0=$ bad, $1=$ good, $2=$ very good and $3=$ luxury) and year of construction (`year`).

**a)** Fit a linear regression model and calculate the mean squared error (MSE) of the dataset.

In [None]:
from sklearn.linear_model import LinearRegression

# Your solution here

**b)** How can this dataset be examined to see whether feature interactions affect the output and, if so, which one? A visual explanation is enough.

In [None]:
# Your solution here

**c)** Fit a linear regression model on these data by manually adding the relevant interactions. Then calculate the MSE again.

In [None]:
# Your solution here

## 4) Advanced: Logistic Regression
Similar to linear regression, logistic regression assumes that there are no interactions between the features, even though this is rarely the case in reality. This also means that the odds ratio is constant for the estimated parameters $\beta$ independent of the input data $x$, although the logistic model outputs a probability conditional on the data ($\mathbb{P}(Y = 1 \mid x)$).
Let's consider the following logistic model with an interaction term and numerical features
$$
\mathbb{P}(Y = 1 \mid x) = f(x) = \text{logistic}\left(\beta_0 + x_1 \beta_1 + x_2 \beta_2 + x_3 \beta_3 + \beta_{12} x_1 x_2 + \beta_{33} x_3^2\right).
$$

Calculate the odds ratio for each parameter $\beta_1, \beta_2$ and $\beta_3$ if we
increase the respective feature value by one unit.

**Solution:**

<span style="color:#AAAEBC"> - your solution here - </span>