# Initial exploration and removing a "NaN" (+ doing some variable selection)

This notebook demonstrates two explorative plots:

1. The [scatter plot matrix](https://www.itl.nist.gov/div898/handbook/eda/section3/scatplma.htm)
   (see also page 235 in our textbook).
   
2. The correlation heatmap. This will show correlation coefficients calculated between pairs of variables
   in a colorful plot. We can choose the type of correlation coefficient - one typical choice is
   the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
   
As an example, we will use data for 21 individuals with high blood pressure. The variables are (see table 1):

| Column  | Description                                                                 |             Unit |
|:--------|:----------------------------------------------------------------------------|-----------------:|
| BP      | Blood pressure                                                              |             mmHg |
| Age     | Age                                                                         |            years |
| Weight  | Weight                                                                      |               kg |
| BSA     | Body surface area                                                           |               m² |
| DUR     | Duration of hypertension                                                    |            years |
| Pulse   | Basal heart rate                                                            | beats per minute |
| Stress  | Stress index                                                                |              --- |
| random1 | Some random numbers                                                         |              --- |
| tide    | Forecasted water levels at high tide the next 20 days in Trondheim          |              m   |
||**Table 1:** *Data columns present in the file [bloodpress.csv](bloodpress.csv)*|

We will also see what we can do if we are missing one value
(say, that we did some mistake when measuring a variable).

## Loading the data and fixing the missing value

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(style="ticks", context="notebook", palette="muted")
%matplotlib notebook

In [None]:
data = pd.read_csv("bloodpress.csv")
data

In [None]:
# Describe the data:
data.describe()

We see here that we have 21 observations for all columns, except for the weight. If we look closer at the data table
above, we can see that this column contains a [Not a number (NaN)](https://en.wikipedia.org/wiki/NaN).
Can ask pandas if this is the case:

In [None]:
print("Do we have a NaN?", data.isnull().values.any())

We do have a NaN in our data. Now, we have to decide what we should do with that. Two common "solutions" are:

1. Remove this observation (the whole row).
2. Remove the affected variable (weight).

Usually, we prefer option 1 since the variable might be important and we would like to keep it!
We can tell pandas to remove the rows with NaN's using [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).
By default, this will remove the affected rows.

In [None]:
data.dropna(inplace=True)

In [None]:
data.describe()

In [None]:
print("Do we have a NaN?", data.isnull().values.any())

## Exploring correlations between pairs of variables - Scatter Plot Matrix
Before we do any modeling, we should check if some variables are correlated. To do this,
we will create a [Scatter Plot Matrix](https://www.itl.nist.gov/div898/handbook/eda/section3/scatplma.htm) using [seaborn](https://seaborn.pydata.org/).

The Scatter Plot Matrix can be used
to identify possible variables we can use for prediction or variables that explain the same thing.

In [None]:
grid = sns.pairplot(
    data, kind="reg"
)  # Create the scatter plot matrix! Add regression line to help with reading.

From the above, we see, for instance, that blood pressure is (positively) correlated with weight.
So it was good that we did not remove that column to get rid of the NaN
since the weight seems to predict the blood pressure!

## Exploring correlations between pairs of variables - Correlations
The Scatter Plot Matrix can be difficult to read for many variables. We can reduce the plots to just numbers by
calculating correlations between different pairs of variables. We will here use the
[Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). This
is a number between -1 and 1 that quantifies the correlation between a pair of variables. Here is a picture
from Wikipedia that shows different situations:

![Pearson correlation coefficient - picture](https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Correlation_coefficient.png/600px-Correlation_coefficient.png)

In [None]:
corr = data.corr()  # Calculate correlations between all pairs of variables
corr.style.background_gradient(
    cmap="Blues"
)  # Show the correlations in a colored table:

We can also make a nice plot as follows:

In [None]:
fig, ax = plt.subplots(constrained_layout=True)
sns.heatmap(corr, cmap="PiYG", vmin=-1, vmax=1, annot=True, ax=ax);

From the plot above, we see, for instance, that the blood pressure is most strongly correlated with weight, but
also that it is positively correlated with age, body surface area, and eart rate.

## Creating a model for predicting the blood pressure
Let us also create a least squares model for the blood pressure, to check if we can predict it!

In [None]:
from sklearn.preprocessing import scale
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

y = data["BP"].to_numpy()
variables = [i for i in data.columns if i != "BP"]
X = data[variables].to_numpy()

In [None]:
X_scaled = scale(X)
model = LinearRegression(fit_intercept=True)
model.fit(X_scaled, y)

In [None]:
def score_model(y_true, y_pred, k=0):
    """Calculate some scores for predicted y-values"""
    r2 = r2_score(y_true, y_pred)  # R²
    mse = mean_squared_error(y_true, y_pred)  # Mean squared error
    n = len(y)
    r2_adj = 1 - (1 - r2) * (n - 1) / (n - k - 1)  # R²-adjusted
    return r2, r2_adj, mse

In [None]:
y_hat = model.predict(X_scaled)
scores = score_model(y, y_hat, k=len(variables))
print(scores)

It is hard to plot the blood pressure as a function of all the variables we have used. This would be a 9-dimensional
plot! One useful plot we can make is to plot the predicted and measured y-values against each other.
If the prediction is perfect, these points will all fall on the $x=y$ line:

In [None]:
fig, axi = plt.subplots(constrained_layout=True)
axi.scatter(y_hat, y)
axi.set_aspect("equal")  # Make the plot square
axi.plot(
    [100, 130], [100, 130], ls=":", color="k"
)  # Plot x=y to help us read the plot
sns.despine(fig=fig)

We can also show the parameters of the linear model. Since we have scaled the variables, this will
tell us something about the importance of the different variables:

In [None]:
fig, axi = plt.subplots(constrained_layout=True)
pos = range(len(variables))
axi.bar(pos, model.coef_)
axi.axhline(y=0, ls=":", color="k")
axi.set_xticks(pos)
axi.set_xticklabels(variables)
sns.despine(fig=fig)

Here, we see that the highest coefficients are for age, weight, BSA, and heart rate. This fits well with what
we have seen in the correlation plots. But, in those plots, we also see that some of these variables are
correlated. We, therefore, expect that we can make simpler models that are almost as good as the one we have just
made. Let us try this:

In [None]:
selections = [  # Try some more selections here!
    ["Weight", "Age", "BSA"],
    ["Weight", "Age", "BSA", "Pulse"],
]

table = {"variables": [], "r2": [], "r2(adj)": [], "mse": []}

all_models = []

for selection in selections:
    X_scaled = scale(data[selection].to_numpy())
    model_sel = LinearRegression(fit_intercept=True)
    model_sel.fit(X_scaled, y)
    all_models.append(model_sel)

    y_hat = model_sel.predict(X_scaled)
    r2, r2_adj, mse = score_model(y, y_hat, k=len(selection))

    table["variables"].append(" & ".join(selection))
    table["r2"].append(r2)
    table["r2(adj)"].append(r2_adj)
    table["mse"].append(mse)


table = pd.DataFrame(table)

fig, axi = plt.subplots(constrained_layout=True)
pos = range(len(table["variables"]))
axi.plot(pos, table["r2"], marker="o", label="R²")
axi.plot(pos, table["r2(adj)"], marker="X", label="R²-adjusted")
axi.legend()
axi.set_xticks(pos)
axi.set_xticklabels(table["variables"].values)
axi.set_ylabel("R² & R²-adjusted")
sns.despine(fig=fig)

In [None]:
table

If we adhere to [Occam's razor](https://en.wikipedia.org/wiki/Occam%27s_razor), we are happy with a model
predicting the blood pressure from just the weight, or the weight & age.

**PS!** There are ways of automating the variable (or feature) selection. Please see the scikit-learn documentation
on [feature selection](https://scikit-learn.org/stable/modules/feature_selection.html).

## Alternative to least squares
It can be a lot of work to compare different models and try different selections of variables. Let us
try an alternative, the [least absolute shrinkage and selection operator (LASSO)](https://en.wikipedia.org/wiki/Lasso_(statistics)).
This one modifies the error we minimize. In least squares we minimize the
squared errors,

\begin{equation}
J = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2.
\end{equation}

where $\hat{y}_i = b_0 + b_1 x_1 + \ldots = b_0 + \sum_{j=1}^m b_j x_j$,
while in LASSO, we minimize,

\begin{equation}
J = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^m | b_j | .
\end{equation}

The practical outcome of this is that the minimization penalizes large coefficients and can now find solutions where some $b_j$'s are zero (= not important
for the model!)

In [None]:
from sklearn.linear_model import Lasso

In [None]:
X_scaled = scale(X)
model_lasso = Lasso(alpha=2)
model_lasso.fit(X_scaled, y)

In [None]:
fig, axi = plt.subplots(constrained_layout=True)
pos = [i for i in range(len(variables))]
axi.bar(pos, model_lasso.coef_)
axi.axhline(y=0, ls=":", color="k")
axi.set_xticks(pos)
axi.set_xticklabels(variables)
sns.despine(fig=fig)

**Conclusion:** The LASSO method "automatically" figures out that the age and weight are the variables we need here.