In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from typing import Protocol

from sklearn.base import BaseEstimator
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler


from training_rl.nb_utils import display_dataframes_side_by_side
from training_rl.config import get_config

ModuleNotFoundError: No module named 'sklearn'

In [None]:
c = get_config(reload=True)

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title">TfL Training Template</div>
<div class="md-slide title">Include title and greeting with divs</div>

# Template for TfL Notebooks

This notebook is a template/readme for TfL trainings. 
Please make sure to **read all of it** before creating content
for the trainings!

We use rise for supporting a presentation mode inside jupyter.
Try it out - enter the presentation mode by pressing `Alt + R`
(and exit it with the same shortcut). 

For presenting, it is highly recommended to enter the browser's
full-screen mode (`F11` on Windows with chrome-based browsers).

For viewing cell metadata related to slides or other aspects, check 
out `View -> Cell Toolbar`. When you receive this template, this
is set to `None` to demonstrate how a committed notebook should
look.


The slides are customized with header, footer and backimage as
described [here](https://rise.readthedocs.io/en/stable/customize.html).

Don't forget to customize the title-slide's and last slide's texts :).


## Extension goodies

To make the notebook prettier and more functional, we use some 
[jupyter extensions](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/).

1. *Equation numbering* for rendering numbers and links to equations. So this works:
\begin{equation}
  \mathcal{X}= \{1, \dots, N \} \label{x_def}.
\end{equation}
Now we can refer to the equation as: $\ref{x_def}$. Note however that:

- You have to use `\begin{equation} ... \end{equation}`, just `$$` won't work.
- Sometimes things will have to be refreshed, use the button `Reset equation numbering` in the toolbar.
- Links to equations don't work in presentation mode.
    
2. *Hinterland* for immediate autocompletion (without `Tab`).

3. *Spellchecker* (most extensions won't work in an IDE, so you will probably write quite a lot directly from the notebook interface). You should probably always keep it on.
4. The *toc2* extension for a table of contents (in TfL style). This extension is a bit buggy on Windows (on linux or inside docker everything is fine), but still usable and of value. It also provides a `Navigation` button in the toolbar which works fine on any OS. **Important**: The ToC does not automatically disappear when going into presentation mode. Thus, before opening presentation mode, first close the ToC. For using the ToC on Windows:  

    - When the ToC is first opened, drag the resize button a bit to make it actually turn into a sidebar.
    - When it is closed, the remaining jupyter window doesn't resize automatically. Drag the browser and maximize it again to trigger resizing.
    
5. *Initialization cells*. Nice extension to define which cells should be executed automatically as the notebook is loaded or the kernel is restarted. We use it to automatically load styles and data. In order to mark cells as initialization cells, go to `View -> Cell Toolbar -> Initialization Cell`. Switch back to viewing `Slideshow` after.
6. *Hide input*. Used in conjunction with initialization and with the `%%capture` magic command which also hides output for hiding style and data loading from the users. Note that the cells don't fully disappear, they are turned into a single white line. Select the white line and untoggle it with the `^` button in the toolbar to view and edit it. Only works on code cells.
7. The *exercise2* extension for displaying hints and solutions

## A note on styles

Styles are defined in the [rise.css](rise.css) file. They are used in presentation
automatically, but must be loaded to be available in the notebook mode.
This is reached by using the custom magic command `%presentation_style`
in a hidden initialization cell, see above.

The styles are intended to be in sync with the [TfL website](https://tfl.appliedai.de/). 
This means headings h3-h6 look like this:

### Some text in h3 - in TfL grey.

The styles in the notebook are similar to the presentation mode but not exactly the same.

When fully done with preparing a notebook, set `View -> Cell Toolbar` to `None` before
committing it. This will look much prettier :)


## Basics of rise

Have a look at the [documentation of rise](https://rise.readthedocs.io/en/stable/). 
Be sure to declare the types of the cells such that the presentation looks nice.

For example, it might be good idea to have the first cell in a section (with the title as h1)
declared as slide, and all following slides therein as sub-slide or fragment.

Here a sub-slide that contains some code as fragments.

In [None]:
df1 = pd.DataFrame({"a1": [1, 2], "b1": [4, 5]})

df2 = pd.DataFrame({"a2": [1, 2], "b2": [4, 5]})

In [None]:
display_dataframes_side_by_side([df1, df2])

# Handling data

We use [accsr](https://github.com/appliedAI-Initiative/accsr) 
for downloading and uploading data from/to a bucket. As recipient of the template.
there is very little that you have to do to make everything work.

1. Either set the env-var `TFL_STORAGE_ADMIN_SECRET` or overwrite the `secret` entry in (the gitignored) `config_local.yml`. Ask someone for the value, in case you don't have it.
2. Structure your data as `data/first_nb`, `data/second_nb` etc. This will allow to download data selectively, and thus not force users to download all data if they just want to execute a single notebook after the workshop. You can also have a `data/common` directory.
3. Once data is stored on your disc, call `python scripts/upload_data.py` to upload everything to the cloud. If you made changes to files and want to overwrite them, call `python scripts/upload_data.py -f`. Don't worry about doing this repeatedly, accsr takes care of never re-uploading or downloading everything.
4. In notebooks, include magic commands a la `%download_data first_nb` in hidden initialization cells, like done here. This ensures that required data will be pulled when the user opens the notebook. Calling  `%download_data` without an argument will download everything.
5. During the actual presentation, data will be downloaded on every spun-up server automatically in the background. Thus, users can get started right away. This is realized by calling the `download_data.py` script as part of the entrypoint. For everything to go smoothly, ask the workshop participants to spin up a server right away when the workshop begins. Then all downloads should finish during the introductory first presentation.


# Writing content

Below we give some guidelines on how to design content for a smooth and beautiful
experience for everybody.

## Pretty Markdown

The way markdown is rendered is, unfortunately, slightly dependent on the renderer.
It behaves differently in the live notebook environment and in the html created
by nbconvert or by jupyter-book. Especially lists, nested lists and equations
within lists are treated differently.

Have a look at the way the list in the [Extension goodies](#Extension-goodies) section is
written. While difficult to decipher in the un-rendered form, it does render
correctly in all environment.

### Custom latex

It is possible to define latex macros and use them in equations inside jupyter 
markdown. The template already comes with some predefined macros, they
were loaded by the magic command `%load_latex_macros`. To extend the set
of macros, simply extend the corresponding string in `constants.py` AND
the mathjax configuration in `notebooks/_config.yaml`.

Because the macros were loaded, this works: $\E$ and $\P$.

### Rendering and Cells

There are three renderings that are important for us:

1. Notebook and presentation mode (live rendering)
2. The html created by nbconvert (closest to the notebook)
3. The html created by jupyter-book (final delivery)

They are similar but unfortunately not exactly equal. For example,
the configuration of mathjax (used for latex macros)has different
mechanisms in the notebook and the jupyter-book.

Also, we might want to remove certain cells depending on the renderer.
E.g., it might not be needed to include the cells with the hints in the
jupyter-book (since the solution is included anyway), but they should
be included in the nbconvert-created html, that is used internally to
review the training.

How cells are processed by the renderer is defined by **tags** in
cell metadata. For jupyter-book tags, see the [documentation](https://jupyterbook.org/en/stable/interactive/hiding.html).
For nbconvert, the `build_docs.sh` script is configured to take into
account tags that end in `nbconv`. So, a cell tagged with
`["remove-cell", "hide-input-nbconv"]` will be removed in the
 jupyter-book rendering and input-hidden in the nbconvert rendering.
 We cannot use the same tags for nbconvert as for jupyter-book, since
 we sometimes need to treat cells differently depending on the renderer.`


## Code Guidelines


The most evil thing inside a notebook is state. It is very typical to encounter code
like

```python
df = pd.read_csv("some/path.csv")
df["my_col"] = df["my_col"] / 10

# Exercise 1: do something to df

# Exercise 2 (in a new cell): now do something else to df
```

This kind of approach, while maybe looking innocent, leads to confusion about
what is actually behind the variable `df`. Every training participant will have
a different value stored in it, and re-executing code from earlier cells after
new cells have been executed might give junk results because `df` has been
modified. It is very easy to completely lose track of what has happened to
mutable objects in notebooks, due to jupyter's non-linear execution model.


The antidote to the above conundrum is to rely on state as little as possible.
Thus, only expensive operations, like loading data, are stored in state, and
everything else is done with pure functions.

Note that participants will mostly base their solutions on copy-pasted code
from the notebook or from hints! This means all hints, proposals, and 
exercise solutions should be written as functions and not as state-altering 
sequences of commands.

In [None]:
# Example: we want to normalize the housing data set.
# We load the data into the notebooks state, but all processing is done by functions

housing_df = pd.read_csv(c.housing_data)
housing_df.head()

Now we write functions for all processing steps, as below:

In [None]:
def get_numerical_columns(df: pd.DataFrame):
    return df.select_dtypes(include=[np.number, "float64", "int64"]).columns


def normalize_and_get_scaler(df: pd.DataFrame, columns: list["str"] = None):
    columns = columns or get_numerical_columns(df)
    scaler = StandardScaler()
    result = df.copy()
    result[columns] = scaler.fit_transform(df[columns])
    return result, scaler

To visualize this processing step, we create a new variable 
instead of overwriting `housing_df`.

However, in further processing steps, one probably should not use
this new variable and instead just call `normalize_and_get_scaler`
again. Proceeding in this fashion guarantees that the state of
the notebook is always transparent. 

*You will thank yourself later
if you consistently avoid state and instead recompute things if 
necessary!*

For expensive computations with simple inputs, consider using `@cache`

In [None]:
normalized_housing_df, scaler = normalize_and_get_scaler(housing_df)
normalized_housing_df.head()

### A note on software engineering

Apart from keeping the structure functional to avoid state, we are also committed to 
writing the best possible code. This is not only good for us internally, and demonstrates
our competence to training participants, but also means that the code received by participants
later on is something they can build on.

We should not back down from advanced programming concepts if they are appropriate. Of course,
during the training you can tell participants that it is not very important to understand
some parts of the code (especially if the participants don't have a solid software engineering
background). But this does not mean that we have to compromise on code quality, just because
some participants won't appreciate it.

So, feel free to use things like `Generic` or `Protocol`, add precise types everywhere,
write good tests and docstrings, and structure your code as if the training 
was a project delivered to a knowledgeable customer - which is exactly what a training is!.

## Exercises

The exercises, apart from being interesting and illuminating, should work on three levels.

- A motivated, **highly experienced** engineer should be able to **write the solution** from scratch, maybe after looking through the source code in some modules. Note that you can reference source code inside markdown as: [regression_housing_data.py](../../edit/scripts/regression_housing_data.py).
- A less motivated, **somehow experienced** engineer can **open the hint**, which will explain how to write the solution and provide a proposal for signatures of the functions that constitute it. They should be able to solve the exercise with this support.
- Finally, a non-motivated or **less experienced** engineer can directly **load the solution** and go through it. The code should be well documented and invite the user to read through it line by line, try out different parametrizations, or even to modify it and to see what happens.

## Example exercise

We will use the `%load` and the custom `%view_hint` magic commands for displaying
solutions and hints. This means that the solutions and hints should be
saved as code/markdown snippets. It makes things a bit hard to debug,
because the snippets are not valid code on their own, but there is
currently no better solution.

Consider also writing a script in addition to the notebook where
the logic of all exercises is done in a single, normal, linear execution
environment. This will not only be a useful reference for participants
but also make your own life easier, as you won't have to re-execute
the notebook all the time or keep track of what is happening there.
You will also have full IDE and Copilot support when writing a script,
which can be very nice ;).

Once such a script is finished, transferring the code to the notebook
is rather quick.

### Exercise 1: regression on housing data

Use some regression model to predict the `median_house_value` from other attributes.

## Hint for Exercise 1

If you want, you can view a hint for the solution by pressing the button below

You should encode the categorical values and remove nans.

Then, create a train-test split, normalize the data frame and train
some sklear model, for example a linear regression, on the target column.

The results can be visualized with `plt.scatter`, you should also
compute some metrics like the mean squared error and the r2 score.

We found writing functions with the following signatures useful (you can
ignore the `Protocol` part, it's just a way to define a type hint):

```python

class SKlearnModelProtocol(Protocol):
    def fit(self, X: pd.DataFrame, y: pd.DataFrame):
        ...

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        ...


def get_categorical_columns(df: pd.DataFrame) -> list["str"]:
    pass


def one_hot_encode_categorical(
    df: pd.DataFrame, columns: list[str] = None
) -> pd.DataFrame:
    pass


def train_sklearn_regression_model(
    model: SKlearnModelProtocol, df: pd.DataFrame, target_column: str
) -> SKlearnModelProtocol:
    pass


def remove_nans(df: pd.DataFrame) -> pd.DataFrame:
    pass


def get_normalized_train_test_df(df: pd.DataFrame, test_size: float = 0.2) -> tuple[pd.DataFrame, pd.DataFrame]:
    pass


def evaluate_model(
    model: SKlearnModelProtocol, X_test: pd.DataFrame, y_test: pd.DataFrame
) -> np.ndarray:
    pass
```

You can also jump ahead to the full solution, either have a look at
[regression_housing_data.py](../../edit/scripts/regression_housing_data.py) or
click the button below

## Solution Exercise 1

In [None]:
class SKlearnModelProtocol(Protocol):
    def fit(self, X: pd.DataFrame, y: pd.DataFrame):
        ...

    def predict(self, X: pd.DataFrame) -> np.ndarray:
        ...


def get_categorical_columns(df: pd.DataFrame):
    return df.select_dtypes(include=["object", "category"]).columns


def one_hot_encode_categorical(
    df: pd.DataFrame, columns: list[str] = None
) -> pd.DataFrame:
    columns = columns or get_categorical_columns(df)
    for column in columns:
        df = pd.concat([df, pd.get_dummies(df[column], prefix=column)], axis=1)
        df = df.drop(column, axis=1)
    return df


def train_sklearn_regression_model(
    model: SKlearnModelProtocol, df: pd.DataFrame, target_column: str
):
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    model.fit(X, y)
    return model


def remove_nans(df: pd.DataFrame):
    # count rows with nans
    nans_count = df.isna().sum().sum()
    if nans_count > 0:
        print(f"Warning: {nans_count} NaNs were found and removed")
    return df.dropna()


def get_normalized_train_test_df(df: pd.DataFrame, test_size: float = 0.2):
    df = remove_nans(df)
    df, _ = normalize_and_get_scaler(df)
    df = one_hot_encode_categorical(df)
    train_df = df.sample(frac=1 - test_size)
    test_df = df.drop(train_df.index)
    return train_df, test_df


def evaluate_model(
    model: SKlearnModelProtocol, X_test: pd.DataFrame, y_test: pd.DataFrame
):
    y_pred = model.predict(X_test)
    print(f"Mean squared error: {mean_squared_error(y_test, y_pred)}")
    print(f"R2 score: {r2_score(y_test, y_pred)}")
    return y_pred


# normalize and split data
train_df, test_df = get_normalized_train_test_df(housing_df)

# train
trained_model = train_sklearn_regression_model(
    LinearRegression(), train_df, "median_house_value"
)

# evaluate
y_pred = evaluate_model(
    trained_model,
    test_df.drop("median_house_value", axis=1),
    test_df["median_house_value"],
)

# visualize results
plt.scatter(test_df["median_house_value"], y_pred)
plt.xlabel("True median house value")
plt.ylabel("Predicted median house value")
plt.show()


<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title">Thank you for the attention!</div>

<div class="md-slide title">Say goodbye in your preferred way :) </div>