<img src="../figures/HeaDS_logo_large_withTitle.png" width="300">

<img src="../figures/tsunami_logo.PNG" width="600">

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Center-for-Health-Data-Science/PythonTsunami/blob/intro/ML/scikit-learn.ipynb)

# Scikit Learn (`sklearn`)

*Prepared by Henry Webel at [NNF CPR](https://www.cpr.ku.dk/staff/rasmussen-group/?pure=en/persons/662319)  [![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social&label=Follow%20%40Henrywebel)](https://twitter.com/henrywebel)*

- Pre-requisites: Python Intro, NumPy, minimal Pandas, matplotlib

### Saving the notebook in Drive
Save a copy in your drive if you want to save your changes: `File` -> `Save a copy in Drive`

![Save Colab Notebook in Google Drive](figures/colab_save_in_drive_2.png)


**Table of Contents in Colab**
> Allows easier navigation

![Table of content in Colab](figures/colab_toc.png)

## Contents

1. Scikit-learn API introduction.
2. If needed: Machine Learning
3. Use-Case with different objects from scikit-learn
    - this includes some exercises
4. Further material

### Resources


- [Glossary](https://scikit-learn.org/stable/glossary.html#glossary)
- [examples](https://github.com/scikit-learn/scikit-learn/tree/master/examples)
- [API design for machine learning software: experiences from the scikit-learn project](https://arxiv.org/abs/1309.0238)
- [Géron, Aurélien (2019): Hands on Machine Learning in Scikit-Learn, Keras and TensorFlow, Vol. 2, Ch. 1- 9](https://github.com/ageron/handson-ml2)

## Scikit-learn

Library of algorithms for Data Science with unified interface.

This notebook is based on the available [tutorials](https://scikit-learn.org/stable/tutorial/index.html) which are interesting to read, but unfortunately note based on executable notebooks.

We will try to predict `Age` using several `RNA`-measurements:

|    |   RPA2_3 |   ZYG11A_4 |   F5_2 |   HOXC4_1 |   NKIRAS2_2 |   MEIS1_1 |   SAMD10_2 |   GRM2_9 |   TRIM59_5 |   LDB2_3 |   ELOVL2_6 |   DDO_1 |   KLF14_2 |   Age |
|---:|---------:|-----------:|-------:|----------:|------------:|----------:|-----------:|---------:|-----------:|---------:|-----------:|--------:|----------:|------:|
|  sample 0|    52.36 |      11.95 |  47.48 |     36.08 |        35.1 |     70.16 |      43.46 |    23.31 |      33.64 |    74.44 |      36.12 |   70.65 |      2.46 |    20 |

### Motivating example and Linear Regression recap

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)

In [None]:
data_univariate = {
    'KLF14_2': {0: 2.46, 1: 1.77, 2: 2.0, 3: 5.44, 4: 4.48, 5: 0.98, 6: 2.12, 7: 2.29, 8: 1.69, 9: 2.54},
    'Age': {0: 20, 1: 29, 2: 49, 3: 67, 4: 65, 5: 20, 6: 31, 7: 37, 8: 35, 9: 39}
} # first 10 samples from real data example below
X_sample = pd.DataFrame(data_univariate)
ax = X_sample.plot(kind='scatter', x='KLF14_2', y='Age')

We can fit a linear regression function $y=w \cdot x+b$ to the point of clouds, predicting age using only KLF_14 as a feature. We determine $w$ (coefficient) and $b$ (bias or intercept) using `numpy`.

In [None]:
x = X_sample['KLF14_2']
y = X_sample['Age']
w, b = np.polyfit(x, y, deg=1)
linear_function = np.poly1d([w, b])

ax.plot(x.unique(), linear_function(x.unique())) # interpolate only unqiue values linearly
ax.get_figure()

Later we will use several features (_multi_) in an _multivariate regression_, predicting age using several gene abundances. 

## Scikit Learn API main principles

This should give a brief, high-level overview. Skip if you want an practical example first.

> Géron (2019): 64f. and [scikit-learn-paper](https://arxiv.org/abs/1309.0238)

First some theory and names

> An **application programming interface (API)** is a computing interface which defines interactions between multiple software intermediaries. It defines the kinds of calls or requests that can be made, how to make them, the data formats that should be used, the conventions to follow, etc. It can also provide extension mechanisms so that users can extend existing functionality in various ways and to varying degrees.[1] An API can be entirely custom, specific to a component, or it can be designed based on an industry-standard to ensure interoperability. Through information hiding, APIs enable modular programming, which allows users to use the interface independently of the implementation. ([Wikipedia](https://en.wikipedia.org/wiki/API))
>  
>Loosely defined, API describes everything an application programmer needs to know about piece of code to know how to use it. ([wiki.python.org](https://wiki.python.org/moin/API#:~:text=API%20is%20a%20shortcut%20for,know%20how%20to%20use%20it.))

### Consistency

> Have a look again to classes introduction in notebook on [2_modules_classes.ipynb](https://colab.research.google.com/github/pythontsunami/teaching/blob/intro/2_modules_classes.ipynb)  
> Capitalized and CamelCase names are reserved for classes in Python!
> `DerivedClass(BaseClass)` describes inheritence. `DerivedClass` inherits everything from `BaseClass`, but can change everything.

- `Estimators`: Interface for building and fitting models
    - `fit` method returns fitted models
    - supervised: `fit(X_train, y_train)`
    - unsupervised: `fit(X_train)`
    - factory to produce model objects


- `Predictors(Estimator)`: Interface for making predictions
    - `fit`, `predict` and `score`
    - supervised and unsupervised: `predict(X_test)`
    - performance assessment: `score` (the higher, the better)
    - clustering: `fit_predict` exists
    - extends `Estimator`


- `Transformers(Estimator)`: Interface for converting data
    - `fit`, `transform`, and `fit_transform`
    - extends `Estimator`

    
> Transformer which is also a predictor? Where is the difference between transform and predict?

Let's look at our previous example

In [None]:
x

In [None]:
x.to_frame().values.shape

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression() # instance
lin_reg.fit(x.to_frame(), y)  # it can fit, so it is an Estimator
lin_reg.predict(x.to_frame()) # it can predict, so it is also an Predictor

In [None]:
linear_function(x) # numpy polynomial of degree 1

And a short Transformer example

In [None]:
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler()
min_max_scaler.fit(x.to_frame())
x_scaled = min_max_scaler.transform(x.to_frame())
print("\n".join(f"old: {old:3.2f}, new: {new[0]:3.2f}" for old, new  in zip(x, x_scaled)))

### Composition  
- `Pipeline` objects from a sequence of `Transformers` and a optinally a final `Predictor`
- `FeatureUnion` objects for a two or more `Pipeline`s in parallel, yielding concatenated outputs.

### Inspection
- learned `features_` have a underscore suffix `_`

In [None]:
lin_reg.intercept_, lin_reg.coef_

In [None]:
b, w # from numpy.polyfit

### Sensible defaults
 - get your first models running quickly
 - sensible defaults for construction of `Estimators`

> Side Note: "A _hyperparameter_ is a parameter of a learning algorithm (not of the model).   
> As such, it is not affected by the learning algorithm itself;   
> it must be set prior to training and remains constant during training." (Géron 2019: 29)  
> Constructor parameters of scikit-learn objects are hyperparameters

In [None]:
# LinearRegression?

## Website

Let's have a look at the [website](https://scikit-learn.org) and see what it offers.

In [None]:
from IPython.display import IFrame, display

# does not show in colab, just use the link and go to the website
display(IFrame(src="https://scikit-learn.org",
               width=1024, height=1024, metadata=None))

### User Guide

Some parts of the [User Guide](https://scikit-learn.org/stable/user_guide.html) will be discussed.

> The User Guide is an overall reference which can be followed in different orders.

- [Different Estimator](https://scikit-learn.org/stable/supervised_learning.html)
- [preprocessing data](https://scikit-learn.org/stable/data_transforms.html): `sklearn.impute`, `sklearn.preprocessing`
- [model selection (incl. metrics)](https://scikit-learn.org/stable/model_selection.html): `sklearn.model_selection`
- [Pipeline](https://scikit-learn.org/stable/data_transforms.html): `sklearn.pipeline`

### Some imports for later

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    StratifiedShuffleSplit,
    cross_val_score,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor

> If you are interested: For a short description use IPython questionmark

In [None]:
# Pipeline?

In [None]:
# import sklearn.base
# sklearn.base??

## Machine Learning Overview

> Science of learning from data.  

Practically this means that the computer is not entirely thought how to make decisions, 
which is sometimes called rule-based (using conditional statements).

1. Supervised
    1. Regression: Continous variable prediction
        - How old is someone?
        - How much income can some expect?
    2. Classification: Category prediction
        - Disease, yes or no?
        - disease stage: How serious is it on a scale from 0 to 4?
2. Unsupervised: Finding groups
    - No labels
    - Put samples into undefined categories

### Classification vs Regression

What is the difference?


- check out [logistic function](https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html#sphx-glr-auto-examples-linear-model-plot-logistic-py)

![Scikit-learn example logistic function](https://scikit-learn.org/stable/_images/sphx_glr_plot_logistic_001.png)

### Unsupervised

Check out the [clustering Guide](https://scikit-learn.org/stable/modules/clustering.html)

![Clustering](https://scikit-learn.org/stable/_images/sphx_glr_plot_kmeans_assumptions_001.png)

## Case Study: Age-prediction
> Thanks for [Sam Bradley](https://www.dtu.dk/english/service/phonebook/person?id=145074&cpid=266426&tab=0)
telling me and [Denis Shepelin](https://www.dtu.dk/english/service/phonebook/person?id=126180&tab=2&qt=dtupublicationquery)
telling him. There I stop the tracking:) 

A paper presenting age predictions based on RNA measurements did upload the data
- [paper](https://www.sciencedirect.com/science/article/pii/S1872497317301643)
- [data](https://zenodo.org/record/2545213/#.X43R0dAzb-g)

> For now view the data as a set of features and labels.  
>
> For first predictions you do not need to understand the biology,  
> but to explain _odd_ things, more knowledge is most of the times helpful

### Feel free to re-implement your own paper of interest 

> If you are interested in a paper which you have the data for, go on and try to adapt the following code.

## Data

In [None]:
url_train_data = "https://zenodo.org/record/2545213/files/train_rows.csv"
url_test_data = "https://zenodo.org/record/2545213/files/test_rows_labels.csv"

# additional data not used for now
url_train_normal = "https://zenodo.org/record/2545213/files/training_data_normal.tsv"
url_test_data_wo_labels = "https://zenodo.org/record/2545213/files/test_rows.csv"

In [None]:
train_data = pd.read_csv(url_train_data, sep="\t")
train_data

In [None]:
test_data = pd.read_table(url_test_data)
test_data

In [None]:
# train_normal = pd.read_csv(url_train_normal, sep='\t')
# train_normal

In [None]:
# test_data_wo_label = pd.read_table(url_test_data) # tab seperated data is often tsv format
# test_data_wo_label

In [None]:
TARGET_COLUMN = "Age"

y_train = train_data[TARGET_COLUMN]
# pop() if you want to modify test_data inplace
y_test = test_data[TARGET_COLUMN]
y_test

In [None]:
test_data

In [None]:
X_test = test_data.drop(TARGET_COLUMN, axis=1)
X_train = train_data.drop(TARGET_COLUMN, axis=1)
X_train

### Is there any missing data?

In [None]:
_df = X_train  # from here it's easy to write a function display what you are interested in
n_na = _df.isna().sum().sum()
print(f"Found # NAs: {n_na}")
if n_na:
    row_with_nas = _df.isna().any(axis=1)
    display(_df.loc[row_with_nas])

In [None]:
_ = X_train.hist(figsize=(15, 15), sharex=True, sharey=True)
# _ = X_test.hist(figsize=(15,15), sharex=True, sharey=True)

### Exercise: Familiarizing with the `Age` variable

> skip on the first try, as it's covered later

- Check if the distribution of `Age` is the same in the predefined test and train set (there are several possibilites to do that)

## A first model

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg = lin_reg.fit(X_train, y_train)

> Factory is replaced by fitted model, but calling fit again first erases previously fitted parameters. see [docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)  
>
> Fit always re-fits (and discards the previous fitted weights). A new instance is returned.

In [None]:
y_test_pred = lin_reg.predict(X_test)
y_test_pred[:10]

In [None]:
y_test[:10].values

In [None]:
lin_reg.score(X_train, y_train), lin_reg.score(X_test, y_test),

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(y_true=y_test, y_pred=y_test_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

### Exercise: Replace the model and see if this improves your results.

1. Select a different [model](https://scikit-learn.org/stable/supervised_learning.html)
2. Adapt only the first block from above below here

## Simple pipeline

Let's add a standardiser.

In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
std_scaler.fit(X_train)
X_train_scaled = std_scaler.transform(X_train)
X_train_scaled[:3]

In [None]:
lin_reg = LinearRegression()
lin_reg = lin_reg.fit(X_train_scaled, y_train)

In [None]:
X_test_transformed = std_scaler.transform(X_test)
y_test_pred = lin_reg.predict(X_test_transformed)
y_test_pred[:10]

> Can you imagine why this is error prone?

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(y_true=y_test, y_pred=y_test_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse # exactly the same as above (affine transformation are irrelevant for linear regressions)

> The result shows a property of Linear models :)

Now let's build a `Pipeline` to avoid too many intermediate assignments

In [None]:
from sklearn.pipeline import Pipeline
simple_pipeline = Pipeline(
    [("scaler", StandardScaler()), ("lin_reg", LinearRegression())]
)
simple_pipeline = simple_pipeline.fit(X_train, y_train)

Alternative:
```python
from sklearn.pipeline import make_pipeline
simple_pipeline = make_pipeline(StandardScaler(), LinearRegression())
```


In [None]:
y_test_pred = simple_pipeline.predict(X_test)
y_test_pred[:10]

In [None]:
from sklearn.metrics import mean_squared_error

lin_mse = mean_squared_error(y_true=y_test, y_pred=y_test_pred)
lin_rmse = np.sqrt(lin_mse)
lin_rmse


### Exercise: Add an imputation step or feature selector

- If you like, mask some data and add an imputer to the pipeline
- If you have many more features, you could add a feature selector before (the `train_normal` data would have it)

In [None]:
mask_keep = np.random.random(size=X_train.shape) > 0.1
# Now X has not changed yet. Assing to a new reference!
X_train.where(mask_keep)

## Excurs: Combining pipelines

What if we would have an additional category?

```python
num_attribs = ['cont_var_1', 'cont_var_2']
cat_attribs = ['cat_var_1']

num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', AttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])

from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])
```

> Check out the [notebook](https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb) of Ch.2 of Géron 2019 for an extended example using a housing dataset of California, USA

## Excurs: CustomTransformer

> scikit-learn is based on duck-typing, although we inherit some additional features for the interface from base classes.

Ref: Tutorial on [Developing Estimators in Scikit-Learn](https://scikit-learn.org/stable/developers/develop.html)
 
> **Duck typing** in computer programming is an application of the duck test—"If it walks like a duck and it quacks like a duck, then it must be a duck"—to determine if an object can be used for a particular purpose. With normal typing, suitability is determined by an object's type. In duck typing, an object's suitability is determined by the presence of certain methods and properties, rather than the type of the object itself. Wikipedia)

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin


class CustomTransformer(BaseEstimator, TransformerMixin):
    """Don't use this. This is an example."""

    def __init__(self, my_bias=0):  # no *args or **kargs
        """Add a bias/ intercept"""
        self.my_bias = my_bias

    def fit(self, X, y=None):
        return self  # nothing else to do

    def transform(self, X):
        return np.c_[X, np.array([self.my_bias] * len(X))]

In [None]:
X = pd.DataFrame(range(4, 14))
X

In [None]:
custom_transformer = CustomTransformer(my_bias=10)
custom_transformer.transform(X)  # return a numpy.array

> Scikit-learn uses the underlying numpy.arrays of a DataFrame.

### Exercise: Custom `Transformer`

Create a custom Transformer adding the squared $x=x^2$ of each feature to the training data.

> scikit-learn is based on duck-typing, although we inherit some additional features for the interface from base classes.

##### !!! Don't use this
To add interaction effects, please use [`sklearn.preprocessing.PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html).

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin


class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_moment=None):  # no *args or **kargs
        self.add_moment = add_moment

    def fit(self, X, y=None):
        return self  # nothing to fit

    def transform(self, X):
        # your code here
        return X


attr_adder = CombinedAttributesAdder(add_moment=True)
extendend_data = attr_adder.transform(X_train.values)

> Can you think of a better transformations? 

## Combine training and test data

We generate our own training set which we will use for model selection, and a test set which will be used for evaluation.

In [None]:
data = pd.concat([train_data, test_data])
old_index = pd.Series(data.index)
data.index = old_index.index
data

In [None]:
X = data.drop(TARGET_COLUMN, axis=1)
y = data[TARGET_COLUMN]

## Model Selection

Agenda:
1. On the combined data set, split the data into a balanced train and test data set of 80/20 (i.e. 80% of the data goes into the training data set). 
2. Perform cross-validation
3. Perform model-selection 

#### Hints
- [model-selection tutorial](https://scikit-learn.org/stable/model_selection.html)

> The aim is to get you started reading the documentation and understand the function signatures  
> while you are able to ask as many questions as you like:)

## Combined data set

In [None]:
from sklearn.model_selection import train_test_split

X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(
    X, y, test_size=0.4, random_state=42
)

### Exercise: Stratification

- Can you stratify the data?
- Check if the distribution of `Age` is the same (there are several possibilites to do that)

## Cross-Validation

- meta-estimators `GridSearchCV` and `RandomizedSearchCV`
- `best_estimator_` attribute

- [Diabetes example](https://scikit-learn.org/stable/auto_examples/exercises/plot_cv_diabetes.html#sphx-glr-auto-examples-exercises-plot-cv-diabetes-py)

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_validate

dict_scores = cross_validate(
    lin_reg,
    X_train_new,
    y=y_train_new,
    groups=y_train_new,
    cv=StratifiedKFold(5),
    scoring=None,
)
dict_scores

In [None]:
pd.DataFrame(
    dict_scores
)  # you can create nice tables if you work on the DataFrame further

What does `test_score` correspond to?

### Exercise

- Replace `scoring=None` by other metrics by reading the documentation.
- Extend this to several estimators and record the results

> Try to google the metrics. Solution could be this [link](https://scikit-learn.org/stable/modules/model_evaluation.html)  
> Make sure that "greater is better" for a score (However: You will be reminded if you forget:).

In [None]:
# scoring = ['metric1', 'metric2'] # replace strings
# scoring = {'key': metric_fct}    # set key and metric_fct
# dict_scores = cross_validate(lin_reg, X, y=y, groups=y, cv=StratifiedKFold(5), scoring=None)

## Fine tuning models

`GridSearchCV` and `RandomSearchCV` on model hyperparameters.

> Side Note: "A _hyperparameter_ is a parameter of a learning algorithm (not of the model).   
> As such, it is not affected by the learning algorithm itself;   
> it must be set prior to training and remains constant during training." (Géron 2019: 29)  
> Constructor parameters of scikit-learn objects are hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

# GridSearchCV?

In [None]:
rf_reg = RandomForestRegressor(random_state=42)
param_grid = {"n_estimators": [3, 10, 20], "max_features": [2, 6, 8, 10, 13]}

grid_search = GridSearchCV(
    estimator=rf_reg,
    param_grid=param_grid,
    cv=5,
    n_jobs=3,
    scoring=None,
    return_train_score=True,
    verbose=1,
)
grid_search.fit(X=X_train_new, y=y_train_new)

In [None]:
grid_search.best_params_

In [None]:
param_grid = [
    param_grid,
    # then try a different set
    {"bootstrap": [False], "n_estimators": [3, 10], "max_features": [10, 13]},
]

grid_search = GridSearchCV(
    estimator=rf_reg,
    param_grid=param_grid,
    cv=5,
    n_jobs=3,
    scoring=None,
    return_train_score=True,
    verbose=1,
)
grid_search.fit(X=X_train_new, y=y_train_new)

In [None]:
grid_search.best_params_

In [None]:
grid_search = GridSearchCV(
    estimator=rf_reg,
    param_grid=param_grid,
    cv=5,
    scoring="neg_mean_squared_error",
    n_jobs=4,
    return_train_score=True,
)
grid_search.fit(X=X_train_new, y=y_train_new)

In [None]:
grid_search.best_estimator_

### Exercise 

- Use `RandomizedSearchCV` 
- Try a different model if you know one

### Exercise: Final model

The best estimator from the grid- or random-search is not yet available for training.

You would need to retrain the final estimator on the whole training dataset.

In [None]:
# ToDo

## Model persistence

To save the model, you can use this [tutorial](https://scikit-learn.org/stable/modules/model_persistence.html).

The save model to disc is self-contained, meaning that you do not need the original code to build an instance to reload the state of the model when it was safed.

Model can be deployed, e.g. to Google Cloud (and probably every other one, but I have not yet tried that).

In [None]:
from joblib import dump, load

lin_reg.fit(X_train, y_train)  # the initial data splits
dump(lin_reg, "lin_reg_model.joblib")

clf = load("lin_reg_model.joblib")
clf

## Scikit-learn API wrappers

If you would like to work your scikit-learn workflow with a model from other libraries, you can use 
predeined wrappers, e.g. for DeepLearning:

- [`tf.keras.wrappers.scikit_learn`](https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn) (https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn) for tensorflow with keras API
- [skorch](https://github.com/skorch-dev/skorch) for PyTorch ([PyData Berlin 2019](https://www.youtube.com/watch?v=Qbu_DCBjVEk))

> What would it take to wrap any model?

## Classification


### Exercise: Image Classification 

Run [image-classification example](https://github.com/scikit-learn/scikit-learn/tree/master/examples/classification) and exchange the classifier.

### Heart Disease Categories (Multiclass)
Datasets (I load the swiss-one only)
- [Heart-Disease data](https://archive.ics.uci.edu/ml/datasets/heart+disease)

In [None]:
import pandas as pd
heart_disease_swiss = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data",
                                  index_col=False, sep=",", names=[i for i in range(14)], na_values=["?"])
heart_disease_swiss.columns = ["age", "sex", "cp", "trestbps", "chol", "fbs",
                               "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "class"]
heart_disease_swiss

### Exercise: All data loaded. 

- Column labes are still not set

In [None]:
import requests
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/switzerland.data'
r = requests.get(url, allow_redirects=True)

In [None]:
raw = r.content.decode("utf-8").splitlines()
raw[:10]

In [None]:
data = []
row = []
i = 0
for line in raw:
    line = line.split()
    row.extend(line)
    if line[-1] == "name":
        i += 1
        data.append(row[:-1])
        row = []
print(data[0])

In [None]:
df = pd.DataFrame(data)
df = df.replace('-9.', '-9').convert_dtypes()
df

### Exercise: automated stratified Cross-Valdiation 
Goals:
- Understand documentation of [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) function
- apply stratified KFold data splitting for imbalanced data


In [None]:
OLDER_THAN = 60
print(
    f"Binary (Dummy) Variable assigning 1 if some is older than {OLDER_THAN} years old."
)
y_binary = (y > OLDER_THAN).astype(int)
y_binary.value_counts()

This will result in a imbalanced classification problem, where the aim is to predict if someone is older than `OLDER_THAN`.

> Is Stratified Splitting the default for [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate)?

- try to test your assumptions