# Task: Extend Kedro Pipeline

In this exercise, you will get more familier with Kedro by extending the workflow pipeline shown in the introduction. Note that the introduction notebook should be run prior to this exercise.

Let's first change the working directory to the existing project.

In [None]:
import os
os.chdir("/workshop/kedro_intro/workflow-tutorial")

## Subtask I: Add additional node to pipeline

After training the model, it should be evaluated. Create a new Kedro `node` that takes as input the model, and the features `x_test` and target `y_test`.

The output should be `evaluation_metric`: a json including several metrics.

The following function can be used.

In [None]:
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def evaluate_model(pipe: Pipeline, x_test: np.ndarray, y_test: np.ndarray):
    """Calculate the coefficient of determination and log the result.

        Args:
            pipe: Trained model.
            X_test: Testing data of independent features.
            y_test: Target.
        Returns:
            json with scores

    """
    y_pred = pipe.predict(x_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("mae", mae)
    mlflow.log_metric("r2", r2)

    logger = logging.getLogger(__name__)
    logger.info("Model has a coefficient R^2 of %.3f.", r2)

    return {"train": {"rmse": float(rmse),
                      "mae": float(mae),
                      "r2": float(r2)}}

### Extend existing pipeline

In [None]:
%%writefile src/workflow_tutorial/pipelines/pipeline.py

from kedro.pipeline import Pipeline, node

from .nodes import evaluate_model, split_data, train_model


def create_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=split_data,
                inputs=["wines-red", "parameters"],
                outputs=["x_train", "x_test", "y_train", "y_test"],
                name="splitting_data",
            ),
            node(
                func=train_model,
                inputs=["x_train", "y_train", "parameters"],
                outputs="model",
                name="training_model",
            ),
            node(
                func=evaluate_model,
                inputs=["model", "x_test", "y_test"],
                outputs="evaluation_metric",
                name="evaluating_model",
            ),
        ]
    )

### Test and visualize pipeline

In [None]:
!kedro run

In [None]:
!kedro viz --host=0.0.0.0 --no-browser

## Subtask II: Add second pipeline

## Set up the data
In the introduction, we have build a pipeline that predicts the quality of **red** wine.
Let's now build a second Pipeline that predicts the quality of **white** wine.

Download the [Wine Quality Data Set](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv) for white wines and add the data to the corresponing directory!

In [None]:
!wget -O data/01_raw/winequality-white.csv http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

### Register the datasets
Register the dataset in the catalog!

In [None]:
%%writefile conf/base/catalog.yml

wines-red:
  type: pandas.CSVDataSet
  filepath: data/01_raw/winequality-red.csv
  load_args:
    sep: ';'

wines-white:
  type: pandas.CSVDataSet
  filepath: data/01_raw/winequality-white.csv
  load_args:
    sep: ';'

Let's have a look at the data..

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

from pathlib import Path
from kedro.framework.context import load_context

context = load_context(Path.cwd())
df = context.catalog.load("wines-white")

from pandas_profiling import ProfileReport

profile = ProfileReport(df, title="Pandas Profiling Report")
profile.to_file("data/08_reporting/wines-white.html")

## Create the pipeline
Create and registered the new pipeline in `src/workflow_tutorial/pipelines/` and `src/workflow_tutorial/hooks.py`, respectively.

In [None]:
%%writefile src/workflow_tutorial/pipelines/pipeline.py

from kedro.pipeline import Pipeline, node

from .nodes import evaluate_model, split_data, train_model


def create_red_wine_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=split_data,
                inputs=["wines-red", "parameters"],
                outputs=["x_train_red", "x_test_red", "y_train_red", "y_test_red"],
                name="splitting_red_wine_data",
            ),
            node(
                func=train_model,
                inputs=["x_train_red", "y_train_red", "parameters"],
                outputs="model_red",
                name="training_red_wine_model",
            ),
            node(
                func=evaluate_model,
                inputs=["model_red", "x_test_red", "y_test_red"],
                outputs="evaluation_metrics_red",
                name="evaluating_red_wine_model",
            ),
        ]
    )

def create_white_wine_pipeline(**kwargs):
    return Pipeline(
        [
            node(
                func=split_data,
                inputs=["wines-white", "parameters"],
                outputs=["x_train_white", "x_test_white", "y_train_white", "y_test_white"],
                name="splitting_white_wine_data",
            ),
            node(
                func=train_model,
                inputs=["x_train_white", "y_train_white", "parameters"],
                outputs="model_white",
                name="training_white_wine_model",
            ),
            node(
                func=evaluate_model,
                inputs=["model_white", "x_test_white", "y_test_white"],
                outputs="evaluation_metrics_white",
                name="evaluating_white_wine_model",
            ),
        ]
    )

### Register the pipeline

Note that `register_pipelines` returns `Dict[str, Pipeline]`, hence, you can return multiple pipelines for each type of wine.

The default pipeline usually comprises all possible pipelines: You can simply add `red_wine_pipeline + white_wine_pipeline`.

In [None]:
%%writefile src/workflow_tutorial/hooks.py

"""Project hooks."""
from typing import Any, Dict, Iterable, Optional

from kedro.config import ConfigLoader
from kedro.framework.hooks import hook_impl
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline
from kedro.versioning import Journal

from workflow_tutorial.pipelines import pipeline

class ProjectHooks:
    @hook_impl
    def register_pipelines(self) -> Dict[str, Pipeline]:
        """Register the project's pipeline.

        Returns:
            A mapping from a pipeline name to a ``Pipeline`` object.

        """
        red_wine_pipeline = pipeline.create_red_wine_pipeline()
        white_wine_pipeline = pipeline.create_white_wine_pipeline()
        
        return {
            "red": red_wine_pipeline,
            "white": white_wine_pipeline,
            "__default__": red_wine_pipeline + white_wine_pipeline,
        }

    @hook_impl
    def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
        return ConfigLoader(conf_paths)

    @hook_impl
    def register_catalog(
        self,
        catalog: Optional[Dict[str, Dict[str, Any]]],
        credentials: Dict[str, Dict[str, Any]],
        load_versions: Dict[str, str],
        save_version: str,
        journal: Journal,
    ) -> DataCatalog:
        return DataCatalog.from_config(
            catalog, credentials, load_versions, save_version, journal
        )


project_hooks = ProjectHooks()



### Set Parameters

Setting the parameters Parameters..

In [None]:
%%writefile conf/base/parameters.yml
test_size: 0.25
random_state: 42

alpha: 0.5
l1_ratio: 0.5

## Run the pipeline
You can either run the full (default) project pipeline or a pipeline specified with the `--pipeline` option.

In [None]:
#!kedro run --pipeline=red

In [None]:
!kedro run -p

## Kedro Visualization 

In [None]:
!kedro viz --host=0.0.0.0 --no-browser

## Optional Subtask: Add data version control
Add git and data version control (DVC - already installed) to the project!

In [None]:
!git init
!dvc init

Add dvc remote storage (local).

In [None]:
!dvc remote add -d -f local_storage /tmp/kedro

In [None]:
# for now let's not track the mlflow runs..
!echo $'\nmlruns/**' >> .gitignore
!git add .

In [None]:
# If necessary, configure your git..
#!git config --global user.email "you@example.com"
#!git config --global user.name "Your Name"
!git commit -m "initial commit"

You could add data directly or as dependencies in a DVC pipeline.

In [None]:
#!dvc add data/01_raw/winequality-red.csv
#!dvc add data/01_raw/winequality-white.csv

We can add the model pickle and the metrics file to the catalog in order to not only store them as a Kedro `MemoryDataSet` but locally.

In [None]:
%%writefile conf/base/catalog.yml

wines-red:
  type: pandas.CSVDataSet
  filepath: data/01_raw/winequality-red.csv
  load_args:
    sep: ';'

wines-white:
  type: pandas.CSVDataSet
  filepath: data/01_raw/winequality-white.csv
  load_args:
    sep: ';'

model_red:
  type: pickle.PickleDataSet
  filepath: data/06_models/model_red.pickle

model_white:
  type: pickle.PickleDataSet
  filepath: data/06_models/model_white.pickle

evaluation_metrics_red:
  type: yaml.YAMLDataSet
  filepath: data/08_reporting/scores_red.yaml
    
evaluation_metrics_white:
  type: yaml.YAMLDataSet
  filepath: data/08_reporting/scores_white.yaml

Create dvc pipelines for red and white wine.

In [None]:
%%bash
dvc run -n kedro_red \
        -p conf/base/parameters.yml:test_size,random_state,alpha,l1_ratio \
        -d data/01_raw/winequality-red.csv \
        -d src/workflow_tutorial/pipelines \
        -m data/08_reporting/scores_red.yaml \
        -o data/06_models/model_red.pickle \
        'kedro run --pipeline=red'

In [None]:
%%bash
dvc run -n kedro_white \
        -p conf/base/parameters.yml:test_size,random_state,alpha,l1_ratio \
        -d data/01_raw/winequality-white.csv \
        -d src/workflow_tutorial/pipelines \
        -m data/08_reporting/scores_white.yaml \
        -o data/06_models/model_white.pickle \
        'kedro run --pipeline=white'

Commit your changes and update dvc remote storage.

In [None]:
!git add .
!git commit -m "add dvc pipelines"

In [None]:
!dvc status -c

In [None]:
!dvc push

Everything should now be up to date.                                                       

In [None]:
!git status
!dvc status -c

## Optional Subtask: Create Airflow DAG from Kedro pipeline

### Install more project dependencies

We want to use the *kedro-airflow* plugin. Please install this new project dependency using `kedro install`.
Note that to further update the project requirements, you should modify `src/requirements.in` (not `src/requirements.txt`).

In [None]:
#%%writefile src/requirements.in
#kedro-airflow

In [None]:
#!kedro install --build-reqs

In [None]:
#!kedro airflow create

In [None]:
#!kedro airflow deploy