# General considerations

## How is a data project structured?

### The steps _inside_ a data project
<img src="http://drive.google.com/uc?export=view&id=16iaQgS0aCURH0FHJs-rPwI72n7Eq3nY_" width=75%>

### But there is a larger context!

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/800px-CRISP-DM_Process_Diagram.png" width=45%>

[Cross-industry standard process for data mining](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)

**STARTS from business relevance!!!**

Actionability and usability is key!

### More detailed process

<img src="http://drive.google.com/uc?export=view&id=1xSE3jvRsMwHnr3dpCv8ZP3DwMP2SzeA4" width=75%>

### Who works on it?

<img src="http://drive.google.com/uc?export=view&id=1MOUFo8VjcA4E1JvsxbC6LbXHnWeLCrP2" width=75%>

### What is the "end" of it?

Someone, be it the data scientist, but more often a service developer has to "deliver" the results of the data project.

## What form can a data project be delivered?

### One time insight

We had a vested interest in an answer in context of a _current_ business decision, which may be a one time event:

Example. Based on past data, what will be the return on investment in case of this merger?

(Should we buy this company now, or never?)

In this case the circumstances will most likely never be repeating, and since M&A is not a daily activity in case of many companies, the deliverable of the data science activity is a one time analysis / prediction, which is consumed to fall one decision, and maybe never again.

### Continuous decision support tool

Most often, though, it is the case, that insights / predictions derived from data science activity are being consumed continuously. They can (and should) become tools for continuous decision support.

Below are some typical forms of "consumption" for such data products:

#### Regular report

Regularly, eg. every quarter management needs a report that summarizes, elaborates and/or predicts certain processes.

The most frequent delivery mechanism for these kinds of data products is the generation of a static report.

<img src="https://cdn.pixabay.com/photo/2018/09/07/14/42/download-pdf-3660827_960_720.png" width=15%>



#### Dashboard

It is also very often the case, that users have a need to access the data products in an interactive manner, so as to gain insights, "dig deeper", and base their decisions thereupon.

The most frequent solution is this case some kind of dedicated data dashboard.

<img src="https://i1.wp.com/artofwifi.net/wp-content/uploads/2018/10/Monthly_report_in_Acrobat.png?fit=1027%2C1022&ssl=1" width=30%>

#### Integrated prediction

It can also occur, that users would like to consume the predictions of data science models inside their normal workflow, eg. as a "pre-populated field" in the ERP system of the company. 

This approach requires heavier integration with the given system.

### Automation solutions

The distinguishing feature of such approaches is, that not just predictions, but some kind of actions are also part of the output of the data solution - like in control problems.

#### Delivering automated predictions "en masse"

One of the most widespread of such solutions is the case eg. of recommendation engines. It is not just  a prediction, that someone who liked this item, will like that also, but a potentially large mass of customers is being actively targeted by offering some products (eg. on a webshop).

Arguably, the boundary between prediction and the "action" (recommending it) is getting quite blurry here.

#### Executing controlling actions

An even more clear cut and extreme case is, when in certain control environments software "agents", that were produced during the machine learning project are operated in live environments to take actions and control other systems.

Some good examples: Industrial control agents, self-driving cars, ...

The latter two approaches highlight the essential need to form **reliable and scalable services** out of our data science models, so that they can operate robustly in external environments (be it a webshop or a road). 

# Tools for delivering results

## Solutions for one time insights

One time insights are typically consumed in form of presentations and PDF-s.

Luckily, with some clever formatting, Jupyter based data science environments can generate such PDF-s. In fact, that was one of their original design aims.

Tools like [nbconvert](https://github.com/jupyter/nbconvert) are ideal for this purpose.


We can go even further with Jupyter based presentations.

<img src="https://rise.readthedocs.io/en/stable/_images/basic_usage.gif" width=75%>

Fun: is it healthy to show an animation about rise in Jupyter, via [Rise](https://rise.readthedocs.io/en/stable/)? :-P

In fact, complete online books can be produced with [JupyterBook](https://jupyterbook.org/intro.html).

## Dashboarding

One of the most widespread solutions for delivering data products is to create a dashboard, where the users can interact with the analysis and predictions created during the data project.

The amount of dashboard creation solutions is _enormous_, full fledged business analytics solutions exist, the discussion of which is out of scope for the current lecture.

We restrict ourselves to pointing out, that the Python / Jupyter ecosystem has also some nice solutions for developing interactive applications.

Two examples:

### [Streamlit](https://github.com/streamlit/streamlit)

<img src="http://drive.google.com/uc?export=view&id=1hf-YMLJN194dUY8ipbzqFfafPQiLVz2l"
 width=55%>

### [Voilá](https://github.com/voila-dashboards/voila)

<img src="http://drive.google.com/uc?export=view&id=1TjCQ_GT58DQGDuj889wU4wSkyua9Nmil"
width=55%>

## Prediction service

The most scalable way to make a data product available for consumption is to create a **prediction service** of it, which typically interacts with other software services, consumes inputs, and on request, produces predictions as outputs.

These outputs then typically can be consumed by other software services for later use / display / decisions, ...

### Advantage of service based solutions

The main advantage of a service based approach is, that a well defined service encapsulates functionality and **ensures independence** (eg. of programming language or environment) from all other services, thus the separate service **can run on dedicated infrastructure** (which in case of high computational demand or even special hardware requirements, like GPU-s is crucial). 

For ensuring scalability, this is a must.

### General structure of a service

The concept of software services is heavily influenced by the dominant practices of the Web. 

<img src="https://blogs.mulesoft.com/wp-content/uploads/apis-versus-web-services-1.png" width=55%>

 

### Sidenote: Microservice architecture

It is worth mentioning, that the mere notion of "corporate software" is undergoing a shift, since the advent of "web scale" companies made it necessary to develop new approaches to software craftshmanship, that changed the view and best practices of software development.

The "traditional" approach envisioned software as a complex "monolith" structure, that encompasses all the functionality needed.

<img src="http://drive.google.com/uc?export=view&id=1a4PpNrnWURjxehvgB9_odInHpnJZD_6P" width=15%>

The current view is shifted towards understanding a software environment as the web of interconnected software services.

<img src="http://drive.google.com/uc?export=view&id=1yo-HNwnBWfKAzBU8G8xWChdAsYK78Q4Y" width=55%>

This allows for the decoupled, quasi independent development and scaling of the separate services, and has even implications on the organizational structure, see eg. [the spotify model for scaling agile](https://medium.com/serious-scrum/spotify-engineering-culture-spotify-model-an-introduction-500837f04010) or on infrastructure operations (see: [containers](https://medium.com/@lakwarus/containers-a-paradigm-shift-614ee3b88372).

<img src="http://drive.google.com/uc?export=view&id=1nCPUaBNIPQXP5XmgYIrW8WbZJJJzeNQm" width=65%>

[Source](https://microservices.io/patterns/decomposition/service-per-team.html)

### Anatomy of a webservice

<img src="http://drive.google.com/uc?export=view&id=1xvyRJzdhuBKwtmA2mWZNM79zgpOxFsfv" width=45%>

[(source)](https://web.archive.org/web/20180817201622/http://ericmacdougall.com/microservices-02-anatomy-of-a-microservice/)
This looks quite scary, but let's focus only on the "logic" part! 

### How to create a service from a model?

<img src="http://drive.google.com/uc?export=view&id=1zG_ZkATkil6XoW7fe5pUsv31Xkvzjfyb" width=80%>

The above example used [Flask](https://github.com/pallets/flask) as an easy and flexible webserver tool, Luckily, deploying such applications has a nice stack available.

<img src="https://miro.medium.com/max/2692/1*nFxyDwJ2DEH1G5PMKPMj1g.png" width=55%>

### Some additional tooling 

If one has to produce more scalable code, and would not like to bother with implementing much of the "boilerplate" that is needed for such a service, there are nice tools, like [FastAPI](https://fastapi.tiangolo.com/) available, that make service development a nice experience.

A nice "walkthrough" of building a Keras based ML service can be found [here](https://medium.com/analytics-vidhya/deploy-machine-learning-models-with-keras-fastapi-redis-and-docker-4940df614ece).

A good introduction to FastAPI can be found [here](https://www.youtube.com/watch?v=3DLwPcrE5mA).

### "...at Google scale..."

Clearly, the challenge of operating machine learning models at "web scale" were clear from the perspective of framework developers. 

Thus, eg. the Tensorflow ecosystem incorporates a library called [Tensorflow Serving](https://www.tensorflow.org/tfx/guide/serving), just for this exact purpose.   

#### The TF ecosystem
<img src="http://drive.google.com/uc?export=view&id=18sayJe87HgVX1UpLeLDucX6TsWQoI9xt" width=75%>

#### The goals of TF Serving

<img src="http://drive.google.com/uc?export=view&id=1nWyoXsOxlQ_t0nBho0JsjDOat6ZveJ97" width=50%>

#### The architecture of TF Serving 
<img src="http://drive.google.com/uc?export=view&id=1soU6nslbx4NinZVE7HxcRAu4-UErNF4s" width=55%>

TF Serving can be considered an end-to-end solution for scalable deployment.

[Introduction to Tensorflow Serving](https://www.youtube.com/watch?v=q_IkJcPyNl0)

# “NEVER DEPLOY AN ML MODEL ONCE!”

“You should never deploy a machine learning model once\,

you should __deploy it never\, or prepare to deploy it over and over again\!__ ”

“If the problem is not important enough to keep working on it and deploying new models\, it is not important enough to pay the cost of putting it into production in the first  place\.”

Josh Wills, director of Data Engineering at Slack 

[Talk](https://www.youtube.com/watch?v=zbS9jBB8fz8&feature=emb_logo)

As discussed in the CRISP-DM model, we have to be prepared to continuously evaluate the performance (see eg.: [_What is your ML test score?_](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45742.pdf)) of our models, and if it degrades (or if business requirements change) to re-train and re-deploy it.

Two aspects are crucial for this:

- __Reproducible research__ and
- __Continuous Delivery \(CD\) / Continuous Integration \(CI\)__ (See eg. [here](https://towardsdatascience.com/how-were-applying-ci-cd-principles-to-machine-learning-619ce3f1162c)

These approaches are integral parts of a data science workflow.



# Epilogue

Please always remember the case how [Netflix never used its 1 million USD algorithm due to engineering cost](
https://www.wired.com/2012/04/netflix-prize-costs/), thus, investigate deployability of models as early as possible!

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcR_bX-M0LVMbLyS9R_hJXSWu75Rt_bYXeP4pw&usqp=CAU" width=50%>

