# Data & ML @Go-Jek 

### 3 December 2018, 7:35pm - 8:05pm

#### Guo Jun
#### Data Science, Gojek

## Content

1. The problem of surge pricing
2. Why we needed clustering for better surge
3. Why we picked python to write the clustering service
4. What we learnt

# The problem of surge pricing

## Why Surge Pricing?

1. Helps ensure a reliable service.
2. Defend for our supply of drivers.

Surge is the manipulating in real time situation of market (demand & supply)
disequilibrium to bring market back to equilibrium.


## Economics 101
![Economics 101](assets/surge_econ.jpg)

We maximise number of completed bookings when we set price at P*. 

## Our initial version of surge

Applies a formula on spaces divided into buckets of grid shape where the main inputs are demand and supply of the area.
Looks something like this:

![Naive Surge](assets/naive_surge.png)

It actually works pretty a-okay!

# Why we needed clustering for better surge

### What are some issues?

* Easily hackable (Walk to the next street etc...)

![Surge Hacking 1](assets/surge_hacking_2.png)

![Surge Hacking 2](assets/surge_hacking_1.png)

We need to find a way to group similar "geospatial buckets" sounds like a clustering problem.

## Clustering constraints

* Clusters must be spatially contiguous.
* Larger clusters on the outskirts of the city where information are more sparse and erratic to pool together the information for more efficient surge-pricing.
* Dynamic

## Clustering
![Clusters](assets/clusters.png)

<center>Credits: Peter Richens</center>

<center>Check out his medium post about how we do allocation!</center>

![Jaeger](assets/jaeger.jpeg)

https://blog.gojekengineering.com/how-we-use-machine-learning-to-match-drivers-riders-b06d617b9e5

## Cluster surge

This is how the initial formula works on the cluster level:

![cluster surge](assets/cluster_surge.png)

Looks awesome! Let's take it deploy this!

# Why we picked python to write the clustering service

## Shall we do it in python?

1. The EDA already done in python. Jupyter <3
2. Pre-built machine learning libraries
3. Data scientists are more usually more familiar with Python ‚Äî easier to debug

Seems like a no brainer...

### Let's get started...

**First question**: 
Python 2 or 3?

![Headache](assets/Jackie-Chan-WTF.jpg)
<center>Having a headache already...</center>

**Second question**: 
How do we manage dependencies and environment?

* pyenv + pipenv?
* conda?
* pyenv + conda?
* pip + virtualenv?

![Python Environment](assets/xkcd_python_environment_2x.png)

<center>‚ÄúPython Environment‚Äù by xkcd</center>

**Third question**: 
Seems like we have to maintain the features in our service. Do we have a database migration strategy for it?

How do we monitor the webservice?

Do we have a rollback strategy for a bad model?

...

# What we learnt

## Are you ready to deploy a python model? :)

Most of the time, it might be easier to work with a language that your organization is more familiar with. 

If so... here are some of the things my team learned.

## A project structure 

A standardize way of managing python project and environment. 
Making it reproducible across machines and amongst the team. 
You just have to do it once. 

Project structure for this presentation
```
‚ùØ tree
.
‚îú‚îÄ‚îÄ Makefile                     <- Makefile with commands like `make present`
‚îú‚îÄ‚îÄ README.md                    <- The top-level README for developers using this project.
‚îú‚îÄ‚îÄ environment.yaml             <- The specification files to build a conda environment 
‚îî‚îÄ‚îÄ notebooks                    <- Jupyter notebooks for this presentation.
    ‚îú‚îÄ‚îÄ assets                   <- Assets to store images, video.
    ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ cluster_surge.png
    ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ clusters.png
    ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ naive_surge.png
    ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ osm_vector_small.png
    ‚îÇ¬†¬† ‚îú‚îÄ‚îÄ surge_econ.jpg
    ‚îÇ¬†¬† ‚îî‚îÄ‚îÄ xkcd_python_environment_2x.png
    ‚îú‚îÄ‚îÄ main.ipynb
    ‚îî‚îÄ‚îÄ main.slides.html
```

pyenv + conda

### Checkout cookiecutter data science for some inspiration

```
‚îú‚îÄ‚îÄ LICENSE
‚îú‚îÄ‚îÄ Makefile           <- Makefile with commands like `make data` or `make train`
‚îú‚îÄ‚îÄ README.md          <- The top-level README for developers using this project.
‚îú‚îÄ‚îÄ data
‚îÇ   ‚îú‚îÄ‚îÄ external       <- Data from third party sources.
‚îÇ   ‚îú‚îÄ‚îÄ interim        <- Intermediate data that has been transformed.
‚îÇ   ‚îú‚îÄ‚îÄ processed      <- The final, canonical data sets for modeling.
‚îÇ   ‚îî‚îÄ‚îÄ raw            <- The original, immutable data dump.
‚îÇ
‚îú‚îÄ‚îÄ docs               <- A default Sphinx project; see sphinx-doc.org for details
‚îÇ
‚îú‚îÄ‚îÄ models             <- Trained and serialized models, model predictions, or model summaries
‚îÇ
‚îú‚îÄ‚îÄ notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
‚îÇ                         the creator's initials, and a short `-` delimited description, e.g.
‚îÇ                         `1.0-jqp-initial-data-exploration`.
‚îÇ
‚îú‚îÄ‚îÄ references         <- Data dictionaries, manuals, and all other explanatory materials.
‚îÇ
‚îú‚îÄ‚îÄ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
‚îÇ   ‚îî‚îÄ‚îÄ figures        <- Generated graphics and figures to be used in reporting
‚îÇ
‚îú‚îÄ‚îÄ requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
‚îÇ                         generated with `pip freeze > requirements.txt`
‚îÇ
‚îú‚îÄ‚îÄ setup.py           <- Make this project pip installable with `pip install -e`
‚îú‚îÄ‚îÄ src                <- Source code for use in this project.
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py    <- Makes src a Python module
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ data           <- Scripts to download or generate data
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ make_dataset.py
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ features       <- Scripts to turn raw data into features for modeling
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ build_features.py
‚îÇ   ‚îÇ
‚îÇ   ‚îú‚îÄ‚îÄ models         <- Scripts to train models and then use trained models to make
‚îÇ   ‚îÇ   ‚îÇ                 predictions
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ predict_model.py
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ train_model.py
‚îÇ   ‚îÇ
‚îÇ   ‚îî‚îÄ‚îÄ visualization  <- Scripts to create exploratory and results oriented visualizations
‚îÇ       ‚îî‚îÄ‚îÄ visualize.py
‚îÇ
‚îî‚îÄ‚îÄ tox.ini            <- tox file with settings for running tox; see tox.testrun.org
```

https://drivendata.github.io/cookiecutter-data-science/

## Python style guide

Defining a team style guide early, makes python code more readable and code documentation easier.

Doctest is super helpful! 

In [3]:
def add_five(i):
    """Dummy function to add five to integer
    
    Args:
      i (int):
          Input integer
          
    Returns:
      int: input integer + 5
      
    Example:
      >>> i = 5
      >>> add_five(i)
      10
      >>> x = 20
      >>> add_five(x)
      21    
    """
    return i + 5

import doctest
doctest.run_docstring_examples(add_five, globals())

**********************************************************************
File "__main__", line 16, in NoName
Failed example:
    add_five(x)
Expected:
    21    
Got:
    25


## Environment Config

Be careful...

In [25]:
import os
x = os.getenv('FOO')
print(x)

None


In [26]:
import os
x = os.environ['FOO']
print(x)

KeyError: 'FOO'

## Start a python internal library

You will be surprised how many people are actually solving the same problem on the same team... 

* Pulling and pushing data to database
* Data visualization (Customizing your matplotlib)
* Common utils

https://matplotlib.org/users/customizing.html

In [3]:
# Config package 
def get_string_list_or_raise_exception(key):
    """Get env string from key and split by comma
    
    Args:
      key (str):
          Input env key
          
    Returns:
      list:
          Env value split by comma
      
    Example:
      >>> 
      >>> get_string_list_or_raise_exception("")
      10    
    """
    try:
        value = os.environ[key].split(',')
        return value
    except Exception as err:
        logging.error("Error getting string value from key: %s", err)
        raise err

## Multiprocessing

Could consider multiprocessing to side step the Global Interpreter Lock if you need concurrency. 

At this point, I would also consider a more performance language... 

## Summary

Take steps to build:
* A reproducible development environment
* Common toolkit for the team
* Idiomatic style guide for readability

Consider:
* Throughput and latency
* Organization's familiarity with the language

# Thank You üêç