## Visualising and retrieving large amounts of data

Carlos Valiente ([@carletes](https://twitter.com/carletes)), on behalf of ECMWF's web people.

https://github.com/ecmwf/ecmwf-pydata-talk

## About ECMWF

We are the [European Centre for Medium-range Weather Forecasts](http://www.ecmwf.int/).

We are both a research institute and a 24x7 operational centre, funded by 34 states.

We provide, among other services:

  * Twice-daily global numerical weather forecasts, up to 2 weeks ahead.
  * Longer-range forecasts, up to one year ahead.
  * Access to our archive of meteorological data (about 100 PB as of late 2015).
  
Main users of our forecasts:

  * Meteorologist from our member states
  * Scientists from universities and other research centres
  * Commercial customers

## Our web applications

Under http://apps.ecmwf.int we offer several web applications:

  * ecCharts: A visualisation tool for our daily forecasts (restricted access)
  * Access to our public data sets (public access, requires registration).
  * Access to our full meteorological archive (restricted access).
  
![apps.ecmwf.int](./img/apps-home-page.png)

## ecCharts: Visualising weather forecasts

A web application for visualising our twice-daily global forecasts.

![Main view](./img/forecaster-home.png)

## ecCharts: Visualising weather forecasts (2)

ecCharts offers pre-defined meteorological _products_
![ecCharts products](./img/forecaster-products.png)


## ecCharts: Visualising weather forecasts (3)

Products are composed of several _layers_
![ecCharts layers](./img/forecaster-layers.png)

## ecCharts: Visualising weather forecasts (4)

We provide access to the last ten model runs

![ecCharts availability](./img/forecaster-availability.png)

## The full stack

ecCharts comprises:

  * A Javascript front-end for the UI
  * A Django HTTP back-end
  * Several Python services for retrieving data, doing computations and plotting.
  * Nginx instances on all cluster nodes to deliver data to the Python services.
  * MongoDB databases for keeping track of available data.

## The Django HTTP back-end

Things we handle here:

  * Access control
  * User preferences

The Django processes dispatch requests to the Python services layer.

## The Python services

A collection of 50+ Python services for doing data retrievals, computations on meteorological data and plotting.

One instance of each service running on each of the 20+ cluster nodes

First version (ca. 2009) implemented with [Twisted](https://twistedmatrix.com/trac/).

Now using [Celery](http://www.celeryproject.org), with [RabbitMQ](https://www.rabbitmq.com) as message broker, and [Redis](http://redis.io) as results backend.

## The Python services: The Twisted days

A central _broker_ process, written using Twisted, accepted HTTP requests from the Django layer.

The broker processes dispatched requests to the service implementations using raw sockets.

A library called `servicelib` encapsulated all this.

```python
# The inevitable `echo` and `sum` services
from servicelib import errors, start_services


def echo_service(context, *args):
    context.log.debug("Executing echo() request from: %s",
                      context.user)
    return " ".join(args)

def sum_service(context, *args):
    try:
        args = [float(a) for a in args]
    except:
        raise errors.BadRequest("Invalid args: %s" % (args,))
    return sum(args)


if __name__ == "__main__":
    start_services({"name": "sum", "execute": "sum_survice"},
                   {"name": "echo", "execute": "echo_service"})
```

## The Python services: The orchestrator service

A service called `orchestrator` let us build complex call trees:

![A service call tree](./img/forecaster-calltree.png)

## The Python services: The orchestrator service (II)

A service called `orchestrator` let us build complex call trees:

```python
from servicelib.client import Broker


broker = Broker()
broker.execute("orchestrator",
               {
                 "render": [
                     "plot": [
                         "retrieve": [
                             "mslp",
                         ]
                     ],
                     "plot": [
                         "wind_speed": [
                             "retrive": [
                                 "10mw_u"
                             ],
                             "retrieve": [
                                 "10mw_v"
                             ]
                         ]
                     ],
                     "plot": [
                         "coastlines"
                     ]
                 ]
               })
```

## The Python services: High-level services

We also built a macro library to let our meteorologist write higher-level services:

```python
from metview.macro import retrieve, sqrt

def wind_speed(r):
    if r['levtype'] == 'sfc':
        u = '165.128'
        v = '166.128'
    else:
        u = '131.128'
        v = '132.128'

    r['param'] = u
    u = retrieve(r)

    r['param'] = v
    v = retrieve(r)

    return sqrt(u * u + v * v)

if __name__ == "__main__":
    import sys
    from metview.macro import run

return run(wind_speed, sys.argv[1:])
```

## The Python services: High-level services (II)

```python
RD     = 287.05
RV     = 461.51
VTMPC1 = RV/RD-1.
TMELT  = 273.16
C1ES   = 610.78
C2ES   = C1ES*RD/RV
C3LES  = 17.269
C3IES  = 21.875
C4LES  = 35.86
C4IES  = 7.66

def relative_humidity(r):
    level = int(r["levelist"]) 

    r['param'] = '130.128' 
    t = retrieve(r)

    r['param'] = '133.128'
    q = retrieve(r)

    ice = (t <  TMELT)
    water = (t >= TMELT)

    z1 = ice*C3IES + water*C3LES
    z2 = ice*C4LES + water*C4LES

    relhuma = C2ES * exp(z1*(t-TMELT) / (t-z2));
    relhuma = level * 100. / relhuma

    return q * 100. * (relhuma - VTMPC1)
```

## The Python services: Caching

We cache all service requests with `memcached`. Caching is done with a Python decorator, based on the MD5 hash of the request arguments.

```python
from cache import cache_control

# A sample Python service
@cache_control(time=24*60*60)
def retrieve(context, *args):
    # ..
```

## The Python services: Caching (2)

A request with no caching:

![A request with no caching](./img/forecaster-no-cache.png)

## The Python services: Caching (3)

A request with caching:

![A request with no caching](./img/forecaster-cache.png)

## The Python services: From Twisted to Celery

In 2011-2012 we switched from Twisted to Celery for the services layer, because:

  * Not everybody felt comfortable with Twisted's asynchronous programming model
  * The services broker was difficult to debug
  * We needed to implement queueing and QOS in the services broker -- lots of work, already done in Celery.
  
Moving to Celery meant a rewrite of our `servicelib` library. All services remained unchanged.

We got rid of our services broker process, since Celery uses RabbitMQ. Rock solid setup now!

## Data storage and indexing

As soon as the supercomputer runs the model, we push the data for the new cycle into our web cluster.

We push about 1 TB of fresh data every day.

Within the web cluster, data is available to all services via HTTP, served with local instances of [Nginx](http://nginx.org)

Data becomes available to our users about 30 minutes after it has been pushed, following ECMWF's official schedule of data availability.

We use [MongoDB](https://www.mongodb.org) for keeping track of the data.

## A MongoDB index entry

```
> db.fields.findOne()
{
	"domain" : "g",
    "class" : "od"
	"type" : "em",
	"param" : "151.128",
	"param_alt" : "msl",
	"stream" : "enfo",
	"levtype" : "sfc",
	"expver" : "0001",
    "base_time" : ISODate("2013-04-08T00:00:00Z"),
    "step" : 354,
    "valid_time" : ISODate("2013-04-22T18:00:00Z"),
    
    "active": true,
    
	"locations" : [
		{
			"offset" : 0,
			"url" :"http://host42.ecmwf.int/data0000.grib",
			"length" : 4158
		}
	]
}
>
```

## The Python MongoDB API

Very clean API, no impedance mismatch with the requests coming from the Javascript UI --- no need for a complex object-to-relational mapping:

```python
import pymongo

client = pyMongo.MongoClient("mongodb://host:27017/")
db = client["fields"]

for rec in db.fields.find({"param": "151.128", "base_time": now}):
    for loc in rec["locations"]:
        download(url=loc["url"],
                 offset=loc["offset"],
                 length=loc["length"])
        # ...
```

## MongoDB issues

  * Write operations block the whole collection (makes pushing data slow).
  * Update queries also block the whole collection (makes activation of data slow).
  * When a new cycle is available from our supercomputer, we need to do lots of insertions and updates. Most of our users are active at that same time, when data is fresh.
  * Every day we remove 20% of the database entries, and add a new 20%. Fragmentation hurts!
  * MongoDB indexes are crucial, but they slow down write operations. Heavy tuning needed here.
  
We're moving now to MongoDB 3 (no more collection-level locks, better handling of fragmentation), and things look better.

## Public data sets: Batch access to data

http://apps.ecmwf.int/datasets: A free service to download public data sets (requires registration)

![Public datasets](./img/datasets-home.png)

## Public data sets: Batch access to data (2)

The UI lets you choose what data to download:

![Public datasets](./img/datasets-menu.png)

## Public data sets: Batch access to data (3)

Users' requests are queued, processed and eventually results arrive:

![Public datasets](./img/datasets-completed.png)

## Public data sets: HTTP REST API

Downloads also available through an [HTTP REST API](https://software.ecmwf.int/wiki/display/WEBAPI/ECMWF+Web+API+Home)

Sample clients in several languages, including Python.

Steps to use it:

1. If you don't have an account, get one at https://apps.ecmwf.int/registration/.
2. Login: https://apps.ecmwf.int/auth/login/
3. Retrieve you API key at https://api.ecmwf.int/v1/key/
4. Copy the information in that page and save it as `$HOME/.ecmwfapirc`:
    ```
    {
        "url": "https://api.ecmwf.int/v1",
        "key": "XXXXXXXXXXXXXXXXXXXXXX",
        "email": "john.smith@example.com"
    }
    ```

In [9]:
import os
import sys

from ecmwfapi import ECMWFDataServer


apirc = os.path.expanduser("~/.ecmwfapirc")
if not os.access(apirc, os.F_OK):
    print "API key file '{}' not found".format(apirc)
    sys.exit(1)

server = ECMWFDataServer()
server.retrieve({
    "class": "s2",
    "dataset": "s2s",
    "hdate": "1981-01-01",
    "date": "2014-01-01",
    "expver": "prod",
    "levtype": "sfc",
    "origin": "ammc",
    "param": "tp",
    "step": "24/to/1488/by/24",
    "stream": "enfh",
    "target": "output-file.grib",
    "time": "00",
    "type": "cf",
})

2015-08-31 16:18:31 ECMWF API python library 1.3
2015-08-31 16:18:31 ECMWF API at https://api.ecmwf.int/v1
2015-08-31 16:18:31 Welcome Carlos Valiente
2015-08-31 16:18:31 In case of problems, please check https://software.ecmwf.int/wiki/display/WEBAPI/Troubleshooting or contact calldesk@ecmwf.int
2015-08-31 16:18:32 Request is queued
Calling nice mars /tmp/tmp-marsiLKRiN.req
PPDIR is /var/tmp/ppdir/x86_64
mars - INFO   - 20150831.151832 - Using odb_api version: 0.10.2 (file format version: 0.5)
mars - INFO   - 20150831.151832 - Maximum retrieval size is 20.00 G
mars - INFO   - 20150831.151832 - Using grib_api version 1.13.1
mars - INFO   - 20150831.151832 - odb_api created on  20140527
mars - INFO   - 20150831.151832 - EMOSLIB version: 395
mars - INFO   - 20150831.151832 - Welcome to MARS with grib_api and ODB
mars - INFO   - 20150831.151832 - grib_api created on  20141119
retrieve,origin=ammc,hdate=1981-01-01,stream=enfh,levtype=sfc,expver=prod,padding=0,step=24/to/1488/by/24,param=tp

In [11]:
import os

print os.stat("output-file.grib")

posix.stat_result(st_mode=33188, st_ino=1538608, st_dev=16777220, st_nlink=1, st_uid=501, st_gid=20, st_size=1298838, st_atime=1441034314, st_mtime=1441034314, st_ctime=1441034314)


## Links

|What|Where|
|----|-----|
|Public API documentation|[https://software.ecmwf.int/wiki/display/WEBAPI/ECMWF+Web+API+Home](https://software.ecmwf.int/wiki/display/WEBAPI/ECMWF+Web+API+Home)|
|Requesting access to our charts|[http://ecmwf.int/en/forecasts/accessing-forecasts](http://ecmwf.int/en/forecasts/accessing-forecasts)|