# Existing Standard Format: PMML

<img src="images/PMML_Logo.png" align='left'>

## What is PMML?

PMML is a Data Mining Group (http://dmg.org/dmg-members.html) standard that has existed and evolved for over 20 years, and is used widely throughout the world. 

Formally, it's a XML dialect that describes a model and/or pipeline.

## Example

Here is an example of a logistic regression classifier trained using R on the Iris dataset:

(http://dmg.org/pmml/pmml_examples/rattle_pmml_examples/IrisMultinomReg.xml)

<img src="images/pmml_example.png">

## Where do we get a PMML model?

A partial list of products supporting PMML is at http://dmg.org/pmml/products.html

Focusing on the *producing PMML* side, we can see there are a lot of products that can create PMML, even if most of them are commercial or have effectively commercial licensing schemes (e.g. JPMML).

In the open-source world (again, excluding AGPL code like JPMML), we have
* R -- strongest open-source export support
* Spark -- very limited support: the listed models are only supported under the *old/deprecated* RDD MLlib API
  * There is work in progress to add PMML export to the new API but it has just begun and may not make progress
* Python 
  * Historically, aside from the wrapper around the above-mentioned JPMML, there is little support
      * e.g., https://pypi.org/project/scikit2pmml/ from https://github.com/vaclavcadek/scikit2pmml
  * Recently, SoftwareAG has created Nyoka, a permissively licensed Python-to-PMML export tool
      * Source https://github.com/nyoka-pmml/nyoka
      * Docs at https://nyoka-pmml.github.io/nyoka/index.html
  
It is important to note that
* although there are plenty of commercial products with at least some PMML support
* and although large enterprises can (and for support/legal reasons prefer to) pay for a product
* the lack of openness and community is leaving commercial-only ML tooling far behind
  * e.g., all of the top deep learning tools are FOSS
  * this means most of the performance-focused work is tied to the FOSS tools
  * scaling is owned by FOSS (kubeflow, Horovod, etc.)
  
### Let's create a PMML model

In [None]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

data = pd.read_csv('data/diamonds.csv')
X = data.carat
y = data.price
pipeline_model = Pipeline([('lin_reg', LinearRegression())]).fit(X.values.reshape(-1,1), y)

In [None]:
from nyoka import skl_to_pmml

skl_to_pmml(pipeline_model, ['carat'], 'price', 'diamonds.pmml')

In [None]:
! cat diamonds.pmml

### How do we run a PMML model?

Enterprise-grade permissive OSS support for running PMML models is effectively nonexistent, so we need to architect in tandem with business decisions around a vendor's analytics server product. These business decisions will go beyond the licensing and support, because they will affect all of our enterprise architectures: hardware, network, software, managment/monitoring/operations, reliability/contiuity, compliance etc.

The most comprehensive set of tools around PMML is almost certainly JPMML

#### JPMML

JPMML (https://github.com/jpmml) is a set of AGPL OSS projects that 
* form the de facto Java implementation of PMML (but also has tools for Python, etc.)
* offer interop with key FOSS tools like Apache Spark, R, Scikit-learn, XGBoost, TensorFlow, etc.
* provide easy scoring in your own apps, or using a "scoring wrapper" or hosted in the cloud
* is maintained and licensed in connection with https://openscoring.io/ 
* *note: there is an older, abandoned, version of JPMML under a more friendly Apache 2.0 license*
  * this older version has many features and might be suitable for some organizations with a higher risk/ownership appetite
  * https://github.com/jpmml/jpmml

#### Pros and Cons: PMML

Pros:
* In wide use / well-accepted / large community
* Core XML dialect can be human readable
* Models can be processed/managed by text-based tools (VCS/CMS/etc.)
* Covers the majority of modeling cases companies use today
* *Formally* interoperable (reading/writing the container file format)

Cons:
* Support for production models in the open-source world is spotty
* Support for consuming models in the OSS is sparse/minimal
* Importance of modern open-source tooling has been dragging PMML down
* Some modern model types and pipelines are not supported, or not supported efficiently/compactly
* *Semantic* interop is limited

In practice, PMML -- even with commercial/enterprise, supported products -- is more like USB C than USB 3. 

I.e., like USB C, it's very versatile in theory, and the plug always fits, but that tells you little or nothing about whether the two devices connected can have any conversation, let alone the specific conversation you need them to have.

Despite its imperfections, it has many advantages over single-product formats, so we often use it even if it cannot fulfil a promise of being the "universal" tool.