# Next-Gen Standard Format: PFA

<img src="images/pfa.png" align='left'>

## PFA (Portable Format for Analytics) is a Modern Replacement for PMML

__"As data analyses mature, they must be hardened — they must have fewer dependencies, a more maintainable structure, and they must be robust against errors." - DMG__

PFA, created in 2015, is intended to improve upon PMML.

From http://dmg.org/pfa/docs/motivation/:

*Tools such as Hadoop and Storm provide automated data pipelines, separating the data flow from the functions that are performed on data (mappers and reducers in Hadoop, spouts and bolts in Storm). Ordinarily, these functions are written in code that has access to the pipeline internals, the host operating system, the remote filesystem, the network, etc. However, all they should do is math.*

*PFA completes the abstraction by encapsulating these functions as PFA documents. From the point of view of the pipeline system, the documents are configuration files that may be loaded or replaced independently of the pipeline code.*

*This separation of concerns allows the data analysis to evolve independently of the pipeline. Since scoring engines written in PFA are not capable of accessing or manipulating their environment, they cannot jeopardize the production system. Data analysts can focus on the mathematical correctness of their algorithms and security reviews are only needed when the pipeline itself changes.*

*This decoupling is important because statistical models usually change more quickly than pipeline frameworks. Model details are often tweaked in response to discoveries about the data and models frequently need to be refreshed with new training samples.*

<img src="images/pfa-line.png" width=800>

(summarized from DMG)

### Overview of PFA capabilities

PFA flexibility:
* Control structures, such as conditionals, loops, and user-defined functions
* Entirely expressed within JSON, and can therefore be easily generated and manipulated by other programs
* Fine-grained function library supporting extensibility callbacks

The following contribute to PFA’s safety:
* Strict numerical compatibility: the same PFA document and the same input results in the same output, regardless of platform.
* Spec only defines functions that transform data. I/O is all controlled by the host system.
* Type system that can be statically checked. ... This system has a type-safe null and PFA only performs type-safe casting, which ensure that missing data never cause run-time errors.
* The callbacks that generalize PFA’s statistical models are not first-class functions
  * The set of functions that a PFA document might call can be predicted before it runs
  * A PFA host may choose to only allow certain functions.

__Example__

Here are some data records:

<img src="images/pfa-doc-1.png" width=600>

And a PFA document which returns the square-root of the sum of the squares of a record's x, y, and z values:

<img src="images/pfa-doc-2.png" width=600>

The above example -- along with numerous other tutorials -- can be viewed, *modified*, and run live online at http://dmg.org/pfa/docs/tutorial2/ and other dmg.org pages.

Although it may not be obvious from this small example, PFA is effectively a programming language, albeit a restricted one, and as such can express complex transformations and aggregations of data. A compliant PFA scoring system must implement the full spec properly: http://dmg.org/pfa/docs/library/

The PFA document is a serialized representation or description of a scoring engine, of which one or more instances can be created by a runtime.

The Avro, JSON, and YAML representations are interchangeable, with the JSON and YAML working better for humans and text tools, while the Avro is better suited to performance, type checking, etc. That said, it is still intended to be a machine-generated and machine-consumed document.

### Let's make a PFA version of our Diamonds model

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

data = pd.read_csv('data/diamonds.csv')
X = data.carat
y = data.price
model = LinearRegression().fit(X.values.reshape(-1,1), y)

In [None]:
from skompiler import skompile

expr = skompile(model.predict)

for line in expr.to('pfa/yaml').split('\n'):
    print(line)

#### Try Scoring Some Records Online

1. Go to: http://dmg.org/pfa/docs/tutorial1/ (in another browser tab)
2. Copy the PFA engine document above, and paste it in the block marked "PFA Document (YAML)"
3. Try scoring with the "Run" button
4. You'll notice that you get an Avro-related type error
5. Format your scoring records, one per line, in JSON, using the input name ("x") as the name for the carat weight.
    * For example, `{ "x" : [1.0] }` to represent a 1-carat diamond

Hadrian (https://github.com/opendatagroup/hadrian) is a permissive OSS implementation of a compliant PFA runtime. You can use the source, or use the pre-built `.war` file to create a JVM-based scoring server.

If you want to build your own Python scoring server for PFA, see the Titus installation instructions here: https://github.com/opendatagroup/hadrian/wiki/Installation#case-4-you-want-to-install-titus-in-python

#### Pros and Cons: PFA

Pros:
* Flexible, extensible
* Permissive open-source scoring engine and some OSS export support (SKompile, Aardpfark)
* Addresses many of the shortcomings of PMML

Cons:
* Limited OSS export support
* Timing appears to have been "unlucky"
    * Perhaps the project is too new?
    * Or other, more "modern" open initiatives have overtaken the DMG and swept this approach away
* In any case, community has not embraced PFA as of late 2019