# Saving and sharing data

Many data science applications require an intermediate storage format for transfer of data. The data to be stored may be structurally complex or large. One application is serialization.

## Serialization

From [Wikipedia](https://en.wikipedia.org/wiki/Serialization)

> In computing, serialization (US spelling) or serialisation (UK spelling) is the process of translating a data structure or object state into a format that can be stored (for example, in a file or memory data buffer) or transmitted (for example, across a computer network) and reconstructed later (possibly in a different computer environment)


### ML example


For example, in ML applications, we often need to store details about a machine learning model (including train/test data so that we can compare it with other models. These may then need to be transferred across computers to perform comparative analysis.  

Note that TensorFlow and PyTorch provide their own model serialization protocols. We will cover them later.

We illustrate with an example from `scikit-learn` docs.

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

We monkey-patch the pipeline to give it a name.

In [None]:
pipe.name = 'my_pipeline_0.0.1'

A pipeline has several parameters.

In [None]:
pipe.get_params()

We also want to know the data used to train and test the model. Here 2 samples of training data are shown.

In [None]:
X_train[:2]

In [None]:
y_train[:2]

We combine these into a single data structure.

In [None]:
python_model = {
    'model': pipe,
    'X_train': X_train,
    'y_train': y_train,
    'X_test': X_test,
    'y_test': y_test
}

In [None]:
import pendulum

filename_base = f'{pipe.name}_{pendulum.now()}'
filename_base

## Python native data formats

If you only ever use Python and don't need to share your data with anyone else, you can use efficient data structures native to Python.

### Pickle

In [None]:
import pickle

In [None]:
# Note that we need to open file in write binary
pickle_file = f'{filename_base}.pickle'
with open(pickle_file, 'wb') as f:
    pickle.dump(python_model, f)

In [None]:
! head -c 200 $pickle_file

In [None]:
with open(pickle_file, 'rb') as f:
    m_pickle = pickle.load(f)
print(m_pickle.keys())

This is super convenient because the model is immediately usable!

In [None]:
m_pickle['model'].score(m_pickle['X_test'], m_pickle['y_test'])

### Joblib

Joblib is more efficient for objects with large arrays. Behind the scenes this uses a library called `dill` that is adds some features to `pickle`.

In [None]:
import joblib

In [None]:
joblib_file = f'{filename_base}.joblib'
joblib.dump(python_model, joblib_file)

In [None]:
! head -c 200 $joblib_file

In [None]:
m_joblib = joblib.load(joblib_file)

In [None]:
m_joblib['model'].score(m_joblib['X_test'], m_joblib['y_test'])

## Portable data formats

Here we generally cannot automatically store Python objects, so we create a generic data structure to store. Serialization using these non-native formats usually takes more work. 

**Note**. Some Python libraries such as `pyyaml` provide mechanisms for directly storing and recreating objects like `pickle` and `joblib` - not covered in lecture notes.

In [None]:
generic_model = {
    'name': pipe.name,
    'params': pipe.get_params(),
    'X_train': X_train,
    'y_train': y_train,
    'X_test': X_test,
    'y_test': y_test
}

### CSV

CSV cannot handle non-tabular data structures, so we would have to do something like store 5 different files:

- model key, value pairs (one per line)
- X\_train
- X\_test
- y\_train
- y\_test

In [None]:
import csv

csv_file = f'{pipe.name}_{pendulum.now()}.csv'
with open(csv_file, 'w') as f:
    writer = csv.writer(f, delimiter=',', quotechar='"')
    writer.writerow(['name', pipe.name])
    for k, v in pipe.get_params().items():
        writer.writerow([k, v])

In [None]:
! head -c 200 $csv_file

Reading back using the CSV module solves the commas embedded in qutotes problem.

In [None]:
with open(csv_file, 'r') as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for i, row in enumerate(reader):
        print(row)
        if i >= 2:
            break

We can write the numpy arrays to CSV in the same way, but it's easier to do so directly in Python.

In [None]:
import numpy as np

In [None]:
X_train_filename = f'X_train_{filename_base}'
np.savetxt(X_train_filename, X_train, delimiter=',')

In [None]:
! head -c 200 $X_train_filename

Reading back into `numpy` is also straightforward.

In [None]:
np.loadtxt(X_train_filename, delimiter=',').shape

### JSON

JSON is ubiquitous as a data format, and is native to the REST API. Generally, JSON only understands basic data types - string, numbers, ,object (this is like a Python dictionary), array (this is like a Python list), boolean and null - so is inefficient for transferring large binary objects such as `numpy` arrays.

In [None]:
import json
import numpy as np

Unfortunately, the `get_params` method returns values that are Python objects suhc as `StandardScaler()` that cannot be directly serialized to JSON.

In [None]:
json_file = f'{filename_base}.json'

with open(json_file, 'w') as f:
    try:
        json.dump(generic_model, f)
    except TypeError as e:
        print(e)

We need to convert to strings first.

In [None]:
def serialize(m):
    """Serialize all objects to their string represntation."""
    d = {}
    for k, v in m.items():
        if type(v) is np.ndarray:
            d[k] = v.tolist()
        else:
            d[k] = str(v)
    return d

In [None]:
with open(json_file, 'w') as f:
    json.dump(serialize(generic_model), f)

In [None]:
! head -c 200 $json_file

The price is that now, everything is a string and you need to do the reconstruction.

See [docs](https://stackabuse.com/scikit-learn-save-and-restore-models/) for how to restore `scikit-learn` models.

It is simple to restore `numpy` arrays.

In [None]:
with open(json_file, 'r') as f:
    m_json = json.load(f)

In [None]:
X_test_json = np.asarray(m_json['X_test'])

## YAML

- YAML Ain't Markup Language
- YAML is often used for configuration - for example, in `docker-compose` to specify containers

YAML is a superset of JSON, so anything that can be serialized as JSON will work. However YAML is more flexible. See YAML [docs](https://yaml.org/spec/1.2/spec.html) for more information - especially how to use YAML aliases and references.

In [None]:
import yaml

In [None]:
yaml_file = f'{filename_base}.yaml'

with open(yaml_file, 'w') as f:
    yaml.safe_dump(serialize(generic_model), f)

In [None]:
! head -c 200 $yaml_file

In [None]:
with open(yaml_file, 'r') as f:
    m_yaml = yaml.safe_load(f)

In [None]:
m_yaml.keys()

### XML

XML is a recursive data structure.

In [None]:
import xml.etree.ElementTree as ET

XML is painful to create manually so I will convert from JSON instead.

In [None]:
! python3 -m pip install --quiet json2xml

In [None]:
from json2xml import json2xml
from json2xml.utils import readfromjson

In [None]:
xml_file = f'{filename_base}.xml'

data = readfromjson(json_file)
xml = json2xml.Json2xml(data).to_xml()

In [None]:
with open(xml_file, 'w') as f:
    f.write(xml)

In [None]:
! head -c 200 $xml_file

In [None]:
tree = ET.parse(xml_file)
root = tree.getroot()

In [None]:
for item in root:
    print(item)

Use [XPath](https://www.w3schools.com/xml/xpath_syntax.asp) notation to navigate the XML tree.

In [None]:
name = root.find('.//name')
name.tag, name.text

In [None]:
len(root.findall('.//item'))

### HDF5

HDF5 was designed to store large and heterogeneous data sets. It is ideal if you need to store lots of numerical data with annotation.

There are two popular libraries in Python:

- [h5py](https://docs.h5py.org/en/stable/)
- [pytables](https://www.pytables.org)

I find `h5py` to have a friendlier interface, but the implementation supported by `pandas` is `pytables`.

In [None]:
h5_file = f'{filename_base}.h5'

In [None]:
import h5py

In [None]:
with h5py.File(h5_file, 'w') as f:
    g = f.create_group(pipe.name)
    g.create_dataset(name='X_train', data=python_model['X_train'])
    g.create_dataset(name='y_train', data=python_model['y_train'])
    g.create_dataset(name='X_test', data=python_model['X_test'])
    g.create_dataset(name='y_test', data=python_model['y_test'])
    g.attrs['name'] = pipe.name
    for k, v in pipe.get_params().items():
        g.attrs[k] = str(v)

In [None]:
! head -c 200 $h5_file

In [None]:
with h5py.File(h5_file, 'r') as f:
    for k in f:
        g = f[k]
        print(g)
        for attr in g.attrs:
            print(attr, g.attrs[attr])
        for item in (g):
            print(item, g[item])

In [None]:
with h5py.File(h5_file, 'r') as f:
    xs = f['my_pipeline_0.0.1/X_train']
    print(xs[:2, :5])

### Google Protocol Buffer (protobuf)

This is typically used to transmit data for ML prediction, especially for ML deployments on a cloud platform. It is a binary buffer, so much more efficient than JSON for large data sets.

From the [official docs](https://developers.google.com/protocol-buffers/docs/pythontutorial), there are 3 steps:

- Define message formats in a .proto file.
- Use the protocol buffer compiler
- Use the Python protocol buffer API to write and read messages

You will rarely have to work with protocol buffers directly in practice, but under the hood, TensorFlow uses this serialization method in the SavedModel protocol buffer.