In [1]:
import pandas as pd

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import lumipy as lm
import lumipy.ml as ml
from lusidjam import RefreshingToken as rt

from sklearn import set_config

set_config(display='diagram')

# Tutorial 10 - Scikit Learn II

You can also convert more complicated models in the form of sklearn pipelines to ONNX and then serve them from Luminesce in the same way as the previous tutorial. 

This tutorial is based on an example from the sklearn documentation:

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html

## Get Titanic Dataset

In [2]:
data = fetch_openml("titanic", version=1, as_frame=True)
print(data['DESCR'])

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variable

In [3]:
df = pd.concat([data['data'], data['target']], axis=1)
df['sex'] = df.sex.astype(str)
df['embarked'] = df.embarked.astype(str)
df['pclass'] = df.pclass.astype(str)
df['survived'] = df.survived.astype(int)

train, test = train_test_split(df, test_size=0.3, random_state=0, stratify=df.iloc[:, -1])

## Upload Data to Drive

Upload the test set data to drive so it can be passed through the model we'll upload shortly. 

In [4]:
drive = lm.get_drive(token=rt())

In [5]:
drive.create_folder('/lumipy_test/ml_tutorial/titanic')
drive.upload(test, '/lumipy_test/ml_tutorial/titanic/test.csv', overwrite=True)

## Construct a Model Pipeline and Train

Construct the sklearn model pipeline and train it as normal. 

In [6]:
X_train, y_train = train.iloc[:,:-1], train.iloc[:,-1]
X_test, y_test = test.iloc[:,:-1], test.iloc[:,-1]

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegressionCV

In [8]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer()),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(
    steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('categorical', categorical_transformer, categorical_features)
    ]
)

pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('pca', PCA()),
        ('classifier', LogisticRegressionCV())
    ]
)

# Display the pipeline graph
pipe

In [9]:
pipe = pipe.fit(X_train, y_train)
print(f"Test set accuracy: {pipe.score(X_test, y_test):1.3f}")

Test set accuracy: 0.799


## Convert Pipeline and Upload to Drive

To use the pipeline within luminesce it will need to be converted to an ONNX graph and then uploaded to drive. 

In [10]:
onnx_model = ml.sklearn_to_onnx(pipe, X_train)

In [11]:
drive.upload(onnx_model, '/lumipy_test/ml_tutorial/titanic/pipeline.onnx', overwrite=True)

## Serve the Model

Now the test data and model file are uploaded we can use them in a Luminesce query. Integration into the Lumipy fluent expression syntax will be added in a future update. For now this is only available in text-based queries. 

In [12]:
client = lm.get_client(token=rt())

In [13]:
test_infer_df = client.query_and_fetch("""
    @titanic_data = use Drive.csv
        --file=/lumipy_test/ml_tutorial/titanic/test.csv
    enduse;

    @titanic_features = select 
        [pclass], [sex], [age], [fare], [embarked]
    from @titanic_data;

    @inference = use Tools.ML.Inference.Sklearn with @titanic_features
        --onnxFilePath=/lumipy_test/ml_tutorial/titanic/pipeline.onnx
    enduse;

    select * from @inference
""")

In [14]:
test_infer_df.head()

Unnamed: 0,label,probabilities_0,probabilities_1
0,0,0.836992,0.163008
1,0,0.711165,0.288835
2,0,0.783509,0.216491
3,1,0.439441,0.560559
4,0,0.783546,0.216454


In [15]:
lumi_cls1_prob = test_infer_df.values[:, -1]
local_cls1_prob = pipe.predict_proba(X_test)[:, -1]
diff = lumi_cls1_prob - local_cls1_prob

# Show predicted probs are consistent within floating point precision
print(diff.mean(), diff.std())

7.279000550315503e-09 2.6317966388897202e-08


If we compute the accuracy for both we see that they're identical.

In [16]:
lumi_labels = test_infer_df.values[:,0]
local_labels = pipe.predict(X_test)
print(accuracy_score(y_test, lumi_labels))
print(accuracy_score(y_test, local_labels))

0.7989821882951654
0.7989821882951654
