# Open Neural Network Exchange (ONNX)

ONNX is an open format designed to facilitate the interoperability of deep learning frameworks. It provides a standardized way to represent deep learning models, enabling developers to move models between different frameworks seamlessly.

In this tutorial, we want to see how we can train a model with scikit-learn and export it to an ONNX model, such that we could use somewhere else (maybe a mobile phone, an edge device, or just some container).

### Scenario

For this tutorial, we will use the popular loan prediction task. The [loan prediction dataset](https://www.kaggle.com/datasets/altruistdelhite04/loan-prediction-problem-dataset) contains features such as applicant demographics, income, credit history, and loan details. This dataset is widely used in machine learning tutorials for a classification task: will a loan be granted to the applicant? Controversely enough, it also serves as a showcase for responsible / ethical AI, since the historical loan data disadvantages women.

## 1. Data Preparation

The data preprocessing and training is not part of the tutorial per se, but we want to be transparent, which preparatory steps have been performed.

In [1]:
import pandas as pd

# NOTE: although we load a file called `train.csv`, we will also use it for model validation.
# we will use the `test.csv` later for the use with ONNX
df = pd.read_csv('./data/train.csv')
# lowercase columns for simplicity
df.columns = map(str.lower, df.columns)
# let's have a look at our data
df.head()

Unnamed: 0,loan_id,gender,married,dependents,education,self_employed,applicantincome,coapplicantincome,loanamount,loan_amount_term,credit_history,property_area,loan_status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [2]:
# loan_status is our target column
y = df["loan_status"]

# drop the loan_status (target), loan_id (irrelevant for training) and gender (discriminatory)
X = df.drop(["loan_status", "loan_id", "gender"], axis=1)

### DType-specific transformations

We need to apply different types of pre-processing depending on our data type, so we distinguish first, which columns are numerical, and which ones categorical

In [3]:
from pandas.api.types import is_numeric_dtype

categorical_features = []
numerical_features = []

for column_name, column in X.items():
    if is_numeric_dtype(column):
        numerical_features.append(column_name)
    else:
        categorical_features.append(column_name)

print(f"Numerical features: {numerical_features}")
print(f"Categorical features: {categorical_features}")

Numerical features: ['applicantincome', 'coapplicantincome', 'loanamount', 'loan_amount_term', 'credit_history']
Categorical features: ['married', 'dependents', 'education', 'self_employed', 'property_area']


### Train / Test Split

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Model

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

numeric_transformer = Pipeline(steps=[
    # for numerical features, we impute missing values with the mean.
    ('imputer', SimpleImputer(strategy='mean')),
    # we standardize the value distribuction into a 0-1 range
    ('scaler', MinMaxScaler())])

categorical_transformer = Pipeline(steps=[
    # for categorical feature, we perform a OneHot-Encoding, since our model architecture - logistic regression - cannot handle categoricals natively
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numerical_features),
        ('categorical', categorical_transformer, categorical_features)])

# our classifier is a logistic regression
clf = LogisticRegression()

# our model pipeline consistes of the set of preprocessors and the classifier
# NOTE that this means, that preprocessing will automatically be also done during inference -> no need for a separate preprocessor
pipeline = Pipeline(
    steps=[('preprocessor', preprocessor), ('classifier', clf)]
)

In [6]:
model = pipeline.fit(X_train, y_train)

## Evaluate model

In [15]:
# .score() returns the mean accuracy.
model.score(X_test, y_test)

0.7837837837837838

# Convert model to ONNX

There are different Python libraries that can help convert machine learning artifacts to ONNX. `skl2onnx` is one of the most popular ones and - unsurprisingly - used to convert sklearn artifacts. Other libraries support Tensorflow, PyTorch, XGBoost and more. Of course, GenAI models are also supported.

## Data Types

ONNX specifications are optimized for numerical computation with tensors.  As such, we need to convert our pandas data types to ONNX tensor types.
NOTE: ONNX is strongly typed and its definition does not support implicit cast. It is impossible to add two tensors or matrices with different types even if other languages does. That’s why an explicit cast must be inserted in a graph.

# TODO https://onnx.ai/sklearn-onnx/supported.html#l-converter-list

In [None]:
from skl2onnx.common.data_types import FloatTensorType, StringTensorType

# TODO: SHAPE
initial_types = [(categorical, StringTensorType([None, 1])) for categorical in categorical_features] + \
              [(numerical, FloatTensorType([None, 1])) for numerical in numerical_features]
initial_types

In [None]:
from skl2onnx import to_onnx

# the actual conversion to an ONNX graph is done by `to_onnx()`, which needs the model and dtype information
model_onnx = to_onnx(pipeline, initial_types=input_types)
with open("model/loan_model.onnx", "wb") as f:
    f.write(model_onnx.SerializeToString())

# Test ONNX model

Now that we have create an ONNX model, let's how we can infer predictions with it. For this demonstration, we will use the Python runtime. This is obviously unnecessarily complicated, since we could just use the sklearn model directly, and won't make use of hardware acceleration. Yet it allows use to test how we can use a model standardized by ONNX.

In [16]:
import numpy as np

# unfortunately, onnx does not support pandas dataframe out of the box. We need to convert our input into a python dict with
# the structure { input_name: input_value }, where input_value is either python-native or a numpy array

input_data = {column: X_test[column].values for column in X_test.columns}
for numeric in numerical_features:
    input_data[numeric] = input_data[numeric].astype(np.float32)
for categorical in input_data:
    input_data[categorical] = input_data[categorical].reshape((input_data[categorical].shape[0], 1))

In [17]:
import onnxruntime as rt

# initialise the session. A session can be used repeatedly on an inference server/backend
sess = rt.InferenceSession("model/loan_model.onnx")

# invoke the session with the test data
# the first parameter - here set to None - defines the output name of the prediction. We don't care about that.
pred_onnx = sess.run(None, input_data)
pred_onnx[1]

[{'N': 0.09480604529380798, 'Y': 0.9051939249038696},
 {'N': 0.13759106397628784, 'Y': 0.8624089360237122},
 {'N': 0.18181517720222473, 'Y': 0.8181848526000977},
 {'N': 0.31672602891921997, 'Y': 0.68327397108078},
 {'N': 0.14742478728294373, 'Y': 0.8525751829147339},
 {'N': 0.289922297000885, 'Y': 0.710077702999115},
 {'N': 0.07458186149597168, 'Y': 0.9254181385040283},
 {'N': 0.16964206099510193, 'Y': 0.8303579092025757},
 {'N': 0.16818147897720337, 'Y': 0.8318185210227966},
 {'N': 0.2463846206665039, 'Y': 0.7536153793334961},
 {'N': 0.6307817101478577, 'Y': 0.36921828985214233},
 {'N': 0.21318262815475464, 'Y': 0.7868173718452454},
 {'N': 0.0636473000049591, 'Y': 0.9363527297973633},
 {'N': 0.21156004071235657, 'Y': 0.7884399890899658},
 {'N': 0.14816853404045105, 'Y': 0.8518314361572266},
 {'N': 0.2685273289680481, 'Y': 0.7314726710319519},
 {'N': 0.27646714448928833, 'Y': 0.7235328555107117},
 {'N': 0.15659794211387634, 'Y': 0.8434020280838013},
 {'N': 0.13404718041419983, 'Y': 0.8