sklearn binary classifier: compiled and Python model discrepancies? #87

blokhin · 2019-03-26T15:17:51Z

Dear treelite community,

what could be the reasons of discrepancies in predictions of a sklearn random forest binary classifier, between the original Python code and the compiled code?

I used gallery interface for compilation and treelite.runtime for deployment, if that may matter.

PS thanks for treelite, I really enjoy 10-15x speedup in my compiled regression models (they work just fine)

hcho3 · 2019-03-28T00:07:25Z

@blokhin Is it possible to post your model for debugging purposes? If not, can you make a toy example to demonstrate the issue?

hcho3 · 2019-04-05T00:15:19Z

@blokhin Any updates?

blokhin · 2019-04-05T22:32:15Z

@hcho3 I figured out the compiled random forest binary classifier outputs not the 0 vs. 1 (as expected), but the float in the range between 0 and 1. Not a big deal, as this float can be correctly rounded then. That was actually a reason of discrepancy.

hcho3 · 2019-04-05T22:34:37Z

@blokhin Treelite will produce a number between 0 and 1, representing the fraction of the votes among the decision trees for the positive class (e.g. 0.75 means 75% of the trees predicted the positive class).

blokhin · 2019-04-05T22:34:40Z

This is a sample code to reproduce:

import random
import numpy as np

from sklearn.ensemble import RandomForestClassifier

import treelite.gallery.sklearn
import treelite.runtime


# First, generate sample data
my_range = range(1, 100)
X_data = [[random.choice(my_range) for _ in range(125)] for _ in range(76)]
X_test = X_data.pop()
y_data = [random.choice([0, 1]) for _ in range(75)]

# Second, prepare a classifier and compile it
model_py = RandomForestClassifier(
    n_estimators=100,
    max_features=2,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=5,
    bootstrap=True,
    n_jobs=-1
)
model_py.fit(X_data, y_data)

model_file = './compiled.so'
icompiler = treelite.gallery.sklearn.import_model(model_py)
icompiler.export_lib(toolchain='clang', libpath=model_file, verbose=True, params={'parallel_comp': 8})
model_tr = treelite.runtime.Predictor(model_file, verbose=True)

# Third, compare: py vs. tr
result_py = model_py.predict([X_test])[0]

batch = treelite.runtime.Batch.from_npy2d(np.array([X_test]))
result_tr = model_tr.predict(batch)

print(result_py, result_tr)
assert result_py == result_tr # I would expect that, but...

hcho3 · 2019-04-05T22:35:20Z

@blokhin Also, try using predict_proba() to get probability values from scikit-learn

blokhin · 2019-04-05T22:36:40Z

OK, got it!

hcho3 closed this as completed Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sklearn binary classifier: compiled and Python model discrepancies? #87

sklearn binary classifier: compiled and Python model discrepancies? #87

blokhin commented Mar 26, 2019 •

edited

Loading

hcho3 commented Mar 28, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

sklearn binary classifier: compiled and Python model discrepancies? #87

sklearn binary classifier: compiled and Python model discrepancies? #87

Comments

blokhin commented Mar 26, 2019 • edited Loading

hcho3 commented Mar 28, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

hcho3 commented Apr 5, 2019

blokhin commented Apr 5, 2019

blokhin commented Mar 26, 2019 •

edited

Loading