New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLeap Transformer schema is wrong #618
Comments
@Ben-Epstein is it possible to send me some sample data so that I can try this out? |
Or if you could please send the serialized zip, that would be good as well. |
I've actually followed the mleap demo for pyspark https://github.com/ancasarb/mleap-demo/blob/master/notebooks/PySpark%20-%20AirBnb.ipynb, serialized the pipeline and re-loaded it in Scala as below
Using the frame.airbnb.json file for example
and everything worked fine, am I missing anything? How are you doing the prediction? |
@ancasarb the dataset is in the code linked above. It's the load_breast_cancer() dataset from sklearn. Let me know if you need any help! |
Is there any update on this? |
@Ben-Epstein Looking at the code for the pipeline schema here, it does seem that we use a hashmap to collect the schema for the pipeline, so it could be that the order is different. However, this shouldn't be an issue at scoring. In the serialized vector assembler, I can see
so the order is correct. If then create a leap frame
then the model scores fine, even if the order that we provide in the json is different than the one from the schema, or the ordering of the columns in the training data. As long as the "fields" and "rows" in the json respect the same ordering (any ordering) the scoring will be correct. Hope this helps, let me know if you have further questions. |
Closing this in preparation of 0.16.0 release, please reopen if you still have questions, hopefully the clarification above makes sense. |
@ancasarb thank you for checking the code, however I'm still getting incorrect predictions with the code above. Did you run it and confirm that predictions on the MLeap transformer match the original spark model? All of the code necessary is above. |
@Ben-Epstein Yes, I ran it without issues. Could you please share a leap frame that you use for the scoring please? |
Worked fine for me with the latest version ml.combust.mleap:mleap-spark_2.11:0.16.0 |
closing this, please re-open if you're still struggling with it. |
After creating a PySpark model and serializing it to a bundle, I try to read in the MLeap transformer and make a prediction but the prediction is wrong.
Upon investigation, I've found that the inputSchema of the model has been modified, so the features are in the wrong order. If you simply print out the PipelineModel, it shows the features in the correct order, but calling inputSchema gives an incorrect order.
@abaveja313
** Code to reproduce:
Model
serialization and movement to HDFS:
Reading in the model
printing the inputSchema:
As you can see, the inputSchema is wrong, causing all predictions to be wrong. I've reproduced the same with LogisticRegression models as well. I'm stuck here because without being able to generate the schema I have to specify it each time which creates non-reproducible code.
Is there something I'm doing wrong here or missing? Help would be greatly appreciated!
The text was updated successfully, but these errors were encountered: