New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: [VRD-711] Add batch prediction method to client #3645
Conversation
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpicknitpick
creds=creds, | ||
token=TOKEN, | ||
) | ||
# the input below is entirely irrelevant since it"s smaller than the batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you cmd-F replace '
with "
# the input below is entirely irrelevant since it"s smaller than the batch size | |
# the input below is entirely irrelevant since it's smaller than the batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya
# the input below is entirely irrelevant since it"s smaller than the batch size | ||
prediction_df = dm.batch_predict(pd.DataFrame({"hi": "bye"}, index=[1]), 10) | ||
# Since no index was provided, we can"t guarantee the index type for assertions | ||
pd.testing.assert_frame_equal(expected_df.reset_index(drop=True), prediction_df.reset_index(drop=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I feel like resetting the index conceptually weakens the test (now we can't check index values if we ever wanted to).
assert_frame_equal()
appears to have a check_index_type
parameter. Would that have worked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different tests with indexes, does that change your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the check_index_type
and it didn't do enough 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, somewhere along the line, number indexes are converting to string indexes and that's sketchy. I should look into that.
input_df = pd.DataFrame({"a": [1, 2, 3, 4, 5], "b": [11, 12, 13, 14, 15]}) | ||
prediction_df = dm.batch_predict(input_df, 1) | ||
expected_final_df = pd.concat(expected_d_list) | ||
# Since no index was provided, we can"t guarantee the index type for assertions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Since no index was provided, we can"t guarantee the index type for assertions | |
# Since no index was provided, we can't guarantee the index type for assertions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell I find-replaced all the single quotes? 😅
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
Co-authored-by: Liu <96442646+liuverta@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
if not isinstance(body, bytes): | ||
body = body.encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can omit this since we're firmly in Python 3, now. json.dumps()
will give us a utf-8
str
token=TOKEN, | ||
) | ||
input_df = pd.DataFrame({"a": [1, 2, 3, 4, 5], "b": [11, 12, 13, 14, 15]}) | ||
prediction_df = dm.batch_predict(input_df, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is a bit hard to read (and therefore probably hard to main/debug); it's unclear that the batch_size=1
here forces us to have five batches because input_df
has five rows, and that it's related to expected_df_list
having five DataFrames.
But I don't know how I'd write it any better 🤷
Impact and Context
Create a
batch_predict
method that accepts and returnspandas.DataFrame
s. This method takes a dataframe, splits it into smaller dataframes of the providedbatch_size
to make predictions against the model, then reassembles the output to return to the user as one dataframe.Risks and Area of Effect
Doesn't affect existing features.
Testing
Added unit tests to cover the basic functionality. Planning to add more complex property tests shortly, in a separate PR. Also deployed to a dev env to ensure the pipeline works from end to end.
Also, python model tests look good with the updated json request with nans thing: https://jenkins.dev.verta.ai/job/test/job/python-models/job/python-models/34/
Reverting