Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for large numbers of features #103

Merged
merged 32 commits into from
Oct 21, 2021
Merged

Conversation

wasade
Copy link
Member

@wasade wasade commented Apr 27, 2021

This pull requests expands on the server side logic to support bulk load operations for obtaining indices from identifiers, and the load of identifier specific data.

The motivation is to support tables containing millions of identifiers. When performed individually, these operations require milliseconds per query, and 1 million times 1 millisecond begins to get large. What we're in effect doing here is packing in more data into the individual requests to reduce the HTTP request/response overhead.

@wasade wasade requested a review from antgonza April 28, 2021 01:53
Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade, looks good, thank you.

Some minor comments. Also, would it be worth adding the batch sizes as global or env variables and defaulting to the current values in the code if not present? This will allow changing that values on the fly for debugging or testing the batch sizes. If you agree and out of the scope of this PR, fine to open as an issue.

redbiom/_requests.py Outdated Show resolved Hide resolved
redbiom/admin.py Outdated Show resolved Hide resolved
redbiom/admin.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of extra comments + the one about batch size.

req = s.post(config['hostname'],
data=_format_request(context, cmd, payload))

if verbose:
print(context, cmd, payload[:100])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary but have you consider the logging python package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I would like to consider that out of scope of this PR


Notes
-----
This method only supports count data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be checked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added

@wasade
Copy link
Member Author

wasade commented Oct 21, 2021

Sorry, missed the batchsize comment. I don't think it's understood well enough to motivate centralizing. At this point, based on what I know, I suspect that effort will not be for gain.

1 similar comment
@wasade
Copy link
Member Author

wasade commented Oct 21, 2021

Sorry, missed the batchsize comment. I don't think it's understood well enough to motivate centralizing. At this point, based on what I know, I suspect that effort will not be for gain.

@antgonza antgonza merged commit d4af5a2 into biocore:master Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants