Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Optimizations for Non-Bulk Indexing #26

Merged
merged 11 commits into from
Mar 15, 2021

Conversation

ric-evans
Copy link
Member

@ric-evans ric-evans commented Mar 10, 2021

The indexer was designed for bulk-indexing. However, it may become necessary to point the indexer at a single file (or a small batch). Until now, that would require a not insignificant overhead.

  • Replaced --no-patch with --patch
    • patching is not common
  • Check if the file already exists in the FC before collecting its metadata
    • unless given --patch, skip the file
    • this will save significant time
    • also decreases (but does not eliminate) the need for blacklists (--blacklist-file)
  • Optimize for single-processed indexing (--processes 1 / the default case)
    • successive child processes are no longer spawned when there's only one process
  • Added option for not recursively indexing (-n/--non-recursive)
    • the indexer will only index the filepaths it's directly given, and will ignore any directories
    • it's a freeby, so might as well add it
  • Added --blacklist for shorter blacklists

closes #23
closes #24

@ric-evans ric-evans added the enhancement New feature or request label Mar 10, 2021
@ric-evans ric-evans self-assigned this Mar 10, 2021
@ric-evans ric-evans changed the title Performance Optimization Performance Optimizations Mar 10, 2021
@ric-evans ric-evans changed the title Performance Optimizations Performance Optimizations for Non-Bulk Indexing Mar 10, 2021
Copy link
Contributor

@dsschult dsschult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good.

try:
if is_processable_path(p):
if os.path.isfile(p):
await process_file(p, manager, fc_rc, no_patch, dryrun)
await index_file(p, manager, fc_rc, patch, dryrun)
Copy link
Contributor

@dsschult dsschult Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want even more performance, you can run multiple index_file operations in parallel, which may help if it is waiting on a FC response. Example: https://github.com/WIPACrepo/iceprod/blob/master/iceprod/server/scheduled_tasks/update_task_priority.py#L66

Copy link
Member Author

@ric-evans ric-evans Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep that in mind if it becomes a noticeable issue. The FC turnaround-time actually isn't all that bad usually. The checksums and I3Reader reading are the most expensive ops by far.

Plus, as a consequence of multi-threading, I think parallel I/O operations would actually slow things down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be nice is to re-design the multi-processing because currently the parent process isn't doing much, which just lowers CPU utilization.

@ric-evans
Copy link
Member Author

depends on WIPACrepo/iceprod#294 for git pips

@ric-evans ric-evans merged commit 0d39ab1 into master Mar 15, 2021
@ric-evans ric-evans deleted the performance-improvements branch March 15, 2021 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query First Before Checksum (Performance Optimization) By Default Don't Patch
2 participants