Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][Dependent] Fix task creation with cloud storage data #7903

Open
wants to merge 17 commits into
base: develop
Choose a base branch
from

Conversation

Marishka17
Copy link
Contributor

@Marishka17 Marishka17 commented May 17, 2024

Motivation and context

This PR contains several fixes:

  • Optimized memory usage by not keeping all downloaded images/part of images in memory while creating a manifest file
  • Optimized the number of requests to CS providers by downloading only images from a specified range (use_cache in True/False)

How has this been tested?

This table contains metrics (required time and memory usage) obtained before this fix and after.

with/without fix Threads count Time for partially files downloading & preparing manifest Memory usage diff after _create_task_manifest_from_cloud_data (MB)
without 1 0:01:02.797817 222
without 2 0:00:11.307688 225
with 1 0:00:12.622483 16.5
with 2 0:00:13.004692 17.2

Testing dataset contains 4000 images, its size is about 400MB, 64 KB was downloaded on average for each image.

Details
  after fix:
  1 thread
  Required time:  0:00:12.622483
  280.7 MiB      0.0 MiB           1                   if data['stop_frame'] is not None:
     727                                                             stop_frame = min(stop_frame, data['stop_frame'])
     728    297.2 MiB     16.5 MiB           2                   _create_task_manifest_from_cloud_data(
     729    280.7 MiB      0.0 MiB           1                       db_data.cloud_storage, sorted_media, manifest,
     730    280.7 MiB      0.0 MiB           1                       start_frame=db_data.start_frame, stop_frame=stop_frame, step=db_data.get_frame_step()
     731                                                         )
     732                                         
     733    297.2 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  2 threads
  Required time:  0:00:13.004692
  726    281.1 MiB      0.0 MiB           1                   if data['stop_frame'] is not None:
     727                                                             stop_frame = min(stop_frame, data['stop_frame'])
     728    298.3 MiB     17.2 MiB           2                   _create_task_manifest_from_cloud_data(
     729    281.1 MiB      0.0 MiB           1                       db_data.cloud_storage, sorted_media, manifest,
     730    281.1 MiB      0.0 MiB           1                       start_frame=db_data.start_frame, stop_frame=stop_frame, step=db_data.get_frame_step()
     731                                                         )
     732                                         
     733    298.3 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  without fix:
  1 thread
  Required time:  0:01:02.797817
  688    280.5 MiB      0.0 MiB           1               if manifest_file:
     689                                                         # Define task manifest content based on cloud storage manifest content and uploaded files
     690                                                         _create_task_manifest_based_on_cloud_storage_manifest(
     691                                                             sorted_media, cloud_storage_manifest_prefix,
     692                                                             cloud_storage_manifest, manifest)
     693                                                     else: # without manifest file but with use_cache option
     694                                                         # Define task manifest content based on list with uploaded files
     695    502.6 MiB    222.1 MiB           1                   _create_task_manifest_from_cloud_data(db_data.cloud_storage, sorted_media, manifest)
     696                                         
     697    502.6 MiB      0.0 MiB           1       av_scan_paths(upload_dir)
  
  2 threads
  Required time:  0:00:11.307688
  280.8 MiB      0.0 MiB           1               if manifest_file:
     689                                                         # Define task manifest content based on cloud storage manifest content and uploaded files
     690                                                         _create_task_manifest_based_on_cloud_storage_manifest(
     691                                                             sorted_media, cloud_storage_manifest_prefix,
     692                                                             cloud_storage_manifest, manifest)
     693                                                     else: # without manifest file but with use_cache option
     694                                                         # Define task manifest content based on list with uploaded files
     695    506.0 MiB    225.2 MiB           1                   _create_task_manifest_from_cloud_data(db_data.cloud_storage, sorted_media, manifest)
     696                                         
     697    506.0 MiB      0.0 MiB           1       av_scan_paths(upload_dir)

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

  • New Features

    • Introduced threading logic for bulk downloads to enhance performance based on CPU capabilities and file count.
    • Added new classes Named and NamedBytesIO for better handling of file attributes in image processing.
  • Improvements

    • Optimized memory usage and request handling for creating manifest files.
    • Enhanced handling of image sources and iteration logic in DatasetImagesReader.
  • Configuration

    • Added global settings for cloud data downloading thread management.
  • Bug Fixes

    • Improved retrieval of preview information from manifests to ensure robustness.
  • Documentation

    • Updated internal documentation to reflect new threading logic and image processing enhancements.

@Marishka17 Marishka17 added bug Something isn't working server labels May 17, 2024
Copy link
Contributor

coderabbitai bot commented May 17, 2024

Walkthrough

The recent updates enhance the CVAT application by optimizing cloud data handling, improving memory usage, and refining threading logic. Key changes include updating methods for preview retrieval, bulk downloading with threading, and updating image processing in manifests. New classes and global variables were introduced to support these enhancements.

Changes

File(s) Change Summary
cvat/apps/engine/cache.py Modified preview retrieval method to use get_first_not_empty_item().
cvat/apps/engine/cloud_provider.py Added threading logic for bulk downloads and updated return type of bulk_download_to_memory.
cvat/apps/engine/task.py Introduced new parameters and methods for task creation and progress updates.
utils/dataset_manifest/core.py Enhanced DatasetImagesReader to handle different source types and improved iteration logic.
cvat/settings/base.py Introduced global variables for cloud data downloading thread management.
changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md Documented optimizations in memory usage and request handling for image processing.
utils/dataset_manifest/types.py Added Named and NamedBytesIO classes for better data handling.
utils/dataset_manifest/__init__.py Added import statement for InvalidManifestError.

In the world of code, where bytes do flow,
Changes bring a brighter glow.
Threads now dance, in harmony they play,
Making tasks swift, in a cloud ballet.
With names and streams, the data sings,
Progress leaps on rapid wings.
🌟✨🐇

Tip

New Features and Improvements

Review Settings

Introduced new personality profiles for code reviews. Users can now select between "Chill" and "Assertive" review tones to tailor feedback styles according to their preferences. The "Assertive" profile posts more comments and nitpicks the code more aggressively, while the "Chill" profile is more relaxed and posts fewer comments.

AST-based Instructions

CodeRabbit offers customizing reviews based on the Abstract Syntax Tree (AST) pattern matching. Read more about AST-based instructions in the documentation.

Community-driven AST-based Rules

We are kicking off a community-driven initiative to create and share AST-based rules. Users can now contribute their AST-based rules to detect security vulnerabilities, code smells, and anti-patterns. Please see the ast-grep-essentials repository for more information.

New Static Analysis Tools

We are continually expanding our support for static analysis tools. We have added support for biome, hadolint, and ast-grep. Update the settings in your .coderabbit.yaml file or head over to the settings page to enable or disable the tools you want to use.

Tone Settings

Users can now customize CodeRabbit to review code in the style of their favorite characters or personalities. Here are some of our favorite examples:

  • Mr. T: "You must talk like Mr. T in all your code reviews. I pity the fool who doesn't!"
  • Pirate: "Arr, matey! Ye must talk like a pirate in all yer code reviews. Yarrr!"
  • Snarky: "You must be snarky in all your code reviews. Snark, snark, snark!"

Revamped Settings Page

We have redesigned the settings page for a more intuitive layout, enabling users to find and adjust settings quickly. This change was long overdue; it not only improves the user experience but also allows our development team to add more settings in the future with ease. Going forward, the changes to .coderabbit.yaml will be reflected in the settings page, and vice versa.

Miscellaneous

  • Turn off free summarization: You can switch off free summarization of PRs opened by users not on a paid plan using the enable_free_tier setting.
  • Knowledge-base scope: You can now set the scope of the knowledge base to either the repository (local) or the organization (global) level using the knowledge_base setting. In addition, you can specify Jira project keys and Linear team keys to limit the knowledge base scope for those integrations.
  • High-level summary placement: You can now customize the location of the high-level summary in the PR description using the high_level_summary_placeholder setting (default @coderabbitai summary).
  • Revamped request changes workflow: You can now configure CodeRabbit to auto-approve or request changes on PRs based on the review feedback using the request_changes_workflow setting.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to full the review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor

github-actions bot commented May 17, 2024

❌ Some checks failed
📄 See logs here

@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor

github-actions bot commented May 17, 2024

❌ Some checks failed
📄 See logs here

Copy link
Contributor

github-actions bot commented May 17, 2024

❌ Some checks failed
📄 See logs here

1 similar comment
Copy link
Contributor

github-actions bot commented May 17, 2024

❌ Some checks failed
📄 See logs here

utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
Comment on lines 716 to 717
stop_frame = len(sorted_media)
if data['stop_frame'] is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor code to remove duplicates

@Marishka17 Marishka17 marked this pull request as ready for review May 20, 2024 07:58
@Marishka17 Marishka17 requested a review from azhavoro as a code owner May 20, 2024 07:58
@Marishka17
Copy link
Contributor Author

@zhiltsov-max, Could you please take a look at the PR? I've added small comments that should be fixed but I can fix them only in the evening probably.

@Marishka17 Marishka17 requested review from zhiltsov-max and removed request for azhavoro May 20, 2024 08:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Out of diff range and nitpick comments (1)
cvat/apps/engine/task.py (1)

Line range hint 893-893: Change the membership test to use not in for correct logical evaluation.

- if not chunk_path.endswith(f"{properties['name']}{properties['extension']}"):
+ if chunk_path not in f"{properties['name']}{properties['extension']}":
Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 1a86ccd and 40fb577.
Files selected for processing (4)
  • cvat/apps/engine/cache.py (1 hunks)
  • cvat/apps/engine/cloud_provider.py (4 hunks)
  • cvat/apps/engine/task.py (4 hunks)
  • utils/dataset_manifest/core.py (8 hunks)
Additional Context Used
Ruff (2)
cvat/apps/engine/task.py (1)

893-893: Test for membership should be not in

utils/dataset_manifest/core.py (1)

788-788: Test for membership should be not in

Additional comments not posted (6)
cvat/apps/engine/cache.py (1)

315-315: Change to use manifest.get_first_not_empty_item() improves robustness by ensuring a non-empty item is retrieved.

Verification successful

The _prepare_cloud_preview method is properly integrated within the _get_or_set_cache_item method, ensuring that the returned values are correctly handled and cached.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify integration of `_prepare_cloud_preview` with other components.

# Test: Search for usages of `_prepare_cloud_preview`. Expect: Proper handling of the returned values.
rg --type python '_prepare_cloud_preview'

Length of output: 72



Script:

#!/bin/bash
# Description: Verify integration of `_prepare_cloud_preview` with other components.

# Test: Search for usages of `_prepare_cloud_preview`. Expect: Proper handling of the returned values.
rg --type py '_prepare_cloud_preview'

Length of output: 220



Script:

#!/bin/bash
# Extract and review the context around the usages of `_prepare_cloud_preview`.

# Extract 5 lines before and after the first usage
rg --type py '_prepare_cloud_preview' -A 5 -B 5

Length of output: 1472



Script:

#!/bin/bash
# Extract the `_get_or_set_cache_item` method for review.

# Search for the `_get_or_set_cache_item` method definition and its content
ast-grep --lang python --pattern $'def _get_or_set_cache_item($_, $_):\n    $$$\n'

Length of output: 81



Script:

#!/bin/bash
# Search for the `_get_or_set_cache_item` method definition and its content.

# Search for the method definition
rg 'def _get_or_set_cache_item' -A 20

Length of output: 1417

utils/dataset_manifest/core.py (1)

196-222: Updates to _get_img_properties enhance handling of different image source types. Verify integration with other components.

Verification successful

The _get_img_properties method is used multiple times within the utils/dataset_manifest/core.py file, primarily in generator expressions. The context shows that the method is called to yield image properties, and the returned values are handled appropriately.

  • Lines where _get_img_properties is called:
    • utils/dataset_manifest/core.py: yield self._get_img_properties(image)

The integration appears to be correct, with the method's return values being used as expected.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify integration of `_get_img_properties` with other components.

# Test: Search for usages of `_get_img_properties`. Expect: Proper handling of the returned values.
rg --type python '_get_img_properties'

Length of output: 69



Script:

#!/bin/bash
# Description: Verify integration of `_get_img_properties` with other components.

# Test: Search for usages of `_get_img_properties`. Expect: Proper handling of the returned values.
rg --type py '_get_img_properties'

Length of output: 510



Script:

#!/bin/bash
# Description: Extract and review the context of `_get_img_properties` calls.

# Extract lines where `_get_img_properties` is called along with some surrounding lines for context.
rg --type py '_get_img_properties' -A 3 -B 3

Length of output: 2351

cvat/apps/engine/cloud_provider.py (2)

9-9: Introduce threading and constants for performance optimization.

The introduction of threading using ThreadPoolExecutor and the constants CPU_NUMBER, MAX_THREADS_NUMBER, and NUMBER_OF_FILES_PER_THREAD are well-aligned with the PR's objective to optimize performance. This should effectively manage the number of threads based on the CPU capabilities and the number of files, which is crucial for handling large volumes of data efficiently.

Also applies to: 14-15, 40-42


185-185: Enhance file downloading functions with optimal threading.

The modifications in bulk_download_to_memory and bulk_download_to_dir to incorporate threading logic are commendable. These changes should significantly reduce the time required for downloading files by parallelizing the process. The use of ThreadPoolExecutor and the conditional logic to determine the number of threads based on the workload are appropriate and should ensure that the system resources are used efficiently.

Also applies to: 205-219, 225-237

cvat/apps/engine/task.py (2)

484-502: Optimized task manifest creation from cloud data to handle large datasets more efficiently.


716-722: Refactor to remove duplicate code as suggested in previous comments.

cvat/apps/engine/task.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented May 20, 2024

Codecov Report

Attention: Patch coverage is 90.51095% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 83.37%. Comparing base (f892bd4) to head (c9867b0).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7903      +/-   ##
===========================================
- Coverage    83.37%   83.37%   -0.01%     
===========================================
  Files          381      382       +1     
  Lines        40053    40123      +70     
  Branches      3769     3769              
===========================================
+ Hits         33395    33451      +56     
- Misses        6658     6672      +14     
Components Coverage Δ
cvat-ui 79.25% <ø> (-0.06%) ⬇️
cvat-server 87.15% <90.51%> (+0.02%) ⬆️

Copy link
Contributor

@zhiltsov-max zhiltsov-max left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few places where stop is changed from (the last valid index) to (the first index after the last valid index). In most places in CVAT it's used in the first meaning, I suggest avoiding changing this to avoid confusion.

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
cvat/apps/engine/task.py Outdated Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Show resolved Hide resolved
cvat/apps/engine/task.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
@Marishka17 Marishka17 requested a review from nmanovic as a code owner May 23, 2024 11:02
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Outside diff range and nitpick comments (3)
utils/dataset_manifest/core.py (2)

144-153: Consider documenting the DatasetImagesReader constructor parameters.

Adding docstrings to the constructor would improve code readability and maintainability by clearly explaining the purpose and expected types of each parameter, especially for complex functions like this one.


Line range hint 761-761: Correct the membership test in the condition.

- if not _dict['version'] in _Manifest.SupportedVersion.choices():
+ if _dict['version'] not in _Manifest.SupportedVersion.choices():

This change corrects the syntax for the membership test to follow Python best practices and improve readability.

cvat/apps/engine/task.py (1)

Line range hint 895-895: Consider using not in for membership tests to enhance readability and performance.

- if not manifest_file or not os.path.isfile(os.path.join(manifest_root, manifest_file)):
+ if manifest_file not in os.path.isfile(os.path.join(manifest_root, manifest_file)):
Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 40fb577 and 64e3218.
Files selected for processing (4)
  • changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md (1 hunks)
  • cvat/apps/engine/cloud_provider.py (4 hunks)
  • cvat/apps/engine/task.py (4 hunks)
  • utils/dataset_manifest/core.py (8 hunks)
Files skipped from review due to trivial changes (1)
  • changelog.d/20240523_120716_maria_fix_task_creation_with_cs_data.md
Additional Context Used
Ruff (2)
cvat/apps/engine/task.py (1)

895-895: Test for membership should be not in

utils/dataset_manifest/core.py (1)

761-761: Test for membership should be not in

Additional comments not posted (5)
utils/dataset_manifest/core.py (1)

168-170: Clarify the necessity of the stop parameter with generators.

The current implementation raises a ValueError if the stop parameter is not passed when a generator is used. This is crucial for preventing runtime errors due to undefined behavior when the end of the generator is not known.

Verification successful

The DatasetImagesReader class constructor includes a check for whether the sources parameter is a generator using the isgenerator function. This check sets the _is_generator_used attribute. The code snippet provided in the review comment ensures that if a generator is used, the stop parameter must be provided to avoid runtime errors due to undefined behavior when the end of the generator is not known.

  • The stop parameter is crucial when using generators to define the endpoint of the iteration, preventing potential runtime errors.
Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Test: Search for usages of DatasetImagesReader with a generator. Expect: Proper handling of the `stop` parameter.
rg --type python 'DatasetImagesReader\([^)]*isgenerator'

Length of output: 89



Script:

#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Test: Search for usages of DatasetImagesReader with a generator. Expect: Proper handling of the `stop` parameter.
rg 'DatasetImagesReader\([^)]*isgenerator'

Length of output: 44



Script:

#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Search for the `isgenerator` parameter usage in the codebase.
rg 'isgenerator'

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern $'class DatasetImagesReader {
  $$$
}'

Length of output: 696



Script:

#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern $'class DatasetImagesReader {
  $$$
}' utils/dataset_manifest/core.py

Length of output: 561



Script:

#!/bin/bash
# Description: Verify the handling of the `stop` parameter when generators are used.

# Inspect the `DatasetImagesReader` class definition to understand how it handles generators.
ast-grep --lang python --pattern 'class DatasetImagesReader { $$$ }' utils/dataset_manifest/core.py

# Fallback: Search for the `DatasetImagesReader` class definition using `rg`.
rg 'class DatasetImagesReader' utils/dataset_manifest/core.py

Length of output: 648



Script:

#!/bin/bash
# Description: Extract the `DatasetImagesReader` class definition for manual inspection.

# Extract the `DatasetImagesReader` class definition from `utils/dataset_manifest/core.py`.
sed -n '/class DatasetImagesReader:/,/^$/p' utils/dataset_manifest/core.py

Length of output: 511

cvat/apps/engine/cloud_provider.py (1)

191-191: Ensure proper error handling in image download logic.

The method optimally_image_download attempts to parse an image and falls back to downloading the entire file if the initial chunk is insufficient. Verify that this fallback mechanism works correctly across different image formats and sizes to ensure robustness.

cvat/apps/engine/task.py (3)

484-504: The updates to _create_task_manifest_from_cloud_data function align well with the PR objectives to optimize cloud storage interactions.


677-688: The implementation of _download_data_from_cloud_storage is efficient and uses cloud storage capabilities effectively.


484-504: The data validation logic in _validate_data is robust, ensuring that the uploaded data conforms to expected formats and quantities.

utils/dataset_manifest/core.py Show resolved Hide resolved
utils/dataset_manifest/core.py Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Outside diff range and nitpick comments (1)
utils/dataset_manifest/core.py (1)

144-153: Ensure consistent use of type hints.

The constructor for DatasetImagesReader uses type hints for some parameters but not for others. It would improve readability and maintainability if all parameters had explicit type hints.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 64e3218 and 914b7bc.
Files selected for processing (3)
  • cvat/apps/engine/cloud_provider.py (4 hunks)
  • cvat/settings/base.py (1 hunks)
  • utils/dataset_manifest/core.py (8 hunks)
Additional Context Used
Ruff (5)
cvat/settings/base.py (4)

34-34: Module level import not at top of file


36-36: Module level import not at top of file


80-80: keys.secret_key.SECRET_KEY imported but unused


718-718: Module level import not at top of file

utils/dataset_manifest/core.py (1)

761-761: Test for membership should be not in

Additional comments not posted (3)
utils/dataset_manifest/core.py (1)

162-163: Good enforcement of method constraints.

Raising a ValueError when a non-predefined sorting method is used with a generator is a good practice as it enforces the method's constraints explicitly.

cvat/apps/engine/cloud_provider.py (2)

223-231: Implementation of bulk_download_to_dir handles threading effectively.


223-231: Cloud storage instance creation is well-handled with comprehensive support for different providers and attributes.

cvat/apps/engine/cloud_provider.py Show resolved Hide resolved
cvat/settings/base.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
utils/dataset_manifest/core.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range and nitpick comments (4)
utils/dataset_manifest/core.py (2)

Line range hint 7-7: Remove unused import io.BytesIO.

- from io import StringIO, BytesIO
+ from io import StringIO

This change cleans up the imports by removing the unused BytesIO, which is not referenced anywhere in this file.


Line range hint 763-763: Correct the membership test in is_empty.

- if self._index.is_empty():
+ if not self._index.is_empty():

This change corrects the logic to properly check if the index is not empty before loading it, which aligns with the intended functionality.

cvat/apps/engine/cloud_provider.py (2)

9-9: Consider making the math import configurable or ensure it's necessary here as it's only used for thread normalization.


Line range hint 184-212: The method optimally_image_download is well-implemented with detailed logging and error handling. However, consider adding more comments to explain the logic, especially around the conditions that determine whether the entire file needs to be downloaded.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 914b7bc and c1b6c8b.
Files selected for processing (3)
  • cvat/apps/engine/cloud_provider.py (8 hunks)
  • utils/dataset_manifest/core.py (10 hunks)
  • utils/dataset_manifest/types.py (1 hunks)
Additional Context Used
Ruff (2)
utils/dataset_manifest/core.py (2)

7-7: io.BytesIO imported but unused


763-763: Test for membership should be not in

Additional comments not posted (5)
utils/dataset_manifest/types.py (1)

8-9: The Named protocol is well-defined and serves its purpose effectively.

utils/dataset_manifest/core.py (2)

199-225: Refactor the _get_img_properties method for clarity.

This refactor simplifies the method by reducing the number of conditionals and improving the structure of the code for better readability and maintainability.


227-233: Optimize the iteration logic in __iter__.

This change uses islice directly in the loop to simplify the iteration and avoid unnecessary checks and range creations inside the loop.

cvat/apps/engine/cloud_provider.py (2)

35-43: The implementation of NamedBytesIO with filename as a property is well-done. This encapsulation enhances the maintainability of the code.


149-149: The method download_fileobj in different storage classes (_CloudStorage, AWS_S3, AzureBlobContainer, GoogleCloudStorage) is consistently implemented. This ensures uniform behavior across different cloud providers, which is good for maintainability and debugging.

Also applies to: 533-541, 735-744, 849-854

cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
utils/dataset_manifest/types.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (1)
utils/dataset_manifest/core.py (1)

Line range hint 763-763: Use not in for membership tests to improve readability.

- if not _dict['version'] in _Manifest.SupportedVersion.choices():
+ if _dict['version'] not in _Manifest.SupportedVersion.choices():

This change makes the membership test more readable and aligns with common Python practices.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between c1b6c8b and b0efba5.
Files selected for processing (1)
  • utils/dataset_manifest/core.py (10 hunks)
Additional Context Used
Ruff (1)
utils/dataset_manifest/core.py (1)

763-763: Test for membership should be not in

utils/dataset_manifest/core.py Show resolved Hide resolved
utils/dataset_manifest/core.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between b0efba5 and 0e71a05.
Files selected for processing (1)
  • cvat/apps/engine/cloud_provider.py (8 hunks)
Additional comments not posted (1)
cvat/apps/engine/cloud_provider.py (1)

Line range hint 200-228: The method optimally_image_download attempts to download just enough of the image to parse its properties, which is a good optimization. However, ensure that the error handling is robust, especially in cases where the image properties cannot be determined with the initial chunk. Consider logging or handling these cases explicitly.

Verification successful

The optimally_image_download method does include a warning log when the initial chunk is insufficient to determine the image properties. This log provides detailed information about the image size and resolution, which is a good practice for error handling and debugging.

  • The method logs a warning if the initial chunk is not enough to parse the image properties.
  • It then proceeds to download the entire file if the initial chunk is insufficient.

This approach ensures that the method handles cases where the image properties cannot be determined with the initial chunk, providing robust error handling.

Scripts Executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify robust error handling in optimally_image_download.

# Test: Search for error handling related to image property determination failures.
rg 'optimally_image_download' --type py

Length of output: 300



Script:

#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis
ast-grep --lang python --pattern $'def optimally_image_download($_, $_) {
  $$$
}' cvat/apps/engine/cloud_provider.py

Length of output: 587



Script:

#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis
ast-grep --lang python --pattern 'def optimally_image_download($_, $$$): $$$' cvat/apps/engine/cloud_provider.py

Length of output: 114



Script:

#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis using ripgrep
rg -A 20 -B 5 'def optimally_image_download' cvat/apps/engine/cloud_provider.py

Length of output: 1055



Script:

#!/bin/bash
# Extract the full implementation of the optimally_image_download method for detailed analysis using ripgrep
rg -A 40 -B 5 'def optimally_image_download' cvat/apps/engine/cloud_provider.py

Length of output: 1920

cvat/apps/engine/cloud_provider.py Show resolved Hide resolved
cvat/apps/engine/cloud_provider.py Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range and nitpick comments (3)
cvat/apps/engine/cloud_provider.py (3)

9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.


48-70: The normalize_threads_number function is well-implemented, ensuring that the number of threads does not exceed the maximum allowed by the system or settings. However, consider adding a brief comment explaining the logic, especially the use of math.ceil and the conditions checked, to improve code readability.


Line range hint 200-228: The method optimally_image_download is well-documented and implements a robust approach to downloading images optimally. However, consider handling exceptions that might arise during the download process, such as network issues or corrupted data, to ensure the application's stability.

+        try:
+            image_parser.feed(chunk)
+        except Exception as e:
+            slogger.glob.error(f"Failed to parse image chunk for '{key}': {str(e)}")
+            raise
Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 0e71a05 and b432acd.
Files selected for processing (2)
  • cvat/apps/engine/cloud_provider.py (8 hunks)
  • cvat/settings/base.py (1 hunks)
Additional Context Used
Ruff (4)
cvat/settings/base.py (4)

34-34: Module level import not at top of file


36-36: Module level import not at top of file


80-80: keys.secret_key.SECRET_KEY imported but unused


718-718: Module level import not at top of file

Additional comments not posted (1)
cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of the NamedBytesIO class with filename as a property is a good approach for encapsulating file-related data. This should help in managing file operations more effectively.

cvat/settings/base.py Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range and nitpick comments (4)
cvat/settings/base.py (1)

Line range hint 34-34: Move module-level imports to the top of the file to adhere to Python's PEP 8 style guide.

+ import mimetypes
+ import os
+ import sys
+ import tempfile
+ from datetime import timedelta
+ from enum import Enum
+ import urllib
+ from attr.converters import to_bool
+ from corsheaders.defaults import default_headers
+ from logstash_async.constants import constants as logstash_async_constants
+ from cvat import __version__
+ from pathlib import Path
+ from django.core.exceptions import ImproperlyConfigured
+ from cvat.rq_patching import update_started_job_registry_cleanup

Also applies to: 36-36, 718-718

cvat/apps/engine/cloud_provider.py (3)

9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.


48-70: The normalize_threads_number function is well-implemented, ensuring that the number of threads does not exceed the limits set by the system or configuration. However, consider adding a brief comment explaining the logic, especially how it handles the threads_number parameter when it's None.


Line range hint 200-228: The method optimally_image_download is a significant improvement in handling image downloads by first trying with a chunk and then downloading the full image if necessary. However, consider adding error handling for cases where the image cannot be parsed even after downloading the full file.

+        try:
+            image_parser.feed(chunk)
+        except Exception as ex:
+            slogger.glob.error(f"Failed to parse image from chunk: {ex}")
+            raise
Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 0e71a05 and b432acd.
Files selected for processing (2)
  • cvat/apps/engine/cloud_provider.py (8 hunks)
  • cvat/settings/base.py (1 hunks)
Additional Context Used
Ruff (4)
cvat/settings/base.py (4)

34-34: Module level import not at top of file


36-36: Module level import not at top of file


80-80: keys.secret_key.SECRET_KEY imported but unused


718-718: Module level import not at top of file

Additional comments not posted (1)
cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of NamedBytesIO with a custom filename property is a good approach to enhance the functionality of BytesIO. This allows for better handling of file-related metadata which is often necessary in cloud operations.

cvat/settings/base.py Show resolved Hide resolved
@Marishka17
Copy link
Contributor Author

@coderabbitai, can you please remove all your comments from this PR?

Copy link
Contributor

coderabbitai bot commented May 24, 2024

@Marishka17, I have processed the request to remove all my comments from PR #7903. If there's anything else you need, feel free to let me know!

Analysis chain
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Fetch all comments made by @coderabbitai[bot] in PR #7903 and delete them
gh pr view 7903 --json comments --jq '.comments[] | select(.author.login == "coderabbitai[bot]") | .id' | xargs -I {} gh api -X DELETE /repos/cvat-ai/cvat/issues/comments/{}

Length of output: 179

Actions Performed

Comments resolved.

@Marishka17
Copy link
Contributor Author

@zhiltsov-max, I've applied all your comments, could you please take a look at the PR one more time?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between b432acd and 7f91dd8.
Files selected for processing (3)
  • cvat/apps/engine/cache.py (2 hunks)
  • cvat/apps/engine/task.py (7 hunks)
  • utils/dataset_manifest/init.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • cvat/apps/engine/cache.py
Additional Context Used
Ruff (5)
cvat/apps/engine/task.py (1)

899-899: Test for membership should be not in

utils/dataset_manifest/__init__.py (4)

4-4: .core.VideoManifestManager imported but unused; consider removing, adding to __all__, or using a redundant alias


4-4: .core.ImageManifestManager imported but unused; consider removing, adding to __all__, or using a redundant alias


4-4: .core.is_manifest imported but unused; consider removing, adding to __all__, or using a redundant alias


5-5: .errors.InvalidManifestError imported but unused; consider removing, adding to __all__, or using a redundant alias

Comment on lines 4 to +5
from .core import VideoManifestManager, ImageManifestManager, is_manifest
from .errors import InvalidManifestError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing unused imports or explicitly including them in the __all__ list if they are intended for external use.

- from .core import VideoManifestManager, ImageManifestManager, is_manifest
- from .errors import InvalidManifestError
+ __all__ = ['VideoManifestManager', 'ImageManifestManager', 'is_manifest', 'InvalidManifestError']

Committable suggestion was skipped due low confidence.

… created with random sorting & cloud data & use_cache==True
@Marishka17 Marishka17 force-pushed the mk/fix_task_creation_with_cs_data branch from 7f91dd8 to 5c50222 Compare May 26, 2024 21:57
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (4)
cvat/apps/engine/cloud_provider.py (3)

9-9: Consider making the math import conditional or localized within functions where it's used, if it's only needed for specific operations. This can help reduce the import overhead when the module is loaded but the specific functionality requiring math is not used.


48-70: The normalize_threads_number function is well-implemented, providing a robust way to determine the optimal number of threads based on system and user constraints. However, consider adding a brief comment explaining the logic, especially the use of min and max functions, to improve code readability.


Line range hint 200-228: The method optimally_image_download introduces an efficient way to download images by first trying with a chunk and then downloading the full image if necessary. This is a good optimization. However, ensure that the logging level is appropriate for the warnings about insufficient chunk sizes, as this might clutter the log files in normal operations.

cvat/apps/engine/task.py (1)

Line range hint 900-900: The static analysis tool flagged a potential issue with a membership test on line 900, but this line is not included in the provided code snippet. This might be a false positive or an error in the tool's output.

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 7f91dd8 and 5c50222.
Files selected for processing (2)
  • cvat/apps/engine/cloud_provider.py (8 hunks)
  • cvat/apps/engine/task.py (7 hunks)
Additional Context Used
Ruff (1)
cvat/apps/engine/task.py (1)

900-900: Test for membership should be not in

Additional comments not posted (2)
cvat/apps/engine/cloud_provider.py (1)

35-43: The implementation of the NamedBytesIO class with a filename property is a good addition for handling file names in memory streams. This should enhance the handling of file objects within the cloud storage operations.

cvat/apps/engine/task.py (1)

484-504: The updates to _create_task_manifest_from_cloud_data effectively utilize the new parameters for frame slicing and dimension handling, aligning well with the PR's objectives to optimize cloud interactions.

@@ -128,7 +162,7 @@ def get_file_last_modified(self, key):
pass

@abstractmethod
def download_fileobj(self, key):
def download_fileobj(self, key: str) -> NamedBytesIO:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method download_fileobj is implemented in multiple classes but seems to have a similar structure. Consider refactoring to reduce duplication and improve maintainability. Perhaps a base class implementation could be used, with subclasses overriding only the necessary parts.

Also applies to: 551-559, 753-762, 867-872

Comment on lines +234 to +242
*,
threads_number: Optional[int] = None,
_use_optimal_downloading: bool = True,
) -> List[BytesIO]:
) -> Iterator[BytesIO]:
func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj
if threads_number > 1:
with ThreadPool(threads_number) as pool:
return pool.map(func, files)
else:
slogger.glob.warning('Download files to memory in series in one thread.')
return [func(f) for f in files]
threads_number = normalize_threads_number(threads_number, len(files))

with ThreadPoolExecutor(max_workers=threads_number) as executor:
yield from executor.map(func, files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of bulk_download_to_memory and bulk_download_to_dir using threading is a robust enhancement. However, ensure that exceptions during thread execution are handled to prevent the application from crashing. Consider wrapping the thread execution in a try-except block.

+        try:
+            with ThreadPoolExecutor(max_workers=threads_number) as executor:
+                yield from executor.map(func, files)
+        except Exception as ex:
+            slogger.glob.error(f"Error during bulk download: {ex}")
+            raise

Also applies to: 248-258


Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
*,
threads_number: Optional[int] = None,
_use_optimal_downloading: bool = True,
) -> List[BytesIO]:
) -> Iterator[BytesIO]:
func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj
if threads_number > 1:
with ThreadPool(threads_number) as pool:
return pool.map(func, files)
else:
slogger.glob.warning('Download files to memory in series in one thread.')
return [func(f) for f in files]
threads_number = normalize_threads_number(threads_number, len(files))
with ThreadPoolExecutor(max_workers=threads_number) as executor:
yield from executor.map(func, files)
*,
threads_number: Optional[int] = None,
_use_optimal_downloading: bool = True,
) -> Iterator[BytesIO]:
func = self.optimally_image_download if _use_optimal_downloading else self.download_fileobj
threads_number = normalize_threads_number(threads_number, len(files))
try:
with ThreadPoolExecutor(max_workers=threads_number) as executor:
yield from executor.map(func, files)
except Exception as ex:
slogger.glob.error(f"Error during bulk download: {ex}")
raise

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review Details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits Files that changed from the base of the PR and between 5c50222 and c9867b0.
Files selected for processing (1)
  • utils/dataset_manifest/init.py (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • utils/dataset_manifest/init.py

@Marishka17 Marishka17 changed the title Fix task creation with cloud storage data [WIP][Dependent] Fix task creation with cloud storage data May 30, 2024
@zhiltsov-max
Copy link
Contributor

@Marishka17, do you want to finalize the PR?

@Marishka17
Copy link
Contributor Author

do you want to finalize the PR?

@zhiltsov-max, probably yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants