Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor parallel subpackage to use dataframes and grouping #887

Merged
merged 58 commits into from
Jul 15, 2022

Conversation

nfahlgren
Copy link
Member

@nfahlgren nfahlgren commented Apr 26, 2022

Describe your changes
This PR makes several major changes to the parallel subpackage.

.github/workflows/continuous-integration.yml:

  • Use pip to install PlantCV instead of setup.py.

plantcv/parallel/__init__.py:

  • Remove deprecated functions convert_datetime_to_unixtime and check_date_range.
  • Import new workflow_inputs function and WorkflowInputs class. These are from new module for handling workflow inputs in Jupyter notebooks and scripts was added to make it easier to migrate from Jupyter to a parallel workflow script.
  • WorkflowConfig default timestamp format was updated to an ISO 8601 UTC datetime.
  • Remove WorkflowConfig coprocess attribute and replace with groupby and group_name attributes. The new attributes are used to group images in the new dataframe-based metadata parser framework and name the image inputs to parallel workflows.
  • A new rotation metadata attribute was added to WorkflowConfig.
  • The plantbarcode metadata attribute in WorkflowConfig was renamed to barcode to be more general.

plantcv/parallel/parsers.py:

  • The parsers module was completely rewritten to reduce complexity and utilize dataframes for easier filtering and grouping.
  • A new input data format (phenodata) was added to the parsers module.
  • All data formats in the parsers module are read into a common data structure, which allows them to all be converted to a dataframe using a single method. New data formats can be added by adding a new private function and reading the data into the common data structure.
  • Dataframes in the parser module can be grouped on one or more metadata terms, the grouped dataframe is passed to the job builder module.

plantcv/parallel/job_builder.py:

  • The job builder module was updated to handle grouped dataframes and remove coprocessing logic.
  • The job builder module was updated to pass arguments to the new workflow_inputs-based argparse framework.
  • The job builder module can automatically name workflow input images if requested (e.g., image1, image2, etc.)
  • Temp JSON files are now named with unique alphanumeric identifiers (UUID) since they can represent data for multiple images.

plantcv/parallel/workflow_inputs.py:

  • New module for defining/handling standard workflow inputs in both Jupyter and parallel workflow script contexts.
  • The WorkflowInputs class is used to set Jupyter notebook input variables in a framework that is compatible with the command-line arguments used in parallel workflow scripts.
  • The workflow_inputs function creates a standardized argparse command-line argument parser for workflows.

plantcv/parallel/process_results.py:

  • The JSON output file is now formatted with line returns and indentation for easier viewing.

plantcv/utils/converters.py:

  • The json2csv util function was updated to handle grouped output data.
  • json2csv now only outputs a single CSV file in long format.

Additionally, relevant tests were added/updated. Documentation was updated where necessary.

Type of update
Is this a: New feature or feature enhancement

Associated issues
Closes #474
Closes #423
Closes #538
Replaces #759

* Adds a new term rotation - we previously used frame but frame is separately used so a new term made sense
* Replaced plantbarcode with barcode to fit a broader range of applications
Replaces "%Y-%m-%d %H:%M:%S.%f" with "%Y-%m-%dT%H:%M:%S.%fZ"
Replaces metadata_parser with a new modular workflow that parses three types of datasets and uses a dataframe structure to do metadata filtering
The workflow configuration template needed to be updated to match updates to WorkflowConfig
* Add a new module for standardizing and implementing workflow command-line and notebook input arguments
* Update job_builder to plug inputs into the new argument framework
* Update multiprocess tests
Still needs work to reduce complexity
@nfahlgren nfahlgren added new feature New feature ideas and solutions work in progress Mark work in progress update Updates an existing feature/method labels Apr 26, 2022
@nfahlgren nfahlgren added this to the PlantCV v4.x milestone Apr 26, 2022
@nfahlgren nfahlgren added this to Pull Requests in PlantCV4 via automation Apr 26, 2022
@codecov
Copy link

codecov bot commented Apr 27, 2022

Codecov Report

Merging #887 (6c88106) into 4.x (19beef2) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##               4.x      #887   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          159       160    +1     
  Lines         6738      6714   -24     
=========================================
- Hits          6738      6714   -24     
Flag Coverage Δ
unittests 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
plantcv/parallel/multiprocess.py 100.00% <ø> (ø)
plantcv/parallel/__init__.py 100.00% <100.00%> (ø)
plantcv/parallel/job_builder.py 100.00% <100.00%> (ø)
plantcv/parallel/parsers.py 100.00% <100.00%> (ø)
plantcv/parallel/process_results.py 100.00% <100.00%> (ø)
plantcv/parallel/workflow_inputs.py 100.00% <100.00%> (ø)
plantcv/utils/converters.py 100.00% <100.00%> (ø)

@HaleySchuhl
Copy link
Contributor

Error handling for "images" input in Workflow Image when input is not a list.

New version breaks something in acute_vertex that we can figure out later
@HaleySchuhl HaleySchuhl requested review from jdavidpeery and removed request for jdavidpeery June 28, 2022 12:52
@nfahlgren
Copy link
Member Author

@JorgeGtz found an issue with the other_args configuration property. In the new workflow_inputs function other is an input keyword argument but other_args are individual arguments and values that are appended to the workflow but are not parsed by the workflow_inputs function.

I think I can refactor other_args to be a dictionary and then have workflow_inputs parse the keywords and values so that they are available in the argparse object.

@nfahlgren nfahlgren merged commit 5f249a3 into 4.x Jul 15, 2022
PlantCV4 automation moved this from Pull Requests to Done Jul 15, 2022
@nfahlgren nfahlgren deleted the revise-parallel-parsers branch July 15, 2022 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature ideas and solutions ready to review update Updates an existing feature/method
Projects
PlantCV4
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants