New function: Add queuing system/batch processing option #28

nfahlgren · 2016-04-29T17:31:23Z

Develop a queuing system/batch processing option to add on in addition to the current local/multiprocessing capabilities.

nfahlgren · 2018-03-15T22:56:23Z

Will investigate Parsl as a potential solution to parallelize PlantCV on different environments rather than building a unique solution for different systems.

nfahlgren · 2019-12-10T22:39:56Z

My initial inclination was to utilize an existing workflow engine (e.g. Nextflow, Parsl, Snakemake, etc.). I have tried them all out, at least a bit. I like them a lot but am not sure they work for precisely what I have been trying to achieve. That being said @gsainsbury86 has developed a Nextflow workflow that we need to look at (https://github.com/aus-plant-phenomics-facility/plantcv-pipeline), and Parsl 0.9 (now released) had some planned features I was waiting for, so I should check them out again. Another ideas below...

nfahlgren · 2019-12-10T22:55:39Z

Another way to go I started mapping out was to use dask, specifically dask-jobqueue. In doing so, one way we could go would be to eliminate plantcv-workflow.py completely. plantcv-workflow.py is just a script that parses (extensive) command-line options and then runs functions in the plantcv.parallel subpackage.

Right now, a user develops a workflow, likely in Jupyter. If in Jupyter, they convert it to a Python script and have to form it into a workflow script with some argument parsing and plugging their code into the main() function. Then they run plantcv-workflow.py over their data with that script.

What if we turned this around a bit? Rather than using plantcv-workflow.py to execute a workflow script, we just add the parallelization components to the workflow script itself.

Rather than having command-line arguments we could have inputs in a configuration dictionary (though a user could make the config an input easily). We discussed this in #470. Then we package the plantcv.parallel functions into a function that runs through the process of parallelizing the workflow.

I would image a user starts with a script downloaded from Jupyter:

from plantcv import plantcv as pcv
img, path, filename = pcv.readimage(image)

Then this gets converted (roughly) to this (perhaps automatically with a converter):

from plantcv import plantcv as pcv
from plantcv.parallel import parallelize

def main():
    config = {
        "dir": "./images",
        "json": "pcv2.output.json",
        "outdir": "./output",
        "meta": "imgtype,camera,frame,zoom,lifter,gain,exposure,id",
        "match": "imgtype:VIS,camera:SV,zoom:z1,frame:0",
        "cpu": 1,
        "coprocess": "NIR",
        "writeimg": True,
        "create": True
    }
    parallelize(config)

def workflow(image, result, outdir, coresult, writeimg, debug):
    img, path, filename = pcv.readimage(image)

if __name__ == '__main__':
    main()

If people have thoughts on this, let us know!

dschneiderch · 2019-12-11T00:11:16Z

In favor. i'm doing something like this already because don't like notebooks much. py scripts through jupyter are easier to deal with and can also produce interactive output with jupyter lab or vscode. WHen I am starting out I create workflowargs.py which I can run independently and before my workflow. I have some logic that lets me run the workflow for a single image (although it only works until i get to print_results and then in bonks because there is no metadata in results.json)

workflowargs.py:

class options():
    def __init__(self):
        self.image = "data/vistest/B3_MicroTom_20191115T111221_VIS0_0.png"
        self.outdir = "output/vis"
        self.result = "result.json"
        self.regex = "(.{2})_(.+)_(\d{8}T\d{6})_(.+)_(\d+)"
        self.debug = 'plot'
        self.debugdir = 'debug/vis'
       
args = options()

then I jump into main() of the workflow script

# Main workflow
def main():
    # Get options
    args = options()

    if args.debug:
        pcv.params.debug = args.debug  # set debug mode
        if args.debugdir:
            pcv.params.debug_outdir = args.debugdir  # set debug directory
            os.makedirs(args.debugdir, exist_ok=True)

    # pixel_resolution
    # mm
    # see pixel_resolution.xlsx for calibration curve for pixel to mm translation
    pixelresolution = 0.055
    # plt.rcParams["font.family"] = "Arial"  # All text is Arial

    # The result file should exist if plantcv-workflow.py was run
    if os.path.exists(args.result):
        # Open the result file
        results = open(args.result, "r")
        # The result file would have image metadata in it from plantcv-workflow.py, read it into memory
        metadata = json.load(results)
        # Close the file
        results.close()
        # Delete the file, we will create new ones
        os.remove(args.result)
        plantbarcode = metadata['metadata']['plantbarcode']['value']
        print(plantbarcode,
              metadata['metadata']['timestamp']['value'], sep=' - ')

    else:
        # If the file did not exist (for testing), initialize metadata as an empty string
        metadata = "{}"
        regpat = re.compile(args.regex)
        plantbarcode = re.search(regpat, args.image).groups()[0]

    # read images and create mask
....

gsainsbury86 · 2019-12-11T00:20:39Z

I'd be happy to provide some more insight into my Nextflow implementation. One of my goals was to make it work such that the actual plantCV process could ideally be isolated and run on the command line as a single instance.

The main reason for me choosing to do it this way is that I don't anticipate myself being the one who does the analysis configuration/tweaking of parameters and thresholds.

My ideal setup is that there's a relatively easy process for our image analyst to modify an existing script for the experiment in question and I can then take that and plug that in to a pipeline/workflow that will run that job for the whole experiment.

As for configuration etc., mine broadly works like this:

[single] querying a database to work out which jobs to run and storing some metadata (plant ID, image path etc.) in a json file.
[parallel] gather the images (fetching via sftp or symlink)
[parallel] analysing each image.
[single] collate the results using the plantcv.parallel utilities

nfahlgren self-assigned this Apr 29, 2016

nfahlgren added the New Function Proposal label Apr 29, 2016

nfahlgren added this to the Incorporate PlantCV with workflow management systems milestone Apr 29, 2016

nfahlgren modified the milestones: PlantCV Release v3.0, Incorporate PlantCV with workflow management systems Sep 6, 2017

nfahlgren mentioned this issue Mar 15, 2018

Convert PlantCV parallelization tools to a distribution package #220

Closed

4 tasks

nfahlgren added new feature New feature ideas and solutions and removed New Function Proposal labels Jun 7, 2019

nfahlgren added the Epic Discussions and broad multi-issue ideas label Dec 10, 2019

nfahlgren mentioned this issue Mar 3, 2020

Overall improvements to workflow parallelization #538

Closed

8 tasks

nfahlgren modified the milestones: PlantCV Release v3.0, PlantCV v4.x Jul 22, 2020

This was linked to pull requests Jul 23, 2020

Reconfigure parallel processing to use config-based inputs #502

Closed

Match multiple metadata values #520

Closed

This was unlinked from pull requests Jul 23, 2020

Reconfigure parallel processing to use config-based inputs #502

Closed

Match multiple metadata values #520

Closed

nfahlgren mentioned this issue Aug 17, 2020

Add distributed computing feature #609

Merged

2 tasks

nfahlgren closed this as completed Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New function: Add queuing system/batch processing option #28

New function: Add queuing system/batch processing option #28

nfahlgren commented Apr 29, 2016

nfahlgren commented Mar 15, 2018

nfahlgren commented Dec 10, 2019

nfahlgren commented Dec 10, 2019

dschneiderch commented Dec 11, 2019

gsainsbury86 commented Dec 11, 2019

New function: Add queuing system/batch processing option #28

New function: Add queuing system/batch processing option #28

Comments

nfahlgren commented Apr 29, 2016

nfahlgren commented Mar 15, 2018

nfahlgren commented Dec 10, 2019

nfahlgren commented Dec 10, 2019

dschneiderch commented Dec 11, 2019

gsainsbury86 commented Dec 11, 2019