Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New function: Add queuing system/batch processing option #28

Closed
nfahlgren opened this issue Apr 29, 2016 · 5 comments
Closed

New function: Add queuing system/batch processing option #28

nfahlgren opened this issue Apr 29, 2016 · 5 comments
Assignees
Labels
Epic Discussions and broad multi-issue ideas new feature New feature ideas and solutions
Milestone

Comments

@nfahlgren
Copy link
Member

Develop a queuing system/batch processing option to add on in addition to the current local/multiprocessing capabilities.

@nfahlgren
Copy link
Member Author

Will investigate Parsl as a potential solution to parallelize PlantCV on different environments rather than building a unique solution for different systems.

@nfahlgren nfahlgren added new feature New feature ideas and solutions and removed New Function Proposal labels Jun 7, 2019
@nfahlgren nfahlgren added the Epic Discussions and broad multi-issue ideas label Dec 10, 2019
@nfahlgren
Copy link
Member Author

My initial inclination was to utilize an existing workflow engine (e.g. Nextflow, Parsl, Snakemake, etc.). I have tried them all out, at least a bit. I like them a lot but am not sure they work for precisely what I have been trying to achieve. That being said @gsainsbury86 has developed a Nextflow workflow that we need to look at (https://github.com/aus-plant-phenomics-facility/plantcv-pipeline), and Parsl 0.9 (now released) had some planned features I was waiting for, so I should check them out again. Another ideas below...

@nfahlgren
Copy link
Member Author

Another way to go I started mapping out was to use dask, specifically dask-jobqueue. In doing so, one way we could go would be to eliminate plantcv-workflow.py completely. plantcv-workflow.py is just a script that parses (extensive) command-line options and then runs functions in the plantcv.parallel subpackage.

Right now, a user develops a workflow, likely in Jupyter. If in Jupyter, they convert it to a Python script and have to form it into a workflow script with some argument parsing and plugging their code into the main() function. Then they run plantcv-workflow.py over their data with that script.

What if we turned this around a bit? Rather than using plantcv-workflow.py to execute a workflow script, we just add the parallelization components to the workflow script itself.

Rather than having command-line arguments we could have inputs in a configuration dictionary (though a user could make the config an input easily). We discussed this in #470. Then we package the plantcv.parallel functions into a function that runs through the process of parallelizing the workflow.

I would image a user starts with a script downloaded from Jupyter:

from plantcv import plantcv as pcv
img, path, filename = pcv.readimage(image)

Then this gets converted (roughly) to this (perhaps automatically with a converter):

from plantcv import plantcv as pcv
from plantcv.parallel import parallelize

def main():
    config = {
        "dir": "./images",
        "json": "pcv2.output.json",
        "outdir": "./output",
        "meta": "imgtype,camera,frame,zoom,lifter,gain,exposure,id",
        "match": "imgtype:VIS,camera:SV,zoom:z1,frame:0",
        "cpu": 1,
        "coprocess": "NIR",
        "writeimg": True,
        "create": True
    }
    parallelize(config)

def workflow(image, result, outdir, coresult, writeimg, debug):
    img, path, filename = pcv.readimage(image)

if __name__ == '__main__':
    main()

If people have thoughts on this, let us know!

@dschneiderch
Copy link
Collaborator

In favor. i'm doing something like this already because don't like notebooks much. py scripts through jupyter are easier to deal with and can also produce interactive output with jupyter lab or vscode. WHen I am starting out I create workflowargs.py which I can run independently and before my workflow. I have some logic that lets me run the workflow for a single image (although it only works until i get to print_results and then in bonks because there is no metadata in results.json)

workflowargs.py:

class options():
    def __init__(self):
        self.image = "data/vistest/B3_MicroTom_20191115T111221_VIS0_0.png"
        self.outdir = "output/vis"
        self.result = "result.json"
        self.regex = "(.{2})_(.+)_(\d{8}T\d{6})_(.+)_(\d+)"
        self.debug = 'plot'
        self.debugdir = 'debug/vis'
       
args = options()

then I jump into main() of the workflow script

# Main workflow
def main():
    # Get options
    args = options()

    if args.debug:
        pcv.params.debug = args.debug  # set debug mode
        if args.debugdir:
            pcv.params.debug_outdir = args.debugdir  # set debug directory
            os.makedirs(args.debugdir, exist_ok=True)

    # pixel_resolution
    # mm
    # see pixel_resolution.xlsx for calibration curve for pixel to mm translation
    pixelresolution = 0.055
    # plt.rcParams["font.family"] = "Arial"  # All text is Arial

    # The result file should exist if plantcv-workflow.py was run
    if os.path.exists(args.result):
        # Open the result file
        results = open(args.result, "r")
        # The result file would have image metadata in it from plantcv-workflow.py, read it into memory
        metadata = json.load(results)
        # Close the file
        results.close()
        # Delete the file, we will create new ones
        os.remove(args.result)
        plantbarcode = metadata['metadata']['plantbarcode']['value']
        print(plantbarcode,
              metadata['metadata']['timestamp']['value'], sep=' - ')

    else:
        # If the file did not exist (for testing), initialize metadata as an empty string
        metadata = "{}"
        regpat = re.compile(args.regex)
        plantbarcode = re.search(regpat, args.image).groups()[0]

    # read images and create mask
....

@gsainsbury86
Copy link
Collaborator

I'd be happy to provide some more insight into my Nextflow implementation. One of my goals was to make it work such that the actual plantCV process could ideally be isolated and run on the command line as a single instance.

The main reason for me choosing to do it this way is that I don't anticipate myself being the one who does the analysis configuration/tweaking of parameters and thresholds.

My ideal setup is that there's a relatively easy process for our image analyst to modify an existing script for the experiment in question and I can then take that and plug that in to a pipeline/workflow that will run that job for the whole experiment.

As for configuration etc., mine broadly works like this:

  1. [single] querying a database to work out which jobs to run and storing some metadata (plant ID, image path etc.) in a json file.
  2. [parallel] gather the images (fetching via sftp or symlink)
  3. [parallel] analysing each image.
  4. [single] collate the results using the plantcv.parallel utilities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Discussions and broad multi-issue ideas new feature New feature ideas and solutions
Projects
None yet
Development

No branches or pull requests

3 participants