Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-ingest: validate jobspec using python/jsonschema #1922

Merged
merged 11 commits into from Jan 17, 2019

Conversation

@garlick
Copy link
Member

commented Jan 15, 2019

This modifies job-ingest to spawn a python script to validate incoming jobspec, instead of calling into the C++ libjobspec library. That library (which incidentally has been copied to flux-sched) is no longer needed in flux-core, so we get to drop the yaml-cpp build dependency.

The python script parses the jobspec schema once, then applies it multiple times to candidate jobspecs received on stdin, emitting results on stdout. While considerably more flexible than the hardwired C++, the python/jsonschema script is slower. For comparison, I timed submitting 100K jobs with submitbench (without munge signing):

  • before this PR (use libjobspec inline with submit request): 23.45s
  • one C++ external program: 31.6s
  • one python script: 451s
  • crew of 4 python scripts: 116s (speedup 3.9)
  • crew of 8 python scripts: 59s (speedup 7.6)

Job submission probably won't be our main bottleneck, so I "tuned" the code that manages the submissions to spawn up to 4 python scripts on demand, and time them out after 5s of inactivity so there aren't a bunch of python interpreter instances hanging around on compute nodes looking menacing. These tunings are currently hardwired.

The work crew uses libsubprocess to do the heavy lifting of process and stdio management, and that made it easy to get this started. The interface to job-ingest is just a "request" function that returns a future. It seems like eventually we might want to circle back and look at making this a generically useful "work crew class", so I did try to structure the code with that in mind, but probably that's a job for another time.

I left the C++ validator "coprocess" in the PR rather than squash it after libjobspec was removed, in case we might want to go back to it at some point. It occurred to me that flux-sched could easily provide one of those if we wanted to outsource this work at some point.

Ultimately I think we will want a way to run a series of validators, possibly in parallel, or pass additional schemas to this python script to be applied serially to the preparsed (JSON) jobspec. For now, I thought this was a reasonable chunk of work to call a PR.

@codecov-io

This comment has been minimized.

Copy link

commented Jan 15, 2019

Codecov Report

Merging #1922 into master will decrease coverage by 0.04%.
The diff coverage is 80.56%.

@@            Coverage Diff             @@
##           master    #1922      +/-   ##
==========================================
- Coverage    80.1%   80.06%   -0.05%     
==========================================
  Files         195      195              
  Lines       35087    35066      -21     
==========================================
- Hits        28105    28074      -31     
- Misses       6982     6992      +10
Impacted Files Coverage Δ
src/common/libflux/conf.c 100% <ø> (ø) ⬆️
src/common/libflux/future.c 87.54% <100%> (+0.25%) ⬆️
src/cmd/flux.c 84.21% <100%> (+0.16%) ⬆️
src/modules/job-ingest/worker.c 72.72% <72.72%> (ø)
src/modules/job-ingest/validate.c 85% <85%> (ø)
src/modules/job-ingest/job-ingest.c 73.28% <89.85%> (+2.02%) ⬆️
src/broker/modservice.c 78.84% <0%> (-0.97%) ⬇️
src/common/libutil/veb.c 98.28% <0%> (-0.58%) ⬇️
src/common/libflux/message.c 81.15% <0%> (-0.37%) ⬇️
src/cmd/flux-module.c 83.72% <0%> (-0.24%) ⬇️
... and 6 more
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jan 15, 2019

This is so cool!

Before dropping yaml-cpp from our base docker images, we should be sure that this dependency is pulled in by flux-sched's Dockerfiles, and add it if not.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 15, 2019

One problem this thing does have is somehow, python unicode strings from jsonschema exceptions are being converted to JSON strings with the leading u included. Example

$ cat broke.json | flux job submit
flux-job: submit: Additional properties are not allowed (u'xslot' was unexpected)

Anybody have a hint on how to solve that?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 15, 2019

Another issue I ran into is I added the following to the valgrind sharness test and got some mem leaks in the broker's exec service (in the subprocess code).

diff --git a/t/valgrind/valgrind-workload.sh b/t/valgrind/valgrind-workload.sh
index 9523b759..fedd209b 100755
--- a/t/valgrind/valgrind-workload.sh
+++ b/t/valgrind/valgrind-workload.sh
@@ -7,3 +7,8 @@ echo FLUX_URI=$FLUX_URI
 for i in `seq 1 $NJOBS`; do
     flux wreckrun --ntasks $size /bin/true
 done
+
+# exercise job-ingest/job-management
+flux job submit <<EOT
+{"attributes":null,"tasks":[{"slot":"foo","count":{"per_slot":1},"command":"app"}],"version":1,"resources":[{"count":1,"type":"slot","with":[{"count":1,"type":"node"}],"label":"foo"}]}
+EOT

The leaks disappeared when I sent a cmb.disconnect message to the broker in validate_destroy(). (Unlike flux-exec which is a process on the local connector, there is no automatic disconnect on module unload for job-ingest)

I guess that means I'm not cleaning up subprocesses properly. What I'm doing is closing STDIN of the filter, then destroying the flux_subprocess_t without waiting for it, since I don't really have any way to wait for it. @chu11 is there a way to do this?

@garlick garlick force-pushed the garlick:jobspec_val branch from 3068c39 to 4363b0a Jan 16, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Rebased on current master and added some commits to address some memory issues.

I added a refcount to futures, since the validation futures in this case are placed on a queue, and (in error cases) can be destroyed while still on the queue, triggering a use-after-free. Now there's a refcount so that can't happen.

The I switched the valgrind sharness test over to use the new system. (It submits 10 jobs and of course they don't run yet). That uncovered some obvious memory leaks which are fixed.

Finally I added the above cmb.disconnect hack to request that the broker exec service clean up any leftover suprocess state from the ingest module.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2019

The I switched the valgrind sharness test over to use the new system. (It submits 10 jobs and of course they don't run yet). That uncovered some obvious memory leaks which are fixed.

The existing valgrind test doesn't just test the "old stuff" but exercises the kvs, aggregator, and other subsystems used by the wreck prototype, so why not just keep the tests until we remove the code for the prototype?

Somewhere I had a (trivial) change to the valgrind tests that ran through a workload.d/*.sh directory instead of a static script. This allowed me to drop in tests for any subsystem (e.g. cron) to check for memleaks unrelated to a job workflow. Why don't we do something like that and then the wreckrun script could just be deleted when we actually remove the implementation?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2019

Finally I added the above cmb.disconnect hack to request that the broker exec service clean up any leftover suprocess state from the ingest module.

Is there an open issue on this? Maybe the subprocess api isn't quite right for this use case?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Somewhere I had a (trivial) change to the valgrind tests that ran through a workload.d/*.sh directory instead of a static script. This allowed me to drop in tests for any subsystem (e.g. cron) to check for memleaks unrelated to a job workflow. Why don't we do something like that and then the wreckrun script could just be deleted when we actually remove the implementation?

Yes that sounds good. I'll do that.

Is there an open issue on this

Not yet, was hoping that maybe I was missing something obvious, but even if so, maybe an issue is the best place to discuss.

@garlick garlick force-pushed the garlick:jobspec_val branch 2 times, most recently from 5d932ba to f2969fa Jan 16, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Rebased on current master, squashed the valgrind test changes into one commit, squashed the two future changes into one commit, and fixed a merge error from the last push.

Copy link
Contributor

left a comment

This worked very well when I was just poking at it on my desktop. Especially nice is the ability for the validation script to push errors all the way back to the user running flux job submit. That is way nicer than how other systems work! (e.g. generic error numbers etc.)

On that note, I know the current validate-jobspec.py is just a first cut, so this shouldn't block the PR, but some of the errors returned are a bit vague, such as when I tried to submit a jobspec with "version": 2:

$ cat t/jobspec/valid/basic.yaml | t/jobspec/y2j.py | jq '.version=2' | flux job submit
flux-job: submit: 2 is not one of [1]

I've also made another couple of very minor comments inline.

src/common/libflux/conf.c Show resolved Hide resolved
validators/validate-schema.py

fluxschemadir = $(datadir)/flux/
dist_fluxschema_DATA = schemas/jobspec.jsonschema

This comment has been minimized.

Copy link
@grondo

grondo Jan 16, 2019

Contributor

given that future flux data files might go into $(datadir)/flux, should fluxschemadir instead be something like $(datadir)/flux/schema/jobspec/? (We could have other schemas as well in the future)

This could be changed in the future (but we're already sharing $datadir with the help.d files...)

This comment has been minimized.

Copy link
@garlick

garlick Jan 16, 2019

Author Member

Sure, I'll do that.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Right, I think the validator is probably in an early state here and there may be things we can do to improve error messages down the road.

The first thing I would like to do is look into the fairly sophisticated exception model for jsonschema.validate(). There is a stack of exceptions and we are only returning a summary (or maybe the top of the stack?)

Then perhaps investigate whether some of the jobspec schema might need to be restructured to avoid certain constructs that make useful errors elusive. For example, the oneOf operator, tends to result in "none of the available schemas could validate" or similar, rather than anything concrete.

Finally, it would be pretty easy to short circuit some common errors using hardwired python code, like the version one with the comically unhelpful message you quoted.

But I think maybe we are OK for now, especially since for the near term we'll be generating jobspec from simple commands.

@SteVwonder

This comment has been minimized.

Copy link
Member

commented Jan 16, 2019

python unicode strings from jsonschema exceptions are being converted to JSON strings with the leading u included

Is this happening with python 2, 3, or both? I suspect it doesn't happen in 2 but does in 3. Can you post the contents of broke.json as a reproducer?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Looks like v2 does it:

$ python --version
Python 2.7.15rc1
$ cd src/modules/job-ingest
$ echo '{"version":2}' | validators/validate-schema.py -s schemas/jobspec.jsonschema
{"errnum":1,"errstr":"u'resources' is a required property"}

If I change the shebang to #/usr/bin/env python3 the problem goes away.

@garlick garlick force-pushed the garlick:jobspec_val branch from f2969fa to a2e349e Jan 16, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2019

Just forced a push after squashing the incremental bug fixes and installed schema path change suggested by @grondo.

@SteVwonder's suggestion for validate-schema.py was added but not squashed.

@garlick garlick force-pushed the garlick:jobspec_val branch from a2e349e to b1333b6 Jan 16, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Fixed up the work crew load management a bit. I can squash the rest of this down if we think it's getting close.

@garlick garlick force-pushed the garlick:jobspec_val branch from b6df127 to 2058133 Jan 17, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

I went ahead and squashed and forced a push.

@garlick garlick referenced this pull request Jan 17, 2019
@SteVwonder

This comment has been minimized.

Copy link
Member

commented Jan 17, 2019

python unicode strings from jsonschema exceptions are being converted to JSON strings with the leading u included

Thanks for the reproducer @garlick. The error message coming out of jsonschema is already encoded into a str, so the easy fix I was hoping for isn't doable. There is a thread on jsonschema repo about this problem: Julian/jsonschema#243. The dev has decided to focus on explicit, dev-focused error messages rather than user-friendly error messages. So the u"string" is considered a feature not a bug.

Conveniently though, someone has already a made a pretty printer for jsonschema errors that should fix this problem: https://github.com/ccpgames/jsonschema-errorprinter

but some of the errors returned are a bit vague, such as when I tried to submit a jobspec with "version": 2:

The above pretty printer could potentially solve this problem as well by giving context to the vague error message.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Thanks! They are just substituting out the u' so maybe I can just do that? (seems to work)

diff --git a/src/modules/job-ingest/validators/validate-schema.py b/src/modules/job-ingest/validators/validate-schema.py
index 7a1047a0..9e979da3 100755
--- a/src/modules/job-ingest/validators/validate-schema.py
+++ b/src/modules/job-ingest/validators/validate-schema.py
@@ -31,7 +31,7 @@ def validate(schema, line):
     except ValueError as e:
         return (1, str(e))
     except jsonschema.exceptions.ValidationError as e:
-        return (1, e.message)
+        return (1, e.message.replace("u'", "'"))
     return (0, None)
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

I'm still a little wary of creating a slough of python package dependencies, so maybe it makes sense to hold off on the pretty printer and just make the above change for now?

@garlick garlick force-pushed the garlick:jobspec_val branch from 2058133 to 3b5b4b3 Jan 17, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Incorporated that validate-jobspec.py output substitution change, squashed, and rebased on current master.

garlick added 11 commits Jan 10, 2019
In preparation for making validation an asynchronous
operation, move json+jobspec validation to validate.[ch]
and wrap it in a future.

For now, just use the future synchronously, as before.
Add a continuation for jobspec validation and move
"batch addition" from request handler to continuation.
Problem: job-ingest module needs in-tree and installed
paths to jobspec validator and jobspec schema.

Rather than adding environment variables for each
new file that needs to be located, set FLUX_CONF_INTREE=1
if running "in tree" and let users call flux_conf_get()
with the appropriate flag.

Add flux_conf_get() paths 'jobspec_validate_path' and
'jobspec_schema_path'.
Instead of calling directly into the C++ jobspec validator,
start the python validator script as a separate process, feed
jobspec to its stdin, and receive yes/no+reason validation
result from its stdout.

The C wrapper for libjobspec can go away, and the
job-ingest module can be linked with the C compiler.

Up to 4 validator processes are started when the load demands,
and time out after 5s of inactivity.
Add validate-schema.py python jobspec validator script,
that can function as a coprocess worker with job-ingest.
Move jobspec schema from test directory to job-ingest module
directory, and install.

Update path in sharness test script.
Drop the C++ jobspec library, the 'validate-jobspec'
validator worker, sharness test, and the yaml-cpp
build dependency.

In addition, drop the configure.ac support for
configuring the C++ compiler, since there is no
other C++ code in the flux-core project.

The variant of the jobspec test that used this
validator is removed.

Updated build dependency table in README.md.
Problem: it can be awkward if a future implementation
that is not standalone needs to hold a reference on the
future to prevent a use-after-free access.

Add flux_future_incref(), flux_future_decref(),
and an internal usecount.
Have the valgrind sharness test execute everything
under a workload.d subdirectory, then:

workload.d/wreck - content originally from workload.sh
workload.d/job - new coverage for job-ingest, job-manager
@garlick garlick force-pushed the garlick:jobspec_val branch from 3b5b4b3 to c2343f7 Jan 17, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Rebased.

@SteVwonder

This comment has been minimized.

Copy link
Member

commented Jan 17, 2019

Thanks @garlick for all the work on this. The worker pool is really cool!

They are just substituting out the u' so maybe I can just do that?

Oh! Nice catch. I didn't look closely at their solution. It's convenient that almost no english words end in u 😆

I'm still a little wary of creating a slough of python package dependencies

Agreed! For the pretty printer, I don't think it is even in any package managers, not even pip. So if we do eventually go that route, vendoring would probably be the way to go.

@grondo grondo merged commit 372d41b into flux-framework:master Jan 17, 2019
3 checks passed
3 checks passed
codecov/patch 80.56% of diff hit (target 80.1%)
Details
codecov/project Absolute coverage decreased by -0.04% but relative coverage increased by +0.46% compared to f88d1cd
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jan 17, 2019

Thanks everybody!

@garlick garlick deleted the garlick:jobspec_val branch Jan 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.