Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add readdata() function for generic data input #2162

Merged
merged 4 commits into from May 6, 2015

Conversation

tzz
Copy link
Contributor

@tzz tzz commented Apr 3, 2015

This PR has three components:

  • unify CSV, JSON, and YAML parsing in one FnCallReadData function
  • fix a bug with CSV parsing, where a null result from the parse could result in a segfault. Also the byte limit is raised to 50 MB.
  • introduce the function readdata() which takes two parameters: the filename and the mode (auto or CSV or YAML or JSON). In auto mode the file extension is checked. Note that readdata() has no file size parameter.

readdata() is handy for cases where the policy author wants to handle multiple types of files without special cases, like in cfengine/masterfiles#397 as @nickanderson will confirm.

If this function is acceptable, I will write docs and acceptance tests.

@cfengine-review-bot
Copy link

Can one of the admins verify this patch?

@nickanderson
Copy link
Member

This is neat, it can make def.json support yaml and I guess csv without any code differences.

So maxbytes for JSON and YAML is inf, but CSV is 50MB? Is there a good reason for the difference? If there is no technical reason to limit CSV then I would suggest it operates the same as JSON and YAML, now we are just missing XML ;-0

@nickanderson
Copy link
Member

Also, does this mean we should deprecate readyaml, readjson, readcsv ?

@tzz
Copy link
Contributor Author

tzz commented Apr 3, 2015

The current readyaml, readcsv, and readjson functions have two reasons to stay: first, explicit size limits (except for readcsv). Second, using them gives you static validation that you'll read the right format instead of deciding it dynamically.

I can make CSV unlimited too, I just thought 50MB was quite large.

I did think about XML! XML is much harder to read into a data container. The mapping is not one-to-one and CFEngine doesn't yet have the tools to extract both content and attributes if I mix them in the data (the mapjson() proposal would help). Also I don't know of anyone that needs it. But we do have libxml linkage... so maybe a future feature :)

I also thought about naming the function data_read. Not sure, what do you think?

@nickanderson
Copy link
Member

For naming I prefer readdata or read_data. @estenberg opinion on CSV data limit?

50MB csv is big and annoying, but so is knowing that CSV is a special case. So unless there is a technical reason to limit it, I would vote for consistency (yes, there is a hobgoblin in there).

@ediosyncratic
Copy link
Contributor

It is only petty consistency that's a hobgoblin. Where consistency is actually useful, as here, there are no hobgoblins.

@tzz
Copy link
Contributor Author

tzz commented Apr 15, 2015

ping?

@tzz
Copy link
Contributor Author

tzz commented Apr 24, 2015

ping? This is a pretty safe isolated change IMO.

@kacf kacf self-assigned this Apr 28, 2015
@kacf
Copy link
Contributor

kacf commented Apr 28, 2015

I agree that CSV should not be size limited, so that we are consistent with the others. Other than that this is great, no concerns from me!

What's missing is an acceptance test and some docs.

@tzz
Copy link
Contributor Author

tzz commented Apr 30, 2015

For the acceptance test, I'll wait for #1485 to get merged because it will make the acceptance test much easier.

@tzz
Copy link
Contributor Author

tzz commented Apr 30, 2015

Added examples. Docs in cfengine/documentation#1026

@tzz
Copy link
Contributor Author

tzz commented May 4, 2015

The auto-guessing feature didn't feel right so I've removed it. Now we simply try reading as JSON if the mode is auto and the extension is unknown.

(This is different from the inline JSON feature, which does auto-guess because the string is inline. Here we'd have to peek into the file, which is much more complicated and less useful.)

@tzz
Copy link
Contributor Author

tzz commented May 4, 2015

For JSON and YAML I had to set a max read size, and chose 50MB just like CSV. I think that's reasonable but we can raise it.

@tzz
Copy link
Contributor Author

tzz commented May 4, 2015

Acceptance test added. A small fix for bundlestate() in DefaultTemplateData() was needed to remove mangled var refs from the bundle state.

@tzz
Copy link
Contributor Author

tzz commented May 4, 2015

The state-based acceptance test functionality is really nice here--almost entirely data-driven.

@tzz
Copy link
Contributor Author

tzz commented May 4, 2015

Docs updated as well.

if (NULL == strchr(lval_key, '#')) // don't collect mangled refs
{
JsonObjectAppendElement(scope_obj, lval_key, RvalToJson(var->rval));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't quite understand this. Can you explain why this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this#promise_filename shows up in the bundle's variables (claims its scope is the bundle). I don't know why it happens since I didn't write the variable-gathering code, but since we can't have variables with # in the name, the fix is good regardless of the underlying issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely recall that there are some exceptions to this rule, and '#' is used to mark those exceptions, but I don't remember any details (or even if I remember correctly..). I suppose we're ok if the tests pass, and this is anyway limited to the template data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# is used to replace . when variables are "mangled." Do git grep -i mangle in core.git and prepare yourself :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, you're right, I wish I hadn't. :-)
I suppose this is appropriate then, since the data for the bundle is not supposed to include any variables from outside the bundle. The this#promise_filename is an exception merely because it is in fact local to that bundle.

@kacf
Copy link
Contributor

kacf commented May 5, 2015

I tried adding some test cases for failed parsing, like here: https://gist.github.com/kacfengine/0e0abe107a8b63061394, but this caused an infinite loop. Would be nice if you included it in your test.

@kacf
Copy link
Contributor

kacf commented May 5, 2015

And btw you're right, the state based testing is really nice!

@tzz
Copy link
Contributor Author

tzz commented May 5, 2015

OK; acceptance test expanded as you suggested. That exposed a pre-existing bug in the YAML parser: it wouldn't break on YAML_NO_EVENT like it's supposed to (on malformed JSON). Fixed; thanks for the suggestion.

@tzz
Copy link
Contributor Author

tzz commented May 5, 2015

I think it makes sense to have a failed CSV parse return the empty array [] but for failed JSON and YAML parses, to fail the variable promise.

The reason is that CSV has no structure: anything can be CSV. Whereas JSON and YAML have to obey strict rules to be parseable.

If you disagree, I can change that behavior.

@tzz
Copy link
Contributor Author

tzz commented May 5, 2015

@nickanderson the data limit is 50MB for all formats now.

@kacf
Copy link
Contributor

kacf commented May 5, 2015

I think it makes sense to have a failed CSV parse return the empty array [] but for failed JSON and YAML parses, to fail the variable promise.

The reason is that CSV has no structure: anything can be CSV. Whereas JSON and YAML have to obey strict rules to be parseable.

If you disagree, I can change that behavior.

Nope, I agree as well. Empty CSV is still valid CSV, so the only possible error is being unable to open the file.

@kacf
Copy link
Contributor

kacf commented May 5, 2015

trigger build

kacf added a commit that referenced this pull request May 6, 2015
Add readdata() function for generic data input
@kacf kacf merged commit 8bba516 into cfengine:master May 6, 2015
@tzz
Copy link
Contributor Author

tzz commented May 6, 2015

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
5 participants