-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --json-file/-j option to read from file w/ json contents #311
Conversation
brew install...
not leaking
cleanup action cleanup trurl.1 cleanup trurl.1
okay I need to figure out whats going on with the windows builds, the cygwin was was working on my local branch until very recently, might be something i changed on the way we are doing the file i/o. I will take a closer look this evening. Edit: Okay, i got the cygwin build working. I just need to add libjson-c to the curl-for-win build. @vszakats would this be done by manually compiling and linking it in here: https://github.com/curl/curl-for-win/blob/main/trurl.sh ? |
It's a new dependency, kind of a big deal to add support for one in general. One approach is to make it a build option, disabled by default, so existing trurl builds (including curl-for-win) continue to work without modification. Another option is to delete/disable the curl-for-win CI job till I (or someone else) get around adding support for it. A 3rd one is to add an option to disable this feature, and I update curl-for-win to set it. |
...that said, have you thought of using a JSON parser that's a self-contained C89 source (with a fitting license)? It's still a dependency, but perhaps easier to integrate and carry. |
I was looking a lot at https://github.com/DaveGamble/cJSON, but I didn't like how it looks and json-c has some nice features. Both this and json-c are in vcpkg, freebsd repos, and most linux distro repositories so i was hoping it wouldn't be too much work either way for maintainers. If folks want a smaller ANSI c json library instead of the one I chose, I think that's a fair request and I can start working this again. for now, I will put it behind a compile time flag as you suggested Edit: |
Co-authored-by: Viktor Szakats <vszakats@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made the -f
function possible to work as a filter, as in it reads URLs and outputs the results in the mean time. This JSON reader reads the entire file before it acts on it, which makes it not work as a filter and if you want to work an a huge number of JSON URLs this will allocate a lot of memory...
@bagder, i've made a few big modifications. Mainly I made it so it reads one json object at a time, so it now only allocates as much memory as required to parse the largest single json object, instead of attempting to read until the of the file. I also made it so it reads in a fixed buffer on the stack, and then allocates more data on the heap if need be as you suggested. |
Co-authored-by: Viktor Szakats <vszakats@users.noreply.github.com>
@@ -33,6 +33,14 @@ CFLAGS += -Wconversion -Wmissing-prototypes -Wwrite-strings -Wsign-compare -Wno- | |||
ifndef NDEBUG | |||
CFLAGS += -g | |||
endif | |||
ifdef TRURL_JSON_IN | |||
CFLAGS += -DTRURL_JSON_IN -Wno-gnu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any info on -Wno-gnu
is it necessary? because it breaks the build on older GCCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm thats not good. It was required by json-c , I think some of there macros are what required this. I'm wondering if we should jump out of json-c, and use one that is purely c89 (and small enough we can just include the source in the repo or something like @vszakats has been saying)
I think a lot of the scaffolding can stay in place for parsing single objects / streaming stuff. it would really just be the guts of from_json() that would need to be swapped out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What OS / compiler have you been using?
With libjson-c installed (I tested against v0.17), I was able to build your branch using Worth setting |
*usedarg = gap; | ||
o->json_in = true; | ||
#else | ||
trurl_warnf(o, "not built with support for JSON input."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose using errorf()
here instead.
if(json_object_object_get_ex(wholeurl, "params", ¶ms)) { | ||
size_t params_length = json_object_array_length(params); | ||
for(size_t j = 0; j < params_length; j++) { | ||
json_object *param = json_object_array_get_idx(params, (int)j); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this function call fail and return NULL? Shouldn't this exit if so?
qpair[key_length] = '='; | ||
memcpy(qpair + key_length + 1, param_v, value_length); | ||
} | ||
this_query = realloc(this_query, this_q_size + qpair_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No check if this returns NULL.
if(this_q_size) { | ||
this_query[this_q_size - 1] = '\0'; | ||
const char *qss = "query:="; /* do not encode the url */ | ||
char *query_set_str = malloc(sizeof(char) * (this_q_size + strlen(qss))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
size_t key_len = strlen(key); | ||
size_t val_len = strlen(val_str); | ||
/* +2, one char for '=' and one for null terminator. */ | ||
char *set_str = malloc(val_len + key_len + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
size_t val_len = strlen(val_str); | ||
/* +2, one char for '=' and one for null terminator. */ | ||
char *set_str = malloc(val_len + key_len + 2); | ||
memset(set_str, 0, val_len + key_len + 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why clear it first if you copy over the contents immediately below?
free(set_str); | ||
} | ||
if(!scheme_set) { | ||
setone(uh, "scheme=http", o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it differ from how it acts on setting a plain URL. Then we let libcurl "guess" the scheme, which may not always be HTTP...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I thought we just defaulted to HTTP. The way this is working is its constructing a CURLU with curl_url_set() via setone. From this we are setting setting every component manually and if we don't set a scheme libcurl throws an error not enough input for a url
.
Maybe the approach i took was a bit too low level. If instead we just concatenate all the components together in parts
and pass that string to single url then it would just do all the scheme guessing.
if(!scheme_set) { | ||
setone(uh, "scheme=http", o); | ||
} | ||
struct iterinfo iinfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaring a variable in the middle of a block is not C89 compliant. I believe we still try to keep it C89...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From past experience with GCC 4.0.1, no, the way loops are done in the code base assume C99, which is why I added -std=gnu99
to CFLAGS
in the past (now gone, not a problem). If things are meant to build with a C89 compiler, I can raise issues for that separately as I have ancient GCC at hand to attempt things.
char reading_buff[JSON_INIT_SIZE]; | ||
size_t last_write = 0; | ||
size_t json_buf_size = JSON_INIT_SIZE; | ||
char *json_string = calloc(sizeof(char), json_buf_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no error check
size_t last_write = 0; | ||
size_t json_buf_size = JSON_INIT_SIZE; | ||
char *json_string = calloc(sizeof(char), json_buf_size); | ||
memset(reading_buff, 0, sizeof(char) * JSON_INIT_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why clear this? It seems unnecessary.
"input": { | ||
"arguments": [ | ||
"-j", | ||
"testfiles/test0001.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testfiles/test0001.txt
is not in this PR
|
char reading_buff[JSON_INIT_SIZE]; | ||
size_t last_write = 0; | ||
size_t json_buf_size = JSON_INIT_SIZE; | ||
char *json_string = calloc(sizeof(char), json_buf_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguments in wrong order. From GCC 14.2.0 :
trurl.c:1744:37: error: 'calloc' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args]
1744 | char *json_string = calloc(sizeof(char), json_buf_size);
| ^~~~
trurl.c:1744:37: note: earlier argument should specify number of elements, later size of each element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh that's funny clang 15 didn't make a stink. I wonder if its a new warning in gcc 14?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either way i can fix it. good catch.
Hello, sorry for letting this PR go stale. I have been thinking about this for a while. I agree with bagder and others that we don't get much from having JSON input. It just an easy feature to suggest. At first glance it makes sense that if we produce JSON we should be able to consume it, but as has been mentioned in this PR as well as the related issues, we wouldn't get much mileage out of it. The only real use case I can think of is using some external program to act as a complex filter / manipulator in between two trurl calls: $ trurl -f some_urls.txt --json | a_very_complicated_filter.py | trurl -j - > filtered_urls.txt but like has been said so many times this can be done by adding in a Moving away from the user experience side into the technical implementation, I don't like these changes. I think the library I chose is really nice to use but it's overkill for this use case and a much simpler one should have been used. @bagder I think we should close this and open a discussion around it, i do not think this code should be merged in. |
Thanks for being so open and flexible @jacobmealey, even after having spent the time doing this PR. Closing this now. I remain unconvinced that JSON input is very useful because the reasons mentioned above. |
This adds
--json-file
as an option to read from a file that contains a json representation of the urls. It works very similarly to--url-file
. The--json-file
input structure is designed to match roughly what the output of--json
produces. the simplest json object possible is:You can pass an arbitrary number of these url JSON objects and perform operations on them like you would with any other url in trurl.
Something that I did that may be up for debate is how the query is handled. I chose to make the query only be able to be set by an array of "keys" and "values" (the "params" section of --json") and it will throw a warning if the "query" component of "parts" is set.
A few examples:
Currently the feature is disabled by default, do turn it on you need to add
TRURL_JSON_IN
to your environment to enable it.fixes: #291