Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: addurls: Support reading records from stdin #4669

Merged
merged 2 commits into from Jul 8, 2020

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Jul 2, 2020

URL-FILE can be a csv file or a JSON file with an array of records.
To allow the user to supply records on stdin, avoiding an intermediate
file in some cases, read from stdin when URL-FILE is "-".

There are two main options for the JSON input on stdin: JSON lines or
a JSON array. Go with an array, which has the advantage that it
matches what addurls expects for a JSON file, which is good for both
user expectations and not introducing divergent code paths. It also
makes it clearer that addurls slurps all the records into memory
rather than working with them as they come in. The disadvantage is the
need for another intermediate reformatting step, though it's trivial
to reformat JSON lines as an array with jq or similar tools.

The --input-type default value is "ext", which means to infer the
input type based on the file extension. For stdin, we could rework
things to try json.load() and then fall back to processing the input
as a csv. But for now, assume that stdin should be treated as JSON by
default unless --input-type specifies that it is a csv.

Closes #3557.


Note: Using this triggers what I think is a mistake (even if deliberate) in our "is interactive?" check: the check says "no" when feeding stdin even when output streams are still attached to a tty. I'll have to look more into the history of that and will open a separate PR if I think we can drop the stdin part from is_interactive.

URL-FILE can be a csv file or a JSON file with an array of records.
To allow the user to supply records on stdin, avoiding an intermediate
file in some cases, read from stdin when URL-FILE is "-".

There are two main options for the JSON input on stdin: JSON lines or
a JSON array.  Go with an array, which has the advantage that it
matches what addurls expects for a JSON file, which is good for both
user expectations and not introducing divergent code paths.  It also
makes it clearer that addurls slurps all the records into memory
rather than working with them as they come in. The disadvantage is the
need for another intermediate reformatting step, though it's trivial
to reformat JSON lines as an array with jq or similar tools.

The --input-type default value is "ext", which means to infer the
input type based on the file extension.  For stdin, we could rework
things to try json.load() and then fall back to processing the input
as a csv.  But for now, assume that stdin should be treated as JSON by
default unless --input-type specifies that it is a csv.

Closes datalad#3557.
@yarikoptic
Copy link
Member

Thank you @kyleam !

The disadvantage is the need for another intermediate reformatting step, though it's trivial to reformat JSON lines as an array with jq or similar tools.

I keep getting lost in jq "language" unless I use it consistently... Having an explicit option for json-lines or automagically determining the format (starts with [ -- pure json list, starts with { -- lines) would be of great assistance.

@kyleam
Copy link
Contributor Author

kyleam commented Jul 2, 2020

I keep getting lost in jq "language" unless I use it consistently... Having an explicit option for json-lines or automagically determining the format (starts with [ -- pure json list, starts with { -- lines) would be of great assistance.

In my view this isn't a compelling reason to complicate things on DataLad's side.

@kyleam
Copy link
Contributor Author

kyleam commented Jul 2, 2020

The failing job on Travis is due to a timeout for the codecov submission: https://travis-ci.org/github/datalad/datalad/jobs/704417042#L2227

The failing AppVeyor runs are the now regular codecov submission failures. (I'm not sure there much to do on our end, but I'll open an issue.)

@yarikoptic
Copy link
Member

I keep getting lost in jq "language" unless I use it consistently... Having an explicit option for json-lines or automagically determining the format (starts with [ -- pure json list, starts with { -- lines) would be of great assistance.

In my view this isn't a compelling reason to complicate things on DataLad's side.

Is it complicated to just slurp-in json lines into an array? It is much easier to do within DataLad than to install an additional tool and/or erect complicated pipeline outside.

@kyleam
Copy link
Contributor Author

kyleam commented Jul 2, 2020

Is it complicated to just slurp-in json lines into an array?

It's more complicated than what is there now, introduces more code when there are tools that can readily do the job, and conceptually introduces yet another way to input data.

It is much easier to do within DataLad than to install an additional tool and/or erect complicated pipeline outside.

There's a balance, and I'd argue that DataLad is already on the wrong side of that balance and has unnecessary bloat.

@mih
Copy link
Member

mih commented Jul 3, 2020

My 2ct: I think in general it would be nice to have it read line by line, but there is little point in faking incremental, when actually there is nothing like that happening. A practical way to avoid forgetting the right jq command is adding an example:

It seems that this is all it would need:

proc_to_yield_lines | jq --slurp | datalad addurls ...

Simple enough IMHO.

@yarikoptic
Copy link
Member

ok, it would suffice for now -- I just need to not forget now to add jq to any container I might like to build with datalad.

@kyleam kyleam merged commit d8ce049 into datalad:master Jul 8, 2020
@kyleam kyleam deleted the addurls-stdin branch July 8, 2020 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

addurls: Consider supporting JSON lines
3 participants