Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jun 11, 2019

No description provided.

==================

Arrow supports reading columnar data from JSON files. A JSON file consists
of multiple JSON objects, one per line, representing individual data rows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does arrow supports only such line-delimited json files? (http://ndjson.org/, http://jsonlines.org/)
Although it is probably rather common in certain file types (eg logs), it is not the standard "JSON file" (AFAIK). If so, that might be worth clarifying.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a standard "JSON file"? In any case, the clarification is just above, so I'm not sure what needs explaining.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to make the wording a bit more explicit about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with standards, but at least in pandas the default for a json file is to consist of a single json object (which can then still be in many different forms). That's also what the JSON Table Schema does.

But thanks for the update! My only remaining suggestion would be to explicitly use the term "newline-delimited JSON files" somewhere, as that seems to be the accepted term for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data warehousing systems generally only support newline delimited JSON, as it's the standard in "big data systems" e.g. for web application logging

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

but we'll have to support more things in Arrow

Arrow :ref:`data types <data.types>` are inferred from the JSON types and
values of each column:

* JSON null values convert to the ``null`` type, but can fall back to any
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIth "fall back", you mean that it can be "upcasted" (maybe also not the correct term) if a subsequent row has a different type?
If my understanding is correct, "fall back" reads a bit strange. Maybe "get promoted"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, while now reading further, the "fall back" in the timestamp example is very logical .. so maybe it is a good term after all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"upcasted" and "promoted" are technical jargon that I find a bit confusing myself :-)

@codecov-io
Copy link

Codecov Report

Merging #4521 into master will decrease coverage by 23.46%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #4521       +/-   ##
==========================================
- Coverage   88.26%   64.8%   -23.47%     
==========================================
  Files         846     475      -371     
  Lines      104224   61040    -43184     
  Branches     1253       0     -1253     
==========================================
- Hits        91993   39555    -52438     
- Misses      11986   21485     +9499     
+ Partials      245       0      -245
Impacted Files Coverage Δ
python/pyarrow/_csv.pyx 99.15% <ø> (ø) ⬆️
python/pyarrow/_json.pyx 90.47% <ø> (ø) ⬆️
cpp/src/arrow/util/memory.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/date_utils.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/extension_type.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/compute/kernels/compare.h 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/memory.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/filesystem/util-internal.cc 0% <0%> (-100%) ⬇️
cpp/src/arrow/util/sse-util.h 0% <0%> (-100%) ⬇️
cpp/src/gandiva/decimal_type_util.h 0% <0%> (-100%) ⬇️
... and 602 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7b5656...40064de. Read the comment docs.

@dhirschfeld
Copy link

I'm keen to give the Python json parser a go however my data is in a data subkey - e.g.

{
  "metadata": { ... },
  "data" : [
    {"a": 1, "b": 2.0, "c": "foo", "d": false},
    {"a": 4, "b": -5.5, "c": null, "d": true}
  ]
}

Will I be able to supply the key where the data is contained? Will pyarrow also be able to return the metadata (not columnar data) as a Python dict?

@wesm
Copy link
Member

wesm commented Jun 12, 2019

@dhirschfeld no, not yet, it only supports line-delimited JSON for the moment. In principle the reader-parser should be able to be fed a general stream of JSON records (possibly all coming from the same RapidJSON reader object) but it will have to be refactored to generalize for these other cases. Can you open a JIRA issue?

@dhirschfeld
Copy link

I opened https://issues.apache.org/jira/browse/ARROW-5568 to track the feature request...

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in ac4a9ef Jun 12, 2019
@pitrou pitrou deleted the ARROW-5556-py-json-docs branch June 12, 2019 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants