ARROW-5556: [Doc] [Python] Document JSON reader #4521

pitrou · 2019-06-11T19:20:43Z

No description provided.

jorisvandenbossche · 2019-06-11T20:43:22Z

docs/source/python/json.rst

+==================
+
+Arrow supports reading columnar data from JSON files.  A JSON file consists
+of multiple JSON objects, one per line, representing individual data rows.


Does arrow supports only such line-delimited json files? (http://ndjson.org/, http://jsonlines.org/)
Although it is probably rather common in certain file types (eg logs), it is not the standard "JSON file" (AFAIK). If so, that might be worth clarifying.

Is there a standard "JSON file"? In any case, the clarification is just above, so I'm not sure what needs explaining.

I'll try to make the wording a bit more explicit about this.

Not too familiar with standards, but at least in pandas the default for a json file is to consist of a single json object (which can then still be in many different forms). That's also what the JSON Table Schema does.

But thanks for the update! My only remaining suggestion would be to explicitly use the term "newline-delimited JSON files" somewhere, as that seems to be the accepted term for this.

Data warehousing systems generally only support newline delimited JSON, as it's the standard in "big data systems" e.g. for web application logging

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json

but we'll have to support more things in Arrow

jorisvandenbossche · 2019-06-11T20:50:58Z

docs/source/python/json.rst

+Arrow :ref:`data types <data.types>` are inferred from the JSON types and
+values of each column:
+
+* JSON null values convert to the ``null`` type, but can fall back to any


WIth "fall back", you mean that it can be "upcasted" (maybe also not the correct term) if a subsequent row has a different type?
If my understanding is correct, "fall back" reads a bit strange. Maybe "get promoted"?

Although, while now reading further, the "fall back" in the timestamp example is very logical .. so maybe it is a good term after all.

"upcasted" and "promoted" are technical jargon that I find a bit confusing myself :-)

docs/source/python/json.rst

codecov-io · 2019-06-11T22:40:24Z

Codecov Report

Merging #4521 into master will decrease coverage by 23.46%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #4521       +/-   ##
==========================================
- Coverage   88.26%   64.8%   -23.47%     
==========================================
  Files         846     475      -371     
  Lines      104224   61040    -43184     
  Branches     1253       0     -1253     
==========================================
- Hits        91993   39555    -52438     
- Misses      11986   21485     +9499     
+ Partials      245       0      -245

Impacted Files	Coverage Δ
python/pyarrow/_csv.pyx	`99.15% <ø> (ø)`	⬆️
python/pyarrow/_json.pyx	`90.47% <ø> (ø)`	⬆️
cpp/src/arrow/util/memory.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/date_utils.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/extension_type.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/compute/kernels/compare.h	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/memory.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/filesystem/util-internal.cc	`0% <0%> (-100%)`	⬇️
cpp/src/arrow/util/sse-util.h	`0% <0%> (-100%)`	⬇️
cpp/src/gandiva/decimal_type_util.h	`0% <0%> (-100%)`	⬇️
... and 602 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c7b5656...40064de. Read the comment docs.

dhirschfeld · 2019-06-12T03:11:01Z

I'm keen to give the Python json parser a go however my data is in a data subkey - e.g.

{
  "metadata": { ... },
  "data" : [
    {"a": 1, "b": 2.0, "c": "foo", "d": false},
    {"a": 4, "b": -5.5, "c": null, "d": true}
  ]
}

Will I be able to supply the key where the data is contained? Will pyarrow also be able to return the metadata (not columnar data) as a Python dict?

wesm · 2019-06-12T04:05:47Z

@dhirschfeld no, not yet, it only supports line-delimited JSON for the moment. In principle the reader-parser should be able to be fed a general stream of JSON records (possibly all coming from the same RapidJSON reader object) but it will have to be refactored to generalize for these other cases. Can you open a JIRA issue?

dhirschfeld · 2019-06-12T05:41:53Z

I opened https://issues.apache.org/jira/browse/ARROW-5568 to track the feature request...

wesm

+1

ARROW-5556: [Doc] [Python] Document JSON reader

65f357a

jorisvandenbossche reviewed Jun 11, 2019

View reviewed changes

sbinet reviewed Jun 11, 2019

View reviewed changes

docs/source/python/json.rst Outdated Show resolved Hide resolved

Address review comments

40064de

wesm approved these changes Jun 12, 2019

View reviewed changes

wesm closed this in ac4a9ef Jun 12, 2019

pitrou deleted the ARROW-5556-py-json-docs branch June 12, 2019 16:02

asfimport mentioned this pull request Jun 12, 2019

[Doc] Document JSON reader #22001

Closed

ARROW-5556: [Doc] [Python] Document JSON reader #4521

ARROW-5556: [Doc] [Python] Document JSON reader #4521

Uh oh!

Conversation

pitrou commented Jun 11, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Jun 11, 2019

Codecov Report

Uh oh!

dhirschfeld commented Jun 12, 2019

Uh oh!

wesm commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dhirschfeld commented Jun 12, 2019

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wesm commented Jun 12, 2019 •

edited

Loading