-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-5556: [Doc] [Python] Document JSON reader #4521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/source/python/json.rst
Outdated
| ================== | ||
|
|
||
| Arrow supports reading columnar data from JSON files. A JSON file consists | ||
| of multiple JSON objects, one per line, representing individual data rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does arrow supports only such line-delimited json files? (http://ndjson.org/, http://jsonlines.org/)
Although it is probably rather common in certain file types (eg logs), it is not the standard "JSON file" (AFAIK). If so, that might be worth clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a standard "JSON file"? In any case, the clarification is just above, so I'm not sure what needs explaining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to make the wording a bit more explicit about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too familiar with standards, but at least in pandas the default for a json file is to consist of a single json object (which can then still be in many different forms). That's also what the JSON Table Schema does.
But thanks for the update! My only remaining suggestion would be to explicitly use the term "newline-delimited JSON files" somewhere, as that seems to be the accepted term for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data warehousing systems generally only support newline delimited JSON, as it's the standard in "big data systems" e.g. for web application logging
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json
but we'll have to support more things in Arrow
| Arrow :ref:`data types <data.types>` are inferred from the JSON types and | ||
| values of each column: | ||
|
|
||
| * JSON null values convert to the ``null`` type, but can fall back to any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIth "fall back", you mean that it can be "upcasted" (maybe also not the correct term) if a subsequent row has a different type?
If my understanding is correct, "fall back" reads a bit strange. Maybe "get promoted"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, while now reading further, the "fall back" in the timestamp example is very logical .. so maybe it is a good term after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"upcasted" and "promoted" are technical jargon that I find a bit confusing myself :-)
Codecov Report
@@ Coverage Diff @@
## master #4521 +/- ##
==========================================
- Coverage 88.26% 64.8% -23.47%
==========================================
Files 846 475 -371
Lines 104224 61040 -43184
Branches 1253 0 -1253
==========================================
- Hits 91993 39555 -52438
- Misses 11986 21485 +9499
+ Partials 245 0 -245
Continue to review full report at Codecov.
|
|
I'm keen to give the Python json parser a go however my data is in a {
"metadata": { ... },
"data" : [
{"a": 1, "b": 2.0, "c": "foo", "d": false},
{"a": 4, "b": -5.5, "c": null, "d": true}
]
}Will I be able to supply the key where the data is contained? Will |
|
@dhirschfeld no, not yet, it only supports line-delimited JSON for the moment. In principle the reader-parser should be able to be fed a general stream of JSON records (possibly all coming from the same RapidJSON reader object) but it will have to be refactored to generalize for these other cases. Can you open a JIRA issue? |
|
I opened https://issues.apache.org/jira/browse/ARROW-5568 to track the feature request... |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
No description provided.