Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

Closed
2 tasks done
bartek opened this issue Mar 4, 2023 · 6 comments
Closed
2 tasks done

Comments

@bartek
Copy link

bartek commented Mar 4, 2023

What happens?

I was reading the latest blog post on shredding deeply nested JSON[1] and could not read JSON (but could read new line delimited JSON) as per the examples.

The file I am attempting to read is a structured log file. The original output is actually newline delimited, but I used jq --slurp to switch it over to an array. Here's the head of the file, showing one typical record (snipped, of course):

[
  {
    "@timestamp": "2023-03-02T03:09:27.402Z",
    "bucket": "logs",
    "ecs.version": "1.6.0",
    "git_commit": "77bf9460",
    "key": "notifications@github.com",
    "log.level": "warn",
    "log.origin": {
      "file.line": 106,
      "file.name": "mail/transform.go"
    },
    "log_group": "awslogs/bartek/fleet",
    "message": "transformMessage contributor: normalization failed",
    "run_id": 1677691571,
    "service.name": "bartek.fleet",
    "service.version": "0.1.0",
    "source": "github"
  },
]

When trying to do the simple query provided in the blog post, I get the following:

D select count(*) from 'out.json';
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".

I thought maybe the actual error was being obfuscated by this. Since I need a manual schema to read this file (the keys and nesting are not consistent). I tried a simple schema but same result:

D select * from read_json('out.json', columns={message: 'VARCHAR'});
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".

No luck! There is nothing really of interest at byte 2097149, it's likely where DuckDB gave up on the record (I assume the issue here is that DuckDB sees everything up to byte 2097149 as one record). The file is 147M.

So, I tried the newline delimited JSON file, providing a schema:

D select * from read_json_auto('newline.json', columns={message: 'VARCHAR'});
┌───────────────┐
│    message    │
│    varchar    │
├───────────────┤
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │

It worked! But why was my JSON array formatted file not working? I tried the todos.json[2] in the blog post and I had no problem parsing that, as per the blog post:

D select * from 'todos.json';
┌────────┬────────┬───────────────────────────────────────────────────────────────────────────┬───────────┐
│ userId │   id   │                                   title                                   │ completed │
│ uint64 │ uint64 │                                  varchar                                  │  boolean  │
├────────┼────────┼───────────────────────────────────────────────────────────────────────────┼───────────┤
│      1 │      1 │ delectus aut autem                                                        │ false     │
│      1 │      2 │ quis ut nam facilis et officia qui                                        │ false     │
│      1 │      3 │ fugiat veniam minus                                                       │ false     │
│      1 │      4 │ et porro tempora                                                          │ true      │

Perhaps this has to do with the varying structure of my JSON file, maybe some of the deep nesting within, or am I missing something obvious and silly on this Saturday morning?

Regardless, super impressed with the speed and intuitive nature of the tool. Reading the JSONL data was snappy and I was able to produce the view I wanted in seconds. That's awesome.

[1] https://duckdb.org/2023/03/03/json.html
[2] https://jsonplaceholder.typicode.com/todos

To Reproduce

Provided in What Happens

OS:

macOS

DuckDB Version:

v0.7.1 b00b93f

DuckDB Client:

Shell (duckdb invoke)

Full Name:

Bartek Ciszkowski

Affiliation:

Myself!

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@Alex-Monahan
Copy link
Contributor

Thanks for the bug report! There are a few options that you could try. If these do not work, would you be open to emailing the file directly so that we can troubleshoot?

  1. Pass in maximum_object_size as a parameter to read_csv_auto with a higher value
  2. Give read_json_auto a hint as to the structure with json_format='array_of_records'

Let us know what you find!

@bartek
Copy link
Author

bartek commented Mar 4, 2023

@Alex-Monahan Thanks for the ideas

D select * from read_json_auto('out.json', columns={message: 'VARCHAR'}, json_format='array_of_records');
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".
D 

Seems to get a count if I boost the size to something ridiculous:

D select count(*) from read_json_auto('array.json', columns={message: 'VARCHAR'}, maximum_object_size=104857600);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       239815 │
└──────────────┘

But when trying to read the data:

D select * from read_json_auto('out.json', columns={message: 'VARCHAR'}, maximum_object_size=104857600);
Error: INTERNAL Error: Unexpected yyjson tag in ValTypeToString
D 

Not sure off the bat which field this may be struggling at, and it's interesting that I don't see this internal error reading the JSONL.

If these do not work, would you be open to emailing the file directly so that we can troubleshoot?

Should not be a problem. This output serialized from a zap logger, but I'll need to obfuscate some things.

@Alex-Monahan
Copy link
Contributor

Alex-Monahan commented Mar 4, 2023

Very interesting that the count worked!

Would you mind trying message: 'JSON' ? I wonder if it is coming in as something other than a varchar and we aren't falling back to it?

Thanks for being open to sending the file! It truly is the best way to solve these kinds of errors.

@lnkuiper
Copy link
Contributor

lnkuiper commented Mar 8, 2023

Thanks for the bug report! The issue here is that maximum_object_size determines the maximum size of a JSON. In your case, the file contains exactly one JSON: one array of objects. So it should parse correctly. The INTERNAL Error should not be thrown, which happens when transforming the parsed JSON to columnar data. I will look into this.

@bartek
Copy link
Author

bartek commented Mar 8, 2023

@lnkuiper Cool. To confirm, the file contains indeed one array of many objects. The first snippet I showed is just one record of thousands. Similar to the todos.json which was presented in the blog post (albeit more complex)

@lnkuiper
Copy link
Contributor

Closed via #7478

We now support streaming reads of JSON arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants