Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

bartek · 2023-03-04T12:02:26Z

What happens?

I was reading the latest blog post on shredding deeply nested JSON[1] and could not read JSON (but could read new line delimited JSON) as per the examples.

The file I am attempting to read is a structured log file. The original output is actually newline delimited, but I used jq --slurp to switch it over to an array. Here's the head of the file, showing one typical record (snipped, of course):

[
  {
    "@timestamp": "2023-03-02T03:09:27.402Z",
    "bucket": "logs",
    "ecs.version": "1.6.0",
    "git_commit": "77bf9460",
    "key": "notifications@github.com",
    "log.level": "warn",
    "log.origin": {
      "file.line": 106,
      "file.name": "mail/transform.go"
    },
    "log_group": "awslogs/bartek/fleet",
    "message": "transformMessage contributor: normalization failed",
    "run_id": 1677691571,
    "service.name": "bartek.fleet",
    "service.version": "0.1.0",
    "source": "github"
  },
]

When trying to do the simple query provided in the blog post, I get the following:

D select count(*) from 'out.json';
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".

I thought maybe the actual error was being obfuscated by this. Since I need a manual schema to read this file (the keys and nesting are not consistent). I tried a simple schema but same result:

D select * from read_json('out.json', columns={message: 'VARCHAR'});
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".

No luck! There is nothing really of interest at byte 2097149, it's likely where DuckDB gave up on the record (I assume the issue here is that DuckDB sees everything up to byte 2097149 as one record). The file is 147M.

So, I tried the newline delimited JSON file, providing a schema:

D select * from read_json_auto('newline.json', columns={message: 'VARCHAR'});
┌───────────────┐
│    message    │
│    varchar    │
├───────────────┤
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │
│ http-request  │

It worked! But why was my JSON array formatted file not working? I tried the todos.json[2] in the blog post and I had no problem parsing that, as per the blog post:

D select * from 'todos.json';
┌────────┬────────┬───────────────────────────────────────────────────────────────────────────┬───────────┐
│ userId │   id   │                                   title                                   │ completed │
│ uint64 │ uint64 │                                  varchar                                  │  boolean  │
├────────┼────────┼───────────────────────────────────────────────────────────────────────────┼───────────┤
│      1 │      1 │ delectus aut autem                                                        │ false     │
│      1 │      2 │ quis ut nam facilis et officia qui                                        │ false     │
│      1 │      3 │ fugiat veniam minus                                                       │ false     │
│      1 │      4 │ et porro tempora                                                          │ true      │

Perhaps this has to do with the varying structure of my JSON file, maybe some of the deep nesting within, or am I missing something obvious and silly on this Saturday morning?

Regardless, super impressed with the speed and intuitive nature of the tool. Reading the JSONL data was snappy and I was able to produce the view I wanted in seconds. That's awesome.

[1] https://duckdb.org/2023/03/03/json.html
[2] https://jsonplaceholder.typicode.com/todos

To Reproduce

Provided in What Happens

OS:

macOS

DuckDB Version:

v0.7.1 b00b93f

DuckDB Client:

Shell (duckdb invoke)

Full Name:

Bartek Ciszkowski

Affiliation:

Myself!

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

Alex-Monahan · 2023-03-04T12:52:48Z

Thanks for the bug report! There are a few options that you could try. If these do not work, would you be open to emailing the file directly so that we can troubleshoot?

Pass in maximum_object_size as a parameter to read_csv_auto with a higher value
Give read_json_auto a hint as to the structure with json_format='array_of_records'

Let us know what you find!

bartek · 2023-03-04T13:04:24Z

@Alex-Monahan Thanks for the ideas

D select * from read_json_auto('out.json', columns={message: 'VARCHAR'}, json_format='array_of_records');
Error: Invalid Input Error: Malformed JSON in file "out.json", at byte 2097149 in object 2: unexpected end of data. Try increasing "maximum_object_size".
D

Seems to get a count if I boost the size to something ridiculous:

D select count(*) from read_json_auto('array.json', columns={message: 'VARCHAR'}, maximum_object_size=104857600);
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│       239815 │
└──────────────┘

But when trying to read the data:

D select * from read_json_auto('out.json', columns={message: 'VARCHAR'}, maximum_object_size=104857600);
Error: INTERNAL Error: Unexpected yyjson tag in ValTypeToString
D

Not sure off the bat which field this may be struggling at, and it's interesting that I don't see this internal error reading the JSONL.

If these do not work, would you be open to emailing the file directly so that we can troubleshoot?

Should not be a problem. This output serialized from a zap logger, but I'll need to obfuscate some things.

Alex-Monahan · 2023-03-04T13:22:26Z

Very interesting that the count worked!

Would you mind trying message: 'JSON' ? I wonder if it is coming in as something other than a varchar and we aren't falling back to it?

Thanks for being open to sending the file! It truly is the best way to solve these kinds of errors.

lnkuiper · 2023-03-08T08:40:18Z

Thanks for the bug report! The issue here is that maximum_object_size determines the maximum size of a JSON. In your case, the file contains exactly one JSON: one array of objects. So it should parse correctly. The INTERNAL Error should not be thrown, which happens when transforming the parsed JSON to columnar data. I will look into this.

bartek · 2023-03-08T13:16:48Z

@lnkuiper Cool. To confirm, the file contains indeed one array of many objects. The first snippet I showed is just one record of thousands. Similar to the todos.json which was presented in the blog post (albeit more complex)

lnkuiper · 2023-05-15T13:41:59Z

Closed via #7478

We now support streaming reads of JSON arrays.

Mytherin mentioned this issue Apr 13, 2023

Unable to use read_json with nested STRUCT data types yields unexpected end of data #7015

Closed

2 tasks

lnkuiper closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

bartek commented Mar 4, 2023 •

edited

Alex-Monahan commented Mar 4, 2023

bartek commented Mar 4, 2023 •

edited

Alex-Monahan commented Mar 4, 2023 •

edited

lnkuiper commented Mar 8, 2023

bartek commented Mar 8, 2023

lnkuiper commented May 15, 2023

Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

Malformed JSON error, suggestion is to increase "maximum_object_size" #6569

Comments

bartek commented Mar 4, 2023 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Alex-Monahan commented Mar 4, 2023

bartek commented Mar 4, 2023 • edited

Alex-Monahan commented Mar 4, 2023 • edited

lnkuiper commented Mar 8, 2023

bartek commented Mar 8, 2023

lnkuiper commented May 15, 2023

bartek commented Mar 4, 2023 •

edited

Have you tried this on the latest `master` branch?

bartek commented Mar 4, 2023 •

edited

Alex-Monahan commented Mar 4, 2023 •

edited