keys are dropped when reading in json newline #490

isichei · 2020-10-20T17:39:14Z

Overview

jsonl additional keys (columns) in a jsonl file will be dropped if they do not exist in the first json blob of the file. I have two files:

test_data1.jsonl

{"employee_name": "Michael Roth", "employee_id": 1}
{"employee_id": 2, "employee_name": null}
{"employee_id": 3, "employee_name": "bobby bob", "new_col": "x"}

test_data2.jsonl (same as testdata1 but with row 1 and row 3 switched)

{"employee_id": 3, "employee_name": "bobby bob", "new_col": "x"}
{"employee_name": "Michael Roth", "employee_id": 1}
{"employee_id": 2, "employee_name": null}

When the files are read in via the Table class it seems to define columns based on the first row. So table1 reads in incorrectly (dropping the extra column) but table 2 reads in correctly as the key exists in the first row and subsequent rows are null.

from pprint import pprint
from frictionless import Table

with Table("test_data1.jsonl") as table:
    print("table1...")
    pprint(table.read_rows())
    
with Table("test_data2.jsonl") as table:
    print("table2...")
    pprint(table.read_rows())

Outputs:

table1...
[Row([('employee_id', 1), ('employee_name', 'Michael Roth')]),
 Row([('employee_id', 2), ('employee_name', None)]),
 Row([('employee_id', 3), ('employee_name', 'bobby bob')])]

table2...
[Row([('employee_id', 3), ('employee_name', 'bobby bob'), ('new_col', 'x')]),
 Row([('employee_id', 1), ('employee_name', 'Michael Roth'), ('new_col', None)]),
 Row([('employee_id', 2), ('employee_name', None), ('new_col', None)])]

Using frictionless 3.23.3

Please preserve this line to notify @roll (lead of this repository)

The text was updated successfully, but these errors were encountered:

roll · 2020-10-22T07:10:06Z

Hi @isichei,

Thanks, it's expected behavior as the first row is used to detect keys.

You can provide the keys manually to ensure what data to pick:

from frictionless import Table, dialects

dialect = dialects.JsonDialect(keys=["employee_name", "employee_id", "new_col"])
with Table("tmp/issue490.jsonl", dialect=dialect) as table:
    print(table.read_rows())

Please re-open if needed

isichei · 2020-10-23T11:03:28Z

Hi @roll,

Happy to keep closed if you think it should. I guess the workaround for our needs would be to parse the data once grabbing the keys and then using them with the Table class (via the Dialect).

Although you get the same skipping of keys when using describe, which I would expect to catch all columns of your data?

from frictionless import describe_schema

schema = describe_schema("test_data1.jsonl")
schema # No new_col in the schema

For example when using pandas you get the extra column (my assumption is that it must parse the data twice, not sure you would want to do the same for describe, or give a "greedy" option to scan the data.

import pandas as pd

df = pd.read_json("test_data1.jsonl", lines=True)
df.columns

roll added the question Further information is requested label Oct 22, 2020

roll closed this as completed Oct 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keys are dropped when reading in json newline #490

keys are dropped when reading in json newline #490

isichei commented Oct 20, 2020 •

edited

roll commented Oct 22, 2020

isichei commented Oct 23, 2020

keys are dropped when reading in json newline #490

keys are dropped when reading in json newline #490

Comments

isichei commented Oct 20, 2020 • edited

Overview

roll commented Oct 22, 2020

isichei commented Oct 23, 2020

isichei commented Oct 20, 2020 •

edited