Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keys are dropped when reading in json newline #490

Closed
isichei opened this issue Oct 20, 2020 · 2 comments
Closed

keys are dropped when reading in json newline #490

isichei opened this issue Oct 20, 2020 · 2 comments
Labels
question Further information is requested

Comments

@isichei
Copy link

isichei commented Oct 20, 2020

Overview

jsonl additional keys (columns) in a jsonl file will be dropped if they do not exist in the first json blob of the file. I have two files:

test_data1.jsonl

{"employee_name": "Michael Roth", "employee_id": 1}
{"employee_id": 2, "employee_name": null}
{"employee_id": 3, "employee_name": "bobby bob", "new_col": "x"}

test_data2.jsonl (same as testdata1 but with row 1 and row 3 switched)

{"employee_id": 3, "employee_name": "bobby bob", "new_col": "x"}
{"employee_name": "Michael Roth", "employee_id": 1}
{"employee_id": 2, "employee_name": null}

When the files are read in via the Table class it seems to define columns based on the first row. So table1 reads in incorrectly (dropping the extra column) but table 2 reads in correctly as the key exists in the first row and subsequent rows are null.

from pprint import pprint
from frictionless import Table

with Table("test_data1.jsonl") as table:
    print("table1...")
    pprint(table.read_rows())
    
with Table("test_data2.jsonl") as table:
    print("table2...")
    pprint(table.read_rows())

Outputs:

table1...
[Row([('employee_id', 1), ('employee_name', 'Michael Roth')]),
 Row([('employee_id', 2), ('employee_name', None)]),
 Row([('employee_id', 3), ('employee_name', 'bobby bob')])]

table2...
[Row([('employee_id', 3), ('employee_name', 'bobby bob'), ('new_col', 'x')]),
 Row([('employee_id', 1), ('employee_name', 'Michael Roth'), ('new_col', None)]),
 Row([('employee_id', 2), ('employee_name', None), ('new_col', None)])]

Using frictionless 3.23.3


Please preserve this line to notify @roll (lead of this repository)

@roll roll added the question Further information is requested label Oct 22, 2020
@roll
Copy link
Member

roll commented Oct 22, 2020

Hi @isichei,

Thanks, it's expected behavior as the first row is used to detect keys.

You can provide the keys manually to ensure what data to pick:

from frictionless import Table, dialects

dialect = dialects.JsonDialect(keys=["employee_name", "employee_id", "new_col"])
with Table("tmp/issue490.jsonl", dialect=dialect) as table:
    print(table.read_rows())

Please re-open if needed

@roll roll closed this as completed Oct 22, 2020
@isichei
Copy link
Author

isichei commented Oct 23, 2020

Hi @roll,

Happy to keep closed if you think it should. I guess the workaround for our needs would be to parse the data once grabbing the keys and then using them with the Table class (via the Dialect).

Although you get the same skipping of keys when using describe, which I would expect to catch all columns of your data?

from frictionless import describe_schema

schema = describe_schema("test_data1.jsonl")
schema # No new_col in the schema

For example when using pandas you get the extra column (my assumption is that it must parse the data twice, not sure you would want to do the same for describe, or give a "greedy" option to scan the data.

import pandas as pd

df = pd.read_json("test_data1.jsonl", lines=True)
df.columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Archived in project
Development

No branches or pull requests

2 participants