Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null fields are omitted by infer_json_schema_from_seekable #4814

Closed
kskalski opened this issue Sep 13, 2023 · 6 comments · Fixed by #4894
Closed

Null fields are omitted by infer_json_schema_from_seekable #4814

kskalski opened this issue Sep 13, 2023 · 6 comments · Fixed by #4894
Labels
arrow Changes to the arrow crate bug help wanted

Comments

@kskalski
Copy link
Contributor

Describe the bug
Fields with only null values are skipped when inferring Schema, which makes reader in strict mode fail as it stumbles upon field which is not included in the schema. In any case, silently removing the fields that are in the input seems wrong - maybe this should be controlled by an option to inference function or it should be left up to the user to filter out null fields.

To Reproduce

#[cfg(test)]
mod tests {
    const DATA2: &str = r#"{"a": 1, "b": "str", "c": null}"#;

    #[test]
    fn test_json_infers_null_schema() {
        let input_buf = std::io::Cursor::new(DATA2.as_bytes());
        let mut buf_reader = std::io::BufReader::new(input_buf);
        let schema = arrow::json::reader::infer_json_schema_from_seekable(&mut buf_reader, None).unwrap();
        let field = schema
            .field_with_name("a")
            .expect("should contain numeric field");
        assert_eq!(&arrow::datatypes::DataType::Int64, field.data_type());
        let field = schema
            .field_with_name("c")
            .expect("should contain null field");
        assert_eq!(&arrow::datatypes::DataType::Null, field.data_type());
    }
}

produces

thread parquet::tests::test_json_infers_null_schema panicked at lakeshore-history/src/parquet.rs:197:14:
should contain null field: SchemaError("Unable to get field named \"c\". Valid fields: [\"a\", \"b\"]")
stack backtrace:

Expected behavior
test passes

Additional context
tested with arrow 46

@kskalski kskalski added the bug label Sep 13, 2023
@tustvold
Copy link
Contributor

What type would you expect to be inferred for null fields?

@kskalski
Copy link
Contributor Author

I think DataType::Null is a correct data type in this situation and compatible with other inference behavior, e.g.

    const DATA: &str = r#"{"a": 1, "b": "str", "c": null, "d": []}"#;

will create a schema with d field, which has a type List(Field { name: "item", data_type: Null, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })

Creating data frames with such schema is a bit different topic, I guess not all output formats support null columns, however there is https://docs.rs/arrow/latest/arrow/array/struct.NullArray.html so:

  • up to a point it could still be used to create / process data
  • if user wants to use schema for writing to format not allowing nulls (e.g. parquet seems to be one, though "Parquet Logical Type Definitions" mention column of UNKNOWN type, which relate to arrow null type), they could modify schema to assign some other datatype with nullable option and this should work fine when parsing data in strict mode

@tustvold
Copy link
Contributor

Does the JSON reader support DataType::Null?

@kskalski
Copy link
Contributor Author

Yes, in fact it does and it generates RecordBatch that with parquet::arrow::ArrowWriter produces column like this:

############ Column(c) ############
name: c
path: c
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Null
converted_type (legacy): NONE
compression: SNAPPY (space_saved: -7%)

@tustvold
Copy link
Contributor

In which case I see no issue with the proposed change

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #4894

@tustvold tustvold added the arrow Changes to the arrow crate label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants