Type Inferrence in inconsistent lists of dictionaries #31

georgejerzy · 2019-05-20T15:35:57Z

Hi, I would expect that module would pass the following test:

DATA
{ "r" : [{ "i": 4 },{ "i": "4px" }] }
SCHEMA
[
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "i",
        "type": "STRING"
      }
    ],
    "mode": "REPEATED",
    "name": "r",
    "type": "RECORD"
  }
]
END

Unfortunately, type of the "i" field returned is INTEGER. I have a problem with understanding if it is this a bug - it's seems to be technically doable and useful, but it also seems to be a case mentioned somewhere in README - "but bq load does not support it, so we follow its behavior".
Is is a bug to be fixed or not?

bxparks · 2019-05-20T19:12:49Z

Hi,

The 1st element in the array says that i is an INTEGER. The 2nd element says that i is a STRING. What is your preference for handling this? I see 3 options:

$ cat > mismatched_repeated.json
{ "r" : [{ "i": 4 },{ "i": "4px" }] }
^D

Option 1: Hard error. This is what bq load does, when it barfs with an error message:

$ bq load --source_format NEWLINE_DELIMITED_JSON --autodetect tmp.tmp mismatched_repeated.json
[...]
BigQuery error in load operation: Error processing job '[snip]': Error while reading data, error message:
JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON parsing error in row
starting at position 0: Could not convert value to string. Field:
i; Value: 4

Option 2: Soft error. This is the current implementation in generate-schema which prints out an error message, but continues to infer the type to be an INTEGER:

$ generate-schema < mismatched_repeated.json 
INFO:root:Processed 1 lines
INFO:root:Problem on line 1: Mismatched type: old=(hard,i,NULLABLE,INTEGER); new=(hard,i,NULLABLE,STRING)
[...your schema above...]

Option 3: Remove the field from the schema, and also print a warning.

Maybe Option 3 is the better choice, since it allows the rest of the data file to be imported.

georgejerzy · 2019-05-21T07:58:07Z

Hi, many thanks for the answer! I've just realized that even in the simpler case:

{ "x": 30 }
{ "x": "50px" }

schema-generator will infer "type": "INTEGER".
And in case:

{ "x": "50px" }
{ "x": 30 }

schema-generator will infer "type": "STRING".
I naturally followed the option 3 that you mentioned, but in my usecase I have massive collection with inconsistent typing (export from mongoDB), and that leads me to loosing 75% of entries.
I see Option 4 - to assume STRING type and return no error. That would allow to place almost all data in BigQuery and handle parsing all weird cases in this column in the SQL view later on.

What drawbacks does casting to string have in such a case?

bxparks · 2019-05-21T17:30:33Z

Yes your results looks to be correct because generate-schema picks the type of the first record, and if it sees another record that is incompatible (cannot be cast to a compatible type), it prints a warning, and continues to process the remaining records.

I'm not sure I understand your Option 4: why is STRING better than INTEGER? What if you had 1000 INTEGERs but only a single record with a STRING? Even if generate-schema selected STRING, how does this help, since bq load will refuse to load the dataset with the inconsistent data types.

It seems like the solution for you is to pre-filter your dataset to eliminate the records that you don't want, and keep the ones that you do want, then generate the schema using generate-schema.

In the meantime, I think I will implement Option 3 and make generate-schema remove the inconsistent field from the resulting schema. Because it does not seem to help much to keep the field assigned to be the type of the first encountered record, when bq load cannot import the resulting schema and dataset.

bxparks · 2019-05-21T17:35:14Z

FYI: I won't be able to look at this again for the next 1-2 weeks... in case you start wondering about the lack of responses.

georgejerzy · 2019-05-22T00:12:53Z

Hey, thanks anyway for the responce! 👍
I think that we've ended discussing quite debatable matter - you gave an example of 1000 INTs and 1 string - what if I had 500 integers and 500 strings? :) Can I afford to trash out half of dataset?
Probably because of my business & technical case I am strong supporter of putting "everything" to BQ, and handle the corner cases later on f.e. with regex inside of BQ view. Not going into details - in my case every row counts from business perspective, it's to early in the project to decide which column is not important and 50/50 combinations pf strings/ints can, unfortunately, happen.
That leads me to conclusion, that I also need something like "bigquery-entry-sanitizer" module that would eventually force the up-typing INTs to STRINGs both in schema and in new entries.
I hope it's 1-2 weeks vacation, thanks again!

bxparks · 2019-06-06T05:12:57Z

(Thanks, back from vacation.)

I'm also a fan of putting everything in BQ then figuring out things later. For this particular problem of inconsistent types for a given field/column, it's not clear to me what generate-schema should do. Even if it spits out STRING as the type, bq load would refuse to load the data.

I'm planning on implementing Option 3 that I described above, which removes the offending field from the generated schema. That will allow the dataset to be loaded using the --ignore_unknown_values flag with bq load. For your case, you would need to write a pre-filter (ie an input sanitizer) to transform your dataset to have a consistent type for that offending field. Then pipe that dataset into generate-schema.

(Keeping this issue open so that I can attach my commits to it.)

…pe (#31)

bxparks · 2019-06-06T18:25:10Z

Pushed 0.5 to PyPI. Closing.

bxparks added a commit that referenced this issue Jun 6, 2019

Remove fields with incompatible types instead of picking the first ty…

d0b327c

…pe (#31)

bxparks mentioned this issue Jun 6, 2019

0.5 - remove fields with unsolvable errors #32

Merged

bxparks closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Type Inferrence in inconsistent lists of dictionaries #31

Type Inferrence in inconsistent lists of dictionaries #31

georgejerzy commented May 20, 2019

bxparks commented May 20, 2019 •

edited

georgejerzy commented May 21, 2019

bxparks commented May 21, 2019

bxparks commented May 21, 2019

georgejerzy commented May 22, 2019

bxparks commented Jun 6, 2019 •

edited

bxparks commented Jun 6, 2019

Type Inferrence in inconsistent lists of dictionaries #31

Type Inferrence in inconsistent lists of dictionaries #31

Comments

georgejerzy commented May 20, 2019

bxparks commented May 20, 2019 • edited

georgejerzy commented May 21, 2019

bxparks commented May 21, 2019

bxparks commented May 21, 2019

georgejerzy commented May 22, 2019

bxparks commented Jun 6, 2019 • edited

bxparks commented Jun 6, 2019

bxparks commented May 20, 2019 •

edited

bxparks commented Jun 6, 2019 •

edited