Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type Inferrence in inconsistent lists of dictionaries #31

Closed
georgejerzy opened this issue May 20, 2019 · 7 comments
Closed

Type Inferrence in inconsistent lists of dictionaries #31

georgejerzy opened this issue May 20, 2019 · 7 comments

Comments

@georgejerzy
Copy link

Hi, I would expect that module would pass the following test:

DATA
{ "r" : [{ "i": 4 },{ "i": "4px" }] }
SCHEMA
[
  {
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "i",
        "type": "STRING"
      }
    ],
    "mode": "REPEATED",
    "name": "r",
    "type": "RECORD"
  }
]
END

Unfortunately, type of the "i" field returned is INTEGER. I have a problem with understanding if it is this a bug - it's seems to be technically doable and useful, but it also seems to be a case mentioned somewhere in README - "but bq load does not support it, so we follow its behavior".
Is is a bug to be fixed or not?

@bxparks
Copy link
Owner

bxparks commented May 20, 2019

Hi,

The 1st element in the array says that i is an INTEGER. The 2nd element says that i is a STRING. What is your preference for handling this? I see 3 options:

$ cat > mismatched_repeated.json
{ "r" : [{ "i": 4 },{ "i": "4px" }] }
^D

Option 1: Hard error. This is what bq load does, when it barfs with an error message:

$ bq load --source_format NEWLINE_DELIMITED_JSON --autodetect tmp.tmp mismatched_repeated.json
[...]
BigQuery error in load operation: Error processing job '[snip]': Error while reading data, error message:
JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
Failure details:
- Error while reading data, error message: JSON parsing error in row
starting at position 0: Could not convert value to string. Field:
i; Value: 4

Option 2: Soft error. This is the current implementation in generate-schema which prints out an error message, but continues to infer the type to be an INTEGER:

$ generate-schema < mismatched_repeated.json 
INFO:root:Processed 1 lines
INFO:root:Problem on line 1: Mismatched type: old=(hard,i,NULLABLE,INTEGER); new=(hard,i,NULLABLE,STRING)
[...your schema above...]

Option 3: Remove the field from the schema, and also print a warning.

Maybe Option 3 is the better choice, since it allows the rest of the data file to be imported.

@georgejerzy
Copy link
Author

Hi, many thanks for the answer! I've just realized that even in the simpler case:

{ "x": 30 }
{ "x": "50px" }

schema-generator will infer "type": "INTEGER".
And in case:

{ "x": "50px" }
{ "x": 30 }

schema-generator will infer "type": "STRING".
I naturally followed the option 3 that you mentioned, but in my usecase I have massive collection with inconsistent typing (export from mongoDB), and that leads me to loosing 75% of entries.
I see Option 4 - to assume STRING type and return no error. That would allow to place almost all data in BigQuery and handle parsing all weird cases in this column in the SQL view later on.

What drawbacks does casting to string have in such a case?

@bxparks
Copy link
Owner

bxparks commented May 21, 2019

Yes your results looks to be correct because generate-schema picks the type of the first record, and if it sees another record that is incompatible (cannot be cast to a compatible type), it prints a warning, and continues to process the remaining records.

I'm not sure I understand your Option 4: why is STRING better than INTEGER? What if you had 1000 INTEGERs but only a single record with a STRING? Even if generate-schema selected STRING, how does this help, since bq load will refuse to load the dataset with the inconsistent data types.

It seems like the solution for you is to pre-filter your dataset to eliminate the records that you don't want, and keep the ones that you do want, then generate the schema using generate-schema.

In the meantime, I think I will implement Option 3 and make generate-schema remove the inconsistent field from the resulting schema. Because it does not seem to help much to keep the field assigned to be the type of the first encountered record, when bq load cannot import the resulting schema and dataset.

@bxparks
Copy link
Owner

bxparks commented May 21, 2019

FYI: I won't be able to look at this again for the next 1-2 weeks... in case you start wondering about the lack of responses.

@georgejerzy
Copy link
Author

Hey, thanks anyway for the responce! 👍
I think that we've ended discussing quite debatable matter - you gave an example of 1000 INTs and 1 string - what if I had 500 integers and 500 strings? :) Can I afford to trash out half of dataset?
Probably because of my business & technical case I am strong supporter of putting "everything" to BQ, and handle the corner cases later on f.e. with regex inside of BQ view. Not going into details - in my case every row counts from business perspective, it's to early in the project to decide which column is not important and 50/50 combinations pf strings/ints can, unfortunately, happen.
That leads me to conclusion, that I also need something like "bigquery-entry-sanitizer" module that would eventually force the up-typing INTs to STRINGs both in schema and in new entries.
I hope it's 1-2 weeks vacation, thanks again!

@bxparks
Copy link
Owner

bxparks commented Jun 6, 2019

(Thanks, back from vacation.)

I'm also a fan of putting everything in BQ then figuring out things later. For this particular problem of inconsistent types for a given field/column, it's not clear to me what generate-schema should do. Even if it spits out STRING as the type, bq load would refuse to load the data.

I'm planning on implementing Option 3 that I described above, which removes the offending field from the generated schema. That will allow the dataset to be loaded using the --ignore_unknown_values flag with bq load. For your case, you would need to write a pre-filter (ie an input sanitizer) to transform your dataset to have a consistent type for that offending field. Then pipe that dataset into generate-schema.

(Keeping this issue open so that I can attach my commits to it.)

@bxparks
Copy link
Owner

bxparks commented Jun 6, 2019

Pushed 0.5 to PyPI. Closing.

@bxparks bxparks closed this as completed Jun 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants