Infer 'REQUIRED' mode with a flag for consistently filled in values #28

korotkevics · 2019-03-03T00:01:29Z

Hello Brian!

Sorry for making you wait.

Here is what I wanted to have.

I've been rereading code and I've understood your confusion.
You use bq load and it expects one to loose mode to either NULLABLE or REPEATED. The thing is that I do not use bq load. :) I first analyze a few given files, decide on a schema, create a table, and then start loading a lot of data continuously (I wouldn't like to discuss it too deeply). If we allow users to decide on mode only when a given flag is set, it should remain backward compatible, however please do check if I haven't broken things :)

To me it seems like a general purpose functionality therefore I would prefer it to be open source so I really hope you won't mind having it.

If you agree to this, feel free to commit directly to my branch or branch out unless you strongly prefer to rewrite :) Also, if you just do a code review could be helpful as I am less experienced with Python.

Have a nice weekend.

bxparks · 2019-03-04T06:34:42Z

Thanks. I made a small change to implement this during the schema output (i.e. in flatten_schema_map()) instead of converting the NULLABLE into a REQUIRED during input data processing. I tried to upload the change directly into your forked repo, but didn't have permission (I'm not sure I fully understand how GitHub implements this workflow), so I uploaded it to my repo here:
https://github.com/bxparks/bigquery-schema-generator/tree/korotkevics-feature/infer_mode
(i.e. the branch name is the same as yours).

I also had to restrict --infer_mode to CSV files only. It doesn't really make sense for JSON files since JSON fields are often completely missing in some records compared to other records.

korotkevics · 2019-03-04T07:38:14Z

@bxparks nice, I've run it locally, it seems to be what I wanted and written in a shorter way :) You may consider adding tests :)

Could you please tell me when it would normally become available on pip (to be installed from repo).

bxparks · 2019-03-04T07:46:18Z

Cool. I'll merge into develop, then let it sit there for a day or two, to give me a chance to clean up small loose ends (There always something). Then push to PyPI in a few days.

korotkevics · 2019-03-04T07:52:28Z

@bxparks hm, yes it seems there's :P, I've just run it over a different file and it does not seem to support TIMESTAMP (example: "2018-07-23 12:42:21.6830000") as it recognizes it as STRING. Does the library support this type?

bxparks · 2019-03-04T08:04:40Z

BigQuery (more precisely 'bq load') supports only 6 decimal places for the 'second' component. You should write a filter to normalize the data. You might say that this belongs in bigquery-schema-generator but I would push back a little on that. To me, this is a data-cleansing operation that belongs in the 'T' part of the ETL pipeline.

korotkevics · 2019-03-04T08:19:40Z

@bxparks ah I see it is not a TIMESTAMP which'd be recognized by BQ as such. Thank you!

korotkevics added 6 commits March 2, 2019 12:31

Define an argument

48ecc70

Sketch out the concept

eab8c43

Fix minor errors

7594727

Fix minor errors

584be40

Add two tests

005a4d7

Use store_false

479f7c5

bxparks merged commit 6c52455 into bxparks:develop Mar 4, 2019

korotkevics deleted the feature/infer_mode branch March 4, 2019 08:20

bxparks mentioned this pull request Mar 6, 2019

0.4 - Support CSV files #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infer 'REQUIRED' mode with a flag for consistently filled in values #28

Infer 'REQUIRED' mode with a flag for consistently filled in values #28

korotkevics commented Mar 3, 2019

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019 •

edited

Loading

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019

Infer 'REQUIRED' mode with a flag for consistently filled in values #28

Infer 'REQUIRED' mode with a flag for consistently filled in values #28

Conversation

korotkevics commented Mar 3, 2019

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019 • edited Loading

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019

bxparks commented Mar 4, 2019

korotkevics commented Mar 4, 2019

korotkevics commented Mar 4, 2019 •

edited

Loading