Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow when BigQueryDenormalizedDestination read a source schema with an empty array. #5486

Closed
jadetr opened this issue Aug 18, 2021 · 5 comments · Fixed by #5813
Closed

Comments

@jadetr
Copy link
Contributor

jadetr commented Aug 18, 2021

Enviroment

  • Airbyte version: example is 0.29.7-alpha
  • OS Version / Instance: Ubuntu 20.10 / dedicated
  • Deployment: Docker
  • Source Connector and version: Harvest V0.1.3
  • Destination Connector and version: BigQueryDenormalizedDestination V0.1.3
  • Severity: Critical
  • Step where error happened: Sync job

Current Behavior

Data is never synced.

Expected Behavior

Data from Harvest source should be synced to BigQuery.

Logs

Uploaded.
This is the line that fails:
2021-08-18 02:28:52 ERROR () LineGobbler(voidCall):85 - at io.airbyte.integrations.destination.bigquery.BigQueryDenormalizedDestination.getField(BigQueryDenormalizedDestination.java:122)

Steps to Reproduce

  1. Setup Harvest source with estimate_messages (has an empty array)
  2. Setup BigQueryDenormalizedDestination
  3. connect source and destination

Are you willing to submit a PR?

I don't program in java, was able to make BigQueryDenormalizedDestination crash in dev with a json file that I cannot upload here.

./gradlew :airbyte-integrations:connectors:destination-bigquery-denormalized:build
docker run --rm -v $(pwd)/secrets:/secrets -v $(pwd)/sample_files:/sample_files airbyte/destination-bigquery-denormalized:dev write --config /secrets/config.json --catalog /sample_files/configured_catalog.json > output.txt 2>&1

here is the json configured_catalog file
{

"streams": [
	{
		"stream": {
			"name": "clients",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"id": {
						"type": ["null", "integer"]
					},
					"name": {
						"type": ["null", "string"]
					},
					"is_active": {
						"type": ["null", "boolean"]
					},
					"address": {
						"type": ["null", "string"]
					},
					"statement_key": {
						"type": ["null", "string"]
					},
					"created_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"updated_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"currency": {
						"type": ["null", "string"],
						"format": "date-time"
					}
				}
			},
			"supported_sync_modes": ["full_refresh", "incremental"],
			"source_defined_cursor": true,
			"default_cursor_field": ["updated_at"],
			"source_defined_primary_key": [["id"]
			]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}, {
		"stream": {
			"name": "contacts",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"id": {
						"type": ["null", "integer"]
					},
					"title": {
						"type": ["null", "string"]
					},
					"first_name": {
						"type": ["null", "string"]
					},
					"last_name": {
						"type": ["null", "string"]
					},
					"email": {
						"type": ["null", "string"]
					},
					"phone_office": {
						"type": ["null", "string"]
					},
					"phone_mobile": {
						"type": ["null", "string"]
					},
					"fax": {
						"type": ["null", "string"]
					},
					"created_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"updated_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"client": {
						"type": ["null", "object"],
						"properties": {
							"id": {
								"type": ["null", "integer"]
							},
							"name": {
								"type": ["null", "string"]
							}
						}
					}
				}
			},
			"supported_sync_modes": ["full_refresh", "incremental"],
			"source_defined_cursor": true,
			"default_cursor_field": ["updated_at"],
			"source_defined_primary_key": [["id"]
			]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}, {
		"stream": {
			"name": "company",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"base_uri": {
						"type": ["null", "string"]
					},
					"full_domain": {
						"type": ["null", "string"]
					},
					"name": {
						"type": ["null", "string"]
					},
					"is_active": {
						"type": ["null", "boolean"]
					},
					"week_start_day": {
						"type": ["null", "string"]
					},
					"wants_timestamp_timers": {
						"type": ["null", "boolean"]
					},
					"time_format": {
						"type": ["null", "string"]
					},
					"plan_type": {
						"type": ["null", "string"]
					},
					"expense_feature": {
						"type": ["null", "boolean"]
					},
					"invoice_feature": {
						"type": ["null", "boolean"]
					},
					"estimate_feature": {
						"type": ["null", "boolean"]
					},
					"approval_required": {
						"type": ["null", "boolean"]
					},
					"clock": {
						"type": ["null", "string"]
					},
					"decimal_symbol": {
						"type": ["null", "string"]
					},
					"thousands_separator": {
						"type": ["null", "string"]
					},
					"color_scheme": {
						"type": ["null", "string"]
					},
					"weekly_capacity": {
						"type": ["null", "integer"]
					}
				}
			},
			"supported_sync_modes": ["full_refresh"]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}, {
		"stream": {
			"name": "estimates",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"id": {
						"type": ["null", "integer"]
					},
					"client_key": {
						"type": ["null", "string"]
					},
					"number": {
						"type": ["null", "string"]
					},
					"purchase_order": {
						"type": ["null", "string"]
					},
					"amount": {
						"type": ["null", "number"]
					},
					"tax": {
						"type": ["null", "number"]
					},
					"tax_amount": {
						"type": ["null", "number"]
					},
					"tax2": {
						"type": ["null", "number"]
					},
					"tax2_amount": {
						"type": ["null", "number"]
					},
					"discount": {
						"type": ["null", "number"]
					},
					"discount_amount": {
						"type": ["null", "number"]
					},
					"subject": {
						"type": ["null", "string"]
					},
					"notes": {
						"type": ["null", "string"]
					},
					"state": {
						"type": ["null", "string"]
					},
					"issue_date": {
						"type": ["null", "string"],
						"format": "date"
					},
					"sent_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"created_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"updated_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"accepted_at": {
						"type": ["null", "string"]
					},
					"declined_at": {
						"type": ["null", "string"]
					},
					"currency": {
						"type": ["null", "string"]
					},
					"client": {
						"type": ["null", "object"],
						"properties": {
							"id": {
								"type": ["null", "integer"]
							},
							"name": {
								"type": ["null", "string"]
							}
						}
					},
					"creator": {
						"type": ["null", "object"],
						"properties": {
							"id": {
								"type": ["null", "integer"]
							},
							"name": {
								"type": ["null", "string"]
							}
						}
					},
					"line_items": {
						"type": ["null", "array"]
					}
				}
			},
			"supported_sync_modes": ["full_refresh", "incremental"],
			"source_defined_cursor": true,
			"default_cursor_field": ["updated_at"],
			"source_defined_primary_key": [["id"]
			]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}, {
		"stream": {
			"name": "estimate_messages",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"id": {
						"type": ["null", "integer"]
					},
					"sent_by": {
						"type": ["null", "string"]
					},
					"sent_by_email": {
						"type": ["null", "string"]
					},
					"sent_from": {
						"type": ["null", "string"]
					},
					"sent_from_email": {
						"type": ["null", "string"]
					},
					"send_me_a_copy": {
						"type": ["null", "boolean"]
					},
					"created_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"updated_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"recipients": {
						"type": ["null", "array"]
					},
					"event_type": {
						"type": ["null", "string"]
					},
					"subject": {
						"type": ["null", "string"]
					},
					"body": {
						"type": ["null", "string"]
					}
				}
			},
			"supported_sync_modes": ["full_refresh", "incremental"],
			"source_defined_cursor": true,
			"default_cursor_field": ["updated_at"],
			"source_defined_primary_key": [["id"]
			]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}, {
		"stream": {
			"name": "estimate_item_categories",
			"json_schema": {
				"$schema": "http://json-schema.org/draft-07/schema#",
				"type": "object",
				"properties": {
					"id": {
						"type": ["null", "integer"]
					},
					"name": {
						"type": ["null", "string"]
					},
					"created_at": {
						"type": ["null", "string"],
						"format": "date-time"
					},
					"updated_at": {
						"type": ["null", "string"],
						"format": "date-time"
					}
				}
			},
			"supported_sync_modes": ["full_refresh", "incremental"],
			"source_defined_cursor": true,
			"default_cursor_field": ["updated_at"],
			"source_defined_primary_key": [["id"]
			]
		},
  "sync_mode": "full_refresh",
  "destination_sync_mode": "overwrite"
	}
]

}

@jadetr
Copy link
Contributor Author

jadetr commented Aug 28, 2021

I just realised that I did a change in the source schema while investigating. estimates#line_items was initially an empty array, I've changed the source schema manually to view if it could fix the problem. This attribute was initially
"line_item": { "type": ["null", "array"] },

@etsybaev etsybaev self-assigned this Aug 30, 2021
@etsybaev
Copy link
Contributor

Hi @jadetr .
Could you please clarify your point about changing the schema and "items"? As far as I see far now, the destination fails as we don't get the "items" element for some arrays from the source. Obviously, it's bad that the destination fails with such unclear error, but still try to understand why didn't we get it from the source.
Did you do anything with it manually?
Thanks!

@jadetr
Copy link
Contributor Author

jadetr commented Aug 30, 2021

@etsybaev , I'll try to give you more details. Hopefully it will be clearer. The source schema is coming from Harvest source connector.

When testing locally I first got the source schema from Harvest and estimate's array where all empty. I tried to modify estimate's line_items manually to test the destination connector and see if I could have the BigQueryDenormalizedDestination code to "pass" instead of failing. I was hacking around with the destination code AND the source schema. I stopped there because I didn't know how to fix the destination connector and instead I've reported the bug.

On August 28, I realised that I sent the "manually edited" source schema in the bug report. I just wanted to highlight that the source schema I sent was manually edited and wasn't what was returned by the actual Harvest source connector.

More investigation
It seems in this case that the source wasn't properly mapped to the API. Here is the content of the line items from the Harvest API
https://help.getharvest.com/api-v2/estimates-api/estimates/estimates/#the-estimate-line-item-object

Here is the content of the line_items schema in airbyte (array is empty).
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-harvest/source_harvest/schemas/estimates.json

By doing this investigation, I saw that Harvest invoices schema was properly implemented. line_items are in the schema.
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-harvest/source_harvest/schemas/invoices.json

Hope this long message clarify things.

@etsybaev
Copy link
Contributor

etsybaev commented Aug 30, 2021

@jadetr , so did I got you right, for now we have in description this statement:
"line_items": { "type": ["null", "array"], **"items": { "type": ["null", "object"], "properties": { "id": { "type": ["null", "integer"] }, "kind": { "type": ["null", "string"] } }** } }

But you're saying that originally it didn't contain the "items" part? Can I ask you please to update the ticket's description to contain the only origin schema\values to avoid confusion? Many thanks in advance!

The StackOverflow error on the destination side happens when array object in schema doesn't contain the "items" block. Basically, the schema is supposed to always have it.

So, as far as I see from the description, the "line_items" contains "items" block inside which seems to be as expected. Meanwhile, the "recipients" doesn't have it that may cause an issue

@jadetr
Copy link
Contributor Author

jadetr commented Aug 30, 2021

@etsybaev , I've modified the ticket to reflect the actual Harvest schema.

If you look in Harvest invoice's schema, there is no array defined for line_items. The line_items are defined as an object but they are in fact an array of objects. I'll test invoice sync and let you know if it works in a later message.
ref:
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-harvest/source_harvest/schemas/invoices.json
"line_items": { "type": ["null", "object"], "properties": { "id": { "type": ["null", "integer"] }, "kind": { "type": ["null", "string"] }, "description": { "type": ["null", "string"] }, "quantity": { "type": ["null", "integer"] }, "unit_price": { "type": ["null", "integer"] }, "amount": { "type": ["null", "integer"] }, "taxed": { "type": ["null", "boolean"] }, "taxed2": { "type": ["null", "boolean"] }, "project": { "type": ["null", "string"] } } }

Please note that the same behaviour happen with Hubspot source as well. There are plenty of empty arrays in Hubspot source schema. Here is one example.

Hubspot company schema
https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-hubspot/source_hubspot/schemas/companies.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment