Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Mixpanel: Sync failure with high volume #8427

Closed
Tracked by #12239
pranavhegde4 opened this issue Dec 2, 2021 · 10 comments · Fixed by #13372
Closed
Tracked by #12239

Source Mixpanel: Sync failure with high volume #8427

pranavhegde4 opened this issue Dec 2, 2021 · 10 comments · Fixed by #13372

Comments

@pranavhegde4
Copy link

pranavhegde4 commented Dec 2, 2021

Enviroment

  • Airbyte version: 0.32.6-alpha
  • OS Version / Instance: GCP e2-standard-4
  • Deployment: Docker
  • Source Connector and version: Mixpanel 0.1.5
  • Destination Connector and version: BigQuery 0.5.0
  • Severity: High
  • Step where error happened: Sync job

Current Behavior

When syncing large number of rows from mixpanel to bigquery, the sync just fails in between.
Reducing date window size can help alleviate this problem. However I've noticed that even for data window size of 2,
if the number of rows in that window is greater than 3 million the same issue occurs.
I suspect its due to OOM as there is a brief spike in memory usage of the python process before its killed.
However the machine has 16 gb of ram, and on an average only around 6 gb was being used before the spike.

Another point of concern is that the ability to configure date window size was removed in MixPanel 0.1.6 and it defaults to 30 days.
Currently the only solution was to reduce date window size which is not possible any more.
Thus OOM is bound to happen as data will be synced for 30 days at once

Expected Behavior

Sync should succeed regardless of number of rows being synced in the date window

Logs

LOG

2021-11-25 17:07:37 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):220 - Records read: 1241000
2021-11-25 17:07:38 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):220 - Records read: 1242000
2021-11-25 17:07:38 INFO () DefaultReplicationWorker(lambda$getReplicationRunnable$2):220 - Records read: 1243000
2021-11-25 17:08:52 INFO () DefaultReplicationWorker(run):138 - Source thread complete.
destination - 2021-11-25 17:08:52 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:08:52 INFO i.a.i.b.FailureTrackingAirbyteMessageConsumer(close):60 - {} - Airbyte message consumer: succeeded.
2021-11-25 17:08:52 INFO () DefaultReplicationWorker(run):139 - Waiting for destination thread to join.
destination - 2021-11-25 17:08:52 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:08:52 INFO i.a.i.d.b.BigQueryRecordConsumer(close):163 - {} - Started closing all connections
destination - 2021-11-25 17:08:53 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:08:53 INFO i.a.i.d.b.BigQueryRecordConsumer(closeNormalBigqueryStreams):278 - {} - Waiting for jobs to be finished/closed
destination - 2021-11-25 17:09:02 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:02 INFO i.a.i.d.b.BigQueryRecordConsumer(closeNormalBigqueryStreams):295 - {} - Replication finished with no explicit errors. Copying data from tmp tables to permanent
destination - 2021-11-25 17:09:06 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:06 INFO i.a.i.d.b.BigQueryRecordConsumer(copyTable):363 - {} - successfully copied table: GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=vayu_staging, tableId=_airbyte_tmp_xvo_mixpanel_export}} to table: GenericData{classInfo=[datasetId, projectId, tableId], {datasetId=vayu_staging, tableId=_airbyte_raw_mixpanel_export}}
2021-11-25 17:09:06 INFO () DefaultReplicationWorker(lambda$getDestinationOutputRunnable$3):248 - state in DefaultReplicationWorker from Destination: io.airbyte.protocol.models.AirbyteMessage@72408d4f[type=STATE,log=<null>,spec=<null>,connectionStatus=<null>,catalog=<null>,record=<null>,state=io.airbyte.protocol.models.AirbyteStateMessage@4fc98083[data={"export":{"date":"2019-03-03T23:59:59"}},additionalProperties={}],additionalProperties={}]
destination - 2021-11-25 17:09:06 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:06 INFO i.a.i.d.b.BigQueryRecordConsumer(closeNormalBigqueryStreams):312 - {} - Removing tmp tables...
destination - 2021-11-25 17:09:06 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:06 INFO i.a.i.d.b.BigQueryRecordConsumer(closeNormalBigqueryStreams):314 - {} - Finishing destination process...completed
destination - 2021-11-25 17:09:06 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:06 INFO i.a.i.b.IntegrationRunner(run):133 - {} - Completed integration: io.airbyte.integrations.destination.bigquery.BigQueryDestination
destination - 2021-11-25 17:09:06 INFO () DefaultAirbyteStreamFactory(lambda$create$0):61 - 2021-11-25 17:09:06 INFO i.a.i.d.b.BigQueryDestination(main):336 - {} - completed destination: class io.airbyte.integrations.destination.bigquery.BigQueryDestination
2021-11-25 17:09:06 INFO () DefaultReplicationWorker(run):141 - Destination thread complete.
2021-11-25 17:09:06 ERROR () DefaultReplicationWorker(run):145 - Sync worker failed.

Steps to Reproduce

  1. Setup Mixpanel source and Bigquery destination
  2. Set up the connection to output only raw data (without normalization)
  3. Need to ensure a large number of rows are being synced in each date window. (This can be achieved by either setting a large date window, or allocating less memory)

Are you willing to submit a PR?

Yes, but might take time as I am new to this.

@pranavhegde4 pranavhegde4 added needs-triage type/bug Something isn't working labels Dec 2, 2021
@alafanechere alafanechere added area/connectors Connector related issues and removed needs-triage labels Dec 2, 2021
@alafanechere
Copy link
Contributor

Another user declared having this bug on slack

@alafanechere alafanechere changed the title Sync Failure from Mixpanel to BigQuery when there are large number of rows Source Mixpanel: Sync failure with high volume Dec 3, 2021
@sherifnada sherifnada added this to the Connectors Dec 24 2021 milestone Dec 10, 2021
@eliziario eliziario self-assigned this Dec 10, 2021
@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 4, 2022

I could not reproduce this bug, because there are less data in our Mixpanel test account. Also decreased RAM in my laptop wsl (windows subsystem for linux) then ran the sync again, the same result. The next step is to create/import as much data as possible in Mixpanel account, ideally more than 1 million records for one stream, then retry reproducing.

@augan-rymkhan
Copy link
Contributor

augan-rymkhan commented Jan 11, 2022

To reproduce the issue, 1.2 million records were imported for User Profiles (engage stream) by CSV file.
Then ran the sync.
1,272,982 records were successfully synced.
In sync process RAM usage increased from 7.44 GB to 9.64 GB.
I used local PostgreSQL instance as destination in this test to run it faster and there are 2 clients who faced this issue, their destinations were BigQuery, Snowflake. So I assume it's not depend on the destination type.

image
mixpanel_sync_650K_records_read

@misteryeo
Copy link
Contributor

@pranavhegde4 based on the latest comment it sounds like this hasn't been reproducible. Are you able to confirm that this issue still persists?

@bleonard bleonard added autoteam team/tse Technical Support Engineers labels Apr 26, 2022
@misteryeo misteryeo added team/connectors-python and removed team/tse Technical Support Engineers community labels May 3, 2022
@roman-romanov-o
Copy link
Contributor

@pranavhegde4 is there any updates? Can you confirm problem still persists?

@pranavhegde4
Copy link
Author

We had to limit the mixpanel ingestion so that it queries only 1 day's worth of data at a time. This has been working till now as there is less data in one day. But I suspect the issue still persists (unless any code change has been made) and we will encounter it if we try to ingest more than 1 day's worth of data (Maybe around 5 to 10 days at a time)

@roman-romanov-o roman-romanov-o self-assigned this May 27, 2022
@pranavhegde4
Copy link
Author

Update: This issue is occurring again as even our 1 day's worth of data has now increased in volume and we aren't able to ingest data at all

@roman-romanov-o
Copy link
Contributor

@pranavhegde4 can you please provide latest logs?

@pranavhegde4
Copy link
Author

pranavhegde4 commented May 30, 2022

This one
logs-501-0.txt

@lazebnyi lazebnyi linked a pull request Jun 1, 2022 that will close this issue
14 tasks
@roman-romanov-o
Copy link
Contributor

@pranavhegde4 I could not find some memory bottleneck to fix

But regarding to your issue - I've added date_window_size into parameters
You can reduce it to change amount of data connector is processing in each iteration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.