Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to read a 14GB table fails without reading any rows. It seems to time out after 5 minutes while waiting for query results #8890

Closed
tuliren opened this issue Dec 18, 2021 · 28 comments · Fixed by #17236

Comments

@tuliren
Copy link
Contributor

tuliren commented Dec 18, 2021

Environment

  • Is this your first time deploying Airbyte: No
  • OS Version / Instance: macOS 11.6.1
  • Memory / Disk: 16GB / 500GB SSD
  • Deployment: Docker
  • Airbyte Version: 0.34.0-alpha
  • Source name/version: MySQL 0.4.13 (0.5.0 and 0.5.1 fail the connection test)
  • Destination name/version: S3 0.1.16
  • Step: On sync
  • Description: Extracting MySQL tables to S3 as parquet files with snappy compression is working for tables as large as 6GB so far, but trying on a 14GB table fails without reading any rows. It seems to time out after 5 minutes while waiting for query results

Problems

There are two issues:

  1. When using MySQL 0.5.x, the connection check fails:

    javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)

  2. When using MySQL 0.4.13, the connection check passes. However, when the database is relatively large, data sync fails before any row is read.

Expected Behavior

Tell us what should happen.

Logs

Slack thread.

@tuliren tuliren added type/bug Something isn't working needs-triage area/connectors Connector related issues area/databases labels Dec 18, 2021
@tuliren tuliren self-assigned this Dec 18, 2021
@grishick grishick changed the title MySQL source cannot connect (0.5.x) or sync (0.4.13) data 🐛MySQL source cannot connect (0.5.x) or sync (0.4.13) data May 3, 2022
@grishick grishick changed the title 🐛MySQL source cannot connect (0.5.x) or sync (0.4.13) data Trying to read a 14GB table fails without reading any rows. It seems to time out after 5 minutes while waiting for query results Aug 26, 2022
@grishick
Copy link
Contributor

Assigning to GL to test using the new e2e testing tool.

  1. Please create a data set (or a script that generates a data set, e.g. TPC-DI or TPC-DS)
  2. create a test scenario
  3. report on test results

@DoNotPanicUA
Copy link
Contributor

The Simple sync scenario scenario is enough to reproduce the sync.
We need only credentials for the MySQL source with a 14GB dataset and the S3 destination.

@grishick
Copy link
Contributor

grishick commented Sep 7, 2022

@DoNotPanicUA @davinchia @evantahler I think this is a perfect use case for a perf test that should be run periodically to catch regressions. I think the right approach here would be to:

  1. Save a data set in S3/GCP as a db dump that can be loaded into MySQL
  2. create a GH action that spins up a MySQL DB, loads the dataset into it, runs the E2E tool, saves the results in GCS or S3 and emails them to DBS & DWs team

I am open to suggestions though and if you think we should have a quick call to lay out the solution - lets do that

@DoNotPanicUA
Copy link
Contributor

DoNotPanicUA commented Sep 7, 2022

@grishick
small input from my side.
We are currently developing a docker version of the E2E tool and the integration with GH actions. So, right now, we can focus on the dataset preparation, etc.
Then we can easily configure a GH action with the latest version.

P.S. we have plans to handle dataset management by the E2E tool in the roadmap, but there is a lot of stuff that needs to be done before :)

@davinchia
Copy link
Contributor

@grishick Approach sounds sane to me.

Some thoughts:

  1. instead of emailing we can do slack.
  2. we can eventually reuse this dump for more than just mysql.

Sounds like we should wait for the E2E testing tool to be ready as per @DoNotPanicUA's comment to leverage that work.

@grishick
Copy link
Contributor

grishick commented Sep 7, 2022

@DoNotPanicUA that's great.

@davinchia I agree that Slack is better.

Yes, I think we should let the GL team make the E2E tool into a docker container, create a GH action, figure out how to spin up a external resources, such as a source or a destination database instance, figure out how to load test datasets and then we can have repeatable runs for data sets of various sizes and complexities.

@davinchia
Copy link
Contributor

Great! Very excited!

@supertopher @git-phu @cpdeethree are around for any infra/tooling requests!

@suhomud
Copy link
Contributor

suhomud commented Sep 15, 2022

@DoNotPanicUA @grishick, I think I'll open a sub tickets for a part which is related to the testing tool.

Regarding the issue itself I'm on it. Found that if I use local Mysql DB with a table size 16GB the sync passed successfully. So the issue happen only with Amazon RDS instance. My assumption is that something to do with RDS configs. Investigating it...

@grishick
Copy link
Contributor

@DoNotPanicUA @grishick, I think I'll open a sub tickets for a part which is related to the testing tool.

Regarding the issue itself I'm on it. Found that if I use local Mysql DB with a table size 16GB the sync passed successfully. So the issue happen only with Amazon RDS instance. My assumption is that something to do with RDS configs. Investigating it...

Makes sense. There is also networking and RDS-special configs involved.

@suhomud
Copy link
Contributor

suhomud commented Sep 16, 2022

Some new inputs from my side. I was able to pass sync with RDS MySql and S3 by changing PreparedStatement fetch size:
preparedStatement.setFetchSize(Integer.MIN_VALUE);
This allows to pass replication but it means that we are fetching one row at a time which is affects the time of sync significantly. Finding from Mysql Docs:
ResultSet

By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.

To enable this functionality, create a Statement instance in the following manner:


stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,  java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

I have some idea to check Local sync with resources similar to Airbyte Cloud. In case if it is the resource issue then we can solve this issue by increasing connectors resource

@grishick
Copy link
Contributor

@tuliren @suhomud my understanding is that fetch size is set dynamically based on the estimation of an average row size. So, what I wonder is"

  • Why doesn't dynamic fetch size work in this scenario and how can we make it work?
  • How large is a single row and what is the distribution of row sizes in this table?
  • How the behavior of this scenario changes when we increase memory allocation for the source container to 1GB, 1.5GB, 2GB?
  • Is the data set in local test exactly the same as the data set in RDS test?
  • If the problem is that moving a result set over network takes more than 5 minutes - is there a timeout setting we can adjust?

@suhomud
Copy link
Contributor

suhomud commented Sep 19, 2022

@grishick
I made a local db with a table with one million records and size is 14344 MB so each record is near 14 kB. I put the same resource quota as we have in Airbyte prod env:

JOB_MAIN_CONTAINER_CPU_REQUEST=2
JOB_MAIN_CONTAINER_CPU_LIMIT=2
JOB_MAIN_CONTAINER_MEMORY_REQUEST="200Mi"
JOB_MAIN_CONTAINER_MEMORY_LIMIT="200Mi"

NORMALIZATION_JOB_MAIN_CONTAINER_MEMORY_LIMIT: "1.5Gi"
NORMALIZATION_JOB_MAIN_CONTAINER_MEMORY_REQUEST: "1.5Gi"

Is it right value for prod env? Do we have a separate memory quota for Mysql?

The local sync failed. It looks like fetch size has been ignored and select query trying to fetch all in memory.
There are two things which I will test:

  • stmt.setFetchSize(Integer.MIN_VALUE) it disables client-side caching. What I'm going to check is the sync performance impact
  • Also I'll test increasing the resource quota. I assume this will also work but it can be an issue with table larger then 14GB

@suhomud
Copy link
Contributor

suhomud commented Sep 19, 2022

@grishick @tuliren ,
Let me summarise what I found.

MySQL JDBC driver does not respect fetch size ⛔️

This part is ignored by mysql driver:
preparedStatement.setFetchSize(FetchSizeConstants.INITIAL_SAMPLE_SIZE);
https://github.com/airbytehq/airbyte/blob/master/airbyte-db/db-lib/src/main/java/io/airbyte/db/jdbc/streaming/AdaptiveStreamingQueryConfig.java#L36
Mysql connector trying to fetch all table records in memory.

JOB_MAIN_CONTAINER_MEMORY_REQUEST="500Mi"
JOB_MAIN_CONTAINER_MEMORY_LIMIT="500Mi"
fetchSize = 10

fetch_size_10_500MB.txt

Increasing Mysql connector memory does not solve the issue ⛔️
I tried to add connector memory to 1 GB but it only increase a waiting time before it failed. I have a guess that because we are using JDBC param ?useCursorFetch=true here we stressed RDS MySql DB since it tells to use server-side cursors however fetch size still ignored 🙉

JOB_MAIN_CONTAINER_MEMORY_REQUEST="1Gi"
JOB_MAIN_CONTAINER_MEMORY_LIMIT="1Gi"
fetchSize = 10

fetch_size_10_1GB.txt

Destination S3 connector is fail with OutOfMemoryError ⛔️

I changed fetch size to preparedStatement.setFetchSize(Integer.MIN_VALUE). This solved the issue with mysql-soruce memory issue but destination-S3 failed to consume 200 MB of data with OutOfMemoryError

JOB_MAIN_CONTAINER_MEMORY_REQUEST="500Mi"
JOB_MAIN_CONTAINER_MEMORY_LIMIT="500Mi"
fetchSize = Integer.MIN_VALUE

fetch_size_Integer.MIN_VALUE_500MB.txt

Working sync ✅🥇
The sync was successful with next changes:

JOB_MAIN_CONTAINER_MEMORY_REQUEST="700Mi"
JOB_MAIN_CONTAINER_MEMORY_LIMIT="700Mi"
fetchSize = Integer.MIN_VALUE

fetch_size_Integer.MIN_VALUE_700MB.txt

My proposal will be:

  • Add custom MySqlStreamingQueryConfig which will extend AdaptiveStreamingQueryConfig and override initialize method to use preparedStatement.setFetchSize(Integer.MIN_VALUE) for mysql only.
  • Set 700Mi resource quota for Mysql Source and S3 Destination

Please let me know you thoughts/ideas 🤔

@DoNotPanicUA
Copy link
Contributor

Please note that preparedStatement.setFetchSize(Integer.MIN_VALUE) will affect the source perfomance.

@suhomud
Copy link
Contributor

suhomud commented Sep 19, 2022

Please note that preparedStatement.setFetchSize(Integer.MIN_VALUE) will affect the source perfomance.

@DoNotPanicUA I'm not sure about previous metrics The one which I found is this.
In my case the sync took near 1.3 hours for 14GB table with 8 columns.
One finding from Stack Overflow:
setFetchSize(Integer.MIN_VALUE) is not as bad as the docs might imply. It actually just disables client-side caching of the entire response and gives you responses as they arrive; there is no round-trip per row required.

@DoNotPanicUA
Copy link
Contributor

If we stop using caching somewhere, it will work less efficiently. The question is how dramatic it will be.
To figure out, we need to compare two runs with the different FetchSize. May I ask you to run them as you already did it before?

@suhomud
Copy link
Contributor

suhomud commented Sep 19, 2022

If we stop using caching somewhere, it will work less efficiently. The question is how dramatic it will be.
To figure out, we need to compare two runs with the different FetchSize. May I ask you to run them as you already did it before?

I run it first without any connectors limitation so it was docker limitation in my case 8 GB for each connectors. Also the DB was local which CPU much better then on RDS. I noticed that when I use default fetch size (10) it stress CPU much because of ?useCursorFetch=true. Let me try to find the connector resources which will be enough for the sync and next step will be changing RDS resources to handle select * from table with ?useCursorFetch=true

@suhomud
Copy link
Contributor

suhomud commented Sep 20, 2022

Benchmark

  1. PreparedStatement FetchSize Integer.MIN_VALUE Local MySql DB instance

Database: Local MySql DB instance
Connectors: Mysql Source -> S3 Destination 700MB memory limit
Table: 12 - 14 GB, 1 000 000 records
PreparedStatement FetchSize: Integer.MIN_VALUE
Result: 32m
Screenshot 2022-09-20 at 14 45 30

Note: This approach does not load DB CPU and disk memory

  1. PreparedStatement FetchSize 10 Local Mysql DB instance
    Database: Local MySql DB instance
    Connectors: Mysql Source -> S3 Destination 1GB memory limit
    Table: 12 - 14 GB, 1 000 000 records
    PreparedStatement FetchSize: 10
    Result: 37m

local_mysql_db_fetchSize_10

Note: this approach load CPU and disk memory of local MySql DB
  1. PreparedStatement FetchSize Integer.MIN_VALUE RDS MySql instance

Database: MySql RDS instance
Connectors: Mysql Source -> S3 Destination 700MB memory limit
Table: 12 - 14 GB, 1 000 000 records
PreparedStatement FetchSize: Integer.MIN_VALUE
Result: 1 h 33m
mysql_RDS_DB

  1. PreparedStatement FetchSize 10 RDS Mysql DB
    Database: MySql RDS instance
    Connectors: Mysql Source -> S3 Destination 1GB memory limit
    Table: 12 - 14 GB, 1 000 000 records
    PreparedStatement FetchSize: 10
    Result: The sync failed with The last packet successfully received from the server was 300,800 milliseconds ago
    I tried with two RDS DB type:

rsd_instance1

and:

Screenshot 2022-09-20 at 12 43 04

Also changed default RDS mysql configs with custom Parameter Group to override different timeouts param like wait_timeout, net_read_timeout but was not able to pass sync successfully.

Summary

  • The FetchSize Integer.MIN_VALUE sync was faster then FetchSize=10 ( 32m VS 37m) with local Mysql DB instance.
  • With FetchSize 10 clients must deal with DB configuration to be able to make successful sync because this approach use DB cursor ?useCursorFetch=true it consume DB disk memory in my case it was near 12 GB for 12 GB table

RDS_DB_select_query_affect_memory

I think we should definitely go with FetchSize Integer.MIN_VALUE because performance did not change and maybe even improved
@DoNotPanicUA @grishick , please let me know your opinion on this

@DoNotPanicUA
Copy link
Contributor

@suhomud, thank you for such a stunning report 👍
It looks like the cache stores the result in data locally to send in chunks. When you run source locally (near Airbyte), we get a boost because network traffic is minimal, and we skip storing data into cache files.
So, we can use FetchSize Integer.MIN_VALUE safely if we consider that user sources are not far from their Airbyte instances.

@grishick
Copy link
Contributor

These are interesting findings. Most of the time Airbyte will not be colocated with the database and it is never colocated in case of Airbyte Cloud, so the defaults should optimize for remote DB. I would like understand more about the failure in the 4th scenario (remote DB and FetchSize = 10).

  • Could you capture the CPU and memory utilization graph from RDS during the sync job?
  • Could you post the entire stack trace with "The last packet successfully received from the server was 300,800 milliseconds ago" error?
  • Could you also document the values for wait_timeout and net_read_timeout?

cc @davinchia @evantahler @bleonard tagging you here, because you expressed interest in the project

@suhomud
Copy link
Contributor

suhomud commented Sep 26, 2022

@grishick,

  1. The resources.

Before the sync:

Screenshot 2022-09-26 at 12.29.29.png

After the sync:

Screenshot 2022-09-26 at 12.49.35.png

CPU:
Screenshot 2022-09-26 at 12.38.48.png
Memory RAM:
Screenshot 2022-09-26 at 12.41.27.png
NOTE: After sync has failed and canceled the memory has not been released and only rebooting helped
Memory HD:
Screenshot 2022-09-26 at 12.52.35.png

  1. Log file

  2. Here is the list of params that I changed for RDS:

connect_timeout = 1200
innodb_lock_wait_timeout = 1200
interactive_timeout = 1200
net_read_timeout = 1200
wait_timeout = 1200

And RDS Instance is:
Screenshot 2022-09-26 at 12.48.24.png

Do you have an idea what else can be changed for RDS?

cc @DoNotPanicUA @evantahler @bleonard

@suhomud
Copy link
Contributor

suhomud commented Sep 26, 2022

Update. I was able to start sync successfully with fetch size 10 by increasing RDS instance Memory HD from 50 GB to 65GB. Let me finish the sync I'll post the result of this sync

@suhomud
Copy link
Contributor

suhomud commented Sep 26, 2022

So it seems like extra disc memory solved issue particularly. The sync is still in progress. For 5 hours it fetch 2 GB only.
With FetchSize Integer.MIN_VALUE the full sync was done for 1.5 hours

@grishick
Copy link
Contributor

@suhomud AFAIK Integer.MIN_VALUE is a negative number - what does it mean for a MySQL JDBC driver to set fetch size as a negative number? My guess is that this setting is equivalent to not setting fetch size at all or telling MySQL to automatically set it. @tuliren do you have any insight on this?

@grishick
Copy link
Contributor

@suhomud nevermind my question - I see you have already answered it previously with a quote from MySQL docs

@grishick
Copy link
Contributor

Looking at the test results above, I am leaning towards agreeing with the conclusion that we have to set fetch size to Integer.MIN_VALUE for MySQL and do not attempt to set it dynamically.

@suhomud
Copy link
Contributor

suhomud commented Sep 27, 2022

@grishick I made a Draft PR with required change #17236
Also asked @alexandr-shegeda to make a test sync with this change since I'm located in Asia region now and maybe it is affecting the Amazon availability for me.

@suhomud
Copy link
Contributor

suhomud commented Sep 28, 2022

Hi @grishick, the PR is ready for review.
We made additional tests. I did not notice a big difference with fetching time for both approaches. However the fetch size 10 was really slow to start the sync and loads RDS resources disk and memory. With Integer.MIN_VALUE the sync starts immediately with big tables and does not have affect on DB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants