Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悰 Redshift Destination: copies only 1 csv stream part from S3 #10646

Closed
hellmund opened this issue Feb 24, 2022 · 5 comments 路 Fixed by #11254
Closed

馃悰 Redshift Destination: copies only 1 csv stream part from S3 #10646

hellmund opened this issue Feb 24, 2022 · 5 comments 路 Fixed by #11254
Assignees
Labels

Comments

@hellmund
Copy link

Environment

  • Airbyte version: example is 0.35.36-alpha
  • OS Version / Instance: Amazon Linux 2 AMI, EC2 t3a.large
  • Deployment: Docker
  • Source Connector and version: Redshift 0.3.8
  • Destination Connector and version: Redshift 0.3.25
  • Severity: High
  • Step where error happened: Sync job

Current Behavior

  • Only the last stream part from S3 gets actually copied into Redshift destination
  • "Succeeded" summary in status web page falsely indicates that all records got emitted and committed

Expected Behavior

  • All parts get copied into Redshift destination
  • "Succeeded" summary in status web page correctly indicates that all records got emitted and committed

Logs

Logging and S3 bucket shows that the multiple CSV staging files contain full data. Log output doesn't mention copying individual files but only that RedshiftStreamCopier(copyStagingFileToTemporaryTable) Copy to tmp table complete.

Steps to Reproduce

  1. Use a data source >150MB (larger than S3 Stream Part Size)
  2. Configure Redshift destination with S3 staging
  3. Trigger connection
  4. All csv parts are loaded into S3 correctly
  5. observe how quickly RedshiftStreamCopier(copyStagingFileToTemporaryTable) completes
  6. verify that only the last part has actually been copied to Redshift

Are you willing to submit a PR?

No, as I don't have context on the surrounding recent changes.

Additional observations: Seems to be related to changes in #9920 affecting entries in "stagingWritersByFile"

@Override
public void closeNonCurrentStagingFileWriters() throws Exception {
Set<String> removedKeys = new HashSet<>();
for (String key : activeStagingWriterFileNames) {
if (!key.equals(currentFile)) {
stagingWritersByFile.get(key).close(false);
stagingWritersByFile.remove(key);
removedKeys.add(key);
}
}
activeStagingWriterFileNames.removeAll(removedKeys);
}

"stagingWritersByFile" gets used in RedshiftStreamCopier to generate the manifest of which csv files to copy to Redshift

Another user experienced similar issues with Snowflake using S3 intermediary: https://airbytehq.slack.com/archives/C01MFR03D5W/p1645735115553539?thread_ts=1645734289.113269&cid=C01MFR03D5W

Temporary work-around: Disable S3 bucket + copy for Redshift Destination, using inefficient batch insert instead.

@BenoitFayolle
Copy link
Contributor

Having the same experience with :
Airbyte version: example is 0.35.49-alpha
Deployment: Docker
Destination Connector and version: Redshift 0.3.27

Verified only one csv file was loaded in redshift by querying the STL_LOAD_COMMITS table.

@drewfustin
Copy link

drewfustin commented Mar 16, 2022

My experience in #11158:

#11158 (comment):

Airbyte processed 88,769 emitted records into 2 CSV files in staging S3: one with 41,147 records and one with 47,622 records. Only one of those two files was written to Redshift, as both the _airbyte_raw_shortened_urls table and the shortened_urls table have only 41,147 records. If "Purge Staging Files and Tables" is set to True on the Redshift destination, both those CSV files are then deleted. I turned it to False to check the records in each CSV, and voila.

Airbyte version: 0.35.54-alpha
OS Version / Instance: AWS EC2 (t3.large instance with amzn2-ami-hvm-2.0.20220218.3-x86_64-ebs AMI)
Deployment: Docker on EC2
Source Connector and version: Postgres 0.4.9
Destination Connector and version: Redshift 0.3.27

@kyle-cheung
Copy link

Having the exact same issue but on Snowflake destination. Thought it was because i upgraded to 0.35.55-alpha, but even after a downgrade back to 0.35.31-alpha the issue still persists. At the end of a successful run Airbyte throws this:

errors: $.access_key_id: is not defined in the schema and the schema does not allow additional properties, 

$.s3_bucket_name: is not defined in the schema and the schema does not allow additional properties, 

$.s3_bucket_region: is not defined in the schema and the schema does not allow additional properties, 

$.secret_access_key: is not defined in the schema and the schema does not allow additional properties, 

$.method: does not have a value in the enumeration [Standard]

@drewfustin
Copy link

drewfustin commented Mar 18, 2022

I'm not sure if @hellmund's observation in the initial comment that this is caused by #9920 is correct or not. I WILL say that I rolled back to v0.35.29-alpha and all seems to be right in the world, but I don't have multiple parts for my S3 CSVs anymore, so it's hard to say. (cc @andriikorotkov)

@VitaliiMaltsev
Copy link
Contributor

Fixed in destination-redshift 0.3.28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants