Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to disable hive partitioning wild cards #232

Open
niydt opened this issue Aug 6, 2021 · 8 comments
Open

Option to disable hive partitioning wild cards #232

niydt opened this issue Aug 6, 2021 · 8 comments

Comments

@niydt
Copy link

niydt commented Aug 6, 2021

The avro files we are trying to load into RedShift are stored in folders with "=" in their names, i.e.

    event_type=users.behaviors.app.FirstSession/. 

When loading data from the following S3 prefix,

com.hoopladigital.brazecurrentsstaging/StagingCurrentFull/dataexport.prod-03.S3.integration.60d3692fcab9ca5f83919aab/event_type%3Dusers.behaviors.app.FirstSession

The lambda failed with this error:

            error: No Configuration Found for com.hoopladigital.brazecurrentsstaging/StagingCurrentFull/dataexport.prod-03.S3.integration.60d3692fcab9ca5f83919aab/event_type=*/date=*/399/prod-03

As shown in the error message above, the"event_type=/date=" portion of the error message was transformed assuming that we are taking advantage of the hive partitioning wildcards (https://github.com/awslabs/aws-lambda-redshift-loader#hive-partitioning-style-wildcards) and replaces the event_type value with *.

We don't want to use this feature- I need the lambda to use the exact folder name that I provided in the prefix. Is there a way for me to configure the lambda to not use hive partitioning wild cards?

line 1584 of index.js:
inputInfo.prefix = inputInfo.bucket + '/' + searchKey.transformHiveStylePrefix();

line 78 of index.js
transformHiveStylePrefix()

@IanMeyers
Copy link
Contributor

Understood. Could you live with the ability to turn this on and off at the function level? Meaning for the whole installation of the loader, it would or would not perform hive wildcard xforms? This would be relatively easy to support, while selectively doing it per prefix will require a bit more thinking.

@IanMeyers
Copy link
Contributor

Also - how many of these event types do you have? If you could suppress specific prefixes from Hive wildcard transforms, would that be achievable or do you have too many event types to list?

@jbrew8
Copy link

jbrew8 commented Aug 6, 2021

I am the user that reported this to AWS support (and they logged this issue on my behalf). Thank you for looking into this.

Disabling this feature at the function level would be fine as we currently do not have plans to use hive wildcards.

Managing a list of prefixes to exclude is also fine. We currently have 13 event types, and while this might increase slightly in the future, it should remain easy to maintain a list.

@niydt
Copy link
Author

niydt commented Aug 6, 2021

Hi Ian,

Thank you for looking into this issue. The problem was raised on jbrew8's behalf, and we would really appreciate any quick workarounds or solution implemented in the near future.

@IanMeyers
Copy link
Contributor

OK. So here's my proposal. Please download version 2.8.0 from https://awslabs-code-us-east-1.s3.amazonaws.com/LambdaRedshiftLoader/AWSLambdaRedshiftLoader-2.8.0.zip, which has not yet been pushed to github. Set an environment variable SuppressWildcardExpansionPrefixList with value:

  • Set to * to suppress all Hive wildcard expansions
  • Set to a comma separated list of prefixes for which wildcard expansion should be suppressed

Given that you only have 13 prefixes, you could load those directly into the variable, but note that all environment variables together cannot exceed 4K.

I have tested this within my account and it works great, but would like to validate with you first before shipping.

@jbrew8
Copy link

jbrew8 commented Aug 10, 2021

Thanks @IanMeyers . I'll give this a shot and let you know how it works.

@jbrew8
Copy link

jbrew8 commented Aug 11, 2021

@IanMeyers I had a chance to try your changes- and it looks like they solve our problem. With the new environment variable set prefixes that contain an equals sign are treated literally, and the data is loaded into RedShift correctly. Thank you for your quick turn around on this issue.

@IanMeyers
Copy link
Contributor

IanMeyers commented Aug 11, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants