-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a user to filter fields/columns from a table #2227
Comments
Created a duplicate, will just add my description: Tell us about the problem you're trying to solveUsers currently cannot select which fields (columns) they want to sync from their streams -- if you pick a stream, it's all or nothing. However, users want the ability to project which fields they want. A simple use case is sensitive data like described in #2463 . We can do this at the Airbyte level without having to do it separately for every connector by filtering the incoming records to contain only the selected fields in the catalog. Sources can still implement projection as needed (e.g: a DB source will still want to select only the relevant columns for efficiency reasons), but this can be implemented "for free" outside the connectors. Describe the alternative you’ve considered or usedProjection at the destination level |
Bumping this issue as it's very relevant to us (and probably most users affected by GDPR or similar local legislation). In our case, we're pulling data from Salesforce which contains sensitive information and writing it to S3 where it should be considered immutable, so preventing the sensitive stuff from entering our systems at all would be a big plus. |
Fairly important for our use cases too. For many of our key tables less than 30-40% of the columns are actually required for analytics use cases. Selecting all significantly adds to trasfer load and warehouse size |
|
Very important for us too, would be really great to be able to select specific fields the same way We select models. |
+1 for this too, I want to bring Airbyte to our org but cannot if we can't deselect extreme sensitive fields as we don't want those living in our Snowflake warehouse |
Related #4400 |
Another reason why this feature would be useful: my source (as well as my destination) is Postgres and I have a column whose type is not understood by Airbyte (it's |
@b4stien for DB based sources, one workaround is to define a DB view with only required columns and then use it as source stream. |
@ppatali in large organizations access/modifications to the production DB are slow or impossible. |
@sherifnada Would it be possible to pass the selected fields as a catalog to a connector? We would like to adapt the API calls based on the provided catalog, depending on which fields are selected certain parameters have to be sent with an API request. |
@n0rritt depending on the source it might work. DB sources should be able to respond appropriately, but most API sources wont |
Just want to comment again as this has become very important for us. We're trying to setup HR/People analytics and using the BambooHR and Greenhouse source connectors, however, there is some extremely sensitive information especially from Bamboo including SSN, salaries, addresses, etc. We don't need this information and as it is we can't setup this connection if we have to replicate the full table. Thanks! |
I'm happy someone is bumping this - I'm a bit concerned that this issue has been open for a year now, especially since it's the 3rd-most (and until recently, 2nd-most) upvoted issue.
@kyle-cheung we also pull data from BambooHR and, luckily, Bamboo allows to restrict which fields the API token can access. We use that to avoid pulling sensitive data. Might be a solution in your case. |
Hello team 👋 I'd also like to be able to add fields that are not mentioned in the default catalog. |
This is a real pain point for us. If we're selecting only certain columns to sync, we don't want them filtered after they get to us. They should never reach Airbyte in the first place. |
Same here— this has pretty big implications for security and PII |
Same here and confirmed that when using BQ as the destination, the column can still show up even after editing the catalog... |
We are on the self-hosted version and we really need this functionality to keep using Airbyte because of the PII implications and security |
AFAIK this is already implemented and released |
@grishick can you show me where this is the case? I just tried running the latest Airbyte version (0.40.32) with a connection from a postgres database to a file. When setting up the connection, I don't have the option to filter fields/columns (see screenshot). |
@andyjih can you update this ticket with the status of the feature? |
Hi all, this feature isn't yet available, but it's actively being worked on. We'll follow up on this issue when there's an update. Sorry for the confusion! |
Hi @andyjih, do you have a timeline for this feature? It's been repeatedly pushed back in the roadmap, which is unfortunate given that it's been the 2nd-most popular issue for a long time now (and the 1st has been delivered). |
This feature is much needed! |
Hi @b4stien , could you please let know where I can find the objects Edit: I found this was in the Airbyze metadata db. |
Would appreciate this feature a lot |
Apologies for adding to the pile-on here, but this is an issue for our team as well. Poking around in the catalog feels a bit brittle, so something baked-in to airbyte would excellent. |
Can you show us one example of how you're editing it? |
This feature would help my squad and team to address GDPR, PII and other regulations! |
Very important feature to address data protection and also allow ability to only load what is needed. |
This is extremely important for us. |
It is now available according to this post: https://airbyte.com/blog/airbyte-column-selection-control-over-the-exact-data-to-sync |
Indeed, you can now select which field you want to replicate when setting up a new connection. |
@malikdiarra are the selected columns available to the source despite being applied by the infrastructure workers? I have a 900 column (yes, I know, unfortunately its a third party vendor platform) that blows open any approach at quickly reading from it. If I could pass through the selected columns from the UI, I could significantly lessen the size of the data being processed from the database. |
Tell us about the problem you're trying to solve
As an operator, I want to be able to select a subset of the fields in my table to be replicated.
┆Issue is synchronized with this Asana task by Unito
The text was updated successfully, but these errors were encountered: