Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a user to filter fields/columns from a table #2227

Closed
cgardens opened this issue Feb 26, 2021 · 51 comments
Closed

Allow a user to filter fields/columns from a table #2227

cgardens opened this issue Feb 26, 2021 · 51 comments

Comments

@cgardens
Copy link
Contributor

cgardens commented Feb 26, 2021

Tell us about the problem you're trying to solve

As an operator, I want to be able to select a subset of the fields in my table to be replicated.

┆Issue is synchronized with this Asana task by Unito

@sherifnada
Copy link
Contributor

Created a duplicate, will just add my description:

Tell us about the problem you're trying to solve

Users currently cannot select which fields (columns) they want to sync from their streams -- if you pick a stream, it's all or nothing. However, users want the ability to project which fields they want. A simple use case is sensitive data like described in #2463 . We can do this at the Airbyte level without having to do it separately for every connector by filtering the incoming records to contain only the selected fields in the catalog.

Sources can still implement projection as needed (e.g: a DB source will still want to select only the relevant columns for efficiency reasons), but this can be implemented "for free" outside the connectors.

Describe the alternative you’ve considered or used

Projection at the destination level

@olivermeyer
Copy link
Contributor

Bumping this issue as it's very relevant to us (and probably most users affected by GDPR or similar local legislation). In our case, we're pulling data from Salesforce which contains sensitive information and writing it to S3 where it should be considered immutable, so preventing the sensitive stuff from entering our systems at all would be a big plus.

@marcosmarxm marcosmarxm changed the title Allow a user to filter fields from a table Allow a user to filter fields/columns from a table Jun 29, 2021
@ashishgupta-97065
Copy link

Fairly important for our use cases too. For many of our key tables less than 30-40% of the columns are actually required for analytics use cases. Selecting all significantly adds to trasfer load and warehouse size

@jamesutton
Copy link

  • 1 for this request! Same issue as @olivermeyer called out, we would rather not ingest PII data at all rather than having to mask it at the destination. This is accomplished through other integration methods (custom python scripts or stitch) by being able to select columns.

@agrass
Copy link
Contributor

agrass commented Jul 9, 2021

Very important for us too, would be really great to be able to select specific fields the same way We select models.

@kyle-cheung
Copy link

+1 for this too, I want to bring Airbyte to our org but cannot if we can't deselect extreme sensitive fields as we don't want those living in our Snowflake warehouse

@sotte
Copy link

sotte commented Oct 1, 2021

Related #4400

@b4stien
Copy link
Contributor

b4stien commented Oct 13, 2021

Another reason why this feature would be useful: my source (as well as my destination) is Postgres and I have a column whose type is not understood by Airbyte (it's bit(343)) and causes a sync failure. For now I have to exclude the whole table to have a successful sync (ideally I'd just have to exclude the faulty column).

@ppatali
Copy link
Contributor

ppatali commented Oct 14, 2021

@b4stien for DB based sources, one workaround is to define a DB view with only required columns and then use it as source stream.

@mrbungie
Copy link

@ppatali in large organizations access/modifications to the production DB are slow or impossible.

@t0hai
Copy link
Contributor

t0hai commented Oct 27, 2021

@sherifnada Would it be possible to pass the selected fields as a catalog to a connector? We would like to adapt the API calls based on the provided catalog, depending on which fields are selected certain parameters have to be sent with an API request.

@sherifnada
Copy link
Contributor

@n0rritt depending on the source it might work. DB sources should be able to respond appropriately, but most API sources wont

@kyle-cheung
Copy link

Just want to comment again as this has become very important for us. We're trying to setup HR/People analytics and using the BambooHR and Greenhouse source connectors, however, there is some extremely sensitive information especially from Bamboo including SSN, salaries, addresses, etc. We don't need this information and as it is we can't setup this connection if we have to replicate the full table.

Thanks!

@olivermeyer
Copy link
Contributor

I'm happy someone is bumping this - I'm a bit concerned that this issue has been open for a year now, especially since it's the 3rd-most (and until recently, 2nd-most) upvoted issue.

Just want to comment again as this has become very important for us. We're trying to setup HR/People analytics and using the BambooHR and Greenhouse source connectors, however, there is some extremely sensitive information especially from Bamboo including SSN, salaries, addresses, etc. We don't need this information and as it is we can't setup this connection if we have to replicate the full table.

Thanks!

@kyle-cheung we also pull data from BambooHR and, luckily, Bamboo allows to restrict which fields the API token can access. We use that to avoid pulling sensitive data. Might be a solution in your case.

@YohanGrember
Copy link

YohanGrember commented Nov 29, 2022

Hello team 👋
I'm also interested in this field filtering for GDPR reasons.

I'd also like to be able to add fields that are not mentioned in the default catalog.
For example, for stripe connectors, our AM unlocked fields that are not mentioned in the catalog.
We are able to ingest these fields through our legacy Python stack but as the Airbyte Stripe catalog is hardcoded, these fields are not retrieved during the Airbyte ingestion.
Will the filtering feature giving more control on the catalog also enable to add fields?

@will-gp
Copy link

will-gp commented Feb 2, 2023

This is a real pain point for us. If we're selecting only certain columns to sync, we don't want them filtered after they get to us. They should never reach Airbyte in the first place.

@mattppal
Copy link

mattppal commented Feb 2, 2023

Same here— this has pretty big implications for security and PII

@jzhang-georgian
Copy link

#2227 (comment)

Same here and confirmed that when using BQ as the destination, the column can still show up even after editing the catalog...

@annafedotova
Copy link

We are on the self-hosted version and we really need this functionality to keep using Airbyte because of the PII implications and security

@grishick
Copy link
Contributor

grishick commented Feb 3, 2023

AFAIK this is already implemented and released

@will-gp
Copy link

will-gp commented Feb 3, 2023

AFAIK this is already implemented and released

@grishick can you show me where this is the case? I just tried running the latest Airbyte version (0.40.32) with a connection from a postgres database to a file. When setting up the connection, I don't have the option to filter fields/columns (see screenshot).
airbyte_sync

@sherifnada
Copy link
Contributor

@andyjih can you update this ticket with the status of the feature?

@andyjih
Copy link
Contributor

andyjih commented Feb 6, 2023

Hi all, this feature isn't yet available, but it's actively being worked on. We'll follow up on this issue when there's an update. Sorry for the confusion!

@olivermeyer
Copy link
Contributor

Hi all, this feature isn't yet available, but it's actively being worked on. We'll follow up on this issue when there's an update. Sorry for the confusion!

Hi @andyjih, do you have a timeline for this feature? It's been repeatedly pushed back in the roadmap, which is unfortunate given that it's been the 2nd-most popular issue for a long time now (and the 1st has been delivered).

@piyushsingariya
Copy link

This feature is much needed!

@yuvaraj91
Copy link

yuvaraj91 commented Mar 23, 2023

airbyte_config.connection table, the data is in the catalog

Hi @b4stien , could you please let know where I can find the objects airbyte_config.connection and catalog ?

Edit: I found this was in the Airbyze metadata db.

@vaahtio
Copy link

vaahtio commented Apr 14, 2023

Would appreciate this feature a lot

@catkins
Copy link

catkins commented Apr 20, 2023

Apologies for adding to the pile-on here, but this is an issue for our team as well. Poking around in the catalog feels a bit brittle, so something baked-in to airbyte would excellent.

@jefferson-roth-mayd
Copy link

airbyte_config.connection table, the data is in the catalog

Hi @b4stien , could you please let know where I can find the objects airbyte_config.connection and catalog ?

Edit: I found this was in the Airbyze metadata db.

Can you show us one example of how you're editing it?
Thanks

@adolitos
Copy link

This feature would help my squad and team to address GDPR, PII and other regulations!

@constah
Copy link

constah commented May 29, 2023

Very important feature to address data protection and also allow ability to only load what is needed.

@dro248
Copy link

dro248 commented Jun 6, 2023

This is extremely important for us.

@jzhang-georgian
Copy link

It is now available according to this post: https://airbyte.com/blog/airbyte-column-selection-control-over-the-exact-data-to-sync

@malikdiarra
Copy link
Contributor

Indeed, you can now select which field you want to replicate when setting up a new connection.

@walker-philips
Copy link

@malikdiarra are the selected columns available to the source despite being applied by the infrastructure workers? I have a 900 column (yes, I know, unfortunately its a third party vendor platform) that blows open any approach at quickly reading from it. If I could pass through the selected columns from the UI, I could significantly lessen the size of the data being processed from the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests