Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Programmatic and Manual pathways for table and column descriptions #147

Closed
samshuster opened this issue Oct 23, 2019 · 35 comments
Closed

Programmatic and Manual pathways for table and column descriptions #147

samshuster opened this issue Oct 23, 2019 · 35 comments
Labels
keep fresh status:review_needed

Comments

@samshuster
Copy link
Contributor

samshuster commented Oct 23, 2019

Overview

As a data engineer, there are quite a few properties that we can extract programmatically that are currently not supported by amundsen as first class properties in the UI. For some of these properties that are likely widely useful, I can understand creating issues to ingest them so that they appear in the panel on the right. However, some of these properties are very company specific and wouldn't be needed by other companies. Therefore, I think that it would be useful to allow users to ingest structured data without needing to make changes to amundsen infrastructure.

What do we do now? And why it won't work in long run?

Currently, we get around this by programmatically updating the table description with prepared markdown, however in the long run we also want users to be able to edit table and column descriptions through amundsen which will put us in a bind. We no longer would able to update programmatically without a lot of added complexity concerning reconciliation and merging of changes.

Proposal

My proposal is that there are two types of descriptions for tables and columns

One would be programmatic and cannot be modified manually.
The other would be the current description.
By default, the programmatic description would not appear on the page unless it is populated.

Note about column level

This is true on the column level as well, where we may want to include company specific attributes, but I can understand why column level programmatic descriptions maybe too much to ask for in terms of cluttering the UI.

@jornh
Copy link
Contributor

jornh commented Oct 23, 2019

On table level

For the User model in databuilder, I noticed it already takes an optional kwargs parameter: https://github.com/lyft/amundsendatabuilder/blame/1.4.10/databuilder/models/user.py#L62 so you can give keynames that can double as field description(s) in the UI and values

I'd imagine something similar could be added to the Table model in https://github.com/lyft/amundsendatabuilder/blob/1.4.10/databuilder/models/table_metadata.py

Note: I haven't tried out the kwargs mechanism though - but the PR amundsen-io/amundsendatabuilder#95 gives some hints in the unit test

On column level

I wonder to what extents Stats already has whats needed? https://github.com/lyft/amundsendatabuilder/blob/1.4.10/databuilder/models/table_stats.py

The metadata model in Architecture.md: https://github.com/lyft/amundsen/blob/master/docs/architecture.md#metadata

If Stats doesn't suit the needs then Column could of course also get a kwargs parameter.

@feng-tao feng-tao added the keep fresh label Oct 24, 2019
@samshuster
Copy link
Contributor Author

samshuster commented Oct 24, 2019

thanks for the detailed thoughts @jornh!

If I understand you correctly, we should be able to use the kwargs of the TableMetadata model (looks like there is already one there with the comment:
:param kwargs: Put additional attributes to the table model if there is any.)
to ingest other programmatic data.

I will take a look at this tomorrow hopefully and see how it works and get back to this issue with my research.
Thanks!

@samshuster
Copy link
Contributor Author

samshuster commented Oct 27, 2019

Ok @jornh I looked into it a bit today and I was able to get table kwarg properties into neo4j as extra properties of a given node. However, nothing showed up in the table ui. Looking through the frontend code, unless I am missing something, I don't see how these extra properties get wired up to the frontend code even for User (I know next to nothing about frontend though so me missing something could be quite likely).

If it is true that the frontend currently does not display the extra kwarg properties, would this story then be frontend only changes, plus possibly additions to sample_data_loader to show how it can take any key value pairs?

UPDATE: now that I have taken a look at column_stats, I definitely think this takes care of what we need on column level.

I wonder if having a table_stats analogous to table_column_stats would be a better approach even over the kwargs?

Thanks for your help and sorry for any dumb questions!

@javamonkey79
Copy link
Contributor

javamonkey79 commented Dec 7, 2019

@feng-tao @markgrover @jinhyukchang we could really use your insights on this one. idk if it would be best to bring this up at the next community meeting, or, best to solicit this on slack, or, just submit a PR.

Overall, I think this could be adopted without too much disruption. What it would take:

  • on the backend: a new table field (lets call it "noneditable_description" for simplicities sake)
  • on the front end: a new div section above (or perhaps below) the existing description section populated with noneditable_description data; I'm not a front end engineer, so, I'll submit that there could be a better way.

To reiterate, why do we want to do this? Consider these images:

image

image

First, you will notice, that all of this markdown data is quite useful, but, it was not created by user entry in amundsen. Instead if was machine generated upstream. Some of this information, such as description, documentation, dashboards we would want editable. Other parts, such as "Describe Extended Table Info" should not be editable, since it is metadata (similar to column info).

Please lmk what you think. We'd really like to do this work, and we think others in the community could benefit, but, we don't want to do the work in such a way as to be unsatisfactory (e.g. if you feel like it is unwanted\unneeded).

jakemongaya pushed a commit to jakemongaya/amundsen that referenced this issue Dec 11, 2019
Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)
@feng-tao
Copy link
Member

feng-tao commented Dec 14, 2019

@javamonkey79 not sure if everyone sees the discussion in the issue, would you mind posting above into the slack channel so that everyone could take a look? Thanks.

@javamonkey79
Copy link
Contributor

javamonkey79 commented Dec 16, 2019

hi @feng-tao yes, I'll solicit feedback there

@javamonkey79
Copy link
Contributor

javamonkey79 commented Dec 16, 2019

There was some discussion in slack about this today. The general consensus is that, instead of just 2 description sections, we should allow for N description sections and whether or not the section will be editable or not would be set as a property of that section. This solution would be quite versatile, and should fit everyones needs. However, it comes with a bit of cost, namely complexity.

Here is one example, of how this might look (courtesy of Matt Spiel):
Screen Shot 2019-12-16 at 2 14 39 PM
Screen Shot 2019-12-16 at 2 18 35 PM

And here are some possible concerns Tamika has raised as far as implementation:

How to change the metadata model to support multiple “descriptions”, and each piece would also have a concept title, content, and whether or not it is editable.
How that data would be ingested.
The implications for search and if there is a requirement for these new sections to be searchable as well.
Whether or not frontend apis have to change given the outcomes of the first and third items.

Further discussion should happen here.

@javamonkey79
Copy link
Contributor

javamonkey79 commented Dec 20, 2019

@ttannis I've been thinking about this some more, and based on the backing neo4j structure, it would appear to be possible to assign multiple description nodes for each table. The only issue with this approach, is that we would need some ordering id assigned to each description. Further, we would also likely want to set a new flag on each description node "editable" which could then be propagated to the front end. wdyt? @feng-tao is there another more suitable approach?

@ttannis
Copy link
Contributor

ttannis commented Dec 20, 2019

That sounds good on my end. The descriptions would be rendered according to the ordering id -- with what we currently know as description being the first one.They would be rendered in order on the table detail page, and support markdown if they are editable.

My current thoughts are:

  1. Our APIs between metadata and frontend for getting and editing descriptions will have to change, so I'd be interested to talk about that next if we all support this approach.
  2. Also interested to talk about what this approach would mean for the search service having to consider the content of all descriptions when searching.

@javamonkey79
Copy link
Contributor

javamonkey79 commented Dec 20, 2019

@ttannis

Our APIs between metadata and frontend for getting and editing descriptions will have to change, so I'd be interested to talk about that next if we all support this approach.

What needs to be talked about? I think it would need to be updated per the new logic, but, that's about it. Did I miss something?

Also interested to talk about what this approach would mean for the search service having to consider the content of all descriptions when searching.

We'd just need to update the logic that populates elasticsearch to look at all description nodes, instead of the one. Or, is there more to it?

@markgrover
Copy link
Contributor

markgrover commented Dec 21, 2019

Thanks @javamonkey79 for this, and @ttannis for all your help. Would love to see this problem to be fixed, let me know if there's something that I can help with.

@samshuster
Copy link
Contributor Author

samshuster commented Jan 24, 2020

Just writing here that I am going to take a stab at this to at least get some more clarity on what could be done. I imagine that my thoughts below are not going to be the final solution because I am sure I am missing important details.

Given this, here are my thoughts on more specific implementation details:

I am going to start first with the data loader changes to populate more description objects in neo4j. I am going to make it backwards compatible.
EDIT: after trying to use Node = Description for all Descriptions, I felt that this was too disruptive to the exsting cipher queries in the metadata service. Because of this, I have currently opted to only use Description nodes for the original Description and have come up with a new Node Type for this new type of metadata. Currently it is called "Programmatic_Description", but a better name would be great.

At first, I thought it would be simpler to only support just one is_editable text box, but thinking through it more, I am only saying this because I don't know enough about frontend. I could see how it would be great to have different boxes that are editable to help guide users to curate specific things. We could even lock down security on different boxes based on roles. Only users who are owners can edit table_description box, but all users can edit another box for example.

    DESCRIPTION_NODE_LABEL = 'Description'
    DESCRIPTION_KEY_FORMAT = '{description}'
    DESCRIPTION_ORDER = 'description_order'

    def __init__(self,
                 source, # type: str
                 text,  # type: Union[None, str]
                 order, # type: int
                 is_editable # type: boolean
                 ):
        """
        :param source: The unique source of what is populating this description
        :param text: the description text. Markdown supported.
        :param order: the display order. Lower numbers mean it will be displayed first. Ties are handled alphabetically by source.
        """

Two options here:

  1. Is that TableMetadata and ColumnMetadata object can take a list of "programmatic_descriptions"
  2. TableMetadata and ColumnMetadata will now both take optional "description_source" parameter which is either None or a string that is where the description comes from.

I actually think approach 2 is better, because that way we can more easily decouple the different "sources" of programmatic descriptions and re-use the existing description parameter.

Will then construct the neo4j nodes accordingly using the "source" as part of the description node id.

After above is done, I will modify Metadata Service to have a new method to fetch all of the programmatic description nodes and concat them together based on order as a first attempt. It is possible that it would be better to return all separately for more frontend flexibility.
NOTE: one difficulty I found here is that this will require a modification to amundsen_common as the Table object itself will need to be modified.

Then the frontend would need to be modified to make these requests and display them.
NOTE: I am not a frontend person, but I wonder if it would be possible to have source based pluggable css or custom javascript with custom frontend builds?

As a first pass for frontend, I will just concatenate all of the programmatic descriptions and display them in a single box.

@javamonkey79 @ttannis

@samshuster
Copy link
Contributor Author

samshuster commented Jan 27, 2020

A couple of other thoughts that came up since working on it on Friday.

I think we should get rid of description_order as part of data loader as that seems like we are coupling presentation logic to the data. Instead, I think the frontend should be configurable to display the different descriptions based on the "description_source" (where is this description metadata coming from? i.e. s3_scraper, extended table details etc.)

@samshuster
Copy link
Contributor Author

samshuster commented Jan 28, 2020

Hey @javamonkey79 @ttannis, as promised I made some headway on this and was able to get a very early stages prototype together here:

Screen Shot 2020-01-27 at 5 39 56 PM

This contrived example shows 2 different sources of metadata (s3 status and quality report) that now populate dedicated boxes in the left pane.

The data is being populated from rows like this:

database,cluster,schema_name,name,description,tags,description_source,description_editable,description_order
hive,gold,test_schema,test_table1,"**Size**: 50T\n\n**Monthly Cost**: $5000","expensive","s3_crawler",false,1
dynamo,gold,test_schema,test_table2,"**Size**: 1T\n\n**Monthly Cost**: $500",,"s3_crawler",false,1
hive,gold,test_schema,test_table1,"### Quality Report:\n --- \n Ipsus enom. Ipsus enom ipsus lorenum.\n ---\n[![Build Status](https://api.travis-ci.com/lyft/amundsendatabuilder.svg?branch=master)](https://travis-ci.com/lyft/amundsendatabuilder)","low_quality","quality_service",false,1

I have not tested out columns yet.

Notice that tags can also come from these client specific metadata sources.

A big thing that needs to be still decided is a name for this new concept. I think "programmatic descriptions" is not a good name. Other possible names that have been suggested are:

  • Custom_Metadata
  • Metadata_Properties
  • Generic_Metadata

Please loop in anybody else that should be part of this conversation. cc @markgrover

@ttannis
Copy link
Contributor

ttannis commented Jan 28, 2020

@samshuster do you have any branches that you can share with this, it's alright if things aren't clean or perfect.

@samshuster
Copy link
Contributor Author

samshuster commented Jan 28, 2020

sure thing @ttannis I will get up some WIP branches for all of the modules that had to be touched... basically all of them! I imagine that there will be some major rework that is asked by you guys, so no worries there.

NOTE: I didn't bother with search at this point, I figure that can be a separate story down the line to index these new datapoints. But happy to try and do it if you feel differently.

@samshuster
Copy link
Contributor Author

samshuster commented Jan 28, 2020

ok @ttannis you can check out the above WIP MRs. Let me know if you have any questions!

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Jan 28, 2020

It seems it first started with un-editable description, but now it became generic metadata. I understand the value of being flexible, but on the other side, I am afraid we are opening the door such that most of new metadata could just go through this approach, not being captured as a new use case model. (And also, we would need control on which metadata should be indexed in search service or not which adds more complexity).

In other words, if I understand it correctly, most of current metadata (and future metadata) can be represented by this flexible model, and I wonder if that is better path as it would easily make new metadata model hide in each companies private repo in databuilder configuration. I think there's trade-off between being flexible on new metadata introduction vs we may see less contribution on new metadata model. May be I am too worried. WDYT?

cc: @markgrover @feng-tao @ttannis

@samshuster
Copy link
Contributor Author

samshuster commented Jan 28, 2020

hey @jinhyukchang I guess this is an aged old question flexibility vs explicit. You raised some great points!

Here are some of my thoughts for why I believe there will always be a use case for having a flexible option like this approach, but I could be wrong.

  1. This is the main one. There are certain attributes that are too company specific to ever make sense to be contributed back to open source. For example, if this issue is closed without being eventually accepted, I would either need to customize our build, or put out separate PRs for new attributes that may never be used by other companies.

  2. The amount of effort to add a new piece of metadata to the data model is currently quite large. If a company is prototyping a new piece of metadata, it might not be worth the effort.

  3. I highly doubt that these flexible options would ever look as nice (currently only markdown! although I suppose that could be changed in the future with some customization on frontend as I mentioned earlier) or be as powerful as an explicit addition to the data model. Given this I think there will still be a drive to add new content explicitly.

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Jan 28, 2020

Fair points, @samshuster ! Let's not make it too powerful :)

@samshuster
Copy link
Contributor Author

samshuster commented Jan 30, 2020

If there are no major concerns with the approach (happy to make changes if there are!) I am going to revisit where I left off on this a few days ago and focus on:

  • adding tests where appropriate
  • adding frontend non-editable section. (NOTE: for now I think I will skip on making these editable as that is considerably more work and might not even be needed).
  • cleaning up and refactoring where applicable

@samshuster
Copy link
Contributor Author

samshuster commented Feb 3, 2020

hey @ttannis @jinhyukchang checking in here again with a couple more updates.

  • I think I am going to skip implementing the editable properties of these descriptions on frontend as it is considerably more work with not much benefit in my opinion. Let me know if I should persevere, but given that we don't want to make these too powerful, it seems not necessary. I did however currently keep the properties in the data_loader, metadata and api but can get rid of these if it is desired.
  • I think column stats can pretty much serve this same purpose for columns, so I removed this new functionality for columns. It would be for tables only.

With those above 2 changes, I think this is almost ready to go! I should have some complete MRs by tomorrow for you guys to take a look at.
The order of how they should be looked at and eventually merged if accepted should be:

  • data_builder
  • amundsen-common
  • metadata
  • frontend

Given this, I moved databuilder MR out of WIP but anticipate that there will still be some major changes left: https://github.com/lyft/amundsendatabuilder/pull/187/files

Thanks for your feedback and help!

@jinhyukchang
Copy link
Contributor

jinhyukchang commented Feb 4, 2020

Thanks @samshuster !
Let us know when it's ready to review.

On the editable properties, I prefer to remove it as it won't be an active code. Let me know what you think.

@samshuster
Copy link
Contributor Author

samshuster commented Feb 4, 2020

sounds good @jinhyukchang that makes sense, i will remove the is_editable property from the data model and will tag you tomorrow when they are all ready to be reviewed

@samshuster
Copy link
Contributor Author

samshuster commented Feb 4, 2020

ok @jinhyukchang @ttannis https://github.com/lyft/amundsendatabuilder/pull/187/files this change is ready for you guys to review.

NOTE: the other MR's are also ready for you to review too! but the builds won't pass until amundsen-common is pushed

@samshuster
Copy link
Contributor Author

samshuster commented Feb 12, 2020

ok @jinhyukchang amundsen-common and metadata are ready for review (sorry to keep tagging you!)

amundsen-io/amundsencommon#18

below will build once above is merged in
amundsen-io/amundsenmetadatalibrary#104

@samshuster
Copy link
Contributor Author

samshuster commented Mar 2, 2020

hey @ttannis just checking in to see when we can start revisiting this MR
amundsen-io/amundsenfrontendlibrary#381

@ttannis
Copy link
Contributor

ttannis commented Mar 2, 2020

hey @ttannis just checking in to see when we can start revisiting this MR
lyft/amundsenfrontendlibrary#381

Last Thursday one of our designers took a look at this concept and is currently working on some suggestions. They will either respond directly on your PR, or I'll relay the information once it's ready.

@samshuster
Copy link
Contributor Author

samshuster commented Mar 6, 2020

hey @ttannis @feng-tao would you guys be open to having my plain design get pushed through as a starting point? It would only show up on amundsen if people ingest programmatic data on the backend, so it would be low risk. We could consider it "beta".

That could give you guys more time to rework the design.

Just an idea, understood if that isn't going to work!

@samshuster
Copy link
Contributor Author

samshuster commented Mar 19, 2020

i believe this issue can be closed now. Thanks everybody!

@jornh
Copy link
Contributor

jornh commented Mar 19, 2020

👏👏👏 thanks for pulling through on this feature Sam.

I’m sure it will benefit a lot of other orgs using Amundsen.

@javamonkey79
Copy link
Contributor

javamonkey79 commented Mar 19, 2020

I agree, nice work on a Herculean effort @samshuster 💪 💯 💥

@markgrover
Copy link
Contributor

markgrover commented Mar 19, 2020

@ttannis
Copy link
Contributor

ttannis commented Mar 20, 2020

i believe this issue can be closed now. Thanks everybody!

Thanks Sam, I'll be closing this issue after creating and publishing the new release for frontend, and updating the quick starts.

@ttannis
Copy link
Contributor

ttannis commented Apr 3, 2020

@ttannis ttannis closed this as completed Apr 3, 2020
@feng-tao feng-tao added the status:review_needed label Apr 15, 2020
benrifkind pushed a commit to benrifkind/amundsen that referenced this issue Dec 16, 2020
* 🎉 Initial commit.

* ♻️ Refactoring code.

* 👌 Updating code due to code review changes.

* 👌 Updating code due to code review changes. (tests)
dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021
Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)
dorianj pushed a commit to dorianj/amundsen that referenced this issue Apr 25, 2021
* 🎉 Initial commit.

* ♻️ Refactoring code.

* 👌 Updating code due to code review changes.

* 👌 Updating code due to code review changes. (tests)
feng-tao pushed a commit that referenced this issue May 7, 2021
Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)
feng-tao pushed a commit that referenced this issue May 7, 2021
* 🎉 Initial commit.

* ♻️ Refactoring code.

* 👌 Updating code due to code review changes.

* 👌 Updating code due to code review changes. (tests)
zacr pushed a commit to SaltIO/amundsen that referenced this issue May 13, 2022
* 🎉 Initial commit.

* ♻️ Refactoring code.

* 👌 Updating code due to code review changes.

* 👌 Updating code due to code review changes. (tests)
hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022
Will scroll past GitHub repo folders (specially more handy on small screens like phones/tablets)
hansadriaans pushed a commit to DataChefHQ/amundsen that referenced this issue Jun 30, 2022
* 🎉 Initial commit.

* ♻️ Refactoring code.

* 👌 Updating code due to code review changes.

* 👌 Updating code due to code review changes. (tests)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep fresh status:review_needed
Projects
None yet
Development

No branches or pull requests

7 participants