Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize component reading and processing for BYO catalog-types #2241

Merged
merged 78 commits into from Nov 2, 2021

Conversation

kiersten-stokes
Copy link
Member

@kiersten-stokes kiersten-stokes commented Oct 19, 2021

Closes #2220. This PR generalizes the ComponentReader class in order to better support additional catalog types for which we may not know the structure.

Related:

What changes were proposed in this pull request?

This PR touches the following elements:

  • Entrypoints in setup.py
    • A new entrypoint group is declared (elyra.component.catalog_types) and contains one name for each type of catalog (see schemas section below)
  • Component catalog schemas and corresponding ComponentRegistrySchemas class
    • The former component-registry schema name has been removed in favor of creating one schema per component catalog type
    • The schema_name must match the entrypoint name in order to be found during the call to get_schemas()
    • All keys except one in the schema for the built-in catalog types (file, directory, or url) have been preserved from when it existed as a single component-registry schema in order to avoid some larger migration issues
      • This doesn't completely remove the need for some sort of migration, however, as the schema_name field still must change to reflect the new schemas -- more discussion needed
    • Added an optional field to the local-file-catalog and local-directory-catalog schema called base_path, which allows a user to enter an optional base directory from which the rest of the paths values may be resolved (otherwise, absolute directories are assumed)
    • Added an optional field to the local-directory-catalog type called include_subdirs, which allows a user to indicate whether they want to search subdirectories for additional component specs
      • Discussion point: I'm not sure if there's a boolean control (like a checkbox or something) already available for this purpose on the frontend in the metadata editor or if one will have to be created
    • Some frontend considerations rise from these changes:
      • Changed the naming on the frontend from 'Pipeline Components' to 'Component Catalogs'; someone from frontend should probably take a look to make sure I haven't missed anything or screwed something up
      • Discussion point: The addition of certain fields in the schema sometimes may look less-than ideal; see the UI when adding/modifying a local-file-catalog instance (note that Base Path doesn't line up with Paths since the paths array type required some different formatting options)
      • We'd like to remove the duplicate title/display_name information in the schema (title, display_name, and uihints.title)
  • ComponentRegistry
    • The schema_name field now determines the appropriate reader class by comparing with entrypoint keys
    • The component_entry SimpleNamespace object includes a few changes:
      • A component_identifier field is added, which is passed on to the Component object after parsing and saved for later access
      • location is removed from the top-level of the component_entry object, as it now optionally resides in the component_metadata value
      • The component_id value is added, the value of which now is determined in and returned from the reader (see below)
  • ComponentReader classes
    • The logic has been moved out of component.py and into it's own file component_reader.py since it got pretty bulky
    • The max_readers configurable Integer has been converted to one of potentially/eventually several override-able settings (stored in a configurable dictionary)
    • read_component_definitions() still drives the below functionality using threads
      • get_catalog_entries() (formerly get_absolute_locations()) takes the registry/catalog instance metadata and, for each component in that registry, builds a dictionary containing all the relevant data that will be required in order to read a component in the call to read_component_definition(); these dictionaries are returned in a list
      • on return to the read_component_definitions() function, the catalog entry dictionaries are added to the thread queue
        • The id-ing method used currently takes the following form
          • <catalog-type-name>:hash(<hashing_key1>:<hashing_key2>:<...>)
          • The hash portion of the above id only includes the first 12 characters for brevity
          • Each catalog-type class can set the hashing_keys as above for their component instances by implementing get_hash_keys() to return a list of keys available in the component_entry_data object returned from get_catalog_entries() for each component
      • when an entry is pulled from the queue, read_component_definition() is passed the catalog entry data and registry metadata used to return the catalog entry's definition in string form
    • Adds some logging functions so that classes can implement custom messages -- this might be overkill and will likely change when we finalize what we are doing with the component_source property
  • ComponentParser classes
    • Determination of the component_id is removed from the parser classes and takes place in the reader
    • Minor changes to include the component identifier field and other field changes in the constructors
    • The AirflowComponentParser is refactored to only include the portion of the operator definition for a component of a single class
  • Component object class
    • Now includes a definition attribute that can be used during processing to access the component definition without re-reading/parsing
    • Some simple name changes:
      • To start, the location_type attribute has been changed to catalog_type
      • Discussion point: We may want to consider changing the name of location to something like location_identifier or something that better conveys where/how a particular component can be defined (consider the MLX example where the "location" is something like the component access id instead)
        • component_identifier is another naming option for this attribute
        • FYI, the value of this attribute is what is displayed under the Component Source property in the node properties panel
  • RuntimePipelineProcessor classes
    • These differ a lot from each other, so I'll explain separately
    • KFP
      • When a component needs to be loaded, KFP's load_component_from_text() is now used for every type of component, where the definition attribute of the Component object is given as the argument
    • Airflow
      • This is a bit of a sticky situation, even outside of the scope of this PR; the issue is that the Airflow DAG template requires us to have the module name in order to insert the appropriate import statement, so we need a way for all catalog types to be able to import the correct modules
        • We will use configuration to handle this; more details to be added later
  • Migration
    With these changes, the existing component-registry schema is replaced by three schemas reflecting the previous location_type property. Any user-maintained instances will need to be adjusted in the following ways (note, factory instances will immediately reflect the new forms):
    • Based on the old location_type value, the appropriate schema name is applied
    • The location_type property is then removed
    • A new version property is added with a value of 1
    • If we choose to rename paths to urls and directories for url-catalog and local-directory-catalog instances, respectively, then those properties will be "renamed".
    • The user-maintained instances are migrated via the metadata-class hook relative to the old component-registry schema. Since the ComponentRegistry initialization accesses all instances of the ComponentRegistries schemaspace, each instance is essentially "touched".
      • It is anticipated that each migration relative to a given schema will be different. In the case of component-registry instances, they are transitioning to a new schema entirely, dropping and adding a property (as noted previously). In this case, the migration will hook the post_load() method and perform an update of the updated instance prior to its return. As a result, what is returned is an instance corresponding to the new schema that has also been persisted in the metadata storage (filesystem).
      • Since the component-registry schema is still required to be present, support for schema deprecation has been added and a deprecated = true property has been added to the component-registry schema. The behavior is that deprecated schemas can be accessed by name only and their instances retrieved, but deprecated schemas are NOT present when retrieving the set of schemas of a given schemaspace. This way, they will not show up in metadata-driven applications (such as Elyra's frontend) but internal code can still reference them.
      • With these changes, a version property has been added to all component-registries schemaspace schemas. We may want a meta-version property at some point that reflects the version of the schema itself, but, for now, version is meant to reflect the version of instances that must adhere to the schema.
      • We will need to decide when we can safely remove the component-registry schema. If we deliver these changes in 3.3, then perhaps 3.4 could be a removal point. This way, anyone jumping from 3.2 to 3.4 will be asked to go through 3.3 first.

Still Left To Do

  • Update tests
  • Update the schema $id fields to be the appropriate path closer to merge time
  • Support metadata migration
  • Remove test fields from directory schema
  • Finalize naming and docstrings, etc. (including abstract methods, "catalog" vs "registry" usage, etc.)

Frontend concerns (not must haves):

Docs will be updated in a separate PR; see #2249

How to implement a new catalog type

  • Add a new schema for your type to the elyra/metadata/schemas directory
    • The name of the schema file must match the value of the schema_name field in the JSON file, e.g. url-catalog.json and 'schema_name': 'url-catalog'
    • Follow the format of the existing catalog schemas, including the noted required keys
      • "schema_name", "display_name", "metadata" must be present in the schema, as should "runtime", "categories", and "description"
      • add other fields as needed to the "metadata" properties that will make it possible to access the components associated with this catalog type
      • Ensure that the schemaspace_name is component-registries and the schemaspace_id is ae79159a-489d-4656-83a6-1adfbc567c70
      • Implement and register a SchemasProvider associated with the ComponentRegistries schemaspace that makes the schema available via its get_schemas() method.
  • Implement a custom ComponentReader subclass, including the following functions for...
    • get_catalog_entries() -- gets a dictionary of relevant data from each component in the catalog that will be used to access component definitions
    • read_catalog_entry() -- reads and returns a component definition (in string form) given the data for that catalog entry/component
    • get_hash_keys() -- provides a list of keys available in the 'catalog_entry_data' dictionary the values of which will be used in constructing a unique hash id for each entry with the given catalog type
    • (More details for all of these functions can be found in their docstrings)
  • Add the new schema_name to the entrypoint for elyra.component.catalog_types in setup.py
    • The entrypoint name must match the schema_name defined in the schema
    • e.g., 'my-type-catalog = elyra.pipeline.component_reader:MyTypeComponentReader'

How was this pull request tested?

This PR will definitely require updates to the tests, but I am holding off on making those until we've finalized the design elements.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

@elyra-bot
Copy link

elyra-bot bot commented Oct 19, 2021

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link: Binder

@kiersten-stokes kiersten-stokes added the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Oct 20, 2021
@ptitzler ptitzler self-requested a review October 20, 2021 15:51
@ptitzler
Copy link
Member

I will provide review comments based on a proof of concept catalog connector implementation for MLX. My initial main evaluation focus:

  • Does the API provide the artifacts a developer needs to implement a custom catalog connector? (Example: can a connector be configured by the user?)
  • Does Elyra's catalog processing have a significant impact on the catalog system? (Example: frequent polling)

Follow up:

  • Is the developer documentation sufficient to accomplish the task of implementing a custom catalog connector without in-person help?

@ptitzler
Copy link
Member

ptitzler commented Oct 20, 2021

Schema/UI for directory catalogs

image

  • (1) Note: this applies to all catalog types; Runtime is not one of the "common properties" (like name, description, tag) and should probably live in ints own category (There are two now, e.g. "component categories" and "source")

  • (2) By combining absolute and relative path into one catalog type we are creating unnecessary usability issues.

    • For directory catalogs all we need is a directory entry (which must be an absolute path) because the directory value itself makes components in this type of catalog portable:
      • user1 defines dir catalog /users/user1
      • user1 creates pipeline P1; adds component (pipeline file => source:dir_catalog, source_key:path-within-dir-catalog)
      • user2 defines dir catalog /users/user2
      • user2 opens pipeline P1; Edit (no user action required if the local dir catalog contains an entry that matches "source_key"): -> needs to map for the node the source to a compatible catalog entry for the pipeline to be valid and runnable

@ptitzler
Copy link
Member

ptitzler commented Oct 20, 2021

Addendum for directory catalog type processing:

  • if a node's component [catalog] source type is directory
    • determine whether source_key is present in the palette for those components that are associated with a directory catalog (using the hash part of the op)
    • if one matching entry -> no user action required
    • if no matching entry -> user must (add/update catalog) to resolve
    • if > 1 matching entry -> user must choose one

@kiersten-stokes kiersten-stokes marked this pull request as draft October 20, 2021 19:04
@kiersten-stokes kiersten-stokes marked this pull request as draft October 20, 2021 19:04
@kevin-bates kevin-bates mentioned this pull request Oct 31, 2021
8 tasks
@kiersten-stokes kiersten-stokes marked this pull request as ready for review November 1, 2021 15:59
@kiersten-stokes kiersten-stokes added status:Test in Progress Test creation or testing is still in progress. A PR tagged with this label is ready for review. and removed status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. labels Nov 1, 2021
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and will be a super useful feature - thank you.

elyra/elyra_app.py Outdated Show resolved Hide resolved
elyra/metadata/schemas/url-catalog.json Show resolved Hide resolved
elyra/pipeline/airflow/component_parser_airflow.py Outdated Show resolved Hide resolved
elyra/pipeline/catalog_connector.py Outdated Show resolved Hide resolved
elyra/pipeline/catalog_connector.py Show resolved Hide resolved
elyra/pipeline/catalog_connector.py Outdated Show resolved Hide resolved
elyra/pipeline/catalog_connector.py Show resolved Hide resolved
elyra/pipeline/catalog_connector.py Show resolved Hide resolved
elyra/tests/pipeline/test_validation.py Outdated Show resolved Hide resolved
@ptitzler
Copy link
Member

ptitzler commented Nov 2, 2021

It appears that component_source property in the .pipeline file does not yet include the catalog type information (even though the information is displayed in the components' properties view).

image

As a result non GUI clients can only display partial information (catalog type information is "lost") :

$ elyra-pipeline describe mlx.pipeline 

────────────────────────────────────────────────────────────────
 Elyra Pipeline details
────────────────────────────────────────────────────────────────

Name: mlx
Description: None
Type: kfp
Version: 5
Nodes: 2
File Dependencies:
    None Listed
Component Dependencies:
    - {'mlx_component_id': 'count-rows'}
    - {'mlx_component_id': 'download-file'}

So does the GUI if there is no matching catalog registered yet:

image

@akchinSTC akchinSTC linked an issue Nov 2, 2021 that may be closed by this pull request
catalog entries in parallel in read_component_definitions(); default 3
"""
).tag(config=True)
max_readers = Integer(3, config=True, allow_none=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Since we see more containerized installations, I think it would be good to generally provide env variable support for configurable values. This can be in a follow-up, but I think we'd add an ELYRA_MAX_CATALOG_CONNECTOR_READERS (or something named similarly) that provides the "default value" (and this env defaults to 3 if not present). An example of this can be found where we deal with the cache ttl value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I can open an issue for that

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thank you @kiersten-stokes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:Test in Progress Test creation or testing is still in progress. A PR tagged with this label is ready for review.
Projects
None yet
5 participants