Generalize component reading and processing for BYO catalog-types #2241

kiersten-stokes · 2021-10-19T18:50:56Z

Closes #2220. This PR generalizes the ComponentReader class in order to better support additional catalog types for which we may not know the structure.

This PR touches the following elements:

Entrypoints in setup.py
- A new entrypoint group is declared (elyra.component.catalog_types) and contains one name for each type of catalog (see schemas section below)
Component catalog schemas and corresponding ComponentRegistrySchemas class
- The former component-registry schema name has been removed in favor of creating one schema per component catalog type
- The schema_name must match the entrypoint name in order to be found during the call to get_schemas()
- All keys except one in the schema for the built-in catalog types (file, directory, or url) have been preserved from when it existed as a single component-registry schema in order to avoid some larger migration issues
  - This doesn't completely remove the need for some sort of migration, however, as the schema_name field still must change to reflect the new schemas -- more discussion needed
- Added an optional field to the local-file-catalog ~~and local-directory-catalog~~ schema called base_path, which allows a user to enter an optional base directory from which the rest of the paths values may be resolved (otherwise, absolute directories are assumed)
- Added an optional field to the local-directory-catalog type called include_subdirs, which allows a user to indicate whether they want to search subdirectories for additional component specs
  - Discussion point: I'm not sure if there's a boolean control (like a checkbox or something) already available for this purpose on the frontend in the metadata editor or if one will have to be created
- Some frontend considerations rise from these changes:
  - Changed the naming on the frontend from 'Pipeline Components' to 'Component Catalogs'; someone from frontend should probably take a look to make sure I haven't missed anything or screwed something up
  - Discussion point: The addition of certain fields in the schema sometimes may look less-than ideal; see the UI when adding/modifying a local-file-catalog instance (note that Base Path doesn't line up with Paths since the paths array type required some different formatting options)
  - We'd like to remove the duplicate title/display_name information in the schema (title, display_name, and uihints.title)
ComponentRegistry
- The schema_name field now determines the appropriate reader class by comparing with entrypoint keys
- The component_entry SimpleNamespace object includes a few changes:
  - A component_identifier field is added, which is passed on to the Component object after parsing and saved for later access
  - location is removed from the top-level of the component_entry object, as it now optionally resides in the component_metadata value
  - The component_id value is added, the value of which now is determined in and returned from the reader (see below)
ComponentReader classes
- The logic has been moved out of component.py and into it's own file component_reader.py since it got pretty bulky
- The max_readers configurable Integer has been converted to one of potentially/eventually several override-able settings (stored in a configurable dictionary)
- read_component_definitions() still drives the below functionality using threads
  - get_catalog_entries() (formerly get_absolute_locations()) takes the registry/catalog instance metadata and, for each component in that registry, builds a dictionary containing all the relevant data that will be required in order to read a component in the call to read_component_definition(); these dictionaries are returned in a list
  - on return to the read_component_definitions() function, the catalog entry dictionaries are added to the thread queue
    - The id-ing method used currently takes the following form
      - <catalog-type-name>:hash(<hashing_key1>:<hashing_key2>:<...>)
      - The hash portion of the above id only includes the first 12 characters for brevity
      - Each catalog-type class can set the hashing_keys as above for their component instances by implementing get_hash_keys() to return a list of keys available in the component_entry_data object returned from get_catalog_entries() for each component
  - when an entry is pulled from the queue, read_component_definition() is passed the catalog entry data and registry metadata used to return the catalog entry's definition in string form
- ~~Adds some logging functions so that classes can implement custom messages -- this might be overkill and will likely change when we finalize what we are doing with the component_source property~~
ComponentParser classes
- Determination of the component_id is removed from the parser classes and takes place in the reader
- Minor changes to include the component identifier field and other field changes in the constructors
- The AirflowComponentParser is refactored to only include the portion of the operator definition for a component of a single class
Component object class
- Now includes a definition attribute that can be used during processing to access the component definition without re-reading/parsing
- Some simple name changes:
  - To start, the location_type attribute has been changed to catalog_type
  - Discussion point: We may want to consider changing the name of location to something like location_identifier or something that better conveys where/how a particular component can be defined (consider the MLX example where the "location" is something like the component access id instead)
    - component_identifier is another naming option for this attribute
    - FYI, the value of this attribute is what is displayed under the Component Source property in the node properties panel
RuntimePipelineProcessor classes
- These differ a lot from each other, so I'll explain separately
- KFP
  - When a component needs to be loaded, KFP's load_component_from_text() is now used for every type of component, where the definition attribute of the Component object is given as the argument
- Airflow
  - This is a bit of a sticky situation, even outside of the scope of this PR; the issue is that the Airflow DAG template requires us to have the module name in order to insert the appropriate import statement, so we need a way for all catalog types to be able to import the correct modules
    - We will use configuration to handle this; more details to be added later
Migration
With these changes, the existing component-registry schema is replaced by three schemas reflecting the previous location_type property. Any user-maintained instances will need to be adjusted in the following ways (note, factory instances will immediately reflect the new forms):
- Based on the old location_type value, the appropriate schema name is applied
- The location_type property is then removed
- A new version property is added with a value of 1
- If we choose to rename paths to urls and directories for url-catalog and local-directory-catalog instances, respectively, then those properties will be "renamed".
- The user-maintained instances are migrated via the metadata-class hook relative to the old component-registry schema. Since the ComponentRegistry initialization accesses all instances of the ComponentRegistries schemaspace, each instance is essentially "touched".
  - It is anticipated that each migration relative to a given schema will be different. In the case of component-registry instances, they are transitioning to a new schema entirely, dropping and adding a property (as noted previously). In this case, the migration will hook the post_load() method and perform an update of the updated instance prior to its return. As a result, what is returned is an instance corresponding to the new schema that has also been persisted in the metadata storage (filesystem).
  - Since the component-registry schema is still required to be present, support for schema deprecation has been added and a deprecated = true property has been added to the component-registry schema. The behavior is that deprecated schemas can be accessed by name only and their instances retrieved, but deprecated schemas are NOT present when retrieving the set of schemas of a given schemaspace. This way, they will not show up in metadata-driven applications (such as Elyra's frontend) but internal code can still reference them.
  - With these changes, a version property has been added to all component-registries schemaspace schemas. We may want a meta-version property at some point that reflects the version of the schema itself, but, for now, version is meant to reflect the version of instances that must adhere to the schema.
  - We will need to decide when we can safely remove the component-registry schema. If we deliver these changes in 3.3, then perhaps 3.4 could be a removal point. This way, anyone jumping from 3.2 to 3.4 will be asked to go through 3.3 first.

Still Left To Do

Update tests
Update the schema $id fields to be the appropriate path closer to merge time
Support metadata migration
Remove test fields from directory schema
Finalize naming and docstrings, etc. (including abstract methods, "catalog" vs "registry" usage, etc.)

Frontend concerns (not must haves):

Handle pipeline migration concerns for BYO catalog type changes #2262
Add a control for dictionary types in the node properties panel #2256
Add support for all JSON schema property types in metadata editor #2254
Remove display_name property references from schema and frontend logic #2255
~~Formatting for schema fields that appear side-by-side~~ this will be coming in JL 4.0

Docs will be updated in a separate PR; see #2249

How to implement a new catalog type

Add a new schema for your type to the elyra/metadata/schemas directory
- The name of the schema file must match the value of the schema_name field in the JSON file, e.g. url-catalog.json and 'schema_name': 'url-catalog'
- Follow the format of the existing catalog schemas, including the noted required keys
  - "schema_name", "display_name", "metadata" must be present in the schema, as should "runtime", "categories", and "description"
  - add other fields as needed to the "metadata" properties that will make it possible to access the components associated with this catalog type
  - Ensure that the schemaspace_name is component-registries and the schemaspace_id is ae79159a-489d-4656-83a6-1adfbc567c70
  - Implement and register a SchemasProvider associated with the ComponentRegistries schemaspace that makes the schema available via its get_schemas() method.
Implement a custom ComponentReader subclass, including the following functions for...
- get_catalog_entries() -- gets a dictionary of relevant data from each component in the catalog that will be used to access component definitions
- read_catalog_entry() -- reads and returns a component definition (in string form) given the data for that catalog entry/component
- get_hash_keys() -- provides a list of keys available in the 'catalog_entry_data' dictionary the values of which will be used in constructing a unique hash id for each entry with the given catalog type
- (More details for all of these functions can be found in their docstrings)
Add the new schema_name to the entrypoint for elyra.component.catalog_types in setup.py
- The entrypoint name must match the schema_name defined in the schema
- e.g., 'my-type-catalog = elyra.pipeline.component_reader:MyTypeComponentReader'

How was this pull request tested?

This PR will definitely require updates to the tests, but I am holding off on making those until we've finalized the design elements.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

elyra-bot · 2021-10-19T18:51:05Z

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link:

ptitzler · 2021-10-20T17:02:05Z

I will provide review comments based on a proof of concept catalog connector implementation for MLX. My initial main evaluation focus:

Does the API provide the artifacts a developer needs to implement a custom catalog connector? (Example: can a connector be configured by the user?)
Does Elyra's catalog processing have a significant impact on the catalog system? (Example: frequent polling)

Follow up:

Is the developer documentation sufficient to accomplish the task of implementing a custom catalog connector without in-person help?

ptitzler · 2021-10-20T18:05:56Z

Schema/UI for directory catalogs

(1) Note: this applies to all catalog types; Runtime is not one of the "common properties" (like name, description, tag) and should probably live in ints own category (There are two now, e.g. "component categories" and "source")
(2) By combining absolute and relative path into one catalog type we are creating unnecessary usability issues.
- For directory catalogs all we need is a directory entry (which must be an absolute path) because the directory value itself makes components in this type of catalog portable:
  - user1 defines dir catalog /users/user1
  - user1 creates pipeline P1; adds component (pipeline file => source:dir_catalog, source_key:path-within-dir-catalog)
  - user2 defines dir catalog /users/user2
  - user2 opens pipeline P1; Edit (no user action required if the local dir catalog contains an entry that matches "source_key"): ~~-> needs to map for the node the source to a compatible catalog entry for the pipeline to be valid and runnable~~

ptitzler · 2021-10-20T18:23:30Z

Addendum for directory catalog type processing:

if a node's component [catalog] source type is directory
- determine whether source_key is present in the palette for those components that are associated with a directory catalog (using the hash part of the op)
- if one matching entry -> no user action required
- if no matching entry -> user must (add/update catalog) to resolve
- if > 1 matching entry -> user must choose one

elyra/metadata/schemas/local-directory-catalog.json

elyra/metadata/schemasproviders.py

elyra/pipeline/airflow/component_parser_airflow.py

elyra/templates/components/canvas_properties_template.jinja2

kevin-bates

This looks good and will be a super useful feature - thank you.

elyra/elyra_app.py

elyra/metadata/schemas/local-directory-catalog.json

elyra/metadata/schemas/url-catalog.json

elyra/pipeline/airflow/component_parser_airflow.py

elyra/pipeline/catalog_connector.py

elyra/tests/pipeline/test_validation.py

ptitzler · 2021-11-02T00:13:08Z

It appears that component_source property in the .pipeline file does not yet include the catalog type information (even though the information is displayed in the components' properties view).

As a result non GUI clients can only display partial information (catalog type information is "lost") :

$ elyra-pipeline describe mlx.pipeline 

────────────────────────────────────────────────────────────────
 Elyra Pipeline details
────────────────────────────────────────────────────────────────

Name: mlx
Description: None
Type: kfp
Version: 5
Nodes: 2
File Dependencies:
    None Listed
Component Dependencies:
    - {'mlx_component_id': 'count-rows'}
    - {'mlx_component_id': 'download-file'}

So does the GUI if there is no matching catalog registered yet:

kevin-bates · 2021-11-02T16:30:29Z

elyra/pipeline/catalog_connector.py

-                           catalog entries in parallel in read_component_definitions(); default 3
-        """
-    ).tag(config=True)
+    max_readers = Integer(3, config=True, allow_none=True,


This looks good. Since we see more containerized installations, I think it would be good to generally provide env variable support for configurable values. This can be in a follow-up, but I think we'd add an ELYRA_MAX_CATALOG_CONNECTOR_READERS (or something named similarly) that provides the "default value" (and this env defaults to 3 if not present). An example of this can be found where we deal with the cache ttl value.

Sounds good! I can open an issue for that

kevin-bates

LGTM - thank you @kiersten-stokes!

…test

kiersten-stokes added 7 commits October 18, 2021 14:50

Promote component-registrsy schema catalog type

8534d61

Update resource locations

28aaa95

Fix schema const fields

445265b

Generalize the reader classes

d6ea8a2

Refine generalization of processing

494d5e7

Remove reader from Component object

e8822c8

Change hashing method and add processing-specific function

e1c3498

kiersten-stokes and others added 9 commits October 19, 2021 13:53

Merge branch 'master' into registry-updates

1bfffd1

Add first-pass support for airflow processor

85e91ef

Change location_type var to catalog_type

27add8b

Change names of things on the frontend

4ed39d1

Add component-registry schema back to schemas folder

e077dfd

Initial migration changes for BYO catalog types

107e373

Merge branch 'registry-migration' into kiersten-registry-updates

66a57bb

Add support for deprecated schemas

dbb1eb6

linting

2a41f26

kiersten-stokes added the status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. label Oct 20, 2021

Add version to component-registry factory instances

81942b9

ptitzler self-requested a review October 20, 2021 15:51

Apply constant values if missing, rename post_load to on_load

2d6c98e

kiersten-stokes marked this pull request as draft October 20, 2021 19:04

ptitzler reviewed Oct 20, 2021

View reviewed changes

elyra/metadata/schemas/local-directory-catalog.json Show resolved Hide resolved

ptitzler reviewed Oct 20, 2021

View reviewed changes

elyra/metadata/schemasproviders.py Outdated Show resolved Hide resolved

kevin-bates and others added 2 commits October 20, 2021 16:29

Auto-discover local schemas for each SchemasProvider

7066aea

Address minor review comments

4059d73

kiersten-stokes added 2 commits October 29, 2021 15:28

Add required numberm field to directory schema for testing

790518f

Remove error-raising in catalog_connector and component_registry

70c060a

kevin-bates mentioned this pull request Oct 31, 2021

Add support Runtime types #2263

Merged

8 tasks

Remove test fields from directory catalog

c81d8dc

kiersten-stokes marked this pull request as ready for review November 1, 2021 15:59

kiersten-stokes added status:Test in Progress Test creation or testing is still in progress. A PR tagged with this label is ready for review. and removed status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. labels Nov 1, 2021

kiersten-stokes added 2 commits November 1, 2021 12:10

Fix typo in directory catalog

5852363

Update test resources to reflect new id for papermill component

e27c438

kiersten-stokes commented Nov 1, 2021

View reviewed changes

elyra/pipeline/airflow/component_parser_airflow.py Outdated Show resolved Hide resolved

kiersten-stokes commented Nov 1, 2021

View reviewed changes

elyra/templates/components/canvas_properties_template.jinja2 Outdated Show resolved Hide resolved

Address review comments

0cb46ed

kevin-bates reviewed Nov 1, 2021

View reviewed changes

kiersten-stokes added 6 commits November 1, 2021 19:45

Address review comments

54a228b

Include catalog_type in component_source field

bb38ecd

Update integration tests with new component ids

136a8dc

Add metadata version field to jest test

c19e96c

Merge branch 'master' into registry-updates

d4c5416

Update server tests to reflect change in component_source property

3d656f9

akchinSTC linked an issue Nov 2, 2021 that may be closed by this pull request

Add support for all JSON schema property types in metadata editor #2254

Closed

Change max_readers back to a configurable Integer

f67a5d8

kevin-bates reviewed Nov 2, 2021

View reviewed changes

kevin-bates approved these changes Nov 2, 2021

View reviewed changes

kiersten-stokes added 2 commits November 2, 2021 13:00

Fix name of Pipeline Components to Component Catalogs in integration …

975467b

…test

Escape colons in component ids for integration tests

763a286

kiersten-stokes mentioned this pull request Nov 2, 2021

Handle pipeline migration concerns for BYO catalog type changes #2262

Closed

2 tasks

akchinSTC merged commit 5eceb17 into elyra-ai:master Nov 2, 2021

kiersten-stokes mentioned this pull request Nov 4, 2021

Sort palette nodes by component label #2277

Merged

kiersten-stokes deleted the registry-updates branch April 1, 2022 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize component reading and processing for BYO catalog-types #2241

Generalize component reading and processing for BYO catalog-types #2241

kiersten-stokes commented Oct 19, 2021 •

edited

elyra-bot bot commented Oct 19, 2021

ptitzler commented Oct 20, 2021

ptitzler commented Oct 20, 2021 •

edited

ptitzler commented Oct 20, 2021 •

edited

kevin-bates left a comment

ptitzler commented Nov 2, 2021

kevin-bates Nov 2, 2021

kiersten-stokes Nov 2, 2021

kevin-bates left a comment

Generalize component reading and processing for BYO catalog-types #2241

Generalize component reading and processing for BYO catalog-types #2241

Conversation

kiersten-stokes commented Oct 19, 2021 • edited

What changes were proposed in this pull request?

Still Left To Do

How to implement a new catalog type

How was this pull request tested?

elyra-bot bot commented Oct 19, 2021

ptitzler commented Oct 20, 2021

ptitzler commented Oct 20, 2021 • edited

ptitzler commented Oct 20, 2021 • edited

kevin-bates left a comment

Choose a reason for hiding this comment

ptitzler commented Nov 2, 2021

kevin-bates Nov 2, 2021

Choose a reason for hiding this comment

kiersten-stokes Nov 2, 2021

Choose a reason for hiding this comment

kevin-bates left a comment

Choose a reason for hiding this comment

kiersten-stokes commented Oct 19, 2021 •

edited

ptitzler commented Oct 20, 2021 •

edited

ptitzler commented Oct 20, 2021 •

edited