Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Introduce the notion of Runtime Processor Type #2258

Closed
kevin-bates opened this issue Oct 27, 2021 · 8 comments · Fixed by #2263
Closed

[Proposal] Introduce the notion of Runtime Processor Type #2258

kevin-bates opened this issue Oct 27, 2021 · 8 comments · Fixed by #2263

Comments

@kevin-bates
Copy link
Member

kevin-bates commented Oct 27, 2021

This issue presents a proposal for introducing the notion of a Runtime Processor Type so that implementations of the same runtime can be more easily distinguished.

Definition:
The following uses the term runtime processor. This term refers to the platform or orchestration tool that is driving the execution of a given pipeline. Support for two common runtime processors is embedded in Elyra. These are Apache Airflow and Kubeflow, although others exist outside of Elyra they could be implemented using our BYO model.

Problem:
With the ability to bring your own runtimes and, as of #2241, bring your own component catalog connectors, it is important that we have the ability to specify that a given entity supports a type of runtime processor. Today, Elyra only defines runtime processor names. Although each instance of a PipelineProcessor has a type property, that property value is actually the name of a runtime configuration schema and not actually a type of runtime processor.

For example, Elyra ships with two runtime schema definitions - airflow and kfp. Runtime configuration instances of these schemas can be created, but each schema equates to a specific implementation of a RuntimePipelineProcessor (or RPP) instance (which is a subclass of the aforementioned PipelineProcessor class). However, if someone wanted to bring their own implementation of RuntimePipelineProcessor that also used Kubeflow to drive the execution of the pipeline, there really isn't a way for that implementation to indicate it too is a Kubeflow-based processor similar to the processor named kfp.

Likewise, Component Catalog Connectors (or CCCs) want the ability to state that the components served from their implementation support a particular processor type, like Kubeflow or Apache Airflow, irrespective of how many RPP implementations are registered.

As a result, we need to formally introduce the notion of runtime processor type.

Proposal:
The first issue to introducing a runtime processor type is its general conflict to the bring your own paradigm. How can we possibly enumerate types - which must be known at the time of development - when we don't know what runtime processor implementations we will have until run time? This can be solved by first defining the set of known processor types, irrespective of there being an implementation of that type. This must be clearly conveyed to users. Just because a specific processor type is defined does NOT imply that such an implementation is available or that one even exists.

A quick google search for runtime processor types yields this result. Here we find six task/orchestration tools including the three we have some knowledge about (Apache Airflow, Kubeflow, and Argo) with the others being Luigi, Prefect, and MLFlow.

We can add some of these as development-time processor types with the idea that anyone wishing to introduce a runtime processor implementation (i.e., BYO RPP) of an undefined (unlisted) type, must first introduce a pull request adding that type to the enumeration.

Implementation (high-level):
We'd like a central location that houses the various processor types. The type names should have string representations that can be referenced in schemas and the UI. Ideally, simple comparisons could be made without regard to case. It should be easy for users to introduce a new type prior to implementing their Pipeline Processor or Catalog Connector. Python's Enum construct seems like a nice fit. In particular, we should use the @unique attribute to ensure values are unique.

Using the types referenced in the google search result, we'd have an enum like the following:

@unique
class RuntimeProcessorType(Enum):
    Local = 1
    Kubeflow = 2
    ApacheAirflow = 4
    Argo = 8
    Luigi = 16
    Prefect = 32
    MLFlow = 64

There are a couple of items worth noting.

  1. The integer values correspond to different bits of an integer. I'm not entirely sure this is necessary but would allow for the composition of a set of processor types to be stored in an integer and simple bit manipulation applied to determine membership.
  2. I've added Local as a type - which I think we want to do - even though it's not an official processor type in Elyra. In addition, the nice thing about using bits is that Local having a value of 1 would be the only odd value - which is somewhat fitting 😄, but also unnecessary.
  3. The downside to using bits is that we'd be limited to 31 task/orchestration engines, so it may be unnecessarily limiting, although, practically speaking, probably not an issue.

We could discuss this further, I don't really have an affinity to the integer values.

Enum classes also have a built-in dictionary that uses the stringized value as a key and returns the enum instance. This dictionary is __members__, so, using the definition above, RuntimeProessorType.__members__.get('Kubeflow') will return RuntimeProcessorType.Kubeflow. We will certainly wrap __members__ access into a get_instance() kind of method.

The string-value is accessible via a built-in name property, so RuntimeProcessorType.Kubeflow.name yields 'Kubeflow'. Likewise, there's a built-in value property where RuntimeProcessorType.Kubeflow.value yields 2. As a result, I think using an Enum subclass would give us the flexibility and central location we (and our users) would need.

Schemas of the Runtimes schemaspace will require a "type" property. This property will be a constant because each runtime schema is associated with exactly one processor type. This will require a migration, but that can be easily performed by introducing metadata_class_name values for each of our runtime schemas. When a given instance is located, the class implementation will check if there's a processor_type field and, if not, inject that field, persist the update, the return from the load. (We should introduce a version field at this time as well.) This same migration approach is used in the Catalog Connector PR (#2241)

Users implementing their own Catalog Connectors should explicitly list the processor types they support. We could introduce the notion of a wildcard value (e.g., '*') to indicate any processor type is supported, but, given the potential plethora of task/orchestration engines, I think it would be best to be explicit. When those providers add support for another engine, they simply expose an updated schema whose enum-valued property contains the new reference. Likewise, they may choose to drop support for a given processor type.

There are locations within the server and UI where the processor name is used today. These will need to be updated and replaced with formal type-based names. In addition, we'll want a new endpoint that can be used to retrieve the types for any registered runtime processors. For example, if there are two Kubeflow processor implementations registered (and nothing more), this endpoint would return Kubeflow (corresponding to the name property of RuntimeProcessorType.Kubeflow) and not the names of the two registered processors (i.e., schemas).

Alternative approach:
Rather than introduce a RuntimeProcessorType enumeration, we could introduce type-based base classes that derive from RuntimePipelineProcessor. For example, an ApacheAirflowBase could be inserted between RuntimePipelineProcessor and AirflowPipelineProcessor. This would allow for Airflow-specific functionality that is agnostic to the actual implementations to be located. These base classes would then expose a processor_type property that reflects their type. In addition, code could also use isinstance(implementation_instance, ApacheAirflowBase) to determine "type".
The problem with this is that we'd still want to introduce "empty" implementations for future-supported types, even though one may never exist. This seems a little heavyweight.
Another caveat is that the "types" are scattered about and not in a central location like that of an Enum class. As a result, references to multiple types would require imports for the various implementations - which is way too heavyweight.

Front-end/pipeline changes related to this proposal:

  1. The runtime platform icon/tile to display should be predicated on the runtime_type field from the schema within the Runtimes schemaspace. Currently, this determination is based on the name of the schema (kfp, airflow). If the runtime_type specifies KUBEFLOW_PIPELINES the Kubeflow icon is displayed, etc. The schema title should be used as the icon 'name' or hover information. The name may also serve that purpose.
  2. When a tile is selected to create a new pipeline, the pipeline contents should include both a runtime: property (which equates to the schema name, as is the case today) and a runtime_type: (sibling) property which reflects the schema's runtime_type value.
  3. Pipeline files (today) contain the following information:
      "app_data": {
        "ui_data": {
          "comments": []
        },
        "version": 5,
        "runtime": "airflow",
        "properties": {
          "name": "untitled13",
          "runtime": "Apache Airflow"
        }
      },

It is confusing what the second (and embedded) runtime value is used for or why it is located in a sub-object. We should probably reformat this as:

      "app_data": {
        "ui_data": {
          "comments": []
        },
        "version": 5,
        "runtime": "airflow",
        "runtime_type": "APACHE_AIRFLOW",
        "name": "untitled13"
      },

barring reasons for the sub-object. Do other items get placed in the "properties": sub-object?
Also, note the use of the enumerated type name field rather than the displayable value. Wherever items are persisted (like in the schema runtime_type field as well) we want to use the name so that a level of indirection is introduced for obtaining the values. This enables the values to change whenever necessary.
4. Migration will need to infer the runtime_type value from the schema name - which should be one-to-one.
5. I think we will want a uihint that can convert the type name to its value. For example, consider this portion of the catalog connector schema...

        "runtime_type": {
        "title": "Runtime Processor Type",
        "description": "The runtime type associated with this Component Catalog",
        "type": "string",
        "enum": ["KUBEFLOW_PIPELINES", "APACHE_AIRFLOW"],
        "uihints": {
          "field_type": "dropdown",
          "category": "Runtime",
          "value-map": {"KUBEFLOW_PIPELINES":"Kubeflow Pipelines", "APACHE_AIRFLOW":"Apache Airflow"}
        }
      },

it would be nice if the editor could look up the up-cased values in a value map kind of thing to use in the dropdown, and, similarly, set the up-cased value into the field once selected. I.e., only used the "displayable" value for display. We can get by with just the up-cased values, but its a better UX if we can display the displayable values.

@kevin-bates kevin-bates added the kind:enhancement New feature or request label Oct 27, 2021
@kevin-bates
Copy link
Member Author

Wondering if Local should even be represented at all. It currently spans the border of a real processor type, yet is registered like the others so, technically speaking, one could introduce their own local runtime. Yet there is no schema for local runtimes, etc. and I suspect there would be some larger issues when drilled down.

One nice thing about the bit-based value approach is, indeed, that Local is the only odd value and could be easily filtered out or identified from its value property.

@kevin-bates
Copy link
Member Author

After discussing in the scrum we decided to move forward with the following changes to what is documented above:

  1. The enum will contain only the runtime types of orchestration tools we have knowledge about relative to Elyra (i.e., ApacheAirflow, Kubeflow, Argo). Development for addition types will require a pull request to introduce that type.
  2. Rather than use integer values, we'll use string values which will be used for display purposes.
  3. I'll begin working on this from the branch that corresponds to Generalize component reading and processing for BYO catalog-types #2241 since it will leverage changes within that PR.

@lresende
Copy link
Member

Sorry I missed the discussion on scrum today. Could we revisit this on the community dev meeting tomorrow.

@kevin-bates
Copy link
Member Author

@lresende - would you mind posting some concerns or questions so we have an idea of the issue prior to the meeting? (I may be late if my prior meeting runs late.) Thank you.

@lresende
Copy link
Member

Definition:
The following uses the term runtime processor. This term refers to the platform or orchestration tool that is driving the execution of a given pipeline. Support for two common runtime processors is embedded in Elyra. These are Apache Airflow and Kubeflow, although others exist outside of Elyra they could be implemented using our BYO model.

I view the following as the drivers to runtime association:

  • Pipeline should have a "human understandable" value to define a runtime
    • This enables easy troubleshooting and the ability to read/update the pipeline file if necessary
  • The value on the pipeline is used to drive the discovery of which runtime processor to use
  • The processor serves as a facade/factory for other things related to a runtime
    • e.g. what catalog to use, etc

Problem: With the ability to bring your own runtimes and, as of #2241, bring your own component catalog connectors, it is important that we have the ability to specify that a given entity supports a type of runtime processor. Today, Elyra only defines runtime processor names. Although each instance of a PipelineProcessor has a type property, that property value is actually the name of a runtime configuration schema and not actually a type of runtime processor.

For example, Elyra ships with two runtime schema definitions - airflow and kfp. Runtime configuration instances of these schemas can be created, but each schema equates to a specific implementation of a RuntimePipelineProcessor (or RPP) instance (which is a subclass of the aforementioned PipelineProcessor class). However, if someone wanted to bring their own implementation of RuntimePipelineProcessor that also used Kubeflow to drive the execution of the pipeline, there really isn't a way for that implementation to indicate it too is a Kubeflow-based processor similar to the processor named kfp.

I believe that most, if not all deployments, will focus on one runtime. In case there are multiple runtime processors that support a given runtime, I would focus on solving the problem via #2136 and only installing the desirable runtime processor that is desirable. Note that this would also enable users to continue to use the existing catalogs, etc as they shouldn't be impacted by a different implementation of a "kfp" runtime processor.

Likewise, Component Catalog Connectors (or CCCs) want the ability to state that the components served from their implementation support a particular processor type, like Kubeflow or Apache Airflow, irrespective of how many RPP implementations are registered.

Agree on the Component Catalog connector parts, they are associated with a given runtime and not necessarily different if there are multiple runtime processor implementations available for a given runtime.

Having said that, I don't think we should support ever having more than one live implementation on a deployment.

As a result, we need to formally introduce the notion of runtime processor type.

Proposal: The first issue to introducing a runtime processor type is its general conflict to the bring your own paradigm. How can we possibly enumerate types - which must be known at the time of development - when we don't know what runtime processor implementations we will have until run time? This can be solved by first defining the set of known processor types, irrespective of there being an implementation of that type. This must be clearly conveyed to users. Just because a specific processor type is defined does NOT imply that such an implementation is available or that one even exists.

A quick google search for runtime processor types yields this result. Here we find six task/orchestration tools including the three we have some knowledge about (Apache Airflow, Kubeflow, and Argo) with the others being Luigi, Prefect, and MLFlow.

We can add some of these as development-time processor types with the idea that anyone wishing to introduce a runtime processor implementation (i.e., BYO RPP) of an undefined (unlisted) type, must first introduce a pull request adding that type to the enumeration.

Implementation (high-level): We'd like a central location that houses the various processor types. The type names should have string representations that can be referenced in schemas and the UI. Ideally, simple comparisons could be made without regard to case. It should be easy for users to introduce a new type prior to implementing their Pipeline Processor or Catalog Connector. Python's Enum construct seems like a nice fit. In particular, we should use the @unique attribute to ensure values are unique.

Using the types referenced in the google search result, we'd have an enum like the following:

@unique
class RuntimeProcessorType(Enum):
    Local = 1
    Kubeflow = 2
    ApacheAirflow = 4
    Argo = 8
    Luigi = 16
    Prefect = 32
    MLFlow = 64

Opening a pipeline and seeing runtime=2 will not be very user-friendly and will require people to go look at the docs to figure out what runtime it maps to.

I also don't like that people would need to change the code to introduce new runtimes (e.g. the proposed list already misses Tekton, Flyte, etc). This would be an issue if people have proprietary ones as well.

I also think that our UI is probably not ready to support multiple runtimes of the same type.

@lresende
Copy link
Member

After discussion on community meeting, @kevin-bates explained in more detail the scenario that would require this functionality and we agreed on it. Thank you @kevin-bates

@kevin-bates
Copy link
Member Author

Based on the discussions and the decision that we only include enum entries relative to runtime processors that have been implemented or we know are under development, here's the current incantation of the enum (along with a proposed help-string):

@unique
class RuntimeProcessorType(Enum):
    """RuntimeProcessorType enumerates the set of platforms targeted by runtime processors.

    Each runtime processor implementation (subclass of PipelineProcessor) will reflect one
    of these values.  Users implementing their own runtime processor that corresponds to a
    type not listed in this enumeration are responsible for appropriately extending this
    enumeration and reflecting that entry in the corresponding runtime schema in order to
    fully integrate their processor with Elyra.
    """
    LOCAL = 'Local'
    KUBEFLOW_PIPELINES = 'Kubeflow Pipelines'
    APACHE_AIRFLOW = 'Apache Airflow'
    ARGO = 'Argo'

@bourdakos1
Copy link
Member

runtime

Today we discussed persisting runtime type vs runtime processor information in the pipeline. We decided that eventually it would be best to only persist one or the other, but due to time restraints we will persist everything for now.

We were a bit unsure about which value to persist, but are leaning towards only persisting type. There are a couple of trade offs and assumptions made depending on which we choose.

persisting type:

  • assumes any pipeline of type APACHE_AIRFLOW can run with any processor (airflow, airflow-no-cos)
    • allows better portability, a user can still run the pipeline with the airflow processor if they don't have airflow-no-cos installed
  • assumes the set of components available is fully dependent on type (APACHE_AIRFLOW)
  • only one tile APACHE_AIRFLOW available when creating pipeline
  • processor (airflow, airflow-no-cos) and config are chosen at submission time
    • note: processor doesn't need to be explicitly chosen at submission (it could just show all three configs available)

persisting processor:

  • assumes a pipeline MUST run with persisted processor (airflow, airflow-no-cos)
  • component list could theoretically change depending on processor
    • airflow and airflow-no-cos could have different components available
  • two tiles (airflow and airflow-no-cos) available when creating pipeline
  • only config is chosen at submission time

akchinSTC pushed a commit that referenced this issue Nov 10, 2021
Introduces a new RuntimeProcessorType enumeration class. The 
enumerated names (i.e., constants) are then referenced (as constants) 
within the Runtimes schemas and their instances (this PR 
addresses the migration of existing instances).

Fixes #2258 
Fixes #2261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants