Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AppFlow L2 Constructs #483

Closed
3 of 11 tasks
rpawlaszek opened this issue Feb 16, 2023 · 6 comments
Closed
3 of 11 tasks

AppFlow L2 Constructs #483

rpawlaszek opened this issue Feb 16, 2023 · 6 comments
Assignees
Labels

Comments

@rpawlaszek
Copy link

rpawlaszek commented Feb 16, 2023

Description

I've been working on PoC L2 constructs for AppFlow recently and would like to discuss the structure as a proposition to include within the core AWS CDK library. Below I present the current state and decisions behind some choices made.

Flows

AppFlow is a fully managed integration service to transfer data between SaaS services and AWS. From the technical standpoint it enables its functionality through so-called flows, and distinguishes three separate types (so called flow triggers): OnDemand, Event, and Scheduled. I would like to propose that they are translated to separate types and currently the approach is to have: OnDemandFlow, OnEventFlow, OnScheduleFlow. This type division comes from the fact that some CFN elements are defined by a particular flow (trigger) type chosen. The naming convention On{type}Flow is just to have it more consistent (I know, it's subjective).

The below snippet shows a minimal definition of a flow (omitting source and destination definitions for brevity, which will be presented further in the description) in the proposed implementation:

const flow = new OnDemandFlow(stack, 'OnDemandFlow', {
  source: source,
  destination: destination,  
  mappings: [Mapping.mapAll()]
});

A single source and a single destination

The Cfn AWS::AppFlow::Flow resource allows for providing a single source, and multiple destinations. However, AppFlow as of now works with only one source and only one destination, so the implementation enables setting a single source, and a single destination only. It can be changed with changing destination property to destinations property (with changing the property type to an array). However, this could give a wrong impression that multiple destinations are an option.

EventBridge notifications

Each flow publishes events to the default EventBridge bus. The proposed approach follows the one for the CodeCommit events and would be as:

  • onStarted
  • onFinished
  • onDeactivated (only for the OnEventFlow and the OnScheduleFlow)
  • onStatus (only for the OnEventFlow)

This way one can consume the notifications as in the example below

flow.onFinished('OnFinished', {
    target: targets.SnsTopic(myTopic),
});

Automatic activation/deactivation of OnEventFlow and OnScheduleFlow types

AppFlow doesn't automatically enable OnEventFlow and OnScheduleFlow. After creation one needs to call an SDK method to start a particular flow. Additionally, the deletion of a flow will fail if the flow is not deactivated. In order to deactivate the flow one needs to make a yet another SDK method. In the proposed implementation both flow types have autoActivate boolean property that when set to true will make sure the flow is activated during the deployment and deactivated before deletion.

NOTE: currently the implementation uses a custom resource for the AWS API that might require shifting to the more general provider framework, but current (preliminary) tests show that this simple approach works.

Start/End windows for OnScheduleFlow

The current issue with CloudFormation templates for AppFlow is that the start and end times for the schedule (if provided) are required to be in the future (evaluated during the deployment time), which means that when re-applying a CloudFormation template the user can face an error. This, in turn, means that a Cfn template cannot be immutable because any update to the flow potentially requires updating the start/end times. With CDK we can apply an approach using the Provider Framework for making sure that the windows are updated for us. For the rate-based flows this would require shifting a startTime specified in the past a number of cycles forward in order to make sure that the deployment is successful and to make sure that if the user expects a specific cadence (say a flow triggered at 00 time every hour) it is preserved. This means that with the proposed implementation we make the CDK code immutable, but the (resolved) deployment definition will not be (in general).

The minimal example shows how to declare an OnScheudleFlow with a cron schedule.

const flow = new OnScheduleFlow(stack, 'OnScheduleFlow', {
  source: source,
  destination: destination,
  schedule: Schedule.cron({ hour: '1', minute: '1' }),
  pullConfig: { mode: DataPullMode.INCREMENTAL },
  scheduleProperties: {
    startTime: new Date(Date.parse('2022-01-01')),
  },
});

In Cfn the startTime definition would make the deployment fail. This means that for any fixed date a future deployment might fail and due to the current approach in handling the date in CloudFormation - the IaC definition will have to be modified. With the proposed implementation the startTime property in the definition above will be shifted to the nearest now so that the user will not face a failure with such a definition.

NOTE: it still depends on the deployment time. That means that if we shift the time to the nearest future point in time, but the other deployment parts take long time the actual execution of the flow deployment might still be after the re-calculated startTime. An option here could be specifying some artificial offset to the future point in time (say, a hint for when calculating the future point in time to add some additional Duration).

Applications

The approach is to have the core functionality put in a core folder, and a per-application implementations of a connector profiles, sources, and destinations under a dedicated application folder (e.g. salesforce, marketo, sapodata, etc.) with a structure like:

appdir/
|-index.ts
|-profile.ts
|-source.ts
|-destination.ts

Connector Profiles

A connector profile is an instance of a connector. There are native and custom connectors. From the perspective of a consumer there should be no difference in how a connector profile is defined in the CDK. A common functionality is placed in ConnectorProfileBase class that due to the restrictions from JSII does not use generics.

Example of importing a connector profile for a Salesforce integration

const profile = SalesforceConnectorProfile.fromConnectorProfileName(stack, 'TestProfile', 'appflow-tester');

the fromConnectorProfileName uses an internal _fromConnectorProfileName method of ConnectorProfileBase casting to the appropriate type.

An example of instantiating the connector profile for Salesforce integration.

const profile = new SalesforceConnectorProfile(stack, 'TestProfile', {
  oAuth: {
    accessToken: `${accessToken}`,
    flow: {
      refreshTokenGrant: {
        refreshToken: `${refreshToken}`,
        client: <clientSecret>,
      },
    },
  },
  instanceUrl: `${instanceUrl}`,
  isSandbox: false,
});

Mind that the elements to be put there can be IResolvable, so a Secret can be used with secretValueFromJson method.

The properties contain anoAuth element somewhat influenced by how OAuth flows are defined in the Amazon Cognito UserPool client in CDK. That is why there is a sub-level element flow and there refreshTokenGrant - the flow used with this is a refresh token grant flow.

ConnectorType

Connector profiles use connectorProfileName/connectorLabel properties to set actual Connectors. Custom connectors have ConnectorType property as CustomConnector and the actual connector type is passed in ConnectorLabel. This is why this logic is abstracted away to ConnectorType type. Additionally, connector types are used in Tasks (described in more detail further in the text) as operations origin (the term used in this implementation). That is why ConnectorType handles this as well.

Public/Private modes

Currently the focus is on public connection mode. As private mode is not currently supported by most of the connections and it needs more analysis (more in-depth view for example in this blog).

Sources & Destinations

a Connector Profile can be used for either a source or a destination. In the proposed implementation they are both called vertices and each contains a property connectorType. The reasoning behind it is that tasks, when they specify what operator is to be used require (currently: source) connector type. Making it a required element for a vertex enables us to remove the requirement for a user to specify the connector type each time a task is defined, as it will be done on their behalf.

Sources and destinations are not Cfn resources themselves, that is why the approach follows the usage of bind method.

OnScheduleFlow and incrementalPullConfig

In the Cfn the definition of the incrementalPullConfig (which effectively gives a name of the field used for tracking the last pulled timestamp) is on the SourceFlowConfig property. As this is effectively a responsibility of the OnScheduleFlow this has been moved to the flow type's properties.

S3Destination and Glue Catalog

Although in Cfn the glue catalog configuration is settable on the flow level - it works only when the destination is S3. That is why the implementation shifts the properties to the S3Destination, which in turn requires using Lazy for populating metadataCatalogConfig in the flow.

Tasks

Note: in the current implementation the proposed naming convention is to talk about records which are tuples in a set, and fields which are particular elements in a record.

Tasks are steps that can be taken upon fields. Tasks compose higher level constructs that in this implementation are named Operations. It is the operations that are the founding elements in the proposed implementation. There are four operations identified:

  • Transforms - 1-1 transforms on source fields, like truncation or masking
  • Mappings - 1-1 or many-to-1 operations from source fields to a destination field
  • Filters - operations that limit the source data on a particular conditions
  • Validations - operations that work on a per-record level and can have either a record-level consequence (i.e. dropping the record) or a global one (terminating the flow).

The approach to the usage is heavily influenced by the experience of defining Step Function Choice SFN tasks. See the example below:

const flow = new OnDemandFlow(stack, 'OnDemandFlow', {
  source: source,
  destination: destination,
  transforms: [
    Transform.mask({ name: 'Name' }, '*'),
  ],
  mappings: [
    Mapping.map({ name: 'Name', dataType: 'String' }, { name: 'Name', dataType: 'string' }),
  ],
  filters: [
    Filter.when(FilterCondition.timestampLessThanEquals({ name: 'LastModifiedDate', dataType: 'datetime' }, new Date(Date.parse('2022-02-02')))),
  ],
  validations: [
    Validation.when(ValidationCondition.isNull('Name'), ValidationAction.ignoreRecord()),
  ]  
});

The operations are not resources themselves, but parts of the definition of the Cfn AWS::AppFlow::Flow resource. That is why they have a bind method. In order to remove the necessity of specifying the (source) connector type in them the bind method takes both IFlow and ISource types, where the latter having property connectorType allows for specifying task operator's origin.

Note on field types: Current implementation doesn't remove setting field types via dataType. There is the describe-connector-entity SDK call that could be investigated to remove this, but for now the assumption is that the users would have to know the record fields and their types to use the implementation.

Mappings

When using mappings that use MAP task type it is required to have a Filter task type with PROJECTION operator for the fields. The current implementation automates building this projection. This makes defining only maps whereas the proposed implementation takes care of the rest.

Many-to-one mappings

When creating many-to-one mappings the current AppFlow mechanism creates an intermediate field with a name ${field1Name},${field2Name}. The current implementation follows that. This has a consequence that there is only one many-to-one definition allowed. It will be identified if this can be extended.

Filters

Similar to Mappings, Filters also require a field present in PROJECTION, that is why they are included in the projection builder in FlowBase

AppFlow permissions

AppFlowPermissionsManager is a class inspired by the AWS CDK core TokenMap, and utilises Lazy to make sure we're generating only one input to the S3 Bucket, KMS Key, and Secrets Manager Secret resource policies no matter how many times we're reusing the resource. The reason behind this was to remove the necessity of manual setting of the right permissions for AppFlow in the resource policies and the fact that sometimes the resources are used multiple times in the stack definition For example an S3 Bucket can be used for ErrorHandling definition with a different prefix, so when we pass the bucket to different source/destination definitions and potentially flows - we need a set of permissions for the service principal.

PoC implementation

https://github.com/rpawlaszek/appflow-cdk

Roles

Role User
Proposed by @rpawlaszek
Author(s) @rpawlaszek
API Bar Raiser @iliapolo
Stakeholders

See RFC Process for details

Workflow

  • Tracking issue created (label: status/proposed)
  • API bar raiser assigned (ping us at #aws-cdk-rfcs if needed)
  • Kick off meeting
  • RFC pull request submitted (label: status/review)
  • Community reach out (via Slack and/or Twitter)
  • API signed-off (label api-approved applied to pull request)
  • Final comments period (label: status/final-comments-period)
  • Approved and merged (label: status/approved)
  • Execution plan submitted (label: status/planning)
  • Plan approved and merged (label: status/implementing)
  • Implementation complete (label: status/done)

Author is responsible to progress the RFC according to this checklist, and
apply the relevant labels to this issue so that the RFC table in README gets
updated.

@konls
Copy link

konls commented Feb 27, 2023

Re: "Automatic activation/deactivation of OnEventFlow and OnScheduleFlow types"

There's a feature request to add a CFN template field to create Event&Scheduled flows as "activated". It's relatively simple to implement, will try to have it done within a month. So hopefully L2 CDK will be able to use that to keep it's code simpler.

@iliapolo
Copy link
Contributor

Adding some notes here for discussion:

  • What happens if we don't set ScheduleStartTime at all? what will CFN default to?
  • In CFN, the Tasks property is an array, and I imagine the order of tasks in it makes a difference - how is this conveyed in the proposed API of separate transform, mapping, etc...?

@rpawlaszek
Copy link
Author

What happens if we don't set ScheduleStartTime at all? what will CFN default to?

The ScheduleStartTime and ScheduleEndTime put a window constraint on the running schedule. If we don't specify them - there is no constraint. If they are set - the behavior then depends on the ScheduleExpression, that is: if it is a rate-based expression the flow will run once it's activated and then keep rate as its period, but when it is a cron-based expression - the flow will run according to the cron definition.

In CFN, the Tasks property is an array, and I imagine the order of tasks in it makes a difference

I haven't experienced that. What matters is the existence of particular items in the array, but not the order itself. Could I ask you to confirm that @konls?

@konls
Copy link

konls commented Apr 20, 2023

What happens if we don't set ScheduleStartTime at all? what will CFN default to?

In Console ScheduleStartTime must be specified - otherwise scheduled flow can't be created.
Though it doesn't seem to be necessary - validation code doesn't have ScheduleStartTime as mandatory. Just verified that a flow with ScheduleStartTime omitted indeed can be created with CFN. CFN handler itself doesn't alter values - just translates them to AppFlow SDK object model.

In CFN, the Tasks property is an array, and I imagine the order of tasks in it makes a difference

Again, AppFlow's CFN handler doesn't care about Tasks order - just translates the collection into SDK's field.
AppFlow itself executes tasks in the following order:

  • mathOperationsTasks
  • stringConcatTasks
  • truncationTasks
  • maskingTasks
  • filterTasks
  • projectionFilterTasks
  • mappingTasks

AppFlow doesn't make promises on task execution order within each task type group - though likely the original task order is going to be maintained.

@iliapolo
Copy link
Contributor

We decided to publish this as a third-party library while we conduct the RFC review. All are welcome to try it out and provide feedback, it will really help us out in the review :).

@evgenyka evgenyka added l2-request request for new L2 construct bar-raiser/assigned labels Aug 10, 2023
@awsmjs
Copy link
Contributor

awsmjs commented Dec 14, 2023

Closing this ticket. We believe the functionality is beneficial, but does not intersect with the core framework and should be vended and maintained separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants