diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..d1d0ac8 --- /dev/null +++ b/.gitignore @@ -0,0 +1,4 @@ +# Default ignored files +/.idea/workspace.xml +/.idea +/.idea/* diff --git a/README.md b/README.md index b0e26be..f3a3f94 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,8 @@ # Data Product Specification This repository wants to define an open specification to define data products with the following principles in mind: -- Data Product as an indipendent unit of deployment -- Technology indipendence +- Data Product as an independent unit of deployment +- Technology independence - Extensibility With an open specification it will be possible to create services for automatic deployment and interoperable components to build a Data Mesh platform. @@ -15,10 +15,10 @@ The Data Product is composed by a general section with Data Product level inform * **Output Ports**: representing all the different interfaces of the Data Product to expose the data to consumers * **Workloads**: internal jobs/processes to feed the Data Product and to perform housekeeping (GDPR, regulation, audit, data quality, etc) * **Storage Areas**: internal data storages where the Data Product is deployed, not exposed to consumers -* **Observability**: provides transparency to the data conusmer about how the Data Product is currently working. This is not declarative, but exposing runtime data. +* **Observability**: provides transparency to the data consumer about how the Data Product is currently working. This is not declarative, but exposing runtime data. -Each Data Product component trait (output ports, workloads, observabilities, etc) will have a well defined and fixed structure and a "specific" one to handle technology specific stuff. -The fixed structure must be technology agnostic. The first fields of teh fixed structure are more technical and linked to how the platform will handle them, while the last fields (specific excluded) are to be treated as pure metadata that will simplify the management and consumption. +Each Data Product component trait (output ports, workloads, observabilities, etc.) will have a well-defined and fixed structure and a "specific" one to handle technology specific stuff. +The fixed structure must be technology-agnostic. The first fields of teh fixed structure are more technical and linked to how the platform will handle them, while the last fields (specific excluded) are to be treated as pure metadata that will simplify the management and consumption. ### General @@ -30,13 +30,13 @@ The fixed structure must be technology agnostic. The first fields of teh fixed s * `FullyQualifiedName: [Option[String]]` human-readable name that describes the Data Product. * `Description: [String]` detailed description about what functional area this Data Product is representing, what purpose has and business related information. * `Kind: [String]*` type of the entity. Since this is a Data Product the only allowed value is `dataproduct`. -* `Domain: [String]*` the identifier of the domain this Data Product is belonging to. -* `Version: [String]*` this is representing the version of the Data Product. Displayed as `X.Y.Z` where X is the major version, Y is the minor version, and Z is the patch. Major version (X) is also shown in the Data Product ID and those fields (version and ID) must always be aligned with one another. We consider a Data Product as an indipendent unit of deployment, so if a breaking change is needed, we create a brand new version of it by chaning the major version. If we introduce a new feature (or patch) we will not create a new major version, but we can just change Y (new feature) or Z patch, thus not creating a new ID (and hence not creating a new Data Product). +* `Domain: [String]*` the identifier of the domain this Data Product belongs to. +* `Version: [String]*` this is representing the version of the Data Product. Displayed as `X.Y.Z` where X is the major version, Y is the minor version, and Z is the patch. Major version (X) is also shown in the Data Product ID and those fields (version and ID) must always be aligned with one another. We consider a Data Product as an independent unit of deployment, so if a breaking change is needed, we create a brand-new version of it by changing the major version. If we introduce a new feature (or patch) we will not create a new major version, but we can just change Y (new feature) or Z patch, thus not creating a new ID (and hence not creating a new Data Product). Constraints: * Major version of the Data Product is always the same as the major version of all of its components ,and it is the same version that is shown in both Data Product ID and component IDs. * `Environment: [String]*`: logical environment where the Data Product will be deployed. * `DataProductOwner: [String]` Data Product owner, the unique identifier of the actual user that owns, manages, and receives notifications about the Data Product. To make it technology independent it is usually the email address of the owner. -* `DataProductOwnerDisplayName [String]`: the human readable version of `DataProductOwner`. +* `DataProductOwnerDisplayName [String]`: the human-readable version of `DataProductOwner`. * `Email: [Option[String]]` point of contact between consumers and maintainers of the Data Product. It could be the owner or a distribution list, but must be reliable and responsive. * `OwnerGroup [String]`: LDAP user/group that is owning the data product. * `DevGroup [String]`: LDAP user/group that is in charge to develop and maintain the data product. @@ -45,7 +45,7 @@ The fixed structure must be technology agnostic. The first fields of teh fixed s * `Maturity: [Option[String]]` this is an enum to let the consumer understand if it is a tactical solution or not. It is really useful during migration from Data Warehouse or Data Lake. Allowed values are: `[Tactical|Strategic]`. * `Billing: [Option[Yaml]]` this is a free form key-value area where is possible to put information useful for resource tagging and billing. * `Tags: [Array[Yaml]]` Tag labels at DP level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). -* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific execution environment. It can also refer to an additional file. At this level we also embed all the information to provision the general infrastructure (resource groups, networking, etc) needed for a specific Data Product. For example if a company decides to create a ResourceGroup for each data product and have a subscription reference for each domain and environment, it will be specified at this level. Also it is reccommended to put general security here, Azure Policy or IAM policies, VPC/Vnet, Subnet. This will be filled merging data defined at common level with values defined specifically for the selected environment. +* `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific execution environment. It can also refer to an additional file. At this level we also embed all the information to provision the general infrastructure (resource groups, networking, etc.) needed for a specific Data Product. For example if a company decides to create a ResourceGroup for each data product and have a subscription reference for each domain and environment, it will be specified at this level. Also, it is recommended to put general security here, Azure Policy or IAM policies, VPC/Vnet, Subnet. This will be filled merging data defined at common level with values defined specifically for the selected environment. The **unique identifier** of a Data Product is the concatenation of Domain, Name and Version. So we will refer to the `DP_UK` as a URN which ends in the following way: `$DPDomain:$DPName:$DPMajorVersion`. @@ -57,42 +57,44 @@ The **unique identifier** of a Data Product is the concatenation of Domain, Name * allowed characters are `[a-zA-Z0-9]` and `[_-]`. * the ID is a URN of the form `urn:dmb:cmp:$DPDomain:$DPName:$DPMajorVersion:$OutputPortName`. * `Name: [String]*` the name of the Output Port. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Output Port ID all special characters are replaced with standard ones and spaces are replaced with dashes. -* `FullyQualifiedName: [Option[String]]` human-readable name that describes better the Output Port. It can also contain specific details (if this is a table this field could contain also indications regarding the databse and the schema). +* `FullyQualifiedName: [Option[String]]` human-readable name that describes better the Output Port. It can also contain specific details (if this is a table this field could contain also indications regarding the database and the schema). * `Description: [String]` detailed explanation about the function and the meaning of the output port. * `Kind: [String]*` type of the entity. Since this is an Output Port the only allowed value is `outputport`. -* `Version: [String]*` specific version of the output port. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponde to the major version of the Data Product it belongs to. +* `Version: [String]*` specific version of the output port. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* correspond to the major version of the Data Product it belongs to. Constraints: - * Major version of the Data Product is always the same as the major version of all of its components and it is the same version that is shown in both Data Product ID and component ID. + * Major version of the Data Product is always the same as the major version of all of its components, and it is the same version that is shown in both Data Product ID and component ID. * `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. * `UseCaseTemplateId: [Option[String]]*` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template. * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: * This array will only contain IDs of other components of the same Data Product. -* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running. +* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field, but it is useful to understand better the platform where the component will be running. * `Technology: [Option[String]]` represents which technology is used to define the output port, like: Athena, Impala, Dremio, etc. The underlying technology is useful for the consumer to understand better how to consume the output port. -* `OutputPortType: [String]` the kind of output port: Files, SQL, Events, etc. This should be extendible with other values, like GraphQL or others. +* `OutputPortType: [String]` the kind of output port: Files, SQL, Events, etc. This should be extensible with other values, like GraphQL or others. * `CreationDate: [Optional[String]]` when this output port has been created. -* `StartDate: [Optional[String]]` the first business date present in the dataset, leave it empty for events or we can use some standard semantic like: "-7D, -1Y". +* `StartDate: [Optional[String]]` the first business date present in the dataset, leave it empty for events, or we can use some standard semantic like: "-7D, -1Y". * `ProcessDescription: [Option[String]]` what is the underlying process that contributes to generate the data exposed by this output port. * `DataContract: [Yaml]`: In case something is going to change in this section, it represents a breaking change because the producer is breaking the contract, this will require to create a new version of the data product to keep backward compatibility - * `Schema: [Array[Yaml]]` when it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/metadata-standard/schemas/entities/table#column. Each column can have a tag array and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json. + * `Schema: [Array[Yaml]]` when it comes to describe a schema we propose to leverage OpenMetadata specification: Ref https://docs.open-metadata.org/metadata-standard/schemas/entities/table#column. Each column can have a tag array, and you can choose between simples LabelTags, ClassificationTags or DescriptiveTags. Here an example of classification Tag https://github.com/open-metadata/OpenMetadata/blob/main/catalog-rest-service/src/main/resources/json/data/tags/piiTags.json. * `SLA: [Yaml]` Service Level Agreement, describe the quality of data delivery and the output port in general. It represents the producer's overall promise to the consumers. * `IntervalOfChange: [Option[String]]` how often changes in the data are reflected. - * `Timeliness: [Option[String]]` the skew between the time that a business fact occuts and when it becomes visibile in the data. + * `Timeliness: [Option[String]]` the skew between the time that a business fact occurs and when it becomes visibile in the data. * `UpTime: [Option[String]]` the percentage of port availability. * `TermsAndConditions: [Option[String]]` If the data is usable only in specific environments. * `Endpoint: [Option[URL]]` this is the API endpoint that self-describe the output port and provide insightful information at runtime about the physical location of the data, the protocol must be used, etc. -* `DataSharingAgreement: [Yaml]` This part is covering usage, privacy, purpose, limitations and is indipendent by the data contract. + * `biTempBusinessTs: [Option[String]]` name of the field representing the business timestamp, as per the "bi-temporality" definition; it should match with a field in the related `Schema` + * `biTempWriteTs: [Option[String]]` name of the field representing the technical (write) timestamp, as per the "bi-temporality" definition; it should match with a field in the related `Schema` +* `DataSharingAgreement: [Yaml]` This part is covering usage, privacy, purpose, limitations and is independent by the data contract. * `Purpose: [Option[String]]` what is the goal of this data set. * `Billing: [Option[String]]` how a consumer will be charged back when it consumes this output port. - * `Security: [Option[String]]` additional information related to security aspects, like restrictions, maskings, sensibile information and privacy. + * `Security: [Option[String]]` additional information related to security aspects, like restrictions, masking, sensibile information and privacy. * `IntendedUsage: [Option[String]]` any other information needed by the consumer in order to effectively consume the data, it could be related to technical stuff (e.g. extract no more than one year of data for good performances ) or to business domains (e.g. this data is only useful in the marketing domains). * `Limitations: [Option[String]]` If any limitation is present it must be made super clear to the consumers. * `LifeCycle: [Option[String]]` Describe how the data will be historicized and how and when it will be deleted. * `Confidentiality: [Option[String]]` Describe what a consumer should do to keep the information confidential, how to process and store it. Permission to share or report it. * `Tags: [Array[Yaml]]` Tag labels at OutputPort level, here we can have security classification for example (please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). * `SampleData: [Option[Yaml]]` provides a sample data of your Output Port (please refer to OpenMetadata specification: https://docs.open-metadata.org/metadata-standard/schemas/entities/table#tabledata). -* `SemanticLinking: [Option[Yaml]]` here we can express semantic relationships between this output port and other outputports (also coming from other domains and data products). For example we could say that column "customerId" of our SQL Output Port references the column "id" of the SQL Output Port of the "Customer" Data Product. +* `SemanticLinking: [Option[Yaml]]` here we can express semantic relationships between this output port and other outputports (also coming from other domains and data products). For example, we could say that column "customerId" of our SQL Output Port references the column "id" of the SQL Output Port of the "Customer" Data Product. * `Specific: [Yaml]` this is a custom section where we must put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance. @@ -104,22 +106,22 @@ Constraints: * the ID is a URN of the form `urn:dmb:cmp:$DPDomain:$DPName:$DPMajorVersion:$WorkloadName`. * `Name: [String]*` the name of the Workload. This name is used also for display purposes, so it can contain all kind of characters. When used inside the Workload ID all special characters are replaced with standard ones and spaces are replaced with dashes. * `FullyQualifiedName: [Optional[String]]` human-readable name that describes better the Workload. -* `Description: [String]` detailed explaination about the purpose of the workload, what sources is reading, what business logic is applying, etc. +* `Description: [String]` detailed explanation about the purpose of the workload, what sources it's reading, what business logic is applying, etc. * `Kind: [String]*` type of the entity. Since this is an Output Port the only allowed value is `workload`. -* `Version: [String]*` specific version of the workload. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* corresponde to the major version of the Data Product it belongs to. +* `Version: [String]*` specific version of the workload. Displayed as `X.Y.Z` where X is the major version of the Data Product, Y is the minor feature and Z is the patch. Major version (X) is also shown in the component ID and those fields( version and ID) are always aligned with one another. Please note that the major version of the component *must always* correspond to the major version of the Data Product it belongs to. Constraints: - * Major version of the Data Product is always the same as the major version of all of its components and it is the same version that is shown in both Data Product ID and component ID. + * Major version of the Data Product is always the same as the major version of all of its components, and it is the same version that is shown in both Data Product ID and component ID. * `InfrastructureTemplateId: [String]*` the id of the microservice responsible for provisioning the component. A microservice may be capable of provisioning several components generated from different use case templates. * `UseCaseTemplateId: [Option[String]]*` the id of the template used in the builder to create the component. Could be empty in case the component was not created from a builder template. * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: * This array will only contain IDs of other components of the same Data Product. -* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running. +* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field, but it is useful to understand better the platform where the component will be running. * `Technology: [Option[String]]` represents which technology is used to define the workload, like: Spark, Flink, pySpark, etc. The underlying technology is useful to understand better how the workload process data. * `WorkloadType: [Option[String]]` explains what type of workload is: Ingestion ETL, Streaming, Internal Process, etc. * `ConnectionType: [Option[String]]` an enum with allowed values: `[HouseKeeping|DataPipeline]`; `Housekeeping` is for all the workloads that are acting on internal data without any external dependency. `DataPipeline` instead is for workloads that are reading from outputport of other DP or external systems. * `Tags: [Array[Yaml]]` Tag labels at Workload level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). -* `ReadsFrom: [Array[String]]` This is filled only for `DataPipeline` workloads and it represents the list of Output Ports or external systems that the workload uses as input. Output Ports are identified with `DP_UK:$OutputPortName`, while external systems will be defined by a URN in the form `urn:dmb:ex:$SystemName`. This filed can be elaborated more in the future and create a more semantic struct. +* `ReadsFrom: [Array[String]]` This is filled only for `DataPipeline` workloads, and it represents the list of Output Ports or external systems that the workload uses as input. Output Ports are identified with `DP_UK:$OutputPortName`, while external systems will be defined by a URN in the form `urn:dmb:ex:$SystemName`. This filed can be elaborated more in the future and create a more semantic struct. Constraints: * This array will only contain Output Port IDs and/or external systems identifiers. * `Specific: [Yaml]` this is a custom section where we can put all the information strictly related to a specific technology or dependent from a standard/policy defined in the federated governance. @@ -141,7 +143,7 @@ Constraints: * `DependsOn: [Array[String]]*` A component could depend on other components belonging to the same Data Product, for example a SQL Output port could be dependent on a Raw Output Port because it is just an external table. This is also used to define the provisioning order among components. Constraints: * This array will only contain IDs of other components of the same Data Product. -* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field but it is useful to understand better the platform where the component will be running. +* `Platform: [Option[String]]` represents the vendor: Azure, GCP, AWS, CDP on AWS, etc. It is a free field, but it is useful to understand better the platform where the component will be running. * `Technology: [Option[String]]` represents which technology is used to define the storage area, like: S3, Kafka, Athena, etc. The underlying technology is useful to understand better how the data is internally stored. * `StorageType: [Option[String]]` the specific type of storage: Files, SQL, Events, etc. * `Tags: [Array[Yaml]]` Tag labels at Storage area level ( please refer to OpenMetadata https://docs.open-metadata.org/metadata-standard/schemas/types/taglabel). @@ -159,7 +161,7 @@ Anyway is good to formalize what kind of information should be included and veri * `Description: [String]` detailed explanation about what this observability is exposing * `Endpoint: [URL]` this is the API endpoint that will expose the observability for each OutputPort * `Completeness: [Yaml]` degree of availability of all the necessary information along the entire history -* `DataProfiling: [Yaml]` volume, distribution of volume along time, range of values, column values distribution and other statistics. Please refer to OpenMetadata to get our default implementation https://docs.open-metadata.org/openmetadata/schemas/entities/table#tableprofile. Keep in mind that this is the kind of standard that a company need to set based on its needs. +* `DataProfiling: [Yaml]` volume, distribution of volume over time, range of values, column values distribution and other statistics. Please refer to OpenMetadata to get our default implementation https://docs.open-metadata.org/openmetadata/schemas/entities/table#tableprofile. Keep in mind that this is the kind of standard that a company need to set based on its needs. * `Freshness: [Yaml]` * `Availability: [Yaml]` * `DataQuality: [Yaml]` describe data quality rules will be applied to the data, using the format you prefer. @@ -177,7 +179,7 @@ In general the version should be used to notify users of the changes between the - a change in the patch version means that there are no significant changes, but just bug fixes or small corrections (e.g. an improvement in the field description, a typo that was fixed, an improvement in the validation files) CUE offers also a [standard way](https://cuelang.org/docs/usecases/datadef/#validating-backwards-compatibility) to check if new versions of a schema are backwards-compatible with older versions. -It is highly reccommended to check for schema compatibilities when multiple and/or complex changes are introduced. +It is highly recommended to check for schema compatibilities when multiple and/or complex changes are introduced. In the following sections we will list all the extensions and modifications of this specification and the impact they have on the overall contract: @@ -206,14 +208,14 @@ These changes are the ones that should not be performed since they impact compat Any change of this kind should **always** increase the major version number. Discouraged customizations are: - change in the name or type of existing fields. This kind of change breaks compatibility with previous versions, and should be performed by keeping in mind that they will impact for sure all the logics based on those fields. -- moving fields as sub-fields of other sections (e.g. moving the "workload type" field as a sub-field of a new "type" field). This is actually a specific case of the one above, and should be treated accordingly. -- deletion of existing fields. This is generally somethign that will impact a lot modules that are leveraging the specification, and you must think very carefully before doing deletions. Think that you can always make a field optional, and this choice will impact the specification way less. +- moving fields as subfields of other sections (e.g. moving the "workload type" field as a subfield of a new "type" field). This is actually a specific case of the one above, and should be treated accordingly. +- deletion of existing fields. This is generally something that will impact a lot of modules that are leveraging the specification, and you must think very carefully before doing deletions. Think that you can always make a field optional, and this choice will impact the specification way less. **N.B.: all the changes described above are allowed only if they do not affect reserved fields which are treated in the Forbidden customization.** ### Forbidden -These changes are not allowed and you **must** always check that your changes do not fall in this category. +These changes are not allowed, and you **must** always check that your changes do not fall in this category. Forbidden changes are the ones that affect reserved fields, that should not change in any way (name, type, structure, etc). All the reserved fields are highlighted with a `*` character in the specification above, like `ID: [String]*` and `Name: [String]*`. Since these are the fields that are usually leveraged by downstream platform modules, any change could break the agreed contract between them. diff --git a/data-product-specification.cue b/data-product-specification.cue index 2d745d3..c370f79 100644 --- a/data-product-specification.cue +++ b/data-product-specification.cue @@ -13,163 +13,165 @@ import "strings" #OM_Constraint: string & =~"(?i)^(NULL|NOT_NULL|UNIQUE|PRIMARY_KEY)$" #OM_TableData: { - columns: [... string] - rows: [... [...]] + columns: [... string] + rows: [... [...]] } #OM_Tag: { - tagFQN: string - description?: string | null - source: string & =~"(?i)^(Tag|Glossary)$" - labelType: string & =~"(?i)^(Manual|Propagated|Automated|Derived)$" - state: string & =~"(?i)^(Suggested|Confirmed)$" - href?: string | null + tagFQN: string + description?: string | null + source: string & =~"(?i)^(Tag|Glossary)$" + labelType: string & =~"(?i)^(Manual|Propagated|Automated|Derived)$" + state: string & =~"(?i)^(Suggested|Confirmed)$" + href?: string | null } #OM_Column: { - name: string - dataType: #OM_DataType - if dataType =~ "(?i)^(ARRAY)$" { - arrayDataType: #OM_DataType - } - if dataType =~ "(?i)^(CHAR|VARCHAR|BINARY|VARBINARY)$" { - dataLength: number - } - dataTypeDisplay?: string | null - description?: string | null - fullyQualifiedName?: string | null - tags?: [... #OM_Tag] - constraint?: #OM_Constraint | null - ordinalPosition?: number | null - if dataType =~ "(?i)^(JSON)$" { - jsonSchema: string - } - if dataType =~ "(?i)^(MAP|STRUCT|UNION)$" { - children: [... #OM_Column] - } + name: string + dataType: #OM_DataType + if dataType =~ "(?i)^(ARRAY)$" { + arrayDataType: #OM_DataType + } + if dataType =~ "(?i)^(CHAR|VARCHAR|BINARY|VARBINARY)$" { + dataLength: number + } + dataTypeDisplay?: string | null + description?: string | null + fullyQualifiedName?: string | null + tags?: [... #OM_Tag] + constraint?: #OM_Constraint | null + ordinalPosition?: number | null + if dataType =~ "(?i)^(JSON)$" { + jsonSchema: string + } + if dataType =~ "(?i)^(MAP|STRUCT|UNION)$" { + children: [... #OM_Column] + } } #DataContract: { - schema: [... #OM_Column] - SLA: { - intervalOfChange?: string | null - timeliness?: string | null - upTime?: string | null - ... - } - termsAndConditions?: string | null - endpoint?: #URL | null - ... + schema: [... #OM_Column] + SLA: { + intervalOfChange?: string | null + timeliness?: string | null + upTime?: string | null + ... + } + termsAndConditions?: string | null + endpoint?: #URL | null + biTempBusinessTs?: string | null + biTempWriteTs?: string | null + ... } #DataSharingAgreement: { - purpose?: string | null - billing?: string | null - security?: string | null - intendedUsage?: string | null - limitations?: string | null - lifeCycle?: string | null - confidentiality?: string | null - ... + purpose?: string | null + billing?: string | null + security?: string | null + intendedUsage?: string | null + limitations?: string | null + lifeCycle?: string | null + confidentiality?: string | null + ... } #OutputPort: { - id: #ComponentId - name: string - fullyQualifiedName?: string | null - description: string - version: #Version & =~"^\(majorVersion)+\\..+$" - infrastructureTemplateId: string - useCaseTemplateId?: string | null - dependsOn: [...#ComponentId] - platform?: string | null - technology?: string | null - outputPortType: string - creationDate?: string | null - startDate?: string | null - processDescription?: string | null - dataContract: #DataContract - dataSharingAgreement: #DataSharingAgreement - tags: [... #OM_Tag] - sampleData?: #OM_TableData | null - semanticLinking?: {...} | null - specific: {...} - ... + id: #ComponentId + name: string + fullyQualifiedName?: string | null + description: string + version: #Version & =~"^\(majorVersion)+\\..+$" + infrastructureTemplateId: string + useCaseTemplateId?: string | null + dependsOn: [...#ComponentId] + platform?: string | null + technology?: string | null + outputPortType: string + creationDate?: string | null + startDate?: string | null + processDescription?: string | null + dataContract: #DataContract + dataSharingAgreement: #DataSharingAgreement + tags: [... #OM_Tag] + sampleData?: #OM_TableData | null + semanticLinking?: {...} | null + specific: {...} + ... } #Workload: { - id: #ComponentId - name: string - fullyQualifiedName?: string | null - description: string - version: #Version & =~"^\(majorVersion)+\\..+$" - infrastructureTemplateId: string - useCaseTemplateId?: string | null - dependsOn: [...#ComponentId] - platform?: string | null - technology?: string | null - workloadType?: string | null - connectionType?: string & =~"(?i)^(housekeeping|datapipeline)$" | null - tags: [... #OM_Tag] - readsFrom: [... string] - specific: {...} | null - ... + id: #ComponentId + name: string + fullyQualifiedName?: string | null + description: string + version: #Version & =~"^\(majorVersion)+\\..+$" + infrastructureTemplateId: string + useCaseTemplateId?: string | null + dependsOn: [...#ComponentId] + platform?: string | null + technology?: string | null + workloadType?: string | null + connectionType?: string & =~"(?i)^(housekeeping|datapipeline)$" | null + tags: [... #OM_Tag] + readsFrom: [... string] + specific: {...} | null + ... } #Storage: { - id: #ComponentId - name: string - fullyQualifiedName?: string | null - description: string - version: #Version & =~"^\(majorVersion)+\\..+$" - owners: [...string] - infrastructureTemplateId: string - useCaseTemplateId?: string | null - dependsOn: [...#ComponentId] - platform?: string | null - technology?: string | null - storageType?: string | null - tags: [... #OM_Tag] - specific: {...} | null - ... + id: #ComponentId + name: string + fullyQualifiedName?: string | null + description: string + version: #Version & =~"^\(majorVersion)+\\..+$" + owners: [...string] + infrastructureTemplateId: string + useCaseTemplateId?: string | null + dependsOn: [...#ComponentId] + platform?: string | null + technology?: string | null + storageType?: string | null + tags: [... #OM_Tag] + specific: {...} | null + ... } #Observability: { - id: #ComponentId - name: string - fullyQualifiedName: string - description: string - version: #Version & =~"^\(majorVersion)+\\..+$" - infrastructureTemplateId: string - useCaseTemplateId?: string | null - dependsOn: [...#ComponentId] - endpoint: #URL - completeness: {...} | null - dataProfiling: {...} | null - freshness: {...} | null - availability: {...} | null - dataQuality: {...} | null - specific: {...} | null - ... + id: #ComponentId + name: string + fullyQualifiedName: string + description: string + version: #Version & =~"^\(majorVersion)+\\..+$" + infrastructureTemplateId: string + useCaseTemplateId?: string | null + dependsOn: [...#ComponentId] + endpoint: #URL + completeness: {...} | null + dataProfiling: {...} | null + freshness: {...} | null + availability: {...} | null + dataQuality: {...} | null + specific: {...} | null + ... } #Component: { - kind: string & =~"(?i)^(outputport|workload|storage|observability)$" - if kind != _|_ { - if kind =~ "(?i)^(outputport)$" { - #OutputPort - } - if kind =~ "(?i)^(workload)$" { - #Workload - } - if kind =~ "(?i)^(storage)$" { - #Storage - } - if kind =~ "(?i)^(observability)$" { - #Observability - } - } - ... + kind: string & =~"(?i)^(outputport|workload|storage|observability)$" + if kind != _|_ { + if kind =~ "(?i)^(outputport)$" { + #OutputPort + } + if kind =~ "(?i)^(workload)$" { + #Workload + } + if kind =~ "(?i)^(storage)$" { + #Storage + } + if kind =~ "(?i)^(observability)$" { + #Observability + } + } + ... } id: #DataProductId diff --git a/example.yaml b/example.yaml index 5569095..b4779e2 100644 --- a/example.yaml +++ b/example.yaml @@ -30,7 +30,7 @@ components: platform: CDP on AWS technology: s3_cdp outputPortType: Files - creationDate: 05-04-2022 16:53:00 + creationDate: 05-04-2022T16:53:00.000Z startDate: processDescription: this output port is generated by a Spark Job scheduled every day at 2AM and it lasts for approx 2 hours dataContract: @@ -45,7 +45,7 @@ components: purpose: this output port want to provide a rich set of profitability KPIs related to the customer billing: 5$ for each full scan security: In order to consume this output port an additional security check with compliance must be done - intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care + intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care limitations: is not possible to use this data without a compliance check lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january confidentiality: if you want to store this data somewhere else, PII columns must be masked @@ -75,26 +75,95 @@ components: platform: CDP on AWS technology: impala_cdp outputPortType: SQL - creationDate: 05-04-2022 17:00:00 + creationDate: 05-04-2022T17:00:00.000Z startDate: processDescription: dataContract: schema: - - name: name + - name: employeeId dataType: string - - name: surname + description: global addressable identifier for an employee. + constraint: PRIMARY_KEY + tags: + - tagFQN: GlobalAddressableIdentifier + source: Tag + labelType: Manual + state: Confirmed + - name: first_name dataType: string + description: employee's first name + constraint: NOT_NULL + tags: + - tagFQN: PII + source: Tag + labelType: Manual + state: Confirmed + - name: last_name + dataType: string + description: employee's last name + constraint: NOT_NULL + tags: + - tagFQN: PII + source: Tag + labelType: Manual + state: Confirmed + - name: birthdate + dataType: date + description: employee's birthdate + constraint: NOT_NULL + tags: [] + - name: gender + dataType: string + description: employee's gender + constraint: NOT_NULL + tags: [] + - name: residential_address + dataType: struct + description: employee's residential address + constraint: NOT_NULL + tags: + - tagFQN: PII + source: Tag + labelType: Manual + state: Confirmed + - name: first_hire_date + dataType: date + description: the date of his/her first hire in mybank. No matter is a temporary or permanent contract + constraint: NOT_NULL + tags: [] + - name: last_working_date + dataType: date + description: the last day the employee worked for mybank + constraint: NULL + tags: [] + - name: last_update + dataType: date + description: the last date the record has been updated + constraint: NULL + tags: [] + - name: businessTs + dataType: timestamp + description: the business timestamp, to be leveraged for time-travelling + constraint: NOT_NULL + tags: [] + - name: writeTs + dataType: timestamp + description: the technical (write) timestamp, to be leveraged for time-travelling + constraint: NOT_NULL + tags: [] SLA: intervalOfChange: 1 hours - timeliness: 1 minutes + timeliness: 1 minutes upTime: 99.9% termsAndConditions: only usable in development environment endpoint: https://myurl/development/my_domain/my_data_product/1.0.0/my_raw_s3_port + biTempBusinessTs: businessTs + biTempWriteTs: writeTs dataSharingAgreements: purpose: this output port want to provide a rich set of profitability KPIs related to the customer billing: 5$ for each full scan security: In order to consume this output port an additional security check with compliance must be done - intendedUsage: the dataset is huge so it is reccomended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care + intendedUsage: the dataset is huge so it is recommended to extract maximum 1 year of data and to use these KPIs in the marketing or sales domain, but not for customer care limitations: is not possible to use this data without a compliance check lifeCycle: the maximum retention is 10 years, and eviction is happening on the first of january confidentiality: if you want to store this data somewhere else, PII columns must be masked