Create seed project to define a 'data segment' component #55

dnvriend · 2017-12-26T13:53:41Z

Data segments can be defined by the Kappa architecture in where there is a fast and a slow lane, fast for real time processing/analytics and slow for big data processing, both are necessary to provide the data-stakeholder with several views on stakeholder's domain. The real-time view provides information about the domain in seconds to minutes timespan allowing for some inaccuracy. The slow lane allows for refinements on stakeholder's view of the domain providing accurate information about the domain as fast as technology allows, currently depending on the size of the data set and the data architecture in place, minutes to hours to days.

As data stakeholders rely more on data to drive their decision making, and more and more companies become driven by data in their decision making, data architecture and data science become more important. The underlying data lake providing both 'data-as-a-service' and 'analytics-as-a-service', must support these requirements.

Data, being shapeless in nature, must be captured and governed accordingly. Decisions being made by numbers must be traced to their origins and data must be kept safe at all times. Having a good data architecture in place, thus having structured data, allows for governing data in all aspects of its life cycle and consumption. Just-in-time is an appropriate strategy where consumption and transformation must be in accordance to policies defines on the data itself.

Encryption has always been a way to secure data in its broadest definition. Amazon KMS provides a way to encrypt structured data, and allows for governance by means of shape definitions, implemented by means of a schema repository and governance by means of policies provided to data consumers just-in-time, at the point of consumption.

Data can only be governed when components are part of the platform, so, data lakes must become a platform in itself, becoming the foundation that other components can build upon. These components are vertical that have a business focus. Business services, ie. verticals are verticals because they specialize, but data is horizontal, it is cross-cutting. Verticals however need specialized data to function, moreover they generate specialized data.

Data from the data-lake can become specialized by fitting the data in a shape by means of a transform. Transforms rotate data 90-degrees in order to make them fit a vertical. Verticals generate data, and that data must be rotated 90-degrees, in order to become horizontal.

Horizontal data is called master-data and these definitions are cross-cutting across the company. There must be a business-data alignment in the vocabulary that both speak. The business are the business processes of a company, and model the world where business-stakeholders operate in. Business processes need data and generate data. Some data is plain process output, but others are derivative data like KPI. Analytics provide KPI, and outputs are what vertical data generators generate. The data lake must support both vertical and horizontal data and must provide a way to provide data to data-verticals in the form of a self-service-api.

Using a menu-card system, where a data architect defines master data, that acts like a pivot point for the data lake to reason about data shapes, the structured data, a data architect provides the system possible ways to rotate (transform) vertical-generated data to horizontal data (master data). The self-service api provides verticals a way to define transforms that rotate horizontal data (master data) to vertical data. The data-lake then generates the data and provides an endpoint to the vertical where the data is made available.

The data lake, inspired by 'product-thinking', provides data stakeholders possible data-products to choose from, eg. search-product, report-product, analytical products, key-value-product and manages the lifecycle for the data product. It also provides information about the data product and enforces policies set on the data product.

The system-as-a-whole must function to provide the data-stakeholder, data-as-a-product that can only be consumed when every aspect of the data product is in order. By consuming data just-in-time policies can be enforced that must not interfere with the operations of the vertical. Therefor data architecture must be in place and define the shape of the data by means of schemas that provide full compatibility on the data for the vertical - the data stakeholder.

Possible implementation candidates are:

Apache Avro to provide full compatibility,
Apache Avro to provide structured data,
Apache Avro to separate schema from data,
Apache Avro to define rules for the data format
Schema repository to provide systems a way to reason about data
Data Governance repository to link policies to applications/users and schemas,
Apache Avro binary data format to encode data in its simplest form,
Custom data record format to, among others, 'tag' the binary data format with its origin,
Custom serializers and deserializers to provide a way to generate custom data record format and to enforce governance policies by means of just-in-time enforcements,
Vertical component definitions; leveraging SBT to setup (legacy) components to become data
AWS services like CloudFormation, Lambda, S3, SNS, Kinesis, Firehose, DynamoDB, Elasticsearch, Aurora, Redshift, KMS, and many more, preferring managed resources.

The text was updated successfully, but these errors were encountered:

dnvriend added the design label Dec 26, 2017

dnvriend self-assigned this Dec 26, 2017

dnvriend added the must have label Jan 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create seed project to define a 'data segment' component #55

Create seed project to define a 'data segment' component #55

dnvriend commented Dec 26, 2017

Create seed project to define a 'data segment' component #55

Create seed project to define a 'data segment' component #55

Comments

dnvriend commented Dec 26, 2017