Skip to content

CDAP 5.0

Compare
Choose a tag to compare
@prinam prinam released this 31 Jul 19:54
· 12 commits to release/5.0 since this release
fe2ddae

Summary

  1. Cloud Runtime

    • Cloud Runtimes allow you to configure batch pipelines to run in a cloud environment. - Before the pipeline runs, a cluster is provisioned in the cloud. The pipeline is executed on that cluster, and the cluster is deleted after the run finishes. - Cloud Runtimes allow you to only use compute resources when you need them, enabling you to make better use of your resources.
  2. Metadata

    • Metadata Driven Processing - Annotate metadata to custom entities such as fields in a dataset, partitions of a dataset, files in a fileset - Access metadata from a program or plugin at runtime to facilitate metadata driven processing - Field Level Lineage - APIs to register operations being performed on fields from a program or a pipeline plugin - Platform feature to compute field level lineage based on operations
  3. Analytics

    • A simple, interactive, UI-driven approach to machine learning. - Lowers the bar for machine learning, allowing users of any level to understand their data and train models while preserving the switches and levers that advanced users might want to tweak.
  4. Operational Dashboard

    • A real-time interactive interface that visualizes program run statistics - Reporting for comprehensive insights into program runs over large periods of time

New Features

Cloud Runtime
........................

  • Added Cloud Runtimes, which allow users to assign profiles to batch pipelines that control what environment the pipeline will run in. For each program run, a cluster in a cloud environment can be created for just that run, allowing efficient use of resources. (CDAP-13089)

  • Added a way for users to create compute profiles from UI to run programs in remote (cloud) environments using one of the available provisioners. (CDAP-13213)

  • Allowed users to specify a compute profile in UI to run the pipelines in cloud environments. Compute profiles can be specified either while running a pipeline manually or via a time schedule or via a pipeline state based trigger. (CDAP-13206)

  • Added a provisioner that allows users to run pipelines on Google Cloud Dataproc clusters. (CDAP-13094)

  • Added a provisioner that can run pipelines on remote Apache Hadoop clusters (CDAP-13774)

  • Added an Amazon Elastic MapReduce provisioner that can run pipelines on AWS EMR. (CDAP-13709)

  • Added support for viewing logs in CDAP for programs executing using the Cloud Runtime. (CDAP-13380)

  • Added metadata such has pipelines, schedules and triggers that are associated with profiles. Also added metrics such as the total number of runs of a pipeline using a profile. (CDAP-13432)

  • Added the ability to disable and enable a profile (CDAP-13494)

  • Added the capability to export or import compute profiles (CDAP-13276)

  • Added the ability to set the default profile at namespace and instance levels. (CDAP-13359)

Metadata
................

  • Added support for annotating metadata to custom entities. For example now a field in a dataset can be annotated with metadata. (CDAP-13260)

  • Added programmatic APIs for users to register field level operations from programs and plugins. (CDAP-13264)

  • Added REST APIs to retrieve the fields which were updated for a given dataset in a given time range, a summary of how those fields were computed, and details about operations which were responsible for updated those fields. (CDAP-13269)

  • Added the ability to view Field Level Lineage for datasets. (CDAP-13511)

Analytics
...............

  • Added CDAP Analytics as an interactive, UI-driver application that allows users to train machine learning models and use them in their pipelines to make predictions. (CDAP-13921)

Operational Dashboard
......................................

  • Added a Dashboard for real-time monitoring of programs and pipelines (CDAP-12865)

  • Added a UI to generate reports on programs and pipelines that ran over a period of time (CDAP-12901)

  • Added feature to support Reports and Dashboard. Dashboard provides realtime status of program runs and future schedules. Reports is a tool for administrators to take a historical look at their applications program runs, statistics and performance (CDAP-13147)

Other New Features
.................................

Data Pipelines
^^^^^^^^^^^^^^

  • Added 'Error' and 'Alert' ports for plugins that support this functionality. To enable this functionality in your plugin, in addition to emitting alerts and errors from the plugin code, users have to set "emit-errors: true" and "emit-alerts: true" in their plugin json. Users can create connections from 'Error' port to Error Handlers plugins, and from 'Alert' port to Alert plugins (CDAP-12839)

  • Added support for Apache Phoenix as a source in Data Pipelines. (CDAP-13045)

  • Added support for Apache Phoenix database as a sink in Data Pipelines. (CDAP-13499)

  • Added the ability to support macro behavior for all widget types (CDAP-12944)

  • Added the ability to view all the concurrent runs of a pipeline (CDAP-13057)

  • Added the ability to view the runtime arguments, logs and other details of a particular run of a pipeline. (CDAP-13006)

  • Added UI support for Splitter plugins (CDAP-13242)

Data Preparation
^^^^^^^^^^^^^^^^

  • Added a Google BigQuery connection for Data Preparation (CDAP-13100)

  • Added a point-and-click interaction to change the data type of a column in the Data Preparation UI (CDAP-12880)

Miscellaneous
^^^^^^^^^^^^^

  • Added a page to view and manage a namespace. Users can click on the current namespace card in the namespace dropdown to go the namespace's detail page. In this page, they can see entities and profiles created in this namespace, as well as preferences, mapping and security configurations for this namespace. (CDAP-13180)

  • Added the ability to restart CDAP programs to make it resilient to YARN outages. (CDAP-12951)

  • Implemented a new Administration page, with two tabs, Configuration and Management. In the Configuration tab, users can view and manage all namespaces, system preferences and system profiles. In the Management tab, users can get an overview of system services in CDAP and scale them. (CDAP-13242)

Improvements

  • Added Spark 2 support for Kafka realtime source (CDAP-13280)

  • Added support for CDH 5.13 and 5.14. (CDAP-12727

  • Added support for EMR 5.4 through 5.7 (CDAP-11805)

  • Upgraded CDAP Router to use Netty 4.1 (CDAP-6308)

  • Added support for automatically restarting long running program types (Service and Flow) upon application master process failure in YARN (CDAP-13179)

  • Added support for specifying custom consumer configs in Kafka source (CDAP-12549)

  • Added support for specifying recursive schemas (CDAP-13143)

  • Added support to pass in YARN application ID in the logging context. This can help in correlating the ID of the program run in CDAP to the ID of the corresponding YARN application, thereby facilitating better debugging. (CDAP-12275)

  • Added the ability to deploy plugin artifacts without requiring a parent artifact. Such plugins are available for use in any parent artifacts (CDAP-9080)

  • Added the ability to import pipelines from the add entity modal (plus button) (CDAP-12274)

  • Added the ability to save the runtime arguments of a pipeline as preferences, so that they do not have to be entered again. (CDAP-11844)

  • Added the ability to specify dependencies to ScalaSparkCompute Action (CDAP-12724)

  • Added the ability to update the keytab URI for namespace's impersonation configuration. (CDAP-12426)

  • Added the ability to upload a User Defined Directive (UDD) using the plus button (CDAP-12279)

  • Allowed CDAP user programs to talk to Kerberos enabled HiveServer2 in the cluster without using a keytab (CDAP-12963)

  • Allowed users to configure the transaction isolation level in database plugins (CDAP-11096)

  • Configured sandbox to have secure store APIs enabled by default (CDAP-13573)

  • Improved robustness of unit test framework by fixing flaky tests (CDAP-13411)

  • Increased default twill reserved memory from 300mb to 768mb in order to prevent YARN from killing containers in standard cluster setups. (CDAP-13405)

  • Macro enabled all fields in the HTTP Callback plugin (CDAP-13116)

  • Removed concurrent upgrades of HBase coprocessors since it could lead to regions getting stuck in transit. (CDAP-12974)

  • Updated the CDAP sandbox to use Spark 2.1.0 as the default Spark version. (CDAP-13409)

  • Improved the documentation for defining Apache Ranger policies for CDAP entities (CDAP-13157)

  • Improved resiliency of router to zookeeper outages. (CDAP-12992)

  • Improved the performance of metadata upgrade by adding a dataset cache. (CDAP-13756)

  • Added CLI command to fetch service logs (CDAP-7644)

  • Added rate limiting to router logs in the event of zookeeper outages (CDAP-12989)

  • Renamed system metadata tables to v2.system.metadata_index.d, v2.system.metadata_index.i. and business metadata tables to v2.business.metadata_index.d, v2.business.metadata_index.i (CDAP-13759)

  • Reduced CDAP Master's local storage usage by deleting temporary directories created for programs as soon as programs are launched on the cluster. (CDAP-6032)

Bug Fixes

  • Fixed a bug in TMS that prevented from correctly consuming multiple events emitted in the same transaction. (CDAP-13033)

  • Fixed a bug that caused errors in the File source if it read parquet files that were not generated through Hadoop. (CDAP-12875)

  • Fixed a bug that caused PySpark to fail to run with Spark 2 in local sandbox. (CDAP-12693)

  • Fixed a bug that could cause the status of a running program to be falsely returned as stopped if the run happened to change state in the middle of calculating the program state. Also fixed a bug where the state for a suspended workflow was stopped instead of running. (CDAP-13296)

  • Fixed a bug that prevented MapReduce AM logs from YARN to show the right URI. (CDAP-7052)

  • Fixed a bug that prevented Spark jobs from running after CDAP upgrade due to caching of jars. (CDAP-12973)

  • Fixed a bug that prevented a parquet snapshot source and sink to be used in the same pipeline (CDAP-13026)

  • Fixed a bug that under some race condition, running a pipeline preview may cause the CDAP process to shut down. (CDAP-13593)

  • Fixed a bug where a Spark program would fail to run when spark authentication is turned on (CDAP-12752)

  • Fixed a bug where an ad-hoc exploration query on streams would fail in an impersonated namespace. (CDAP-13123)

  • Fixed a bug where pipelines with conditions on different branches could not be deployed. (CDAP-13463)

  • Fixed a bug where the Scala Spark compiler had missing classes from classloader, causing compilation failure (CDAP-12743)

  • Fixed a bug where the upgrade tool did not upgrade the owner meta table (CDAP-13372)

  • Fixed a bug with artifacts count, as when we we get artifact count from a namespace we also include system artifacts count causing the total artifact count to be much larger than real count. (CDAP-12647)

  • Fixed a class loading issue and a schema mismatch issue in the whole-file-ingest plugin. (CDAP-13364)

  • Fixed a dependency bug that could cause HBase region servers to deadlock during a cold start (CDAP-12970)

  • Fixed an issue that caused pipeline failures if a Spark plugin tried to read or write a DataFrame using csv format. (CDAP-12742)

  • Fixed an issue that prevented user runtime arguments from being used in CDAP programs (CDAP-13532)

  • Fixed an issue where Spark 2.2 batch pipelines with HDFS sinks would fail with delegation token issue error (CDAP-13281)

  • Fixed an issue with that caused hbase sink to fail when used alongside other sinks, using spark execution engine. (CDAP-12731)

  • Fixed an issue with the retrieval of non-ASCII strings from Table datasets. (CDAP-13002)

  • Fixed avro fileset plugins so that reserved hive keywords can be used as column names (CDAP-13040)

  • Fixed macro enabled properties in plugin configuration to only have macro behavior if the entire value is a macro. (CDAP-13331)

  • Fixed the logs REST API to return a valid json object when filters are specified (CDAP-12988)

  • Fixes an issue where a dataset's class loader was closed before the dataset itself, preventing the dataset from closing properly. (CDAP-13110)

Deprecated and Removed Features

  • Deprecated the aggregation of metadata annotated with all the entities (application, programs, dataset, streams) associated in a run. From this release onwards metadata for program runs behaves like any other entity where a metadata can be directly annotated to it and retrieved from it. For backward compatibility, to achieve the new behavior an additional query parameter 'runAggregation' should be set to false while making the REST call to retrieve metadata of program runs. (CDAP-13721)

  • Dropped support for CDH 5.1, 5.2, 5.3 and HDP 2.0, 2.1 due to security vulnerabilities identified in them (CDAP-8141)

  • Removed HDFS, YARN, and HBase operational stats. These stats were not very useful, could generate confusing log warnings, and were confusing when used in conjunction with cloud profiles. (CDAP-13493)

  • Removed analytics plugins such as decision tree, naive bayes and logistic regression from Hub. The new Analytics flow in the UI should be used as a substitute for this functionality. (CDAP-13720)

  • Removed deprecated cdap sdk commands. Use cdap sandbox commands instead. (CDAP-12584)

  • Removed deprecated cdap.sh and cdap-cli.sh scripts. Use cdap sandbox or cdap cli instead. (CDAP-13680)

  • Removed deprecated error datasets from pipelines. Error transforms should be used instead of error datasets, as they offer more functionality and flexibility. (CDAP-11870)

  • Deprecated HDFS Sink. Use the File sink instead. (CDAP-13353)

  • Removed deprecated stream size based schedules (CDAP-12692)

  • Deprecated streams and flows. Use Apache Kafka as a replacement technology for streams and spark streaming as a replacement technology for flows. Streams and flows will be removed in 6.0 release. (CDAP-13419)

  • Removed multiple deprecated programmatic and RESTful API's in CDAP.
    (CDAP-5966)

Known Issues

  • Updating the compute profile to use to manually run a pipeline using the UI can remove the existing schedules and triggers of the pipeline. (CDAP-13853)

  • The reports feature does not work with Apache Spark 2.0 currently. As a workaround, upgrade to use Spark version 2.1 or later to use reports. (CDAP-13919)

  • Plugins that are not supported while running a pipeline using a cloud runtime throw unclear error messages at runtime. (CDAP-13896)

  • While some built-in plugins have been updated to emit operations for capturing field level lineage, a number of them do not yet emit these operations. (CDAP-13274)

  • Pipelines cannot propagate dynamic schemas at runtime. (CDAP-13326)

  • Reading metadata is not supported when pipelines or programs run using a cloud runtime. (CDAP-13963)

  • Creating a pipeline from Data Preparation when using an Apache Kafka plugin fails. As a workaround, after clicking the Create Pipeline button, manually update the schema of the Kafka plugin to set a single field named body as a non-nullable string. (CDAP-13971)

  • Metadata for custom entities is not deleted if it's nearest known ancestor entity (parent) is deleted. (CDAP-13910)