## 👉 START HERE: How to use this notebook

### Step 1: Create synthetic evaluation data

To measure your Agent's quality, you need a diverse, representative evaluation set.  This notebook turns your unstructured documents into a high-quality synthetic evaluation set so that you can start to evaluate and improve your Agent's quality before subject matter experts are available to label data.

This notebook does the following:
1. <TODO>

THIS DOES NOT WORK FROM LOCAL IDE YET.

**Important note:** Throughout this notebook, we indicate which cells you:
- ✅✏️ *should* customize - these cells contain config settings to change
- 🚫✏️ *typically will not* customize - these cells contain  code that is parameterized by your configuration.

*Cells that don't require customization still need to be run!*

### 🚫✏️ Install Python libraries

In [0]:
%pip install -qqqq -U -r requirements.txt
dbutils.library.restartPython()

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain 0.1.20 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.3.24 which is incompatible.
langchain 0.1.20 requires langsmith<0.2.0,>=0.1.17, but you have langsmith 0.2.2 which is incompatible.
langchain-community 0.0.38 requires langchain-core<0.2.0,>=0.1.52, but you have langchain-core 0.3.24 which is incompatible.
langchain-community 0.0.38 requires langsmith<0.2.0,>=0.1.0, but you have langsmith 0.2.2 which is incompatible.
langchain-text-splitters 0.0.2 requires langchain-core<0.3,>=0.1.28, but you have langchain-core 0.3.24 which is incompatible.
ydata-profiling 4.5.1 requires pandas!=1.4.0,<2.1,>1.1, but you have pandas 2.2.3 which is incompatible.
ydata-profiling 4.5.1 requires pydantic<2,>=1.8.1, but you have pydantic 2.10.3 which is incompatible.[0m[31m
[0m[43mNote: you 

### 🚫✏️ Connect to Databricks

If running locally in an IDE using Databricks Connect, connect the Spark client & configure MLflow to use Databricks Managed MLflow.  If this running in a Databricks Notebook, these values are already set.

In [0]:
from mlflow.utils import databricks_utils as du
import os

if not du.is_in_databricks_notebook():
    from databricks.connect import DatabricksSession

    spark = DatabricksSession.builder.getOrCreate()
    os.environ["MLFLOW_TRACKING_URI"] = "databricks"

### 🚫✏️ Load the Agent's storage locations

This notebook writes to the evaluation set table that you specified in the [Agent setup](02_agent_setup.ipynb) notebook.

In [0]:
from cookbook.config.shared.agent_storage_location import AgentStorageConfig
from cookbook.databricks_utils import get_table_url
from cookbook.config import load_serializable_config_from_yaml_file

# Load the Agent's storage configuration
agent_storage_config: AgentStorageConfig = load_serializable_config_from_yaml_file('./configs/agent_storage_config.yaml')

# Check if the evaluation set already exists
try:
    eval_dataset = spark.table(agent_storage_config.evaluation_set_uc_table)
    if eval_dataset.count() > 0:
        print(f"Evaluation set {get_table_url(agent_storage_config.evaluation_set_uc_table)} already exists!  By default, this notebook will append to the evaluation dataset.  If you would like to overwrite the existing evaluation set, please delete the table before running this notebook.")
    else:
        print(f"Evaluation set {get_table_url(agent_storage_config.evaluation_set_uc_table)} exists, but is empty!  By default, this notebook will NOT change the schema of this table - if you experience schema related errors, drop this table before running this notebook so it can be recreated with the correct schema.")
except Exception:
    print(f"Evaluation set `{agent_storage_config.evaluation_set_uc_table}` does not exist.  This notebook will create a new Delta Table at this location.")

Evaluation set `casaman_ssa.demos.my_agent_autogen_eval_set` does not exist.  This notebook will create a new Delta Table at this location.


#### ✅✏️ Load the source documents for synthetic evaluation data generation

Most often, this will be the same as the document output table from the [data pipeline](01_data_pipeline.ipynb).

Here, we provide code to load the documents table that was created in the [data pipeline](01_data_pipeline.ipynb).

Alternatively, this can be a Spark DataFrame, Pandas DataFrame, or list of dictionaries with the following keys/columns:
- `doc_uri`: A URI pointing to the document.
- `content`: The content of the document.

In [0]:
from cookbook.config.data_pipeline import DataPipelineConfig
from cookbook.config import load_serializable_config_from_yaml_file

datapipeline_config: DataPipelineConfig= load_serializable_config_from_yaml_file('./configs/data_pipeline_config.yaml')

source_documents = spark.table(datapipeline_config.output.parsed_docs_table)

display(source_documents.toPandas())

content,parser_status,doc_uri,last_modified
"**EBOOK** ## The Big Book of Data Engineering 2nd Edition A collection of technical blogs, including code samples and notebooks ##### With all-new content ----- #### Contents **S E CTI ON 1** **Introduction to Data Engineering on Databricks** ............................................................................................................. **03** **S E CTI ON 2** **Guidance and Best Practices** ........................................................................................................................................................................... **10** **2 .1** Top 5 Databricks Performance Tips ................................................................................................................................................. 11 **2 . 2** How to Profile PySpark ........................................................................................................................................................................ 16 **2 . 3** Low-Latency Streaming Data Pipelines With Delta Live Tables and Apache Kafka .......................................................... 20 **2 . 4** Streaming in Production: Collected Best Practices ................................................................................................................... 25 **2 . 5** Streaming in Production: Collected Best Practices, Part 2 ...................................................................................................... 32 **2 .6** Building Geospatial Data Products ................................................................................................................................................. 37 **2 .7** Data Lineage With Unity Catalog .................................................................................................................................................... 47 **2 . 8** Easy Ingestion to Lakehouse With COPY INTO ............................................................................................................................ 50 **2 .9** Simplifying Change Data Capture With Databricks Delta Live Tables .................................................................................. 57 **2 .1 0** Best Practices for Cross-Government Data Sharing ................................................................................................................. 65 **S E CTI ON 3** **Ready-to-Use Notebooks and Data Sets** ...................................................................................................................................... **74** **S E CTI ON 4** **Case Studies** ................................................................................................................................................................................................................................. **76** **4 . 1** Akamai .................................................................................................................................................................................................... 77 **4 . 2** Grammarly ........................................................................................................................................................................................... 80 **4 . 3** Honeywell .............................................................................................................................................................................................. 84 **4 . 4** Wood Mackenzie ................................................................................................................................................................................. 87 **4 . 5** Rivian .................................................................................................................................................................................................... 90 **4 . 6** AT&T ....................................................................................................................................................................................................... 94 ----- **SECTION** # 01 ### Introduction to Data Engineering on Databricks ----- Organizations realize the value data plays as a strategic asset for various business-related initiatives, such as growing revenues, improving the customer experience, operating efficiently or improving a product or service. However, accessing and managing data for these initiatives has become increasingly complex. Most of the complexity has arisen with the explosion of data volumes and data types, with organizations amassing an estimated [80% of data in](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) [unstructured and semi-structured format](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) . As the collection of data continues to increase, 73% of the data goes unused for analytics or decision-making. In order to try and decrease this percentage and make more data usable, data engineering teams are responsible for building data pipelines to efficiently and reliably deliver data. But the process of building these complex data pipelines comes with a number of difficulties: **•** In order to get data into a data lake, data engineers are required to spend immense time hand-coding repetitive data ingestion tasks **•** Since data platforms continuously change, data engineers spend time building and maintaining, and then rebuilding, complex scalable infrastructure **•** As data pipelines become more complex, data engineers are required to find reliable tools to orchestrate these pipelines **•** With the increasing importance of real-time data, low latency data pipelines are required, which are even more difficult to build and maintain **•** Finally, with all pipelines written, data engineers need to constantly focus on performance, tuning pipelines and architectures to meet SLAs **How can Databricks help?** With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting, transforming, processing, scheduling and delivering data. The Lakehouse Platform automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake so data engineers can focus on quality and reliability to drive valuable insights. Lakehouse Platform **One platform to support multiple personas** **BI & Data** **Warehousing** **Data** **Engineering** **Data** **Streaming** **Data** **Science & ML** ©2023 Databricks Inc. — All rights reserved **Unity Catalog** **Fine-grained governance for data and AI** **Delta Lake** **Data reliability and performance** **Cloud Data Lake** All Raw Data (Logs, Texts, Audio, Video, Images) Figure 1 The Databricks Lakehouse Platform unifies your data, analytics and AI on one common platform for all your data use cases ----- **Key differentiators for successful data engineering** **with Databricks** By simplifying on a lakehouse architecture, data engineers need an enterprise-grade and enterprise-ready approach to building data pipelines. To be successful, a data engineering solution team must embrace these eight key differentiating capabilities: **Data ingestion at scale** With the ability to ingest petabytes of data with auto-evolving schemas, data engineers can deliver fast, reliable, scalable and automatic data for analytics, data science or machine learning. This includes: **•** Incrementally and efficiently processing data as it arrives from files or streaming sources like Kafka, DBMS and NoSQL **•** Automatically inferring schema and detecting column changes for structured and unstructured data formats **•** Automatically and efficiently tracking data as it arrives with no manual intervention **•** Preventing data loss by rescuing data columns **Declarative ETL pipelines** Data engineers can reduce development time and effort and instead focus on implementing business logic and data quality checks within the data pipeline using SQL or Python. This can be achieved by: **•** Using intent-driven declarative development to simplify “how” and define “what” to solve **•** Automatically creating high-quality lineage and managing table dependencies across the data pipeline **•** Automatically checking for missing dependencies or syntax errors, and managing data pipeline recovery **Real-time data processing** Allow data engineers to tune data latency with cost controls without the need to know complex stream processing or implement recovery logic. **•** Avoid handling batch and real-time streaming data sources separately **•** Execute data pipeline workloads on automatically provisioned elastic Apache Spark™-based compute clusters for scale and performance **•** Remove the need to manage infrastructure and focus on the business logic for downstream use cases ----- **Unified orchestration of data workflows** Simple, clear and reliable orchestration of data processing tasks for data, analytics and machine learning pipelines with the ability to run multiple non-interactive tasks as a directed acyclic graph (DAG) on a Databricks compute cluster. Orchestrate tasks of any kind (SQL, Python, JARs, Notebooks) in a DAG using Databricks Workflows, an orchestration tool included in the lakehouse with no need to maintain or pay for an external orchestration service. **•** Easily create and manage multiple tasks with dependencies via UI, API or from your IDE **•** Have full observability to all workflow runs and get alerted when tasks fail for fast troubleshooting and efficient repair and rerun **•** Leverage high reliability of 99.95% uptime **•** Use performance optimization clusters that parallelize jobs and minimize data movement with cluster reuse **Data quality validation and monitoring** Improve data reliability throughout the data lakehouse so data teams can confidently trust the information for downstream initiatives by: **•** Defining data quality and integrity controls within the pipeline with defined data expectations **•** Addressing data quality errors with predefined policies (fail, drop, alert, quarantine) **•** Leveraging the data quality metrics that are captured, tracked and reported for the entire data pipeline Data Sources Data Warehouses On-premises Systems SaaS Applications Machine & Application Logs Application Events Mobile & IoT Data Cloud Storage Messag e Buses **Lakehouse Platform** **Workflows** for end-to-end orchestration Real-Time BI Apps Real-Time AI Apps Real-Time Analytics with **Databricks SQL** Real-Time Machine Learning with **Databricks ML** Streaming ETL with **Delta Live Tables** Predictive Maintenance Personalized Offers Patient Diagnostics Real-Time Operational Apps Real-Time Applications with **Spark Structured Streaming** **Photon** for lightning-fast data processing **Unity Catalog** for data governance and sharing **Delta Lake** for open and reliable data storage Alerts Detection Fraud Dynamic Pricing ©2023 Databricks Inc. — All rights reserved Figure 2 A unified set of tools for real-time data processing ----- **Fault tolerant and automatic recovery** Handle transient errors and recover from most common error conditions occurring during the operation of a pipeline with fast, scalable automatic recovery that includes: **•** Fault tolerant mechanisms to consistently recover the state of data **•** The ability to automatically track progress from the source with checkpointing **•** The ability to automatically recover and restore the data pipeline state **Data pipeline observability** Monitor overall data pipeline status from a dataflow graph dashboard and visually track end-to-end pipeline health for performance, quality and latency. Data pipeline observability capabilities include: **•** A high-quality, high-fidelity lineage diagram that provides visibility into how data flows for impact analysis **•** Granular logging with performance and status of the data pipeline at a row level **•** Continuous monitoring of data pipeline jobs to ensure continued operation **Automatic deployments and operations** Ensure reliable and predictable delivery of data for analytics and machine learning use cases by enabling easy and automatic data pipeline deployments and rollbacks to minimize downtime. Benefits include: **•** Complete, parameterized and automated deployment for the continuous delivery of data **•** End-to-end orchestration, testing and monitoring of data pipeline deployment across all major cloud providers **Migrations** Accelerating and de-risking the migration journey to the lakehouse, whether from legacy on-prem systems or disparate cloud services. The migration process starts with a detailed discovery and assessment to get insights on legacy platform workloads and estimate migration as well as Databricks platform consumption costs. Get help with the target architecture and how the current technology stack maps to Databricks, followed by a phased implementation based on priorities and business needs. Throughout this journey companies can leverage: **•** Automation tools from Databricks and its ISV partners **•** Global and/or regional SIs who have created Brickbuilder migration solutions **•** Databricks Professional Services and training This is the recommended approach for a successful migration, whereby customers have seen a 25-50% reduction in costs and 2-3x faster time to value for their use cases. ----- **Unified governance** With Unity Catalog, data engineering and governance teams benefit from an enterprisewide data catalog with a single interface to manage permissions, centralize auditing, automatically track data lineage down to the column level, and share data across platforms, clouds and regions. Benefits: **•** Discover all your data in one place, no matter where it lives, and centrally manage fine-grained access permissions using an ANSI SQL-based interface **•** Leverage automated column-level data lineage to perform impact analysis of any data changes across the pipeline and conduct root cause analysis of any errors in the data pipelines **•** Centrally audit data entitlements and access **•** Share data across clouds, regions and data platforms, while maintaining a single copy of your data in your cloud storage ©2023 Databricks Inc. — All rights reserved Figure 3 The Databricks Lakehouse Platform integrates with a large collection of technologies **A rich ecosystem of data solutions** The Databricks Lakehouse Platform is built on open source technologies and uses open standards so leading data solutions can be leveraged with anything you build on the lakehouse. A large collection of technology partners make it easy and simple to integrate the technologies you rely on when migrating to Databricks and to know you are not locked into a closed data technology stack. ----- **Conclusion** As organizations strive to become data-driven, data engineering is a focal point for success. To deliver reliable, trustworthy data, data engineers shouldn’t need to spend time manually developing and maintaining an end-to-end ETL lifecycle. Data engineering teams need an efficient, scalable way to simplify ETL development, improve data reliability and manage operations. As described, the eight key differentiating capabilities simplify the management of the ETL lifecycle by automating and maintaining all data dependencies, leveraging built-in quality controls with monitoring and by providing deep visibility into pipeline operations with automatic recovery. Data engineering teams can now focus on easily and rapidly building reliable end-to-end production-ready data pipelines using only SQL or Python for batch and streaming that deliver high-value data for analytics, data science or machine learning. **Follow proven best practices** In the next section, we describe best practices for data engineering end-to end use cases drawn from real-world examples. From data ingestion and real-time processing to analytics and machine learning, you’ll learn how to translate raw data into actionable data. As you explore the rest of this guide, you can find data sets and code samples in the various **[Databricks Solution Accelerators](https://www.databricks.com/solutions/accelerators)** , so you can get your hands dirty as you explore all aspects of the data lifecycle on the Databricks Lakehouse Platform. **Start experimenting with these** **free Databricks** **notebooks** **.** ----- **SECTION** # 02 ### Guidance and Best Practices **2.1** Top 5 Databricks Performance Tips **2.2** How to Profile PySpark **2.3** Low-Latency Streaming Data Pipelines With Delta Live Tables and Apache Kafka **2.4** Streaming in Production: Collected Best Practices **2.5** Streaming in Production: Collected Best Practices, Part 2 **2.6** Building Geospatial Data Products **2.7** Data Lineage With Unity Catalog **2.8** Easy Ingestion to Lakehouse With COPY INTO **2.9** Simplifying Change Data Capture With Databricks Delta Live Tables **2.10** Best Practices for Cross-Government Data Sharing ----- SECTION 2.1 **Top 5 Databricks Performance Tips** by **B R YA N S M I T H** and **R O B S A K E R** March 10, 2022 As solutions architects, we work closely with customers every day to help them get the best performance out of their jobs on Databricks — and we often end up giving the same advice. It’s not uncommon to have a conversation with a customer and get double, triple, or even more performance with just a few tweaks. So what’s the secret? How are we doing this? Here are the top 5 things we see that can make a huge impact on the performance customers get from Databricks. Here’s a TLDR: **•** **Use larger clusters.** It may sound obvious, but this is the number one problem we see. It’s actually not any more expensive to use a large cluster for a workload than it is to use a smaller one. It’s just faster. If there’s anything you should take away from this article, it’s this. Read section 1. Really. **•** **Use** **[Photon](https://databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html?itm_data=product-cta-announcingPhotonBlog)** , Databricks’ new, super-fast execution engine. Read section 2 to learn more. You won’t regret it. **•** **Clean out your configurations** . Configurations carried from one Apache Spark™ version to the next can cause massive problems. Clean up! Read section 3 to learn more. **•** **Use** **[Delta Caching](https://docs.databricks.com/delta/optimizations/delta-cache.html)** . There’s a good chance you’re not using caching correctly, if at all. See Section 4 to learn more. **•** **Be aware of lazy evaluation** . If this doesn’t mean anything to you and you’re writing Spark code, jump to section 5. **•** **Bonus tip! Table design is super important** . We’ll go into this in a future blog, but for now, check out the [guide on Delta Lake best practices](https://docs.databricks.com/delta/best-practices.html) . **1. Give your clusters horsepower!** This is the number one mistake customers make. Many customers create tiny clusters of two workers with four cores each, and it takes forever to do anything. The concern is always the same: they don’t want to spend too much money on larger clusters. Here’s the thing: **it’s actually not any more expensive to use a** **large cluster for a workload than it is to use a smaller one. It’s just faster.** ----- The key is that you’re renting the cluster for the length of the workload. So, if you spin up that two worker cluster and it takes an hour, you’re paying for those workers for the full hour. However, if you spin up a four worker cluster and it takes only half an hour, the cost is actually the same! And that trend continues as long as there’s enough work for the cluster to do. Here’s a hypothetical scenario illustrating the point: **Number of Workers** **Cost Per Hour** **Length of Workload (hours)** **Cost of Workload** 1 $1 2 $2 2 $2 1 $2 4 $4 0.5 $2 8 $8 0.25 $2 Notice that the total cost of the workload stays the same while the real-world time it takes for the job to run drops significantly. So, bump up your Databricks cluster specs and speed up your workloads without spending any more money. It can’t really get any simpler than that. **2. Use Photon** Our colleagues in engineering have rewritten the Spark execution engine in C++ and dubbed it Photon. The results are impressive! Beyond the obvious improvements due to running the engine in native code, they’ve also made use of CPU-level performance features and better memory management. On top of this, they’ve rewritten the Parquet writer in C++. So this makes writing to Parquet and Delta (based on Parquet) super fast as well! But let’s also be clear about what Photon is speeding up. It improves computation speed for any built-in functions or operations, as well as writes to Parquet or Delta. So joins? Yep! Aggregations? Sure! ETL? Absolutely! That UDF (user-defined function) you wrote? Sorry, but it won’t help there. The job that’s spending most of its time reading from an ancient on-prem database? Won’t help there either, unfortunately. ----- The good news is that it helps where it can. So even if part of your job can’t be sped up, it will speed up the other parts. Also, most jobs are written with the native operations and spend a lot of time writing to Delta, and Photon helps a lot there. So give it a try. You may be amazed by the results! **3. Clean out old configurations** You know those Spark configurations you’ve been carrying along from version to version and no one knows what they do anymore? They may not be harmless. We’ve seen jobs go from running for hours down to minutes simply by cleaning out old configurations. There may have been a quirk in a particular version of Spark, a performance tweak that has not aged well, or something pulled off some blog somewhere that never really made sense. At the very least, it’s worth revisiting your Spark configurations if you’re in this situation. Often the default configurations are the best, and they’re only getting better. Your configurations may be holding you back. **4. The Delta Cache is your friend** This may seem obvious, but you’d be surprised how many people are not using the [Delta Cache](https://docs.databricks.com/delta/optimizations/delta-cache.html) , which loads data off of cloud storage (S3, ADLS) and keeps it on the workers’ SSDs for faster access. If you’re using Databricks SQL Endpoints you’re in luck. Those have caching on by default. In fact, we recommend using [CACHE SELECT * FROM table](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-cache.html) to preload your “hot” tables when you’re starting an endpoint. This will ensure blazing fast speeds for any queries on those tables. If you’re using regular clusters, be sure to use the i3 series on Amazon Web Services (AWS), L series or E series on Azure Databricks, or n2 in GCP. These will all have fast SSDs and caching enabled by default. Of course, your mileage may vary. If you’re doing BI, which involves reading the same tables over and over again, caching gives an amazing boost. However, if you’re simply reading a table once and writing out the results as in some ETL jobs, you may not get much benefit. You know your jobs better than anyone. Go forth and conquer. ----- **5. Be aware of lazy evaluation** However, there is a catch here. Every time you try to display or write out results, it runs the execution plan again. Let’s look at the same block of code but extend it and do a few more operations. —------- _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) .filter(...) ) _# Now run the execution plan to get results_ df2.display() _# Unfortunately this will run the plan again, including filtering, joining,_ _etc_ df2.display() _# So will this…_ df2.count() —------ If you’re a data analyst or data scientist only using SQL or doing BI you can skip this section. However, if you’re in data engineering and writing pipelines or doing processing using Databricks/Spark, read on. When you’re writing Spark code like select, groupBy, filter, etc., you’re really building an execution plan. You’ll notice the code returns almost immediately when you run these functions. That’s because it’s not actually doing any computation. So even if you have petabytes of data, it will return in less than a second. However, once you go to write your results out you’ll notice it takes longer. This is due to lazy evaluation. It’s not until you try to display or write results that your execution plan is actually run. —------- _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) . filter (...) _# Now run the execution plan to get results_ df2.display() —------ ----- The developer of this code may very well be thinking that they’re just printing out results three times, but what they’re really doing is kicking off the same processing three times. Oops. That’s a lot of extra work. This is a very common mistake we run into. So why is there lazy evaluation, and what do we do about it? In short, processing with lazy evaluation is way faster than without it. Databricks/Spark looks at the full execution plan and finds opportunities for optimization that can reduce processing time by orders of magnitude. So that’s great, but how do we avoid the extra computation? The answer is pretty straightforward: save computed results you will reuse. This works especially well when [Delta Caching](https://docs.databricks.com/delta/optimizations/delta-cache.html) is turned on. In short, you benefit greatly from lazy evaluation, but it’s something a lot of customers trip over. So be aware of its existence and save results you reuse in order to avoid unnecessary computation. **Start experimenting with these** **free Databricks** **notebooks** **.** Let’s look at the same block of code again, but this time let’s avoid the recomputation: _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) . filter (...) ) _# save it_ df2.write.save(path) _# load it back in_ df3 = spark.read.load(path) _# now use it_ df3.display() _# this is not doing any extra computation anymore. No joins, filtering,_ _etc. It’s already done and saved._ df3.display() _# nor is this_ df3.count() ----- SECTION 2.2  **How to Profile PySpark** by **X I N R O N G M E N G , TA K U YA U E S H I N , H Y U K J I N K W O N** and **A L L A N F O LT I N G** October 6, 2022 In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore, PySpark UDFs offer more flexibility since they enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state “what to do”; PySpark, as a sandbox, encapsulates “how to do it.” That makes PySpark easier to use, but it can be difficult to identify performance bottlenecks and apply custom optimizations. To address the difficulty mentioned above, PySpark supports various profiling tools, which are all based on [cProfile](https://docs.python.org/3/library/profile.html#module-cProfile) , one of the standard Python [profiler](https://docs.python.org/3/library/profile.html) [implementations](https://docs.python.org/3/library/profile.html) . PySpark Profilers provide information such as the number of function calls, total time spent in the given function, and filename, as well as line number to help navigation. That information is essential to exposing tight loops in your PySpark programs, and allowing you to make performance improvement decisions. **Driver profiling** PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process; thus, we can profile it as a normal Python program using cProfile as illustrated below: import cProfile with cProfile.Profile() as pr: _# Your code_ pr.print_stats() **Workers profiling** Executors are distributed on worker nodes in the cluster, which introduces complexity because we need to aggregate profiles. Furthermore, a Python worker process is spawned per executor for PySpark UDF execution, which makes the profiling more intricate. ----- The UDF profiler, which is introduced in Spark 3.3, overcomes all those obstacles and becomes a major tool to profile workers for PySpark applications. We’ll illustrate how to use the UDF profiler with a simple Pandas UDF example. Firstly, a PySpark DataFrame with 8,000 rows is generated, as shown below. ```  sdf = spark.range( 0 , 8 * 1000 ).withColumn(  'id' , (col( 'id' ) % 8 ). cast ( 'integer' ) # 1000 rows x 8 groups (if group  by 'id' )  ).withColumn( 'v' , rand()) ``` Later, we will group by the id column, which results in 8 groups with 1,000 rows per group. The Pandas UDF plus_one is then created and applied as shown below: ```  import pandas as pd  def plus_one ( pdf: pd.DataFrame ) -> pd.DataFrame:  return pdf.apply( lambda x: x + 1 , axis= 1 )  res = sdf.groupby( ""id"" ).applyInPandas(plus_one, schema=sdf.schema)  res.collect() ``` Executing the example above and running sc.show_profiles() prints the following profile. The profile below can also be dumped to disk by sc.dump_ profiles(path). The UDF id in the profile (271, highlighted above) matches that in the Spark plan for res. The Spark plan can be shown by calling res.explain() . Note that plus_one takes a pandas DataFrame and returns another pandas DataFrame. For each group, all columns are passed together as a pandas DataFrame to the plus_one UDF, and the returned pandas DataFrames are combined into a PySpark DataFrame. ----- The first line in the profile’s body indicates the total number of calls that were monitored. The column heading includes **•** ncalls , for the number of calls. **•** tottime , for the total time spent in the given function (excluding time spent in calls to sub-functions) **•** percall , the quotient of tottime divided by ncalls **•** cumtime , the cumulative time spent in this and all subfunctions (from invocation till exit) **•** percall , the quotient of cumtime divided by primitive calls **•** filename:lineno(function) , which provides the respective information for each function Digging into the column details: plus_one is triggered once per group, 8 times in total; _arith_method of pandas Series is called once per row, 8,000 times in total. pandas.DataFrame.apply applies the function lambda x: x + 1 row by row, thus suffering from high invocation overhead. We can reduce such overhead by substituting the pandas.DataFrame.apply with pdf + 1, which is vectorized in pandas. The optimized Pandas UDF looks as follows: ```  import pandas as pd  def plus_one_optimized ( pdf: pd.DataFrame ) -> pd.DataFrame:  return pdf + 1  res = sdf.groupby( ""id"" ).applyInPandas(plus_one_optimized, schema=sdf.  schema)  res.collect() ``` The updated profile is as shown below. We can summarize the optimizations as follows: **•** Arithmetic operation from 8,000 calls to 8 calls **•** Total function calls from 2,898,160 calls to 2,384 calls **•** Total execution time from 2.300 seconds to 0.004 seconds The short example above demonstrates how the UDF profiler helps us deeply understand the execution, identify the performance bottleneck and enhance the overall performance of the user-defined function. The UDF profiler was implemented based on the executor-side profiler, which is designed for PySpark RDD API. The executor-side profiler is available in all active Databricks Runtime versions. ----- Both the UDF profiler and the executor-side profiler run on Python workers. They are controlled by the spark.python.profile Spark configuration, which is false by default. We can enable that Spark configuration on a Databricks Runtime cluster as shown below. **Conclusion** PySpark profilers are implemented based on cProfile; thus, the profile reporting relies on the [Stats](https://docs.python.org/3/library/profile.html#the-stats-class) class. [Spark Accumulators](https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators) also play an important role when collecting profile reports from Python workers. Powerful profilers are provided by PySpark in order to identify hot loops and suggest potential improvements. They are easy to use and critical to enhance the performance of PySpark programs. The UDF profiler, which is available starting from Databricks Runtime 11.0 (Spark 3.3), overcomes all the technical challenges and brings insights to user-defined functions. In addition, there is an ongoing effort in the Apache Spark™ open source community to introduce memory profiling on executors; see [SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281) for more information. **Start experimenting with these** **free Databricks** **notebooks** **.** ----- SECTION 2.3  **Low-Latency Streaming Data Pipelines With Delta Live Tables** **and Apache Kafka** by **F R A N K M U N Z** August 9, 2022 [Delta Live Tables (DLT)](https://databricks.com/product/delta-live-tables) is the first ETL framework that uses a simple declarative approach for creating reliable data pipelines and fully manages the underlying infrastructure at scale for batch and [streaming data](https://www.databricks.com/product/data-streaming) . Many use cases require actionable insights derived from near real-time data. Delta Live Tables enables low-latency streaming data pipelines to support such use cases with low latencies by directly ingesting data from event buses like [Apache Kafka](https://kafka.apache.org/) , [AWS](https://aws.amazon.com/kinesis/) [Kinesis](https://aws.amazon.com/kinesis/) , [Confluent Cloud](https://www.confluent.io/confluent-cloud) , [Amazon MSK](https://www.youtube.com/watch?v=HtU9pb18g5Q) , or [Azure Event Hubs](https://docs.microsoft.com/en-us/azure/event-hubs/) . This article will walk through using DLT with Apache Kafka while providing the required Python code to ingest streams. The recommended system architecture will be explained, and related DLT settings worth considering will be explored along the way. **Streaming platforms** Event buses or message buses decouple message producers from consumers. A popular streaming use case is the collection of click-through data from users navigating a website where every user interaction is stored as an event in Apache Kafka. The event stream from Kafka is then used for real-time streaming data analytics. Multiple message consumers can read the same data from Kafka and use the data to learn about audience interests, conversion rates, and bounce reasons. The real-time, streaming event data from the user interactions often also needs to be correlated with actual purchases stored in a billing database. **Apache Kafka** [Apache Kafka](https://kafka.apache.org/) is a popular open source event bus. Kafka uses the concept of a topic, an append-only distributed log of events where messages are buffered for a certain amount of time. Although messages in Kafka are not deleted once they are consumed, they are also not stored indefinitely. The message retention for Kafka can be configured per topic and defaults to 7 days. Expired messages will be deleted eventually. This article is centered around Apache Kafka; however, the concepts discussed also apply to many other event busses or messaging systems. ----- **Streaming data pipelines** In a data flow pipeline, Delta Live Tables and their dependencies can be declared with a standard SQL Create Table As Select (CTAS) statement and the DLT keyword “live.” When developing DLT with Python, the @dlt.table decorator is used to create a Delta Live Table. To ensure the data quality in a pipeline, DLT uses [Expectations](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-expectations.html) which are simple SQL constraints clauses that define the pipeline’s behavior with invalid records. Since streaming workloads often come with unpredictable data volumes, Databricks employs [enhanced autoscaling](https://databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performance-optimizations.html) for data flow pipelines to minimize the overall end-to-end latency while reducing cost by shutting down unnecessary infrastructure. **Delta Live Tables** are fully recomputed, in the right order, exactly once for each pipeline run. In contrast, **streaming Delta Live Tables** are stateful, incrementally computed and only process data that has been added since the last pipeline run. If the query which defines a streaming live tables changes, new data will be processed based on the new query but existing data is not recomputed. Streaming live tables always use a streaming source and only work over append-only streams, such as Kafka, Kinesis, or Auto Loader. Streaming DLTs are based on top of Spark Structured Streaming. You can chain multiple streaming pipelines, for example, workloads with very large data volume and low latency requirements. **Direct ingestion from streaming engines** Delta Live Tables written in Python can directly ingest data from an event bus like Kafka using Spark Structured Streaming. You can set a short retention period for the Kafka topic to avoid compliance issues, reduce costs and then benefit from the cheap, elastic and governable storage that Delta provides. As a first step in the pipeline, we recommend ingesting the data as is to a Bronze (raw) table and avoid complex transformations that could drop important data. Like any Delta table the Bronze table will retain the history and allow it to perform GDPR and other compliance tasks. Ingest streaming data from Apache Kafka ----- When writing DLT pipelines in Python, you use the @dlt.table annotation to create a DLT table. There is no special attribute to mark streaming DLTs in Python; simply use spark.readStream() to access the stream. Example code for creating a DLT table with the name kafka_bronze that is consuming data from a Kafka topic looks as follows: import dlt from pyspark.sql.functions import - from pyspark.sql.types import - TOPIC = ""tracker-events"" KAFKA_BROKER = spark.conf.get( ""KAFKA_SERVER"" ) _# subscribe to TOPIC at KAFKA_BROKER_ raw_kafka_events = (spark.readStream . format ( ""kafka"" ) .option( ""subscribe"" , TOPIC) .option( ""kafka.bootstrap.servers"" , KAFKA_BROKER) .option( ""startingOffsets"" , ""earliest"" ) .load() ) **@dlt.table(table_properties={** **""pipelines.reset.allowed""** **:** **""false""** **})** ```  def kafka_bronze (): ``` return raw_kafka_events pipelines.reset.allowed Note that event buses typically expire messages after a certain period of time, whereas Delta is designed for infinite retention. This might lead to the effect that source data on Kafka has already been deleted when running a full refresh for a DLT pipeline. In this case, not all historic data could be backfilled from the messaging platform, and data would be missing in DLT tables. To prevent dropping data, use the following DLT table property: pipelines.reset.allowed=false Setting pipelines.reset.allowed to false prevents refreshes to the table but does not prevent incremental writes to the tables or new data from flowing into the table. **Checkpointing** If you are an experienced Spark Structured Streaming developer, you will notice the absence of checkpointing in the above code. In Spark Structured Streaming checkpointing is required to persist progress information about what data has been successfully processed and upon failure, this metadata is used to restart a failed query exactly where it left off. Whereas checkpoints are necessary for failure recovery with exactly-once guarantees in Spark Structured Streaming, DLT handles state automatically without any manual configuration or explicit checkpointing required. **Mixing SQL and Python for a DLT pipeline** A DLT pipeline can consist of multiple notebooks but one DLT notebook is required to be written entirely in either SQL or Python (unlike other Databricks notebooks where you can have cells of different languages in a single notebook). Now, if your preference is SQL, you can code the data ingestion from Apache Kafka in one notebook in Python and then implement the transformation logic of your data pipelines in another notebook in SQL. ----- **Schema mapping** When reading data from messaging platform, the data stream is opaque and a schema has to be provided. The Python example below shows the schema definition of events from a fitness tracker, and how the value part of the [Kafka message is mapped](https://docs.databricks.com/spark/latest/structured-streaming/kafka.html) to that schema. event_schema = StructType([ \ StructField( ""time"" , TimestampType(), True ) , \ StructField( ""version"" , StringType(), True ), \ StructField( ""model"" , StringType(), True ) , \ StructField( ""heart_bpm"" , IntegerType(), True ), \ StructField( ""kcal"" , IntegerType(), True ) \ ]) _# temporary table, visible in pipeline but not in data browser,_ _# cannot be queried interactively_ **@dlt.table(comment=** **""real schema for Kakfa payload""** **,** **temporary=** **True** **)** ```  def kafka_silver (): ``` return ( _# kafka streams are (timestamp,value)_ _# value contains the kafka payload_ dlt.read_stream( ""kafka_bronze"" ) .select(col( ""timestamp"" ),from_json(col( ""value"" ) .cast( ""string"" ), event_schema).alias( ""event"" )) .select( ""timestamp"" , ""event.*"" ) **Benefits** Reading streaming data in DLT directly from a message broker minimizes the architectural complexity and provides lower end-to-end latency since data is directly streamed from the messaging broker and no intermediary step is involved. **Streaming ingest with cloud object store intermediary** For some specific use cases, you may want to offload data from Apache Kafka, e.g., using a Kafka connector, and store your streaming data in a cloud object intermediary. In a Databricks workspace, the cloud vendor-specific objectstore can then be mapped via the Databricks Files System (DBFS) as a cloudindependent folder. Once the data is offloaded, [Databricks Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) can ingest the files. Auto Loader can ingest data with a single line of SQL code. The syntax to ingest JSON files into a DLT table is shown below (it is wrapped across two lines for readability). _-- INGEST with Auto Loader_ create or replace streaming live table raw as select `*` FROM cloud_files(""dbfs:/data/twitter"", ""json"") ----- Note that Auto Loader itself is a streaming data source and all newly arrived files will be processed exactly once, hence the streaming keyword for the raw table that indicates data is ingested incrementally to that table. Since offloading streaming data to a cloud object store introduces an additional step in your system architecture it will also increase the end-to-end latency and create additional storage costs. Keep in mind that the Kafka connector writing event data to the cloud object store needs to be managed, increasing operational complexity. Therefore Databricks recommends as a best practice to directly access event bus data from DLT using [Spark Structured Streaming](https://www.databricks.com/blog/2022/08/09/low-latency-streaming-data-pipelines-with-delta-live-tables-and-apache-kafka.html#described) as described above. **Other event buses or messaging systems** This article is centered around Apache Kafka; however, the concepts discussed also apply to other event buses or messaging systems. DLT supports any data source that Databricks Runtime directly supports. **Amazon Kinesis** In Kinesis, you write messages to a fully managed serverless stream. Same as Kafka, Kinesis does not permanently store messages. The default message retention in Kinesis is one day. When using Amazon Kinesis, replace format(“kafka”) with format(“kinesis”) in the Python code for streaming ingestion above and add Amazon Kinesis-specific settings with option(). For more information, check the section about Kinesis Integration in the Spark Structured Streaming documentation. **Azure Event Hubs** For Azure Event Hubs settings, check the official [documentation at Microsoft](https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-kafka-spark-tutorial) and the article [Delta Live Tables recipes: Consuming from Azure Event Hubs](https://alexott.blogspot.com/2022/06/delta-live-tables-recipes-consuming.html) . **Summary** DLT is much more than just the “T” in ETL. With DLT, you can easily ingest from streaming and batch sources, cleanse and transform data on the Databricks Lakehouse Platform on any cloud with guaranteed data quality. Data from Apache Kafka can be ingested by directly connecting to a Kafka broker from a DLT notebook in Python. Data loss can be prevented for a full pipeline refresh even when the source data in the Kafka streaming layer expired. **Get started** If you are a Databricks customer, simply follow the [guide to get started](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables) . Read the release notes to learn more about what’s included in this GA release. If you are not an existing Databricks customer, [sign up for a free trial](https://www.databricks.com/try-databricks) , and you can view our detailed [DLT pricing here](https://www.databricks.com/product/pricing) . Join the conversation in the [Databricks Community](https://community.databricks.com/s/topic/0TO8Y000000VJEhWAO/summit22) where data-obsessed peers are chatting about Data + AI Summit 2022 announcements and updates. Learn. Network. Last but not least, enjoy the [Dive Deeper into Data Engineering](https://youtu.be/uhZabeKxXBw) session from the summit. In that session, I walk you through the code of another streaming data example with a Twitter livestream, Auto Loader, Delta Live Tables in SQL, and Hugging Face sentiment analysis. ----- SECTION 2.4  **Streaming in Production: Collected Best Practices** by **B Y A N G E L A C H U** and **T R I S T E N W E N T L I N G** December 12, 2022 Releasing any data pipeline or application into a production state requires planning, testing, monitoring, and maintenance. Streaming pipelines are no different in this regard; in this blog we present some of the most important considerations for deploying streaming pipelines and applications to a production environment. At Databricks, we offer two different ways of building and running streaming pipelines and applications — [Delta Live Tables (DLT)](https://www.databricks.com/product/delta-live-tables) and [Databricks Workflows](https://www.databricks.com/product/workflows) . DLT is our flagship, fully managed ETL product that supports both batch and streaming pipelines. It offers declarative development, automated operations, data quality, advanced observability capabilities, and more. Workflows enable customers to run Apache Spark™ workloads in Databricks’ optimized runtime environment (i.e., Photon) with access to unified governance (Unity Catalog) and storage (Delta Lake). Regarding streaming workloads, both DLT and Workflows share the same core streaming engine — Spark Structured Streaming. In the case of DLT, customers program against the DLT API and DLT uses the Structured Streaming engine under the hood. In the case of Jobs, customers program against the Spark API directly. The recommendations in this blog post are written from the Structured Streaming engine perspective, most of which apply to both DLT and Workflows (although DLT does take care of some of these automatically, like Triggers and Checkpoints). We group the recommendations under the headings “Before Deployment” and “After Deployment” to highlight when these concepts will need to be applied and are releasing this blog series with this split between the two. There will be additional deep-dive content for some of the sections beyond as well. We recommend reading all sections before beginning work to productionalize a streaming pipeline or application, and revisiting these recommendations as you promote it from dev to QA and eventually production. **Before deployment** There are many things you need to consider when creating your streaming application to improve the production experience. Some of these topics, like unit testing, checkpoints, triggers, and state management, will determine how your streaming application performs. Others, like naming conventions and how many streams to run on which clusters, have more to do with managing multiple streaming applications in the same environment. ----- **Unit testing** The cost associated with finding and fixing a bug goes up exponentially the farther along you get in the SDLC process, and a Structured Streaming application is no different. When you’re turning that prototype into a hardened production pipeline you need a CI/CD process with built-in tests. So how do you create those tests? At first you might think that unit testing a streaming pipeline requires something special, but that isn’t the case. The general guidance for streaming pipelines is no different than [guidance you may have heard for Spark batch jobs](https://docs.databricks.com/notebooks/testing.html) . It starts by organizing your code so that it can be unit tested effectively: **•** Divide your code into testable chunks **•** Organize your business logic into functions calling other functions. If you have a lot of logic in a [foreachBatch](https://docs.databricks.com/structured-streaming/foreach.html) or you’ve implemented [mapGroupsWithState](https://docs.databricks.com/structured-streaming/initial-state-map-groups-with-state.html) or flatMapGroupsWithState, organize that code into multiple functions that can be individually tested. **•** Do not code in dependencies on the global state or external systems **•** Any function manipulating a DataFrame or data set should be organized to take the DataFrame/data set/configuration as input and output the DataFrame/data set Once your code is separated out in a logical manner you can implement unit tests for each of your functions. Spark-agnostic functions can be tested like any other function in that language. For testing UDFs and functions with DataFrames and data sets, there are multiple Spark testing frameworks available. These frameworks support all of the DataFrame/data set APIs so that you can easily create input, and they have specialized assertions that allow you to compare DataFrame content and schemas. Some examples are: **•** The built-in Spark test suite, designed to test all parts of Spark **•** spark-testing-base, which has support for both Scala and Python **•** spark-fast-tests, for testing Scala Spark 2 & 3 **•** chispa, a Python version of spark-fast-tests Code examples for each of these libraries can be found [here](https://github.com/alexott/spark-playground/tree/master/testing) . But wait! I’m testing a streaming application here — don’t I need to make streaming DataFrames for my unit tests? The answer is no; you do not! Even though a streaming DataFrame represents a data set with no defined ending, when functions are executed on it they are executed on a microbatch — a discrete set of data. You can use the same unit tests that you would use for a batch application, for both stateless and stateful streams. One of the advantages of Structured Streaming over other frameworks is the ability to use the same transformation code for both streaming and with other batch operations for the same sink. This allows you to simplify some operations, like backfilling data, for example, where rather than trying to sync the logic between two different applications, you can just modify the input sources and write to the same destination. If the sink is a Delta table, you can even do these operations concurrently if both processes are append-only operations. ----- **Triggers** process a microbatch in order to maximize resource utilization, but setting the interval longer would make sense if your stream is running on a shared cluster and you don’t want it to constantly take the cluster resources. If you do not need your stream to run continuously, either because data doesn’t come that often or your SLA is 10 minutes or greater, then you can use the Trigger.Once option. This option will start up the stream, check for anything new since the last time it ran, process it all in one big batch, and then shut down. Just like with a continuously running stream when using Trigger.Once, the checkpoint that guarantees fault tolerance (see below) will guarantee exactlyonce processing. Spark has a new version of Trigger.Once called Trigger.AvailableNow. While Trigger.Once will process everything in one big batch, which depending on your data size may not be ideal, Trigger.AvailableNow will split up the data based on maxFilesPerTrigger and maxBytesPerTrigger settings. This allows the data to be processed in multiple batches. Those settings are ignored with Trigger.Once. You can see examples for setting triggers [here](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers) . **Pop quiz —** how do you turn your streaming process into a batch process that automatically keeps track of where it left off with just one line of code? **Answer —** change your processing time trigger to Trigger.Once/Trigger. AvailableNow! Exact same code, running on a schedule, that will neither miss nor reprocess any records. Now that you know your code works, you need to determine how often your stream will look for new data. This is where [triggers](https://docs.databricks.com/structured-streaming/triggers.html) come in. Setting a trigger is one of the options for the writeStream command, and it looks like this: _// Scala/Java_ .trigger(Trigger.ProcessingTime( ""30 seconds"" )) _# Python_ .trigger(processingTime= '30 seconds' ) In the above example, if a microbatch completes in less than 30 seconds, then the engine will wait for the rest of the time before kicking off the next microbatch. If a microbatch takes longer than 30 seconds to complete, then the engine will start the next microbatch immediately after the previous one finishes. The two factors you should consider when setting your trigger interval are how long you expect your stream to process a microbatch and how often you want the system to check for new data. You can lower the overall processing latency by using a shorter trigger interval and increasing the resources available for the streaming query by adding more workers or using compute or memory optimized instances tailored to your application’s performance. These increased resources come with increased costs, so if your goal is to minimize costs, then a longer trigger interval with less compute can work. Normally you would not set a trigger interval longer than what it would typically take for your stream to ----- **Name your stream** You name your children, you name your pets, now it’s time to name your streams. There’s a writeStream option called .queryName that allows you to provide a friendly name for your stream. Why bother? Well, suppose you don’t name it. In that case, all you’ll have to go on in the Structured Streaming tab in the Spark UI is the string and the unintelligible guid that is automatically generated as the stream’s unique identifier. If you have more than one stream running on a cluster, and all of them have and unintelligible strings as identifiers, how do you find the one you want? If you’re exporting metrics how do you tell which is which? Make it easy on yourself, and name your streams. When you’re managing them in production you’ll be glad you did, and while you’re at it, go and name your batch queries in any foreachBatch() code you have. **Fault tolerance** How does your stream recover from being shut down? There are a few different cases where this can come into play, like cluster node failures or intentional halts, but the solution is to set up checkpointing. Checkpoints with write-ahead logs provide a degree of protection from your streaming application being interrupted, ensuring it will be able to pick up again where it last left off. Checkpoints store the current offsets and state values (e.g., aggregate values) for your stream. Checkpoints are stream specific so each should be set to its own location. Doing this will let you recover more gracefully from shutdowns, failures from your application code or unexpected cloud provider failures or limitations. To configure checkpoints, add the checkpointLocation option to your stream definition: _// Scala/Java/Python_ streamingDataFrame.writeStream .format( ""delta"" ) .option( ""path"" , """" ) .queryName( ""TestStream"" ) .option( ""checkpointLocation"" , """" ) .start() To keep it simple — every time you call .writeStream, you must specify the checkpoint option with a unique checkpoint location. Even if you’re using foreachBatch and the writeStream itself doesn’t specify a path or table option, you must still specify that checkpoint. It’s how Spark Structured Streaming gives you hassle-free fault tolerance. Efforts to manage the checkpointing in your stream should be of little concern in general. As [Tathagata Das has said](https://youtu.be/rl8dIzTpxrI?t=454) , “The simplest way to perform streaming analytics is not having to reason about streaming at all.” That said, one setting deserves mention as questions around the maintenance of checkpoint files come up occasionally. Though it is an internal setting that doesn’t require direct configuration, the setting spark.sql.streaming.minBatchesToRetain (default 100) controls the number of checkpoint files that get created. Basically, the number of files will be roughly this number times two, as there is a file created noting the offsets at the beginning of the batch (offsets, a.k.a write ahead logs) and another on completing the batch (commits). The number of files is checked periodically for cleanup as part of the internal processes. This simplifies at least one aspect of long-term streaming application maintenance for you. ----- It is also important to note that some changes to your application code can invalidate the checkpoint. Checking for any of these changes during code reviews before deployment is recommended. You can find examples of changes where this can happen in [Recovery Semantics after Changes in a Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query) [Query](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#recovery-semantics-after-changes-in-a-streaming-query) . Suppose you want to look at checkpointing in more detail or consider whether asynchronous checkpointing might improve the latency in your streaming application. In that case, these are covered in greater depth in [Speed Up Streaming Queries With Asynchronous State Checkpointing](https://www.databricks.com/blog/2022/05/02/speed-up-streaming-queries-with-asynchronous-state-checkpointing.html) . **State management and RocksDB** Stateful streaming applications are those where current records may depend on previous events, so Spark has to retain data in between microbatches. The data it retains is called state, and Spark will store it in a state store and read, update and delete it during each microbatch. Typical stateful operations are streaming aggregations, streaming dropDuplicates, stream-stream joins, mapGroupsWithState, or flatMapGroupsWithState. Some common types of examples where you’ll need to think about your application state could be sessionization or hourly aggregation using group by methods to calculate business metrics. Each record in the state store is identified by a key that is used as part of the stateful computation, and the more unique keys that are required the larger the amount of state data that will be stored. When the amount of state data needed to enable these stateful operations grows large and complex, it can degrade your workloads’ performance, leading to increased latency or even failures. A typical indicator of the state store being the culprit of added latency is large amounts of time spent in garbage collection (GC) pauses in the JVM. If you are monitoring the microbatch processing time, this could look like a continual increase or wildly varying processing time across microbatches. The default configuration for a state store, which is sufficient for most general streaming workloads, is to store the state data in the executors’ JVM memory. Large number of keys (typically millions, see the Monitoring & Instrumentation section in part 2 of this blog) can add excessive memory pressure on the machine memory and increase the frequency of hitting these GC pauses as it tries to free up resources. On the Databricks Runtime (now also supported in [Apache Spark 3.2+](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#rocksdb-state-store-implementation) ) you can use [RocksDB](http://rocksdb.org/) as an alternative state store provider to alleviate this source of memory pressure. RocksDB is an embeddable persistent key-value store for fast storage. It features high performance through a log-structured database engine written entirely in C++ and optimized for fast, low-latency storage. Leveraging RocksDB as the state store provider still uses machine memory but no longer occupies space in the JVM and makes for a more efficient state management system for large amounts of keys. This doesn’t come for free, however, as it introduces an extra step in processing every microbatch. Introducing RocksDB shouldn’t be expected to reduce latency except when it is related to memory pressure from state data storage in the JVM. The RocksDBbacked state store still provides the same degree of fault tolerance as the regular state storage as it is included in the stream checkpointing. ----- RocksDB configuration, like checkpoint configuration, is minimal by design and so you only need to declare it in your overall Spark configuration: spark.conf. set ( ""spark.sql.streaming.stateStore.providerClass"" , ""com.databricks.sql.streaming.state.RocksDBStateStoreProvider"" ) If you are monitoring your stream using the streamingQueryListener class, then you will also notice that RocksDB metrics will be included in the stateOperators field. For more detailed information on this see the [RocksDB State Store Metrics](https://docs.databricks.com/spark/latest/structured-streaming/production.html#rocksdb-state-store-metrics) [section](https://docs.databricks.com/spark/latest/structured-streaming/production.html#rocksdb-state-store-metrics) of “Structured Streaming in Production.” It’s worth noting that large numbers of keys can have other adverse impacts in addition to raising memory consumption, especially with unbounded or nonexpiring state keys. With or without RocksDB, the state from the application also gets backed up in checkpoints for fault tolerance. So it makes sense that if you have state files being created so that they will not expire, you will keep accumulating files in the checkpoint, increasing the amount of storage required and potentially the time to write it or recover from failures as well. For the data in memory (see the Monitoring & Instrumentation section in part 2 of this blog) this situation can lead to somewhat vague out-of-memory errors, and for the checkpointed data written to cloud storage you might observe unexpected and unreasonable growth. Unless you have a business need to retain streaming state for all the data that has been processed (and that is rare), read the [Spark](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) [Structured Streaming documentation](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) and make sure to implement your stateful operations so that the system can drop state records that are no longer needed (pay close attention to dropDuplicates and stream-stream joins). **Running multiple streams on a cluster** Once your streams are fully tested and configured, it’s time to figure out how to organize them in production. It’s a common pattern to stack multiple streams on the same Spark cluster to maximize resource utilization and save cost. This is fine to a point, but there are limits to how much you can add to one cluster before performance is affected. The driver has to manage all of the streams running on the cluster, and all streams will compete for the same cores across the workers. You need to understand what your streams are doing and plan your capacity appropriately to stack effectively. Here is what you should take into account when you’re planning on stacking multiple streams on the same cluster: **•** Make sure your driver is big enough to manage all of your streams. Is your driver struggling with a high CPU utilization and garbage collection? That means it’s struggling to manage all of your streams. Either reduce the number of streams or increase the size of your driver. **•** Consider the amount of data each stream is processing. The more data you are ingesting and writing to a sink, the more cores you will need in order to maximize your throughput for each stream. You’ll need to reduce the number of streams or increase the number of workers depending on how much data is being processed. For sources like Kafka you will need to configure how many cores are being used to ingest with the minPartitions option if you don’t have enough cores for all of the partitions across all of your streams. ----- **•** Consider the complexity and data volume of your streams. If all of the streams are doing minimal manipulation and just appending to a sink, then each stream will need fewer resources per microbatch and you’ll be able to stack more. If the streams are doing stateful processing or computation/ memory-intensive operations, that will require more resources for good performance and you’ll want to stack fewer streams. **•** Consider [scheduler pools](https://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools) . When stacking streams they will all be contending for the same workers and cores, and one stream that needs a lot of cores will cause the other streams to wait. Scheduler pools enable you to have different streams execute on different parts of the cluster. This will enable streams to execute in parallel with a subset of the available resources. **Conclusion** Some of the ideas we’ve addressed here certainly deserve their own time and special treatment with a more in-depth discussion, which you can look forward to in later deep dives. However, we hope these recommendations are useful as you begin your journey or seek to enhance your production streaming experience. Be sure to continue with the next post, “Streaming in Production: Collected Best Practices, Part 2.” **[Review Databrick’s Structured Streaming Getting Started Guide](https://www.databricks.com/spark/getting-started-with-apache-spark/streaming)** **•** Consider your SLA. If you have mission critical streams, isolate them as a best practice so lower-criticality streams do not affect them. **Start experimenting with these** **free Databricks** **notebooks** **.** On Databricks we typically see customers stack between 10-30 streams on a cluster, but this varies depending on the use case. Consider the factors above so that you can have a good experience with performance, cost and maintainability. ----- SECTION 2.5  **Streaming in Production: Collected Best Practices, Part 2** by **A N G E L A C H U** and **T R I S T E N W E N T L I N G** January 10, 2023 In our two-part blog series titled “Streaming in Production: Collected Best Practices,” this is the second article. Here we discuss the “After Deployment” considerations for a Structured Streaming Pipeline. The majority of the suggestions in this post are relevant to both Structured Streaming Jobs and Delta Live Tables (our flagship and fully managed ETL product that supports both batch and streaming pipelines). **After deployment** After the deployment of your streaming application, there are typically three main things you’ll want to know: **•** How is my application running? **•** Are resources being used efficiently? **•** How do I manage any problems that come up? We’ll start with an introduction to these topics, followed by a deeper dive later in this blog series. **Monitoring and instrumentation (How is my application running?)** Streaming workloads should be pretty much hands-off once deployed to production. However, one thing that may sometimes come to mind is: “how is my application running?” Monitoring applications can take on different levels and forms depending on: **•** the metrics collected for your application (batch duration/latency, throughput, …) **•** where you want to monitor the application from At the simplest level, there is a streaming dashboard ( [A Look at the New](https://www.databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html) [Structured Streaming UI](https://www.databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html) ) and built-in logging directly in the Spark UI that can be used in a variety of situations. This is in addition to setting up failure alerts on jobs running streaming workloads. If you want more fine-grained metrics or to create custom actions based on these metrics as part of your code base, then the StreamingQueryListener is better aligned with what you’re looking for. ----- If you want the Spark metrics to be reported (including machine level traces for drivers or workers) you should use the platform’s [metrics sink](https://spark.apache.org/docs/latest/monitoring.html#metrics) . The Apache Spark Structured Streaming UI Another point to consider is where you want to surface these metrics for observability. There is a Ganglia dashboard at the cluster level, integrated partner applications like [Datadog](https://www.datadoghq.com/blog/databricks-monitoring-datadog/) for monitoring streaming workloads, or even more open source options you can build using tools like Prometheus and Grafana. Each has advantages and disadvantages to consider around cost, performance, and maintenance requirements. Whether you have low volumes of streaming workloads where interactions in the UI are sufficient or have decided to invest in a more robust monitoring platform, you should know how to observe your production streaming workloads. Further “Monitoring and Alerting” posts later in this series will contain a more thorough discussion. In particular, we’ll see different measures on which to monitor streaming applications and then later take a deeper look at some of the tools you can leverage for observability. **Application optimization (Are resources being used effectively?** **Think “cost”)** The next concern we have after deploying to production is “is my application using resources effectively?” As developers, we understand (or quickly learn) the distinction between working code and well-written code. Improving the way your code runs is usually very satisfying, but what ultimately matters is the overall cost of running it. Cost considerations for Structured Streaming applications will be largely similar to those for other Spark applications. One notable difference is that failing to optimize for production workloads can be extremely costly, as these workloads are frequently “always-on” applications, and thus wasted expenditure can quickly compound. Because assistance with cost optimization is ----- frequently requested, a separate post in this series will address it. The key points that we’ll focus on will be efficiency of usage and sizing. Getting the cluster sizing right is one of the most significant differences between efficiency and wastefulness in streaming applications. This can be particularly tricky because in some cases it’s difficult to estimate the full load conditions of the application in production before it’s actually there. In other cases, it may be difficult due to natural variations in volume handled throughout the day, week, or year. When first deploying, it can be beneficial to oversize slightly, incurring the extra expense to avoid inducing performance bottlenecks. Utilize the monitoring tools you chose to employ after the cluster has been running for a few weeks to ensure proper cluster utilization. For example, are CPU and memory levels being used at a high level during peak load or is the load generally small and the cluster may be downsized? Maintain regular monitoring of this and keep an eye out for changes in data volume over time; if either occurs, a cluster resize may be required to maintain cost-effective operation. As a general guideline, you should avoid excessive shuffle operations, joins, or an excessive or extreme watermark threshold (don’t exceed your needs), as each can increase the number of resources you need to run your application. A large watermark threshold will cause Structured Streaming to keep more data in the state store between batches, leading to an increase in memory requirements across the cluster. Also, pay attention to the type of VM configured — are you using memory-optimized for your memory-intense stream? Compute-optimized for your computationally-intensive stream? If not, look at the utilization levels for each and consider trying a machine type that could be a better fit. Newer families of servers from cloud providers with more optimal CPUs often lead to faster execution, meaning you might need fewer of them to meet your SLA. **Troubleshooting (How do I manage any problems that come up?)** The last question we ask ourselves after deployment is “how do I manage any problems that come up?” As with cost optimization, troubleshooting streaming applications in Spark often looks the same as other applications since most of the mechanics remain the same under the hood. For streaming applications, issues usually fall into two categories — failure scenarios and latency scenarios **Failure scenarios** Failure scenarios typically manifest with the stream stopping with an error, executors failing or a driver failure causing the whole cluster to fail. Common causes for this are: **•** Too many streams running on the same cluster, causing the driver to be overwhelmed. On Databricks, this can be seen in Ganglia, where the driver node will show up as overloaded before the cluster fails. **•** Too few workers in a cluster or a worker size with too small of a core-tomemory ratio, causing executors to fail with an Out Of Memory error. This can also be seen on Databricks in Ganglia before an executor fails, or in the Spark UI under the executors tab. **•** Using a collect to send too much data to the driver, causing it to fail with an Out Of Memory error. ----- **Latency scenarios** For latency scenarios, your stream will not execute as fast as you want or expect. A latency issue can be intermittent or constant. Too many streams or too small of a cluster can be the cause of this as well. Some other common causes are: **•** Data skew — when a few tasks end up with much more data than the rest of the tasks. With skewed data, these tasks take longer to execute than the others, often spilling to disk. Your stream can only run as fast as its slowest task. **•** Executing a stateful query without defining a watermark or defining a very long one will cause your state to grow very large, slowing down your stream over time and potentially leading to failure. **•** Poorly optimized sink. For example, performing a merge into an overpartitioned Delta table as part of your stream. **•** Stable but high latency (batch execution time). Depending on the cause, adding more workers to increase the number of cores concurrently available for Spark tasks can help. Increasing the number of input partitions and/or decreasing the load per core through batch size settings can also reduce the latency. Just like troubleshooting a batch job, you’ll use Ganglia to check cluster utilization and the Spark UI to find performance bottlenecks. There is a specific [Structured Streaming tab](https://www.databricks.com/blog/2020/07/29/a-look-at-the-new-structured-streaming-ui-in-apache-spark-3-0.html) in the Spark UI created to help monitor and troubleshoot streaming applications. On that tab each stream that is running will be listed, and you’ll see either your stream name if you named your stream or  if you didn’t. You’ll also see a stream ID that will be visible on the Jobs tab of the Spark UI so that you can tell which jobs are for a given stream. You’ll notice above we said which jobs are for a given stream. It’s a common misconception that if you were to look at a streaming application in the Spark UI you would just see one job in the Jobs tab running continuously. Instead, depending on your code, you will see one or more jobs that start and complete for each microbatch. Each job will have the stream ID from the Structured Streaming tab and a microbatch number in the description, so you’ll be able to tell which jobs go with which stream. You can click into those jobs to find the longest running stages and tasks, check for disk spills, and search by Job ID in the SQL tab to find the slowest queries and check their explain plans. The Jobs tab in the Apache Spark UI ----- If you click on your stream in the Structured Streaming tab you’ll see how much time the different streaming operations are taking for each microbatch, such as adding a batch, query planning and committing (see earlier screenshot of the Apache Spark Structured Streaming UI). You can also see how many rows are being processed as well as the size of your state store for a stateful stream. This can give insights into where potential latency issues are. We will go more in-depth with troubleshooting later in this blog series, where we’ll look at some of the causes and remedies for both failure scenarios and latency scenarios as we outlined above. **Conclusion** You may have noticed that many of the topics covered here are very similar to how other production Spark applications should be deployed. Whether your workloads are primarily streaming applications or batch processes, the majority of the same principles will apply. We focused more on things that become especially important when building out streaming applications, but as we’re sure you’ve noticed by now, the topics we discussed should be included in most production deployments. Across the majority of industries in the world today information is needed faster than ever, but that won’t be a problem for you. With Spark Structured Streaming you’re set to make it happen at scale in production. Be on the lookout for more in-depth discussions on some of the topics we’ve covered in this blog, and in the meantime keep streaming! **[Review Databricks Structured Streaming in](https://docs.databricks.com/structured-streaming/production.html)** **[Production Documentation](https://docs.databricks.com/structured-streaming/production.html)** **Start experimenting with these** **free Databricks** **notebooks** **.** ----- SECTION 2.6  **Building Geospatial Data Products** by **M I L O S C O L I C** January 6, 2023 Geospatial data has been driving innovation for centuries, through use of maps, cartography and more recently through digital content. For example, the oldest map has been found etched in a piece of mammoth tusk and dates [approximately 25,000 BC](https://en.wikipedia.org/wiki/History_of_cartography) . This makes geospatial data one of the oldest data sources used by society to make decisions. A more recent example, labeled as the birth of spatial analysis, is that of Charles Picquet in 1832 who used geospatial data to analyze [Cholera outbreaks in Paris](https://gallica.bnf.fr/ark:/12148/bpt6k842918.image) ; a couple of decades later John Snow in 1854 followed the same approach for [Cholera outbreaks in](https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak) [London](https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak) . These two individuals used geospatial data to solve one of the toughest problems of their times and in effect save countless lives. Fast-forwarding to the 20th century, the concept of [Geographic Information Systems (GIS)](https://education.nationalgeographic.org/resource/geographic-information-system-gis) was [first](https://gisandscience.files.wordpress.com/2012/08/3-an-introduction-to-the-geo-information-system-of-the-canada-land-inventory.pdf) [introduced](https://gisandscience.files.wordpress.com/2012/08/3-an-introduction-to-the-geo-information-system-of-the-canada-land-inventory.pdf) in 1967 in Ottawa, Canada, by the Department of Forestry and Rural Development. Today we are in the midst of the cloud computing industry revolution — supercomputing scale available to any organization, virtually infinitely scalable for both storage and compute. Concepts like [data mesh](https://www.databricks.com/blog/2022/10/19/building-data-mesh-based-databricks-lakehouse-part-2.html) and [data marketplace](https://www.databricks.com/blog/2022/06/28/introducing-databricks-marketplace-an-open-marketplace-for-all-data-and-ai-assets.html) are emerging within the data community to address questions like platform federation and interoperability. How can we adopt these concepts to geospatial data, spatial analysis and GIS systems? By adopting the concept of data products and approaching the design of geospatial data as a product. In this blog we will provide a point of view on how to design scalable geospatial data products that are modern and robust. We will discuss how Databricks Lakehouse Platform can be used to unlock the full potential of geospatial products that are one of the most valuable assets in solving the toughest problems of today and the future. **What is a data product? And how to design one?** The most broad and the most concise definition of a “data product” was coined by DJ Patil (the first U.S. Chief Data Scientist) in _Data Jujitsu: The Art of Turning_ _Data into Product:_ “a product that facilitates an end goal through the use of data.” The complexity of this definition (as admitted by Patil himself) is needed to encapsulate the breadth of possible products, to include dashboards, reports, Excel spreadsheets, and even CSV extracts shared via emails. You might notice that the examples provided deteriorate rapidly in quality, robustness and governance. What are the concepts that differentiate a successful product versus an unsuccessful one? Is it the packaging? Is it the content? Is it the quality of the content? Or is it only the product adoption in the market? Forbes defines the 10 must-haves of a successful product. A good framework to summarize this is through the value pyramid. ----- Figure 1: Product value pyramid (source) The value pyramid provides a priority on each aspect of the product. Not every value question we ask about the product carries the same amount of weight. If the output is not useful none of the other aspects matter — the output isn’t really a product but becomes more of a data pollutant to the pool of useful results. Likewise, scalability only matters after simplicity and explainability are addressed. How does the value pyramid relate to the data products? Each data output, in order to be a data product: **•** **Should have clear usefulness.** The amount of the data society is generating is rivaled only by the amount of data pollutants we are generating. These are outputs lacking clear value and use, much less a strategy for what to do with them. **•** **Should be explainable.** With the emergence of AI/ML, explainability has become even more important for data driven decision-making. Data is as good as the metadata describing it. Think of it in terms of food — taste does matter, but a more important factor is the nutritional value of ingredients. **•** **Should be simple.** An example of product misuse is using a fork to eat cereal instead of using a spoon. Furthermore, simplicity is essential but not sufficient — beyond simplicity the products should be intuitive. Whenever possible both intended and unintended uses of the data should be obvious. **•** **Should be scalable.** Data is one of the few resources that grows with use. The more data you process the more data you have. If both inputs and outputs of the system are unbounded and ever-growing, then the system has to be scalable in compute power, storage capacity and compute expressive power. Cloud data platforms like Databricks are in a unique position to answer for all of the three aspects. **•** **Should generate habits.** In the data domain we are not concerned with customer retention as is the case for the retail products. However, the value of habit generation is obvious if applied to best practices. The systems and data outputs should exhibit the best practices and promote them — it should be easier to use the data and the system in the intended way than the opposite. The geospatial data should adhere to all the aforementioned aspects — any data products should. On top of this tall order, geospatial data has some specific needs. ----- **Geospatial data standards** **•** **“Advocate the understanding and use of geospatial data standards** **within other sectors of government.”** — Value pyramid applies to the standards as well — concepts like ease of adherence (usefulness/ simplicity), purpose of the standard (explainability/usefulness), adoption (habit generation) are critical for the value generation of a standard. A critical tool for achieving the data standards mission is the [FAIR](https://www.go-fair.org/fair-principles/) data principles: **•** **Findable** — The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of data sets and services. **•** **Accessible** — Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorization. **•** **Interoperable** — The data usually needs to be integrated with other data. In addition, the data needs to interoperate with applications or workflows for analysis, storage, and processing. **•** **Reusable** — The ultimate goal of FAIR is to optimize the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. Geospatial data standards are used to ensure that geographic data is collected, organized, and shared in a consistent and reliable way. These standards can include guidelines for things like data formatting, coordinate systems, map projections, and metadata. Adhering to standards makes it easier to share data between different organizations, allowing for greater collaboration and broader access to geographic information. The Geospatial Commision (UK government) has defined the UK Geospatial Data Standards Register as a central repository for data standards to be applied in the case of geospatial data. Furthermore, the mission of this registry is to: **•** **“Ensure UK geospatial data is more consistent and coherent and usable** **across a wider range of systems.”** — These concepts are a callout for the importance of explainability, usefulness and habit generation (possibly other aspects of the value pyramid). **•** **“Empower the UK geospatial community to become more engaged with** **the relevant standards and standards bodies.”** — Habit generation within the community is as important as the robust and critical design on the standard. If not adopted standards are useless. ----- We share the belief that the FAIR principles are crucial for the design of scalable data products we can trust. To be fair, FAIR is based on common sense, so why is it key to our considerations? _“What I see in FAIR is not new in itself, but what it_ _does well is to articulate, in an accessible way, the need for a holistic approach_ _to data improvement. This ease in communication is why FAIR is being used_ _increasingly widely as an umbrella for data improvement — and not just in the_ _geospatial community.”_ — [A FAIR wind sets our course for data improvement](https://geospatialcommission.blog.gov.uk/2022/03/02/a-fair-wind-sets-our-course-for-data-improvement/) . To further support this approach, the [Federal Geographic Data Committee](https://www.fgdc.gov/standards) has developed the [National Spatial Data Infrastructure (NSDI) Strategic Plan](https://www.fgdc.gov/nsdi-plan/nsdi-strategic-plan-2021-2024.pdf) that covers the years 2021-2024 and was approved in November 2020. The goals of NSDI are in essence FAIR principles and convey the same message of designing systems that promote the circular economy of data — data products that flow between organizations following common standards and in each step through the data supply chain unlock new value and new opportunities. The fact that these principles are permeating different jurisdictions and are adopted across different regulators is a testament to the robustness and soundness of the approach. The FAIR concepts weave really well together with the data product design. In fact FAIR is traversing the whole product value pyramid and forms a value cycle. By adopting both the value pyramid and FAIR principles we design data products with both internal and external outlook. This promotes data reuse as opposed to data accumulation. Why do FAIR principles matter for geospatial data and geospatial data products? FAIR is transcendent to geospatial data, it is actually transcendent to data, it is a simple yet coherent system of guiding principles for good design — and that good design can be applied to anything including geospatial data and geospatial systems. Figure 2: NDSI Strategic Goals ----- **Grid index systems** In traditional GIS solutions’ performance of spatial operations are usually achieved by building tree structures ( [KD trees](https://en.wikipedia.org/wiki/K-d_tree) , [ball trees](https://www.researchgate.net/publication/283471105_Ball-tree_Efficient_spatial_indexing_for_constrained_nearest-neighbor_search_in_metric_spaces) , [Quad trees](https://en.wikipedia.org/wiki/Quadtree) , etc). The issue with tree approaches is that they eventually break the scalability principle — when the data is too big to be processed in order to build the tree and the computation required to build the tree is too long and defeats the purpose. This also negatively affects the accessibility of data; if we cannot construct the tree we cannot access the complete data and in effect we cannot reproduce the results. In this case, grid index systems provide a solution. Grid index systems are built from the start with the scalability aspects of the geospatial data in mind. Rather than building the trees, they define a series of grids that cover the area of interest. In the case of [H3](https://h3geo.org/) (pioneered by Uber), the grid covers the area of the Earth; in the case of local grid index systems (e.g., [British National Grid](https://en.wikipedia.org/wiki/Ordnance_Survey_National_Grid) ) they may only cover the specific area of interest. These grids are composed of cells that have unique identifiers. There is a mathematical relationship between location and the cell in the grid. This makes the grid index systems very scalable and parallel in nature. Figure 4: Grid Index Systems (H3, British National Grid) ----- Another important aspect of grid index systems is that they are open source, allowing index values to be universally leveraged by data producers and consumers alike. Data can be enriched with the grid index information at any step of its journey through the data supply chain. This makes the grid index systems an example of community driven data standards. Community driven data standards by nature do not require enforcement, which fully adheres to the habit generation aspect of value pyramid and meaningfully addresses interoperability and accessibility principles of FAIR. Databricks has recently announced [native support for the H3 grid index system](https://www.databricks.com/blog/2022/09/14/announcing-built-h3-expressions-geospatial-processing-and-analytics.html) following the same value proposition. Adopting common industry standards driven by the community is the only way to properly drive habit generation and interoperability. To strengthen this statement, organizations like [CARTO](https://carto.com/blog/hexagons-for-location-intelligence/) , [ESRI](https://www.esri.com/arcgis-blog/products/bus-analyst/analytics/using-uber-h3-hexagons-arcgis-business-analyst-pro/) and [Google](https://opensource.googleblog.com/2017/12/announcing-s2-library-geometry-on-sphere.html) have been promoting the usage of grid index systems for scalable GIS system design. In addition, Databricks Labs project [Mosaic](https://databrickslabs.github.io/mosaic/) supports the [British National Grid](https://en.wikipedia.org/wiki/Ordnance_Survey_National_Grid) as the standard grid index system that is widely used in the UK government. Grid index systems are key for the scalability of geospatial data processing and for properly designing solutions for complex problems (e.g., figure 5 — flight holding patterns using H3). **Geospatial data diversity** Geospatial data standards spend a solid amount of effort regarding data format standardization, and format for that matter is one of the most important considerations when it comes to interoperability and reproducibility. Furthermore, if the reading of your data is complex — how can we talk about simplicity? Unfortunately geospatial data formats are typically complex, as data can be produced in a number of formats including both open source and vendor-specific formats. Considering only vector data, we can expect data to arrive in WKT, WKB, GeoJSON, web CSV, CSV, Shape File, GeoPackage, and many others. On the other hand, if we are considering raster data we can expect data to arrive in any number of formats such as GeoTiff, netCDF, GRIB, or GeoDatabase; for a comprehensive list of formats please consult this [blog](https://gisgeography.com/gis-formats/) . Figure 5: Example of using H3 to express flight holding patterns ----- Geospatial data domain is so diverse and has organically grown over the years around the use cases it was addressing. Unification of such a diverse ecosystem is a massive challenge. A recent effort by the Open Geospatial Consortium (OGC) to standardize to [Apache Parquet](https://parquet.apache.org/) and its geospatial schema specification [GeoParquet](https://geoparquet.org/) is a step in the right direction. Simplicity is one of the key aspects of designing a good scalable and robust product — unification leads to simplicity and addresses one of the main sources of friction in the ecosystem — the data ingestion. Standardizing to GeoParquet brings a lot of value that addresses all of the aspects of FAIR data and value pyramid. Figure 6: Geoparquet as a geospatial standard data format Why introduce another format into an already complex ecosystem? GeoParquet isn’t a new format — it is a schema specification for Apache Parquet format that is already widely adopted and used by the industry and the community. Parquet as the base format supports binary columns and allows for storage of arbitrary data payload. At the same time the format supports structured data columns that can store metadata together with the data payload. This makes it a choice that promotes interoperability and reproducibility. Finally, [Delta Lake](https://delta.io/) format has been built on top of parquet and brings [ACID](https://en.wikipedia.org/wiki/ACID) properties to the table. ACID properties of a format are crucial for reproducibility and for trusted outputs. In addition, Delta is the format used by scalable data sharing solution [Delta Sharing](https://www.databricks.com/product/delta-sharing) . Delta Sharing enables enterprise scale data sharing between any public cloud using Databricks (DIY options for private cloud are available using open source building blocks). Delta Sharing completely abstracts the need for custom built Rest APIs for exposing data to other third parties. Any data asset stored in Delta (using GeoParquet schema) automatically becomes a data product that can be exposed to external parties in a controlled and governed manner. Delta Sharing has been built from the ground up with [security best practices in mind](https://www.databricks.com/blog/2022/08/01/security-best-practices-for-delta-sharing.html?utm_source=bambu&utm_medium=social&utm_campaign=advocacy&blaid=3352307) . ----- Figure 7: Delta Sharing simplifying data access in the ecosystem **Circular data economy** Borrowing the concepts from the sustainability domain, we can define a circular data economy as a system in which data is collected, shared, and used in a way that maximizes its value while minimizing waste and negative impacts, such as unnecessary compute time, untrustworthy insights, or biased actions based data pollutants. Reusability is the key concept in this consideration — how can we minimize the ""reinvention of the wheel."" There are countless data assets out in the wild that represent the same area, same concepts with just ever slight alterations to better match a specific use case. Is this due to the actual optimizations or due to the fact it was easier to create a new copy of the assets than to reuse the existing ones? Or was it too hard to find the existing data assets, or maybe it was too complex to define data access patterns. Data asset duplication has many negative aspects in both FAIR considerations and data value pyramid considerations — having many disparate similar (but different) data assets that represent the same area and same concepts can deteriorate simplicity considerations of the data domain — it becomes hard to identify the data asset we actually can trust. It can also have very negative ----- implications toward habit generation. Many niche communities will emerge that will standardize to themselves ignoring the best practices of the wider ecosystem, or worse yet they will not standardize at all. In a circular data economy, data is treated as a valuable resource that can be used to create new products and services, as well as improving existing ones. This approach encourages the reuse and recycling of data, rather than treating it as a disposable commodity. Once again, we are using the sustainability analogy in a literal sense — we argue that this is the correct way of approaching the problem. Data pollutants are a real challenge for organizations both internally and externally. An article by The Guardian states that less than 1% of collected data is actually analyzed. There is too much data duplication, the majority of data is hard to access and deriving actual value is too cumbersome. Circular data economy promotes best practices and reusability of existing data assets allowing for a more consistent interpretation and insights across the wider data ecosystem. Figure 8: Databricks Marketplace ----- Interoperability is a key component of FAIR data principles, and from interoperability a question of circularity comes to mind. How can we design an ecosystem that maximizes data utilization and data reuse? Once again, FAIR together with the value pyramid holds answers. Findability of the data is key to the data reuse and to solving for data pollution. With data assets that can be discovered easily we can avoid the recreation of same data assets in multiple places with just slight alteration. Instead we gain a coherent data ecosystem with data that can be easily combined and reused. Databricks has recently announced the [Databricks Marketplace](https://www.databricks.com/blog/2022/06/28/introducing-databricks-marketplace-an-open-marketplace-for-all-data-and-ai-assets.html) . The idea behind the marketplace is in line with the original definition of data product by DJ Patel. The marketplace will support sharing of data sets, notebooks, dashboards, and machine learning models. The critical building block for such a marketplace is the concept of Delta Sharing — the scalable, flexible and robust channel for sharing any data — geospatial data included. Designing scalable data products that will live in the marketplace is crucial. In order to maximize the value add of each data product one should strongly consider FAIR principles and the product value pyramid. Without these guiding principles we will only increase the issues that are already present in the current systems. Each data product should solve a unique problem and should solve it in a simple, reproducible and robust way. **You can read more on how Databricks Lakehouse** **Platform can help you accelerate time to value from** **your data products in the eBook:** **[A New Approach](https://www.databricks.com/p/ebook/a-new-approach-to-data-sharing)** **[to Data Sharing](https://www.databricks.com/p/ebook/a-new-approach-to-data-sharing)** **.** **Start experimenting with these** **free Databricks** **notebooks** **.** ----- SECTION 2.7  **Data Lineage With Unity Catalog** by **P A U L R O O M E , TA O F E N G A N D S A C H I N T H A K U R** June 8, 2022 This blog will discuss the importance of data lineage, some of the common use cases, our vision for better data transparency and data understanding with data lineage. **What is data lineage and why is it important?** Data lineage describes the transformations and refinements of data from source to insight. Lineage includes capturing all the relevant metadata and events associated with the data in its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, what other data sets leverage it, and many other events and attributes. With a data lineage solution, data teams get an end-to-end view of how data is transformed and how it flows across their data estate. As more and more organizations embrace a data-driven culture and set up processes and tools to democratize and scale data and AI, data lineage is becoming an essential pillar of a pragmatic data management and governance strategy. To understand the importance of data lineage, we have highlighted some of the common use cases we have heard from our customers below. **Impact analysis** Data goes through multiple updates or revisions over its lifecycle, and understanding the potential impact of any data changes on downstream consumers becomes important from a risk management standpoint. With data lineage, data teams can see all the downstream consumers — applications, dashboards, machine learning models or data sets, etc. — impacted by data changes, understand the severity of the impact, and notify the relevant stakeholders. Lineage also helps IT teams proactively communicate data migrations to the appropriate teams, ensuring business continuity. **Data understanding and transparency** Organizations deal with an influx of data from multiple sources, and building a better understanding of the context around data is paramount to ensure the trustworthiness of the data. Data lineage is a powerful tool that enables data leaders to drive better transparency and understanding of data in their organizations. Data lineage also empowers data consumers such as data scientists, data engineers and data analysts to be context-aware as they perform analyses, resulting in better quality outcomes. Finally, data stewards can see which data sets are no longer accessed or have become obsolete to retire unnecessary data and ensure data quality for end business users . ----- **Debugging and diagnostics** You can have all the checks and balances in place, but something will eventually break. Data lineage helps data teams perform a root cause analysis of any errors in their data pipelines, applications, dashboards, machine learning models, etc., by tracing the error to its source. This significantly reduces the debugging time, saving days, or in many cases, months of manual effort. **Compliance and audit readiness** Many compliance regulations, such as the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPPA), Basel Committee on Banking Supervision (BCBS) 239, and Sarbanes-Oxley Act (SOX), require organizations to have clear understanding and visibility of data flow. As a result, data traceability becomes a key requirement in order for their data architecture to meet legal regulations. Data lineage helps organizations be compliant and audit-ready, thereby alleviating the operational overhead of manually creating the trails of data flows for audit reporting purposes. **Effortless transparency and proactive control with** **data lineage** The [lakehouse](https://www.databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) provides a pragmatic data management architecture that substantially simplifies enterprise data infrastructure and accelerates innovation by unifying your data warehousing and AI use cases on a single platform. We believe data lineage is a key enabler of better data transparency and data understanding in your lakehouse, surfacing the relationships between data, jobs, and consumers, and helping organizations move toward proactive data management practices. For example: **•** As the owner of a dashboard, do you want to be notified next time that a table your dashboard depends upon wasn’t loaded correctly? **•** As a machine learning practitioner developing a model, do you want to be alerted that a critical feature in your model will be deprecated soon? **•** As a governance admin, do you want to automatically control access to data based on its provenance? All of these capabilities rely upon the automatic collection of data lineage across all use cases and personas — which is why the lakehouse and data lineage are a powerful combination. ----- Data lineage for tables Data lineage for table columns Data Lineage for notebooks, workflows, dashboards **Built-in security:** Lineage graphs in Unity Catalog are privilege-aware and share the same permission model as Unity Catalog. If users do not have access to a table, they will not be able to explore the lineage associated with the table, adding an additional layer of security for privacy considerations. **Easily exportable via REST API:** Lineage can be visualized in the Data Explorer in near real-time, and retrieved via REST API to support integrations with our catalog partners. **Getting started with data lineage in Unity Catalog** Data lineage is available with Databricks Premium and Enterprise tiers for no additional cost. If you already are a Databricks customer, follow the data lineage guides ( [AWS](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html) | [Azure](https://docs.microsoft.com/azure/databricks/data-governance/unity-catalog/data-lineage) ) to get started. If you are not an existing Databricks customer, sign up for a [free trial](https://www.databricks.com/try-databricks) with a Premium or Enterprise workspace. ----- SECTION 2.8 **Easy Ingestion to Lakehouse With COPY INTO** by **A E M R O A M A R E , E M M A L I U , A M I T K A R A** and **J A S R A J D A N G E** January 17, 2023 A new data management architecture known as the data lakehouse emerged independently across many organizations and use cases to support AI and BI directly on vast amounts of data. One of the key success factors for using the data lakehouse for analytics and machine learning is the ability to quickly and easily ingest data of various types, including data from on-premises storage platforms (data warehouses, mainframes), real-time streaming data, and bulk data assets. As data ingestion into the lakehouse is an ongoing process that feeds the proverbial ETL pipeline, you will need multiple options to ingest various formats, types and latency of data. For data stored in cloud object stores such as AWS S3, Google Cloud Storage and Azure Data Lake Storage, Databricks offers Auto Loader, a natively integrated feature, that allows data engineers to ingest millions of files from the cloud storage continuously. In other streaming cases (e.g., IoT sensor or clickstream data), Databricks provides native connectors for Apache Spark Structured Streaming to quickly ingest data from popular message queues, such as [Apache Kafka](https://docs.databricks.com/spark/latest/structured-streaming/kafka.html?_ga=2.117268486.126296912.1643033657-734003504.1641217794) , Azure Event Hubs or AWS Kinesis at low latencies. Furthermore, many customers can leverage popular ingestion tools that integrate with Databricks, such as Fivetran — to easily ingest data from enterprise applications, databases, mainframes and more into the lakehouse. Finally, analysts can use the simple “COPY INTO” command to pull new data into the lakehouse automatically, without the need to keep track of which files have already been processed. This blog focuses on COPY INTO, a simple yet powerful SQL command that allows you to perform batch file ingestion into Delta Lake from cloud object stores. It’s idempotent, which guarantees to ingest files with exactly-once semantics when executed multiple times, supporting incremental appends and simple transformations. It can be run once, in an ad hoc manner, and can be scheduled through Databricks Workflows. In recent Databricks [Runtime releases](https://docs.databricks.com/release-notes/runtime/releases.html) , COPY INTO introduced new functionalities for data preview, validation, enhanced error handling, and a new way to copy into a schemaless Delta Lake table so that users can get started quickly, completing the end-to-end user journey to ingest from cloud object stores. Let’s take a look at the popular COPY INTO use cases. ----- **1. Ingesting data for the first time** The default for data validation is to parse all the data in the source directory to ensure that there aren’t any issues, but the rows returned for preview are limited. Optionally, you can provide the number of rows to preview after VALIDATE. The COPY_OPTION “mergeSchema” specifies that it is okay to evolve the schema of your target Delta table. Schema evolution only allows the addition of new columns, and does not support data type changes for existing columns. In other use cases, you can omit this option if you intend to manage your table schema more strictly as your data pipeline may have strict schema requirements and may not want to evolve the schema at all times. However, our target Delta table in the example above is an empty, columnless table at the moment; therefore, we have to specify the COPY_OPTION “mergeSchema” here. Figure 1: COPY INTO VALIDATE mode output COPY INTO requires a table to exist as it ingests the data into a target Delta table. However, you have no idea what your data looks like. You first create an empty Delta table. ```  CREATE TABLE my_example_data; ``` Before you write out your data, you may want to preview it and ensure the data looks correct. The COPY INTO Validate mode is a new feature in Databricks Runtime [10.3](https://docs.databricks.com/release-notes/runtime/10.3.html) and above that allows you to preview and validate source data before ingesting many files from the cloud object stores. These validations include: **•** if the data can be parsed **•** the schema matches that of the target table or if the schema needs to be evolved **•** all nullability and check constraints on the table are met COPY INTO my_example_data FROM 's3://my-bucket/exampleData' FILEFORMAT `=` CSV VALIDATE COPY_OPTIONS ( 'mergeSchema' `=` 'true' ) ----- **2. Configuring COPY INTO** Figure 2 shows the validate output that the header is properly parsed. Figure 2: COPY INTO VALIDATE mode output with enabled header and inferSchema **3. Appending data to a Delta table** Now that the preview looks good, we can remove the VALIDATE keyword and execute the COPY INTO command. COPY INTO my_example_data FROM 's3://my-bucket/exampleData' FILEFORMAT `=` CSV FORMAT_OPTIONS ( 'header' `=` 'true' , 'inferSchema' `=` 'true' , 'mergeSchema' `=` 'true' ) COPY_OPTIONS ( 'mergeSchema' `=` 'true' ) When looking over the results of VALIDATE (see Figure 1), you may notice that your data doesn’t look like what you want. Aren’t you glad you previewed your data set first? The first thing you notice is the column names are not what is specified in the CSV header. What’s worse, the header is shown as a row in your data. You can configure the CSV parser by specifying FORMAT_OPTIONS. Let’s add those next. COPY INTO my_example_data FROM 's3://my-bucket/exampleData' FILEFORMAT `=` CSV VALIDATE FORMAT_OPTIONS ( 'header' `=` 'true' , 'inferSchema' `=` 'true' , 'mergeSchema' `=` 'true' ) COPY_OPTIONS ( 'mergeSchema' `=` 'true' ) When using the FORMAT OPTION, you can tell COPY INTO to infer the data types of the CSV file by specifying the inferSchema option; otherwise, all default data types are STRINGs. On the other hand, binary file formats like AVRO and PARQUET do not need this option since they define their own schema. Another option, “mergeSchema” states that the schema should be inferred over a comprehensive sample of CSV files rather than just one. The comprehensive list of format-specific options can be found in the [documentation](https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-copy-into#format-options) . ----- COPY INTO keeps track of the state of files that have been ingested. Unlike commands like INSERT INTO, users get idempotency with COPY INTO, which means users won’t get duplicate data in the target table when running COPY INTO multiple times from the same source data. COPY INTO can be run once, in an ad hoc manner, and can be scheduled with Databricks Workflows. While COPY INTO does not support low latencies for ingesting natively, you can trigger COPY INTO through orchestrators like Apache Airflow. Figure 3: Databricks workflow UI to schedule a task ----- **4. Secure data access with COPY INTO** COPY INTO supports secure access in several ways. In this section, we want to highlight two new options you can use in both [Databricks SQL](https://dbricks.co/dbsql) and notebooks from recent releases: **Unity Catalog** With the general availability of Databrick Unity Catalog, you can use COPY INTO to ingest data to Unity Catalog managed or external tables from any source and file format supported by COPY INTO. Unity Catalog also adds new options for configuring secure access to raw data, allowing you to use Unity Catalog external locations or storage credentials to access data in cloud object storage. Learn more about how to use [COPY INTO with Unity Catalog](https://docs.databricks.com/ingestion/copy-into/unity-catalog.html#use-copy-into-to-load-data-with-unity-catalog) . **Temporary Credentials** What if you have not configured Unity Catalog or instance profile? How about data from a trusted third party bucket? Here is a convenient COPY INTO feature that allows you to [ingest data with inline temporary credentials](https://docs.databricks.com/ingestion/copy-into/temporary-credentials.html) to handle the ad hoc bulk ingestion use case. COPY INTO my_example_data FROM 's3://my-bucket/exampleDataPath' WITH ( CREDENTIAL (AWS_ACCESS_KEY `=` '...' , AWS_SECRET_KEY `=` '...' , AWS_SESSION_ TOKEN `=` '...' ) ) FILEFORMAT `=` CSV **5. Filtering files for ingestion** What about ingesting a subset of files where the filenames match a pattern? You can apply glob patterns — a glob pattern that identifies the files to load from the source directory. For example, let’s filter and ingest files which contain the word `raw_data` in the filename below. COPY INTO my_example_data FROM 's3://my-bucket/exampleDataPath' FILEFORMAT `=` CSV PATTERN `=` '*raw_data*.csv' FORMAT_OPTIONS ( 'header' `=` 'true' ) **6. Ingest files in a time period** In data engineering, it is frequently necessary to ingest files that have been modified before or after a specific timestamp. Data between two timestamps may also be of interest. The ‘modifiedAfter’ and ‘modifiedBefore’ format options offered by COPY INTO allow users to ingest data from a chosen time window into a Delta table. COPY INTO my_example_data FROM 's3://my-bucket/exampleDataPath' FILEFORMAT `=` CSV PATTERN `=` '*raw_data_*.csv' FORMAT_OPTIONS( '2022-0912T10:53:11.000+0000' 'header' ) `=` 'true' , 'modifiedAfter' `=` ----- **7. Correcting data with the force option** Because COPY INTO is by default idempotent, running the same query against the same source files more than once has no effect on the destination table after the initial execution. You must propagate changes to the target table because, in real-world circumstances, source data files in cloud object storage may be altered for correction at a later time. In such a case, it is possible to first erase the data from the target table before ingesting the more recent data files from the source. For this operation you only need to set the copy option ‘force’ to ‘true’. COPY INTO my_example_data FROM 's3://my-bucket/exampleDataPath' FILEFORMAT `=` CSV PATTERN `=` '*raw_data_2022*.csv' FORMAT_OPTIONS( 'header' `=` 'true' ) COPY_OPTIONS ( 'force' `=` 'true' ) **8. Applying simple transformations** What if you want to rename columns? Or the source data has changed and a previous column has been renamed to something else? You don’t want to ingest that data as two separate columns, but as a single column. We can leverage the SELECT statement in COPY INTO perform simple transformations. COPY INTO demo.my_example_data FROM ( SELECT concat(first_name, "" "", last_name) as full_name, `*` EXCEPT (first_name, last_name) FROM 's3://my-bucket/exampleDataPath' ) FILEFORMAT `=` CSV PATTERN `=` '*.csv' FORMAT_OPTIONS( 'header' `=` 'true' ) COPY_OPTIONS ( 'force' `=` 'true' ) **9. Error handling and observability with COPY INTO** **Error handling:** How about ingesting data with file corruption issues? Common examples of file corruption are: **•** Files with an incorrect file format **•** Failure to decompress **•** Unreadable files (e.g., invalid Parquet) ----- COPY INTO’s format option ignoreCorruptFiles helps skip those files while processing. The result of the COPY INTO command returns the number of files skipped in the num_skipped_corrupt_files column. In addition, these corrupt files aren’t tracked by the ingestion state in COPY INTO, therefore they can be reloaded in a subsequent execution once the corruption is fixed. This option is available in Databricks [Runtime 11.0+](https://docs.databricks.com/release-notes/runtime/11.0.html) . You can see which files have been detected as corrupt by running COPY INTO in VALIDATE mode. COPY INTO my_example_data FROM 's3://my-bucket/exampleDataPath' FILEFORMAT `=` CSV VALIDATE ALL FORMAT_OPTIONS( 'ignoreCorruptFiles' `=` 'true' ) **Observability:** In Databricks Runtime 10.5, [file metadata column](https://docs.databricks.com/ingestion/file-metadata-column.html) was introduced to provide input file metadata information, which allows users to monitor and get key properties of the ingested files like path, name, size and modification time, by querying a hidden STRUCT column called _metadata. To include this information in the destination, you must explicitly reference the _metadata column in your query in COPY INTO. COPY INTO my_example_data FROM ( SELECT `*` , _metadata source_metadata FROM 's3://my-bucket/ exampleDataPath' ) FILEFORMAT `=` CSV **How does it compare to Auto Loader?** COPY INTO is a simple and powerful command to use when your source directory contains a small number of files (i.e., thousands of files or less), and if you prefer SQL. In addition, COPY INTO can be used over JDBC to push data into Delta Lake at your convenience, a common pattern by many ingestion partners. To ingest a larger number of files both in streaming and batch we recommend using [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) . In addition, for a modern data pipeline based on [medallion](https://www.databricks.com/glossary/medallion-architecture) [architecture](https://www.databricks.com/glossary/medallion-architecture) , we recommend using Auto Loader in [Delta Live Tables pipelines](https://docs.databricks.com/ingestion/auto-loader/dlt.html) , leveraging advanced capabilities of automatic error handling, quality control, data lineage and setting [expectations](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html) in a declarative approach. **How to get started?** To get started, you can go to **[Databricks SQL](https://dbricks.co/dbsql)** query editor, update and run the example SQL commands to ingest from your cloud object stores. Check out the options in No. 4 to establish secure access to your data for querying it in Databricks SQL. To get familiar with COPY INTO in Databricks SQL, you can also follow this [quickstart tutorial.](https://docs.databricks.com/ingestion/copy-into/tutorial-dbsql.html) As an alternative, you can use this [notebook](https://www.databricks.com/wp-content/uploads/notebooks/db-385-demo_copy_into.html) in Data Science & Engineering and Machine Learning workspaces to learn most of the COPY INTO features in this blog, where source data and target Delta tables are generated in DBFS. More tutorials for COPY INTO can be found [here](https://docs.databricks.com/ingestion/copy-into/index.html) . ----- SECTION 2.9  **Simplifying Change Data Capture With Databricks Delta Live Tables** by **M O J G A N M A Z O U C H I** April 25, 2022 This guide will demonstrate how you can leverage change data capture in Delta Live Tables pipelines to identify new records and capture changes made to the data set in your data lake. Delta Live Tables pipelines enable you to develop scalable, reliable and low latency data pipelines, while performing change data capturee in your data lake with minimum required computation resources and seamless out-of-order data handling. **Note:** We recommend following [Getting Started with Delta Live Tables](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables) which explains creating scalable and reliable pipelines using Delta Live Tables (DLT) and its declarative ETL definitions. **Background on change data capture** Change data capture ( [CDC](https://en.wikipedia.org/wiki/Change_data_capture) ) is a process that identifies and captures incremental changes (data deletes, inserts and updates) in databases, like tracking customer, order or product status for near-real-time data applications. CDC provides realtime data evolution by processing data in a continuous incremental fashion as new events occur. Since [over 80% of organizations plan on implementing multicloud strategies](https://solutionsreview.com/data-integration/whats-changed-2020-gartner-magic-quadrant-for-data-integration-tools/) [by 2025](https://solutionsreview.com/data-integration/whats-changed-2020-gartner-magic-quadrant-for-data-integration-tools/) , choosing the right approach for your business that allows seamless real-time centralization of all data changes in your ETL pipeline across multiple environments is critical. By capturing CDC events, Databricks users can re-materialize the source table as Delta Table in Lakehouse and run their analysis on top of it, while being able to combine data with external systems. The MERGE INTO command in Delta Lake on Databricks enables customers to efficiently upsert and delete records in their data lakes — you can check out our previous deep dive on the topic [here](https://www.databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html) . This is a common use case that we observe many of Databricks customers are leveraging Delta Lakes to perform, and keeping their data lakes up to date with real-time business data. While Delta Lake provides a complete solution for real-time CDC synchronization in a data lake, we are now excited to announce the change data capture feature in Delta Live Tables that makes your architecture even simpler, more efficient and scalable. DLT allows users to ingest CDC data seamlessly using SQL and Python. Earlier CDC solutions with Delta tables were using MERGE INTO operation, which requires manually ordering the data to avoid failure when multiple rows of the source data set match while attempting to update the same rows of the target ----- Delta table. To handle the out-of-order data, there was an extra step required to preprocess the source table using a foreachBatch implementation to eliminate the possibility of multiple matches, retaining only the latest change for each key (see the [change data capture example](https://www.databricks.com/blog/2022/04/25/simplifying-change-data-capture-with-databricks-delta-live-tables.html#) ). The new APPLY CHANGES INTO operation in DLT pipelines automatically and seamlessly handles out-of-order data without any need for data engineering manual intervention. **CDC with Databricks Delta Live Tables** In this blog, we will demonstrate how to use the APPLY CHANGES INTO command in Delta Live Tables pipelines for a common CDC use case where the CDC data is coming from an external system. A variety of CDC tools are available such as Debezium, Fivetran, Qlik Replicate, Talend, and StreamSets. While specific implementations differ, these tools generally capture and record the history of data changes in logs; downstream applications consume these CDC logs. In our example, data is landed in cloud object storage from a CDC tool such as Debezium, Fivetran, etc. We have data from various CDC tools landing in a cloud object storage or a message queue like Apache Kafka. Typically we see CDC used in an ingestion to what we refer as the medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a Lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture. Delta Live Tables allows you to seamlessly apply changes from CDC feeds to tables in your Lakehouse; combining this functionality with the medallion architecture allows for incremental changes to easily flow through analytical workloads at scale. Using CDC together with the medallion architecture provides multiple benefits to users since only changed or added data needs to be processed. Thus, it enables users to cost-effectively keep Gold tables up-to-date with the latest business data. **NOTE:** The example here applies to both SQL and Python versions of CDC and also on a specific way to use the operations; to evaluate variations, please see the official documentation [here](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html#python) . **Prerequisites** To get the most out of this guide, you should have a basic familiarity with: **•** SQL or Python **•** Delta Live Tables **•** Developing ETL pipelines and/or working with Big Data systems **•** Databricks interactive notebooks and clusters **•** You must have access to a Databricks Workspace with permissions to create new clusters, run jobs, and save data to a location on external cloud object storage or [DBFS](https://docs.gcp.databricks.com/data/databricks-file-system.html) **•** For the pipeline we are creating in this blog, “Advanced” product edition which supports enforcement of data quality constraints, needs to be selected ----- **The data set** Here we are consuming realistic looking CDC data from an external database. In this pipeline, we will use the [Faker](https://github.com/joke2k/faker) library to generate the data set that a CDC tool like Debezium can produce and bring into cloud storage for the initial ingest in Databricks. Using [Auto Loader](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) we incrementally load the messages from cloud object storage, and store them in the Bronze table as it stores the raw messages. The Bronze tables are intended for data ingestion which enable quick access to a single source of truth. Next we perform APPLY CHANGES INTO from the cleaned Bronze layer table to propagate the updates downstream to the Silver table. As data flows to Silver tables, generally it becomes more refined and optimized (“just-enough”) to provide an enterprise a view of all its key business entities. See the diagram below. This blog focuses on a simple example that requires a JSON message with four fields of customer’s name, email, address and id along with the two fields: operation (which stores operation code (DELETE, APPEND, UPDATE, CREATE) and operation_date (which stores the date and timestamp for the record came for each operation action) to describe the changed data. To generate a sample data set with the above fields, we are using a Python package that generates fake data, Faker. You can find the notebook related to this data generation section [here](https://www.databricks.com/wp-content/uploads/notebooks/DB-129/1-cdc-data-generator.html) . In this notebook we provide the name and storage location to write the generated data there. We are using the DBFS functionality of Databricks; see the [DBFS documentation](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) to learn more about how it works. Then, we use a PySpark user-defined function to generate the synthetic data set for each field, and write the data back to the defined storage location, which we will refer to in other notebooks for accessing the synthetic data set. **Ingesting the raw data set using Auto Loader** According to the medallion architecture paradigm, the Bronze layer holds the most raw data quality. At this stage we can incrementally read new data using Auto Loader from a location in cloud storage. Here we are adding the path to our generated data set to the configuration section under pipeline settings, which allows us to load the source path as a variable. So now our configuration under pipeline settings looks like below: ""configuration"" : { ""source"" : ""/tmp/demo/cdc_raw"" ----- Then we load this configuration property in our notebooks. Let’s take a look at the Bronze table we will ingest, a. In SQL, and b. Using Python **A . S Q L** SET spark.source; CREATE STREAMING LIVE TABLE customer_bronze ( address string , email string , id string , firstname string , lastname string , operation string , operation_date string , _rescued_data string ) TBLPROPERTIES ( ""quality"" = ""bronze"" ) COMMENT ""New customer data incrementally ingested from cloud object storage landing zone"" AS SELECT * FROM cloud_files( ""${source}/customers"" , ""json"" , map( ""cloudFiles. inferColumnTypes"" , ""true"" )); **B . P Y T H O N** import dlt from pyspark.sql.functions import - from pyspark.sql.types import - source = spark.conf.get( ""source"" ) **@dlt.table(name=** **""customer_bronze""** **,** **comment =** **""New customer data incrementally ingested from** **cloud object storage landing zone""** **,** **table_properties={** **""quality""** **:** **""bronze""** **}** **)** ```  def customer_bronze (): ``` return ( spark.readStream. format ( ""cloudFiles"" ) \ .option( ""cloudFiles.format"" , ""json"" ) \ .option( ""cloudFiles.inferColumnTypes"" , ""true"" ) \ .load( f"" {source} /customers"" ) ) The above statements use the Auto Loader to create a streaming live table called customer_bronze from json files. When using Auto Loader in Delta Live Tables, you do not need to provide any location for schema or checkpoint, as those locations will be managed automatically by your DLT pipeline. Auto Loader provides a Structured Streaming source called cloud_files in SQL and cloudFiles in Python, which takes a cloud storage path and format as parameters. To reduce compute costs, we recommend running the DLT pipeline in Triggered mode as a micro-batch assuming you do not have very low latency requirements. ----- **Expectations and high-quality data** In the next step to create a high-quality, diverse, and accessible data set, we impose quality check expectation criteria using Constraints. Currently, a constraint can be either retain, drop, or fail. For more detail see [here](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-expectations.html) . All constraints are logged to enable streamlined quality monitoring. **A . S Q L** CREATE TEMPORARY STREAMING LIVE TABLE customer_bronze_clean_v( CONSTRAINT valid_id EXPECT (id IS NOT NULL ) ON VIOLATION DROP ROW , CONSTRAINT valid_address EXPECT (address IS NOT NULL ), CONSTRAINT valid_operation EXPECT (operation IS NOT NULL ) ON VIOLATION DROP ROW ) TBLPROPERTIES (""quality"" `=` ""silver"") COMMENT ""Cleansed bronze customer view (i.e. what will become Silver)"" AS SELECT `*` FROM STREAM(LIVE.customer_bronze); **B . P Y T H O N** ```  @dlt.view(name= ""customer_bronze_clean_v"" ,  comment= ""Cleansed bronze customer view (i.e. what will become Silver)"" ) ``` **Using APPLY CHANGES INTO statement to propagate changes to** **downstream target table** Prior to executing the Apply Changes Into query, we must ensure that a target streaming table which we want to hold the most up-to-date data exists. If it does not exist we need to create one. Below cells are examples of creating a target streaming table. Note that at the time of publishing this blog, the target streaming table creation statement is required along with the Apply Changes Into query, and both need to be present in the pipeline — otherwise your table creation query will fail. **A . S Q L** CREATE STREAMING LIVE TABLE customer_silver TBLPROPERTIES (""quality"" `=` ""silver"") COMMENT ""Clean, merged customers""; **B . P Y T H O N** dlt.create_target_table(name= ""customer_silver"" , comment= ""Clean, merged customers"" , table_properties={ ""quality"" : ""silver"" ``` @dlt.expect_or_drop( ""valid_id"" , ""id IS NOT NULL"" ) @dlt.expect( ""valid_address"" , ""address IS NOT NULL"" ) @dlt.expect_or_drop( ""valid_operation"" , ""operation IS NOT NULL"" ) def customer_bronze_clean_v ():  return dlt.read_stream( ""customer_bronze"" ) \ ``` `.select(` `""address""` `,` `""email""` `,` `""id""` `,` ""firstname"" `,` `""lastname""` `,` ``` ""operation"" , ""operation_date"" , ""_rescued_data"" ) ``` ----- Now that we have a target streaming table, we can propagate changes to the downstream target table using the Apply Changes Into query. While CDC feed comes with INSERT, UPDATE and DELETE events, DLT default behavior is to apply INSERT and UPDATE events from any record in the source data set matching on primary keys, and sequenced by a field which identifies the order of events. More specifically it updates any row in the existing target table that matches the primary key(s) or inserts a new row when a matching record does not exist in the target streaming table. We can use APPLY AS DELETE WHEN in SQL, or its equivalent apply_as_deletes argument in Python to handle DELETE events. In this example we used ""id"" as my primary key, which uniquely identifies the customers and allows CDC events to apply to those identified customer records in the target streaming table. Since ""operation_date"" keeps the logical order of CDC events in the source data set, we use ""SEQUENCE BY operation_date"" in SQL, or its equivalent ""sequence_by = col(""operation_date"")"" in Python to handle change events that arrive out of order. Keep in mind that the field value we use with SEQUENCE BY (or sequence_by) should be unique among all updates to the same key. In most cases, the sequence by column will be a column with timestamp information. Finally we used ""COLUMNS * EXCEPT (operation, operation_date, _rescued_ data)"" in SQL, or its equivalent ""except_column_list""= [""operation"", ""operation_ date"", ""_rescued_data""] in Python to exclude three columns of ""operation"", ""operation_date"", ""_rescued_data"" from the target streaming table. By default all the columns are included in the target streaming table, when we do not specify the ""COLUMNS"" clause. **A . S Q L** APPLY CHANGES INTO LIVE.customer_silver FROM stream(LIVE.customer_bronze_clean_v) KEYS (id) APPLY AS DELETE WHEN operation `=` ""DELETE"" SEQUENCE BY operation_date COLUMNS `*` EXCEPT (operation, operation_date, _rescued_data); **B . P Y T H O N** ```  dlt.apply_changes(  target = ""customer_silver"",  source = ""customer_bronze_clean_v"",  keys = [""id""],  sequence_by = col(""operation_date""),  apply_as_deletes = expr(""operation = 'DELETE'""),  except_column_list = [""operation"", ""operation_date"", ""_rescued_data""]) ``` To check out the full list of available clauses see [here](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html#requirements) . Please note that, at the time of publishing this blog, a table that reads from the target of an APPLY CHANGES INTO query or apply_changes function must be a live table, and cannot be a streaming live table. A [SQL](https://www.databricks.com/wp-content/uploads/notebooks/DB-129/2-retail-dlt-cdc-sql.html) and [Python](https://www.databricks.com/wp-content/uploads/notebooks/DB-129/2-Retail_DLT_CDC_Python.html) notebook is available for reference for this section. Now that we have all the cells ready, let’s create a pipeline to ingest data from cloud object storage. Open Jobs in a new tab or window in your workspace, and select “Delta Live Tables.” ----- The pipeline associated with this blog has the following DLT pipeline settings: { ""clusters"" : [ { ""label"" : ""default"" , ""num_workers"" : 1 } ], ""development"" : true , ""continuous"" : false , ""edition"" : ""advanced"" , ""photon"" : false , ""libraries"" : [ { ""notebook"" : { ""path"" : ""/Repos/mojgan.mazouchi@databricks.com/Delta-Live-Tables/ notebooks/1-CDC_DataGenerator"" } }, { ""notebook"" : { ""path"" : ""/Repos/mojgan.mazouchi@databricks.com/Delta-Live-Tables/ notebooks/2-Retail_DLT_CDC_sql"" } } ], ""name"" : ""CDC_blog"" , ""storage"" : ""dbfs:/home/mydir/myDB/dlt_storage"" , ""configuration"" : { ""source"" : ""/tmp/demo/cdc_raw"" , ""pipelines.applyChangesPreviewEnabled"" : ""true"" }, ""target"" : ""my_database"" 1. Select “Create Pipeline” to create a new pipeline 2. Specify a name such as “Retail CDC Pipeline” 3. Specify the Notebook Paths that you already created earlier, one for the generated data set using Faker package, and another path for the ingestion of the generated data in DLT. The second notebook path can refer to the notebook written in SQL, or Python depending on your language of choice. 4. To access the data generated in the first notebook, add the data set path in configuration. Here we stored data in “/tmp/demo/cdc_raw/customers”, so we set “source” to “/tmp/demo/cdc_raw/” to reference “source/customers” in our second notebook. 5. Specify the Target (which is optional and referring to the target database), where you can query the resulting tables from your pipeline 6. Specify the Storage Location in your object storage (which is optional), to access your DLT produced data sets and metadata logs for your pipeline 7. Set Pipeline Mode to Triggered. In Triggered mode, DLT pipeline will consume new data in the source all at once, and once the processing is done it will terminate the compute resource automatically. You can toggle between Triggered and Continuous modes when editing your pipeline settings. Setting “continuous”: false in the JSON is equivalent to setting the pipeline to Triggered mode. 8. For this workload you can disable the autoscaling under Autopilot Options, and use only one worker cluster. For production workloads, we recommend enabling autoscaling and setting the maximum numbers of workers needed for cluster size. 9. Select “Start” 10. Your pipeline is created and running now! ----- You can check out our previous deep dive on the topic [here](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables#pipeline-observability) . Try this [notebook](https://www.databricks.com/wp-content/uploads/notebooks/DB-129/3-retail-dlt-cdc-monitoring.html) to see pipeline observability and data quality monitoring on the example DLT pipeline associated with this blog. **Conclusion** In this blog, we showed how we made it seamless for users to efficiently implement change data capture (CDC) into their lakehouse platform with Delta Live Tables (DLT). DLT provides built-in quality controls with deep visibility into pipeline operations, observing pipeline lineage, monitoring schema, and quality checks at each step in the pipeline. DLT supports automatic error handling and best in class auto-scaling capability for streaming workloads, which enables users to have quality data with optimum resources required for their workload. Data engineers can now easily implement CDC with a new declarative [APPLY](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables#pipeline-observability) [CHANGES INTO API](https://www.databricks.com/discover/pages/getting-started-with-delta-live-tables#pipeline-observability) with DLT in either SQL or Python. This new capability lets your ETL pipelines easily identify changes and apply those changes across tens of thousands of tables with low-latency support. **Ready to get started and try out CDC in Delta Live Tables for yourself?** Please watch this [webinar](https://www.databricks.com/p/webinar/tackle-data-transformation) to learn how Delta Live Tables simplifies the complexity of data transformation and ETL, and see our [Change data capture](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html?_gl=1*d51pfv*_gcl_aw*R0NMLjE2NDYyNTYzOTkuQ2p3S0NBaUF5UHlRQmhCNkVpd0FGVXVha29wck1CWldNUG5INUNpczB3cnMwUGZfd2JxOV9vRWU4bVFITkptZWVaOV9lVFVIYVk0a3Bob0NkYWtRQXZEX0J3RQ..&_ga=2.123024395.1232434169.1646524051-1547688913.1627598437&_gac=1.158632392.1646256400.CjwKCAiAyPyQBhB6EiwAFUuakoprMBZWMPnH5Cis0wrs0Pf_wbq9_oEe8mQHNJmeeZ9_eTUHaY4kphoCdakQAvD_BwE) [with Delta Live Tables](https://docs.databricks.com/data-engineering/delta-live-tables/delta-live-tables-cdc.html?_gl=1*d51pfv*_gcl_aw*R0NMLjE2NDYyNTYzOTkuQ2p3S0NBaUF5UHlRQmhCNkVpd0FGVXVha29wck1CWldNUG5INUNpczB3cnMwUGZfd2JxOV9vRWU4bVFITkptZWVaOV9lVFVIYVk0a3Bob0NkYWtRQXZEX0J3RQ..&_ga=2.123024395.1232434169.1646524051-1547688913.1627598437&_gac=1.158632392.1646256400.CjwKCAiAyPyQBhB6EiwAFUuakoprMBZWMPnH5Cis0wrs0Pf_wbq9_oEe8mQHNJmeeZ9_eTUHaY4kphoCdakQAvD_BwE) document, official [github](https://github.com/databricks/delta-live-tables-notebooks) and follow the steps in this [video](https://vimeo.com/700994477) to create your pipeline! **DLT pipeline lineage observability and data quality** **monitoring** All DLT pipeline logs are stored in the pipeline’s storage location. You can specify your storage location only when you are creating your pipeline. Note that once the pipeline is created you can no longer modify storage location. ----- SECTION 2.10  **Best Practices for Cross-Government Data Sharing** by **M I L O S C O L I C , P R I T E S H P AT E L , R O B E R T W H I F F I N , R I C H A R D J A M E S W I L S O N ,** **M A R C E L L F E R E N C Z** and **E D W A R D K E L LY** February 21, 2023 Government data exchange is the practice of sharing data between different government agencies and often partners in commercial sectors. Government can share data for various reasons, such as to improve government operations’ efficiency, provide better services to the public, or support research and policymaking. In addition, data exchange in the public sector can involve sharing with the private sector or receiving data from the private sector. The considerations span multiple jurisdictions and over almost all industries. In this blog, we will address the needs disclosed as part of national data strategies and how modern technologies, particularly Delta Sharing, Unity Catalog, and clean rooms, can help you design, implement and manage a future-proof and sustainable data ecosystem. **Data sharing and public sector** “The miracle is this: the more we share the more we have.” — [Leonard Nimoy.](https://en.wikipedia.org/wiki/Leonard_Nimoy) Probably the quote about sharing that applies the most profoundly to the topic of data sharing. To the extent that the purpose of sharing the data is to create new information, new insights, and new data. The importance of data sharing is even more amplified in the government context, where federation between departments allows for increased focus. Still, the very same federation introduces challenges around data completeness, data quality, data access, security and control, [FAIR](https://en.wikipedia.org/wiki/FAIR_data) -ness of data, etc. These challenges are far from trivial and require a strategic, multifaceted approach to be addressed appropriately. Technology, people, process, legal frameworks, etc., require dedicated consideration when designing a robust data sharing ecosystem. [The National Data Strategy](https://www.gov.uk/government/publications/uk-national-data-strategy/national-data-strategy) (NDS) by the UK government outlines five actionable missions through which we can materialize the value of data for the citizen and society-wide benefits. ----- It comes as no surprise that each and every one of the missions is strongly related to the concept of data sharing, or more broadly, data access both within and outside of government departments: **1. Unlocking the value of the data across the economy** — Mission 1 of the NDS aims to assert government and the regulators as enablers of the value extraction from data through the adoption of best practices. The UK data economy was estimated to be near [£125 billion in 2021](https://www.gov.uk/government/publications/uks-digital-strategy/uk-digital-strategy) with an upwards trend. In this context, it is essential to understand that the government-collected and provided open data can be crucial for addressing many of the challenges across all industries. For example, insurance providers can better assess the risk of insuring properties by ingesting and integrating [Flood areas](https://environment.data.gov.uk/flood-monitoring/doc/reference#flood-areas) provided by [DEFRA](https://www.gov.uk/government/organisations/department-for-environment-food-rural-affairs) . On the other hand, capital market investors could better understand the risk of their investments by ingesting and integrating the [Inflation Rate Index](https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/l55o/mm23) by [ONS](https://www.ons.gov.uk/) . Reversely, it is crucial for regulators to have well-defined data access and data sharing patterns for conducting their regulatory activities. This clarity truly enables the economic actors that interact with government data. **2. Securing a pro-growth and trusted data regime** — The key aspect of Mission 2 is data trust, or more broadly, adherence to data quality norms. Data quality considerations become further amplified for data sharing and data exchange use cases where we are considering the whole ecosystem at once, and quality implications transcend the boundaries of our own platform. This is precisely why we have to adopt “data sustainability.” What we mean by sustainable data products are data products that harness the existing sources over reinvention of the same/similar assets, accumulation of unnecessary data (data pollutants) and that anticipate future uses. Ungoverned and unbounded data sharing could negatively impact data quality and hinder the growth and value of data. The quality of how the data is shared should be a key consideration of data quality frameworks. For this reason, we require a solid set of standards and best practices for data sharing with governance and quality assurance built into the process and technologies. Only this way can we ensure the sustainability of our data and secure a pro-growth trusted data regime. ----- **3. Transforming government’s use of data to drive efficiency and improve** **public services** — “By 2025 data assets are organized and supported as products, regardless of whether they’re used by internal teams or external customers… Data products continuously evolve in an agile manner to meet the needs of consumers… these products provide data solutions that can more easily and repeatedly be used to meet various business challenges and reduce the time and cost of delivering new AI-driven capabilities.” — [The data-driven enterprise of 2025](https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-data-driven-enterprise-of-2025) by McKinsey. AI and ML can be powerful enablers of digital transformation for both the public and private sectors. AI, ML, reports, and dashboards are just a few examples of data products and services that extract value from data. The quality of these solutions is directly reflected in the quality of data used for building them and our ability to access and leverage available data assets both internally and externally. Whilst there is a vast amount of data available for us to build new intelligent solutions for driving efficiency for better processes, better decision-making, and better policies — there are numerous barriers that can trap the data, such as legacy systems, data silos, fragmented standards, proprietary formats, etc. Modeling data solutions as data products and standardizing them to a unified format allows us to abstract such barriers and truly leverage the data ecosystem. **4. Ensuring the security and resilience of the infrastructure on which** **data relies** — Reflecting on the vision of the year 2025 — this isn’t that far from now and even in a not so distant future, we will be required to rethink our approach to data, more specifically — what is our digital supply chain infrastructure/data sharing infrastructure? Data and data assets are products and should be managed as products. If data is a product, we need a coherent and unified way of providing those products. If data is to be used across industries and across both private and public sectors, we need an open protocol that drives adoption and habit generation. To drive adoption, the technologies we use must be resilient, robust, trusted and usable by/for all. Vendor lock-in, platform lock-in or cloud lock-in are all boundaries to achieving this vision. **5. Championing the international flow of data** — Data exchange between jurisdictions and across governments will likely be one of the most transformative applications of data at scale. Some of the world’s toughest challenges depend on the efficient exchange of data between governments — prevention of criminal activities, counterterrorism activities, net-zero emission goals, international trade, the list goes on and on. Some steps in this direction are already materializing: the U.S. federal government and UK government have agreed on data exchange for countering serious crime activities. This is a true example of championing international flow data and using data for good. It is imperative that for these use cases, we approach data sharing from a security-first angle. Data sharing standards and protocols need to adhere to security and privacy best practices. ----- While originally built with a focus on the UK government and how to better integrate data as a key asset of a modern government, these concepts apply in a much wider global public sector context. In the same spirit, the U.S. Federal Government proposed the [Federal Data Strategy](https://strategy.data.gov/overview/) as a collection of principles, practices, action steps and timeline through which government can leverage the full value of Federal data for mission, service and the public good. The principles are grouped into three primary topics: **•** **Ethical governance** — Within the domain of ethics, the sharing of data is a fundamental tool for promoting transparency, accountability and explainability of decision-making. It is practically impossible to uphold ethics without some form of audit conducted by an independent party. Data (and metadata) exchange is a critical enabler for continuous robust processes that ensure we are using the data for good and we are using data we can trust. **•** **Conscious design** — These principles are strongly aligned with the idea of data sustainability. The guidelines promote forward thinking around usability and interoperability of the data and user-centric design principles of sustainable data products. **•** **Learning culture** — Data sharing, or alternatively knowledge sharing, has an important role in building a scalable learning ecosystem and learning culture. Data is front and center of knowledge synthesis, and from a scientific angle, data proves factual knowledge. Another critical component of knowledge is the “Why?” and data is what we need to address the “Why?” component of any decisions we make, which policy to enforce, who to sanction, who to support with grants, how to improve the efficiency of government services, how to better serve citizens and society. In contrast to afore discussed qualitative analysis of the value of data sharing across governments, the European Commission forecasts the economic value of the European data economy will [exceed €800 billion by 2027](https://commission.europa.eu/strategy-and-policy/priorities-2019-2024/europe-fit-digital-age/european-data-strategy_en) — roughly the same size as the [Dutch economy in 2021](https://ec.europa.eu/eurostat/databrowser/view/NAMA_10_GDP/default/table?lang=en&category=na10.nama10.nama_10_ma) ! Furthermore, they predict more than 10 million data professionals in Europe alone. The technology and infrastructure to support the data society have to be accessible to all, interoperable, extensible, flexible and open. Imagine a world in which you’d need a different truck to transport products between different warehouses because each road requires a different set of tires — the whole supply chain would collapse. When it comes to data, we often experience the “one set of tires for one road” paradox. Rest APIs and data exchange protocols have been proposed in the past but have failed to address the need for simplicity, ease of use and cost of scaling up with the number of data products. ----- **Delta Sharing — the new data** **highway** Delta Sharing provides an open protocol for secure data sharing to any computing platform. The protocol is based on Delta data format and is agnostic concerning the cloud of choice. Delta is an open source data format that avoids vendor, platform and cloud lock-in, thus fully adhering to the principles of data sustainability, conscious design of the U.S. Federal Data Strategy and mission 4 of the UK National Data Strategy. Delta provides a governance layer on top of the Parquet data format. Furthermore, it provides many performance optimizations not available in Parquet out of the box. The openness of the data format is a critical consideration. It is the main factor for driving the habit generation and adoption of best practices and standards. ----- Delta Sharing is a protocol based on a lean set of REST APIs to manage sharing, permissions and access to any data asset stored in Delta or Parquet formats. The protocol defines two main actors, the data provider (data supplier, data owner) and the data recipient (data consumer). The recipient, by definition, is agnostic to the data format at the source. Delta Sharing provides the necessary abstractions for governed data access in many different languages and tools. Delta Sharing is uniquely positioned to answer many of the challenges of data sharing in a scalable manner within the context of highly regulated domains like the public sector: **• Privacy and security concerns** — Personally identifiable data or otherwise sensitive or restricted data is a major part of the data exchange needs of a data-driven and modernized government. Given the sensitive nature of such data, it is paramount that the governance of data sharing is maintained in a coherent and unified manner. Any unnecessary process and technological complexities increase the risk of over-sharing data. With this in mind, Delta Sharing has been designed with [security best practices](https://www.databricks.com/blog/2022/08/01/security-best-practices-for-delta-sharing.html) from the very inception. The protocol provides end-to-end encryption, short-lived credentials, and accessible and intuitive audit and governance features. All of these capabilities are available in a centralized way across all your Delta tables across all clouds. **• Quality and accuracy** — Another challenge of data sharing is ensuring that the data being shared is of high quality and accuracy. Given that the underlying data is stored as Delta tables, we can guarantee that the [transactional nature of data](https://docs.delta.io/latest/concurrency-control.html#concurrency-control) is respected; Delta ensures ACID properties of data. Furthermore, Delta supports [data constraints](https://docs.delta.io/latest/delta-constraints.html#constraints) to guarantee data quality requirements at storage. Unfortunately, other formats such as [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) , [CSVW](https://csvw.org/) , [ORC](https://www.google.com/search?q=orc+data+format&rlz=1C5GCEM_enGB931GB932&ei=CzHRY6KqI4S78gL7hoigCw&oq=ORC+da&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQARgAMgUIABCRAjIFCAAQkQIyBQgAEIAEMgUIABCABDIHCAAQgAQQCjIHCAAQgAQQCjIHCAAQgAQQCjIHCAAQgAQQCjIHCAAQgAQQCjIHCAAQgAQQCjoKCAAQRxDWBBCwAzoHCAAQsAMQQzoNCAAQ5AIQ1gQQsAMYAToPCC4Q1AIQyAMQsAMQQxgCOgwILhDIAxCwAxBDGAI6FQguEMcBENEDENQCEMgDELADEEMYAjoECAAQQzoGCAAQChBDOgoIABCxAxCDARBDOgcIABCxAxBDSgQIQRgASgQIRhgBUCxY3AJg3QxoAXABeACAAW6IAbgCkgEDMC4zmAEAoAEByAETwAEB2gEGCAEQARgJ2gEGCAIQARgI&sclient=gws-wiz-serp) , [Avro](https://en.wikipedia.org/wiki/Apache_Avro) , [XML](https://en.wikipedia.org/wiki/XML) , etc., do not have such properties without significant additional effort. The issue becomes even more emphasized by the fact that data quality cannot be ensured in the same way on both the data provider and data recipient side without the exact reimplementation of the source systems. It is critical to embed quality and metadata together with data to ensure quality travels together with data. Any decoupled approach to managing data, metadata and quality separately increases the risk of sharing and can lead to undesirable outcomes. **• Lack of standardization** — Another challenge of data sharing is the lack of standardization in how data is collected, organized, and stored. This is particularly pronounced in the context of governmental activities. While governments have proposed standard formats (e.g., Office for National Statistics [promotes usage of CSVW](https://www.ons.gov.uk/aboutus/transparencyandgovernance/datastrategy/datastandards#metadata-exchange) ), aligning all private and public sector companies to standards proposed by such initiatives is a massive challenge. Other industries may have different requirements for scalability, interoperability, format complexity, lack of structure in data, etc. Most of the currently advocated standards are lacking in multiple such aspects. Delta is the most mature candidate for assuming the central role in the standardization of data exchange format. It has been built as a transactional and scalable data format, it supports structured, semi-structured and unstructured data, it stores data schema and metadata together with data and it provides a scalable enterprise-grade sharing protocol through Delta Sharing. Finally, Delta is one of the most popular open source projects in the ecosystem and, since May 2022, has surpassed [7 million monthly](https://delta.io/blog/2022-08-02-delta-2-0-the-foundation-of-your-data-lake-is-open/) [downloads](https://delta.io/blog/2022-08-02-delta-2-0-the-foundation-of-your-data-lake-is-open/) . ----- **• Cultural and organizational barriers** — These challenges can be summarized by one word: friction. Unfortunately, it’s a common problem for civil servants to struggle to obtain access to both internal and external data due to over-cumbersome processes, policies and outdated standards. The principles we are using to build our data platforms and our data sharing platforms have to be self-promoting, have to drive adoption and have to generate habits that adhere to best practices. If there is friction with standard adoption, the only way to ensure standards are respected is by enforcement and that itself is yet another barrier to achieving data sustainability. Organizations have already adopted Delta Sharing both in the private and public sectors. For example, [U.S. Citizenship](https://www.uscis.gov/) [and Immigration Services](https://www.uscis.gov/) (USCIS) uses Delta Sharing to satisfy several [interagency data-sharing](https://delta.io/blog/2022-12-08-data-sharing-across-government-delta-sharing/) requirements. Similarly, Nasdaq describes Delta Sharing as the “ [future of financial data sharing,](https://www.nasdaq.com/articles/delta-sharing-protocol%3A-the-evolution-of-financial-data-sharing-2021-05-26) ” and that future is open and governed. **• Technical challenges** — Federation at the government scale or even further across multiple industries and geographies poses technical challenges. Each organization within this federation owns its platform and drives technological, architectural, platform and tooling choices. How can we promote interoperability and data exchange in this vast, diverse technological ecosystem? The data is the only viable integration vehicle. As long as the data formats we utilize are scalable, open and governed, we can use them to abstract from individual platforms and their intrinsic complexities. Delta format and Delta Sharing solve this wide array of requirements and challenges in a scalable, robust and open way. This positions Delta Sharing as the strongest choice for unification and simplification of the protocol and mechanism through which we share data across both private and public sectors. ----- **Data Sharing through data clean rooms** [Data clean rooms](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) address this particular need. With data clean rooms you can share data with third parties in a privacy-safe environment. With Unity Catalog , you can enable fine-grained access controls on the data and meet your privacy requirements. In this architecture, the data participants never get access to the raw data. The only outputs from the clean rooms are those data assets generated in a pre-agreed, governed and fully controlled manner that ensures compliance with the requirements of all parties involved. Finally, data clean rooms and Delta Sharing can address hybrid on-premise-offpremise deployments, where the data with the most restricted access remains on the premise. In contrast, less restricted data is free to leverage the power of the cloud offerings. In said scenario, there may be a need to combine the power of the cloud with the restricted data to solve advanced use cases where capabilities are unavailable on the on-premises data platforms. Data clean rooms can ensure that no physical data copies of the raw restricted data are created, results are produced within the clean room’s controlled environment and results are shared back to the on-premises environment (if the results maintain the restricted access within the defined policies) or are forwarded to any other compliant and predetermined destination system. Taking the complexities of data sharing within highly regulated space and the public sector one step further — what if we require to share the knowledge contained in the data without ever granting direct access to the source data to external parties? These requirements may prove achievable and desirable where the data sharing risk appetite is very low. In many public sector contexts, there are concerns that combining the data that describes citizens could lead to a big brother scenario where simply too much data about an individual is concentrated in a single data asset. If it were to fall into the wrong hands, such a hypothetical data asset could lead to immeasurable consequences for individuals and the trust in public sector services could erode. On the other hand, the value of a 360 view of the citizen could accelerate important decision-making. It could immensely improve the quality of policies and services provided to the citizens. ----- **Citizen value of data sharing** Every decision made by the government is a decision that affects its citizens. Whether the decision is a change to a policy, granting a benefit or preventing crime, it can significantly influence the quality of our society. Data is a key factor in making the right decisions and justifying the decisions made. Simply put, we can’t expect high-quality decisions without the high quality of data and a complete view of the data (within the permitted context). Without data sharing, we will remain in a highly fragmented position where our ability to make those decisions is severely limited or even completely compromised. In this blog, we have covered several technological solutions available within the lakehouse that can derisk and accelerate how the government is leveraging the data ecosystem in a sustainable and scalable way. For more details on the industry use cases that Delta Sharing is addressing please consult [A New Approach to Data Sharing](https://www.databricks.com/product/unity-catalog) eBook. **Start experimenting with these** **free Databricks** **notebooks** **.** ----- **SECTION** # 03 ### Ready-to-Use Notebooks and Data Sets ----- **Digital Twins** Leverage digital twins — virtual representations of devices and objects — to optimize operations and gain insights This section includes several Solution Accelerators — free, ready-to-use examples of data solutions from different industries ranging from retail to manufacturing and healthcare. Each of the following scenarios includes notebooks with code and step-by-step instructions to help you get started. Get hands-on experience with the Databricks Lakehouse Platform by trying the following for yourself: **[Explore the Solution](https://databricks.com/solutions/accelerators/digital-twins)** **Overall Equipment** **Effectiveness** Ingest equipment sensor data for metric generation and data driven decision-making **[Explore the Solution](https://www.databricks.com/solutions/accelerators/overall-equipment-effectiveness)** **Real-time point of** **sale analytics** Calculate current inventories for various products across multiple store locations with Delta Live Tables **[Explore the Solution](https://www.databricks.com/solutions/accelerators/real-time-point-of-sale-analytics)** **Recommendation Engines** **for Personalization** Improve customers’ user experience and conversion with personalized recommendations **[Explore the Solution](https://www.databricks.com/solutions/accelerators/recommendation-engines)** **Understanding Price** **Transparency Data** Efficiently ingest large healthcare data sets to create price transparency for better understanding of healthcare costs **[Explore the Solution](https://www.databricks.com/solutions/accelerators/price-transparency-data)** Additional Solution Accelerators with ready-to-use notebooks can be found here: **[Databricks Solution Accelerators](https://www.databricks.com/solutions/accelerators)** ----- **SECTION** # 04 ### Case Studies **4.1** Akamai **4.2** Grammarly **4.3** Honeywell **4.4** Wood Mackenzie **4.5** Rivian **4.6** AT&T ----- SECTION 4.1 **Akamai delivers real-time security** **analytics using Delta Lake** ###### <1 **Min ingestion time,** **reduced from 15 min** ###### <85% **Of queries have a response** **time of 7 seconds or less** **I N D U S T R Y** [Technology and Software](https://www.databricks.com/solutions/industries/technology-and-software) **S O L U T I O N** [Threat Detection](https://databricks.com/solutions/accelerators/threat-detection) **P L AT F O R M U S E C A S E** Delta Lake, Data Streaming, Photon, [Databricks SQL](https://databricks.com/product/databricks-sql) **C LO U D** [Azure](https://www.databricks.com/product/azure) Akamai runs a pervasive, highly distributed content delivery network (CDN). Its CDN uses approximately 345,000 servers in more than 135 countries and over 1,300 networks worldwide to route internet traffic for some of the largest enterprises in media, commerce, finance, retail and many other industries. About 30% of the internet’s traffic flows through Akamai servers. Akamai also provides cloud security solutions. In 2018, the company launched a web security analytics tool that offers Akamai customers a single, unified interface for assessing a wide range of streaming security events and performing analysis of those events. The web analytics tool helps Akamai customers to take informed actions in relation to security events in real time. Akamai is able to stream massive amounts of data and meet the strict SLAs it provides to customers by leveraging Delta Lake and the Databricks Lakehouse Platform for the web analytics tool. ----- **Ingesting and streaming enormous amounts of data** Akamai’s web security analytics tool ingests approximately 10GB of data related to security events per second. Data volume can increase significantly when retail customers conduct a large number of sales — or on big shopping days like Black Friday or Cyber Monday. The web security analytics tool stores several petabytes of data for analysis purposes. Those analyses are performed to protect Akamai’s customers and provide them with the ability to explore and query security events on their own. The web security analytics tool initially relied on an on-premises architecture running Apache Spark™ on Hadoop. Akamai offers strict service level agreements (SLAs) to its customers of 5 to 7 minutes from when an attack occurs until it is displayed in the tool. The company sought to improve ingestion and query speed to meet those SLAs. “Data needs to be as real-time as possible so customers can see what is attacking them,” says Tomer Patel, Engineering Manager at Akamai. “Providing queryable data to customers quickly is critical. We wanted to move away from on-prem to improve performance and our SLAs so the latency would be seconds rather than minutes.” **Delta Lake allows us to not only query the data better but to** **also acquire an increase in the data volume. We’ve seen an** **80% increase in traffic and data in the last year, so being able** **to scale fast is critical.** After conducting proofs of concept with several companies, Akamai chose to base its streaming analytics architecture on Spark and the Databricks Lakehouse Platform. “Because of our scale and the demands of our SLA, we determined that Databricks was the right solution for us,” says Patel. “When we consider storage optimization, and data caching, if we went with another solution, we couldn’t achieve the same level of performance.” **Improving speed and reducing costs** Today, the web security analytics tool ingests and transforms data, stores it in cloud storage, and sends the location of the file via Kafka. It then uses a Databricks Job as the ingest application. Delta Lake, the open source storage format at the base of the Databricks Lakehouse Platform, supports real-time querying on the web security analytics data. Delta Lake also enables Akamai to scale quickly. “Delta Lake allows us to not only query the data better but to also acquire an increase in the data volume,” says Patel. “We’ve seen an 80% increase in traffic and data in the last year, so being able to scale fast is critical.” Akamai also uses Databricks SQL (DBSQL) and Photon, which provide extremely fast query performance. Patel added that Photon provided a significant boost to query performance. Overall, Databricks’ streaming architecture combined with DBSQL and Photon enables Akamai to achieve real-time analytics, which translates to real-time business benefits. **Tomer Patel** Engineering Manager, Akamai ----- Patel says he likes that Delta Lake is open source, as the company has benefitted from a community of users working to improve the product. “The fact that Delta Lake is open source and there’s a big community behind it means we don’t need to implement everything ourselves,” says Patel. “We benefit from fixed bugs that others have encountered and from optimizations that are contributed to the project.” Akamai worked closely with Databricks to ensure Delta Lake can meet the scale and performance requirements Akamai defined. These improvements have been contributed back to the project (many of which were made available as part of Delta Lake 2.0), and so any user running Delta Lake now benefits from the technology being tested at such a large scale in a real-world production scenario. **Meeting aggressive requirements for scale,** **reliability and performance** Using Spark Structured Streaming on the Databricks Lakehouse Platform enables the web security analytics tool to stream vast volumes of data and provide low-latency, real-time analytics-as-a-service to Akamai’s customers. That way Akamai is able to make available security event data to customers within the SLA of 5 to 7 minutes from when an attack occurs. “Our focus is performance, performance, performance,” says Patel. “The platform’s performance and scalability are what drives us.” Using the Databricks Lakehouse Platform, it now takes under 1 minute to ingest the security event data. “Reducing ingestion time from 15 minutes to under 1 minute is a huge improvement,” says Patel. “It benefits our customers because they can see the security event data faster and they have a view of what exactly is happening as well as the capability to filter all of it.” Akamai’s biggest priority is to provide customers with a good experience and fast response times. To date, Akamai has moved about 70% of security event data from its on-prem architecture to Databricks, and the SLA for customer query and response time has improved significantly as a result. “Now, with the move to Databricks, our customers experience much better response time, with over 85% of queries completing under 7 seconds.” Providing that kind of realtime data means Akamai can help its customers stay vigilant and maintain an optimal security configuration. ----- SECTION 4.2 **Grammarly uses Databricks Lakehouse to improve** **user experience** ###### 110% **Faster querying, at 10% of the cost** **to ingest, than a data warehouse** ###### 5 billion **Daily events available for** **analytics in under 15 minutes** Grammarly’s mission is to improve lives by improving communication. The company’s trusted AI-powered communication assistance provides real-time suggestions to help individuals and teams write more confidently and achieve better results. Its comprehensive offerings — [Grammarly Premium](https://www.grammarly.com/premium) , [Grammarly Business](https://www.grammarly.com/business) , [Grammarly for](https://www.grammarly.com/edu) [Education](https://www.grammarly.com/edu) and [Grammarly for Developers](https://developer.grammarly.com/) — deliver leading communication support wherever writing happens. As the company grew over the years, its legacy, homegrown analytics system made it challenging to evaluate large data sets quickly and cost- effectively. By migrating to the Databricks Lakehouse Platform, Grammarly is now able to sustain a flexible, scalable and highly secure analytics platform that helps 30 million people and 50,000 teams worldwide write more effectively every day. **I N D U S T R Y** [Technology and Software](https://www.databricks.com/solutions/industries/technology-and-software) **S O L U T I O N** Recommendation Engines, Advertising Effectiveness, Customer Lifetime Value **P L AT F O R M U S E C A S E** Lakehouse, Delta Lake, Unity Catalog, [Machine Learning, ETL](https://www.databricks.com/product/machine-learning) **C LO U D** [AWS](https://www.databricks.com/product/aws) ----- **Harnessing data to improve communications for millions of** **users and thousands of teams** When people use Grammarly’s AI communication assistance, they receive suggestions to help them improve multiple dimensions of communication, including spelling and grammar correctness, clarity and conciseness, word choice, style, and tone. Grammarly receives feedback when users accept, reject or ignore its suggestions through app-created events, which total about 5 billion events per day. Historically, Grammarly relied on a homegrown legacy analytics platform and leveraged an in-house SQL-like language that was time-intensive to learn and made it challenging to onboard new hires. As the company grew, Grammarly data analysts found that the platform did not sufficiently meet the needs of its essential business functions, especially marketing, sales and customer success. Analysts found themselves copying and pasting data from spreadsheets because the existing system couldn’t effectively ingest the external data needed to answer questions such as, “Which marketing channel delivers the highest ROI?” Reporting proved challenging because the existing system didn’t support Tableau dashboards, and company leaders and analysts needed to ensure they could make decisions quickly and confidently. **Databricks Lakehouse has given us the flexibility to unleash** **our data without compromise. That flexibility has allowed us** **to speed up analytics to a pace we’ve never achieved before.** **Chris Locklin** Engineering Manager, Data Platforms, Grammarly Grammarly also sought to unify its data warehouses in order to scale and improve data storage and query capabilities. As it stood, large Amazon EMR clusters ran 24/7 and drove up costs. With the various data sources, the team also needed to maintain access control. “Access control in a distributed file system is difficult, and it only gets more complicated as you ingest more data sources,” says Chris Locklin, Engineering Manager, Data Platforms at Grammarly. Meanwhile, reliance on a single streaming workflow made collaboration among teams challenging. Data silos emerged as different business areas implemented analytics tools individually. “Every team decided to solve their analytics needs in the best way they saw fit,” says Locklin. “That created challenges in consistency and knowing which data set was correct.” ----- As its data strategy was evolving, Grammarly’s priority was to get the most out of analytical data while keeping it secure. This was crucial because security is Grammarly’s number-one priority and most important feature, both in how it protects its users’ data and how it ensures its own company data remains secure. To accomplish that, Grammarly’s data platform team sought to consolidate data and unify the company on a single platform. That meant sustaining a highly secure infrastructure that could scale alongside the company’s growth, improving ingestion flexibility, reducing costs and fueling collaboration. **Improving analytics, visualization and decision-making** **with the lakehouse** After conducting several proofs of concept to enhance its infrastructure, Grammarly migrated to the Databricks Lakehouse Platform. Bringing all the analytical data into the lakehouse created a central hub for all data producers and consumers across Grammarly, with Delta Lake at the core. Using the lakehouse architecture, data analysts within Grammarly now have a consolidated interface for analytics, which leads to a single source of truth and confidence in the accuracy and availability of all data managed by the data platform team. Across the organization, teams are using Databricks SQL to conduct queries within the platform on both internally generated product data and external data from digital advertising platform partners. Now, they can easily connect to Tableau and create dashboards and visualizations to present to executives and key stakeholders. “Security is of utmost importance at Grammarly, and our team’s numberone objective is to own and protect our analytical data,” says Locklin. “Other companies ask for your data, hold it for you, and then let you perform analytics on it. Just as Grammarly ensures our users’ data always remains theirs, we wanted to ensure our company data remained ours. Grammarly’s data stays inside of Grammarly.” With its data consolidated in the lakehouse, different areas of Grammarly’s business can now analyze data more thoroughly and effectively. For example, Grammarly’s marketing team uses advertising to attract new business. Using Databricks, the team can consolidate data from various sources to extrapolate a user’s lifetime value, compare it with customer acquisition costs and get rapid feedback on campaigns. Elsewhere, data captured from user interactions flow into a set of tables used by analysts for ad hoc analysis to inform and improve the user experience. By consolidating data onto one unified platform, Grammarly has eliminated data silos. “The ability to bring all these capabilities, data processing and analysis under the same platform using Databricks is extremely valuable,” says Sergey Blanket, Head of Business Intelligence at Grammarly. “Doing everything from ETL and engineering to analytics and ML under the same umbrella removes barriers and makes it easy for everyone to work with the data and each other.” ----- To manage access control, enable end-to-end observability and monitor data quality, Grammarly relies on the data lineage capabilities within Unity Catalog. “Data lineage allows us to effectively monitor usage of our data and ensure it upholds the standards we set as a data platform team,” says Locklin. “Lineage is the last crucial piece for access control. It allows analysts to leverage data to do their jobs while adhering to all usage standards and access controls, even when recreating tables and data sets in another environment.” **Faster time to insight drives more intelligent** **business decisions** Using the Databricks Lakehouse Platform, Grammarly’s engineering teams now have a tailored, centralized platform and a consistent data source across the company, resulting in greater speed and efficiency and reduced costs. The lakehouse architecture has led to 110% faster querying, at 10% of the cost to ingest, than a data warehouse. Grammarly can now make its 5 billion daily events available for analytics in under 15 minutes rather than 4 hours, enabling lowlatency data aggregation and query optimization. This allows the team to quickly receive feedback about new features being rolled out and understand if they are being adopted as expected. Ultimately, it helps them understand how groups of users engage with the UX, improving the experience and ensuring features and product releases bring the most value to users. “Everything my team does is focused on creating a rich, personalized experience that empowers people to communicate more effectively and achieve their potential,” says Locklin. Moving to the lakehouse architecture also solved the challenge of access control over distributed file systems, while Unity Catalog enabled fine-grained, rolebased access controls and real-time data lineage. “Unity Catalog gives us the ability to manage file permissions with more flexibility than a database would allow,” says Locklin. “It solved a problem my team couldn’t solve at scale. While using Databricks allows us to keep analytical data in-house, Unity Catalog helps us continue to uphold the highest standards of data protection by controlling access paradigms inside our data. That opens a whole new world of things that we can do.” Ultimately, migrating to the Databricks Lakehouse Platform has helped Grammarly to foster a data-driven culture where employees get fast access to analytics without having to write complex queries, all while maintaining Grammarly’s enterprise-grade security practices. “Our team’s mission is to help Grammarly make better, faster business decisions,” adds Blanket. “My team would not be able to effectively execute on that mission if we did not have a platform like Databricks available to us.” Perhaps most critically, migrating off its rigid legacy infrastructure gives Grammarly the adaptability to do more while knowing the platform will evolve as its needs evolve. “Databricks has given us the flexibility to unleash our data without compromise,” says Locklin. “That flexibility has allowed us to speed up analytics to a pace we’ve never achieved before.” ----- SECTION 4.3 **Honeywell selects Delta Live Tables for streaming data** Companies are under growing pressure to reduce energy use, while at the same time they are looking to lower costs and improve efficiency. Honeywell delivers industry- specific solutions that include aerospace products and services, control technologies for buildings and industry, and performance materials globally. Honeywell’s Energy and Environmental Solutions division uses IoT sensors and other technologies to help businesses worldwide manage energy demand, reduce energy consumption and carbon emissions, optimize indoor air quality, and improve occupant well-being. Accomplishing this requires Honeywell to collect vast amounts of data. Using Delta Live Tables on the Databricks Lakehouse Platform, Honeywell’s data team can now ingest billions of rows of sensor data into Delta Lake and automatically build SQL endpoints for real-time queries and multilayer insights into data at scale — helping Honeywell improve how it manages data and extract more value from it, both for itself and for its customers. **I N D U S T R Y** [Manufacturing](https://databricks.com/solutions/industries/manufacturing-industry-solutions) **P L AT F O R M U S E C A S E** Lakehouse, Delta Lake, Delta Live Tables **C LO U D** [Azure](https://databricks.com/product/azure) **Databricks helps us pull together many different data sources, do** **aggregations, and bring the significant amount of data we collect** **from our buildings under control so we can provide customers value.** **Dr. Chris Inkpen** Global Solutions Architect, Honeywell Energy and Environmental Solutions ----- **Processing billions of IoT data points per day** Honeywell’s solutions and services are used in millions of buildings around the world. Helping its customers create buildings that are safe, more sustainable and productive can require thousands of sensors per building. Those sensors monitor key factors such as temperature, pressure, humidity and air quality. In addition to the data collected by sensors inside a building, data is also collected from outside, such as weather and pollution data. Another data set consists of information about the buildings themselves — such as building type, ownership, floor plan, square footage of each floor and square footage of each room. That data set is combined with the two disparate data streams, adding up to a lot of data across multiple structured and unstructured formats, including images and video streams, telemetry data, event data, etc. At peaks, Honeywell ingests anywhere between 200 to 1,000 events per second for any building, which equates to billions of data points per day. Honeywell’s existing data infrastructure was challenged to meet such demand. It also made it difficult for Honeywell’s data team to query and visualize its disparate data so it could provide customers with fast, high-quality information and analysis. **ETL simplified: high-quality, reusable data pipelines** With Delta Live Tables (DLT) on the Databricks Lakehouse Platform, Honeywell’s data team can now ingest billions of rows of sensor data into Delta Lake and automatically build SQL endpoints for real-time queries and multilayer insights into data at scale. “We didn’t have to do anything to get DLT to scale,” says Dr. Chris Inkpen, Global Solutions Architect at Honeywell Energy and Environmental Solutions. “We give the system more data, and it copes. Out of the box, it’s given us the confidence that it will handle whatever we throw at it.” Honeywell credits the Databricks Lakehouse Platform for helping it to unify its vast and varied data — batch, streaming, structured and unstructured — into one platform. “We have many different data types. The Databricks Lakehouse Platform allows us to use things like Apache Kafka and Auto Loader to load and process multiple types of data and treat everything as a stream of data, which is awesome. Once we’ve got structured data from unstructured data, we can write standardized pipelines.” Honeywell data engineers can now build and leverage their own ETL pipelines with Delta Live Tables and gain insights and analytics quickly. ETL pipelines can be reused regardless of environment, and data can run in batches or streams. It’s also helped Honeywell’s data team transition from a small team to a larger team. “When we wrote our first few pipelines before DLT existed, only one person could work in one part of the functionality. Now that we’ve got DLT and the ability to have folders with common functionality, we’ve got a really good platform where we can easily spin off different pipelines.” DLT also helped Honeywell establish standard log files to monitor and costjustify its product pipelines. “Utilizing DLT, we can analyze which parts of our pipeline need optimization,” says Inkpen. “With standard pipelines, that was much more chaotic.” ----- **Enabling ease, simplicity and scalability across the** **infrastructure** Delta Live Tables has helped Honeywell’s data team consistently query complex data while offering simplicity of scale. It also enables end-to-end data visualization of Honeywell’s data streams as they flow into its infrastructure, are transformed, and then flow out. “Ninety percent of our ETL is now captured in diagrams, so that’s helped considerably and improves data governance. DLT encourages — and almost enforces — good design,” says Inkpen. Using the lakehouse as a shared workspace has helped promote teamwork and collaboration at Honeywell. “The team collaborates beautifully now, working together every day to divvy up the pipeline into their own stories and workloads,” says Inkpen. Meanwhile, the ability to manage streaming data with low latency and better throughput has improved accuracy and reduced costs. “Once we’ve designed something using DLT, we’re pretty safe from scalability issues — certainly a hundred times better than if we hadn’t written it in DLT,” says Inkpen. “We can then go back and look at how we can take a traditional job and make it more performant and less costly. We’re in a much better position to try and do that from DLT.” Using Databricks and DLT also helps the Honeywell team perform with greater agility, which allows them to innovate faster while empowering developers to respond to user requirements almost immediately. “Our previous architecture made it impossible to know what bottlenecks we had and what we needed to scale. Now we can do data science in near real-time.” Ultimately, Honeywell can now more quickly provide its customers with the data and analysis they need to make their buildings more efficient, healthier and safer for occupants. “I’m continuously looking for ways to improve our lifecycles, time to market, and data quality,” says Inkpen. “Databricks helps us pull together many different data sources, do aggregations, and bring the significant amount of data we collect from our buildings under control so we can provide customers value.” **Ready to get started? Learn more about** **[Delta Live Tables here](https://www.databricks.com/product/delta-live-tables)** **.** ----- SECTION 4.4 **Wood Mackenzie helps customers transition to a more** **sustainable future** ###### 12 Billion **Data points processed** **each week** ###### 80-90% **Reduction in** **processing time** ###### Cost Savings **In operations through** **workflow automation** Wood Mackenzie offers customized consulting and analysis for a wide range of clients in the energy and natural resources sectors. Founded in Edinburgh, the company first cultivated deep expertise in upstream oil and gas, then broadened its focus to deliver detailed insight for every interconnected sector of the energy, chemicals, metals and mining industries. Today it sees itself playing an important role in the transition to a more sustainable future. Using Databricks Workflows to automate ETL pipelines helps Wood Mackenzie ingest and process massive amounts of data. Using a common workflow provided higher visibility to engineering team members, encouraging better collaboration. With an automated, transparent workflow in place, the team saw improved productivity and data quality and an easier path to fix pipeline issues when they arise. **I N D U S T R Y** [Energy and Utilities](https://www.databricks.com/solutions/industries/oil-and-gas) **P L AT F O R M U S E C A S E** Lakehouse, Workflows **C LO U D** [AWS](https://www.databricks.com/product/aws) ----- **Delivering insights to the energy industry** Fulfilling Wood Mackenzie’s mission, the Lens product is a data analytics platform built to deliver insights at key decision points for customers in the energy sector. Feeding into Lens are vast amounts of data collected from various data sources and sensors used to monitor energy creation, oil and gas production, and more. Those data sources update about 12 billion data points every week that must be ingested, cleaned and processed as part of the input for the Lens platform. Yanyan Wu, Vice President of Data at Wood Mackenzie, manages a team of big data professionals that build and maintain the ETL pipeline that provides input data for Lens. The team is leveraging the Databricks Lakehouse Platform and uses Apache Spark™ for parallel processing, which provides greater performance and scalability benefits compared to an earlier single-node system working sequentially. “We saw a reduction of 80-90% in data processing time, which results in us providing our clients with more up-to-date, more complete and more accurate data,” says Wu. **Our mission is to transform the way we power the planet.** **Our clients in the energy sector need data, consulting services** **and research to achieve that transformation. Databricks** **Workflows gives us the speed and flexibility to deliver the** **insights our clients need.** **Improved collaboration and transparency with a common** **workflow** The data pipeline managed by the team includes several stages for standardizing and cleaning raw data, which can be structured or unstructured and may be in the form of PDFs or even handwritten notes. Different members of the data team are responsible for different parts of the pipeline, and there is a dependency between the processing stages each team member owns. Using [Databricks Workflows](https://www.databricks.com/product/workflows) , the team defined a common workstream that the entire team uses. Each stage of the pipeline is implemented in a Python notebook, which is run as a job in the main workflow. Each team member can now see exactly what code is running on each stage, making it easy to find the cause of the issue. Knowing who owns the part of the pipeline that originated the problem makes fixing issues much faster. “Without a common workflow, different members of the team would run their notebooks independently, not knowing that failure in their run affected stages downstream,” says Meng Zhang, Principal Data Analyst at Wood Mackenzie. “When trying to rerun notebooks, it was hard to tell which notebook version was initially run and the latest version to use.” **Yanyan Wu** Vice President of Data, Wood Mackenzie ----- Using Workflows’ alerting capabilities to notify the team when a workflow task fails ensures everyone knows a failure occurred and allows the team to work together to resolve the issue quickly. The definition of a common workflow created consistency and transparency that made collaboration easier. “Using Databricks Workflows allowed us to encourage collaboration and break up the walls between different stages of the process,” explains Wu. “It allowed us all to speak the same language.” Creating transparency and consistency is not the only advantage the team saw. Using Workflows to automate notebook runs also led to cost savings compared to running interactive notebooks manually. **Improved code development productivity** The team’s ETL pipeline development process involves iteration on PySpark notebooks. Leveraging [interactive notebooks](https://www.databricks.com/product/collaborative-notebooks) in the Databricks UI makes it easy for data professionals on the team to manually develop and test a notebook. Because Databricks Workflows supports running notebooks as task type (along with Python files, JAR files and other types), when the code is ready for developing notebooks with the interactive notebook UI while leveraging the power of automation, which reduces potential issues that may happen when running notebooks manually. The team has gone even further in increasing productivity by developing a CI/CD process. “By connecting our source control code repository, we know the workflow always runs the latest code version we committed to the repo,” explains Zhang. “It’s also easy to switch to a development branch to develop a new feature, fix a bug and run a development workflow. When the code passes all tests, it is merged back to the main branch and the production workflow is automatically updated with the latest code.” Going forward, Wood Mackenzie plans to optimize its use of Databricks Workflows to automate machine learning processes such as model training, model monitoring and handling model drift. The firm uses ML to improve its data quality and extract insights to provide more value to its clients. “Our mission is to transform how we power the planet,” Wu says. “Our clients in the energy sector need data, consulting services and research to achieve that transformation. Databricks Workflows gives us the speed and flexibility to deliver the insights our clients need.” production, it’s easy and cost effective to automate it by adding it to a workflow. The workflow can then be easily revised by adding or removing any steps to or from the defined flow. This way of working keeps the benefit of manually ----- SECTION 4.5 **Rivian redefines driving experience with** **the Databricks Lakehouse** ###### 250 platform users **A 50x increase from a year ago** Rivian is preserving the natural world for future generations with revolutionary Electric Adventure Vehicles (EAVs). With over 25,000 EAVs on the road generating multiple terabytes of IoT data per day, the company is using data insights and machine learning to improve vehicle health and performance. However, with legacy cloud tooling, it struggled to scale pipelines cost-effectively and spent significant resources on maintenance — slowing its ability to be truly data driven. Since moving to the Databricks Lakehouse Platform, Rivian can now understand how a vehicle is performing and how this impacts the driver using it. Equipped with these insights, Rivian is innovating faster, reducing costs, and ultimately, delivering a better driving experience to customers. **I N D U S T R Y** [Manufacturing](https://www.databricks.com/solutions/industries/manufacturing-industry-solutions) **S O L U T I O N** Predictive Maintenance, Scaling ML Models for IoT, Data-Driven ESG **P L AT F O R M** [Lakehouse](https://www.databricks.com/product/data-lakehouse) , [Delta Lake](https://www.databricks.com/product/delta-lake-on-databricks) , [Unity Catalog](https://www.databricks.com/product/unity-catalog) **C LO U D** [AWS](https://www.databricks.com/product/aws) ----- **Struggling to democratize data on a legacy platform** sharing of data, which further contributed to productivity issues. Required data languages and specific expertise of toolsets created a barrier to entry that limited developers from making full use of the data available. Jason Shiverick, Principal Data Scientist at Rivian, said the biggest issue was the data access. “I wanted to open our data to a broader audience of less technical users so they could also leverage data more easily.” Rivian knew that once its EAVs hit the market, the amount of data ingested would explode. In order to deliver the reliability and performance it promised, Rivian needed an architecture that would not only democratize data access, but also provide a common platform to build innovative solutions that can help ensure a reliable and enjoyable driving experience. **Databricks Lakehouse empowers us to lower the barrier of** **entry for data access across our organization so we can build** **the most innovative and reliable electric vehicles in the world.** **Wassym Bensaid** Vice President of Software Development, Rivian Building a world that will continue to be enjoyed by future generations requires a shift in the way we operate. At the forefront of this movement is Rivian — an electric vehicle manufacturer focused on shifting our planet’s energy and transportation systems entirely away from fossil fuel. Today, Rivian’s fleet includes personal vehicles and involves a partnership with Amazon to deliver 100,000 commercial vans. Each vehicle uses IoT sensors and cameras to capture petabytes of data ranging from how the vehicle drives to how various parts function. With all this data at its fingertips, Rivian is using machine learning to improve the overall customer experience with predictive maintenance so that potential issues are addressed before they impact the driver. Before Rivian even shipped its first EAV, it was already up against data visibility and tooling limitations that decreased output, prevented collaboration and increased operational costs. It had 30 to 50 large and operationally complicated compute clusters at any given time, which was costly. Not only was the system difficult to manage, but the company experienced frequent cluster outages as well, forcing teams to dedicate more time to troubleshooting than to data analysis. Additionally, data silos created by disjointed systems slowed the ----- **Predicting maintenance issues with Databricks Lakehouse** Rivian chose to modernize its data infrastructure on the Databricks Lakehouse Platform, giving it the ability to unify all of its data into a common view for downstream analytics and machine learning. Now, unique data teams have a range of accessible tools to deliver actionable insights for different use cases, from predictive maintenance to smarter product development. Venkat Sivasubramanian, Senior Director of Big Data at Rivian, says, “We were able to build a culture around an open data platform that provided a system for really democratizing data and analysis in an efficient way.” Databricks’ flexible support of all programming languages and seamless integration with a variety of toolsets eliminated access roadblocks and unlocked new opportunities. Wassym Bensaid, Vice President of Software Development at Rivian, explains, “Today we have various teams, both technical and business, using Databricks Lakehouse to explore our data, build performant data pipelines, and extract actionable business and product insights via visual dashboards.” metrics, Rivian can improve the accuracy of smart features and the control that drivers have over them. Designed to take the stress out of long drives and driving in heavy traffic, features like adaptive cruise control, lane change assist, automatic emergency driving, and forward collision warning can be honed over time to continuously optimize the driving experience for customers. Secure data sharing and collaboration was also facilitated with the Databricks Unity Catalog. Shiverick describes how unified governance for the lakehouse benefits Rivian productivity. “Unity Catalog gives us a truly centralized data catalog across all of our different teams,” he said. “Now we have proper access management and controls.” Venkat adds, “With Unity Catalog, we are centralizing data catalog and access management across various teams and workspaces, which has simplified governance.” End-to-end version controlled governance and auditability of sensitive data sources, like the ones used for autonomous driving systems, produces a simple but secure solution for feature engineering. This gives Rivian a competitive advantage in the race to capture the autonomous driving grid. Rivian’s ADAS (advanced driver-assistance systems) Team can now easily prepare telemetric accelerometer data to understand all EAV motions. This core recording data includes information about pitch, roll, speed, suspension and airbag activity, to help Rivian understand vehicle performance, driving patterns and connected car system predictability. Based on these key performance ----- **Accelerating into an electrified and sustainable world** By scaling its capacity to deliver valuable data insights with speed, efficiency and cost-effectiveness, Rivian is primed to leverage more data to improve operations and the performance of its vehicles to enhance the customer experience. Venkat says, “The flexibility that lakehouse offers saves us a lot of money from a cloud perspective, and that’s a huge win for us.” With Databricks Lakehouse providing a unified and open source approach to data and analytics, the Vehicle Reliability Team is able to better understand how people are using their vehicles, and that helps to inform the design of future generations of vehicles. By leveraging the Databricks Lakehouse Platform, they have seen a 30%–50% increase in runtime performance, which has led to faster insights and model performance. Shiverick explains, “From a reliability standpoint, we can make sure that components will withstand appropriate lifecycles. It can be as simple as making sure door handles are beefy enough to endure constant usage, or as complicated as predictive and preventative maintenance to eliminate the chance of failure in the field. Generally speaking, we’re improving software quality based on key vehicle metrics for a better customer experience.” From a design optimization perspective, Rivian’s unobstructed data view is also producing new diagnostic insights that can improve fleet health, safety, stability and security. Venkat says, “We can perform remote diagnostics to triage a problem quickly, or have a mobile service come in, or potentially send an OTA to fix the problem with the software. All of this needs so much visibility into the data, and that’s been possible with our partnership and integration on the platform itself.” With developers actively building vehicle software to improve issues along the way. Moving forward, Rivian is seeing rapid adoption of Databricks Lakehouse across different teams — increasing the number of platform users from 5 to 250 in only one year. This has unlocked new use cases including using machine learning to optimize battery efficiency in colder temperatures, increasing the accuracy of autonomous driving systems, and serving commercial depots with vehicle health dashboards for early and ongoing maintenance. As more EAVs ship, and its fleet of commercial vans expands, Rivian will continue to leverage the troves of data generated by its EAVs to deliver new innovations and driving experiences that revolutionize sustainable transportation. ----- SECTION 4.6 **Migrating to the cloud to better serve** **millions of customers** ###### 300% **ROI from OpEx savings** **and cost avoidance** ###### 3X **Faster delivery of ML/data** **science use cases** Consistency in innovation is what keeps customers with a telecommunications company and is why AT&T is ranked among the best. However, AT&T’s massive on-premises legacy Hadoop system proved complex and costly to manage, impeding operational agility and efficiency and engineering resources. The need to pivot to cloud to better support hundreds of millions of subscribers was apparent. Migrating from Hadoop to Databricks on the Azure cloud, AT&T experienced significant savings in operating costs. Additionally, the new cloud-based environment has unlocked access to petabytes of data for correlative analytics and an AI-as-a-Service offering for 2,500+ users across 60+ business units. AT&T can now leverage all its data — without overburdening its engineering team or exploding operational costs — to deliver new features and innovations to its millions of end users. **I N D U S T R Y** [Communication Service Providers](https://www.databricks.com/solutions/industries/telco-industry-solutions) **S O L U T I O N** Customer Retention, Subscriber Churn Prediction, Threat Detection **P L AT F O R M** Lakehouse, Data Science, Machine Learning, [Data Streaming](https://www.databricks.com/product/data-streaming) **C LO U D** [Azure](https://www.databricks.com/product/azure) ----- **Hadoop technology adds operational complexity and** **unnecessary costs** AT&T is a technology giant with hundreds of millions of subscribers and ingests 10+ petabytes[ [a](https://www.databricks.com/blog/2022/04/11/data-att-modernization-lakehouse.html) ] of data across the entire data platform each day. To harness this data, it has a team of 2,500+ data users across 60+ business units to ensure the business is data powered — from building analytics to ensure decisions are based on the best data-driven situation awareness to building ML models that bring new innovations to its customers. To support these requirements, AT&T needed to democratize and establish a data single version of truth (SVOT) while simplifying infrastructure management to increase agility and lower overall costs. However, physical infrastructure was too resource intensive. The combination of a highly complex hardware setup (12,500 data sources and 1,500+ servers) coupled with an on-premises Hadoop architecture proved complex to maintain and expensive to manage. Not only were the operational costs to support workloads high, but there were also additional capital costs around data centers, licensing and more. Up to 70% of the on-prem platform had to be prioritized to ensure 50K data pipeline jobs succeeded and met SLAs and data quality objectives. Engineers’ time was focused on managing updates, With these deeply rooted technology issues, AT&T was not in the best position to achieve its goals of increasing its use of insights for improving its customer experience and operating more efficiently. “To truly democratize data across the business, we needed to pivot to a cloud-native technology environment,” said Mark Holcomb, Distinguished Solution Architect at AT&T. “This has freed up resources that had been focused on managing our infrastructure and move them up the value chain, as well as freeing up capital for investing in growthoriented initiatives.” **A seamless migration journey to Databricks** As part of its due diligence, AT&T ran a comprehensive cost analysis and concluded that Databricks was both the fastest and achieved the best price/ performance for data pipelines and machine learning workloads. AT&T knew the migration would be a massive undertaking. As such, the team did a lot of upfront planning — they prioritized migrating their largest workloads first to immediately reduce their infrastructure footprint. They also decided to migrate their data before migrating users to ensure a smooth transition and experience for their thousands of data practitioners. fixing performance issues or simply provisioning resources rather than focusing on higher-valued tasks. The resource constraints of physical infrastructure also drove serialization of data science activities, slowing innovation. Another hurdle faced in operationalizing petabytes of data was the challenge of building streaming data pipelines for real-time analytics, an area that was key to supporting innovative use cases required to better serve its customers. **The migration from Hadoop to Databricks enables us to bring** **more value to our customers and do it more cost-efficiently** **and much faster than before.** **Mark Holcomb** Distinguished Solution Architect, AT&T ----- They spent a year deduplicating and synchronizing data to the cloud before migrating any users. This was a critical step in ensuring the successful migration of such a large, complex multi-tenant environment of 2,500+ users from 60+ business units and their workloads. The user migration process occurred over nine months and enabled AT&T to retire on-premises hardware in parallel with migration to accelerate savings as early as possible. Plus, due to the horizontal, scalable nature of Databricks, AT&T didn’t need to have everything in one contiguous environment. Separating data and compute, and across multiple accounts and workspaces, ensured analytics worked seamlessly without any API call limits or bandwidth issues and consumption clearly attributed to the 60+ business units. All in all, AT&T migrated over 1,500 servers, more than 50,000 production CPUs, 12,500 data sources and 300 schemas. The entire process took about two and a half years. And it was able to manage the entire migration with the equivalent of 15 full-time internal resources. “Databricks was a valuable collaborator throughout the process,” said Holcomb. “The team worked closely with us to resolve product features and security concerns to support our migration timeline.” **Databricks reduces TCO and opens new paths to** **innovation** One of the immediate benefits of moving to Databricks was huge cost savings. AT&T was able to rationalize about 30% of its data by identifying and not migrating underutilized and duplicate data. And prioritizing the migration of the largest workloads allowed half the on-prem equipment to be rationalized during the course of the migration. “By prioritizing the migration of our most compute-intensive workloads to Databricks, we were able to significantly drive down costs while putting us in position to scale more efficiently moving forward,” explained Holcomb. The result is an anticipated 300% five-year migration ROI from OpEx savings and cost avoidance (e.g., not needing to refresh data center hardware). With data readily available and the means to analyze data at any scale, teams of citizen data scientists and analysts can now spend more time innovating, instead of serializing analytics efforts or waiting on engineering to provide the necessary resources — or having data scientists spend their valuable time on less complex or less insightful analyses. Data scientists are now able to collaborate more effectively and speed up machine learning workflows so that teams can deliver value more quickly, with a 3x faster time to delivery for new data science use cases. “Historically you would have had operations in one system and analytics in a separate one,” said Holcomb. “Now we can do more use cases like operational analytics in a platform that fosters cross-team collaboration, reduces cost and improves the consistency of answers.” Since migrating to Databricks, AT&T now has a single version of truth to create new data-driven opportunities, including a self-serve AI-as-a-Service analytics platform that will enable new revenue streams and help it continue delivering exceptional innovations to its millions of customers. ----- #### About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://databricks.com/try-databricks)** Contact us for a personalized demo **databricks.com/contact** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf,2024-09-19T16:57:20Z
"##### EBOOK # 8 Steps to Becoming an AI-Forward Retailer ----- ## Contents Introduction .............................................................................................................................................................................................. **3** The State of the Retail Industry: The Diverging Performance of Data Leaders vs. Data Laggards ...................................................................................... **4** Begin With a Shared Vision of Success ....................................................................................................................................... **6** Why Companies Struggle With Setting Clear Business Outcomes for AI ................................................................... **7** Before Diving In: Assess Your Readiness ..................................................................................................................................... **9** Getting Started: Putting Some Wins on the Board .................................................................................................................. **11** Going Big: Learning to Embrace Transformational Change ............................................................................................... **12** Normalizing the Process: Engraining a Data-Driven Mindset Into the Fabric of the Business ...................................................................................................................................................... **14** From Hindsight to Foresight: The Journey to Becoming a Data-Forward Enterprise .......................................... **16** The 8 Steps to Building a Data-Forward Retailer ................................................................................................................... **17** Transform Retail Data Into Actionable Insights ....................................................................................................................... **21** ----- ## Introduction In a world where data is king, retailers have historically been trailblazers, pioneering data technology adoption to supercharge their operations, enhance customer understanding and sharpen personalization. The journey began with the simple cash register about 150 years ago, progressed to standardized product reporting with the introduction of the UPC and EAN, and has evolved to include cutting-edge technologies such as RFID and machine learning. Today, we stand on the brink of “Generation AI,” defined by sophisticated language models and images. Retailers, with their history of embracing data technologies, find themselves in a strong position to reap the benefits of this new era. Automation of customer service, supply chain modeling with digital twins and delivering hyper-personalized experiences in real time are all in the cards, promising to bolster revenue, improve margins and slash costs for early adopters. According to an internal analysis by Databricks, data pioneers are already outstripping their competition. The “Databricks 30” — an index tracking the publicly traded data and AI leaders across six major industry sectors, including retail — shows these front-runners outperforming the rest of the market by an impressive and increasing margin. It’s clear: retailers integrating data and AI strategies are setting themselves up for significant gains and a robust competitive advantage. However, for retailers mired in the landscape of outdated data platforms, the transformation into an AI-driven organization can seem a Herculean task. Embracing this wave of innovative technologies may feel overwhelming, yet it’s clear that those who make the leap stand to gain significantly in the rapidly evolving retail landscape. To help you navigate the rapidly evolving world of retail and consumer goods, this eBook provides a road map for organizations embarking on digital transformation journeys — a shift that is as much about culture as it is about technology, if not more so. The core advice? Start with a crystal-clear vision for transformation, outlining a compelling case for why such change is vital for the company’s long-term survival. Then, initiate the process by introducing AI to make gradual enhancements in critical business procedures. ----- ## The State of the Retail Industry: The Diverging Performance of Data Leaders vs. Data Laggards The pandemic’s fallout has led to a widening chasm between the retail industry’s leaders and laggards. McKinsey & Company encapsulated this trend succinctly: “Companies with tech-forward business models, who were already pulling ahead pre-crisis, left their competitors in the dust.” But what exactly is a “tech-forward business model”? It isn’t a simple narrative of digital natives dethroning traditional retailers. Heavyweights like Walmart, Target and Costco held their own against Amazon. Nor was it purely a matter of scale — smaller brands like Warby Parker or Everlane managed to carve out substantial consumer bases, competing against larger, established players. **The common denominator among all victors** **was their ability to harness data, analytics and AI** **to rapidly react to shifts in consumer behavior.** methods, optimizing operations to alleviate the pressure these modes exerted on margins. They successfully established tighter partnerships with suppliers and logistic entities, collaborating toward shared triumphs. In all these instances, it was their timely access to information, foresight driven by this data, and the exploration of probable outcomes that set these organizations apart. Infusing data-driven decision-making into core processes within the organization, as well as those crossing partner boundaries, unlocked this approach’s full potential. To illustrate the significance of prioritizing data and AI, we developed the Databricks 30 Index. Drawing inspiration from Morgan Stanley’s “Data Era” stocks research, this index tracks marquee customers across our top five verticals and partners. The Databricks 30 is an equal-weight price index, composed of five marquee customers each across Retail/Consumer Products, Financial Services, Healthcare, Media/Entertainment, Manufacturing/Logistics, plus five strategic partners. These businesses deftly used consumer demand insights to understand the effects of supply chain disruptions and labor shortages and reallocate resources to mitigate the most harmful impacts. They adeptly introduced new delivery ----- Our analysis reveals that companies in the Databricks 30 Index outpaced the S&P 500 by an impressive +21 percentage points (pp) over the past three years. In other words, if the stock market rose by 50% during this period, the Databricks 30 Index would have soared by 71% (outperforming by 21pp). Even more remarkable, excluding tech entirely from the Databricks 30, the Databricks 30 ex-Tech index outperforms the S&P 500 by an even larger margin over the same time frame: +23pp. DB30 DOw30 Similar to Morgan Stanley’s analysis, we find that non-tech U.S. companies that are investing in cloud, data and innovation do, in fact, win. So now that we see the impact, let’s dive into the steps retail organizations can take to put themselves on a trajectory of continued growth and success amid an ever-changing landscape. 01-01-2019 01-01-2020 01-01-2021 01-01-2022 01-01-2023 01-01-2019 01-01-2020 01-01-2021 DATE ----- ## Begin With a Shared Vision of Success The most overlooked activity in becoming an AI-forward retailer is the most crucial. In the rush to secure a position on the AI frontier, many companies are leaping before they look, embarking on AI initiatives without a clear understanding of what they want to achieve. Simply adopting the newest, shiniest tech tools isn’t a silver bullet. Many companies set themselves up for failure by neglecting to clearly define the expected business outcomes at the onset of the initiative, a strategic move that can effectively reduce project risk and costs and lead to the ultimate success of the program. In fact, in an attempt to accelerate results, this cavalier approach can instead spiral into expensive mistakes, wasted resources and a decrease in trust for stakeholders from unmet expectations. It’s like setting sail on an open ocean without a destination in mind; the journey might provide some interesting detours, but it lacks direction and purpose. However, when organizations take the time to articulate their expected business outcomes before deploying AI and data-driven programs, they position themselves to reduce project risk and costs. By aligning AI initiatives with specific business objectives and creating a shared vision with stakeholders, the focus becomes less about the technology itself and more about how it can be used to reach these defined goals. Technology decisions, too, are improved by having a known target. Without clear business outcomes in mind, companies tend to design, develop and implement technologies that _might_ be needed to solve the problem. Aligning the technical road map and activities with business outcomes mitigates the risk of misallocated resources and the potential fallout from the unfulfilled promise of AI. Furthermore, a clear understanding of expected business outcomes allows for efficient project management and cost control. Companies can set key performance indicators (KPIs) tied directly to these outcomes. This not only provides a means to measure progress, but also helps control costs by ensuring that resources are targeted toward initiatives that deliver value. It’s not just about numbers either; having explicit objectives aids in cultivating stakeholder buy-in. Clear communication about the purpose and potential benefits of an AI initiative can foster support from executives, employees, investors and customers alike. This collective backing can further mitigate risk and cut costs by ensuring that everyone is pulling in the same direction. ----- ## Why Companies Struggle With Setting Clear Business Outcomes for AI Getting started with AI at your organization might be daunting, and that’s because it is a big undertaking! Struggling to define clear outcomes for AI projects is a common issue among many businesses for a variety of reasons. Here are some key factors that contribute to this challenge: **They believe the data strategy is a technology problem.** Companies often hire a chief data officer, or make the data strategy the responsibility of the technology organization. **They lack an understanding of their business processes** An alarming number of businesses jump onto the AI bandwagon without understanding how their business operates. Decisions are made at the leadership level, but how they translate to operational decisions is muddled. Data and AI are fundamentally business process technologies, and without fully understanding how the business works, any initiative in data and AI is bound to have limited success. **They lack a data culture** Somewhat related to the previous point, many companies have teams that make decisions based on experience and intuition. These should not be discounted, but the reason for intuition is often a result of a poor definition of processes, which prevents the ability to measure and improve processes. **They struggle to get high-quality data** AI projects require good-quality, relevant data. Many businesses struggle with issues related to data access, quality, privacy and security, which can complicate the process of defining clear outcomes. **They lack the organizational structures required** Implementing AI often requires significant changes in business processes, organizational structures and even corporate culture. Many companies find it hard to manage these changes, leading to difficulties in setting and achieving clear outcomes. ----- Data and AI programs are a business process problem first, and a technology problem last. Familiarity with technology is important, but irrelevant if companies do not understand it. Addressing these challenges often requires companies to invest in education about AI capabilities, to formulate clear strategies, to manage change effectively, and to bring on board the necessary skills either by hiring new talent or upskilling existing employees. It’s a journey that requires commitment, but the potential benefits of successful AI initiatives make it a worthwhile venture. **They don’t have the right people in place** There’s often a gap between the skills available within a company and the skills needed to define and achieve AI outcomes. Without team members who understand AI, data analysis and project management, businesses can struggle to set clear objectives for AI initiatives. **They struggle to quantify the value of AI projects** AI’s benefits can sometimes be intangible or long-term, making them difficult to quantify. Companies may struggle to define outcomes in measurable terms, complicating the process of setting objectives and monitoring progress. ----- ## Before Diving In: Assess Your Readiness There is a growing sense of urgency for organizations relatively new to data and AI-driven enablement to “get in the game.” Profiles of top performers and headline-making achievements create a clearer sense of what is possible and what can be gained, leaving those entering into the space eager to achieve similar results. But what’s missing in those articles are the sustained investments in process, people and technology and the numerous challenges, missteps and outright failures that had to occur before success was achieved. Data-driven transformation is a journey, and before any successful journey is pursued, it’s wise to reflect on the organization’s readiness so that you can anticipate challenges and identify areas for remediation and improvement that will deliver you to your intended destination. With this in mind, we encourage organizations new to this space to assess their maturity in terms of the use and management of their existing information assets: 1. How easily discoverable and accessible are data in your environment? 3. Is the quality of these data formally verified? 4. Are key entities such as products and customers actively managed, and can data related to these items be easily linked across various data sources? 5. How quickly are data made available for analysis following their creation or modification? Is this latency aligned with how you might use this data? 6. Are processes established for determining appropriate uses of data, governing access and providing oversight on consumption? 7. Is there one individual responsible for effective data management across the enterprise, and has this person established a process for receiving and responding to feedback and shifting organizational priorities? This list of questions is by no means exhaustive, but it should help to identify blockers that are likely to become impediments down the road. 2. How well understood are these information assets? ----- Similarly, we would encourage organizations to assess their maturity in terms of analytics capabilities: 1. Is business performance at all levels assessed in terms of key metrics? 2. How frequently are data-driven analyses used in making key business decisions? 3. To what degree are advanced analytics techniques — i.e., data science — used in decision-making processes? 4. Are predictive models regularly leveraged as part of operational business processes? 5. How is experimentation used to assess the performance of various initiatives? Lastly, and probably most importantly, we’d encourage the organization to perform a frank assessment of its readiness to embrace change. Becoming a data-driven enterprise is fundamentally about operating differently than before. Decision-making authority becomes more diffuse and often more automated. Project outcomes become less certain as the organization focuses on innovation where learning is emphasized over predictable results. Process silos often become more intertwined as new modes of engagement evolve. When done right, this transition creates a healthy tension between what’s needed to be successful today and what’s needed to be successful tomorrow. But this can also manifest itself as employee resistance and political infighting as processes and organizational structures evolve. What’s often needed to overcome this is strong leadership, a clear vision and mandate for change as well as a reassessment of incentive structures and active organizational change management as the organization transitions into this new way of working. 6. Are predictive models used to automate key business decisions? 7. Has the organization embraced a model of continuous deployment for the regular update of model-driven processes? **TRADITIONAL APPROACH** **Upfront reqs** **Technical implementation** **Production** **ITERATIVE APPROACH** Continuous feedback **Business questions** **Testing** **Production** **Optimization** Continuous learning and optimization An iterative approach involves the use of data to continually optimize the performance of data products. ----- ## Getting Started: Putting Some Wins on the Board With the organization ready to proceed, the next phase is about learning to deliver new solutions within your organization. There will be new technologies to deploy and new skills to develop, and there will be new patterns for integration into business workflows and procedures for incremental updates and improvements. But most importantly, there will need to be a new level of partnership and trust between the business and the technology sides of the organization that needs to be carefully nurtured. The best way we have found to do this is to start with projects that improve on existing operational workflows, i.e., do what you do, but do it smarter. The business is often familiar with existing pain points and can more clearly envision how a new capability can be folded into its processes. They are also familiar with how to assess the impact a new approach may have on their business and can help design tests to validate whether the intended results As capabilities demonstrating value over the status quo are developed, they are folded into business processes. This is not a one-and-done effort but part of an ongoing cycle of deployment to continue so long as the team has a line of sight to meaningful gains. The team does not wait for the ideal solution but instead focuses on incremental improvements that deliver measurable value along the way. Oversight for this process is provided by another body, one tasked with the success of the overall transformative efforts within the business. As success is delivered, there will be growing demand for the time and talents of these teams, and the organization will need to prioritize resources across an increasing number of opportunities. This steering committee will need to be responsible for allocating limited resources and advocating for additional ones as well to strike the right balance of investments for the organization. are or are not being delivered. **DEMAND FORECASTING** Demand forecasting is a massive challenge for retail and consumer goods organizations. And one where even an incremental change can have a massive impact, so it’s often one of the first projects organizations identify to put a win on the board. According to [McKinsey](https://www.mckinsey.com/featured-insights/artificial-intelligence/notes-from-the-ai-frontier-applications-and-value-of-deep-learning) , a 10% to 20% improvement in supply chain forecasting accuracy is likely to produce a 5% reduction in inventory costs and a 2% to 3% increase in revenues. To hit the ground running, check out the [Databricks Solution](https://www.databricks.com/solutions/accelerators/demand-forecasting) [Accelerators for Demand Forecasting](https://www.databricks.com/solutions/accelerators/demand-forecasting) — pre-built notebooks and best practices for key use cases. Work on these projects is a collaborative effort between the business and IT. Together, the project team explores a potential solution with a notion of how it may be integrated in mind from the outset. As the project unfolds, all members are part of the iterative cycles and help to steer the solution in new directions until an item of value is derived. ----- ## Going Big: Learning to Embrace Transformational Change With some experience under your belt, it’s time to build on the organizational muscle developed during initial efforts and flex for more transformative impact. Again, the focus is on established functions within the business, but instead of pointed, incremental improvements, the team begins to create a vision for the part of the organization that would operate if it were to fully embrace data and AI enablement. It’s at this phase that many of the concerns about organizational resistance mentioned earlier are most likely to manifest themselves. Ideally, initial implementation efforts have built champions within the business, but it’s still important to be mindful of pushback that can emerge as the organization more fully begins to change. Having and maintaining strong business sponsorship in this phase is critical, and having that sponsor articulate and regularly reinforce a clear vision for the change that’s now underway can help everyone understand the need to support these efforts. So far in this exploration of the journey to data and AI transformation, we’ve minimized the importance of technology in order to focus on the business and organizational aspects that often get neglected in this conversation. But it’s at this stage that the organization needs to have established its preference for data and analytics platforms. Because of the breadth of needs that will have to be addressed and the ongoing innovation taking place in the data science community, we strongly suggest standardizing on a platform that is open and flexible while also providing cost-effective use of both infrastructure and people resources and strong data governance and protection. For many organizations, the Databricks Lakehouse Platform has proven itself to be the ideal platform to meet these needs. **WHY STANDARDIZE ON DATABRICKS?** The Databricks Lakehouse is the only enterprise data and AI platform that allows retailers to leverage all of their data, from any source, on any workload to always offer more engaging customer experiences driven by real-time data, at the lowest cost and with the greatest investment protection. ----- But simply standardizing on a platform is not enough. The organization needs to work through the roles and responsibilities around the use of this platform and processes for moving things from experimentation and formal development to testing and operationalization. The importance of having an MLOps strategy really comes to life at this phase. This doesn’t mean your strategy around MLOps can’t change, but this phase is when you want to think about and define your answers to some key questions such as the following: 1. How do we evaluate new and existing (retrained) models as part of their movement from development to production? 2. How do we determine when a model should be retrained? 3. What are the preferred mechanisms for production deployment? 4. How do we fall back should we have a deployment problem? 5. What are the service level expectations for the deployment processes? ###### ”Databricks Lakehouse has simplified the adoption of AI so that we can deliver better shopping experiences for our customers.” **Numan Ali** Solutions Architect, Data and Analytics Center of Excellence at Pandora ----- ## Normalizing the Process: Engraining a Data-Driven Mindset Into the Fabric of the Business Too often, leadership views innovation as a destination and not a process (“Let’s launch an LLM app!”). An enterprise doesn’t simply transform into a data-driven organization overnight and then it’s done. Yes, there will be an upfront investment, but there will also be ongoing investment in order to support sustained innovation. Ironically, one of the major obstacles to this change is viewing the goal as simply delivering a project or projects. Think about it — just 12 months ago, only a few specialists in academia and industry were talking about generative AI and large language models (LLMs). Today, [retailers have to integrate this](https://www.databricks.com/blog/2023/04/13/retail-age-generative-ai.html) [new technology](https://www.databricks.com/blog/2023/04/13/retail-age-generative-ai.html) or fall behind others who will find a way to create more personalized consumer experiences with it. Technology, especially when it comes to data and AI, moves far too quickly. What retailer tech teams need to deliver at the end of the day is applications, of course, but also the ability to react quickly to change. What sort of ongoing investments in terms of people, process and technology do retailers need to foster in order to ingrain an innovation mindset? This is an ongoing balancing act where organizations need to innovate and look for new opportunities but also sustain that innovation in a way that is realistic for the business. For this, let’s consider the 70-20-10 rule: the idea that companies should allocate 70% of innovation investment to core initiatives, 20% to adjacent ones and 10% to transformational ones, or “moonshots.” While not a hard-and-fast rule, this concept was touted by Google co-founder Larry Page in a [Fortune magazine article](https://www.google.com/url?q=https://money.cnn.com/2008/04/29/magazines/fortune/larry_page_change_the_world.fortune/&sa=D&source=editors&ust=1690998645852122&usg=AOvVaw2AHj-fx8XkEeMKP2Ts5gDu) , and was validated by a [study conducted](https://hbr.org/2012/05/managing-your-innovation-portfolio) [by Harvard Business Review](https://hbr.org/2012/05/managing-your-innovation-portfolio) , which found that companies following the rule outperformed their peers, typically realizing a P/E premium of 10% to 20%. ----- The goal of the 70-20-10 rule is to help guide the organization toward sustained innovation and spend the bulk of time on the core business. This is part of why we recommend starting first with fast (just 2- to 3-month total) pilot projects to use AI on existing business use cases like demand forecasting and call center optimization. By working in these areas with a focus on learning and iterating, retailers will soon find where data silos and rigidity exist in the system. As these foundational barriers are knocked down, it then makes it possible to tackle more transformational use cases and start to build the characteristics of a data-forward enterprise. In other words, start to utilize data and data-driven insights as a primary driver for decision-making and operations, while also prioritizing continuous data analysis and improvement. **TRANSFORMATIVE** **ADJACENT** **CORE** ###### Companies that allocated about 70% of their innovation activity to core initiatives, ### 20% to adjacent ones and 10% to ###### transformational ones outperformed their peers. **Bansi Nagji & Geoff Tuff** _Managing Your Innovation Portfolio_ Harvard Business Review, May 2012 ----- ## From Hindsight to Foresight: The Journey to Becoming a Data-Forward Enterprise So what does it take to successfully embark on this journey to becoming a data-forward enterprise? First and foremost, you need to not only establish a baseline understanding of what has occurred by examining historical data but leverage advancements in technologies (e.g., streaming, computer vision, voice recognition) to make predictions of the future. Through the use of both historical data and predictive techniques such as forecasting, recommendations, prescriptive care and nextbest-action, organizations can begin to improve decisions and, in some cases, automate certain decision-making processes. But rather than moving from historical views to predictive actions in a linear fashion, this journey involves addressing both approaches simultaneously. Once you are able to unify historical and predictive analysis, you can then take significant steps toward becoming a dataforward enterprise. ##### The Data-Forward Enterprise Data, analytics and AI working in concert **Data Purgatory** Things are better, but data isn’t driving the business **Data Maturity** Every aspect of the business is supported by insights and AI **Data Siloed** Data and teams are segregated into different systems DATA MATURITY Being data-forward means silos cease to exist, and data, analytics and AI are informing every aspect of the business. ----- ## The 8 Steps to Building a Data-Forward Retailer Before you start your data-forward journey, a few critical steps must be considered to establish a solid foundation to build upon. Based on our work with the largest and most successful retailers in the world, spanning startups to global giants, we at Databricks have seen that the most successful followed these steps to effectively gain wallet share, whereas those who couldn’t would often leave major gaps that competitors could take advantage of. These steps are the basics to prepare businesses for where they need to be both now and in the near future. **2** **Get grounded: Understand the technology** To start, business leaders need to ground themselves in technology, especially when it comes to AI. AI can do amazing things, but it is not magical and vendors are prone to overpromising and underdelivering. Less than getting deep into code, the purpose is to understand the limitations and ideal use cases. Databricks provides several [free resources for retailers](https://www.databricks.com/explore/retail-resources) , but we recommend starting with [The Big Book of Retail & Consumer Goods Use Cases](https://www.databricks.com/resources/ebook/big-book-of-retail-consumer-goods-use-cases) for a C-level perspective of how different brands are using data, analytics and AI to drive revenue or cut operational costs. **1** **Set the foundation: Define goals and objectives** The best way to avoid shiny object syndrome (where you start out with a technology and then try to figure out what to do with it) is to first identify the problems you want to solve. From there, you can set goals around innovation to align incentives, and, most importantly, ensure you are driving specific business outcomes such as improving customer engagement, optimizing inventory management or increasing sales. **3** **Understand the skills and processes in your business** As we will get into in step 4, starting with smaller pilot projects enables you to not just deliver a quick win and validate the use of AI in the enterprise, but also understand the in-house capabilities in terms of people, process and technology to deliver technical projects. And if required, be willing and ready to hire people with the right skill sets that can help you make the most of your data. For example, building a core team of data analysts can help extract deep insights that lead to better decision-making and identify opportunities for growth. It is critical at this step to define the roles you need, determine how you will source for those roles (via external hiring or internal transfer), and ensure those roles have opportunities for career progression. ----- For inspiration and a head start, check out our [Solution Accelerators for Retail](https://www.databricks.com/solutions/accelerators?industry=Retail%20and%20Consumer%20Goods) [& Consumer Goods](https://www.databricks.com/solutions/accelerators?industry=Retail%20and%20Consumer%20Goods) . These free resources were created to help our customers save hours of discovery, design, development and testing. Our purpose-built guides — fully functional notebooks and best practices — speed up results across your most common and high-impact use cases and enable you to go from idea to proof of concept (PoC) in as little as two weeks. We have over 20 accelerators built specifically for critical retail and consumer goods use cases, from Demand Forecasting and On-Shelf Availability to Recommendation Engines and Customer Lifetime Value. We also have a set of Solution Accelerators specifically for [LLMs in Retail & Consumer Goods.](https://www.databricks.com/solutions/accelerators/large-language-models-retail) **5** **Implement data management and governance early** The first step to successfully implementing AI/ML in your business broadly is to ensure you have accurate, reliable and current data to train your models against. This data can (and should) come from a variety of sources, so it’s key to unify all data types and sources (sales transactions, customer feedback, social media) in a centralized location that is easily accessible, while not losing sight of data security to maintain customer trust. Setting up data governance parameters to control who has which kinds of access to what data, and being able to audit the history of this access, will actually accelerate innovation while ensuring data security and compliance. **Delivering exactly what customers want,** **every time, and on time** Data is at the heart of Gousto’s mission to change the way people eat through the delivery of boxes of fresh ingredients and easy-to-follow recipes. However, even as their business exploded at the start of the pandemic, their systems couldn’t ingest data fast enough, couldn’t talk to each other and wouldn’t scale — forcing them to temporarily stop accepting new customers. Now Gousto is set up to achieve exciting ambitions for menu expansion, sophisticated personalization and next-day delivery. Learn how they did it. **[READ THE FULL GOUSTO STORY](https://www.databricks.com/customers/gousto)** **4** **Start small: Pilot a project** There is no substitute for rolling your sleeves up and running a pilot project to evaluate the feasibility and potential impact of a project before implementing it on a larger scale. When selecting a pilot project, we recommend starting with a project that will deliver clear business value, such as incremental revenue or clear cost savings, yet only takes 2-3 months to complete. The more time there is between project inception and seeing results, the more likely it will lose momentum internally. ----- **6** **Incorporate AI across the business (starting with daily tasks)** Given the large upfront investment in data scientists and engineers to build an AI program, the ROI will come from using it at scale. Constantly look to uncover patterns and repeatable processes that can be optimized or fully automated with AI. **Building a global fashion icon with a** **customer-first approach** British luxury brand Burberry was seeking an efficient way to annotate its thousands of highly specific marketing assets for better targeting. Working with Labelbox within Databricks Lakehouse, they are now able to complete image annotation projects in hours instead of months. And marketing team members now have access to powerful content insights without needing to ask data scientists for help. **[READ THE FULL BURBERRY STORY](https://www.databricks.com/customers/burberry)** **Customizing interactions that convert clicks** **to revenue with Databricks Lakehouse** Global jewelry manufacturer and retailer Pandora needed a unified view of all their data where they could easily segment, categorize and analyze to deliver custom messaging to consumers. With Databricks Lakehouse, they now have the insights they need to deliver highly targeted messaging — increasing consumer engagement from the initial opening of a marketing email to maximizing shopping bag conversions to driving revenue on the website. **[READ THE FULL PANDORA STORY](https://www.databricks.com/customers/pandora)** **Building an operationally efficient** **omnichannel business** The Hershey Company analyzes the data they need to stay in front of changing human behavior and delight their customers. With Databricks Lakehouse, they can analyze data feeds from their largest retail customer — uncovering insights that will help extend their industry leadership. **[READ THE FULL HERSHEY STORY](https://www.databricks.com/customers/hershey)** **Ushering in a new era** **of data-driven retailing** Outdoor apparel brand Columbia Sportswear has enabled data and analytics self-service throughout the organization in a way that ensures everyone is working from a single source of truth. Whichever data team needs access to the data, Databricks Lakehouse gives them the confidence that the data is reliable and consistent. **[READ THE FULL COLUMBIA SPORTSWEAR STORY](https://www.google.com/url?q=https://www.databricks.com/customers/columbia&sa=D&source=editors&ust=1690998645853115&usg=AOvVaw0_kRasuzyi4ESz1SMB0n-K)** ----- **7** **Foster a culture of data-driven decision-making** What does it mean to have a culture of data-driven decision-making? In practice, it means empowering all employees to use data to inform their decisions. Only some strategic decisions will be based on complete and accurate information. It’s unwise to assume otherwise. The right approach is to leverage as much data as possible, from past tests or current efforts, to mitigate risk. Leaders need to not only ask for data but also ensure that their employees will be able to find the data they need. **Unlocking critical trends and insights** **needed to serve our 180 million customers** Reckitt, the maker of Lysol as well as hundreds of other household brands, was looking to deliver best-in-class customer experiences to their over 180 million customers spanning the globe. With Databricks Lakehouse, Reckitt has established a data-first culture by surfacing real-time, highly accurate, deep customer data insights that have led to a better understanding of international market trends and demand across the multiple product lines they support. **[READ THE FULL RECKITT STORY](https://www.databricks.com/customers/reckitt)** **Customer 360 to enable faster speed** **to market, better results** The Middle East’s Al-Futtaim serves as a local distributor for global brands such as Toyota, IKEA and Ace Hardware. With Databricks Lakehouse serving as a unified platform to aggregate and analyze various data sources on all customers, they have created a “golden customer record” that improves all decision-making, from forecasting demand to powering their global loyalty program. **[READ THE FULL AL-FUTTAIM STORY](https://www.google.com/url?q=https://www.databricks.com/customers/al-futtaim&sa=D&source=editors&ust=1690998645853527&usg=AOvVaw3cs-6mM2ANTKDCzTdTvEYH)** **8** **Continuously evaluate and improve** Recognize that establishing a data-driven culture is an ongoing journey and never a set destination. Constantly evaluate your data collection, analysis and decision-making process to identify areas for improvement. Even small and constant incremental improvements will deliver large gains in absolute terms when applied at scale. You can always personalize more, forecast better, or better manage your supply chain as you bring in better data sources and refine your models. ----- ## Transform Retail Data Into Actionable Insights Becoming data forward is not a crazy idea. Too often, leaders or organizations allow themselves to be intimidated by focusing on large-scale transformations. But it’s the small operational changes that can make your business more efficient as well as shift the larger culture forward. Once you’ve set this foundation, it then allows you to move toward bigger things. These steps may fail, but it’s actually positive to have these setbacks to learn from to try again. The bigger risk is to not try and thus fall behind competitors who are embracing the internal changes needed to take advantage of AI and machine learning. Core to delivering on these steps to become a data-forward retailer is a solid data foundation that can unify your data and AI workloads with sharing and governance built in, so internal and external teams can get access to the data they need when they need it. With the [Databricks Lakehouse for Retail](https://www.databricks.com/solutions/industries/retail-industry-solutions) , companies gain valuable insights into customer behavior, optimize supply chain operations and make informed business decisions in real time. EXPLORE DATABRICKS LAKEHOUSE FOR RETAIL Access key resources to understanding how a lakehouse for retail can set you on the path toward becoming a data-forward organization. **[LEARN MORE](https://www.databricks.com/explore/retail-resources)** #### Visit our website to learn more about Databricks Lakehouse for Retail. ----- ## About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast, and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://www.databricks.com/try-databricks#account)** Contact us for a personalized demo **databricks.com/contact** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/8-steps-to-becoming-a-ai-forward-retailer-ebook.pdf,2024-09-19T16:57:19Z
"### eBook # The Big Book  of MLOps #### A data-centric approach  to build and scale AI,  including LLMOps M o d e l O p s D a t a O p s D e �O p s ----- ## Contents **A U T H O R S :** **Joseph Bradley** Lead Product Specialist **Rafi Kurlansik** Lead Product Specialist **Matt Thomson** Director, EMEA Product Specialists **Niall Turbitt** Lead Data Scientist **C H A P T E R 1 :**  **Introduction** 3 ###### People and process 4  People 5  Process 6  Why should I care about MLOps? 8  Guiding principles 9 **C H A P T E R 2 :**  **Fundamentals of MLOps** 11 ###### Semantics of dev, staging and prod 11  ML deployment patterns 15 **C H A P T E R 3 :** **MLOps Architecture and Process**  19 ###### Architecture components 19  Data Lakehouse 19  MLflow 19  Databricks and MLflow Autologging 20  Feature Store 20  MLflow Model Serving 20  Databricks SQL 20  Databricks Workflows and Jobs 20  Reference architecture 21  Overview 22  Dev 23  Staging 27  Prod 30 **C H A P T E R 4 :**  **LLMOps – Large Language Model Operations** 36 ###### Discussion of key topics for LLMOps 39  Reference architecture 46  Looking ahead 48 ----- **CHAPTER 1:** ## Introduction **Note:** Our prescription for MLOps is general to any set of tools and applications, though we give concrete examples using Databricks features and functionality. We also note that no single architecture or prescription will work for all organizations or use cases. Therefore, while we provide guidelines for building MLOps, we call out important options and variations. This whitepaper is written primarily for ML engineers and data scientists wanting to learn more about MLOps, with high-level guidance and pointers to more resources. The past decade has seen rapid growth in the adoption of machine learning (ML). While the early adopters were a small number of large technology companies that could afford the necessary resources, in recent times ML-driven business cases have become ubiquitous in all industries. Indeed, according to MIT Sloan Management Review, 83% of CEOs report that [artificial intelligence (AI) is a strategic priority](https://sloanreview.mit.edu/projects/artificial-intelligence-in-business-gets-real/) . This democratization of ML across industries has brought huge economic benefits, with [Gartner estimating](https://www.gartner.com/en/newsroom/press-releases/2018-04-25-gartner-says-global-artificial-intelligence-business-value-to-reach-1-point-2-trillion-in-2018) [that $3.9T in business value](https://www.gartner.com/en/newsroom/press-releases/2018-04-25-gartner-says-global-artificial-intelligence-business-value-to-reach-1-point-2-trillion-in-2018) will be created by AI in 2022. However, building and deploying ML models is complex. There are many options available for achieving this but little in the way of well-defined and accessible standards. As a result, over the past few years we have seen the emergence of the machine learning operations (MLOps) field. **MLOps is a set of processes** **and automation for managing models, data and code to improve performance stability and long-term** **efficiency in ML systems.** Put simply, MLOps = [ModelOps](https://en.wikipedia.org/wiki/ModelOps) + [DataOps](https://en.wikipedia.org/wiki/DataOps) + [DevOps](https://en.wikipedia.org/wiki/DevOps) . The concept of developer operations (DevOps) is nothing new. It has been used for decades to deploy software applications, and the deployment of ML applications has much to gain from it. However, strong DevOps practices and tooling alone are insufficient because ML applications rely on a constellation of artifacts (e.g., models, data, code) that require special treatment. Any MLOps solution must take into account the various people and processes that interact with these artifacts. Here at Databricks we have seen firsthand how customers develop their MLOps approaches, some of which work better than others. We launched the open source [MLflow](https://www.mlflow.org/) project to help make our customers successful with MLOps, and with over 10 million downloads/month from PyPI as of May 2022, MLflow’s adoption is a testament to the appetite for operationalizing ML models. This whitepaper aims to explain how your organization can build robust MLOps practices incrementally. First, we describe the people and process involved in deploying ML applications and the need for operational rigor. We also provide general principles to help guide your planning and decision-making. Next, we go through the fundamentals of MLOps, defining terms and broad strategies for deployment. Finally, we introduce a general MLOps reference architecture, the details of its processes, and best practices. ----- #### People and process **M L W O R K F L O W A N D P E R S O N A S** Data Governance Officer Dat1 Data Scientist Engineer ML Engineer Business Stakeholder Dataa Preparation Evplorator{a Data unal{sis Feature Mode� Modela Deplo{�ent Engineering Training Validation Mode� Modela Deplo{�ent Monitoring Training Validation Modela Validation **Figure 1** ----- #### People Building ML applications is a team sport, and while in the real world people “wear many hats,” it is still useful to think in terms of archetypes. They help us understand roles and responsibilities and where handoffs are required, and they highlight areas of complexity within the system. We distinguish between the following personas: **M L P E R S O N A S** Data Governance Officer Responsible for ensuring that data governance, data privacy and other compliance measures are adhered to across the model development and deployment process. Not typically involved in day-to- day operations. Data Engineer Responsible for building data pipelines to process, organize and persist data sets for machine learning and other downstream applications. Data Scientist Responsible for understanding the business problem, exploring available data to understand if machine learning is applicable, and then training, tuning and evaluating a model to be deployed. ML Engineer Responsible for deploying machine learning models to production with appropriate governance, monitoring and software development best practices such as continuous integration and continuous deployment ( [CI/CD](https://en.wikipedia.org/wiki/CI/CD) ). Business Stakeholder Responsible for using the model to make decisions for the business or product, and responsible for the business value that the model is expected to generate. ----- #### Process Together, these people develop and maintain ML applications. While the development process follows a distinct pattern, it is not entirely monolithic. The way you deploy a model has an impact on the steps you take, and using techniques like reinforcement learning or online learning will change some details. Nevertheless, these steps and personas involved are variations on a core theme, as illustrated in Figure 1 above. Let’s walk through the process step by step. Keep in mind that this is an iterative process, the frequency of which will be determined by the particular business case and data. **M L P R O C E S S** Data Preparation Exploratory Data Analysis Feature Engineering Model Training Model Validation Deployment Monitoring ###### Data preparation Prior to any data science or ML work lies the data engineering needed to prepare production data and make it available for consumption. This data may be referred to as “raw data,” and in later steps, data scientists will extract features and labels from the raw data. ###### Exploratory data analysis (EDA) Analysis is conducted by data scientists to assess statistical properties of the data available, and determine if they address the business question. This requires frequent communication and iteration with business stakeholders. ----- ###### Feature engineering Data scientists clean data and apply business logic and specialized transformations to engineer features for model training. These data, or features, are split into training, testing and validation sets. ###### Model training Data scientists explore multiple algorithms and hyperparameter configurations using the prepared data, and a best-performing model is determined according to predefined evaluation metric(s). ###### Model validation Prior to deployment a selected model is subjected to a validation step to ensure that it exceeds some baseline level of performance, in addition to meeting any other technical, business or regulatory requirements. This necessitates collaboration between data scientists, business stakeholders and ML engineers. ###### Deployment ML engineers will deploy a validated model via batch, streaming or online serving, depending on the requirements of the use case. ###### Monitoring ML engineers will monitor deployed models for signs of performance degradation or errors. Data scientists will often be involved in early monitoring phases to ensure that new models perform as expected after deployment. This will inform if and when the deployed model should be updated by returning to earlier stages in the workflow. The data governance officer is ultimately responsible for making sure this entire process is compliant with company and regulatory policies. ----- #### Why should I care about MLOps? Consider that the typical ML application depends on the aforementioned people and process, as well as regulatory and ethical requirements. These dependencies change over time — and your models, data and code must change as well. The data that were a reliable signal yesterday become noise; open source libraries become outdated; regulatory environments evolve; and teams change. ML systems must be resilient to these changes. Yet this broad scope can be a lot for organizations to manage — there are many moving parts! Addressing these challenges with a defined MLOps strategy can dramatically reduce the iteration cycle of delivering models to production, thereby accelerating time to business value. There are two main types of risk in ML systems: **technical risk** inherent to the system itself and **risk of** **noncompliance** with external systems. Both of these risks derive from the dependencies described above. For example, if data pipeline infrastructure, KPIs, model monitoring and documentation are lacking, then you risk your system becoming destabilized or ineffective. On the other hand, even a well-designed system that fails to comply with corporate, regulatory and ethical requirements runs the risk of losing funding, receiving fines or incurring reputational damage. Recently, one private company’s data collection practices were found to have violated the Children’s Online Privacy Protection Rule (COPPA). The [FTC fined](https://www.protocol.com/policy/ftc-algorithm-destroy-data-privacy) the company $1.5 million and [ordered](https://www.ftc.gov/system/files/ftc_gov/pdf/wwkurbostipulatedorder.pdf) it to destroy or delete the illegally harvested data, and all models or algorithms developed with that data. With respect to efficiency, the absence of MLOps is typically marked by an overabundance of manual processes. These steps are slower and more prone to error, affecting the quality of models, data and code. Eventually they form a bottleneck, capping the ability for a data team to take on new projects. Seen through these lenses, the aim of MLOps becomes clear: improve the long-term performance stability and success rate of ML systems while maximizing the efficiency of teams who build them. In the introduction, we defined MLOps to address this aim: MLOps is a **set of processes and automation** to manage **models, data and code** to meet the two goals of **stable performance and long-term efficiency in** **ML systems** . _MLOps = ModelOps + DataOps + DevOps_ . With clear goals we are ready to discuss principles that guide design decisions and planning for MLOps M o d e l O p s D a t a O p s D e �O p s ----- Given the complexity of ML processes and the different personas involved, it is helpful to start from simpler, high-level guidance. We propose several broadly applicable principles to guide MLOps decisions. They inform our design choices in later sections, and we hope they can be adapted to support whatever your #### Guiding principles ###### Always keep your business goals in mind Just as the core purpose of ML in a business is to enable data-driven decisions and products, the core purpose of MLOps is to ensure that those data-driven applications remain stable, are kept up to date and continue to have positive impacts on the business. When prioritizing technical work on MLOps, consider the business impact: Does it enable new business use cases? Does it improve data teams’ productivity? Does it reduce operational costs or risks? ###### Take a data-centric approach to machine learning Feature engineering, training, inference and monitoring pipelines are data pipelines. As such, they need to be as robust as other production data engineering processes. Data quality is crucial in any ML application, so ML data pipelines should employ systematic approaches to monitoring and mitigating data quality issues. Avoid tools that make it difficult to join data from ML predictions, model monitoring, etc., with the rest of your data. The simplest way to achieve this is to develop ML applications on the same platform used to manage production data. For example, instead of downloading training data to a laptop, where it is hard to govern and reproduce results, secure the data in cloud storage and make that storage available to your training process. business use case may be. ----- ###### Implement MLOps in a modular fashion As with any software application, code quality is paramount for an ML application. Modularized code enables testing of individual components and mitigates difficulties with future code refactoring. Define clear steps (e.g., training, evaluation or deployment), supersteps (e.g., training-to-deployment pipeline) and responsibilities to clarify the modular structure of your ML application. ###### Process should guide automation We automate processes to improve productivity and lower risk of human error, but not every step of a process can or should be automated. People still determine the business question, and some models will always need human oversight before deployment. Therefore, the development process is primary and each module in the process should be automated as needed. This allows incremental build-out of automation and customization. Furthermore, when it comes to particular automation tools, choose those that align to your people and process. For example, instead of building a model logging framework around a generic database, you can choose a specialized tool like MLflow, which has been designed with the ML model lifecycle in mind. ----- **CHAPTER 2:** ## Fundamentals of MLOps **Note:** In our experience with customers, there can be variations in these three stages, such as splitting staging into separate “test” and “QA” substages. However, the principles remain the same and we stick to a dev, staging and prod setup within this paper. #### Semantics of dev, staging and prod ML workflows include the following key assets: code, models and data. These assets need to be developed (dev), tested (staging) and deployed (prod). For each stage, we also need to operate within an execution environment. Thus, all the above — execution environments, code, models and data — are divided into dev, staging and prod. These divisions can best be understood in terms of quality guarantees and access control. On one end, assets in prod are generally business critical, with the highest guarantee of quality and tightest control on who can modify them. Conversely, dev assets are more widely accessible to people but offer no guarantee of quality. For example, many data scientists will work together in a dev environment, freely producing dev model prototypes. Any flaws in these models are relatively low risk for the business, as they are separate from the live product. In contrast, the staging environment replicates the execution environment of production. Here, code changes made in the dev environment are tested prior to code being deployed to production. The staging environment acts as a gateway for code to reach production, and accordingly, fewer people are given access to staging. Code promoted to production is considered a live product. In the production environment, human error can pose the greatest risk to business continuity, and so the least number of people have permission to modify production models. One might be tempted to say that code, models and data each share a one-to-one correspondence with the execution environment — e.g., all dev code, models and data are in the dev environment. That is often close to true but is rarely correct. Therefore, we will next discuss the precise semantics of dev, staging and prod for execution environments, code, models and data. We also discuss mechanisms for restricting access to each. ----- ###### Execution environments An execution environment is the place where models and data are created or consumed by code. Each execution environment consists of compute instances, their runtimes and libraries, and automated jobs. With Databricks, an “environment” can be defined via dev/staging/prod separation at a few levels. An organization could create distinct environments across multiple cloud accounts, multiple Databricks workspaces in the same cloud account, or within a single Databricks workspace. These separation patterns are illustrated in Figure 2 below. **E N V I R O N M E N T S E P A R AT I O N P AT T E R N S** Multiple clou$ accounts staging prod Multiple Databricks workspaces staging prod Databricks workspace access controls dev staging prod dev dev **Figure 2** ----- Databricks released Delta Lake to the open source community in 2019. Delta Lake provides all the data ###### Code ML project code is often stored in a version control repository (such as Git), with most organizations using branches corresponding to the lifecycle phases of development, staging or production. There are a few common patterns. Some use only development branches (dev) and one main branch (staging/prod). Others use main and development branches (dev), branches cut for testing potential releases (staging), and branches cut for final releases (prod). Regardless of which convention you choose, separation is enforced through Git repository branches. lifecycle management functions that are needed to make cloud-based object stores reliable and performant. This design allows clients to update multiple objects at once and to replace a subset of the objects with another, etc., in a serializable manner that still achieves high parallel read/write performance from the objects — while offering advanced capabilities like time travel (e.g., query As a best practice, code should only be run in an execution environment that corresponds to it or in one that’s higher. For example, the dev environment can run any code, but the prod environment can only run prod code. ###### Models While models are usually marked as dev, staging or prod according to their lifecycle phase, **it is important to** **note that model and code lifecycle phases often operate asynchronously** . That is, you may want to push a new model version before you push a code change, and vice versa. Consider the following scenarios: point-in-time snapshots or rollback of erroneous To detect fraudulent transactions, you develop an ML pipeline that retrains a model weekly. Deploying the code can be a relatively infrequent process, but each week a new model undergoes its own lifecycle of being generated, tested and marked as “production” to predict on the most recent transactions. In this case the code lifecycle is slower than the model lifecycle. To classify documents using large deep neural networks, training and deploying the model is often a one- time process due to cost. Updates to the serving and monitoring code in the project may be deployed more frequently than a new version of the model. In this case the model lifecycle is slower than the code. Since model lifecycles do not correspond one-to-one with code lifecycles, it makes sense for model management to have its own service. [MLflow](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.1yd956s4db32) and its Model Registry support managing model artifacts directly via UI and APIs. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Model artifacts are secured using MLflow access controls or cloud storage permissions updates), automatic data layout optimization, upserts, caching and audit logs. ----- ###### Data Some organizations label data as either dev, staging or prod, depending on which environment it originated in. For example, all prod data is produced in the prod environment, but dev and staging environments may have read-only access to them. Marking data this way also indicates a guarantee of data quality: dev data may be temporary or not meant for wider use, whereas prod data may offer stronger guarantees around reliability and freshness. Access to data in each environment is controlled with table access controls ( [AWS](https://docs.databricks.com/security/access-control/table-acls/index.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/security/access-control/table-acls/) [GCP](https://docs.gcp.databricks.com/security/access-control/table-acls/index.html) ) or cloud storage permissions. | | In summary, when it comes to MLOps, you will always have operational separation between dev, staging and prod. Assets in dev will have the least restrictive access controls and quality guarantees, while those in prod will be the highest quality and tightly controlled. |ASSET|SEMANTICS|SEPARATED BY| |---|---|---| |Execution environments|Labeled according to where development, testing and connections with production systems happen|Cloud provider and Databricks Workspace access controls| |Models|Labeled according to model lifecycle phase|MLflow access controls or cloud storage permissions| |Data|Labeled according to its origin in dev, staging or prod execution environments|Table access controls or cloud storage permissions| |Code|Labeled according to software development lifecycle phase|Git repository branches| **Table 1** ----- #### ML deployment patterns The fact that models and code can be managed separately results in multiple possible patterns for getting ML artifacts through staging and into production. We explain two major patterns below. **D E P L O Y M O D E L S** dev staging prod **D E P L O Y C O D E** dev staging prod These two patterns differ in terms of whether the model artifact or the training code that produces the model artifact is promoted toward production. ----- ###### Deploy models In the first pattern, the model artifact is generated by training code in the development environment. This artifact is then tested in staging for compliance and performance before finally being deployed into production. This is a simpler handoff for data scientists, and in cases where model training is prohibitively expensive, training the model once and managing that artifact may be preferable. However, this simpler architecture comes with limitations. If production data is not accessible from the development environment (e.g., for security reasons), this architecture may not be viable. This architecture does not naturally support automated model retraining. While you could automate retraining in the development environment, you would then be treating “dev” training code as production ready, which many deployment teams would not accept. This option hides the fact that ancillary code for featurization, inference and monitoring needs to be deployed to production, requiring a separate code deployment path. ###### Deploy code In the second pattern, the code to train models is developed in the dev environment, and this code is moved to staging and then production. Models will be trained in each environment: initially in the dev environment as part of model development, in staging (on a limited subset of data) as part of integration tests, and finally in the production environment (on the full production data) to produce the final model. If an organization restricts data scientists’ access to production data from dev or staging environments, deploying code allows training on production data while respecting access controls. Since training code goes through code review and testing, it is safer to set up automated retraining. Ancillary code follows the same pattern as model training code, and both can go through integration tests in staging. However, the learning curve for handing code off to collaborators can be steep for many data scientists, so opinionated project templates and workflows are helpful. Finally, data scientists need visibility into training results from the production environment, for only they have the knowledge to identify and fix ML-specific issues. ----- The diagram below contrasts the code lifecycle for the above deployment patterns across the different execution environments. Code development Development environment Unit tests Integration tests Development environment Staging environment Model training Continuous deployment Staging environment Production environment Deploy pipelines Production environment #### Deploy models  Deploy code **In general we recommend following the “deploy code” approach, and the reference architecture in** **this document is aligned to it.** Nevertheless, there is no perfect process that covers every scenario, and the options outlined above are not mutually exclusive. Within a single organization, you may find some use cases deploying training code and others deploying model artifacts. Your choice of process will depend on the business use case, resources available and what is most likely to succeed. ----- |Col1|Col2|DEPLOY MODELS|DEPLOY CODE| |---|---|---|---| |Process|Dev|Develop training code. Develop ancillary code.1 Train model on prod data.  Promote model and ancillary code.|Develop training code. Develop ancillary code.  Promote code.| ||Staging|Test model and ancillary code.  Promote model and ancillary code.|Train model on data subset. Test ancillary code.  Promote code.| ||Prod|Deploy model. Deploy ancillary pipelines.|Train model on prod data. Test model. Deploy model. Deploy ancillary pipelines.| |Trade-offs|Automation| Does not support automated retraining in locked-down env.| Supports automated retraining in locked-down env.| ||Data access control| Dev env needs read access to prod training data.| Only prod env needs read access to prod training data.| ||Reproducible models| Less eng control over training env, so harder to ensure reproducibility.| Eng control over training env, which helps to simplify reproducibility.| ||Data science familiarity| DS team builds and can directly test models in their dev env.| DS team must learn to write and hand off modular code to eng.| ||Support for large projects| This pattern does not force the DS team to use modular code for model training, and it has less iterative testing.| This pattern forces the DS team to use modular code and iterative testing, which helps with coordination and development in larger projects.| ||Eng setup and maintenance| Has the simplest setup, with less CI/CD infra required.| Requires CI/CD infra for unit and integration tests, even for one-off models.| |When to use||Use this pattern when your model is a one-off or when model training is very expensive. Use when dev, staging and prod are not strictly separated envs.|Use this pattern by default. Use when dev, staging and prod are strictly separated envs.| **Table 2** **1** “Ancillary code” refers to code for ML pipelines other than the model training pipeline. Ancillary code could be featurization, inference, monitoring or other pipelines. ----- **CHAPTER 3:** ## MLOps Architecture  and Process ###### Lakehouse Platform #### Architecture components Before unpacking the reference architecture, take a moment to familiarize yourself with the Databricks features used to facilitate MLOps in the workflow prescribed. ###### Data Lakehouse A [Data Lakehouse architecture](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) unifies the best elements of data lakes and data warehouses — delivering data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes. Data in the lakehouse are typically organized using a “medallion” architecture of Bronze, Silver and Gold tables of increasing refinement and quality. ###### MLflow [MLflow](https://www.mlflow.org/) is an open source project for managing the end-to-end machine learning lifecycle. It has the following primary components: Data Warehousing Data Engineering Data Streaming Data S�ien�� and ML Unity Catalog Fine-grained governance for data and AI Delta Lake Data relia)ility and .erfor2ance Cloud Data Lake All structured and unstructured data  **Tracking:** Allows you to track experiments to record and compare parameters, metrics and model artifacts. See documentation for [AWS](https://docs.databricks.com/applications/mlflow/tracking.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/tracking) [GCP](https://docs.gcp.databricks.com/applications/mlflow/tracking.html) . | |  **Models (“MLflow flavors”):** Allows you to store and deploy models from any ML library to a variety of model serving and inference platforms. See documentation for [AWS](https://docs.databricks.com/applications/mlflow/models.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/models) [GCP](https://docs.gcp.databricks.com/applications/mlflow/models.html) . | |  **Model Registry:** Provides a centralized model store for managing models’ full lifecycle stage transitions: from staging to production, with capabilities for versioning and annotating. The registry also provides webhooks for automation and continuous deployment. See documentation for [AWS](https://docs.databricks.com/applications/mlflow/model-registry.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/model-registry) [GCP](https://docs.gcp.databricks.com/applications/mlflow/model-registry.html) . | | Databricks also provides a fully managed and hosted version of MLflow with enterprise security features, high availability, and other Databricks workspace features such as experiment and run management and notebook revision capture. MLflow on Databricks offers an integrated experience for tracking and securing machine learning model training runs and running machine learning projects. ----- ###### Databricks and MLflow Autologging Databricks Autologging is a no-code solution that extends [MLflow automatic logging](https://mlflow.org/docs/latest/tracking.html#automatic-logging) to deliver automatic experiment tracking for machine learning training sessions on Databricks. Databricks Autologging automatically captures model parameters, metrics, files and lineage information when you train models with training runs recorded as MLflow tracking runs. See documentation for [AWS](https://docs.databricks.com/applications/mlflow/databricks-autologging.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/databricks-autologging) [GCP](https://docs.gcp.databricks.com/applications/mlflow/databricks-autologging.html) . | | ###### Feature Store The Databricks Feature Store is a centralized repository of features. It enables feature sharing and discovery across an organization and also ensures that the same feature computation code is used for model training and inference. See documentation for [AWS](https://docs.databricks.com/applications/machine-learning/feature-store/index.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/machine-learning/feature-store/) [GCP](https://docs.gcp.databricks.com/applications/machine-learning/feature-store/index.html) . | | ###### MLflow Model Serving MLflow Model Serving allows you to host machine learning models from Model Registry as REST endpoints that are updated automatically based on the availability of model versions and their stages. See documentation for [AWS](https://docs.databricks.com/applications/mlflow/model-serving.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/model-serving) [GCP](https://docs.gcp.databricks.com/applications/mlflow/model-serving.html) . | | ###### Databricks SQL Databricks SQL provides a simple experience for SQL users who want to run quick ad hoc queries on their data lake, create multiple visualization types to explore query results from different perspectives, and build and share dashboards. See documentation for [AWS](https://docs.databricks.com/sql/index.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/sql/) [GCP](https://docs.gcp.databricks.com/sql/index.html) . | | ###### Databricks Workflows and Jobs Databricks Workflows (Jobs and Delta Live Tables) can execute pipelines in automated, non-interactive ways. For ML, Jobs can be used to define pipelines for computing features, training models, or other ML steps or pipelines. See documentation for [AWS](https://docs.databricks.com/data-engineering/jobs/index.html) [Azure](https://docs.microsoft.com/en-us/azure/databricks/data-engineering/jobs/) [GCP](https://docs.gcp.databricks.com/data-engineering/jobs/index.html) . | | ----- #### Reference architecture We are now ready to review a general reference architecture for implementing MLOps on the Databricks Lakehouse platform using the recommended “deploy code” pattern from earlier. This is intended to cover the majority of use cases and ML techniques, but it is by no means comprehensive. When appropriate, we will highlight alternative approaches to implementing different parts of the process. We begin with an overview of the system end-to-end, followed by more detailed views of the process in development, staging and production environments. These diagrams show the system as it operates in a steady state, with the finer details of iterative development cycles omitted. This structure is summarized below. **O V E R V I E W** ```  dev staging prod ``` Data Exploratory data analysis (EDA) Project code Feature table refresh Model training Commit code Merge request Unit tests (CI) Integration tests (CI) Merge Cut release branch Feature table refresh Model training Continuous deployment (CD) Online serving (REST APIs) Inference: batch or streaming Monitoring Retraining ----- ###### Overview Source control dev staging (main) release Merge reIuest to staging Cut release branch Pull from release branch to production **Figure 3** Development environment Exploratory data analysis Staging environment Create dev branch Commit code C} trigger Merge Production environment Model Registry St�ge{ �one St�ge{ St�ging St�ge{ Production . . . Inference & serving dev Feature table refresh dev Unit tests (CI) Push model to registr� Load model for testing Load model for inference Integration tests (CI) dev dev Promote to production Inference & serving Model training dev release dev Feature table refresh release Mode� training release Continuous Deployment (CD) release Monitoring release Data tables Feature tables Feature tables Data tables Feature tables Metrics tables Here we see the overall process for deploying code and model artifacts, the inputs and outputs for pipelines, and model lifecycle stages in production. Code source control is the primary conduit for deploying ML pipelines from development to production. Pipelines and models are prototyped on a dev branch in the development environment, and changes to the codebase are committed back to source control. Upon merge request to the staging branch (usually the “main” branch), a continuous integration (CI) process tests the code in the staging environment. If the tests pass, new code can be deployed to production by cutting a code release. In production, a model is trained on the full production data and pushed to the MLflow Model Registry. A continuous deployment (CD) process tests the model and promotes it toward the production stage in the registry. The Model Registry’s production model can be served via batch, streaming or REST API. ----- ###### Dev In the development environment, data scientists and ML engineers can collaborate on all pipelines in an ML project, committing their changes to source control. While engineers may help to configure this environment, data scientists typically have significant control over the libraries, compute resources and code that they use. **Figure 4** Development environment 0� E�ploratory data analysis 0� dev Source control Tracking Server Metrics Parameters Models dev . . . models train.py deploy.py in(erence.py monitoring.py dat< (eaturization.py tests unit.py integration.py Inference: Streaming or batch Feature table refresh Data Featurization preparation Model training Training and Evaluation tuning Create dev mrancg 0u Commit code 04 dev dev 0� Lakehouse Feature tamles Bronze / Silver / Gold prod data Feature tamles Temp tamles dev data ----- ###### Data Data scientists working in the dev environment possess read-only access to production data. They also require read-write access to a separate dev storage environment to develop and experiment with new features and other data tables. ###### Exploratory data analysis (EDA) The data scientist explores and analyzes data in an interactive, iterative process. This process is used to assess whether the available data has the potential to address the business problem. EDA is also where the data scientist will begin discerning what data preparation and featurization are required for model training. This ad hoc process is generally not part of a pipeline that will be deployed in other execution environments. ###### Project code This is a code repository containing all of the pipelines or modules involved in the ML system. Dev branches are used to develop changes to existing pipelines or to create new ones. Even during EDA and initial phases of a project, it is recommended to develop within a repository to help with tracking changes and sharing code. ----- ###### Feature table refresh This pipeline reads from raw data tables and feature tables and writes to tables in the Feature Store. The pipeline consists of two steps:  **Data preparation** This step checks for and corrects any data quality issues prior to featurization. **Featurization** In the dev environment, new features and updated featurization logic can be tested by writing to feature tables in dev storage, and these dev feature tables can be used for model prototyping. Once this featurization code is promoted to production, these changes will affect the production feature tables. Features already available in production feature tables can be read directly for development. In some organizations, feature engineering pipelines are managed separately from ML projects. In such cases, the featurization pipeline can be omitted from this architecture. ----- ###### Model training Data scientists develop the model training pipeline in the dev environment with dev or prod feature tables.  **Training and tuning** The training process reads features from the feature store and/or Silver- or Gold-level Lakehouse tables, and it logs model parameters, metrics and artifacts to the [MLflow tracking server](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.1yd956s4db32) . After training and hyperparameter tuning, the final model artifact is logged to the tracking server to record a robust link between the model, its input data, and the code used to generate it. **Evaluation** Model quality is evaluated by testing on held-out data. The results of these tests are logged to the MLflow tracking server. If governance requires additional metrics or supplemental documentation about the model, this is the time to add them using MLflow tracking. Model interpretations (e.g., plots produced by [SHAP](https://shap.readthedocs.io/en/latest/index.html) or [LIME](https://arxiv.org/abs/1602.04938) ) and plain text descriptions are common, but defining the specifics for such governance requires input from business stakeholders or a data governance officer. **Model output** The output of this pipeline is an ML model artifact stored in the MLflow tracking server. When this training pipeline is run in staging or production, ML engineers (or their CI/CD code) can load the model via the model URI (or path) and then push the model to the Model Registry for management and testing. ###### Commit code After developing code for featurization, training, inference and other pipelines, the data scientist or ML engineer commits the dev branch changes into source control. This section does not discuss the continuous deployment, inference or monitoring pipelines in detail; see the “Prod” section below for more information on those. ----- ###### Staging The transition of code from development to production occurs in the staging environment. This code includes model training code and ancillary code for featurization, inference, etc. Both data scientists and ML engineers are responsible for writing tests for code and models, but ML engineers manage the continuous integration pipelines and orchestration. Source control 0] 0_ dev staging >main< release Merge reHuest to staging Cut release branch Staging environment CI trigger Merge 0� **Figure 5** Unit tests (CI) Tracking Server 0� Model Registry dev 03 Integration tests (CI) Feature Store tests Model training tests Model deployment tests Inference tests Model monitoring tests Lakehouse dev Feature tables Temp tables staging data ----- ###### Data The staging environment may have its own storage area for testing feature tables and ML pipelines. This data is generally temporary and only retained long enough to run tests and to investigate test failures. This data can be made readable from the development environment for debugging. ###### Merge code  **Merge request** The deployment process begins when a merge (or pull) request is submitted against the staging branch of the project in source control. It is common to use the “main” branch as the staging branch. **Unit tests (CI)** This merge request automatically builds source code and triggers unit tests. If tests fail, the merge request is rejected. ----- ###### Integration tests (CI) The merge request then goes through integration tests, which run all pipelines to confirm that they function correctly together. The staging environment should mimic the production environment as much as is reasonable, running and testing pipelines for featurization, model training, inference and monitoring. Integration tests can trade off fidelity of testing for speed and cost. For example, when models are expensive to train, it is common to test model training on small data sets or for fewer iterations to reduce cost. When models are deployed behind REST APIs, some high-SLA models may merit full-scale load testing within these integration tests, whereas other models may be tested with small batch jobs or a few queries to temporary REST endpoints. Once integration tests pass on the staging branch, the code may be promoted toward production.  **Merge** If all tests pass, the new code is merged into the staging branch of the project. If tests fail, the CI/CD system should notify users and post results on the merge (pull) request. Note: It can be useful to schedule periodic integration tests on the staging branch, especially if the branch is updated frequently with concurrent merge requests. ###### Cut release branch Once CI tests have passed on a commit in the staging branch, ML engineers can cut a release branch from that commit. ----- **Figure 6** ###### Prod The production environment is typically managed by a select set of ML engineers and is where ML pipelines directly serve the business or application. These pipelines compute fresh feature values, train and test new model versions, publish predictions to downstream tables or applications, and monitor the entire process to avoid performance degradation and instability. While we illustrate batch and streaming inference alongside online serving below, most ML applications will use only one of these methods, depending on the business requirements. Production environment 0b 0� 0� Model Registry Online serving Stage: None Stage: Staging Stage: Production Log requests and predictions release Load model for online serving Ena�le online serving Feature table refresh Data Featurization preparation release 0B 0~ Load model for testing Load model for testing Load model for inference Inference: Batch or streaming Register and request transition Model training Training Evaluation and tuning release Promote to staging Promote to production Model Data ingest inference Pu�lish predictions 03 Continuous Deployment (CD) release Monitoring Data ingest Check model performance and data drift Pu�lish metrics Compare Staging vs Production Request model transition to Production release Compliance checks 0� Trigger model training release Data ta�les Feature ta�les Feature ta�les Monitoring ta�les Lakehouse ----- Though data scientists may not have write or compute access in the production environment, it is important to provide them with visibility to test results, logs, model artifacts and the status of ML pipelines in production. This visibility allows them to identify and diagnose problems in production. ###### Feature table refresh This pipeline transforms the latest production Lakehouse data into production feature tables. It can use batch or streaming computation, depending on the freshness requirements for downstream training and inference. The pipeline can be defined as a [Databricks Job](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.rxs6npet1ull) which is scheduled, triggered or continuously running. ###### Model training The model training pipeline runs either when code changes affect upstream featurization or training logic, or when automated retraining is scheduled or triggered. This pipeline runs on the full production data.  **Training and tuning** During the training process, logs are recorded to the [MLflow tracking server](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.1yd956s4db32) . These include model metrics, parameters, tags and the model itself. During development, data scientists may test many algorithms and hyperparameters, but it is common to restrict those choices to the top-performing options in the production training code. Restricting tuning can reduce the variance from tuning in automated retraining, and it can make training and tuning faster. **Evaluation** Model quality is evaluated by testing on held-out production data. The results of these tests are logged to the MLflow tracking server. During development, data scientists will have selected meaningful evaluation metrics for the use case, and those metrics or their custom logic will be used in this step. **Register and request transition** Following model training, the model artifact is registered to the [MLflow Model Registry](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.1yd956s4db32) of the production environment, set initially to ’stage=None’. The final step of this pipeline is to request a transition of the ----- ###### Continuous deployment (CD) The CD pipeline is executed when the training pipeline finishes and requests to transition the model to ‘stage=Staging’. There are three key tasks in this pipeline:  **Compliance checks** These tests load the model from the Model Registry, perform compliance checks (for tags, documentation, etc.), and approve or reject the request based on test results. If compliance checks require human expertise, this automated step can compute statistics or visualizations for people to review in a manual approval step at the end of the CD pipeline. Regardless of the outcome, results for that model version are recorded to the Model Registry through metadata in tags and comments in descriptions. The MLflow UI can be used to manage stage transition requests manually, but requests and transitions can be automated via MLflow APIs and [webhooks](https://docs.databricks.com/applications/mlflow/model-registry-webhooks.html) . If the model passes the compliance checks, then the transition request is approved and the model is promoted to ‘stage=Staging’. If the model fails, the transition request is rejected and the model is moved to ‘stage=Archived’ in the Model Registry. **Compare staging vs. production** To prevent performance degradation, models promoted to ‘stage=Staging’ must be compared to the ‘stage=Production’ models they are meant to replace. The metric(s) for comparison should be defined according to the use case, and the method for comparison can vary from canary deployments to A/B tests. All comparison results are saved to metrics tables in the lakehouse. If this is the first deployment and there is no ‘stage=Production’ model yet, the ‘stage=Staging’ model should be compared to a business heuristic or other threshold as a baseline. For a new version of an existing ‘stage=Production’ model, the ‘stage=Staging’ model is compared with the current ‘stage=Production’ model. ----- **Request model transition to production** If the candidate model passes the comparison tests, a request is made to transition it to ‘stage=Production’ in the Model Registry. As with other stage transition requests, notifications, approvals and rejections can be managed manually via the MLflow UI or automatically through APIs and webhooks. This is also a good time to consider human oversight, as it is the last step before a model is fully available to downstream applications. A person can manually review the compliance checks and performance comparisons to perform checks which are difficult to automate. ###### Online serving (REST APIs) For lower throughput and lower latency use cases, online serving is generally necessary. With MLflow, it is simple to deploy models to [Databricks Model Serving](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.72shqep1kelf) , cloud provider serving endpoints, or on-prem or custom serving layers. In all cases, the serving system loads the production model from the Model Registry upon initialization. On each request, it fetches features from an online Feature Store, scores the data and returns predictions. The serving system, data transport layer or the model itself could log requests and predictions. ###### Inference: batch or streaming This pipeline is responsible for reading the latest data from the Feature Store, loading the model from ‘stage=Production’ in the Model Registry, performing inference and publishing predictions. For higher throughput, higher latency use cases, batch or streaming inference is generally the most cost-effective option. A batch job would likely publish predictions to Lakehouse tables, over a JDBC connection, or to flat files. A streaming job would likely publish predictions either to Lakehouse tables or to message queues like Apache Kafka.® ----- ###### Monitoring Input data and model predictions are monitored, both for statistical properties (data drift, model performance, etc.) and for computational performance (errors, throughput, etc.). These metrics are published for dashboards and alerts.  **Data ingestion** This pipeline reads in logs from batch, streaming or online inference. **Check accuracy and data drift** The pipeline then computes metrics about the input data, the model’s predictions and the infrastructure performance. Metrics that measure statistical properties are generally chosen by data scientists during development, whereas metrics for infrastructure are generally chosen by ML engineers.  **Publish metrics** The pipeline writes to Lakehouse tables for analysis and reporting. Tools such as [Databricks SQL](https://docs.google.com/document/d/1yCODhUuimWJHR8Sc-sd6xY7vJuN6nPek2pNrfhv7hU4/edit#heading=h.nsthucrt9k77) are used to produce monitoring dashboards, allowing for health checks and diagnostics. The monitoring job or the dashboarding tool issues notifications when health metrics surpass defined thresholds. **Trigger model training** When the model monitoring metrics indicate performance issues, or when a model inevitably becomes out of date, the data scientist may need to return to the development environment and develop a new model version. ----- **Note:** While automated retraining is supported in this architecture, it isn’t required, and caution ###### Retraining This architecture supports automatic retraining using the same model training pipeline above. While we recommend beginning with manually triggered retraining, organizations can add scheduled and/or triggered retraining when needed.  **Scheduled** If fresh data are regularly made available, rerunning model training on a defined schedule can help models to keep up with changing trends and behavior. **Triggered** If the monitoring pipeline can identify model performance issues and send alerts, it can additionally trigger retraining. For example, if the distribution of incoming data changes significantly or if the model performance degrades, automatic retraining and redeployment can boost model performance with minimal human intervention. must be taken in cases where it is implemented. It is inherently difficult to automate selecting the correct action to take from model monitoring When the featurization or retraining pipelines themselves begin to exhibit performance issues, the data scientist may need to return to the dev environment and resume experimentation to address such issues. alerts. For example, if data drift is observed, does it indicate that we should automatically retrain, or does it indicate that we should engineer additional features to encode some new signal in the data? ----- **CHAPTER 4:** ## LLMOps – Large Language Model Operations #### Large language models LLMs have splashed into the mainstream of business and news, and there is no doubt that they will disrupt countless industries. In addition to bringing great potential, they present a new set of questions for MLOps: Is prompt engineering part of operations, and if so, what is needed? Since the “large” in “LLM” is an understatement, how do cost/performance trade-offs change? Is it better to use paid APIs or to fine-tune one’s own model? …and many more! The good news is that “LLMOps” (MLOps for LLMs) is not that different from traditional MLOps. However, some parts of your MLOps platform and process may require changes, and your team will need to learn a mental model of how LLMs coexist alongside traditional ML in your operations. In this section, we will explain what may change for MLOps when introducing LLMs. We will discuss several key topics in detail, from prompt engineering to packaging, to cost/performance trade-offs. We also provide a reference architecture diagram to illustrate what may change in your production environment. ###### What changes with LLMs? For those not familiar with large language models (LLMs), see [this summary](https://www.databricks.com/product/machine-learning/large-language-models) for a quick introduction. The one-sentence summary is: LLMs are a new class of natural language processing (NLP) models that have significantly surpassed their predecessors in performance across a variety of tasks, such as open-ended question answering, summarization and execution of near-arbitrary instructions. From the perspective of MLOps, LLMs bring new requirements, with implications for MLOps practices and platforms. We briefly summarize key properties of LLMs and the implications for MLOps here, and we delve into more detail in the next section. ----- **Table 3** |KEY PROPERTIES OF LLMS|IMPLICATIONS FOR MLOPS| |---|---| |LLMs are available in many forms: Very general proprietary models behind paid APIs Open source models that vary from general to specific applications Custom models fine-tuned for specific applications|Development process: Projects often develop incrementally, starting from existing, third-party or open source models and ending with custom fine-tuned models.| |Many LLMs take general natural language queries and instructions as input. Those queries can contain carefully engineered “prompts” to elicit the desired responses.|Development process: Designing text templates for querying LLMs is often an important part of developing new LLM pipelines. Packaging ML artifacts: Many LLM pipelines will use existing LLMs or LLM serving endpoints; the ML logic developed for those pipelines may focus on prompt templates, agents or “chains” instead of the model itself. The ML artifacts packaged and promoted to production may frequently be these pipelines, rather than models.| |Many LLMs can be given prompts with examples and context, or additional information to help answer the query.|Serving infrastructure: When augmenting LLM queries with context, it is valuable to use previously uncommon tooling such as vector databases to search for relevant context.| |LLMs are very large deep learning models, often ranging from gigabytes to hundreds of gigabytes.|Serving infrastructure: Many LLMs may require GPUs for real-time model serving. Cost/performance trade-offs: Since larger models require more computation and are thus more expensive to serve, techniques for reducing model size and computation may be required.| |LLMs are hard to evaluate via traditional ML metrics since there is often no single “right” answer.|Human feedback: Since human feedback is essential for evaluating and testing LLMs, it must be incorporated more directly into the MLOps process, both for testing and monitoring and for future fine-tuning.| ----- The list above may look long, but as we will see in the next section, many existing tools and processes only require small adjustments in order to adapt to these new requirements. Moreover, many aspects do not change: The separation of development, staging and production remains the same Git version control and model registries remain the primary conduits for promoting pipelines and models toward production The lakehouse architecture for managing data remains valid and essential for efficiency Existing CI/CD infrastructure should not require changes The modular structure of MLOps remains the same, with pipelines for data refresh, model tuning, model inference, etc. ----- #### Discussion of key topics for LLMOps So far, we have listed top potential changes to MLOps as you introduce LLMs. In this section, we will dive into more details about selected topics. ###### Prompt engineering Prompt engineering is the practice of adjusting the text prompt given to an LLM in order to elicit better responses — using engineering techniques. It is a very new practice, but some best practices are emerging. We will cover a few tips and best practices and link to useful resources. **1** Prompts and prompt engineering are model-specific. A prompt given to two different models will generally _not_ produce the same results. Similarly, prompt engineering tips do not apply to all models. In the extreme case, many LLMs have been fine-tuned for specific NLP tasks and do not even require prompts. On the other hand, very general LLMs benefit greatly from carefully crafted prompts. **2** When approaching prompt engineering, go from simple to complex: track, templatize and automate. Start by tracking queries and responses so that you can compare them and iterate to improve prompts. Existing tools such as MLflow provide tracking capabilities; see [MLflow LLM Tracking](https://mlflow.org/docs/latest/llm-tracking.html) for more details. Checking structured LLM pipeline code into version control also helps with prompt development, for git diffs allow you to review changes to prompts over time. Also see the section below on packaging model and pipelines for more information about tracking prompt versions. Then, consider using tools for building prompt templates, especially if your prompts become complex. Newer LLM-specific tools such as [LangChain](https://python.langchain.com/en/latest/index.html) and [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/) provide such templates and more. Finally, consider automating prompt engineering by replacing manual engineering with automated tuning. Prompt tuning turns prompt development into a data-driven process akin to hyperparameter tuning for traditional ML. The [Demonstrate-Search-Predict (DSP) Framework](https://github.com/stanfordnlp/dsp) is a good example of a tool for prompt tuning. ----- ###### Resources There are lots of good resources about prompt engineering, especially for popular models and services: DeepLearning.AI course on [ChatGPT](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/) [Prompt Engineering](https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/) DAIR.AI [Prompt Engineering Guide](https://www.promptingguide.ai/)  [Best practices for prompt engineering](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) [with the OpenAI API](https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api) **3** Most prompt engineering tips currently published online are for ChatGPT, due to its immense popularity. Some of these generalize to other models as well. We will provide a few tips here: Use clear, specific prompts, which may include an instruction, context (if needed), a user query or input, and a description of the desired output type or format Provide examples in your prompt (“few-shot learning”) to help the LLM to understand what you want Tell the model how to behave, such as telling it to admit if it cannot answer a question Tell the model to think step-by-step or explain its reasoning If your prompt includes user input, use techniques to prevent prompt hacking, such as making it very clear which parts of the prompt correspond to your instruction vs. user input ----- ###### Packaging models or pipelines for deployment In traditional ML, there are generally two types of ML logic to package for deployment: models and pipelines. These artifacts are generally managed toward production via a Model Registry and Git version control, respectively. With LLMs, it is common to package ML logic in new forms. These may include: A lightweight call to an LLM API service (third party or internal) A “chain” from LangChain or an analogous pipeline from another tool. The chain may call an LLM API or a local LLM model. An LLM or an LLM+tokenizer pipeline, such as a [Hugging Face](https://huggingface.co/) pipeline. This pipeline may use a pretrained model or a custom fine-tuned model. An engineered prompt, possibly stored as a template in a tool such as LangChain Though LLMs add new terminology and tools for composing ML logic, all of the above still constitute models and pipelines. Thus, the same tooling such as [MLflow](https://mlflow.org/) can be used to package LLMs and LLM pipelines for deployment. [Built-in model flavors](https://mlflow.org/docs/latest/models.html) include: PyTorch and TensorFlow Hugging Face Transformers (relatedly, see Hugging Face Transformers’s [MLflowCallback](https://huggingface.co/docs/transformers/en/main_classes/callback#transformers.integrations.MLflowCallback) ) LangChain OpenAI API (See the [documentation](https://mlflow.org/docs/latest/models.html) for a complete list) For other LLM pipelines, MLflow can package the pipelines via the [MLflow pyfunc flavor](https://mlflow.org/docs/latest/models.html#python-function-python-function) , which can store arbitrary Python code. **Note about prompt versioning:** Just as it is helpful to track model versions, it is helpful to track prompt versions (and LLM pipeline versions, more generally). Packaging prompts and pipelines as MLflow Models simplifies versioning. Just as a newly retrained model can be tracked as a new model version in the MLflow Model Registry, a newly updated prompt can be tracked as a new model version. **Note about deploying models vs. code:** Your decisions around packaging ML logic as version controlled code vs. registered models will help to inform your decision about choosing between the deploy models, deploy code and hybrid architectures. Review the subsection below about human feedback, and make sure that you have a well-defined testing process for whatever artifacts you choose to deploy. ----- ###### Managing cost/performance trade-offs One of the big Ops topics for LLMs is managing cost/performance trade-offs, especially for inference and serving. With “small” LLMs having hundreds of millions of parameters and large LLMs having hundreds of billions of parameters, computation can become a major expense. Thankfully, there are many ways to manage and reduce costs when needed. We will review some key tips for balancing productivity and costs. **1** Start simple, but plan for scaling. When developing a new LLM-powered application, speed of development is key, so it is acceptable to use more expensive options, such as paid APIs for existing models. As you go, make sure to collect data such as queries and responses. In the future, you can use that data to fine-tune a smaller, cheaper model which you can own. **2** Scope out your costs. How many queries per second do you expect? Will requests come in bursts? How much does each query cost? These estimates will inform you about project feasibility and will help you to decide when to consider bringing the model in-house with open source models and fine-tuning. **3** Reduce costs by tweaking LLMs and queries. There are many LLM-specific techniques for reducing computation and costs. These include shortening queries, tweaking inference configurations and using smaller versions of models. **4** Get human feedback. It is easy to reduce costs but hard to say how changes impact your results, unless you get human feedback from end users. ----- ###### Resources **Fine-tuning**  [Fine-Tuning Large Language Models with](https://www.databricks.com/blog/2023/03/20/fine-tuning-large-language-models-hugging-face-and-deepspeed.html) [Hugging Face and DeepSpeed](https://www.databricks.com/blog/2023/03/20/fine-tuning-large-language-models-hugging-face-and-deepspeed.html) Webinar: [Build Your Own Large Language](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) [Model Like Dolly: How to fine-tune and](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) [deploy your custom LLM](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) **Model distillation,** **quantization and pruning** ###### Methods for reducing costs of inference **Use a smaller model** Pick a different existing model. Try smaller versions of models (such as “t5-small” instead of “t5-base”) or alternate architectures. Fine-tune a custom model. With the right training data, a fine-tuned model can often be smaller and/or perform better than a generic model. Use model distillation (or knowledge distillation). This technique “distills” the knowledge of the original model into a smaller model. Reduce floating point precision (quantization). Models can sometimes use lower precision arithmetic without losing much in quality.  [Gentle Introduction to 8-bit Matrix](https://huggingface.co/blog/hf-bitsandbytes-integration) **Reduce computation for a given model** Shorten queries and responses. Computation scales with input and output sizes, so using more concise queries and responses reduces costs. Tweak inference configurations. Some types of inference, such as beam search, require more computation. **Other** Split traffic. If your return on investment (ROI) for an LLM query is low, then consider splitting traffic so that low ROI queries are handled by simpler, faster models or methods. Save LLM queries for high ROI traffic. Use pruning techniques. If you are training your own LLMs, there are pruning techniques that allow models to use sparse computation during inference. This reduces computation for most or all queries. [Multiplication for transformers at scale](https://huggingface.co/blog/hf-bitsandbytes-integration) [using Hugging Face Transformers,](https://huggingface.co/blog/hf-bitsandbytes-integration) [Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration)  [Large Transformer Model Inference](https://lilianweng.github.io/posts/2023-01-10-inference-optimization/) [Optimization](https://lilianweng.github.io/posts/2023-01-10-inference-optimization/)  [Making LLMs even more accessible with](https://huggingface.co/blog/4bit-transformers-bitsandbytes) [bitsandbytes, 4-bit quantization and](https://huggingface.co/blog/4bit-transformers-bitsandbytes) [QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) ----- ###### Human feedback, testing, and monitoring While human feedback is important in many traditional ML applications, it becomes much more important for LLMs. Since most LLMs output natural language, it is very difficult to evaluate the outputs via traditional metrics. For example, suppose an LLM were used to summarize a news article. Two equally good summaries might have almost completely different words and word orders, so even defining a “ground-truth” label becomes difficult or impossible. Humans — ideally your end users — become essential for validating LLM output. While you can pay human labelers to compare or rate model outputs, the best practice for user-facing applications is to build human feedback into the applications from the outset. For example, a tech support chatbot may have a “click here to chat with a human” option, which provides implicit feedback indicating whether the chatbot’s responses were helpful. In terms of operations, not much changes from traditional MLOps:  **Data:** Human feedback is simply data, and it should be treated like any other data. Store it in your lakehouse, and process it using the same data pipeline tooling as other data.  **Testing and monitoring:** A/B testing and incremental rollouts of new models and pipelines may become more important, superceding offline quality tests. If you can collect user feedback, then these rollout methods can validate models before they are fully deployed.  **Fine-tuning:** Human feedback becomes especially important for LLMs when it can be incorporated into fine-tuning models via techniques like Reinforcement Learning from Human Feedback (RLHF). Even if you start with an existing or generic model, you can eventually customize it for your purposes via fine-tuning. ###### Resources **Reinforcement Learning from** **Human Feedback (RLHF)** Chip Huyen blog post on [“RLHF: Reinforcement Learning from](https://huyenchip.com/2023/05/02/rlhf.html) [Human Feedback”](https://huyenchip.com/2023/05/02/rlhf.html) Hugging Face blog post on [“Illustrating Reinforcement Learning from](https://huggingface.co/blog/rlhf) [Human Feedback (RLHF)”](https://huggingface.co/blog/rlhf)  [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback) ----- ###### Other topics  **Scaling out:** Practices around scaling out training, fine-tuning and inference are similar to traditional ML, but some of your tools may change. Tools like [Apache Spark](https://spark.apache.org/) [™](https://spark.apache.org/) and [Delta Lake](https://delta.io/) remain general enough for your LLM data pipelines and for batch and streaming inference, and they may be helpful for distributing fine-tuning. To handle LLM fine-tuning and training, you may need to adopt some new tools such as [distributed PyTorch](https://pytorch.org/tutorials/beginner/dist_overview.html) , [distributed TensorFlow](https://www.tensorflow.org/guide/distributed_training) , and [DeepSpeed](https://www.deepspeed.ai/) .  **[Model serving:](https://www.databricks.com/product/model-serving)** If you manage the serving system for your LLMs, then you may need to make adjustments to handle larger models. While serving with CPUs can work for smaller deep learning models, most LLMs will benefit from or require GPUs for serving and inference.  **Vector databases:** Some but not all LLM applications require vector databases for efficient similarity- based lookups of documents or other data. Vector databases may be an important addition to your serving infrastructure. Operationally, it is analogous to a feature store: it is a specialized tool for storing preprocessed data which can be queried by inference jobs or model serving systems. ----- #### Reference architecture To illustrate potential adjustments to your reference architecture from traditional MLOps, we provide a modified version of the previous production architecture. Production environment Model Registry Stage: �one Stage: Staging Stage: Production Load model for testing Load model for inference Push model to registry Promote to production Model serving LLM API request release Fine-Tine LLM release Vector Database Update release Continuous Deployment (CD) release Monitoring & Evaluation release Internal/External Data tables Vector database Metrics tables Human feedback model hub **Figure 7** ----- ###### Additional resources With LLMs being such a novel field, we link to several LLM resources below, which are not necessarily “LLMOps” but may prove useful to you.  [edX: Professional Certificate in Large](https://www.edx.org/professional-certificate/databricks-large-language-models) [Language Models](https://www.edx.org/professional-certificate/databricks-large-language-models) Chip Huyen blog post on [“Building LLM](https://huyenchip.com/2023/04/11/llm-engineering.html) [applications for production”](https://huyenchip.com/2023/04/11/llm-engineering.html) LLM lists and leaderboards  [LMSYS Leaderboard](https://chat.lmsys.org/?leaderboard)  [Hugging Face Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)  [Stanford Center for Research on](https://crfm.stanford.edu/) [Foundation Models](https://crfm.stanford.edu/)  [Ecosystem graphs](https://crfm.stanford.edu/ecosystem-graphs/index.html)  [HELM](https://crfm.stanford.edu/helm/latest/?) Blog post on [“Open Source ChatGPT](https://www.saattrupdan.com/posts/2023-04-16-open-source-chatgpt-alternatives) The primary changes to this production architecture are:  **Internal/External Model Hub:** Since LLM applications often make use of existing, pretrained models, an internal or external model hub becomes a valuable part of the infrastructure. It appears here in production to illustrate using an existing base model that is then fine-tuned in production. Without fine- tuning, this hub would mainly be used in development.  **Fine-Tune LLM:** Instead of de novo Model Training, LLM applications will generally fine-tune an existing model (or use an existing model without any tuning). Fine-tuning is a lighter-weight process than training, but it is similar operationally.  **Vector Database:** Some (but not all) LLM applications use vector databases for fast similarity searches, most often to provide context or domain knowledge in LLM queries. We replaced the Feature Store (and its Feature Table Refresh job) with the Vector Database (and its Vector Database Update job) to illustrate that these data stores and jobs are analogous in terms of operations.  **Model Serving:** The architectural change illustrated here is that some LLM pipelines will make external API calls, such as to internal or third-party LLM APIs. Operationally, this adds complexity in terms of potential latency or flakiness from third-party APIs, as well as another layer of credential management.  **Human Feedback in Monitoring and Evaluation:** Human feedback loops may be used in traditional ML but become essential in most LLM applications. Human feedback should be managed like other data, ideally incorporated into monitoring based on near real-time streaming. [Alternatives”](https://www.saattrupdan.com/posts/2023-04-16-open-source-chatgpt-alternatives) ----- #### Looking ahead LLMs only became mainstream in late 2022, and countless libraries and technologies are being built to support and leverage LLM use cases. You should expect rapid changes. More powerful LLMs will be open- sourced; tools and techniques for customizing LLMs and LLM pipelines will become more plentiful and flexible; and an explosion of techniques and ideas will gradually coalesce into more standardized practices. While this technological leap provides us all with great opportunities, the use of cutting-edge technologies requires extra care in LLMOps to build and maintain stable, reliable LLM-powered applications. The good news is that much of your existing MLOps tooling, practices and knowledge will transfer smoothly over to LLMs. With the additional tips and practices mentioned in this section, you should be well set up to harness the power of large language models. ----- ##### About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark ™ , Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/the-big-book-of-mlops-v10-072023.pdf,2024-09-19T16:57:22Z
"----- # TABLE OF CONTENTS ##### Welcome to Data, Analytics and AI ....... 02 **Do you know what you’re getting into?** ............................................ **02** **How to use this book** ��������������������������������������������������������������������������������������� **02** ##### Business Value .......................................................................... 03 **Talking to the business (feels like combat)** ����������������������������� **03** **Creating Value Alignment** ������������������������������������������������������������������ **03** **Goals and Outcomes** ���������������������������������������������������������������������������� **04** ##### Ultimate Class Build Guide .................................. 04 **Creating a character** ����������������������������������������������������������������������������� **04** - Data Engineers ������������������������������������������������������������������������������������� **04** - Data Scientists ������������������������������������������������������������������������������������� **05** - Data Analysts ���������������������������������������������������������������������������������������� **05** ##### Diving In ............................................................................................... 05 **Producing game data** ���������������������������������������������������������������������������� **05** **And receiving it in cloud** ��������������������������������������������������������������������� **08** **Getting data from your game to the cloud** ������������������������������ **08** ##### The Value of Data Throughout the Game Developer Lifecycle ................................... 09 **Lifecycle overview** ���������������������������������������������������������������������������������� **09** **Use data to develop a next-generation** **customer experience** ��������������������������������������������������������������������������� **09** ##### Getting Started with Gaming Use Cases .............................................................. 10 **Where do I start? Start with Game Analytics** ������������������������� **10** **Understand your audience** ���������������������������������������������������������������������������� **11** - Player Segmentation ���������������������������������������������������������������������������� **11** - Player Lifetime Value ��������������������������������������������������������������������������� **12** - Social Media Monitoring �������������������������������������������������������������������� **12** - Player Feedback Analysis ����������������������������������������������������������������� **13** - Toxicity Detection ��������������������������������������������������������������������������������� **13** **Find your audience** ���������������������������������������������������������������������������������� **14** **Activating Your Playerbase** ������������������������������������������������������������������������� **15** - Player Recommendations ����������������������������������������������������������������� **15** - Next Best Offer/Action ����������������������������������������������������������������������� **15** - Churn Prediction & Prevention ������������������������������������������������������� **16** - Real-time Ad Targeting ����������������������������������������������������������������������� **16** **Operational Use Cases** �������������������������������������������������������������������������� **17** - Anomaly Detection ������������������������������������������������������������������������������ **17** - Build Pipeline ������������������������������������������������������������������������������������������� **17** - Crash Analytics �������������������������������������������������������������������������������������� **18** ##### Things to Look Forward To ..................................... 19  Appendix .............................................................................................. 21 **Ultimate Class Build Guide** ������������������������������������������������������������������ **21** - Creating a Character ��������������������������������������������������������������������������� **21** - Data Engineers ���������������������������������������������������������������������������� **21** - Data Scientists ���������������������������������������������������������������������������� **21** - Data Analysts ������������������������������������������������������������������������������ **22** **Data Access and the Major Cloud Providers** ................................ **23** - Cloud Rosetta Stone �������������������������������������������������������������������������� **23** - Jargon Glossary ������������������������������������������������������������������������������������ **23** - Getting started with the major cloud providers ������������������� **23** **Getting Started with Detailed Use Cases** ���������������������������������� **25** - Game analytics ������������������������������������������������������������������������������������� **25** - Player Segmentation �������������������������������������������������������������������������� **25** - Player Lifetime Value �������������������������������������������������������������������������� **26** - Social Media Monitoring ������������������������������������������������������������������� **28** - Player Feedback Analysis ���������������������������������������������������������������� **29** - Toxicity Detection ������������������������������������������������������������������������������� **30** - Multi-Touch Attribution and Media Mix Modeling ���������������� **31** - Player Recommendations ���������������������������������������������������������������� **32** - Next Best Offer/Action ���������������������������������������������������������������������� **33** - Churn Prediction & Prevention ����������������������������������������������������� **34** - Real-time Ad Targeting ���������������������������������������������������������������������� **35** **Getting Started with Operational Use Cases** �������������������������� **36** - Anomaly Detection ����������������������������������������������������������������������������� **36** - Build Pipeline ����������������������������������������������������������������������������������������������������� **37** - Crash Analytics ������������������������������������������������������������������������������������� **39** Multi-Touch Attribution ��������������������������������������������������������������������� **14** ----- # Welcome to Data, Analytics, and AI ### Do you know what you’re getting into? You may have heard the stories of game studios spending countless hours trying to more effectively acquire, engage, and retain players. Well, did you know that data, analytics, and AI plays a central role in the development and operation of today’s top-grossing video games? Studios globally struggle with fragmented views of their audience, with data often outpacing legacy technologies. Today, the need for real- time capabilities and the leap from descriptive to predictive analytics has made it so that data, analytics, and AI are no longer a “nice-to-have”, but table stakes for success. The objective of this handbook is to guide you on the role data, analytics, and AI plays in the development and operations of video games. We’ll cover who the key stakeholders are and how to align people across business units. Then we’ll talk through strategies to help you successfully advocate for data, analytics, and AI projects internally. Finally, we dive deep through the most common use cases. We want to give you enough information to feel well as helpful tips when operating as or working with one of these classes. We follow this with the fundamentals for building a Proof of Concept (POC) or Minimum Viable Product (MVP). That is, connecting to the cloud; accessing your data; and most importantly, being able to represent the value you’re seeking to unlock as you sell your project into your team and broader organization. Finally, we’ll dive into the most common use cases for data, analytics, and AI within game development. Similar to a tech- tree in a video game, we begin with the most basic use cases - setting up your game analytics. Then we progress through more advanced data use cases such as player segmentation, assessing lifetime value, detecting and mitigating toxicity, multi-touch attribution, recommendation engines, player churn prediction and prevention, and more. Don’t forget to review the Appendix. You’ll find a handy “ Jargon Glossary ”, “ Cloud Rosetta Stone ”, and “ get started guide for the three major cloud providers ”. All incredibly helpful assets to keep as hotkeys. empowered to make a demonstrable impact. Just by reading this you are adding incredible insight and value to yourself as an industry professional. Quest on! ### How to use this book This book is primarily intended for technical professionals who are engaging with data within game studios. No matter your role in the gaming industry, you will be able to glean key takeaways that will make you more effective in your individual role and within the larger team — be that production, art, engineering, marketing, or otherwise. Begin your journey by reviewing the “ **Data, Analytics, and AI** **Ground Rules** ” section to the right, which presents some This section presents some rules and guidelines for interpreting the role that data plays in the game development lifecycle. Next, it’s time to learn about the key professions (aka character classes) that interact and engage with data, analytics, and AI on a consistent basis within a game studio. This section breaks down each of the classes, providing an **Data, Analytics, and AI Ground Rules** This guide assumes you understand the following: - You understand the basics of data, analytics, and AI: How and why data is stored in a system, why data is transformed, the different types of output that data can feed into — such as a report, an analysis answering a question, or a machine learning model. If this is the first time you’re creating a character, we highly recommend reviewing our data, analytics, and AI tutorial — aka getting started training and documentation, available at [dbricks.co/training](https://www.databricks.com/learn/training/home) - You have a basic understanding of cloud infrastructure. Specifically what it is, who are the key players, and associated terms (e.g., virtual machines, APIs, applications) - You are generally aware of the game development lifecycle; pre-production, production, testing/QA, launch, operation overview of each character’s strengths and weaknesses as ----- # Business Value Demonstrating business value is important when working on data, analytics, and AI projects because it helps ensure that the efforts of the project are aligned with the goals and objectives of the business. By showing how the project can positively impact a game’s key performance indicators (KPIs) and bottom-line metrics, such as game revenue, player satisfaction, and operational efficiency, studio stakeholders are more likely to support and invest in the project. Additionally, demonstrating business value can help justify the resources, time, and money that are required to execute the project, and can also help prioritize which projects should be pursued. By focusing on business value, data, analytics, and AI projects can become strategic initiatives that contribute to the long-term success of your game studio. ### Talking to the business (feels like combat) While we highly encourage everyone to read this section, you may already feel confident understanding the needs and concerns of your internal stakeholders, and how to sell-in a project successfully. If so, feel free to skip this section. We would love to dive into the data to explore and discover as much as possible, unfortunately in most environments, we are limited by resources and time. Understanding both the businesses pain points and strategic goals is crucial to choosing projects that will benefit the business, create value and make your message much easier to sell. Whenever we embark on a proof-of-concept (PoC) or minimum viable product (MVP) — to prove out a new **Questions to ask:** - What other strategic goals and pain points can you list out and how would you prioritize them as a business leader? - Does your prioritization match how your team, manager and/or leadership would prioritize? Typically the closer the match, the easier initial projects will be to “sell”. methodology or technology — we will need to pitch it back for adoption. The technology could be revolutionary and absolutely amazing, but without the value proposition and tie back to goals, it is likely to land flat or fail to be adopted. It is key to talk to your stakeholders to understand their perception of pain points and positions on potential projects to add value. Much like stopping at the Tavern when the adventuring party gets to town, these can be informal conversations where you socialize potential solutions while gathering information about what matters. ### Creating value alignment So what are your strategic goals and pain points and how might they be addressed through a use case from a PoC or MVP leveraging your data? A few examples of strategic goals that are top of mind for our customers at the beginning of any fiscal or calendar year: - Reduce costs - Simplify your infrastructure - Acquire more players - Monetize your playerbase - Retain your players (aka prevent churn) Here are four ways the Databricks Lakehouse can provide value that aligns with your strategic goals and pain points: `1.` **Improved collaboration:** Databricks platform allows everyone to share and collaborate on data, notebooks and models between data scientists, engineers and business users. This enables for a more efficient and streamlined process for data analysis and decision making. `2.` **Find and explore your data:** The data in the Lakehouse is cataloged and accessible, which enables business users to explore and query the data easily and discover insights by themselves. `3.` **Uncover actionable business insights:** By putting your game’s data into a Lakehouse architecture, it can be better analyzed using various tools provided by Databricks such as SQL, dashboards, notebooks, visualization and machine learning to better understand your playerbase, providing valuable insights into player behavior and performance. These insights can help the ----- and retention, and use that information to improve the game and grow monetization. `4.` **Lead with data-driven decisions:** A Lakehouse architecture provides a single source of truth for your organization’s data. Data engineers write once, data analysts interpret the data, and data scientists can run machine machine learning models on the same data. _This cannot be understated in the value this provides an_ _organization from a total cost of ownership perspective._ With the ability to access and analyze all the data in one place, the business can make unified data-driven decisions, rather than relying on intuition or fragmented data. ### Goals and outcomes Like many projects, starting with a strong foundation of ‘what success looks like’ will significantly improve your likelihood of achieving your objectives. Here are a few best-practices we recommend: `1.` **Set goals:** Define your hypothesis, then use your data and process to prove or disprove your hypothesis. You have a goal in mind, make it part of the experiment. If the outcome differs from the expectation, that is part of experiments and we can learn from it to improve the next experiment. This is all about shortening the feedback loop project appropriately. For example, are you doing this as a side project? Do you have 2 sprints to show progress? It’s important to scope your project based on the time, resources, and quality needed for the said project to be a success. `3.` **Scope down:** Ruthlessly control scope for any PoC or MVP. Prioritization is your best friend. Stakeholders and your own internal team will naturally want to increase scope because there’s no shortage of good ideas. But by controlling scope, you improve your chances of shipping on time and on budget. Don’t let perfection be the enemy of good. There are always exceptions to this, but that is what the next sprint is for. `4.` **Deliver on time:** Recovering lost goodwill is incredibly difficult - strive to always deliver on time. Make sure your goals, constraints and scope creep will not explode your timeline as creating tight feedback loops and iteration cycles is what will make you more agile than the competition. `5.` **Socialize early, and often:** Show quantifiable value as quickly as possible, both to your immediate team and business stakeholders. Measure the value as frequently as makes sense, and socialize early and often to promote visibility of the project and ensure tight alignment across teams. This will empower you to create tighter feedback loops that will help improve any future iterations of your product, platform, or technology. between insight and action. # Ultimate Class Build Guide ### Creating a character Have you rolled your character already? Data engineers, data scientists, and data analysts form the heart of mature game data teams. Though, depending on studio size and resources, making sense of large amounts of data. Depending on the size of the organization, individuals may be required to multiclass in order to address needs of the team. In smaller studios, it’s often developers who wear multiple hats, including those in data engineering, analytics and data science. Key characters include: game developers may also be pulled in from time to time to perform data engineering and or data science tasks. Though for the sake of this guide, we’ll keep focus on roles of data engineers, data scientists, and data analysts. There are many aspects to these roles, but they can be summarized in that Data Engineers create and maintain critical data workflows, Data Analysts interpret data and create reports that keep the business teams running seamlessly, and Data Scientists are responsible for **Data Engineers** Data engineers build systems that collect, manage, and convert source data into usable information for data scientists and business analysts to interpret. Their ultimate goal is to make data accessible so that teams can use it to evaluate and optimize a goal or objective. ----- Data scientists determine the questions their team should be asking and figure out how to answer those questions using data. They often develop predictive models for theorizing and forecasting. **Data Analysts** to report on the health of a title or building a recommendation engine for your players, this guide will help you better understand the unique classes required to develop and maintain an effective data, analytics, and AI platform. **Learn more about these character classes** A data analyst reviews data to identify key insights into a game studio’s customers and ways the data can be used to solve problems. # Diving In Before we get to the primary use cases of game data, analytics, and AI, we need to cover some basics. That is, the different types of game data and how they are produced. And the subsequent receiving of that data in the cloud to ### Producing game data… Speaking in generalities, there are four buckets of data as it relates to your video game. collect, clean, and prepare for analysis. **1. Game Telemetry** Game telemetry refers to the data collected about player behavior and interactions within a video game. The primary data source is the game engine. And the goal of game telemetry is to gather information that can help game developers understand player behavior and improve the overall game experience. Some of the primary metrics that are typically tracked in game telemetry include: - **Player engagement:** Track the amount of time players spend playing the game, and their level of engagement with different parts of the game. - **Game progress:** Monitor player progress through different levels and milestones in the game. - **In-game purchases:** Track the number and value of in-game purchases made by players. - **Player demographics:** Collect demographic information about players, such as age, gender, location, and device type. - **Session length:** Monitor the length of each player session, and how often players return to the game. - **Retention:** Track the percentage of players who return to the game after their first session. ----- such as the types of actions taken, the number of deaths, and the use of power-ups. - **User Acquisition:** Track the number of new players acquired through different marketing channels. **2. Business KPIs** The second bucket of data is business key performance indicators (or KPIs). Business KPIs are metrics that measure the performance and success of a video game from a business perspective. The primary data source for business KPIs include game telemetry, stores, and marketplaces. These KPIs help game studios understand the financial and operational performance of their games and make informed decisions about future development and growth. Some of the primary business metrics that are typically tracked include: - **Revenue:** Track the total revenue generated by the game, including sales of the game itself, in-game purchases, and advertising. - **Player Acquisition Cost (CAC):** Calculate the cost of acquiring a new player, including marketing and advertising expenses. - **Lifetime Value (LTV):** Estimate the amount of revenue a player will generate over the course of their time playing the game. - **Player Retention:** Track the percentage of players who continue to play the game over time, and how long they play for. - **Engagement:** Measure the level of engagement of players with the game, such as the number of sessions played, time spent playing, and in-game actions taken. - **User Acquisition:** Track the number of new players acquired through different marketing channels and the cost of acquiring each player. - **Conversion Rate:** Measure the percentage of players who make an in-game purchase or complete a specific action. - **Gross Margin:** Calculate the profit generated by the game after subtracting the cost of goods sold, such as the cost of game development and server hosting. **3. Game Services** Similar to game telemetry, game services provide critical infrastructure that requires careful monitoring and management. These services include things like game server hosting, and more. Here the source of data is the game services used. Some of the common metrics game teams typically track for these services include: - **Concurrent Players:** Track the number of players who are simultaneously connected to the game servers to ensure that the servers have enough capacity to handle the player demand. - **Server Availability:** Monitor the uptime and downtime of the game servers to ensure that players have access to the game when they want to play, particularly important for global live service games where demand fluctuates throught the day. - **Latency:** Measure the time it takes for data to travel from the player’s device to the game server and back, to ensure that players have a smooth and responsive gaming experience. - **Network Bandwidth:** Monitor the amount of data being transmitted between the player’s device and the game server to ensure that players have a high-quality gaming experience, even on slow internet connections. - **Live Operations:** Monitor the success of in-game events, promotions, and other live operations to understand what resonates with players and what doesn’t. - **Player Feedback:** Monitor player feedback and reviews, including ratings and comments on social media, forums, and app stores, to understand what players like and dislike about the game. - **Chat Activity:** Track the number of messages and interactions between players in the game’s chat channels to understand the level of social engagement and community building in the game. **4. Data beyond the game** The last bucket comes from data sources beyond the video game. These typically include the following: - **Social Media Data:** Social media platforms, such as Facebook, Twitter, TikTok and Instagram, can provide valuable insights into player behavior, feedback and preferences, as well as help game teams understand how players are talking about their games online with different communities. - **Forum Data:** Online forums and discussion boards, such as Reddit and Discord, can be rich sources of player feedback and opinions about the game. ----- #### The secret to success is bringing all of the disparate data sources  together, so you have as complete a 360-degree view as possible of  what’s happening in and around your game. - **Player Reviews:** Ratings and reviews on app stores, such as Steam, Epic, Google Play and the Apple App Store, can provide valuable feedback on player experiences and help game teams identify areas for improvement. - **Third-Party Data:** Third-party data sources, such as market research firms and industry data providers, can provide valuable insights into broader gaming trends and help game teams make informed decisions about their games and marketing strategies. This is a lot of data. And it’s no wonder that studios globally struggle with fragmented views of their audience, with data often outpacing legacy technologies. Today, the need for real- time capabilities and the leap from descriptive to predictive analytics has made it so that data, analytics, and AI are now table stakes for a game to be successful. Tapping into these four buckets of data sources, you’ll find actionable insights that drive better understanding of your playerbase, more efficient acquisition, stronger and longer lasting engagement, and monetization that deepens the relationship with your players. That’s what we’re going to dig into throughout the rest of this book. **Let’s begin with how to get data out of your game!** There are a variety of ways to get data out of the game and into cloud resources. In this section, we will provide resources for producing data streams in Unity and Unreal. In addition, we will also provide a generic approach that will work for any game engine, as long as you are able to send HTTP requests. **Unity** Since Unity supports C#, you would use a .NET SDK from the cloud provider of your choice. All three major cloud providers [using AWS](https://www.youtube.com/watch?v=yv4ynyCytdU) is provided here. - **AWS:** [AWS .NET SDK - Unity considerations](https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/unity-special.html) - **GCP:** [GCP .NET SDK Documentation](https://cloud.google.com/dotnet/docs/reference) - **Azure:** [Azure .NET SDK Overview](https://learn.microsoft.com/en-us/dotnet/azure/sdk/azure-sdk-for-dotnet) - **Kafka (Open-source alternative):** [Kafka .NET connector](https://github.com/confluentinc/confluent-kafka-dotnet) From here, the SDK is used to send data to a messaging service. These messaging services will be covered in more detail in the next section. **Unreal Engine** Unreal supports development with C++, so you could use C++ SDKs or Blueprint interfaces to those SDKs. The resources for each SDK are provided here - **AWS:** [How to integrate AWS C++ SDK with Unreal Engine](https://aws.amazon.com/blogs/gametech/how-to-integrate-the-aws-c-sdk-with-unreal-engine/) - **Azure:** [Azure C++ SDK with PlayFab](https://learn.microsoft.com/en-us/gaming/playfab/sdks/unreal/) - **Kafka (Open-source alternative):** [Getting started with](https://docs.confluent.io/kafka-clients/librdkafka/current/overview.html#ak-cplus) [Kafka and C++](https://docs.confluent.io/kafka-clients/librdkafka/current/overview.html#ak-cplus) Just like with the Unity example above, from here the data is sent to a messaging streaming service. Other engines may not support C++ or C#, but there is still a way to get your data into the cloud, no matter the language! By hitting an API Gateway with a HTTP POST request, you are able to send data to cloud services from many more types of applications. A sample high level architecture of this solution in AWS and Azure can be seen below: **AWS:** have .NET SDKs to use and I have linked the documentation **Azure:** for each below. No matter the cloud provider, if you want to use a SDK you install it through the NuGet package manager into your Unity project. [A walkthrough of how to implement the .NET SDK](https://www.youtube.com/watch?v=yv4ynyCytdU) ----- Once the data has been sent from the game into an event- streaming service, how do we get that data to a more permanent home? Here we will start by outlining what these messaging services do and how we can use them to point our data to a desired location. Messaging services ingest real-time event data, being streamed to them from a number of different sources, and then send them to their appropriate target locations. These target locations can be databases, compute clusters or cloud object stores. A key property of the messaging services is to preserve the time in which the events arrive, so that it is always known the order that events occurred. - Data is stored in object storage such as S3, Azure Storage or GCP Buckets using Delta Lake. - Delta Lake is an open-source storage framework that makes it easy to maintain data consistency and track changes. **Data Governance & Cataloging:** - Unity Catalog in Databricks provides tools for data governance that helps with compliance and controlling access to data in the lake. - Unity Catalog also allows to track data lineage, auditing and data discovery with the use of data catalogs and governance. - Metadata about the data including the structure, format, and location of the data can be stored in a data catalog. Examples of cloud messaging services include AWS Kinesis Firehose, Google PubSub, and Azure Event Hubs Messaging. If you prefer to use open-source products, Apache Kafka is a very popular open-source alternative. ### Getting data from your game to the cloud Moving to the cloud platform part of the journey involves building a gaming Lakehouse. The gaming Lakehouse allows gaming companies to store, manage, and analyze large volumes of gaming data, such as player behavior, performance metrics, and financial transactions, to gain valuable insights and make data-driven decisions to improve their business outcomes. **Next here are the basics of the Databricks** **platform simplified.** **Data Ingestion:** - Data can be ingested into the Gaming Lakehouse using various built-in data ingestion capabilities provided by Databricks such as Structured Streaming and Delta Live Tables for a single simple API that handles streaming or batch pipelines. - Data can be ingested in real-time or batch mode from **Data Quality:** - Databricks platform enables you to validate, clean and enrich data using built-in libraries and rule-based validation using Delta Live Tables. - It also allows tracking data quality issues and missing values by using Databricks Delta Live Tables tables. **Data Security:** - Databricks provides a comprehensive security model to secure data stored in the lake. - Access to data can be controlled through robust access controls on objects such as catalogs, schemas, tables, rows, columns, models, experiments, and clusters. **Analytics:** - The processed data can be analyzed using various tools provided by Databricks such as SQL Dashboards, Notebooks, visualizations and ML. - Game studios can gain insights into player performance and behaviorto better engageplayers and improve their games. **Get started with your preferred cloud** various sources such as game clients, servers or APIs. Data can be cleaned, transformed and enriched with additional data sources, making it ready for analysis. ----- # The Value of Data Throughout the Game Development Lifecycle ### Lifecycle overview Over the last decade, the way games have been developed and monetized has changed dramatically. Most if not all top grossing games are now built using a games-as-service strategy, meaning titles shipped in cycles of constant iteration to increase engagement and monetization of players over time. Games-as-a-Service models have the ability to create sticky, high-margin games, but they also heavily depend on cloud-based services such as game play analytics, multiplayer servers and matchmaking, player relationship management, performance marketing and more. Data plays an integral role in the development and operation of video games. Teams need tools and services to optimize player lifetime value (LTV) with databases that can process terabytes-petabytes of evolving data, analytics solutions that can access that data with near real-time latency, and machine learning (ML) models that can translate insights into actionable and innovative gameplay features. A game’s development lifecycle is unique to each studio. With different skillsets, resources, and genres of games, there is no one model. Below is a simplified view of a game development lifecycle for a studio running a games-as-a-service model. What’s important to remember is that throughout your title’s development lifecycle, there is data that can help you better understand your audience, more effectively find and acquire players, and more easily activate and engage them. Whether using game play data to optimize creative decision making during pre-production, tapping machine learning models to predict and prevent churn, or identifying the next best offer or action for your players in real-time, **data is your friend** . ### Use data to develop a next-generation customer experience In the game industry, customer experience (CX) is an important factor that can impact a player’s enjoyment of a game and the length they choose to play that game over time. In today’s highly competitive and fast-paced games industry, a game studio’s ability to deliver exceptional and seamless customer experiences can be a strategic differentiator when it comes to cutting through the noise and winning a gamer’s ## Game Development Lifecycle **Game Development Lifecycle** #### Games-as-a-Service (GaaS) / Games-as-a-Community (GaaC) Game-as-a-service (GaaS) / Game-as-a-Community (GaaC) **Game Development Lifecycle** _Game-as-a-service (GaaS) / Game-as-a-Community (GaaC)_ **1. Pre-Production** Brainstorm how to give life to the many ideas laid out in the planning phase **3. Testing** Every feature and mechanic in the game needs to be tested for game loop and quality control **5. Operation** As studios increasingly adopt games-as-a-service models, the ongoing operation of a video game is as critical as the launch itself **OPERATE** **MEASURE** **ENGAGE** **MONETIZE** |DISCOVERY & COMPATIBILITY INTEGRATION RELEASE PUBLISH AWARENESS|Col2|Col3|Col4|Col5|Col6|Col7|Col8| |---|---|---|---|---|---|---|---| ||||||||| ||||||||| **ONBOARDING** **BUILD & TEST** **FLIGHTING AND** **EXPERIMENTATION** **2. Production** Most of the time, effort, and resources spent on developing video games are spent in production stage **4. Launch** Whether developing alongside the community with alpha and beta releases, or launching into general availability, a game launch is a critical milestone ----- can help drive value through customer experience: `1.` **Personalization:** Game studios can use data analytics and machine learning to personalize the game experience for each player based on their preferences and behavior. This can include personalized recommendations for content, in-game events, and other features that are tailored to the player’s interests. `2.` **Omnichannel support:** Players often use multiple channels, such as social media, forums, and in-game support, to communicate with game studios. Next generation customer experience involves providing a seamless and integrated support experience across all these channels in near-real time. `3.` **Continuous improvement:** Game studios can use data and feedback from players to continuously improve gathering feedback on new features and using it to refine and optimize the game over time. In summary, defining what a next generation customer experience looks like for your game is important because it can help you create a more personalized, seamless, and enjoyable experience for your players, which can lead to increased engagement, monetization, and loyalty. There are many ways teams can use data throughout a game’s development lifecycle, but far and away the most valuable focus area will be in building and refining the customer experience. Throughout the rest of this guide, we will dig into the most common use cases for data, analytics, and AI in game development, starting with where we recommend everyone begins: game analytics. # Getting Started with Gaming Use Cases ### Where do I start? Start with game analytics **Overview** Big question: Where’s the best place to start when it comes to game data, analytics, and AI? For most game studios, the best place to start is with game analytics. Setting up a dashboard for your game analytics that helps you correlate data across disparate sources is infinitely valuable in a world where there is no one gaming data source to rule them all. An effective dashboard should include your game telemetry data, data from any game services you’re running, and data sources outside of your game such as stores, marketplaces, and social media. See below. **What we’re trying to solve/achieve** Getting a strong foundation in game analytics unlocks more advanced data, analytics, and AI use cases. For example, concurrent player count plus store and marketplace data **GAME TELEMETRY** **Data Sources** **GAME SERVICES** **OTHER SOURCES** ----- and lifetime value. Usage telemetry combined with crash reporting and social media listening helps you more quickly uncover where players might be getting frustrated. And correlating chat logs, voice transcriptions, and or discord that are relevant and engaging to your players, giving you tools to effectively market and monetize with your audience. **Let’s start with Player Segmentation.** and reddit forums can help you identify disruptive behavior before it gets out of hand, giving you the tools to take actionable steps to mitigate toxicity within your community. **Get started and set up your Analytics Dashboard** ### Understand your audience With your analytics pipelines set up, the first area of focus is to better understand your audience. This can help you inform a variety of key business decisions, from the highest macro order of “what game(s) to develop”, to how to market and monetize those games, and how to optimize the player experience. By understanding the demographics, preferences, and behaviors of their audience, a game studio can create games that are more likely to appeal to their target market and be successful. You can also use this understanding to tailor your marketing and monetization strategies to the needs and preferences of your players. Additionally, understanding your audience can help you ##### Player Segmentation **Overview** Player segmentation is the practice of dividing players into groups based on shared characteristics or behaviors. Segmentation has a number of benefits. You can better understand your players, create more personalized content, improve player retention, and optimize monetization, all of which contributes to an improved player experience. **What we’re trying to solve/achieve** The primary objective of segmentation is to ensure you’re not treating your entire playerbase the exact same. Humans are different, and your players have different motivations, preferences and behaviors. Recognizing this and engaging with them in a way that meets them where they’re at is one of the most impactful ways you can cultivate engagement with your game. As we mentioned above, the benefits of segmentation are broad reaching. Through better understanding of your playerbase, you can better personalize experiences, tailoring content and customer experience to specific groups of players that increases engagement and satisfaction. Better understanding of your players also helps in improving player retention. By identifying common characteristics of players who are at risk of churning (i.e., stopping play), you can develop targeted strategies that only reach specific audiences. Create advanced customer segments to build out more effective user stories, and identify potential purchasing predictions based on behaviors. Leverage existing sales data, campaigns and promotions systems to create robust segments with actionable behavior insights to inform your product roadmap. You can then use this information to build useful customer clusters that are targetable with different promos and offers to drive more efficient acquisition and deeper engagement with existing players. identify potential pain points or areas for improvement within your games, allowing you to proactively make changes **Get started with Player Segmentation** to address these issues and improve the player experience before a player potentially churns. ----- **Overview** Player lifetime value (LTV) is a measure of the value that a player brings to a game over the lifetime they play that game. It is typically calculated by multiplying the average revenue per user (ARPU) by the average player lifespan. For example, if the average player spends $50 per year and plays the game for 2 years, their LTV would be $50 * 2 = $100. **What we’re trying to solve/achieve** Game studios care about LTV because it helps them understand the long-term value of their players and make informed decisions about how to invest in player acquisition and retention. For example, if the LTV of a player is higher than the cost of acquiring them (e.g., through advertising), it may be worth investing more in player acquisition. On the other hand, if the LTV of a player is lower than the cost of acquiring them, it may be more cost-effective to focus on retaining existing players rather than acquiring new ones. LTV is one of the more important metrics that game studios, particularly those building live service games, can use to understand the value of their players. It is important to consider other metrics as well, such as player retention, monetization, and engagement. **Get started with Player Lifetime Value** ##### Social Media Monitoring **Overview** As the great Warren Buffet once said, “It takes 20 years to build a reputation and five minutes to ruin it. If you think about that, you’ll do things differently.” Now more than ever, people are able to use social media and instantly amplify their voices to thousands of people who share similar interests and hobbies. Take Reddit as an example. r/gaming, the largest video game community (also called a subreddit) has over 35 million members with nearly 500 new posts and 10,000 new comments per day, while over 120 game- specific subreddits have more than 10,000 members each, the largest being League of Legends with over 700,000 members. The discourse that takes place on online social platforms generates massive amounts of raw and organic be used to understand how customers think and discover exactly what they want. The act and process of monitoring content online across the internet and social media for keyword mentions and trends for downstream processing and analytics is called media monitoring. By applying media monitoring to social media platforms, game developers are able to gain new advantages that previously might not have been possible, including: - Programmatically aggregate product ideas for new feature prioritization - Promote a better user experience by automatically responding to positive or negative comments - Understand the top influencers in the industry who can sway public opinion - Monitor broader industry trends and emerging segments such as free-to-play games - Detect and react to controversies or crises as they begin - Get organic and unfiltered feedback of games and features - Understand customer sentiment at scale - Make changes faster to keep customer satisfaction high and prevent churn By failing to monitor, understand, and act on what customers are saying about the games and content you release as well as broader industry trends, you risk those customers leaving for a better experience that meets the demands and requirements of what customers want. **What we’re trying to solve/achieve** By monitoring and listening to what existing and potential customers are saying on social media, game developers are able to get a natural and organic understanding of how customers actually feel about the games and products they release, or gauge consumer interest before investing time and money in a new idea. The main process for social media monitoring is to gather data from different social media platforms, such as Twitter or YouTube, process those comments or tweets, then take action on the processed data. While customer feedback can be manually discovered and processed in search of certain keyword mentions or feedback, it is a much better idea to automate it and do it programmatically. **Get started with Social Media Monitoring** ----- **Overview** Player feedback analysis is the process of collecting, analyzing, and acting on player feedback to inform game development. It involves collecting player feedback from multiple sources, such as in-game surveys, customer support tickets, social media, marketplace reviews, and forums, and using data analytics tools to identify patterns, trends, and insights. The goal of player feedback analysis is to better understand player needs, preferences, and pain points, and use this information to inform game development decisions and improve the overall player experience. Player feedback analysis is an important part of game development as it helps ensure that the game continues to meet player needs and expectations. By regularly collecting and analyzing player feedback, game studios can make data-driven decisions to improve the game, increase player engagement and retention, and ultimately drive success and growth. For this use case, we’re going to focus on taking online reviews for your video game and categorizing the different topics players are talking about (bucketing topics) in order to better understand the themes (via positive or negative sentiment) affecting your community. **What we’re trying to solve/achieve** This is incredibly helpful, providing data-driven customer insight into your development process. Whether used in **Overview** Across massively multiplayer online video games (MMOs), multiplayer online battle arena games (MOBAs) and other forms of online gaming, players continuously interact in real time to either coordinate or compete as they move toward a common goal — winning. This interactivity is integral to game play dynamics, but at the same time, it’s a prime opening for toxic behavior — an issue pervasive throughout the online video gaming sphere. Toxic behavior manifests in many forms, such as the varying degrees of griefing, cyberbullying and sexual harassment that are illustrated in the matrix below from [Behaviour](http://gamestudies.org/2004/articles/deslauriers_iseutlafrancestmartin_bonenfant) [Interactive](http://gamestudies.org/2004/articles/deslauriers_iseutlafrancestmartin_bonenfant) , which lists the types of interactions seen within the multiplayer game, _Dead by Daylight_ . pre-production, such as looking at games that are similar **Survivors** with reviews to learn where those games have strengths and weaknesses; or using player feedback analysis with a live service title to identify themes that can apply to your product roadmap, player feedback analysis helps teams better support and cultivate engagement with the player community. **GEN** **RUSHING** **GEN** **HIDING** **ACTIVATING** **LOOPING** **EMOTES** **RUSH** **BLINDING** **SANDBAGGING** **UNHOOKING** **TEABAGGING** **REPORTING** **REPORTING** **REPORTING** **REPORTING** **TEXT** **CHATTING** Ultimately, player feedback analysis does two things. 1) It **Less** **toxic** **Most** **toxic** can help you stack rank themes according to positive and negative sentiment, and 2) you can weight those themes according to impact on player engagement, toxicity, monetization, churn, and more. We’ve all read reviews that are overly positive, or overly negative. The process of player feedback analysis helps to normalize feedback across the community (keeping in mind, only for those who have written a review), so you’re not over indexing on one review, or a **HATCH** **HATCH** **DISCONNECTING** **DISCONNECTING** **FARMING** **FARMING** **HATCH** **HATCH** **DISCONNECTING** **DISCONNECTING** **FARMING** **FARMING** **CAMPING** **CAMPING** **FARMING** **FARMING** **CAMPING** **CAMPING** **BEING AWAY** **FROM** **KEYBOARD** **(AFK)** **CAMPING** **DRIBBLING** **TUNNELING** **LOBBY** **DODGING** **BODY** **BLOCKING** **FACE** **SLUGGING** **CAMPING** **Killers** single theme that may seem in the moment very pressing. In addition to the [personal toll](https://msutoday.msu.edu/news/2021/faculty-voice-gaming-and-toxicity) that toxic behavior can have **Get started with Player Feedback Analysis** on gamers and the community -- an issue that cannot be ----- game studios. For example, a study from [Michigan State](https://msutoday.msu.edu/news/2021/faculty-voice-gaming-and-toxicity) [University](https://msutoday.msu.edu/news/2021/faculty-voice-gaming-and-toxicity) revealed that 80% of players recently experienced toxicity, and of those, 20% reported leaving the game due to these interactions. Similarly, a study from [Tilburg University](https://arno.uvt.nl/show.cgi?fid=145375) showed that having a disruptive or toxic encounter in the first session of the game led to players being over three times more likely to leave the game without returning. Given that player retention is a top priority for many studios, particularly as game delivery transitions from physical media releases to long-lived services, it’s clear that toxicity must be curbed. Compounding this issue related to churn, some companies face challenges related to toxicity early in development, even before launch. For example, [Amazon’s Crucible](https://www.wired.com/story/amazon-crucible-release-first-big-videogame/) was released into testing without text or voice chat due in part to not having a system in place to monitor or manage toxic In this section, we’re going to talk about how to use your data to more effectively find your target audience across the web. Whether you’re engaging in paid advertising, influencer or referral marketing, PR, cross promotion, community building, etc - use data to separate activity from impact. You want to focus on the channels and strategies that leverage your resources most effectively, be that time or money. Say you have a cohort of highly engaged players who are spending money on your title, and you want to find more gamers just like that. Doing an analysis on the demographic and behavioral data of this cohort will give you the information needed to use an ad platform (such as Meta, Google, or Unity) to do lookalike modeling and target those potential gamers for acquisition. gamers and interactions. This illustrates that the scale of the gaming space has far surpassed most teams’ ability to manage such behavior through reports or by intervening in disruptive interactions. Given this, it’s essential for studios to integrate analytics into games early in the development lifecycle and then design for the ongoing management of toxic interactions. **What we’re trying to solve/achieve** Toxicity in gaming is clearly a multifaceted issue that has become a part of video game culture and cannot be addressed universally in a single way. That said, addressing toxicity within in-game chat can have a huge impact given the frequency of toxic behavior and the ability to automate the detection of it using natural language processing (NLP). In summary, by leveraging machine learning to better identify disruptive behavior so that better-informed decisions around handling actions can be made. **Get started with Toxicity Detection** ##### Multi-Touch Attribution **Overview** Multi-touch attribution is a method of attributing credit to different marketing channels or touchpoints that contribute to a sale or conversion. In other words, it is a way of understanding how different marketing efforts influence a customer’s decision to make a purchase or take a desired action. There are a variety of different attribution models that can be used to assign credit to different touchpoints, each with its own strengths and limitations. For example, the last- click model attributes all credit to the last touchpoint that the customer interacted with before making a purchase, while the first-click model attributes all credit to the first touchpoint. Other models, such as the linear model or the time decay model, distribute credit across multiple touchpoints based on different algorithms. **What we’re trying to solve/achieve** Multi-touch attribution can be useful for game studios because it can help them understand which marketing channels or efforts are most effective at driving conversions and inform their marketing strategy. However, it is important to choose the right attribution model for your title based on your business model (one-time purchase, subscription, free-to-play, freemium, in-game advertising, etc.) and regularly review and optimize your attribution efforts to ensure they are accurate and effective. **Get started with Multi-Touch Attribution** ----- ### Activating Your Playerbase So far, we’ve discussed how to better understand your players, and how to acquire more of your target audience. Next, we’re going to dig into how to better activate your players to create a more engaged and loyal playerbase that stays with your game for the long-term. Here, we’re going to focus on strategies that differentiate your gamer experience. ##### Player Recommendations and make in-game purchases. Additionally, personalized recommendations can help improve the overall player experience and increase satisfaction. Game studios can use a variety of techniques to create player recommendations, such as machine learning algorithms, collaborative filtering, and manual curation. It is important to regularly review and optimize these recommendations to ensure that they are effective and relevant to players. **Get started with Player Recommendations** **Overview** Player recommendations are suggestions for content or actions that a game studio makes to individual players based on their interests and behaviors. These recommendations can be used to promote specific in-game items, encourage players to try new features, or simply provide a personalized experience. **What we’re trying to solve/achieve** Player recommendations matter to game studios because they can help improve player retention, engagement, and monetization. By providing players with recommendations that are relevant and engaging, studios can increase the likelihood that players will continue to play their games ##### Next Best Offer/Action **Overview** Next best offer (NBO) and next best action (NBA) are techniques that businesses use to make personalized recommendations to their customers. NBO refers to the practice of recommending the most relevant product or service to a customer based on their past purchases and behaviors. NBA refers to the practice of recommending the most relevant action or interaction to a customer based on the same information. ----- in-game purchase to a player based on their past spending habits and the items they have shown an interest in. They might use NBA to recommend a specific level or event to a player based on their progress and interests. **What we’re trying to solve/achieve** It’s important to remember that next best offer is a specific use case within personalization that involves making recommendations to players on the most valuable in-game item or action they should take next. For example, a next best offer recommendation in a mobile game might suggest that a player purchase a specific in-game currency or unlock a new character. Both NBO and NBA can be used to improve customer retention, engagement, and monetization by providing personalized recommendations that are more likely to be relevant and appealing to individual customers. They can be implemented using a variety of techniques, such as machine learning algorithms or manual curation. **Get started with Next Best Offer/Action** ##### Churn Prediction & Prevention **Overview** Video games live and die by their player base. For Games- may overwhelm the ability of these players to consume, reinforcing the overall problem of player churn. At some point, it becomes critical for teams to take a cold, hard look at the cost of acquisition relative to the subscriber lifetime value (LTV) earned. These figures need to be brought into a healthy balance, and retention needs to be actively managed, not as a point-in-time problem to be solved, but as a “chronic condition” which needs to be managed for the ongoing health of the title. Headroom for continued acquisition-driven growth can be created by carefully examining why some players leave and some players stay. When centered on factors known at the time of acquisition, gaming studios may have the opportunity to rethink key aspects of their acquisition strategy that promote higher average retention rates, which can lead to higher average revenue per user. **Prerequisites for use case** This use case assumes a certain level of existing data collection infrastructure in the studio. Notably, a studio ready to implement a churn prediction and prevention model should have - A cloud environment where player data is stored - This source data should contain player behavior and session telemetry events from within the game. This is the foundation that insights can be built on top of. as-a-Service (GaaS) titles, engagement is the most important metric a team can measure. Naturally, proactively preventing churn is critical to sustained engagement and **Get started with Churn Prediction & Prevention** growth. Through churn prediction and prevention, you will be able to analyze behavioral data to identify subscribers with an increased risk of churn. Next, you will use machine learning to quantify the likelihood of a subscriber to churn, as well as indicate which factors create that risk. **What we’re trying to solve/achieve** Balancing customer acquisition and retention is critical. This is the central challenge to the long-term success of any live service game. This is particularly challenging in that successful customer acquisition strategies needed to get games to scale tend to be followed by service disruptions or declines in quality and customer experience, accelerating player abandonment. To replenish lost subscribers, the acquisition engine continues to grind and expenses mount. As games reach for customers beyond the core playerbase they may have initially targeted, the title may not resonate ##### Real-time Ad Targeting **Overview** Real-time ad targeting in the context of game development focuses on using data to deliver personalized and relevant advertisements to players in near real-time, while they are playing a game. Real-time targeting is performanced based, using highly personalized messagings which are achieved by using data to precisely determine the most opportune moments to display ads, based on factors such as player behavior, game state, and other contextual information. Knowing when to send those ads is based on data. This use case is specific to titles using in-game advertising as a business model. It’s important to note that in-game real- time ad targeting requires a sophisticated tech stack, with ----- with bigger ad ecosystem, ad networks and partners. The Databricks Lakehouse platform is an optimal foundation as it already contains many of the connectors required to enable this use case. **What we’re trying to solve/achieve** The goal of in-game real-time ad targeting is to provide a more immersive and relevant advertising experience for players, while also increasing the effectiveness of the ads for advertisers. By delivering targeted ads that are relevant to each player’s interests, game developers can create a more enjoyable and personalized gaming experience, which can help to reduce churn and increase the lifetime value of each player. Additionally, real-time ad targeting can also help game developers monetize their games more effectively, as advertisers are willing to pay a premium for hyper-targeted and engaged audiences. **Get started with Real-time Ad Targeting** ### Operational use cases In the game development industry, operational analytics **Overview** Anomaly detection plays an important role in the operation of a live service video game by helping to identify and diagnose unexpected behaviors in real-time. By identifying patterns and anomalies in player behavior, system performance, and network traffic, this information can then be used to detect and diagnose server crashes, performance bottlenecks, and hacking attempts. The ability to understand if there will be an issue before it becomes widespread is immensely valuable. Without anomaly detection, which is a form of advanced analytics, you’re always in a reactive (rather than proactive) state. Anomaly detection is a type of quality of service solution. **What we’re trying to solve/achieve** The goal of anomaly detection is to ensure that players have a stable and enjoyable gaming experience. This has an impact across your game, from reducing downtime, to minimizing player churn, and improving your game’s reputation and revenue. Additionally, the insights gained from anomaly detection can also be used to mitigate cheating and disruptive behavior. **Get started with Anomaly Detection** are essential for ensuring a smooth and efficient production process. One common use case is anomaly detection, where data analytics is utilized to identify any unusual patterns or behaviors in the game, such as crashes or performance issues. This helps developers quickly identify and fix problems, improving the overall quality of the game. Another example is build pipelines, where data analytics can be used to monitor and optimize the process of creating new builds of the game. By tracking key metrics such as build time, error rates, and resource utilization, developers can make informed decisions about how to optimize the build process for maximum efficiency. Other operational use cases in game development include tracking player behavior, measuring server performance, and analyzing sales and marketing data. Lets explore a few of these below. ##### Build Pipeline **Overview** A build pipeline is a set of automated processes that are used to compile and assemble the code, assets, and resources that make up a game project. The build pipeline typically includes several stages, such as code compilation, optimization, testing, and release. The purpose of a build pipeline is to streamline the game development process and ensure that each stage of development is completed efficiently and effectively. A build pipeline can be configured to run automatically, so that new builds are generated whenever changes are made to the code or assets. This helps to ensure that the game is always up-to-date and ready for testing and release. The logs are collected are in near-real time from build servers. A simplified example:Dev X is committing code on title Y, submitted on day Z, along with the log files from the pipeline and build server. Builds typically take multiple hours to complete, requiring significant amounts of compute via build farms. Being able to ----- are wasting compute, and being able to predict which builds will fail as they goes through the pipeline are ways to curb operational expenses. **What we’re trying to solve/achieve** With this use case, we’re seeking to reduce wasted compute and build a foundational view of what was developed, by who, when and how testing performed. In an ideal state, our automated build pipeline could send a notification to the developer with a confidence metric on the build making it through, allowing them to decide whether to continue or move another build through the pipeline. Often, developers do not have clear visibility until the build has completed or failed. By providing more insight to devs into the build pipeline process, we can increase the rate at which builds are completed efficiently and effectively. **Get started with Build Pipeline** ##### Crash Analytics resources were being used. How long crash testing takes can vary, depending on the game’s business model, amount of content, and scope. For a title with a one-time release, where there is a large amount of content and a complex storyline, the chances of hidden crashes causing errors while in development are high, making it require more time to perform testing before the game can be published. For titles built in a game-as-a-service model, i.e. a game shipped in cycles of constant iteration, crash detection should be done continuously, since errors in newly released content might affect the base game and lead to crashes. Increasingly, titles are being released in alpha (where developers do the testing), closed beta (which includes a limited group of testers/sample-users who do the gameplay testing) and open betas (where anyone interested can register to try the game). All of which happens before the game is “officially” released. Regardless of alpha, beta, or GA, players may stumble over game crashes, which triggers crash reports that are sent to the developers for fixing. But sometimes, it can be challenging to understand the issue that caused the crash from crash reports provided by your game’s platform. **What we’re trying to solve/achieve** Ultimately, the purpose of crash analytics is to identify the root cause of a crash, and help you take steps to prevent similar crashes from happening in the future. This feedback loop can be tightened through automation in the data pipeline. For example, by tracking crashes caused on builds from committers, the data can provide build suggestions to improve crash rate. Furthermore, teams can automate deduplication when multiple players experience the same errors, helping to reduce noise in the alerts received. **Get started with Crash Analytics** **Overview** Games crash, it is a fact of game development. The combination of drivers, hardware, software, and configurations create unique challenges in tracking, resolving and managing the user experience. Crash analytics and reporting is the process of collecting information about crashes or unexpected failures in a software application, in this case, a video game. A crash report typically includes information about the state of the game at the time of the crash, such as what the player was ----- # Things to look forward to This eBook was created to help game developers better wrap their heads around the general concepts in which data, analytics, and AI can be used to support the development and growth of video games. **If you only have 5 minutes,** **these takeaways are critical to your success** . For more information on advanced data, analytics, and AI use cases, as well as education resources, we highly recommend Databricks training portal [dbricks.co/training](http://dbricks.co/training) . **Top takeaways:** If you take nothing else from this guide, here are the most important takeaways we want to leave with you on your journey. `1.` **Data is fundamental. Data, analytics, and AI play a role** throughout the entire game development lifecycle - from discovery to pre-production, development to operating a game as a live service. Build better games, cultivate deeper player engagements, and operate more effectively by utilizing the full potential of your data. `2.` **Define your goals.** Start by establishing the goals of what you’re hoping to learn and or understand around your game. Clear goals make it easier to identify key metrics to track, example goals include; developing high-quality games that provide engaging and satisfying player experiences, increasing player engagement and retention by analyzing and improving gameplay and mechanics, and building a strong and positive brand reputation through effective marketing and community outreach. `3.` **Identify and understand your data sources.** Spend time to identify and understand the breadth of data sources you are already collecting, be that game telemetry, marketplace, game services, or sources beyond the game like social media. It is critical to collect the right data, and track the right metrics based on the goals and objectives you have set for your game. `4.` **Start small, and iterate quickly.** Recognize that goals and objectives evolve as you learn more about the interaction ----- are most effective when scoped small with tight feedback loops, allowing you to quickly adapt with your community and alongside shifting market conditions. `5.` **Game analytics forms the foundation.** Start by getting a game analytics dashboard up and running. The process of building out a dashboard will naturally require connecting and transforming your data in a way to unlock more advanced use cases down the road. `6.` **Plan and revisit your data strategy frequently.** Once dashboarding is set up, you’ll have a better picture of what downstream data use cases make the most sense for your game and business objectives. As you move to use cases such as player segmentation, churn analysis, and player lifetime value, revisit your data strategy frequently to ensure you’re spending time on use cases that drive actionable insights for you and your team. `7.` **Show value broad and wide.** Whether your data strategy is new or well established on the team, build the habit of communicating broadly to stakeholders across the company. Early in the process, it is important to gather critical feedback on what data is helpful and where there are opportunities for improvement. The worst thing that can happen is you create something that no one uses. That is a waste of everyone’s time and money. `8.` **Ask for help.** Engage with your technical partners. There are humans who can help ensure you’re developing your data and analytics platform in a way that is efficient and effective. There are numerous partners with domain expertise in data science and data engineering that can accelerate your data journey - here is our recommended partner list for [data, analytics, and AI workloads](https://www.databricks.com/company/partners/consulting-and-si) . `9.` **Participate in the community.** The community for game analytics is large and growing. It is important to research and your needs and interests. Here are a few of our favorites: `a.` [IGDA Game Analytics](https://igda.org/sigs/analytics/) : The IGDA has a number of Special Interest Groups that bring together user researchers, designers, data engineers and data scientists focused on understanding player behavior and experiences. They offer resources and events for those working in games user research, including a yearly Games User Research Summit. `b.` [Data Science Society](https://www.datasciencesociety.net/) : The Data Science Society is a global community of data scientists and engineers. While not specifically focused on game development, they offer a wealth of resources and opportunities for learning, networking, and collaboration in the field of data science. `c.` [Hugging Face](https://huggingface.co/) : is hub of open source models for Natural Language Processing, computer vision, and other fields where AI plays its role. They also provide an online platform where users can access pre-trained models and tools, share their own models and datasets, and collaborate with other developers in the community. `d.` [Data Engineering subreddit](https://www.reddit.com/r/dataengineering/) : The Data Engineering subreddit is a forum for data engineers to discuss topics related to building and managing data pipelines, data warehousing, and related technologies. While not specifically focused on game development, it can be a valuable resource for those working on data engineering in the gaming industry. `10. ` **Go beyond dashboards.** Looking at dashboards is only the first step in your data journey. Imagine how the output of your data can be presented in a way to help stakeholders across your company achieve more. For example, dropping data into an application that can help game designers make balancing decisions based on player events. ----- # APPENDIX Ultimate class build guide ### Creating a character The heart and soul of mature data teams are formed by this trio of classes. There are many aspects to these roles, but they can be summarized in that Data Engineers create and maintain critical data workflows, Data Analysts interpret data and create reports that keep the business teams running seamlessly, and Data Scientists are responsible for making sense of large amounts of data. Depending on the size of the organization, individuals may be required to multiclass in order to address needs of the team. In smaller studios, it’s often developers who wear multiple hats, including those in data engineering, analytics and data science. Whether you’re looking to stand-up an analytics dashboard to report on the health of a title or building a recommendation engine for your players, this guide will help you better understand the unique classes required to develop and maintain an effective data, analytics, and AI platform. ##### Data Engineers **Goals and Priorities of Data Engineers** - Enable access to usable data for real-time insights — data that both enables timely decision-making and is accurate and reproducible - Increase user confidence and trust in data. This involves ensuring high consistency and reliability in ETL processes - Limit the issues and failures experienced by other engineers and data scientists, allowing those roles to focus less on troubleshooting and more on drawing meaningful conclusions from data and building new products / features **What Data Engineers care about:** - Enabling access to data for real-time insights — data that both enables timely decision-making and is accurate and reproducible - Building high-performance, reliable and scalable pipelines for data processing - Delivering data for consumption from a variety of sources by Data Analysts and Data Scientists against tight SLAs - A Data Engineer’s biggest challenge? Collaboration across teams Data engineers build systems that collect, manage, and convert source data into usable information for data scientists and business analysts to interpret. Their ultimate goal is to make data accessible so that teams can use it to evaluate and optimize a goal or objective. **Responsibilities:** - Data Engineers are responsible for data migration, manipulation, and integration of data (joining dissimilar data systems) - Setup and maintenance of ETL pipelines to convert source data into actionable data for insights. It is the responsibility of the data engineer to make sure these pipelines run efficiently and are well orchestrated. - The Data Engineer sets up the workflow process to orchestrate pipelines for the studio’s data and continuously validates it - Managing workflows to enable data scientists and data analysts, and ensuring workflows are well-integrated with different parts of the studio (e.g., marketing, test/QA, etc) ##### Data Scientists Data scientists determine the questions their team should be asking and figure out how to answer those questions using data. They often develop predictive models for theorizing and forecasting. **Responsibilities:** - Responsible for making sense of the large amounts of data collected for a given game title, such as game telemetry, business KPIs, game health and quality, and sources beyond the game such as social media listening - The analytics portion of a Data Scientist’s job means looking at new and existing data to try and discover new things within it - The engineering component may include writing out pipeline code and deploying it to a repository - Data Scientists are responding for building, maintaining, and monitoring models used for analytics and/or data products ----- **Goals and Priorities:** - Developing new business capabilities (such as behavioral segmentation, churn prediction, recommendations) and optimizing processes around those capabilities - Increase ROI by building algorithms and tools that are maintainable and reusable - Exploring (or further expanding) the use of machine learning models for specific use cases - Bridges the gap between engineering and analytics, between the technology teams and business teams - Provides business side of studio with data that is crucial in decision-making, for example a churn model that helps predict the impact of a new feature set **What Data Scientists care about:** - Creating exploratory analysis or models to accurately predict business metrics, e.g., customer spend, churn, etc., and provide data-driven recommendations - Enable team with actionable insights that are easy to understand and well curated - Create and move models from experimentation to production - A Data Scientist’s biggest challenge? Keeping up with advancements and innovation in data science, and knowing which tools and libraries to use ##### Data Analysts A data analyst reviews data to identify key insights into a game studio’s customers and ways the data can be used to solve problems. **Responsibilities:** - Often serves as the go-to point of contact for non- - Analysts often interpret data and create reports or other documentation for studio leadership - Analysts typically are responsible for mining and compiling data - Streamline and or simplify processes when possible **Goals and Priorities:** - Empower stakeholder and business teams with actionable data - “Catch things before they break”. Proactively mitigate potential data issues before they occur (for internal and external customers) - Analysts are often recruited to assist other teams (i.e., BI teams) with their domain knowledge - Driving business impact through documentation and reliable data **What Data Analysts care about:** - Easy access to high quality data. - Quickly find insights from data with SQL queries and interactive visualizations. - The ability to easily share insights and while creating impactful assets for others to consume (dashboards, reports). - A Data Analyst’s biggest challenge? Working with complex processes and complicated technologies that are filled with messy data. While fighting these challenges, Analysts are often left alone or forced through paths that prevent collaboration with others across team/organization. - Untrustworthy data: often Analysts get asked to provide answers to leadership that will leverage the data to determine the direction of the company. When the data is untrustworthy or incorrect due to previously mentioned challenges this can eventually lead to lack of trust in the data teams from leadership or the business. technical business / operations colleagues for data access / analysis questions ----- # Data access and the major cloud providers ### Cloud Rosetta Stone [AWS / Azure / GCP Service Comparison - Click Here](https://cloud.google.com/free/docs/aws-azure-gcp-service-comparison) If you are newer to the cloud computing space, it is easy to get lost between the hundreds of different services between the three major cloud providers. The table below is meant to highlight the important data, analytics, and AI services used by the various hyperscale service providers Amazon, Microsoft, and Google. In addition, it aims to pair up services from different cloud providers that serve the same purpose. ### Getting started with the major cloud providers Here are some quick ways to get started with the three major cloud providers: AWS, Azure, and GCP: **AWS:** `1.` **[Create an AWS account](https://portal.aws.amazon.com/billing/signup)** **:** The first step is to create an account on the AWS website. This will give you access to the AWS Management Console, which is the web-based interface for managing your AWS resources. `2.` **Use the AWS free tier:** AWS offers a free tier of service that provides a limited amount of free resources each month. This is a great way to get started and try out various AWS services without incurring any charges. `3.` **Explore the AWS Management Console:** Once you have an account and are logged in, take some time to explore the AWS Management Console and familiarize yourself with the various services that are available. `4.` **Next you can search for Databricks:** In the AWS Management Console, use the search bar in the top-left corner of the page and search for “Databricks”. `5.` **Navigate to the Databricks page:** Once you have found the Databricks page, you can access it to get started with the Databricks service. `6.` **Launch Databricks Workspace:** To launch the Databricks Workspace on AWS, you can use the CloudFormation template provided by Databricks. Databricks CloudFormation template creates an IAM role, security group, and Databricks Workspace in your AWS account. **Azure:** `1.` **[Create an Azure account](https://azure.microsoft.com/en-us/free/gaming/)** **:** The first step is to create an account on Azure portal. This will give you access to the Azure portal, which is the web-based interface for managing your Azure resources. |Service Type|Service Description|AWS Service|Azure Service|GCP Service| |---|---|---|---|---| |Storage|Object storage for various file types and artifacts (CSV, JSON, Delta, JAR). Objects can be retrieved by other services|Amazon Simple Storage Service (S3)|Azure Blob Storage|Google Cloud Storage| |Compute|High-performance VMs to run applications. Platform where data transformations are run in Big Data apps.|Amazon Elastic Compute (EC2)|Azure Virtual Machines|Google Compute Engine| |Messaging|Real-time event streaming services to write data to object stores or data warehouses. One OSS version is Kafka|Amazon Kinesis|Azure Service Bus Messaging|Google Pub/Sub| |Data Warehouse|Traditional data storage layer for structured data, to then be used by data analysts. Often used to read from a Data Lake, which acts as a single source of truth|Redshift or Databricks|Synapse or Databricks|BigQuery or Databricks| ----- **Jargon Glossary** |CDP|Customer Data Platform (CDP). A CDP is a piece of software that combines data from multiple tools to create a single centralized customer database containing data on all touch points and interactions with your product or service.| |---|---| |ETL|Extract, Transform, Load. In computing, extract, transform, load is a three-phase process where data is extracted, transformed and loaded into an output data container. The data can be collated from one or more sources and it can also be outputted to one or more destinations| |KPI|Key Performance Indicator, a quantifiable measure of performance over time for a specifci objective. KPIs provide targets for teams to shoot for, milestones to gauge progress, and insights that help people across the organization make better decisions.| |POC|Proof of Concept (PoC). A proof of concept is a prototype or initial implementation of a solution that is developed to demonstrate the feasibility of a concept or idea. It is often used to test the effectiveness of a new tool or approach to data analysis or machine learning before investing in a full-scale implementation.| |MVP|Minimum Viable Product (MVP). An MVP refers to the smallest possible solution that can be delivered to meet a specific business need. The goal of an MVP is to quickly validate assumptions and prove the potential value of a larger project. By delivering a smaller solution first, stakeholders can gain confidence in the project and see a return on investment sooner, while also providing feedback to improve the larger project.| |ROI|Return on investment (ROI), which is calculated by dividing the profit earned on an investment by the cost of that investment.| |Serverless computing|Using compute platforms that are completely managed by service providers. When using serverless computing, you simply execute queries or deploy applications and the service provider (AWS, Databricks, etc.) handles necessary server maintenance.| |VPC|Virtual Private Cloud. A VPC is a virtual cloud networking environment, which helps organize and give you control of your resources. You also define how resources within your VPC can communicate with other regions, VPCs, and the public internet with traffic rules and security groups.| `2.` **Take Azure tutorials:** Azure provides tutorials, documentation, and sample templates to help you get started. These resources can help you understand the basics of Azure and how to use its services. `3.` **You can search for Databricks:** In the Azure portal, use the search bar at the top of the page and search for “Databricks”. `4.` **Navigate to the Databricks page:** Once you have found the Databricks page, you can access it to get started with the Databricks service. `5.` **Create a new Databricks workspace:** To create a new Databricks workspace, you can use the Azure portal, Azure CLI or Azure Powershell. Once created, you’ll be able to access your Databricks Workspace through the Azure portal. `6.` **Other Azure Services:** Once you have a Databricks workspace setup, you can easily connect it to other Azure Services such as Azure Storage, Event Hubs, Azure Data Lake Storage, Azure SQL and Cosmos DB for example. **GCP:** `1.` **[Create a GCP account](https://console.cloud.google.com/freetrial)** **:** the first step is to create an account on GCP portal. This will give you access to the GCP Console, which is the web-based interface for managing your GCP resources. `2.` **Explore the GCP Console:** Once you have an account and are logged in, take some time to explore the GCP Console and familiarize yourself with the various services that are available. `3.` **Search for Databricks:** In the GCP Console, use the search bar in the top-left corner of the page and search for “Databricks”. `4.` **Navigate to the Databricks page:** Once you have found the Databricks page, you can access it to get started with the Databricks service. `5.` **Create a new Databricks workspace:** To create a new Databricks workspace, you can use the GCP Console or the gcloud command-line tool. Once created, you’ll be able to access your Databricks Workspace through the GCP Console. ----- # Detailed Use Cases ### Getting started with game analytics Fortunately, standing up an effective analytics dashboard is getting easier. It all starts with getting your data into an architecture that sets your team up for success. Selecting any of the major cloud providers — [AWS](https://portal.aws.amazon.com/billing/signup) [,](https://portal.aws.amazon.com/billing/signup) [Azure](https://azure.microsoft.com/en-us/free/gaming/) [,](https://azure.microsoft.com/en-us/free/gaming/) [GCP](https://console.cloud.google.com/freetrial) — you can land all your data into a cloud data lake, then use Databricks Lakehouse architecture to run real-time and reliable processing. Databricks can then help you visualize that data in a dashboard, or send to a visual analytics platform, such as Tableau. `1.` **Sign up for a Databricks account:** You’ll need to create an account on the Databricks website in order to use the platform. `2.` **Access the Databricks portal:** Interact with the Databricks platform and run tasks such as creating clusters, running jobs, and accessing data. `3.` **Set up a development environment:** You’ll need a development environment where you can write and test your code, whether you’re using a local IDE or the Databricks Workspace. `4.` **Collect data:** Once you have your development environment set up, you can start collecting data from your game. This can involve integrating or building a SDK into your game code, or using another tool to send data to cloud storage. `5.` **Process and analyze the data:** Once you have collected your data, you can use Databricks to process and analyze it. This can involve cleaning and transforming the data, running queries or machine learning algorithms, or creating visualizations. `6.` **Monitor and optimize:** Regularly monitor your analytics to ensure that they are accurate and relevant, and use the insights you gain to optimize your game. Keep in mind that these are just general steps to get started with Databricks for game analytics. The specific steps you’ll need to take will depend on your specific use case and needs. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://databricks.com/company/contact) [out](https://databricks.com/company/contact) to us. **Tips / Best Practices** - **Define your goals:** What do you want to learn from your analytics data? Having clear goals will help you focus on collecting the right data and making meaningful use of it. - **Plan your data collection:** Determine what data you need to collect, how you will collect it, and how you will store it. - **Consider privacy:** Make sure you are transparent with your players about what data you are collecting and how you will use it, and give them the option to opt out if they wish. - **Use analytics to inform design:** Leverage your analytics data to inform decisions around game design, such as any balance changes or new content targeting a specific audience. - **Monitor and test your analytics implementation:** Regularly check your analytics to ensure that data is being collected correctly, and conduct tests to validate the accuracy of your data. - **Visualize your data:** Dashboarding your data is one of the most effective ways to quickly and effectively make sense of what’s happening at a given moment in time. - **Use data to improve player retention:** Analyze player behavior and use the insights you gain to improve player retention, such as by identifying and addressing pain points or by providing personalized content. - **Collaborate with your team:** Share your analytics findings with your team and encourage them to use the data to inform their work. - **Keep it simple:** Don’t try to collect too much data or create overly complex analytics systems. Keep it simple and focused on your goals. - **Start where you are:** If you’ve yet to gather all of your data, don’t go build some fancy model. Start with the data you have available to you and build from there. ### Getting started with Player Segmentation Player segmentation is crucial to studios as it allows them to better understand their audience and tailor their game experience to meet their specific needs and preferences. By dividing players into different segments based on factors such as demographics, playing styles, and in-game behavior, ----- studios can gain valuable insights into what motivates and engages their players. This information can then be used to design games that not only provide a more enjoyable experience for players, but also drive player retention and increase revenue for the studio. In a competitive industry where player satisfaction is key to success, player segmentation is an essential tool for studios to stay ahead of the game. Start by evaluating the segmentation goals such as: - **Personalize the experience:** Changing or creating experience specific designs to the player. - **Create relevant content:** Surface the best content to players based on features and behaviors that will matter the most depending on the player’s place in the games life cycle. - **Monetization:** Create tailored monetization strategies that effectively reach and convert each player group. For example, you may have a group of highly engaged players who are more likely to make in-app purchases, while another group is less likely to spend money but may be more receptive to advertisements. The next steps would be to identify, collect and analyze player data. By gathering information on player behavior, preferences, and demographics, you can gain insights into their motivations, pain points, and what drives their engagement with your game. There are multiple types of player data to collect, including: - **Player Behavior:** Track player behavior and actions within your game to gain insights into their play style, preferences, and patterns. - **Surveys:** Ask players directly about their preferences, motivations, and feedback through in-game surveys, email questionnaires, or other forms of direct communication. - **Focus groups:** Gather a small group of players to discuss and provide feedback on specific aspects of your game and player experience. - **Social media listening:** Monitor social media platforms to gather insights into how players are engaging with and talking about your game. **[Customer Segmentation solution accelerator](https://www.databricks.com/solutions/accelerators/customer-segmentation)** **Tips / Best Practices** Define your segmentation goals: Determine what you want to learn about your players and why. This will help you focus your analysis and ensure that your segments are meaningful and actionable. - **Use meaningful criteria:** Choose criteria that are relevant to your goals and that differentiate players in meaningful ways. This could include demographic information, in-game behavior, spending habits, or a combination of factors. - **Analyze player data:** Use data from your players to inform your segmentation strategy. This could include data on in-game behavior, spending habits, or demographic information. - **Use multiple methods:** We recommend using a combination of methods, such as clustering to create segments that are statistically meaningful and actionable to your game. - **Validate your segments:** Test your segments to ensure that they accurately reflect the differences you observed in your player data. This could involve comparing the segments to each other, or validating the segments against external data sources. - **Consider ethical and privacy concerns:** Ensure that your segmentation strategy is ethical and complies with privacy laws and regulations. This could involve anonymizing your player data, obtaining consent from players, or other measures to protect player privacy. - **Monitor and refine your segments:** Regularly review your segments to ensure that they remain relevant and meaningful. Refine your segments as necessary to reflect changes in your player data or your goals. ### Getting Started with Player Lifetime Value Assuming you’ve followed the steps to collecting, storing, and preparing your player data for analysis; To calculate player lifetime value (LTV), the quick and dirty way of assessing overall player LTV is to divide the total revenue by the total number of registered players. Note, LTV is a critical calculation for return on investment, which is player lifetime spend versus the amount spent on player acquisition. Ideally, you want lifetime spend to be equal to or more than cost of acquisition. ----- As long as your game and its community are currently active, any player lifetime value calculations should be considered models, not exact numbers. This is because many of the players you’re considering are likely actively registered and actively playing, so the exact player LTV number is a moving target. Advanced predictive models Simple predictive models Historical average and benchmarks But these models are not entirely accurate since it doesn’t take into account the players who are registered but have yet to generate any revenue. Instead, a data-driven approach pivoted around player segmentation or cohorts will generally yield more actionable insight, far more than calculating a single LTV for the entire player base. You can define your game’s cohorts in multiple ways. Perhaps the most obvious in terms of calculating LTV is going by daily active cohorts, or users who joined your game on the same day. You could also organize cohorts by users who joined your game through a certain ad campaign or promotional effort, by country or geographic location, or by the type of device used. **[Lifetime Value solution accelerator](https://www.databricks.com/solutions/accelerators/customer-lifetime-value)** **ACCURACY** **Tips / Best Practices** **Use multiple data sources:** To get a complete picture of a player’s value, be sure to consider data from a variety of sources, including in-game purchases, ad revenue, and other monetization strategies. **Consider player retention:** Player retention is a key factor in LTV, so be sure to consider how long players are likely to play your game when calculating LTV. **Use accurate data:** Make sure you are using accurate data when calculating LTV. This might involve cleaning and processing your data, or using trusted sources such as in- game analytics tools. **Regularly review and update your LTV estimates:** Player LTV can change over time, so be sure to regularly review and update your estimates to ensure they are accurate. **Test and optimize:** Use experimentation methods such as A/B testing to see how different variables, such as in-game events or pricing strategies, affect LTV. Use the insights you gain to optimize your LTV calculations. **Be aware of outside factors:** Your calculations should consider the many outside factors that can affect your LTV, such as the virality of your game, any spikes or surge in visitors due to unexpected promotions (influencers, reviewers talking about your game), any significant changes to your game that users respond well to, and other organic lifts that are difficult to predict with existing data. The first calculation is relatively simple. We suggest using average revenue per user (ARPU), which is a game’s daily revenue divided by the number of active users, to help you calculate lifetime value. First, you’ll need to define what is an active player using retention values; which can be set to a week, multi-day, or multi-week period of time depending on how your game has performed to date. You can then look at the number of users who churn on a given day, averaging with the number of days from the player’s first visit to the current date (or the specific date you’ve considered the end for said exercise). This is your playerbase lifetime value (note not Player Lifetime Value). To get Lifetime Value, divide daily revenue by the number of daily active users, and multiply that by the Lifetime Value to get your player LTV. It’s important to note that while calculating player lifetime value, the term is not entirely accurate since most player lifetimes are not over (particularly true for live service games). But for the purpose of modeling, we recommend keeping the amount of time that you consider a lifetime relatively short, allowing you to extrapolate. Keeping the time period shorter helps mitigate inaccuracies, specifically, the longer you stretch out what you consider a lifetime the more likely you are to collect inactive users in your count. ----- ### Getting Started with Social Media Monitoring Social media monitoring has three primary components: collecting the data, processing the results, and taking action on the findings. When it comes to collecting the data, whether you’re looking for tweets, YouTube comments, or Reddit posts, it can be very easy to get started since many social media platforms such as Twitter, YouTube, and Reddit all provide their own detailed and comprehensive APIs making it easy to start gathering data from those platforms with proper documentation and code examples to help along the way. Once the data has been collected, the next step is to process it and prepare it to be used in the next step. Processing your data can range in complexity from a simple keywords filter or more complicated approach such as filtering by location, removing emojis, and censoring and substituting words. With the data collected and processed, it can move to the final stage and be analyzed for downstream use and actionable insights by applying sentiment analysis or text mining. If a game studio is looking to save time and have the above steps performed for them, it may be appealing to buy a pre-built tool. The primary benefits of buying an off the shelf solution is that it is often faster and easier to get started with, and the development of the tool is handled by a third party who will have experience in building media monitoring solutions. On the other hand, building your own custom solution will provide more flexibility and control. Many pre- built media monitoring tools might not have the capabilities required to effectively process video, audio, and image data, and may not be able to control the frequency in which data is processed, whether it be near real-time or batch. Additionally, pre-built solutions tend to take a generalist approach for NLP, whether it be keyword extraction, topic filtering, or sentiment analysis, which often leads to poor results and feedback, especially for an industry as unique as the gaming industry where certain industry-specific slang or terminology is frequently used. Overall, building your own media monitoring tool will provide greater control and flexibility leading to a better tailored return on investment, and luckily Databricks makes it even easier to get started. With the Databricks Lakehouse platform, all data engineering, data science, machine learning, and data analytics can be done in a single place without having to stitch multiple systems and tools together. Data engineers can use Workflows and Jobs to call social media platform APIs on a scheduled basis and use Delta Live Tables to create declarative data pipelines for cleaning and processing the data that comes in. Data scientists can use tools such as ML-specific Databricks runtimes (DBRs) that come with many of the most popular and common libraries already installed, MLflow which makes model development, ----- tracking, and serving easy and efficient, and various other tools such as AutoML and Bamboolib. Data analysts are able to create real-time alerts, dashboards, and visualizations using Databricks SQL. Each of the three personas will be able to effectively collaborate with each other and integrate each piece of their work into the broader data architecture. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://databricks.com/company/contact) [out](https://databricks.com/company/contact) to us. **Tips / Best Practices** While social media monitoring can be easy to get started with, there are a few key points to keep in mind. - Remember the Pareto principle (roughly 80% of impact comes from 20% of activity) and diminishing returns. While it’s important to monitor large platforms such as Reddit, Twitter, and YouTube, it might not be worthwhile to monitor smaller platforms (in terms of engagement) as the bulk of customer feedback will be on those major platforms. - Monitor other sources of information. It is also useful to monitor mentions of key company personnel such as executives or public facing employees. - While follower count does matter on platforms such as Twitter, don’t ignore users with low-follower counts. It only takes one or two re-tweets from other users to become a large issue. - On social media, customers can see through generic corporate responses to complaints, so it is important to get a clear understanding of the issue and provide a clear response. ### Getting Started with Player Feedback Analysis The easiest place to start is gathering your data. With accounts set up on Steam, Epic, Apple, Google, Xbox, Sony, Nintendo (or whatever platform you’re using), identify the ID for your game(s), and pull the reviews corresponding to that game into Databricks through an API call. From here, you clean the data using some of the pre- processing available in Python that removes any emojis and ASCII characters. Once complete, run through Spark NLP pipeline which does the basic natural language processing steps such as normalization, stemming, lemmatization. We recommend running through pre-trained models, such as Word Embeddings and Named Entity Recognition models from John Snow Labs. This should complete the pipeline and generates the aspects for the reviews provided by the community. This data is then loaded into a Delta table for further analysis, such as using a visual dashboard (built on SQL queries inside Databricks) to analyze and understand the aspects the community is talking about, which can then be shared back with the development team for analysis and action. This is a great exercise to run once per month. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://www.databricks.com/company/contact) [out](https://www.databricks.com/company/contact) to us. **Tips / Best Practices** - **Check for word groupings:** Make sure your word groupings are accurate to improve the analysis. For example, if your game is called Football Manager, and the shorthand is FM, make sure both of those are grouped appropriately. - **Leverage domain knowledge:** Clean the reviews based on your domain knowledge. There are generic steps one could take, but that will not be as effective as someone with domain, and specific game knowledge of your title. - **Experiment with models:** Feel free to try multiple pre- trained models, and or tweak the pre-trained models based on your understanding of the domain to improve the accuracy of your results. - **Work one title at a time:** This process works best when pulling reviews for a single title, specifically one version of one title at a time. - **Let the model to the heavy lift, but use humans to double-** **check:** The sentiment corresponding to the aspects in the model will be labeled as Positive or Negative. In the case of a neutral review, the model will do its best to determine whether that is more positive or negative. A best practice is to spend time going back through the aspects early to determine model accuracy and make updates accordingly. ----- ### Getting Started with Toxicity Detection Our recommendation on tackling the toxicity issue is to leverage cloud-agnostic and flexible tooling that can consume chat data from a variety of sources, such as chat logs, voice transcriptions, or sources like discord and reddit forums. No matter if the data is in log form from game servers or events from a message system, Databricks can provide quick and easy ways to ingest the data. Leveraging a simplified architecture like the diagram above shows no matter the source, getting chat data for inferencing and model development can be as simple. While we leveraged a pre-built model from John Snow Labs to accelerate development, you can bring the ML framework of your choice to the platform. **[Gaming Toxicity solution accelerator](https://notebooks.databricks.com/notebooks/CME/Toxicity_Detection_in_Gaming/index.html)** **Tips / Best Practices - things to consider** - **Define what toxic and disruptive behavior looks** **like within your community:** Clearly define what you consider to be toxic behavior, as this will determine how you measure and detect it. This might include things like hateful language, harassment, or cheating. - **Collect relevant data:** Make sure you are collecting the right data to help you detect toxicity. This might include data on in-game chat, player reports, and other sources. - **Use machine learning:** Use machine learning algorithms to analyze your data and identify patterns of toxic behavior. This will allow you to more accurately detect toxicity and prioritize cases for review. - **Test and optimize:** Regularly review and test your toxicity detection systems to ensure they are accurate and effective. Use experimentation methods such as A/B testing to see how different strategies impact toxicity rates. - **Be transparent:** Make sure you are transparent with your players about how you are detecting toxicity, and give them the option to opt out if they wish. - **Take action:** When toxic behavior is detected, take appropriate action to address it. The health and wellness of your community depends on it. This might involve banning players, issuing warnings, or taking other disciplinary measures. ----- ### Getting Started with Multi-Touch Attribution and Media Mix Modeling To get started with multi-touch attribution, you need to first select an attribution model. There are a variety of different attribution models to choose from, each with its own attribution credit according to your chosen model (above). We highly recommend you regularly review and test your attribution efforts to ensure they are accurate and effective. Use experimentation methods such as A/B testing to see how different strategies impact conversion rates. **[Multi-Touch Attribution solution accelerator](https://notebooks.databricks.com/notebooks/CME/Multi-touch_Attribution/index.html#Multi-touch_Attribution_1.html)** strengths and limitations. `1.` **Last-click model:** This model attributes all credit to the last touchpoint that the customer interacted with before making a purchase or taking a desired action. `2.` **First-click model:** This model attributes all credit to the first touchpoint that the customer interacted with. `3.` **Linear model:** This model attributes equal credit to each touchpoint that the customer interacted with. `4.` **Time decay model:** This model attributes more credit to touchpoints that are closer in time to the purchase or desired action. `5.` **Position-based model:** This model attributes a portion of the credit to the first and last touchpoints, and the remainder is distributed evenly among the other touchpoints. `6.` **Custom model:** Some businesses create their own attribution model based on specific business needs or goals. Each attribution model has its own strengths and limitations, and the right model for a particular video game will depend on a variety of factors, including the goals of your title, the customer journey, and the types of marketing channels being used. It is important to carefully consider the pros and cons of each model and choose the one that best aligns with the needs of your game. Next, you’re going to want to set up tracking. In order to attribute credit to different touchpoints, you’ll need to set up tracking to capture data on customer interactions. This might involve integrating tracking code into the game, or using a third-party tracking tool. With tracking set up, you’ll start collecting data on player interactions and be able to use that information to calculate **Tips / Best Practices - things to consider** - **Define clear goals:** Sounds simple, but by clearly defining the goals of your acquisition campaign and what success looks like, you will be able to guide your decision-making and ensure that you are measuring the right metrics - such as cost per install, return on ad spend, conversion rate, lifetime value, retention rate, and more. - **Use a data-driven approach:** Use data to inform your decision-making. Collect data on all touchpoints in the player journey, including ad impressions, clicks, installs, and in-game actions. - **Choose the right attribution model:** Select the right attribution model that accurately reflects the player journey for your specific genre of game. This can be a complex process. A couple of things to keep in mind - Consider the touchpoints that are most important for your player journey, such as first ad impression, first click, or first in-game action - Consider the business goals you’re trying to achieve. For example, if you are focused on maximizing return on investment, a last-click attribution model may be most appropriate. On the other hand, if you are looking to understand the impact of each touchpoint, a multi- touch attribution model may be more appropriate. - Consider the data you have available, including ad impressions, clicks, installs, and in-game actions. - **Continuously monitor and optimize:** Continuously monitor and optimize your acquisition campaigns based on the data. Test different approaches, make adjustments as needed, and use A/B testing to determine what works best. ----- ### Getting Started with Player Recommendations Recommendations is an advanced use case. We don’t recommend (hehe) that you start here, instead, we’re assuming that you’ve done the work to set up your game analytics (collecting, cleaning, and preparing data for analysis) and that you’ve done basic segmentation to place your players in cohorts based on their interests and behaviors. Recommendations can come in many forms for video games. For this context, we’re going to focus on the wide-and-deep learning for recommender systems, which has the ability to both memorize and generalize recommendations based on player behavior and interactions. First [introduced by](https://arxiv.org/abs/1606.07792) [Google](https://arxiv.org/abs/1606.07792) for use in its Google Play app store, the wide-and- deep machine learning (ML) model has become popular in a variety of online scenarios for its ability to personalize user engagements, even in ‘cold start problem’ scenarios with sparse data inputs. The goal with wide-and-deep recommenders is to provide **Understanding the model design** To understand the concept of wide-and-deep recommend­ ations, it’s best to think of it as two separate, but collaborating, engines. The wide model, often referred to in the literature as the linear model, memorizes users and their past choices. Its inputs may consist simply of a user identifier and a product identifier, though other attributes relevant to the pattern (such as time of day) may also be incorporated. The deep portion of the model, so named as it is a deep neural network, examines the generalizable attributes of a user and their choices. From these, the model learns the broader characteristics that tend to favor user selections. Together, the wide-and-deep submodels are trained on historical product selections by individual users to predict future selections. The end result is a single model capable of calculating the probability with which a user will purchase a given item, given both memorized past choices and generalizations about a user’s preferences. These probabilities form the basis for user-specific rankings, which can be used for making recommendations. an intimate level of player understanding. This model uses explicit and implicit feedback to expand the considerations set for players. Wide-and-deep recommenders go beyond simple weighted averaging of player feedback found in some collaborative filters to balance what is understood about the individual with what is known about similar gamers. If done properly, the recommendations make the gamer feel understood (by your title) and this should translate into greater value for both the player and you as the business. **Building the model** The intuitive logic of the wide-and-deep recommender belies the complexity of its actual construction. Inputs must be defined separately for each of the wide-and- deep portions of the model and each must be trained in a coordinated manner to arrive at a single output, but tuned using optimizers specific to the nature of each submodel. Thankfully, the [Tensorflow DNNLinearCombinedClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier) [estimator](https://www.tensorflow.org/api_docs/python/tf/estimator/DNNLinearCombinedClassifier) provides a pre-packaged architecture, greatly simplifying the assembly of an overall model. **User A** - user identity - user attributes **Product B** **Wide** **Sub-Model** **Probability of** **User A + Product B** **Wide & Deep** **Model** **Deep** **Sub-Model** - product identity - product attributes ----- **Training** The challenge for most teams is then training the recommender on the large number of user-product combinations found within their data. Using [Petastorm](https://petastorm.readthedocs.io/en/latest/) , an open-source library for serving large datasets assembled in Apache Spark™ to Tensorflow (and other ML libraries), one can cache the data on high-speed, temporary storage and then read that data in manageable increments to the model during training. In doing so, we limit the memory overhead associated with the training exercise while preserving performance. **Tuning** Tuning the model becomes the next challenge. Various model parameters control its ability to arrive at an optimal solution. The most efficient way to work through the potential parameter combinations is simply to iterate through some number of training cycles, comparing the models’ evaluation metrics with each run to identify the ideal parameter combinations. By trials, we can parallelize this work across many compute nodes, allowing the optimizations to be performed in a timely manner. **Deploying** Finally, we need to deploy the model for integration with various retail applications. Leveraging [MLflow](https://www.mlflow.org/) allows us to both persist our model and package it for deployment across a wide variety of microservices layers, including Azure Machine Learning, AWS Sagemaker, Kubernetes and Databricks Model Serving. While this seems like a large number of technologies to bring together just to build a single model, Databricks integrates all of these technologies within a single platform, providing data scientists, data engineers & [MLOps](https://www.databricks.com/glossary/mlops) Engineers a unified exper­ ience. The pre-integration of these technologies means various per­sonas can work faster and leverage additional capabilities, such as the [automated tracking](https://docs.databricks.com/machine-learning/automl-hyperparam-tuning/index.html#automated-mlflow-tracking) of models, to enhance the transparency of the organization’s model building efforts. To see an end-to-end example of how a wide and deep recommender model may be built on Databricks, please check out the following notebooks: [Get the notebook](https://d1r5llqwmkrl74.cloudfront.net/notebooks/RCG/Wide_and_Deep/index.html#Wide_and_Deep_1.html) **[Recommendation Engines solution accelerator](https://www.databricks.com/solutions/accelerators/recommendation-engines)** **Tips / Best Practices - things to consider** - **Use data to inform recommendations:** Use data from your analytics, player feedback, and other sources to understand what players like and dislike. This will help you create recommendations that are more likely to be relevant and engaging for individual players. - **Segment your players:** Consider segmenting your players based on characteristics such as playstyle, spending habits, and demographic information. This will allow you to create more targeted recommendations for different groups of players. - **Consider the player’s current context:** When creating recommendations, consider the player’s current context, such as what they are doing in the game and what content they have already consumed. This will help you create recommendations that are more likely to be relevant and timely. - **Test and optimize your recommendations:** Use experimentation methods such as A/B testing to see how different recommendations perform with different player segments. Use the insights you gain to optimize your recommendations. - **Be transparent:** Make sure you are transparent with players about how you are creating recommendations and give them the option to opt out if they wish. - **Use recommendations to improve the player experience:** Use personalized recommendations to improve the player experience and increase engagement and satisfaction. ### Getting Started with Next Best Offer/Action Since NBO/NBA is a specific use case of personalization, how a team might get started implementing this will look very similar to how they would with broader personalization activities. Begin with ensuring you are appropriately collecting player data (behavior, preferences, in-game purchases, etc), storing it in your cloud data lake using a service such as Delta Lake from Databricks. From here, you’ll prepare the data using Databricks to clean, transform, and prepare for analysis. This may include aggregating data from multiple sources, removing duplicates and outliers, and transforming the data into a format suitable for analysis. As you analyze the player data, seek to identify patterns and trends in player behavior ----- and preferences that will give you signal on which actions are more likely to be successful. From here, you can build a recommendation model based on the player data analysis, and incorporate information on in-game items and player preferences to make personalized recommendations. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://www.databricks.com/company/contact) [out](https://www.databricks.com/company/contact) to us. **Tips / Best Practices** - **Define your goals:** Like every use case, starting with clearly defined goals helps to ensure your implementation of NBO and NBA will be as effective and efficient as possible. Your goals will also help you determine what data to collect and how it will be used. - **Collect relevant data:** Based on your goals, make sure you are collecting the right data to inform your NBO and NBA recommendations. This might include data on player behavior, engagement, and spending habits. - **Leverage machine learning to scale your** **recommendations:** Use machine learning algorithms to analyze your data and make personalized recommendations to your players. This will allow you to identify trends and patterns that might not be immediately apparent. - **Test and optimize:** THIS IS CRITICAL. Use experimentation methods such as A/B testing to see how different recommendations perform with different player segments. Past performance is not a perfect indicator of future success. Consistent testing allows you to tune your NBO and NBA recommendations so they evolve with your playerbase. - **Consider the player’s context:** When making recommend­ ations, consider the player’s current context, such as what they are doing in the game and what content they have already consumed. This will help you create recommend­ ations that are more likely to be relevant and timely. - **Be transparent:** Make sure you are transparent with your players about how you are using their data to make recommendations, and give them the option to opt out if they wish. - **Collaborate with your team:** Share your NBO and NBA ### Getting Started with Churn Prediction & Prevention The exciting part of this analysis is that not only does it help to quantify the risk of customer churn but it paints a quantitative picture of exactly which factors explain that risk. It’s important that we not draw too rash of a conclusion with regards to the causal linkage between a particular attribute and its associated hazard, but it’s an excellent starting point for identifying where an organization needs to focus its attention for further investigation. The hard part in this analysis is not the analytic techniques. The Kaplan-Meier curves and Cox Proportional Hazard models used to perform the analysis above are well established and widely supported across analytics platforms. The principal challenge is organizing the input data. The vast majority of subscription services are fairly new as businesses. As such, the data required to examine customer attrition may be scattered across multiple systems, making an integrated analysis more difficult. Data Lakes are a starting point for solving this problem, but complex transformations required to cleanse and restructure data that has evolved as the business itself has (often rapidly) evolved requires considerable processing power. This is certainly the case with the KKBox information assets and is a point noted by the data provider in their public challenge. The key to successfully completing this work is the establishment of transparent, maintainable data processing pipelines executed on an elastically scalable (and therefore cost-efficient) infrastructure, a key driver behind the [Delta](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html) [Lake pattern](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html) . While most organizations may not be overly cost-conscious in their initial approach, it’s important to remember the point made above that churn is a chronic condition to be managed. As such, this is an analysis that should be periodically revisited to ensure acquisition and retention practices are aligned. To support this, we are making the code behind our analysis available for download and review. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach out](https://www.databricks.com/company/contact) to us. efforts with your team and encourage them to use the data to inform their work. **[Churn Prediction solution accelerator](https://www.databricks.com/solutions/accelerators/survivorship-and-churn)** ----- **Tips / Best Practices** - **Define churn:** Clearly define what you consider to be player churn, as this will determine how you measure and predict it. For example, you might consider churn to be when a player stops playing your game for a certain number of days, or when they uninstall it. - **Collect relevant data:** Make sure you are collecting the right data to help you predict and prevent churn. This might include data on player behavior, engagement, and spending habits. - **Use machine learning:** Use machine learning algorithms to analyze your data and predict which players are at risk of churning. This will allow you to identify trends and patterns that might not be immediately apparent. - **Test and optimize:** Use experimentation methods such as A/B testing to see how different strategies impact churn rates. Use the insights you gain to optimize your churn prevention efforts. - **Focus on retention:** Implement retention strategies that are tailored to the needs and preferences of your players. This might involve providing personalized content, addressing pain points, or offering incentives to continue playing. - **Be transparent:** Make sure you are transparent with your players about how you are using their data to predict and prevent churn, and give them the option to opt out if they wish. - **Collaborate with your team:** Share your churn prediction and prevention efforts with your team and encourage them to use the data to inform their work. ### Getting Started with Read-time Ad Targeting Typically, implementing a real-time ad targeting strategy begins outside of your game (in services such as Google Ads, Unity Advertising), where your game becomes the delivery point for the advertisement. Here, you will need to integrate with Ad networks that provide real-time ad targeting capabilities. That will allow you to access a range of available ad assets to dynamically select and display the most relevant ads to players. Both Google AdMob and Unity Ads are great for banner ads, native ads, and rewarded video ads. Your role is to ensure that the data you’re collecting is fed back into the advertising platform to better serve targeted ads to your playerbase. To use a service like Databricks to manage the data needed to provide real-time ad targeting in your application, you can follow the below steps: `1.` **Collect and store player data:** Collect data on player behavior, preferences, and demographics, and store it in a data lake using Databricks. Popular analytics tools such as Google Analytics or Mixpanel can be integrated into the game to collect data on player behavior. These tools, just like tracking website traffic, can track in-game events, provide insights on player behavior and demographics.. and they give you access to detailed reports and dashboards. Another option is to build in-house tracking systems to collect data on player behavior - logging events, e.g in-game purchases or player actions, activities such as “at which level does a player quit playing” and storing this in a database for analysis. The downside of building in-house tracking systems is you will need to host and maintain your own logging servers. `2.` **Prepare the data:** Use Databricks to clean, transform, and prepare the player data for analysis. This may include aggregating data from multiple sources, removing duplicates and outliers, and transforming the data into a format suitable for analysis. `3.` **Analyze the data:** Use Databricks’ built-in machine learning and data analytics capabilities to analyze the player data and identify patterns and trends. `4.` **Create audience segments:** Based on the analysis, use Databricks to create audience segments based on common characteristics such as interests, behaviors, and preferences. `5.` **Integrate with the ad server:** When an ad opportunity presents itself within the game, a call is made to the ad server. This call includes information about the player, such as the audience segment that they belong to. The ad server then uses this information to decide what ad to deliver to the player. `6.` **Monitor and optimize:** Use Databricks to monitor the performance of the ad targeting and make optimizations as needed, such as adjusting the audience segments or adjusting the targeting algorithms. By using a service like Databricks to manage the data needed for real-time ad targeting, game developers can effectively leverage their player data to create more personalized and engaging experiences, increase revenue, and reduce churn. ----- If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://www.databricks.com/company/contact) [out](https://www.databricks.com/company/contact) to us. **Tips / Best Practices** - **Focus on player data:** Make player data the center of your targeting strategy by collecting and storing comprehensive information on player behavior, preferences, and demographics. Here, it’s critical to ensure the game code data trackers are properly implemented in order to collect this data (see Game Analytics section for detail). - **Segment your audience:** Create audience segments based on common characteristics such as interests, behaviors, and preferences, and use these segments to **Test and iterate:** Continuously test and iterate your targeting strategy to refine your audience segments and improve targeting accuracy. **Balance relevance and privacy:** Balance the need for relevant, personalized ads with players’ privacy by only collecting and using data that is necessary for targeting and obtaining player consent. **Monitor performance:** Regularly monitor the performance of your targeting strategy to ensure that it is delivering the desired results and make optimizations as needed. **Partner with the right ad platform:** Choose an ad platform that is well-suited to your needs and aligns with your goals, and work closely with them to ensure that your targeting strategy is delivering the best results. deliver targeted ads. # Operational use cases ### Anomaly Detection First thing is to begin collecting the data, game server / client logs out of your project. Then consume this into Databricks Delta, to have a continuous anomaly detection model running. Focus this on key pieces of information you want to monitor, for example - for live service games, this is going to be infrastructure and network-related metrics such as Ping and Server Health (# of clients connected, server uptime, server usage, CPU/RAM, # of sessions, time of sessions). Once the model is ingesting and tuned specifically for the metrics based on the information you have above. You would build out alerts or notifications based on these specific metrics hitting a threshold that you define as needing attention. From here, you can build out automated systems to mitigate those effects - such as migrating players to a different server, canceling matches, scaling infrastructure, creating tickets for admins to review. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://www.databricks.com/company/contact) [out](https://www.databricks.com/company/contact) to us. **Tips / Best Practices** - **Define the problem and objectives clearly:** Before implementing an anomaly detection solution, it is important to define the problem you are trying to solve and your specific objectives. This will help ensure that you have the right data sources and use the appropriate algorithms to achieve your goals. - **Choose the right data sources:** To effectively detect anomalies, you need to have the right data sources. Consider data from player behavior, system performance, and network traffic, as well as any other data sources that are relevant to your problem and objectives. - **Clean and preprocess the data:** To ensure that the data you use for anomaly detection is accurate and meaningful, it is important to clean and preprocess the data. This includes removing any irrelevant or invalid data, handling missing values, and normalizing the data if necessary. - **Choose the right algorithms:** There are many algorithms that can be used for anomaly detection, including statistical methods, machine learning algorithms, and rule-based systems. Choose the algorithms that are best ----- suited to your data and problem, and that provide the right level of accuracy, speed, and scalability. - **Validate the results:** Before deploying the anomaly detection solution in production, it is important to validate the results by testing the solution on a small subset of data and comparing the results to expected outcomes. - **Monitor and update the solution:** Once the anomaly detection solution is deployed, it is important to monitor its performance and accuracy, and update the solution as needed. This may include retraining the algorithms, adding or removing data sources, and updating the parameters and thresholds used by the algorithms. Additionally, there are some key gotchas to look out for when implementing an anomaly detection solution. - **Avoid overfitting:** Overfitting occurs when the anomaly detection solution is too complex and learns the noise in the data rather than the underlying patterns. To avoid overfitting, it is important to choose algorithms that are appropriate for the size and complexity of the data, and to validate the results using a separate test dataset. - **False positive and false negative results:** False positive and false negative results can occur when the anomaly detection solution is not properly calibrated, or when the solution is applied to data that is significantly different from the training data. To minimize the risk of false positive and false negative results, it is important to validate the results using a separate test dataset, and to fine-tune the parameters and thresholds used by the algorithms as needed. - **Scalability:** Scalability can be a concern when implementing an anomaly detection solution, especially when dealing with large amounts of data. To ensure that the solution can scale to meet the demands of a growing player base, it is important to choose algorithms that are fast and scalable, and to deploy the solution using a scalable infrastructure. ### Getting Started with Build Pipeline An operational goal game projects have is to make sure game project builds are generated, delivered quickly and efficiently to internal testing & external users. A few of the key metrics and capabilities with analyzing your build pipelines are the below: - **Build time and speed:** This includes metrics such as the time it takes to create a build, number of builds, and compute spent. - **Build size and storage:** size of the builds, amount of storage, and network costs. - **Bug tracking and resolution:** This includes metrics such as the number of bugs reported, the time it takes to resolve them, and the number of bugs that are resolved in each build. - **Code quality and efficiency:** This includes metrics such as code complexity, code duplication, and the number of code lines written. - **Collaboration and communication:** Such as the number of code reviews, the number of team meetings, and the number of code commits. - **Advanced capabilities:** Such as Predicting real time build failure to reduce spend and combining build data with Crash Analytics (see below) to have “commit to build” visibility for accelerated bug fixing. Before you start implementing your build pipeline, it’s important to define your requirements. What are the key goals of your build pipeline? Choosing the right CI/CD tools is critical to the success of your build pipeline. There are many different tools available, including Jenkins, Azure Devops, Perforce, gitlab and more. When choosing a CI/CD tool, consider factors such as ease of use, scalability, and cost. In addition, consider the specific needs of your game project, and choose a tool that can meet those needs. The general recommendation is to look at automating your build process early. Once you’ve chosen your CI/CD tools, you can automate your build process by setting up a build server, configuring your CI/CD tool, and creating a script to build your game project. The build process should be automated as much as possible, and it should include steps to compile your code, run automated tests, and generate a build of your project. Once you have automated your build process, often the next step is to implement CD (Continuous Delivery). This involves automating the deployment of your game builds delivery to stakeholders, such as QA testers, beta testers, or end-users via publishing platforms. CD can help ensure that stakeholders have access to the latest version of your game ----- as soon as possible, allowing them to provide feedback and help drive the development process forward. Finally, it’s important to monitor and measure your build pipeline to ensure that it’s working as expected. This can involve using tools such as Databricks Dashboards to visualize the status of your pipeline, or using metrics such as build times, test results, and deployment success rates to evaluate the performance of your pipeline. By monitoring and measuring your build pipeline, you can identify areas for improvement and make changes as needed to ensure that your pipeline continues to meet your needs. If you have any questions about how databricks can integrate into your devops solution, please don’t hesitate to [reach out](https://www.databricks.com/company/contact) to us. **Tips / Best Practices** - **Seek to automate early and often:** Automate as much of the build process as possible, from checking code into version control to generating builds and distributing them to stakeholders. This can help reduce errors and save time, allowing game teams to focus on more high value tasks. **Version control, version control, version control:** Use a version control system to manage the source code and other assets. This ensures that changes to the codebase are tracked and can be easily undone if needed. **Implement continuous integration and delivery:** Continuous integration (CI) involves automatically building and testing after code changes are checked into version control. With CI, new changes to the codebase do not break existing functionality. By automating the build process, CI helps to reduce errors and save time. CD, on the other hand, involves automatically delivering builds to stakeholders, such as QA testers, beta testers, or end- users, after they have passed the automated tests. By combining CI and CD, a video game project can ensure that builds are generated and delivered quickly and efficiently, without the need for manual intervention. **Build for scalability:** As your game project grows, you will need a build pipeline solution that is scalable and can handle the needs of your game team. **Integration with other tools:** Integrate the build pipeline solution with other tools and systems, such as issue tracking, testing, and deployment tools, to ensure a smooth and efficient workflow. **Reference Architecture** **Databricks** **SQL** **Power BI** |GAME INFRASTRUCTURE|Col2| |---|---| ||| ||| **AWS** **Quicksight** ----- ### Getting Started with Crash Analytics Building a pipeline to build a holistic view to support crash analytics means data coming from multiple different sources, different velocities and joining the data together. The amount of data sources depends on your game projects publishing platforms, some may come from console based providers such as sony, xbox, and nintendo or pc platforms like Steam, Epic Games Marketplace, GoG and many others. **High level steps** - Determine what platforms your game is running on and how to interface to collect data. - **Collect crash data:** Implement crash reporting tools in your game to collect data on crashes. The source data may be delivered in varying formats such as JSON or CSV. - **Load crash data into Databricks:** Use Databricks’ data ingestion tools to load the crash data into your workspace. This could involve using Databricks’ built-in data source connectors, or programmatically ingest files to load the data. - **Transform and clean the crash data:** Use Databricks’ data processing and transformation tools to clean and prepare the crash data for analysis. This could involve using Databricks’ capabilities like DLT, or using SQL to perform custom transformations. - **Visualize crash data:** Use Databricks’ dashboarding tools to create visualizations that help you understand the patterns and trends in your crash data. This could involve using Databricks’ built-in visualization tools, or integrating with external visualization tools like Tableau or PowerBI. - **Analyze crash data:** Use Databricks’ machine learning and statistical analysis tools to identify the root causes of crashes. This could involve using Spark MLlib or many of the popular tools to build machine learning models, or using SQL to perform custom analyses. - **Monitor and refine your pipeline:** Regularly review your pipeline to ensure that it remains relevant and useful. Refine your pipeline as necessary to reflect changes in your crash data or your goals. If you have any questions about how this solution can be deployed in your environment, please don’t hesitate to [reach](https://www.databricks.com/company/contact) [out](https://www.databricks.com/company/contact) to us. ----- **Tips / Best Practices** - **Automated collection and aggregation of crash reports:** Collecting crash reports should be an automated process that is integrated into the output of the build pipeline for the game. The crash reports should be automatically aggregated and made available for analysis in near real-time. - **Clear reporting and prioritization of issues:** The solution should provide clear reporting on the most common issues and allow game developers to prioritize fixing the most impactful problems first. - **Integration with other analytics tools:** The crash analytics solution should integrate with other analytics tools, such as player behavior tracking, to provide a more complete picture of how crashes are impacting the player experience. - **Flexibility and scalability:** As the game grows, the Additionally, there are some key gotchas to look out for when implementing an anomaly detection solution. - **Data privacy and security:** Ensure that crash reports do not contain sensitive information that could be used to identify individual players. - **Scalability:** As the number of players and crashes increases, it may become difficult to manage and analyze the growing volume of data. - **Integration with other tools:** Be aware when integrating crash analytics with other tools and systems, especially if the tools use different data formats or data structures. - **Prioritization of issues:** Determine which crashes are the most impactful and prioritize fixes accordingly. This can be a complex process, especially if there are a large number of different crash types and causes. solution should be able to scale to accommodate an increasing number of players and crashes. **Data privacy and security:** It’s important to consider data privacy and security when implementing a crash analytics solution. This may involve implementing measures to anonymize crash reports, or taking steps to ensure that sensitive information is not included in the reports. **Reference Architecture** **Databricks** **SQL** **Power BI** **AWS** **Quicksight** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks_ultimate_gaming_data_guide_2023.pdf,2024-09-19T16:57:21Z
"### Executive Guide # Transform and Scale Your Organization With Data and AI #### A guide for CIOs, CDOs, and  data and AI executives ----- ## Contents **A U T H O R :** **Chris D’Agostino** Global Field CTO Databricks **E D I T O R S :** Manveer Sahota **C H A P T E R 1 :**  **Executive Summary** 3 **C H A P T E R 2 :**  **Define the Strategy** 6 **1.** Establish the goals and business value 8 **2.** Identify and prioritize use cases 19 **3.** Build successful data teams 22 **4.** Deploy a modern data stack 28 **5.** Improve data governance and compliance 36 **6.** Democratize access to quality data 41 **7.** Dramatically increase productivity of your workforce 47 **8.** Make informed build vs. buy decisions 52 **9.** Allocate, monitor and optimize costs 55 **10.** Move to production and scale adoption 58 Jessica Barbieri Toby Balfre **C H A P T E R 3 :** **Conclusion**  63 ----- **CHAPTER 1:** ## Executive Summary Data and AI leaders are faced with the challenge of future-proofing their architecture and platform investments. The Lakehouse implementation from Databricks combines the best features of EDWs and data lakes by enabling all their workloads using open source and open standards — avoiding the vendor lock-in, black box design and proprietary data formats of other cloud vendors. It’s not surprising that many industry experts say data is the most valuable resource in the modern economy — some even go so far as to describe it as the “new oil.” But at Databricks, we think of data as water. Its core compound never changes, and it can be transformed to whatever use case is desired, with the ability to get it back to its original form. Furthermore, just as water is essential to life, data is now essential to survival, competitive differentiation and innovation for every business. Clearly, the impact and importance of data are growing exponentially in both our professional and personal lives, while artificial intelligence (AI) is being infused in more of our daily digital interactions. The explosion in data availability over the last decade and the forecast for growth at a compounded [annual growth rate (CAGR) of 23%](https://www.google.com/url?q=https://www.idc.com/getdoc.jsp?containerId%3DprUS47560321&sa=D&source=docs&ust=1651117260200496&usg=AOvVaw3jdZ_6YHlXGQlUMJK8ULux) over 2020–2025 — combined with low-cost cloud storage, compute, open source software and machine learning (ML) environments — have caused a major shift in how organizations leverage data and AI to improve data governance and the user experience, plus satisfy more AI/ML-based use cases to drive future growth. Every organization is working to improve business outcomes while effectively managing a variety of risks — including economic, compliance, security and fraud, financial, reputational, operational and competitive risk. Your organization’s data and the systems that process it play a critical role in not only enabling your financial goals but also in minimizing these seven key business risks. Businesses have realized that their legacy information technology (IT) platforms are not able to scale and meet the increasing demands for better data analytics. As a result, they are looking to transform how their organizations use and process data. Successful data transformation initiatives for data, analytics and AI involve not only the design of hardware and software systems but also the alignment of people, processes and platforms. These initiatives always require a major financial investment and, therefore, need to yield a significant return on investment (ROI) — one that starts in months, not years. To guide these initiatives, many organizations are adding the role of chief data officer (CDO) to their C-suite. Despite this structural change and focused resources, [87% of organizations](https://databricks.com/discover/mit-infographic) still face many challenges to deliver on their data strategy — including how to deploy a modern data architecture, leverage data efficiently and securely, stay compliant with an ever-increasing set of regulations, hire the right talent, and identify and execute on AI opportunities. ----- To successfully lead data and AI transformation initiatives, organizations need to develop and execute a comprehensive strategy that enables them to easily deploy a modern data architecture, unlock the full potential of all their data, and future-proof their investments to provide the greatest ROI. Today, organizations have the option of moving away from closed, proprietary systems offered by a variety of cloud vendors and adopting a strategy that emphasizes open, nonproprietary solutions built using industry standards. At Databricks, we have helped over 7,000 companies achieve data, analytics and AI breakthroughs, and we’ve hired industry experts and thought leaders to help organizations better understand the steps involved in successful digital transformation initiatives. We are the first vendor to propose the data lakehouse architecture, which decouples data storage from compute while providing the best price/performance metrics for all your data workloads — including data warehousing. We have captured the lessons learned and summarized them in this series of Executive Guides — which are designed to serve as blueprints for CIOs, CDOs, CTOs and other data and technology executives to implement successful digital transformation initiatives for data, analytics and AI using a _modern data stack_ . Databricks is the first company to deliver a unified data platform that realizes the data lakehouse architecture and enables the data personas in your organization to run their data, analytics and AI workloads in a simple, open and collaborative environment, as shown in Figure 1. ###### Lakehouse Platform Data Warehousing Data Engineering Data Streaming Data S�ien�� and ML Unity Catalog Fine-grained governance for data and AI Delta Lake Data relia)ility and .erfor2ance Cloud Data Lake All structured and unstructured data **Figure 1:** The Databricks Lakehouse Platform ----- **The lakehouse architecture benefits organizations in several ways:** **1.** It leverages low-cost cloud object stores to store ALL enterprise data. **2.** It provides the ability to run different data workloads efficiently and in a cost-effective manner. **3.** It uses open formats and standards that provide greater data portability — thus avoiding vendor lock-in. Our intention is to present key considerations and equip you with the knowledge to ask informed questions, make the most critical decisions early in the process, and develop the comprehensive strategy that most organizations lack. In addition, we have created an easy-to-follow Data and AI Maturity Model and provided a comprehensive professional services offering that organizations can leverage to measure their readiness, reskill their staff and track progress as they embark on their data transformation initiative. ----- **CHAPTER 2:** ## Define the Strategy The most critical step to enable data, analytics and AI at scale is to develop a comprehensive and executable strategy for how your organization will leverage people, processes and platforms to drive measurable business results against your corporate priorities. The strategy serves as a set of principles that every member of your organization can refer to when making decisions. The strategy should cover the roles and responsibilities of teams within your organization for how you capture, store, curate and process data to run your business — including the internal and external resources (labor and budget) needed to be successful. Establish the goals and business value Build successful data teams Ease data governance and compliance Simplify the user experience Allocate, monitor and optimize costs Identify and prioritize use cases Deploy a modern data architecture Democratize access to quality data Make informed build vs. buy decisions Move to production and drive adoption **Figure 2:** The 10 steps to a winning data and AI strategy ----- #### Here are 10 key considerations **1.** Secure buy-in and alignment on the overall business goals, timeline and appetite for the initiative. **2.** Identify, evaluate and prioritize use cases that actually provide a significant ROI. **3.** Create high-performing teams and empower your business analyst, data scientist, machine learning and data engineering talent. **4.** Future-proof your technology investment with a modern data architecture. **5.** Ensure you satisfy the European Union’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA) and other emerging data compliance and governance regulations. **6.** Implement needed policies, procedures and technology to guarantee data quality and enable secure data access and the sharing of all your data across the organization. **7.** Streamline the user experience (UX), improve collaboration and simplify the complexity of your tooling. **8.** Make informed build vs. buy decisions and ensure you are focusing your limited resources on the most important problems. **9.** Establish the initial budgets and allocate and optimize costs based on SLAs and usage patterns. **10.** Codify best practices for moving into production and how to measure progress, rate of adoption and user satisfaction. The strategy should clearly answer these 10 topics and more, and should be captured in a living document, owned and governed by the CDO and made available for everyone in the organization to review and provide feedback on. The strategy will evolve based on the changing market/political conditions, evolving business, the technology landscape or a combination of any of these — but it should serve as the North Star for how you will navigate the many decisions and trade-offs that you will need to make over the course of the transformation. This guide takes a stepwise approach to addressing each of these 10 topics. ----- Studies have shown that data scientists spend 80% of their time collecting and compiling data sets #### 1. Establish the goals and business value Most organizations on a data, analytics and AI journey establish a set of goals for the resulting investment. The goals generally fall into one of three categories: **1.** **Business outcomes** **2.** **People** **3.** **Technology** and only 20% of their time developing insights and In terms of business outcomes, organizations need to adapt more quickly to market opportunities and emerging risks, and their legacy-based information systems make that difficult to achieve. As a result, business leaders see the digital transformation as an opportunity to build a new technology foundation from which to run their business and increase business value. One that is more agile, scalable, secure and easier to use — making the organization better positioned to adapt, innovate and thrive in the modern and dynamic economy. For organizations today, people are one of their most valuable assets — you cannot succeed in data, analytics and AI without them. The battle for top talent is as fierce as ever, and the way that people work impacts your ability to hire and retain the skills you need to succeed. It is important to make sure that employees work in a frictionless data environment, to the extent possible, so they feel productive each day and can do their best work. Finally, from a technology perspective, organizations have grown tired of the high costs associated with complex system architectures, vendor lock-in, and proprietary solutions that are slow to evolve. The industry trend is to move away from large capital expenditures (capex) to pay for network and server capacity in advance — and toward a “just-in-time” and “pay-for-what-you-use” operating expense (opex) approach. Your data analytics environment should support this trend as well — using open standards, low- cost storage and on-demand compute that efficiently spins up to perform data workloads and spins down once they are complete. algorithms. Organizations that are able to invert these numbers benefit in two ways — happier employees and improved time to market for use cases. These employers create more favorable working environments and lower the risk of burnout and the resulting regrettable attrition. ----- **Executive buy-in and support** Large organizations are difficult to change — but it’s not impossible. In order to be successful, you need to have unwavering buy-in and support from the highest levels of management — including the CEO and board of directors. With this support, you have the leverage you need to develop the strategy, decide on an architecture and implement a solution that can truly change the way your business is run. Without it, you have a very expensive science project that has little hope of succeeding. Why? Because the majority of people in your organization are busy doing their day jobs. The added work to support the initiative must be offset by a clear articulation of the resulting benefits — not only for the business but for the personnel within it. The transformation should result in a positive change to how people do their jobs on a daily basis. Transformation for data, analytics and AI needs to be a company-wide initiative that has the support from all the leaders. Even if the approach is to enable data and AI one business unit (BU) at a time, the plan needs to be something that is fully embraced in order to succeed. Ideally, the senior-most executives serve as vocal proponents. ----- **Evolve to an AI-first company — not just a data-first company** Data and AI transformations should truly transform the way organizations use data, not just evolve it. For decades, businesses have operated using traditional business processes and leveraged Structured Query Language (SQL) and business intelligence (BI) tools to query, manipulate and report on a subset of their data. There are five major challenges with this approach: **1.** A true self-assessment of where your organization is on the AI maturity curve. Most organizations will use pockets of success with analytics and AI to move higher up the maturity curve, but in reality the ability to replicate and scale the results is nearly impossible. Auto�ated Decision�Ma�ing #### Tech leaders are to the right of the Data Maturity Curve Prescriptive Anal�tics Predictive Modeling Data Exploration From hindsight to foresight How should we respond? Auto�aticall� �a�� the best decision Ad Hoc Queries Reports Clean Data WHAT HAPPENED? WHAT W255 HAPPEN? Data and A2 Maturit� **Figure 3:** The Data Maturity Curve ----- **2.** Data volumes and types have outgrown even the most modern approaches to SQL-based data processing. **3.** These large data volumes also make it nearly impossible for your workforce to continue to programmatically state, in a priority manner, how data insights can be achieved or how the business should react to changing data. **4.** Organizations need to reduce the costs of processing all this data. You simply cannot afford to hire the number of people needed to respond to every piece of data flowing into your environment. Machines scale, people do not. **5.** Advances in machine learning and AI have simplified the steps and reduced the expertise needed to gain game-changing insights. For these reasons, plus many others, the organizations that thrive in the 21st century will do so based on their ability to leverage all the data at their disposal. Traditional ways of processing and managing data will not work. Using ML and AI will empower your workforce to leverage data to make better decisions for managing risk, helping your organization succeed in the modern economy. **Go “all in” on the cloud** The COVID-19 pandemic has caused rapid adoption of cloud-based solutions for collaboration and videoconferencing — and organizations are now using this time to reevaluate their use of on-premises and cloud-based services. The cloud vendors provide many benefits to organizations, including Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS) solutions. These benefits, especially when combined with the use of open source software (OSS), increase the speed at which organizations can use the latest technologies while also reducing their capex in these budget-conscious times. For AWS, Microsoft, Google and other cloud providers, the game is about data acquisition. The more corporate data that resides in a specific cloud, the more sticky the customer is to the vendor. At the same time, multicloud support is both a selling point and an aspirational goal for many organizations. Companies are well aware of vendor lock-in and want to abstract their applications so they can be moved across clouds if there is a compelling business reason. ----- Approaching your technology choices with a multicloud point of view gives the organization more sovereignty over the data — flexibility to run workloads anywhere, ease of integration when acquiring businesses that run on different cloud providers and simplified compliance with emerging regulations that may require companies to be multicloud — as part of a mandate to reduce risk to the consumer’s personal information. As a result, data portability and the ability to run workloads on different cloud providers are becoming increasingly important. **Modernize business applications** As organizations begin to accelerate the adoption of the cloud, they should avoid a simple “lift and shift” approach. The majority of on-premises applications are not built with the cloud in mind. They usually differ in the way that they handle security, resiliency, scalability and failover. Their application designs often store data in ways that make it difficult to adhere to regulatory requirements such as the GDPR and CCPA standards. Finally, the features and capabilities of the application may be monolithic in nature and, therefore, tightly coupled. In contrast, modern cloud applications are modular in design and use RESTful web services and APIs to easily provide access to an application’s functionality. Cloud-based architectures, commodity databases and software application development frameworks make it easier for developers to build scalable, secure end-to-end applications to run all your internal business processes. Building n-tiered applications (e.g., mobile and web-based applications with RESTful APIs and a backing database) has become straightforward with the latest tooling available to your application development teams. As a first step, organizations should inventory their business-critical applications, prioritize them based on business impact and modernize them in a consistent manner for cloud-based deployments. It is these applications that generate and store a significant amount of the data consumed within an organization. Using a consistent approach to cloud-based application design makes it easier to extract data when it is needed. “We are on an amazing journey. Being among the fastest-growing enterprise software cloud companies on record was unimaginable when we started Databricks. To get here, we’ve stayed focused on the three big bets we made when founding the company — cloud, open source and machine learning. Fast-forward seven years, thousands of data teams around the globe are working better together on Databricks.” **Ali Ghodsi** Co-founder and CEO Databricks ----- The next step is to identify which applications are viewed as the system of record (SOR) for a given data set. A good architectural principle is to only allow data sets to be stored inside their declared SOR and not allow other applications within your environment to store copies of the data — unless absolutely necessary for performance reasons. In this case, it is best to “cache” the data for use in the non-SOR application and sync the data from the actual SOR. Data from these SORs should be made available in three ways: **1.**  Expose a set of RESTful APIs for applications to invoke at any given time. **2.** Ensure that copies of the data land in the data lake. **3.** Change data capture (CDC) and other business events should be streamed in real time for immediate consumption by downstream applications. **Move toward real-time decisioning** The value of data should be viewed through two different lenses. The first is to view data in the aggregate, and the second is to view data as an individual event. This so-called “time value of data” is an important concept in the world of data, analytics and AI. To be effective, you need to be able to leverage both — on the same data platform. On the one hand, data in aggregate becomes more valuable over time — as you collect more of it. The aggregate data provides the ability to look back in time and see the complete history of an aspect of your business and to discover trends. Real-time data is most valuable the moment it is captured. In contrast, a newly created or arriving data event gives you the opportunity to make decisions — in the moment — that can positively affect your ability to reduce risk, better service your customers or lower your operating costs. The goal is to act immediately — with reliability and accuracy — upon the arrival of a new streaming event. This “time value of data” is shown in Figure 4 on the next page. ----- For example, real-time processing of clickstream data from your customer-facing mobile application can indicate when the customer is having trouble and may need to call into your call center. This insight gives you the opportunity to interject with a digital assistant or to pass on “just-in-time” information to your call center agents — improving the customer experience and lowering customer churn. Data, analytics and AI rely on the ”time value of data” — a powerful concept that allows you to train your machine learning models using historical data and provides you with the ability to make real-time decisions as new events take place. For example, credit card fraud models can use deep historical data about a given customer’s buying patterns (location, day of week, time of day, retailer, average purchase amount, etc.) to build rich models that are then executed for each new credit card transaction. This real-time execution, combined with historical data, enables the best possible customer experience. #### Time Value of Data The Databricks Lakehouse Platform allows you to combine real-time streaming and batch processing using one architecture and a consistent set of programming APIs. **Figure 4:** Time Value of Data Value of an individual data record is very high once created but decreases over time Value of data records in aggregate increases over time Real-Time Decisioning Real-Time Analysis Trend Analysis Model Training ----- **Land** **_all_** **data in a data lake** In order to effectively drive data, analytics and AI adoption, relevant data needs to be made available to the user as quickly as possible. Data is often siloed in various business applications and is hard and/or slow to access. Likewise, organizations can no longer afford to wait for data to be loaded into data stores like a data warehouse, with predefined schemas that are designed to allow you to ask very specific questions about that data only. What do you do when you want to ask a different question? To further complicate matters, how do you handle new data sets that cannot easily be manipulated to fit into your predefined data stores? How do you find new insights as quickly as possible? The overall goal is to gain insights from the data as quickly as possible — which can happen at any step along the data pipeline — including raw, refined and curated data states. This phenomenon has led to the concept known as the four Vs of data — specifically, _volume_ , _velocity_ , _variety_ and _veracity_ . Data-, analytics- and AI-driven organizations need to be able to store and process all their data, regardless of size, shape or speed. In addition, data lineage and provenance are critical to knowing whether or not you can trust the data. **Change the way people work** When done correctly, organizations get value from data, analytics and AI in three ways — infrastructure savings, productivity gains and business-impacting use cases. Productivity gains require a true focus on minimizing the number of steps needed to produce results with data. This can be accomplished by: **1.**  Making data more accessible and ensuring it can be trusted **2.** Minimizing the number of tools/systems needed to perform work **3.** Creating a flywheel effect by leveraging the work of others “We believe that the data lakehouse architecture presents an opportunity comparable to the one we saw during early years of the data warehouse market. The unique ability of the lakehouse to manage data in an open environment, blend all varieties of data from all parts of the enterprise and combine the data science focus of the data lake with the end-user analytics of the data warehouse will unlock incredible value for organizations.” **Bill Inmon** The father of the data warehouse ----- In large organizations, it’s understandable why application and data silos are prevalent. Each business unit is laser-focused on achieving their goals, and the use of information technology is viewed as an enabler. Systems and applications get built over time to satisfy specific needs within a line of business. As a result, it’s not surprising to learn that employees must jump through a large number of hoops to get access to the data they need to do their jobs. It should be as simple as getting your identity and PC. With Databricks, users can collaborate and perform A primary goal of your data and AI transformation should be to focus on improving the user experience — in other words, improving how your entire organization interacts with data. Data must be easily discoverable with default access to users based on their role(s) — with a simple process to compliantly request access to data sets that are currently restricted. The tooling you make available should satisfy the principal needs of the various personas — data engineers, data scientists, machine learning engineers, business analysts, etc. Finally, the results of the work performed by a user or system upstream should be made available to users and systems downstream as “data assets” that can drive business value. Organizations that maximize the productivity of their workforce and enable employees to do their best work under optimal conditions are the ones that have the greatest chance to recruit and retain top talent. **Minimize time in the “seam”** As you begin your data transformation, it is important to know that the longer it takes, the more risk and cost you introduce into your organization. The stepwise approach to migrating your existing data ecosystem to a modern data stack will require you to operate in two environments simultaneously, the old and the new, for some period of time. This will have a series of momentary adverse effects on your business: It will increase your operational costs substantially, as you will run two sets of infrastructure It will increase your data governance risk, since you will have multiple copies of your data sitting in two very different ecosystems their work more efficiently, regardless of their persona or role. The user experience is designed to support the workloads of data analysts, SQL developers, data engineers, data scientists and machine learning professionals. ----- It increases the cyberattack footprint and vectors, as the platforms will likely have very different security models and cyber defenses It will cause strain on your IT workforce due to the challenges of running multiple environments It will require precise communications to ensure that your business partners know which environment to use and for what data workloads To mitigate some of the strain on the IT workforce, some organizations hire staff augmentation firms to “keep the lights on” for the legacy systems while the new systems are being implemented and rolled out. It’s important to remember this is a critical but short-lived experience for business continuity. **Shut down legacy platforms** In keeping with the goal of minimizing time in the seam, the project plan and timeline must include the steps and sequencing for shutting down legacy platforms. For example, many companies migrate their on- premises Apache Hadoop data lake to a cloud-based object store. The approach for shutting down the on- premises Hadoop system is generally as follows: **1.** Identify the stakeholders (business and IT) who own the jobs that run in the Hadoop environment. **2.** Declare that no changes can be made to the Hadoop environment — with the exception of emergency fixes or absolutely critical new business use cases. **3.** Inventory the data flow paths that feed data into the Hadoop environment. **4.** Identify the source systems that feed the data. **5.** Inventory the data that is currently stored in the Hadoop environment and understand the rate of change. **6.** Inventory the software processes (aka jobs) that handle the data and understand the output of the jobs. **7.** Determine the downstream consumers of the output from the jobs. ----- **8.** Prioritize the jobs to move to the modern data architecture. **9.** One by one, port the data input, job execution, job output and downstream consumers to the new architecture. **10.** Run legacy and new jobs in parallel for a set amount of time — in order to validate that things are working smoothly. **11.** Shut down the legacy data feeds, job execution and consumption. Wait. Look for smoke. **12.** Rinse and repeat — until all jobs are migrated. **13.** Shut down the Hadoop cluster. A similar model can also be applied to legacy on-premises enterprise data warehouses. You can follow the same process for other legacy systems in your environment. Some of these systems may be more complex and require the participation of more stakeholders to identify the fastest way to rationalize the data and processes. It is important, however, to make sure that the organization has the fortitude to hold the line when there is pressure to make changes to the legacy environments or extend their lifespan. Setting firm dates for when these legacy systems will be retired will serve as a forcing function for teams when they onboard to the new modern data architecture. Having the executive buy-in from page 9 plays a crucial role in seeing the shutdown of legacy platforms through. ----- #### 2. Identify and prioritize use cases An important next step in enabling data, analytics and AI to transform your business is to identify use cases that drive business value — while prioritizing the ones that are achievable under the current conditions (people, processes, data and infrastructure). There are typically hundreds of use cases within an organization that could benefit from better data and AI — but not all use cases are of equal importance or feasibility. Leaders require a systematic approach for identifying, evaluating, prioritizing and implementing use cases. **Establish the list of potential use cases** The first step is to ideate by bringing together various stakeholders from across the organization and understand the overall business drivers — especially those that are monitored by the CEO and board of directors. The second step is to identify use case opportunities in collaboration with business stakeholders, and understand the business processes and the data required to implement the use case. After steps one and two, the next step is to prioritize these cases by calculating the expected ROI. To avoid this becoming a pet project within the data/IT teams, it’s important to have a line of business champion at the executive level. There needs to be a balance between use cases that are complex and ones that are considered low- hanging fruit. For example, determining if a web visitor is an existing or net new customer requires a fairly straightforward algorithm that uses web browser cookie data and the correlation of the devices used by a given individual or household. However, developing a sophisticated credit card fraud model that takes into account geospatial, temporal, merchant and customer-purchasing behavior requires a broader set of data to perform the analytics. In terms of performance, thought should be given to the speed at which the use case must execute. In general, the greater the performance, the higher the cost. Therefore, it’s worth considering grouping use cases into three categories: **1.** Sub-second response **2.** Multi-second response **3.** Multi-minute response ----- Being pragmatic about the true service level agreement (SLA) will save time and money by avoiding over- engineering the design and infrastructure. **Thinking in terms of “data assets”** Machine learning algorithms require data — data that is readily available, of high quality and relevant — to perform the experiments, train the models, and then execute the model when it is deployed to production. The quality and veracity of the data used to perform these machine learning steps are key to deploying models into production that produce a tangible ROI. It is critical to understand what steps are needed in order to make the data available for a given use case. One point to consider is to prioritize use cases that make use of similar or adjacent data. If your engineering teams need to perform work to make data available for one use case, then look for opportunities to have the engineers do incremental work in order to surface data for adjacent use cases. Mature data and AI companies embrace the concept of “data assets” or “data products” to indicate the importance of adopting a design strategy and data asset roadmap for the organization. Taking this approach helps stakeholders avoid fit-for-purpose data sets that drive only a single use case — and raise the level of thinking to focus on data assets that can fuel many more business functions. The “data asset” roadmap helps data source owners understand the priority and complexity of the data assets that need to be created. Using this approach, data becomes part of the fabric of the company, evolves the culture, and influences the design of business applications and other systems within the organization. **Determine the highest impact/priority** As shown in Figure 5, organizations can evaluate a given use case using a scorecard approach that takes into account three factors: strategic importance, feasibility and tangible ROI. Strategic importance measures whether or not the use case helps meet immediate corporate goals and has the potential to drive growth or reduce risk. Feasibility measures whether or not the organization has the data and IT infrastructure, plus the data science talent readily available, to implement the use case. The ROI score indicates whether or not the organization can easily measure the impact to the P/L. ----- |= Scored by business stakeholders = Scored by technology stakeholders|Col2|SCORING GUIDELINES (RELATIVE SCORING)|Col4|Col5| |---|---|---|---|---| |||1 = LOW SCORE, DO LATER|5 = AVERAGE, NICE TO HAVE|10 = HIGH, MUST HAVE| |Strategic Importance Score How important is it to business success?|Business Alignment|Not required for any corporate goals|Not required for immediate corporate goals|Required for immediate corporate goals| ||Business Driver|Does not drive growth/profitability (P&L) or competitiveness|Could drive some growth/profitability (P&L)|Significantly drives growth/profitability (P&L) and competitiveness| ||IT Foundation|No BI/IT dependencies|BI/IT best practice|BI/IT foundational element| |Feasibility Score What is the current data and AI readiness?|Data Access and Trust Adjusting Based on Availability|Low awareness of available data (internal and external) or the problems it can solve|Some ingestion and exploration of large-scale data is possible|Large-scale data is available for exploration in the cloud| ||Delivery (Data Engineers, Data Scientists, Data Analysts)|Limited in-house resources|Hiring plan for data science and engineering resources, few available in-house|Scaled data science, engineering, cloud and deployment organization| ||Architecture|Current thinking on architecture resembles on-prem traditional data warehousing solution with batch processes rather than a data lakehouse approach|Architecture has been built and tested, some use cases are underway with multiple data sources now available in the cloud|The platform is utilized at scale across the business and is able to evolve to meet the demands of new business lines and services driven by data| |ROI Score How tangible and large is the ROI?|ROI Potential|Mostly productivity gains, “soft intangible benefits”|Some P&L impact, not easily tangible|Significant P&L impact, “hard measured benefits”| **Figure 5:** Methodology for scoring use cases **Ensure business and technology leadership alignment** Prioritizing use cases requires striking a balance between offensive- and defensive-oriented use cases. It is important for executives to evaluate use cases in terms of opportunity growth (offensive) and risk reduction (defensive). For example, data governance and compliance use cases should take priority over offensive-oriented use cases when the cost of a data breach or noncompliance is higher than the acquisition of a new customer. ----- The Databricks Professional Services team can help customers identify revenue-generating and cost-saving opportunities for data and AI use cases that provide a significant ROI when adopting the #### 3. Build successful data teams In order to succeed with data, analytics and AI, companies must find and organize the right talent into high- performing teams — ones that can execute against a well-defined strategy with the proper tools, processes, training and leadership. Digital transformations require executive-level support and are likely to fail without it — especially in large organizations. However, it’s not enough to simply hire the best data and AI talent — the organization must want to succeed, at an enterprise level. In other words, they must also evolve their company culture into one that embraces data, data literacy, collaboration, experimentation and agile principles. We define these companies as “data native.” lakehouse architecture. **Chief information officers and chief data officers — two sides of the data coin** Data native companies generally have a single, accountable executive who is responsible for areas such as data science, business analytics, data strategy, data governance and data management. The data management aspects include registering data sets in a data catalog, tracing data lineage as data sets flow through the environment, performing data quality checks and scanning for sensitive data in the clear. Many organizations are rapidly adding the chief data officer (CDO) role to their executive ranks in order to oversee and manage these responsibilities. The CDO works closely with CIOs and other business stakeholders to establish the overall project plan, design and implementation — and to align project management, product management, business analysis, data engineering, data scientist and machine learning talent. The CDO and CIO will need to build a broad coalition of support from stakeholders who are incentivized to make the transformation a success and help drive organization-wide adoption. To do this, the stakeholders must understand the benefits of — and their role and responsibilities in — supporting the initiative. ----- There are two organizational constructs that are found in most successful data native companies. The first is the creation of an _AI/ML center of excellence_ (COE) that is designed to establish in-house expertise around ML and AI, and which is then used to educate the rest of the organization on best practices. The second is the formation of a _data and AI transformation steering committee_ that will oversee and guide decisions and priorities for the transformative data, analytics and AI initiatives, plus help remove obstacles. Furthermore, CDOs need to bring their CIOs along early in the journey. **Creating an AI/ML COE** Data science is a fast-evolving discipline with an ever-growing set of frameworks and algorithms to enable everything from statistical analysis to supervised learning to deep learning using neural networks. While it is difficult to establish specific and exact boundaries between the various disciplines, for the purposes of this document, we use “data science” as an umbrella term to cover machine learning and artificial intelligence. However, the general distinction is that data science is used to produce insights, machine learning is used to produce predictions, and artificial intelligence is used to produce actions. In contrast, while a data scientist is expected to forecast the future based on past patterns, data analysts extract meaningful insights from various data sources. A data scientist creates questions, while a data analyst finds answers to the existing set of questions. Organizations wanting to build a data science competency should consider hiring talent into a centralized organization, or COE, for the purposes of establishing the tools, techniques and processes for performing data science. The COE works with the rest of the organization to educate and promote the appropriate use of data science for various use cases. ----- A common approach is to have the COE report into the CDO, but still have data scientists dotted line into the business units or department. Using this approach, you achieve two goals: The data scientists are closer to the business stakeholders, have a better understanding of the data within a business unit and can help identify use cases that drive value Having the data scientists reporting into the CDO provides a structure that encourages collaboration and consistency in how work is performed among the cohort and brings that to the entire organization **Data and AI transformation steering committee** The purpose of the steering committee is to provide governance and guidance to the data transformation initiative. The CDO and CIO should co-chair the committee along with one business executive who can be a vocal advocate and help drive adoption. The level of executive engagement is critical to success of the initiative. The steering committee should meet regularly with leaders from across the organization to hear status reports and resolve any conflicts and remove obstacles, if possible. The leaders should represent a broad group of stakeholders, including:  **Program/project management:** To report the status of progress for deploying the new data ecosystem and driving adoption through use cases  **Business partners:** To provide insight and feedback on how easy or difficult it is to drive adoption of the platform  **Engineering:** To report the status of the implementation and what technology trade-offs need to be made  **Data science:** To report on the progress made by the COE on educating the organization about use cases for ML and to report the status of various implementations -----  **InfoSec:** To review the overall security, including network, storage, application and data encryption and tokenization  **Architecture:** To oversee that the implementation adheres to architectural standards and guardrails  **Risk, compliance and legal:** To oversee the approach to data governance and ethics in ML  **User experience:** To serve as the voice of the end users who will perform their jobs using the new data ecosystem  **Communication:** To provide up-to-date communications to the organization about next steps and how to drive adoption **Partnering with architecture and InfoSec** Early on, the CDO and CIO should engage the engineering and architecture community within the organization to ensure that everyone understands the technical implications of the overall strategy. This minimizes the chances that the engineering teams will build separate and competing data platforms. In regulated industries that require a named enterprise architect (EA), this will be a key relationship to foster. The EA is responsible for validating that the overall technology design and data management features support the performance and regulatory compliance requirements — specifically, whether the proposed design can meet the anticipated SLAs of the most demanding use cases and support the volume, velocity, variety and veracity (four Vs) of the data environment. It is important to fully understand which environments and accounts your data is stored in. The goal is to minimize the number of copies of your data and to keep the data within your cloud account — and not the vendor’s. Make sure the architecture and security model for protecting data is well understood. ----- From an InfoSec perspective, the CDO must work to ensure that the proper controls and security are applied to the new data ecosystem and that the authentication, authorization and access control methods meet all the data governance requirements. An industry best practice is to enable self-service registration of data sets, by the data owner, and support the assignment of security groups or roles to help automate the access control process. This allows data sets to be accessible only to the personnel that belong to a given group. The group membership could be based primarily on job function or role within the organization. This approach provides fast onboarding of new employees, but caution should be taken not to proliferate too many access control groups — in other words, do not get too fine grained with group permissions, as they will become increasingly difficult to manage. A better strategy is to be more coarse-grained and use row- and column-level security sparingly. **Centralized vs. federated labor strategy** In most organizations today, managers work in silos, making decisions with the best intentions but focused on their own functional areas. The primary risk to the status quo is that there will be multiple competing and conflicting approaches to creating enterprise data and AI platforms. This duplication of effort will waste time and money and potentially erode the confidence and motivation of the various teams. While it certainly is beneficial to compare and contrast different approaches to implementing an architecture, the approaches should be strictly managed, with everyone designing for the same goals and requirements — as described in this strategy document and adhering to the architectural principles and best practices. Even still, the roles of the CDO and CIO together should deliver a data analytics and AI platform with the least amount of complexity as possible, and one that can easily scale across the organization. It is very challenging to merge disparate data platform efforts into a single, cohesive design. It is best to get out in front of this wave of innovation and take input from the various teams to create a single, centralized platform. Having the data engineering teams centralized, reporting into a CIO, makes it easier to design a modern data stack — while ensuring that there is no duplication of effort when implementing the platform components. Figure 6 shows one possible structure. ----- **Figure 6:** Centralized teams with matrixed responsibilities **Data Scientist** Model and predict with data **Data Analyst** Visualize and describe data **Team A ($1.1M)** **Team B ($1.3M)** **Team C ($1.5M)** **Data Engineer** Store, process, maintain data **Business Partners** **and Domain Experts** Centralize data scientists under CDO — embed in lines of business for day-to-day tasking Centralize data engineers under CIO/CTO — initially as an enterprise function **Hiring, training and upskilling your talent** While this guide does not cover recruiting strategies, it is important to note that data engineering and data science talent is very difficult to find in this competitive market. As a result, every organization should consider what training and upskilling opportunities exist for their current staff. A large number of online courses, at relatively low cost, teach the fundamentals of data science and AI. It will still be important to augment your existing staff with experienced data scientists and machine learning experts. You will then need to establish clear training paths, resources and timelines to upskill your talent. Using the COE construct, it is easier to upskill a mix of data science talent by having the experts mentor the less experienced staff. The majority of Ph.D.-level talent comes from academia and has a vested interest in educating others. It’s important to set up the structure and allow time in the schedule for knowledge transfer, experimentation and a safe environment in which to fail. A key aspect in accelerating the experience of your talent is to enable data science using production-like data and creating a collaborative environment for code sharing. ----- The Databricks training, [documentation](https://docs.databricks.com) and [certification](https://databricks.com/learn/certification) available to customers is industry- leading, and our [Solution Accelerators](https://databricks.com/solutions/accelerators) provide #### 4. Deploy a modern data stack The modern data architecture can most easily be described as the evolution of the enterprise data warehouse (EDW) from the 1980s and the Hadoop-style data lakes from the mid-2000s. The capabilities, limitations and lessons learned from working with these two legacy data architectures inspired the next generation of data architecture — what the industry now refers to as the lakehouse. Figure 7 shows how the architectures have evolved as networking, storage, memory and CPU performance have improved over time. exemplar code for organizations to hit the ground running with data and AI. **Figure 7:** A brief history of data architectures ----- **Evolving beyond the enterprise data warehouse and data lake** The EDW provided organizations with the ability to easily load structured and semi-structured data into well-organized tables — like rows and columns in a spreadsheet — and execute Structured Query Language (SQL) queries and generate business intelligence (BI) reports to measure the health and performance of the business. Though the EDW coupled storage and compute, it provided organizations with the ability to catalog data, apply robust security and audit, monitor costs and support a large number of simultaneous users — while still being performant. The EDW served its purpose for decades. However, most of the recent advances in AI have been in better models to process unstructured data (text, images, video, audio), but these are precisely the types of data that an EDW is not optimized for. Therefore, in the mid-2000s, organizations wanted to take advantage of new data sets — _ones that_ _contained unstructured data_ — and apply new analytics — _ones that leveraged emerging data science_ _algorithms_ . In order to accomplish this, massive investments in on-premises data lakes occurred — most often leveraging Apache Hadoop and its distributed file system, known as HDFS, running on low-cost, commodity hardware. The Hadoop-style data lake provided the separation of compute from storage that organizations were seeking — thus eliminating the risk of vendor lock-in and opening the doors to a wide range of new analytics. Despite all these benefits, the architecture proved to be difficult to use, with a complex programming model known as MapReduce, and the performance fell short of the majority of real- time use cases. Over time, Hadoop workloads were often migrated to Apache Spark™ workloads, which run 100x faster by processing data in-memory across a cluster — with the ability to massively scale. The Spark programming model was also simpler to use and provided a consistent set of application programming interfaces (APIs) for languages such as Python, SQL, R, Java and Scala. Spark was the first major step in separating compute from storage and providing the scale needed for distributed workloads. ----- A data lakehouse combines the best of data **Cloud-based data lakes** More than a decade ago, the cloud opened a new frontier for data storage. Cloud object stores like Amazon S3 and Azure Data Lake Storage (ADLS) have become some of the largest, most cost-effective storage systems in the world — which make them an attractive platform to serve as the next generation of data lakes. Object stores excel at massively parallel reads — an essential requirement for modern data warehouses. lakes and data warehouses, enabling BI and ML However, data lakes lack some critical features: They do not support transactions, they do not enforce data quality, and their lack of consistency/isolation makes it almost impossible to mix appends and reads, and batch and streaming jobs. Also, performance is hampered by expensive metadata operations — for example, efficiently listing the millions of files (objects) that make up most large data lakes. **Lakehouse — the modern data architecture** What if it were possible to combine the best of both worlds? The performance, concurrency and data management of EDWs with the scalability, low cost and workload flexibility of the data lake. This is exactly the target architecture described by CDOs, CIOs and CTOs when asked how they would envision reducing the complexity of their current data ecosystems while enabling data and AI, at scale. The building blocks of this architecture are shown in Figure 8 and are what inspired the innovations that make the lakehouse architecture possible. on all data on a simple, open and multicloud modern data stack. ----- **Exploratory Data Scientist** **Production Machine Learning** **BI/Ad Hoc SQL Analytics** **Curated Data Lake** **Raw Data Ingest** “Bronze” **Filtered/Cleaned/Augmented** “Silver” **Business-Level Aggregates** “Gold” **D ATA Q U A L I T Y** **Data Sources (Batch and Real-Time)** **Unstructured** - Image, Video, Audio - Free Text, Blob **Semi-Structured** - Logs, Clickstream - CSV, JSON, XML **Structured** - Systems of Record - Operational DBs **Figure 8:** The building blocks for a modern data architecture The lakehouse architecture provides a flexible, high-performance design for diverse data applications, including real-time streaming, batch processing, data warehousing, data science and machine learning. This target-state architecture supports loading all the data types that might be interesting to an organization — structured, semi-structured and unstructured — and provides a single processing layer, using consistent APIs across programming languages, to curate data while applying rigorous data management techniques. The move toward a single, consistent approach to data pipelining and refinement saves organizations time, money and duplication of effort. Data arrives in a landing zone and is then moved through a series of curation and refinement steps resulting in highly consumable and trusted data for downstream use cases. The architecture makes possible the efficient creation of “data assets” for the organization by taking a stepwise approach to improving data. ----- **Lakehouse key features** To effectively migrate organizations to the lakehouse architecture, here’s a list of key features that must be available for stakeholders to run business-critical production workloads:  **Reliable data pipelines:** The lakehouse architecture simplifies the ETL development and management with declarative pipeline development, automatic data testing and deep visibility for monitoring and recovery.  **Transaction support:** In an enterprise lakehouse, many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL.  **Schema enforcement and governance:** The lakehouse should have a way to support schema enforcement and evolution, supporting DW schema paradigms such as star/snowflake schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms.  **Fine-grained governance for data and AI:** The first fine-grained, centralized security model for data lakes across clouds — based on the ANSI SQL open standards. The lakehouse enables organizations to unify data and AI assets by centrally sharing, auditing, securing and managing structured and unstructured data like tables, files, models and dashboards in concert with existing data, storage and catalogs.  **Storage is decoupled from compute:** In practice this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property.  **Openness:** The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. Databricks released Delta Lake to the open source community in 2019. Delta Lake provides all the data lifecycle management functions that are needed to make cloud-based object stores reliable and performant. This design allows clients to update multiple objects at once, replace a subset of the objects with another, etc., in a serializable manner that still achieves high parallel read/write performance from the objects — while offering advanced capabilities like time travel (e.g., query point-in-time snapshots or rollback of erroneous updates), automatic data layout optimization, upserts, caching and audit logs. -----  **Support for diverse data types ranging from unstructured to structured data:** The lakehouse can be used to store, refine, analyze and access data types needed for many new data applications, including images, video, audio, semi-structured data and text.  **Support for diverse workloads:** This includes data science, machine learning, SQL and analytics. Multiple tools might be needed to support all these workloads, but they all rely on the same data repository.  **End-to-end streaming:** Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications.  **BI support:** Lakehouses enable the use of BI tools directly on the source data. This reduces staleness, improves recency, reduces latency and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse.  **Multicloud:** The Databricks Lakehouse Platform offers you a consistent management, security and governance experience across all clouds. You don’t need to invest in reinventing processes for every cloud platform that you’re using to support your data and AI efforts. Instead, your data teams can simply focus on putting all your data to work to discover new insights and create business value. ###### Lakehouse Platform Data Warehousing Data Engineering Data Streaming Data S�ien�� and ML Unity Catalog Fine-grained governance for data and AI Delta Lake Data relia)ility and .erfor2ance Cloud Data Lake All structured and unstructured data **Figure 9:** Delta Lake is the open data storage layer that delivers reliability, security and performance on your data lake — for both streaming and batch operations ----- These are the key attributes of lakehouses. Enterprise-grade systems require additional features. Tools for security and access control are basic requirements. Data governance capabilities, including auditing, retention and lineage, have become essential, particularly in light of recent privacy regulations. Tools that enable data discovery such as data catalogs and data usage metrics are also needed. With a lakehouse, such enterprise features only need to be implemented, tested and administered for a single system. Databricks is the only cloud-native vendor **Databricks — innovation driving performance** Advanced analytics and machine learning on unstructured and large-scale data are two of the most strategic priorities for enterprises today — and the growth of unstructured data is going to increase exponentially — so it makes sense for CIOs and CDOs to think about positioning their data lake as the center of their data infrastructure. The main challenge is whether or not it can perform reliably and fast enough to meet the SLAs of the various workloads — especially SQL-based analytics. Databricks has focused its engineering efforts on incorporating a wide range of industry-leading software and hardware improvements in order to implement the first lakehouse solution. Our approach capitalizes on the computing advances of the Apache Spark framework and the latest networking, storage and CPU technologies to provide the performance customers need to simplify their architecture. These innovations combine to provide a single architecture that can store and process all the data sets within an organization — supporting the range of analytics outlined above. **BI and SQL workloads** Perhaps the most significant challenge for the lakehouse architecture is the ability to support SQL queries for star/snowflake schemas in support of BI workloads. Part of the reason EDWs have remained a major part of the data ecosystem is because they provide low-latency, high-concurrency query support. In order to compete with the EDW, optimizations must be found within the lakehouse architecture that provide satisfactory query performance for the majority of BI workloads. Fortunately, advances in query plan, query execution, statistical analysis of files in the object store, and hardware and software improvements make it possible to deliver on this promise. to be recognized as a Leader in both [2021 Magic Quadrant reports:](https://databricks.com/p/ebook/databricks-named-leader-by-gartner) **Cloud Database Management Systems** and **Data Science and Machine Learning Platforms** ----- **A word about the data mesh architecture** In 2019, another architectural concept, called the data mesh, was introduced. This architecture addresses what some designers identify as weaknesses of a centralized data lake. Namely, that you fill the data lake using a series of extract, transform, load (ETL) processes — which unnecessarily adds complexity. The data mesh approach avoids centralizing data in one location and encourages the source systems to create “data products” or “data assets” that are served up directly to consumers for data and AI workloads. The designers advocate for a federated approach to data and AI — while using enterprise policies to govern how source systems make data assets available. There are several challenges with this approach. First, the data mesh assumes that each source system can dynamically scale to meet the demands of the consumers — particularly challenging when data assets become “hot spots” within the ecosystem. Second, centralized policies oftentimes leave the implementation details to the individual teams. This has the potential of inconsistent implementations, which may lead to performance degradations and differing cost profiles. Finally, the data mesh approach assumes that each source system team has the necessary skills, or can acquire them, to build robust data products. The lakehouse architecture is not at odds with the data mesh philosophy — as ingesting higher-quality data from the source systems reduces the curation steps needed inside the data lake itself. ----- #### 5. Improve data governance and compliance Data governance is perhaps the most challenging aspect of data transformation initiatives. Every stakeholder recognizes the importance of making data readily available, of high quality and relevant to help drive business value. Likewise, organizations understand the risks of failing to get it right — the potential for undetected data breaches, negative impact on the brand and the potential for significant fines in regulated environments. However, organizations shouldn’t perceive data governance or a defensive data strategy as a blocker or deterrent to business value. In fact, many organizations have leveraged their strong stance on data governance as a competitive differentiator to earn and maintain customer trust, ensure sound data and privacy practices, and protect their data assets **Why data governance fails** While most people agree that data governance is a set of principles, practices and tooling that helps manage the complete lifecycle of your data, what is often not discussed is what constitutes a pragmatic approach — one that balances realistic policies with automation and scalability. Too often the policies developed around data governance define very strict data management principles — for example, the development of an enterprise-wide ontological model that all data must adhere to. Organizations can spend months, if not years, trying to define the perfect set of policies. The engineering effort to automate the enforcement of the new policies is not prioritized, or takes too long, due to the complexity of the requirements. Meanwhile, data continues to flow through the organization without a consistent approach to governance, and too much of the effort is done manually and fraught with human error. What are the basic building blocks of a sound data governance approach? ----- **A pragmatic approach to data governance** At a high level, organizations should enable the following data management capabilities: **Identify all sources of data** Identify all data-producing and data-storing applications Identify the systems of record (SOR) for each data set Label data sets as internal or external (third party) Identify where sensitive data is stored — GDPR/CCPA scope Limit which operational data stores (ODSs) can re-store SOR data **Catalog data sets** Register all data sets in a centralized data catalog Create a lightweight, self-service data registration process Limit manual entry as much as possible Record the schema, if any, for the data set Use an inference engine or tool to extract the data set schema Add business and technical metadata to make it meaningful Use machine learning to classify data sets Use crowdsourcing to validate the machine-based results **Track data lineage** Track data set flow and what systems act on data Create an enumerated list of action values for specific operations Emit lineage events via streaming layer and aggregate in data lake lineage event schema: Optional: Add a source code repository URL for action traceability ----- **Perform data quality checks** Create a rules library that is centrally managed and versioned Update the rules library periodically with new rules Use a combination of checks — null/not null, regex, valid values Perform schema enforcement checks against data set registration By minimizing the number of copies of your data **Scan for sensitive data** Establish a tokenization strategy for sensitive data — GDPR/CCPA Tokenize all sensitive data stored in the data lake — avoid cleartext Use fixed-length tokens to preserve analytic value Determine the approach for token lookup/resolution when needed Ensure that any central token stores are secure with rotating keys Identify which data elements from GDPR/CCPA to include in scans Efficiently scan for sensitive data in cleartext using the rules library **Establish approved data flow patterns** Determine pathways for data flow (source —> target) Limit the ways to get SOR data (APIs, streaming, data lake, etc.) Determine read/write patterns for the data lake Strictly enforce data flow pathways to/from data lake Detect violations and anomalies using lineage event analysis Identify offending systems and shut down or grant exception Record data flow exceptions and set a remediation deadline **Centralize data access controls** Establish a common governance model for all data and AI assets Centrally define access policies for all data and AI assets Enable fine-grained access controls at row and column levels Centrally enforce access policies across all workloads — BI, analytics, ML and moving to a single data processing layer where all your data governance controls can run together, you improve your chances of staying in compliance and detecting a data breach. ----- **Make data discovery easy** Establish a data discovery model Use manual or automatic data classification Provide a visual interface for data discovery across your data estate Simplify data discovery with rich keyword- or business glossary-based search **Centralize data access auditing** Establish a framework or best practices for access auditing Capture audit logs for all CRUD operations performed on data Make auditing reports easily accessible to data stewards/admins for ensuring compliance This is not intended to be an exhaustive list of features and requirements but rather a framework to evaluate your data governance approach. There will be violations at runtime, so it will be important to have procedures in place for how to handle these violations. In some cases, you may want to be very strict and shut down the data flow of the offending system. In other cases, you may want to quarantine the data until the offending system is fixed. Finally, some SLAs may require the data to flow regardless of a violation. In these cases, the receiving systems must have their own methodology for dealing with bad data. ----- **Hidden cost of data governance** There are numerous examples of high-profile data breaches and failure to comply with consumer data protection legislation. You don’t have to look very far to see reports of substantial fines levied against organizations that were not able to fully protect the data within their data ecosystem. As organizations produce and collect more and more data, it’s important to remember that while storage is cheap, failing to enforce proper data governance is very, very expensive. In order to catalog, lineage trace, quality check, and scan your data effectively, you will need a lot of compute power when you consider the massive amounts of data that exist in your organization. Each time you copy a piece of data to load it into another tool or platform, you need to determine what data governance techniques exist there and how you ensure that you truly know where all your data resides. Imagine the scenario where data flows through your environment and is loaded into multiple platforms using various ETL processes. How do you handle the situation when you discover that sensitive data is in cleartext? Without a consistent set of data governance tools, you may not be able to remediate the problem before it’s flagged for violation. Having a smaller attack surface and fewer ingress/egress routes helps guard your data and protect your organization’s brand and balance sheet. The bottom line is that the more complex your data ecosystem architecture is, the more difficult and costly it is to get data governance right. ----- #### 6. Democratize access to quality data Effective data and AI solutions rely more on the amount of quality data available than on the sophistication or complexity of the model or algorithm. Google published a paper titled “The Unreasonable Effectiveness of Data” demonstrating this point. The takeaway is that organizations should focus their efforts on making sure data scientists have access to the widest selection of relevant and high-quality data to perform their jobs — which is to create new opportunities for revenue growth, cost reduction and risk reduction. **The 80/20 data science dilemma** Most existing data environments have their data stored primarily in different operational data stores within a given business unit (BU) — creating several challenges: Most business units deploy use cases that are based only on their own data — without taking advantage of cross-BU opportunities The schemas are generally not well understood outside of BU or department — with only the database designers and power users being able to make efficient use of the data. This is referred to as the “tribal knowledge” phenomenon. The approval process and different system-level security models make it difficult and time-consuming for data scientists to gain the proper access to the data they need In order to perform analysis, users are forced to log in to multiple systems to collect their data. This is most often done using single-node data science and generates unnecessary copies of data stored on local disk drives, various network shares or user-controlled cloud storage. In some cases, the data is copied to “user spaces” within production platform environments. This has the strong potential of degrading the overall performance for true production workloads. To make matters worse, these copies of data are generally much smaller than the full-size data sets that would be needed in order to get the best model performance for your ML and AI workloads. ----- Small data sets reduce the effectiveness of exploration, experimentation, model development and model training — resulting in inaccurate models when deployed into production and used with full-size data sets. As a result, data science teams are spending 80% of their time wrangling data sets and only 20% of their time performing analytic work — work that may need to be redone once they have access to the full-size data sets. This is a serious problem for organizations that want to remain competitive and generate game- changing results. Another factor contributing to reduced productivity is the way in which end users are typically granted access to data. Security policies usually require both coarse-grained and fine-grained data protections. In other words, granting access at a data set level but limiting access to specific rows and columns (fine- grained) within the data set. **Rationalize data access roles** The most common approach to providing coarse-grained and fine-grained access is to use what’s known as role-based access control (RBAC). Individual users log on to system-level accounts or via a single sign-on (SSO) authentication and access control solution. Users can access data by being added to one or more Lightweight Directory Access Protocol (LDAP) groups. There are different strategies for identifying and creating these groups — but typically, they are done on a system-by-system basis, with a 1:1 mapping for each coarse- and fine-grained access control combination. This approach to data access usually produces a proliferation of user groups. It is not unusual to see several thousand discrete security groups for large organizations — despite having a much smaller number of defined job functions. This approach creates one of the biggest security challenges in large organizations. When personnel leave the company, it is fairly straightforward to remove them from the various security groups. However, when personnel move around within the organization, their old security group assignments often remain intact and new ones are assigned based on their new job function. This leads to personnel continuing to have access to data that they no longer have a “need to know.” The Databricks Lakehouse Platform brings together all the data and AI personas into one environment and makes it easy to collaborate, share code and insights, and operate against the same view of data. ----- **Data classification** Having all your data sets stored in a single, well-managed data lake gives you the ability to use partition strategies to segment your data based on “need to know.” Some organizations create a partition based on which business unit owns the data and which one owns the data classification. For example, in a financial services company, credit card customers’ data could be stored separately from that of debit card customers, and access to GDPR/CCPA-related fields could be handled using classification labels. The simplest approach to data classification is to use three labels:  **Public data:** Data that can be freely disclosed to the public. This would include your annual report, press releases, etc.  **Internal data:** Data that has low security requirements but should not be shared with the public or competitors. This would include strategy briefings and market or customer segmentation research.  **Restricted data:** Highly sensitive data regarding customers or internal business operations. Disclosure could negatively affect operations and put the organization at financial or legal risk. Restricted data requires the highest level of security protection. Some organizations introduce additional labels, but care should be taken to make sure that everyone clearly understands how to apply them. The data classification requirements should be clearly documented and mapped to any legal or regulatory requirements. For example, CCPA is so sweeping that it includes 11 categories of personal information — and defines “personal information” as “information that identifies, relates to, describes, is capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household.” ----- Just examining one CCPA category, _Customer Records Information_ , we see that the following information is to be protected: name, signature, social security number, physical characteristics or description, address, telephone number, passport number, driver’s license or state identification card number, insurance policy number, education, employment, employment history, bank account number, credit or debit card number, other financial information, medical information, and health insurance information. There are generally three different approaches in industry to performing data classification: **1. Content-based:** Scans or inspects and interprets files to find sensitive information. This is generally done using regular expressions and lookup tables to map values to actual entities stored inside the organization (e.g., customer SSN). **2. Context-based:** Evaluates the source of the data (e.g., application, location or creator) to determine the sensitivity of the data. **3. User-based:** Relies on a manual, end-user selection of each data set or element and requires expert domain knowledge to ensure accuracy. Taking all this into account, an organization could implement a streamlined set of roles for RBAC that uses the convention where “domain” might be the business unit within an organization, “entity” is the noun that the role is valid for, “data set” or “data asset” is the ID, and “classification” is one of the three values (public, internal, restricted). There is a “deny all default” policy that does not allow access to any data unless there is a corresponding role assignment. Wild cards can be used to grant access to eliminate the need to enumerate every combination. ----- For example, gives a user or a system access to all the data fields that describe a credit card transaction for a customer, including the 16-digit credit card number. Whereas would allow the user or system access only to nonsensitive data regarding the transaction. This gives organizations the chance to rationalize their security groups by using a domain naming convention to provide coarse-grained and fine-grained access without the need for creating tons of LDAP groups. It also dramatically eases the administration of granting access to data for a given user. **Everyone working from the same view of data** The modern data stack, when combined with a simplified security group approach and a robust data governance methodology, gives organizations an opportunity to rethink how data is accessed — and greatly improves time to market for their analytic use cases. All analytic workloads can now operate from a single, shared view of your data. Combining this with a sensitive data tokenization strategy can make it straightforward to empower data scientists to do their job and shift the 80/20 ratio in their favor. It’s now easier to work with full-size data sets that both obfuscate NPI/PII information and preserve analytic value. Now, data discovery is easier because data sets have been registered in the catalog with full descriptions and business metadata — with some organizations going as far as showing realistic sample data for a particular data set. If a user does not have access to the underlying data files, having data in one physical location eases the burden of granting access, and then it’s easier to deploy access-control policies and collect/analyze audit logs to monitor data usage and to look for bad actors. Adopting the Databricks Lakehouse Platform allows you to add data sets into a well-managed data lake using low-cost object stores, and makes it easy to partition data based on domain, entity, data set and classification levels to provide fine-grained (row- level and column-level) security. ----- **Data security, validation and curation — in one place** The modern data architecture using Databricks Lakehouse makes it easy to take a consistent approach to protecting, validating and improving your organization’s data. Data governance policies can be enforced using the built-in features of schema validation, expectations and pipelines — the three main steps to data curation. Databricks enables moving data through well-defined states: Raw —> Refined —> Curated or, as we refer to it at Databricks, Bronze —> Silver —> Gold. The raw data is known as “Bronze-level” data and serves as the landing zone for all your important analytic data. Bronze data functions as the starting point for a series of curation steps that filter, clean and augment the data for use by downstream systems. The first major refinement results in data being stored in “Silver- level” tables within the data lake. These tables carry all the benefits of the Delta Lake product — for example, ACID transactions and time travel. The final step in the process is to produce business-level aggregates, or “Gold-level” tables, that combine data sets from across the organization. It’s a set of data used to improve customer service across the full line of products, perform GDPR/CCPA reporting or look for opportunities to cross-sell to increase customer retention. For the first time, organizations can truly optimize data curation and ETL — eliminating unnecessary copies of data and the duplication of effort that often happens in ETL jobs with legacy data ecosystems. This “solve once, access many times” approach speeds time to market, improves the user experience and helps retain talent. **Extend the impact of your data with secure data sharing** Data sharing is crucial to drive business value in today’s digital economy. More and more organizations are now looking to securely share trusted data with their partners/suppliers, internal lines of business or customers to drive collaboration, improve internal efficiency and generate new revenue streams with data monetization. Additionally, organizations are interested in leveraging external data to drive new product innovations and services. Business executives must establish and promote a data sharing culture in their organizations to build competitive advantage. ----- #### 7. Dramatically increase productivity of your workforce Now that you have deployed a modern data stack and have landed all your analytical data in a well- managed data lake with a rationalized approach to access control, the next question is, “What tools should I provide to the user community so they can be most effective at using the new data ecosystem?” **Design thinking: working backward from the user experience** Design thinking is a human-centered approach to innovation — focused on understanding customer needs, rapid prototyping and generating creative ideas — that will transform the way you develop products, services, processes and organizations. Design thinking was introduced as a technique to not only improve but also bring joy to the way people work. The essence of design thinking is to determine what motivates people to do their job, where their current pain points are and what could be improved to make their jobs enjoyable. **Moving beyond best of breed** If you look across a large enterprise, you will find no shortage of database design, ETL, data cleansing, model training and model deployment tools. Many organizations take a “best of breed” approach in providing tooling for their end users. This typically occurs because leaders genuinely want to empower business units, departments and teams to select the tool that best suits their specific needs — so-called federated tool selection. Data science tooling, in particular, tends not to be procured at the “enterprise” level at first — given the high cost of rolling it out to the entire user population. ----- When tool selection becomes localized, there are a few things to consider: Tools are generally thought of as discrete components within an ecosystem and, therefore, interchangeable with criteria that are established within a specific tool category. The tool with the best overall score gets selected. The selection criteria for a tool usually contains a subjective list of “must-have” features based on personal preference or adoption within a department, or because a given tool is better suited to support a current business process Discrete tools tend to leapfrog one another and add features based on market demand rather quickly Evaluations that are performed over many months likely become outdated by the time the tool has moved into production The “enterprise” requirements are often limited to ensuring that the tool fits into the overall architecture and security environment but nothing more It’s rare that the tools are evaluated in terms of simplifying the overall architecture, reducing the number of tools in play or streamlining the user experience The vendor backing the tool is evaluated in terms of risk, but not enough focus is spent on the partnership model, the ability to influence the roadmap and professional services support For these reasons and more, it’s worth considering an architecture and procurement strategy that centers on selecting a data platform that enables seamless integration with point solutions rather than a suite of discrete tools that require integration work and may no longer be category leaders over the long haul. ----- Databricks is a leading data and AI company — Keep in mind that data platforms work well because the vendor took an opinionated point of view of how data processing, validation and curation should work. It’s the integration between the discrete functions of the platform that saves time, conserves effort and improves the user experience. Many companies try to take on the integration of different technology stacks, which increases risk, cost and complexity. The consequences of not doing the integration properly can be serious — in terms of security, compliance, efficiency, cost, etc. partly due to the innovations in the [open source](https://databricks.com/product/open-source) So, find a vendor that you can develop a true partnership with — one that is more likely to take feedback and incorporate your requirements into their platform product roadmap. This will require some give-and- take from both parties — sometimes calling for an organization to adjust their processes to better fit how the platform works. There are many instances where a given business process could be simplified or recast to work with the platform, as is. Sometimes it will require the vendor to add features that support your processes. The vendor will always be market driven and will want to build features in such a way that they apply to the broadest set of customers. The final point to consider is that it takes a substantial amount of time to become an expert user of a given tool. Users must make a significant investment to learn how the tool works and the most efficient way of performing their job. The more discrete tools in an environment, the more challenging this becomes. Minimizing the number of tools and their different interfaces, styles of interaction and approach to security and collaboration helps improve the user experience and decreases time to market. [software](https://databricks.com/product/open-source) that runs our platform — and as a result of listening to the needs of thousands of customers and having our engineers work side by side with customer teams to deliver real business value using data and AI. ----- **Unified platform, unified personas** Deploying a unified data platform — like the Databricks Lakehouse Platform, which implements a modern data stack — will provide an integrated suite of tools for the full range of personas in your organization, including business analysts, SQL developers, data engineers and data scientists. You will immediately increase productivity and reduce risk because you’ll be better able to share the key aspects of data pipelining — including ingestion, partitioning, curation, SQL analytics, reporting, and model development and deployment. All the work streams function off a single view of the data, and the handoffs between subsystems are well managed. Data processing happens in one auditable environment, and the number of copies of data is kept to an absolute minimum — with each user benefiting from the data assets created by others. Redundant work is eliminated. The 80/20 dilemma for data scientists shifts to a healthier ratio, and they now are able to spend more time working with rather than collecting the data. It’s difficult to decide what algorithm will work best — shifting the 80/20 ratio allows the data scientist to try out multiple algorithms to solve a problem. Another challenge is that enterprise data changes rapidly. New fields are added or existing fields are typed differently — for example, changing a string to an integer. This has a cascading effect, and the downstream consumers must be able to adjust by monitoring the execution and detecting the changes. The data scientist, in turn, must update and test new models on the new data. Your data platform should make the detection and remediation easier, not harder. For the data engineers, their primary focus is extracting data from source systems and moving it into the new data ecosystem. The data pipeline function can be simplified with a unified data platform because the programming model and APIs are consistent across programming languages (e.g., Scala, Python). This results in improved operations and maintenance (O&M). The runtime environment is easier to troubleshoot and debug since the compute layer is consistent, and the logging and auditing associated with the data processing and data management is centralized and of more value. ----- **Maximize the productivity of your workforce** Once you have a data platform that brings together your full range of personas, you should focus on the next step for increasing productivity — namely, self-service environments. In large organizations, there needs to be a strategy for how solutions are promoted up through the runtime environments for development, testing and production. These environments need to be nearly identical to one another — using the same version of software while limiting the number, size and horsepower of the compute nodes. To the extent possible, development and test should be performed with realistic test/ synthetic data. One strategy to support this is to tap into the flow of production data and siphon off a small percentage that is then changed in randomized fashion — obfuscating the real data but keeping the same general shape and range of values. The **DEV** environment should be accessible to everyone without any organizational red tape. The DEV environments should be small and controlled with policies that spin them up and tear them down efficiently. Every aspect of the DEV infrastructure should be treated as ephemeral. Nothing should exist in the environment that cannot be destroyed and easily rebuilt. The **TEST** environment should mimic the PROD environment as much as possible, including the monitoring tools — within obvious cost/budget constraints. The use of the TEST environment can be requested by the developers, but the process is governed using a workflow/sign-off approval approach — signed off by management. Moving to **PROD** is the final step, and there usually is a “separation of duties” that is required so that developers cannot randomly promote software to run in production. Again, this process should be strictly governed using a workflow/sign-off approval approach — signed off by management as well. Many organizations fully automate the steps, except the sign-offs, and support the notion of continuous deployments. **DEV** **TEST** **PROD** ----- #### 8. Make informed build vs. buy decisions A key piece of the strategy will involve the decision around which components of the data ecosystem are built by the in-house engineering team and which components are purchased through a vendor relationship. There is increased emphasis within engineering teams on taking a “builder” approach. In other words, the engineering teams prefer to develop their own solutions in-house rather than rely on vendor products. **Competitive advantage** This “roll your own’’ approach has some advantages — including being able to establish the overall product vision, prioritize features and directly allocate the resources to build the software. However, it is important to keep in mind which aspects of your development effort give you the most competitive advantage. Spend some time working with the data transformation steering committee and other stakeholders to debate the pros and cons of building out various pieces of the data ecosystem. The primary factor should come down to whether or not a given solution offers true competitive advantage for the organization. Does building this piece of software make it harder for your competitors to compete with you? If the answer is no, then it is better to focus your engineering and data science resources on deriving insights from your data. **Beware: becoming your own software vendor** As many engineering leaders know, building your own software is an exciting challenge. However, it does come with added responsibility — namely, managing the overall project timeline and costs, and being responsible for the design, implementation, testing, documentation, training, and ongoing maintenance and updates. You basically are becoming your own software vendor for every component of the ecosystem that you build yourself. When you consider the cost of a standard-sized team, it is not uncommon to spend several million dollars per year building out individual component parts of the new data system. This doesn’t include the cost to operate and maintain the software once it is in production. ----- To offset the anticipated development costs, engineering teams will oftentimes make the argument that they are starting with open source software and extending it to meet the “unique requirements” of your organization. It’s worth pressure testing this approach and making sure that a) the requirements truly are unique and b) the development offers the competitive advantage that you need. Even software built on top of open source still requires significant investment in integration and testing. The integration work is particularly challenging because of the large number of open source libraries that are required in the data science space. The question becomes, “Is this really the area that you want your engineering teams focused on?” Or would it be better to “outsource“ this component to a third party? **How long will it take? Can the organization afford to wait?** Even if you decide the software component provides a competitive advantage and is something worth building in-house, the next question that you should ask is, “How long will it take?” There is definitely a time-to-market consideration, and the build vs. buy decision needs to also account for the impact to the business due to the anticipated delivery schedule. Keep in mind that software development projects usually take longer and cost more money than initially planned. The organization should understand the impact to the overall performance and capabilities of the daily ecosystem for any features tied to the in-house development effort. Your business partners likely do not care how the data ecosystem is implemented as long as it works, meets their needs, is performant, is reliable and is delivered on time. Carefully weigh the trade-offs among competitive advantage, cost, features and schedule. Databricks is built on top of popular open source software that it created. Engineering teams can improve the underpinnings of the Databricks platform by submitting code via pull request and becoming committers to the projects. The benefit to organizations is that their engineers contribute to the feature set of the data platform while Databricks remains responsible for all integration and performance testing plus all the runtime support, including failover and disaster recovery. ----- **Don’t forget about the data** Perhaps the single most important feature of a modern data stack is its ability to help make data sets and “data assets” consumable to the end users or systems. Data insights, model training and model execution cannot happen in a reliable manner unless the data they depend on can be trusted and is of good quality. In large organizations, revenue opportunities and the ability to reduce risk often depend on merging data sets from multiple lines of business or departments. Focusing your data engineering and data science efforts on curating data and creating robust and reliable pipelines likely provides the best chance at creating true competitive advantage. The amount of work required to properly catalog, schema enforce, quality check, partition, secure and serve up data for analysis should not be underestimated. The value of this work is equally important to the business. The ability to curate data to enable game-changing insights should be the focus of the work led by the CDO and CIO. This has much more to do with the data than it does with the ability to have your engineers innovate on components that don’t bring true competitive advantage. ----- #### 9. Allocate, monitor and optimize costs Beginning in 1987, Southwest Airlines famously standardized on flying a single airplane type — the Boeing 737 class of aircraft. This decision allowed the airline to save on both operations and maintenance — requiring only one type of simulator to train pilots, streamlining their spare parts supply chain and maintaining a more manageable parts inventory. Their pilots and maintenance crews were effectively interchangeable in case anyone ever called in sick or missed a connection. The key takeaway is that in order to reduce costs and increase efficiency, Southwest created their own version of a unified platform — getting all their flight- related personas to collaborate and operate from the same point of view. Lessons learned on the platform could be easily shared and reused by other members of the team. The more the team used the unified platform, the more they collaborated and their level of expertise increased. **Reduce complexity, reduce costs** The architectures of enterprise data warehouses (EDWs) and data lakes were either more limited or more complex — resulting in increased time to market and increased costs. This was mainly due to the requirement to perform ETL to explore data in the EDW or the need to split data using multiple pipelines for the data lake. The data lakehouse architecture simplifies the cost allocation because all the processing, serving and analytics are performed in a single compute layer. Organizations can rightsize the data environments and control costs using policies. The centralized and consistent approach to security, auditing and monitoring makes it easier to spot inefficiencies and bottlenecks in the data ecosystem. Performance improvements can be gained quickly as more platform expertise is developed within the workforce. The Databricks platform optimizes costs for your data and AI workloads by intelligently provisioning infrastructure only as you need it. Customers can establish policies that govern the size of clusters based on DEV, TEST, PROD environments or anticipated workloads. ----- Databricks monitors and records usage and allows organizations to easily track costs on a data and **Centralized funding model** As previously mentioned, data transformation initiatives require substantial funding. Centralizing the budget under the CDO provides consistency and visibility into how funds are allocated and spent — increasing the likelihood of a positive ROI. Funding at the beginning of the initiative will be significantly higher than the funding in the out-years. It’s not uncommon to see 3- to 5-year project plans for larger organizations. Funding for years 1 and 2 is often reduced in years 3 and 4 and further reduced in year 5 — until it reaches a steady state that is more sustainable. AI workload basis. This provides the ability to The budget takes into account the cost of the data engineering function, commercial software licenses and building out the center of excellence to accelerate the data science capabilities of the organization. Again, the CDO must partner closely with the CIO and the enterprise architect to make sure that the resources are focused on the overall implementation plan and to make sound build vs. buy decisions. It’s common to see the full budget controlled by the CDO, with a significant portion allocated to resources in the CIO’s organization to perform the data engineering tasks. The data science community reports into the CDO and is matrixed into the lines of business in order to better understand the business drivers and the data sets. Finally, investing in data governance cannot wait until the company has suffered from a major regulatory challenge, a data breach or some other serious defense-related problem. CDOs should spend the necessary time to educate leaders throughout the organization on the value of data governance. implement an enterprise-wide chargeback mode and put in place appropriate spending limits. ----- **Chargeback models** To establish the centralized budget to fund the data transformation initiative, some organizations impose a “tax” on each part of the organization — based on size as well as profit and loss. This base-level funding should be used to build the data engineering and data science teams needed to deploy the building blocks of the new data ecosystem. However, as different teams, departments and business units begin using the new data ecosystem, the infrastructure costs, both compute and storage, will begin to grow. The costs will not be evenly distributed, due to different levels of usage from the various parts of the organization. The groups with the heavier usage should obviously cover their pro rata share of the costs. This requires the ability to monitor and track usage — not only based on compute but also on the amount of data generated and consumed. This so-called chargeback model is an effective and fair way to cover the cost deltas over and above the base-level funding. Plus, not all the departments or lines of business will require the same level of compute power or fault tolerance. The architecture should support the ability to separate out the runtime portions of the data ecosystem and isolate the workloads based on the specific SLAs for the use cases in each environment. Some workloads cannot fail and their SLAs will require full redundancy, thus increasing the number of nodes in the cluster or even requiring multiple clusters operating in different cloud regions. In contrast, less critical workloads that can fail and be restarted can run on less costly infrastructure. This makes it easier to better manage the ecosystem by avoiding a one-size-fits-all approach and allocating costs to where the performance is needed most. ----- #### 10. Move to production and scale adoption Now that you’ve completed the hard work outlined in the first nine steps, it is time to put the new data ecosystem to use. In order to get truly game-changing results, organizations must be really disciplined at managing and using data to enable use cases that drive business value. They must also establish a clear set of metrics to measure adoption and track the net promoter score (NPS) so that the user experience continues to improve over time. **If you build it, they will come** Keep in mind that your business partners are likely the ones to do the heavy lifting when it comes to data set registration. Without a robust set of relevant, quality data to use, the data ecosystem will be useless. A high level of automation for the registration process is important because it’s not uncommon to see thousands of data sets in large organizations. The business and technical metadata plus the data quality rules will help guarantee that the data lake is filled with consumable data. The lineage solution should provide a visualization that shows the data movement and verifies that the approved data flow paths are being followed. Some key metrics to keep an eye on are: Percentage of source systems contributing data to the ecosystem Percentage of real-time streaming relative to API and batch transfers Percentage of registered data sets with full business and technical metadata Volume of data written to the data lake Percentage of raw data that enters a data curation pipeline Volume of data consumed from the data lake Number of tables defined and populated with curated data Number of models trained with data from the data lake Lineage reports and anomaly detection incidents Number of users running Python, SQL, Scala and R workloads In 2018, Databricks released MLflow — an open source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment and a central model registry. MLflow is included in the Databricks Lakehouse Platform and accelerates the adoption of machine learning and AI in organizations. ----- **Communication plan** Communication is critical throughout the data transformation initiative — however, it is particularly important once you move into production. Time is precious and you want to avoid rework, if at all possible. Organizations often overlook the emotional and cultural toll that a long transformation process takes on the workforce. The seam between the legacy environment and the new data ecosystem is an expensive and exhausting place to be — because your business partners are busy supporting two data worlds. Most users just want to know when the new environment will be ready. They don’t want to work with partially completed features, especially while performing double duty. Establish a solid communication plan and set expectations for when features will come online. Make sure there is detailed documentation, training and a support/help desk to field users’ questions. **DevOps — software development + IT operations** Mature organizations develop a series of processes and standards for how software and data are developed, managed and delivered. The term “DevOps” comes from the software engineering world and refers to developing and operating large-scale software systems. DevOps defines how an organization, its developers, operations staff and other stakeholders establish the goal of delivering quality software reliably and repeatedly. In short, DevOps is a culture that consists of two practices: continuous integration (CI) and continuous delivery (CD). The CI portion of the process is the practice of frequently integrating newly written or changed code with the existing code repository. As software is written, it is continuously saved back to the source code repository, merged with other changes, built, integrated and tested — and this should occur frequently enough that the window between commit and build is narrow enough that no errors can occur without developers noticing them and correcting them immediately. This is particularly important for large, distributed teams to ensure that the software is always in a working state — despite the frequent changes from various developers. Only software that passes the CI steps is deployed — resulting in shortened development cycles, increased deployment velocity and the creation of dependable releases. Software development IT operations ----- **DataOps — data processing + IT operations** DataOps is a relatively new focus area for the data engineering and data science communities. Its goal is to use the well-established processes from DevOps to consistently and reliably improve the quality of data used to power data and AI use cases. DataOps automates and streamlines the lifecycle management tasks needed for large volumes of data — basically, ensuring that the volume, velocity, variety and veracity of the data are taken into account as data flows through the environment. DataOps aims to reduce the end-to- end cycle time of data analytics — from idea, to exploration, to visualizations and to the creation of new data sets, data assets and models that create value. For DataOps to be effective, it must encourage collaboration, innovation and reuse among the stakeholders, and the data tooling should be designed to support the workflow and make all aspects of data curation and ETL more efficient. **MLOps — machine learning + IT operations** Not surprisingly, the term “MLOps” takes the DevOps approach and applies it to the machine learning and deep learning space — automating or streamlining the core workflow for data scientists. MLOps is a bit unique when compared with DevOps and DataOps because the approach to deploying effective machine learning models is far more iterative and requires much more experimentation — data scientists try different features, parameters and models in a tight iteration cycle. In all these iterations, they must manage the code base, understand the data used to perform the training and create reproducible results. The logging aspect of the ML development lifecycle is critical. MLOps aims to manage deployment of machine learning and deep learning models in large-scale production environments while also focusing on business and regulatory requirements. The ideal MLOps environment would include data science tools where models are constructed and analytical engines where computations are performed. Data processing IT operations ####  Machine learning IT operations ----- The overall workflow for deploying production ML models is shown in Figure 10. Unlike most software applications that execute a series of discrete operations, ML platforms are not deterministic and are highly dependent on the statistical profile of the data they use. ML platforms can suffer performance degradation of the system due to changing data profiles. Therefore, the model has to be refreshed even if it currently “works” — leading to more iterations of the ML workflow. The ML platform should natively support this style of iterative data science. **Ethics in AI** As more organizations deploy data and AI solutions, there is growing concern around a number of issues related to ethics — in particular, how do you ensure the data and algorithms used to make decisions are fair and ethical, and that the outcomes have the appropriate impact on the target audience? Organizations must ensure that the “black box” algorithms that produce results have the transparency, interpretability and explainability to satisfy legal and regulatory safeguards. The vast majority of AI work still involves software development by human beings and the use of curated data sets. There is the obvious potential for bias and the application of AI in domains that are ethically questionable. CDOs are faced with the added challenge of needing to be able to defend the use of AI, explain how it works and describe the impact of its existence on the target audience — whether internal workers or customers. Data extraction Data preparation Model e�aluation Data analI�i� 4 Model training 6 Model �er�ing and execution Model monitoring **Figure 10:** Workflow for deploying production ML models ----- **Data and AI Maturity Model** When data and AI become part of the fabric of the company and the stakeholders in the organization adopt a data asset and AI mindset, the company moves further along a well-defined maturity curve, as shown in Figure 11. **Top-Line Categories and Ranking Criteria** **L O W M AT U R I T Y / V A L U E** **H I G H M AT U R I T Y / V A L U E** 1. Explore 2. Experiment 3. Formalize 4. Optimize 5. Transform Organization is beginning to explore big data and AI, and understand the possibilities and potential of a few starter projects and experiment **Figure 11:** The Data and AI Maturity Model Organization builds the basic capabilities and foundations to begin exploring a more expansive data and AI strategy, but it lacks vision, long-term objectives or leadership buy-in Data and AI are budding into drivers of value for BUs aligned to specific projects and initiatives as the core tenets of data and AI are integrated into corporate strategy Data and AI are core drivers of value across the organization, structured and central to corporate strategy, with a scalable architecture that meets business needs and buy-in from across the organization Data and AI are at the heart of the corporate strategy and are invaluable differentiators and drivers of competitive advantage Databricks partners with its customers to enable them to do an internal self-assessment. The output of the self-assessment allows organizations to: Understand the current state of their journey to data and AI maturity Identify key gaps in realizing (more) value from data and AI Plot a path to increase maturity with specific actions Identify Databricks resources who can help support their journey ----- **CHAPTER 3:** ## Conclusion After a decade in which most enterprises took a hybrid approach to their data architecture — and struggled with the complexity, cost and compromise that come with supporting both data warehouses and data lakes — the lakehouse paradigm represents a breakthrough. Choosing the right modern data stack will be critical to future-proofing your investment and enabling data and AI at scale. The simple, open and multicloud architecture of the Databricks Lakehouse Platform delivers the simplicity and scalability you need to unleash the power of your data teams to collaborate like never before — in real time, with all their data, for every use case. For more information, please visit [Databricks](https://databricks.com/solutions/roles/data-leaders) or [contact us](https://databricks.com/company/contact) . **A B O U T T H E A U T H O R** Chris D’Agostino is the Global Field CTO at Databricks, having joined the company in January 2020. His role is to provide thought leadership and serve as a trusted advisor to our top customers, globally. Prior to Databricks, Chris ran a 1,000-person data engineering function for a top 10 U.S. bank. In that role, he led a team that was responsible for building out a modern data architecture that emphasized the key attributes of the lakehouse architecture. Chris has also held leadership roles at a number of technology companies. ----- ##### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/transform-scale-your-organization-with-data-ai-v16-052522.pdf,2024-09-19T16:57:23Z
"### eBook # A New Approach to Data Sharing #### Open data sharing and collaboration for data, analytics, and AI ### Second Edition ----- ## Contents Introduction — Data Sharing in Today’s Digital Economy 4 **Chapter 1: What Is Data Sharing and Why Is It Important?** **5** Common data sharing use cases 6 Data monetization 6 Data sharing with partners or suppliers (B2B) 6 Internal lines of business (LOBs) sharing 6 Key benefits of data sharing 7 **Chapter 2: Conventional Methods of Data Sharing and Their Challenges** **8** Legacy and homegrown solutions 9 Proprietary vendor solutions 11 Cloud object storage 13 **Chapter 3: Delta Sharing — An Open Standard for Secure Sharing of Data Assets** **14** What is Delta Sharing? 14 Key benefits of Delta Sharing 16 Maximizing value of data with Delta Sharing 18 Data monetization with Delta Sharing 19 B2B sharing with Delta Sharing 21 Internal data sharing with Delta Sharing 23 **Chapter 4: How Delta Sharing Works** **26** ----- **Chapter 5: Introducing Databricks Marketplace** **28** ## Contents What is Databricks Marketplace? 30 Key benefits of Databricks Marketplace 30 Enable collaboration and accelerate innovation 32 Powered by a fast, growing ecosystem 32 Use cases for an open marketplace 32 New upcoming feature: AI model sharing 33 **Chapter 6: Share securely with Databricks Clean Rooms** **34** What is a data clean room? 34 Common data clean room use cases 36 Shortcomings of existing data clean rooms 38 Key benefits of Databricks Clean Rooms 39 **Resources: Getting started with Data Sharing and Collaboration** **40** **About the Authors** **42** ----- ## Introduction  Data Sharing in Today’s Digital Economy Today’s economy revolves around data. Everyday, more and more organizations must exchange data with their customers, suppliers and partners. Security is critical. And yet, efficiency and immediate accessibility are equally important. Where data sharing may have been considered optional, it’s now required. More organizations are investing in streamlining internal and external data sharing across the value chain. But they still face major roadblocks — from human inhibition to legacy solutions to vendor lock-in. To be truly data-driven, organizations need a better way to share data. [Gartner predicts that by 2024](https://www.gartner.com/en/documents/3999501) , organizations that promote data sharing will outperform their peers on most business value who have successfully executed data sharing initiatives are 1.7x more effective in showing business value and return on investment from their data analytics strategy. To compete in the digital economy, organizations need an open — and secure — approach to data sharing. This eBook takes a deep dive into the modern era of data sharing and collaboration, from common use cases and key benefits to conventional approaches and the challenges of those methods. You’ll get an overview of our open approach to data sharing and find out how Databricks allows you to share your data across platforms, to share all your data and AI, and to share all your data securely with unified governance in a privacy-safe way. metrics. In addition, Gartner recently found that Chief Data Officers ----- ## Chapter 1  What Is Data Sharing and Why Is It Important? Data sharing is the ability to make the same data available to one or many stakeholders — both external and internal. Nowadays, the ever-growing amount of data has become a strategic asset for any company. Data sharing — within your organization or externally — is an enabling technology for data commercialization and enhanced analysis. Sharing data as well as consuming data from external sources allows companies to collaborate with partners, establish new partnerships and generate new revenue streams with data monetization. Data sharing can deliver benefits to business groups across the enterprise. For those business groups, data sharing can enable access to data needed to make critical decisions. This includes but is not limited to roles such as the data analyst, data scientist and data engineer. ----- #### Common data sharing use cases #### Data  monetization Companies across industries are commercializing data. Large multinational organizations have formed exclusively to monetize data, while other organizations are looking for ways to monetize their data and generate additional revenue streams. Examples of these companies can range from an agency with an identity graph to a telecommunication company with proprietary 5G data or to retailers that have a unique ability to combine online and offline data. Data vendors are growing in importance as companies realize they need external data for better decision-making. #### Data sharing with partners  or suppliers (B2B) Many companies now strive to share data with partners and suppliers as similarly as they share it across their own organizations. For example, retailers and their suppliers continue to work more closely together as they seek to keep their products moving in an era of ever-changing consumer tastes. Retailers can keep suppliers posted by sharing sales data by SKU in real time, while suppliers can share real-time inventory data with retailers so they know what to expect. Scientific research organizations can make their data available to pharmaceutical companies engaged in drug discovery. Public safety agencies can provide real-time public data feeds of environmental data, such as climate change statistics or updates on potential volcanic eruptions. #### Internal lines of business  (LOBs) sharing Within any company, different departments, lines of business and subsidiaries seek to share data so everyone can make decisions based on a complete view of the current business reality. For example, finance and HR departments need to share data as they analyze the true costs of each employee. Marketing and sales teams need a common view of data to determine the effectiveness of recent marketing campaigns. And different subsidiaries of the same company need a unified view of the health of the business. Removing data silos — which are often established for the important purpose of preventing unauthorized access to data — is critical for digital transformation initiatives and maximizing the business value of data. ----- #### Key benefits of data sharing As you can see from the use cases described above, there are many benefits of data sharing, including: **Greater collaboration with existing partners.** In today’s hyper- connected digital economy, no single organization can advance its business objectives without partnerships. Data sharing helps solidify existing partnerships and can help organizations establish new ones.  **Ability to generate new revenue streams.** With data sharing, organizations can generate new revenue streams by offering data products or data services to their end consumers. **Ease of producing new products, services or business models.** Product teams can leverage both first-party data and third-party data to refine their products and services and expand their product/ service catalog. **Greater efficiency of internal operations.** Teams across the organization can meet their business goals far more quickly when they don’t have to spend time figuring out how to free data from silos. When teams have access to live data, there’s no lag time between the need for data and the connection with the appropriate data source. ----- ## Chapter 2  Conventional Methods of Data Sharing and Their Challenges Sharing data across different platforms, companies and clouds is no easy task. In the past, organizations have hesitated to share data more freely because of the perceived lack of secure technology, competitive concerns and the cost of implementing data sharing solutions. Even for companies that have the budget to implement data sharing technology, many of the current approaches can’t keep up with today’s requirements for open-format, multi- cloud, high-performance solutions. Most data sharing solutions are tied to a single vendor, which creates friction for data providers and data consumers who use non-compatible platforms. Over the past 30 years, data sharing solutions have come in three forms: legacy and homegrown solutions, cloud object storage and closed source commercial solutions. Each of these approaches comes with its pros and cons. ----- #### Legacy and homegrown solutions Many companies have built homegrown data sharing solutions based on legacy technologies such as email, (S)FTP or APIs. Provider ETL Consumer Batch data from provider Table � Table 2 FTP/SSH/API Server FTP/SSH/API ETL Database Analyst Run Analysis Server **Figure 1:** Legacy data sharing solutions **Pros**  **Vendor agnostic.** FTP, email and APIs are all well-documented protocols. Data consumers can leverage a suite of clients to access data provided to them.  **Flexibility.** Many homegrown solutions are built on open source technologies and will work both on-prem and on clouds. ----- **Cons**  **Data movement.** It takes significant effort to extract data from cloud storage, transform it and host it on an FTP server for different recipients. Additionally, this approach results in creating copies of data sets. Data copying causes duplication and prevents organizations from instantly accessing live data.  **Complexity of sharing data.** Homegrown solutions are typically built on complex architectures due to replication and provisioning. This can add considerable time to data sharing activities and result in out-of-date data for end consumers.  **Operational overhead for data recipients.** Data recipients have to extract, transform and load (ETL) the shared data for their end use cases, which further delays the time to insights. For any new data updates from the providers, the consumers have to rerun ETL pipelines again and again.  **Security and governance.** As modern data requirements become more stringent, homegrown and legacy technologies have become more difficult to secure and govern.  **Scalability.** Such solutions are costly to manage and maintain and don’t scale to accommodate large data sets. ----- #### Proprietary vendor solutions Commercial data sharing solutions are a popular option among companies that don’t want to devote the time and resources to building an in-house solution yet also want more control than what cloud object storage can offer. Vendor 1 Platform Proprietary data format Vendor V Platform Proprietary data format Data Provider 1 Data; Provider Data Provider 1 Data; Consumer Shared data set Data; Provider Shared dataset Data; Consumer No cross-platform sharing **Figure 2:** Proprietary vendor solutions Shared dataset Shared data set Shared data set Shared data set Sharing limited to recipients on the same platform Data; Consumer Data; Consumere **Pros**  **Simplicity.** Commercial solutions allow users to share data easily with anyone else who uses the same platform. ----- **Cons**  **Vendor lock-in.** Commercial solutions don’t interop with other platforms well. While data sharing is easy among fellow customers, it’s usually impossible with those who use competing solutions. This reduces the reach of data, resulting in vendor lock-in. Furthermore, platform differences between data providers and recipients introduce data sharing complexities.  **Data movement.** Data must be loaded onto the platform, requiring additional ETL and data copies.  **Scalability.** Commercial data sharing comes with scaling limits from the vendors.  **Cost.** All the above challenges create additional cost for sharing data with potential consumers, as data providers have to replicate data for different recipients on different cloud platforms. ----- #### Cloud object storage **Cons**  **Limited to a single cloud provider.** Recipients have to be on the same cloud to access the objects.  **Cumbersome security and governance.** Assigning permissions and managing access is complex. Custom application logic is needed to generate signed URLs.  **Complexity.** Personas managing data sharing (DBAs, analysts) find it difficult to understand Identity Access Management (IAM) policies and how data is mapped to underlying files. For companies with large volumes of data, sharing via cloud storage is time-consuming, cumbersome and nearly impossible to scale.  **Operational overhead for data recipients.** The data recipients have to run extract, transform and load (ETL) pipelines on the raw files before consuming them for their end use cases. The lack of a comprehensive solution makes it challenging for data providers and consumers to easily share data. Cumbersome and incomplete data sharing processes also constrain the development of business opportunities from shared data. Object storage is considered a good fit for the cloud because it is elastic and can more easily scale into multiple petabytes to support unlimited data growth. The big three cloud providers all offer object storage services (AWS S3, Azure Blob, Google Cloud Storage) that are cheap, scalable and extremely reliable. An interesting feature of cloud object storage is the ability to generate signed URLs, which grant time-limited permission to download objects. Anyone who receives the presigned URL can then access the specified objects, making this a convenient way to share data. **Pros**  **Sharing data in place.** Object storage can be shared in place, allowing consumers to access the latest available data.  **Scalability.** Cloud object storage profits from availability and durability guarantees that typically cannot be achieved on-premises. Data consumers retrieve data directly from the cloud providers, saving bandwidth for the providers. ----- ## Chapter 3  Delta Sharing — An Open Standard for Secure Sharing of Data Assets We believe the future of data sharing should be characterized by open technology. Data sharing shouldn’t be tied to a proprietary technology that introduces unnecessary limitations and financial burdens to the process. It should be readily available to anyone who wants to share data at scale. This philosophy inspired us to develop and release a new protocol for sharing data: Delta Sharing. #### What is Delta Sharing? Delta Sharing provides an open solution to securely share live data from your lakehouse to any computing platform. Recipients don’t Data providers can centrally manage, govern, audit and track usage of the shared data on one platform. Delta Sharing is natively integrated with [Unity Catalog](https://databricks.com/product/unity-catalog) , enabling organizations to centrally manage and audit shared data across organizations and confidently share data assets while meeting security and compliance needs. With Delta Sharing, organizations can easily share existing large- scale data sets based on the open source formats Apache Parquet and Delta Lake without moving data. Teams gain the flexibility to query, visualize, transform, ingest or enrich shared data with their tools of choice. have to be on the Databricks platform or on the same cloud or a cloud at all. Data providers can share live data without replicating it or moving it to another system. Recipients benefit from always having access to the latest version of data and can quickly query shared data using tools of their choice for BI, analytics and machine learning, reducing time-to-value. ----- Data ����i�e� Any u�e cy�e Analytics BI Data Science Data Recipient Any sool And many more Any cloud/on-prem On-premises Access permissions Delta Sharing Protocol Delta �a�e �a�le Delta Sharing Ser�er No replication Easy to manage Secure **Figure 3:** Delta Sharing Databricks designed Delta Sharing with five goals in mind: Provide an open cross-platform sharing solution Share live data without copying it to another system Support a wide range of clients such as Power BI, Tableau, Apache Spark™, pandas and Java, and provide flexibility to consume data using the tools of choice for BI, machine learning and AI use cases Provide strong security, auditing and governance Scale to massive structured data sets and also allow sharing of unstructured data and future data derivatives such as ML models, dashboards and notebooks, in addition to tabular data ----- #### Key benefits of Delta Sharing By eliminating the obstacles and shortcomings associated with typical data sharing approaches, Delta Sharing delivers several key benefits, including: **Open cross-platform sharing.** Delta Sharing establishes a new open standard for secure data sharing and supports open source Delta and Apache Parquet formats. Data recipients don’t have to be on the Databricks platform or on the same cloud, as Delta Sharing works across clouds and even from cloud to on-premises setups. To give customers even greater flexibility, Databricks has also released open source connectors for pandas, Apache Spark, Elixir and Python, and is working with partners on many more.  **Securely share live data without replication.** Most enterprise **Centralized governance.** With Databricks Delta Sharing, data providers can grant, track, audit and even revoke access to shared data sets from a single point of enforcement to meet compliance and other regulatory requirements. Databricks Delta Sharing users get: Implementation of Delta Sharing as part of Unity Catalog, the governance offering for Databricks Lakehouse Simple, more secure setup and management of shares The ability to create and manage recipients and data shares Audit logging captured automatically as part of Unity Catalog Direct integration with the rest of the Databricks ecosystem No separate compute for providing and managing shares data today is stored in cloud data lakes. Any of these existing data sets on the provider’s data lake can easily be shared without any data replication or physical movement of data. Data providers can update their data sets reliably in real time and provide a fresh and consistent view of their data to recipients. ----- **Share data products, including AI models, dashboards and** **notebooks, with greater flexibility.** Data providers can choose between sharing anentire table or sharing only a version or specific partitions of a table. However, sharing just tabular data is not enough to meet today’s consumer demands. Delta Sharing also supports sharing of non-tabular data and data derivatives such as data streams, AI models, SQL views and arbitrary files, enablingincreased collaboration and innovation. Data providers can build, package and distribute data products including data sets, AI and notebooks, allowingdata recipients to get insights faster. Furthermore, this approach promotes and empowers the exchange of knowledge — not just data — between different organizations. **Share data at a lower cost.** Delta Sharing lowers the cost of managing and consuming shares for both data providers and recipients. Providers can share data from their cloud object store without replicating, thereby reducing the cost of storage. Incontrast, existing data sharing platforms require data providers to first move their data into their platform or store data in proprietary formats in their managed storage, which often costs more and results in data duplication. With Delta Sharing, data providers don’t need to set up separate computing environments to share data. Consumers can access shared data directly using their tools of choice without setting up specific consumption ecosystems, thereby reducing costs. With Delta Sharing we are able to achieve a truly open marketplace and truly open ecosystem. In contrast, commercial products are mostly limited to sharing raw tabular data and cannot be used to share these higher-valued data derivatives.  **Reduced time-to-value.** Delta Sharing eliminates the need to set up a new ingestion process to consume data. Data recipients can directly access the fresh data and query it using tools of their choice. Recipients can also enrich data with data sets from popular data providers. The Delta Sharing ecosystem of open source and commercial partners is growing every day. ----- #### Maximizing value of data with Delta Sharing Delta Sharing is already transforming data sharing activities for companies in a wide range of industries. Given the sheer variety of data available and the technologies that are emerging, it is hard to anticipate all the possible use cases Delta Sharing can address. The Delta Sharing approach is to share any data anytime with anyone easily and securely. In this section we will explore the building blocks of such an approach and the use cases emerging from these. “Delta Sharing helped us streamline our data delivery process for large data sets. This enables our clients to bring their own compute environment to read fresh curated data with little-to- no integration work, and enables us to continue expanding our catalog of unique, high-quality data products.” — **William Dague** , Head of Alternative Data, Nasdaq “We recognize that openness of data will play a key role in achieving Shell’s Carbon Net Zero ambitions. Delta Sharing provides Shell with a standard, controlled and secure protocol for sharing vast amounts of data easily with our partners to work toward these goals without requiring our partners be on the same data sharing platform.” — **Bryce Bartmann** , Chief Digital Technology Advisor, Shell “Leveraging the powerful capabilities of Delta Sharing from Databricks enables Pumpjack Dataworks to have a faster onboarding experience, removing the need for exporting, importing and remodeling of data, which brings immediate value to our clients. Faster results yield greater commercial opportunity for our clients and their partners.” “Data accessibility is a massive consideration for us. We believe that Delta Sharing will simplify data pipelines by enabling us to query fresh data from the place where it lives, and we are not locked into any platform or data format.” — **Rayne Gaisford** , Global Head of Data Strategy, Jefferies — **Corey Zwart** , Head of Engineering, Pumpjack Dataworks “As a data company, giving our customers access to our data sets is critical. The Databricks Lakehouse Platform with Delta Sharing really streamlines that process, allowing us to securely reach a much broader user base regardless of cloud or platform.” — **Felix Cheung** , VP of Engineering, SafeGraph ----- #### Data monetization with Delta Sharing Delta Sharing enables companies to monetize their data product simply and with necessary governance. Data /on.2-er $ Cloud Storage Fulfllleen Entitles various data products Data Vendor Unity Catalog Unity Catalog Cloud Storage Data /on.2-er � N o n - D ata b r i c k s C u s t o m e r O n a n y c lo u d o r o n - p r e m i s e s Storage R/O R/O **Figure 4:** Data monetization with Delta Sharing Delta Sharing Billieg Audit Log ----- With Delta Sharing, a data provider can seamlessly share large data sets and overcome the scalability issues associated with SFTP servers. Data providers can easily expand their data product lines since Delta Sharing doesn’t require you to build a dedicated service for each of your data products like API services would. The company simply grants and manages access to the data recipients instead of replicating the data — thereby reducing complexity and latency. Any data that exits your ELT/ETL pipelines becomes a candidate for a data product. Any data that exists on your platform can be securely shared with your consumers. This grants a wider addressable market — your products have appeal to a broader range of consumers, from those who say “we need access to your raw data only” to those who say “we want only small subsets of your Gold layer data.” To mitigate cost concerns, Delta Sharing maintains an audit log that tracks any permitted access to the data. Data providers can use this information to determine the costs associated with any of the data products and evaluate if such products are commercially viable and sensible. ----- #### B2B sharing with Delta Sharing Cloud Storage Partner A Unity Catalog Partner U Unity Catalog Cloud Storage Partner B N o n - D ata b r i c k s C u s t o m e r O n a n y c lo u d o r o n - p r e m i s e s Storage R/O R/O R/O **Figure 5:** B2B sharing with Delta Sharing Delta Sharing ----- Delta Sharing applies in the case of bidirectional exchange of data. Companies use Delta Sharing to incorporate partners and suppliers seamlessly into their workflows. Traditionally, this is not an easy task. An organization typically has no control over how their partners are implementing their own data platforms. The complexity increases when we consider that the partners and suppliers can reside in a public cloud, private cloud or an on-premises deployed data platform. The choices of platform and architecture are not imposed on your partners and suppliers. Due to its open protocol, Delta Sharing addresses this requirement foundationally. Through a wide array of existing connectors (and many more being implemented), your data can land anywhere your partners and suppliers need to consume it. In addition to the location of data consumer residency, the complexity of data arises as a consideration. The traditional approach to sharing data using APIs is inflexible and imposes additional development cycles on both ends of the exchange in order to implement both the provider pipelines and consumer pipelines. With Delta Sharing, this problem can be abstracted. Data can be shared as soon as it lands in the Delta table and when the shares and grants are defined. There are no implementation costs on the provider side. On the consumer side, data simply needs to be ingested and transformed into an expected schema for the downstream processes. This means that you can form much more agile data exchange patterns with your partners and suppliers and attain value from your combined data much quicker than ever before. ----- #### Internal data sharing with Delta Sharing Internal data sharing is becoming an increasingly important consideration for any modern organization, particularly where data describing the same concepts have been produced in different ways and in different data silos across the organization. In this situation it is important to design systems and platforms that allow governed and intentional federation of data and processes, and at the same time allow easy and seamless integration of said data and processes. Architectural design patterns such as Data Mesh have emerged to address these specific challenges and considerations. Data Mesh architecture assumes a federated design and dissemination of ownership and responsibility to business units or divisions. This, in fact, has several advantages, chief among them that data is owned by the parts of the organization closest to the source of the data. Data residence is naturally enforced since data sits within the geo- locality where it has been generated. Finally, data volumes and data variety are kept in control due to the localization within a data domain (or data node). On the other hand, the architecture promotes exchange of data between different data domains when that data is needed to deliver outcomes and better insights. ----- Business Unit 1 Business Unit , i n R e g i o n A i n R e g i o n - Cloud Storage Unity Catalog R/O R/O Unity Catalog Cloud Storage Delta Sharing Business Unit B i n R e g i o n A Delta Sharing R/O R/O Cloud Storage Business Unit � N o n - D ata b r i c k s C u s t o m e r O n a n y c lo u d o r o n - p r e m i s e s Storage **Figure 6:** Building a Data Mesh with Delta Sharing ----- Unity Catalog enables consolidated data access control across different data domains within an organization using the Lakehouse on Databricks. In addition, Unity Catalog adds a set of simple and easy-to-use declarative APIsto govern and control data exchange patterns between the data domains in the Data Mesh. To make matters even more complicated, organizations can grow through mergers and acquisitions. In such cases we cannot assume that organizations being acquired have followed the same set of rules and standards to define their platforms and produce their data. Furthermore, we cannot even assume that they have used the same cloud providers, nor can we assume the complexity of their data models. Delta Sharing can simplify and accelerate the unification and assimilation of newly acquired organizations and their data and processes.. Individual organizations can be treated as new data domains in the overarching mesh. Only selected data sources can be exchanged between the different platforms. This enables teams to move freely between the organizations that are merging without losing their data — if anything, they are empowered to drive insights of higher quality by combining the data of both. With Unity Catalog and Delta Sharing, the Lakehouse architecture seamlessly combines with the Data Mesh architecture to deliver more power than ever before, pushing the boundaries of what’s possible and simplifying activities that were deemed daunting not so long ago. ----- ## Chapter 4  How Delta Sharing Works Delta Sharing is designed to be simple, scalable, nonproprietary and cost-effective for organizations that are serious about getting more from their data. Delta Sharing is natively integrated with Unity Catalog, which enables customers to add fine-grained governance and security controls, making it easy and safe to share data Delta Sharing is a simple REST protocol that securely grants temporary access to part of a cloud data set. It leverages modern cloud storage systems — such as AWS S3, Azure ADLS or Google’s GCS — to reliably grant read-only access to large data sets. Here’s how it works for data providers and data recipients. internally or externally. Data PJQIiLeJ Data Recipient Access permissions Request table Pre-signed short-lived URLs Delta Lake Parquet `iles Delta Sharing Server **Figure 7:** How Delta Sharing works connecting data providers and data recipients Temporary direct access to fles Parquet ormatt in the object store — AWS S3, GCP, ADLS - • • Delta Sharing Client ----- #### Data providers The data provider shares existing tables or parts thereof (such as specific table versions or partitions) stored on the cloud data lake in Delta Lake format. The provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages recipient access. . To manage shares and recipients, you can use SQL commands,the Unity Catalog CLI or the intuitive user interface. #### Data recipients The data recipient only needs one of the many Delta Sharing clients that support the protocol. Databricks has released open source connectors for pandas, Apache Spark, Java and Python, and is working with partners on many more. #### The data exchange The Delta Sharing data exchange follows three efficient steps: **1.** The recipient’s client authenticates to the sharing server and asks to query a specific table. The client can also provide filters on the data (for example, “country=US”) as a hint to read just a subset of the data. **2.** The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in cloud storage systems that make up the table. **3.** To allow temporary access to the data, the server generates short-lived presigned URLs that allow the client to read Parquet files directly from the cloud provider so that the read-only access can happen in parallel at massive bandwidth, without streaming through the sharing server. ----- ## Chapter 5  Introducing Databricks Marketplace Enterprises need open collaboration for data and AI. Data sharing — within an organization or externally — allows companies to collaborate with partners, establish new partnerships and generate new revenue streams with data monetization. The demand for generative AI is driving disruption across industries, increasing the urgency for technical teams to build generative AI models and Large Language Models (LLMs) on top of their own data to differentiate their offerings. Traditional data marketplaces are restricted and offer only data or simple applications, therefore limiting their value to data consumers. They also don’t offer tools to evaluate the data assets beyond basic descriptions or examples. Finally, data delivery is limited, often requiring ETL or a proprietary delivery mechanism. Enterprises need a better way to share data and AI that is flexible, secure and unlocks business value. An ecosystem makes data sharing and collaboration powerful. **Today, data marketplaces present many challenges and collaboration can be complex for both data consumers and data providers.** **Data Consumers** **Data Providers** Focus on data only or simple applications Lengthy discovery and evaluation Delayed time-to-insights with vendor lock-in Limited opportunities to monetize new types of assets Limited opportunities to Difficulty reaching more users Difficulty reaching Lack of secure technology and unified governance Lack of secure technology ----- #### Challenges in today's data marketplaces **Data Consumers** **Data Providers**  **Focus on data only or simple applications:** Accessing only data sets means organizations looking to take advantage of AI/ML need to look elsewhere or start from scratch, causing delays in driving business insights.  **Lengthy discovery and evaluation:** The tools most marketplaces provide for data consumers to evaluate data are simply descriptions and example SQL statements. Minimal  **Limited opportunities to monetize new types of assets:** A data-only approach means organizations are limited to monetizing anything beyond a data set and will face more friction to create new revenue opportunities with non- compatible platforms. **Difficulty reaching more users:** Data providers must choose between forgoing potential business or incurring the expense of replicating data. evaluation tools mean it takes more time to figure out if a data product is right for you, which might include more time in back-and-forth messages with a provider or searching for a new provider altogether. **Delayed time-to-insights with vendor lock-in:** Delivery through proprietary sharing technologies or FTP means either vendor lock-in or lengthy ETL processes to get the data where you need to work with it. **Lack of secure technology and unified governance:** Without open standards for sharing data securely across platforms and clouds, data providers must use multiple tools to secure access to scattered data, leading to inconsistent governance. ----- #### What is Databricks Marketplace? approach allows you to put your data to work more quickly in every cloud with your tools of choice. Marketplace brings together a vast ecosystem of data consumers and data providers to collaborate across a wide array of data sets without platform dependencies, complicated ETL, expensive replication and vendor lock-in. Databricks Marketplace is an open marketplace for all your data, analytics and AI, powered by Delta Sharing. Since Marketplace is powered by Delta Sharing, you can benefit from open source flexibility and no vendor lock-in, enabling you to collaborate across all platforms, clouds and regions. This open #### Key Benefits of Databricks Marketplace **Consumers** **Providers** Databricks Marketplace provides key benefits for both data consumers and data providers. Discover more than just data Reach users on any platform Reach users Evaluate data products faster Avoid vendor lock-in Monetize more than just data Monetize more Share data securely ----- #### Databricks Marketplace drives innovation and expands revenue opportunities ##### Data Consumers  For data consumers, the Databricks Marketplace dramatically expands the opportunity to deliver innovation and advance analytics and AI initiatives. **Discover more than just data:** Access more than just data sets, including AI models, notebooks, applications and solutions. **Evaluate data products faster:** Pre-built notebooks and sample data help you quickly evaluate and have much greater confidence that a data product is right for your AI or analytics initiatives. Obtain the fastest and simplest time to insight. **Avoid vendor lock-in:** Substantially reduce the time to deliver insights and avoid lock-in with open and seamless sharing and collaboration across clouds, regions, or platforms. Directly integrate with your tools of choice and right where you work. ##### Data Providers  For data providers, the Databricks Marketplace enables them the ability to reach new users and unlock new revenue opportunities. **Reach users on any platform:** Expand your reach across platforms and access a massive ecosystem beyond walled gardens. Streamline delivery of simple data sharing to any cloud or region, without replication. **Monetize more than just data:** Monetize the broadest set of data assets including data sets, notebooks, AI models to reach more data consumers. **Share data securely:** Share all your data sets, notebooks, AI models, dashboards and more securely across clouds, regions and data platforms. ----- #### Enable collaboration and accelerate innovation #### Powered by a fast, growing ecosystem Enterprises need open collaboration for data and AI. In the past few months, we've continued to increase partners across industries, including Retail, Communications and Media & Entertainment,  **Advertising and Retail** Incorporate shopper behavior analysis | Ads uplift/ performance | Demand forecasting | “Next best SKU” prediction | Inventory analysis | Live weather data Financial Services, with 520+ listings you can explore in our open  **Finance** Incorporate data from stock exchange to predict economic impact | Market research | Public census and housing data to predict insurance sales  **Healthcare and Life Sciences** Genomic target identification | Patient risk scoring Accelerating drug discovery | Commercial effectiveness | Clinical research For more on Databricks Marketplace, go to [marketplace.databricks.com](http://marketplace.databricks.com) , or refer to the Resources section on page 41 . Marketplace from 80+ providers and counting. #### Use cases for an open marketplace Organizations across all industries have many use cases for consuming and sharing third-party data from the simple (dataset joins) to the more advanced (AI notebooks, applications and dashboards). ----- #### New upcoming feature: AI model sharing Nowadays, it may seem like every organization wants to become an AI organization. However, most organizations are new to AI. Databricks has heard from customers that they want to discover out-of-the-box AI models on Marketplace to help them kickstart their AI innovation journey. To meet this demand, Databricks will be adding AI model sharing capabilities on Marketplace to provide users access to both OSS and proprietary AI (both first-and third-party) models. This will enable data consumers and providers to discover and monetize AI models and integrate AI into their data solutions. Using this feature, data consumers can evaluate AI models with rich previews, including visualizations and pre-built notebooks with sample data. With Databricks Marketplace, there are no difficult data delivery mechanisms — you can get the AI models instantly with the click of a button. All of this works out-of-the-box with the AI capabilities of the Databricks Lakehouse Platform for both real-time and batch inference. For real-time inference, you can use model serving endpoints. For batch inference, you can invoke the models as functions directly from DBSQL or notebooks. With AI model sharing, Databricks customers will have access to best-in-class models from leading providers, as well as OSS models published by Databricks which can be quickly and securely applied on top of their data. Databricks will curate and publish its own open source models across common use cases, such as instruction-following and text summarization, and optimize tuning or deployment of these models. Using AI models from Databricks Marketplace can help your organization summarize complex information quickly and easily to help accelerate the pace of innovation. ----- ## Chapter 6  Share securely with Databricks Clean Rooms While the demand for external data to make data-driven innovations is greater than ever, there is growing concern among organizations around data privacy. The need for organizations to share data and collaborate with their partners and customers in a secure, governed and privacy-centric way is driving the concept of “data clean rooms.” #### What is a data clean room? A data clean room provides a secure, governed and privacy-safe environment where participants can bring their sensitive data, which might include personally identifiable information (PII), and perform joint analysis on that private data. Participants have full control of the data and can decide which participants can perform what analysis without exposing any sensitive data. ###### Collaborator A  Data Cleanroom E.G., AGENCIES, PUBLISHERS, MVPDS, RETAILERS What is our audience overlap? ###### Collaborator B E.G., ADVERTISERTS **Figure 8:** Data clean room diagram example for audience overlap analysis in advertising How did my campaign do in terms of reach and frequency? What is the lift in purchases among those in-segment versus those out-of-segment? **Collaborator A-owned sensitive data** **Secure and privacy-preserving environment** **Collaborator B-owned sensitive data** ----- A data clean room is not a new concept. Google introduced the idea in 2017 when it announced Ads Data Hub, which allows advertisers to gain impression-level insights about cross-device media campaigns in a more secure, privacy-safe environment. In the last few years, the demand for clean rooms has accelerated. IDC predicts that by 2024, 65% of G2000 enterprises will form data sharing partnerships with external stakeholders via data clean rooms to increase interdependence while safeguarding data privacy. There are various compelling needs driving this demand: **Privacy-first world.** Stringent data privacy regulations such as GDPR and CCPA, along with sweeping changes in third-party measurement, have transformed how organizations collect, use and share data. For example, Apple’s [App Tracking Transparency](https://developer.apple.com/app-store/user-privacy-and-data-use/) [Framework](https://developer.apple.com/app-store/user-privacy-and-data-use/) (ATT) provides users of Apple devices the freedom and flexibility to easily opt out of app tracking. Google also plans to [phase out support for third-party cookies in Chrome](https://blog.google/products/chrome/updated-timeline-privacy-sandbox-milestones/) by late 2024. As these privacy laws and practices evolve, the demand for data cleanrooms is likely to rise as the industry moves to new **Collaboration in a fragmented ecosystem.** Today, consumers have more options than ever before when it comes to where, when and how they engage with content. As a result, the digital footprint of consumers is fragmented across different platforms, necessitating that companies collaborate with their partners to create a unified view of their customers’ needs and requirements. To facilitate collaboration across organizations, cleanrooms provide a secure and private way to combine their data with other data to unlock new insights or capabilities. identifiers that are PII based, such as UID 2.0, and organizations try to find new ways to share and join data with customers and partners in a privacy-centric way. **New ways to monetize data.** Most organizations are looking to monetize their data in one form or another. With today’s privacy laws, companies will try to find any possible advantages to monetize their data without the risk of breaking privacy rules. This creates an opportunity for data vendors or publishers to join data for big data analytics without having direct access to the data. ----- #### Common data clean room uses cases #### Category management for retail and consumer goods Clean rooms enable real-time collaboration between retailers and suppliers, ensuring secure information exchange for demand forecasting, inventory planning and supply chain optimization. This improves product availability, reduces costs and streamlines operations for both parties. #### Real-world evidence (RWE) for healthcare Clean rooms provide secure access to sensitive healthcare data sets, allowing collaborators to connect and query multiple sources of data without comprising data privacy. This supports RWE use cases such as regulatory decisions, safety, clinical trial design and observational research. #### Audience overlap exploration for media and entertainment By creating a clean room environment, media companies can securely share their audience data with advertisers or other media partners. This allows them to perform in-depth analysis and identify shared audience segments without directly accessing or exposing individual user information. #### Know Your Customer (KYC) in banking KYC standards are designed to combat financial fraud, money laundering and terrorism financing. Clean rooms can be used within a given jurisdiction to allow financial services companies to collaborate and run shared analytics to build a holistic view of a transaction for investigations. ----- #### Personalization with expanded interests for retailers Retailers want to target consumers based on past purchases, as well as other purchases with different retailers. Clean rooms enable retailers to augment their knowledge of consumers to suggest new products and services that are relevant to the individual but have #### 5G data monetization for telecom 5G data monetization enables telecoms to capitalize on data from 5G networks. Clean rooms provide a secure environment for collaboration with trusted partners, ensuring privacy while maximizing data value for optimized services, personalized experiences and targeted advertising. not yet been purchased. ----- #### Shortcomings of existing data clean rooms Organizations exploring clean room options are finding some glaring shortcomings in the existing solutions that limit the full potential of the “clean rooms” concept. First, many existing data clean room vendors require data to be on the same cloud, same region, and/or same data platform. Participants then have to move data into proprietary platforms, which results in lock-in and additional data storage costs. Second, most existing solutions are not scalable to expand collaboration beyond a few collaborators at a time. For example, an advertiser might want to get a detailed view of their ad performance across different platforms, which requires analysis of the aggregated data from multiple data publishers. With collaboration limited to just a few participants, organizations get partial insights on one clean room platform and end up moving their data to another clean room vendor to aggregate the data, incurring the operational overhead of collating partial insights. Finally, existing clean room solutions do not provide the flexibility to run arbitrary analysis and are mainly restricted to SQL, a subset of Python, and pre-defined templates. While SQL is absolutely needed for clean rooms, there are times when you require complex computations such as machine learning or integration with APIs where SQL doesn’t satisfy the full depth of the technical requirements. ----- #### Key benefits of Databricks Clean Rooms Databricks Clean Rooms allow businesses to easily collaborate with their customers and partners in a secure environment on any cloud in a privacy-safe way. Key benefits of Databricks Clean Rooms include: **Flexible - your language and workload of** **choice.** Databricks Clean Rooms empower collaborators to share and join their existing data and run complex workloads in any language —Python, R, SQL, Java and Scala — on the data while maintaining data privacy. Beyond traditional SQL, users can run arbitrary workloads and languages, allowing them to train machine learning models, perform inference and utilize open-source or third-party privacy- enhancing technologies. This flexibility enables data scientists and analysts to achieve more comprehensive and advanced data analysis within the secure Clean Room environment. **Scalable, multi-party collaboration.** With Databricks Clean Rooms, you can launch a clean room and work with multiple collaborators at a time. This capability enables real-time collaboration, fostering efficient and rapid results. Moreover, Databricks Clean Rooms seamlessly integrate with identity service providers, allowing users to leverage offerings from these providers during collaboration. The ability to collaborate with multiple parties and leverage identity services enhances the overall data collaboration experience within Databricks Clean Rooms. **Interoperable - any data source** **with no replication.** Databricks Clean Rooms excel in interoperability, ensuring smooth collaboration across diverse environments. With Delta Sharing, collaborators can seamlessly work together across different cloud providers, regions and even data platforms without the need for extensive data movement. This eliminates data silos and enables organizations to leverage existing infrastructure and data ecosystems while maintaining the utmost security and compliance. ----- ## Resources  Getting started with Data Sharing and Collaboration Data sharing plays a key role in business processes across the enterprise, from product development and internal operations to customer experience and compliance. However, most businesses have been slow to move forward because of incompatibility between systems, complexity and security concerns. Data-driven organizations need an open — and secure — approach to data sharing. Databricks offers an open approach to data sharing and collaboration with a variety of tools to:  **Share across platforms:** You can share live data sets, as well as AI models, dashboards and notebooks across platforms, clouds and regions. This open approach is powered by Delta Sharing, the world’s first open protocol for secure data sharing, which allows organizations to share data for any use case, any tool and on any cloud.  **Share all your data and AI: Databricks Marketplace** is an open marketplace for all your data, analytics and AI, enabling both data consumers and data providers with the ability to deliver innovation and advance analytics and AI initiatives.  **Share securely: Databricks Clean Rooms** allows businesses to easily collaborate with customers and partners on any cloud in a privacy-safe way. With Delta Sharing, clean room participants can securely share data from their data lakes without any data replication across clouds or regions. Your data stays with you without vendor lock-in, and you can centrally audit and monitor the usage of your data. ----- Get started with these products by exploring the resources below. **Delta Sharing**  [Data Sharing on Databricks](https://www.databricks.com/product/delta-sharing) [Learn about Databricks Unity Catalog](https://www.databricks.com/product/unity-catalog) [Blog post: What’s new with Data Sharing and Collaboration on the](https://www.databricks.com/blog/whats-new-data-sharing-and-collaboration-lakehouse) [Lakehouse](https://www.databricks.com/blog/whats-new-data-sharing-and-collaboration-lakehouse) [Learn about open source Delta Sharing](https://delta.io/sharing/) [Video: What’s new with Data Sharing and Collaboration on](https://youtu.be/imSi6dYBXSg?feature=shared) [the Lakehouse](https://youtu.be/imSi6dYBXSg?feature=shared) **Databricks Marketplace** [Learn about Databricks Marketplace](https://www.databricks.com/product/marketplace) [Explore Databricks Marketplace](https://marketplace.databricks.com/) [Video: Databricks Marketplace - Going Beyond Data and](https://youtu.be/d11QcTaqHE4?feature=shared) [Applications](https://youtu.be/d11QcTaqHE4?feature=shared) [Demo: Databricks Marketplace](https://www.databricks.com/resources/demos/videos/data-sharing/marketplace) [AWS Documentation: What is Databricks Marketplace](https://docs.databricks.com/en/marketplace/index.html) [Azure Documentation: What is Databricks Marketplace](https://learn.microsoft.com/en-us/azure/databricks/marketplace/) [AWS Documentation](https://docs.databricks.com/en/data-sharing/index.html) **Databricks Clean Rooms**  [Learn about Databricks Clean Rooms](https://www.databricks.com/product/clean-room) [Video: What’s new with Data Sharing and Collaboration on](https://youtu.be/imSi6dYBXSg?feature=shared) [the Lakehouse](https://youtu.be/imSi6dYBXSg?feature=shared) [eBook: The Definitive Guide to Data Clean Rooms](https://www.databricks.com/resources/ebook/market-smarter-data-clean-rooms) [Webinar: Unlock the Power of Secure Data Collaboration](https://events.databricks.com/202304-AMER-VE-Clean-Room-Panel?utm_source=habu&_gl=1*1r1w5jw*_gcl_au*NTc4ODMxMjE4LjE2ODg5MjQ0Njk.*rs_ga*ODM5OTc3OTgtOTdmYy00ZmZhLTkwMTktZTlhYmFhNzlmZWE2*rs_ga_PQSEQ3RZQC*MTY5Mjg4ODIzNzc4NC45OC4xLjE2OTI4ODgzMDYuNTkuMC4w&_ga=2.161567100.1599267366.1692625473-835843671.1688924469) [with Clean Rooms](https://events.databricks.com/202304-AMER-VE-Clean-Room-Panel?utm_source=habu&_gl=1*1r1w5jw*_gcl_au*NTc4ODMxMjE4LjE2ODg5MjQ0Njk.*rs_ga*ODM5OTc3OTgtOTdmYy00ZmZhLTkwMTktZTlhYmFhNzlmZWE2*rs_ga_PQSEQ3RZQC*MTY5Mjg4ODIzNzc4NC45OC4xLjE2OTI4ODgzMDYuNTkuMC4w&_ga=2.161567100.1599267366.1692625473-835843671.1688924469) [Azure Documentation](https://learn.microsoft.com/en-us/azure/databricks/data-sharing/) ----- ## About the Authors **Vuong Nguyen** is a Solution Architect at Databricks, focusing on making analytics and AI simple for customers by leveraging the power of the Databricks Lakehouse Platform. You can reach Vuong on [LinkedIn](https://www.linkedin.com/in/vuong-nguyen) . **Sachin Thakur** is a Principal Product Marketing Manager on the Databricks Data Engineering and Analytics team. His area of focus is data governance with Unity Catalog, and he is passionate about helping organizations democratize data and AI with the Databricks Lakehouse Platform. You can reach Sachin on [LinkedIn](https://www.linkedin.com/in/sachin10thakur/) . **Milos Colic** is a Senior Solution Architect at Databricks. His passion is to help customers with their data exchange and data monetization needs. Furthermore, he is passionate about geospatial data processing and ESG. You can reach Milos on [LinkedIn](https://www.linkedin.com/in/milos-colic/) . **Jay Bhankharia** is a Senior Director on the Databricks Data Partnerships team. His passion is to help customers gain insights from data to use the power of the Databricks Lakehouse Platform for their analytics needs. You can reach Jay on [LinkedIn](https://www.linkedin.com/in/jay-bhankharia-cfa-b9835612/) . **Itai Weiss** is a Lead Delta Sharing Specialist at Databricks and has over 20 years of experience in helping organizations of any size build data solutions. He focuses on data monetization and loves to help customers and businesses get more value from the data they have. You can reach Itai on [LinkedIn](https://www.linkedin.com/in/itai-weiss/) . **Somasekar Natarajan** (Som) is a Solution Architect at Databricks specializing in enterprise data management. Som has worked with Fortune organizations spanning three continents for close to two decades with one objective — helping customers to **Giselle Goicochea** is a Senior Product Marketing Manager on the Databricks Data Engineering and Analytics team. Her area of focus is data sharing and collaboration with Delta Sharing and Databricks Marketplace. You can reach Giselle on [LinkedIn](https://www.linkedin.com/in/giselle-goicochea/) . **Kelly Albano** is a Product Marketing Manager on the Databricks Data Engineering and Analytics team. Her area of focus is security, compliance and Databricks Clean Rooms. You can reach Kelly on [LinkedIn](https://www.linkedin.com/in/kellyalbano/) . harness the power of data. You can reach Som on [LinkedIn](https://www.linkedin.com/in/somasekar-natarajan/) . ----- ##### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** © Databricks 2023 All rights reserved -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/a-new-approach-to-data-sharing-2nd-edition-databricks.pdf,2024-09-19T16:57:20Z
"##### The Delta Lake Series Complete Collection ----- ### What is Delta Lake? [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) is a unified data management system that brings data reliability and fast analytics to cloud data lakes. Delta Lake runs on top of existing data lakes and is fully compatible with Apache Spark™ APIs. At Databricks, we’ve seen how Delta Lake can bring reliability, performance and lifecycle management to data lakes. With Delta Lake, there will be no more malformed data ingestion, difficulties deleting data for compliance, or issues modifying data for data capture. With Delta Lake, you can accelerate the velocity that high-quality data can get into your data lake and the rate that teams can leverage that data with a secure and scalable cloud service. In this eBook, the Databricks team has compiled all of their insights into a comprehensive format so that you can gain a full understanding of Delta Lake and its capabilities. ----- Contents Processes Petabytes With Data Skipping and Z-Ordering Fundamentals & Performance **Here s what** 4 Using data skipping and Z-Order clustering The Fundamentals of Delta Lake: Why Reliability and 5 Exploring the details 21 Performance Matter **you’ll find inside** 5 Features 22 Processes Petabytes With Data Skipping and Z-Ordering Rollbacks 39 Pinned view of a continuously updating Delta Lake table across multiple downstream jobs Queries for time series analytics made simple Easily Clone Your Delta Lake for Testing, Sharing and ML Reproducibility 41 What are clones? 41 A lakehouse combines the best elements of data lakes and data warehouses 52 Some early examples 55 From BI to AI 55 Diving Deep Into the Inner Workings of the Lakehouse and Delta Lake 56 1. Data lakes 57 2. Custom storage engines 57 Creating the Dashboard / Virtual Network Operation Centers 82 Creating (near) real-time alerts 85 Next steps: machine learning 86 Point-of-failure prediction and remediation 87 Customer churn 87 Getting started with the Databricks streaming video QoS solution 87 Customer Use Cases 88 Healthdirect Australia 89 Data quality and governance issues, silos, and the inability to scale 89 Fundamentals & Performance Using data skipping and Z-Order clustering 21 The Fundamentals of Delta Lake: Why Reliability and Exploring the details 21 Performance Matter Features Challenges with data lakes Delta Lake’s key functionalities Unpacking the Transaction Log Implementing atomicity to ensure Why Use MERGE With Delta Lake? When are upserts necessary? 24 Why upserts into data lakes have operations complete fully operations complete fully 9 Dealing with multiple concurrent reads and writes **Chapter** Time travel, data lineage and debugging 10 How to Use Schema Enforcement and Evolution Understanding table schemas 11 #### 01 Fundamentals and Performance traditionally been challenging 25 traditionally been challenging Shallow clones Deep clones **Chapter** 42 42 #### 04 3. Lakehouse Dealing with multiple concurrent reads and writes Introducing MERGE in Delta Lake In the research paper, the authors explain: 59 3. Lakehouse Streaming 58 - The Fundamentals of Delta Lake: Why Reliability Simplifying use cases with MERGE 26 and Performance Matter Deleting data due to GDPR 26 Understanding - How Delta Lake Solves Common Pain Points in Streaming 60 Modernizing analytics with Databricks and Delta Lake 90 Delta Engine **•** **USE CASE #1:** 60 Simplifying Streaming Stock Faster data pipelines result in better patient-driven healthcare 91 Scaling execution performance Data Analysis Using Delta Lake 60 Comcast 93 Time travel, data lineage and debugging Simplifying use cases with MERGE Where do clones help? Understanding Modernizing analytics with Databricks and Delta Lake How to Use Schema Enforcement and Evolution Deleting data due to GDPR Testing and experimentation with a production table Delta Engine Faster data pipelines result in better patient-driven healthcare - Unpacking the Transaction Log Applying change data from databases 26 - How to Use Schema Enforcement and Evolution Updating session information from streaming pipelines 27 - Delta Lake DML Internals How to start using Delta Lake 28 - How Delta Lake Quickly Processes Petabytes Loading and saving our Delta Lake data 29 With Data Skipping and Z-Ordering In-place conversion to Delta Lake 30 Understanding table schemas Applying change data from databases Staging major changes to a production table Scaling execution performance Comcast Announcing Delta Engine for **•** **USE CASE #2:** How Tilting Point Does Streaming Infrastructure unable to support data and ML needs high-performance query execution Ingestion Into Delta Lake 61 Automated infrastructure, faster data What is schema enforcement? How does schema enforcement work? How is schema enforcement useful? What is schema evolution? How does schema evolution work? Updating session information from streaming pipelines Machine learning result reproducibility Data migration Data sharing Data archiving Looks awesome! Any gotchas? How can I use it? Enabling Spark SQL DDL Announcing Delta Engine for Infrastructure unable to support data and ML needs How to start using Delta Lake high-performance query execution Automated infrastructure, faster data Getting started with Delta Engine **•** **USE CASE #3:** 62 Building a Quality of Service pipelines with Delta Lake 95 Streaming Analytics Solution for Streaming Video Services 63 Delivering personalized experiences with ML Loading and saving our Delta Lake data Getting started with Delta Engine pipelines with Delta Lake In-place conversion to Delta Lake Streaming Delivering personalized experiences with ML Delete our flight data Update our flight data 31 Merge our flight data 31 How Delta Lake Solves Common Pain Points in Streaming Banco Hipotecario 97 Legacy analytics tools are slow, rigid and impossible to scale 98 How is schema evolution useful? 14 Summary **Chapter** 14 Delta Lake DML Internals 15 Delta Lake DML: UPDATE 15 #### 02 Features #### 05 Chapter Data lake pain points Customer Use Cases 64 How is schema evolution useful? Data lake pain points Summary Data warehouse pain points - Why Use MERGE With Delta Lake? View table history 32 - Simple, Reliable Upserts and Deletes on Delta Lake Travel back in time with table history 33 Tables Using Python APIs Clean up old table versions with vacuum 33 How Delta Lake on Databricks solves these issues **•** **USE CASE #1:** Healthdirect Australia Provides Personalized 65 A unified platform powers the data lake Simplifying Streaming Stock Data Analysis Using Delta Lake and Secure Online Patient Care With Databricks 66 and easy collaboration 99 Delta Lake View table history and DML in Delta Lake on How Delta Lake on Databricks solves these issues A unified platform powers the data lake DML Internals Travel back in time with table history Apache Spark 3.0 Simplifying Streaming Stock Data Analysis Using Delta Lake and easy collaboration Implement your streaming **•** **USE CASE #2:** Comcast Uses Delta Lake and MLflow to An efficient team maximizes customer stock analysis solution with Delta Lake Transform the Viewer Experience 67 acquisition and retention 100 Delta Lake DML: UPDATE Clean up old table versions with vacuum Support for SQL DDL commands Implement your streaming An efficient team maximizes customer - Time Travel for Large-Scale Data Lakes Common challenges with changing data 35 - Easily Clone Your Delta Lake for Testing, Sharing Working with Time Travel 36 and ML Reproducibility 1. Using a timestamp 36 UPDATE: Under the hood 16 UPDATE + Delta Lake time travel = Easy debugging UPDATE: Performance tuning tips 16 Delta Lake DML: DELETE 16 DELETE: Under the hood 17 DELETE + VACUUM: Cleaning up old data files Common challenges with changing data to define tables in the Hive metastore stock analysis solution with Delta Lake acquisition and retention Analyze streaming stock data in real time 69 **•** **USE CASE #3:** Banco Hipotecario Personalizes the Banking Viacom18 101 How Tilting Point Does Streaming Ingestion Into Delta Lake Experience With Data and ML 71 Growth in subscribers and terabytes of viewing data push Hadoop to its limits 102 Working with Time Travel Create or replace tables Analyze streaming stock data in real time 69 Viacom18 1. Using a timestamp Explicitly alter the table schema How Tilting Point Does Streaming Ingestion Into Delta Lake Growth in subscribers and terabytes of viewing data push Hadoop to its limits - Enabling Spark SQL DDL and DML in Delta Lake Scala syntax 36 on Apache Spark 3.0 Python syntax 37 How data flows and associated challenges 72 **•** **USE CASE #4:** Viacom18 Migrates From Hadoop to Rapid data processing for analytics Leveraging Structured Streaming with blob store as Databricks to Deliver More Engaging Experiences 72 and ML with Databricks 103 Scala syntax Support for SQL Insert, Delete, Update and Merge Automatic and incremental Presto/Athena manifest generation Configuring your table through table properties Support for adding user-defined metadata in Delta Lake table commits 48 Other highlights 49 Lakehouse 50 What Is a Lakehouse? 51 How data flows and associated challenges 72 Rapid data processing for analytics Python syntax Leveraging Structured Streaming with blob store as and ML with Databricks SQL syntax 37 2. Using a version number Scala syntax source and Delta Lake tables as sink Leveraging viewer data to power personalized viewing experiences 104 DELETE: Performance tuning tips 18 Delta Lake DML: MERGE **Chapter** 18 Here’s how an upsert works: 18 MERGE: Under the hood 19 MERGE: Performance tuning tips **03** 19 DELETE: Performance tuning tips Lakehouse Building a Quality of Service Analytics Solution for Streaming Video Services 75 Databricks Quality of Service solution overview 76 Video QoS solution architecture 77 Making your data ready for analytics 79 Video applications events 80 CDN logs 81 Delta Lake DML: MERGE - What Is a Lakehouse? Python syntax 38 - Diving Deep Into the Inner Workings of the SQL syntax 38 Lakehouse and Delta Lake Audit data changes 39 Here’s how an upsert works: Python syntax MERGE: Under the hood SQL syntax MERGE: Performance tuning tips Audit data changes How Delta Lake Quickly - Understanding Delta Engine Reproduce experiments and reports 39 ----- **Fundamentals and Performance** Boost data reliability for machine learning and business intelligence with Delta Lake ## CHAPTER 01 ----- **The Fundamentals of Delta** **Lake: Why Reliability and** **Performance Matter** When it comes to data reliability, performance — the speed at which your programs run — is of utmost importance. Because of the ACID transactional protections that Delta Lake provides, you’re able to get the reliability and performance you need. With Delta Lake, you can stream and batch concurrently, perform CRUD operations, and save money because you’re now using fewer VMs. It’s easier to maintain your data engineering pipelines by taking advantage of streaming, even for batch jobs. Delta Lake is a storage layer that brings reliability to your data lakes built on HDFS and cloud object storage by providing ACID transactions through optimistic concurrency control between writes and snapshot isolation for consistent reads during writes. Delta Lake also provides built-in data versioning for easy rollbacks and reproducing reports. In this chapter, we’ll share some of the common challenges with data lakes as well as the Delta Lake features that address them. **Challenges with data lakes** Data lakes are a common element within modern data architectures. They serve as a central ingestion point for the plethora of data that organizations seek to gather and mine. While a good step forward in getting to grips with the range of data, they run into the following common problems: ----- **1. Reading and writing into data lakes is not reliable.** Data engineers often run into the problem of unsafe writes into data lakes that cause readers to see garbage data during writes. They have to build workarounds to ensure readers always see consistent data during writes. **2. The data quality in data lakes is low.** Dumping unstructured data into a data lake is easy, but this comes at the cost of data quality. Without any mechanisms for validating schema and the data, data lakes suffer from poor data quality. As a consequence, analytics projects that strive to mine this data also fail. **3. Poor performance with increasing amounts of data.** As the amount of data that gets dumped into a data lake increases, the number of files and directories also increases. Big data jobs and query engines that process the data spend a significant amount of time handling the metadata operations. This problem is more pronounced in the case of streaming jobs or handling many concurrent batch jobs. **4. Modifying, updating or deleting records in data lakes is hard.** Engineers need to build complicated pipelines to read entire partitions or tables, modify the data and write them back. Such pipelines are inefficient and hard to maintain. Because of these challenges, many big data projects fail to deliver on their vision or sometimes just fail altogether. We need a solution that enables data practitioners to make use of their existing data lakes, while ensuring data quality. **Delta Lake’s key functionalities** Delta Lake addresses the above problems to simplify how you build your data lakes. Delta Lake offers the following key functionalities: **• ACID transactions:** Delta Lake provides ACID transactions between multiple writes. Every write is a transaction, and there is a serial order for writes recorded in a transaction log. The transaction log tracks writes at file level and uses [optimistic](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) ----- [concurrency control](https://en.wikipedia.org/wiki/Optimistic_concurrency_control) , which is ideally suited for data lakes since multiple writes trying to modify the same files don’t happen that often. In scenarios where there is a conflict, Delta Lake throws a concurrent modification exception for users to handle them and retry their jobs. Delta Lake also offers the highest level of isolation possible ( [serializable isolation](https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable) ) that allows engineers to continuously keep writing to a directory or table and consumers to keep reading from the same directory or table. Readers will see the latest snapshot that existed at the time the reading started. **• Schema management:** Delta Lake automatically validates that the schema of the DataFrame being written is compatible with the schema of the table. Columns that are present in the table but not in the DataFrame are set to null. If there are extra columns in the DataFrame that are not present in the table, this operation throws an exception. Delta Lake has DDL to add new columns explicitly and the ability to update the schema automatically. **• Scalable metadata handling:** Delta Lake stores the metadata information of a table or directory in the transaction log instead of the metastore. This allows Delta Lake to list files in large directories in constant time and be efficient while reading data. **• Data versioning and time travel:** Delta Lake allows users to read a previous snapshot of the table or directory. When files are modified during writes, Delta Lake creates newer versions of the files and preserves the older versions. When users want to read the older versions of the table or directory, they can provide a timestamp or a version number to Apache Spark’s read APIs, and Delta Lake constructs the full snapshot as of that timestamp or version based on the information in the transaction log. This allows users to reproduce experiments and reports and also revert a table to its older versions, if needed. **• Unified batch and streaming sink:** Apart from batch writes, Delta Lake can also be used as an efficient streaming sink with [Apache Spark’s structured streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) . Combined with ACID transactions and scalable metadata handling, the efficient streaming sink enables lots of near real-time analytics use cases without having to maintain a complicated streaming and batch pipeline. **• Record update and deletion:** Delta Lake will support merge, update and delete DML commands. This allows engineers to easily upsert and delete records in data lakes and simplify their change data capture and GDPR use cases. Since Delta Lake tracks and modifies data at file-level granularity, it is much more efficient than reading and overwriting entire partitions or tables. **• Data expectations (coming soon):** Delta Lake will also support a new API to set data expectations on tables or directories. Engineers will be able to specify a boolean condition and tune the severity to handle data expectations. When Apache Spark jobs write to the table or directory, Delta Lake will automatically validate the records and when there is a violation, it will handle the records based on the severity provided. ----- **Unpacking the** **Transaction Log** The transaction log is key to understanding Delta Lake because it is the common thread that runs through many of its most important features, including ACID transactions, scalable metadata handling, time travel and more. The Delta Lake transaction log is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. Delta Lake is built on top of [Apache Spark](https://databricks.com/spark/about) to allow multiple readers and writers of a given table to work on the table at the same time. To show users correct views of the data at all times, the transaction log serves as a single source of truth: the central repository that tracks all changes that users make to the table. When a user reads a Delta Lake table for the first time or runs a new query on an open table that has been modified since the last time it was read, Spark checks the transaction log to see what new transactions are posted to the table. Then, Spark updates the end user’s table with those new changes. This ensures that a user’s version of a table is always synchronized with the master record as of the most recent query and that users cannot make divergent, conflicting changes to a table. In this chapter, we’ll explore how the Delta Lake transaction log offers an elegant solution to the problem of multiple concurrent reads and writes. ----- **Implementing atomicity to ensure** **operations complete fully** Atomicity is one of the four properties of ACID transactions that guarantees that operations (like an INSERT or UPDATE) performed on your [data lake](https://databricks.com/glossary/data-lake) either complete fully or don’t complete at all. Without this property, it’s far too easy for a hardware failure or a software bug to cause data to be only partially written to a table, resulting in messy or corrupted data. The transaction log is the mechanism through which Delta Lake is able to offer the guarantee of atomicity. For all intents and purposes, if it’s not recorded in the transaction log, it never happened. By only recording transactions that execute fully and completely, and using that record as the single source of truth, the Delta Lake transaction log allows users to reason about their data and have peace of mind about its fundamental trustworthiness, at petabyte scale. **Dealing with multiple concurrent reads and writes** But how does Delta Lake deal with multiple concurrent reads and writes? Since Delta Lake is powered by Apache Spark, it’s not only possible for multiple users to modify a table at once — it’s expected. To handle these situations, Delta Lake employs **optimistic** **concurrency control** . Optimistic concurrency control is a method of dealing with concurrent transactions that assumes the changes made to a table by different users can complete without conflicting with one another. It is incredibly fast because when dealing with petabytes of data, there’s a high likelihood that users will be working on different parts of the data altogether, allowing them to complete non-conflicting transactions simultaneously. Of course, even with optimistic concurrency control, sometimes users do try to modify the same parts of the data at the same time. Luckily, Delta Lake has a protocol for that. Delta Lake handles these cases by implementing a rule of mutual exclusion, then it attempts to solve any conflict optimistically. This protocol allows Delta Lake to deliver on the ACID principle of isolation, which ensures that the resulting state of the table after multiple, concurrent writes is the same as if those writes had occurred serially, in isolation from one another. ----- As all the transactions made on Delta Lake tables are stored directly to disk, this process satisfies the ACID property of durability, meaning it will persist even in the event of system failure. **Time travel, data lineage and debugging** Every table is the result of the sum total of all the commits recorded in the Delta Lake transaction log — no more and no less. The transaction log provides a step-by-step instruction guide, detailing exactly how to get from the table’s original state to its current state. Therefore, we can recreate the state of a table at any point in time by starting with an original table, and processing only commits made after that point. This powerful ability is known as “time travel,” or data versioning, and can be a lifesaver in any number of situations. For more information, please refer to [Introducing Delta Time Travel for](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) [Large-Scale Data Lakes](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) and [Getting Data Ready for Data Science With Delta Lake and](https://www.youtube.com/watch?v=hQaENo78za0&list=PLTPXxbhUt-YVPwG3OWNQ-1bJI_s_YRvqP&index=21&t=112s) [MLflow.](https://www.youtube.com/watch?v=hQaENo78za0&list=PLTPXxbhUt-YVPwG3OWNQ-1bJI_s_YRvqP&index=21&t=112s) As the definitive record of every change ever made to a table, the Delta Lake transaction log offers users a verifiable data lineage that is useful for governance, audit and compliance purposes. It can also be used to trace the origin of an inadvertent change or a bug in a pipeline back to the exact action that caused it. Users can run the [DESCRIBE HISTORY](https://docs.delta.io/latest/delta-utility.html#describe-history) command to see metadata around the changes that were made. **Want to learn more about Delta Lake’s transaction log?** Read our blog post > Watch our tech talk > ----- **How to Use Schema** **Enforcement and** **Evolution** As business problems and requirements evolve over time, so does the structure of your data. With Delta Lake, incorporating new columns or objects is easy; users have access to simple semantics to control the schema of their tables. At the same time, it is important to call out the importance of schema enforcement to prevent users from accidentally polluting their tables with mistakes or garbage data in addition to schema evolution, which enables them to automatically add new columns of rich data when those columns belong. **Schema enforcement rejects any new columns or other schema changes that** **aren’t compatible with your table.** By setting and upholding these high standards, analysts and engineers can trust that their data has the highest levels of integrity and can reason about it with clarity, allowing them to make better business decisions. On the flip side of the coin, schema evolution complements enforcement by making it easy for intended schema changes to take place automatically. After all, it shouldn’t be hard to add a column. Schema enforcement is the yin to schema evolution’s yang. When used together, these features make it easier than ever to block out the noise and tune in to the signal. **Understanding table schemas** Every DataFrame in Apache Spark contains a schema, a blueprint that defines the shape of the data, such as data types and columns, and metadata. With Delta Lake, the table’s schema is saved in JSON format inside the transaction log. ----- **What is schema enforcement?** Schema enforcement, or schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that don’t match the table’s schema. Like the front-desk manager at a busy restaurant who only accepts reservations, it checks to see whether each column of data inserted into the table is on its list of expected columns (in other words, whether each one has a “reservation”), and rejects any writes with columns that aren’t on the list. **How does schema enforcement work?** Delta Lake uses **schema validation on write,** which means that all new writes to a table are checked for compatibility with the target table’s schema at write time. If the schema is not compatible, Delta Lake cancels the transaction altogether (no data is written), and raises an exception to let the user know about the mismatch. To determine whether a write to a table is compatible, Delta Lake uses the following rules. The DataFrame to be written cannot contain: **• Any additional columns that are not present in the target table’s schema.** Conversely, it’s OK if the incoming data doesn’t contain every column in the table — those columns will simply be assigned null values. **• Column data types that differ from the column data types in the target table.** If a target table’s column contains StringType data, but the corresponding column in the DataFrame contains IntegerType data, schema enforcement will raise an exception and prevent the write operation from taking place. **• Column names that differ only by case.** This means that you cannot have columns such as “Foo” and “foo” defined in the same table. While Spark can be used in case sensitive or insensitive (default) mode, Delta Lake is case-preserving but insensitive when storing the schema. [Parquet](https://databricks.com/glossary/what-is-parquet) is case sensitive when storing and returning column information. To avoid potential mistakes, data corruption or loss issues (which we’ve personally experienced at Databricks), we decided to add this restriction. ----- Rather than automatically adding the new columns, Delta Lake enforces the schema, and stops the write from occurring. To help identify which column(s) caused the mismatch, Spark prints out both schemas in the stack trace for comparison. **How is schema enforcement useful?** Because it’s such a stringent check, schema enforcement is an excellent tool to use as a gatekeeper for a clean, fully transformed data set that is ready for production or consumption. It’s typically enforced on tables that directly feed: - Machine learning algorithms - BI dashboards - Data analytics and visualization tools - Any production system requiring highly structured, strongly typed, semantic schemas In order to prepare their data for this final hurdle, many users employ a simple multihop architecture that progressively adds structure to their tables. To learn more, take a look at [Productionizing Machine Learning With Delta Lake.](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html) **What is schema evolution?** Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. Most commonly, it’s used when performing an append or overwrite operation, to automatically adapt the schema to include one or more new columns. **How does schema evolution work?** Following up on the example from the previous section, developers can easily use schema evolution to add the new columns that were previously rejected due to a schema mismatch. Schema evolution is activated by adding .option(‘mergeSchema’, ‘true’) to your .write or .writeStream Spark command, as shown in the following example. #Add the mergeSchema option loans.write.format( “delta” ) \ .option( “mergeSchema” , “true” ) \ .mode( “append” ) \ .save(DELTALAKE_SILVER_PATH) By including the mergeSchema option in your query, any columns that are present in the DataFrame but not in the target table are automatically added to the end of the schema as part of a write transaction. Nested fields can also be added, and these fields will get added to the end of their respective struct columns as well. Data engineers and scientists can use this option to add new columns (perhaps a newly tracked metric, or a column of this month’s sales figures) to their existing ML production tables without breaking existing models that rely on the old columns. The following types of schema changes are eligible for schema evolution during table appends or overwrites: - Adding new columns (this is the most common scenario) - Changing of data types from NullType → any other type, or upcasts from ByteType → ShortType → IntegerType Other changes, not eligible for schema evolution, require that the schema and data are overwritten by adding .option(“overwriteSchema”,“true”) . Those changes include: - Dropping a column - Changing an existing column’s data typeC (in place) - Renaming column names that differ onlyC by case (e.g., “Foo” and “foo”) ----- Finally, with the release of Spark 3.0, explicit DDL (using ALTER TABLE ) is fully supported, allowing users to perform the following actions on table schemas: - Adding columns - Changing column comments - Setting table properties that define the behavior of the table, such as setting the retention duration of the transaction log **How is schema evolution useful?** Schema evolution can be used anytime you _intend_ to change the schema of your table (as opposed to where you accidentally added columns to your DataFrame that shouldn’t be there). It’s the easiest way to migrate your schema because it automatically adds the correct column names and data types, without having to declare them explicitly. **Summary** Schema enforcement rejects any new columns or other schema changes that aren’t compatible with your table. By setting and upholding these high standards, analysts and engineers can trust that their data has the highest levels of integrity and can reason about it with clarity, allowing them to make better business decisions. On the flip side of the coin, schema evolution complements enforcement by making it easy for intended schema changes to take place automatically. After all, it shouldn’t be hard to add a column. Schema enforcement is the yin to schema evolution’s yang. When used together, these features make it easier than ever to block out the noise and tune in to the signal. **Want to learn more about schema enforcement and evolution?** Read our blog post > Watch our tech talk > ----- **Delta Lake** **DML Internals** Delta Lake supports data manipulation language (DML) commands including UPDATE, DELETE and MERGE. These commands simplify change data capture (CDC), audit and governance, and GDPR/CCPA workflows, among others. In this chapter, we will demonstrate how to use each of these DML commands, describe what Delta Lake is doing behind the scenes, and offer some performance tuning tips for each one. **Delta Lake DML: UPDATE** You can use the UPDATE operation to selectively update any rows that match a filtering condition, also known as a predicate. The code below demonstrates how to use each type of predicate as part of an UPDATE statement. Note that Delta Lake offers APIs for Python, Scala and SQL, but for the purposes of this eBook, we’ll include only the SQL code. -- Update events UPDATE events SET eventType= ‘click’ WHERE buttonPress = 1 ----- **UPDATE: Under the hood** Delta Lake performs an UPDATE on a table in two steps: 1. Find and select the files containing data that match the predicate and, therefore, need to be updated. Delta Lake uses [data skipping](https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping) whenever possible to speed up this process. 2. Read each matching file into memory, update the relevant rows, and write out the result into a new data file. Once Delta Lake has executed the UPDATE successfully, it adds a commit in the transaction log indicating that the new data file will be used in place of the old one from now on. The old data file is not deleted, though. Instead, it’s simply “tombstoned” — recorded as a data file that applied to an older version of the table, but not the current version. Delta Lake is able to use it to provide data versioning and time travel. **UPDATE + Delta Lake time travel = Easy debugging** Keeping the old data files turns out to be very useful for debugging because you can use Delta Lake “time travel” to go back and query previous versions of a table at any time. In the event that you update your table incorrectly and want to figure out what happened, you can easily compare two versions of a table to one another to see what has changed. SELECT - FROM events VERSION AS OF 11 EXCEPT ALL SELECT - FROM mytable VERSION AS OF 12 **UPDATE: Performance tuning tips** The main way to improve the performance of the UPDATE command on Delta Lake is to add more predicates to narrow down the search space. The more specific the search, the fewer files Delta Lake needs to scan and/or modify. **Delta Lake DML: DELETE** You can use the DELETE command to selectively delete rows based upon a predicate (filtering condition). DELETE FROM events WHERE date < ‘2017-01-01’ ----- In the event that you want to revert an accidental DELETE operation, you can use time travel to roll back your table to the way it was. **DELETE: Under the hood** DELETE works just like UPDATE under the hood. Delta Lake makes two scans of the data: The first scan is to identify any data files that contain rows matching the predicate condition. The second scan reads the matching data files into memory, at which point Delta Lake deletes the rows in question before writing out the newly clean data to disk. After Delta Lake completes a DELETE operation successfully, the old data files are not deleted entirely — they’re still retained on disk, but recorded as “tombstoned” (no longer part of the active table) in the Delta Lake transaction log. Remember, those old files aren’t deleted immediately because you might still need them to time travel back to an earlier version of the table. If you want to delete files older than a certain time period, you can use the VACUUM command. **DELETE + VACUUM: Cleaning up old data files** Running the VACUUM command permanently deletes all data files that are: 1. No longer part of the active table and 2. Older than the retention threshold, which is seven days by default Delta Lake does not automatically VACUUM old files — you must run the command yourself, as shown below. If you want to specify a retention period that is different from the default of seven days, you can provide it as a parameter. from delta.tables import - deltaTable. # vacuum files older than 30 days(720 hours) deltaTable.vacuum( 720 ) ----- **DELETE: Performance tuning tips** Just like with the UPDATE command, the main way to improve the performance of a DELETE operation on Delta Lake is to add more predicates to narrow down the search space. The Databricks managed version of Delta Lake also features other performance enhancements like improved [data skipping](https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping) , the use of bloom filters, and [Z-Order Optimize](https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering) (multi-dimensional clustering). [Read more about Z-Order Optimize](https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering) [on Databricks.](https://docs.databricks.com/delta/optimizations/file-mgmt.html#z-ordering-multi-dimensional-clustering) **Delta Lake DML: MERGE** The Delta Lake MERGE command allows you to perform upserts, which are a mix of an UPDATE and an INSERT. To understand upserts, imagine that you have an existing table (aka a target table), and a source table that contains a mix of new records and updates to existing records. **Here’s how an upsert works:** - When a record from the source table matches a preexisting record in the target table, Delta Lake updates the record. - When there is no such match, Delta Lake inserts the new record. The Delta Lake MERGE command greatly simplifies workflows that can be complex and cumbersome with other traditional data formats like Parquet. Common scenarios where merges/upserts come in handy include change data capture, GDPR/CCPA compliance, sessionization, and deduplication of records. **For more information about upserts, read:** [Efficient Upserts Into Data Lakes With Databricks Delta](https://databricks.com/blog/2019/03/19/efficient-upserts-into-data-lakes-databricks-delta.html) [Simple, Reliable Upserts and Deletes on Delta Lake Tables Using Python APIs](https://databricks.com/blog/2019/10/03/simple-reliable-upserts-and-deletes-on-delta-lake-tables-using-python-apis.html) [Schema Evolution in Merge Operations and Operational Metrics in Delta Lake](https://databricks.com/blog/2020/05/19/schema-evolution-in-merge-operations-and-operational-metrics-in-delta-lake.html) ----- **MERGE: Under the hood** Delta Lake completes a MERGE in two steps: 1. Perform an inner join between the target table and source table to select all files that have matches. 2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data. The main way that this differs from an UPDATE or a DELETE under the hood is that Delta Lake uses joins to complete a MERGE. This fact allows us to utilize some unique strategies when seeking to improve performance. **MERGE: Performance tuning tips** To improve performance of the MERGE command, you need to determine which of the two joins that make up the merge is limiting your speed. If the inner join is the bottleneck (i.e., finding the files that Delta Lake needs to rewrite takes too long), try the following strategies: - Add more predicates to narrow down the search space. - Adjust shuffle partitions. - Adjust broadcast join thresholds. - Compact the small files in the table if there are lots of them, but don’t compact them into files that are too large, since Delta Lake has to copy the entire file to rewrite it. **On Databricks’ managed Delta Lake, use Z-Order optimize to exploit the** **locality of updates.** On the other hand, if the outer join is the bottleneck (i.e., rewriting the actual files themselves takes too long), try the strategies below. - **Adjust shuffle partitions:** Reduce files by enabling automatic repartitioning before writes (with Optimized Writes in Databricks Delta Lake). - **Adjust broadcast thresholds:** If you’re doing a full outer join, Spark cannot do a broadcast join, but if you’re doing a right outer join, Spark can do one, and you can adjust the broadcast thresholds as needed. - **Cache the source table / DataFrame:** Caching the source table can speed up the second scan, but be sure not to cache the target table, as this can lead to cache coherency issues. Delta Lake supports DML commands including UPDATE, DELETE and MERGE INTO, which greatly simplify the workflow for many common big data operations. In this chapter, we demonstrated how to use these commands in Delta Lake, shared information about how each one works under the hood, and offered some performance tuning tips. **Want a deeper dive into DML internals, including snippets of code?** [Read our blog post >](https://databricks.com/blog/2020/09/29/diving-into-delta-lake-dml-internals-update-delete-merge.html) ----- **How Delta Lake Quickly** **Processes Petabytes With** **Data Skipping and Z-Ordering** Delta Lake is capable of sifting through petabytes of data within seconds. Much of this speed is owed to two features: (1) data skipping and (2) Z-Ordering. Combining these features helps the [Databricks Runtime](https://databricks.com/product/databricks-runtime) to dramatically reduce the amount of data that needs to be scanned to answer selective queries against large Delta tables, which typically translates into substantial runtime improvements and cost savings. Using Delta Lake’s built-in data skipping and ZORDER clustering features, large cloud data lakes can be queried in a matter of seconds by skipping files not relevant to the query. For example, 93.2% of the records in a 504 TB data set were skipped for a typical query in a real-world cybersecurity analysis use case, reducing query times by up to two orders of magnitude. In other words, Delta Lake can speed up your queries by as much as 100x. **Want to see data skipping and Z-Ordering in action?** Apple’s Dominique Brezinski and Databricks’ Michael Armbrust demonstrated how to use Delta Lake as a unified solution for data engineering and data science in the context of cybersecurity monitoring and threat response. Watch their keynote speech, Threat [Detection and Response at Scale.](https://databricks.com/session/keynote-from-apple) ----- AND / OR / NOT are also supported as well as “literal op column” predicates. Even though data skipping kicks in when the above conditions are met, it may not always be effective. But, if there are a few columns that you frequently filter by and want to make sure that’s fast, then you can explicitly optimize your data layout with respect to skipping effectiveness by running the following command: OPTIMIZE [ WHERE ] ZORDER BY ( [, …]) **Exploring the details** Apart from partition pruning, another common technique that’s used in the data warehousing world, but which Spark currently lacks, is I/O pruning based on [small](https://dl.acm.org/doi/10.5555/645924.671173) [materialized aggregates](https://dl.acm.org/doi/10.5555/645924.671173) . In short, the idea is to keep track of simple statistics such as minimum and maximum values at a certain granularity that are correlated with I/O granularity. And we want to leverage those statistics at query planning time in order to avoid unnecessary I/O. This is exactly what Delta Lake’s [data skipping](https://docs.databricks.com/delta/optimizations/file-mgmt.html#data-skipping) feature is about. As new data is inserted into a Delta Lake table, file-level min/max statistics are collected for all columns (including nested ones) of supported types. Then, when there’s a lookup query against the table, Delta Lake first consults these statistics in order to determine which files can safely be skipped. **Want to learn more about data skipping and Z-Ordering, including** **how to apply it within a cybersecurity analysis?** [Read our blog post >](https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html) **Using data skipping and Z-Order clustering** Data skipping and Z-Ordering are used to improve the performance of needle-in-thehaystack queries against huge data sets. Data skipping is an automatic feature of Delta Lake, kicking in whenever your SQL queries or data set operations include filters of the form “column op literal,” where: - column is an attribute of some Delta Lake table, be it top-level or nested, whose data type is string / numeric / date/ timestamp - op is a binary comparison operator, StartsWith / LIKE pattern%’, or IN - literal is an explicit (list of) value(s) of the same data type as a column ----- **Features** Use Delta Lake’s robust features to reliably manage your data ## CHAPTER 02 ----- **Why Use MERGE** **With Delta Lake?** [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) , the next-generation engine built on top of Apache Spark, supports the MERGE command, which allows you to efficiently upsert and delete records in your data lakes. MERGE dramatically simplifies how a number of common data pipelines can be built -- all the complicated multi-hop processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for various use cases ranging from change data capture to GDPR. You no longer need to write complicated logic to overwrite tables and overcome a lack of snapshot isolation. With changing data, another critical capability required is the ability to roll back, in case of bad writes. Delta Lake also offers [rollback capabilities with the Time Travel](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) [feature](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) , so that if you do a bad merge, you can easily roll back to an earlier version. In this chapter, we’ll discuss common use cases where existing data might need to be updated or deleted. We’ll also explore the challenges inherent to upserts and explain how MERGE can address them. ----- **When are upserts necessary?** There are a number of common use cases where existing data in a data lake needs to be updated or deleted: -  **General Data Protection Regulation (GDPR) compliance:** With the introduction of the right to be forgotten (also known as data erasure) in GDPR, organizations must remove a user’s information upon request. This data erasure includes deleting user information in the data lake as well. - **Change data capture from traditional databases:** In a service-oriented architecture, typically web and mobile applications are served by microservices built on traditional SQL/NoSQL databases that are optimized for low latency. One of the biggest challenges organizations face is joining data across these various siloed data systems, and hence data engineers build pipelines to consolidate all data sources into a central data lake to facilitate analytics. These pipelines often have to periodically read changes made on a traditional SQL/NoSQL table and apply them to corresponding tables in the data lake. Such changes can take various forms: Tables with slowly changing dimensions, change data capture of all inserted/updated/deleted rows, etc. -  **Sessionization:** Grouping multiple events into a single session is a common use case in many areas ranging from product analytics to targeted advertising to predictive maintenance. Building continuous applications to track sessions and recording the results that write into data lakes is difficult because data lakes have always been optimized for appending data. - **De-duplication:** A common data pipeline use case is to collect system logs into a Delta Lake table by appending data to the table. However, often the sources can generate duplicate records and downstream de-duplication steps are needed to take care of them. ----- **Why upserts into data lakes have** **traditionally been challenging** Since data lakes are fundamentally based on files, they have always been optimized for appending data rather than for changing existing data. Hence, building the above use case has always been challenging. Users typically read the entire table (or a subset of partitions) and then overwrite them. Therefore, every organization tries to reinvent the wheel for their requirement by handwriting complicated queries in SQL, Spark, etc. This approach is: - **Inefficient:** Reading and rewriting entire partitions (or entire tables) to update a few records causes pipelines to be slow and costly. Hand-tuning the table layout and query optimization is tedious and requires deep domain knowledge. - **Possibly incorrect:** Handwritten code modifying data is very prone to logical and human errors. For example, multiple pipelines concurrently modifying the same table without any transactional support can lead to unpredictable data inconsistencies and in the worst case, data losses. Often, even a single handwritten pipeline can easily cause data corruptions due to errors in encoding the business logic. - **Hard to maintain:** Fundamentally such handwritten code is hard to understand, keep track of and maintain. In the long term, this alone can significantly increase the organizational and infrastructural costs. **Introducing MERGE in Delta Lake** With Delta Lake, you can easily address the use cases above without any of the aforementioned problems using the following MERGE command: MERGE INTO USING ON [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] where = DELETE | UPDATE SET - | UPDATE SET column1 = value1 [, column2 = value2 ...] = INSERT - | INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...]) Let’s understand how to use MERGE with a simple example. Suppose you have a [slowly changing dimension](https://en.wikipedia.org/wiki/Slowly_changing_dimension) table that maintains user information like addresses. Furthermore, you have a table of new addresses for both existing and new users. To merge all the new addresses to the main user table, you can run the following: MERGE INTO users USING updates ON users.userId = updates.userId WHEN MATCHED THEN UPDATE SET address = updates.addresses WHEN NOT MATCHED THEN INSERT (userId, address) VALUES (updates.userId, updates.address) This will perform exactly what the syntax says -- for existing users (i.e., MATCHED clause), it will update the address column, and for new users (i.e., NOT MATCHED clause) it will insert all the columns. For large tables with TBs of data, this Delta Lake MERGE operation can be orders of magnitude faster than overwriting entire partitions or tables since Delta Lake reads only relevant files and updates them. Specifically, Delta Lake's MERGE has the following advantages: [ WHEN MATCHED [ AND ] THEN ] ----- **Simplifying use cases with MERGE** **Deleting data due to GDPR** Complying with the “right to be forgotten” clause of GDPR for data in data lakes cannot get any easier. You can set up a simple scheduled job with an example code, like below, to delete all the users who have opted out of your service. MERGE INTO users USING opted_out_users ON opted_out_users.userId = users.userId WHEN MATCHED THEN DELETE **Applying change data from databases** You can easily apply all data changes — updates, deletes, inserts — generated from an external database into a Delta Lake table with the MERGE syntax as follows: MERGE INTO users USING ( SELECT userId, latest.address AS address, latest.deleted AS deleted FROM ( SELECT userId, MAX(struct(TIME, address, deleted)) AS latest FROM changes GROUP BY userId ) ) latestChange ON latestChange.userId = users.userId WHEN MATCHED AND latestChange.deleted = TRUE THEN DELETE WHEN MATCHED THEN UPDATE SET address = latestChange.address WHEN NOT MATCHED AND latestChange.deleted = FALSE THEN INSERT (userId, address) VALUES (userId, address) - **Fine-grained:** The operation rewrites data at the granularity of files and not partitions. This eliminates all the complications of rewriting partitions, updating the Hive metastore with MSCK and so on. - **Efficient:** Delta Lake’s data skipping makes the MERGE efficient at finding files to rewrite, thus eliminating the need to hand-optimize your pipeline. Furthermore, Delta Lake with all its I/O and processing optimizations makes all the reading and writing data by MERGE significantly faster than similar operations in Apache Spark. - **Transactional:** Delta Lake uses optimistic concurrency control to ensure that concurrent writers update the data correctly with ACID transactions, and concurrent readers always see a consistent snapshot of the data. Here is a visual explanation of how MERGE compares with handwritten pipelines. ----- **Updating session information from streaming** **pipelines** If you have streaming event data flowing in and if you want to sessionize the streaming event data and incrementally update and store sessions in a Delta Lake table, you can accomplish this using the foreachBatch in Structured Streaming and MERGE. For example, suppose you have a Structured Streaming DataFrame that computes updated session information for each user. You can start a streaming query that applies all the sessions update to a Delta Lake table as follows (Scala). streamingSessionUpdatesDF.writeStream .foreachBatch { (microBatchOutputDF: DataFrame , batchId: Long ) => microBatchOutputDF.createOrReplaceTempView(“updates”) microBatchOutputDF.sparkSession.sql(s””” MERGE INTO sessions USING updates ON sessions.sessionId = updates.sessionId WHEN MATCHED THEN UPDATE SET * WHEN NOT MATCHED THEN INSERT * “”” ) }.start() For a complete working example of each Batch and MERGE, see this notebook ( [Azure](https://docs.azuredatabricks.net/_static/notebooks/merge-in-streaming.html) | [AWS](https://docs.databricks.com/_static/notebooks/merge-in-streaming.html) ). **Additional resources** [Tech Talk | Addressing GDPR and CCPA Scenarios With Delta Lake and Apache Spark](https://www.youtube.com/watch?v=tCPslvUjG1w) [Tech Talk | Using Delta as a Change Data Capture Source](https://www.youtube.com/watch?v=7y0AAQ6qX5w) [Simplifying Change Data Capture With Databricks Delta](https://databricks.com/blog/2018/10/29/simplifying-change-data-capture-with-databricks-delta.html) [Building Sessionization Pipeline at Scale With Databricks Delta](https://databricks.com/session/building-sessionization-pipeline-at-scale-with-databricks-delta) [Tech Chat | Slowly Changing Dimensions (SCD) Type 2](https://www.youtube.com/watch?v=HZWwZG07hzQ) ----- **Simple, Reliable Upserts and** **Deletes on Delta Lake Tables** **Using Python APIs** In this chapter, we will demonstrate how to use Python and the new Python APIs in Delta Lake within the context of an on-time flight performance scenario. We will show how to upsert and delete data, query old versions of data with time travel, and vacuum older versions for cleanup. **How to start using Delta Lake** The Delta Lake package is installable through PySpark by using the --packages option. In our example, we will also demonstrate the ability to VACUUM files and execute Delta Lake SQL commands within Apache Spark. As this is a short demonstration, we will also enable the following configurations: spark.databricks.delta.retentionDurationCheck.enabled=false to allow us to vacuum files shorter than the default retention duration of seven days. Note, this is only required for the SQL command VACUUM spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension to enable Delta Lake SQL commands within Apache Spark; this is not required for Python or Scala API calls. # Using Spark Packages ./bin/pyspark --packages io.delta:delta-core_2.11:0.4.0 --conf “spark. databricks.delta.retentionDurationCheck.enabled=false” --conf “spark. sql.extensions=io.delta.sql.DeltaSparkSessionExtension” ----- **Loading and saving our Delta Lake data** This scenario will be using the On-Time Flight Performance or Departure Delays data set generated from the RITA BTS Flight Departure Statistics; some examples of this data in action include the and OnTime Flight Performance with GraphFrames for Apache Spark™. Within PySpark, start [2014 Flight Departure Performance via d3.js Crossfilter](https://dennyglee.com/2014/06/06/2014-flight-departure-performance-via-d3-js-crossfilter/) by reading the data set. # Location variables /departureDelays.delta$ ls l . .. _delta_log part- 00000 -df6f69ea-e6aa- 424b -bc0e-f3674c4f1906-c000.snappy.parquet part- 00001 -711bcce3-fe9e- 466e -a22c- 8256f8b54930 -c000.snappy.parquet part- 00002 - 778ba97d - 89b8 - 4942 -a495- 5f6238830b68 -c000.snappy.parquet Part- 00003 - 1a791c4a-6f11-49a8 -8837-8 093a3220581 -c000.snappy.parquet tripdelaysFilePath = “/root/data/departuredelays.csv” pathToEventsTable = “/root/deltalake/departureDelays.delta” Now, let’s reload the data, but this time our DataFrame will be backed by Delta Lake. # Read flight delay data departureDelays = spark.read \ .option( “header” , “true” ) \ .option( “inferSchema” , “true” ) \ .csv(tripdelaysFilePath) Next, let’s save our departureDelays data set to a Delta Lake table. By saving this table to Delta Lake storage, we will be able to take advantage of its features including ACID transactions, unified batch and streaming and time travel. # Save flight delay data into Delta Lake format departureDelays \ .write \ # Load flight delay data in Delta Lake format delays_delta = spark \ .read \ .format( “delta” ) \ .load( “departureDelays.delta” ) # Create temporary view delays_delta.createOrReplaceTempView(“delays_delta”) # How many flights are between Seattle and San Francisco spark.sql(“select count(1) from delays_delta where origin = ‘SEA’ and destination = ‘SFO’”).show() .format( “delta” ) \ .mode( “overwrite” ) \ .save( “departureDelays.delta” ) Note, this approach is similar to how you would normally save Parquet data; instead of specifying format(“parquet”) , you will now specify format(“delta”) . If you were to take a look at the underlying file system, you will notice four files created for the departureDelays Delta Lake table. ----- Finally, lets determine the number of flights originating from Seattle to San Francisco; in this data set, there are 1698 flights. **In-place conversion to Delta Lake** If you have existing Parquet tables, you have the ability to convert them to Delta Lake format in place, thus not needing to rewrite your table. To convert the table, you can run the following commands. deltaTable DeltaTable .forPath(spark, pathToEventsTable ) # Delete all on-time and early flights deltaTable. delete ( “delay < 0” ) # How many flights are between Seattle and San Francisco spark.sql( “select count(1) from delays_delta where origin = ‘SEA’ and destination = ‘SFO’” ).show() from delta.tables import - # Convert non partitioned parquet table at path ‘/path/to/table’ deltaTable = DeltaTable .convertToDelta(spark, “parquet.`/path/to/ table`” ) # Convert partitioned parquet table at path ‘/path/to/table’ and partitioned by integer column named ‘part’ After we delete (more on this below) all of the on-time and early flights, as you can see from the preceding query there are 837 late flights originating from Seattle to San Francisco. If you review the file system, you will notice there are more files even though you deleted data. /departureDelays.delta$ ls -l _delta_log part- 00000 -a2a19ba4- 17e 9- 4931 - 9bbf - 3c9d4997780 b-c000.snappy.parquet part-00000-df6f69ea-e6aa-424b-bc0e-f3674c4f1906-c000.snappy.parquet part- 00001 - 711bcce3 -fe9e- 466e -a22c- 8256f8b54930 -c000.snappy.parquet part- 00001 -a0423a18- 62eb - 46b3 -a82f-ca9aac1f1e93-c000.snappy.parquet part- 00002 - 778ba97d - 89b8 - 4942 -a495-5f6238830b68-c000.snappy.parquet part- 00002 -bfaa0a2a- 0a31 - 4abf -aa63- 162402f802cc -c000.snappy.parquet part- 00003 - 1a791c4a - 6f11 - 49a8 -8837- 8093a3220581 -c000.snappy.parquet part- 00003 -b0247e1d-f5ce- 4b45 - 91cd - 16413c784a66 -c000.snappy.parquet partitionedDeltaTable = DeltaTable .convertToDelta(spark, “parquet.`/path/to/table`”, “part int” ) **Delete our flight data** To delete data from a traditional data lake table, you will need to: 1. Select all of the data from your table not including the rows you want to delete 2. Create a new table based on the previous query 3. Delete the original table 4. Rename the new table to the original table name for downstream dependencies Instead of performing all of these steps, with Delta Lake, we can simplify this process by running a DELETE statement. To show this, let’s delete all of the flights that had arrived early or on-time (i.e., delay < 0). from delta.tables import - from pyspark.sql.functions import - # Access the Delta Lake table ----- In traditional data lakes, deletes are performed by rewriting the entire table excluding the values to be deleted. With Delta Lake, deletes are instead performed by selectively writing new versions of the files containing the data to be deleted and only marks the previous files as deleted. This is because Delta Lake uses multiversion concurrency control (MVCC) to do atomic operations on the table: For example, while one user is deleting data, another user may be querying the previous version of the table. This multiversion model also enables us to travel back in time (i.e., [time travel](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) ) and query previous versions as we will see later. **Update our flight data** To update data from your traditional Data Lake table, you will need to: 1. Select all of the data from your table not including the rows you want to modify 2. Modify the rows that need to be updated/changed 3. Merge these two tables to create a new table 4. Delete the original table 5. Rename the new table to the original table name for downstream dependencies Instead of performing all of these steps, with Delta Lake, we can simplify this process by running an UPDATE statement. To show this, let’s update all of the flights originating from Detroit to Seattle. With the Detroit flights now tagged as Seattle flights, we now have 986 flights originating from Seattle to San Francisco. If you were to list the file system for your departureDelays folder (i.e., $../departureDelays/ls -l ), you will notice there are now 11 files (instead of the 8 right after deleting the files and the four files after creating the table). **Merge our flight data** A common scenario when working with a data lake is to continuously append data to your table. This often results in duplicate data (rows you do not want to be inserted into your table again), new rows that need to be inserted, and some rows that need to be updated. With Delta Lake, all of this can be achieved by using the merge operation (similar to the SQL MERGE statement). Let’s start with a sample data set that you will want to be updated, inserted or de-duplicated with the following query. # Update all flights originating from Detroit to now be originating from Seattle deltaTable.update(“origin = ‘DTW’”, { “origin”: “’SEA’” } ) # What flights between SEA and SFO for these date periods spark.sql( “select * from delays_delta where origin = ‘SEA’ and destination = ‘SFO’ and date like ‘1010%’ limit 10” ).show() # How many flights are between Seattle and San Francisco The output of this query looks like the following table. Note, the color-coding has been added to clearly identify which rows are de-duplicated (blue), updated (yellow) and inserted (green). spark.sql( “select count(1) from delays_delta where origin = ‘SEA’ and destination = ‘SFO’” ).show() ----- Next, let’s generate our own merge_table that contains data we will insert, update or de-duplicate with the following code snippet. items = [( 1010710 , 31 , 590 , ‘SEA’, ‘SFO’), ( 1010521 , 10 , 590 , ‘SEA’ , ‘SFO’ ), (1010822, 31, 590, ‘SEA’, ‘SFO’)] With Delta Lake, this can be easily achieved via a merge statement as noted in the following code snippet. # Merge merge_table with flights deltaTable. alias( “flights” ) \ .merge(merge_table. alias ( “updates”),”flights.date = updates.date” ) \ .whenMatchedUpdate(set = { “delay” : “updates.delay” } ) \ .whenNotMatchedInsertAll() \ .execute() # What flights between SEA and SFO for these date periods spark.sql( “select * from delays_delta where origin = ‘SEA’ and destination = ‘SFO’ and date like ‘1010%’ limit 10” ).show() cols = [ ‘date’ , ‘delay’ , ‘distance’ , ‘origin’ , ‘destination’ ] merge_table = spark.createDataFrame(items, cols) merge_table.toPandas() In the preceding table ( merge_table ), there are three rows with a unique date value: 1. 1010521: This row needs to _update_ the _flights_ table with a new delay value (yellow) 2. 1010710: This row is a _duplicate_ (blue) 3. 1010832: This is a new row to be _inserted_ (green) All three actions of de-duplication, update and insert were efficiently completed with one statement. **View table history** As previously noted, after each of our transactions (delete, update), there were more files created within the file system. This is because for each transaction, there are different versions of the Delta Lake table. ----- This can be seen by using the DeltaTable.history() method as noted below Note: You can also perform the same task with SQL: spark.sql(“DESCRIBE HISTORY ‘” + pathToEventsTable + “’”).show() As you can see, there are three rows representing the different versions of the table (below is an abridged version to help make it easier to read) for each of the operations (create table, delete and update): **Travel back in time with table history** With Time Travel, you can review the Delta Lake table as of the version or timestamp. To view historical data, specify the version or timestamp option; in the following code snippet, we will specify the version option. # Load DataFrames for each version dfv0 = spark.read.format( “delta” ).option( “versionAsOf” , 0 ).load( “departureDelays.delta” ) dfv1 = spark.read.format(“delta”).option( “versionAsOf” , 1 ).load( “departureDelays.delta” ) dfv2 = spark.read.format( “delta” ).option( “versionAsOf” , 2 ).load( “departureDelays.delta” ) # Calculate the SEA to SFO flight counts for each version of history cnt0 = dfv0. where( “origin = ‘SEA’”). where ( “destination = ‘SFO’” ).count() cnt1 = dfv1. where (“origin = ‘SEA’”). where ( “destination = ‘SFO’” ).count() cnt2 = dfv2. where (“origin = ‘SEA’”). where ( “destination = ‘SFO’” ).count() # Print out the value print ( “SEA -> SFO Counts: Create Table: %s, Delete: %s, Update: %s” % (cnt0, cnt1, cnt2)) ## Output SEA -> SFO Counts : Create Table: 1698 , Delete: 837, Update: 986 Whether for governance, risk management and compliance (GRC) or rolling back errors, the Delta Lake table contains both the metadata (e.g., recording the fact that a delete had occurred with these operators) and data (e.g., the actual rows deleted). But how do we remove the data files either for compliance or size reasons? **Clean up old table versions with vacuum** The [Delta Lake vacuum](https://docs.delta.io/0.7.0/delta-utility.html#vacuum) method will delete all of the rows (and files) by default that are older than seven days’ reference. If you were to view the file system, you’ll notice the 11 files for your table. /departureDelays.delta$ ls -l _delta_log part- 00000 - 5e52736b -0e63- 48f3 - 8d56 - 50f7cfa0494d -c000.snappy.parquet part- 00000 - 69eb53d5 - 34b4 - 408f -a7e4- 86e000428c37 -c000.snappy.parquet ----- part- 00000 -f8edaf04- 712e - 4ac4 - 8b42 - 368d0bbdb95b -c000.snappy.parquet part- 00001 - 20893eed - 9d4f - 4c1f -b619- 3e6ea1fdd05f -c000.snappy.parquet part- 00001 - 9b68b9f6 - bad3 - 434f - 9498 -f92dc4f503e3-c000.snappy.parquet part- 00001 - d4823d2e - 8f9d - 42e3 - 918d - 4060969e5844 -c000.snappy.parquet part- 00002 - 24da7f4e - 7e8d - 40d1 -b664- 95bf93ffeadb -c000.snappy.parquet part- 00002 - 3027786c - 20a9 - 4b19 - 868d -dc7586c275d4-c000.snappy.parquet part- 00002 -f2609f27- 3478 - 4bf9 -aeb7- 2c78a05e6ec1 -c000.snappy.parquet part- 00003 - 850436a6 -c4dd- 4535 -a1c0- 5dc0f01d3d55 -c000.snappy.parquet Part- 00003 -b9292122- 99a7 -4223-aaa9- 8646c281f199 -c000.snappy.parquet To delete all of the files so that you only keep the current snapshot of data, you will specify a small value for the vacuum method (instead of the default retention of 7 days). # Remove all files older than 0 hours old. deltaTable.vacuum( 0 ) Note , you perform the same task via SQL syntax:¸ # Remove all files older than 0 hours old spark.sql(“VACUUM ‘” + pathToEventsTable + “‘ RETAIN 0 HOURS”) Once the vacuum has completed, when you review the file system you will notice fewer files as the historical data has been removed. /departureDelays.delta$ ls -l _delta_log part- 00000 -f8edaf04- 712e - 4ac4 - 8b42 - 368d0bbdb95b -c000.snappy.parquet part- 00001 - 9b68b9f6 -bad3- 434f - 9498 -f92dc4f503e3-c000.snappy.parquet part- 00002 - 24da7f4e - 7e8d - 40d1 -b664- 95bf93ffeadb -c000.snappy.parquet part- 00003 -b9292122- 99a7 - 4223 -aaa9- 8646c281f199 -c000.snappy.parquet Note, the ability to time travel back to a version older than the retention period is lost after running vacuum. ----- **Time Travel for** **Large-Scale Data Lakes** Time travel capabilities are available in [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) . [Delta Lake](https://delta.io/) is an [open-source storage](https://github.com/delta-io/delta) [layer](https://github.com/delta-io/delta) that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. With this feature, Delta Lake automatically versions the big data that you store in your data lake, and you can access any historical version of that data. This temporal data management simplifies your data pipeline by making it easy to audit, roll back data in case of accidental bad writes or deletes, and reproduce experiments and reports. Your organization can finally standardize on a clean, centralized, versioned big data repository in your own cloud storage for your analytics. **Common challenges with changing data** - **Audit data changes:** Auditing data changes is critical both in terms of data compliance as well as simple debugging to understand how data has changed over time. Organizations moving from traditional data systems to big data technologies and the cloud struggle in such scenarios. - **Reproduce experiments and reports:** During model training, data scientists run various experiments with different parameters on a given set of data. When scientists revisit their experiments after a period of time to reproduce the models, typically the source data has been modified by upstream pipelines. A lot of times, they are caught unaware by such upstream data changes and hence struggle to reproduce their experiments. Some scientists and organizations engineer best ----- practices by creating multiple copies of the data, leading to increased storage costs. The same is true for analysts generating reports. - **Rollbacks:** Data pipelines can sometimes write bad data for downstream consumers. This can happen because of issues ranging from infrastructure instabilities to messy data to bugs in the pipeline. For pipelines that do simple appends to directories or a table, rollbacks can easily be addressed by date-based partitioning. With updates and deletes, this can become very complicated, and data engineers typically have to engineer a complex pipeline to deal with such scenarios. **Working with Time Travel** Delta Lake’s time travel capabilities simplify building data pipelines for the above use cases. Time Travel in Delta Lake improves developer productivity tremendously. It helps: - Data scientists manage their experiments better - Data engineers simplify their pipelines and roll back bad writes - Data analysts do easy reporting Organizations can finally standardize on a clean, centralized, versioned big data repository in their own cloud storage for analytics. We are thrilled to see what you will be able to accomplish with this feature. As you write into a Delta Lake table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways: **1. Using a timestamp** **Scala syntax** You can provide the timestamp or date string as an option to DataFrame reader: val df = spark.read .format( “delta” ) .option( “timestampAsOf” , “2019-01-01” ) .load( “/path/to/my/table” ) ----- **Python syntax** df = spark.read \ .format( “delta” ) \ .option( “timestampAsOf” , “2019-01-01” ) \ .load( “/path/to/my/table” ) **SQL syntax** SELECT count(*) FROM my_table TIMESTAMP AS OF “2019-01-01” SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1 ) SELECT count(*) FROM my_table TIMESTAMP AS OF “2019-01-01 01:30:00.000” If the reader code is in a library that you don’t have access to, and if you are passing input parameters to the library to read data, you can still travel back in time for a table by passing the timestamp in yyyyMMddHHmmssSSS format to the path: val inputPath = “/path/to/my/table@20190101000000000” val df = loadData(inputPath) // Function in a library that you don’t have access to def loadData(inputPath : String ) : DataFrame = { spark.read .format(“delta”) .load(inputPath) } inputPath = “/path/to/my/table@20190101000000000” df = loadData(inputPath) # Function in a library that you don’t have access to def loadData(inputPath): return spark.read \ .format( “delta” ) \ .load(inputPath) ----- **2. Using a version number** In Delta Lake, every write has a version number, and you can use the version number to travel back in time as well. **Scala syntax** val df = spark.read .format( “delta” ) .option( “versionAsOf” , “5238” ) .load( “/path/to/my/table” ) val df = spark.read .format( “delta” ) .load( “/path/to/my/table@v5238” ) **Python syntax** df = spark.read \ .format( “delta” ) \ .option( “versionAsOf” , “5238” ) \ .load( “/path/to/my/table” ) df = spark.read \ .format( “delta” ) \ .load( “/path/to/my/table@v5238” ) **SQL syntax** SELECT count(*) FROM my_table VERSION AS OF 5238 ----- **Audit data changes** You can look at the history of table changes using the DESCRIBE HISTORY command or through the UI. **Reproduce experiments and reports** Time travel also plays an important role in machine learning and data science. Reproducibility of models and experiments is a key consideration for data scientists because they often create hundreds of models before they put one into production, and in that time-consuming process would like to go back to earlier models. However, because data management is often separate from data science tools, this is really hard to accomplish. Databricks solves this reproducibility problem by integrating Delta Lake’s Time Travel capabilities with [MLflow](https://mlflow.org/) , an open-source platform for the machine learning lifecycle. For reproducible machine learning training, you can simply log a timestamped URL to the path as an MLflow parameter to track which version of the data was used for each training job. This enables you to go back to earlier settings and data sets to reproduce earlier models. You neither need to coordinate with upstream teams on the data nor worry about cloning data for different experiments. This is the power of unified analytics, whereby data science is closely married with data engineering. **Rollbacks** Time travel also makes it easy to do rollbacks in case of bad writes. For example, if your GDPR pipeline job had a bug that accidentally deleted user information, you can easily fix the pipeline: INSERT INTO my_table SELECT - FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1 ) WHERE userId = 111 ----- You can also fix incorrect updates as follows: # Will use the latest version of the table for all operations below MERGE INTO my_table target USING my_table TIMESTAMP AS OF date_sub(current_date(), 1 ) source ON source.userId = target.userId WHEN MATCHED THEN UPDATE SET - If you simply want to roll back to a previous version of your table, you can do so with either of the following commands: RESTORE TABLE my_table VERSION AS OF [version_number] RESTORE TABLE my_table TIMESTAMP AS OF [timestamp] **Pinned view of a continuously updating** **Delta Lake table across multiple downstream jobs** With AS OF queries, you can now pin the snapshot of a continuously updating Delta Lake table for multiple downstream jobs. Consider a situation where a Delta Lake table is being continuously updated, say every 15 seconds, and there is a downstream job that periodically reads from this Delta Lake table and updates different destinations. In such scenarios, typically you want a consistent view of the source Delta Lake table so that all destination tables reflect the same state. You can now easily handle such scenarios as follows: version = spark.sql( “SELECT max(version) FROM (DESCRIBE HISTORY my_table)” ).collect() data = spark.table( “my_table@v%s” % version[ 0 ][ 0 ]data.where ( “event_type = e1” ).write.jdbc( “table1” ) data.where( “event_type = e2” ).write.jdbc( “table2” ) ... data.where( “event_type = e10” ).write.jdbc( “table10” ) **Queries for time series analytics made simple** Time travel also simplifies time series analytics. For example, if you want to find out how many new customers you added over the last week, your query could be a very simple one like this: SELECT count( distinct userId) - ( SELECT count( distinct userId) FROM my_table TIMESTAMP AS OF date_sub( current_date (), 7)) FROM my_table **Additional resources** [Tech Talk | Diving Into Delta Lake: Unpacking the Transaction Log](https://databricks.com/discover/diving-into-delta-lake-talks/unpacking-transaction-log) [Tech Talk | Getting Data Ready for Data Science With Delta Lake and MLflow](https://databricks.com/discover/getting-started-with-delta-lake-tech-talks/getting-data-ready-data-science-delta-lake-mlflow) [Data + AI Summit Europe 2020 | Data Time Travel by Delta Time Machine](https://databricks.com/session_eu20/data-time-travel-by-delta-time-machine-2) [Spark + AI Summit NA 2020 | Machine Learning Data Lineage With](https://databricks.com/session_na20/machine-learning-data-lineage-with-mlflow-and-delta-lake) [MLflow and Delta Lake](https://databricks.com/session_na20/machine-learning-data-lineage-with-mlflow-and-delta-lake) [Productionizing Machine Learning With Delta Lake](https://databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html) ----- **Easily Clone Your Delta Lake** **for Testing, Sharing and ML** **Reproducibility** Delta Lake has a feature called **Table Cloning** , which makes it easy to test, share and recreate tables for ML reproducibility. Creating copies of tables in a data lake or data warehouse has several practical uses. However, given the volume of data in tables in a data lake and the rate of its growth, making physical copies of tables is an expensive operation. Delta Lake now makes the process simpler and cost-effective with the help of table clones. **What are clones?** Clones are replicas of a source table at a given point in time. They have the same metadata as the source table: same schema, constraints, column descriptions, statistics and partitioning. However, they behave as a separate table with a separate lineage or history. Any changes made to clones only affect the clone and not the source. Any changes that happen to the source during or after the cloning process also do not get reflected in the clone due to Snapshot Isolation. In Delta Lake we have two types of clones: shallow or deep. **Shallow clones** A _shallow_ (also known as a Zero-Copy) clone only duplicates the metadata of the table being cloned; the data files of the table itself are not copied. This type of cloning does not create another physical copy of the data resulting in minimal storage costs. Shallow clones are inexpensive and can be extremely fast to create. ----- These clones are not self-contained and depend on the source from which they were cloned as the source of data. If the files in the source that the clone depends on are removed, for example with VACUUM, a shallow clone may become unusable. Therefore, shallow clones are typically used for short-lived use cases such as testing and experimentation. **Deep clones** Shallow clones are great for short-lived use cases, but some scenarios require a separate and independent copy of the table’s data. A deep clone makes a full copy of the metadata and the data files of the table being cloned. In that sense, it is similar in functionality to copying with a CTAS command ( CREATE TABLE.. AS… SELECT… ). But it is simpler to specify since it makes a faithful copy of the original table at the specified version, and you don’t need to re-specify partitioning, constraints and other information as you have to do with CTAS. In addition, it is much faster, robust and can work in an incremental manner against failures. With deep clones, we copy additional metadata, such as your streaming application transactions and COPY INTO transactions, so you can continue your ETL applications exactly where it left off on a deep clone. **Where do clones help?** Sometimes I wish I had a clone to help with my chores or magic tricks. However, we’re not talking about human clones here. There are many scenarios where you need a copy of your data sets — for exploring, sharing or testing ML models or analytical queries. Below are some examples of customer use cases. **Testing and experimentation with a production table** When users need to test a new version of their data pipeline they often have to rely on sample test data sets that are not representative of all the data in their production environment. Data teams may also want to experiment with various indexing techniques to improve the performance of queries against massive tables. These experiments and tests cannot be carried out in a production environment without risking production data processes and affecting users. It can take many hours or even days, to spin up copies of your production tables for a test or a development environment. Add to that, the extra storage costs for your development environment to hold all the duplicated data — there is a large overhead in setting a test environment reflective of the production data. With a shallow clone, this is trivial: -- SQL CREATE TABLE delta.`/some/test/location` SHALLOW CLONE prod.events # Python DeltaTable.forName(“spark”, “prod.events”).clone(“/some/test/location”, isShallow=True) // Scala DeltaTable.forName(“spark”, “prod.events”).clone(“/some/test/location”, isShallow=true) After creating a shallow clone of your table in a matter of seconds, you can start running a copy of your pipeline to test out your new code, or try optimizing your table in different dimensions to see how you can improve your query performance, and much much more. These changes will only affect your shallow clone, not your original table. **Staging major changes to a production table** Sometimes, you may need to perform some major changes to your production table. These changes may consist of many steps, and you don’t want other users to see the changes that you’re making until you’re done with all of your work. A shallow clone can help you out here: ----- -- SQL CREATE TABLE temp.staged_changes SHALLOW CLONE prod.events; DELETE FROM temp.staged_changes WHERE event_id is null; UPDATE temp.staged_changes SET change_date = current_date() WHERE change_date is null; ... -- Perform your verifications Once you’re happy with the results, you have two options. If no other change has been made to your source table, you can replace your source table with the clone. If changes have been made to your source table, you can merge the changes into your source table. -- If no changes have been made to the source REPLACE TABLE prod.events CLONE temp.staged_changes; -- If the source table has changed MERGE INTO prod.events USING temp.staged_changes ON events.event_id <=> staged_changes.event_id WHEN MATCHED THEN UPDATE SET *; -- Drop the staged table DROP TABLE temp.staged_changes; **Machine learning result reproducibility** Coming up with an effective ML model is an iterative process. Throughout this process of tweaking the different parts of the model, data scientists need to assess the accuracy of the model against a fixed data set. This is hard to do in a system where the data is constantly being loaded or updated. A snapshot of the data used to train and test the model is required. This snapshot allows the results of the ML model to be reproducible for testing or model governance purposes. ----- We recommend leveraging [Time Travel](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) to run multiple experiments across a snapshot; an example of this in action can be seen in [Machine Learning Data Lineage With MLflow](https://databricks.com/session_na20/machine-learning-data-lineage-with-mlflow-and-delta-lake) [and Delta Lake.](https://databricks.com/session_na20/machine-learning-data-lineage-with-mlflow-and-delta-lake) Once you’re happy with the results and would like to archive the data for later retrieval, for example, next Black Friday, you can use deep clones to simplify the archiving process. MLflow integrates really well with Delta Lake, and the autologging feature (mlflow.spark. autolog() ) will tell you which version of the table was used to run a set of experiments. # Run your ML workloads using Python and then DeltaTable.forName(spark, “feature_store”).cloneAtVersion(128, “feature_ store_bf2020”) **Data migration** A massive table may need to be moved to a new, dedicated bucket or storage system for performance or governance reasons. The original table will not receive new updates going forward and will be deactivated and removed at a future point in time. Deep clones make the copying of massive tables more robust and scalable. -- SQL CREATE TABLE delta.`zz://my-new-bucket/events` CLONE prod.events; ALTER TABLE prod.events SET LOCATION ‘zz://my-new-bucket/events’; With deep clones, since we copy your streaming application transactions and COPY INTO transactions, you can continue your ETL applications from exactly where it left off after this migration! **Data sharing** In an organization, it is often the case that users from different departments are looking for data sets that they can use to enrich their analysis or models. You may want to share your data with other users across the organization. But rather than setting up elaborate pipelines to move the data to yet another store, it is often easier and economical to create a copy of the relevant data set for users to explore and ----- **Looks awesome! Any gotchas?** Just to reiterate some of the gotchas mentioned above as a single list, here’s what you should be wary of: -  Clones are executed on a snapshot of your data. Any changes that are made to the source table after the cloning process starts will not be reflected in the clone. -  Shallow clones are not self-contained tables like deep clones. If the data is deleted in the source table (for example through VACUUM), your shallow clone may not be usable. -  Clones have a separate, independent history from the source table. Time travel queries on your source table and clone may not return the same result. -  Shallow clones do not copy stream transactions or COPY INTO metadata. Use deep clones to migrate your tables and continue your ETL processes from where it left off. **How can I use it?** Shallow and deep clones support new advances in how data teams test and manage their modern cloud data lakes and warehouses. Table clones can help your team implement production-level testing of their pipelines, fine-tune their indexing for optimal query performance, create table copies for sharing — all with minimal overhead and expense. If this is a need in your organization, we hope you will take table cloning for a spin and give us your feedback — we look forward to hearing about new use cases and extensions you would like to see in the future. **Additional resource** [Simplifying Disaster Recovery With Delta Lake](https://databricks.com/session_na20/simplifying-disaster-recovery-with-delta-lake) test the data to see if it is a fit for their needs without affecting your own production systems. Here deep clones again come to the rescue. -- The following code can be scheduled to run at your convenience CREATE OR REPLACE TABLE data_science.events CLONE prod.events; **Data archiving** For regulatory or archiving purposes, all data in a table needs to be preserved for a certain number of years, while the active table retains data for a few months. If you want your data to be updated as soon as possible, but you have a requirement to keep data for several years, storing this data in a single table and performing time travel may become prohibitively expensive. In this case, archiving your data in a daily, weekly or monthly manner is a better solution. The incremental cloning capability of deep clones will really help you here. -- The following code can be scheduled to run at your convenience CREATE OR REPLACE TABLE archive.events CLONE prod.events; Note that this table will have an independent history compared to the source table, therefore, time travel queries on the source table and the clone may return different results based on your frequency of archiving. ----- **Enabling Spark SQL DDL** **and DML in Delta Lake on** **Apache Spark 3.0** The release of [Delta Lake 0.7.0](https://github.com/delta-io/delta/releases/tag/v0.7.0) coincided with the release of [Apache Spark 3.0](https://github.com/delta-io/delta/releases/tag/v0.7.0) , thus enabling a new set of features that were simplified using Delta Lake from SQL. Here are some of the key features. **Support for SQL DDL commands** **to define tables in the** **[Hive metastore](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore)** You can now define Delta tables in the [Hive](https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore) metastore and use the table name in all SQL operations when creating (or replacing) tables. **Create or replace tables** -- Create table in the metastore CREATE TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING DELTA PARTITIONED BY (date) LOCATION ‘/delta/events’ -- If a table with the same name already exists, the table is replaced with the new configuration, else it i s created CREATE OR REPLACE TABLE events ( ----- date DATE, eventId STRING, eventType STRING, data STRING) INSERT INTO events SELECT * FROM newEvents -- To atomically replace all of the data in a table, you can use overwrite mode INSERT OVERWRITE events SELECT * FROM newEvents USING DELTA PARTITIONED BY (date) LOCATION ‘/delta/events’ **Explicitly alter the table schema** -- Alter table and schema -- Delete events DELETE FROM events WHERE date < ‘2017-01-01’ -- Update events UPDATE events SET eventType = ‘click’ WHERE eventType = ‘click’ ALTER TABLE table_name ADD COLUMNS ( col_name data_type [COMMENT col_comment] [FIRST|AFTER colA_name], ...) You can also use the Scala/Java/Python APIs: - DataFrame.saveAsTable(tableName) and DataFrameWriterV2 APIs ( [#307](https://github.com/delta-io/delta/issues/307) ). - DeltaTable.forName(tableName) API to create instances of io.delta.tables .DeltaTable which is useful for executing Update/Delete/Merge operations in Scala/Java/Python. **Support for SQL Insert, Delete, Update and Merge** One of the most frequent questions through our [Delta Lake Tech Talks](https://databricks.com/discover/diving-into-delta-lake-talks) was when would DML operations such as delete, update and merge be available in Spark SQL? Wait no more, these operations are now available in SQL! Below are examples of how you can write delete, update and merge (insert, update, delete and de-duplication operations using Spark SQL). -- Using append mode, you can atomically add new data to an existing Delta table -- Upsert data to a target Delta -- table using merge MERGE INTO events USING updates ON events.eventId = updates.eventId WHEN MATCHED THEN UPDATE SET events.data = updates.data WHEN NOT MATCHED THEN INSERT (date, eventId, data) VALUES (date, eventId, data) It is worth noting that the merge operation in Delta Lake supports more advanced syntax than standard ANSI SQL syntax. For example, merge supports -  Delete actions -- Delete a target when matched with a source row. For example, “... WHEN MATCHED THEN DELETE ...” -  Multiple matched actions with clause conditions -- Greater flexibility when target and source rows match. For example: ... WHEN MATCHED AND events.shouldDelete THEN DELETE WHEN MATCHED THEN UPDATE SET events.data = updates.data ----- Star syntax [-](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) Shorthand for setting target column value with the similarly named sources column. For example: [-](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) such as automated manifest generation. For example, with [table properties](https://www.youtube.com/watch?v=o54YMz8zvCY) , you can block deletes and updates in a Delta table using delta.appendOnly=true . [-](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) WHEN MATCHED THEN SET * [-](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) WHEN NOT MATCHED THEN INSERT * -- equivalent to updating/inserting with event .date = updates.date, events.eventId = updates.eventId, event .data = updates.data **Automatic and incremental Presto/Athena manifest** **generation** As noted in [Query Delta Lake Tables From Presto and Athena, Improved Operations](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) [Concurrency, and Merge Performance,](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) Delta Lake supports other processing engines to read Delta Lake by using manifest files; the manifest files contain the list of the most current version of files as of manifest generation. As described in the preceding chapter, you will need to: - Generate a Delta Lake manifest file - Configure Presto or Athena to read the generated manifests - Manually re-generate (update) the manifest file New for Delta Lake 0.7.0 is the capability to update the manifest file automatically with the following command: ALTER TABLE delta.`pathToDeltaTable` SET TBLPROPERTIES( delta.compatibility.symlinkFormatManifest.enabled=true ) **Configuring your table through table properties** With the ability to set table properties on your table by using ALTER TABLE SET TBLPROPERTIES, you can enable, disable or configure many features of Delta Lake [-](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) You can also easily control the history of your Delta Lake table retention by the following [properties](https://databricks.com/blog/2020/11/11/analytics-on-the-data-lake-with-tableau-and-the-lakehouse-architecture.html) : -  delta.logRetentionDuration: Controls how long the history for a table (i.e., transaction log history) is kept. By default, 30 days of history is kept, but you may want to alter this value based on your requirements (e.g., GDPR historical context) - delta.deletedFileRetentionDuration: Controls how long ago a file must have been deleted before being a candidate for VACUUM. By default, data files older than seven days are deleted. As of Delta Lake 0.7.0, you can use ALTER TABLE SET TBLPROPERTIES to configure these properties. ALTER TABLE delta. `pathToDeltaTable` SET TBLPROPERTIES( delta.logRetentionDuration = “interval “ delta.deletedFileRetentionDuration = “interval “ ) **Support for adding user-defined metadata** **in Delta Lake table commits** You can specify user-defined strings as metadata in commits made by Delta Lake table operations, either using the DataFrameWriter option userMetadata or the SparkSession configuration spark.databricks.delta.commitInfo. userMetadata . In the following example, we are deleting a user (1xsdf1) from our data lake per user request. To ensure we associate the user’s request with the deletion, we have also added the DELETE request ID into the userMetadata. ----- SET spark.databricks.delta.commitInfo.userMetadata={ “GDPR”:”DELETE Request 1x891jb23” There were a lot of great questions during the AMA concerning structured streaming and using trigger.once . }; For more information, some good resources explaining this concept include: - [Running Streaming Jobs Once a Day for 10x Cost Savings](https://databricks.com/session_eu20/common-strategies-for-improving-performance-on-your-delta-lakehouse) - [Beyond Lambda: Introducing Delta Architecture](https://databricks.com/session_eu20/achieving-lakehouse-models-with-spark-3-0) : Specifically the cost vs. latency trade-off discussed here . **Additional resources** [Tech Talk | Delta Lake 0.7.0 + Spark 3.0 AMA](https://www.youtube.com/watch?v=xzKqjCB8SWU) [Tech Talks | Apache Spark 3.0 + Delta Lake](https://www.youtube.com/watch?v=x6RqJYqLoPI&list=PLTPXxbhUt-YWnAgh3RE8DOb46qZF57byx) [Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0](https://databricks.com/blog/2020/08/27/enabling-spark-sql-ddl-and-dml-in-delta-lake-on-apache-spark-3-0.html) DELETE FROM user_table WHERE user_id = ‘1xsdf1’ When reviewing the [history](https://databricks.com/session_eu20/radical-speed-for-your-sql-queries-with-delta-engine) operations of the user table (user_table), you can easily identify the associated deletion request within the transaction log. **Other highlights** Other highlights for the Delta Lake 0.7.0 release include: - Support for Azure Data Lake Storage Gen2 — Spark 3.0 has support for Hadoop 3.2 libraries which enables support for Azure Data Lake Storage Gen2. - Improved support for streaming one-time triggers — With Spark 3.0, we now ensure that a [one-time trigger](https://databricks.com/session_eu20/mlflow-delta-lake-and-lakehouse-use-cases-meetup) ( Trigger.Once ) processes all outstanding data in a Delta Lake table in a single micro-batch even if rate limits are set with the DataStreamReader option maxFilesPerTrigger. ----- **Lakehouse** Combining the best elements of data lakes and data warehouses ## CHAPTER 03 ----- **What Is a** **Lakehouse?** Over the past few years at Databricks, we’ve seen a new data management architecture that emerged independently across many customers and use cases: the **lakehouse.** In this chapter, we’ll describe this new architecture and its advantages over previous approaches. Data warehouses have a long history of decision support and business intelligence applications. Since its inception in the late 1980s, data warehouse technology continued to evolve and MPP architectures led to systems that were able to handle larger data sizes. But while warehouses were great for structured data, a lot of modern enterprises have to deal with unstructured data, semi-structured data, and data with high variety, velocity and volume. Data warehouses are not suited for many of these use cases, and they are certainly not the most cost-efficient. As companies began to collect large amounts of data from many different sources, architects began envisioning a single system to house data for many different analytic products and workloads. About a decade ago, companies began building [data lakes](https://databricks.com/glossary/data-lake) -- repositories for raw data in a variety of formats. While suitable for storing data, data lakes lack some critical features: They do not support transactions, they do not enforce data quality, and their lack of consistency / isolation makes it almost impossible to mix appends and reads, ----- **A lakehouse combines the best elements** **of data lakes and data warehouses** A lakehouse is a new data architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available. A lakehouse has the following key features: - **Transaction support:** In an enterprise lakehouse, many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL. and batch and streaming jobs. For these reasons, many of the promises of data lakes have not materialized and, in many cases, lead to a loss of many of the benefits of data warehouses. The need for a flexible, high-performance system hasn’t abated. Companies require systems for diverse data applications including SQL analytics, real-time monitoring, data science and machine learning. Most of the recent advances in AI have been in better models to process unstructured data (text, images, video, audio), but these are precisely the types of data that a data warehouse is not optimized for. A common approach is to use multiple systems — a data lake, several data warehouses, and other specialized systems such as streaming, time-series, graph and image databases. Having a multitude of systems introduces complexity and, more importantly, introduces delay as data professionals invariably need to move or copy data between different systems. ----- **Schema enforcement and governance:** The lakehouse should have a way to support schema enforcement and evolution, supporting DW schema paradigms such as star/snowflake-schemas. The system should be able to [reason about data](https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html) [integrity](https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html) , and it should have robust governance and auditing mechanisms. - **BI support:** Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse. - **Storage is decoupled from compute:** In practice, this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property. - **Openness:** The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. - **Support for diverse data types ranging from unstructured to structured data:** The lakehouse can be used to store, refine, analyze and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text. - **Support for diverse workloads:** Including data science, machine learning and SQL analytics. Multiple tools might be needed to support all these workloads, but they all rely on the same data repository. - **End-to-end streaming:** Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications. These are the key attributes of lakehouses. Enterprise-grade systems require additional features. Tools for security and access control are basic requirements. Data governance capabilities including auditing, retention and lineage have become essential particularly in light of recent privacy regulations. Tools that enable data discovery such as data catalogs and data usage metrics are also needed. With a lakehouse, such enterprise features only need to be implemented, tested and administered for a single system. ----- **Read the research** **Delta Lake: High-Performance ACID** **Table Storage Over Cloud Object Stores** **Abstract** Cloud object stores such as Amazon S3 are some of the largest and most costeffective storage systems on the planet, making the main attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: Metadata operations, such as listing objects, are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular data sets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift, and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale data sets and billions of objects. Authors: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van H Ö vell, Adrian Ionescu, Alicja Łuszczak, Michał Szafra ́nski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia Read the full research paper on the [inner workings of the lakehouse](https://databricks.com/research/delta-lake-high-performance-acid-table-storage-overcloud-object-stores) [.](https://databricks.com/research/delta-lake-high-performance-acid-table-storage-overcloud-object-stores) ----- **Some early examples** The [Databricks Unified Data Platform](https://databricks.com/product/data-lakehouse) has the architectural features of a lakehouse. Microsoft’s [Azure Synapse Analytics](https://azure.microsoft.com/en-us/blog/simply-unmatched-truly-limitless-announcing-azure-synapse-analytics/) service, which [integrates with Azure Databricks](https://databricks.com/blog/2019/11/04/new-microsoft-azure-data-warehouse-service-and-azure-databricks-combine-analytics-bi-and-data-science.html) , enables a similar lakehouse pattern. Other managed services such as [BigQuery](https://cloud.google.com/bigquery/) and [Redshift Spectrum](https://docs.aws.amazon.com/redshift/latest/dg/c-using-spectrum.html) have some of the lakehouse features listed above, but they are examples that focus primarily on BI and other SQL applications. Companies that want to build and implement their own systems have access to open source file formats (Delta Lake, [Apache Iceberg](https://iceberg.apache.org) , [Apache Hudi](https://hudi.apache.org) ) that are suitable for building a lakehouse. Merging data lakes and data warehouses into a single system means that data teams can move faster as they are able to use data without needing to access multiple systems. The level of SQL support and integration with BI tools among these early lakehouses is generally sufficient for most enterprise data warehouses. Materialized views and stored procedures are available, but users may need to employ other mechanisms that aren’t equivalent to those found in traditional data warehouses. The latter is particularly important for “ [lift and shift scenarios](https://whatis.techtarget.com/definition/lift-and-shift) ,” which require systems that achieve semantics that are almost identical to those of older, commercial data warehouses. What about support for other types of data applications? Users of a lakehouse have access to a variety of standard tools ( [Apache Spark](https://databricks.com/glossary/apache-spark-as-a-service) , Python, R, machine learning libraries) for non-BI workloads like data science and machine learning. Data exploration and refinement are standard for many analytic and data science applications. Delta Lake is designed to let users incrementally improve the quality of data in their lakehouse until it is ready for consumption. A note about technical building blocks. While distributed file systems can be used for the storage layer, object stores are more commonly used in lakehouses. Object stores provide low-cost, highly available storage that excels at massively parallel reads — an essential requirement for modern data warehouses. **From BI to AI** The lakehouse is a new data management architecture that radically simplifies enterprise data infrastructure and accelerates innovation in an age when machine learning is poised to disrupt every industry. In the past, most of the data that went into a company’s products or decision-making was structured data from operational systems, whereas today, many products incorporate AI in the form of computer vision and speech models, text mining and others. Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data. Current lakehouses reduce cost, but their performance can still lag specialized systems (such as data warehouses) that have years of investments and realworld deployments behind them. Users may favor certain tools (BI tools, IDEs, notebooks) over others so lakehouses will also need to improve their UX and their connectors to popular tools so they can appeal to a variety of personas. These and other issues will be addressed as the technology continues to mature and develop. Over time, lakehouses will close these gaps while retaining the core properties of being simpler, more cost-efficient and more capable of serving diverse data applications. ----- **Diving Deep Into the** **Inner Workings of the** **Lakehouse and Delta Lake** Databricks wrote a [blog article](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) that outlined how more and more enterprises are adopting the lakehouse pattern. The blog created a massive amount of interest from technology enthusiasts. While lots of people praised it as the next-generation data architecture, some people thought the lakehouse is the same thing as the data lake. Recently, several of our engineers and founders wrote a research paper that describes some of the core technological challenges and solutions that set the lakehouse architecture apart from the data lake, and it was accepted and published at the International Conference on Very Large Databases (VLDB) 2020. You can read the paper, [“Delta Lake: High-Performance ACID Table Storage Over Cloud](https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf) [Object Stores,” here](https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf) . Henry Ford is often credited with having said, “If I had asked people what they wanted, they would have said faster horses.” The crux of this statement is that people often envision a better solution to a problem as an evolution of what they already know rather than rethinking the approach to the problem altogether. In the world of data storage, this pattern has been playing out for years. Vendors continue to try to reinvent the old horses of data warehouses and data lakes rather than seek a new solution. ----- More than a decade ago, the cloud opened a new frontier for data storage. Cloud object stores like Amazon S3 have become some of the largest and most costeffective storage systems in the world, which makes them an attractive platform to store data warehouses and data lakes. However, their nature as key-value stores makes it difficult to achieve ACID transactions that many organizations require. Also, performance is hampered by expensive metadata operations (e.g., listing objects) and limited consistency guarantees. Based on the characteristics of cloud object stores, three approaches have emerged. **1. Data lakes** The first is directories of files (i.e., data lakes) that store the table as a collection of objects, typically in columnar format such as Apache Parquet. It’s an attractive approach because the table is just a group of objects that can be accessed from a wide variety of tools without a lot of additional data stores or systems. However, both performance and consistency problems are common. Hidden data corruption is common due to failed transactions, eventual consistency leads to inconsistent queries, latency is high, and basic management capabilities like table versioning and audit logs are unavailable. **2. Custom storage engines** The second approach is custom storage engines, such as proprietary systems built for the cloud like the Snowflake data warehouse. These systems can bypass the consistency challenges of data lakes by managing the metadata in a separate, strongly consistent service that’s able to provide a single source of truth. However, all I/O operations need to connect to this metadata service, which can increase cloud resource costs and reduce performance and availability. Additionally, it takes a lot of engineering work to implement connectors to existing computing engines like Apache Spark, TensorFlow and PyTorch, which can be challenging for data teams that use a variety of computing engines on their data. Engineering challenges can be exacerbated by unstructured data because these systems are generally optimized for traditional structured ----- data types. Finally, and most egregiously, the proprietary metadata service locks customers into a specific service provider, leaving customers to contend with consistently high prices and expensive, time-consuming migrations if they decide to adopt a new approach later. **3. Lakehouse** With Delta Lake, an open source ACID table storage layer atop cloud object stores, we sought to build a car instead of a faster horse with not just a better data store, but a fundamental change in how data is stored and used via the lakehouse. A lakehouse is a new architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. They are what you would get if you had to redesign storage engines in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available. Delta Lake maintains information about which objects are part of a Delta table in an ACID manner, using a write-ahead log, compacted into Parquet, that is also stored in the cloud object store. This design allows clients to update multiple objects at once, replace a subset of the objects with another, etc., in a serializable manner that still achieves high parallel read/write performance from the objects. The log also provides significantly faster metadata operations for large tabular data sets. Additionally, Delta Lake offers advanced capabilities like time travel (i.e., the ability to query point-in-time snapshots or roll back erroneous updates), automatic data layout optimization, upserts, caching, and audit logs. Together, these features improve both the manageability and performance of working with data in cloud object stores, ultimately opening the door to the lakehouse architecture that combines the key features of data warehouses and data lakes to create a better, simpler data architecture. ----- Today, Delta Lake is used across thousands of Databricks customers, processing exabytes of structured and unstructured data each day, as well as many organizations in the open source community. These use cases span a variety of data sources and applications. The data types stored include Change Data Capture (CDC) logs from enterprise OLTP systems, application logs, time-series data, graphs, aggregate tables for reporting, and image or feature data for machine learning. The applications include SQL workloads (most commonly), business intelligence, streaming, data science, machine learning and graph analytics. Overall, Delta Lake has proven itself to be a good fit for most data lake applications that would have used structured storage formats like Parquet or ORC, and many traditional data warehousing workloads. Across these use cases, we found that customers often use Delta Lake to significantly simplify their data architecture by running more workloads directly against cloud object stores, and increasingly, by creating a lakehouse with both data lake and transactional features to replace some or all of the functionality provided by message queues (e.g., Apache Kafka), data lakes or cloud data warehouses (e.g., Snowflake, Amazon Redshift). **[In the research paper](https://databricks.com/research/delta-lake-high-performance-acid-table-storage-overcloud-object-stores)** **, the authors explain:** - The characteristics and challenges of object stores - The Delta Lake storage format and access protocols - The current features, benefits and limitations of Delta Lake - Both the core and specialized use cases commonly employed today - Performance experiments, including TPC-DS performance Through the paper, you’ll gain a better understanding of Delta Lake and how it enables a wide range of DBMS-like performance and management features for data held in low-cost cloud storage. As well as how the Delta Lake storage format and access protocols make it simple to operate, highly available, and able to deliver highbandwidth access to the object store. ----- **Understanding** **Delta Engine** The Delta Engine ties together a 100% Apache Spark-compatible vectorized query engine to take advantage of modern CPU architecture with optimizations to Spark 3.0’s query optimizer and caching capabilities that were launched as part of Databricks Runtime 7.0. Together, these features significantly accelerate query performance on data lakes, especially those enabled by [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) , to make it easier for customers to adopt and scale a [lakehouse architecture](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) . **Scaling execution performance** One of the big hardware trends over the last several years is that CPU clock speeds have plateaued. The reasons are outside the scope of this chapter, but the takeaway is that we have to find new ways to process data faster beyond raw compute power. One of the most impactful methods has been to improve the amount of data that can be processed in parallel. However, data processing engines need to be specifically architected to take advantage of this parallelism. In addition, data teams are being given less and less time to properly model data as the pace of business increases. Poorer modeling in the interest of better business agility drives poorer query performance. Naturally, this is not a desired state, and organizations want to find ways to maximize both agility and performance. ----- **Announcing Delta Engine for** **high-performance query execution** Delta Engine accelerates the performance of Delta Lake for SQL and DataFrame workloads through three components: an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a native vectorized execution engine that’s written in C++. The improved query optimizer extends the functionality already in Spark 3.0 (cost-based optimizer, adaptive query execution, and dynamic runtime filters) with more advanced statistics to deliver up to 18x increased performance in star schema workloads. Delta Engine’s caching layer automatically chooses which input data to cache for the user, transcoding it along the way in a more CPU-efficient format to better leverage the increased storage speeds of NVMe SSDs. This delivers up to 5x faster scan performance for virtually all workloads. However, the biggest innovation in Delta Engine to tackle the challenges facing data teams today is the native execution engine, which we call Photon. (We know. It’s in an engine within the engine…). This completely rewritten execution engine for ----- Databricks has been built to maximize the performance from the new changes in modern cloud hardware. It brings performance improvements to all workload types while remaining fully compatible with open Spark APIs. **Getting started with Delta Engine** By linking these three components together, we think it will be easier for customers to understand how improvements in multiple places within the Databricks code aggregate into significantly faster performance for analytics workloads on data lakes. We’re excited about the value that Delta Engine delivers to our customers. While the time and cost savings are already valuable, its role in the lakehouse pattern supports new advances in how data teams design their data architectures for increased unification and simplicity. For more information on the Delta Engine, watch this keynote address from [Spark + AI Summit 2020: Delta Engine: High-Performance Query Engine for Delta Lake](https://www.youtube.com/watch?v=o54YMz8zvCY) . ----- **Streaming** Using Delta Lake to express computation on streaming data ## CHAPTER 04 ----- **How Delta Lake Solves Common** **Pain Points in Streaming** The pain points of a traditional streaming and data warehousing solution can be broken into two groups: data lake and data warehouse pains. **Data lake pain points** While data lakes allow you to flexibly store an immense amount of data in a file system, there are many pain points including (but not limited to): - Consolidation of streaming data from many disparate systems is difficult. - Updating data in a data lake is nearly impossible, and much of the streaming data needs to be updated as changes are made. This is especially important in scenarios involving financial reconciliation and subsequent adjustments. - Query speeds for a data lake are typically very slow. - Optimizing storage and file sizes is very difficult and often requires complicated logic. **Data warehouse pain points** The power of a data warehouse is that you have a persistent performant store of your data. But the pain points for building modern continuous applications include (but are not limited to): - Constrained to SQL queries (i.e., no machine learning or advanced analytics). - Accessing streaming data and stored data together is very difficult, if at all possible. - Data warehouses do not scale very well. - Tying compute and storage together makes using a warehouse very expensive. ----- **How Delta Lake on Databricks solves these issues** [Delta Lake](https://docs.databricks.com/delta/index.html) is a unified data management system that brings data reliability and performance optimizations to cloud data lakes. More succinctly, Delta Lake combines the advantages of data lakes and data warehouses with Apache Spark™ to allow you to do incredible things. - Delta Lake, along with Structured Streaming, makes it possible to analyze streaming and historical data together at high speeds. - When Delta Lake tables are used as sources and destinations of streaming big data, it is easy to consolidate disparate data sources. - Upserts are supported on Delta Lake tables. - Delta Lake is ACID compliant, making it easy to create a compliant data solution. - Easily include machine learning scoring and advanced analytics into ETL and queries. - Decouples compute and storage for a completely scalable solution. In the following use cases, we’ll share what this looks like in practice. ----- **Simplifying Streaming Stock** **Data Analysis Using Delta Lake** Real-time analysis of stock data is a complicated endeavor. After all, there are many challenges in maintaining a streaming system and ensuring transactional consistency of legacy and streaming data concurrently. Thankfully, [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) helps solve many of the pain points of building a streaming system to analyze stock data in real time. In this section, we’ll share how to simplify the streaming of stock data analysis using Delta Lake. In the following diagram, you can see a high-level architecture that simplifies this problem. We start by ingesting two different sets of data into two Delta Lake tables. The two data sets are stock prices and fundamentals. After ingesting the data into their respective tables, we then join the data in an ETL process and write the data out into a third Delta Lake table for downstream analysis. Delta Lake helps solve these problems by combining the scalability, streaming and access to the advanced analytics of Apache Spark with the performance and ACID compliance of a data warehouse. ----- # Create Fundamental Data (Databricks Delta table) dfBaseFund = spark \\ .read \\ .format( ‘delta’ ) \\ .load( ‘/delta/stocksFundamentals’ ) # Create Price Data (Databricks Delta table) dfBasePrice = spark \\ .read \\ .format( ‘delta’ ) \\ .load( ‘/delta/stocksDailyPrices’ ) **Implement your streaming** **stock analysis solution with Delta Lake** Delta Lake and Apache Spark do most of the work for our solution; you can try out the full [notebook](https://pages.databricks.com/rs/094-YMS-629/images/streaming-stock-data-analysis-setup.html) and follow along with the code samples below. As noted in the preceding diagram, we have two data sets to process — one for fundamentals and one for price data. To create our two Delta Lake tables, we specify the .format(‘delta’) against our Databricks File System ( [DBFS](https://docs.databricks.com/data/databricks-file-system.html) ) locations. ----- While we’re updating the stockFundamentals and stocksDailyPrices , we will consolidate this data through a series of ETL jobs into a consolidated view ( stocksDailyPricesWFund ). With the following code snippet, we can determine the start and end date of available data and then combine the price and fundamentals data for that date range into DBFS. # Determine start and end date of available data row = dfBasePrice.agg( func.max(dfBasePrice.price_date) .alias ( “maxDate” ), func.min(dfBasePrice.price_date) .alias ( “minDate” ) ).collect()[ 0 ] startDate = row[ “minDate” ] endDate = row[ “maxDate” ] # Define our date range function # Save data to DBFS dfPriceWFund .write .format( ‘delta’ ) .mode( ‘append’ ) .save( ‘/delta/stocksDailyPricesWFund’ ) # Loop through dates to complete fundamentals + price ETL process for single_date in daterange( startDate, (endDate + datetime.timedelta(days= 1 )) ): print ‘Starting ’ + single_date.strftime( ‘%Y-%m-%d’ ) start = datetime.datetime.now() combinePriceAndFund(single_date) end = datetime.datetime.now() print ( end - start) def daterange(start_date, end_date): Now we have a stream of consolidated fundamentals and price data that is being pushed into [DBFS](https://docs.databricks.com/data/databricks-file-system.html) in the /delta/stocksDailyPricesWFund location. We can build a Delta Lake table by specifying .format(“delta”) against that DBFS location. for n in range( int ((end_date - start_date).days)): yield start_date + datetime.timedelta(n) # Define combinePriceAndFund information by date and def combinePriceAndFund(theDate): dfFund = dfBaseFund. where (dfBaseFund.price_date == theDate) dfPrice = dfBasePrice. where ( dfBasePrice.price_date == theDate dfPriceWithFundamentals = spark .readStream .format( “delta” ) .load( “/delta/stocksDailyPricesWFund” ) ).drop( ‘price_date’ ) # Drop the updated column dfPriceWFund = dfPrice.join(dfFund, [ ‘ticker’ ]).drop( ‘updated’ ) // Create temporary view of the data dfPriceWithFundamentals.createOrReplaceTempView( “priceWithFundamentals” ) ----- Now that we have created our initial Delta Lake table, let’s create a view that will allow us to calculate the price/earnings ratio in real time (because of the underlying streaming data updating our Delta Lake table). %sql CREATE OR REPLACE TEMPORARY VIEW viewPE AS select ticker, price_date, first(close) as price, (close/eps_basic_net) as pe from priceWithFundamentals where eps_basic_net > 0 group by ticker, price_date, pe **Analyze streaming stock data in real time** With our view in place, we can quickly analyze our data using Spark SQL. %sql select - from viewPE where ticker == “AAPL” order by price_date ----- As the underlying source of this consolidated data set is a Delta Lake table, this view isn’t just showing the batch data but also any new streams of data that are coming in as per the following streaming dashboard. Underneath the covers, Structured Streaming isn’t just writing the data to Delta Lake tables but also keeping the state of the distinct number of keys (in this case ticker symbols) that need to be tracked. Because you are using Spark SQL, you can execute aggregate queries at scale and in real time. %sql SELECT ticker, AVG(close) as Average_Close FROM priceWithFundamentals GROUP BY ticker ORDER BY Average_Close In closing, we demonstrated how to simplify streaming stock data analysis using [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) . By combining Spark Structured Streaming and Delta Lake, we can use the Databricks integrated workspace to create a performant, scalable solution that has the advantages of both data lakes and data warehouses. The [Databricks Unified Data Platform](https://databricks.com/product/data-lakehouse) removes the data engineering complexities commonly associated with streaming and transactional consistency, enabling data engineering and data science teams to focus on understanding the trends in their stock data. ----- **How Tilting Point Does Streaming** **Ingestion Into Delta Lake** Tilting Point is a new-generation games partner that provides top development studios with expert resources, services and operational support to optimize high-quality live games for success. Through its user acquisition fund and its world-class technology platform, Tilting Point funds and runs performance marketing management and live games operations to help developers achieve profitable scale. By leveraging Delta Lake, Tilting Point is able to leverage quality data and make it readily available for analytics to improve the business. Diego Link, VP of Engineering at Tilting Point, provided insights for this use case. The team at Tilting Point was running daily and hourly batch jobs for reporting on game analytics. They wanted to make their reporting near real-time, getting insights within 5–10 minutes. They also wanted to make their in-game LiveOps decisions based on real-time player behavior for giving real-time data to a bundles-and-offer system, provide up-to-theminute alerting on LiveOPs changes that actually might have unforeseen detrimental effects and even alert on service interruptions in game operations. The goal was to ensure that the game experience was as robust as possible for their players. Additionally, they had to store encrypted Personally Identifiable Information (PII) data separately in order to maintain GDPR compliance. ----- **How data flows and associated challenges** Tilting Point has a proprietary software development kit that developers integrate with to send data from game servers to an ingest server hosted in AWS. This service removes all PII data and then sends the raw data to an Amazon Firehose endpoint. Firehose then dumps the data in JSON format continuously to S3. To clean up the raw data and make it available quickly for analytics, the team considered pushing the continuous data from Firehose to a message bus (e.g., Kafka, Kinesis) and then using [Apache Spark’s Structured Streaming](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) to continuously process data and write to Delta Lake tables. While that architecture sounds ideal for low latency requirements of processing data in seconds, Tilting Point didn’t have such low latency needs for their ingestion pipeline. They wanted to make the data available for analytics in a few minutes, not seconds. Hence they decided to simplify our architecture by eliminating a message bus and instead use S3 as a continuous source for their structured streaming job. But the key challenge in using S3 as a continuous source is identifying files that changed recently. Listing all files every few minutes has two major issues: - **Higher latency:** Listing all files in a directory with a large number of files has high overhead and increases processing time. - **Higher cost:** Listing lots of files every few minutes can quickly add to the S3 cost. **Leveraging Structured Streaming with blob store as** **source and Delta Lake tables as sink** To continuously stream data from cloud blob storage like S3, Tilting Point uses [Databricks’ S3-SQS source](https://docs.databricks.com/spark/latest/structured-streaming/sqs.html#optimized-s3-file-source-with-sqs) . The S3-SQS source provides an easy way to incrementally stream data from S3 without the need to write any state management code on what files were recently processed. ----- This is how Tilting Point’s ingestion pipeline looks: - [Configure Amazon S3 event notifications](https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html) to send new file arrival information to SQS via SNS. - Tilting Point uses the S3-SQS source to read the new data arriving in S3. The S3SQS source reads the new file names that arrived in S3 from SQS and uses that information to read the actual file contents in S3. An example code below: spark.readStream \ .format( “s3-sqs” ) \ . option ( “fileFormat” , “json” ) \ . option ( “queueUrl” , ...) \ . schema (...) \ . load () - Tilting Point’s structured streaming job then cleans up and transforms the data. Based on the game data, the streaming job uses the foreachBatch API of Spark streaming and writes to 30 different Delta Lake tables. - The streaming job produces lots of small files. This affects performance of downstream consumers. So, an optimize job runs daily to compact small files in the table and store them as right file sizes so that consumers of the data have good performance while reading the data from Delta Lake tables. Tilting Point also runs a weekly optimize job for a second round of compaction. Architecture showing continuous data ingest into Delta Lake tables ----- The above Delta Lake ingestion architecture helps in the following ways: - **Incremental loading:** The S3-SQS source incrementally loads the new files in S3. This helps quickly process the new files without too much overhead in listing files. - **No explicit file state management:** There is no explicit file state management needed to look for recent files. - **Lower operational burden:** Since we use S3 as a checkpoint between Firehose and Structured Streaming jobs, the operational burden to stop streams and reprocess data is relatively low. - **Reliable ingestion:** Delta Lake uses [optimistic concurrency control](https://docs.databricks.com/delta/optimizations/isolation-level.html) to offer ACID transactional guarantees. This helps with reliable data ingestion. - **File compaction:** One of the major problems with streaming ingestion is tables ending up with a large number of small files that can affect read performance. Before Delta Lake, we had to set up a different table to write the compacted data. With Delta Lake, thanks to ACID transactions, we can compact the files and rewrite the data back to the same table safely. - **Snapshot isolation:** Delta Lake’s snapshot isolation allows us to expose the ingestion tables to downstream consumers while data is being appended by a streaming job and modified during compaction. - **Rollbacks:** In case of bad writes, [Delta Lake’s Time Travel](https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html) helps us roll back to a previous version of the table. In this section, we walked through Tilting Point’s use cases and how they do streaming ingestion using Databricks’ S3-SQS source into Delta Lake tables efficiently without too much operational overhead to make good quality data readily available for analytics. ----- **Building a Quality of Service** **Analytics Solution for Streaming** **Video Services** As traditional pay TV , content owners have embraced directto-consumer (D2C) subscription and ad-supported streaming for monetizing their [continues to stagnate](https://nscreenmedia.com/us-tv-market-svod-exceed-pay-tv-2020/) libraries of content. For companies whose entire business model revolved around producing great content, which they then licensed to distributors, the shift to now owning the entire glass-to-glass experience has required new capabilities, such as building media supply chains for content delivery to consumers, supporting apps for a myriad of devices and operating systems, and performing customer relationship functions like billing and customer service. With most services renewing on a monthly basis, subscription service operators need to prove value to their subscribers at all times. General quality of streaming video issues (encompassing buffering, latency, pixelation, jitter, packet loss and the blank screen) have significant business impacts, whether it’s increased [subscriber churn](https://www.streamingmedia.com/Articles/ReadArticle.aspx?ArticleID=112209) or [decreased video engagement](https://www.tvtechnology.com/opinions/why-buffering-remains-every-video-providers-worst-nightmare) . When you start streaming, you realize there are so many places where breaks can happen and the viewer experience can suffer. There may be an issue at the source in the servers on-premises or in the cloud; in transit at either the CDN level or ISP level or the viewer’s home network; or at the playout level with player/client issues. What breaks at n x 104 concurrent streamers is different from what breaks at n x 105 or n x 106. There is no pre-release testing that can quite replicate real-world users and their ability to push even the most redundant systems to their breaking point as they ----- channel surf, click in and out of the app, sign on from different devices simultaneously and so on. And because of the nature of TV, things will go wrong during the most important, high-profile events drawing the largest audiences. If you start [receiving](https://downdetector.com/) [complaints on social media](https://downdetector.com/) , how can you tell if they are unique to that one user or rather regional or a national issue? If national, is it across all devices or only certain types (e.g., possibly the OEM updated the OS on an older device type, which ended up causing compatibility issues with the client)? Identifying, remediating and preventing viewer quality of experience issues becomes a big data problem when you consider the number of users, the number of actions they are taking and the number of handoffs in the experience (servers to CDN to ISP to home network to client). Quality of Service (QoS) helps make sense of these streams of data so you can understand what is going wrong, where and why. Eventually you can get into predictive analytics around what could go wrong and how to remediate it before anything breaks. **Databricks Quality of Service solution overview** The aim of this solution is to provide the core for any streaming video platform that wants to improve their QoS system. It is based on the [AWS Streaming Media Analytics](https://github.com/awslabs/aws-streaming-media-analytics) [Solution](https://github.com/awslabs/aws-streaming-media-analytics) provided by AWS Labs, which we then built on top of to add Databricks as a Unified Data Analytics Platform for both the real-time insights and the advanced analytics capabilities. [By using Databricks](https://databricks.com/customers) , streaming platforms can get faster insights by always leveraging the most complete and recent data sets powered by robust and reliable data pipelines. This decreases time to market for new features by accelerating data science using a collaborative environment. It provides support for managing the end-to-end machine learning lifecycle and reduces operational costs across all cycles of software development by having a unified platform for both data engineering and data science. ----- **Video QoS solution architecture** With complexities like low-latency monitoring alerts and highly scalable infrastructure required for peak video traffic hours, the straightforward architectural choice was the Delta Architecture — both standard big data architectures like Lambda and Kappa Architectures have disadvantages around the operational effort required to maintain multiple types of pipelines (streaming and batch) and lack support for a unified data engineering and data science approach. The Delta Architecture is the next-generation paradigm that enables all the data personas in your organization to be more productive: - Data engineers can develop data pipelines in a cost-efficient manner continuously without having to choose between batch and streaming - Data analysts can get near real-time insights and faster answers to their BI queries - Data scientists can develop better machine learning models using more reliable data sets with support for time travel that facilitates reproducible experiments and reports Delta Architecture using the “multi-hop” approach for data pipelines ----- Writing data pipelines using the Delta Architecture follows the best practices of having a multi-layer “multi-hop” approach where we progressively add structure to data: “Bronze” tables or Ingestion tables are usually raw data sets in the native format (JSON, CSV or txt), “Silver” tables represent cleaned/transformed data sets ready for reporting or data science, and “Gold” tables are the final presentation layer. For the pure streaming use cases, the option of materializing the DataFrames in intermediate Delta Lake tables is basically just a trade-off between latency/SLAs and cost (an example being real-time monitoring alerts vs. updates of the recommender system based on new content). A streaming architecture can still be achieved while materializing DataFrames in Delta Lake tables The number of “hops” in this approach is directly impacted by the number of consumers downstream, complexity of the aggregations (e.g., Structured Streaming enforces certain limitations around chaining multiple aggregations) and the maximization of operational efficiency. The QoS solution architecture is focused around best practices for data processing and is not a full video-on-demand (VoD) solution — with some standard components like the “front door” service Amazon API Gateway being avoided from the high-level architecture in order to keep the focus on data and analytics. ----- High-level architecture for the QoS platform **Making your data ready for analytics** Both sources of data included in the QoS solution (application events and CDN logs) are using the JSON format, great for data exchange — allowing you to represent complex nested structures, but not scalable and difficult to maintain as a storage format for your data lake / analytics system. In order to make the data directly queryable across the entire organization, the Bronze to Silver pipeline (the “make your data available to everyone” pipeline) should transform any raw formats into Delta Lake and include all the quality checks or data masking required by any regulatory agencies. ----- Raw format of the app events **Video applications events** Based on the architecture, the video application events are pushed directly to Kinesis Streams and then just ingested to a Delta Lake append-only table without any changes to the schema. Using this pattern allows a high number of consumers downstream to process the data in a streaming paradigm without having to scale the throughput of the Kinesis stream. As a side effect of using a Delta Lake table as a sink (which supports [optimize](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html) !), we don’t have to worry about the way the size of the processing window will impact the number of files in your target table — known as the “small files” issue in the big data world. Both the timestamp and the type of message are being extracted from the JSON event in order to be able to partition the data and allow consumers to choose the type of events they want to process. Again combining a single Kinesis stream for the events with a Delta Lake “Events” table reduces the operational complexity while making things easier for scaling during peak hours. All the details are extracted from JSON for the Silver table ----- **CDN logs** The CDN logs are delivered to S3, so the easiest way to process them is the Databricks Auto Loader, which incrementally and efficiently processes new data files as they arrive in S3 without any additional setup. auto_loader_df = spark.readStream.format( “cloudFiles” ) \ .option( “cloudFiles.format” , “json” ) \ .option( “cloudFiles.region” , region) \ .load(input_location) anonymized_df = auto_loader_df. select ( ‘*’ , ip_ anonymizer( ‘requestip’ ). alias ( ‘ip’ ))\ .drop( ‘requestip’ )\ .withColumn( “origin” , map_ip_to_location(col( ‘ip’ ))) anonymized_df.writeStream \ .option( ‘checkpointLocation’ , checkpoint_location)\ .format( ‘delta’ ) \ .table(silver_database + ‘.cdn_logs’ ) As the logs contain IPs — considered personal data under the GDPR regulations — the “make your data available to everyone” pipeline has to include an anonymization step. Different techniques can be used, but we decided to just strip the last octet from IPv4 and the last 80 bits from IPv6. On top, the data set is also enriched with information around the origin country and the ISP provider, which will be used later in the Network Operation Centers for localization. ----- **Creating the Dashboard /** **Virtual Network Operation Centers** Streaming companies need to monitor network performance and the user experience as near real-time as possible, tracking down to the individual level with the ability to abstract at the segment level, easily defining new segments such as those defined by geos, devices, networks and/or current and historical viewing behavior. For streaming companies that has meant adopting the concept of Network Operation Centers (NOC) from telco networks for monitoring the health of the streaming experience for their users at a macro level, flagging and responding to any issues early on. At their most basic, NOCs should have dashboards that compare the current experience for users against a performance baseline so that the product teams can quickly and easily identify and attend to any service anomalies. In the QoS solution we have incorporated a [Databricks dashboard](https://docs.databricks.com/notebooks/dashboards.html) . BI tools can also be effortlessly connected in order to build more complex visualizations, but based on customer feedback, built-in dashboards are, most of the time, the fastest way to present the insights to business users. The aggregated tables for the NOC will basically be the Gold layer of our Delta Architecture — a combination of CDN logs and the application events. Example of Network Operations Center dashboard ----- The dashboard is just a way to visually package the results of SQL queries or Python / R transformation — each notebook supports multiple dashboards so in case of multiple end users with different requirements we don’t have to duplicate the code — as a bonus the refresh can also be scheduled as a Databricks job. Visualization of the results of a SQL query Loading time for videos (time to first frame) allows better understanding of the performance for individual locations of your CDN — in this case the AWS CloudFront Edge nodes — which has a direct impact in your strategy for improving this KPI — either by spreading the user traffic over multi-CDNs or maybe just implementing a dynamic origin selection in case of AWS CloudFront using Lambda@Edge. ----- Failure to understand the reasons for high levels of buffering — and the poor video quality experience that it brings — has a significant impact on subscriber churn rate. On top of that, advertisers are not willing to spend money on ads responsible for reducing the viewer engagement — as they add extra buffering on top, so the profits on the advertising business usually are impacted too. In this context, collecting as much information as possible from the application side is crucial to allow the analysis to be done not only at video level but also browser or even type / version of application. On the content side, events for the application can provide useful information about user behavior and overall quality of experience. How many people that paused a video have actually finished watching that episode / video? What caused the stoppage: The quality of the content or delivery issues? Of course, further analyses can be done by linking all the sources together (user behavior, performance of CDNs /ISPs) to not only create a user profile but also to forecast churn. ----- **Creating (near) real-time alerts** When dealing with the velocity, volume and variety of data generated in video streaming from millions of concurrent users, dashboard complexity can make it harder for human operators in the NOC to focus on the most important data at the moment and zero-in on root cause issues. With this solution, you can easily set up automated alerts when performance crosses certain thresholds that can help the human operators of the network as well as set off automatic remediation protocols via a Lambda function. For example: - If a CDN is having latency much higher than baseline (e.g., if it’s more than 10% latency vs. baseline average), initiate automatic CDN traffic shifts. - If more than [some threshold, e.g., 5%] of clients report playback errors, alert the product team that there is likely a client issue for a specific device. - If viewers on a certain ISP are having higher-than-average buffering and pixelation issues, alert frontline customer representatives on responses and ways to decrease issues (e.g., set stream quality lower). From a technical perspective, generating real-time alerts requires a streaming engine capable of processing data real time and publish-subscribe service to push notifications. updates of web applications) or Amazon SQS for other consumers. The [custom for](https://docs.databricks.com/spark/latest/structured-streaming/foreach.html) [each writer](https://docs.databricks.com/spark/latest/structured-streaming/foreach.html) option makes the writing of a pipeline to send email notifications based on a rule-based engine (e.g., validating the percentage of errors for each individual type of app over a period of time) really straightforward. def send_error_notification(row): sns_client = boto3.client( ‘sns’ , region) error_message = ‘Number of errors for the App has exceeded the threshold {}’ .format(row[ ‘percentage’ ]) response = sns_client.publish( TopicArn =, Message = error_message, Subject =, MessageStructure = ‘string’ ) # Structured Streaming Job getKinesisStream( “player_events” )\ .selectExpr( “type” , “app_type” )\ .groupBy( “app_type” )\ .apply(calculate_error_percentage)\ . where ( “percentage > {}” .format(threshold)) \ .writeStream\ . foreach (send_error_notification)\ .start() Integrating microservices using Amazon SNS and Amazon SQS Sending email notifications using AWS SNS The QoS solution implements the [AWS best practices for integrating microservices](https://docs.aws.amazon.com/whitepapers/latest/microservices-on-aws/introduction.html) by using Amazon SNS and its integrations with Amazon Lambda (see below for the ----- On top of the basic email use case, the Demo Player includes three widgets updated in real time using AWS AppSync: the number of active users, the most popular videos and the number of users concurrently watching a video. Updating the application with the results of real-time aggregations The QoS solution is applying a similar approach — Structured Streaming and Amazon SNS — to update all the values allowing for extra consumers to be plugged in using AWS SQS. This is a common pattern when huge volumes of events have to be enhanced and analyzed; pre-aggregate data once and allow each service (consumer) to make their own decision downstream. **Next steps: machine learning** Manually making sense of the historical data is important but is also very slow. If we want to be able to make automated decisions in the future, we have to integrate machine learning algorithms. As a Unified Data Platform, Databricks empowers data scientists to build better data science products using features like Runtime for Machine Learning with built-in or the integration with MLflow, the end-toend machine learning lifecycle management tool. support for [Hyperopt](https://docs.databricks.com/applications/machine-learning/automl-hyperparam-tuning/index.html#hyperopt-overview) / [Horvod](https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/horovod-runner.html) / [AutoML](https://databricks.com/product/automl-on-databricks) ----- We have already explored a few important use cases across our customer base while focusing on the possible extensions to the QoS solution. **Point-of-failure prediction and remediation** As D2C streamers reach more users, the costs of even momentary loss of service increases. ML can help operators move from reporting to prevention by forecasting where issues could come up and remediating before anything goes wrong (e.g., a spike in concurrent viewers leads to switching CDNs to one with more capacity automatically). **Customer churn** Critical to growing subscription services is keeping the subscribers you have. By understanding the quality of service at the individual level, you can add QoS as a variable in churn and customer lifetime value models. Additionally, you can create customer cohorts for those who have had video quality issues in order to test proactive messaging and save offers. **Getting started with the Databricks streaming video** **QoS solution** Providing consistent quality in the streaming video experience is table stakes at this point to keep fickle audiences with ample entertainment options on your platform. With this solution we have sought to create a quick start for most streaming video platform environments to embed this QoS real-time streaming analytics solution in a way that: 1. Scales to any audience size 2. Quickly flags quality performance issues at key parts of the distribution workflow 3. Is flexible and modular enough to easily customize for your audience and your needs, such as creating new automated alerts or enabling data scientists to test and roll out predictive analytics and machine learning To get started, download the notebooks for the [Databricks streaming video QoS](https://databricks.com/notebooks/QoS/index.html#00.config.html) [solution](https://databricks.com/notebooks/QoS/index.html#00.config.html) . For more guidance on how to unify batch and streaming data into a single system, view the [Delta Architecture webinar](https://pages.databricks.com/201908-WB-Delta-Architecture-A-Step-Beyond-Lambda-Architecture_Reg.html) . ----- **Customer Use Cases** See how customers are using Delta Lake to rapidly innovate ## CHAPTER 05 ----- **Healthdirect Australia** Provides Personalized and Secure Online Patient Care With Databricks As the shepherds of the National Health Services Directory (NHSD), Healthdirect is focused on leveraging terabytes of data covering time-driven, activity-based healthcare transactions to improve health care services and support. With governance requirements, siloed teams and a legacy system that was difficult to scale, they moved to Databricks. This boosted data processing for downstream machine learning while improving data security to meet HIPAA requirements. **Spotlight on Healthdirect** **Industry:** Healthcare and life sciences 6x Improvement in data processing 20M Records ingested in minutes **Data quality and governance issues, silos, and the** **inability to scale** Due to regulatory pressures, Healthdirect Australia set forth to improve overall data quality and ensure a level of governance on top of that, but they ran into challenges when it came to data storage and access. On top of that, data silos were blocking the team from efficiently preparing data for downstream analytics. These disjointed data ----- sources impacted the consistency of data reads, as data was oftentimes out-of-sync between the various systems in their stack. The low-quality data also led to higher error rates and processing inefficiencies. This fragmented architecture created significant operational overhead and limited their ability to have a comprehensive view of the patient. Further, they needed to ingest over 1 billion data points due to a changing landscape of customer demand such as bookings, appointments, pricing, eHealth transaction activity, etc. — estimated at over 1TB of data. “We had a lot of data challenges. We just couldn’t process efficiently enough. We were starting to get batch overruns. We were starting to see that a 24-hour window isn’t the most optimum time in which we want to be able to deliver healthcare data and services,” explained Peter James, Chief Architect at Healthdirect Australia. Ultimately, Healthdirect realized they needed to modernize their end-to-end process and tech stack to properly support the business. **Modernizing analytics with Databricks and Delta Lake** Databricks provides Healthdirect Australia with a Unified Data Platform that simplifies data engineering and accelerates data science innovation. The notebook environment enables them to make content changes in a controlled fashion rather than having to run bespoke jobs each time. “Databricks has provided a big uplift for our teams and our data operations,” said James. “The analysts were working directly with the data operations teams. They are able to achieve the same pieces of work together within the same time frames that used to take twice as long. They’re working together, and we’re seeing just a massive acceleration in the speed at which we can deliver service.” ----- With Delta Lake, they’ve created logical data zones: Landing, Raw, Staging and Gold. Within these zones, they store their data “as is,” in their structured or unstructured state, in Delta Lake tables. From there, they use a metadata-driven schema and hold the data within a nested structure within that table. What this allows them to do is handle data consistently from every source and simplifies the mapping of data to the various applications pulling the data. Meanwhile, through Structured Streaming, they were able to convert all of their ETL batch jobs into streaming ETL jobs that could serve multiple applications consistently. Overall, the advent of Spark Structured Streaming, Delta Lake and the Databricks Unified Data Platform provides significant architectural improvements that have boosted performance, reduced operational overheads and increased process efficiencies. **Faster data pipelines result in better patient-driven** **healthcare** As a result of the performance gains delivered by Databricks and the improved data reliability through Delta Lake, Healthdirect Australia realized improved accuracy of their fuzzy name match algorithm from less than 80% with manual verification to 95% and no manual intervention. The processing improvements with Delta Lake and Structured Streaming allowed them to process more than 30,000 automated updates per month. Prior to Databricks, they had to use unreliable batch jobs that were highly manual to process the same number of updates over a span of 6 months — a 6x improvement in data processing. “Databricks delivered the time to market as well as the analytics and operational uplift that we needed in order to be able to meet the new demands of the healthcare sector.” – Peter James, Chief Architect, Healthdirect Australia ----- They were also able to increase their data load rate to 1 million records per minute, loading their entire 20 million record data set in 20 minutes. Before the adoption of Databricks, this used to take more than 24 hours to process the same 1 million transactions, blocking analysts from making swift decisions to drive results. Last, data security, which was critical to meet compliance requirements, was greatly improved. Databricks provides standard security accreditations like HIPAA, and Healthdirect was able to use Databricks to meet Australia’s security requirements. This yielded significant cost reductions and gave them continuous data assurance by monitoring changes to access privileges like changes in roles, metadata-level security changes, data leakage, etc. “Databricks delivered the time to market as well as the analytics and operational uplift that we needed in order to be able to meet the new demands of the healthcare sector,” said James. With the help of Databricks, they have proven the value of data and analytics and how it can impact their business vision. With transparent access to data that boasts well-documented lineage and quality, participation across various business and analyst groups has increased — empowering teams to collaborate and more easily and quickly extract value from their data with the goal of improving healthcare for everyone. ----- **Comcast** Uses Delta Lake and MLflow to Transform the Viewer Experience **Spotlight on Comcast** **Industry:** Media and entertainment 10x Reduction in overall compute costs to process data 90% Reduction in required DevOps resources to manage infrastructure Reduced Deployment times from weeks to minutes As a global technology and media company connecting millions of customers to personalized experiences, Comcast struggled with massive data, fragile data pipelines and poor data science collaboration. With Databricks — leveraging Delta Lake and MLflow — they can build performant data pipelines for petabytes of data and easily manage the lifecycle of hundreds of models to create a highly innovative, unique and award-winning viewer experience using voice recognition and machine learning. ----- **Infrastructure unable to support data and ML needs** Instantly answering a customer’s voice request for a particular program while turning billions of individual interactions into actionable insights, strained Comcast’s IT infrastructure and data analytics and data science teams. To make matters more complicated, Comcast needed to deploy models to a disjointed and disparate range of environments: cloud, on-premises and even directly to devices in some instances. - **Massive data:** Billions of events generated by the entertainment system and 20+ million voice remotes, resulting in petabytes of data that need to be sessionized for analysis. - **Fragile pipelines:** Complicated data pipelines that frequently failed and were hard to recover. Small files were difficult to manage, slowing data ingestion for downstream machine learning. - **Poor collaboration:** Globally dispersed data scientists working in different scripting languages struggled to share and reuse code. - **Manage management of ML models:** Developing, training and deploying hundreds of models was highly manual, slow and hard to replicate, making it difficult to scale. - **Friction between dev and deployment:** Dev teams wanted to use the latest tools and models while ops wanted to deploy on proven infrastructure. ----- **Automated infrastructure, faster data** **pipelines with Delta Lake** Comcast realized they needed to modernize their entire approach to analytics from data ingest to the deployment of machine learning models to delivering new features that delight their customers. Today, the Databricks Unified Data Platform enables Comcast to build rich data sets and optimize machine learning at scale, streamline workflows across teams, foster collaboration, reduce infrastructure complexity, and deliver superior customer experiences. - **Simplified infrastructure management:** Reduced operational costs through automated cluster management and cost management features such as autoscaling and spot instances. - **Performant data pipelines:** Delta Lake is used for the ingest, data enrichment and initial processing of the raw telemetry from video and voice applications and devices. - **Reliably manage small files:** Delta Lake enabled them to optimize files for rapid and reliable ingestion at scale. - **Collaborative workspaces:** Interactive notebooks improve cross-team collaboration and data science creativity, allowing Comcast to greatly accelerate model prototyping for faster iteration. - **Simplified ML lifecycle:** Managed MLflow simplifies the machine learning lifecycle and model serving via the Kubeflow environment, allowing them to track and manage hundreds of models with ease. - **Reliable ETL at scale:** Delta Lake provides efficient analytics pipelines at scale that can reliably join historic and streaming data for richer insights. ----- **Delivering personalized experiences with ML** In the intensely competitive entertainment industry, there is no time to press the Pause button. Armed with a unified approach to analytics, Comcast can now fastforward into the future of AI-powered entertainment — keeping viewers engaged and delighted with competition-beating customer experiences. - **Emmy-winning viewer experience:** Databricks helps enable Comcast to create a highly innovative and award-winning viewer experience with intelligent voice commands that boosts engagement. - **Reduced compute costs by 10x:** Delta Lake has enabled Comcast to optimize data ingestion, replacing 640 machines with 64 while improving performance. Teams can spend more time on analytics and less time on infrastructure management. - **Less DevOps:** Reduced the number of DevOps full-time employees required for onboarding 200 users from 5 to 0.5. - **Higher data science productivity:** Fostered collaboration between global data scientists by enabling different programming languages through a single interactive workspace. Also, Delta Lake has enabled the data team to use data at any point within the data pipeline, allowing them to act more quickly in building and training new models. - **Faster model deployment:** Reduced deployment times from weeks to minutes as operations teams deployed models on disparate platforms. ----- **Banco Hipotecario** Personalizes the Banking Experience With Data and ML Banco Hipotecario — a leading Argentinian commercial bank — is on a mission to leverage machine learning to deliver new insights and services that will delight customers and create upsell opportunities. With a legacy analytics and data warehousing system that was rigid and complex to scale, they turned to Databricks to unify data science, engineering and analytics. As a result of this partnership, they were able to significantly increase customer acquisition and cross-sells while lowering the cost for acquisition, greatly impacting overall customer retention and profitability. **Spotlight on Banco Hipotecario** **Industry:** Financial services 35% Reduction in cost of acquisition **Technical use cases:** Ingest and ETL, machine learning and SQL Analytics ----- **Legacy analytics tools are slow, rigid and** **impossible to scale** Banco Hipotecario set forth to increase customer acquisition by reducing risk and improving the customer experience. With data analytics and machine learning anchoring their strategy, they hoped to influence a range of use cases from fraud detection and risk analysis to serving product recommendations to drive upsell and cross-sell opportunities and forecast sales. Banco Hipotecario faced a number of the challenges that often come along with outdated technology and processes: disorganized or inaccurate data; poor crossteam collaboration; the inability to innovate and scale; resource-intensive workflows, — the list goes on. “In order to execute on our data analytics strategy, new technologies were needed in order to improve data engineering and boost data science productivity,” said Daniel Sanchez, Enterprise Data Architect at Banco Hipotecario. “The first steps we took were to move to a cloud-based data lake, which led us to Azure Databricks and Delta Lake.” ----- **A unified platform powers the data lake** **and easy collaboration** Banco Hipotecario turned to Databricks to modernize their data warehouse environment, improve cross-team collaboration, and drive data science innovation. Fully managed in Microsoft Azure, they were able to easily and reliably ingest massive volumes of data, spinning up their whole infrastructure in 90 days. With Databricks’ automated cluster management capabilities, they are able to scale clusters ondemand to support large workloads. Delta Lake has been especially useful in bringing reliability and performance to Banco Hipotecario’s data lake environment. With Delta Lake, they are now able to build reliable and performant ETL pipelines like never before. Meanwhile, performing SQL Analytics on Databricks has helped them do data exploration, cleansing and generate data sets in order to create models, enabling the team to deploy their first model within the first three months, and the second model generated was rolled out in just two weeks. At the same time, data scientists were finally able to collaborate, thanks to interactive notebooks; this meant faster builds, training and deployment. And MLflow streamlined the ML lifecycle and removed the overreliance on data engineering. “Databricks gives our data scientists the means to easily create our own experiments and deploy them to production in weeks, rather than months,” said Miguel Villalba, Head of Data Engineering and Data Science. ----- **An efficient team maximizes customer** **acquisition and retention** Since moving to Databricks, the data team at Banco Hipotecario could not be happier, as Databricks has unified them across functions in an integrated fashion. The results of data unification and markedly improved collaboration and autonomy cannot be overstated. Since deploying Databricks, Banco Hipotecario has increased their cross-sell into new products by a whopping 90%, while machine learning has reduced the cost of customer acquisition by 35%. ----- **Viacom18** Migrates From Hadoop to Databricks to Deliver More Engaging Experiences Viacom18 Media Pvt. Ltd. is one of India’s fastest-growing entertainment networks with 40x growth over the past decade. They offer multi-platform, multigenerational and multicultural brand experiences to 600+ million monthly viewers. In order to deliver more engaging experiences for their millions of viewers, Viacom18 migrated from their Hadoop environment due to its inability to process data at scale efficiently. With Databricks, they have streamlined their infrastructure management, increased data pipeline speeds and increased productivity among their data teams. Today, Viacom18 is able to deliver more relevant viewing experiences to their subscribers, while identifying opportunities to optimize the business and drive greater ROI. **Spotlight on Viacom18** **Industry:** Media and entertainment 26% Increase in operational efficiency lowers overall costs ----- **Growth in subscribers and terabytes of viewing data** **push Hadoop to its limits** Viacom18, a joint venture between Network18 and ViacomCBS, is focused on providing its audiences with highly personalized viewing experiences. The core of this strategy requires implementing an enterprise data architecture that enables the building of powerful customer analytics on daily viewer data. But with millions of consumers across India, the sheer amount of data was tough to wrangle: They were tasked with ingesting and processing over 45,000 hours of daily content on VOOT (Viacom18’s on-demand video subscription platform), which easily generated 700GB to 1TB of data per day. “Content is at the heart of what we do,” explained Parijat Dey, Viacom18’s Assistant Vice President of Digital Transformation and Technology. “We deliver personalized content recommendations across our audiences around the world based on individual viewing history and preferences in order to increase viewership and customer loyalty.” Viacom18’s data lake, which was leveraging on-premises Hadoop for operations, wasn’t able to optimally process 90 days of rolling data within their management’s defined SLAs, limiting their ability to deliver on their analytics needs, which impacted not only the customer experience but also overall costs. To meet this challenge head-on, Viacom18 needed a modern data warehouse with the ability to analyze data trends for a longer period of time instead of daily snapshots. They also needed a platform that simplified infrastructure by allowing their team to easily provision clusters with features like auto-scaling to help reduce compute costs. ----- **Rapid data processing for analytics** **and ML with Databricks** To enable the processing power and data science capabilities they required, Viacom18 partnered with Celebal Technologies, a premier Salesforce, data analytics and big data consulting organization based in India. The team at Celebal leveraged Azure Databricks to provide Viacom18 with a unified data platform that modernizes its data warehousing capabilities and accelerates data processing at scale. The ability to cache data within Delta Lake resulted in the much-needed acceleration of queries, while cluster management with auto-scaling and the decoupling of storage and compute simplified Viacom18’s infrastructure management and optimized operational costs. “Delta Lake has created a streamlined approach to the management of data pipelines,” explained Dey. “This has led to a decrease in operational costs while speeding up time-to-insight for downstream analytics and data science.” The notebooks feature was an unexpected bonus for Viacom18, as a common workspace gave data teams a way to collaborate and increase productivity on everything from model training to ad hoc analysis, dashboarding and reporting via PowerBI. ----- **Leveraging viewer data to power personalized** **viewing experiences** Celebal Technologies and Databricks have enabled Viacom18 to deliver innovative customer solutions and insights with increased cross-team collaboration and productivity. With Databricks, Viacom18’s data team is now able to seamlessly navigate their data while better serving their customers. “With Databricks, Viacom18’s engineers can now slice and dice large volumes of data and deliver customer behavioral and engagement insights to the analysts and data scientists,” said Dey. In addition to performance gains, the faster query times have also lowered the overall cost of ownership, even with daily increases in data volumes. “Azure Databricks has greatly streamlined processes and improved productivity by an estimated 26%,” concluded Dey. Overall, Dey cites the migration from Hadoop to Databricks has delivered significant business value — reducing the cost of failure, accelerating processing speeds at scale, and simplifying ad hoc analysis for easier data exploration and innovations that deliver highly engaging customer experiences. ----- # What’s next? Now that you understand Delta Lake, it may be time to take a look at some additional resources. **Do a deep dive into Delta Lake >** - [Getting Started With Delta Lake Tech Talk Series](https://databricks.com/discover/getting-started-with-delta-lake-tech-talks) - [Diving Into Delta Lake Tech Talk Series](https://databricks.com/discover/diving-into-delta-lake-talks) - [Visit the site](https://databricks.com/product/delta-lake-on-databricks) for additional resources **[Try Databricks for free >](https://databricks.com/try-databricks)** **[Learn more >](https://pages.databricks.com/delta-lake-open-source-reliability-for-data-lakes-reg.html)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/030521-2-The-Delta-Lake-Series-Complete-Collection.pdf2024-09-19T16:57:19Z**eBook** ## The Data Team’s Guide to the Databricks Lakehouse Platform ----- #### Contents **C H A P TE R 1** **C H A P TE R 2** **C H A P TE R 3** **C H A P TE R 4** **C H A P TE R 5** **C H A P TE R 6** **C H A P TE R 7** **C H A P TE R 8** **C H A P TE R 9** **C H A P TE R 10** **C H A P TE R 11** **C H A P TE R 12** **The data lakehouse** ...................................................................................................................................................................................... **4** **The Databricks Lakehouse Platform** .......................................................................................................................... **11** **Data reliability and performance** ................................................................................................................................... **18** **Unified governance and sharing for data, analytics and AI** ....................................... **28** **Security** .............................................................................................................................................................................................................................. **41** **Instant compute and serverless** ................................................................................................................................... **48** **Data warehousing** ......................................................................................................................................................................................... **52** **Data engineering** ............................................................................................................................................................................................. **56** **Data streaming** .................................................................................................................................................................................................. **68.** **Data science and machine learning** ........................................................................................................................ **7** **3.** **Databricks Technology Partners and the modern data stack** ............................ **7** **9.** **Get started with the Databricks Lakehouse Platform** ....................................................... **8** **1** ----- **I N T R O D U C T I O N** #### The Data Team’s Guide to the Databricks Lakehouse Platform _The Data Team’s Guide to the Databricks Lakehouse Platform_ is designed for data practitioners and leaders who are embarking on their journey into the data lakehouse architecture. In this eBook, you will learn the full capabilities of the data lakehouse architecture and how the Databricks Lakehouse Platform helps organizations of all sizes — from enterprises to startups in every industry — with all their data, analytics, AI and machine learning use cases on one platform. You will see how the platform combines the best elements of data warehouses and data lakes to increase the reliability, performance and scalability of your data platform. Discover how the lakehouse simplifies complex workloads in data engineering, data warehousing, data streaming, data science and machine learning — and bolsters collaboration for your data teams, allowing them to maintain new levels of governance, flexibility and agility in an open and multicloud environment. ----- **CHAPTER** ### The data lakehouse # 01 ----- #### The evolution of data architectures Data has moved front and center within every organization as data-driven insights have fueled innovation, competitive advantage and better customer experiences. However, as companies place mandates on becoming more data-driven, their data teams are left in a sprint to deliver the right data for business insights and innovation. With the widespread adoption of cloud, data teams often invest in large-scale complex data systems that have capabilities for streaming, business intelligence, analytics and machine learning to support the overall business objectives. To support these objectives, data teams have deployed cloud data warehouses and data lakes. Traditional data systems: The data warehouse and data lake With the advent of big data, companies began collecting large amounts of data from many different sources, such as weblogs, sensor data and images. Data warehouses — which have a long history as the foundation for decision support and business intelligence applications — cannot handle large volumes of data. While data warehouses are great for structured data and historical analysis, they weren’t designed for unstructured data, semi-structured data, and data with high variety, velocity and volume, making them unsuitable for many types of data. This led to the introduction of data lakes, providing a single repository of raw data in a variety of formats. While suitable for storing big data, data lakes do not support transactions, nor do they enforce data quality, and their lack of consistency/isolation makes it almost impossible to read, write or process data. For these reasons, many of the promises of data lakes never materialized and, in many cases, reduced the benefits of data warehouses. As companies discovered new use cases for data exploration, predictive modeling and prescriptive analytics, the need for a single, flexible, high-performance system only grew. Data teams require systems for diverse data applications including SQL analytics, real-time analytics, data science and machine learning. ----- To solve for new use cases and new users, a common approach is to use multiple systems — a data lake, several data warehouses and other specialized systems such as streaming, time-series, graph and image databases. But having multiple systems introduces complexity and delay, as data teams invariably need to move or copy data between different systems, effectively losing oversight and governance over data usage. You have now duplicated data in two different systems and the changes you make in one system are unlikely to find their way to the other. So, you are going to have data drift almost immediately, not to mention paying to store the same data multiple times. Then, because governance is happening at two distinct levels across these platforms, you are not able to control things consistently. **Challenges with data, analytics and AI** In a recent [Accenture](https://www.accenture.com/_acnmedia/pdf-108/accenture-closing-data-value-gap-fixed.pdf) study, only 32% of companies reported tangible and measurable value from data. The challenge is that most companies continue to implement two different platforms: data warehouses for BI and data lakes for AI. These platforms are incompatible with each other, but data from both systems is generally needed to deliver game-changing outcomes, which makes success with AI extremely difficult. Today, most of the data is landing in the data lake, and a lot of it is unstructured. In fact, according to [IDC](https://www.idc.com/getdoc.jsp?containerId=US47998321) , about 80% of the data in any organization will be unstructured by 2025. But, this data is where much of the value from AI resides. Subsets of the data are then copied to the data warehouse into structured tables, and back again in some cases. You also must secure and govern the data in both warehouses and offer fine-grained governance, while lakes tend to be coarser grained at the file level. Then, you stand up different stacks of tools on these platforms to do either BI or AI. ----- Finally, the tool stacks on top of these platforms are fundamentally different, which makes it difficult to get any kind of collaboration going between the teams that support them. This is why AI efforts fail. There is a tremendous amount of complexity and rework being introduced into the system. Time and resources are being wasted trying to get the right data to the right people, and everything is happening too slowly to get in front of the competition. **Realizing this requires two disparate,** **incompatible data platforms** **Business** **SQL** **Incomplete** **Data science** **Data** **support for** **intelligence** **analytics** **and ML** **streaming** **SQL** **analytics** **Incomplete** **support for** **use cases** **Incompatible** **security and** **governance models** **Copy subsets of data** |Col1|Col2|Col3|Col4| |---|---|---|---| |Governa T|n a|c b|e and security le ACLs| ||||| |Col1|Col2|Col3|Col4| |---|---|---|---| |Governa File|n s|c a|e and security nd blobs| ||||| **Disjointed** **and duplicative** **Data warehouse** **data silos** **Data lake** Structured tables Unstructured files: logs, text, images, video ----- **Moving forward with a lakehouse architecture** To satisfy the need to support AI and BI directly on vast amounts of data stored in data lakes (on low-cost cloud storage), a new data management architecture emerged independently across many organizations and use cases: the data lakehouse. The data lakehouse can store _all_ and _any_ type of data once in a data lake and make that data accessible directly for AI and BI. The lakehouse paradigm has specific capabilities to efficiently allow both AI and BI on all the enterprise’s data at a massive scale. Namely, it has the SQL and performance capabilities such as indexing, caching and MPP processing to make BI work fast on data lakes. It also has direct file access and direct native support for Python, data science and AI frameworks without the need for a separate data warehouse. In short, a lakehouse is a data architecture that combines the best elements of data warehouses and data lakes. Lakehouses are enabled by a new system design, which implements similar data structures and data management features found in a data warehouse directly on the low-cost storage used for data lakes. ----- ##### Data lakehouse One platform to unify all your data, analytics and AI workloads ###### Lakehouse Platform All machine learning, SQL, BI, and streaming use cases One security and governance approach for all data assets on all clouds ----- **Key features for a lakehouse** Recent innovations with the data lakehouse architecture can help simplify your data and AI workloads, ease collaboration for data teams, and maintain the kind of flexibility and openness that allows your organization to stay agile as you scale. Here are key features to consider when evaluating data lakehouse architectures: Transaction support: In an enterprise lakehouse, many data pipelines will often be reading and writing data concurrently. Support for ACID (Atomicity, Consistency, Isolation and Durability) transactions ensures consistency as multiple parties concurrently read or write data. Schema enforcement and governance: The lakehouse should have a way to support schema enforcement and evolution, supporting data warehouse schema paradigms such as star/snowflake. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms. Data governance: Capabilities including auditing, retention and lineage have become essential, particularly considering recent privacy regulations. Tools that allow data discovery have become popular, such as data catalogs and data usage metrics. BI support: Lakehouses allow the use of BI tools directly on the source data. This reduces staleness and latency, improves recency and lowers cost by not having to operationalize two copies of the data in both a data lake and a warehouse. Storage decoupled from compute: In practice, this means storage and compute use separate clusters, thus these systems can scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property. Openness: The storage formats, such as Apache Parquet, are open and standardized, so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. Support for diverse data types (unstructured and structured): The lakehouse can be used to store, refine, analyze and access data types needed for many new data applications, including images, video, audio, semi-structured data and text. Support for diverse workloads: Use the same data repository for a range of workloads including data science, machine learning and SQL analytics. Multiple tools might be needed to support all these workloads. End-to-end streaming: Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications. **Learn more** **•** [Lakehouse: A New Generation of Open Platforms](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) [That Unify Data Warehousing and Advanced Analytics](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) **•** [Building the Data Lakehouse by Bill Inmon, Father of the](https://databricks.com/p/ebook/building-the-data-lakehouse) [Data Warehouse](https://databricks.com/p/ebook/building-the-data-lakehouse) **•** [What Is a Data Lakehouse?](https://databricks.com/glossary/data-lakehouse#:~:text=A%20data%20lakehouse%20is%20a,(ML)%20on%20all%20data.) ----- **CHAPTER** # 02 ### The Databricks Lakehouse Platform ----- #### Lakehouse: A new generation of open platforms ###### This is the lakehouse paradigm Databricks is the inventor and pioneer of the data lakehouse architecture. The data lakehouse architecture was coined in the research paper, [Lakehouse: A New Generation of Open Platforms that](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) [Unify Data Warehousing and Advanced Analytics](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) , introduced by Databricks’ founders, UC Berkeley and Stanford University at the 11th Conference on Innovative Data Systems Research (CIDR) in 2021. At Databricks, we are continuously innovating on the lakehouse architecture to help customers deliver on their data, analytics and AI aspirations. The ideal data, analytics and AI platform needs to operate differently. Rather than copying and transforming data in multiple systems, you need one platform that accommodates all data types. **Data science** **Data** **and ML** **streaming** **All ML, SQL, BI** **and streaming use cases** **One security and governance** **approach for all data assets** **on all clouds** **A reliable data platform** **to efficiently handle** **all data types** **Persona-based** **use cases** **Unity Catalog** Fine-grained governance for data and AI **Delta Lake** Data reliability and performance **Business** **intelligence** **SQL** **analytics** Files and blobs and table ACLs Ideally, the platform must be open, so that you are not locked into any walled gardens. You would also have one security and governance model. It would not only manage all data types, but it would also be cloud-agnostic to govern data wherever it is stored. Last, it would support all major data, analytics and AI workloads, so that your teams can easily collaborate and get access to all the data they need to innovate. ----- #### What is the Databricks Lakehouse Platform? The Databricks Lakehouse Platform unifies your data warehousing and AI uses cases on a single platform. It combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. It’s built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster. **Lakehouse Platform** Data Data Data Data science warehousing engineering streaming and ML ----- #### Benefits of the Databricks Lakehouse Platform **Simple** The unified approach simplifies your data architecture by eliminating the data silos that traditionally separate analytics, BI, data science and machine learning. With a lakehouse, you can eliminate the complexity and expense that make it hard to achieve the full potential of your analytics and AI initiatives. **Open** Delta Lake forms the open foundation of the lakehouse by providing reliability and performance directly on data in the data lake. You’re able to avoid proprietary walled gardens, easily share data and build your modern data stack with unrestricted access to the ecosystem of open source data projects and the broad Databricks partner network. **Multicloud** The Databricks Lakehouse Platform offers you a consistent management, security and governance experience across all clouds. You do not need to invest in reinventing processes for every cloud platform that you are using to support your data and AI efforts. Instead, your data teams can simply focus on putting all your data to work to discover new insights. ----- #### The Databricks Lakehouse Platform architecture **Data reliability and performance for lakehouse** [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) is an open format storage layer built for the lakehouse that integrates with all major analytics tools and works with the widest variety of formats to store and process data. **Instant compute and serverless** Serverless compute is a fully managed service where Databricks provisions and manages the compute layer on behalf of the customer in the Databricks cloud account instead of the customer account. As of the current release, serverless compute is supported for use with Databricks SQL. In Chapter 6, we explore the details of instant compute and serverless for lakehouse. [Photon](https://databricks.com/product/photon) is the next-generation query engine built for the lakehouse that leverages a state-of-the-art vectorized engine for fast querying and provides the best performance for all workloads in the lakehouse. In Chapter 3, we explore the details of data reliability and performance for the lakehouse. **Unified governance and security for lakehouse** The Databricks Lakehouse Platform provides unified governance with enterprise scale, security and compliance. The [Databricks Unity Catalog](https://databricks.com/product/unity-catalog) (UC) provides governance for your data and AI assets in the lakehouse — files, tables, dashboards, and machine learning models — giving you much better control, management and security across clouds. [Delta Sharing](https://databricks.com/product/delta-sharing) is an open protocol that allows companies to securely share data across the organization in real time, independent of the platform on which the data resides. In Chapter 4, we go into the details of unified governance for lakehouse and, in Chapter 5, we dive into the details of security for lakehouse. ----- #### The Databricks Lakehouse Platform workloads The Databricks Lakehouse Platform architecture supports different workloads such as data warehousing, data engineering, data streaming, data science and machine learning on one simple, open and multicloud data platform. **Data warehousing** Data warehousing is one of the most business-critical workloads for data teams, and the best data warehouse is a lakehouse. The Databricks Lakehouse Platform lets you run all your SQL and BI applications at scale with up to 12x better price/ performance, a unified governance model, open formats and APIs, and your tools of choice — no lock-in. Reduce resource management overhead with serverless compute, and easily ingest, transform and query all your data in-place to deliver real-time business insights faster. Built on open standards and APIs, the Databricks Lakehouse Platform provides the reliability, quality and performance that data lakes natively lack, plus integrations with the ecosystem for maximum flexibility. In Chapter 7, we go into the details of data warehousing on the lakehouse. **Data engineering** Data engineering on the lakehouse allows data teams to unify batch and streaming operations on a simplified architecture, streamline data pipeline development and testing, build reliable data, analytics and AI workflows on any cloud platform, and meet regulatory requirements to maintain governance. automates the complexity of building and maintaining pipelines and running ETL workloads so data engineers and analysts can focus on quality and reliability to drive valuable insights. In Chapter 8, we go into the details of data engineering on the lakehouse. **Data streaming** [Data streaming](https://www.databricks.com/product/data-streaming) is one of the fastest growing workloads within the Databricks Lakehouse Platform and is the future of all data processing. Real-time processing provides the freshest possible data to an organization’s analytics and machine learning models enabling them to make better, faster decisions, more accurate predictions, offer improved customer experiences and more. The Databricks Lakehouse Platform Dramatically simplifies data streaming to deliver real-time analytics, machine learning and applications on one platform. In Chapter 9, we go into the details of data streaming on the lakehouse. **Data science and machine learning** Data science and machine learning (DSML) on the lakehouse is a powerful workload that is unique to many other data offerings. DSML on the lakehouse provides a data-native and collaborative solution for the full ML lifecycle. It can maximize data and ML team productivity, streamline collaboration, empower ML teams to prepare, process and manage data in a self-service manner, and standardize the ML lifecycle from experimentation to production. In Chapter 10, we go into the details of DSML on the lakehouse. The lakehouse provides an end-to-end data engineering and ETL platform that ----- **Databricks Lakehouse Platform and your** **modern data stack** The Databricks Lakehouse Platform is open and provides the flexibility to continue using existing infrastructure, to easily share data and build your modern data stack with unrestricted access to the ecosystem of open source data projects and the broad Databricks partner network with [Partner Connect](https://databricks.com/partnerconnect) . In Chapter 11, we go into the details of our technology partners and the modern data stack. #### Global adoption of the Databricks Lakehouse Platform Today, Databricks has over 7,000 [customers](https://databricks.com/customers) , from Fortune 500 to unicorns across industries doing transformational work. Organizations around the globe are driving change and delivering a new generation of data, analytics and AI applications. We believe that the unfulfilled promise of data and AI can finally be fulfilled with one platform for data analytics, data science and machine learning with the Databricks Lakehouse Platform. **Learn more** [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse) [Databricks Lakehouse Platform Demo Hub](https://databricks.com/discover/demos) [Databricks Lakehouse Platform Customer Stories](https://databricks.com/customers) [Databricks Lakehouse Platform Documentation](https://databricks.com/documentation) [Databricks Lakehouse Platform Training and Certification](https://databricks.com/learn/training/home) [Databricks Lakehouse Platform Resources](https://databricks.com/resources) ----- **CHAPTER** # 03 ### Data reliability and performance To bring openness, reliability and lifecycle management to data lakes, the Databricks Lakehouse Platform is built on the foundation of Delta Lake. Delta Lake solves challenges around unstructured/structured data ingestion, the application of data quality, difficulties with deleting data for compliance or issues with modifying data for data capture. Although data lakes are great solutions for holding large quantities of raw data, they lack important attributes for data reliability and quality and often don’t offer good performance when compared to data warehouses. ----- #### Problems with today’s data lakes When it comes to data reliability and quality, examples of these missing attributes include: **•** **Lack of ACID transactions:** Makes it impossible to mix updates, appends and reads **•** **Lack of schema enforcement:** Creates inconsistent and low-quality data. For example, rejecting writes that don’t match a table’s schema. **•** **Lack of integration with data catalog:** Results in dark data and no single source of truth Even just the absence of these three attributes can cause a lot of extra work for data engineers as they strive to ensure consistent high-quality data in the pipelines they create. These challenges are solved with two key technologies that are at the foundation of the lakehouse: Delta Lake and Photon. **What is Delta Lake?** Delta Lake is a file-based, open source storage format that provides ACID transactions and scalable metadata handling, and unifies streaming and batch data processing. It runs on top of existing data lakes and is compatible with Apache Spark™ and other processing engines. Delta Lake uses Delta Tables which are based on Apache Parquet, a commonly used format for structured data already utilized by many organizations. Therefore, switching existing Parquet tables to Delta Tables is easy and quick. Delta Tables can also be used with semi-structured and unstructured data, providing versioning, reliability, metadata management, and time travel capabilities that make these types of data easily managed as well. As for performance, data lakes use object storage, so data is mostly kept in immutable files leading to the following problems: **•** **Ineffective partitioning:** In many cases, data engineers resort to “poor man’s” indexing practices in the form of partitioning that leads to hundreds of dev hours spent tuning file sizes to improve read/write performance. Often, partitioning proves to be ineffective over time if the wrong field was selected for partitioning or due to high cardinality columns. **•** **Too many small files:** With no support for transactions, appending new data takes the form of adding more and more files, leading to “small file problems,” a known root cause of query performance degradation. ----- **Delta Lake features** **ACID guarantees** Delta Lake ensures that all data changes written to storage are committed for durability and made visible to readers atomically. In other words, no more partial or corrupted files. **Scalable data and metadata handling** Since Delta Lake is built on data lakes, all reads and writes using Spark or other distributed processing engines are inherently scalable to petabyte-scale. However, unlike most other storage formats and query engines, Delta Lake leverages Spark to scale out all the metadata processing, thus efficiently handling metadata of billions of files for petabyte-scale tables. **Audit history and time travel** The Delta Lake transaction log records details about every change made to data, providing a full audit trail of the changes. These data snapshots allow developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. **Schema enforcement and schema evolution** Delta Lake automatically prevents the insertion of data with an incorrect schema, i.e., not matching the table schema. And when needed, it allows the table schema to be explicitly and safely evolved to accommodate ever-changing data. **Support for deletes, updates and merges** Most distributed processing frameworks do not support atomic data modification operations on data lakes. Delta Lake supports merge, update and delete operations to enable complex use cases including but not limited to change data capture (CDC), slowly changing dimension (SCD) operations and streaming upserts. **Streaming and batch unification** A Delta Lake table can work both in batch and as a streaming source and sink. The ability to work across a wide variety of latencies, ranging from streaming data ingestion to batch historic backfill, to interactive queries all work out of the box. ----- **The Delta Lake transaction log** A key to understanding how Delta Lake provides all these capabilities is the transaction log. The Delta Lake transaction log is the common thread that runs through many of Delta Lake’s most notable features, including ACID transactions, scalable metadata handling, time travel and more. The Delta Lake transaction log is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. Delta Lake is built on top of Spark to allow multiple readers and writers of a given table to work on a table at the same time. To always show users correct views of the data, the transaction log serves as a single source of truth: the central repository that tracks all changes that users make to the table. When a user reads a Delta Lake table for the first time or runs a new query on an open table that has been modified since the last time it was read, Spark checks the transaction log to see what new transactions are posted to the table. Then, Spark updates the table with those recent changes. This ensures that a user’s version of a table is always synchronized with the master record as of the most recent query, and that users cannot make divergent, conflicting changes to a table. **Flexibility and broad industry support** Delta Lake is an open source project, with an engaged community of contributors building and growing the Delta Lake ecosystem atop a set of open APIs and is part of the Linux Foundation. With the growing adoption of Delta Lake as an open storage standard in different environments and use cases, comes a broad set of integration with industry-leading tools, technologies and formats. Organizations leveraging Delta Lake on the Databricks Lakehouse Platform gain flexibility in how they ingest, store and query data. They are not limited in storing data in a single cloud provider and can implement a true multicloud approach to data storage. Connectors to tools, such as Fivetran, allow you to leverage Databricks’ ecosystem of partner solutions, so organizations have full control of building the right ingestion pipelines for their use cases. Finally, consuming data via queries for exploration or business intelligence (BI) is also flexible and open. ----- **Delta Lake integrates with all major analytics tools** Eliminates unnecessary data movement and duplication ----- In addition to a wide ecosystem of tools and technologies, Delta Lake supports a broad set of data formats for structured, semi-structured and unstructured data. These formats include image binary data that can be stored in Delta Tables, graph data format, geospatial data types and key-value stores. **Learn more** [Delta Lake on the Databricks Lakehouse](https://databricks.com/product/delta-lake-on-databricks) [Documentation](https://docs.databricks.com/delta/index.html) [Delta Lake Open Source Project](https://docs.databricks.com/delta/index.html) [eBooks: The Delta Lake Series](https://databricks.com/p/ebook/the-definitive-guide-to-delta-lake-series) **What is Photon?** As many organizations standardize on the lakehouse paradigm, this new architecture poses challenges with the underlying query execution engine for accessing and processing structured and unstructured data. The execution engine needs to provide the performance of a data warehouse and the scalability of data lakes. Photon is the next-generation query engine on the Databricks Lakehouse Platform that provides dramatic infrastructure cost savings and speedups for all use cases — from data ingestion, ETL, streaming, data science and interactive queries — directly on your data lake. Photon is compatible with Spark APIs and implements a more general execution framework that allows efficient processing of data with support of the Spark API. This means getting started is as easy as turning it on — no code change and no lock-in. With Photon, typical customers are seeing up to 80% TCO savings over traditional Databricks Runtime (Spark) and up to 85% reduction in VM compute hours. Spark instructions Photon instructions Photon engine Delta/Parquet Photon writer to Delta/Parquet ----- Why process queries with Photon? Query performance on Databricks has steadily increased over the years, powered by Spark and thousands of optimizations packaged as part of the Databricks Runtime (DBR). Photon provides an additional 2x speedup per the TPC-DS 1TB benchmark compared to the latest DBR versions. **Relative speedup to DBR 2.1 by DBR version** Higher is better **Customers have observed significant speedups using** **Photon on workloads such as:** **•** **SQL-based jobs:** Accelerate large-scale production jobs on SQL and Spark DataFrames **•** **IoT use cases:** Faster time-series analysis using Photon compared to Spark and traditional Databricks Runtime **•** **Data privacy and compliance:** Query petabytes-scale data sets to identify and delete records without duplicating data with Delta Lake, production jobs and Photon **•** **Loading data into Delta and Parquet:** Vectorized I/O speeds up data loads for Delta and Parquet tables, lowering overall runtime and costs of data engineering jobs Release date - DBR version (TPC-DS 1TB 10 x i3xl) ----- **100TB TPC-DS price/performance** Lower is better Best price/performance for analytics in the cloud Written from the ground up in C++, Photon takes advantage of modern hardware for faster queries, providing up to 12x better price/performance compared to other cloud data warehouses — all natively on your data lake. Databricks SQL Databricks SQL Cloud data Cloud data Cloud data spot on-demand warehouse 1 warehouse 2 warehouse 3 **System** ----- Works with your existing code and avoids vendor lock-in Photon is designed to be compatible with the Apache Spark DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. All you do is turn it on. Photon will seamlessly coordinate work and resources and transparently accelerate portions of your SQL and Spark queries. No tuning or user intervention required. **Photon in the Databricks Lakehouse Platform** **Client: submit SQL** Parsing Catalyst: analysis/ planning/optimization scheduling Execute task Execute task Execute task Execute task _Lifecycle of a Photon query_ Spark driver JVM Spark executors mixed JVM/Native ----- Optimizing for all data use cases and workloads Photon is the first purpose-built lakehouse engine designed to accelerate all data and analytics workloads: data ingestion, ETL, streaming, data science, and interactive queries. While we started Photon primarily focused on SQL to provide customers with world-class data warehousing performance on their data lakes, we’ve significantly increased the scope of ingestion sources, formats, APIs and methods supported by Photon since then. As a result, customers have seen dramatic infrastructure cost savings and speedups on Photon across all their modern Spark (e.g., Spark SQL and DataFrame) workloads. Query optimizer Native execution engine Caching _Accelerating all workloads on the lakehouse_ **Learn more** [Announcing Photon Public Preview: The Next-Generation](https://www.databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html) [Query Engine on the Databricks Lakehouse Platform](https://www.databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html) [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) ----- **CHAPTER** # 04 ### Unified governance and sharing for data, analytics and AI Today, more and more organizations recognize the importance of making high-quality data readily available to data teams to drive actionable insights and business value. At the same time, organizations also understand the risks of data breaches which negatively impact brand value and inevitably lead to erosion of customer trust. Governance is one of the most critical components of a lakehouse data platform architecture; it helps ensure that data assets are securely managed throughout the enterprise. However, many companies are using different incompatible governance models leading to complex and expensive solutions. ----- #### Key challenges with data and AI governance **Diversity of data and AI assets** The increased use of data and the added complexity of the data landscape have left organizations with a difficult time managing and governing all types of their data-related assets. No longer is data stored in files or tables. Data assets today take many forms, including dashboards, machine learning models and unstructured data like video and images that legacy data governance solutions simply are not built to govern and manage. **Rising multicloud adoption** More and more organizations now leverage a multicloud strategy to optimize costs, avoid vendor lock-in, and meet compliance and privacy regulations. With nonstandard, cloud-specific governance models, data governance across clouds is complex and requires familiarity with cloud-specific security and governance concepts, such as identity and access management (IAM). **Disjointed tools for data governance on the lakehouse** Today, data teams must deal with a myriad of fragmented tools and services for their data governance requirements, such as data discovery, cataloging, auditing, sharing, access controls, etc. This inevitably leads to operational inefficiencies and poor performance due to multiple integration points and network latency between the services. **Two disparate and incompatible data platforms** Organizations today use two different platforms for their data analytics and AI efforts — data warehouses for BI and data lakes for AI. This results in data replication across two platforms, presenting a major governance challenge. With no unified view of the data landscape, it is difficult to see where data is stored, who has access to what data, and consistently define and enforce data access policies across the two platforms with different governance models. ----- #### One security and governance approach Lakehouse systems provide a uniform way to manage access control, data quality and compliance across all of an organization’s data using standard interfaces similar to those in data warehouses by adding a management interface on top of data lake storage. Modern lakehouse systems support fine-grained (row, column and view level) access control via SQL, query auditing, attribute-based access control, data versioning and data quality constraints and monitoring. These features are generally provided using standard interfaces familiar to database administrators (for example, SQL GRANT commands) to allow existing personnel to manage all the data in an organization in a uniform way. Centralizing all the data in a lakehouse system with a single management interface also reduces the administrative burden and potential for error that comes with managing multiple separate systems. #### What is Unity Catalog? Unity Catalog is a unified governance solution for all data, analytics and AI assets including files, tables, dashboards and machine learning models in your lakehouse on any cloud. Unity Catalog simplifies governance by empowering data teams with a common governance model based on ANSI-SQL to define and enforce fine-grained access controls. With attribute-based access controls, data administrators can enable fine-grained access controls on rows and columns using tags (attributes). Built-in data search and discovery allows data teams to quickly find and reference relevant data for any use case. Unity Catalog offers automated data lineage for all workloads in SQL, R, Scala and Python, to build a better understanding of the data and its flow in the lakehouse. Unity Catalog also allows data sharing across or within organizations and seamless integrations with your existing data governance tools. With Unity Catalog, data teams can simplify governance for all data and AI assets with one consistent model to discover, access and share data, giving you much better native performance, management and security across clouds. ----- **Key benefits** The common metadata layer for cross-workspace metadata is at the account level and eases collaboration by allowing different workspaces to access Unity Catalog metadata through a common interface and break down data silos. Further, the data permissions in Unity Catalog are applied to account-level identities, rather than identities that are local to a workspace, allowing a consistent view of users and groups across all workspaces. Catalog, secure and audit access to all data assets on any cloud Unity Catalog provides centralized metadata, enabling data teams to create a single source of truth for all data assets ranging from files, tables, dashboards to machine learning models in one place. ----- Unity Catalog offers a unified data access layer that provides a simple and streamlined way to define and connect to your data through managed tables, external tables, or files, while managing their access controls. Unity Catalog centralizes access controls for files, tables and views. It allows fine-grained access controls for restricting access to certain rows and columns to the users and groups who are authorized to query them. With Attribute-Based Access Controls (ABAC), you can control access to multiple data items at once based on user and data attributes, further simplifying governance at scale. For example, you will be able to tag multiple columns as personally identifiable information (PII) and manage access to all columns tagged as PII in a single rule. Today, organizations are dealing with an increased burden of regulatory compliance, and data access auditing is a critical component to ensure your organization is set up for success while meeting compliance requirements. Unity Catalog also provides centralized fine-grained auditing by capturing an audit log of operations such as create, read, update and delete (CRUD) that have been performed against the data. This allows a fine-grained audit trail showing who accessed a given data set and helps you meet your compliance and business requirements. ----- Built-in data search and discovery Data discovery is a critical component to break down data silos and democratize data across your organization to make data-driven decisions. Unity Catalog provides a rich user interface for data search and discovery, enabling data teams to quickly search relevant data assets across the data landscape and reference them for all use cases — BI, analytics and machine learning — accelerating time-to-value and boosting productivity. ----- Automated data lineage for all workloads Data lineage describes the transformations and refinements of data from source to insight. Lineage includes capturing all the relevant metadata and events associated with the data in its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, which other data sets leverage it, and many other events and attributes. Unity Catalog offers automated data lineage down to table and column level, enabling data teams to get an end-to-end view of where data is coming from, what transformations were performed on the data and how data is consumed by end applications such as notebooks, workflows, dashboards, machine learning models, etc. With automated data lineage for all workloads — SQL, R, Python and Scala, data teams can quickly identify and perform root cause analysis of any errors in the data pipelines or end applications. Second, data teams can perform impact analysis to see dependencies of any data changes on downstream consumers and notify them about the potential impact. Finally, data lineage also empowers data teams with increased understanding of their data and reduces tribal knowledge. Unity Catalog can also capture lineage associated with non-data entities, such as notebooks, workflows and dashboards. Lineage can be _Data lineage with Unity Catalog_ retrieved via REST APIs to support integrations with other catalogs. Integrated with your existing tools **Resources** [Learn more about Unity Catalog](https://databricks.com/product/unity-catalog) [AWS Documentation](https://docs.databricks.com/data-governance/unity-catalog/index.html) [Azure Documentation](https://docs.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/) Unity Catalog helps you to future-proof your data and AI governance with the flexibility to leverage your existing data catalogs and governance solutions — Collibra, Alation, Immuta, Privacera, Microsoft Purview and AWS Lakeformation. ----- #### Open data sharing and collaboration Data sharing has become important in the digital economy as enterprises wish to exchange data easily and securely with their customers, partners, suppliers and internal lines of business to better collaborate and unlock value from that data. But to date, a lack of standards-based data sharing protocol has resulted in data sharing solutions tied to a single vendor or commercial product, introducing vendor lock-in risks. What the industry deserves is an open approach to data sharing. **Why data sharing is hard** Data sharing has evolved from an optional feature of a few data platforms to a business necessity and success factor for organizations. Our solution architects encounter daily the classic scenarios of a retailer looking to publish sales data to their suppliers in real time or a supplier that wants to share real-time inventory. As a reminder, data sharing recently triggered the most impressive scientific development that humankind has ever seen. On January 5, 2021, the first sample of the genome of the coronavirus was uploaded to the internet. It wasn’t a lung biopsy from a patient in Wuhan, but a shared digital genomic data set that triggered the development of the first batch of COVID vaccines worldwide. treatments, tests and tracking mutations as they are passed down through a lineage, a branch of the coronavirus family tree. The above graphic shows such a [publicly shared mutation data set](https://www.ncbi.nlm.nih.gov/genbank/) . Since then, coronavirus experts have daily exchanged public data sets, looking for better ----- Sharing data, as well as consuming data from external sources, allows you to collaborate with partners, establish new partnerships, enable research and can generate new revenue streams with data monetization. Despite those promising examples, existing data sharing technologies come with several limitations: **•** Traditional data sharing technologies, such as Secure File Transfer Protocol (SFTP), do not scale well and only serve files offloaded to a server **•** Cloud object stores operate on an object level and are cloud-specific **•** Commercial data sharing offerings baked into vendor products often share tables instead of files, but scaling them is expensive and they are not open and, therefore, do not permit data sharing with a different platform The following table compares proprietary vendor solutions with SFTP, cloud object stores and Delta Sharing. |Col1|Proprietary vendor solutions|SFTP|Cloud object store|Delta Sharing| |---|---|---|---|---| |Secure||||| |Cheap||||| |Vendor agnostic||||| |Multicloud||||| |Open source||||| |Table/DataFrame abstraction||||| |Live data||||| |Predicate pushdown||||| |Object store bandwidth||||| |Zero compute cost||||| |Scalability||||| ----- **Open source data sharing and Databricks** To address the limitations of existing data sharing solutions, Databricks developed [Delta Sharing](https://github.com/delta-io/delta-sharing) , with various contributions from the OSS community, and donated it to the Linux Foundation. An open source–based solution, such as Delta Sharing, eliminates the lock-in of commercial solutions and brings a number of additional benefits such as community-developed integrations with popular, open source data processing frameworks. In addition, open protocols allow the easy integration of commercial clients, such as BI tools. **What is Databricks Delta Sharing?** Databricks Delta Sharing provides an open solution to securely share live data from your lakehouse to any computing platform. Recipients don’t have to be on the Databricks platform or on the same cloud or a cloud at all. Data providers can share live data, without replicating or moving it to another system. Recipients benefit from always having access to the latest version of data and can quickly query shared data using tools of their choice for BI, analytics and machine learning, reducing time-to-value. Data providers can centrally manage, govern, audit and track usage of the shared data on one platform. Unity Catalog natively supports [Delta Sharing](https://databricks.com/product/delta-sharing) , the world’s first open protocol for data sharing, enabling organizations to share live, large-scale data without replication and make data easily and quickly accessible from tools of your choice, with enterprise-grade security. **Key benefits** Open cross-platform sharing Easily share existing data in Delta Lake and Apache Parquet formats between different vendors. Consumers don’t have to be on the Databricks platform, same cloud or a cloud at all. Native integration with Power BI, Tableau, Spark, pandas and Java allow recipients to consume shared data directly from the tools of their choice. Delta Sharing eliminates the need to set up a new ingestion process to consume data. Data recipients can directly access the fresh data and query it using tools of their choice. Recipients can also enrich data with data sets from popular data providers. Sharing live data without copying it Share live ready-to-query data, without replicating or moving it to another system. Most enterprise data today is stored in cloud data lakes. Any of the existing data sets on the provider’s data lake can easily be shared across clouds, regions or data platforms without any data replication or physical movement of data. Data providers can update their data sets reliably in real time and provide a fresh and consistent view of their data to recipients. Centralized administration and governance You can centrally govern, track and audit access to the shared data from a single point of enforcement to meet compliance requirements. Detailed user-access audit logs are kept to know who is accessing the data and monitor usage of the shared data down to table, partition and version level. ----- An open Marketplace for data solutions The demand for third-party data to make data-driven innovations is greater than ever, and data marketplaces act as a bridge between data providers and data consumers to help facilitate the discovery and distribution of data sets. Databricks Marketplace provides an open marketplace for exchanging data products such as data sets, notebooks, dashboards and machine learning models. To accelerate insights, data consumers can discover, evaluate and access more data products from third-party vendors than ever before. Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. Databricks Marketplace is powered by Delta Sharing, allowing consumers to access data products without having to be on the Databricks platform. This open approach allows data providers to broaden their addressable market without forcing consumers into vendor lock-in. _Databricks Marketplace_ Privacy-safe data cleanrooms Powered by open source Delta Sharing, the Databricks Lakehouse Platform provides a flexible data cleanroom solution allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language — Python, R, SQL, Java and Scala — on the data while maintaining data privacy. Additionally, data cleanroom participants don’t have to do cost-intensive data replication across clouds or regions with other participants, which simplifies data operations and reduces cost. _Data cleanrooms with Databricks Lakehouse Platform_ ----- **How it works** Delta Sharing is designed to be simple, scalable, non-proprietary and cost-effective for organizations that are serious about getting more from their data. Delta Sharing is natively integrated with Unity Catalog, which allows customers to add fine-grained governance and security controls, making it easy and safe to share data internally or externally. Delta Sharing is a simple REST protocol that securely shares access to part of a cloud data set. It leverages modern cloud storage systems — such as AWS S3, Azure ADLS or Google’s GCS — to reliably transfer large data sets. Here’s how it works for data providers and data recipients. **Data provider** **Data recipient** Data science And many more On-premises The data provider shares existing tables or parts thereof (such as specific table versions or partitions) stored on the cloud data lake in Delta Lake format. The provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. To manage shares and recipients, you can use SQL commands or the Unity Catalog CLI or the intuitive user interface. The data recipient only needs one of the many Delta Sharing clients that supports the protocol. Databricks has released open source connectors for pandas, Apache Spark, Java and Python, and is working with partners on many more. ----- The Delta Sharing data exchange follows three efficient steps: 1. The recipient’s client authenticates to the sharing server and asks to query a specific table. The client can also provide filters on the data (for example, “country=US”) as a hint to read just a subset of the data. 2. The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in cloud storage systems that make up the table. 3. To transfer the data, the server generates short-lived presigned URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel at massive bandwidth, without streaming through the sharing server. **Learn more** [Try Delta Sharing](https://databricks.com/product/delta-sharing) [Delta Sharing Demo](https://youtu.be/wRT1Vpbyy88) [Introducing Delta Sharing: An Open Protocol for Secure Data Sharing](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Introducing Data Cleanrooms for the Lakehouse](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Introducing Databricks Marketplace](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Delta Sharing ODSC Webinar](https://www.youtube.com/watch?v=YrNHtaWlkM8) ----- **CHAPTER** # 05 ### Security Organizations that operate in multicloud environments need a unified, reliable and consistent approach to secure data. We’ve learned from our customers that a simple and unified approach to data security for the lakehouse is one of the most critical requirements for modern data solutions. Databricks is trusted by the world’s largest organizations to provide a powerful lakehouse platform with high security and scalability. In fact, thousands of customers trust Databricks with their most sensitive data to analyze and build data products using machine learning (ML). With significant investment in building a highly secure and scalable platform, Databricks delivers end-to-end platform security for data and users. ----- #### Platform architecture reduces risk The Databricks Lakehouse architecture is split into two separate planes to simplify your permissions, avoid data duplication and reduce risk. The control plane is the management plane where Databricks runs the workspace application and manages notebooks, configuration and clusters. Unless you choose to use [serverless compute](https://docs.databricks.com/serverless-compute/index.html) , the data plane runs inside your cloud service provider account, processing your data without taking it out of your account. You can embed Databricks in your data exfiltration protection architecture using features like customer-managed VPCs/VNets and admin console options that disable export. While certain data, such as your notebooks, configurations, logs, and user information, is present within the control plane, that information is encrypted at rest, and communication to and from the control plane is encrypted in transit. **Users** **Interactive** **users** |Col1|Control pane|Col3| |---|---|---| ||Web application Configurations Notebooks, repos, DBSQL|Cluster Cluste Your cloud s Your cloud s| ||Cluster manager|| You also have choices for where certain data lives: You can host your own store of metadata about your data tables (Hive metastore), or store query **Data** **DBFS root** results in your cloud service provider account and decide whether to use the [Databricks Secrets API.](https://docs.databricks.com/dev-tools/api/latest/secrets.html) ----- #### Step-by-step example **Users** **Interactive** **users** **DBFS root** |Col1|ample|Col3|Col4|Col5| |---|---|---|---|---| ||Control pane 1 4|||| |||Web application Configurations Notebooks, repos, DBSQL Cluster manager|6|Cluster Cluste YYoouurr cclloouudd s| |||||| |||||| |||||| |||||| |||||| ----- Suppose you have a data engineer that signs in to Databricks and writes a notebook that transforms raw data in Kafka to a normalized data set sent to storage such as Amazon S3 or Azure Data Lake Storage. Six steps make that happen: 1. The data engineer seamlessly authenticates, via your single sign-on if desired, to the Databricks web UI in the control plane, hosted in the Databricks account. 2. As the data engineer writes code, their web browser sends it to the control plane. JDBC/ODBC requests also follow the same path, authenticating with a token. 3. When ready, the control plane uses Cloud Service Provider APIs to create a Databricks cluster, made of new instances in the data plane, in your CSP account. Administrators can apply cluster policies to enforce security profiles. 4. Once the instances launch, the cluster manager sends the data engineer’s code to the cluster. 5. The cluster pulls from Kafka in your account, transforms the data in your account and writes it to a storage in your account. 6. The cluster reports status and any outputs back to the cluster manager. The data engineer does not need to worry about many of the details — simply write the code and Databricks runs it. #### Network and server security Here is how Databricks interacts with your cloud service provider account to manage network and server security **Networking** Regardless of where you choose to host the data plane, Databricks networking is straightforward. If you host it yourself, Databricks by default will still configure networking for you, but you can also control data plane networking with your own managed VPC or VNet. The serverless data plane network infrastructure is managed by Databricks in a Databricks cloud service provider account and shared among customers, with additional network boundaries between workspaces and between clusters. Databricks does not rewrite or change your data structure in your storage, nor does it change or modify any of your security and governance policies. Local firewalls complement security groups and subnet firewall policies to block unexpected inbound connections. Customers at the enterprise tier can also use the IP access list feature on the control plane to limit which IP addresses can connect to the web UI or REST API — for example, to allow only VPN or office IPs. ----- **Servers** In the data plane, Databricks clusters automatically run the latest hardened system image. Users cannot choose older (less secure) images or code. For AWS and Azure deployments, images are typically updated every two-to-four weeks. GCP is responsible for its system image. Databricks runs scans for every release, including: **•** System image scanning for vulnerabilities **•** Container OS and library scanning **Severity** **Remediation time** **Critical** **< 14 days** **High** **< 30 days** **Medium** **< 60 days** **Low** **When appropriate** **•** Static and dynamic code scanning **Databricks access** Databricks code is peer reviewed by developers who have security training. Significant design documents go through comprehensive security reviews. Scans run fully authenticated, with all checks enabled, and issues are tracked against the timeline shown in this table. Note that Databricks clusters are typically short-lived (often terminated after a job completes) and do not persist data after they terminate. Clusters typically share the same permission level (excluding high concurrency or Databricks SQL clusters, where more robust security controls are in place). Your code is launched in an unprivileged container to maintain system stability. This security design provides protection against persistent attackers and privilege escalation. Databricks access to your environment is limited to cloud service provider APIs for our automation and support access. Automated access allows the Databricks control plane to configure resources in your environment using the cloud service provider APIs. The specific APIs vary based on the cloud. For instance, an AWS cross-account IAM role, or Azure-owned automation or GKE automation do not grant access to your data sets (see the next section). Databricks has a custom-built system that allows staff to fix issues or handle support requests — for example, when you open a support request and check the box authorizing access to your workspace. Access requires either a support ticket or engineering ticket tied expressly to your workspace and is limited to a subset of employees and for limited time periods. Additionally, if you have configured audit log delivery, the audit logs show the initial access event and the staff’s actions. ----- **Identity and access** Databricks supports robust ACLs and SCIM. AWS customers can configure SAML 2.0 and block non-SSO logins. Azure Databricks and Databricks on GCP automatically integrate with Azure Active Directory or GCP identity. Databricks supports a variety of ways to enable users to access their data. **Examples include:** **•** The Table ACLs feature uses traditional SQL-based statements to manage access to data and enable fine-grained view-based access **•** IAM instance profiles enable AWS clusters to assume an IAM role, so users of that cluster automatically access allowed resources without explicit credentials **•** External storage can be mounted or accessed using a securely stored access key **•** The Secrets API separates credentials from code when accessing external resources **Data security** Databricks provides encryption, isolation and auditing. **Databricks encryption capabilities are** **in place both at rest and in motion** |For data-at-rest encryption: • Control plane is encrypted • Data plane supports local encryption • Customers can use encrypted storage buckets • Customers at some tiers can confgi ure customer-managed keys for managed services|For data-in-motion encryption: • Control plane <-> data plane is encrypted • Offers optional intra-cluster encryption • Customer code can be written to avoid unencrypted services (e.g., FTP)| |---|---| **Customers can isolate users at multiple levels:** **•** **Workspace level:** Each team or department can use a separate workspace **•** **Cluster level:** Cluster ACLs can restrict the users who can attach notebooks to a given cluster **•** **High concurrency clusters:** Process isolation, JVM whitelisting and limited languages (SQL, Python) allow for the safe coexistence of users of different privilege levels, and is used with Table ACLs **•** **Single-user cluster:** Users can create a private dedicated cluster Activities of Databricks users are logged and can be delivered automatically to a cloud storage bucket. Customers can also monitor provisioning activities by monitoring cloud audit logs. ----- **Compliance** **Databricks supports the following compliance standards on** **our multi-tenant platform:** **•** **SOC 2 Type II** **•** **ISO 27001** **•** **ISO 27017** **•** **ISO 27018** Certain clouds support Databricks deployment options for FedRAMP High, HITRUST, HIPAA and PCI. Databricks Inc. and the Databricks platform are also GDPR and CCPA ready. **Learn more** To learn more about Databricks security, visit the [Security and Trust Center](https://databricks.com/trust) ----- **CHAPTER** # 06 ### Instant compute and serverless ----- #### Benefits of Databricks Serverless SQL Serverless SQL is much easier to administer with Databricks taking on the responsibility of deploying, configuring and managing your cluster VMs. Databricks can transfer compute capacity to user queries typically in about 15 seconds — so you no longer need to wait for clusters to start up or scale out to run your queries. Serverless SQL also has built-in connectors to your favorite tools such as Tableau, Power BI, Qlik, etc. These connectors use optimized JDBC/ODBC drivers for easy authentication support and high performance. And finally, you save on cost because you do not need to overprovision or pay for the idle capacity. #### What is serverless compute? Serverless compute is a fully managed service where Databricks provisions and manages the compute layer on behalf of the customer in the Databricks cloud account instead of the customer account. As of the current release, serverless compute is supported for use with Databricks SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by 20%-40% on average. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time data sets of the lakehouse with a simple and performant solution. ----- **Inside Serverless SQL** **Databricks Serverless SQL** **Managed servers** **Serverless SQL** **compute** **Secure** **Instant compute** At the core of Serverless SQL is a compute platform that operates a pool of servers located in a Databricks’ account, running Kubernetes containers that can be assigned to a user within seconds. When many users are running reports or queries at the same time, the compute platform adds more servers to the cluster (again, within seconds) to handle the concurrent load. Databricks manages the entire configuration of the server and automatically performs the patching and upgrades as needed. Each server is running a secure configuration and all processing is secured by three layers of isolation: The Kubernetes container hosting the runtime; the virtual machine (VM) hosting the container; and the virtual network for the workspace. Each layer is isolated to one workspace with no sharing or cross-network traffic allowed. The containers use hardened configurations, VMs are shut down and not reused, and network traffic is restricted to nodes in the same cluster. ----- #### Performance of Serverless SQL We ran a set of internal tests to compare Databricks Serverless SQL to the current Databricks SQL and several traditional cloud data warehouses. We found Serverless SQL to be the most cost-efficient and performant environment to run SQL workloads when considering cluster startup time, query execution time and overall cost. **Databricks Serverless SQL is the highest** **performing and most cost-effective solution** **Cloud SQL solutions compared** **Faster** **Query** **execution** **time** **Slower** **Serverless** **SQL** **CDW1** **CDW3** **Cost Estimate** **High** **Medium** **Low** **CDW2** **CDW4** **Slower** **Faster** **(~5min)** **Startup time** **(~2-3sec)** **Learn more** The feature is currently in Public Preview. Sign up to [request access to Serverless SQL](https://databricks.com/p/ebook/serverless-sql-preview-sign-up) . To learn more about Serverless SQL, visit our [documentation page.](https://docs.databricks.com/serverless-compute/index.html) ----- **CHAPTER** # 07 ### Data warehousing Data warehouses are not keeping up with today’s world. The explosion of languages other than SQL and unstructured data, machine learning, IoT and streaming analytics are forcing organizations to adopt a bifurcated architecture of disjointed systems: Data warehouses for BI and data lakes for ML. While SQL is ubiquitous and known by millions of professionals, it has never been treated as a first-class citizen on data lakes, until the lakehouse. ----- #### What is data warehousing The Databricks Lakehouse Platform provides a simplified multicloud and serverless architecture for your data warehousing workloads. Data warehousing on the lakehouse allows SQL analytics and BI at scale with a common governance model. Now you can ingest, transform and query all your data in-place — using your SQL and BI tools of choice — to deliver real-time business insights at the best price/performance. Built on open standards and APIs, the lakehouse provides the reliability, quality and performance that data lakes natively lack, and integrations with the ecosystem for maximum flexibility — no lock-in. With data warehousing on the lakehouse, organizations can unify all analytics and simplify their architecture to enable their business with real-time business insights at the best price/performance. #### Key benefits **Best price/performance** Lower costs, get the best price/performance and eliminate resource management overhead On-premises data warehouses have reached their limits — they physically cannot scale to handle the growing volumes of data, and don’t provide the elasticity customers need to respond to ever-changing business needs. Cloud data warehouses are a great alternative to on-premises data warehouses, providing greater scale and elasticity, but cloud costs for proprietary cloud data warehouses typically yield to an exponential cost increase following the growth of data volume. The Databricks Lakehouse Platform provides instant, elastic SQL serverless compute — decoupled from storage on cheap cloud object stores — and thousands of performance optimizations that can lower overall infrastructure costs by [an average of 40%](https://databricks.com/blog/2021/08/30/announcing-databricks-serverless-sql.html) . Databricks automatically determines instance types and configuration for the best price/performance — [up to 12x better](https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) [than traditional cloud data warehouses](https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) — and scale for high concurrency use cases. ----- **Built-in governance** One source of truth and one unified governance layer across all data teams Underpinned by Delta Lake, the Databricks Lakehouse Platform simplifies your architecture by allowing you to establish one single copy of all your data for in-place analytics and ETL/ELT on your existing data lakes — no more data movements and copies in disjointed systems. Then, seamless integration with Databricks Unity Catalog lets you easily discover, secure and manage all your data with fine-grained governance, data lineage, and standard SQL. **Rich ecosystem** Ingest, transform and query all your data in-place with your favorite tools Very few tools exist to conduct BI on data lakes. Generally, doing so has required data analysts to submit Spark jobs or use a developer interface. While these tools are common for data scientists, they require knowledge of languages and interfaces that are not traditionally part of a data analyst’s tool set. As a result, the learning curve for an analyst to make use of a data lake is too high when well-established tools and methods already exist for data warehouses. The Databricks Lakehouse Platform works with your preferred tools like dbt, Fivetran, Power BI or Tableau, allowing analysts and analytical engineers to easily ingest, transform and query the most recent and complete data, without having to move it into a separate data warehouse. Additionally, it empowers every analyst across your organization to quickly and collaboratively find and share new insights with a built-in SQL editor, visualizations and dashboards. **Break down silos** Accelerate time from raw to actionable data and go effortlessly from BI to ML applications, organizations will need to manage an entirely different system than their SQL-only data warehouse, slowing down collaboration and innovation. The Databricks Lakehouse Platform provides the most complete end-to-end data warehousing solution for all your modern analytics needs, and more. Now you can empower data teams and business users to access the latest data faster for downstream real-time analytics and go effortlessly from BI to ML. Speed up the time from raw to actionable data at any scale — in batch and streaming. And go from descriptive to advanced analytics effortlessly to uncover new insights. It is challenging for data engineering teams to enable analysts at the speed that the business requires. Data warehouses need data to be ingested and processed ahead of time before analysts can access and query it using BI tools. Because traditional data warehouses lack real-time processing and do not scale well for large ETL jobs, they create new data movements and bottlenecks for the data engineering team, and make it slow for analysts to access the latest data. And for advanced analytics (ML) ----- **Data warehousing on Databricks** **Truly decoupled, serverless, compute layer** **Data consumers** **Data processing** **Unity Catalog** **ETL** **ETL** **Bronze raw** **Silver staging** **Gold DW/marts** **Open storage layer** **Data ingest** **Data sources** **Databricks** **Partner Connect** **Continuous** **ingest** **Batch** **ingest** **On-premises** **OLTP** **OLAP** **Hadoop** **Third-party data** **loT devices** **SaaS applications** **Social** **DWH** **On-premises** **Hadoop** **Third-party data** **loT devices** **SaaS applications** **Social** **DWH** **Learn more** [Try Databricks SQL for free](https://dbricks.co/dbsql) [Databricks SQL Demo](https://databricks.com/discover/demos/databricks-sql) [Databricks SQL Data](https://youtu.be/jlEdoVpWwNc) [Warehousing Admin Demo](https://youtu.be/jlEdoVpWwNc) [On-demand Webinar: Learn](https://databricks.com/p/webinar/learn-databricks-sql-from-the-experts) [Databricks SQL From the Experts](https://databricks.com/p/webinar/learn-databricks-sql-from-the-experts) [eBook: Inner Workings of the](https://databricks.com/p/ebook/data-lakehouse-is-your-next-data-warehouse) [Lakehouse for Analytics and BI](https://databricks.com/p/ebook/data-lakehouse-is-your-next-data-warehouse) ----- **CHAPTER** # 08 ### Data engineering Organizations realize the value data plays as a strategic asset for growing revenues, improving the customer experience, operating efficiently or improving a product or service. Data is really the driver of all these initiatives. Nowadays, data is often streamed and ingested from hundreds of different data sources, sometimes acquired from a data exchange, cleaned in various ways with different orchestrated steps, versioned and shared for analytics and AI. And increasingly, data is being monetized. Data teams rely on getting the right data at the right time for analytics, data science and machine learning, but often are faced with challenges meeting the needs of their initiatives for data engineering. ----- #### Why data engineering is hard One of the biggest challenges is accessing and managing the increasingly complex data that lives across the organization. Most of the complexity arises with the explosion of data volumes and data types, with organizations amassing an estimated [80% of data that is unstructured and semi-structured.](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) With this volume, managing data pipelines to transform and process data is slow and difficult, and increasingly expensive. And to top off the complexity, most businesses are putting an increased emphasis on multicloud environments which can be even more difficult to maintain. [Zhamak Dehghani](https://databricks.com/speaker/zhamak-dehghani) , a principal technology consultant at Thoughtworks, wrote that data itself has become a product, and the challenging goal of the data engineer is to build and run the machinery that creates this high-fidelity data product all the way from ingestion to monetization. Despite current technological advances data engineering remains difficult for several reasons: **Complex data ingestion methods** Data ingestion means retrieving batch and streaming data from various sources and in various formats. Ingesting data is hard and complex since you either need to use an always-running streaming platform like Apache Kafka or you need to be able to keep track of which files haven’t been ingested yet. Data engineers are required to spend a lot of time hand-coding repetitive and error-prone data ingestion tasks. **Data engineering principles** These days, large operations teams are often just a memory of the past. Modern data engineering principles are based on agile software development methodologies. They apply the well-known “you build it, you run it” paradigm, use isolated development and production environments, CI/CD, and version control transformations that are pushed to production after validation. Tooling needs to support these principles. ----- **Third-party tools** Data engineers are often required to run additional third-party tools for orchestration to automate tasks such as ELT/ETL or customer code in notebooks. Running third-party tools increases the operational overhead and decreases the reliability of the system. **Performance tuning** Finally, with all pipelines and workflows written, data engineers need to constantly focus on performance, tuning pipelines and architectures to meet SLAs. Tuning such architectures requires in-depth knowledge of the underlying architecture and constantly observing throughput parameters. Most organizations are dealing with a complex landscape of data warehouses and data lakes these days. Each of those platforms has its own limitations, workloads, development languages and governance model. With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting, transforming, processing, scheduling and delivering data. The lakehouse platform automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake so data engineers can focus on quality and reliability to drive valuable insights. Data engineering in the lakehouse allows data teams to unify batch and streaming operations on a simplified architecture, streamline data pipeline development and testing, build reliable data, analytics and AI workflows on any cloud platform, and meet regulatory requirements to maintain world-class governance. The lakehouse provides an end-to-end data engineering and ETL platform that automates the complexity of building and maintaining pipelines and running ETL workloads so data engineers and analysts can focus on quality and reliability to drive valuable insights. #### Databricks makes modern data engineering simple There is no industry-wide definition of modern data engineering. This should come close: _A_ **_unified data platform_** _with_ **_managed data ingestion_** _, schema detection,_ _enforcement, and evolution, paired with_ **_declarative, auto-scaling data_** **_flow_** _integrated with a lakehouse_ **_native orchestrator_** _that supports all_ _kinds of workflows._ ----- ----- #### Benefits of data engineering on the lakehouse By simplifying and modernizing with the lakehouse architecture, data engineers gain an enterprise-grade and enterprise-ready approach to building data pipelines. The following are eight key differentiating capabilities that a data engineering solution team can enable with the Databricks Lakehouse Platform: **•** **Easy data ingestion:** With the ability to ingest petabytes of data, data engineers can enable fast, reliable, scalable and automatic data ingestion for analytics, data science or machine learning. **•** **Data pipeline observability:** Monitor overall data pipeline estate status from a dataflow graph dashboard and visually track end-to-end pipeline health for performance, quality, status and latency. **•** **Simplified operations:** Ensure reliable and predictable delivery of data for analytics and machine learning use cases by enabling easy and automatic data pipeline deployments into production or roll back pipelines and minimize downtime. **•** **Scheduling and orchestration:** Simple, clear and reliable orchestration of data processing tasks for data and machine learning pipelines with the ability to run multiple non-interactive tasks as a directed acyclic graph (DAG) on a Databricks compute cluster. **•** **Automated ETL pipelines:** Data engineers can reduce development time and effort and focus on implementing business logic and data quality checks within the data pipeline using SQL or Python. **•** **Data quality checks:** Improve data reliability throughout the data lakehouse so data teams can confidently trust the information for downstream initiatives with the ability to define data quality and automatically address errors. **•** **Batch and streaming:** Allow data engineers to set tunable data latency with cost controls without having to know complex stream processing and implement recovery logic. **•** **Automatic recovery:** Handle transient errors and use automatic recovery for most common error conditions that can occur during the operation of a pipeline with fast, scalable fault-tolerance. ----- **Data engineering is all about data quality** The goal of modern data engineering is to distill data with a quality that is fit for downstream analytics and AI. Within the Lakehouse, data quality is achieved on three different levels. 1. On a **technical level** , data quality is guaranteed by enforcing and evolving schemas for data storage and ingestion. **Kenesis** **CSV,** **JSON, TXT...** **Data Lake** 2. On an **architectural level** , data quality is often achieved by implementing the medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a [lakehouse](https://databricks.com/glossary/data-lakehouse) with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture, e.g., from Bronze to Silver to Gold layer tables. 3. The **Databricks Unity Catalog** comes with robust data quality management with built-in quality controls, testing, monitoring and enforcement to ensure accurate and useful data is available for downstream BI, analytics and machine learning workloads. **Streaming** **analytics** **Bronze** **Silver** **Gold** **BI and** **reporting** Raw ingestion Filtered, cleaned, Business-level and history augmented aggregates **Quality** **Data science** **and ML** ----- #### Data ingestion With the Databricks Lakehouse Platform, data engineers can build robust hyper-scale ingestion pipelines in streaming and batch mode. They can incrementally process new files as they land on cloud storage — with no need to manage state information — in scheduled or continuous jobs. Data engineers can efficiently track new files (with the ability to scale to billions of files) without having to list them in a directory. Databricks automatically infers the schema from the source data and evolves it as the data loads into the Delta Lake lakehouse. Efforts continue with enhancing and supporting Auto Loader, our powerful data ingestion tool for the Lakehouse. **What is Auto Loader?** Have you ever imagined that ingesting data could become as easy as dropping a file into a folder? Welcome to Databricks Auto Loader. [Auto Loader](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) is an optimized data ingestion tool that incrementally and efficiently processes new data files as they arrive in the cloud storage built into the Databricks Lakehouse. Auto Loader can detect and enforce the schema of your data and, therefore, guarantee data quality. New files or files that have been changed since the last time new data was processed are identified automatically and ingested. Noncompliant data sets are quarantined into rescue data columns. You can use the [trigger once] option with Auto Loader to turn it into a job that turns itself off. **Ingestion for data analysts: COPY INTO** Ingestion also got much easier for data analysts and analytics engineers working with Databricks SQL. [COPY INTO](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html) is a simple SQL command that follows the lake-first approach and loads data from a folder location into a Delta Lake table. COPY INTO can be scheduled and called by a job repeatedly. When run, only new files from the source location will be processed. #### Data transformation Turning SQL queries into production ETL pipelines typically involves a lot of tedious, complicated operational work. Even at a small scale, the majority of a data practitioner’s time is spent on tooling and managing infrastructure. Although the medallion architecture is an established and reliable pattern for improving data quality, the implementation of this pattern is challenging for many data engineering teams. While hand-coding the medallion architecture was hard for data engineers, creating data pipelines was outright impossible for data analysts not being able to code with Spark Structured Streaming in Scala or Python. Even at a small scale, most data engineering time is spent on tooling and managing infrastructure rather than transformation. Auto-scaling, observability and governance are difficult to implement and, as a result, often left out of the solution entirely. ----- #### What is Delta Live Tables? Delta Live Tables (DLT) is the first ETL framework that uses a simple **declarative approach** to building reliable data pipelines. DLT automatically auto-scales your infrastructure so data analysts and engineers can spend less time on tooling and focus on getting value from data. Engineers are able to **treat their data as code** and apply modern software engineering best practices like testing, error-handling, monitoring and documentation to deploy reliable pipelines at scale. DLT fully supports both Python and SQL and is tailored to work with both streaming and batch workloads. With DLT you write a Delta Live Table in a SQL notebook, create a pipeline under Workflows and simply click [Start]. **Write** **create live table** **Create** **a pipeline** **Click** **Start** Start ----- DLT reduces the implementation time by accelerating development and automating complex operational tasks. Since DLT can use plain SQL, it also enables data analysts to create production pipelines and turns them into the often discussed “analytics engineer.” At runtime, DLT speeds up pipeline execution applied with Photon. Software engineering principles are applied for data engineering to foster the idea of treating your data as code. Your data is the sole source of truth for what is going on inside your business. Beyond just the transformations, there are many things that should be included Dependency Full refresh management *Coming soon in the code that define your data. Declaratively express entire data flows in SQL or Python. Natively enable modern software engineering best practices like separate development and production environments, the ability to easily test before deploying, deploy and manage environments using parameterization, unit testing and documentation. DLT also automatically scales compute, providing the option to set the minimum and maximum number of instances and let DLT size up the cluster according to cluster utilization. In addition, tasks like orchestration, error handling and recovery, and performance optimization are all handled automatically. Incremental computation* Checkpointing and retries ----- Expectations in the code help prevent bad data from flowing into tables, track data quality over time, and provide tools to troubleshoot bad data with granular pipeline observability. This enables a high-fidelity lineage diagram of your pipeline to track dependencies and aggregate data quality metrics across all your pipelines. Unlike other products that force you to deal with streaming and batch workloads separately, DLT supports any type of data workload with a single API so data engineers and analysts alike can build cloud-scale data pipelines faster without the need for advanced data engineering skills. #### Data orchestration The lakehouse makes it much easier for businesses to undertake ambitious data and machine learning (ML) initiatives. However, orchestrating and managing end-to-end production workflows remains a bottleneck for most organizations, relying on external tools or cloud-specific solutions that are not part of their lakehouse platform. Tools that decouple task orchestration from the underlying data processing platform reduce the overall reliability of their production workloads, limit observability, and increase complexity for end users. #### What is Databricks Workflows? [Databricks Workflows](https://databricks.com/product/workflows) is the first fully managed and integrated lakehouse [orchestration](https://databricks.com/glossary/orchestration) service that allows data teams to build reliable workflows on any cloud. Workflows lets you orchestrate data flow pipelines (written in DLT or dbt), as well as machine learning pipelines, or any other tasks such as notebooks or Python wheels. Since Databricks Workflows is fully managed, it eliminates operational overhead for data engineers, enabling them to focus on your workflows not on managing your infrastructure. It provides an easy point-and-click authoring experience for all your data teams, not just those with specialized skills. Deep integration with the underlying lakehouse platform ensures you will create and run reliable production workloads on any cloud while providing deep and centralized monitoring with simplicity for end users. Sharing job clusters over multiple tasks reduces the time a job takes, reduces costs by eliminating overhead and increases cluster utilization with parallel tasks. ----- Databricks Workflows’ deep integration with the lakehouse can best be seen with its monitoring and observability features. The matrix view in the following graphic shows a history of runs for a job. Failed tasks are marked in red. A failed job can be repaired and rerun with the click of a button. Rerunning a failed task detects and triggers the execution of all dependent tasks. You can create workflows with the UI, but also through the Databricks Workflows API, or with external orchestrators such as Apache Airflow. Even if you are using an external orchestrator, Databricks Workflows’ monitoring acts as a single pane of glass that includes externally triggered workflows. ----- #### Orchestrate anything Remember that DLT is one of many task types for Databricks Workflows. This is where the managed data flow pipelines with DLT tie together with the easy point-and-click authoring experience of Databricks Workflows. In the following example, you can see an end-to-end workflow built with customers in a workshop: Data is streamed from Twitter according to search terms, then ingested with Auto Loader using automatic schema detection and enforcement. In the next step, the data is cleaned and transformed with Delta Live table pipelines written in SQL, and finally run through a pre-trained BERT language model from Hugging Face for sentiment analysis of the tweets. Different task types for ingest, cleanse/transform and ML are combined in a single workflow. Using Workflows, these tasks can be scheduled to provide a daily overview of social media coverage and customer sentiment for a business. After streaming tweets with filtering for keywords such as “data engineering,” “lakehouse” and “Delta Lake,” we curated a list of those tweets that were classified as positive with the highest probability score. **Learn more** [Data Engineering on the](https://databricks.com/solutions/data-pipelines) [Lakehouse](https://databricks.com/solutions/data-pipelines) [Delta Live Tables](https://databricks.com/product/delta-live-tables) [Databricks Workflows](https://www.databricks.com/product/workflows) [Big Book of Data Engineering](https://databricks.com/p/ebook/the-big-book-of-data-engineering?itm_data=datapipelines-promo-bigbookofde) ----- **CHAPTER** ### Data streaming # 09 **CHAPTER** There are two types of data processing: batch processing and streaming processing. Batch processing refers to the discontinuous, periodic processing of data that has been stored for a period of time. For example, an organization may need to run weekly reports on a set of predictable transaction data. There is no need for this data to be streaming — it can be processed on a weekly basis. Streaming processing, on the other hand, refers to unbounded processing of data as it arrives. ----- **Data Streaming Challenges** However, getting value from streaming data can be a tricky practice. While most data today can be considered streaming data, organizations are overwhelmed by the need to access, process and analyze the volume, speed and variety of this data moving through their platforms. To keep pace with innovation, they must quickly make sense of data streams decisively, consistently and in real time. Three common technical challenges organizations experience with implementing real-time data streaming include: **•** **Specialized APIs and language skills:** Data practitioners encounter barriers to adopting streaming skillsets because there are new languages, APIs and tools to learn. **•** **Operational complexity:** To implement data streaming at scale, data teams need to integrate and manage streaming-specific tools with their other cloud services. They also have to manually build complex operational tooling to help these systems recover from failure, restart workloads without reprocessing data, optimize performance, scale the underlying infrastructure, and so on. **•** **Incompatible governance models:** Different governance and security models across real-time and historical data platforms makes it difficult to provide the right access to the right users, see the end-to-end data lineage, and/or meet compliance requirements. In a wide variety of cases, an organization might find it useful to leverage streaming data. Here are some common examples: **•** **Retail:** Real-time inventory updates help support business activities, such as inventory and pricing optimization and optimization of the supply chain, logistics and just-in-time delivery. **•** **Smart energy:** Smart meter monitoring in real time allows for smart electricity pricing models and connection with renewable energy sources to optimize power generation and distribution. **•** **Preventative maintenance:** By reducing unplanned outages and unnecessary site and maintenance visits, real-time streaming analytics can lower operational and equipment costs. **•** **Industrial automation:** Manufacturers can use streaming and predictive analytics to improve production processes and product quality, including setting up automated alerts. **•** **Healthcare:** To optimize care recommendations, real-time data allows for the integration of various smart sensors to monitor patient condition, medication levels and even recovery speed. **•** **Financial institutions:** Firms can conduct real-time analysis of transactions to detect fraudulent transactions and send alerts. They can use fraud analytics to identify patterns and feed data into machine learning algorithms. Regardless of specific use cases, the central tenet of streaming data is that it gives organizations the opportunity to leverage the freshest possible insights for better decision-making and more optimized customer experiences. ----- **Data streaming architecture** Before addressing these challenges head-on, it may help to take a step back and discuss the ingredients of a streaming data pipeline. Then, we will explain how the Databricks Lakehouse Platform operates within this context to address the aforementioned challenges. Every application of streaming data requires a pipeline that brings the data from its origin point — whether sensors, IoT devices or database transactions — to its final destination. In building this pipeline, streaming architectures typically employ two layers. First, streaming capture systems **capture** and temporarily store streaming data for processing. Sometimes these systems are also called messaging systems or messaging buses. These systems are optimized for small payloads and high frequency inputs/outputs. Second, streaming **processing** systems continuously process data from streaming capture systems and other storage systems. **Capturing** **Processing** It may help to think of a simplified streaming pipeline according to the following seven phases: 1. Data is continuously generated at origin points 2. The generated data is captured from those origin points by a capture system like Apache Kafka (with limited retention) **3. The captured data is extracted and incrementally ingested to** **a processing platform like Databricks; data is ingested exactly** **once and stored permanently, even if this step is rerun** **4. The ingested data is converted into a workable format** **5. The formatted data is cleansed, transformed and joined in** **a number of pipeline steps** **6. The transformed data is processed downstream through** **analysis or ML modeling** 7. The resulting analysis or model is used for some sort of practical application, which may be anything from basic reporting to an event-driven software application You will notice four of the steps in this list are in boldface. This is because the lakehouse architecture is specifically designed to optimize this part of the pipeline. Uniquely, the Databricks Lakehouse Platform can ingest, transform, analyze and model on streaming data _alongside_ batch-processed data. It can accommodate both structured _and_ unstructured data. It is here that the value of unifying the best pieces of data lakes and data warehouses really shines for complex enterprise use cases. ----- **Data Streaming on the Lakehouse** Now let’s zoom in a bit and see how the Databricks Lakehouse Platform addresses each part of the pipeline mentioned above. **Streaming data ingestion and transformation** begins with continuously and incrementally collecting raw data from streaming sources through a feature called Auto Loader. Once the data is ingested, it can be transformed from raw, messy data into clean, fresh, reliable data appropriate for downstream analytics, ML or applications. [Delta Live Tables (DLT)](https://www.databricks.com/product/delta-live-tables) makes it easy to build and manage these data pipelines while automatically taking care of infrastructure management and scaling, data quality, error testing and other administrative tasks. DLT is a high-level abstraction built on Spark Structured Streaming, a scalable and fault-tolerant stream processing engine. **[Real-time analytics](https://www.databricks.com/product/databricks-sql)** refers to the downstream analytical application of streaming data. With fresher data streaming into SQL analytics or BI reporting, more actionable insights can be achieved, resulting in better business outcomes. **[Real-time ML](https://www.databricks.com/product/machine-learning)** involves deploying ML models in a streaming mode. This deployment is supported with structured streaming for continuous inference from a live data stream. Like real-time analytics, real-time ML is a downstream impact of streaming data, but for different business use cases (i.e., AI instead of BI). Real-time modeling has many benefits, including more accurate predictions about the future. **Real-time applications** process data directly from streaming pipelines and trigger programmatic actions, such as displaying a relevant ad, updating the price on a pricing page, stopping a fraudulent transaction, etc. There typically is no human-in-the-loop for such applications. Data in cloud storage and message stores ----- **Databricks Lakehouse Platform differentiators** Understanding what the lakehouse architecture provides is one thing, but it is useful to understand how Databricks uniquely approaches the common challenges mentioned earlier around working with streaming data. **Databricks empowers unified data teams.** Data engineers, data scientists and analysts can easily build streaming data workloads with the languages and tools they already know and the APIs they already use. **Databricks simplifies development and operations.** Organizations can focus on getting value from data by reducing complexity and automating much of the production aspects associated with building and maintaining real-time data workloads. See why customers love streaming on the Databricks Lakehouse Platform with these resources. **Learn more** [Data Streaming Webpage](https://www.databricks.com/product/data-streaming) [Project Lightspeed: Faster and Simpler Stream Processing](https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html) [With Apache Spark](https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html) [Structured Streaming Documentation](https://docs.databricks.com/spark/latest/structured-streaming/index.html) [Streaming — Getting Started With Apache Spark on Databricks](https://databricks.com/spark/getting-started-with-apache-spark/streaming) **Databricks is one platform for streaming and batch data.** Organizations can eliminate data silos, centralize security and governance models, and provide complete support for all their real-time use cases under one roof — the roof of the lakehouse. Finally — and perhaps most important — Delta Lake, the core of the [Databricks](https://www.databricks.com/product/data-lakehouse) [Lakehouse Platform](https://www.databricks.com/product/data-lakehouse) , was built for streaming from the ground up. Delta Lake is deeply integrated with Spark Structured Streaming and overcomes many of the limitations typically associated with streaming systems and files. In summary, the Databricks Lakehouse Platform dramatically simplifies data streaming to deliver real-time analytics, machine learning and applications on one platform. And, that platform is built on a foundation with streaming at its core. This means organizations of all sizes can use their data in motion and make more informed decisions faster than ever. ----- **CHAPTER** ### Data science and machine learning # 10 **CHAPTER** While most companies are aware of the potential benefits of applying machine learning and AI, realizing these potentials can often be quite challenging for those brave enough to take the leap. Some of the largest hurdles come from siloed/disparate data systems, complex experimentation environments, and getting models served in a production setting. Fortunately, the Databricks Lakehouse Platform provides a helping hand and lets you use data to derive innovative insights, build powerful predictive models, and enable data scientists, ML engineers, and developers of all kinds to create within the space of machine learning and AI. ----- #### Databricks Machine Learning ----- #### Exploratory data analysis With all the data in one place, data is easily explored and visualized from within the notebook-style experience that provides support for various languages (R, SQL, Python and Scala) as well as built-in visualizations and dashboards. Confidently and securely share code with co-authoring, commenting, automatic versioning, Git integrations and role-based access controls. The platform provides laptop-like simplicity at production-ready scale. ----- #### Model creation and management From data ingestion to model training and tuning, all the way through to production model serving and versioning, the Lakehouse brings the tools needed to simplify those tasks. Get right into experimenting with the Databricks ML runtimes, optimized and preconfigured to include most popular libraries like scikit-learn, XGBoost and more. Massively scale thanks to built-in support for distributed training and hardware acceleration with GPUs. From within the runtimes, you can track model training sessions, package and reuse models easily with [MLflow](https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html) , an open source machine learning platform created by Databricks and included as a managed service within the Lakehouse. It provides a centralized location from which to manage models and package code in an easily reusable way. Training these models often involves the use of features housed in a centralized feature store. Fortunately, Databricks has a built-in feature store that allows you to create new features, explore and re-use existing features, select features for training and scoring machine learning models, and publish features to low-latency online stores for real-time inference. If you are looking to get a head start, [AutoML](https://databricks.com/blog/2022/04/18/supercharge-your-machine-learning-projects-with-databricks-automl-now-generally-available.html) allows for low to no-code experimentation by pointing to your data set and automatically training models and tuning hyperparameters to save both novice and advanced users precious time in the machine learning process. AutoML will also report back metrics related to the model training results as well as the code needed to repeat the training already custom-tailored to your data set. This glass box approach ensures that you are never trapped or suffer from vendor lock-in. In that regard, the Lakehouse supports the industry’s widest range of data tools, development environments, and a thriving ISV ecosystem so you can make your workspace your own and put out your best work. ##### Compute platform **Any ML workload optimized and accelerated** **Databricks Machine Learning Runtime** - Optimized and preconfigured ML frameworks - Turnkey distribution ML - Built-in AutoML - GPU support out of the box Built-in **ML frameworks** and **model explainability** Built-in support for **AutoML** and **hyperparameter tuning** Built-in support for **distributed training** Built-in support for **hardware accelerators** ----- #### Deploy your models to production Exploring and creating your machine learning models typically represents only part of the task. Once the models exist and perform well, they must become part of a pipeline that keeps models updated, monitored and available for use by others. **Webhooks** allow registering of Databricks can help here by providing a world-class experience for model versioning, monitoring and serving within the same platform that you can use to generate the models themselves. This means you can make all your ML pipelines in the same place, monitor them for drift, retrain them with new data, and promote and serve them easily and at scale. Throughout the ML lifecycle, rest assured knowing that lineage and governance are being tracked the entire way. This means regulatory compliance and security woes are significantly reduced, potentially saving costly issues down the road. callbacks on events like stage transitions to integrate with CI/CD automation. **Tags** allow storing deployment — specific metadata with model versions, e.g., whether the deployment was successful. **Model lifecycle management** Staging Production Archived Logged model **Comments** allow communication and collaboration between teammates when reviewing model versions. ----- **Learn more** [Databricks Machine Learning](https://databricks.com/product/machine-learning) [Databricks Data Science](https://databricks.com/product/data-science) [Databricks ML Runtime Documentation](https://docs.databricks.com/runtime/mlruntime.html) ----- **CHAPTER** # 11 ### Databricks Technology Partners and the modern data stack Databricks Technology Partners integrate their solutions with Databricks to provide complementary capabilities for ETL, data ingestion, business intelligence, machine learning and governance. These integrations allow customers to leverage the Databricks Lakehouse Platform’s reliability and scalability to innovate faster while deriving valuable data insights. Use preferred analytical tools with optimized connectors for fast performance, low latency and high user concurrency to your data lake. ----- With [Partner Connect](https://databricks.com/partnerconnect) , you can bring together all your data, analytics and AI tools on one open platform. Databricks provides a fast and easy way to connect your existing tools to your lakehouse using validated integrations and helps you discover and try new solutions. **Databricks thrives within your modern data stack** **BI and dashboards** **Machine learning** **Data science** **Data governance** **Data pipelines** **Data ingestion** Data Data Data warehousing engineering streaming **Unity Catalog** Data science and ML **Consulting** **and SI partners** **Delta Lake** **Cloud Data Lake** **Learn more** [Become a Partner](https://databricks.com/p/register-your-interest-for-databricks-partner-program) [Partner Connect demos](https://databricks.com/partnerconnect#partner-demos) [Partner Connect](https://databricks.com/partnerconnect) [Databricks Partner Connect Guide](https://docs.databricks.com/integrations/partner-connect/index.html) ----- **CHAPTER** ### Get started with the Databricks Lakehouse Platform # 12 ----- #### Databricks Trial Get a collaborative environment for data teams to build solutions together with interactive notebooks to use Apache Spark TM , SQL, Python, Scala, Delta Lake, MLflow, TensorFlow, Keras, scikit-learn and more. **•** Available as a 14-day full trial in your own cloud or as a lightweight trial hosted by Databricks **[Try Databricks for free](https://databricks.com/try-databricks?itm_data=NavBar-TryDatabricks-Trial)** **[Databricks documentation](https://databricks.com/documentation)** Get detailed documentation to get started with the Databricks Lakehouse Platform on your cloud of choice: Databricks on AWS, Azure Databricks and [Databricks on Google Cloud](https://docs.gcp.databricks.com/?_gl=1*16ovt38*_gcl_aw*R0NMLjE2NTI1NDYxNjIuQ2owS0NRandwdjJUQmhEb0FSSXNBTEJuVm5saU9ydGpfX21uT1U5NU5iRThSbmI5a3o2OGdDNUY0UTRzYThtTGhVZHZVb0NhTkRBMmlWc2FBcEN6RUFMd193Y0I.&_ga=2.135042808.863708747.1652113196-1440404449.1635787641&_gac=1.225252968.1652546163.Cj0KCQjwpv2TBhDoARIsALBnVnliOrtj__mnOU95NbE8Rnb9kz68gC5F4Q4sa8mLhUdvUoCaNDA2iVsaApCzEALw_wcB) . **[Databricks Demo Hub](https://databricks.com/discover/demos)** Get a firsthand look at Databricks from the practitioner’s perspective with these simple on-demand videos. Each demo is paired with related materials — including notebooks, videos and eBooks — so that you can try it out for yourself on Databricks. **[Databricks Academy](https://databricks.com/learn/training/home)** Whether you are new to the data lake or building on an existing skill set, you can find a curriculum tailored to your role or interest. With training and certification through Databricks Academy, you will learn to master the Databricks Lakehouse Platform for all your big data analytics projects. **[Databricks Community](https://community.databricks.com/)** **[Databricks Labs](https://databricks.com/learn/labs)** Databricks Labs are projects created by the field to help customers get their use cases into production faster. **[Databricks customers](https://databricks.com/customers)** Discover how innovative companies across every industry are leveraging the Databricks Lakehouse Platform. Get answers, network with peers and solve the world’s toughest problems, together. ----- #### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) **,** [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . © Databricks 2022. All rights reserved. Apache, Apache Spark, Spark and the Spark -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/The-Data-Teams-Guide-to-the-DB-Lakehouse-Platform.pdf2024-09-19T16:57:20Z##### Guide ## 6 Strategies for Building Personalized Customer Experiences ----- ### Contents **Introduction** ................................................................................................................................................................................................................. **3** **1.** **Building a Foundation for Personalization** Leveraging ML-Based Customer Entity Resolution ............................................................................................................................... **4** **2.** **Estimating Customer Lifetime Value** Building Brand Loyalty With Data ................................................................................................................................................................. **6** **3.** **Mitigating Customer Churn** Balancing Acquisition and Retention .......................................................................................................................................................... **10** **4.** **Streamlining Customer Analysis and Targeting** Creating Efficiency and Accuracy With Behavioral Data .................................................................................................................. **14** **5.** **Assessing Consumer Interest Data** Fine-Tuning ML Recommendations ............................................................................................................................................................ **18** **6.** **Delivering Personalized Customer Journeys** Crafting a Real-Time Recommendation Engine .................................................................................................................................... **14** **Conclusion** Building a Direct Path to Winning the Minds and Wallets of Your Customers ............................................................................. **23** ----- ### Introduction In today’s experience-driven world, the most beloved brands are the ones that know their customers. Customers are loyal to brands that recognize their needs and preferences — and tailor user journeys and engagements accordingly. A study from McKinsey shows [76% of consumers](https://www.mckinsey.com/business-functions/growth-marketing-and-sales/our-insights/the-value-of-getting-personalization-right-or-wrong-is-multiplying) are more likely to consider buying from a brand that personalizes the shopping and user experience to the wants and needs of the customer. And as organizations pursue omnichannel excellence, these same high expectations of online experiences also extend to brick-and-mortar locations — revealing for many merchants that personalized engagement is fundamental to attracting customers and expanding share of wallet. But achieving a 360-degree view of your customers to serve personalized experiences requires integrating various types of data — including demographics, behavioral and transactional — to develop robust profiles. This guide focuses on six actionable strategic pillars for businesses to leverage automation, real-time data, AI-driven analysis and well-tuned ML models to architect and deliver customized customer experiences at every touch point. # 76% of consumers are more likely to purchase due to personalization # 76% ----- ### Building a Foundation for Personalization Get a 360-degree view of the customer by leveraging ML-based entity resolution To create truly personalized interactions, you need actionable insights about your customers. Start by establishing a common customer profile and accurately linking together customer records across disparate data sets. Get a 360-degree view of your target customer by bringing together: - Sales and traffic-driven first-party data - Product ratings and surveys - Customer surveys and support center calls - Third-party data purchased from data aggregators and online trackers - Zero-party data provided by customers themselves Location **C A S E S T U DY** **Personalizing‌ ‌experiences‌ with‌ ‌data‌ ‌and‌ ‌ML‌** Grab is the largest online-to-offline platform in Southeast Asia and has generated over 6 billion transactions for transport, food and grocery delivery, and digital payments. Grab uses Databricks to create sophisticated customer segmentation and recommendation engines that can now ingest and optimize thousands of user-generated signals and data sources simultaneously, enhancing data integrity and security, and reducing weeks of work to only hours. [Get the full story](https://www.databricks.com/customers/grab) Demographics Orders Network/ Usage “The C360 platform empowered teams to create consumer features at scale, which in turn allows for these features to be extended to other markets and used by other teams. This helps to reduce the engineering overhead and costs exponentially.” **N I K H I L DWA R A K A N AT H** Head of Analytics, Grab Social Apps/ Clickstream |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| ||||||| ||Cus 3|t 6|o|mer 0|| ||||||| ||||||| Service Call/ Records Customer 360 Billing Devices ----- Given the different data sources and data types, automated matching can still be incredibly challenging due to inconsistent formats, misinterpretation of data, and entry errors across various systems. And even if inconsistent, all that data may be perfectly valid — but to accurately connect the millions of customer identities most retailers manage, businesses must lean on automation. In a machine learning (ML) approach to entity resolution, text attributes like name, address and phone number are translated into numerical representations that can be used to quantify the degree of similarity between any two attribute values. But your ability to train such a model depends on your access to accurately labeled training data. It’s a time-consuming exercise, but if done right, the model learns to reflect the judgments of the human reviewers. Many organizations rely on libraries encapsulating this knowledge to build their applications and workflows. One such library is [Zingg](https://www.zingg.ai/) , an open source library bringing together ML-based approaches to intelligent candidate pair generation and pair-scoring. Oriented toward the construction of custom workflows, Zingg presents these capabilities within the context of commonly employed steps such as training data label assignment, model training, data set deduplication, and (cross-data set) record matching. Built as a native Apache Spark TM application, Zingg scales well to apply these techniques to enterprise-sized data sets. Organizations can then use Zingg in combination with platforms such as Databricks Lakehouse to provide the back end to human-in-the-middle workflow applications that automate the bulk of the entity resolution work and present data experts with a more manageable set of edge case pairs to interpret. As an active-learning solution, models can be retrained to take advantage of this additional human input to improve future predictions and further reduce the number of cases requiring expert review. Finally, these technologies can be assembled to enable their own enterprise-scaled customer entity resolution workflow applications. **Need help building your foundation for a** **360-degree view of your customers?** Get pre-built code sample data and step-by-step instructions in a Databricks notebook in the **Customer Entity Resolution** **Solution Accelerator.** **•** Translating text attributes (like name, address, phone number) into quantifiable numerical representations **•** Training ML models to determine if these numerical labels form a match **•** Scoring the confidence of each match **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/customer-entity-resolution)** ----- ### Estimating Customer Lifetime Value Building brand loyalty to drive share of wallet with data Once you’ve set up a 360-degree view of the customer, the next challenge is how to spend money to profitably grow the brand. The goal is to spend marketing dollars on activities that attract loyal customers and avoid spending on unprofitable customers or activities that damage the brand. Keep in mind, that making decisions solely based on ROI isn’t the answer. This one-track approach could ultimately weaken your brand equity and make you more dependent on lowering your price through promotions as a way to generate sales. **C A S E S T U DY** **Identifying and engaging brand loyalists** Today’s customer has overwhelmingly abundant options in products and services to choose from. That’s why personalizing customer experiences is so important, as it increases revenue, marketing efficiency and customer retention. Not every customer carries the same potential for profitability. Different customers derive different value from your products and services, which directly translates into differences in the overall amount of value a business can expect in return. Mutually beneficial relationships carefully align customer acquisition cost (CAC) and retention rates with the total revenue or customer lifetime value (CLV). **Predicting and increasing customer lifetime value with ML** Kolibri Games, creators of Idle Miner Tycoon and Idle Factory Tycoon, attracts over 10 million monthly active users. With Databricks, they achieved a 30% increase in player LTV, improved data team productivity by 3x, and reduced ML model-to-production time by 40x. [Get the full story](https://databricks.com/customers/kolibri-games) Within your existing customer base are people ranging from brand loyalists to brand transients. Brand loyalists are highly engaged with your brand, are willing to share their experience with others, and are the most likely to purchase again. Brand transients have no loyalty to your brand and shop based on price. Your focus should be on growing the group of brand loyalists while minimizing interactions with brand transients. **Calculating customers’ lifetime intent** To assess the remaining lifetime in a customer relationship, businesses must carefully examine the transactional signals and other indicators from previous customer engagements and transactions. For example, if a frequent customer slows down their buying habits — or simply doesn’t make a purchase for an extended period of time — it may signal the upcoming end of the relationship. However, in the case of another customer who engages infrequently, the same extended absence may not signal anything notable. The infrequent buyer may continue to purchase even after a long pause in activity. ----- Customer A Customer B Customer C Past Future Different customers with the same number of transactions, but signaling different lifetime intent. The probability of re-engagement (P_alive) relative to a customer’s history of purchases. Every customer relationship with a business has a lifespan. Understanding what point in the lifespan at a given time provides critical insight to inform marketing and sales tactics. By proactively discovering shifts in the relationship, you can adapt how to respond to each customer at the optimal time. For example, a certain signal might prompt a change in how to deliver products and services, which could help maximize revenue. Transactional signals can be used to estimate the probability that a customer is active and likely to return in the future. Popularized as the Buy ’til You Die (BTYD) model, analysts can compare a customer’s frequency and recency of engagement to similar patterns across their user population to accurately predict individual CLV. The mathematics behind these predictive CLV models is complex, but the logic behind these critical models is accessible through a popular Python library named Lifetimes, which allows the input of simple summary metrics in order to derive customer-specific lifetime estimates. **C A S E S T U DY** **How personalized experiences keep customers coming** **back for more** Publicis Groupe empowers brands to transform retail experiences with digital technologies, but data challenges and team silos stood in the way of delivering the personalization that their customers required. See how they use Databricks to create a single customer view that allows them to drive customer loyalty and retention. As a result, they’ve seen a 45%–50% increase in customer campaign revenue. [Get the full story](https://databricks.com/customers/publicis-groupe) ----- **Delivering customer lifetime estimates to the business** Spark natively distributes this work across a multi-server environment, enabling consistent, accurate and efficient analysis. Spark’s flexibility allows models to adapt in real time as new information is ingested, eliminating the bottlenecks that come with manual data mapping and profile building. With per customer metrics calculated, the Lifetimes library can be used to train multiple BTYD models, such as Pareto/NBD and BG/NBD. Training models to predict engagements over time using proprietary data can take several months and thousands of training runs. [Hyperopt](http://hyperopt.github.io/hyperopt/) , a specialized snippet library, helps businesses tap into the infrastructure behind their Spark environments and distribute the training outputs across models. Using the Lifetimes library to calculate customer-specific probabilities at speed and scale can be challenging — from processing large volumes of transaction data to deriving data curves and value distribution patterns and, eventually, to integration with business initiatives. But with the proper approach, you can resolve all of them. These models depend on three key per customer metrics: **FREQUENCY** The number of times within a given time period in which a repeat transaction is observed **AGE** The length of time between the occurrence of an initial transaction to the end of a given time period **RECENCY** The “age” of a customer (how long they’ve engaged with a brand) at the time of their latest repeat transaction ----- **Solution deployment** Once properly trained, these models can determine the probability that a customer will re-engage, as well as the number of engagements a business can expect from that customer over time. But the real challenge is putting these predictive capabilities into the hands of those that determine customer engagement. Matrices illustrating the probability a customer is alive (left) and the number of future purchases in a 30-day window given a customer’s frequency and recency metrics (right). Businesses need a way to develop and deploy solutions in a highly scalable environment with a limited upfront cost. Databricks Solution Accelerators leverage real-world sample data sets and pre-built code to show how raw data can be transformed into real solutions — including step-by-step instructions ready to go in a Databricks notebook. **Need help determining your customers’** **lifetime value?** Use the **Customer Lifetime Value Accelerator** to **•** Ingest sample retail data **•** Use pre-built code to develop visualizations and explore past purchase behavior **•** Apply machine learning to predict the likelihood and nature of future purchases **[GET THE ACCELERATOR](https://databricks.com/solutions/accelerators/customer-lifetime-value)** ----- ### Mitigating Customer Churn Balancing acquisition and retention with personalized experiences There are no guarantees of success. With a bevy of options at their disposal, customer churn is a reality that companies face and are focused on overcoming every day. One [recent analysis](https://info.recurly.com/annual-subscription-billling-metrics-report?submissionGuid=3c21cde7-5f58-4d86-9218-332d697e7b3e) of consumer-oriented subscription services estimated a segment average 7.2% monthly rate of churn. When narrowed to brands focused on consumer goods, that rate jumped to 10.0%. This figure translates to a lifetime of 10 months for the average subscription box service, leaving businesses of this kind with little time to recover acquisition costs and bring subscribers to net profitability. **C A S E S T U DY** ##### Riot Games **Creating an optimal in-game experience for League of Legends** Riot Games is one of the top PC game developers in the world, with over 100 million monthly active users, 500 billion data points, and over 26 petabytes of data and counting. They turned to Databricks to build a more efficient and scalable way to leverage data and improve the overall gaming experience — ensuring customer engagement and reducing churn. [Get the full story](https://www.databricks.com/customers/riot-games) Organizations must take an honest look at the cost of acquisition relative to a customer’s lifetime value (LTV) earned. These figures need to be brought into a healthy balance and treated as a “chronic condition” [to be managed.](https://retailtouchpoints.com/features/trend-watch/can-subscription-retail-solve-its-customer-retention-problem) **Understanding attrition predictability through subscriptions:** **Examining retention-based acquisition variables** Public data for subscription services is extremely hard to come by. KKBox, a Taiwan-based music streaming service, recently released over two years of anonymized [subscription data](https://www.kaggle.com/c/kkbox-churn-prediction-challenge) to examine customer churn. Through analyzing the data, we uncover customer dynamics familiar to any subscription provider. Most subscribers join the KKBox service through a 30-day trial offer. Customers then appear to enlist in one-year subscriptions, which provide the service with a steady flow of revenue. Subscribers typically churn at the end of the 30-day trial and at regular one-year intervals. The Survival Rate reflects the proportion of the initial (Day 1) subscriber population that is retained over time, first at the roll-to-pay milestone, and then at the renewal milestone. ----- By Initial Payment Method timeline Customer attrition by subscription day on the KKBox streaming service for customers registering via different payment methods. By Initial Payment Plan Days timeline Customer attrition by subscription day on the KKBox streaming service for customers selecting different initial payment methods and terms/days. This pattern of high initial drop-off, followed by a period of slower but continuing drop-off cycles makes intuitive sense. Where it gets interesting is when the data changes. The patterns of customer churn become vastly different as time passes and new or changing elements are introduced (e.g., payment methods and options, membership tiers, etc.). By Registration Channel timeline Customer attrition by subscription day on the KKBox streaming service for customers registering via different channels. ----- These patterns seem to indicate that KKBox _could_ potentially differentiate between customers based on their lifetime potential, using only the information available at subscriber acquisition. In the same way, non-subscription businesses could use similar data techniques to get an accurate illustration of the total lifetime value of a particular customer, even before collecting historical data. This information can help businesses target certain shoppers with effective discounts or promotions as early as trial registration. Nevertheless, it’s always important to consider more than individual data points. The baseline risk of customer attrition over a subscription lifespan. The channel and payment method multipliers combine to explain a customer’s risk of attrition at various points in time. The higher the value, the higher the proportional risk of churn in the associated period. ----- **Applying churn analytics to your data** This analysis is useful in two ways: **1)** to quantify the risk of customer churn and **2)** to paint a quantitative picture of the specific factors that explain that risk, giving analysts a clearer understanding of what to focus on, what to ignore and what to investigate further. The main challenge is organizing the input data. The data required to examine customer attrition may be scattered across multiple systems, making an integrated analysis difficult. [Data lakes](https://databricks.com/discover/data-lakes/introduction) support the creation of transparent, sustainable data processing pipelines that are flexible, scalable and highly cost-efficient. Remember that **churn is a chronic** **condition to be managed** , and attrition data should be periodically revisited to maintain alignment between acquisition and retention efforts. **Need help predicting customer churn?** Use the **Subscriber Churn Prediction Accelerator** to analyze behavioral data, identify subscribers with an increased risk of cancellation, and predict attrition. Machine learning lets you quantify a user’s likelihood to churn, identifying factors that explain the risk. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/survivorship-and-churn)** ----- ### Streamlining Customer Analysis and Targeting Creating efficient and highly targeted customer experiences with behavioral data Effective targeting comes down to one fundamental element: the cost of delivering a good or service relative to what a consumer is willing to pay. In the earliest applications of segmentation, manufacturers recognized that specialized product lines targeting specific consumer groups could help brands stand out against competitors. **C A S E S T U DY** **Finding that special something every time** Pandora is a jewelry company with global reach. They built their master consumer view (MCV) dashboard on the Databricks Lakehouse Platform, giving them the insights necessary to deliver highly targeted messaging and personalization — resulting in 80% growth in email marketing success, a 50% increase in click-to-open rate across 65 million emails, and 255M DKK (Danish Krone) in quarterly revenue. [Get the full story](https://www.databricks.com/customers/pandora) This mode of thinking extends beyond product development and into every customer-oriented business function, requiring specific means of ideation, production and delivery. The work put into segmentation doesn’t need to be a gamble. Scrutinizing customers and testing responsiveness is an ongoing process. Organizations must analyze and adapt to shifting markets, changing consumer demand and evolving business objectives. **C A S E S T U DY** **Powering insight-driven dashboards to increase customer** **acquisition** Bagelcode is a global game company with more than 50 million global users. By using the Databricks Lakehouse Platform, they are now able to support more diversified indicators, such as a user’s level of frequency and the amount of time they use a specific function for each game, enabling more well-informed responses. In addition, the company is mitigating customer churn by better predicting gamer behavior and providing personalized experiences at scale. [Get the full story](https://www.databricks.com/customers/bagelcode) “Thanks to Databricks Lakehouse, we can support real-time business decision-making based on data analysis results that are automatically updated on an hourly and daily basis, even as data volumes have increased by nearly 1,000 times.” **J O O H Y U N K I M** Vice President, Data and AI, Bagelcode ----- A brand’s goal with segmentation should be to define a shared customer perspective on customers, allowing the organization to engage users consistently and cohesively. But any adjustments to customer engagement require careful consideration of [organizational change concerns](https://www.researchgate.net/publication/45348436_Bridging_the_segmentation_theorypractice_divide) . **C A S E S T U DY** **Responding to global demand shifts with ease** Reckitt produces some of the world’s most recognizable and trusted consumer brands in hygiene, health and nutrition. With Databricks Lakehouse on Azure, they’re able to meet the needs of billions of consumers worldwide by surfacing real-time, highly accurate, deep customer insights, leading to a better understanding of trends and demand, allowing them to provide best-in-class experiences in every market. [Get the full story](https://www.databricks.com/customers/reckitt) **A segmentation walk-through: Grocery chain promotions** A promotions management team for a large grocery chain is responsible for running a number of promotional campaigns, each of which is intended to drive greater overall sales. Today, these marketing campaigns include leaflets and coupons mailed to individual households, manufacturer coupon matching, in-store discounts and the stocking of various private-label alternatives to popular national brands. Recognizing uneven response rates between households, the team is eager to determine if customers might be segmented based on their responsiveness to these promotions. They anticipate that such segmentation may allow the promotions management team to better target individual households, driving overall higher response rates for each promotional dollar spent. Using historical data from point-of-sale systems along with campaign information from their promotions management systems, the team derives a number of features that capture the behavior of various households with regard to promotions. Applying standard data preparation techniques, the data is organized for analysis and using a variety of clustering algorithms, such as k-means and hierarchical clustering, the team settles on two potentially useful cluster designs. ----- Overlapping segment designs separating households based on their responsiveness to various promotional offerings. Profiling of clusters to identify differences in behavior across clusters. **Assessing results** Comparing households by demographic factors not used in developing the clusters themselves, some interesting patterns separating cluster members by age and other factors are identified. While this information may be useful in not only predicting cluster membership and designing more effective campaigns targeted to specific groups of households, the team recognizes the need to collect additional demographic data before putting too much emphasis on these results. With profiling, marketers can discern those customer households in the highlighted example fall into two groups: those who are responsive to coupons and mailed leaflets, and those who are not. Further divisions show differing degrees of responsiveness to other promotional offers. ----- **Need help segmenting your customers for** **more targeted marketing?** Use the **Customer Segmentation Accelerator** and drive better purchasing predictions based on behaviors. Through sales data, campaigns and promotions systems, you can build useful customer clusters to effectively target various households with different promos and offers. Age-based differences in cluster composition of behavior-based customer segments. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/customer-segmentation)** The results of the analysis now drive a dialog between the data scientists and the promotions management team. Based on initial findings, a revised analysis will be performed focused on what appear to be the most critical features differentiating households as a means to simplify the cluster design and evaluate overall cluster stability. Subsequent analyses will also examine the revenue generated by various households to understand how changes in promotional engagement may impact customer spending. Using this information, the team believes they will have the ability to make a case for change to upper management. Should a change in promotions targeting be approved, the team makes plans to monitor household spending, promotions spend and campaign responsiveness rates using much of the same data used in this analysis. This will allow the team to assess the impact of these efforts and identify when the segmentation design needs to be revisited. ----- #### Assessing Consumer Interest Data to Inform Engagement Strategies Fine-tuning ML recommendations to boost conversions Personalization is a [journey](https://www.bcg.com/publications/2021/the-fast-track-to-digital-marketing-maturity) . To operationalize personalized experiences, it’s important to identify high-value audiences who have the highest likelihood of specific actions. Here’s where **propensity scoring** comes in. Specifically, this process allows companies to estimate customers’ potential receptiveness to an offer or to content related to a subset of products, and determine which messaging to apply. Calculating propensity scores requires assessment of past interactions and data points (e.g., frequency of purchases, percentage of spend associated with a particular product category, days since last purchase and other historical data). Databricks provides critical capabilities for propensity scoring (like the Feature Store, AutoML and MLflow) to help businesses answer three key considerations and develop a robust process: **1.** How to maintain the significant number of features used to train propensity models **2.** How to rapidly train models aligned with new campaigns **3.** How to rapidly re-deploy models, retrained as customer patterns drift, into the scoring pipeline **Boosting model training efficiency** With the [Databricks Feature Store](https://docs.databricks.com/applications/machine-learning/feature-store/index.html) , data scientists can easily reuse features created by others. The feature store is a centralized repository that enables the persistence, discovery and sharing of features across various model training exercises. As features are captured, lineage and other metadata are captured. Standard security models ensure that only permitted users and processes may employ these features, enforcing the organization’s data access policies on data science processes. **Extracting the complexities of ML** [Databricks AutoML](https://docs.databricks.com/applications/machine-learning/automl.html) allows you to quickly generate models by leveraging industry best practices. As a glass box solution, AutoML first generates a collection of notebooks representing various aligned model variations. In addition to iteratively training models, AutoML allows you to access the notebooks associated with each model, creating an editable starting point for further exploration. **Streamlining the overall ML lifecycle** [MLflow](https://docs.databricks.com/applications/mlflow/index.html) is an open source machine learning model repository, managed within the Databricks Lakehouse. This repository enables tracking and analysis of the various model iterations generated by both AutoML and custom training cycles alike. When used in combination with the Databricks Feature Store, models persisted with MLflow can retain knowledge of the features used during training. As models are retrieved, this same information allows the model to retrieve relevant features from the Feature Store, greatly simplifying the scoring workflow and enabling rapid deployment. ----- **How to build a propensity scoring workflow with Databricks** Using these features in combination, many organizations implement propensity scoring as part of a three-part workflow: **1.** Data engineers work with data scientists to define features relevant to the propensity scoring exercise and persist these to the Feature Store. Daily or even real-time feature engineering processes are then defined to calculate up-to-date feature values as new data inputs arrive. Model Training and Deployment **2.** As part of the inference workflow, customer identifiers are presented to previously trained models in order to generate propensity scores based on the latest features available. Feature Store information captured with the model allows data engineers to retrieve these features and easily generate the desired scores, which can then be used for analysis within Databricks Lakehouse or published to downstream marketing systems. **3.** In the model-training workflow, data scientists periodically retrain the propensity score models to capture shifts in customer behaviors. As these models are persisted to MLfLow, change management processes are used to evaluate and elevate those models that meet organizational criteria-toproduction status. In the next iteration of the inference workflow, the latest production version of each model is retrieved to generate customer scores. Score Generation and Publication ETL **Need help assessing interest from your** **target audience?** Feature Engineering ETL Feature Store Profiles Sales Promotions Customer Use the **Propensity Scoring Accelerator** to estimate customers’ potential receptiveness to an offer or to content related to a subset of products. Using these scores, marketers can determine which of the many messages at their disposal should be presented to a specific customer. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/propensity-scoring)** Downstream Applications A three-part propensity scoring workflow. ----- ### Delivering Personalized Customer Journeys Strategies for crafting a real-time recommendation engine As the economy continues to weather unpredictable disruptions, shortages and demand, delivering personalized customer experiences at speed and scale will require adaptability on the ground and within a company’s operational tech stack. With the Databricks Lakehouse, Al-Futtaim has transformed their data strategy and operations, allowing them to create a “golden customer record” that improves all decision-making from forecasting demand to powering their global loyalty program. [Get the full story](https://www.databricks.com/customers/al-futtaim) **C A S E S T U DY** “Databricks Lakehouse allows every division in our organization — from automotive to retail — to gain a unified view of our customer across businesses. With these insights, we can optimize everything from forecasting and supply chain, to powering our loyalty program through personalized marketing campaigns, cross-sell strategies and offers.” **D M I T R I Y D O V G A N** Head of Data Science, Al-Futtaim Group As COVID-19 forced a [shift](https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/a-global-view-of-how-consumer-behavior-is-changing-amid-covid-19) in consumer focus toward value, availability, quality, safety and community, brands most attuned to changing needs and sentiments saw customers [switch](https://martechseries.com/sales-marketing/customer-experience-management/braze-survey-one-in-four-consumers-tried-new-brand-during-covid-19/) from [rivals](https://www.retailtouchpoints.com/resources/personalization-gains-new-relevance-as-covid-19-challenges-brand-loyalties) to their brand. While some segments gained business and many lost, organizations that had already begun the journey toward improved customer experience saw better outcomes, closely mirroring patterns [observed](https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Marketing%20and%20Sales/Our%20Insights/Adapting%20customer%20experience%20in%20the%20time%20of%20coronavirus/Adapting-customer-experience-in-the-time-of-coronavirus.ashx) in the 2007–2008 recession. **Creating a unified view across 200+ brands** As a driving force for economic growth in the Middle East, Al-Futtaim impacts the lives of millions of people across the region through the distribution and operations of global brands like Toyota, IKEA, Ace Hardware and Marks & Spencer. Al-Futtaim’s focus is to harness their data to improve all areas of the business, from streamlining the supply chain to optimizing marketing strategies. But with the brands capturing such a wide variety of data, Al-Futtaim’s legacy systems struggled to provide a single view into the customer due to data silos and the inability to scale efficiently to meet analytical needs. ----- The personalization of customer experiences will remain a key focus for B2C and [B2B organizations](https://hbr.org/2017/07/how-b2b-sellers-are-offering-personalization-at-scale) . Increasingly, market analysts are recognizing customer experience as a [disruptive force](https://sloanreview.mit.edu/article/the-experience-disrupters/) enabling upstart organizations to upend long-established players. **Focus on the customer journey** Personalization starts with a careful exploration of the [customer journey](https://hbr.org/2015/11/competing-on-customer-journeys) . The [digitization of each stage](https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/the-drumbeat-of-digital-how-winning-teams-play) provides the customer with flexibility in terms of how they will engage and provides the organization with the ability to [assess](https://www.bcg.com/en-us/publications/2020/three-personalization-imperatives-during-covid-crisis) [the health of their model](https://www.bcg.com/en-us/publications/2020/three-personalization-imperatives-during-covid-crisis) . **C A S E S T U DY** **Personalizing the beauty product shopping experience** Flaconi wanted to leverage data and AI to become the No. 1 online beauty product destination in Europe. However, they struggled with massive volumes of streaming data and with infrastructure complexity that was resource-intensive and costly to scale. See how they used Databricks to increase time-to-market by 200x, reduce staff costs by 40% and increase net order income. Get the full story ¹ Comparison of total returns to shareholders for publicly traded companies ranking in the top 10 or bottom 10 of Forrester’s Customer Experience Performance Index in 2007-09. Source: Forrester Customer Experience Performance Index (2007-09); press search CX leaders outperform laggards, even in a down market, in this visualization of the Forrester Customer Experience Performance Index [as provided](https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Marketing%20and%20Sales/Our%20Insights/Adapting%20customer%20experience%20in%20the%20time%20of%20coronavirus/Adapting-customer-experience-in-the-time-of-coronavirus.ashx) by McKinsey & Company. ----- Careful consideration of how customers interact with various assets — and how these interactions may be interpreted as expressions of preference — can unlock a wide range of data that enables personalization. The complexity of these engines requires that they be deployed thoughtfully, using limited pilots and customer response assessments. And in those assessments, it’s important to keep in mind that there is no expectation of perfection — only incremental improvement over the prior solution. **C A S E S T U DY** **Need help generating personalized** **recommendations?** **Connecting shoppers to savings with data-driven** **personalization‌** Use the **Recommendation Engines Accelerator** to estimate customers’ potential receptiveness to an offer or to content related to a subset of products. Using these scores, marketers can determine which of the many messages at their disposal should be presented to a specific customer. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/propensity-scoring)** Flipp is an online marketplace that aggregates weekly shopping circulars, so consumers get deals and discounts without clipping coupons. Siloed customer data sources once made getting insights difficult. Now with Databricks, Flipp’s data teams can access and democratize data, helping them do their jobs more effectively while bringing better deals to users, more meaningful insights to partners, and a 10% jump in foot traffic to brick-and-mortar retailers. Get the full story The engines we use to serve content based on customer preferences are known as recommenders. With some recommenders, a heavy focus on the shared preferences of similar customers helps define what recommendations will actually make an impact. With others, it can be more useful to focus on the properties of the content itself (e.g., product descriptions). ----- ### Building a Direct Path to Winning the Minds and Wallets of Your Customers Providing deep, effective personalized experiences to customers depends on a brand’s ability to intelligently leverage consumer and market data from a wide variety of sources to fuel faster, smarter decisions — without sacrificing accuracy for speed. The Databricks Lakehouse Platform is purpose-built for exactly that, offering a scalable data architecture that unifies all your data, analytics and AI to deliver unforgettable customer experiences. Created on open source and open standards, Databricks offers a robust and cost-effective platform for brands to collaborate with partners, clients, manufacturers and distributors to unleash more innovation and efficiencies at every touch point. Businesses can rapidly ingest available data in real time, at scale, and create accessible, data-driven insights that enable actionable strategies across the value chain. Databricks is a multicloud platform, designed for quick enterprise development. Teams using the Lakehouse can more effectively reveal the 360-degree view into their company’s operational health and the evolving needs of their customers — all while empowering teams to easily unify data efforts, perform fine-grained analyses and streamline cross-functional data operations using a single, sophisticated solution. ###### Learn more about Databricks Lakehouse for industries  like Retail & Consumer Goods, Media & Entertainment  and more at databricks.com/solutions ----- ### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://www.databricks.com/try-databricks?utm_medium=paid+search&utm_source=google&utm_campaign=14272820537&utm_adgroup=126939742998&utm_content=trial&utm_offer=try-databricks&utm_ad=563736421186&utm_term=databricks%20free%20trial&gclid=Cj0KCQjwpeaYBhDXARIsAEzItbHzQGCu2K58-lnVCepMI5MYP6jTXkgfvqmzwAMqrlVwVOniebOE43UaAk3OEALw_wcB)** ##### Contact us for a personalized demo databricks.com/contact -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Databricks-Customer-360-ebook-Final.pdf2024-09-19T16:57:19Z#### eBook # Big Book of Retail  & Consumer Goods Use Cases ##### Driving real-time decisions  with the Lakehouse ----- ### Contents (1/2) C H A P T E R 1 :  Introduction 4 **C H A P T E R 2 :**  **Modern Data Platform for Real-Time Retail** 6 Common challenges 6 The Lakehouse for Retail 8 **C H A P T E R 3 :** **Use Case: Real-Time Supply Chain Data**  12 Case Study: Gousto 14 Case Study: ButcherBox 14 **C H A P T E R 4 :**  **Use Case: Truck Monitoring** 15 Case Study: Embark 16 **C H A P T E R 5 :** **Use Case: Inventory Allocation**  17 Case Study: H&M 19 Case Study: Edmunds 19 **C H A P T E R 6 :** **Use Case: Point of Sale and Clickstream**  20 **C H A P T E R 7 :** **Use Case: On-Shelf Availability**  22 Case Study: Reckitt 25 **C H A P T E R 8 :** **�Use Case: Customer and Vehicle Identification** 26 **C H A P T E R 9 :**  **Use Case: Recommendation Engines** 28 Case Study: Wehkamp 31 Case Study: Columbia 31 Case Study: Pandora 31 **C H A P T E R 1 0 :**  **Use Case: Perpetual Inventory** 32 **C H A P T E R 1 1 :**  **Use Case: Automated Replenishments** 34 ----- ### Contents (2/2) C H A P T E R 1 2 :  Use Case: Fresh Food Forecasting 36 Case Study: ButcherBox 37 Case Study: Sam’s Club 37 **C H A P T E R 1 3 :**  **Use Case: Propensity-to-Buy** 38 **C H A P T E R 1 4 :**  **Use Case: Next Best Action** 41 **C H A P T E R 1 5 :** **Customers That Innovate With Databricks Lakehouse for Retail**  43 **C H A P T E R 1 6 :**  **Conclusion** 43 ----- **CHAPTER 1:** ### Introduction Retailers are increasingly being challenged to make time-sensitive decisions in their operations. Consolidating e-commerce orders. Optimizing distribution to ensure item availability. Routing delivery vehicles. These decisions happen thousands of times daily and have a significant financial impact. Retailers need real-time data to support these decisions, but legacy systems are limited to data that’s hours or days old. **When seconds matter, only the Lakehouse delivers better decisions** Retail is a 24/7 business where customers expect accurate information and immediate relevant feedback. The integration of physical and e-commerce customer experiences into an omnichannel journey has been happening for the past 20 years, but the pandemic provided a jolt to consumer trends that dramatically shifted purchasing patterns. In reaction to these industry changes, retailers have responded with significant, rapid investments — including stronger personalization, order fulfillment, and delivery and loyalty systems. While these new targeted capabilities have addressed the immediate need — and created expectations of making decisions in real time — most retailers still rely on legacy data systems, which impedes their ability to scale these innovations. Unfortunately, most legacy systems are only able to process information in hours or days. The delays caused by waiting for data are leading to significant risks and costs for the industry. **Grocers** need to consolidate order picking to achieve profitability in e-commerce, but this requires up-to- the-minute order data. Not having this information causes them to spend more resources on having people pick orders separately, at a higher operating cost. **Apparel retailers** must be able to present the correct available inventory on their website. This requires that in-store sales be immediately reflected in their online systems. Inaccurate information can lead to lost sales, or worse, the customer becoming unsatisfied and moving to different retailers. ----- **Convenience fuel retailers** must collaborate with distribution centers, direct-to-store delivery distributors and other partners. Having delayed data can lead to out-of-stocks, costing stores thousands of dollars per week. The margin of error in retail has always been razor thin, but with a pandemic and inflationary pressures, it’s at zero. Reducing the error rate requires better predictions and real-time data. **Use Case Guide** In this use case guide, we show how the Databricks Lakehouse for Retail is helping leading organizations take **all of their data in a single lakehouse architecture, streamline their data engineering and management,** **make it ready for SQL and ML/AI** , and **do so very fast within their own cloud infrastructure environment** **based on open source and open standards** . These capabilities are all delivered at world-record-setting performance, while achieving a market-leading total cost of ownership. Databricks Lakehouse for Retail has become the industry standard for enabling retailers to drive decisions in real time. This use case guide also highlights common use cases across the industry, and offers additional resources in the form of Solution Accelerators and reference architectures to help as you embark on your own journey to drive better customer experiences with data and AI. ----- **CHAPTER 2:** ### Modern Data Platform  for Real-Time Retail Retailers continue to adapt to rapidly shifting dynamics across the omnichannel. In navigating these changes, retailers are increasingly focused on improving the real-time availability of data and insights, and performing advanced analytics delivered within tight business service windows. **Common challenges** In response to the surge in e-commerce and volatility in their supply chains, retailers are investing millions in modernizing distribution centers, partnering with delivery companies, and investing in customer engagement systems. Warehouse automation is expected to become a $41B market according to Bloomberg. Increasingly, distribution centers are being automated with robotics to power dynamic routing and delivery. Shoppers that became accustomed to having fast, same-day, and sometimes even overnight delivery options during the pandemic now expect them as the norm. Retailers understand that the shipping and delivery experience is now one of many touchpoints that merchants can use to develop customer brand loyalty. ## $41B Market | Retail Warehouse Automation Yet while retailers modernize different areas of their operations, they’re constrained by a single point of weakness, as they are reliant on legacy data platforms to bring together all of this data. Powering real-time decisions in modern retail requires real-time ingestion of data, transformation, governance of information, and powering business intelligence and predictive analytics all within the time required by retail operations. ----- **Ingesting large volumes of transactional data in real time.** The biggest blocker to crucial insights is the ability to ingest data from transaction systems in real time. Transaction logs from point-of-sale systems, clickstreams, mobile applications, advertising and promotions, as well as inventory, logistics and other systems, are constantly streaming data. Big data sets need to be ingested, cleansed and aggregated and integrated with each other before they can be used. The problem? Retailers have used legacy data warehouses that are built around batch processing. And worse, increasing the frequency of how often data is processed leads to a “hockey stick” in costs. As a result of these limitations, merchants resort to ingesting data nightly to deal with the large volumes of data and integration with other data sets. The result? Accurate data to drive decisions can be delayed by days. **Performing fine-grained analysis at scale within tight time windows.** Retailers have accepted a trade-off when performing analysis. Predictions can be detailed and accurate, or they can be fast. Running forecasts or price models at a day, store and SKU level can improve accuracy by 10% or more, but doing so requires tens of millions of model calculations that need to be performed in narrow service windows. This is well beyond the capability of legacy data platforms. As a result, companies have been forced to accept the trade-off and live with less accurate predictions. **Powering real-time decisions on the front line.** Data is only useful if it drives decisions, but serving real-time data to thousands of employees is a daunting task. While data warehouses are capable of serving reports to large groups of users, they’re still limited to stale data. Most retailers limit the frequency of reports to daily or weekly updates and depend on the staff to use their best judgment for decisions that are more frequent. **Delivering a hyper-personalized omnichannel experience.** The storefront of the 21st century is focused on delivering personalized experiences throughout the omnichannel. Retailers have access to a trove of customer data, and yet off-the-shelf tools for personalization and customer segmentation struggle to deal with high volumes, and the analytics have high rates of inaccuracy. Retailers need to deliver personalized experiences at scale to win in retail. ----- ###### The Lakehouse for Retail Databricks Lakehouse for Retail solves these core challenges. The Lakehouse unlocks the ability to unify all types of data — from images to structured data — in real time, provide enterprise-class management and governance, and then immediately turn that data into actionable insights with real-time reporting and predictive analytics. It does this with record-setting speed and industry-leading total cost of ownership (TCO) in a platform-as-a-service (PaaS) that allows customers to solve these pressing problems. **Any structure** **Reliable, real-time** **Capabilities for** **Data sharing** **or frequency** **processing** **any persona** **& collaboration** _Semi-structured batch_ **All of** **your sources** Competitive activity E-commerce Mobile Applications Video & Images Point of Sale Distribution & Logistics Customer & Loyalty Delivery & Partners _Structured real-time_ _Semi-structured real-time_ _Unstructured batch_ _Semi-structured real-time_ _Structured real-time_ _Structured batch_ Data Lakehouse Data Management and Governance Process, manage and query all of your data Ad Hoc Data Science **Internal Teams** Production Machine Learning **Customers** BI Reporting & Dashboarding **Partners** Real-time Applications Any Cloud _Structured real-time_ ----- **Reference Architecture** At the core of the Databricks Lakehouse for Retail is technology that enables retailers to avoid the trade- offs between speed and accuracy. Technology such as Delta Lake enables the Lakehouse — a new paradigm that combines the best elements of data warehouses and data lakes — to directly address these factors by enabling you to unify all of your data — structured and unstructured, batch and real-time — in one centrally managed and governed location. Once in the Lakehouse, e-commerce systems, reporting users, analysts, data scientists and data engineers can all leverage this information to serve models for applications and power real-time reporting, advanced analytics, large-scale forecasting models and more. **EDGE** **HYBRID** **CLOUD** REST Model Serving |Machine Learning Operations Tracking Registery|RES| |---|---| ||Application| Replication Automatic DBs |Col1|Real-tim| |---|---| ||| Raw Data (Bronze Table) Clean Data (Silver Table) Refined Data (Gold Table) Business Applications Power BI Batch ----- ###### How it works The Lakehouse for Retail was built from the ground up to solve the needs of modern retail. It blends simplicity, flexibility and lower cost of ownership with best-in-industry performance. The result is differentiated capabilities that help retailers win. Robust data Time-sensitive machine Data in real time Use all of your data Real-time reporting management learning **Limited.** EDWs support the management of structured data. **No.** Data lakes lack enterprise-class data management tools. **Yes.** Delta and Unity Catalog offer native data management and governance of all data types. **No.** EDWs offer quick access to reports on old data. **No.** Data lakes were not designed for reporting, let alone real-time reporting. **No.** Data lakes are able to support large analytics, but lack the ability to meet business SLAs. **No.** EDWs must extract data and send it to a third party for machine learning. **Yes.** Data views can be materialized, enabling front- line employees with real- time data. **Yes.** The Lakehouse can scale to process the most demanding predictions within business SLAs. **No.** Data warehouses are batch oriented, restricting data updates to hours or days. **No.** Data lakes are batch oriented. **Yes.** Support for real-time streaming data. **No.** Data warehouses have very limited support for unstructured data. **Yes.** Data lakes offer support for all types of data. **Yes.** Supports all types of data in a centrally managed platform. **LEGACY DATA** **WAREHOUSE** **LEGACY DATA** **DATA LAKES** **(HADOOP)** **DATA LAKES** **ROBUST** **DATA** **ROBUST** ----- **Data in real time.** Retail operates in real time and so should your data. The Lakehouse offers support for streaming data from clickstream, mobile applications, IoT sensors and even real-time e-commerce and point-of-sale data. And Delta Lake enables this world-record-leading performance while maintaining support for ACID transactions. **Use all of your data.** Retailers are increasingly capturing data from mobile devices, video, images and a growing variety of other data sources. This data is extremely powerful in helping to improve our understanding of consumer behavior and operations. The Lakehouse for Retail enables companies to take full advantage of all types of data in a cost-efficient way, in a single unified lakehouse architecture. **Robust data management and governance** that companies need to protect sensitive data, but was lacking from earlier big data systems. The Lakehouse offers transactional integrity with ACID compliance, detailed data security, schema enforcement, time travel, data lineage and more. Moving to a modern data architecture does not require sacrificing enterprise maturity. **High-performance predictive analytics.** Machine learning models, such as demand forecasting or recommendation engines, can be run in hours without compromising accuracy. The Lakehouse can scale to support tens of millions of predictions in tight windows, unlocking critical and time- sensitive analytics such as allocating inventory, optimizing load tenders and logistics, calculating item availability and out-of-stocks, and delivering highly personalized predictions. **Value with Databricks** By using Databricks to build and support your lakehouse, you can empower your business with even more speed, agility and cost savings. The flexibility of the Databricks Lakehouse Platform means that you can start with the use case that will have the most impact on your business. As you implement the pattern, you will find that you’re able to tackle use cases quicker and more easily than before. To get you started, this guidebook contains the use cases we most commonly see across the Retail and Consumer Goods industry. ----- **CHAPTER 3** ### Use Case:  Real-Time Supply  Chain Data **Overview** As companies see a surge in demand from e-commerce and delivery services, and seek increasing efficiencies with plant or distribution centers, real-time data is becoming a key part of the technical roadmap. Real-time supply chain data allows customers to deal with problems as they happen and before items are sent downstream or shipped to consumers, which is the first step in enabling a supply chain control tower. **R E L E V A N T F O R** Retail Consumer Goods Manufacturers Distributors Logistics Restaurants **Challenges** **Batch data** — existing data warehouses bring data in batch, creating a lag between when something is happening and when a customer can act on it **Complex analysis in real time** — if ingesting data in real time wasn’t a big enough challenge, companies have the added pressure to take immediate action on it **Complex maintenance** — ETL tools to bring data in batch are often complex and costly to maintain ----- **Value with the Databricks Lakehouse** Databricks has enabled real-time streaming of supply chain data across a variety of customers for specific plant operations or as part of a supply chain control tower. **Near real-time ingestion and visibility of data** — one customer experienced a 48,000% improvement in speed to data, with greater reliability **Cost-neutral** — because Delta’s efficient engine requires smaller instances, many customers report that they were able to move from batch to real-time at neutral costs **�Simplified architecture and maintenance** — leveraging Delta for ingestion streamlines the pattern for real-time data ingestions. Customers frequently report that the amount of code required to support streaming ingestion is 50% less than previous solutions. **Immediate enablement of additional use cases** — customers can now prevent problems as they’re happening, predict and prevent issues, and even gain days on major changes such as production schedules between shifts **Solution overview** Databricks allows for both streaming and batch data sets to be ingested and made available to enable real-time supply chain use cases. Delta Lake simplifies the change data capture process while providing ACID transactions and scalable metadata handling, and unifying streaming and batch data processing. And Delta Lake supports versioning and enables rollbacks, full historical audit trails, and reproducible machine learning experiments. **Typical use case data sources include:** Supply planning, procurement, manufacturing execution, warehousing, order fulfillment, shop floor/historian data, IoT sensor, transportation management ----- **CASE STUDY** With Databricks, Gousto was able to implement real-time visibility in their supply chain. Gousto moved from daily batch updates to near real-time streaming data, utilizing Auto Loader and Delta Lake. The platform provided by Databricks has allowed Gousto to respond to increased demand during the coronavirus outbreak by providing real-time insight into performance on the factory picking lines. **CASE STUDY** As a young e-commerce company, ButcherBox needed to act nimbly to make the most of the data from its hundreds of thousands of subscribers. With Databricks Lakehouse, the company could pull 18 billion rows of data in under three minutes. Now, ButcherBox has a near real-time understanding of its customers, and can also act proactively to address any logistical and delivery issues. HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. ----- **CHAPTER 4** ### Use Case: Truck Monitoring With many industries still feeling the effects of supply chain issues, being able to increase the efficiency of trucks on the road can make all the difference in getting goods into the hands of customers in a timely manner. Real-time data is making it easier for companies to get immediate insights into truck manufacturing delays, maintenance issues, supply chain issues, delivery schedules and driver safety. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics **Challenges** **** Siloed data makes it difficult to get a comprehensive understanding of fleet performance A lack of real-time insights can delay responses to manufacturing or supply chain issues Not having effective automation and AI increases the risk of human error, which can result in vehicular accidents or shipment delays ----- **Value with the Databricks Lakehouse** Databricks empowers companies to get real-time insights into their fleet performance, from manufacturing to delivery. **Near real-time insights** — the greater speed to data means a quicker response to issues and the ability to monitor driver safety more immediately **Ability to scale** — although consumer demands are constantly evolving, Databricks can handle fleet expansion without sacrificing data quality and speed **Optimizing with AI/ML** — implementing AI and ML models can lead to more effective route monitoring, proactive maintenance and reduced risk of accidents **Solution overview** Databricks enables better truck monitoring, quickly ingesting data on everything from vehicle manufacturing to route optimization. This results in a more complete and real-time view of a company’s fleet, and these analytics provide companies with the tools they need to scale and improve their operations. **Typical use case data sources include:** Supply planning, transportation management, manufacturing, predictive maintenance **CASE STUDY** With 94% of vehicular accidents attributed to human error, Embark used the Databricks Lakehouse Platform to unlock thousands of hours of recorded data from its trucks and then collaboratively analyze that data via dashboards. This has resulted in more efficient ML model training as Embark speeds toward fully autonomous trucks. HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. ----- **CHAPTER 5** ### Use Case: Inventory Allocation **Overview** Replenishment planning is the process of determining what needs to go where. It is used by replenishment planning, distributors and consumer goods companies performing vendor-managed replenishment (VMR) or vendor-managed inventory (VMI) to make daily decisions on which product needs to be sent to which store and on what day. Replenishment is challenging for companies because it deals with rapidly changing data and the need to make complex decisions on that data in narrow service windows. Retailers need to stream in real-time sales data to signal how much of a product has been sold in order. Inaccurate sales data leads to an insufficient number of products being sent to stores. This results in lost sales and low customer satisfaction. Inventory allocation is a process that might be performed multiple times a day during peak seasons, or daily during slower seasons. Companies need the ability to scale to perform tens of millions of predictions multiple times a day — on demand and dynamically — during peak season without paying a premium for this capability throughout the year. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Restaurants ----- **Challenges** Customers must complete tens of millions of inventory allocation predictions within tight time windows. This information is used to determine which products get put on trucks and go to specific stores. Traditional inventory allocation rules cause trade-offs in accuracy in order to calculate all possibilities in the service windows Legacy tools have rudimentary capabilities and have limited ability to consider flavors, sizes and other attributes that may be more or less popular by store **Value with Databricks** Customers are able to complete inventory allocation models within SLAs with no trade-off for accuracy.  **Speed —** on average, customers moving to Databricks for demand forecasting report a double-digit improvement in forecast accuracy  **Ability to scale** and perform fine-grained (day, store, item) level allocations  **Provide more robust allocations** by incorporating causal factors that may increase demand, or include information on flavors or apparel sizes for specific stores **Solution overview** The objective of inventory allocation is to quickly determine when to distribute items and where — from warehouses and distribution centers to stores. Inventory allocation begins by looking at the consumption rate of products, the available inventory and the shipping schedules, and then using this information to create an optimized manifest of what items should be carried on which trucks, at what point, and at what time. This becomes the plan for route accounting systems that arrange deliveries. Inventory allocation also deals with trade-offs related to scarcity of items. If an item has not been available in a store for a long time, that store may receive heightened priority for the item in the allocation. ----- HOW TO GET STARTED **Typical use case data sources include:** point of sale, digital sales, replenishment data, modeled safety stock, promotions data, weather **View our webinar covering demand forecasting with Starbucks and then read our blog about** **demand forecasting.** **[Demand forecasting with causal factors.](https://www.databricks.com/blog/2020/03/26/new-methods-for-improving-supply-chain-demand-forecasting.html)** Our most popular notebook at Databricks. This blog walks you through the business and technical challenges of performing demand forecasting and explains how we approached solving it. **[On-demand webinar for demand forecasting.](https://www.databricks.com/blog/2020/02/21/on-demand-webinar-granular-demand-forecasting-at-scale.html)** Video and Q&A from our webinar with Starbucks. Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **CASE STUDY** H&M turned to the Databricks Lakehouse Platform to simplify its infrastructure management, enable performant data pipelines at scale, and simplify the machine learning lifecycle. The result was a more data- driven organization that could better forecast operations to streamline costs and boost revenue. **CASE STUDY** Edmunds is on a mission to make car shopping an easy experience for all. With the Databricks Lakehouse Platform, they are able to simplify access to their disparate data sources and build ML models that make predictions off data streams. With real-time insights, they can ensure that the inventory of vehicle listings on their website is accurate and up to date, improving overall customer satisfaction. ----- **CHAPTER 6** ### Use Case: Point of Sale  and Clickstream **Overview** Disruptions in the supply chain — from reduced product supply and diminished warehouse capacity — coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Historically, point-of-sale (POS) systems recorded all in-store transactions, but were traditionally kept in a system that was physically in the store. This would result in a delay in actionable insights. And now with consumers increasingly shopping online, it’s crucial to not only collect and analyze that clickstream data quickly, but also unify it with POS data to get a complete and real-time snapshot of each customer’s shopping behavior. Near real-time availability of information means that retailers can continuously update their estimates of item availability. No longer is the business managing operations based on their knowledge of inventory states as they were a day prior, but instead is taking actions based on their knowledge of inventory states as they are now. **R E L E V A N T F O R** Retail E-commerce **Challenges** Retailers with legacy POS systems in their brick-and-mortar stores are working with siloed and incomplete sales data Both POS and clickstream data need to be unified and ingested in real time ----- HOW TO GET STARTED Contact your Databricks account team **Value with Databricks** Databricks brings POS and clickstream data together for a unified data source that leads to real-time insights and a clearer understanding of customer behavior.  **Single source of truth** — a centralized, cloud-based POS system means it can be merged with clickstream data  **Near real-time insights** — the greater speed to data means businesses get the latest insights into customer purchasing behaviors and trends to have them perform a free proof-of- concept with your real-time data.  **Scalability** — companies can scale with Databricks to handle data from countless transactions ----- **CHAPTER 7** ### Use Case: On-Shelf Availability **Overview** Ensuring the availability of a product on shelf is the single largest problem in retail. Retailers globally are missing out on nearly $1 trillion in sales because they don’t have on hand what customers want to buy in their stores. Shoppers encounter out-of-stock scenarios as often as one in three shopping trips. All told, worldwide, shoppers experience $984 billion worth of out-of-stocks, $144.9 billion in North America alone, according to industry research firm IHL. In the past, if a customer faced an out-of-stock, they would most likely select a substitute item. The cost of going to another store prevented switching. Today, e-commerce loyalty members, such as those who belong to Walmart+ and Amazon Prime, are 52% more likely than other consumers to purchase out-of-stock items online. It is believed that a quarter of Amazon’s retail revenue comes from customers who first tried to buy a product in-store. In all, an estimated $36 billion is lost to brick-and-mortar competition, and another $34.8 billion is lost to Amazon or another e-retailer, according to IHL. On-shelf availability takes on a different meaning in pure e-commerce applications. An item can be considered in stock when it is actually in a current customer’s basket. If another customer places the same item in their basket, there is the possibility that the first customer will purchase the last available item before the second customer. This problem is exacerbated by retailers who use stores to keep inventory. In these situations, customers may order an item that is picked for delivery at a much later time. The window between ordering and picking creates the probability of out-of-stocks. On-shelf availability predicts the depletion of inventory by item, factors in safety stock levels and replenishment points, and generates a signal that suggests an item may be out of stock. This information is used to generate alerts to retail staff, distributors, brokers and consumer goods companies. Every day, tens of thousands of people around the world do work that is generated by these algorithms. The sheer volume of data used to calculate on-shelf availability prevents most companies from analyzing all of their products. Companies have between midnight and 4 AM to collect all of the needed information and run these models, which is beyond the capability of legacy data systems. Instead, companies choose the priority categories or products to analyze, which means a significant percentage of their unavailable products will not be proactively addressed. ----- One of the biggest challenges with on-shelf availability is determining when an item is actually out of stock. While some retailers are investing in computer vision and robots, and others employ the use of people to manually survey item availability, most retailers default to a signal of determining when an item has not been scanned in an acceptable time. **R E L E V A N T F O R** Retail Consumer Goods E-commerce Direct to Consumer **Challenges** The biggest challenge to generating on-shelf availability alerts is time. Companies may receive their final sales data from the preceding day shortly after midnight. They have less than 4 hours from that point to ingest large volumes of t-log data and calculate probabilities of item availability. Most firms are encumbered by a data warehouse process that only releases data after it has been ingested and aggregates have been calculated, a process that can require multiple hours per night. For this reason, most firms make sacrifices in their analysis. They may alternate categories they analyze by different days, prioritize only high-impact SKUs, or run analysis at higher-level and less-accurate aggregate levels. Among the challenges: Processing large volumes of highly detailed data and running millions of models in a narrow time window Companies are spending hundreds of thousands of dollars annually to generate these daily alerts for a few categories Dealing with false positives and negatives in predictions Distributing information quickly and efficiently to internal systems and external partners ----- **Value with Databricks** Databricks enables customers to generate on-shelf availability (OSA) predictions at scale with no compromises. **** Delta removes the data processing bottleneck. Delta enables retailers to stream in real time or to batch process large volumes of highly detailed and frequently changing point-of-sale transaction data. **** Easily scale to process all OSA predictions within tight service windows using Apache Spark TM **** Manage features and localize models with additional causal data to improve accuracy with MLflow **** Easily deploy information via streams, through API for mobile applications or partners, or to Delta for reporting **** Enable retailers to monetize their data by directly licensing OSA alerts **Solution overview** Databricks enables companies to perform on-shelf availability analysis without making compromises to the breadth or quality of predictions. It begins with Delta Lake — a nearly perfect platform for ingesting and managing t-log data. One of the biggest challenges in t-log data is the frequent number of changes to a transaction that can occur within a data. Delta Lake simplifies this with transaction awareness using a transaction log, and creates additional metadata for easier retrieval. Data is made available in a fraction of the time needed in data warehouse- based systems. This is why the largest retailers in the world are using Delta Lake for processing t-log data. Once data is available, users need to generate predictions about item availability on the shelf. With its extremely performant engine and the ability to distribute computation across countless nodes, Spark provides the perfect platform for calculating out-of-stocks. Customers no longer need to run in aggregate or against a subset of data. ----- **HOW TO GET STARTED** [Solution Accelerator:](https://www.databricks.com/solutions/accelerators/on-shelf-availability) [On-Shelf Availability](https://www.databricks.com/solutions/accelerators/on-shelf-availability) In this solution, we show how the Databricks Lakehouse Platform enables real-time insights to rapidly respond And lastly, data is only useful if it drives better outcomes. Databricks can write the resulting data into Delta Lake for further reporting, or to any downstream application via APIs, feeds or other integrations. Users can feed their predictive alerts to downstream retail operations systems or even to external partners within the tightest service windows, and in enough time to drive actions on that day. **Typical use case data sources include:** point-of-sale data, replenishment data, safety stock calculations, manual inventory data (optional), robotic or computer vision inventory data (optional) **CASE STUDY** Reckitt distributes its products to millions of consumers in over 60 countries, which was causing the organization to struggle with the complexity of forecast demand, especially with large volumes of different types of data across many disjointed pipelines. Thanks to the Databricks Lakehouse Platform, Reckitt now uses predictive analytics, product placement and business forecasting to better support neighborhood grocery stores. to demand, drive more sales by ensuring stock is available on shelf, and scale out your forecasting models to accommodate any size operation. ----- **CHAPTER 8** ### Use Case: Customer and Vehicle Identification **Overview** COVID-19 led to increased consumer demand for curbside pickup, drive-through and touchless payment options. Retailers that were able to implement these new services have been able to differentiate overall customer experiences and mitigate catastrophic hits on revenue levels. For retailers to create a seamless contactless experience for customers, they need real-time data to know when a customer has arrived and where they’re located, as well as provide updates throughout the pickup journey. And through the use of computer vision, they can capture that data by employing optical recognition on images to read vehicle license plates. Retailers can also use information captured from license plates to make recommendations on buying patterns. Looking ahead, facial recognition also has the potential to provide retailers with valuable information to better serve their customers in real time. **R E L E V A N T F O R** Retail Consumer Goods Drive-Through Food Retailers **Challenges** Ineffective data processing can lead to suboptimal order preparation timing Without real-time data, it can be difficult to provide customers with live updates on their order status ----- **Value with Databricks** Databricks makes it possible to not only identify customers and vehicles in real time but also provide real- time communications throughout the entire shopping and curbside or drive-through experience.  **Near real-time insights** — the greater speed to data means retailers can get the right order preparation timing  **Recommendations** — being able to quickly access and refer to data from previous visits will ensure each subsequent visit is equally as or more seamless than the last  **Optimizing with AI/ML** — implementing AI and ML models can lead to more effective geofencing, vehicle identification and order prediction **CASE STUDY** **CASE STUDY** ----- **CHAPTER 9** ### Use Case: Recommendation Engines **Overview** Customers that feel understood by a retailer are more likely to spend more per purchase, purchase more frequently with that retailer, and deliver higher profitability per customer. The way that retailers achieve this is by recommending products and services that align with customer needs. Providing an experience that makes customers feel understood helps retailers stand out from the crowd of mass merchants and build loyalty. This was true before COVID, but shifting consumer preferences make this more critical than ever for retail organizations. With research showing the cost of customer acquisition is as much as five times as retaining existing ones, organizations looking to succeed in the new normal must continue to build deeper connections with existing customers in order to retain a solid consumer base. There is no shortage of options and incentives for today’s consumers to rethink long-established patterns of spending. Recommendation engines are used to create personalized experiences for users across retail channels. These recommendations are generated based on the data collected from purchases, items interacted with, users’ behavior across physical and digital channels, and other data such as from customer service interactions and reviews. Leveraging a Customer 360 architecture that collects all user clickstream and behavioral data, marketers are able to create recommendations that are integrated with other business objectives such as highlighting items that are on promotion or product availability. Creating recommendations is not a monolithic activity. Recommendation engines are used to personalize the customer experience in every possible area of consumer engagement, from proactive notifications and offers, to landing page optimization, suggested products, automated shipment recommendations, cross-sell and upsell, and even suggestions for complementary items after the purchase. ----- **R E L E V A N T F O R** Retail E-commerce Direct to Consumer Media Telecom Financial Services (any B2B or B2C company) **Challenges** Recommendation engines are very difficult to do well. Many companies use off-the-shelf recommenders, but traditional off-the-shelf systems suffer from high rates of inaccuracy. In our analysis, we found general recommenders with 29% variance, meaning that of every 10 recommendations delivered, 3 would be irrelevant. **Massive volumes of highly detailed and frequently changing data.** Recommendation accuracy is improved by having recent data, and yet most systems struggle to handle the large volumes of information involved. **Creating a 360 view of the customer.** Identity and being able to stitch together all customer touchpoints in one place are critical to enabling this use case. More data, including transaction and clickstream data, is critical for driving accuracy and precision in messaging. **Processing speed.** Retailers need to be able to frequently refresh models based on constantly changing dynamics, and deliver real-time recommendations via APIs. **Automation.** This is an “always-on” use case where automation is essential for scalability and responsiveness based on frequent model updates. ----- Many firms choose to use recommender systems from Amazon or Google. Using these systems trains the general recommendation engine in a way that helps competitors improve the accuracy of their own recommendations. **Value with Databricks** Recommendations are one of the most critical capabilities that a retailer maintains. This is a capability that retailers must own, and Databricks provides a solid platform for enabling this. Using Databricks as the foundation for their Customer 360 architecture to deliver omnichannel personalization, sample value metrics from a media agency include: **200% ROI for 70% of retailers** engaging in advanced personalization **10% improvement** in conversions **35% improvement** in purchase frequency **37% improvement** in customer lifetime value **Solution overview** Recommendations are only as good as the data that powers them. Delta Lake provides the best platform for capturing and managing huge volumes of highly atomic and frequently changing data. It allows organizations to combine various sources of data in a timely and efficient manner, from transactions, demographics and preference information across products, to clickstream, digital journey and marketing analytics data to bring a 360 view of customer interactions to enable omnichannel personalization. By identifying changes in user behavior or engagement, retailers are able to detect early signals that indicate a propensity to buy or a change in preferences, and recommend products and services that will keep consumers engaged. ----- **Typical use case data sources include:** Customer 360 data, CRM, loyalty data, transaction data, clickstream data, mobile data: **Engagement data** — transaction log data, clickstream data, promotion interaction **Identity** — loyalty data, person ID, device ID, email, IP address, name, gender, income, presence of children, location **User lifecycle** — subscription status, payment history, cost of acquisition, lifetime value, propensity to churn **CASE STUDY** For Wehkamp to provide the best shopping experience for their customers, they turned to Databricks for help with their data analytics and machine learning needs, resulting in a highly engaging web shop personalized to each of their customers. **CASE STUDY** Columbia’s legacy ETL was unable to support batch and real-time use cases at scale. After migrating to Databricks, the company is now able to more efficiently and reliably work with its data, resulting in smarter business decisions. **CASE STUDY** Pandora wanted to drive stronger online engagement with their customers, so they used the Databricks Lakehouse Platform to create more personalized experiences and boost both click-to-open rates and quarterly revenue. HOW TO GET STARTED Databricks has created [four](https://www.databricks.com/solutions/accelerators/recommendation-engines) [Recommendation Engine accelerators,](https://www.databricks.com/solutions/accelerators/recommendation-engines) with content-based and collaborative filter methods, and both item- and user-based analysis. These accelerators have been further refined to be highly performant to enable frequent retraining of models. To begin working on recommendation engines, contact your Databricks account team. ----- **CHAPTER 10** ### Use Case: Perpetual Inventory **Overview** With the rapid adoption of digital channels for retail, staying on top of your inventory is crucial to meeting customer demand. As a result, the periodic inventory system is now outdated — instead, using a perpetual inventory model allows businesses to perform immediate and real-time tracking of sales and inventory levels. This has the added benefit of reducing labor costs and human error, ensuring that you always have an accurate overview of your inventory and can better forecast demand to avoid costly stockouts. The key to building a perpetual inventory system is real-time data. By capturing real-time transaction records related to sold inventory, retailers can make smarter inventory decisions that streamline operations and lower overall costs. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Supply Chain Inventory Management **Challenges** **** Companies need to scale to handle ever-increasing inventory and the data associated with the products **** Data needs to be ingested and then processed in real time (or near real-time) to provide a truly accurate view of inventory ----- HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **Value with Databricks** Databricks enables real-time inventory updates, giving businesses the insights they need to properly manage inventory and to forecast more accurately. **Near real-time insights** — the greater speed to data means inventory is automatically updated with the latest sales data **Detailed records** — with all inventory updates and movements being tracked as they happen, companies know they’re getting the most accurate information at any point **Optimizing with AI/ML** — using AI and ML can help with forecasting demand and reducing inventory management costs ----- **CHAPTER 11** ### Use Case: Automated  Replenishments **Overview** Customers favor convenience more than ever when it comes to their goods, and automated replenishments help meet that need. Whether it’s through a connected device or smartphone app, real-time data plays a key role in ensuring consumers get a refill automatically delivered at the right time. On the manufacturing side, this real-time data can also help with vendor-managed replenishment (VMR), reducing the time needed to forecast, order and receive thousands of items. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Direct to Customer **Challenges** **** Being able to ingest large amounts of data quickly is crucial to actually fulfilling the replenishment orders With VMR, there may be a disconnect between the vendor and customer, resulting in a forecast for replenishment even when the customer can’t fulfill that order ----- HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **Value with Databricks** Databricks enables real-time inventory updates, giving businesses the insights they need to properly manage inventory and to forecast more accurately. **Near real-time insights** — the greater speed to data means businesses can stay on top of customer needs **Scalability** — companies can scale with Databricks to handle thousands of SKUs, each with its own unique properties and expiry dates **Optimizing with AI/ML** — using AI and ML can lead to better forecasting and predictions ----- **CHAPTER 12** ### Use Case: Fresh Food Forecasting **Overview** Fresh food typically accounts for up to 40% of revenue for grocers, and plays an important role in driving store traffic. But fresh food is also incredibly complex to manage — prices can be volatile, there is a wide range of suppliers to work with and the products expire, which creates significant amounts of waste. In order to avoid losing significant revenue, businesses need to properly forecast when food is nearing its sell-by date, the current levels of customer demand (also taking into account seasonality), and the proper timing for replenishing food stock. Being able to tap into real-time data is key to staying on top of the ever- changing needs around fresh food. **R E L E V A N T F O R** Retail E-commerce Distributors Logistics Restaurants **Challenges** **** Because of the perishable nature of fresh food, customers need to be able to ingest data quickly enough to conduct daily forecasting and daily replenishment **** Customers are running aggregate-level forecasts, which are less accurate than fine-grained forecasting **** Customers are forced to compromise on what they can analyze ----- HOW TO GET STARTED Contact your Databricks account team to get started with inventory allocation. Databricks does not have a Solution Accelerator. View our webinar covering demand forecasting with Starbucks and then read our blog about demand forecasting. [Fine-grained time series forecasting at scale.](https://www.databricks.com/blog/2021/04/06/fine-grained-time-series-forecasting-at-scale-with-facebook-prophet-and-apache-spark-updated-for-spark-3.html) This blog details the importance of time series forecasting, walks through building a simple model to show the use of Facebook Prophet, and then shows off the combination of Facebook Prophet and Adobe Spark to scale to hundreds of models. [On-demand webinar for demand forecasting.](https://www.databricks.com/blog/2020/02/21/on-demand-webinar-granular-demand-forecasting-at-scale.html) Video and Q&A from our webinar with Starbucks **Value with Databricks** Customers average double-digit improvement in forecast accuracy, leading to a reduction in lost sales and in spoiled products, as well as lower inventory and handling costs. **Improved accuracy** — on average, customers moving to Databricks for demand forecasting report a double-digit improvement in forecast accuracy **�Ability to scale and perform fine-grained (day, store, item) level forecasts** — rapidly scale to tens of millions of model iterations in narrow service windows. Companies need accurate demand forecasts in a few hours. **Eliminate compromises on what to analyze** — customers do not need to select winners or losers among the products they forecast. They can predict demand for all products as frequently as required. **Solution overview:** Databricks is well suited to handling forecasting for fresh food at scale. Forecasting begins with the Databricks Solution Accelerator. It enables companies to rapidly build fine-grained forecasting of items — forecasting that can be efficiently scaled to tens of millions of predictions in tight service windows. **Typical use case data sources include:** historic point-of-sale data, shipment data, promotions, pricing, expiration dates and weather. **CASE STUDY** ButcherBox faced the complex challenges of securing inventory with enough lead time, meeting highly variable customer order preferences and unpredictable customer sign-ups, and managing delivery logistics. With Databricks, the company was able to create a predictive solution to adapt quickly and integrate tightly with the rest of its data estate. on demand forecasting. **CASE STUDY** Sam’s Club needed to build out an enterprise-scale data platform to handle the billions of transactions and trillions of events going through the company. Find out how Databricks became a key component in the shift from on premises Hadoop clusters to a cloud based platform ----- **CHAPTER 13** ### Use Case: Propensity-to-Buy **Overview** Customers often have repeatable purchase patterns that may not be noticed upon initial observation. While we know that commuting office workers are likely to purchase coffee at a coffee shop on weekday mornings, do we understand why they visit on Thursday afternoons? And more importantly, how do we predict these buying moments when customers are not in our stores? The purpose of a propensity-to-buy model is to predict when a customer is predisposed to make a purchase and subsequently act on that information by engaging customers. Traditional propensity-to-buy models leveraged internal sales and loyalty data to identify patterns of consumption. These models are useful, but are limited in understanding the full behavior of customers. More advanced propensity-to-buy models are now incorporating alternative data sets to identify trips to competing retailers, competitive scan data from receipts, and causal data that helps to explain when and why customers make purchases. Propensity-to-buy models create a signal that is sent to downstream systems such as those for promotion management, email and mobile alerts, recommendations and others. **R E L E V A N T F O R** Retail E-commerce Direct to Consumer ----- **Challenges** **** Customers do not want to be inundated with messages from retailers. Companies need to limit their outreach to customers to avoid angering them. Companies need to traverse and process vast sums of customer data and generate probabilities of purchase frequently Companies need to look at external data that helps build a propensity-to-buy model that captures the full share of the customer wallet. They need to quickly test and incorporate additional data that improves the accuracy of their models. **Value with Databricks** **** Databricks allows companies to efficiently traverse huge volumes of customer data over time, and efficiently synthesize this into data for analysis **** Companies need to traverse and process vast sums of customer data and generate probabilities of purchase frequency **** Companies need to look at external data that helps build a propensity-to-buy model that captures the full share of the customer wallet. They need to quickly test and incorporate additional data that improves the accuracy of their models. **Solution overview:** Propensity-to-buy analytics determine the signals that indicate the probability a customer is in a buying moment. Historic propensity models relied on sales data to identify buying patterns, but newer approaches are incorporating behavioral data. Proximity to a coffee shop might push a consumer over the threshold of a buying moment. Traditional, batch-oriented operations are insufficient to solve this problem. If you wait until that night, or even later in the day you have lost the opportunity to act ----- **HOW TO GET STARTED** To begin working on propensity-to- buy, leverage our [Propensity Scoring](https://www.databricks.com/solutions/accelerators/propensity-scoring) [Solution Accelerator](https://www.databricks.com/solutions/accelerators/propensity-scoring) With the propensity to buy, speed becomes a critical force in determining key inflection points. Databricks enables marketers to ingest data in real time and update probabilities. Lightweight queries can be automated to refresh models, and the resulting data can be fed automatically to downstream promotions, web or mobile systems, where the consumer can be engaged. As this data is streamed into Delta Lake, data teams can quickly capture the data for broader analysis. Calculating a propensity to buy requires traversing interactions that are episodic in nature, and span broad periods of time. Delta Lake helps simplify this with scalable metadata handling, ACID transactions and data skipping. Delta Lake even manages schema evolution to provide users with flexibility as their needs evolve. **Typical use case data sources include:** point-of-sale data with tokens, loyalty data, e-commerce sales data, mobile application data, competitive scan or receipt data (optional), place of interest data (optional) ----- **CHAPTER 14** ### Use Case: Next Best Action **Overview** The e-commerce boom over the last couple of years has given consumers ample choice for digital shopping options. If your business isn’t engaging customers at every point in their purchasing journey, you risk losing them to a competitor. By applying AI/ML to automatically determine — in real time — the next best action for customers, you can greatly increase your conversion rates. **R E L E V A N T F O R** Retail Consumer Goods Direct to Consumer E-commerce **Challenges** Siloed data makes it difficult to create an accurate and comprehensive profile of each customer, resulting in suboptimal recommendations for the next best action Companies need to ingest large amounts of data in real time and then take action on it immediately Many businesses still struggle with training their ML models to properly determine the next best action (and self-optimize based on the results) ----- **HOW TO GET STARTED** To begin working on propensity-to- buy, leverage our [Propensity Scoring](https://www.databricks.com/solutions/accelerators/propensity-scoring) [Solution Accelerator](https://www.databricks.com/solutions/accelerators/propensity-scoring) **Value with Databricks:** Databricks provides all the tools needed to **process large volumes of data and find the next best** **action** at any given point in the customer journey **Near real-time insights** — the greater speed to data means businesses can react immediately to customer actions **Single source of truth** — break down data silos by unifying all of a company’s customer data (including basic information, transactional data, online behavior/purchase history, and more) to get a complete customer profile **Optimizing with AI/ML** — use AI to create self-optimizing ML models that are trained to find the best next step for customers ----- **CHAPTER 15** ### Customers That Innovate With Databricks Lakehouse for Retail Some of the top retail and consumer packaged goods companies in the world turn to Databricks Lakehouse for Retail to deliver real-time experiences to their customers. Today, data is at the core of every innovation in the retail and consumer packaged goods industry. Databricks Lakehouse for Retail enables companies across every sector of retail and consumer goods to harness the power of real-time data and analytics to solve strategic challenges and deliver more engaging experiences to customers. Get started with a free trial of Lakehouse for Retail and start building better data applications today. **[Start your free trial](https://databricks.com/try-databricks)** Contact us for a personalized demo at: [databricks.com/contact](http://databricks.com/contact ) ----- ###### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/lakehouse_for_retail-082922.pdf2024-09-19T16:57:21Z**eBook** # Accelerate Digital Transformation in Insurance With Data, Analytics and AI ### Real-world use cases with Databricks Lakehouse ----- ## Contents Introduction ................................................................................................................................................................................................................ **03** Three Trends Driving Transformation in Insurance .............................................................................................................................. **05** The Need for Modern Data Infrastructure ................................................................................................................................................. **06** Common Challenges Insurers Face Using Legacy Technology ...................................................................................................... **08** Why Lakehouse for Insurance ............................................................................................................................................................................ **10** Key Use Cases for Insurance: **C L A I M S A U T O M AT I O N A N D T R A N S F O R M AT I O N** ............................................................................................................................................. **14** **D Y N A M I C P R I C I N G A N D U N D E R W R I T I N G** .......................................................................................................................................................... **15** **A N O M A LY D E T E C T I O N A N D F R A U D U L E N T C L A I M S** ...................................................................................................................................... **16** **C U S T O M E R 3 6 0 A N D H Y P E R - P E R S O N A L I Z AT I O N** ......................................................................................................................................... **17** Global Regulatory Impact in Insurance ......................................................................................................................................................... **18** **I N D U S T R Y S O L U T I O N S :** Get Started With Accelerators, Brickbuilders and Enablers ............................................................ **19** Get Started With Industry Solutions ............................................................................................................................................................. **20** Conclusion ................................................................................................................................................................................................................... **26** ----- ## Introduction With the rapid advancement of technology, rising consumer expectations, and strong competition between insuretechs and incumbents resulting from the dissolution of industry boundaries, it is clear that insurers must continue to accelerate their data transformation journey. Today, new insights are derived as quickly as data can move in the insurance industry. This speed has increased as insurers collect vast amounts of customer data from new sources, such as IoT sensors, smartwatches that provide insight into consumers’ health data, and online behavior that includes clickstream data, spending habits, and frequented websites. As a result, the data strategy has become even more complex. Consumers want stronger reassurance for what they value most: financial security and greater peace of mind. Insurers have always prided themselves on delivering such protection and security. However, customer needs have changed, and insurers that move most swiftly to satisfy them will be in the best position to navigate challenging times. The bottom line is that insurers must adapt to these changes and meet the evolving needs of their customers to remain competitive. Data-driven insurers will seek opportunities to improve the customer experience, develop more sophisticated pricing models, and increase their operational resilience. More than ever, the total cost of ownership (TCO) of digital investments and enterprise data strategy has become a top priority for boards and senior executives in the insurance industry. So, what does this mean from a data and analytics perspective? It all comes down to having one reliable source of truth for data, which is derived from batch and streaming data, structured and unstructured data, from multiple clouds and jurisdictions. In a regulated and risk-averse industry where data sharing was once seen as optional, it has now become fundamental. To compete in the digital economy, insurers need an open and secure approach to data sharing. Databricks Lakehouse for Insurance plays a critical role in helping insurance providers accelerate innovation and transform their businesses, resulting in significant operational efficiencies and improved customer experiences at a fraction of the cost of data warehouses. This eBook provides an in-depth exploration of key challenges and common use cases in the insurance industry. Most importantly, you will gain insight into how Databricks Lakehouse can unlock the true value of your data through practical Solution Accelerators and a wide range of partners available to assist you on your journey. **The future of insurance will** **become increasingly data-driven,** **and analytics enabled.”** **[EY’s](https://www.ey.com/en_us/insurance/five-principles-for-the-future-of-protection)** [“Five principles for the future of protection”](https://www.ey.com/en_us/insurance/five-principles-for-the-future-of-protection) ----- The Lakehouse reference architecture below illustrates a sample framework upon which insurers can build. Moving from left to right in the diagram, the first layer represents various data sources such as on-premises systems, web and mobile applications, IoT sensors, enterprise data warehouses, and third-party APIs. Data is then ingested through automated data pipelines, and processed within the Lakehouse platform across three layers (Bronze, Silver and Gold). These layers are responsible for data preparation, including ML model registry, centralized governance, workflow orchestration, and job scheduling. They ensure a compliant and secure infrastructure that sits atop the cloud layer (or multiple clouds), eliminating the need for data duplication. Finally, the transformed data is delivered as actionable insights and supports use cases such as automated reporting, business analytics, customer 360, and claims analytics. These use cases not only mitigate risk but also drive revenue. **Data Sources** **On-Premises** **Servers** **Ingestion** **Lakehouse for Financial Services** **Bronze Layer** **Silver Layer** **Gold Layer** **Serving** **Automated** **Reporting** **Web and Mobile** **Applications** **Business Analytics** **and Interactive** **Dashboards** **Raw Entity Data** **Curated Feature** **Sets** **Aggregated** **Business Views** **Automated Data Pipelines** **(Batch or Streaming)** **Collaborative** **Data Source** **Internet-of-Things** **(IoT) Devices** **Enterprise Data** **Warehouses** **Third-Party APIs** **and Services** **ML Model** **Registry** **Centralized Data** **Governance** **Workflow** **Orchestration** **Productionized** **Referenced Data** **and Models** **Job Scheduling** ----- ## Three Trends Driving Transformation in Insurance Over the next decade, technology-enabled insurance companies will bear little resemblance to today’s organizations. The following three trends are driving this transformation in the insurance industry: **The rapid emergence of large language** **models and generative AI** In recent years, there has been a significant breakthrough in the field of artificial intelligence with the emergence of large language models (LLMs) and generative AI. These models, such as GPT-4 and its predecessors, Databricks Dolly and others are built using deep learning techniques and massive amounts of training data, enabling them to generate human-like text and perform a wide range of natural language processing tasks. LLMs and generative AI can help insurance companies automate repetitive tasks such as underwriting, claims processing, and customer service, improving efficiency and reducing costs. They can also help insurers to better understand customer needs and preferences, leading to more personalized products and services. However, as with any disruptive technology, the adoption of LLMs and generative AI will require careful consideration of ethical and regulatory issues, such as data privacy and algorithmic bias. **Transformed ecosystems** **and open insurance** [According to EY](https://assets.ey.com/content/dam/ey-sites/ey-com/en_gl/topics/insurance/ey-2022-global-insurance-outlook-report.pdf) , leading companies leverage insurtechs in their ecosystems to achieve high margins in commoditized products. Open insurance, which involves sharing and managing insurancerelated data through APIs, is more than an item in the regulatory agenda. It can give consumers access to better products and accurate pricing, as well as enable them to execute transactions more easily. In its [annual Chief Data Officer Survey](https://www.gartner.com/smarterwithgartner/data-sharing-is-a-business-necessity-to-accelerate-digital-business) , Gartner found that organizations that promote external data sharing have three times the measurable economic benefit across a variety of performance metrics compared to their peers. **Revised target operating model** **with a focus on talent** Demographic shifts and perennial cost pressures make it critical for insurers to attract and retain talent. Consequently, it’s important for insurers to equip their workforces with the right tools and technologies to help them identify business processes that can be optimized to differentiate themselves from their competitors, with an emphasis on moments that matter in the customer journey, according to EY. Recent research from Deloitte highlights the advantages of upskilling and building a future-ready workforce. One of the benefits of AI adoption in the workforce is that it enables organizations to automate a wide range of business processes, boosting speed and efficiency. But what’s even more important is that it enables employees to focus on higher-value work, according to Deloitte. ----- ## The Need for Modern Data Infrastructure **Insurers turning to cloud and data analytics** The insurance industry has undergone significant changes over the years, and one of the areas that has evolved the most is data management. With the growing need for advanced analytics and digital transformation, many insurance companies are turning to cloud technology and modern data infrastructures to enhance their data management strategies. The benefits of adopting cloud technology are numerous, particularly the ability to efficiently store and quickly access vast amounts of data, which is crucial in a heavily regulated and datadriven industry like insurance. Additionally, the flexibility of the cloud enables insurers to scale costs, adapt to changing work environments, and meet evolving customer and business requirements. dynamic pricing and underwriting, and form the foundation for claims automation. By implementing advanced analytics, insurers can innovate more easily, scale their businesses, and bring new products to market more quickly. To remain competitive, insurance companies must increase their investment in cloud technology and data analytics, as this will accelerate insightful decisionmaking across various functions such as claims management, underwriting, policy administration, and customer satisfaction. Overall, the adoption of cloud technology and data analytics is imperative for insurance providers to enhance operational efficiency, improve business processes, and stay relevant in today’s fast-paced business landscape. Furthermore, insurance providers can leverage the cloud to analyze customer data at scale, gaining insights into behaviors that drive hyper-personalization, ----- **Let’s take a closer look look at a few examples:** **Auto insurers** need to integrate new data sources, such as weather and traffic, to build solutions capable of real-time processing. This enables them to alert emergency services promptly and gain a better understanding of drivers’ driving patterns. It also enables the development of sophisticated machine learningbased risk assessment, underwriting and claims models. **Commercial insurance** , including property, general liability, cyber insurance and business income insurance, utilizes ML-based automation of actuarial models. This automation facilitates underwriting, claims forecasting and dynamic pricing for their customers. Another notable trend in recent years is the use of IoT- based alerting for sensitive or valuable commodities. For example, in the case of vaccines, IoT sensors can monitor the temperature in real time and send alerts to the appropriate team or person if the temperature exceeds acceptable thresholds. This is crucial as vaccines must be stored within specific temperature ranges. In **life insurance** , complex ML models can be employed to create a profile of the customer’s lifestyle and, importantly, detect any changes to it. This deeper understanding and 360-degree view of the customer enable more customized underwriting and pricing based on the policyholder’s current health, lifestyle and eating habits. |Type of Data Source|Typical Vendors|High-priority business use caes Claims Automation Dynamic Pricing Anomoly Detection Customer 360 and and Transformation and Underwriting and Fraudulent Claims Hyper-Personalization|Col4|Col5|Col6| |---|---|---|---|---|---| |Policy data|Guidewire, Duck Creek, Majesco, FINEOS, EIS, Unqork||||| |Claims data|Guidewire, Duck Creek, Majesco, FINEOS, EIS, Unqork, TransUnion||||| |Real-time ingestions|Cambridge Mobile Telematics, Zendrive, Custom||||| |Alternative / Supplemental data|Experian, Equifax, Verisk, IBM Weather||||| |Marketing data|Salesforce, HubSpot, Google Analytics||||| **Figure 1.** Innovating with data and analytics — use cases made possible and key data sources from popular insurance vendors ----- ## Common Challenges Insurers Face Using Legacy Technology Modernization is not an easy process for insurers, and while transforming IT ecosystems is necessary to improve business outcomes, ensuring business continuity is absolutely critical. However, the volume of data they collect, along with changes in user behavior and legacy systems that can’t handle this amount of data, are forcing insurance providers to accelerate their modernization journeys. Insurance providers face several challenges when using legacy technology, including: **Legacy on-premises systems:** Legacy on-premises systems are not only expensive to maintain, but they also store large amounts of big data in silos across the business. This makes it difficult to access the data, hindering data analytics efforts and limiting executives’ ability to make informed business decisions. **Ingesting large volumes of transactional data in real time:** The inability to ingest data from transaction systems in real time is a major obstacle to obtaining critical insights. Transaction logs from operations such as policy administration, enrollment and claims constantly stream data. However, many insurance companies still rely on legacy data warehouses built around batch processing, which is not suitable for ingesting and integrating large data sets. As a result, insurers often opt to ingest data nightly, leading to delays in receiving accurate data for decision-making. **Performing fine-grained analysis at scale within tight time frames:** Legacy technology forces insurers to make a trade-off when analyzing data for user intent. They can choose between detailed and accurate predictions or fast predictions. Running detailed forecasts can improve accuracy, but it requires performing millions of model calculations within narrow service windows, which exceeds the capability of legacy data platforms. Consequently, insurance companies have to accept less accurate predictions. **Powering real-time decisions on the front line:** Serving real-time data to thousands of workers is a complex task. While data warehouses can serve reports to large groups of users, they are limited to providing stale data. As a result, most insurers only provide daily or weekly updates to reports and rely on employees’ judgment for more frequent decisions. **Delivering a hyper-personalized omnichannel experience:** Today’s insurers aim to deliver personalized experiences across every channel, both digital and offline. While insurance providers have access to vast amounts of customer data, off-theshelf tools for personalization and customer segmentation struggle to handle such high volumes, leading to inaccurate analytics. To succeed in the insurance industry, companies must deliver personalized experiences at scale. ----- Databricks Lakehouse for Insurance addresses the key challenges faced across the insurance value chain. The lakehouse enables the integration of various data types, including images and structured data, in real time. It offers robust management and governance capabilities, and rapidly transforms data into actionable insights through real-time reporting and predictive analytics. This platform-as-a-service solution delivers exceptional speed and industry-leading total cost of ownership, providing insurers with faster insights to enhance the customer experience and gain a competitive edge. **Product** **Development &** **Feature Selection** **Application** **Review &** **Submission** **Policy Issue,** **Service &** **Administration** **Sales & Lead** **Management** **Hyperpersonalization/** **life events** **Underwriting** **and Pricing** **UW rules** **guidelines &** **technical pricing** **Rating Offer &** **Endorsements** **Evaluate** **rate options,** **pricing and** **endorsements** **Claims** **Coverage/** **Review policy** **features/riders** **documents** **(submission)** **Omnichannel** **Fraud, frequency,** **severity and** **reserves** **We continuously develop solution accelerators and enablers to accelerate the time to market.** **•** Dynamic segmentation **•** Personas **•** Hyper-personalization **•** Intelligent automation **•** Product architecture and manufacturing **•** Configurable products **•** Competitor rates **•** Reflexive questionnaire **•** LLM assistance for document summarization **•** NLP for unstructured data **•** Evaluation of risk within appetite **•** Validation of UW requirements **•** Straight-through processing optimization **•** Risk assessment via actuarial pricing **•** Triaging of risk to underwriter SME for policy/ exposure changes **•** Predict loss cost (frequency and severity) **•** Computer vision on images to identify loss **•** Auto-adjudication and triaging of claims to claim adjuster **•** Tailor communication by segment (e.g., email, text, mail, or omnichannel) **•** Identify Fraud, Waste and Abuse, route to ICU **Figure 2.** Evaluating data maturity across the insurance value chain and lines of business (LOBs) ----- ## Why Lakehouse for Insurance Databricks Lakehouse for Insurance combines simplicity, flexibility and reusability, enabling insurers to meet the demands of the market with speed and agility. It offers best-in-industry performance and serves as a modern data architecture that provides differentiated capabilities for insurers to thrive in a competitive industry. **•** Insurance companies can store any type of data using Databricks Lakehouse for Insurance, leveraging the low-cost object storage supported by cloud providers. This helps break down data silos that hinder efforts to aggregate data for advanced analytics, such as claim triaging and fraud identification, regulatory reporting, or compute-intensive risk workloads. Another critical feature is the time-travel capabilities of the lakehouse architecture, allowing insurers to access any historical version of their data. **•** Supporting streaming use cases, such as monitoring transaction data, is easier with the lakehouse. It utilizes Apache Spark ™ as the data processing engine and Delta Lake as the storage layer. Spark enables seamless switching between batch and streaming workloads with just a single line of code. Delta Lake’s native support for ACID transactions ensures reliable and high-performing streaming workloads. **•** For both machine learning and non-machine learning insurance models, a comprehensive governance framework is provided. Data, code, libraries and models are linked and independently version controlled using technologies like Delta Lake and MLflow. Delta Lake ensures stability by allowing insurance companies to declare their expectations for data quality upfront. MLflow enables training models in any language and deploying them anywhere, minimizing the need for complex handoffs between data science practices, independent validation units and operational teams. ----- **Level-up value with Databricks Lakehouse for insurance** Building your data lakehouse with the Databricks Lakehouse Platform empowers your organization with the speed, agility and flexibility needed to address critical insurance use cases that have a significant impact on your customers and your business. Additionally, it helps lower the total cost of ownership (TCO). With a modern and unified data architecture, the Databricks platform enables the implementation of your data, analytics and AI strategy at scale on a unified and modern cloud data architecture. The key benefits include: **1. Cost and complexity reduction** The Databricks Lakehouse provides an open, simple and unified cloud data management architecture that streamlines operational inefficiencies, reduces IT infrastructure costs, and enhances productivity across teams. **2. Enhanced risk management and control** By unlocking the value of enterprise data, the platform helps reduce corporate governance and security risks. It facilitates data-driven decisionmaking through governed discovery, access and data sharing. **3. Accelerated innovation** The platform enables the acceleration of digital transformation, modernization and cloud migration initiatives, fostering new growth opportunities and driving innovation for improved customer and workforce experiences. To help you get started, this guidebook includes the most commonly observed use cases across the insurance industry. ----- **Reference Architecture for Smart Claims** **1.** The Lakehouse ingests various types of data, either in bulk or incrementally through change data capture (CDC). These include structured and unstructured data sets like images, text, and video, such as IoT sensor data, operational data like claims and policies, and on-prem or third-party data such as from credit bureaus, weather, and driving records. Partner Connect offers a range of ingest tools from different vendors that you can directly use from the Databricks portal. **2.** Delta Live Tables (DLT) is the preferred ETL path to transform the data based on business requirements. All the data resides in cloud storage, where Delta refines it into Bronze, Silver and Gold zones of a medallion pipeline blueprint. Databricks Workflows provide orchestration of the various dependent tasks, with advanced capabilities like **3.** Databricks SQL, with Photon and serverless options, caters to BI consumption use cases to refresh a dashboard monitoring key metrics and KPIs, with query history and alerts on critical events. **4.** Databricks ML Runtime, MLFlow, along with Feature Store, Auto ML, and real-time Model Serving enable ML use cases to provide **5.** Delta Sharing provides a secure and governed way of sharing data internally and externally without copying it, using Unity Catalog. predictive insights. retry, repair and job status notifications. ----- **Secure data sharing with Delta Lake** At the heart of Databricks Lakehouse for Insurance is a technology that allows insurers to overcome the trade-offs between speed and accuracy. Technologies like Delta Lake enable the lakehouse, which combines the strengths of data warehouses and data lakes, to directly address these challenges. With Delta Lake, insurance providers can unify all their data — structured and unstructured, batch and real-time — in one centrally managed and governed location. Once the data is in the lakehouse, various stakeholders such as e-commerce systems, reporting users, analysts, data scientists and data engineers can leverage this information. They can use it to develop models for applications, power real-time reporting, perform advanced analytics, and create large-scale forecasting models, among other use cases. **Business intelligence** **Streaming** **Centralized** **governance** ##### Lakehouse Platform **Data science / ML** **One copy** **of data** **Data warehouse** **Orchestration** ----- **K E Y U S E C A S E** ## Claims automation and transformation **Overview** Insurers are entering a new era of claims transformation, supported by evolving technological advancements and increasing data availability. Leveraging the Databricks Lakehouse, organizations can deal with the massive amount of structured and unstructured data coming in from different sources, in different formats, and time frames. Every touchpoint in the claims journey — beginning even before an incident occurs — can be supported by a combination of technology and human intervention that seamlessly expedites the process. **Business problem** Missing data, or data that is “not in good order” and needs to be corrected before processing, leads to claims leakage and inefficient processes in triaging claims to the right resource. **Solution/value with Databricks** Enable triaging of claims and resources by leveraging big data processing and integrated ML and AI capabilities, including MLflow model lifecycle management. **Business outcomes and benefits** **•** Decrease in annual claims payout **•** Increase in claim fraud detection/prevention **•** Improve efficiencies by 15% **“Applying AI as broadly, as aggressively** **and as enthusiastically as possible. No part** **of our business should be untouched by it.”** — Masashi Namatame, Group Chief Digital Officer, Managing Executive Officer, Tokio Marine **C U S T O M E R C A S E S T U D Y** **Tokio Marine: Striving to** **become Al-driven** Insurers of all types now routinely use AI models to drive underwriting, streamline claims processing and accelerate claims adjudication, protect against insurance fraud, and improve risk forecasting, for example. Tokio Marine — Japan’s oldest insurance company, which has done business since 1879 — has been applying advanced uses of AI, particularly in its auto insurance business, says Masashi Namatame, Group Chief Digital Officer and Managing Executive Officer at Tokio Marine: “To assess collision damages, the company uses an AIbased computer vision solution to analyze photos from accident scenes.” Comparing these with what he describes as “thousands or even millions” of photos of past analogous incidents, the model produces liability assessments of the parties involved and projects anticipated repair costs. AI has also provided the company with tangible benefits in online sales — especially in personalized product recommendations and contract writing, according to Namatame. Read the case study in the [MIT CIO vision 2025 report](https://www.databricks.com/resources/whitepaper/mit-cio-vision-2025) . ----- **K E Y U S E C A S E** ## Dynamic pricing and underwriting **Overview** In modernized insurance platforms, there is a growing trend toward personalized approaches, where insurance carriers utilize metrics from trip summaries to inform pricing strategies for individuals based on their behavior. This involves leveraging unstructured and streaming data, including IoT telematics driver data, weather information, geolocation, traffic patterns and crash history. The Lakehouse platform is well suited for these new use cases as it offers native support for streaming, making it easy for insurance carriers to incrementally ingest data. **Business problem** Actuaries are spending valuable time on low-value activities, which hampers agility and advanced analytical capabilities in pricing and underwriting, hindering improvements in risk and pricing modeling. **Solution/value with Databricks** **•** Unified cloud-native platform **•** Scalability for ingesting IoT data from millions of trips, expanding the customer base **•** Reduced total cost of ownership compared to legacy Hadoop systems **•** Usage-based pricing, leading to lower premiums for customers and reduced risk for insurance carriers, thereby lowering loss ratios **•** Enables the creation of a digitally enabled, end-to-end underwriting experience **Business outcomes and benefits** **C U S T O M E R C A S E S T U D Y** **American financial services** **mutual organization** This organization aimed to leverage the vast amounts of structured and unstructured data it collected to enhance its underwriting and decision-making processes, enabling greater efficiency and effectiveness. However, the company’s legacy infrastructure struggled to scale with the increasing data volume and processing demands, limiting its ability to analyze the data and derive actionable insights. With Databricks, the insurer centralized everything on one unified Lakehouse platform, supporting all operational and analytical use cases. This allowed them to analyze broader sets of data for superior underwriting performance and create a digitally empowered, end-to-end underwriting experience. **•** Improve competitive position **•** Decrease combined ratio **•** 15% improvement in efficiencies ----- **K E Y U S E C A S E** ## Anomaly detection and fraudulent claims **Overview** **C U S T O M E R C A S E S T U D Y** **One of the largest U.S.** **insurance companies and a** **leading small business insurer** The increasing availability of data and market competition challenge insurance providers to offer better pricing to their customers. This U.S.-based insurer, with hundreds of millions of insurance records to analyze for downstream ML, realized that their legacy batch analysis process was slow and inaccurate, providing limited insight for predicting the frequency and severity of claims. With Databricks, they were able to scale up the use of deep learning models, resulting in more accurate pricing predictions and increased revenue from claims. By leveraging Databricks Lakehouse, they harmonized data, analytics and AI at scale, enabling accurate pricing predictions and supporting various use cases from vehicle telematics to actuarial modeling. Fraud continues to grow at a rapid rate, posing a threat to the revenue and growth of companies. For example, American consumers reported losing more than $5.8 billion to fraud in 2021, a 70% increase from $3.4 billion in 2020, according to the Federal Trade Commission. The insurance industry is undergoing transformational change to support new channels and services, offering transactional features and facilitating payments through digital channels to remain competitive. However, the speed and convenience of these capabilities benefit both consumers and fraudsters. Building a fraud framework requires more than just highly accurate machine learning models. It often involves a complex decision science process that combines a rules engine with a robust and scalable machine learning platform. **Business problem** Insurers need the ability to identify fraudulent activity and respond to new suspicious trends in near real-time. **Solution/value with Databricks** Modernized approaches in insurance require full digital transformation, including the adoption of usagebased pricing to reduce premiums. Insurance providers now consume data from the largest mobile telematics providers (e.g., CMT) to obtain granular sensor and trip summaries for users of online insurance applications. This data is crucial not only for pricing but also for underwriting scenarios to mitigate risks for carriers. **$1 of fraud costs companies 3.36x in chargeback,** **replacement and operational costs** [Lexis Nexis](https://risk.lexisnexis.com/insights-resources/research/2020-true-cost-of-fraud-retail) ----- **K E Y U S E C A S E** ## Customer 360 and hyper-personalization **Overview** Winning the hearts and minds of your customers starts with personalizing the user experience. The ability to offer complementary products to meet the needs of your customers lets you build deeper relationships with them and engender their loyalty. In addition, a better understanding of the finer details within accounts allows you to offer them more personalized products. To do this, you need 360-degree customer views, which requires you to locate and consolidate all your customers’ contact data from every digital tool that you use and house it in one central location. With Databricks Lakehouse, insurers can “hyper-personalize,” increase cross-sell/upsell opportunities, enhance customer 360 and bring new products to market faster. **Business problem** The inability to reconcile customer records across different lines of business limits real-time customer insights necessary for upselling and cross-selling. Siloed data makes it challenging to create accurate and comprehensive customer profiles, resulting in suboptimal recommendations for the next best action. **Solution/value with Databricks** Databricks provides the tools needed to process large volumes of data and determine the next best action at any point in the customer journey. **•** Eliminates data silos by unifying all customer data, including basic information, transactional data, online behavior/purchase history, etc., to create complete customer profiles **•** Integrated data security ensures that security measures are incorporated at every layer of the Databricks Lakehouse Platform **•** Delta improves data quality, providing a single source of truth for real-time streams and ensuring reliable and high-quality data for data teams **•** Integrated ML and AI capabilities utilize AI to create self-optimizing ML models that determine the next best step for each customer **•** MLflow model lifecycle management helps manage the entire machine learning lifecycle reliably, securely and at scale **Business outcomes and benefits** **•** Use AI, ML, automation and real-time data to gain deeper customer insights and understand their needs **•** Improve competitive positioning **•** Enhance the customer experience **C U S T O M E R C A S E S T U D Y** **160-year-old U.S.** **insurance company** This insurance provider underwent a significant digital transformation to provide a more personalized financial services experience to its 10,000 advisors and millions of customers across various touchpoints. Recognizing the importance of becoming data-driven, the company leveraged Databricks in its client 360 platform to aggregate transactional and behavioral data, along with core attributes, providing business users with next-best-action recommendations for seamless customer engagement. ----- ## Global Regulatory Impact in Insurance **Navigating global regulations** **with technical implementation** Digital innovation continues to reshape the insurance sector. The pace and scale of technological change are likely to increase due to factors such as artificial intelligence (AI), cloud computing, and the entry of new players like insurtechs, e-tailers, and manufacturers from outside the insurance industry. To succeed and thrive in today’s economic environment, insurers should prioritize upgrading their infrastructure and technology, rather than solely focusing on transforming operations. For example, migrating from on-premises systems to the cloud can bring significant benefits, according to global consultancy [Deloitte](https://www2.deloitte.com/content/dam/insights/articles/us175368_cfs_fsi-outlook-insurance/DI_US175368_CFS_FSI-Outlook-Insurance.pdf) [.](https://www2.deloitte.com/content/dam/insights/articles/us175368_cfs_fsi-outlook-insurance/DI_US175368_CFS_FSI-Outlook-Insurance.pdf) As insurers upgrade their compliance processes to meet new global regulations, such as IFRS 17 and LDTI, the impact of regulatory updates becomes more complex for organizations operating across multiple jurisdictions. Instead of merely responding to regulatory and industry requirements, insurance companies should make data-focused investments that help them anticipate and meet the expectations of distributors and policyholders. **IFRS-17** IFRS 17 is an International Finance Reporting Standard (IFRS) for insurance contracts. IFRS 17 aims to standardize insurance accounting by providing consistent principles for all facets of accounting for insurance contracts. IFRS 17 removes existing inconsistencies so analysts, investors and others can more easily compare companies, contracts and industries. **LDTI for long-duration contracts** The Financial Accounting Standards Board long-duration targeted improvements (LDTI) introduced changes to the U.S. GAAP accounting model to simplify and improve the financial reporting of long-duration contracts, including providing financial statement users with more timely and relevant information about those contracts. It is crucial for insurers to redirect their focus toward developing advanced data management and utilization capabilities that offer better insights and improved performance. These investments serve as not only a foundation for regulatory compliance but also a starting point for more comprehensive and proactive transformation initiatives. ----- **I N D U S T R Y S O L U T I O N S** ## Get Started With Accelerators, Brickbuilders and Enablers Insurance Solution Accelerators and enablers are pre-built collateral to help customers rapidly develop and deploy technical capabilities to accelerate value. **Adoption challenges** Numerous challenges hinder organizations from developing and implementing the necessary technical solutions to enhance their operational effectiveness, increase revenue, and stay competitive. These challenges include: **•** Lack of technical skills (data scientists/data engineers): Companies often struggle to find employees proficient in Python or Scala, or individuals who possess extensive experience in data science. **•** Business problems require in-depth data science and industry knowledge: Businesses seek solutions tailored to address specific problems, rather than generic technical features. **•** Companies seek actionable insights: Organizations prefer readily applicable patterns that can be quickly implemented, rather than custom data science solutions that come with potential costs and risks of implementation failure. **What are accelerators/enablers?** **Solution Accelerators** Save hours on discovery, design, development and testing with Databricks Solution Accelerators. Our purpose-built guides, including fully functional notebooks and best practices, expedite results for your most common and high-impact use cases. With these accelerators, you can go from idea to proof of concept (PoC) in as little as two weeks. **Brickbuilders** Brickbuilder Solutions are data and AI solutions designed by leading consulting companies to address industry-specific business requirements. Built on the Databricks Lakehouse Platform and backed by the industry experience of these consultancies, businesses can have confidence in solutions tailored to their specific use cases. Brickbuilder Solutions can be implemented at any stage of the customer journey. **Solution Enablers** Solution enablers consist of targeted collections of notebooks and materials, such as webinars and blog posts, designed to support larger solutions. They aim to solve pain points or address specific layers of business capabilities, such as resolving data ingestion challenges. ----- ## Get Started With Industry Solutions **Claims transformation:** **automation and fraud prevention** Insurers are entering a new era of claims transformation, supported by evolving technological advancements and growing data availability. The end-to-end claims process, from extracting relevant information from documentation submitted when filing a claim to triaging and routing claims and the underwriting process, is ripe for digital transformation. By leveraging the Databricks Lakehouse, organizations can handle millions of data points coming in different formats and time frames, from various sources, at an unprecedented volume. Every touchpoint in the claims journey, starting even before an incident occurs, will be supported by a combination of technology and human intervention that seamlessly expedites the process. Personalizing the claims experience by anticipating needs, providing real-time status alerts, and reducing friction in the process increases customer loyalty and retention. **Customer/Partner Successes** **Accelerate underwriting through collaboration and efficient ML** A leading P&C insurer took full advantage of the MongoDB and Databricks integration, leveraging both platforms to foster collaboration between their data and developer teams. The integration provides a more natural development experience for Spark users and exposes all of Spark’s libraries. This allows MongoDB data to be materialized as DataFrames and data sets for analysis using machine learning, graph, streaming and SQL APIs. The insurer also benefits from automatic schema inference. With this integration, the insurer was able to train and observe their ML models (MongoDB Atlas Charts) more efficiently and incorporate them into business applications. As a result, crucial underwriting processes that previously took days are now executed in seconds. In addition to the time and cost savings, the company can provide a more immediate response to customers within its digital experience platform. **Learn more:** **Watch video:** **[F R A U D D E T E C T I O N](https://notebooks.databricks.com/notebooks/FSI/geospatial_analysis/index.html#geospatial_analysis_1-0.html)** **Claims processing is the process whereby an insurer receives,** **verifies and processes a claim report submitted by a policyholder.** **It accounts for** **[70% of a property insurer’s expenses](https://www2.deloitte.com/us/en/insights/industry/financial-services/insurance-claims-transformation.html)** **and is a** **criticial component of customer satisfaction with their carrier.”** **[C L A I M S A U T O M AT I O N E N A B L E R](https://www.databricks.com/blog/2023/02/01/design-patterns-batch-processing-financial-services.html)** [Laying the](https://www.youtube.com/watch?v=LkckhRjezxs ) [Foundation for](https://www.youtube.com/watch?v=LkckhRjezxs ) [Claims Automation](https://www.youtube.com/watch?v=LkckhRjezxs ) **[C A R C L A I M S I M A G E C L A S S I F I C AT I O N](https://github.com/databricks-industry-solutions/car-classification)** **Deloitte,** [”Preserving the human touch in insurance claims transformations”](https://www2.deloitte.com/us/en/insights/industry/financial-services/insurance-claims-transformation.html) **[S M A R T C L A I M S : C L A I M S A U T O M AT I O N](https://www.databricks.com/blog/2023/04/03/claims-automation-databricks-lakehouse.html)** ----- **Risk management:** **dynamic pricing and underwriting** Modernized approaches at insurance carriers require a full digital transformation, and one aspect of this transformation involves dynamic pricing and underwriting to reduce premiums. Insurance providers are now consuming data from the largest mobile telematics providers to obtain the most granular sensor and trip summaries for users of online insurance applications. Not only is this data critical for pricing, but it is also critical for underwriting scenarios to de-risk carriers. Dynamic pricing and underwriting automate routine tasks and provide teams with alternative data sources to empower actuarial and underwriting professionals to become “exponential.” This allows teams to focus on key aspects of risk selection and analysis that drive competitive advantage and market differentiation. By leveraging personalized data points, insurers can deliver near real-time underwriting decisions for life insurance applicants, reducing policy abandonment and costs. **Customer/Partner Successes** **Automated extraction of medical risk factors for life insurance underwriting** **(John Snow Labs)** Life insurance underwriting considers an applicant’s medical risk factors in addition to mortality risk characteristics. These risk factors are often found in free-text documents. New insurance-specific natural language processing (NLP) models can automatically extract relevant medical history and risk factors from such documents. Forward-thinking companies are embracing accelerated underwriting, which utilizes new data along with algorithmic tools and modeling techniques to quickly assess and group applicants without requiring bodily fluids, physician’s notes, and so on. This joint Solution Accelerator from Databricks and John Snow Labs simplifies the implementation of this approach, creating a faster, more consistent, and scalable underwriting experience. **Learn more:** **Watch video:** **[R I S K M A N A G E M E N T](https://www.databricks.com/solutions/accelerators/market-risk)** **Risk is highly influenced by behavior, and 80% of morbidity in** **healthcare risk is driven by factors such as smoking, drinking** **alcohol, physical activity and diet. In the case of driving,** **60% of fatal accidents are a result of behavior alone. If insurers** **can change customer behaviors and help them make better** **choices, then the risk curve shifts.”** **[A C T U A R I A L W O R K B E N C H](https://github.com/koernigo/databricksActuarialWorkbench)** **[U N D E R W R I T I N G A U T O M AT I O N](https://www.mongodb.com/blog/post/building-digital-data-pipelines-transforming-underwriting-usage-based-insurance-mongodb)** **[L I F E I N S U R A N C E U N D E R W R I T I N G W I T H](https://www.nlpsummit.org/automated-extraction-of-medical-risk-factors-for-life-insurance-underwriting/)** **[N AT U R A L L A N G U A G E P R O C E S S I N G](https://www.nlpsummit.org/automated-extraction-of-medical-risk-factors-for-life-insurance-underwriting/)** [Automated](https://www.youtube.com/watch?v=UI4-7JkC2eE&list=PL5zieHHAlvAqSu9qmBWIFL_TRVYooTS_d ) [Extraction of](https://www.youtube.com/watch?v=UI4-7JkC2eE&list=PL5zieHHAlvAqSu9qmBWIFL_TRVYooTS_d ) [Medical Risk Factors](https://www.youtube.com/watch?v=UI4-7JkC2eE&list=PL5zieHHAlvAqSu9qmBWIFL_TRVYooTS_d ) [for Life Insurance](https://www.youtube.com/watch?v=UI4-7JkC2eE&list=PL5zieHHAlvAqSu9qmBWIFL_TRVYooTS_d ) [Underwriting](https://www.youtube.com/watch?v=UI4-7JkC2eE&list=PL5zieHHAlvAqSu9qmBWIFL_TRVYooTS_d ) **Accenture Insurance Blog,** ”Discovery – a holistic, ongoing innovation story” ----- **Product distribution:** **segmentation and personalization** The most forward-thinking and data-driven insurers are focused on achieving personalization at scale. They are exploring new partnerships and business models to create integrated, value-added experiences that prioritize the overall health and financial wellness of their customers, rather than just their insurance needs. These insurers are investing in new data sources, analytics platforms, and artificial intelligence (AI)-powered decision engines that enable them to connect producers with like-minded customers or engage customers with enticing offers and actionable steps based on their previous choices. The outcome is more efficient and effective service from producers, trusted and convenient interactions for consumers, and increased customer engagement and growth for insurers in an increasingly digital-oriented world. **Customer/Partner Successes** **[Persona 360: Financial Customer Data Platform (](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions/datasentics-persona360)** [DataSentics](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions/datasentics-persona360) **)** [Persona 360](https://datasentics.com/product-persona360-for-data-scientists) developed by DataSentics, an Atos company, is specifically designed for retail banks and insurance companies. It enables them to complete, unify and comprehensively capture customer profiles using a smart data model. Built on the Databricks Lakehouse Platform and available on multiple clouds, Persona 360 enhances basic profile information with insights derived from digital behavior and unstructured data, such as call center recordings. By utilizing Persona 360, you can leverage pre-built banking and insurance customer 360° data models and access over 1500+ attributes to gain a deeper understanding of customer segments. With Persona 360, you can: **•** Access pre-built insurance-specific customer 360° data models and AI segmentation, consisting of 1,695+ attributes and segments **•** Seamlessly connect the workflows of data scientists (via Databricks) and marketing specialists (via Persona 360), making it easy for data experts to incorporate their findings and enabling nontechnical users to comprehend and activate the data **•** Leverage tools that can increase engagement by 37% and conversion rates by 45% through personalized campaigns **Learn more:** **Watch video:** **[N E X T B E S T O F F E R](https://www.databricks.com/solutions/accelerators/recommendation-engines)** **Demand for hyper-personalized and real-time risk protection** **requires broad adoption of artificial** **intelligence (AI), machine** **learning and digital platforms.** **EY,** [”Nine customer types defining the next wave of insurance”](https://www.ey.com/en_us/insurance/nine-customer-types-defining-the-next-wave-of-insurance) **[C U S T O M E R L I F E T I M E VA L U E (C LT V )](https://www.databricks.com/solutions/accelerators/customer-lifetime-value)** **[C U S T O M E R S E G M E N TAT I O N](https://www.databricks.com/solutions/accelerators/customer-segmentation)** [The Impact of](https://www.youtube.com/watch?v=7qZ14bGip5g&t=3s) [Analytics and AI](https://www.youtube.com/watch?v=7qZ14bGip5g&t=3s) [on the Future of](https://www.youtube.com/watch?v=7qZ14bGip5g&t=3s) [Insurance](https://www.youtube.com/watch?v=7qZ14bGip5g&t=3s) **[R E P U TAT I O N M A N A G E M E N T](https://www.databricks.com/solutions/accelerators/reputation-risk)** **[C H U R N P R E D I C T I O N](https://www.databricks.com/solutions/accelerators/retention-management)** ----- **Summary and applicability of Solution Accelerators based on insurance provider type / Solution Accelerator matrix** **by insurance provider type** |Product distribution Personalization Given the volume of data required, the complexity of operating AI from experiments (POCs) to enterprise scale data pipelines, combined with strict data and privacy regulations on the use of customer data on cloud infrastructure, the Lakehouse has quickly emerged as the strategic platform to accelerate digital transformation.|Consumer Lines (Auto/Home/ Personal Lines)|Commercial Lines|Life Insurance|Reinsurance| |---|---|---|---|---| |Next best offer Customers have different needs at each stage of the buyer journey. Choose the right recommender model for your scenario to find the next best action at any given point in the customer journey.||||| |Customer Analyzing customer lifetime value is critical to improving marketing decision-making, campaign ROI and lifetime value customer retention. Learn how to identify your most valuable customers with Databricks’ Customer Lifetime Value Solution Accelerator.||||| |Churn prediction Earning loyalty and getting the largest number of customers to stick around is something that is in your best interest as well as your customers’ best interest. Develop an understanding of how a customer lifetime should progress and examine where in that lifetime journey customers are likely to churn so you can effectively manage retention and reduce your churn rate.||||| |Customer Personalization is touted as the gold standard of customer engagement. Using sales data, campaigns segmentation and promotions systems, this solution helps you create advanced customer segments to drive better purchasing predictions based on behaviors.||||| |Reputation Harness the Databricks Lakehouse Platform to build a risk engine that can analyze customer feedback management securely and in realtime to power an early assessment of reputation risks.||||| ----- |Anomaly detection and fraudulent claims Anomaly Anomaly detection is the technique of identifying rare events or observations which can raise suspicions detection by being statistically different from the rest of the observations.|Consumer Lines (Auto/Home/ Personal Lines)|Commercial Lines|Life Insurance|Reinsurance| |---|---|---|---|---| |Fraudulent A large-scale fraud prevention system is usually a complex ecosystem made of various controls (all with claims critical SLAs), a mix of traditional rules and AI and a patchwork of technologies between proprietary on- premises systems and open source cloud technologies.||||| |Risk management Adopt a more agile approach to risk management, including actuarial and underwriting intelligence by unifying data and AI in the Lakehouse. Risk management Adopt a more agile approach to risk management, including actuarial and underwriting intelligence by unifying data and AI in the Lakehouse.|Consumer Lines (Auto/Home/ Personal Lines)|Commercial Lines|Life Insurance|Reinsurance| |---|---|---|---|---| |Underwriting Machine learning provides a decision support system for underwriting processes to help you improve your automation underwriting outcomes.||||| |Actuarial You can use the Databricks Lakehouse Platform to automate actuarial models and leverage Machine workbench Learning (ML) for underwriting, claims forecasting, etc.||||| ----- |Claims transformation Anomaly detection Preempt fraud with rule-based patterns and select ML algorithms for reliable fraud detection. Use and claims fraud anomaly detection and fraud prediction to respond to bad actors rapidly.|Consumer Lines (Auto/Home/ Personal Lines)|Commercial Lines|Life Insurance|Reinsurance| |---|---|---|---|---| |Car claims image By applying transfer learning on pre-trained neural networks, Databricks helps insurance companies classification kickstart their AI/computer vision journeys toward claim assessment and damage estimation.||||| |Claims automation Insurers are entering a new era of claims transformation, supported by evolving technological advancement and growing data availability. You can simplify and scale your claims lifecycle with data and AI.||||| |Medical claims Using advanced natural language processing, you can extract text from medical records and enable automation.||||| |Guidewire claims Data ingestion enabler for distributed ledger technology that has predefined schemas and mapping to/ center data from Guidewire data format. integration||||| ----- ## Conclusion Today, data and AI are at the center of every innovation in the insurance industry. Databricks Lakehouse for Insurance empowers insurance providers to leverage the potential of data and analytics to address strategic challenges, make informed decisions, mitigate risks, enhance customer experiences, and accelerate innovation. **Customers that innovate with Databricks Lakehouse for Insurance** Some of the top property and casualty, life and health insurance companies and reinsurers in the world turn to Databricks Lakehouse to harness the power of data and analytics to solve strategic challenges and make smarter decisions that minimize risk, deliver superior customer experiences and fast-track innovation. ----- ## About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark ™ , Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , LinkedIn and [Facebook](https://www.facebook.com/databricksinc) . #### Begin your journey with a free trial of Databricks Lakehouse for Insurance and start developing advanced data and AI applications today **[START YOUR FREE TRIAL](https://databricks.com/try-databricks?itm_data=Homepage-HeroCTA-Trial)** ###### Contact us for a personalized demo at:  dbricks.com/contact -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks_ebook_insurance_v10.pdf2024-09-19T16:57:21Z``` TECHNICAL GUIDE ``` # Solving Common Data Challenges #### Startups and Digital Native Businesses ----- ### Table of Contents # 01 ``` CHALLENGE:   ###### Creating a unified data architecture for data quality, governance and efficiency # 03 CHALLENGE:   ###### Building effective machine learning operations ``` # 02 ``` CHALLENGE:   ###### Building a data architecture to support scale and performance # 04 SUMMARY: ###### The Databricks Lakehouse Platform addresses these challenges ``` ----- **I N T R O D U C T I O N** This guide shares how the lakehouse architecture can increase productivity and cost-efficiently support all your data, analytics and AI workloads, and flexibly scale with the pace of growth for your company. Read the entire guide or dive straight into a specific challenge. With the advent of cloud infrastructure, a new generation of startups has rapidly built and scaled their businesses. The use of cloud infrastructure, once seen as innovative, has now become table stakes. The differentiator for the fastest-moving startups and digital natives now comes from the effective use of data at scale, primarily analytics and AI. Digital natives — defined as fast-moving, lean, and technically savvy, born-in-the-cloud organizations — are beginning to focus on new data-driven use cases such as real-time machine learning and personalized customer experiences. To pursue these new data-intensive use cases and initiatives, organizations must look beyond the technologies that delivered them to this point in time. Over time, these technologies, such as transactional databases, streaming/batch pipelines and firstgeneration analytics engines, have led to brittle This guide examines some of the biggest data challenges and solutions for startups and for scaling digital native businesses that have reached the point where an end-to-end modern data platform is a smart investment. Some key considerations include: systems that are not cost-efficient and require time-consuming administration and engineering toil. In addition to growing maintenance needs, data is often stored in disparate locations and formats, with little or no governance, making real-time use cases, analytics and AI difficult or impossible. **Consolidating on a unified data platform** As mentioned above, siloed data storage and management add administrative and financial cost. You can benefit significantly when you unify your data in one location with a flexible architecture that scales with your needs and delivers performance for future success. For this, you will want an open platform that supports all your data including batch and streaming workloads, data analytics and machine learning. With data unification, you create a more efficient, integrated approach to ingesting, cleaning and organizing your data. You also need automation to make data analysis easier for the nontechnical users in the company. But broader data access also means more focus on security, privacy, compliance and access control, which can create overhead for a growing. **Scaling up capacity and increasing performance** **and usability of the data solutions** Data teams at growing digital native organizations find it time intensive and costly to handle the growing volume and velocity of their data being ingested from multiple sources, across multiple clouds. You now need a unified and simplified platform that can instantly scale up capacity and deliver more computing power on demand to free up your data teams to produce outputs more quickly. This lowers the total cost for the overall infrastructure by eliminating redundant licensing, infrastructure and administration costs. **Building effective machine learning operations** For data teams beginning their machine learning journeys, the challenge of training data models can increase in management complexity. Many teams with disparate coding needs for the entire model lifecycle suffer inefficiencies from transferring data and code across many separate services. To build and manage effective ML operations, consider an end-to-end MLOps environment that brings all data together in one place and incorporates managed services for experiment tracking, model training, feature development and feature and model serving. ----- # 01 ``` CHALLENGE:  ## Create a unified data architecture for data quality, governance and efficiency ``` ----- ``` CHALLENGE 01 ### Create a unified data architecture for data quality, governance and efficiency ``` As cloud-born companies grow, data volumes rapidly increase, leading to new challenges and use cases. Among the challenges: Application stacks optimized for transaction use cases aren’t able to handle the volume, velocity and variety of data that modern data teams require. For example, this leads to query performance issues as data volume grows. Data silos develop as each team within an organization chooses different ETL/ELT and storage solutions for their needs. As the organization grows and changes, these pipelines and storage solutions become brittle, hard to maintain and nearly impossible to integrate. These data silos lead to discoverability, integration and access issues, which prevent teams from leveraging the full value of the organization’s available data. Data governance is hard. Disparate ETL/ELT and storage solutions lead to governance, compliance, auditability and access control challenges, which expose organizations to tremendous risk. The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing and maintaining data solutions at scale. It integrates with cloud storage and the security in your cloud account, manages and deploys cloud infrastructure on your behalf. Your data practitioners no longer need separate storage systems for their data. And you don’t have to rely on your cloud provider for security. The lakehouse has its own robust security built into the platform. For all the reasons above, the most consistent advice from successful data practitioners is to create a “single source of truth” by unifying all data on a single platform. With the Databricks Lakehouse Platform, you can unify all your data on one platform, reducing data infrastructure costs and compute. You don’t need excess data copies and you can retire expensive legacy infrastructure. ```  01 ``` ----- ``` CUSTOMER STORY: GRAMMARLY ### Helping 30 million people and 50,000 teams communicate more effectively ``` While its business is based on analytics, [Grammarly](http://www.grammarly.com) for many years relied on a homegrown analytics platform to drive its AI writing assistant to help users improve multiple aspects of written communications. As teams developed their own requirements, data silos inevitably emerged as different business areas implemented analytics tools individually. “Every team decided to solve their analytics needs in the best way they saw fit,” said Chris Locklin, Engineering Manager, Data Platforms, at Grammarly. “That created challenges in consistency and knowing which data set was correct.” To better scale and improve data storage and query capabilities, Grammarly brought all its analytical data into the Databricks Lakehouse Platform and created a central hub for all data producers and consumers across the company. Grammarly had several goals with the lakehouse, including better access control, security, ingestion flexibility, reducing costs and fueling collaboration. “Access control in a distributed file system is difficult, and it only gets more complicated as you ingest more data sources,” said Locklin. To manage access control, enable end-to-end observability and monitor data quality, Grammarly relies on the data lineage capabilities within Unity Catalog. “Data lineage allows us to effectively monitor usage of our data and ensure it upholds the standards we set as a data platform team,” said Locklin. “Lineage is the last crucial piece for access control.” Data analysts within Grammarly now have a consolidated interface for analytics, which leads to a single source of truth and confidence in the accuracy and availability of all data managed by the data platform team. Having a consistent data source across the company also resulted in greater speed and efficiency and reduced costs. Data practitioners experienced 110% faster querying at 10% of the cost to ingest compared to a data warehouse. Grammarly can now make its 5 billion daily events available for analytics in under 15 minutes rather than 4 hours. Migrating off its rigid legacy infrastructure gave Grammarly the flexibility to do more and the confidence that the platform will evolve with its needs. Grammarly is now able to sustain a flexible, scalable and highly secure analytics platform that helps 30 million people and 50,000 teams worldwide write more effectively every day. [Read the full story here.](https://www.databricks.com/customers/grammarly) ----- ###### How to unify the data infrastructure with Databricks The [Databricks Lakehouse Platform](https://docs.databricks.com/lakehouse/index.html) architecture is composed of two primary parts: - The infrastructure to deploy, configure and manage the platform and services You can build a Databricks workspace by configuring secure integrations between the Databricks platform and your cloud account, and then Databricks deploys temporary Apache Spark™/Photon clusters using cloud resources in your account to process and store data in object storage and other integrated services you control. Here are three steps to get started with the Databricks Lakehouse Platform: **Understand the architecture** The lakehouse provides a unified architecture, meaning that all data is stored in the same accessible place. The diagram shows how data comes in from sources like a customer relationship management (CRM) system, an enterprise resource planning (ERP) system, websites or unstructured customer emails. **Optimize the storage layer** All data is stored in cloud storage while Databricks provides tooling to assist with ingestion, such as Auto Loader, and we recommend [open-source](https://delta.io/) [Delta Lake](https://docs.databricks.com/delta/index.html) as the storage format of choice. Delta optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform. Having all your data in the same optimized, open storage keeps all your use cases in the same place, thus enabling collaboration and removing software tool overhead. - the customer-owned infrastructure managed in collaboration by Databricks and the customer. The lakehouse handles all varieties of data (structured, semi-structured, unstructured), as well as all velocities of data (streaming, batch or somewhere in the middle). [Sign up for a free trial](https://www.databricks.com/try-databricks#account) account with the instructions on the [get started page.](https://docs.databricks.com/getting-started/index.html) ----- The Databricks Lakehouse organizes data stored with Delta Lake in cloud object storage with familiar concepts like database, tables and views. Delta Lake extends Parquet data files with a file-based transaction log for [ACID transactions](https://docs.databricks.com/lakehouse/acid.html) and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations to provide incremental processing at scale.This model combines many of the benefits of a data warehouse with the scalability and flexibility of a data lake. To learn more about the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform, see [Getting started](https://docs.databricks.com/getting-started/delta.html) [with Delta Lake](https://docs.databricks.com/getting-started/delta.html) [.](https://docs.databricks.com/getting-started/delta.html) The first step in unifying your data architecture is setting up how data is to be accessed and used across the organization. We’ll discuss this as a series of steps: **1** Set up governance with Unity Catalog **2** Grant secure access to the data ###### “Delta Lake provides us with a single source of truth for all of our data,” said Stone. “Now our data engineers are able to build reliable data pipelines that thread the needle on key topics, such as inventory management, allowing us to identify in near real-time what our trends are so we can figure out how to effectively move inventory.”  – Jake Stone, Senior Manager, Business Analytics at ButcherBox [Learn more](https://www.databricks.com/blog/2022/02/07/how-butcherbox-uses-data-insights-to-provide-quality-food-tailored-to-each-customers-unique-taste.html) **3** Capture audit logs **4** View data lineage **5** Set up data sharing ----- **Configure unified governance** Databricks recommends using catalogs to provide an easily searchable inventory of data, notebooks, dashboards and models. Often this means that catalogs can correspond to software development environment scope, team or business unit. [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/get-started.html) manages how data is secured, accessed and shared. Unity Catalog offers a single place to administer data access policies that apply across all workspace and personas and automatically captures user-level audit logs that record access to your data. Data stewards can securely grant access to a broad set of users to discover and analyze data at scale. These users can use a variety of languages and tools, including SQL and Python, to create derivative data sets, models and dashboards that can be shared across teams. To set up Unity Catalog for your organization, you do the following: **1** Configure an S3 bucket and IAM role that Unity Catalog can use to store and access data in your AWS account. **2** Create a metastore for each region in which your organization operates, and attach workspaces to the metastore. Each workspace will have the same view of the data you manage in Unity Catalog. **3** If you have a new account, add users, groups and service principals to your Databricks account. **4** Next, create and grant access to catalogs, schemas and tables. For complete setup instructions, see [Get started using Unity Catalog.](https://docs.databricks.com/data-governance/unity-catalog/get-started.html#:~:text=To%20enable%20your%20Databricks%20account%20to%20use%20Unity,Transfer%20your%20metastore%20admin%20role%20to%20a%20group.) ----- ###### How Unity Catalog works You will notice that the hierarchy of primary data objects in Unity Catalog flows from metastore to table: **Metastore** is the top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organizes your data. **Metastore** **Catalog** **Schemas** **Views** **Managed** **Tables** **Catalog** is the first layer of the object hierarchy, used to organize your data assets. **Schemas** , also known as databases, are the second layer of the object hierarchy and contain tables and views. **Table** is the lowest level in the object hierarchy, and tables can be external (stored in external locations in your cloud storage of choice) or managed (stored in a storage container in your cloud storage that you create expressly for Databricks). You can also create readonly **Views** from tables. **External** **tables** The diagram below represents the file system hierarchy of a single storage bucket: ----- Unity Catalog uses the identities in the Databricks account to resolve users, service principals, and groups and to enforce permissions. To configure identities in the account, follow the instructions in [Manage users,](https://docs.databricks.com/administration-guide/users-groups/index.html) [service principals, and groups](https://docs.databricks.com/administration-guide/users-groups/index.html) . Refer to those users, service principals, and groups when you create [access-control policies](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html) in Unity Catalog. Unity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Data Explorer or a REST API command. The assignment of users, service principals, and groups to workspaces is called identity federation. All workspaces attached to a Unity Catalog metastore are enabled for identity federation. Securable objects in Unity Catalog are hierarchical, meaning that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For more on granting privileges, see the [Inheritance model](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html#inheritance) . A common scenario is to set up a schema per team where only that team has USE SCHEMA and CREATE on the schema. This means that any tables produced by team members can only be shared within the team. Data Explorer uses the privileges configured by Unity Catalog administrators to ensure that users are only able to see catalogs, databases, tables and views that they have permission to query. [Databricks Data Explorer](https://docs.databricks.com/data/index.html) is the main user interface for many Unity Catalog features. Use Data Explorer to view schema details, preview sample data, and see table details and properties. Administrators can view and change owners. Admins and data object owners can grant and revoke permissions through this interface. **Set up secure access** In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (schema), tables and views. Privileges and metastores are shared across workspaces, allowing administrators to set secure permissions once against groups synced from identity providers and know that end users only have access to the proper data in any Databricks workspace they enter. ----- ``` CUSTOMER STORY: BUTCHERBOX ### How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste ``` As a young e-commerce company, [ButcherBox](https://www.butcherbox.com/) has to be nimble as its customers’ needs change, which means it is constantly considering behavioral patterns, distribution center efficiency, a growing list of marketing and communication channels, and order processing systems. The meat and seafood subscription company collects data on hundreds of thousands of subscribers. It deployed the Databricks Lakehouse Platform to gain visibility across its diverse range of data systems and enable its analytics team to securely view and export data in the formats needed. With so much data feeding in from different sources — from email systems to its website — the data team at ButcherBox quickly discovered that data silos were a significant “We knew we needed to migrate from our legacy data warehouse environment to a data analytics platform that would unify our data and make it easily accessible for quick analysis to improve supply chain operations, forecast demand and, most importantly, keep up with our growing customer base,” explained Jake Stone, Senior Manager, Business Analytics, at ButcherBox. The platform allows analysts to share builds and iterate on a project without getting into the code. Querying a table of 18 billion rows would have been problematic with a traditional platform. With Databricks, ButcherBox can do it in three minutes. “Delta Lake provides us with a single source of truth for all of our data,” said Stone. “Now our data engineers are able to build reliable data pipelines that thread the needle on key topics such as inventory management, allowing us to identify in near real- time what our trends are so we can figure out how to effectively move inventory.” [Read the full story here.](https://www.databricks.com/blog/2022/02/07/how-butcherbox-uses-data-insights-to-provide-quality-food-tailored-to-each-customers-unique-taste.html) problem because they blocked complete visibility into critical insights needed to make strategic and marketing decisions. ----- **Set up secure data sharing** Databricks uses an open protocol called [Delta Sharing](https://docs.databricks.com/data-sharing/index.html) to share data with other entities regardless of their computing platforms. Delta Sharing is integrated with Unity Catalog. Your data must be registered with Unity Catalog to manage, govern, audit and track usage of the shared data on the Lakehouse Platform. The primary concepts of Delta Sharing are shares (read-only collections of tables and table partitions to be shared) and recipients (objects that associate an organization with a credential or secure sharing identifier). As a data provider, you generate a token and share it securely with the recipient. They use the token to authenticate and get read access to the tables you’ve included in the shares you’ve given them access to. Recipients access the shared data in read-only format. Whenever the data provider updates data tables in their own Databricks account, the updates appear in near real-time in the recipient’s system. **Capture audit logs** Unity Catalog captures an audit log of actions performed against the metastore. To access audit logs for Unity Catalog events, you must enable and configure audit logs for your account. Audit logs for each workspace and account-level activities are delivered to your account. See how to [configure audit](https://docs.databricks.com/data-governance/unity-catalog/audit.html) [logs](https://docs.databricks.com/data-governance/unity-catalog/audit.html) and create a dashboard to analyze audit log data. **View data lineage** You can use Unity Catalog to capture runtime data lineage across queries in any language executed on a Databricks cluster or SQL warehouse. Lineage can be visualized in Data Explorer in near real-time and retrieved with the Databricks REST API. Lineage is aggregated across all workspaces attached to Unity Catalog and captured down to the column level, and includes notebooks, workflows and dashboards related to the query. To understand the requirements and how to capture lineage data, see [Capture and view data](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html) [lineage with Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html) . Unity Catalog Metastore Catalog Data providers can use Databricks audit logging to monitor the creation and modification of shares, and recipients can monitor recipient activity on shares. Data recipients who use shared data in a Databricks account can use Databricks audit logging to understand who is accessing which data. ----- ###### Resources: - [Databricks documentation](https://docs.databricks.com/?_ga=2.8076210.1659353804.1668454132-1193545868.1666711643) - [Getting Started With Delta Lake](https://docs.databricks.com/delta/index.html) - [Webinar: Deep Dive Into Lakehouse With Delta Lake](https://www.databricks.com/p/webinar/deep-dive-into-lakehouse-with-delta-lake-complimentary-training) - [Big Book of Data Engineering Use Cases](https://www.databricks.com/explore/de-data-warehousing/big-book-of-data-engineering#page=1) - [10 Powerful Features to Simplify Semi-structured](https://www.databricks.com/blog/2021/11/11/10-powerful-features-to-simplify-semi-structured-data-management-in-the-databricks-lakehouse.html) [Data Management in the Databricks Lakehouse](https://www.databricks.com/blog/2021/11/11/10-powerful-features-to-simplify-semi-structured-data-management-in-the-databricks-lakehouse.html) ###### Key Takeaways - With the Databricks Lakehouse Platform, you can unify and simplify all your data on one platform to better scale and improve data storage and query capabilities - The lakehouse helps reduce data infrastructure and compute costs. You don’t need excess data copies and can retire expensive legacy infrastructure. Leverage Delta Lake as the open format storage layer to deliver reliability, security and performance on your data lake — for both streaming and batch operations — replacing data silos with a single home for structured, semi-structured and unstructured data With Unity Catalog you can centralize governance for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud The Databricks Lakehouse Platform is open source with multicloud flexibility so that you can use your data however and wherever you want — no vendor lock-in ----- # 02 ``` CHALLENGE:  ## Build your data architecture to support scale and performance ``` ----- ``` CHALLENGE 02 ### Build your data architecture to support scale and performance ``` As modern digital native companies mature, data volumes grow and new use cases develop. This inevitably leads to the increasing complexity of data architecture as new storage and access patterns emerge. Data growth can come suddenly and unexpectedly, when it does, the existing architecture needs to sustain performance, all the while being cost-effective. The relational databases and traditional data warehouses that met the needs of the businesses once upon a time are now creating limitations for new real-time use cases and large-scale data analytics pipelines. Here are some common challenges around managing data and performance at scale: **Volume and velocity** — Exponentially increasing data sources, and the speed at which they capture and create data. **Latency requirements** — The demands of downstream applications and users have evolved (people want data and the results from the data faster). **Governance** — Cataloging, auditing, securing and reporting on data is burdensome at scale when using old systems not built with data access controls and compliance in mind. **Multicloud** is really hard. **Data storage** — Storing data in the wrong format is slow to access, query and is expensive at scale. **Data format** — Supporting structured, semistructured and unstructured data formats is now a requirement. Most data storage solutions are designed to handle only one type of data, requiring multiple products to be stitched together. ``` 02 ``` ----- ###### Lakehouse solves scale and performance challenges The solution for growing digital companies is a unified and simplified platform that can instantly scale up capacity to deliver more computing power on demand, freeing up teams to go after the much-needed data and produce outputs more quickly. With a lakehouse, they can replace their data silos with a single home for their structured, semi-structured and unstructured data. Users and applications throughout the enterprise environment can connect to the same single copy of the data to drive diverse workloads. The lakehouse architecture is cost-efficient for scaling, lowering the total cost of ownership for the overall infrastructure by consolidating all data estate and use cases onto a single platform and eliminating redundant licensing, infrastructure and administration costs. Unlike other warehouse options that can only scale horizontally, the Databricks Lakehouse can scale horizontally and vertically based on workload demands. With the Databricks Lakehouse, you can optimize the compute costs on a platform that is [2.7x faster and](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) [12x more performant than Snowflake](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) , according to research by the Barcelona Supercomputing Center. And your data teams are more productive by focusing on more strategic initiatives versus managing multiple data solutions. ``` CUSTOMER STORY: RIVIAN ### Driving into the future of electric transportation ``` ``` CUSTOMER STORY: RIVIAN ``` With more than 11,000 electric adventure vehicles (EAVs) on the road generating multiple terabytes of IoT data per day, [Rivian](https://rivian.com/) is using data insights and machine learning to improve vehicle health and performance. However, with legacy cloud tooling, it struggled to scale pipelines cost-effectively and spent significant resources on maintenance. Before Rivian even shipped its first EAV, it was already up against data visibility and tooling limitations that decreased output, prevented collaboration and increased operational costs. Rivian chose to modernize its data infrastructure on the Databricks Lakehouse Platform, giving it the ability to unify all its data into a common view for downstream analytics and machine learning. Now, unique data teams have a range of accessible tools to deliver actionable insights for different use cases, from predictive maintenance to smarter product development. “Today we have various teams, both technical and business, using Databricks Lakehouse to explore our data, build performant data pipelines, and extract actionable business and product insights via visual dashboards,” said Wassym Bensaid, Vice President of Software Development at Rivian. For instance, Rivian’s ADAS (advanced driver-assistance systems) Team can now easily prepare telemetric accelerometer data to understand all EAV motions. This core recording data includes information about pitch, roll, speed, suspension and airbag activity to help Rivian understand vehicle performance, driving patterns and connected car system predictability. Based on these key performance metrics, Rivian can improve the accuracy of smart features and the control that drivers have over them. By leveraging the Databricks Lakehouse Platform, Rivian has seen a 30%–50% increase in runtime performance, which has led to faster insights and model performance. [Read the full story here.](https://www.databricks.com/customers/rivian) ----- ###### How to ensure scalability and performance with Databricks The [Databricks Lakehouse Platform](https://docs.databricks.com/lakehouse/index.html) is built for ensuring scalability and performance for your data architecture based on the following features and capabilities: - A simplified and cost-efficient architecture that increases productivity - A platform that ensures reliable, high performing ETL workloads — for streaming and batch data — while Databricks automatically manages your infrastructure - The ability to ingest, transform and query all your data in one place, and scale on demand with serverless compute - Enables real-time data access for all data, analytics and AI use cases ----- The following section will provide a short series of steps for understanding the key components of the Databricks Lakehouse Platform. **Step 2** **Understand the common Delta Lake operations** The Databricks Lakehouse Platform simplifies the entire data lifecycle, from data ingestion to monitoring and governance, and it starts with [Delta Lake](https://www.databricks.com/product/delta-lake-on-databricks) , a fully open-source storage system based on the Delta format providing reliability through ACID transactions and scalable metadata handling. Large quantities of raw files in blob storage can be converted to Delta to organize and store the data cheaply. This allows for flexibility of data movement while being performant and less expensive. **Step 1** **Get a trial Databricks account** Start your 14-day free trial with Databricks on AWS in a few easy steps. [Get started with a free trial and setup](https://docs.databricks.com/getting-started/index.html) . During the 14day free trial, all Databricks usage is free, but Databricks uses compute and S3 storage resources in your cloud provider account. and writing data can occur simultaneously without risk of many queries resulting in performance degradation or deadlock for business-critical workloads. This means that users and applications throughout the enterprise environment can connect to the same single copy of the data to drive diverse workloads, with all viewers guaranteed to receive the most current version of the data at the time their query executes. With performance features like indexing, Delta Lake customers have seen [ETL workloads execute](https://www.databricks.com/customers/columbia) [up to 48x faster.](https://www.databricks.com/customers/columbia) [Get acquainted with the Delta Lake storage format](https://docs.databricks.com/delta/tutorial.html) and learn how to create, manage and query tables. With support for ACID transactions and schema enforcement, Delta Lake provides the reliability that traditional data lakes lack. This enables you to scale reliable data insights throughout the organization and run analytics and other data projects directly on your data lake — [for up to 50x faster time-to-insight.](https://www.databricks.com/customers/wejo) Delta Lake transactions use log files stored alongside data files to provide ACID guarantees at a table level. Because the data and log files backing Delta Lake tables live together in cloud object storage, reading ----- All data in Delta Lake is stored in open Apache Parquet format, allowing data to be read by any compatible reader. APIs are open and compatible with Apache Spark, so you have access to a vast open-source ecosystem to avoid data lock-in from proprietary formats and conversions, which have embedded and added costs. ###### By leveraging Databricks and Delta Lake, we have already been able to democratize data at scale while lowering the cost of running production workloads by 60%, saving us millions of dollars.”  — Steve Pulec, Chief Technology Officer, YipitData [Learn more](https://www.databricks.com/customers/yipitdata) ----- **Step 3** **Ingest data efficiently at scale** With a [Lakehouse Platform](https://www.databricks.com/product/data-lakehouse) , data teams can ingest data from hundreds of data sources for analytics, AI and streaming applications into one place. Databricks recommends [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) for incremental data ingestion. To ingest any file that can land in a data lake, Auto Loader incrementally and automatically processes new data files as they arrive in cloud storage in scheduled or continuous jobs. Auto Loader scales to support near real-time ingestion of millions of files per hour. For pushing data in Delta Lake, the SQL command [COPY INTO](https://docs.databricks.com/ingestion/copy-into/index.html) allows you to perform batch file ingestion into Delta Lake. COPY INTO is best used when the input directory contains thousands of files or fewer, and the user prefers SQL. COPY INTO can be used over JDBC to push data into Delta Lake at your convenience. **Step 4** **Leverage production-ready tools** **to automate ETL pipelines** Once the raw data is ingested, Databricks provides a suite of production-ready tools that allow data professionals to quickly develop and deploy extract, transform and load (ETL) pipelines. Databricks SQL allows analysts to run SQL queries against the same tables used in production ETL workloads, allowing for real-time business intelligence at scale. With your trial account, [it’s time to develop and deploy](https://docs.databricks.com/getting-started/etl-quick-start.html) [your first extract, transform and load (ETL) pipelines](https://docs.databricks.com/getting-started/etl-quick-start.html) for data orchestration and learn how easy it is to create a cluster, create a Databricks notebook, configure [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) for ingestion into [Delta Lake](https://docs.databricks.com/delta/index.html) , process and interact with the data, and schedule a job. Databricks supports workloads in SQL, Python, Scala and R, allowing users with diverse skill sets and technical backgrounds to leverage their knowledge to derive analytic insights. You can use all languages supported by Databricks to define production jobs, and notebooks can leverage a combination of languages. This means that you can promote queries written by SQL analysts for last-mile ETL into production data engineering code with almost no effort. Queries and workloads defined by personas across the organization leverage the same data sets, so there’s no need to reconcile field names or make sure dashboards are up to date before sharing code and results with other teams. ----- With [Delta Live Tables](https://www.databricks.com/product/delta-live-tables) (DLT), data professionals have a framework that uses a simple declarative approach to build ETL and ML pipelines on batch or streaming data while automating operational complexities such as infrastructure management, task orchestration, error handling and recovery, retries, and performance optimization. Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with: - [Autoscaling compute infrastructure](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-concepts.html#auto-scaling) for cost savings - Data quality checks with [expectations](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html) - Automatic [schema evolution](https://docs.databricks.com/ingestion/auto-loader/schema.html) handling - Monitoring via metrics in the [event log](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-event-log.html) With DLT, engineers can also treat their data as code and apply software engineering best practices like testing, monitoring and documentation to deploy reliable pipelines at scale. You can easily define end-toend data pipelines in SQL or Python and automatically maintain all data dependencies across the pipeline and reuse ETL pipelines with environment-independent data management. ``` CUSTOMER STORY: ABNORMAL SECURITY ### Stopping sophisticated ransomware in its tracks ``` ``` CUSTOMER STORY: ABNORMAL SECURITY ``` The increase in email phishing and ransomware attacks requires the type of protection that can scale and evolve to meet the challenges of modern cyberattacks. [Abnormal Security](https://abnormalsecurity.com/) , a cloud-native email security provider, knew that scalability would become a major focus to stay ahead of attack strategies with frequent product updates. Abnormal also required a data analytics infrastructure robust enough to meet the scale requirements for its data pipelines and constantly refined ML models. “We were spending too much time managing our Spark infrastructure,” said Carlos Gasperi, Software Engineer at Abnormal Security. “What we needed to be doing with that time was building the pipelines that would make the product better.” The company implemented the Databricks Lakehouse Platform, which simplified its data architecture and maximized the performance of data pipelines and analytics. Data practitioners are now able to ingest data directly from S3 and query it in near real-time with the help of Delta Lake, an open-format storage layer that delivers reliability, security and performance on the data lake for both streaming and batch operations. With Databricks SQL, data scientists are then able to create visualizations using rich dashboards to drive product decisions and improve detection efficacy. Databricks also provided the collaborative environment that Abnormal’s data teams needed to increase their productivity and work in the same space without constantly competing for compute resources. With Databricks, Abnormal has seen a 20% reduction in successful email attacks, a 40% reduction in infrastructure costs and a 30% increase in productivity. [Read the full story here.](https://www.databricks.com/customers/abnormal) ----- Delta Live Tables Enhanced Autoscaling is designed to handle streaming workloads that trigger intermittently and are unpredictable. It optimizes cluster utilization by only scaling up to the necessary number of nodes while maintaining endto-end SLAs, and gracefully shuts down nodes when utilization is low to avoid unnecessary idle node capacity. Delta Live Tables helps prevent bad data from flowing into tables through validation, integrity checks and predefined error policies. In addition, you can monitor data quality trends over time to get insight into how your data is evolving and where changes may be necessary. ----- **Step 5** **Use Databricks SQL for serverless compute** [Databricks SQL (DB SQL)](https://www.databricks.com/product/databricks-sql) is a serverless data warehouse on the Lakehouse Platform for running your SQL and BI applications at scale with up to 12x better price/performance. It’s imperative for younger, growing companies to reduce resource contention, and one way to accomplish that is with serverless compute. Running serverless removes the need to manage, configure or scale cloud infrastructure on the lakehouse, freeing up your data team for what they do best. See for yourself in this tutorial on [how to run and visualize](https://docs.databricks.com/sql/get-started/user-quickstart.html) [a query in Databrick SQL](https://docs.databricks.com/sql/get-started/user-quickstart.html) and create dashboards on data stored in your data lake. The Databricks SQL REST API supports services to manage queries and dashboards, query history and SQL warehouses. Databricks SQL warehouses provide instant, elastic SQL compute — decoupled from storage — and will automatically scale to provide unlimited concurrency without disruption, for high concurrency use cases. DB SQL has data governance and security built in. Handle high concurrency with fully managed load balancing and scaling of compute resources. ----- **Faster queries with Photon** [Photon](https://www.databricks.com/product/photon) is a new vectorized query engine designed to deliver dramatic infrastructure cost savings and accelerate all data and analytics workloads: data ingestion, ETL, streaming, interactive queries, data science and machine learning. Photon is used by default in Databricks SQL. To enable Photon acceleration, select the **Use Photon** **Acceleration** checkbox when you create the cluster. If you [create the cluster](https://docs.databricks.com/clusters/configure.html#photon-image) using [the clusters API](https://docs.databricks.com/dev-tools/api/latest/clusters.html) , set runtime_engine to PHOTON. Photon supports a number of instance types on the driver and worker nodes. Photon instance types consume DBUs at a different rate than the same instance type running the non-Photon runtime. For more information about Photon instances and DBU consumption, see the [Databricks pricing page.](https://www.databricks.com/product/pricing/product-pricing/instance-types) Photon will seamlessly coordinate work and resources and transparently accelerate portions of your SQL and Spark queries. No tuning or user intervention required. Photon is compatible with Apache Spark APIs, so getting started is as easy as turning it on — no code change and no lock- in. Written entirely in C++, Photon provides an additional [2x speedup over Apache Spark](https://www.databricks.com/product/photon) per the TPC-DS 1TB benchmark, and customers have observed 3x–8x speedups on average. With Photon, typical customers are seeing up to [80% TCO savings](https://www.databricks.com/blog/2022/08/03/announcing-photon-engine-general-availability-on-the-databricks-lakehouse-platform.html#:~:text=Up%20to%2080%25%20TCO%20cost%20savings%20%2830%25%20on,Photon%203-8x%20faster%20queries%20on%20interactive%20SQL%20workloads) over traditional Databricks Runtime (Apache Spark) and up to 85% reduction in VM compute hours. Learn how to connect BI tools to Databricks SQL compute resources with the following user guides: [Queries](https://docs.databricks.com/sql/user/queries/index.html) [Visualizations](https://docs.databricks.com/sql/user/visualizations/index.html) [Favorites and tags](https://docs.databricks.com/sql/user/favorites-tags.html) [Workspace browser](https://docs.databricks.com/sql/user/workspace-browser/index.html) [Dashboards](https://docs.databricks.com/sql/user/dashboards/index.html) [Alerts](https://docs.databricks.com/sql/user/alerts/index.html) ----- **Step 6** **Orchestrate workflows** Databricks provides a comprehensive suite of tools and integrations to support your data processing workflows. Databricks [Workflows](https://www.databricks.com/product/workflows) removes operational overhead by offering fully managed orchestration service for all your teams, so you can focus on your workflows, not on managing your infrastructure. Orchestrate diverse workloads for the full lifecycle including Delta Live Tables, [Jobs](https://docs.databricks.com/workflows/index.html) for SQL, [Spark](https://www.databricks.com/product/spark) , notebooks, dbt, ML models and more. Here’s a tutorial on how to [create your first workflow with a Databricks job](https://docs.databricks.com/workflows/jobs/jobs-quickstart.html) . You will learn how to create notebooks, create and run a job, view the run details, and run jobs with different parameters. ----- **Step 7** **Run an end-to-end analytics pipeline** This where you can see how everything works together to run efficiently at scale. First take the quickstart: [Running end-to-end lakehouse analytics pipelines](https://docs.databricks.com/getting-started/lakehouse-e2e.html) , where you will write to and read data from an external location managed by Unity Catalog and configure Auto Loader to ingest data to Unity Catalog. ###### Resources: - [Databricks Lakehouse free trial](https://www.databricks.com/try-databricks?itm_data=DataLakehouse-HeroCTA-Trial#account) - [The Lakehouse for companies born in the cloud](https://www.databricks.com/solutions/audience/digital-native) - [How DuPont achieved 11x latency reduction and 4x cost reduction with Photon](https://www.databricks.com/blog/2022/10/04/how-dupont-achieved-11x-latency-reduction-and-4x-cost-reduction-photon.html) - [Apache Spark on Databricks](https://docs.databricks.com/spark/index.html) - [Discover Lakehouse solutions](https://www.databricks.com/solutions) - [Databricks documentation](https://docs.databricks.com/) ###### “Databricks Workflows allows our analysts to easily create, run, monitor and repair data pipelines without managing any infrastructure. This enables them to have full autonomy in designing and improving ETL processes that produce must-have insights for our clients. We are excited to move our Airflow pipelines over to Databricks Workflows.”  —Anup Segu, Senior Software Engineer, YipitData [Learn more.](https://www.databricks.com/customers/yipitdata) ----- # 03 ``` CHALLENGE:  ## Building effective machine-learning operations ``` ----- ``` CHALLENGE 03 ### Building effective machine-learning operations ``` Growing startups and digital native companies face several challenges when they start building, maintaining and scaling machine learning operations (MLOps) for their data science teams. MLOps is different from DevOps. DevOps practices and tooling alone are insufficient because ML applications rely on an assortment of artifacts (e.g., models, data, code) that can each require different methods of experiment tracking, model training, feature development, governance, feature and model serving. For data teams beginning their machine learning journeys, the challenge of training data models can be labor-intensive and not cost-effective because the data has to be converted into features and trained on a separate machine learning platform Data teams often perform development in disjointed, siloed stacks spanning DataOps, ModelOps and DevOps Development and training environment disconnect. Moving code and data between personal development environments and machine learning platforms for model training at scale is error prone and cumbersome. The “it worked on my machine” problem. Gathering high-quality data. Data that is siloed across the organization is hard to discover, collect, clean and use. This leads to stale data and delays in development of models. See **Create a unified data architecture.** ```  03 ``` ----- ###### Siloed stacks spanning DataOps, ModelOps and DevOps When data engineers help ingest, refine and prep data, they do so on their own stack. This data has to be converted into features and then trained on a separate machine learning platform. This cross- platform handoff often results in data staleness, difficulty in maintaining versions, and eventually, poorly performing models. Even after you have trained your model, you have to deal with yet another tech stack for model deployment. It’s challenging to serve features in real time and difficult to trace problems in production back to the data. The downstream business impact is massive — longer and more expensive projects, and lower model accuracy in production leading to declining business metrics. If you are looking at launching or scaling your MLOps, you should probably focus on an incremental strategy. At Databricks, we see firsthand how customers develop their MLOps approaches across a huge variety of teams and businesses. [Check out](https://www.youtube.com/watch?v=JApPzAnbfPI) [this Data +AI Summit session](https://www.youtube.com/watch?v=JApPzAnbfPI) to learn more about building robust MLOps practices. ###### Databricks solution: Databricks Machine Learning is an integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and model serving. The capabilities of Databricks map directly to the steps of model development and deployment. With Databricks Machine Learning, you can: Train models either manually or with AutoML Track training parameters and models using experiments with MLflow tracking Create feature tables and access them for model training and inference Share, manage and serve models using MLflow Model Registry Deploy models for Serverless Real-time Inference ----- ###### Use MLOps on the Databricks Lakehouse Platform To gain efficiencies and reduce costs, many smaller digital companies are employing machine learning operations. MLOps is a set of processes and automation for managing models, data and code, and unique library dependencies to improve performance stability and long-term efficiency in ML systems. To describe it simply, MLOps = ModelOps + DataOps + DevOps. The aim of MLOps is to improve the long-term performance, stability and success rate of ML systems while maximizing the efficiency of the teams who build them. Not only does MLOps improve organizational efficiency, it also allows the models to iterate faster and react to real-life changes in the data. This ability separates companies that can grow to meet their customer’s challenges in a reactive manner versus those that will spend significant time on data updates/processes and miss the opportunity to do something with their models. The absence of MLOps is typically marked by an overabundance of manual processes which are slower and more prone to error, affecting the quality of models, data and code. Eventually they form a bottleneck, capping the ability for a data team to take on new projects. The process is complex. In larger organizations, several specialists and stakeholders can be involved in one ML project. But data practitioners at smaller digital natives and high-growth startups may be forced to wear several hats. ----- And once an ML project goes into production, the MLOps continues, since the models, data and code change over time due to regulatory and business requirements. But the ML system must be resilient and flexible. Addressing these challenges with a defined MLOps strategy can dramatically reduce the iteration cycle of delivering models to production. ----- ###### Steps in machine learning model development and deployment: **Step 1** **Data preparation** Manually preparing and labeling data is a thankless, time-consuming job. With Databricks, teams can label data with human effort, machine learning models in Databricks, or a combination of both. Teams can also employ a [model-assisted labeling](https://labelbox.com/product/automation ) workflow that allows humans to easily inspect and correct a model’s predicted labels. This process can drastically reduce the amount of unstructured data you need to achieve strong model performance. The [Databricks Runtime for Machine Learning](https://docs.databricks.com/runtime/mlruntime.html) is a ready-to-go environment with many external libraries, including TensorFlow, PyTorch, Horovod, scikit-learn and XGBoost. It provides extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner, and model checkpointing. To use Databricks Runtime ML, select the ML version of the runtime when you [create your cluster](https://docs.databricks.com/clusters/index.html) . To access data in Unity Catalog for machine learning workflows, you must use a [single user cluster](https://docs.databricks.com/data-governance/unity-catalog/compute.html) . User isolation clusters are not compatible with Databricks Runtime for Machine Learning. Machine learning applications often need to use shared storage for data loading and model checkpointing. You can load tabular data from [tables](https://docs.databricks.com/lakehouse/data-objects.html#table) or files. A table is a collection of structured data stored as a directory on cloud object storage. For [data preprocessing](https://docs.databricks.com/machine-learning/preprocess-data/index.html) , you can use [Databricks Feature Store](https://docs.databricks.com/machine-learning/feature-store/index.html) to create new features, explore and reuse existing features, track lineage and feature creation code, and publish features to low-latency online stores for real-time inference. The Feature Store is a centralized repository that enables data scientists to find and share features. It ensures that the same code used to compute the feature values is used for model training and inference. The Feature Store library is available only on Databricks Runtime for Machine Learning and is accessible through Databricks notebooks and workflows. ###### Resources: - [The Comprehensive Guide to Feature Stores](https://www.databricks.com/resources/ebook/the-comprehensive-guide-to-feature-stores) - [Load data for machine learning and deep learning](https://docs.databricks.com/machine-learning/load-data/index.html) - [Preprocess data for machine learning and](https://docs.databricks.com/machine-learning/preprocess-data/index.html) [deep learning](https://docs.databricks.com/machine-learning/preprocess-data/index.html) ----- C `USTOMER STORY: ZIPLINE` ### Data-driven drones deliver lifesaving medical aid around the world Automated logistics and delivery system provider [Zipline](https://www.flyzipline.com/ ) is redefining logistics by using cutting-edge drone technology and a global autonomous logistics network to save lives information they need to accurately measure success, find the metrics that relate to customer experiences or logistics, and improve on them exponentially as more data is ingested and machine learning models are refined. by giving remote communities access to emergency and preparatory medical aid and resources, regardless of where they are in the world. Doing so requires the ability to ingest and analyze huge chunks of time series data in real time. This data is produced every time a drone takes flight and includes performance data, in-flight battery management, regional weather patterns, geographic obstacles, landing errors and a litany of other information that must be processed. “About 30% of the deliveries we do are lifesaving emergency deliveries, where the product being delivered does not exist at the hospital. We have to be fast, and we have to be able to rely on all the different kinds of data to predict failures before they occur so that we can guarantee a really, really high service level to the people who are literally depending on us with their lives,” said Zipline CEO Keller Rinaudo. “Databricks gives us confidence in our operations, and enables us to continuously improve our technology, expand our impact, and provide lifesaving aid where and when it’s needed, every single day.” [Read full story here.](https://www.databricks.com/customers/zipline) Every Zipline flight generates a gigabyte of data with potential life-or-death consequences, but accessing and federating the data for both internal and external decision-making was challenging. With Databricks as the common platform, Zipline’s data team can access all the ----- **Step 2** **Model training** For training machine learning and deep learning models, you can use [AutoML](https://docs.databricks.com/machine-learning/automl/index.html) , which automatically prepares a data set for model training, performs a set of trials using open-source libraries such as scikit-learn and XGBoost, and creates a Python notebook with the source code for each trial run so you can review, reproduce and modify the code. In Databricks, [notebooks](https://docs.databricks.com/notebooks/index.html) are the primary tool for creating data science and machine learning workflows and collaborating with colleagues. Databricks notebooks provide real-time coauthoring in multiple languages, automatic versioning and built-in data visualizations. ###### Resources: - [Model training examples](https://docs.databricks.com/machine-learning/train-model/index.html) - [Training models with Feature Store](https://docs.databricks.com/machine-learning/feature-store/train-models-with-feature-store.html) - [Best practices for deep learning on Databricks](https://docs.databricks.com/machine-learning/feature-store/train-models-with-feature-store.html) - [Machine learning quickstart notebook](https://docs.databricks.com/machine-learning/train-model/ml-quickstart.html) ----- ###### Resources: - [MLflow quickstart (Python)](https://docs.databricks.com/_extras/notebooks/source/mlflow/mlflow-quick-start-python.html) - [Track machine learning training runs](https://docs.databricks.com/mlflow/tracking.html) - [Automatically log training runs to MLflow](https://docs.databricks.com/mlflow/quick-start-python.html#automatically-log-training-runs-to-mlflow) - [Track ML Model training data with Delta Lake](https://docs.databricks.com/mlflow/tracking-ex-delta.html) - [Log, load, register, and deploy MLflow models](https://docs.databricks.com/mlflow/models.html) **Step 3** **Track model development** The model development process is iterative, and can be challenging. You can use [MLflow tracking](https://mlflow.org/docs/latest/tracking.html) to help you keep track of the model development process, including parameter settings or combinations you have tried and how they affected the model’s performance. MLflow tracking uses experiments and runs to log and track your model development. A run is a single execution of model code. An experiment is a collection of related runs. Within an experiment, you can compare and filter runs to understand how your model performs and how its performance depends on the parameter settings, input data, etc. MLflow can automatically log training code written in many ML frameworks. This is the easiest way to get started using MLflow tracking. With MLflow’s autologging capabilities, a single line of code automatically logs the resulting model. A hosted version of MLflow Model Registry can help [manage the full lifecycle](https://docs.databricks.com/machine-learning/manage-model-lifecycle/index.html) of MLflow models. You can apply webhooks to automatically trigger actions based on registry events. For example, you can trigger CI builds when a new model version is created or notify your team members through Slack each time a model transition to production is requested. This promotes a traceable version control work process. You can leverage this feature for web traffic A/B testing and funneled to versions of deployed models for more precise population studies. **Step 4** **Deploy machine learning models** You can use MLflow to deploy models for batch or streaming inference or to set up a REST endpoint to serve the model. Simplify your model deployment by registering models to [the MLflow Model Registry](https://docs.databricks.com/mlflow/model-registry.html) . After you have registered your model, you can [automatically](https://docs.databricks.com/machine-learning/manage-model-lifecycle/index.html#generate-inference-nb) [generate a notebook](https://docs.databricks.com/machine-learning/manage-model-lifecycle/index.html#generate-inference-nb) for batch inference or configure the model for online serving with Serverless RealTime Inference or [Classic MLflow Model Serving on](https://docs.databricks.com/archive/classic-model-serving/model-serving.html) [Databricks](https://docs.databricks.com/archive/classic-model-serving/model-serving.html) . For model inference for deep learning applications, Databricks recommends the following workflow. To debug and tune model inference on Databricks, using GPUs (graphics processing units) can efficiently optimize the running speed for model inference. As GPUs and other accelerators become faster, it is important that the data input pipeline keep up with demand. The data input pipeline reads the data into Spark DataFrames, transforms it and loads it as the input for model inference. ----- ``` CUSTOMER STORY: ITERABLE ### Optimizing touch points across the entire customer journey ``` “With Databricks Lakehouse, we can efficiently deploy powerful ML and AI solutions to help our customers meet rising consumer demands for more personalized experiences that drive revenue and results.” —Sinéad Cheung, Principal Product Manager, [Iterable](https://iterable.com/) Captivating an audience and understanding customer journeys are essential to creating deeper brand- customer connections that drive growth, loyalty and revenue. From helping medical practitioners build trust with new patients to ensuring that food delivery users feel connected to their culinary community, Iterable helps more than 1,000 brands optimize and humanize their marketing in today’s competitive landscape. This need to build personalized and automated customer experiences for its clients drove the company to find a fully managed platform that would simplify infrastructure management, make collaboration possible, and give it the ability to scale for analytics and AI. With Databricks Lakehouse, Iterable can harness diverse, complex data sets — including conversion events, unique user labels, engagement patterns and business insights — and facilitate rapid prototyping of machine learning models that deliver top-notch and personalized user experiences for higher-converting marketing campaigns. [Read the full story here.](https://www.databricks.com/customers/iterable) ----- ###### ML Stages ML workflows include the following key assets: code, models and data. These assets need to be developed (dev), tested (staging) and deployed (production). Each stage needs to operate within an execution environment. So the execution environments, code, models and data are divided into dev, staging and production. ML project code is often stored in a version control repository (such as Git), with most organizations using branches corresponding to the lifecycle phases of development, staging or production. Since model lifecycles do not correspond one-toone with code lifecycles, it makes sense for model management to have its own service. MLflow and its Model Registry support managing model artifacts directly via UI and APIs. The loose coupling of model artifacts and code provides flexibility to update production models without code changes, streamlining the deployment process in many cases. Databricks recommends creating separate environments for the different stages of ML code and model development with clearly defined transitions between stages. The recommended MLOps workflow is broken into these three stages: [Development](https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html#development-stage) — The focus of the development stage is experimentation. Data scientists develop features and models and run experiments to optimize model performance. The output of the development process is ML pipeline code that can include feature computation, model training inference and monitoring ----- [Staging](https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html#staging-stage) This stage focuses on testing the ML pipeline code for production readiness, including code for model training as well as feature engineering pipelines and inference code. The output of the staging process is a release branch that triggers the CI/CD system to start the production stage. ----- [Production](https://docs.databricks.com/machine-learning/mlops/mlops-workflow.html#production-stage) ML engineers own the production environment where ML pipelines are deployed. These pipelines compute fresh feature values, train and test new model versions, publish predictions to downstream tables or applications, and monitor the entire process to avoid performance degradation and instability. Data scientists have visibility to test results, logs, model artifacts and production pipeline status to allow them to identify and diagnose problems in production. The Databricks Machine Learning home page provides quick access to all the machine learning resources. To access this page, move your mouse or pointer over the left sidebar in the Databricks workspace. From the persona switcher at the top of the sidebar, select Machine Learning. From the shortcuts menu, you can create a [notebook](https://docs.databricks.com/notebooks/index.html) , [start AutoML](https://docs.databricks.com/machine-learning/automl/index.html) or open a [tutorial notebook](https://docs.databricks.com/machine-learning/tutorial/ml-quickstart.html) . The center of the screen includes any recently viewed items, and the sidebar provides quick access to the [Experiments page](https://docs.databricks.com/mlflow/tracking.html#mlflow-experiments) , [Databricks Feature Store](https://docs.databricks.com/machine-learning/feature-store/index.html) and [Model Registry.](https://docs.databricks.com/mlflow/model-registry.html) New users can get started with a series of [tutorials](https://docs.databricks.com/machine-learning/tutorial/index.html) that illustrate how to use Databricks throughout the ----- ###### Resources: - [MLOps Virtual Event: Standardizing MLOps at Scale](https://www.databricks.com/p/webinar/mlops-virtual-event) - [Virtual Event — Automating the ML Lifecycle With](https://www.databricks.com/p/webinar/automating-the-ml-lifecycle-with-databricks-machine-learning?itm_data=product-resources-automatingMLlifecycle) [Databricks Machine Learning](https://www.databricks.com/p/webinar/automating-the-ml-lifecycle-with-databricks-machine-learning?itm_data=product-resources-automatingMLlifecycle) - [MLOps Virtual Event “Operationalizing Machine](https://www.databricks.com/p/webinar/operationalizing-machine-learning-at-scale) [Learning at Scale”](https://www.databricks.com/p/webinar/operationalizing-machine-learning-at-scale) - [The Big Book of MLOps](https://www.databricks.com/p/ebook/the-big-book-of-mlops) - [Machine learning on Databricks](https://www.databricks.com/product/machine-learning) - [Watch the demos](https://www.databricks.com/discover/demos) ML lifecycle or access the [in-product quickstart](https://docs.databricks.com/machine-learning/tutorial/ml-quickstart.html) for a model-training tutorial notebook that steps through loading data, training and tuning a model, comparing and analyzing model performance and using the model for inference. Also be sure to download the [Big Book of MLOps](https://www.databricks.com/p/thank-you/the-big-book-of-mlops) to learn how your organization can build a robust MLOPs practice incrementally. ----- # 04 ``` SUMMARY:  ## The Databricks Lakehouse Platform addresses these challenges  04 ``` ----- ### Summary We’ve organized the common data challenges for startups and growing digital native businesses into three main buckets: Building a **unified data architecture** — one that supports **scalability and performance** ; and building effective **machine learning** **operations** , all with an eye on cost efficiency and increased productivity. The Lakehouse Platform provides an efficient and scalable architecture that solves these challenges and will support your data, analytics and AI workloads now and as you scale. With [Databricks](https://www.databricks.com/) you can unify all your data with cost-efficient architecture for highly performant digital native applications and analytic workloads — designed to scale as you grow. Use your data however and wherever you want with open-source flexibility, leverage open formats, APIs and your tools of choice. Ensure reliable, high-performing data workloads while Databricks automatically manages your infrastructure as you scale. Leverage serverless Databricks SQL to increase productivity and scale on demand with up to 12x better price/performance. Easily access data for ML models and accelerate the full ML lifecycle from experimentation to production. Discover more about the lakehouse for companies born in the cloud **.** ----- ### Get started with Databricks Trial Get a collaborative environment for data teams to build solutions together with interactive notebooks to use Apache Spark™, SQL, Python, Scala, Delta Lake, MLflow, TensorFlow, Keras, scikit-learn and more. ### Get started with About Databricks Trial Databricks Get a collaborative environment for data teams to build Databricks is the lakehouse company. More than 7,000 solutions together with interactive notebooks to use organizations worldwide — including Comcast, Condé Apache Spark™, SQL, Python, Scala, Delta Lake, MLflow, Nast and over 50% of the Fortune 500 — rely on the TensorFlow, Keras, scikit-learn and more. Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Available as a 14-day full trial in your own cloud or as Francisco, with offices around the globe. Founded by a lightweight trial hosted by Databricks. the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . **[TRY DATABRICKS FOR FREE](https://www.databricks.com/try-databricks?itm_data=H#account)** - Available as a 14-day full trial in your own cloud or as a lightweight trial hosted by Databricks. © Databricks 2023. All rights reserved. Apache, Apache Spark, Spark and the Spark -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/technical_guide_solving_common-data-challenges-for-startups-and-digital-native-businesses.pdf2024-09-19T16:57:22Z**EBOOK** # Four Forces Driving Intelligent Manufacturing ### A data-driven business built on Lakehouse for Manufacturing ----- ## Contents Introduction .................................................................................................................................................................................................................................................. **03** The four driving forces of change ..................................................................................................................................................................................................... **04** Digital transformation is not a destination, it’s a journey .......................................................................................................................................................... **05** Manufacturing – use case maturity matrix ...................................................................................................................................................................................... **06** The foundations for data-driven manufacturing ............................................................................................................................................................................ **07** DRIVING FORCE NO. 1 The shift from manufacturing to Intelligent Manufacturing ...................................................................................................................................................... **08** DRIVING FORCE NO. 2 Transparency, visibility, data: optimizing the supply chain ........................................................................................................................................................ **10** DRIVING FORCE NO. 3 Future opportunities for manufacturing business models ......................................................................................................................................................... **13** DRIVING FORCE NO. 4 The focus on sustainability ....................................................................................................................................................................................................................... **15** Leveraging the Databricks Lakehouse for Manufacturing ........................................................................................................................................................... **17** The building blocks of Lakehouse for Manufacturing .................................................................................................................................................................... **18** Manufacturers’ end goals .......................................................................................................................................................................................................................... **19** 2 Four Forces Driving Intelligent Manufacturing ----- ## Introduction ##### Manufacturing has always been an evolutionary business, grounded upon infrastructure, business processes, and manufacturing operations built over decades in a continuum of successes, insights and learnings. The methods and processes used to approach the development, release and optimization of products and capital spend are the foundation of the industry’s evolution. But today it’s data- and AI-driven businesses that are being rewarded because they’re using process and product optimization not previously possible, able to forecast and sense supply chain demand, and, crucially, introduce new forms of revenue based upon service rather than product. The drivers for this evolution will be the emergence of what we refer to as “Intelligent Manufacturing” that has been enabled by the rise of computational power at the Edge and in the Cloud. As well as new levels of connectivity speed enabled by 5G and fiber optic, combined with increased use of advanced analytics and machine learning (ML). Yet, even with all the technological advances enabling these new data-driven businesses, challenges exist. McKinsey’s recent research with the World Economic Forum estimates the value creation potential of manufacturers and suppliers that implement Industry 4.0 in their operations at USD$37 trillion by 2025. Truly a huge number. But the challenge that most companies still struggle with is the move from piloting point solutions to delivering sustainable impact at scale. [Only 30% of companies are capturing value from](https://www.mckinsey.com/~/media/mckinsey/industries/advanced%20electronics/our%20insights/capturing%20value%20at%20scale%20in%20discrete%20manufacturing%20with%20industry%204%200/industry-4-0-capturing-value-at-scale-in-discrete-manufacturing-vf.pdf) [Industry 4.0 solutions in manufacturing today.](https://www.mckinsey.com/~/media/mckinsey/industries/advanced%20electronics/our%20insights/capturing%20value%20at%20scale%20in%20discrete%20manufacturing%20with%20industry%204%200/industry-4-0-capturing-value-at-scale-in-discrete-manufacturing-vf.pdf) ##### 80% of manufacturers [see smart manufacturing as](https://roboticsandautomationnews.com/2021/03/10/new-study-reveals-80-percent-of-manufacturers-see-smart-manufacturing-as-key-to-future-success/41322/) [key to their future success](https://roboticsandautomationnews.com/2021/03/10/new-study-reveals-80-percent-of-manufacturers-see-smart-manufacturing-as-key-to-future-success/41322/) ##### 57% of manufacturing leaders feel their organization [lacks skilled workers to support](https://www.gartner.com/en/newsroom/press-releases/2021-05-11-gartner-survey-shows-57-percent-of-manufacturing-leaders-feel-their-organization-lacks-skilled-workers-to-support-smart-manufacturing-digitization-plans) [their smart manufacturing plans](https://www.gartner.com/en/newsroom/press-releases/2021-05-11-gartner-survey-shows-57-percent-of-manufacturing-leaders-feel-their-organization-lacks-skilled-workers-to-support-smart-manufacturing-digitization-plans) [A lack of supply chain](https://www2.deloitte.com/content/dam/Deloitte/us/Documents/energy-resources/us-2021-manufacturing-industry-outlook.pdf) [integration could stall smart](https://www2.deloitte.com/content/dam/Deloitte/us/Documents/energy-resources/us-2021-manufacturing-industry-outlook.pdf) [factory initiatives for](https://www2.deloitte.com/content/dam/Deloitte/us/Documents/energy-resources/us-2021-manufacturing-industry-outlook.pdf) **[3 in 5](https://www2.deloitte.com/content/dam/Deloitte/us/Documents/energy-resources/us-2021-manufacturing-industry-outlook.pdf)** ##### manufacturers by 2025 3 Four Forces Driving Intelligent Manufacturing ----- ## The four driving forces of change ###### Over the last two years, demand imbalances and supply chain swings have added a sense of urgency for manufacturers to digitally transform. But in truth, the main challenges facing the industry have existed, and will continue to exist, outside these recent exceptional circumstances. Manufacturers will always strive for greater levels of visibility across their supply chain, always seek to optimize and streamline operations to improve margins. In the continuing quest for improved efficiency, productivity, adaptability and resilience, manufacturers are commonly tackling these major challenges: ###### Skills and production gaps The rise of the digital economy is demanding a new set of skills. For today’s Intelligent Manufacturing organizations, there’s a fundamental need for computer and programming skills for automation, along with critical-thinking abilities. Also important is the ability to use collaboration systems and new advanced assistance tools, such as automation, virtual reality (VR) and augmented reality (AR). The deficit of workers with these skills is of critical concern to manufacturers. In addition, the industry dynamics are pushing companies to increase and refine both partner/supplier relationships, optimize internal operations and build robust supply chains that do not rely upon safety stock to weather supply chain swings. Historical focus on operational use cases is now extending to building agile supply chains. ###### Supply chain volatility If the events of the last few years proved anything, it’s that supply chains need to be robust and resilient. Historically, supply chain volatility was smoothed by holding “safety stock,” which added costs without financial value. Then the pendulum swung to “just in time delivery,” where efficient use of working capital disregarded demand risks. Recent experiences have highlighted that demand sensing is needed in addition to safety stock for high-risk parts or raw materials. The ability to monitor, predict and respond to external factors – including natural disasters, shipping and warehouse constraints, and geopolitical disruption – is vital to reduce risk and promote agility. Many of these external data sources leverage unstructured data (news, social posts, videos and images), and being able to manage both structured and unstructured data available to measure and analyze this volatility is key. ###### Need for new and additional sources of revenue Manufacturers’ growth historically has been limited to new product introduction rate or expansion into new geographies. The emergence of “equipment as-a-service” is changing that dynamic. It’s pivoting the business from product-centric growth to one leveraging added services, which are not slaves to the product development introduction cycle and can be highly differentiated depending on the market segment and types of products. Real-time data plays an outsize role, as now businesses are in unison with use cases such as predictive maintenance, stock replenishment and worker safety. ###### An increased focus on sustainability Manufacturers have always focused on efficiency, but they’re increasingly seeing the value chain as circular. It’s no longer enough to consider an organization’s own carbon footprint – it needs to also include indirect emissions and other environmental impacts from the activities it doesn’t own or control. This requires a 360-degree view of sustainability, which includes both internal and external factors in measuring compliance with ESG programs. **This eBook will look closer at these four key challenges** **and their associated use cases, as well as some** **of the most effective technologies and solutions** **that can be implemented to respond to them.** 4 Four Forces Driving Intelligent Manufacturing ----- ## Digital transformation is not a destination, it’s a journey ##### Digitalization is reshaping many areas of manufacturing and logistics, product design, production and quality of goods as well as sustainability and energy output. This transition from manual operations to automated solutions is enhancing and optimizing operational efficiency and decision-making, while also making supply chains more frictionless and reliable, as well as enabling organizations to become more responsive and adaptable to market and customer needs. This disruption has been driven by a rush of new technologies including artificial intelligence, machine learning, advanced analytics, digital twins, Internet of Things (IoT), and automation. These, in turn, have been enabled by the greater network capabilities of 5G. Industry 4.0 is well underway. Intelligent Manufacturing isn’t the future, it’s what competitive organizations have established today. ## The data and AI maturity curve ### From descriptive to prescriptive Prescriptive Analytics Predictive Modeling **How** can we make it happen? Data Exploration **What** will happen? **Why** did it happen? Ad Hoc Queries Reports Cleaned Data **What** happened? Analytics Maturity Raw Data 5 Four Forces Driving Intelligent Manufacturing ----- ## Manufacturing – use case maturity matrix No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Use case name EDW offload Product 360 Voice of customer insights Testing & simulation optimization Supplier 360 Spend analytics Sourcing event optimization Process & quality monitoring Process 360 Equipment predictive maintenance Quality & yield optimization Supply chain 360 Demand analytics Inventory visibility & tracking Inventory optimization Logistics route optimization Customer 360 Marketing & sales personalization Recommendation engine Asset/Vehicle 360 Connected asset & value-added services Quality event detection & traceability Asset predictive maintenance Peer Competitive Scale Standard among peer group Common among peer group Strategic among peer group Design Purchasing **11** **10** **13** **12** **17** New innovations Manufacturing Supply Chain That is not to say that the digital transformation journey is simple. Replacing legacy systems, breaking down data and organizational silos, bridging the gap between operational technology (OT) and informational technology (IT), reskilling workforces, and much more requires a clear and determined digitalization strategy, and to reach new levels of IT and data maturity. **16** Much of the aforementioned transformation requires a foundation of effective data management and architecture to be in place. Without this ability to control the vast amounts of structured data (highly organized and easily decipherable) and unstructured data (qualitative, no predefined data model), manufacturers cannot generate actionable insights from their data, derive value from machine learning, monitor and analyze supply chains, or coordinate decisions across the business. **15** **14** Marketing & Sales Service **19** **18** **23** **22** **21** **20** Awareness Exploration Optimization Transformation Maturity Stages 6 Four Forces Driving Intelligent Manufacturing ----- ## The foundations for data-driven manufacturing ###### Cloud-native platforms Improve data management, enhance data analytics and expand the use of enterprise data, including streaming structured and unstructured data ###### Technology-enabled collaboration Democratize analytics and ML capabilities – ensure the right users have access to the right data driving business value ###### The ability to scale machine learning use cases A central place to store and discover ML models and enabling greater collaboration between ML, data and business users ##### 95% agree that [digital transformation](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [in manufacturing](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [is essential to their](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [company’s future success](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [Global spending on](https://www.idc.com/getdoc.jsp?containerId=prUS48372321) [digital transformation](https://www.idc.com/getdoc.jsp?containerId=prUS48372321) [is forecast to reach](https://www.idc.com/getdoc.jsp?containerId=prUS48372321) ##### USD$2.8 trillion by 2025 ##### 85% have accelerated [their digital transformation](https://www.mckinsey.com/featured-insights/future-of-work/what-800-executives-envision-for-the-postpandemic-workforce) [strategies since 2020](https://www.mckinsey.com/featured-insights/future-of-work/what-800-executives-envision-for-the-postpandemic-workforce) ###### Open standards and open data architectures Leverage open source standards and open data formats to accelerate innovation and enable the integration of best-of-breed, third-party tools and services 7 Four Forces Driving Intelligent Manufacturing ----- ### Driving Force No. 1 ## The shift from manufacturing to Intelligent Manufacturing ##### If left unaddressed, a Deloitte study calculates that the manufacturing skills gap will leave 2.1 million jobs unfilled by 2030, costing the U.S. economy up to $1 trillion . The immediate response would be to point the finger at the pandemic. Indeed, the same study found that approximately 1.4 million positions were lost at the start of the pandemic, and only 63% of those have since been recouped. Yet the reasons for the lack of manufacturing talent today are manifold, and COVID-19 has only contributed to an existing problem. For instance, many highly experienced baby boomers are retiring from the workforce, leaving fewer people with the in-depth knowledge of custom equipment and machines. Meanwhile, younger generations have a poor perception of what manufacturing jobs are like and are reluctant to step into the industry. Meaning not only a problem with retaining skills, but also attracting them. And, of course, there is a growing gap between the current capabilities of industrial workers and the skill sets needed for today’s data-driven, sensor-filled, 5G-enabled Intelligent Manufacturing. With the drive to optimize operations, stabilize supply chains and reinvent business models through equipment-as-a-service, the skill sets have radically changed from even a decade ago. Intelligent Manufacturing’s use cases are placing a high demand on robotics programmers and technicians, cybersecurity experts, digital twin architects, supply network analysts, and people who can leverage AI and machine learning algorithms because deployment of these common use cases is producing multiples of returns for those embracing Intelligent Manufacturing. 8 Four Forces Driving Intelligent Manufacturing ----- ### Those manufacturers with a strategy for upskilling existing talent, while also changing the perception of the incoming workforce, need to take advantage of the following use cases: ##### 44% report difficulty [hiring manufacturing](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [talent with the required](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) [digital expertise](https://www.fictiv.com/ebooks/2021-state-of-manufacturing?utm_source=forbes&utm_medium=column&utm_campaign=som21&utm_content=report) ##### 83% of manufacturing workers are interested [in learning new digital skills](https://www.mendix.com/press/welcome-news-to-jumpstart-the-post-pandemic-economy-mendix-survey-shows-78-of-u-s-manufacturing-workers-want-to-help-with-digital-transformation/) ##### 56% of Gen Z say [that the pandemic has](https://skillsgapp.com/how-the-pandemic-shifted-gen-zs-perception-of-manufacturing-careers/) [changed their perception](https://skillsgapp.com/how-the-pandemic-shifted-gen-zs-perception-of-manufacturing-careers/) [of manufacturing. 77% now](https://skillsgapp.com/how-the-pandemic-shifted-gen-zs-perception-of-manufacturing-careers/) [view it as more important](https://skillsgapp.com/how-the-pandemic-shifted-gen-zs-perception-of-manufacturing-careers/) ### Proof through customer success ##### Watch our case study ###### Digital twins Ingesting information from sensors and other data sources, these virtual replicas of physical assets create models to which a layer of visualization can be applied. This enables users to predict failures, assess performance and reveal opportunities for optimization. Digital twins unlock the ability for manufacturers to monitor and manage production remotely, as well as explore “what-if” scenarios. ###### Process and quality optimization Process and quality optimization generally covers the optimization of equipment, operating procedures, and control loops. It requires access to accurate, up-to-date data about conditions, collected through IoT devices to monitor every aspect. The introduction of deep learning architectures is enabling manufacturing machinery to identify visual clues that are indicative of quality issues in manufactured goods, while digital twins can be used to spot inefficiencies without the need to pause production. ###### Throughput optimization Increasing throughput is critical for meeting delivery schedules, and manufacturers are always looking for ways to identify and eliminate bottlenecks, reduce inventory and increase the utilization of assets. Throughput optimization makes use of data-driven algorithms to identify, rank and resolve labor, equipment or inventory bottlenecks. ###### Equipment predictive maintenance Rather than wait for a piece of equipment to fail or stick to a fixed schedule, predictive maintenance adopts a predictive approach to equipment maintenance. By monitoring real-time data collected from hundreds of IoT sensors, machine learning techniques can detect anomalies in operations and possible defects in equipment and processes. Predictive maintenance correlates data across many more dimensions than traditional inspection techniques, to anticipate failures and prevent costly breakdowns. ###### Quality and yield optimization (with computer vision) Quality assurance focuses on the use of data analytics, AI and machine learning to identify and prevent defects during the manufacturing process. [This type of edge AI](https://www.qualitymag.com/articles/96231-how-edge-ai-can-improve-the-visual-inspection-process) [is an approach that can increase productivity by 50%](https://www.qualitymag.com/articles/96231-how-edge-ai-can-improve-the-visual-inspection-process) [and detection rates by up to 90%.](https://www.qualitymag.com/articles/96231-how-edge-ai-can-improve-the-visual-inspection-process) Making use of image recognition and machine learning, computer vision can automate visual inspections, detecting faults and imperfections faster and more cost effectively than manual approaches. 9 Four Forces Driving Intelligent Manufacturing ----- ### Driving Force No. 2 ## Transparency, visibility, data: optimizing the supply chain ##### Over the last few years, organizations have experienced the biggest disruption to their supply chains since the 1940s. In the short term, this meant having to adapt to global lockdowns and restrictions, material shortages and compromised workforces. Longer term, there will be economic downturns and new consumer and customer demands and habits to contend with. Resilience and end-to-end visibility are key, with manufacturers given a harsh reminder of how important it is to be able to forecast and respond to disruption. Such resiliency requires a combination of technologies and solutions. For example, decision support tools with predictive capabilities – to monitor the supply chain and analyze what-if scenarios. Demand sensing and forecasting in combination with enterprise critical systems (ERP) needs to combine data from a wide variety of sources. 10 Four Forces Driving Intelligent Manufacturing Working together, combining millions of data points from across organizations’ operations along with other external sources, these technologies can be used to optimize supply chains, reduce costs and improve customer service and loyalty. However, achieving this – embracing the latest in AI, machine learning and predictive analytics – means being able to manage and maintain a flow of accurate, relevant data and to be able to translate this data into actionable insights. ----- #### Successful supply chain optimization depends on up-to-the-minute, end-to-end visibility that can be applied across all stages of the supply chain, from design to planning to execution. This will incorporate a range of solutions that can include: ###### Demand, inventory, logistics ###### Purchasing **Spend analytics:** Most obviously, transparency and insight into where cash is spent is vital for identifying opportunities to reduce external spending across supply markets, suppliers and locations. However, spend analytics are also hugely important to supply chain agility and resilience. This requires a single source of data truth for finance and procurement departments. For example, integrating purchase order, invoice, accounts payable, and general-ledger account data to create a level of transparency, visibility and consistency to inform supplier discussions and deploy strategies to manage cash better during times of disruption. ###### Cross supply chain collaboration **Supply chain 360:** With real-time insights and aggregated supply chain data in a single business intelligence dashboard, manufacturers are empowered with greater levels of visibility, transparency and insights for more informed decision-making. This dashboard can be used to identify risks and take corrective steps, assess suppliers, control costs and more. **Demand analytics:** By collecting and analyzing millions – if not billions – of data points about market and customer behavior and product performance, manufacturers can use this understanding to improve operations and support strategic decisions that affect the demand of products and services. [Around 80% say that using this form of data](https://paperzz.com/doc/8615467/the-demand-analytics-premium---strategy) [analysis has improved decision-making, while 26% say](https://paperzz.com/doc/8615467/the-demand-analytics-premium---strategy) [having this level of know-how to predict, shape and meet](https://paperzz.com/doc/8615467/the-demand-analytics-premium---strategy) [demands has increased their profits.](https://paperzz.com/doc/8615467/the-demand-analytics-premium---strategy) **Inventory visibility and tracking:** Inventory visibility is the ability to view and track inventory in real time, with insights into SKU stock levels and which warehouse or fulfillment center it is stored at. With complete oversight of inventory across multiple channels, this helps improve supply chain efficiency, demand forecasting and order accuracy, while ultimately enhancing the customer experience. **Inventory optimization:** The practice of having the right amount of available inventory to meet demand, both in the present and the future, enables manufacturers to address demand expectations, and reduce the costs of common inventory issues. Inventory optimization incorporates data for demand forecasting, inventory strategy and stock replenishment. With the addition of AI reinforced learning models, this can help improve demand prediction, recommend stock levels, and automatically order raw materials to fulfill orders, while also detecting and responding to shifts in demand. **Logistics route optimization:** Using AI, route optimization can help manufacturers go beyond normal route planning and include parameters to further drive logistics efficiency. What-if scenarios present route options that help cut transportation costs, boost productivity and execute on-time deliveries. **Supply chain network design:** By building and modeling the supply chain, it enables manufacturers to understand the costs and time to bring goods and services to market. Supply chain network design helps to evaluate delivery at the lowest possible cost, optimal sources and inventory deployment, as well as define distribution strategies. 11 Four Forces Driving Intelligent Manufacturing ----- [Successfully implementing AI-enabled supply](https://www.mckinsey.com/industries/metals-and-mining/our-insights/succeeding-in-the-ai-supply-chain-revolution) [chain management has enabled early adopters to](https://www.mckinsey.com/industries/metals-and-mining/our-insights/succeeding-in-the-ai-supply-chain-revolution) ##### improve logistics costs by 15%, inventory levels by 35%, and service levels by 65%  Only 6% of companies believe [they’ve achieved full supply chain visibility](https://www.supplychaindive.com/news/supply-chain-visibility-failure-survey-geodis/517751/ ) ##### 57% believe that supply chain management [gives them a competitive edge that enables them](https://financesonline.com/supply-chain-statistics/ ) [to further develop their business](https://financesonline.com/supply-chain-statistics/ ) ### Supply chain optimization case study ##### Watch our case study 12 Four Forces Driving Intelligent Manufacturing ----- ### Driving Force No. 3 ## Future opportunities for manufacturing business models ##### When looking at the rapid evolution and growth of e-commerce, manufacturers have some catching up to do. Particularly when it comes to embracing new and customer-centric business models. For example, when shifting from a product to a service mindset, the product lifecycle becomes more holistic and the client relationship is maintained beyond the point of purchase. These new opportunities are forming part of a longer-term industry shift from the sale of goods (CapEx) to recurring revenue streams, such as through Equipment-as-a-Service (EaaS) models. While this approach is not new to many (Rolls-Royce’s “Power-by-the-Hour” engine subscription model has been around since 1962), customer demand, advances in industrial IoT technology, and a continuing decline in sales and margins have seen EaaS emerge as an imperative for manufacturers. Opening up some of these new revenue streams, of course, demands operational flexibility, but more importantly, digital maturity. This means cloud technologies that allow employees new levels of access to data, the ability to work anywhere, and adapt rapidly to new needs. The introduction of a microservices architecture, to allow the agile development and deployment of new IT services. And the democratization of data, so the entire organization and its ecosystem of partners and suppliers have access to information about market demand, operations, production, logistics and transportation. 13 Four Forces Driving Intelligent Manufacturing ----- ##### By 2023, 20% of industrial equipment manufacturers will [support EaaS with remote](https://www.gartner.com/en/newsroom/press-releases/2021-07-28-gartner-identifies-top5-manufacturing-trends-2021) [Industrial IoT capabilities](https://www.gartner.com/en/newsroom/press-releases/2021-07-28-gartner-identifies-top5-manufacturing-trends-2021) ##### In 2025, the global EaaS market is estimated [to grow to $131B compared](https://iot-analytics.com/entering-the-decade-of-equipment-as-a-service/) [to $22B in 2019](https://iot-analytics.com/entering-the-decade-of-equipment-as-a-service/) ##### In the U.S., 34% said [pay-per-use models represent](https://relayr.io/pr-forsa-survey/) [a big or a very big competitive](https://relayr.io/pr-forsa-survey/) [advantage, while 29% consider](https://relayr.io/pr-forsa-survey/) [it a slight advantage](https://relayr.io/pr-forsa-survey/) ### Equipment as a service case study ##### Read our case study ### This level of visibility and collaboration is not only beneficial to lower maintenance costs, capital expenditure and human capital management, but also in empowering all stakeholders to make smarter and more informed decisions. ###### Connected assets The digital connectivity of high-value physical assets is helping to drive a more efficient use of assets and cost savings. Connected assets can provide continuous, real-time data on their operating conditions, even if they are on the other side of the world. Connected assets can also be used as the foundation of as-a-service business models to track the usage of rented machines, and for automakers to use with connected vehicles and electrification strategies. ###### Quality event detection and traceability Manufacturers are increasingly seeking end-to-end supply chain traceability — to be able to identify and trace the history, distribution, location and application of products, parts and materials. With event-based traceability, typically using blockchain ledgers, manufacturers can record events along the supply chain. This can help aid legal compliance, support quality assurance and brand trust, and provide full supply chain visibility for better risk management. ###### Demand-driven manufacturing **Equipment-as-a-Service:** Startup organizations without the in-house infrastructure can use a third-party to realize their concepts, while manufacturers with the production capabilities can ensure minimal downtime for their assets. This involves greater risk for the manufacturer, but also the potential for higher and annuitized revenues. 14 Four Forces Driving Intelligent Manufacturing ----- ### Driving Force No. 4 ## The focus on sustainability ##### It’s an inescapable truth that Earth’s resources are finite, and we need to change our present, linear business models for something that minimizes our use of resources and eliminates waste. Manufacturers need to take a more sustainable approach, where they can limit their negative environmental impacts, while also conserving energy and natural resources. When looking at the entire manufacturing value chain, there are many areas where more sustainable practices can deliver measurable change. Products can be designed in a way that reduces waste and increases their longevity; materials can be selected and sourced in a more ethical way; operational efficiency and green energy can improve production; and the introduction of sustainable practices for transportation and shipping can help reduce carbon footprints. [These are part of the move](https://www.strategyand.pwc.com/de/en/industries/industrials/importance-of-the-circular-economy-for-manufacturing.html) [toward more circular business models](https://www.strategyand.pwc.com/de/en/industries/industrials/importance-of-the-circular-economy-for-manufacturing.html) [and establishing what PwC has called the](https://www.strategyand.pwc.com/de/en/industries/industrials/importance-of-the-circular-economy-for-manufacturing.html) [four Rs of the circular economy: Reduce,](https://www.strategyand.pwc.com/de/en/industries/industrials/importance-of-the-circular-economy-for-manufacturing.html) [Refurbish/Reuse, Recycle and Recover.](https://www.strategyand.pwc.com/de/en/industries/industrials/importance-of-the-circular-economy-for-manufacturing.html) There are a number of business operating models that employ the four Rs and support the circular economy. Sharing platforms and aaS models help optimize manufacturing capacity and enable businesses to rent rather than buy the machinery and equipment they need. Product use extension helps extend the lifecycle of products through repair and refurbishment, while resource recovery means recovering raw materials from end-of-life products. Achieving this means establishing a redesigned supply chain that leverages many use cases, technologies and solutions we covered earlier. It will require greater levels of collaboration between suppliers and vendors. It will require optimizing production lines and transportation. It will require greater levels of customer engagement to extend product lifecycles and close the loop of the supply chain. But most of all, it will require data, to provide visibility and intelligence across the network, and to be able to make the decisions to improve efficiency in the present, as well as longer-term decisions based on a broad view of sustainability impacts. 15 Four Forces Driving Intelligent Manufacturing ----- ### Sustainability Solution Accelerator ##### Read now [The manufacturing industry alone](https://blogs.3ds.com/delmia/leverage-the-power-of-digitalization-for-more-sustainable-manufacturing/) [is responsible for](https://blogs.3ds.com/delmia/leverage-the-power-of-digitalization-for-more-sustainable-manufacturing/) **[54% of the](https://blogs.3ds.com/delmia/leverage-the-power-of-digitalization-for-more-sustainable-manufacturing/)** ##### world’s energy consumption [and](https://blogs.3ds.com/delmia/leverage-the-power-of-digitalization-for-more-sustainable-manufacturing/) **[20% of carbon emissions](https://blogs.3ds.com/delmia/leverage-the-power-of-digitalization-for-more-sustainable-manufacturing/)** ##### 80% of the world’s leading companies [are now incorporating sustainability](https://assets.kpmg/content/dam/kpmg/xx/pdf/2020/11/the-time-has-come.pdf) [into their operations and goals](https://assets.kpmg/content/dam/kpmg/xx/pdf/2020/11/the-time-has-come.pdf) ##### 78% of industrial, manufacturing and metals organizations now report on sustainability — up from 68% in 2017 16 Four Forces Driving Intelligent Manufacturing ----- ## Leveraging the Databricks Lakehouse for Manufacturing Our open, simple and collaborative Lakehouse for Manufacturing enables automotive, electronics, industrial, and transportation & logistics organizations to unlock more value and transform how they use data and AI. All your sources Any structure or frequency Reliable, real-time processing Analytics capabilities for any use case or persona Competitor News & Social Consumer Devices Video & Images IoT & Shop Floor Enterprise Resource Planning Sales Transaction & Syndicated Inventory & Logistics Unstructured batch Ad Hoc Data Science Low-cost, rapid experimentation with new data and models. Production Machine Learning High volume, fine-grained analysis at scale served in the tightest of service windows. BI Reporting and Dashboarding Power real-time dashboarding directly, or feed data to a data warehouse for high-concurrency reporting. Real-Time Applications Lakehouse enables a real-time data-driven business with the ability to ingest structured, semi-structured and unstructured data from ERP, SCM, IoT, social or other sources in your value chain so that predictive AI and ML insights can be realized. This enables them to operate their business in real time, deliver more accurate analytics that leverage all their data, and drive collaboration and innovation across their value chain. Most important for capital intensive manufacturing business, it enables them to move quickly from proof-of-concept (PoC) ideation to ROI quickly. Semi-structured real-time Unstructured batch Semi-structured real-time Structured real-time Structured batch Structured real-time Data Lakehouse Process, manage, and query all your data. Any cloud Provide real-time data to downstream applications or power applications via APIs. 17 Four Forces Driving Intelligent Manufacturing ----- ## The building blocks of Lakehouse for Manufacturing ###### Real Time Make data-informed decisions ###### Solution Accelerators Accelerate the possibilities of capabilities ###### Partner Solutions Accelerate the creation of insights ###### Speed Delivering fast ROI **Real-time data to make informed** **decisions:** The Lakehouse Platform streamlines data ingestion and management in a way that makes it easy to automate and secure data with fast, real-time performance. This means you can consolidate and enhance data from across the organization and turn it into accessible, actionable insights. **Solution Accelerators for new** **capabilities:** Through our Solution Accelerators, manufacturers can easily access and deploy common and high-impact use cases. For manufacturers restricted by time and resources, these accelerators provide the tools and pre-built code to deliver PoCs in less than two weeks. **Pre-built applications to deliver** **solutions faster:** We make it easy for you to discover data, analytics and AI tools, using pre-built integrations to connect with partner solutions, integrating them (and existing solutions) into the Lakehouse Platform to rapidly expand capabilities in a few clicks. **The speed to deliver fast ROI:** With faster data ingestion and access to insights combined with easier, quicker deployments, this means accelerated digital transformation and higher ROI. 18 Four Forces Driving Intelligent Manufacturing ----- ## Manufacturers’ end goals ##### Intelligent Manufacturing leaders leverage a combination of familiar manufacturing techniques and recent value producing and differentiating use of data-led use cases. This means making use of IIoT, cloud computing, data analytics, machine learning and more to create an end-to-end digital ecosystem across the entire value chain and build scalable architectures that take data from edge to AI. It means embracing automation and robotics, optimizing how organizations use assets and augmenting the capabilities of workforces, and introducing new levels of connectivity to accelerate performance. Not to mention open the door to new platform and as-a-service business models with the potential to generate new revenue streams. Also key to the data-driven transformation of manufacturing is visibility: a 360-degree, end-end-to view of the supply chain. Not only is this critical for the efficiency, optimization and profitability of operations, it is needed to be able to take new strides in sustainability. Of course, better data management is not only about unlocking insight, empowering AI, and enabling decision-making. It’s also about governance: acknowledging format issues, adhering to compliance, protecting IP, ensuring data security. All this needs to be taken into consideration when bringing onboard an ISV to establish a modern, unified architecture for data and AI. 19 Four Forces Driving Intelligent Manufacturing ----- ## About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark,™ Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . Get started with a free trial of Databricks and start building data applications today ##### Start your free trial To learn more, visit us at: **[Databricks for Manufacturing](https://databricks.com/solutions/industries/manufacturing-industry-solutions)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/four-forces-driving-intelligent-manufacturing-v7.pdf2024-09-19T16:57:21Z## Driving Innovation and Transformation in the Federal Government With Data + AI Empowering the federal government to efficiently deliver on mission objectives and better serve citizens ----- ### Contents State of the union: Data and AI in the federal government **03** Recognizing the opportunity for data and AI **04** Challenges to innovation **07** The Databricks Lakehouse Platform: Modernizing the federal government to achieve mission objectives **09** Customer story: U.S. Citizenship and Immigration Services **13** Conclusion **15** ----- ### State of the union: Data and AI in the federal government For the private sector, the growth, maturation and application of data analytics and artificial intelligence (AI) have driven innovation. This has resulted in solutions that have helped to improve efficiencies in everything from optimizing supply chains to accelerating drug development to creating personalized customer experiences and much more. Unfortunately, the federal government and many of its agencies are just beginning to take advantage of the benefits that data, analytics and AI can deliver. This inability to innovate is largely due to aging technology investments, resulting in a sprawl of legacy systems siloed by agencies and departments. Additionally, the government is one of the largest employers in the world, which introduces significant complexity, operational inefficiencies and a lack of transparency that limit the ability of its agencies to leverage the data at their disposal for even basic analytics – let alone advanced data analytic techniques, such as machine learning. ----- ### Recognizing the opportunity for data and AI The opportunity for the federal government to leverage data analytics and AI cannot be overstated. With access to some of the largest current and historical data sets available to the United States — and with vast personnel resources and some of the best private sector use cases and applications of AI available in the world — the federal government has the ability to transform the efficiency and effectiveness of many of its agencies. In fact, the federal government plans to spend $4.3 billion in artificial intelligence research and development across agencies in fiscal year 2023, according to a recent report from Bloomberg Government. These priorities are validated by a recent Gartner study of government CIOs across all levels (including state and local), confirming that the top game-changing technologies are AI, data analytics and the cloud. And as an indication of the potential impact, a recent study by Deloitte shows the government can save upward of $3 billion annually on the low end to more than $41 billion annually on the high end from data-driven automation and AI. Sources: [• Gartner Survey Finds Government CIOs to Focus Technology Investments on Data Analytics and Cybersecurity in 2019](https://www.gartner.com/en/newsroom/press-releases/2019-01-23-gartner-survey-finds-government-cios-to-focus-technol) [• Administration Projects Agencies Will Spend $1 Billion on Artificial Intelligence Next Year](https://www.nextgov.com/emerging-tech/2019/09/administration-projects-agencies-will-spend-1-billion-artificial-intelligence-next-year/159781/) Investment in AI to automate repetitive tasks can improve efficiencies across government agencies, which could save **96.7** #### million federal hours annually, with a potential savings of **$3.3 billion.** **WILLIAM EGGERS, PETER VIECHNICKI** **AND DAVID SCHATSKY** [Deloitte Insights](https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/artificial-intelligence-government.html) ----- **An increased focus on cloud, analytics and AI = operational efficiency** 1. AI/ML 2. Data Analytics 3. Cloud **$1B** **TOP PRIORITIES** **$41B+** Data and AI Research and Government CIOs’ top Estimated government Development Initiative game-changing technologies savings from data-driven automation **U.S. Government** Fortunately, the President’s Management Agenda (PMA) has recognized the need to modernize their existing infrastructure, federate data for easier access and build more **IT Modernization Act** Allows agencies to invest in modern technology solutions to improve service to the public, secure sensitive systems and data, and save taxpayer dollars. **Federal Data Strategy** A 10-year vision for how the federal government will accelerate the use of data to achieve its mission, serve the public and steward resources, while protecting security, privacy and confidentiality. **AI Executive Order** Makes AI a top research and development priority for federal agencies, provides a shared ethics framework for developing and using AI, and expands job rotation programs to increase the number of AI experts at agencies. advanced data analytics capabilities by establishing mandates for modernization, data openness and the progression of AI innovations. This will put agencies in a better position to leverage the scale of the cloud and democratize This will put agencies in a better position to leverage the scale of the cloud and democratize The end result will be transformative innovation that can not only improve the operational secure access to data in order to enable downstream business intelligence and AI use cases. efficiencies of each agency, but also support the delivery of actionable insights in real time efficiencies of each agency, but also support the delivery of actionable insights in real time for more informed decision-making. This benefits citizens in the form of better services, stronger national security and smarter resource management. ----- Top data and AI use cases in the government **H E A LT H C A R E** Improve the delivery and quality of healthcare services for citizens with powerful analytics and a 360° view of patients. - Patient 360 - Insurance management - Population health - Genomics - Supply chain optimization - Drug discovery and delivery Across the federal government, data and AI is providing the insights and predictive capabilities to thwart cyberattacks and national threats, provide better social services more efficiently, and improve the delivery and quality of healthcare services. **H O M E L A N D S E C U R I T Y** Detect and prevent criminal activities and national threats with real-time analytics and data-driven decision-making. - Customs and border protection - Counter-terrorism - Immigration and citizenship - Federal emergency aid management **D E F E N S E** **E N E R G Y** Improve energy management with data insights that ensure energy resiliency and sustainability. - Security of energy infrastructure - Energy exploration - Smarter energy management - Electrical grid reliability Apply the power of predictive analytics to geospatial, IoT and surveillance data to improve operations **C O M M E R C E** Proactively detect anomalies with machine learning to mitigate risk and prevent fraudulent activity. - Tax fraud and collection - Grants management - Process and operations management - Customer 360 **I N T E L L I G E N C E C O M M U N I T Y** Leverage real-time insights to make informed decisions that can impact the safety of our citizens and the world. - Threat detection - Intelligence surveillance and reconnaissance - Neutralize cyberattacks - Social media analytics and protect the nation. - Logistics - Surveillance and reconnaissance - Predictive maintenance - Law enforcement and readiness ----- ### Challenges to innovation The opportunity to drive innovation throughout the federal government is massive and has implications for every U.S. citizen. But there are several critical barriers preventing Ten of the existing legacy systems most in need of modernization cost about **$337 million a year** to operate and maintain. agencies from making the progress needed to realize the value of their data and delivering those innovations. **THE GOVERNMENT ACCOUNTABILITY OFFICE,** **INFORMATION TECHNOLOGY REPORT TO CONGRESS, JUNE 2019** The complexities and impact of legacy data warehouses and marts Multiple federal agencies are burdened with a legacy IT infrastructure that is being left behind by the technological advancements seen in the private sector. This infrastructure is traditionally built with on-premises data warehouses and data marts that are highly complex to maintain, costly to scale as compute is coupled with storage, limited from a data science perspective, and they lack support for the growing volumes of unstructured data. This inhibits data-driven innovation and blocks the use of AI, leaving agencies to search for data science tools to fill the gaps. Infrastructure also becomes harder and more expensive to maintain as it ages. Over time, these environments become more complex due to their need for specialized patches and updates that keep these systems available while doing nothing to solve the issues of poor interoperability, ever-decreasing processing speeds, and an inability to scale – all of which are critically necessary to support today’s more data-intensive use cases. For example, systems at the departments of Education, Health and Human Services, Treasury, and Social Security are over 40 years old.¹ This is causing pain in a variety of areas. often requires significant customization and, even then, there is still a chance that the final integration won’t be successful. These systems also keep personnel from spending their energy and resources on emerging technologies such as AI. And data reliability is a big concern. Replication of data occurs across data marts as various teams try to access and explore it, creating data management and governance challenges. Without a single source of truth, teams struggle with data inconsistencies, which can result in inaccurate analysis and model performance that is only compounded over time. Thankfully, there are initiatives in place, such as the Data Center and Cloud Optimization Initiative Program Management Office (DCCOI PMO), which are investing in modernizing IT infrastructure for federal agencies.² Maintaining these systems requires a massive investment of both time and money compared to modern cloud-based systems. For the technical teams that are tasked with trying to integrate any of these legacy systems with third-party tooling or services, this [¹ Agencies Need to Develop Modernization Plans for Critical Legacy Systems](https://www.gao.gov/assets/gao-19-471.pdf) [² IT Modernization](https://www.gsa.gov/technology/government-it-initiatives/data-center-optimization-initiative-dcoi) ----- Data is critical … and complicated Data is both the greatest asset and one of the greatest challenges that federal agencies must learn to manage. While the volume and usefulness of data collected by federal agencies are not in question, much of it is locked in legacy source systems, comes in diverse structured Data silos hamper any data-driven advancements In any data-driven organization, the need to have trusted, timely and efficient access to data is critical. For the data teams responsible for driving the digital transformation of federal agencies, the challenges they face are myriad. and unstructured formats, and is subject to a variety of governance models. We have already seen how existing, legacy infrastructure, as well as the integration of Not only is this data siloed and very difficult to integrate, but the data volumes collected by federal agencies are massive. At Health and Human Services, for example, or the Department of Veterans Affairs, healthcare data sets will be sized by population and include electronic health records, clinical data, imaging and more. For the Department of Defense fragmented data sources, will strain data engineering teams trying to deliver high-quality data at scale. Their challenge includes developing the right data pipelines that will take the massive volumes of raw data coming from fragmented sources into one centralized location with clean, secure and compliant data for agency decision-makers. and the Department of Homeland Security, data includes everything from mapping, satellite Data scientists and analysts alike must have the right toolset to collaboratively investigate, extract and report meaningful insights from this data. Unfortunately, data silos extend to organizational silos, which make collaboration inside an agency as well as between agencies very difficult. With different groups of data teams leveraging their own coding and analytical tools, communicating insights and working across teams — let alone across agencies — is almost impossible. This lack of collaboration can drastically limit the capabilities of any data analytics or AI initiatives — from the deployment of shared business intelligence (BI) reports and dashboards for data investigation and decision- making to the training of machine learning models to automate processes and make predictions. Compounding these challenges is an overall lack of data science expertise and skills within federal agencies. As a result, even with access to their data, without intuitive tooling it’s very difficult to deliver advanced analytic use cases with ML and AI. Organizational silos also impact the effectiveness of data analysts, who are responsible for analyzing and reporting insights from the data to better inform subject-matter experts or policy — and decision-makers. Without a data platform that eliminates these silos and enables visualization of and reporting on shared data, data analysts will be limited in how they are able to drive the organizational and policy agendas of their respective agencies. imagery and intelligence data to payroll and human resources data. The Social Security Administration and Internal Revenue Service manage personal data for every single citizen in the United States. Combining these various forms of data from disparate legacy systems that are not integrated — and doing it across different government agencies and departments — can be slow and error prone, hindering downstream analytics and actionable insights. The teams that are responsible for this are faced with not only integrating these data sources, but also managing the entire ETL workflow in order to enable the application of basic analytics, let alone machine learning and AI. ----- **THE DATABRICKS LAKEHOUSE PLATFORM:** ### Modernizing the federal government to achieve mission objectives Databricks provides federal agencies with a Lakehouse Platform that combines the best of data warehouses and data lakes — to store and manage all your data for all your analytics workloads. Databricks federates all data and democratizes access for downstream use cases, empowering federal agencies to unlock the full potential of their data to deliver on their mission objectives and better serve citizens. Federal agencies that are powering impactful innovations with Databricks Lakehouse Lakehouse offers a single solution for all major data workloads, whether structured or unstructured, and supports use cases from streaming analytics to BI, data science and AI. Using predictive analytics for better passenger safety and experience Enabling operational efficiencies through process automation to streamline the path to citizenship All your government data Reliable, Analytics capabilities real-time processing for every use case AD HOC DATA SCIENCE Health Surveillance Social Security Demographics Crime Audio/Visual Geospatial Structured batch Unstructured stream Structured batch Structured batch Unstructured batch Unstructured stream Unstructured stream PRODUCTION MACHINE LEARNING **DATA LAKEHOUSE** Process, manage and query all your data BI REPORTING AND SCORECARDING Leveraging advanced analytics to improve outcomes for patients through Medicare and Medicaid services The Databricks Lakehouse Platform has three unique characteristics that address head-on the biggest challenges that federal agencies are facing: It offers simplicity with regard to data management, in that the Databricks Lakehouse is architected to support all of an agency’s data workloads on one It is built on open standards so that any existing investments in tooling or resources can remain effective And it’s collaborative, enabling agency data engineers, analysts and data scientists to work together much more easily common platform ----- Managing federal data with a unified approach Databricks enables aggregation and processing of massive collections of diverse and sensitive agency data that currently exists in silos, both structured and unstructured. As we’ve seen, for many agencies this would be incredibly difficult with the infrastructure challenges they are experiencing. The Databricks Lakehouse leverages Delta Lake to unify By providing a unified data foundation for business intelligence, data science and machine learning, federal agencies can add reliability, performance and quality to existing data lakes while simplifying data engineering and infrastructure management with automation to simplify the development and management of data pipelines. the very large and diverse amounts of data that government agencies are working with. Delta Lake is an open format, centralized data storage layer that delivers reliability, security and performance — for both streaming and batch operations. The Lakehouse Platform combines the best elements of data lakes and data warehouses — delivering the data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes ----- Break down the institutional silos limiting collaboration Foster collaboration at every step with the latest machine learning tools that allow everyone to work and build value together — from data scientists to researchers to business decision-makers. Close the glaring skills gap within these government organizations by providing tooling that simplifies the ML lifecycle and empowers the data teams that do not have the data science expertise to still be productive with their data through integrating BI tools and SQL analytics capabilities. Empower data scientists with an intuitive and interactive workspace where they can easily collaborate on data, share models and code, and manage the entire machine learning lifecycle in one place. Databricks notebooks natively support Python, R, SQL and Scala so practitioners can work together with the languages and libraries of their choice. Deliver on mission objectives with powerful analytics across agencies The Databricks Lakehouse Platform includes a business intelligence capability — Databricks SQL. Databricks SQL allows data analysts and users to query and run reports against all of an agency’s unified data. Databricks SQL integrates with BI tools, like Tableau and Microsoft Power BI, and complements any existing BI tools with a SQL-native interface, allowing data analysts and data scientists to query data directly within Databricks. Additionally, with Databricks SQL, the data team can turn insights from real-world data into powerful visualizations designed for machine learning. Visualizations can then be turned into interactive dashboards to share insights with peers across agencies, policymakers, Easily create visualizations and share dashboards via integrations with BI tools, like Tableau and Microsoft Power BI regulators and decision-makers. ----- Ensure data security and compliance at scale Databricks is fully aware of the sensitivity of the data that many of our federal agencies are responsible for. From national security and defense data to individual health and financial information to national infrastructure and energy data — all of it is critical. Data is protected at every level of the platform through deep integration with fine-grained, cloud-provider access control mechanisms. The Databricks Lakehouse is a massively secure and scalable multicloud platform running millions of machines every day. It is independently audited and compliant with FedRAMP security assessment protocols on the Azure cloud and can provide a HIPAA-compliant deployment on both AWS and Azure clouds. The platform’s administration capabilities include tools to manage user access, control spend, audit usage, and analyze activity across every workspace, all while seamlessly enforcing user and data governance, at any scale. With complete AWS accreditation, Databricks runs across all major networks including GovCloud, SC2S, C2S and commercial; all networks, including public, NIPR, SIPR and JWICS; and ATOs, including FISMA, IL5, IL6, ICD 503 INT-A and INT-B. ----- **CUSTOMER STORY: U.S. CITIZENSHIP AND IMMIGRATION SERVICES** ### Streamlining the path to citizenship with data ##### 24x faster query performance ##### 10 minutes to process tables with 120 million rows ##### 40 million applications processed The U.S. Citizenship and Immigration Services (USCIS) gains actionable insights from dashboards via Tableau to better understand how to streamline operations and more quickly process immigration and employment applications as well as petitions. Today, their data analyst team has over 6,000 Tableau dashboards running — all powered by Databricks. The U.S. Citizenship and Immigration Services is the government agency that oversees lawful immigration to the United States. Over the last decade, the volume of immigration- and citizenship-related applications has skyrocketed across naturalizations, green cards, employment authorizations and other categories. With millions of applications and petitions flooding the USCIS, processing delays were reaching crisis levels — with overall case processing times increasing 91% since FY2014. ----- Processing delays fueled by on-premises, legacy architecture Core to these issues was an on-premises, legacy architecture that was complex, slow and costly to scale. By migrating to AWS and Databricks, USCIS adopted a unified approach to data analytics with more big data processing power and the federation of data across dozens of disparate sources. This has unlocked operational efficiencies and new A new era of data-driven innovation improves operations USCIS now has the ability to understand their data more quickly, which has unlocked new opportunities for innovation. With Databricks, they are able to run queries in 19 minutes, something that used to take an entire day — a 24x performance gain. This means they are spending far less time troubleshooting and more time creating value. opportunities for their entire data organization to drive business intelligence and fuel ML innovations designed to streamline application and petition processes. Removing complexities with a fully managed cloud platform Since migrating to the cloud and integrating Databricks into their data analytics workflows, USCIS has been able to make smarter decisions that help streamline processes and leverage ML to reduce application processing times. These newfound efficiencies and capabilities have allowed them to scale their data footprint from about 30 data sources to 75 without issue. Databricks provided USCIS with significant impact where it mattered most — faster processing speeds that enabled data analysts to deliver timely reports to decision- We discovered Databricks, and the light bulb really clicked for us on what we needed to do moving forward to stay relevant. makers — and that freed up data scientists to build ML models to help improve operations. Leveraging the efficiencies of the cloud and Delta Lake, they were able to easily provision a 26-node cluster within minutes and ingest tables with 120 million rows into S3 in under 10 minutes. Prior to Databricks, performing the same processes would have taken somewhere **SHAWN BENJAMIN** **CHIEF OF DATA AND BUSINESS INTELLIGENCE, USCIS** between two and three hours. ----- ### Conclusion Enabling federal agencies to take advantage of data analytics and AI will help them execute their missions both effectively and efficiently. The Databricks Lakehouse Platform will unify data, analytics and AI workloads, making agencies data-driven and giving policymakers access to deeper, more meaningful insights for decision-making. It will also eliminate data silos and increase communication and collaboration across agencies to ensure the best results for all citizens. ----- ### About Databricks Databricks is the data and AI company. More than 5,000 organizations worldwide — including Comcast, Condé Nast, H&M, and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. Get started with a free trial of Databricks and start building data applications today **START YOUR FREE TRIAL** To learn more, visit us at: **dbricks.co/federal** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Data-AI-in-Fed-Gov-Ebook.pdf2024-09-19T16:57:19Z**eBook** # Cybersecurity in Financial Services ### Protecting financial institutions with advanced analytics and AI ----- ## Contents The State of the Industry .................................................................................................................................................................................... **03** A New Commitment to Cybersecurity ....................................................................................................................................................... **04** The Biggest Challenge With Security Analytics ..................................................................................................................................... **05** Journey of SecOps: Destination Lakehouse ............................................................................................................................................ **06** Rethinking Cybersecurity in Financial Services With Databricks Lakehouse ......................................................................... **07** Lakehouse in Financial Services ..................................................................................................................................................................... **08** Lakehouse and SIEM: The Pattern for Cloud-Scale Security Operations .................................................................................. **12** Common Use Cases ................................................................................................................................................................................................ **14** Getting Started With Databricks for Cybersecurity ............................................................................................................................. **15** ----- **I N T R O D U C T I O N** ## The State of the Industry Cloud, cost and complexity of customer data and cybersecurity are top of mind for every financial services security leader today. As financial services institutions (FSIs) continue to accelerate their digital transformation, cybercriminals, fraudsters and state-sponsored actors continue with more sophisticated threats. The impact of these attacks ranges from the exposure of highly sensitive data to the disruption of services and the exploitation of backdoors for future attacks — all resulting in both financial and non-financial costs. Responding quickly to potential threats requires security tools capable of analyzing billions of threat signals in real-time. Recently, it seems like every week reveals a new data breach or ransomware assault, and the cost is skyrocketing: more than $4 million per incident, up 10 percent from 2020, and about $401 million for a substantial [breach at a large corporation](https://www.ibm.com/security/data-breach) . **Cybersecurity is no longer just a back-office cost and now** **poses critical business risks, such as:** **•** Operational disruption **•** Material customer loss **•** Increase in insurance premiums **•** Lawsuits or fines **•** Systemic destabilization **•** Credit downgrade **•** Reputational damage Source: Navigating Cyber 2022, FS-ISAC, Annual Cyber Threat Review and Predictions ----- ## A New Commitment to Cybersecurity It comes as no surprise that in recent years FSIs have seen an amplified commitment to cybersecurity. As business leaders look to new solutions, large portions of IT budgets are now devoted to leveraging data and AI to thwart cyberattacks. Furthermore, regulators are taking notice of the increased risk of cybersecurity threats. Growing geopolitical tensions have also prompted federal agencies such as the Cybersecurity and Infrastructure Security Agency and the Federal Bureau of Investigation [to warn](https://www.wsj.com/livecoverage/russia-ukraine-latest-news-2022-04-05/card/banks-haven-t-seen-rise-in-cyberattacks-from-russia-yet-p3F5ebzAhTauVjsNx46E) that “tough sanctions imposed on Russia could prompt a spate of cyberattacks against critical infrastructure such as banks.” Additionally, the Securities and Exchange Commission released its [2022 Exam Priorities](https://www.sec.gov/news/press-release/2022-57) , which include information security, and specifically “how firms are safeguarding their customers’ records and assets from cyber threats, including oversight of thirdparty providers, identification of red flags related to identity theft, response to incidents, including to ransomware attacks and management of operational risk in light of ‘a dispersed workforce.’” However, as is often the case, implementing new cybersecurity strategies and processes is easier said than done. **Cybersecurity needs a transformation** **... breaches, cost and complexity are growing** ## 100% of organizations surveyed have had breaches. **The average breach costs $4M** ## 85% **will increase their cyber budget** next FY. Cybersecurity industry will grow to $366B by ‘28 ## 67% of organizations were **breached at** **least three times** . A mega breach costs $401M. **Cost, Complexity, Cloud** - Hundreds of tools with expanding footprints - Data locked in vendor proprietary tools - Humans compensating for analytical and integration deficiencies In this eBook, we’ll take a closer look at the challenges associated with replacing the infrastructure of a legacy data analytics system, and how financial institutions are solving them with Databricks. ----- ## The Biggest Challenge With Security Analytics For many FSIs, on-premises security incident and event management (SIEM) technologies have been the go-to solution for threat detection, analysis and investigations. However, these legacy technologies were built for a world where big data was measured in gigabytes, not today’s terabytes or petabytes. This means that not only are legacy SIEMs unable to scale to today’s data volumes, but they are also unable to serve the modern, distributed enterprise. By now, the advantages of moving to the cloud are no secret to anyone. For FSIs, scalability, simplicity, efficiency and cost are absolutely essential components of success. Many within FinServ are looking to cloud computing to make this possible, adding detection and response in the cloud to the security team’s responsibility. Because legacy SIEMs predate the emergence of cloud, artificial intelligence and machine learning (AI/ML) in the mainstream, they’re unable to address the complex data and AI-driven analytics needed for threat detection, threat hunting, in-stream threat intelligence enrichment, analytical automation and analyst collaboration. In other words, legacy SIEMs are no longer suitable for the modern enterprise or the current threat landscape. **Counting the Financial Cost of Legacy SIEMs** The financial cost of the continued use of legacy SIEMs continues to rise because most SIEM providers charge their customers based on the volume of data ingested. While some legacy technologies are available in the cloud, they’re either not designed to be cloud-native applications or confined to a single cloud service provider. As a result, security teams have to employ multiple tools for detection, investigation and response — or pay exorbitant egress charges for data transiting from one cloud provider to another. This causes operational slowdowns, errors driven by complexity, and inconsistent implementation of security policies. A lack of support for multiple clouds also means an increase in maintenance overhead. Security staff members are often stressed because analysts have to learn different tools for different cloud platforms. For some, it also creates an implicit cloud vendor lock-in, meaning that security teams are unable to support missions because their tools are not portable across multiple cloud providers. Collectively, these drawbacks to legacy SIEMs result in a much weaker security posture for FSIs. ----- ## Journey of SecOps: Destination Lakehouse How did security analytics get to this point? In the early days, there was a need to aggregate alerts from antiviruses and intrusion detection systems. SIEMs were born, built on data warehouses, relational databases or NoSQL database management systems. But as incident investigation needs evolved, those data warehouses weren’t able to handle the volume and variety of data, which led to the development of data lakes. Data lakes were cost-effective and scalable but didn’t have strong data governance and data hygiene, earning them the moniker of “data swamps.” Simply integrating the two tech stacks is really complicated because of varying governance models, data silos and inconsistent use case support. Fast-forward to today, security teams now need AI/ML at scale in a multicloud world. Why choose one or the other? The lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. The governance and transactional capabilities of the data warehouse, the scale and flexibility of a data lake, AI/ML from the ground up and multicloud native deployments in one platform – this is a modern architecture called the lakehouse (data lake and data warehouse). **Current Challenges** **Introducing the Data Lakehouse** **Cloud Storage** No support for analytics or investigations **SIEMs** No attack chaining. Poor for high cardinality search. **UBA tools** No historical search, blackbox, proprietary storage **No SIEM/Log** solution is multicloud native **Curated Alerts** **Cloud-scale** **search** **ML/AI** **Multicloud** ----- ## Rethinking Cybersecurity in Financial Services With Databricks Lakehouse Databricks introduced the first data lakehouse platform to the industry, and today over 7,000 customers use it worldwide. With Databricks Lakehouse, FSIs that are ready to modernize their data infrastructure and analytics capabilities for better protection against cyber threats now have one cost-effective solution that addresses the needs of all their teams. The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses, delivering the low-cost, flexible object stores offered by data lakes and the data management and performance typically found in data warehouses. This unified platform simplifies existing architecture by eliminating the data silos that traditionally separate analytics, data science and ML. It’s built on open source, open data and open standards to maximize flexibility, and its inherent collaborative capabilities accelerate the ability to work across teams and innovate faster. Moreover, because it’s multicloud, it works the same way no matter which cloud provider is used. ETL and Enrichment **Proof Point** **Firewall** **Antivirus** ----- ## Lakehouse in Financial Services By unifying data with analytics and AI, Lakehouse allows FSIs to easily access all their data for downstream advanced analytics capabilities to support complex security use cases. Lakehouse facilitates collaboration between threat intelligence teams and cyber operations, enables security operations teams to detect advanced threats, and reduces human resource burnout through analytical automation and collaboration. Importantly, Lakehouse also accelerates investigations from days to minutes. Along with a more modern architecture, the Lakehouse Platform includes Delta Lake, which unifies all security data in a transactional data lake to feed advanced analytics. The analytics and collaboration are done in notebooks, and security teams can use multiple languages — SQL, Python, R and Scala — in the same notebook. This makes it easy for security practitioners to explore data and develop advanced analytics and reporting using their favorite methods. Additionally, a separation of compute from storage means performance at scale without impacting overall storage costs. ----- **C A S E S T U D Y** **When It Comes to Security, Data Is the Best Defense*** **Protecting HSBC’s 40 million customers begins with collecting and processing data from billions** **of signals to make previously impossible threat detection possible** security operation departments, creating an enhanced relationship that results in better defenses, insight into the security posture of the organization, and the ability to respond at the pace of the adversary. The old way of thinking about security — stronger locks, higher walls — is outdated and ineffective. “When defending an organization, too often we just focus heavily on tools, technology, and reactive scenarios,” said T.J. Campana, managing director of global defense and chief technology officer at HSBC, the multinational bank. “But the security business is a data business. And the data always has a story to tell us.” The quality of security, he added, is proportional to the information that can be distilled from petabytes of data that endlessly flows through company networks. That means “empowering people to get the right insights, in the right way to quickly prevent, detect, and respond to threats, wherever and whenever they occur,” said George Webster, executive director of global cybersecurity science and analytics at HSBC. If a big organization is made up of tens of millions of parts that must click together seamlessly, security keeps those seals tight. Data gathering, analytical tools, and human intellect work together as one. This involves fusing the data science and But working across years of data at petabyte scale is not an easy task, especially when a long time is measured in minutes and the adversary is constantly working against you. To put this in perspective, the security teams at HSBC intake 10 times the amount of data contained in all of the books in the U.S. Library of Congress every day, and must process months, if not years, of data at a time. That is where innovative design, smart people, and leveraging the right technology come into play. “We have to break the paradigm of the tool being the end goal of defense and instead view the tools as an enabler of our people,” said Webster. “It is always about the people,” added Campana. HSBC turned away from the common security paradigm by leveraging the big data processing techniques from Azure Databricks. In many ways, their open source Delta Lake is the key enabler, with Spark being the engine. Delta Lake allows these teams to structure, optimize, and unlock data at scale, while Spark allows multiple complex programs to seamlessly crunch through the data. This enables HSBC’s security teams to constantly evolve their defenses, create new capabilities at pace, and perform investigations that were previously impossible. When a new threat emerges, the bank doesn’t have the luxury to wait for the security market to identify, respond, and mitigate. Instead, the bank turns to its people and creates what is needed at breathtaking speed. ----- **C A S E S T U D Y : C O N T I N U E D** It’s an essential function for HSBC, which needs to continually think about how to keep more than 40 million customers in 64 countries and territories safe. Taken together, it’s an all-brains-on-deck moment with data and people guiding the ship. It’s also a tall task for a company as massive and multifaceted as HSBC. Headquartered in the UK, it is one of the largest global banks (total assets: a whopping $2.968 trillion), with operations across Africa, Europe, Asia, and the Americas. It’s also the largest bank in Hong Kong and even prints some of the local currency, which bears the HSBC name. The bank’s cybersecurity approach involves fusing the data science and security operation departments, creating an enhanced relationship that results in more efficient threat discovery, rapid development of operational use cases and AI models. This enables the continuous creation of capabilities that stop adversaries before they even start. “We have to get out of the mindset that security is a walled garden,” said Webster. “We must create truly collaborative environments for our people to enable the business to operate,” said Campana. Staffing this symbiotic power center will be someone Campana optimistically calls “the analyst of the future,” a description that’s both mindset and skillset: threat hunter and data scientist. In addition, when another organization is hit by cybercrime, HSBC analyzes it to understand how it may have responded and then improves its defenses accordingly. That’s in contrast to the industry norm; a Ponemon survey revealed that 47 percent of organizations have not assessed the readiness of their incident response teams. That means the first time they test their plans will be at the worst possible time — in the middle of a cyber attack. The proactive approach is a far cry from the old reactive conveyor belt model of security when alert tickets were received from tooling and processed in a slow and linear way. Today, cross-disciplinary security teams don’t just react; they continually search for the signals in the noise — tiny aberrations that indicate something’s not right – and send up red flags in real-time. “We’re scanning hundreds of billions of signals per day. I cannot wait. We need situational awareness right now,” said Campana. That increased speed is critical for threat assessment. Information theft may be the most expensive and fastest-rising consequence of cybercrime, but data is not the only target. Core systems are being hacked in a dangerous trend to disrupt and destroy. Regulators are also increasingly asking banks for controls in place to detect and preempt financial crimes. That’s where big data tooling like Delta Lake and Spark shine, and where it will continually be called on to address the security needs of new initiatives. “Digital security is about organically adjusting to risks,” said Webster. “It’s a journey of continual discovery with one central goal: to protect customers. They want things easy and they want them quick. It’s our job to make sure that it’s secure.” *This story previously appeared in [WIRED Brand Lab for Databricks](https://www.wired.com/sponsored/story/when-it-comes-to-security-data-is-the-best-defense/) . ----- **Advantages of a Lakehouse** **A cost-efficient upgrade** Databricks customers only pay for the data they analyze, not for what they collect. This means that security teams can collect any amount of data without worrying about ingest-based pricing, and only pay for the data that’s actually used for analysis — for example, an incident investigation or a data call for an audit. This pricing model enables security teams to collect data that was previously out of reach, such as netflow data, endpoint detection and response data, and application and services data. Further, Databricks is a fully managed service, meaning that security teams don’t have to pre-commit to hardware capital expenditures. With no hardware to manage and no big data implementations to maintain, security teams can significantly reduce their management and maintenance costs. **Multicloud** Databricks is cloud-native on AWS, Microsoft Azure and Google Cloud. This creates freedom for the security teams to use whatever cloud provider they like. Additionally, teams can acquire and maintain operational consistency across all providers when they have multiple cloud footprints. This enables consistent policy implementation, reduced complexity for staff and increased efficiency. Additionally, Databricks enables faster detection, investigation and response across the enterprise because analytics can be reused across the major cloud providers through a unified platform that centralizes data for easy sharing and fosters collaboration across teams. **Enterprise security and** **360° risk management** The Lakehouse Platform is easy to set up, manage, scale and, most importantly, secure. This is because Lakehouse easily integrates with existing security and management tools, enabling users to extend their policies for peace of mind and greater control. With multicloud management, security admins and data teams get a consistent experience across all major cloud providers. This saves valuable time and the resources required to upskill talent on proprietary services for data, analytics and AI. Security, risk and compliance leaders are also able to give team members a range of security permissions that come with thorough audit trails. This allows teams to quickly spin up and wind down collaborative workspaces for any project and to manage use cases from end to end — from enabling user access and controlling spend to auditing usage and analyzing activity across every workspace to enforce user and data governance. ----- ## Lakehouse and SIEM: The Pattern for Cloud-Scale Security Operations According to George Webster, head of cybersecurity sciences and analytics at HSBC, Lakehouse and SIEM is the pattern for security operations. What does it look like? It leverages the strengths of the two components: Lakehouse for multicloud native storage and analytics, SIEM for security operations workflows. For Databricks customers like HSBC, there are two general patterns for this integration that are both underpinned by what Webster calls the cybersecurity data lake with Lakehouse. In the first pattern, Lakehouse stores all the data for the maximum retention period. A subset of the data is then sent to the SIEM and stored for a fraction of the time. This pattern has the advantage of allowing analysts to query near-term data using the SIEM while having the ability to do historical analysis and more sophisticated analytics in Databricks. It also lets them manage any licensing or storage costs for the SIEM deployment. The second pattern is to send the highest-volume data sources to Databricks — for example, cloud-native logs, endpoint threat detection and response logs, DNS data and network events. Low-volume data sources such as alerts, e-mail logs and vulnerability scan data go to the SIEM. This pattern enables Tier 1 analysts to quickly handle high-priority alerts in the SIEM. Threat-hunt teams and investigators can leverage the advanced analytical capabilities of Databricks. This pattern has a cost-benefit of offloading processing, ingestion and storage from the SIEM. ----- **Databricks and Splunk:** **A Case Study in Cost-Savings** Databricks integrates with your preferred SIEM, like Splunk, and the Splunk-certified Databricks add-on can be used to meet SOC needs without changing the user interface. This example features a global financial institution’s security operation, where the organization grew throughput from 25TB per day with only 180 days lookback, to 100TB per day with 395 days lookback using the Databricks SIEM augmentation. The total cost of ownership savings, including infrastructure and license costs, saved tens of millions (more than $80mn per year) in cloud costs. ##### FinServ Security Operations Databricks + Splunk **Drastically** Lowered Costs **CURRENT STATE** **FUTURE OPTION** 100 75 **Throughput** TB per day **Lookback** **period** Days 50 **100** 25 **25** 0 Splunk only Splunk + Databricks **395** **180** Splunk only Splunk + Databricks TCO savings with Splunk and Databricks vs. Splunk only solution: $81M ----- ## Common Use Cases As FSIs focus on modernizing their data analytics and warehousing capabilities, the Databricks Lakehouse Platform brings a new level of empowerment to FSIs, allowing them to unlock the full potential of their data to deliver on their objectives and better serve their customers. **Common use cases include:** **•** **Threat hunting:** Empower security teams to proactively detect and discover advanced threats using months or years of data **•** **Incident investigation:** Gain complete visibility across network, endpoint, cloud and application data to respond to incidents **•** **Phishing threat detection:** Uncover social engineering attacks that are often used to steal user data, including log-in credentials and credit card numbers **•** **Supply chain monitoring:** Leverage ML to identify suspicious behavior within your software supply chain **•** **Ransomware detection:** Scope the impact and spread of ransomware attacks to inform complete mitigation and remediation **•** **Credentials-abuse detection:** Identify and investigate anomalous credential usage across your infrastructure **•** **Insider-threats detection:** Find and respond to malicious threats from people within an organization who have inside information about security practices, data and computer systems **•** **Network traffic analysis:** Examine real-time network availability and activity to identify anomalies, vulnerabilities and malware **•** **Analytics automation:** Automatically contextualize and enrich multiple streaming and batch analytics to accelerate analyst workflows and decision-making **•** **Augmenting anti-money laundering practices** **(AML):** Using structured and unstructured data to maintain a list of politically exposed individuals, often referred to as PEP, to augment a bank’s AML processes. This includes pulling data from an organization externally (keeping the PEP list up-to-date including out-of-country officials and diplomats) as well as internally (including critical personnel, network admins, etc.) who need extra scrutiny. ----- ## Getting Started With Databricks for Cybersecurity Getting up and running on Databricks to address your cybersecurity needs is easy with our Solution Accelerators. Databricks Solution Accelerators are highly optimized, fully functional analytics solutions that provide customers with a fast start to solving their data problems. **•** [Cybersecurity analytics and AI at scale with Splunk and Databricks](https://databricks.com/solutions/accelerators/cybersecurity-analytics-and-ai) : Rapidly detect threats, investigate the impact and reduce risks with the Databricks add-on for Splunk **•** [Threat detection at scale with DNS analytics](https://databricks.com/blog/2020/10/05/detecting-criminals-and-nation-states-through-dns-analytics.html) : Recognize cybercriminals using DNS, threat intelligence feeds and ML Databricks Solution Accelerators are free. Join the hundreds of Databricks customers using Solution Accelerators to drive better outcomes in their businesses. If you’d like to learn more about how we are helping financial services institutions securely leverage data and AI, please visit us at [dbricks.co/fiserv](https://databricks.com/solutions/industries/financial-services) or reach out to us at [cybersecurity@databricks.com](mailto:cybersecurity%40databricks.com?subject=) . ----- ## About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, Acosta and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark,™ Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . #### Get started with a free trial of Databricks and start building data applications today **[START YOUR FREE TRIAL](https://databricks.com/try-databricks?itm_data=Homepage-HeroCTA-Trial)** ###### To learn more, visit us at:  dbricks.com/fiserv -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks-eBook-finServ-cyber.pdf2024-09-19T16:57:20Z**EBOOK** ## Why the Data Lakehouse Is Your Next Data Warehouse ----- ### Contents Preface .......................................................................................................................................................................................................................................... **3** Introduction ............................................................................................................................................................................................................................. **4** Our Approach: The Databricks Lakehouse Platform ................................................................................................................................... **5** Introducing Databricks SQL: The Best Data Warehouse Is a Lakehouse ...................................................................................... **6** Why Databricks SQL? ............................................................................................................................................................................................... 6 Common use cases .................................................................................................................................................................................................... 7 The Inner Workings of the Lakehouse ................................................................................................................................................................... **8** **PA R T 1 :** Storage layer .............................................................................................................................................................................................. 8 **PA R T 2 :** Compute layer ......................................................................................................................................................................................... 13 **PA R T 3 :** Consumption layer ................................................................................................................................................................................ 19 Conclusion ............................................................................................................................................................................................................................. **24** Customer Stories ............................................................................................................................................................................................................... **25** ----- ### Preface Historically, data teams have had to resort to a bifurcated architecture to run traditional BI and analytics workloads, copying subsets of the data already stored in their data lake to a legacy data warehouse. Unfortunately, this led to the lock-in, high costs and complex governance inherent in proprietary architectures. Our customers have asked us to simplify their data architecture. We decided to accelerate our investments to do just that. We introduced [Databricks SQL](https://databricks.com/product/databricks-sql) to simplify and provide data warehousing capabilities and first-class support for SQL on the [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse) , for all your existing tools. We use the term “lakehouse” to reflect our customers’ desire to combine the best of data warehouses and data lakes. With the lakehouse, you can now establish one source of truth for all data and enable all workloads from AI to BI on one platform. And we want to provide you with ease-of-use and state-of-the-art performance at the lowest cost. **Reynold Xin** Original Creator of Apache Spark, TM Co-founder and Chief Architect, Databricks This eBook covers how we went back to the drawing board to build Databricks SQL — the last mile of enabling data warehousing capabilities for your existing data lakes — as part of the Databricks Lakehouse Platform. ----- ### Introduction Most organizations operate their business with a complex data architecture that combines data warehouses and data lakes. For one thing, data lakes are great for machine learning (ML). They support open formats and a large ecosystem. But data lakes have poor support for business intelligence (BI) and suffer complex data quality problems. Data warehouses, on the other hand, are great for BI applications. But they have limited support for ML workloads, can’t handle natural language data, large-scale structured data, or raw, video, audio or image files, and are proprietary systems with only a SQL interface. As a result, data is moved around the organization through data pipelines and systems that create a multitude of data silos. A large amount of time is spent maintaining these pipelines and systems rather than creating new value from data, and downstream consumers struggle to get a single source of truth of the data due to the inherent siloing of data that takes place. The situation becomes very expensive, and decision-making speed and quality are negatively affected. Unifying these systems can be transformational in how we think about data. ##### The need for simplification It is time for a new data architecture that can meet both today’s and tomorrow’s needs. Without any compromise. Advanced analytics and ML are one of the most strategic priorities for data-driven organizations today, and the amount of unstructured data is growing exponentially. So it makes sense to position the data lake as the center of the data infrastructure. However, for this to be achievable, the data lake needs to adopt the strengths of data warehouses. The answer is the [lakehouse](https://databricks.com/blog/2021/05/19/evolution-to-the-data-lakehouse.html) , an open data architecture enabled by a new open and standardized system design: one that implements data structure and data management features similar to those in a data warehouse, directly on the lowcost storage used for data lakes. **[DOWNLOAD NOW](https://databricks.com/p/ebook/building-the-data-lakehouse?utm_medium=paid+search&utm_source=google&utm_campaign=14925739153&utm_adgroup=133613202892&utm_content=ebook&utm_offer=building-the-data-lakehouse&utm_ad=552195081555&utm_term=data%20lakehouse%20databricks&gclid=Cj0KCQiAzMGNBhCyARIsANpUkzPYW8MmlNjO9tOWa_35rFFe7Jti32z5Debcr_nG5QU_1-GEuznzUy8aAm-PEALw_wcB)** ##### Building the Data Lakehouse [Bill Immon, Father of the Data Warehouse](https://databricks.com/p/ebook/building-the-data-lakehouse?utm_medium=paid+search&utm_source=google&utm_campaign=14925739153&utm_adgroup=133613202892&utm_content=ebook&utm_offer=building-the-data-lakehouse&utm_ad=552195081555&utm_term=data%20lakehouse%20databricks&gclid=Cj0KCQiAzMGNBhCyARIsANpUkzPYW8MmlNjO9tOWa_35rFFe7Jti32z5Debcr_nG5QU_1-GEuznzUy8aAm-PEALw_wcB) ----- ### Our Approach: The Databricks Lakehouse Platform Our customers have asked us for simplification. This is why we’ve embarked on this journey to deliver one simple, open and collaborative platform for all your data, AI and BI workloads on your existing data lakes. The [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse) greatly simplifies data architectures by combining the data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes. It’s built on open source and open standards to maximize flexibility, and lets you store all your data — structured, semi-structured and unstructured — in your existing data lake while still getting the data quality, performance, security and governance you’d expect from a data warehouse. Data only needs to exist once to support all of your data, AI and BI workloads on one common platform — establishing one source of truth. Finally, the Lakehouse Platform provides tailored and collaborative experiences so data engineers, data scientists and analysts can work together on one common platform across the entire data lifecycle — from ingestion to consumption and the serving of data products — and innovate faster. Let’s look at how, with the right data structures and data management capabilities in place, we can now deliver data warehouse and analytics capabilities on your lakehouse. That’s where Databricks SQL (DB SQL) comes in. **[DISCOVER LAKEHOUSE](https://databricks.com/discoverlakehouse)** ----- ### Introducing Databricks SQL: The Best Data Warehouse Is a Lakehouse Databricks SQL is a serverless data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with up to 12x better price/performance, a unified governance model, open formats and APIs, and your tools of choice — no vendor lock-in. Reduce resource management overhead with serverless compute, and easily ingest, transform and query all your data in place to deliver real-time business insights faster. In fact, DB SQL now holds the new world record in 100TB TPC-DS, the gold standard performance benchmark for data warehousing. Built on open standards and APIs, the lakehouse provides an open, simplified and multicloud architecture that brings the best of data warehousing and data lakes together, and integrations with a rich ecosystem for maximum flexibility. ##### Why Databricks SQL? Best Price/Performance Lower costs, get world-class performance, and eliminate the need to manage, configure or scale cloud infrastructure with serverless. Built-In Governance Establish one single copy for all your data using open standards, and one unified governance layer across all data teams using standard SQL. Rich Ecosystem Use SQL and any tool like Fivetran, dbt, Power BI or Tableau along with Databricks to ingest, transform and query all your data in place. Break Down Silos Empower every analyst to access the latest data faster for downstream real-time analytics, and go effortlessly from BI to ML. **[WATCH A DEMO](https://databricks.com/discover/demos/databricks-sql)** ----- ### Common use cases Thousands of customers like [Atlassian](https://www.google.com/search?q=atlassian+databricks+keynote&oq=atlassian+databricks+keynote&aqs=chrome..69i57j69i60j69i65l3j69i60j69i64l2.6409j0j1&sourceid=chrome&ie=UTF-8#:~:text=12%3A26,May%2026%2C%202021) , [SEGA](https://youtu.be/SzeXHcwPDSE) and [Punchh](https://databricks.com/customers-4/punchh) are using Databricks SQL to enable self-served analytics for hundreds of analysts across their organizations, and to build custom data applications to better serve their customers. Below are some examples of use cases for Databricks SQL. **At Atlassian, we have proven** **Query data lake data with** **Collaboratively explore** **Build rich and custom** **your BI tools of choice** **the freshest data** **data applications** **that there is no longer a need** **for two separate data things.** **Technology has advanced** **far enough for us to consider** **one single unified lakehouse** **architecture.** **Rohan Dhupelia** Data Platform Senior Manager, Atlassian Enable business analysts to directly query data lake data using their favorite BI tool and avoid data silos. Reengineered and optimized connectors ensure fast performance, low latency and high user concurrency to your data lake. Now analysts can use the best tool for the job on one single source of truth for your data. Empower every analyst and SQL professional in your organization to quickly find and share new insights by providing them with a collaborative and self-served analytics experience. Confidently manage data permissions with fine-grained governance, share and reuse queries, and quickly analyze and share results using interactive visualizations and dashboards. Build more effective and tailored data applications for your own organization or your customers. Benefit from the ease of connectivity, management and better price/ performance of DB SQL to simplify development of dataenhanced applications at scale, all served from your data lake. ----- ### The Inner Workings of the Lakehouse In the next chapter, we’ll unpack the three foundational layers of the Databricks Lakehouse Platform and how we went back to the drawing board to build this experience. Specifically, we’ll dive into how we built Databricks SQL to deliver analytics and data warehousing workloads on your lakehouse. Those layers are: **1 .** The storage layer, or how we store and govern data **2 .** The compute layer, or how we process queries **3 .** The consumption layer, or the tools you can use to interface with the system ###### PART 1: STORAGE LAYER In order to bring the best of data lakes and data warehouses, we needed to support the openness and flexibility of data lakes, as well as the quality, performance and governance you’d expect from a data warehouse. **Storage layer attributes — data lake vs. data warehouse vs. data lakehouse** |Data Lake Open format|Data Warehouse Closed, proprietary format|Data Lakehouse Open format| |---|---|---| |Low quality, “data swamp”|High-quality, reliable data|High-quality, reliable data| |File-level access control|Fine-grained governance (tables row/columnar level)|Fine-grained governance (tables row/columnar level)| |All data types|Structured only|All data types| |Requires manually specifying how to lay out data|Automatically lays out data to query efficiently|Automatically lays out data to query efficiently| ----- ##### Transactional guarantees for your data lake The open source format [Delta Lake](https://delta.io/) — based on Parquet — solves historical data lake challenges around data quality and reliability. It is the foundation for the lakehouse, and Databricks SQL stores and processes data using Delta Lake. For example, it provides ACID transactions to ensure that every operation either fully succeeds or fully aborts for later retries — without requiring new data pipelines to be created. It unifies batch and streaming pipelines so you can easily merge existing and new data at the speed required for your business. With Time Travel, Delta Lake automatically records all past transactions, so it’s easy to access and use previous versions of your data for compliance needs or for ML applications. Advanced indexing, caching and auto-tuning allow optimization of Delta tables for the best query performance. Delta Lake also acts as the foundation for fine-grained, role-based access controls on the lakehouse. As a result, Delta Lake allows you to treat tables in Databricks SQL just like you treat tables in a database: updates, inserts and merges can take place with high performance at the row level. This is particularly useful if you are inserting new data rapidly (e.g., in IoT or e-commerce use cases), or if you are redacting data (e.g., for compliance laws such as GDPR). Furthermore, Delta Lake provides you with one open and standard format — not only for SQL but also for Python, Scala and other languages — so you can run all analytical and ML use cases on the same data. **Delta Lake provides the key** An open format storage layer built for lake-first architecture ACID transactions, Time Travel, highly available Advanced indexing, caching, auto-tuning Fine-grained, role-based access controls Streaming & batch, analytics & ML Python, SQL, R, Scala Delta Lake brings data quality, performance and governance to the lakehouse **[DOWNLOAD NOW](https://databricks.com/p/ebook/delta-lake-the-definitive-guide-by-oreilly)** ##### Delta Lake: The Definitive Guide [by O’Reilly](https://databricks.com/p/ebook/delta-lake-the-definitive-guide-by-oreilly) ----- ##### A framework for building a curated data lake With the ability to ingest petabytes of data with auto-evolving schemas, Delta Lake helps turn raw data into actionable data by incrementally and efficiently processing data as it arrives from files or streaming sources like Kafka, Kinesis, Event Hubs, DBMS and NoSQL. It can also automatically and efficiently track data as it arrives with no manual intervention, as well as infer schema, detect column changes for structured and unstructured data formats, and prevent data loss by rescuing data columns that don’t meet data quality specifications. And now with [Partner Connect](https://www.databricks.com/partnerconnect) , it’s never been easier to bring in critical business data from various sources. As you refine the data, you can add more structure to it. Databricks recommends the Bronze, Silver and Gold pattern. It lets you easily merge and transform new and existing data — in batch or streaming — while benefiting from the low-cost, flexible object storage offered by data lakes. Bronze is the initial landing zone for the pipeline. We recommend copying data that’s as close to its raw form as possible to easily replay the whole pipeline from the beginning, if needed. Silver is where the raw data gets cleansed (think data quality checks), transformed and potentially enriched with external data sets. Gold is the production-grade data that your entire company can rely on for business intelligence, descriptive statistics, and data science/machine learning. By the time you get to Gold, the tables are high-value business-level metrics that have all the schema enforcement and constraints applied. This way, you can retain the flexibility of the data lake at the Bronze and Silver levels, and then use the Gold level for high-quality business data. Auto Loader BRONZE SILVER GOLD Structured Streaming Batch COPY INTO Partners Raw ingestion Filtered, cleaned Business-level and history and augmented aggregates |Col1|Col2| |---|---| ||R| **[LEARN MORE](https://youtu.be/n9cRw6AkNDQ)** ----- ##### An aside on batch and streaming data pipelines The best way to set up and run data pipelines in the Bronze/Silver/Gold pattern recommended on the previous page is in Delta Live Tables (DLT). DLT makes it easy to build and manage reliable batch and streaming data pipelines that deliver high-quality data. It helps data engineering teams simplify ETL development and management with declarative pipeline development, automatic data testing, and deep visibility for monitoring and recovery. The fact that you can run all your batch and streaming pipelines together in one simple, declarative framework makes data engineering easy on the Databricks Lakehouse Platform. We regularly talk to customers who have been able to reduce pipeline development time from weeks — or months — to mere minutes with Delta Live Tables. And by the way, even data analysts can easily interrogate DLT pipelines for the queries they need to run, without knowing any sort of specialized programming language or niche skills. One of the top benefits of DLT, and Delta Lake in general, is that it is built with streaming pipelines in mind. Today, the world operates in real time, and businesses are increasingly expected to analyze and respond to their data in real time. With streaming data pipelines built on DLT, analysts can easily access, query and analyze data with greater accuracy and actionability than with conventional batch processing. Delta Live Tables makes real-time analytics a reality for our customers. ----- ##### Fine-grained governance on the lakehouse Delta Lake is the foundation for open and secure [data sharing](https://databricks.com/blog/2021/05/26/introducing-delta-sharing-an-open-protocol-for-secure-data-sharing.html) and governance on the lakehouse. It underpins the [Databricks Unity Catalog](https://databricks.com/product/unity-catalog) (in preview), which provides fine-grained governance across clouds, data and ML assets. Among the benefits of the Unity Catalog, it allows you to: **• Discover, audit and govern data assets in one place:** A user-friendly interface, automated data lineage across tables, columns, notebooks, workflows and dashboards, role-based security policies, table or column-level tags, and central auditing capabilities make it easy for data stewards to discover, manage and secure data access to meet compliance and privacy needs directly on the lakehouse. **• Grant and manage permissions using SQL:** Unity Catalog brings finegrained centralized governance to data assets across clouds through the open standard SQL DCL. This means database administrators can easily grant permission to arbitrary, user-specific views, or set permissions on all columns tagged together, using familiar SQL. **• Centrally manage and audit shared data across organizations:** Every organization needs to share data with customers, partners and suppliers to better collaborate and to unlock value from their data. Unity Catalog builds on open source [Delta Sharing](http://delta.io/sharing) to centrally manage and govern shared assets within and across organizations. The Unity Catalog makes it easy for data stewards to discover, manage and secure data access to meet compliance and privacy needs on the lakehouse. **[LEARN MORE](https://databricks.com/blog/2021/05/26/introducing-databricks-unity-catalog-fine-grained-governance-for-data-and-ai-on-the-lakehouse.html)** ----- ###### PART 2: COMPUTE LAYER The next layer to look at is the compute layer, or how we process queries. Apache Spark TM has been the de facto standard for data lake compute. It’s great for processing terabytes and petabytes of data cheaply, but historically Spark SQL uses a nonstandard syntax and can be difficult to configure. Data warehouses, on the other hand, tend to support short running queries really well, especially when you have a lot of users issuing queries concurrently. They tend to be easier to set up, but don’t necessarily scale or they become too costly. **Compute layer attributes — data lake vs. data warehouse vs. data lakehouse** |Data Lake High performance for large jobs (TBs to PBs)|Data Warehouse High concurrency|Data Lakehouse High performance for large jobs (TBs to PBs)| |---|---|---| |Economical|Scaling is exponentially more expensive|Economical| |High operational complexity|Ease of use|Ease of use| |||| A popular belief is that large workloads require a drastically different system than low latency, high concurrency workloads. For example, there’s the classic trade-off in computer systems between latency and throughput. But after spending a lot of time analyzing these systems, we found that it was possible to simultaneously improve large query performance and concurrency and latency. Although the classic trade-offs definitely existed, they were only explicit when we optimized the system to the very theoretical optimal. It turned out the vast majority of software — and this includes all data warehouse systems and Databricks — were far away from optimal. ----- ##### Simplified administration and instant, elastic SQL compute — decoupled from storage To achieve world-class performance for analytics on the lakehouse, we chose to completely rebuild the compute layer. But performance isn’t everything. We also want it to be simple to administer and cheaper to use. Databricks SQL leverages serverless SQL warehouses that let you get started in seconds, and it’s powered by a new native MPP vectorized engine: Photon. Databricks SQL warehouses are optimized and elastic SQL compute resources. Just pick the cluster size and Databricks automatically determines the best instance types and VMs configuration for the best price/performance. This means you don’t have to worry about estimating peak demand or paying too much by overprovisioning. You just need to click a few buttons to operate. To further streamline the experience, simply use [Databrick SQL Serverless](https://databricks.com/blog/2021/08/30/announcing-databricks-serverless-sql.html) . With the serverless capability, queries start rapidly with zero infrastructure management or configuration overhead. This lowers your total cost, as you pay only for what you consume without idle time or overprovisioned resources. Since CPU clock speeds have plateaued, we also wanted to find new ways to process data faster, beyond raw compute power. One of the most impactful methods has been to improve the amount of data that can be processed in parallel. However, data processing engines need to be specifically architected to take advantage of this parallelism. So, from the ground up, we built [Photon](https://databricks.com/product/photon) , a new C++ based vectorized query processing engine that dramatically improves query performance while remaining fully compatible with open Spark APIs. Databricks SQL warehouses are powered by Photon, which seamlessly coordinates work and resources and transparently accelerates portions of your SQL queries directly on your data lake. No need to move the data to a data warehouse. **[READ NOW](https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf)** ##### Photon: A Fast Query Engine for Lakehouse Systems [SIGMOD 2022 Best Industry Paper Award](https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf) ----- **Did you know?** Databricks SQL warehouses scale automatically throughout the day to better suit your business needs. Administration is simplified by identifying how many clusters can scale out with min and max, and Databricks SQL will auto-scale as needed. This ensures that you have ample compute to serve your needs, without overprovisioning. Administrators appreciate the ability to have better control over consumption costs, while users appreciate that their queries process as fast and efficiently as possible. For most BI and analytics use cases, using medium-size warehouses with scaling is a great balance of price/performance that fits most business needs. In the next section, we will discuss examples of Databricks SQL performance results on large-scale analytic workloads as well as highly concurrent workloads. Running Scheduled Starting Cluster Scale ----- ##### Large query performance: the fastest data warehouse The industry standard benchmark used by data warehouses is TPC-DS. It includes 100 queries that range from very simple to very sophisticated to simulate decision support workloads. This benchmark was created by a committee formed by data warehousing vendors. The chart at right shows price/performance results running the 100TB version of TPC-DS, since for large workloads the numbers that ultimately matter pertain to the performance cost. As you can see, Databricks SQL outperforms all cloud data warehouses we have measured. **[LEARN MORE](https://dbricks.co/benchmark)** **Did you know?** **$2,000** **$1,791** **$1,500** **$1,000** **$952** **$500** **$242** **$146** **$358** **$0** Databricks SQL Databricks SQL Cloud Data Cloud Data Cloud Data Spot On-Demand Warehouse 1 Warehouse 2 Warehouse 3 System 100TB TPC-DS price/performance benchmark (lower is better). Databricks SQL has set a [new world record in](http://tpc.org/5013) [100TB TPC-DS](http://tpc.org/5013) , the gold standard performance benchmark for data warehousing. Databricks SQL outperformed the previous record by 2.2x. And this result has been formally audited and reviewed by the TPC council. ----- ##### Highly concurrent analytics workloads Beyond large queries, it is also common for highly concurrent analytics workloads to execute over small data sets. To optimize concurrency, we used the same TPC-DS benchmark, but on a much smaller scale (10GB) and with 32 concurrent streams. We analyzed the results to identify and remove bottlenecks, and built hundreds of optimizations to improve concurrency. Databricks SQL now outperforms some of the best cloud data warehouses for both large queries and small queries with lots of users. Real-world workloads, however, are not just about either large or small queries. Databricks SQL also provides intelligent workload management with a dual queuing system and highly parallel reads. 16,523 12,248 ###### ~3X 4,672 11,690 July 2020 Jan 2021 Oct 2022 CLOUD DW X SQL WAREHOUSE X - L SIZE 10GB TPC-DS queries/hr at 32 concurrent streams (higher is better). ----- ##### Intelligent workload management with smart queuing system Real-world workloads typically include a mix of small and large queries. Therefore the smart queuing and load balancing capabilities of Databricks SQL need to account for that too. Databrick SQL uses a smart dual queuing system (in preview) that prioritizes small queries over large, as analysts typically care more about the latency of short queries than large ones. ##### Highly parallel reads with improved I/O performance It is common for some tables in a lakehouse to be composed of many files — for example, in streaming scenarios such as IoT ingest when data arrives continuously. In legacy systems, the execution engine can spend far more time listing these files than actually executing the query. Our customers told us they do not want to sacrifice performance for data freshness. With async and highly parallel I/O, when executing a query, Databricks SQL now automatically reads the next blocks of data from cloud storage while the current block is being processed. This considerably increases overall query performance on small files (by 12x for 1MB files) and “cold data” (data that is not cached) use cases as well. **[LEARN MORE](https://databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html)** ----- ###### PART 3: CONSUMPTION LAYER The third layer of the Databricks Lakehouse Platform would similarly have to bridge the best of both data lakes and data warehouses. In the lakehouse, you would have to be able to work seamlessly with your tools of choice — whether you are a business analyst, data scientist, or ML or data engineer. The lakehouse must treat Python, Scala, R and SQL programming languages and ecosystems as first-class citizens to truly unify data engineering, ML and BI workloads in one place. **Consumption layer attributes — data lake vs. data warehouse vs. data lakehouse** |Data Lake Notebooks (great for data scientists)|Data Warehouse Lack of support for data science/ML|Data Lakehouse Notebooks (great for data scientists)| |---|---|---| |Openness with rich ecosystem (Python, R, Scala)|Limited to SQL only|Openness with rich ecosystem (Python, R, Scala)| |BI/SQL not 1st-class citizen|BI/SQL 1st-class citizen|BI/SQL 1st-class citizen| |||| ----- ##### A platform for your tools of choice At Databricks we believe strongly in open platforms and meeting our customers where they are. We work very closely with a large number of software vendors to make sure you can easily use your tools of choice on Databricks, like [Tableau](https://databricks.com/blog/2021/05/07/improved-tableau-databricks-connector-with-azure-ad-authentication-support.html) , [Power BI](https://databricks.com/blog/2021/02/26/announcing-general-availability-ga-of-the-power-bi-connector-for-databricks.html) or [dbt](https://databricks.com/blog/2021/12/06/deploying-dbt-on-databricks-just-got-even-simpler.html) . With [Partner Connect](https://www.databricks.com/partnerconnect) , it’s easier than ever to connect with your favorite tools, easier to get data in, easier to authenticate using single sign-on, and of course, with all the concurrency and performance improvements, we make sure that the direct and live query experience is great. **Now more than ever, organizations** **need a data strategy that enables** **speed and agility to be adaptable.** **As organizations are rapidly moving** **their data to the cloud, we’re** **seeing growing interest in doing** **analytics on the data lake. The** **introduction of Databricks SQL** **delivers an entirely new experience** **for customers to tap into insights** **from massive volumes of data with** **the performance, reliability and** **scale they need. We’re proud to** **partner with Databricks to bring** **that opportunity to life.** **Francois Ajenstat** Chief Product Officer, Tableau + Any other Apache Spark-compatible client ----- ##### Faster BI results retrieval with Cloud Fetch Once query results are computed, cloud data warehouses often collect and stream back results to BI clients on a single thread. This can create a bottleneck and greatly slows down the experience if you are fetching anything more than a few megabytes of results in size. To provide analysts with the best experience from their favorite BI tools, we also needed to speed up how the system delivers results to BI tools like Power BI or Tableau once computed. That’s why we’ve reimagined this approach with a new architecture called [Cloud Fetch](https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html) . For large results, Databricks SQL now writes results in parallel across all of the compute nodes to cloud storage, and then sends the list of files using pre-signed URLs back to the client. The client then can download in parallel all the data from cloud storage. This approach provides up to 10x performance improvement in real-world scenarios. parallel data transfers Cloud Storage **Cluster** SQL Endpoint CUSTOMER BENCHMARK TABLEAU EXTRACT Cloud Fetch enables faster, higher-bandwidth connectivity to and from your BI tools. **[LEARN MORE](https://databricks.com/blog/2021/08/11/how-we-achieved-high-bandwidth-connectivity-with-bi-tools.html)** ----- ##### A first-class SQL development experience In addition to supporting your favorite tools, we are also focused on providing a native first-class SQL development experience. We’ve talked to hundreds of analysts using various SQL editors like SQL Workbench every day, and worked with them to provide the dream set of capabilities for SQL development. For example, Databricks SQL now supports [standard ANSI SQL](https://databricks.com/blog/2021/11/16/evolution-of-the-sql-language-at-databricks-ansi-standard-by-default-and-easier-migrations-from-data-warehouses.html) , so you don’t need to learn a special SQL dialect. Query tabs allow you to work on multiple queries at once, autosave gives you peace of mind so you never have to worry about losing your drafts, integrated history lets you easily look at what you have run in the past, and intelligent auto-complete understands subqueries and aliases for a delightful experience. The built-in SQL query editor allows you to quickly explore available databases, query and visualize results. ----- Finally, with Databricks SQL, analysts can easily make sense of query results through a wide variety of rich visualizations and quickly build dashboards with an intuitive drag-and-drop interface. To keep everyone current, dashboards can be shared and configured to automatically refresh, as well as to alert the team to meaningful changes in the data. Easily combine visualizations to build rich dashboards that can be shared with stakeholders. ----- ### Conclusion Databricks SQL leverages open source standard [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) to turn raw data into actionable data, combining the flexibility and openness of data lakes with the reliability and performance of data warehouses. The Unity Catalog provides fine-grained governance on the lakehouse across all clouds using one friendly interface and standard SQL. Databricks SQL also holds the [new world record in 100TB TPC-DS](https://dbricks.co/benchmark) , the gold standard performance benchmark for data warehousing. It is powered by Photon, the new vectorized query engine for the lakehouse, and by SQL warehouses for instant, elastic compute decoupled from storage. Finally, Databricks SQL offers a native first-class SQL development experience, with a built-in SQL editor, rich visualizations and dashboards, and integrates seamlessly with your favorite BI- and SQL-based tools for maximum productivity. Databricks SQL under the hood. ----- ### Atlassian Atlassian is a leading provider of collaboration, development and issue-tracking software for teams. With over 150,000 global customers (including 85 of the Fortune 100), Atlassian is advancing the power of collaboration with products including Jira, Confluence, Bitbucket, Trello and more. USE CASE Atlassian uses the Databricks Lakehouse Platform to democratize data across the enterprise and drive down operational costs. Atlassian currently has a number of use cases focused on putting the customer experience at the forefront. **Customer support and service experience** With the majority of their customers being server-based (using products like Jira and Confluence), Atlassian set out to move those customers into the cloud to leverage deeper insights that enrich the customer support experience. **Marketing personalization** The same insights could also be used to deliver personalized marketing emails to drive engagement with new features and products. **Anti-abuse and fraud detection** They can predict license abuse and fraudulent behavior through anomaly detection and predictive analytics. ----- SOLUTION AND BENEFITS Atlassian is using the Databricks Lakehouse Platform to enable data democratization at scale, both internally and externally. They have moved from a data warehousing paradigm to standardization on Databricks, enabling the company to become more data driven across the organization. Over 3,000 internal users in areas ranging from HR and marketing to finance and R&D — more than half the organization — are accessing insights from the platform on a monthly basis via open technologies like Databricks SQL. Atlassian is also using the platform to drive more personalized support and service experiences to their customers. **•** Delta Lake underpins a single lakehouse for PBs of data accessed by 3,000+ users across HR, marketing, finance, sales, support and R&D **•** BI workloads powered by Databricks SQL enable dashboard reporting for more users **•** MLflow streamlines MLOps for faster delivery **•** Data platform unification eases governance, and self-managed clusters enable autonomy With cloud-scale architecture, improved productivity through cross-team collaboration, and the ability to access all of their customer data for analytics and ML, the impact on Atlassian is projected to be immense. Already the company has: **•** Reduced the cost of IT operations (specifically compute costs) by 60% through moving 50,000+ Spark jobs from EMR to Databricks with minimal effort and low-code change **•** Decreased delivery time by 30% with shorter dev cycles **•** Reduced data team dependencies by 70% with more self-service enabled throughout the organization **[LEARN MORE](https://www.youtube.com/watch?v=Xo1U617T-mU)** **At Atlassian, we need to ensure** **teams can collaborate well** **across functions to achieve** **constantly evolving goals. A** **simplified lakehouse architecture** **would empower us to ingest high** **volumes of user data and run the** **analytics necessary to better** **predict customer needs and** **improve the experience of our** **customers. A single, easy-to-use** **cloud analytics platform allows** **us to rapidly improve and build** **new collaboration tools based on** **actionable insights.** **Rohan Dhupelia** Data Platform Senior Manager, Atlassian ----- ### ABN AMRO As an established bank, ABN AMRO wanted to modernize their business but were hamstrung by legacy infrastructure and data warehouses that complicated access to data across various sources and created inefficient data processes and workflows. Today, Azure Databricks empowers ABN AMRO to democratize data and AI for a team of 500+ empowered engineers, scientists and analysts who work collaboratively on improving business operations and introducing new go-to-market capabilities across the company. USE CASE ABN AMRO uses the Databricks Lakehouse Platform to deliver financial services transformation on a global scale, providing automation and insight across operations. **Personalized finance** ABN AMRO leverages real-time data and customer insights to provide products and services tailored to customers’ needs. For example, they use machine learning to power targeted messaging within their automated marketing campaigns to help drive engagement and conversion. **Risk management** Using data-driven decision-making, they are focused on mitigating risk for both the company and their customers. For example, they generate reports and dashboards that internal decision makers and leaders use to better understand risk and keep it from impacting ABN AMRO’s business. **Fraud detection** With the goal of preventing malicious activity, they’re using predictive analytics to identify fraud before it impacts their customers. Among the activities they’re trying to address are money laundering and fake credit card applications. ----- SOLUTION AND BENEFITS Today, Azure Databricks empowers ABN AMRO to democratize data and AI for a team of 500+ engineers, scientists and analysts who work collaboratively on improving business operations and introducing new go-to-market capabilities across the company. **•** Delta Lake enables fast and reliable data pipelines to feed accurate and complete data for downstream analytics **•** Integration with Power BI enables easy SQL analytics and feeds insights to 500+ business users through reports and dashboards **•** MLflow speeds deployment of new models that improve the customer experience — with new use cases delivered in under two months **Databricks has changed the way** **we do business. It has put us in** **a better position to succeed in** **our data and AI transformation** **as a company by enabling data** **professionals with advanced data** **capabilities in a controlled and** **scalable way.** **Stefan Groot** Head of Analytics Engineering, ABN AMRO #### 10x faster time to market — use cases deployed in two months #### 100+ use cases to be delivered over the coming year #### 500+ empowered business and IT users **[LEARN MORE](https://databricks.com/customers/abn-amro)** ----- ### SEGA Europe **Improving the player experience** # “ is at the heart of everything **we do, and we very much** **see Databricks as a key** **partner, supporting us to drive** **forward the next generation of** **community gaming.** **Felix Baker** Data Services Manager, SEGA Europe SEGA® Europe, the worldwide leader in interactive entertainment, is using the Databricks Lakehouse Platform to personalize the player experience and build its own machine learning algorithm to help target and tailor games for over 30 million of its customers. As housebound gamers looked to pass the time during the first lockdowns of 2020, some SEGA Europe titles, including Football Manager,™ saw over double the number of sales during the first lockdown compared to the year before. Furthermore, a number of SEGA titles experienced a more than 50% increase in players over the course of the COVID-19 pandemic. With more anonymized data being collected through an analytics pipeline than ever before, the team needed a dedicated computing resource to handle the sheer volume of data, extract meaningful insights from it and enable the data science team to improve general workflow. **[LEARN MORE](https://www.youtube.com/watch?v=SzeXHcwPDSE)** ----- ### About Databricks Databricks is the lakehouse company. More than 7,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark, TM Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://databricks.com/try-databricks)** Contact us for a personalized demo **databricks.com/contact** **[DISCOVER LAKEHOUSE](https://databricks.com/discoverlakehouse)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Why-the-Data-Lakehouse-Is-Your-Next-Data-Warehouse-Ebook-2nd%20Edition.pdf2024-09-19T16:57:19Z# Big Book of Data and AI Use Cases for the Public Sector ### Best practices, customer stories and solution templates for government agencies interested in building on the Lakehouse ----- ## Contents The State of Data and AI in the Government .......................................................................................... 3 The Need for a Modern Data Architecture ............................................................................................. 5 Introducing the Lakehouse for Public Sector ......................................................................................... 6 **U S E C A S E :** Cybersecurity ........................................................................................................................... 9 **U S E C A S E :** Predictive Maintenance .......................................................................................................... 12 **U S E C A S E :** Fraud Detection ....................................................................................................................... 15 **U S E C A S E :** Money Laundering ................................................................................................................. 17 **U S E C A S E :** Entity Analytics ...................................................................................................................... 19 **U S E C A S E :** Geospatial Analytics .............................................................................................................. 21 **U S E C A S E :** Public Health Management .................................................................................................. 24 Conclusion ................................................................................................................................................. 26 ----- ## The State of Data and AI in the Government ###### Over the last decade, data and AI have redefined every industry on the planet. Retailers have improved the shopping experience with personalized recommendations, financial institutions have strengthened risk management through the use of advanced analytics, and the healthcare industry is tapping into the power of machine learning to predict and prevent chronic disease. The public sector is no exception. In 2018, the U.S. Federal Government embarked on one of its most ambitious efforts since putting a man on the moon — embedding data into all aspects of decision-making. By enacting the Evidence-Based Policymaking Act of 2018, Congress set in motion requirements for agencies to modernize their data and analytics capabilities, including the appointment of agency-level chief data officers. A year later came the Federal Data Strategy, which provided further guidance for how agencies should manage and use data by 2030. With all of this guidance, agencies are starting to make meaningful improvements to their data strategy, but when it comes to innovating with data, agencies still lag behind the private sector. This begs the question: what’s standing in the way? The hurdles aren’t due to a lack of effort on the part of agency leaders. In fact, they can largely be attributed to a patchwork of legacy technologies that have been amassed over the last 30 to 40 years. While these hurdles stand in the way, a number of innovative agencies are making significant progress as they embrace new data and AI capabilities. ----- Federal spending on artificial intelligence rose to [nearly $1 billion](https://www.federaltimes.com/thought-leadership/2021/09/28/why-the-government-market-for-artificial-intelligence-technology-is-expanding/) in 2020, up 50% from 2018. There’s a good reason for this level of spend: Deloitte recently published a report, “AI-augmented Government,” that estimates the federal government could free up as many as 1.2 billion hours of work and save up to $41.1 billion annually through the use of AI-driven automation. Early adopters of advanced analytics are starting to see the fruits of their labor. For example, [USCIS modernized their analytics stack](https://databricks.com/customers/uscis) on Databricks to accelerate insights on applicants by 24x, automate the processing of millions of applications, and reduce appointment no-show rates with predictive analytics. The [Orange](https://www.govloop.com/how-a-california-county-court-elevated-data-driven-decision-making-for-the-state/) [County Courts](https://www.govloop.com/how-a-california-county-court-elevated-data-driven-decision-making-for-the-state/) also recently shared how they are automating legacy paperbased workflows with machine learning. In this eBook, we explore the hurdles of legacy technologies and how a modern data lakehouse can help agencies unlock innovative data and analytics use cases at all levels of government. Over the following seven example use cases, covering everything from cyber threat detection to improving public health, **An increased focus on cloud, analytics and AI = operational efficiency** 1. AI/ML 2. Data Analytics 3. Cloud **$1B** **TOP PRIORITIES** **$41B+** Data and AI Research and Government CIOs’ top Estimated government Development Initiative game-changing technologies savings from data-driven automation **U.S. Government** we demonstrate how the Databricks Lakehouse for Public Sector is critical to improving citizen services and delivering on mission objectives. This guide also includes resources in the form of Solution Accelerators, reference architectures and real-world customer stories to help as you embark on your own journey to drive a safer and more prosperous nation through the use of data and AI. ----- ## The Need for a Modern Data Architecture ###### Government agencies are now turning to the cloud and modern data technologies to federate and make sense of their massive volumes of data. Building on that foundation, agencies are starting to adopt advanced analytics and AI to automate costly, outdated and resource-intensive operations as well as improve decisionmaking with predictive insights that can better keep pace with the dynamic needs of citizens and global communities. That being said, there are a number of barriers standing in their way. ##### Common challenges Many government agencies are burdened with a legacy IT infrastructure that is built with on-premises data warehouses that are complex to maintain, are costly to scale as compute is coupled with storage, and lack support for unstructured data and advanced analytics. This severely inhibits data-driven innovation. Maintaining these systems requires a massive investment of both time and money compared to modern cloud-based systems and creates a number of avoidable challenges: government is often done in weekly or daily batches, but decision-making needs to happen in real time. Critical events like cyber attacks and health pandemics can’t wait a week. **Lack of citizen insights** When data is siloed, teams get an incomplete view of the citizen, resulting in missed opportunities to improve the delivery of services that impact the quality of life for their constituents. **Lack of reliability** Siloed systems result in data replication as teams spin up new data marts to support their one-off use cases. Without a single source of truth, teams struggle with data inconsistencies, which can result in inaccurate analysis and model performance that is only compounded over time. **Lack of agility** Disjointed analytics tools and legacy infrastructure hinder the ability of teams to conduct real-time analytics. Most data processing in the **Lack of productivity** Data scientists and data analysts alike must have the right tool set to collaboratively investigate, extract and report meaningful insights from their data. Unfortunately, data silos lead to organizational silos, which make collaboration inside an agency as well as between agencies very difficult. With different groups of data teams leveraging their own coding and analytical tools, communicating insights and working across teams — let alone across agencies — is almost impossible. This lack of collaboration can drastically limit the capabilities of any data analytics or AI initiative. ----- ## Introducing the Lakehouse for Public Sector The reason that the Databricks Lakehouse is able to deliver the simplicity, flexibility and speed that a government agency requires is that it fundamentally reimagines the modern data architecture. Databricks provides federal, state and local agencies with a cloud-native Lakehouse Platform that combines the best of data warehouses and data lakes — to store and manage all your data for all your analytics workloads. With this modern architecture, agencies can federate all their data and democratize access for downstream use cases, empowering their teams to deliver on their mission objectives by unlocking the full potential of their data. **Delivering real-time data insight in support of the mission** - Fraud, Waste & Abuse - Cybersecurity - Medicaid Dashboards & Reporting - Process Improvement - Predictive Maintenance - SCM & Demand Forecasting - Smart Military/Censor Data - Military Heatlh - COVID Response/Decision Support - Smart Cities/Connected Vehicles - Citizen Engagement - Data-Driven Decision-Making ----- **Federate all of your agency’s data** Any type of data can be stored because, like a data lake, the Databricks Lakehouse is built using the low-cost object storage supported by cloud providers. Leveraging this capability helps break down the data silos that hinder efforts to aggregate data for advanced analytics (e.g., predictive maintenance) or compute-intensive workloads like detecting cyber threats across billions of signals. Probably even more important is the ability of the lakehouse architecture to travel back in time, ensuring full audit compliance and high governance standards for analytics and AI. **Power real-time decision-making** Streaming use cases such as IoT analytics or disease spread tracking is simpler to support because the lakehouse uses Apache Spark TM as the data processing engine and Delta Lake as a storage layer. With Spark, you can toggle between batch and streaming workloads with just a line of code. With Delta Lake, native support for ACID transactions means that you can deploy streaming workloads without the overhead of common reliability and performance issues. These capabilities make real-time analytics possible. **Unlock collaborative analytics for all personas** The Databricks Lakehouse for Public Sector is your one-stop shop for all your analytics and AI. The platform includes a business intelligence capability — Databricks SQL — that empowers data analysts to query and run reports against all of an agency’s unified data. Databricks SQL integrates with BI tools like Tableau and Microsoft Power BI and complements any existing BI tools with a SQL-native interface, allowing data analysts and data scientists to query data directly within Databricks and build powerful dashboards. ----- **Deliver on your mission with predictive insights** In the same environment, data scientists can build, share and collaborate on machine learning models for advanced use cases like fraud detection or geospatial analytics. Additionally, MLflow, an open source toolkit for managing the ML lifecycle, is built into the Lakehouse so data scientists can manage everything in one place. Databricks natively supports Python, R, SQL and Scala so practitioners can work together with the languages and libraries of their choice, reducing the need for separate tools. With these capabilities, data teams can turn insights from real-world data into powerful visualizations designed for machine learning. Visualizations can then be turned into interactive dashboards to share insights with peers across agencies, policymakers, regulators and decision-makers. ##### Customers That Innovate With Databricks Lakehouse for Public Sector Some of the top government agencies in the world turn to the Databricks Lakehouse for Public Sector to bring analytics and AI-driven automation and innovation to the communities they serve. ----- ###### USE CASE: ## Cybersecurity ##### Overview **Limited window of data** Given the high cost of storage, most agencies retain only a few weeks of threat data. This can be a real problem in scenarios where a perpetrator gains access to a network but waits months before doing anything malicious. Without a long historical record, security teams can’t analyze cyberattacks over long-term horizons or conduct deep forensic reviews. ##### Solution overview For government agencies that are ready to modernize their security data infrastructure and analyze data at petabyte-scale more cost-effectively, Databricks provides an open lakehouse platform that augments existing SIEMs to help democratize access to data for downstream analytics and AI. Built on Apache Spark and Delta Lake, Databricks is optimized to process large volumes of streaming and historic data for real-time threat analysis and incident response. Security teams can query threat data going years into the past in just minutes and build ML models to detect new threat patterns and reduce false positives. Additionally, Databricks created a Splunk-certified add-on to augment Splunk for Enterprise Security (ES) for cost-efficient log and retention expansion. Cyberattacks from bad actors and nation states are a huge and growing threat to government agencies. Recent large-scale attacks like the ones on SolarWinds, log4j, Colonial Pipeline and HAFNIUM highlight the sophistication and increasing frequency of broad-reaching cyberattacks. Data breaches cost the federal government more than $4 million per incident in 2021 and threaten national security. Staying ahead of the next threat requires continuous monitoring of security data from an agency’s entire attack surface before, during and after an incident. ##### Challenges **Scaling existing SIEM solutions** Agencies looking to expand existing SIEM tools for today’s petabytes of data can expect increased licensing, storage, compute and integration resources resulting in tens of millions of dollars in additional costs per year. **Rules-based systems** Many legacy SIEM tools lack the critical analytics capabilities — such as advanced analytics, graph processing and machine learning — needed to detect unknown threat patterns or deliver on a broader set of security use cases like behavioral analytics. ----- ##### How to get started [Solution Accelerator: Detect Criminal](https://databricks.com/blog/2020/10/05/detecting-criminals-and-nation-states-through-dns-analytics.html) [Threats Using DNS Analytics](https://databricks.com/blog/2020/10/05/detecting-criminals-and-nation-states-through-dns-analytics.html) Detecting criminals and nation states through DNS analytics. In order to address common cybersecurity challenges such as deployment complexity, tech limitation and cost, security teams need a real-time data analytics platform that can handle cloud scale, analyze data wherever it is, natively support streaming and batch analytics, and have collaborative content development capabilities. ##### Customer story **[WATCH THE VIDEO](https://www.youtube.com/watch?v=5BRGqxq4iQw)** **Fighting Cyber Threats in Real Time** Since partnering with Databricks, HSBC has reduced costs, accelerated threat detection and response, and improved their security posture. Not only can they process all of their required data, but they’ve also increased online query retention from just days to months at petabyte scale. HSBC is now able to execute 2-3x more threat hunts per analyst. [Solution Accelerator:](https://databricks.com/blog/2021/07/23/augment-your-siem-for-cybersecurity-at-cloud-scale.html) [Databricks Add-On for Splunk](https://databricks.com/blog/2021/07/23/augment-your-siem-for-cybersecurity-at-cloud-scale.html) Designed for cloud-scale security operations, the add-on provides Splunk analysts with access to all data stored in the Lakehouse. Bidirectional pipelines between Splunk and Databricks allow agency analysts to integrate directly into Splunk visualizations and security workflows. ----- ##### Reference architecture ----- ###### USE CASE: ## Predictive Maintenance ##### Overview **Integrating unstructured data** Equipment data doesn’t just come in the form of IoT data. Agencies can gather rich unstructured signals like audio, visual (e.g., video inspections) and text (e.g., maintenance logs). Most legacy data architectures are unable to integrate structured and unstructured data sources. **Operationalizing machine learning** Most agencies lack the advanced analytics tools needed to build models that can predict potential equipment failures. Those that do typically have their data scientists working in a siloed set of tools, resulting in unnecessary data replication and inefficient workflows. ##### Solution overview The Databricks Lakehouse is tailor-made for building IoT applications at scale. With Databricks, agencies can easily manage large streaming volumes of small files, with ACID transaction guarantees and reduced job fails compared to traditional data warehouse architectures. Additionally, the Lakehouse is cloud native and built on Apache Spark, so scaling for petabytes of data is not an issue. With the Lakehouse, agencies can bring together all of their structured and unstructured data with a unified set of tooling for data engineering, model building and production rollout. With these capabilities, operations teams can quickly detect and act on pending equipment failures before they affect performance. Predictive maintenance is oftentimes associated with the manufacturing sector, but in reality it extends far beyond the factory floor. Consider this for a moment: the U.S. Government operates a fleet of over [640,000 vehicles](https://www.government-fleet.com/301786/federal-vs-state-local-fleets) including public buses, postal delivery trucks, drones, helicopters and jet fighters. Many of these vehicles — like multimillion-dollar aircraft — contain sensors that generate massive amounts of data on the use and conditions of various components. And it’s not just vehicles. Modern public utilities stream data through connected IoT devices. All of this data can be analyzed to identify the root cause of a failure and predict future maintenance, helping to avoid costly repairs and critical assets from being out of service. ##### Challenges **Managing IoT data at scale** With billions of sensors generating information, most data systems are unable to handle the sheer volume of data. Before agencies can even start analyzing their data, legacy data warehouse–based tools require preprocessing of data, making real-time analysis impossible. ----- ##### How to get started **Solution Accelerator: Predictive Maintenance** Learn how to ingest real-time IoT data from field devices, perform complex time series processing in Delta Lake and leverage machine learning to build predictive maintenance models. [Part 1: Use case overview](https://databricks.com/blog/2020/08/03/modern-industrial-iot-analytics-on-azure-part-1.html) [Part 2: Ingest real-time IoT data and perform time series processing](https://databricks.com/blog/2020/08/11/modern-industrial-iot-analytics-on-azure-part-2.html) [Part 3: Using ML to predict maintenance.](https://databricks.com/blog/2020/08/20/modern-industrial-iot-analytics-on-azure-part-3.html) [Watch the Demo:](https://vimeo.com/580864758/5a5bc42bb9) [Predictive Maintenance on Azure Databricks](https://vimeo.com/580864758/5a5bc42bb9) ##### Customer story **[LEARN MORE](https://www.tallan.com/blog/client-stories/dc-water/)** **Protecting the Water Supply for 700,000 Residents** Utilizing machine learning for predictive analytics to help stop water main breaks before they occur, potentially saving hundreds of thousands of dollars in repairs while reducing service interruption. ----- ##### Reference architecture Weather Sensor Readings (semi-structured) Real-time streaming Wind Turbine Telematics (semi-structured) Maintenance Logs (unstructured) #### Databricks Lakehouse Platform Bronze Layer Silver Layer Gold Layer Append Raw Merge Data Data Join Streams and Analyze Data Enriched Readings Output Build Predictive Maintenance Model Granular Readings Aggregated Hourly Readings Real-time Dashboards for Real-Time Dashboards for Optimizing Performance Optimizing Performance |Col1|Col2|Col3| |---|---|---| ----- ###### USE CASE: ## Fraud Detection ##### Overview According to [McKinsey & Company](https://www.mckinsey.com/~/media/McKinsey/Industries/Public%20Sector/Our%20Insights/Cracking%20down%20on%20government%20fraud%20with%20data%20analytics/Cracking-down-on-government-fraud-with-data-analytics-vF.pdf) , more than half of the federal government’s monetary losses to fraud, waste and abuse go undetected and total tens of billions of dollars. Financial fraud comes in many forms, from individuals taking advantage of relief programs to complex networks of criminal organizations working together to falsify medical claims and rebate forms. Investigative teams hoping to stay ahead of fraudsters need advanced analytics techniques so they can detect anomalous behavior buried in a sea of data. ##### Challenges **Lack of machine learning** A rules-based approach is not enough. Bad actors are getting more and more sophisticated in how they take advantage of government programs, necessitating an AI-driven approach. **Unreliable data** Getting high-quality, clean data and maintaining a rich feature store is critical for identifying ever-evolving fraud patterns while maintaining a strict record of previous data points. ##### Solution overview The Databricks Lakehouse enables teams to develop complex ML models with high governance standards and bridge the gap between data science and technology to address the challenge of analyzing large volumes of data at scale — 40 billion financial transactions a year are made in the United States alone. Additionally, Databricks makes it possible to combine modern AI techniques with the legacy rules-based methods that underpin current approaches to fraud detection all within a common and efficient Spark-based orchestration engine. ##### How to get started [Solution Accelerator: Fraud Detection](https://databricks.com/blog/2021/01/19/combining-rules-based-and-ai-models-to-combat-financial-fraud.html) Due to an ever-changing landscape, building a financial fraud detection framework often goes beyond just creating a highly accurate machine learning model. Oftentimes it involves a complex-decision science setup that combines a rules engine with a need for a robust and scalable machine learning platform. In this example, we show how to build a holistic fraud detection solution on Databricks using data from a financial institution. **Analytics at scale** Training complex ML models with hundreds of features on gigabytes of structured, semi-structured and unstructured data can be impossible without a highly scalable and distributed infrastructure. ----- ##### Customer story **[WATCH THE VIDEO](https://www.youtube.com/watch?v=Ca1MMNpBSHM)** **Identifying Financial Fraud at Scale** Processes hundreds of billions of market events per day on the Databricks Lakehouse and uses the power of machine learning to identify illicit activity in near real-time. ##### Reference architecture ----- ###### USE CASE: ## Money Laundering ##### Overview Approximately [$300 billion](https://home.treasury.gov/system/files/136/2018NMLRA_12-18.pdf) is laundered through the United States each year, and with criminal organizations — both at home and abroad — implementing increasingly sophisticated methods for laundering funds, it’s getting harder to stop. While the federal government continues to apply pressure on the financial sector through heightened regulation, more is needed to combat laundering. Modern AI techniques such as graph analytics and computer vision can be used to process different types of structured (e.g., financial transactions) and unstructured (e.g., real estate images) data and identify illicit behavior. This allows investigative teams to automate labor-intensive activities like confirming a residential address or reviewing transaction histories, and instead dig into priority threats. ##### Challenges **Complex data science** Modern anti-money laundering (AML) practices require multiple ML capabilities such as entity resolution, computer vision and graph analytics on entity metadata, which is typically not supported by any one data platform. **Time-consuming false positives** Any reported suspicious activity must be investigated manually to ensure accuracy. Many legacy solutions generate a high number of false positives or fail to identify unknown patterns, resulting in wasted effort by investigators. ##### Solution overview AML solutions face the operational burden of processing billions of transactions a day. The Databricks Lakehouse Platform combines the low storage cost benefits of cloud data lakes with the robust transaction capabilities of data warehouses, making it the ideal foundation for building AML analytics at massive scale. At the core of Databricks is Delta Lake, which can store and combine both unstructured and structured data to build entity relationships; moreover, Databricks Delta Engine provides efficient access using the new Photon compute to speed up BI queries on tables spanning billions of transactions. On top of these capabilities, ML is a first-class citizen in the Lakehouse, which means analysts and data scientists do not waste time subsampling or moving data to share dashboards and stay one step ahead of bad actors. **Model transparency** Although AI can be used to address many money laundering use cases, the lack of transparency in the development of ML models offers little explainability, inhibiting broader adoption. ----- ##### How to get started [Solution Accelerator: Modern](https://databricks.com/blog/2021/07/16/aml-solutions-at-scale-using-databricks-lakehouse-platform.html) [Anti-Money Laundering Techniques](https://databricks.com/blog/2021/07/16/aml-solutions-at-scale-using-databricks-lakehouse-platform.html) Lakehouse Platform leveraging a series of next-gen machine learning techniques including NLP, computer vision, entity resolution and graph analytics. This approach helps teams better adapt to the reality of modern laundering practices. Current anti-money laundering practices bear little resemblance to those of the last decade. In today’s digital world, financial institutions are processing billions of transactions daily, increasing the surface area of money laundering. With this accelerator, we demonstrate how to build a scalable AML solution on the ##### Reference architecture ----- ###### USE CASE: ## Entity Analytics ##### Overview **No machine learning capabilities** Entity resolution typically relies on basic rules-based logic to compare records (e.g., matching on name and address), but with messy, large volumes of data, advanced analytics is needed to improve accuracy and accelerate efforts. ##### Solution overview The Databricks Lakehouse is an ideal platform for building entity analytics at scale. With support for a wide range of data formats and a rich and extensible set of data transformation and ML capabilities, Databricks enables agencies to bring together all of their data in a central location and move beyond simple rules-based methods for entity resolution. Data teams can easily explore different machine learning techniques like natural language processing, classification and graph analytics to automate entity matching. And one-click provisioning and deprovisioning of cloud resources makes it easy for teams to cost-effectively allocate the necessary compute resources for any size job so they can uncover findings faster. Entity analytics aims to connect disparate data sources to build a full view of a person or an organization. This has many applications in the public sector, such as fraud detection, national security and population health. For example, Medicare fraud teams need to understand which prescriptions are filled, claims filed and facilities visited across geographies to uncover suspicious behavior. Before teams can even look for suspicious behavior, they must first determine which records are associated. In the United States, nearly 50,000 people share the name John Smith (and there are thousands of others with similar names). Imagine trying to identify the right John Smith for this type of analysis. That’s no easy task. ##### Challenges **Disjointed data** Managing complex and brittle ETL pipelines in order to cleanse and join data across siloed systems and data stores. **Compute intensive** Identifying related entities across population-level data sets requires massive compute power that far outstrips legacy on-prem data architectures. ----- ##### How to get started [Virtual Workshop: Entity Analytics](https://drive.google.com/file/d/1wGGT9Fn5EZF5Rgrabuttt1xdua5csrBa/view?usp=sharing) Learn from Databricks experts on how entity analytics is being deployed in the public sector and watch a demo that shows how to use ML to link payments and treatments across millions of records in a public CMS data set. [Solution Accelerator:](https://drive.google.com/file/d/1a5xdaRSNQjQvgztOZg0tCiCajjVpvVPA/view?usp=sharing) [Machine Learning-Based Item Matching](https://drive.google.com/file/d/1a5xdaRSNQjQvgztOZg0tCiCajjVpvVPA/view?usp=sharing) While focused on retail, this accelerator has applications for any organization working on entity matching, especially as it relates to items that might be stored across locations. In this notebook, we demonstrate how to use machine learning and the Databricks Lakehouse Platform to resolve differences between product definitions and descriptions, and determine which items are likely pairs and which are distinct across disparate data sets. ##### Customer story **[WATCH THE VIDEO](https://databricks.com/session_na21/entity-resolution-using-patient-records-at-cmmi)** In this talk, NewWave shares the specifics on CMS’s entity resolution use case, the ML necessary for this data and the unique uses of Databricks in providing this capability. ##### Sample workflow ----- ###### USE CASE: ## Geospatial Analytics ##### Overview **Broad range of analytics capabilities** Enterprises require a diverse set of data applications — including SQL-based analytics, real-time monitoring, data science and machine learning — to support geospatial workloads given the diverse nature of the data and use cases. ##### Solution overview With Delta Lake at the core, the Databricks Lakehouse is ideal for geospatial workloads, as it provides a single source of truth for all types of structured, unstructured, streaming and batch data, enabling seamless spatio-temporal unification and cross-querying with tabular and raster-based data. Built on Apache Spark, the Lakehouse easily scales for data sets consisting of billions of rows of data with distributed processing in the cloud. To expand on the core capabilities of the Lakehouse, Databricks has introduced the Mosaic library, an extension to the Apache Spark framework, built for fast and easy processing of large geospatial data sets. Popular frameworks such as Apache Sedona or GeoMesa can still be used alongside Mosaic, and because Mosaic sits on top of Lakehouse architecture, it unlocks AI/ML and advanced analytics capabilities to support all types of geospatial use cases. Every day billions of handheld and IoT devices, along with thousands of airborne and satellite remote sensing platforms, generate hundreds of exabytes of location-aware data. This boom of geospatial big data combined with advancements in machine learning is enabling government agencies to develop new capabilities. The potential use cases for geospatial analytics and AI touch every part of the government, including disaster recovery (e.g., flood/earthquake mapping), defense and intel (e.g., detecting threats using drone footage), infrastructure (e.g., public transportation planning), civilian safety (e.g., crime prediction), public health (e.g., disease spread tracking), and much more. Every agency at the state and federal level needs to consider how they can tap into geospatial data. ##### Challenges **Massive volumes of geospatial data** With the proliferation of low-cost sensor arrays, GPS technologies and highresolution imaging organizations are collecting tens of TBs of geospatial data daily, outpacing their ability to store and process this data at scale. **Compute-intensive spatial workloads** Geospatial data is complex in structure, with various formats not well suited for legacy data warehouses, as well as being compute intensive, with geospatialspecific transformations and queries requiring hours and hours of compute. ----- ##### How to get started [Solution Accelerator:](https://databricks.com/blog/2022/05/02/high-scale-geospatial-processing-with-mosaic.html) [Mosaic for Geospatial Analytics](https://databricks.com/blog/2022/05/02/high-scale-geospatial-processing-with-mosaic.html) Build a Lakehouse to support all of your geospatial analytics and AI use cases with the Mosaic library. Mosaic provides a number of capabilities including easy conversion between common spatial data encodings, constructors to easily generate new geometries from Spark native data types, many of the OGC SQL standard ST_ functions implemented as Spark Expressions for transforming, aggregating and joining spatial data sets, and optimizations for performing pointin-polygon joins using an approach we codeveloped with Ordnance Survey — all provided with the flexibility of a Scala, SQL or Python API. [Virtual Workshop: Geospatial](https://databricks.com/p/webinar/workshop-geospatial-analytics-and-ai-at-scale) [Analytics and AI at Scale](https://databricks.com/p/webinar/workshop-geospatial-analytics-and-ai-at-scale) Learn how to build powerful geospatial insights and visualizations with a Lakehouse for all your geospatial data processing, analytics and AI. ##### Customer story **[WATCH THE VIDEO](https://databricks.com/session_na20/automating-federal-aviation-administrations-faa-system-wide-information-management-swim-data-ingestion-and-analysis)** **Analyzing Flight Data to Improve Aviation** To help airlines better serve their millions of passengers, USDOT built a modern analytics architecture on Databricks that incorporates data such as weather, flight, aeronautical and surveillance information. With this new platform, they reduced compute costs by 90% and can now power use cases such as predicting air cargo traffic patterns, flight delays and the financial impact of flight cancellations. ##### Customer story **[WATCH THE VIDEO](https://www.youtube.com/watch?v=LP198QMdDbY&t=1070s)** **Customer Story: Flood Prediction With Machine Learning** In an effort to improve the safety of civil projects, Stantec built a machine learning model on Databricks leveraging large volumes of weather and geological data — oftentimes consisting of trillions of data points — to predict the impact of flash floods on various regions and adjust civil planning accordingly. ----- ##### Reference architecture Mosaic Kepler Magics Geometry Display Functions for Map Display ESRI Java API for Geometry Operations Built-In Indexing System Support JTS Java API for Geometry Operations ----- ###### USE CASE: ## Public Health Management ##### Overview In their lifetime, every human is expected to generate a million gigabytes of health data spanning electronic health records, medical images, claims, wearable data, genomics and more. This data is critical to understanding the health of the individual, but when aggregated and analyzed across large populations, government agencies can glean important insights like disease trends, the impact of various treatment guidelines and the effectiveness of resources. By adding in [Social Determinants of Health (SDOH)](https://databricks.com/blog/2022/04/18/increasing-healthcare-equity-with-data.html) data — such as geographical location, income level, education, housing — agencies can better identify underserved communities and the critical factors that contribute to positive health outcomes. ##### Challenges **Rapidly growing health data** Healthcare data is growing exponentially. Unfortunately, legacy on-premises data architectures are complex to manage and too costly to scale for populationscale analytics. **Complexities of ML in healthcare** The legacy analytics platforms that underpin healthcare lack the robust data science capabilities needed for predictive health use cases like disease risk scoring. There’s also the challenge of managing reproducibility, which is critical when building ML models that can impact patient outcomes. ##### Solution overview The Databricks Lakehouse enables public health agencies to bring together all their research and patient data in a HIPAA-certified environment and marry it with powerful analytics and AI capabilities to deliver real-time and predictive insights at population scale. The Lakehouse eliminates the need for legacy data architectures, which have historically inhibited innovation in patient care by creating data silos and making advanced analytics difficult. Databricks led open source projects — like [Glow for genomics](https://databricks.com/blog/2021/11/17/databricks-open-source-genomics-toolkit-outperforms-leading-tools.html) and [Smolder for EHR data](https://databricks.com/blog/2021/01/28/burning-through-electronic-health-records-in-real-time-with-smolder.html) — that make it easy to ingest and prepare healthcare-specific data modalities for downstream analytics. **Fragmented patient data** It is widely accepted that over 80% of medical data is unstructured, yet most organizations still focus their attention on data warehouses designed to only support structured data and SQL-based analytics. ----- ##### How to get started [Solution Accelerator:](https://databricks.com/blog/2022/05/02/high-scale-geospatial-processing-with-mosaic.html) [NLP for Healthcare](https://databricks.com/blog/2022/05/02/high-scale-geospatial-processing-with-mosaic.html) Our joint solutions with John Snow Labs bring together the power of Spark NLP for Healthcare with the collaborative analytics and AI capabilities of Databricks. Informatics teams can ingest raw unstructured medical text files into Databricks, extract meaningful insights using natural language processing techniques, and make the data available for downstream analytics. We have specific NLP solutions for from lab reports, automating the deidentification of PHI and [extracting oncology insights](https://databricks.com/solutions/accelerators/nlp-oncology) [identifying adverse drug events](https://databricks.com/blog/2022/01/17/improving-drug-safety-with-adverse-event-detection-using-nlp.html) . [Solution Accelerator:](https://databricks.com/blog/2020/10/20/detecting-at-risk-patients-with-real-world-data.html) [Disease Risk Prediction](https://databricks.com/blog/2020/10/20/detecting-at-risk-patients-with-real-world-data.html) One of the most powerful tools for identifying patients at risk for a chronic condition is the analysis of real world data (RWD). This Solution Accelerator notebook provides a template for building a machine learning model that assesses the risk of a patient for a given condition within a given window of time based on a patient’s encounter history and demographics information. [Demo: Real-Time](https://www.youtube.com/watch?v=_ltDF2obiSc) [COVID-19 Contact Tracing](https://www.youtube.com/watch?v=_ltDF2obiSc) Databricks COVID-19 surveillance solution takes a data-driven approach to adaptive response, applying predictive analytics to COVID-19 data sets to help drive more effective shelter-in-place policies. ##### Customer story **[WATCH THE VIDEO](https://databricks.com/session_na21/from-vaccine-management-to-icu-planning-how-crisp-unlocked-the-power-of-data-during-a-pandemic)** **From Vaccine Management to ICU Planning** During the pandemic, the Chesapeake Regional Information System for our Patients implemented a modern data architecture on Databricks to address critical reporting needs. This allowed them to analyze 400 billion data points for innovative use cases like real-time disease spread tracking, vaccine distribution and prioritizing vulnerable populations. ----- ## Conclusion Today, data is at the core of how government agencies operate and AI is at the forefront of driving innovation into the future. The Databricks Lakehouse for Public Sector enables government agencies at the federal, state and local level to harness the full power of data and analytics to solve strategic challenges and make smarter decisions that improve the safety and quality of life of all citizens. Get started with a free trial of Databricks Lakehouse and start building better data applications today. **[START YOUR FREE TRIAL](https://databricks.com/try-databricks)** ###### Contact us for a personalized demo databricks.com/contact Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark,™ Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-and-ai-use-cases-for-the-public-sector.pdf2024-09-19T16:57:20Z###### EBOOK # Lakehouse for Manufacturing ###### Build a connected customer experience, optimize operations and unify your data ecosystem ----- ## Contents Introduction .......................................................................................................................... **3** Manufacturing Transformation Trends .............................................................................. **5** Manufacturing Data Challenges ......................................................................................... **9** Databricks Lakehouse for Manufacturing ....................................................................... **10** Building Innovative Solutions on the Lakehouse ............................................................. **12** **SOLUTION:** Part-Level Demand Forecasting ....................................................................... 12 **SOLUTION:** Overall Equipment Effectiveness & KPI Monitoring ............................................. 14 **SOLUTION:** Digital Twins ................................................................................................... 15 **SOLUTION:** Computer Vision ............................................................................................ 16 An Ecosystem on the Lakehouse for Manufacturing ...................................................... **17** **SOLUTION:** Avanade Intelligent Manufacturing .................................................................. **18** **SOLUTION:** DataSentics Quality Inspector ........................................................................ **18** SOLUTION: Tredence Predictive Supply Risk Management ................................................. **19** Leading Manufacturing Companies That Choose Us ................................................... **20** ----- ## Introduction Market conditions in manufacturing are more challenging than ever. Operating margins and growth are impacted by the rising cost of labor, materials, energy and transportation, all peaking at the same time. Disruptive events in the supply chain are increasing in frequency and intensity, leading to significant revenue losses and damaged brand reputation. Effective acquisition and retention of next-generation talent is a considerable issue for manufacturers. There are more jobs in the industry than there are people to do them, further compounding the problem of slower than expected industrial productivity growth over the last 15 years. The industry is also one of the largest consumers of energy, and faces a direct challenge of transforming operations to be more sustainable as governments are prioritizing net-zero policies that require a step change in energy efficiency and transition to low-carbon energy sources. The manufacturing industry generates massive amounts of new data every day — estimated to be two to four times more in size than in industries such as communications, media, retail and financial services. This explosion of data has opened the door for the global manufacturing ecosystem to boost productivity, quality, sustainability and growth beyond what was previously thought possible. Unfortunately, legacy data warehouse-based architectures weren’t built for the massive volumes and type of data coming in through today’s factories, products, processes and workers, let alone to support the advanced AI/ML use cases required to meet the customer expectations of shorter lead times, reliable delivery and smarter products. ----- For that, companies need to adopt a modern data architecture that provides the speed, scale and collaboration needed by broad teams of data engineers, data scientists, and analysts. Manufacturers need a comprehensive data platform that can not only handle massive volumes of data, but effectively and seamlessly operationalize the value from data, analytics and AI. This is achieved by: Removing data silos by placing all data, regardless of type or frequency, in a single, open architecture — including unstructured data from sensors, telemetry, natural language logs, videos and images — helping you to gain end-to-end visibility into your business Ensuring your data is “always on” so that the freshest and highest quality data is available for all for the full spectrum of enterprise analytics and AI/ML use cases, allowing you to drive ITOT convergence Having a comprehensive open architecture so IT and data teams can move with agility to bring AI and ML to where it’s needed, when it’s needed, including in connectivityconstrained environments Maintaining fine-grained governance and access control on your data assets, protecting sensitive intellectual property and customer data The Databricks Lakehouse for Manufacturing does just this. It’s a comprehensive approach that empowers teams in the industry to collaborate and innovate around data, analytics and AI. It eliminates the technical limitations of legacy technologies and gives data teams the ability to drive deeper, end-to-end insight into supply chains, automate processes to reduce costs and grow productivity, and achieve sustainable transformation for a more prosperous future. Welcome to the Lakehouse for Manufacturing. ----- ## Manufacturing Transformation Trends The future of manufacturing is smart, sustainable and service oriented. Today’s forward-thinking leaders are preparing the foundation they need to support that future by leveraging fast and connected data from all corners of the enterprise. There are four key trends driving transformation in manufacturing: **Boosting industrial productivity through automation** A spike in labor costs, as well as the cost of energy and materials, puts significant pressure on operating margins. At the same time, industrial productivity has plateaued — it is at the same level today as it was in the late 2000s. In the face of these macro challenges and economic uncertainty, there has never been a more burning need to reduce costs and improve productivity through greater visibility and automation. The industry has made strides in collecting data from machines and performing predictive analytics on sensor readings, with 47% of manufacturers citing the use of predictive maintenance to reduce operational costs with considerable upside ahead. However, there is an entirely different class of unstructured data in the form of images, videos and LiDAR that is opening the door to game-changing automation in quality inspection, flow optimization and production scheduling. Historically, these critical processes have depended on manual and visual inspection of products and operations, which is resource intensive and less accurate than ML-driven computer vision techniques. This untapped data and capability is allowing manufacturers to deliver higher product quality and deliver on production demands using fewer resources. Andrew Ng, a machine learning pioneer, rightly describes the massive opportunity for these technologies in his quote: “It is incumbent on every CEO in any manufacturing or industrial automation company to figure out how to make deep learning technology work for your business.” **CUSTOMER STORY SPOTLIGHT:** ##### Corning #### $2 million in cost avoidance through manufacturing upset event reduction **Driving Better Efficiency in Manufacturing Process With ML** Corning has been one of the world’s leading innovators in materials science for nearly 200 years. Delivering high-quality products is a key objective across the company’s manufacturing facilities around the world, and it’s always on a mission to explore how ML can help deliver on that goal. Databricks has been central to the company’s digital transformation, as it provides a simplified and unified platform where teams can centralize all data and ML work. Now, they can train models, register them in MLflow, generate all additional artifacts — like exported formats — and track them in the same place as the base model. [LEARN MORE](https://www.databricks.com/blog/2023/01/05/how-corning-built-end-end-ml-databricks-lakehouse-platform.html) ----- **Gaining end-to-end operations and** **supply chain visibility** Modern customer expectations are forcing manufacturers to focus on more customer-centric KPIs: quality, on-time commitments and speed of delivery. That’s not to say that asset and labor efficiency are less important — however, with customer expectations of shorter lead times and more reliable delivery, the success measures in manufacturing are shifting to a mantra of “measure what your customer values.” High-performing manufacturers that embed this deep into their operational playbook also perform best on productivity and ROIC growth results, as evidenced in a recent study by the World Economic Forum and the International Centre of Industrial Transformation. The problem? In a post-pandemic world, operations and supply chains are persistently constrained, with increasing disruptions, spiraling costs and unpredictable performance. The business impact is considerable — studies have shown that a 30-day disruption can reduce EBITDA by 5% and impact annual revenue by as much as 20%. Manufacturing companies need to be able to deliver on customer expectations, commitments and service levels, all while lowering costs and increasing productivity. Manufacturers need an enterprise data platform that can provide real-time visibility into order flows, production processes, supplier performance, inventory and logistics execution, breaking down departmental silos to maximize customer responsiveness, improve manufacturing agility and boost performance. **Transforming your business model through** **tech-fueled services** Servitization, defined as the process of building revenue streams from services, has been trending for some time. The adaptation of the business model has been considerably profitable: on average, services account for ~30% of industrial manufacturing companies but contribute 60%+ of profit. In aftersale services, a clear customer preference for business outcome-based offerings has emerged in almost every corner of the manufacturing industry. The use of data, analytics and AI is foundational to delivering more personalized customer outcomes, proactive field service delivery and differentiated missioncritical applications to their customers. With greater autonomy, connectivity and sensorization, manufacturers operate in a paradigm where their products generate more and more data every second, opening up numerous new addressable opportunities for value creation. The business of manufacturing is no longer linear, and manufacturers will need to reimagine their businesses to go beyond merely providing the primary unit of production — the next SKU, machine, vehicle or airplane — and leverage this data to operate a platform business with higher growth, stickier revenue streams and greater resilience to demand shocks. ----- **CUSTOMER STORY SPOTLIGHT:** ##### Rolls-Royce **Aerospace Goes Green With Data and AI** While most people think of luxury cars when they hear “Rolls-Royce,” the Civil Aerospace branch is its own company, having separated from the car manufacturing arm in 1971. The now wildly successful manufacturer of commercial airplane engines is a leader in its industry for innovation. Today, Rolls-Royce _“We employed Databricks to optimize inventory planning using data and analytics,_ _positioning parts where they need to be, based on the insight we gain from our_ _connected engines in real time and usage patterns we see in our service network. This_ _has helped us minimize risks to engine availability, reduce lead times for spare parts_ _and drive more efficiency in stock turns — all of this enables us to deliver TotalCare,_ _the aviation industry’s leading Power-by-the-Hour (PBH) maintenance program.”_ **S T U A R T H U G H E S** Chief Information and Digital Officer Rolls-Royce Civil Aerospace obtains information directly from the airlines’ engines and funnels it into the Databricks platform. This gives the company insights into how the engines are performing and ways to improve maintenance schedules, translating to less downtime, delays, and rerouting — all of which reduce carbon footprint. [LEARN MORE](https://www.wired.com/sponsored/story/how-tech-is-helping-to-save-the-world/) ----- **Driving a more sustainable approach** **to manufacturing** Global efforts on reducing greenhouse gas (GHG) emissions are accelerating, with over 70 countries representing more than 75% of global emissions having signed agreements to reach net-zero emissions by 2050. Manufacturing-centric sectors are critical to achieving net-zero sustainability commitments around the world, as they represent over 50% of global energy consumption and contribute to ~25% of global emissions. Those at the forefront of data, analytics and AI are setting science-based targets and are driving favorable sustainability outcomes today by deriving better insights from their operations, supply chains and the outcomes that their products generate for their end customers. **CUSTOMER STORY SPOTLIGHT:** ##### Shell **Delivering Innovative Energy Solutions for a Cleaner World** Shell has been at the forefront of creating a cleaner tomorrow by investing in digital technologies to tackle climate change and become a net-zero emissions energy business. Across the business, they are turning to data and AI to improve operational efficiencies, drive customer engagement, and tap into new innovations like renewable energy. Hampered by large volumes of data, Shell chose Databricks to be one of the foundational components of its Shell.ai platform. Today, Databricks empowers hundreds of Shell’s engineers, scientists and analysts to innovate together as part of their ambition to deliver cleaner energy solutions more rapidly and efficiently. [LEARN MORE](https://www.google.com/url?q=https://www.databricks.com/customers/shell&sa=D&source=editors&ust=1679097620349908&usg=AOvVaw00lb46oTfGRpOREXOI1Ue3) _“Shell has been undergoing a digital transformation as part of our ambition to deliver more_ _and cleaner energy solutions. As part of this, we have been investing heavily in our data lake_ _architecture. Our ambition has been to enable our data teams to rapidly query our massive_ _data sets in the simplest possible way. The ability to execute rapid queries on petabyte_ _scale data sets using standard BI tools is a game changer for us. Our co-innovation_ _approach with Databricks has allowed us to influence the product road map, and we are_ _excited to see this come to market.”_ ### Millions of dollars saved in potential engine repair costs data team ### 250 members supporting 160+ high-value use cases faster – ### 9x 5 minutes to validate a label, reduced from 45 minutes **D A N I E L J E AV O N S** General Manager – Advanced Analytics CoE Shell ----- ## Manufacturing Data Challenges **Massive unstructured/OT data volumes** The industry is seeing immense growth in data volumes: much of this massive growth is due to semi-structured and unstructured data from connected workers, buildings, vehicles and factories. This growth in multi-modal data from IoT sensors, process historians, product telemetry, images, cameras and perception systems has outpaced legacy data warehouse-centric technologies. On-prem and cloud data warehouse tech-based architectures are too complex and too costly for the large and heterogeneous data sets prevalent in the industry. **Driving IT-OT convergence** The success and pace of data modernization efforts in manufacturing is so often muted by critical data being stuck in multiple closed systems and proprietary formats, making it difficult and cost-prohibitive to extract the full potential of IT and OT data sets. In addition, data quality issues such as outdated or inaccurate data can often lead to a disjointed and incomplete view of customers, operations and assets. For years, companies have lacked a common foundation for complex and heterogeneous manufacturing data — from IoT-generated data streams to financial metrics stored in ERP applications — and it has impacted their ability to provide the freshest, highest-quality and most complete data for analytics. **Bringing AI/ML to where it’s needed** To realize the promise of AI/ML in manufacturing, machine learning models need to be brought as close to the decision as possible, often at the edge in facilities and locations with limited or intermittent connectivity to the internet or cloud. This requires deployment flexibility to on-premises or edge devices, with an experience comparable to that in the cloud. **Inability to innovate at scale** CDOs want to be able to quickly and efficiently reproduce successes at global scale. Technical and business users want to simply and quickly know what data sets are available to solve the business issue at hand. Analysts want flexibility to use the tools they are most familiar with in order to stay responsive to business needs. Fragmented approaches to architecture and tooling make scaling business impact very difficult, which results in talent churn, slower development and duplicative efforts — all leading to higher costs. ----- ## Databricks Lakehouse for Manufacturing **Deliver personalized outcomes and frictionless experiences** **Millions of assets streaming IoT data** **5%–10% reduction in unplanned downtime and cost** **Accurate prices across 1,000s of locations and millions of dealers** **200%+ increase in offer conversion rates** With Databricks Lakehouse for Manufacturing, manufacturers can gain a single view of their customers that combines data from each stage of the customer journey. With a 360-degree view in place, manufacturers can drive more differentiated sales strategies and precise service outcomes in the field, delivering higher revenue growth, profitability and CSAT scores. With the Databricks Lakehouse, you can analyze product telemetry data, customer insights and service networks to deliver highest uptime, quality of service and economic value through the product lifecycle. **Optimize the supply chain, production processes and fulfillment logistics** **with real-time analytics and AI.** The Databricks Lakehouse for Manufacturing is the only enterprise data platform that helps manufacturing organizations optimize their supply chains, boost product innovation, increase operational efficiencies, predict fulfillment needs and reduce overall costs. ----- **Gain real-time insight for agile manufacturing and logistics** **30%–50% improvement in forecast accuracy** **90% lower cost for new manufacturing line** **4%–8% reduction in logistics costs** **10% improvement in carbon footprint** The Databricks Lakehouse lets you build a resilient and predictive supply chain by eliminating the trade-off between accuracy or depth of analysis and time. With scalable, fine-grained forecasts to predict or sense demand, or perform supply chain planning and optimization, Databricks improves accuracy of decisions, leading to higher revenue growth and lower costs. The lakehouse provides an “always on” architecture that makes IT-OT convergence a reality, by continuously putting all data to work regardless of the frequency at which it arrives (periodic, event-driven or real-time streaming) and creates valuable data products that can empower decision makers. This creates real-time insight into performance with data from connected factory equipment, order flows and production processes to drive the most effective resource scheduling. **Empower the manufacturing workforce of the future** **25% improvement in data team productivity** **50x faster time to insight** **50% reduction in workplace injuries** With Databricks, manufacturers can increase the impact and decrease the time-to-value of their data assets, ultimately making data and AI central to every part of their operation. And by empowering data teams across engineering, analytics and AI to work together, Databricks frees up employees to self-serve and focus on realizing maximum business value — improving product quality, reducing downtime and exceeding customer expectations. **Execute product innovation at the speed of data** **90% decrease in time to market of new innovations** **20x faster data processing of vehicle and road data** It is critical that manufacturers are offering the most desirable value propositions so end consumers don’t look elsewhere. By tapping into product performance and attribute data along with market trends and operations information, manufacturers can make strategic decisions. With Databricks, manufacturers can decrease time to market with new products to increase sales by analyzing customer behavior and insights (structured, unstructured and semi-structured), product telemetry (streaming, RFID, computer vision) and digital twins, and leveraging that data to drive product decisions. ----- ## Building Innovative Solutions on the Lakehouse The flexibility of the Databricks Lakehouse Platform means that you can start with the use case that will have the most impact on your business. Through our experience working with some of the largest and most cutting-edge manufacturers in the world, we’ve developed Solution Accelerators based on the most common needs of manufacturers to help you get started. These purpose-built guides — fully functional notebooks and best practices — speed up results across your most common and high-impact use cases. Go from idea to proof of concept (PoC) in as little as two weeks. Check out the full list of Solution Accelerators [here](https://www.databricks.com/solutions/accelerators) . **S O L U T I O N** **Part-Level Demand** **Forecasting** Demand forecasting is a critical business process for manufacturing and supply chains. McKinsey estimates that over the next 10 years, supply chain disruptions can cost close to half (~45%) of a year’s worth of profits for companies. Having accurate and up-to-date forecasts is vital to plan the scaling of manufacturing operations, ensure sufficient inventory and guarantee customer fulfillment. In recent years, manufacturers have been investing heavily in quantitativebased forecasting that is driven by historical data and powered using either statistical or machine learning techniques. Benefits include: **•** Better sales planning and revenue forecasting **•** Optimized safety stock to maximize turn-rates and service-delivery performance **•** Improved production planning by tracing back production outputs to raw material levels **A disruption lasting just 30 days or less could** **equal losses of** **3%-5% of EBITDA.** ----- Databricks Lakehouse can enable large-scale forecasting solutions to help manufacturers navigate the most common data challenges when trying to forecast demand. **C O M M O N U S E C A S E S :** Scalable, accurate forecasts across large numbers of store-item combinations experiencing intermittent demand Automated model selection to ensure the best model is selected for each store-item combination Metrics to identify the optimal frequency with which to generate new predictions Manage material shortages and predict overplanning **Try our** **[Parts-Level Solution Accelerator](https://www.databricks.com/solutions/accelerators/demand-forecasting)** **to facilitate** **fine-grained demand forecasts and planning.** ----- **S O L U T I O N** **Overall Equipment Effectiveness** **& KPI Monitoring** ​The need to monitor and measure manufacturing equipment performance is critical for operational teams within manufacturing. Today, Overall Equipment Effectiveness (OEE) is considered the standard for measuring manufacturing equipment productivity. According to Engineering USA, an OEE value of 85% or above is considered world-leading. However, many manufacturers typcially achieve a range of between 40% and 60%. Reasons for underachievement often include: **•** Delayed inputs due to manual processes that are prone to human error **•** Bottlenecks created by data silos, impeding the flow of fresh data to stakeholders **•** A lack of collaboration capabilities, keeping stakeholders from working on the same information at the same time **Poor OEE value** **can be a result of poor parts quality, slow** **production performance and production availability issues.** Databricks Lakehouse can help manufacturers maneuver through the challenges of ingesting and converging operational technology (OT) data with traditional data from IT systems to build forecasting solutions. **C O M M O N U S E C A S E S** Incrementally ingest and process sensor data from IoT devices in a variety of formats Compute and surface KPIs and metrics to drive valuable insights Optimize plant operations with data-driven decisions **Try our** **[Solution Accelerator for OEE and KPI Monitoring](https://www.databricks.com/solutions/accelerators/overall-equipment-effectiveness)** **for** **performant and scalable end-to-end monitoring.** ----- Market dynamics and volatility are requiring manufacturers to bring products to market more quickly, optimize production processes and build agile supply chains at scale at a lower price. To do so, many manufacturers have turned to building digital twins, which are virtual representations of objects, products, pieces of equipment, people, processes or even complete manufacturing ecosystems. Digital twins provide insights — derived from sensors (often IoT or IIoT) that are embedded in the original equipment — that have the potential to transform the manufacturing industry by driving greater efficiency, reducing costs and improving quality. **S O L U T I O N** **Digital Twins** **Digital twin technologies can improve product** **quality by** **up to 25%.** Databricks Lakehouse can bring digital twins to life through fault-tolerant processing of streaming workloads generated by IoT sensor data and complex event processing (important for modeling physical processes). **C O M M O N U S E C A S E S** Process real-world data in real time Compute insights at scale and deliver to multiple downstream applications Optimize plant operations with data-driven decisions **Try our** **[Solution Accelerator for Digital Twins](https://www.databricks.com/solutions/accelerators/digital-twins)** **to accelerate** **time to market of new innovations.** ----- **S O L U T I O N** **Computer Vision** The rise in computer vision has been fueled by the rapid developments in neural network technologies, which use AI to better understand and interpret images with near-perfect precision. In manufacturing, computer vision can transform operations by, for example, identifying product defects to improve quality control, detecting safety hazards on the production floor, and tracking and managing inventory levels. **As per the American Society for Quality, cost of poor quality for** **companies can be as high as** **20% of revenue.** Databricks Lakehouse can easily ingest complex, unstructured image and video data at massive scale. Through the most popular computer vision libraries, data teams can scale AI models that leverage computer vision to recognize patterns, detect objects and make predictions with 99% accuracy. **C O M M O N U S E C A S E S** Quickly identify defects and ensure that products and processes meet quality standards Automate positioning and guidance to ensure that parts and products are properly aligned and assembled Predict maintenance issues to reduce downtime and maintenance costs, improve parts reliability, and increase safety for workers **Try our** **[Solution Accelerator for Computer Vision](https://www.databricks.com/blog/2021/12/17/enabling-computer-vision-applications-with-the-data-lakehouse.html)** **to improve** **efficiency, reduce costs and enhance overall safety.** ----- ## An Ecosystem on the Lakehouse for Manufacturing We’ve partnered with leading consulting firms and independent software vendors to deliver innovative, manufacturing-specific solutions. Databricks Brickbuilder Solutions help you cut costs and increase value from your data. Backed by decades of industry expertise — and built for the Databricks Lakehouse Platform — Brickbuilder Solutions are tailored to your exact needs. We also work with technology partners like Alteryx, AtScale, Fivetran, Microsoft Power BI, Qlik, Sigma, Simplement, Tableau and ThoughtSpot to accelerate the availability and value of data. This allows businesses to unify data from complex source systems and operationalize it for analytics, AI and ML on the Databricks Lakehouse Platform. ----- **S O L U T I O N** **Avanade Intelligent Manufacturing** Every year, businesses lose millions of dollars due to equipment failure, unscheduled downtime and lack of control in maintenance scheduling. Along with lost dollars, businesses will experience lower employee morale when stations are in and out of service. Avanade’s Intelligent Manufacturing solution supports connected production facilities and assets, workers, products and consumers to create value through enhanced insights and improved outcomes. Manufacturers can harness data to drive interoperability and enhanced insights at scale using analytics and AI. Outcomes include improvements across production (e.g., uptime, quantity and yield), better experiences for workers, and greater insight into what customers want. **Try our joint solution,** **[Intelligent Manufacturing](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions/avanade-intelligent-manufacturing)** **, to drive value and** **operationalize team coordination and productivity.** **S O L U T I O N** **DataSentics Quality Inspector** Quality control is a crucial aspect of any production process, but traditional methods can be time-consuming and prone to human error. Quality Inspector by DataSentics, an Atos company, offers a solution that is both efficient and reliable. With out-of-the-box models for visual quality inspection, which are tailored to meet specific business requirements, organizations will experience stable, scalable quality control that’s easy to improve over time. Quality Inspector is an end-to-end solution that can be seamlessly integrated into an existing setup, delivering high performance and reliability. **Try our joint solution,** **[Quality Inspector](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions?itm_data=menu-item-brickbuildersoverview)** **, to automate production quality** **control with an increase in accuracy and quicker time to value.** ----- TREDENCE PSRM_1”: PREDICT SUPPLY RISK TREDENCE PSRM_2”: REAL-TIME SHIPMENT VISIBILITY TREDENCE PSRM_3”: DELAY ALERTS **S O L U T I O N** **Tredence Predictive Supply Risk Management** Customers today are faced with multiple supply risks including lack of in-transit visibility, disruptions caused by weather, local events, among others. Tredence’s Predictive Supply Risk Management solution, built on the Databricks Lakehouse Platform, helps businesses meet supply risk challenges by providing a scalable, cloud-based solution that can be tailored to the specific needs of each organization. The platform’s flexibility and scalability allow businesses to keep pace with changing regulations and customer demands, while their comprehensive suite of tools helps identify and mitigate risks across the enterprise. **Try our joint solution,** **[Predictive Supply Risk Management](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions?itm_data=menu-item-brickbuildersoverview)** **, to** **predict order delays, identify root causes and quantify supply** **chain impact.** Visit our [site](https://www.databricks.com/company/partners/consulting-and-si/partner-solutions?itm_data=menu-item-brickbuildersoverview) to learn more about our Databricks Partner Solutions. ----- ## Leading Manufacturing Companies That Choose Us ----- Databricks is the lakehouse company. More than 9,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark,™ Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . ###### Get started with a free trial of Databricks and start building data applications today **[START YOUR FREE TRIAL](https://www.databricks.com/try-databricks?utm_medium=paid+search&utm_source=google&utm_campaign=14272820537&utm_adgroup=126939742998&utm_content=trial&utm_offer=try-databricks&utm_ad=634147899783&utm_term=try%20databricks&gclid=CjwKCAiAr4GgBhBFEiwAgwORrTnkJaDf9SpIDy2RxOV28a2G2HtUDvJnLXiVWBsqcAWa_XmSvabkVRoCiwgQAvD_BwE#account)** To learn more, visit us at: **[Manufacturing Industry Solutions](https://www.databricks.com/solutions/industries/manufacturing-industry-solutions)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Lakehouse-for-Manufacturing.pdf2024-09-19T16:57:19Z**2 0 2 0 E D I T I O N** | U P D AT E D # Standardizing the Machine Learning Lifecycle ### From experimentation to production with MLflow [��](https://mlflow.org) ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** #### Contents Chapter 1: Machine Learning Lifecycle Challenges 3 Chapter 2: Applying Good Engineering Principles to Machine Learning 7 Chapter 3: Introducing MLflow 9 Chapter 4: A Closer Look at MLflow Model Registry 16 Chapter 5: Making Organizations Successful with ML 19 Chapter 6: Introducing the Unified Data Analytics Platform 20 Chapter 7: Standardizing the Machine Learning Lifecycle on Databricks 25 Chapter 8: Getting Started 26 Chapter 9: Comparison Matrix 27 #### Preface ##### Technology changes quickly. Data science and machine learning (ML) are moving  even faster. In the short time since we first published this eBook, businesses across industries have rapidly matured their machine learning operations (MLOps) — implementing ML applications and moving their first models into production. This has turned ML models into corporate assets that need to be managed across the lifecycle.  That’s why MLflow, an open-source platform developed by Databricks, has emerged  as a leader in automating the end-to-end ML lifecycle. With 1.8 million 1 downloads a month — and growing support in the developer community — this open-source platform is simplifying the complex process of standardizing and productionizing MLOps. This updated eBook explores the advantages of MLflow and introduces you to the newest component: MLflow Model Registry. You’ll also discover how MLflow fits into the Databricks Unified Data Analytics Platform for data engineering, science and analytics. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 1: **** **Machine Learning** #### Lifecycle Challenges Building machine learning models is hard. Putting them into production is harder. Enabling others — data scientists, engineers or even yourself — to reproduce your pipeline and results is equally challenging. How many times have you or your peers had to discard previous work because it was either not documented properly or too difficult to replicate? Getting models up to speed in the first place is significant enough that it can be easy to overlook long- term management. What does this involve in practice? In essence, we have to compare the results of different versions of ML models along with corresponding artifacts — code, dependencies, visualizations, intermediate data and more — to track what’s running where, and to redeploy and roll back updated models as needed. Each of these requires its own specific tools, and it’s these changes that make the ML lifecycle so challenging compared with traditional software development lifecycle (SDLC) management. This represents a serious shift and creates challenges compared with a more traditional software development lifecycle for the following reasons: The diversity and number of ML tools involved, coupled with a lack of standardization across ML libraries and frameworks The continuous nature of ML development, accompanied by a lack of tracking and management tools for machine learning models and experiments The complexity of productionizing ML models due to the lack of integration among data pipelines, ML environments and production services Let’s look at each of these areas in turn. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### The diversity and number of ML tools involved While the traditional software development process leads to the rationalization and governance of tools and platforms used for developing and managing applications, the ML lifecycle relies on data scientists’ ability to use multiple tools, whether for preparing data and training models, or deploying them for production use. Data scientists will seek the latest algorithms from However, due to the variety of available tools and the lack of detailed tracking, teams often have trouble getting the same code to work again in the same way. Reproducing the ML workflow is a critical challenge, whether a data scientist needs to pass training code to an engineer for use in production or go back to past work to debug a problem. the most up-to-date ML libraries and frameworks available to compare results and improve performance. **PREP DATA** **BUILD MODEL** **DEPLOY MODEL** Azure ML ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### The continuous nature of ML development Technology never stands still. New data, algorithms, libraries and frameworks impact model performance continuously and, thus, need to be tested. Therefore, machine learning development requires a continuous approach, along with tracking capabilities to compare and reproduce results. The performance of ML models depends not only on the algorithms used, but also on the quality of the data sets and the parameter values for the models. **P R E P** **D ATA** **B U I L D** **M O D E L** Whether practitioners work alone or on teams, it’s still very difficult to track which parameters, code and data went into each experiment to produce a model, due to the intricate nature of the ML lifecycle itself. **D E P L O Y** **M O D E L** ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### The complexity of productionizing ML models In software development, the architecture is set early on, based on the target application. Once the infrastructure and architecture have been chosen, they won’t be updated or changed due to the sheer amount of work involved in rebuilding applications from scratch. Modern developments, such as the move to microservices, are making this easier, but for the most part, SDLC focuses on maintaining and improving what already exists. One of today’s key challenges is to effectively transition models from experimentation to staging and production — without needing to rewrite the code for production use. This is time-consuming and risky as it can introduce new bugs. There are many solutions available to productionize a model quickly, but practitioners need the ability to choose and deploy models across any platform, and scale resources as needed to manage model inference effectively on big data, in batch or real time. With machine learning the first goal is to build a model. And keep in mind: a model’s performance in terms of accuracy and sensitivity is agnostic from the deployment mode. However, models can be heavily dependent on latency, and the chosen architecture requires significant scalability based on the business application. End-to-end ML pipeline designs can be great for batch analytics and looking at streaming data, but they can involve different approaches for real-time scoring when an application is based on a microservice architecture working via REST APIs, etc. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 2: **** **Applying Good Engineering** #### Principles to Machine Learning Many data science and machine learning projects fail due to preventable issues that have been resolved in software engineering for more than a decade. However, those solutions need to be adapted due to key differences between developing code and training ML models. -  **Expertise, code and data** — With the addition of data, data science and ML, code not only needs to deal with data dependencies but also handle the inherent nondeterministic characteristics of statistical modeling. ML models are not guaranteed to behave the same way when trained twice, unlike traditional code, which can be easily unit tested. -  **Model artifacts** — In addition to application code, ML products and features also depend on models that are the result of a training process. Those model artifacts can often be large — on the order of gigabytes — and often need to be served differently from code itself. -  **Collaboration** — In large organizations, models that are deployed in an application are usually not trained by the same people responsible for the deployment. Handoffs between experimentation, testing and production deployments are similar but not identical to approval processes in software engineering. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### The need for standardization Some of the world’s largest tech companies have already begun solving these problems internally with their own machine learning platforms and lifecycle management tools. 2 These internal platforms have been extremely successful and are designed to accelerate the ML lifecycle by standardizing the process of data preparation, model training, and deployment via APIs built for data scientists. The platforms not only help standardize the ML lifecycle but also play a major role in retaining knowledge and best practices, and maximizing data science team productivity and collaboration, thereby leading to greater ROI. Internally driven strategies still have limitations. First, they are limited to a few algorithms or frameworks. Adoption of new tools or libraries can lead to significant bottlenecks. Of course, data scientists always want to try the latest and the best algorithms, libraries and frameworks — the most recent versions of PyTorch, TensorFlow and so on. Unfortunately, production teams cannot easily incorporate these into the custom ML platform without significant rework. The second limitation is that each platform is tied to a specific company’s infrastructure. This can limit sharing of efforts among data scientists. As each framework is so specific, options for deployment can be limited. The question then is: Can similar benefits to these systems be provided in an open manner? This evaluation must be based on the widest possible mix of tools, languages, libraries and infrastructures. Without this approach, it will be very difficult for data scientists to evolve their ML models and keep pace with industry developments. Moreover, by making it available as open source, the wider industry will be able to join in and contribute to ML’s wider adoption. This also makes it easier to move between various tools and libraries over time. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 3: **** **Introducing MLflow** **M AT E I Z A H A R I A** Co-founder and Chief Technologist at Databricks At Databricks, we believe that there should be a better way to manage the ML lifecycle. So in June 2018, we unveiled [MLflow](https://mlflow.org/) , an open-source machine learning platform for managing the complete ML lifecycle. ###### “MLflow is designed to be a cross-cloud, modular, API-first framework, to work well with  all popular ML frameworks and libraries. It is open and extensible by design, and platform  agnostic for maximum flexibility.” With MLflow, data scientists can now package code as reproducible runs, execute and compare hundreds of parallel experiments, and leverage any hardware or software platform for training, hyperparameter tuning and more. Also, organizations can deploy and manage models in production on a variety of clouds and serving platforms. ###### “ With MLflow, data science teams can systematically package and reuse models  across frameworks, track and share experiments locally or in the cloud, and deploy  models virtually anywhere,” says Zaharia. “The flurry of interest and contributions we’ve  seen from the data science community validates the need for an open-source framework to  streamline the machine learning lifecycle.” ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Key benefits **EXPERIMENT TRACKING** As mentioned previously, getting ML models to perform takes significant trial and error, and continuous configuration, building, tuning, testing, etc. Therefore, it is imperative to allow data science teams to track all that goes into a specific run, along with the results. With MLflow, data scientists can quickly record runs and keep track of model parameters, results, code and data from each experiment, all in one place. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Key benefits **FLEXIBLE DEPLOYMENT** There is virtually no limit to what machine learning can do for your business. However, there are different ways to architect ML applications for production, and various tools can be used for deploying models, which often lead to code rewrites prior to deploying ML models into production. With MLflow, your data scientists can quickly download or deploy any saved models to various platforms — locally or in the cloud — from experimentation to production. **REPRODUCIBLE PROJECTS** The ability to reproduce a project — entirely or just parts of it — is key to data science productivity, knowledge sharing and, hence, accelerating innovation. With MLflow, data scientists can build and package composable projects, capture dependencies and code history for reproducible results, and quickly share projects with their peers. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Key benefits **MODEL MANAGEMENT** Use one central place to share ML models, collaborate on moving them from experimentation to online testing and production, integrate with approval and governance workflows, and monitor ML deployments and their performance. This is powered by the latest MLflow component, MLflow Model Registry. **M O D E L D E P L O Y M E N T A N D M O N I T O R I N G** **I N - L I N E C O D E** �� **M L L I B R A R I E S** ###### Model Format **C O N TA I N E R S** **F L AV O R 1** **F L AV O R 2** **B AT C H A N D S T R E A M S C O R I N G** Simple model flavors usable by many tools **C L O U D I N F E R E N C E S E R V I C E S** ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Use case examples Let‘s examine three use cases to explore how users can leverage some of the MLflow components. **EXPERIMENT TRACKING** A European energy company is using MLflow to track and update hundreds of energy-grid models. This company’s goal is to build a time-series model for every major energy producer (e.g., power plant) and consumer (e.g., factory), monitor these models using standard metrics, and combine the predictions to drive business processes, such as pricing. Because a single team is responsible for hundreds of models, possibly using different ML libraries, it’s important to have a standard development and tracking process. The team has standardized on Jupyter notebooks for development, MLflow Tracking for metrics, and Databricks Jobs for inference. **REPRODUCIBLE PROJECTS** An online marketplace is using MLflow to package deep learning jobs using Keras and run them in the cloud. Each data scientist develops models locally on a laptop using a small data set, checks them into a Git repository with an MLproject file, and submits remote runs of the project to GPU instances in the cloud for large-scale training or hyperparameter search. Using MLflow Projects makes it easy to create the same software environment in the cloud and share project code among data scientists. **MODEL PACKAGING** An e-commerce site’s data science team is using MLflow Model Registry to package recommendation models for use by application engineers. This presents a technical challenge because the recommendation application includes both a standard, off-the-shelf recommendation model and custom business logic for pre- and post-processing. For example, the application might include custom code to ensure the recommended items are diverse. This business logic needs to change in sync with the model, and the data science team wants to control both the business logic and the model, without having to submit a patch to the web application each time the logic has to change. Moreover, the team wants to A/B test distinct models with distinct versions of the processing logic. The solution was to package both the recommendation model and the custom logic using the python_ function flavor in an MLflow Model, which can then be deployed and tested as a single unit. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Open and extensible by design Since we [unveiled](https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html) and open sourced MLflow in June 2018 at the Spark + AI Summit in San Francisco, community engagement and contributions have led to an impressive array of new features and integrations: **SUPPORT FOR MULTIPLE** **PROGRAMMING LANGUAGES** To give developers a choice, MLflow supports R, Python, Java and Scala, along with a REST server interface that can be used from any language. **INTEGRATION WITH POPULAR ML** **LIBRARIES AND FRAMEWORKS** MLflow has built-in integrations with the most popular machine learning libraries — such as scikit-learn, TensorFlow, Keras, PyTorch, H2O, and Apache Spark™ MLlib — to help teams build, test and deploy machine learning applications. **CROSS-CLOUD SUPPORT** Organizations can use MLflow to quickly deploy machine learning models to multiple cloud services, including Databricks, Azure Machine Learning and Amazon SageMaker, depending on their needs. MLflow leverages AWS S3, Google Cloud Storage and Azure Data Lake Storage, allowing teams to easily track and share artifacts from their code. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Rapid community adoption ## 2.5M #### monthly downloads ## 200+ #### code contributors ## 100+ #### contributing organizations Organizations using and contributing to MLflow Source: [mlflow.org](https://mlflow.org) ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 4: **** **A Closer Look at** #### MLflow Model Registry MLflow originally introduced the ability to [track metrics, parameters and artifacts](https://www.mlflow.org/docs/latest/tracking.html#) as part of experiments, [package models and reproducible ML projects](https://www.mlflow.org/docs/latest/projects.html) , and [deploy models to batch or to real-time serving platforms](https://www.mlflow.org/docs/latest/models.html) . The latest MLflow component — MLflow Model Registry — builds on MLflow’s original capabilities to provide organizations with one central place to share ML models, collaborate on moving them from experimentation to testing and production, and implement approval and governance workflows. �� **Model Registry** **D O W N S T R E A M** �� **Tracking Server** Data Scientists **Staging** Data Engineers **Production** **Archived** **A U T O M AT E D J O B S** **Parameters** **Metrics** **Artifacts** The Model Registry gives MLflow users new tools for sharing, reviewing and managing ML models throughout their lifecycle **Metadata** **Models** **R E S T S E R V I N G** **R E V I E W E R S + C I / C D T O O L S** The MLflow Model Registry complements the MLflow offering and is designed to help organizations implement good engineering principles with machine learning initiatives, such as collaboration, governance, reproducibility and knowledge management. The next few pages highlight some of the key features of this new component. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### One hub for managing ML models collaboratively Building and deploying ML models is a team sport. Not only are the responsibilities along the machine learning model lifecycle often split across multiple people (e.g., data scientists train models whereas production engineers deploy them), but also at each lifecycle stage, teams can benefit from collaboration and sharing ###### Flexible CI/CD pipelines to manage stage transitions MLflow lets you manage your models’ lifecycles either manually or through automated tools. Analogous to the approval process in software engineering, users can manually request to move a model to a new lifecycle stage (e.g., from staging to production), and review or comment on other users’ transition requests. (e.g., a fraud model built in one part of the organization could be reused in others). Alternatively, you can use the Model Registry’s API to plug in continuous integration MLflow facilitates sharing of expertise and knowledge across teams by making ML models more discoverable and providing collaborative features to jointly improve on common ML tasks. Simply register an MLflow model from your experiments to and deployment (CI/CD) tools, such as Jenkins, to automatically test and transition your models. Each model also links to the experiment run that built it — in MLflow Tracking — to let you easily review models. get started. The MLflow Model Registry will then let you track multiple versions of the model and mark each one with a lifecycle stage: development, staging, production or archived. Sample machine learning models displayed via the MLflow Model Registry dashboard The machine learning model page view in MLflow, showing how users can request and review changes to a model’s stage ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Visibility and governance for the full ML lifecycle In large enterprises, the number of ML models that are in development, staging and production at any given point in time may be in the hundreds or thousands. Having full visibility into which models exist, what stages they are in and who has collaborated on and changed the deployment stages of a model allows organizations to better manage their ML efforts. MLflow provides full visibility and enables governance by keeping track of each model’s history and managing who can approve changes to the model’s stages. Identify versions, stages and authors of each model ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 5: **** **Making Organizations** #### Successful with ML Standardizing the ML lifecycle with MLflow is a great step to ensure that data scientists can share and track experiments, compare results, reproduce runs and productionize faster. In addition to increasing data science team productivity and collaboration and applying good engineering practices to machine learning, organizations also need to do the following: **Reliably ingest, ETL and** **catalog big data** **Work with state-of-the-art** **ML frameworks and tools** **Easily scale compute from** **single to multi-node** Databricks excels at all the above. Learn more at [databricks.com](https://databricks.com) ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 6: **** **Introducing the Unified** #### Data Analytics Platform Databricks accelerates innovation by unifying data science, engineering and business. Through a fully managed, cloud-based service built by the original creators of Apache Spark, Delta Lake and MLflow, the Databricks Unified Data Analytics Platform lowers the barrier for enterprises to innovate with AI and accelerates their innovation. **DATA ENGINEERS** **DATA SCIENTISTS** **ML ENGINEERS** **DATA ANALYSTS** **BI INTEGRATIONS** **Access all your data** **DATA SCIENCE WORKSPACE** **Collaboration across the lifecycle** **UNIFIED DATA SERVICE** **High-quality data with great performance** **ENTERPRISE CLOUD SERVICE** **A simple, scalable and secure managed service** ##### RAW DATA LAKE ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Data engineering Speed up the preparation of high-quality data, essential for best-in-class ML applications, at scale ###### Data science Collaboratively explore large data sets, build models iteratively and deploy across multiple platforms ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Providing managed MLflow on Databricks MLflow is natively integrated with the Databricks Unified Data Analytics Platform so that ML practitioners and engineers can benefit from out-of-the-box tracking, packaging, deployment and management capabilities for ML models with enterprise reliability, security and scale. By using MLflow as part of Databricks, data scientists can: **WORKSPACES** Benefit from a streamlined experiment tracking experience with Databricks Workspace and collaborative Notebooks **BIG DATA SNAPSHOTS** Track large-scale data that fed the models, along with all the other model parameters, then **JOBS** Easily initiate jobs remotely, from an on-premises environment or from Databricks notebooks **SECURITY** Take advantage of one common security model for the entire machine learning lifecycle reproduce training runs reliably Read our [blog](https://databricks.com/blog/2019/03/06/managed-mlflow-on-databricks-now-in-public-preview.html) to learn more about these integrations. ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Getting data ready for ML with Delta Lake Delta Lake is a storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions and scalable metadata handling, and it unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. By using Delta Lake, data engineers and data scientists can keep track of data used for model training. Files ML Runtime - Schema enforced high quality data - Optimized performance �� - Full data lineage / governance - reproductibility through time travel Streaming Batch Ingestion Tables Ingestion Data Catalog Data Feature Store Feature **Y O U R E X I S T I N G D E LTA L A K E** 3rd Party Data Marketplace ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** ###### Ready-to-use ML environments Databricks Runtime for Machine Learning provides data scientists and ML practitioners with on-demand access to ready-to-use machine learning clusters that are preconfigured with the latest and most popular machine learning frameworks, including TensorFlow, Keras, PyTorch, scikit-learn, XGBoost and Horovod. By using the Databricks Runtime for ML, data scientists can get to results faster with one-click access to ML clusters, optimized performance on popular ML algorithms, and simplified distributed deep learning on Horovod and GPUs. It also supports Conda for further customization. **P A C K A G E S A N D O P T I M I Z E S M O S T** **C O M M O N M L F R A M E W O R K S** **C U S T O M I Z E D E N V I R O N M E N T S** **U S I N G C O N D A** **C U S T O M I Z E D E N V I R O N M E N T S** requirements.txt conda.yaml **...** **B U I LT- I N O P T I M I Z AT I O N F O R** **D I S T R I B U T E D D E E P L E A R N I N G** Distribute and Scale any Single-Machine ML Code to thousands of machines **B U I LT- I N A U T O M L A N D** **E X P E R I M E N T T R A C K I N G** Machine Learning Machine Auto ML and Tracking / Visualizations with MLflow Conda- Based ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 7: **** **Standardizing the** #### Machine Learning  Lifecycle on Databricks **B U I L D M O D E L** **P R E P D ATA** �� Azure ML **D E P L O Y M O D E L** �� ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 8: **** **Getting Started** Take the next step toward standardizing your ML lifecycle — test drive MLflow and the Databricks Unified Data Analytics Platform. **[S TA R T Y O U R F R E E T R I A L](https://databricks.com/try)** **[R E Q U E S T A P E R S O N A L I Z E D D E M O](https://databricks.com/contact)** **[L E A R N M O R E](https://databricks.com/mlflow)** **[J O I N T H E C O M M U N I T Y](https://mlflow.org)** ----- **M A C H I N E L E A R N I N G L I F E C Y C L E** CHAPTER 8: **** **Comparison Matrix** |E X P E R I M E N T T R A C K I N G MLflow Tracking API MLflow Tracking Server Notebook Integration Workspace Integration R E P R O D U C I B L E P R O J E C T S MLflow Projects GitHub and Conda Integration Scalable Cloud/Clusters for Project Runs M O D E L M A N A G E M E N T MLflow Model Registry Model Versioning Stage Transitions and Comments CI/CD Workflow Integration Model Stage F L E X I B L E D E P L O Y M E N T MLflow Models Built-In Batch Inference Built-In Streaming Analytics S E C U R I T Y A N D M A N A G E M E N T High Availability Automated Updates Role-Based Access Control|O P E N S O U R C E M L F L O W   Self-hosted                |M A N A G E D M L F L O W O N D ATA B R I C K S   Fully managed    With remote execution             | |---|---|---| -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/LP_2-primary-asset_standardizing-the-ml-lifecycle-ebook-databricks-0626120-v8.pdf2024-09-19T16:57:20Z----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks_improper_payments_eBook_v4_image.pdf2024-09-19T16:57:20Z### Technical Migration Guide # Strategies to Evolve Your Data Warehouse to the Databricks Lakehouse ----- ## Contents Lakehouse Architecture 3 The Databricks Lakehouse Platform 4 Business Value 5 Single source of truth 5 Data team 6 Future-proof 6 Migration to Lakehouse 7 Overview 7 Migration strategy 8 Migration planning 9 ELT approach 12 Agile modernization 15 Security and data governance 17 Team involvement 19 Conclusion 19 ----- ## Lakehouse Architecture Data warehouses were designed to provide a central data repository with analytic compute capabilities to help business leaders get analytical insights, support decision-making and business intelligence (BI). Legacy on-premises data warehouse architectures are difficult to scale and make it difficult for data teams to keep up with the exponential growth of data. Oftentimes data teams publish and use a subset of well-defined data for development and testing. This slows down both innovation and time to insight. Cloud data warehouses (CDW) were an attempt to tackle the on-premises data warehouse challenges. CDWs removed the administrative burden of tasks such as setup, upgrades and backups. CDWs also improved scalability and introduced cloud’s pay-as-you-go model to reduce cost. CDWs leverage a proprietary data format to achieve cloud-scale and performance; however, this also leads to customers locked into these formats with difficult But enterprise data teams don’t need a better data warehouse. They need an innovative, simple solution that provides reliable performance, elastic scale and allows self-service to unblock analytics to access all data at a reasonable cost. The answer is the lakehouse. The lakehouse pattern represents a paradigm shift from traditional on-premises data warehouse systems that are expensive and complex to manage. It uses an open data management architecture that combines the flexibility, cost-efficiency and scale of data lakes with the data management and ACID semantics of data warehouses. A lakehouse pattern enables data transformation, cleansing and validation to support both business intelligence and machine learning (ML) users on all data. Lakehouse is cloud-centric and unifies a complete up-to-date data set for teams, allowing collaboration across an organization. paths to support use cases outside the data warehouse itself (i.e., machine learning). Customers often find themselves with a bifurcated architecture, which ultimately leads to a more costly and complex data platform over time. ----- ## The Databricks Lakehouse Platform The Databricks Lakehouse Platform is **simple** ; it unifies your data, governance, analytics and AI on one platform. It’s **open** — the open source format Delta Lake unifies your data ecosystem with open standards and data formats. Databricks is **multicloud** — delivering one **consistent experience across all clouds** so you don’t need to reinvent the wheel for every cloud platform that you’re using to support your data and AI efforts. Databricks SQL stores and processes data using Delta Lake to simplify and enhance data warehousing capabilities. Analysts can use their favorite language, SQL, popular transformation tools such as dbt, and preferred BI tools like Power BI and Tableau to analyze data. The built-in query editor reduces contextual switching and improves productivity. Administrators enjoy simplified workload management via serverless compute and auto-scaling to meet high-concurrency workload needs. All this at a fraction of the cost of traditional data warehouses. ###### Lakehouse Platform Data Warehousing Data Engineering Data Streaming Data S�ien�� and ML Unity Catalog Fine-grained governance for data and AI Delta Lake Data relia)ility and .erfor2ance Cloud Data Lake All structured and unstructured data Simple Open Multicloud ----- ## Business Value #### Single source of truth Databricks Delta Lake leverages cloud-based blob storage to provide an infinitely scalable storage layer where you can store all your data, including raw and historical data, alongside structured data tables in the data warehouse. The lakehouse pattern avoids data silos and shares the same elastic scale and governance across all use cases: BI, data engineering, streaming and AI/ML. This means that data engineering teams don’t have to move data to a proprietary data warehouse for business analysts or create a separate data store to support data science. Instead, data teams can access the open format Delta tables directly and combine data sets in the lakehouse, as needed. Data scientists can also work collaboratively on common data with access to versioned history to facilitate repeatable experiments. A single source of truth facilitates moving from descriptive to predictive analytics. ----- #### Data team With central data governance and fine-grained access control capabilities to secure the lakehouse, you can enable self-service SQL analytics for everyone on the Databricks Lakehouse Platform. This allows each team to be more agile and innovate faster. **Data Analysts** — Using the Databricks SQL editor or their tools of choice (DBT, Power BI, Tableau), SQL analysts can leverage familiar toolsets. **Data Engineers** — Utilizing Delta Lake as a unified storage layer, data engineering teams can eliminate duplicate data and ETL jobs that move data across various systems. Databricks supports both batch and streaming workloads to reduce bottlenecks and serve the most up-to-date data to downstream users and applications. **Administrators** — The pay-as-you-go, decentralized compute resource allows each team to run their The Databricks Lakehouse Platform provides a reliable ETL and data management framework to simplify ETL pipelines. Data teams can build end-to-end data transformations in a single pipeline instead of many small ETL tasks. Databricks supports data quality enforcement to ensure reliability with auto-scalable infrastructure. Your teams can onboard new data sources quickly to power new use cases with fresh data. This not only allows your team to efficiently and reliably deliver high-quality data in a timely manner, it also reduces ETL workload cost significantly. #### Future-proof Unlike CDWs that lock customers in, Databricks offers an open platform with open standards, open protocols and open data formats. It supports a full range of popular languages (SQL, Python, R, Scala) and popular BI tools. You can leverage the performant and low-cost distributed compute layer for data processing — or use a variety of tools and engines to efficiently access the data via Databricks APIs. Databricks also allows data consumption with a rich partner ecosystem. Teams can handle all existing BI and AI use cases with the flexibility to support future use cases as they emerge. workload in isolated environments without worrying about contention. Serverless SQL endpoint frees your team from infrastructure management challenges. ----- ## Migration to Lakehouse #### Overview A lakehouse is the ideal data architecture for data-driven organizations. It combines the best qualities of data warehouses and data lakes to provide a single solution for all major data workloads and supports use cases from streaming analytics to BI, data science and AI. The Databricks Lakehouse Platform leverages low-cost, durable cloud storage and only consumes (charges for) compute resources when workloads are running. This pay- **C U S T O M E R S T O R Y** ##### Building the Lakehouse  at Atlassian [Watch now](https://www.youtube.com/watch?v=Xo1U617T-mU) as-you-go model means compute resources are automatically shut down if no processing is needed. Data teams can use small clusters that can power individual workloads they plan to migrate. They can make the choice to leverage serverless SQL endpoints and completely free data teams from infrastructure capacity planning and cluster maintenance. The auto-scaling, elastic nature of Databricks clusters leads to significant savings on infrastructure cost and maintenance. Organizations typically achieve 50% TCO savings compared to other cloud data warehouses. Data warehouse migration is never an easy task. Databricks aims to mitigate the things that can go wrong in these demanding migration projects. The Databricks Lakehouse Platform provides many out-of-the-box features to mitigate migration risks. **C U S T O M E R S T O R Y** ##### Driving Freight Transportation Into the Future [Read more](https://databricks.com/customers/jbhunt) ----- #### Migration strategy Migration is a huge effort and very expensive. Yet, almost every enterprise has to migrate to new platforms every 3–5 years because the old platform cannot support new use cases, catch up with data growth or meet scaling needs. To get better ROI on migration, implement a migration strategy that can reduce future re-platform needs and extend to your future data and AI strategy. Use the opportunity of a data migration to standardize your data in open Delta format to allow existing and future tools to access it directly without moving or converting it. Merge your siloed data warehouses into the unified storage layer in the Databricks Lakehouse Platform — without worrying about storage capacity. The unified storage layer allows your team to deploy a unified data governance on top to secure all data access consistently. Simplify your data governance story with Databricks Unity Catalog. Move toward a single, consistent approach to data pipelining and refinement. Merge batch and streaming into a single end- to-end pipeline to get fresher data and provide more real-time decisions. Take a metadata-driven approach to align the dataflow with business processes and have data validation and quality check built-in. Through a series of curation and refinement steps, the output results in highly consumable and trusted data for downstream use cases. The lakehouse architecture makes it possible for the organization to create “data assets” by taking a stepwise approach to improving data and serving all essential use cases. Encourage your BI/analyst team to leverage Databricks serverless endpoints for self-serve and agility. Each team can evaluate their top priority workloads and migrate them in parallel to speed up migration. Take advantage of Databricks’ rich partner ecosystem. Your favorite partners are likely already integrated via Partner Connect and can be set up with a few clicks. There are also many ISV and SI consulting partners who can help your migration journey. ----- #### Migration planning Migrating a data warehouse to the cloud can be time consuming and challenging for your data teams. It’s important to agree on the data architecture, migration strategy and process/ frameworks to be used before undertaking a data migration. Databricks provides Migration Assessment and Architecture Review sessions to develop a joint migration roadmap. This process is designed to help organizations to successfully migrate to a lakehouse architecture. Based on information collected and business objectives, the Databricks team will work with customers to propose a target architecture and provide a tailored migration roadmap. These assessments help get a full picture of current data systems and the future vision. They clarify what you are migrating and do proper use case discovery. This includes identifying workloads and data source dependency, for example: Sample migration assessment checklist: Identify upstream data sources and workload dependencies Identify active/inactive data sets and database objects Identify downstream application dependencies and data freshness requirements Define a cost-tracking mechanism, such as tag rules for chargeback and cost attribution Define security requirements and data governance Clarify access management need, document needed permissions per user/group Outline current tooling (ingestion, ETL and BI) and what’s needed ----- It’s important to identify key stakeholders and keep them engaged during the migration to make sure they are aligned with the overall objectives. The workload assessment result will be reviewed with key stakeholders. Through the review process, data teams can get a better understanding of which workloads can most benefit from modernization. Databricks often works with partners to provide a workload assessment and help customers understand their migration complexity and properly plan a budget. Databricks also partners with third-party vendors that provide migration tools to securely automate major migration tasks. Databricks Partner Connect makes it easy to connect with this ecosystem of tools to help with the migration, including: Code conversion tooling that can automatically translate 70%–95% of the SQL code in your current system to Databricks optimized code with Delta and other best practices Converters that automate multiple GUI-based ETL/ELT platform conversion to reduce migration time and cost Data migration tools that can migrate data from on-premises storage to cloud storage 2x–3x faster than what was previously possible ----- #### We can use Automated conversion for most workload types ###### EDWs Open Cloud Storage ADLS, S3, GCP Storage Databricks Tables, �ie�s Spark SQL Databricks Notebooks Spark SQL � little bit o� Python or Scal� Runs on Databricks JDBC/ODBC Databricks permissions- Table ACLs Credential Pass-throughs to Files Big Data ETL tools, Databricks Notebooks Air5o� DAGs, ADF, Databricks Job and any other Enterprise Schedulers Data Migration Metastore Migration SQL Migration Security ETL Tools DB locked �ormats on Disks Databases, Tables, �ie�s Ad-hoc SQL �ueries T-SQL, PL/SQL, BTEQ Reports �rom PB`, Tableau etc^ GRANTs, Roles External tables- File permissions Data Stage, Po�erCenter, Ab `nitio etc^ Orchestration ETL Schedulers ----- #### ELT approach The separation of storage and compute makes ELT on lakehouse a better choice than traditional ETL. You can ingest all raw data to Delta Lake, leverage low-cost storage and create a Medallion data implementation from raw/Bronze to curated/Gold depending on what’s needed to support use cases. During ingestion, basic data validation can occur, but establishing a Bronze data layer is the foundation of a single-pane-of-glass for the business. Teams can leverage compute resources as needed without a fixed compute infrastructure. Establishing a Silver layer further enriches data by exploring and applying transformations. ELT allows data teams to break pipelines into smaller “migrations,” starting with a simple workload, then improving the pipeline design iteratively. **I M P R O V E D ATA Q U A L I T Y** Data B r o n z e Ta b l e s S i lv e r Ta b l e s G o l d Ta b l e s Streaming Analytics CSV TXT JSON D ata �a �e Raw integration Filtered, Cleaned, Augmented Business-level Aggregates Reuorting ----- We highly recommend leveraging [Delta Live Tables (DLT)](https://databricks.com/product/delta-live-tables) , a new cloud-native managed service in the Databricks Lakehouse Platform that provides a reliable ETL framework to modernize your data pipeline at scale. Instead of migrating multiple ETL tasks one by one in a traditional data warehouse, you can focus on source and expected output, and create your entire dataflow graph declaratively. Delta Live Tables offers: A metadata-driven approach — You just specify what data should be in each table or view rather than the details of how processing should be done An end-to-end data pipeline with data quality and freshness checks, end-to-end monitoring/visibility, error recovery, and lineage, which reduces the strain on data engineering teams and improves time-to-value in building data pipelines Automatic management of all the dependencies within the pipeline. This ensures all tables are populated correctly, whether continuously or on a regular schedule. For example, updating one table will automatically trigger all downstream table updates to keep data up-to-date. All pipelines are built code-first, which makes editing, debugging and testing of data pipelines simpler and easier. DLT can also automatically recover from common error conditions, reducing operational overhead. ----- #### Agile modernization Agile development allows teams to move quickly knowing migrated pipelines can be revisited at a later cycle and evolving data models are supported within the architecture. Allowing business impact to drive priorities via an agile approach helps mitigate migration risks. Prioritizing and selecting use cases where modernization brings business benefits quickly is a good starting point. Focus on the 20% of workloads that consume 80% of budget. By breaking workflows down into components and managing data stories, teams can adjust priorities over time. Changes can be made in collaboration with the user community to fit the business definition of value. Migrating to a lakehouse architecture leverages separation of storage and compute to remove resource contention between ETL and BI workloads. As a result, the migration process can be more agile, allowing you to evolve your design iteratively without big-bang effort: Reduce time during the initial phase on full capacity plan and All of this allows you to take a more iterative and business-focused approach for migration instead of a full planning, execution, test/ validation approach. Here are more approaches that help facilitate this phased implementation: Leverage [Databricks Auto Loader](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) . Auto Loader helps to ingest new data into pipelines quicker to get data in near real-time. Delta Live Tables (DLT) improves data quality during data transformation and automatically scales to address data volume change. DLT can also support schema evolution and quarantine bad data or data that needs to be reprocessed at a later stage. Use dedicated clusters to isolate workloads, lower the total cost of ownership and improve overall performance. By using multiple clusters, we can shut down resources when not in use and move away from managing fixed resources in a single large cluster. scoping Flexible cloud infrastructure and unlimited, autoscaling storage Workload management is much simpler, you can isolate each workload with a dedicated compute resource, without worrying about managing workload contention Auto-scale and tear down the compute resources after the job is done to achieve cost efficiency ----- Leverage Databricks’ deep bench of expertise to build reusable assets along the migration: Create a migration factory for iterative migration process Determine and implement a security and governance framework Establish a to-be environment and move use cases/workloads in logical units Prove business value and scale over time Add new functionality continuously so important business requirements are not left on hold during migration Take this iterative and templated approach. Migration speed will accelerate. Customers can finish migration 15%–20% faster and reduce the amount of tech debt created during the migration. “ M a k e i t w o r k ” Pa r e l l e l i z e t h e B u i l d F o u n d at i o n s “ M a k e i t w o r k r i @ h t ” i t e r at i o n s “ M a k e i t w o r k >a s t 2 Full %i""ecycle %ig�t�ou�e /or�load� Leverage Databricks’ deep bench of expertise to build out some **templates for the** **most effective Databricks** **implementation.** Migration Functionality Optimization and Delta Take an **iterative, bite-sized** **approach** to migration, reduce tech debt and rework, and bring forward the value of the solution earlier. Migration Functionality Optimization and Delta Migration Functionality Optimization and Delta Migration Functionality Optimization and Delta ----- To maximize the value of your lakehouse, you should consider retiring some legacy architecture design patterns. Leverage the migration process to simplify data warehousing tasks. Regardless of how you complete your migration, you could utilize lakehouse strengths to improve architectural patterns: Merge your siloed data warehouses on your unified lakehouse platform and unify data access and data governance via Unity Catalog. The lakehouse architecture provides a unified storage layer for all your data where there is no physical boundary between data. There is no need to keep data copies for each system using the data set. Clean up and remove jobs that are created to keep data in sync across various data systems. Keep a single copy of raw data in your lakehouse as a single source of truth. The Databricks Lakehouse Platform allows you to merge batch and streaming into a single system to build a simple continuous Simplify your workload isolation and management by running jobs in dedicated clusters. Separating storage and compute allows you to easily isolate each task with isolated compute resources. There is no need to squeeze them into a single large data appliance and spend lots of time managing and coordinating resources. Leverage the elasticity of the Databricks compute layer to automatically handle workload concurrency changes at peak time instead of paying for over-provisioned resources for most of the time. This greatly simplifies the workload management effort the traditional data warehouses require. Simplify disaster recovery. Storage and compute separation allows easy disaster recovery. The cloud storage provides very good data redundancy and supports automated replication to another region. Customers can spin up compute resources quickly in another region and maintain service availability in case of an outage. data flow model to process data as it arrives. Process data in near real-time and enable data-driven decisions with the most recent updates. ----- #### Security and data governance Security is paramount in any data-driven organization. Data security should enforce the business needs for both internal and external data, so the lakehouse should be set up to meet your organization’s security requirements. Databricks provides built-in security to protect your data during and after migration. Encrypt data at rest and in-transit, using a cloud-managed key or your own Set up a custom network policy, use IP range to control access Leverage Private Link to limit network traffic to not traverse the public internet The challenge with the traditional data warehouse and data lake architecture is that data is stored in multiple stores and your data team also needs to manage data access and data governance twice. The lakehouse pattern uses unified storage which simplifies governance. The Databricks Lakehouse Platform provides a unified governance layer across all your data teams. Migrating to Databricks Unity Catalog provides data discovery, data lineage, role-based security policies, table or row/column-level access control, and central auditing capabilities that make the data platform easy for data stewards to confidently manage and secure data access to meet compliance and privacy needs, directly on the lakehouse. Enable SSO, integrate with active directory and other IdPs Control data access to database objects using RBAC Enable audit logs to monitor user activities ----- A-�it Log Acco-nt Level$ User Management Cre�entials ##### Centralized Governance ACL Store Access Control Metastore Lineage Explorer Data Explorer ----- #### Team involvement Plan to educate and train your team iteratively throughout the migration process. As new workloads are migrated, new teams will gain exposure to the lakehouse pattern. Plan to ramp up new team members as the migration process progresses, developing a data Center of Excellence within the organization. Databricks provides a cost effective platform for ad hoc work to be performed. A sandbox environment can be leveraged for teams to get exposure to Databricks technology and get hands-on experience. Databricks also provides [learning path](https://databricks.com/learn/training/home) training for customers. Encourage teams to get hands-on experience relevant to their immediate tasks, gain #### Conclusion Data warehouse migration touches many business areas and impacts many teams, but the Databricks Lakehouse Platform simplifies this transition, reduces risks and accelerates your ROI. The Databricks Business Value Consulting team can work with you to quantify the impact of your use cases to both data and business teams. And the Databricks team of solution architects, professional services, and partners are ready to help. Reach out to your Databricks account team or send a message to [sales@databricks.com](mailto:sales%40databricks.com?subject=) to get started. exposure to new things and try new ideas. #### Additional resources [Migrate to Databricks](https://databricks.com/solutions/migration) [Modernize Your Data Warehouse](https://databricks.com/p/webinar/apj-modernize-your-data-warehouse) ----- ##### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/guide-evolve-your-data-warehouse-to-the-lakehouse-v3.pdf2024-09-19T16:57:21Z**The** **Delta Lake** **Series** **Lakehouse** Combining the best elements of data lakes and data warehouses ----- ###### Here’s what #### What’s ###### you’ll find inside #### inside? The Delta Lake Series of eBooks is published by Databricks to help leaders and practitioners understand the full capabilities of Delta Lake as **Introduction** **What is Delta Lake?** well as the landscape it resides in. This eBook, **The Delta Lake Series — Lakehouse** , focuses on lakehouse. **Chapter** **01** ##### 02 Chapter  03 Chapter What Is a Lakehouse? Diving Deep Into the Inner Workings of the Lakehouse and Delta Lake Understanding Delta Engine #### What’s next? After reading this eBook, you’ll not only understand what Delta Lake offers, but you’ll also understand how its features result in substantial performance improvements. ----- #### What is Delta Lake? [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) is a unified data management system that brings data reliability and fast analytics to cloud data lakes. Delta Lake runs on top of existing data lakes and is fully compatible with Apache Spark™ APIs. At Databricks, we’ve seen how Delta Lake can bring reliability, performance and lifecycle management to data lakes. Our customers have found that Delta Lake solves for challenges around malformed data ingestion, difficulties deleting data for compliance, or issues modifying data for data capture. With Delta Lake, you can accelerate the velocity that high-quality data can get into your data lake and the rate that teams can leverage that data with a secure and scalable cloud service. ----- **What Is a Lakehouse?** ### CHAPTER 01 ----- **What Is a** **Lakehouse?** # 01 Over the past few years at Databricks, we’ve seen a new data management architecture that emerged independently across many customers and use cases: the **lakehouse.** In this chapter, we’ll describe this new architecture and its advantages over previous approaches. Data warehouses have a long history of decision support and business intelligence applications. Since its inception in the late 1980s, data warehouse technology continued to evolve and MPP architectures led to systems that were able to handle larger data sizes. But while warehouses were great for structured data, a lot of modern enterprises have to deal with unstructured data, semi-structured data, and data with high variety, velocity and volume. Data warehouses are not suited for many of these use cases, and they are certainly not the most cost-efficient. As companies began to collect large amounts of data from many different sources, architects began envisioning a single system to house data for many different analytic products and workloads. About a decade ago, companies began building [data lakes](https://databricks.com/glossary/data-lake) -- repositories for raw data in a variety of formats. While suitable for storing data, data lakes lack some critical features: They do not support transactions, they do not enforce data quality, and their lack of consistency / isolation makes it almost impossible to mix appends and reads, ----- **A lakehouse combines the best elements** **of data lakes and data warehouses** A lakehouse is a new data architecture that combines the best elements of data lakes and data warehouses. and batch and streaming jobs. For these reasons, many of the promises of data lakes have not materialized and, in many cases, lead to a loss of many of the benefits of data warehouses. The need for a flexible, high-performance system hasn’t abated. Companies require systems for diverse data applications including SQL analytics, real-time monitoring, data science and machine learning. Most of the recent advances in AI have been in better models to process unstructured data (text, images, video, audio), but these are precisely the types of data that a data warehouse is not optimized for. A common approach is to use multiple systems — a data lake, several data warehouses, and other specialized systems such as streaming, time-series, graph and image databases. Having a multitude of systems introduces complexity and, more importantly, introduces delay as data professionals invariably need to move or copy data between different systems. Lakehouses are enabled by a new system design: implementing similar data struc- tures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. They are what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available. A lakehouse has the following key features: - **Transaction support:** In an enterprise lakehouse, many data pipelines will often be reading and writing data concurrently. Support for ACID transactions ensures consistency as multiple parties concurrently read or write data, typically using SQL. ----- - **Schema enforcement and governance:** The lakehouse should have a way to support schema enforcement and evolution, supporting DW schema paradigms such as star/snowflake-schemas. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms. - **BI support:** Lakehouses enable using BI tools directly on the source data. This reduces staleness and improves recency, reduces latency and lowers the cost of having to operationalize two copies of the data in both a data lake and a warehouse. - **Storage is decoupled from compute:** In practice, this means storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property. - **Openness:** The storage formats they use are open and standardized, such as Parquet, and they provide an API so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. - **Support for diverse data types ranging from unstructured to structured data:** The lakehouse can be used to store, refine, analyze and access data types needed for many new data applications, including images, video, audio, semi-structured data, and text. - **Support for diverse workloads:** Including data science, machine learning and SQL analytics. Multiple tools might be needed to support all these workloads, but they all rely on the same data repository. - **End-to-end streaming:** Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications. These are the key attributes of lakehouses. Enterprise-grade systems require additional features. Tools for security and access control are basic requirements. Data governance capabilities including auditing, retention and lineage have become essential particularly in light of recent privacy regulations. Tools that enable data discovery such as data catalogs and data usage metrics are also needed. With a lakehouse, such enterprise features only need to be implemented, tested and administered for a single system. ----- **Read the research** **Delta Lake: High-Performance ACID** **Table Storage Over Cloud Object Stores** **Abstract** Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making the main attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: Metadata operations, such as listing objects, are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular data sets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift, and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale data sets and billions of objects. Authors: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van H Ö vell, Adrian Ionescu, Alicja Łuszczak, Michał Szafra ́nski, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, Matei Zaharia Read the full research paper on the [inner workings of the lakehouse.](https://databricks.com/research/delta-lake-high-performance-acid-table-storage-overcloud-object-stores) ----- **Some early examples** The [Databricks Unified Data Platform](https://databricks.com/product/data-lakehouse) has the architectural features of a lakehouse. Microsoft’s Azure Synapse Analytics service, which [integrates with Azure Databricks](https://databricks.com/blog/2019/11/04/new-microsoft-azure-data-warehouse-service-and-azure-databricks-combine-analytics-bi-and-data-science.html) , enables a similar lakehouse pattern. Other managed services such as BigQuery and Redshift Spectrum have some of the lakehouse features listed above, but they are examples that focus primarily on BI and other SQL applications. Companies that want to build and implement their own systems have access to open source file formats (Delta Lake, Apache Iceberg, Apache Hudi) that are suitable for building a lakehouse. Merging data lakes and data warehouses into a single system means that data teams can move faster as they are able to use data without needing to access multiple systems. The level of SQL support and integration with BI tools among these early lakehouses is generally sufficient for most enterprise data warehouses. Materialized views and A note about technical building blocks. While distributed file systems can be used for the storage layer, object stores are more commonly used in lakehouses. Object stores provide low-cost, highly available storage that excels at massively parallel reads — an essential requirement for modern data warehouses. **From BI to AI** The lakehouse is a new data management architecture that radically simplifies enterprise data infrastructure and accelerates innovation in an age when machine learning is poised to disrupt every industry. In the past, most of the data that went into a company’s products or decision-making was structured data from operational systems, whereas today, many products incorporate AI in the form of computer vision and speech models, text mining and others. Why use a lakehouse instead of a data lake for AI? A lakehouse gives you data versioning, governance, security and ACID properties that are needed even for unstructured data. stored procedures are available, but users may need to employ other mechanisms that aren’t equivalent to those found in traditional data warehouses. The latter is particularly important for “lift and shift scenarios,” which require systems that achieve semantics that are almost identical to those of older, commercial data warehouses. What about support for other types of data applications? Users of a lakehouse have access to a variety of standard tools ( [Apache Spark](https://databricks.com/glossary/apache-spark-as-a-service) , Python, R, machine learning libraries) for non-BI workloads like data science and machine learning. Data exploration and refinement are standard for many analytic and data science applications. Delta Lake is designed to let users incrementally improve the quality of Current lakehouses reduce cost, but their performance can still lag specialized systems (such as data warehouses) that have years of investments and real- world deployments behind them. Users may favor certain tools (BI tools, IDEs, notebooks) over others so lakehouses will also need to improve their UX and their connectors to popular tools so they can appeal to a variety of personas. These and other issues will be addressed as the technology continues to mature and develop. Over time, lakehouses will close these gaps while retaining the core properties of being simpler, more cost-efficient and more capable of serving diverse data applications. data in their lakehouse until it is ready for consumption. ----- **Diving Deep Into the Inner Workings** **of the Lakehouse and Delta Lake** ### CHAPTER 02 ----- **Diving Deep Into the** **Inner Workings of the** **Lakehouse and Delta Lake** # 02 Databricks wrote a [blog article](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) that outlined how more and more enterprises are adopting the lakehouse pattern. The blog created a massive amount of interest from technology enthusiasts. While lots of people praised it as the next-generation data architecture, some people thought the lakehouse is the same thing as the data lake. Recently, several of our engineers and founders wrote a research paper that describes some of the core technological challenges and solutions that set the lakehouse architecture apart from the data lake, and it was accepted and published at the International Conference on Very Large Databases (VLDB) 2020. You can read the paper, “ [Delta Lake: High-Performance ACID Table Storage Over Cloud](https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf) [Object Stores](https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf) ,” here. Henry Ford is often credited with having said, “If I had asked people what they wanted, they would have said faster horses.” The crux of this statement is that people often envision a better solution to a problem as an evolution of what they already know rather than rethinking the approach to the problem altogether. In the world of data storage, this pattern has been playing out for years. Vendors continue to try to reinvent the old horses of data warehouses and data lakes rather than seek a new solution. ----- More than a decade ago, the cloud opened a new frontier for data storage. Cloud object stores like Amazon S3 have become some of the largest and most cost- effective storage systems in the world, which makes them an attractive platform to store data warehouses and data lakes. However, their nature as key-value stores makes it difficult to achieve ACID transactions that many organizations require. Also, performance is hampered by expensive metadata operations (e.g., listing objects) and limited consistency guarantees. Based on the characteristics of cloud object stores, three approaches have emerged. **1. Data lakes** The first is directories of files (i.e., data lakes) that store the table as a collection of objects, typically in columnar format such as Apache Parquet. It’s an attractive approach because the table is just a group of objects that can be accessed from a wide variety of tools without a lot of additional data stores or systems. However, both performance and consistency problems are common. Hidden data corruption is common due to failed transactions, eventual consistency leads to inconsistent queries, latency is high, and basic management capabilities like table versioning and audit logs are unavailable. **2. Custom storage engines** The second approach is custom storage engines, such as proprietary systems built for the cloud like the Snowflake data warehouse. These systems can bypass the consistency challenges of data lakes by managing the metadata in a separate, strongly consistent service that’s able to provide a single source of truth. However, all I/O operations need to connect to this metadata service, which can increase cloud resource costs and reduce performance and availability. Additionally, it takes a lot of engineering work to implement connectors to existing computing engines like Apache Spark, TensorFlow and PyTorch, which can be challenging for data teams that use a variety of computing engines on their data. Engineering challenges can be exacerbated by unstructured data because these systems are generally optimized for traditional structured ----- data types. Finally, and most egregiously, the proprietary metadata service locks customers into a specific service provider, leaving customers to contend with consistently high prices and expensive, time-consuming migrations if they decide to adopt a new approach later. **3. Lakehouse** With Delta Lake, an open source ACID table storage layer atop cloud object stores, we sought to build a car instead of a faster horse with not just a better data store, but a fundamental change in how data is stored and used via the lakehouse. A lakehouse is a new architecture that combines the best elements of data lakes and data warehouses. Lakehouses are enabled by a new system design: implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. They are what you would get if you had to redesign storage engines in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available. Delta Lake maintains information about which objects are part of a Delta table in an ACID manner, using a write-ahead log, compacted into Parquet, that is also stored in the cloud object store. This design allows clients to update multiple objects at once, replace a subset of the objects with another, etc., in a serializable manner that still achieves high parallel read/write performance from the objects. The log also provides significantly faster metadata operations for large tabular data sets. Additionally, Delta Lake offers advanced capabilities like time travel (i.e., the ability to query point-in-time snapshots or roll back erroneous updates), automatic data layout optimization, upserts, caching, and audit logs. Together, these features improve both the manageability and performance of working with data in cloud object stores, ultimately opening the door to the lakehouse architecture that combines the key features of data warehouses and data lakes to create a better, simpler data architecture. ----- Today, Delta Lake is used across thousands of Databricks customers, processing exabytes of structured and unstructured data each day, as well as many organizations in the open source community. These use cases span a variety of data sources and applications. The data types stored include Change Data Capture (CDC) logs from enterprise OLTP systems, application logs, time-series data, graphs, aggregate tables for reporting, and image or feature data for machine learning. The applications include SQL workloads (most commonly), business intelligence, streaming, data science, machine learning and graph analytics. Overall, Delta Lake has proven itself to be a good fit for most data lake applications that would have used structured storage formats like Parquet or ORC, and many traditional data warehousing workloads. Across these use cases, we found that customers often use Delta Lake to significantly simplify their data architecture by running more workloads directly against cloud object stores, and increasingly, by creating a lakehouse with both data lake and transactional features to replace some or all of the functionality provided by message queues (e.g., Apache Kafka), data lakes or cloud data warehouses (e.g., Snowflake, Amazon Redshift). **[In the research paper,](https://databricks.com/research/delta-lake-high-performance-acid-table-storage-overcloud-object-stores)** **the authors explain:** - The characteristics and challenges of object stores - The Delta Lake storage format and access protocols - The current features, benefits and limitations of Delta Lake - Both the core and specialized use cases commonly employed today - Performance experiments, including TPC-DS performance Through the paper, you’ll gain a better understanding of Delta Lake and how it enables a wide range of DBMS-like performance and management features for data held in low-cost cloud storage. As well as how the Delta Lake storage format and access protocols make it simple to operate, highly available, and able to deliver high- bandwidth access to the object store. ----- **Understanding Delta Engine** ### CHAPTER 03 ----- **Understanding** **Delta Engine** # 03 The Delta Engine ties together a 100% Apache Spark-compatible vectorized query engine to take advantage of modern CPU architecture with optimizations to Spark 3.0’s query optimizer and caching capabilities that were launched as part of Databricks Runtime 7.0. Together, these features significantly accelerate query performance on data lakes, especially those enabled by [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) , to make it easier for customers to adopt and scale a [lakehouse architecture](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html) . **Scaling execution performance** One of the big hardware trends over the last several years is that CPU clock speeds have plateaued. The reasons are outside the scope of this chapter, but the takeaway is that we have to find new ways to process data faster beyond raw compute power. One of the most impactful methods has been to improve the amount of data that can be processed in parallel. However, data processing engines need to be specifically architected to take advantage of this parallelism. In addition, data teams are being given less and less time to properly model data as the pace of business increases. Poorer modeling in the interest of better business agility drives poorer query performance. Naturally, this is not a desired state, and organizations want to find ways to maximize both agility and performance. ----- **Announcing Delta Engine for** **high-performance query execution** Delta Engine accelerates the performance of Delta Lake for SQL and DataFrame workloads through three components: an improved query optimizer, a caching layer that sits between the execution layer and the cloud object storage, and a native vectorized execution engine that’s written in C++. The improved query optimizer extends the functionality already in Spark 3.0 (cost-based optimizer, adaptive query execution, and dynamic runtime filters) with more advanced statistics to deliver up to 18x increased performance in star schema workloads. Delta Engine’s caching layer automatically chooses which input data to cache for the user, transcoding it along the way in a more CPU-efficient format to better leverage the increased storage speeds of NVMe SSDs. This delivers up to 5x faster scan performance for virtually all workloads. However, the biggest innovation in Delta Engine to tackle the challenges facing data teams today is the native execution engine, which we call Photon. (We know. ----- It’s in an engine within the engine…). This completely rewritten execution engine for Databricks has been built to maximize the performance from the new changes in modern cloud hardware. It brings performance improvements to all workload types while remaining fully compatible with open Spark APIs. **Getting started with Delta Engine** By linking these three components together, we think it will be easier for customers to understand how improvements in multiple places within the Databricks code aggregate into significantly faster performance for analytics workloads on data lakes. We’re excited about the value that Delta Engine delivers to our customers. While the time and cost savings are already valuable, its role in the lakehouse pattern supports new advances in how data teams design their data architectures for increased unification and simplicity. For more information on the Delta Engine, watch this keynote address from [Spark + AI Summit 2020:](https://www.youtube.com/watch?v=o54YMz8zvCY) [Delta Engine: High-Performance Query Engine for Delta Lake](https://www.youtube.com/watch?v=o54YMz8zvCY) . ----- ## What’s next? Now that you understand Delta Lake and how its features can improve performance, it may be time to take a look at some additional resources. **Data + AI Summit Europe 2020 >** - [Photon Technical Deep Dive: How to Think Vectorized](https://databricks.com/session_eu20/photon-technical-deep-dive-how-to-think-vectorized) **Explore subsequent eBooks in the collection >** - The Delta Lake Series — Fundamentals and Performance - The Delta Lake Series — Features - The Delta Lake Series — Streaming - The Delta Lake Series — Customer Use Cases - [MLflow, Delta Lake and Lakehouse Use Cases Meetup and AMA](https://databricks.com/session_eu20/mlflow-delta-lake-and-lakehouse-use-cases-meetup) - [Common Strategies for Improving Performance on Your Delta Lakehouse](https://databricks.com/session_eu20/common-strategies-for-improving-performance-on-your-delta-lakehouse) - [Achieving Lakehouse Models With Spark 3.0](https://databricks.com/session_eu20/achieving-lakehouse-models-with-spark-3-0) - [Radical Speed for Your SQL Queries With Delta Engine](https://databricks.com/session_eu20/radical-speed-for-your-sql-queries-with-delta-engine) **Do a deep dive into Delta Lake >** - [Analytics on the Data Lake With Tableau and the Lakehouse Architecture](https://databricks.com/blog/2020/11/11/analytics-on-the-data-lake-with-tableau-and-the-lakehouse-architecture.html) - [Visit the site for additional resources](https://databricks.com/product/delta-lake-on-databricks) **Vodcasts and podcasts >** - [Welcome to Lakehouse. Data Brew | Episode 2](https://www.youtube.com/watch?v=HVqxI7sFbKc) - [Data Brew by Databricks | Season 1: Lakehouses](https://databricks.com/discover/data-brew) **[Try Databricks for free >](https://databricks.com/product/delta-lake-on-databricks)** **[Learn more >](https://databricks.com/product/delta-lake-on-databricks)** - [Data Alone Is Not Enough: The Evolution of Data Architectures](https://a16z.com/2020/10/22/data-alone-is-not-enough-the-evolution-of-data-architectures/) -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/The-Delta-Lake-Series-Lakehouse-012921.pdf2024-09-19T16:57:19Z**EBOOK** # All Roads Lead to the Lakehouse #### A deep dive into data ingestion with the lakehouse ----- ## Contents Introduction...................................................................................................................................................................................................................... **03** Life of a Data Engineer ............................................................................................................................................................................................... **04** Ingesting From Cloud Object Stores...................................................................................................................................................................... **05** COPY INTO ......................................................................................................................................................................................................... **06** Auto Loader ....................................................................................................................................................................................................... **09** Ingesting Data From External Applications .......................................................................................................................................................... **13** Partner Connect ............................................................................................................................................................................................... **13** ----- ### Introduction Organizations today are inundated with data siloed across various on-premises application systems, databases, data warehouses and SaaS applications. This fragmentation makes it difficult to support new use cases for analytics or machine learning, so many IT teams are now centralizing all of their data with a lakehouse architecture built on top of Delta Lake, an open format storage layer. The first thing data engineers need to do to support the lakehouse architecture is to efficiently move data from various systems into their lakehouse. Ingesting data is a critical first step in the data engineering and management lifecycle. ----- ### Life of a Data Engineer The primary focus of data engineers is to provide timely and reliable data to downstream data teams at an organization. Requests for data can come from a variety of teams, and for a variety of data types. For example: **•** Marketing team requests for Facebook and Google ad data in order to analyze spend and better allocate their budget for ads **•** Security team looking to get access to a table with low latency security data from Kafka, in order to run rules to detect intrusions into the network **•** Sales operations requesting customer data from Salesforce to enrich existing tables **•** Finance team hoping to find a way to automatically ingest critical data from Google Sheets or transaction data from AWS Kinesis In each of these common scenarios, data engineers must create usable and easily queryable tables from semi-structured and unstructured data. Beyond writing queries to retrieve and transform all this data, the data engineering team must also be concerned with performance, because running these queries on an ongoing basis can be a big load on the system. Data engineers face the challenge of constant requests and ongoing business ###### W H AT I S D E LTA L A K E ? Before thinking about ingestion into Delta Lake, it’s important to understand why ingesting into Delta Lake is the right solution in the first place. [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) is an open format data management layer that brings data warehouse capabilities to your open data lake. Across industries, enterprises have enabled true collaboration among their data teams with a reliable single source of truth enabled by Delta Lake. By delivering quality, reliability, security and performance on your data lake — for both streaming and batch operations — Delta Lake eliminates data silos and makes analytics accessible across the enterprise. With Delta Lake, customers can build a cost-efficient, highly scalable lakehouse that eliminates data silos and provides self-serving analytics to end users. requirements, as well as an ever-changing ecosystem. As business requirements change, so do the requirements around schemas, necessitating custom code to handle the changes. With all of these challenges, the work of a data engineer is extremely critical, and increasingly complex, with many steps involved before getting data to a state where it can actually be queried by the business stakeholders. So how do data engineers get the data that each of these teams need at the frequency, with the freshness, and in the format required? ----- ### Ingesting From Cloud Object Stores There are a number of common ways in which data engineers ingest data into Delta Lake. First and foremost is ingesting files from cloud object stores such as Azure Data Lake Storage, AWS S3 or Google Cloud Storage. Typically, customers are looking to migrate existing tables or perform incremental ingestion into Delta Lake, and to do so, they can leverage tools like [CONVERT TO DELTA](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-convert-to-delta.html) , [COPY INTO](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html) , and [Auto Loader](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) . We will focus on Auto Loader and COPY INTO here. **Auto Loader** Auto Loader is an optimized data ingestion tool that incrementally and efficiently processes new data files as they arrive in cloud storage with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that **COPY INTO** COPY INTO is a SQL command that allows you to perform batch file ingestion into Delta Lake. COPY INTO is a command that ingests files with exactly-once semantics, best used when the input directory contains thousands of files or fewer, and the user prefers SQL. COPY INTO can be used over JDBC to push data into Delta Lake at your convenience. subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory. Auto Loader has interfaces through Python and Scala, and can be used with SQL through Delta Live Tables. ----- ##### COPY INTO COPY INTO is a powerful yet simple SQL command that allows you to perform batch file ingestion into Delta Lake and perform many of the use cases outlined in this section. COPY INTO can be run once, in an ad hoc manner, and can be scheduled through Databricks jobs. ``` FILEFORMAT = CSV FORMAT_OPTIONS (‘header’ = ‘true’) ``` While COPY INTO does not support low latencies, you can trigger a COPY INTO based on events by using cloud functions such as AWS Lambda or through orchestrators like Apache Airflow. COPY INTO supports incremental appends and simple transformations. COPY INTO is a great command to use when your source directory contains a small number of files (i.e., thousands of files or less). To ingest a larger number of files, we recommend Auto Loader, which we will cover later in this eBook. **Common Use Cases for COPY INTO** **Ingesting data to a new Delta table** A common ad hoc ingestion use case using COPY INTO is to ingest data into a new Delta table. To copy data into a new Delta table, users can use CREATE TABLE command first, followed by COPY INTO. Step 1: `CREATE TABLE` `my_table (id` `INT` `, name STRING, age` `INT` `);` Step 2 1 : `COPY INTO` `my_table` ```  FROM ‘s3://my_bucket/my_path’ WITH (  CREDENTIAL (  AWS_ACCESS_KEY = ‘*****’,  AWS_SECRET_KEY = ‘*****’,  AWS_SESSION_TOKEN = ‘*****’  )  ENCRYPTION (  TYPE = ‘AWS_SSE_C’,  MASTER_KEY = ‘*****’ ``` The code block above covers the AWS temporary in-line credential format. When you use in-line credentials in Azure and AWS, the following parameters are required for each type of credential and encryption: |Credential Name|Required Parameters| |---|---| |AWS temporary credentials|AWS_ACCESS_KEY AWS_SECRET_KEY| ||AWS_SESSION_TOKEN| |Azure SAS token|AZURE_SAS_TOKEN| |Encryption Name|Required Parameters| |---|---| |AWS server-side encryption with customer-provided encryption key|TYPE = ‘AWS_SSE_C’ MASTER_KEY| |Azure client-provided encryption key|ATYPE = ‘AZURE_CSE’ MASTER_KEY| **Appending data to your Delta table** To append data to a Delta table, users can leverage the COPY INTO command. COPY INTO is a powerful SQL command that is idempotent and incremental. When using COPY INTO, users point to a location of files, and once those files are ingested, Delta Lake will keep 1 If you only have temporary access to a cloud object store, you can use temporary in-line credentials to ingest data from the cloud object store. When you are an admin or with ANY FILE access, and the instance profile has been set for the cloud object store, you do not need to specify credentials in-line for COPY INTO. ----- track of the state of files that have been ingested. Unlike commands like INSERT INTO, users get idempotency with COPY INTO, which means users are prevented from ingesting the same data twice to the same table. ```  COPY INTO table_identifier  FROM [ file_location | ( SELECT expression_list FROM file_location)]  FILEFORMAT = JSON | CSV | TEXT | PARQUET | AVRO | ORC | BINARYFILE  [ FILES = [file_name [,...] | PATTERN = ‘regex_pattern’ ]  [ FORMAT_OPTIONS ( ‘data_source_reader_option’ = ‘value’ [, ...])]  [ COPY_OPTIONS ( ’OPTION’ = ‘VALUE’ [,...])] ``` One of the main benefits of COPY INTO is that users don’t have to worry about providing a schema, because the schema is automatically inferred from your data files. Here is a very simple example of how you would ingest data from CSV files that have headers, where you leave the tool to infer the schema and the proper data types. It’s as simple as that. ```  COPY INTO my_delta_table  FROM ‘s3://my-bucket/path/to/csv_files’  FILEFORMAT = CSV  FORMAT_OPTIONS ( ‘header’ = ‘true’ , ‘inferSchema’ = ‘true’ ) ``` **Using COPY INTO without an existing table** 2 ```  CREATE TABLE my_delta_table (dummy string);  COPY INTO my_delta_table  FROM ‘s3://my-bucket/path/to/csv_files’  FILEFORMAT = CSV  FORMAT_OPTIONS (  ‘header’ = ‘true’ ,  ‘inferSchema’ = ‘true’ ,  ‘mergeSchema’ = ‘true’  )  COPY_OPTIONS ( ‘overwrite’ = ‘true’ , ‘overwriteSchema’ = ‘true’ ) ``` **Ingesting a CSV file without headers** If you are looking to ingest a CSV file that doesn’t have headers, columns will be named as _c0 or _c1, with the index of the column. You can use the double colon syntax to cast the data type that you want and then alias these columns to whatever you want to call them. ```  COPY INTO my_delta_table  FROM ( SELECT  _c0::int as key,  _c1::double value,  _c2::timestamp event_time  FROM ‘s3://my-bucket/path/to/csv_files’ )  FILEFORMAT = CSV ``` In the most common case, in order to use COPY INTO, a table definition is required. However, if you would like to get started quickly and don’t have an existing table or require a specific schema, you can create your table with a dummy schema. Then, once you run COPY INTO, you can overwrite the table and overwrite the schema. COPY INTO will actually infer the data types, and then change your Delta table to have the required schema. 2 This use case will not work in Databricks SQL workspace, as it currently only works on clusters without table ACLs. ----- **Evolving schema over time for CSV files** 3 When ingesting CSV files that have a different number of columns than your existing table, you can use the option “‘mergeSchema’ = ‘true’”. This option needs to be provided both as FORMAT_OPTIONS and COPY_OPTIONS. FORMAT_OPTIONS applies to the source data. Once “mergeSchema” is provided as a format option, Databricks will look at multiple CSV files and infer the schema across those files. COPY_OPTIONS applies to your Delta table when you’re running the COPY INTO command. When “mergeSchema” is provided as a copy option, you’re instructing Delta Lake that it is safe to evolve the schema. Schema evolution only allows the addition of new columns. Data type changes for existing columns are not supported. ```  COPY INTO my_delta_table  FROM (SELECT  _C0::int as key,  _C1::double value,  _C2::timestamp event_time,  ...  FROM ‘s3://my-bucket/path/to/csv_files’ )  FILEFORMAT = CSV  FORMAT_OPTIONS ( ‘mergeSchema’ = ‘true’ )  COPY_OPTIONS ( ‘mergeSchema’ = ‘true’ ) ``` **Fixing bad data** If you find that there is a mistake in the source data file and some of the data you ingested is bad, you can use RESTORE on your Delta table and set it to the timestamp or version of the Delta table that you want to roll back to (e.g., to restore to yesterday’s data). Then you can rerun your COPY INTO command. Alternatively, if running a RESTORE is not possible, COPY INTO supports reloading files by the use of the “force” copy option. You can manually remove the old data from your Delta Lake table by running a DELETE operation and then using COPY INTO with “force” = “true”. You can use the PATTERN keyword to provide a file name pattern, or you can specify the file names with the FILES keyword to reload a subset of files in conjunction with “force”. ```  RESTORE my_delta_table TO TIMESTAMP AS OF date_sub(current_date(),  1);  COPY INTO my_delta_table  FROM ‘s3://my-bucket/path/to/csv_files’  FILEFORMAT = CSV  PATTERN = ‘2021-09-08*.csv’  FORMAT_OPTIONS ( ‘header’ = ‘true’ , ‘inferSchema’ = ‘true’ )  COPY_OPTIONS ( ‘force’ = ‘true’ ) ``` 3 Limitation: schema evolution with “mergeSchema” in COPY_OPTIONS does not work in Databricks SQL workspace or clusters enabled with table ACLs. ----- ##### Auto Loader While COPY INTO can solve a lot of the key use cases our customers face, due to its limitations (scalability), there are many scenarios where we recommend Auto Loader for data ingestion. Auto Loader is a data source on Databricks that incrementally and efficiently processes new data files as they arrive in cloud storage with minimal DevOps effort. Auto Loader is available in Python and Scala, and also in SQL in [Delta Live Tables](https://databricks.com/product/delta-live-tables) . Auto Loader is an incremental streaming source that provides exactly-once ingestion guarantees. It keeps track of which files have been ingested using a durable key-value store. It can discover new files very efficiently and is extremely scalable. Auto Loader has been battle tested. We have seen customers running Auto Loader on millions of files an hour, and petabytes of data per day. To use Auto Loader, you simply specify ‘readStream’ and the format “cloudFiles”, indicating that you will use Auto Loader to load files from the cloud object stores. Next, you specify the format of the file — for example, JSON — as an option to Auto Loader, and you specify where to load it from. ```  df = spark.readStream.format( “cloudFiles” )  .option( “cloudfiles.format” , “json” )  .load( “/path/to/table” ) ``` Under the hood, when data lands in your cloud storage, Auto Loader discovers files either through directory listing or file notifications. Given permissions to the underlying storage bucket or container, Auto Loader can list the directory that you want to load data from in an efficient and scalable manner and load data immediately. Alternatively, Auto Loader can also automatically set up file notifications on your storage account, which allows it from queues, deduplicate these notifications using its key-value store and then process the underlying files. If there are any failures, Auto Loader will replay what hasn’t been processed, giving you exactly-once semantics. Directory listing mode is very easy to get started with. If your files are uploaded to your cloud storage system in a lexicographical order, Auto Loader will optimize the discovery of files by starting directory listing from the latest uploaded files, saving you both time and money. If files cannot be uploaded in a lexicographical order and you need Auto Loader to scale to high volumes, Databricks recommends using the file notification mode. Cloud services such as AWS Kinesis Firehose, AWS DMS and Azure Data Factory can be configured to upload files in a lexical order, typically by providing the upload time of records in the file path, such as /base/path/yyyy/MM/dd/HH/file.format. **Common Use Cases for Auto Loader** **New to Auto Loader** As a new user to the Databricks Lakehouse, you’ll want to ingest data from cloud object stores into Delta Lake as part of your data pipeline for incremental loading. Here is a simple example using Python to demonstrate the ease and flexibility of Auto Loader with a few defined options. You can run the code in a notebook. ```  stream = spark.readStream \  .format( “cloudFiles” ) \  .option( “cloudFiles.format” , “csv” ) \  .option( “cloudFiles.schemaLocation” , schema_location) \  .load(raw_data_location) ``` to efficiently discover newly arriving files. When a file lands in file notification mode, the cloud storage system sends a notification to a queuing system. For example, in AWS, S3 will send a notification to AWS SQS. On Azure, a notification is sent to Azure queue storage. On Google, it’ll be sent to Pub/Sub. Auto Loader can then fetch these event notifications ----- In order to write to a Delta table from the stream, follow the example below: ```  stream.writeStream \  .option( “mergeSchema” , “true” ) \  .option( “checkpointLocation” , checkpoint_location) \  .start(target_delta_table_location) ``` **Migrating to Auto Loader** As a Spark user, you may be using an existing Spark structured streaming to process data. To migrate to Auto Loader, all a user needs to do is take existing streaming code and turn two lines of it into ‘cloudFiles’, specifying the file format within an option. **Migrating a livestreaming pipeline** Migrating a livestreaming pipeline can be challenging, but with Auto Loader, as with COPY INTO, you can specify a timestamp when the source files are updated or created and Auto Loader will ingest all modified data after that point. ```  df = spark.readStream  .format( “cloudFiles” )  .option( “cloudFiles.format” , “json” )  .option( “modifiedAfter” , “2021-09-09 00:00:00” )  .options(format_options)  .schema(schema)  .load( “/path/to/table” ) ``` **Schema inference and evolution** Auto Loader provides schema inference and management capabilities. With a schema location specified, Auto Loader can store the changes to the inferred schema over time. For file formats like JSON and CSV, where the schemas can get fuzzy, schema inference on Auto Loader can automatically infer data types or treat everything as a string. When data does not match your schema (e.g., an unknown column or format), Auto Loader has a data rescue capability that will “rescue” all data in a separate column, stored as a JSON string, to investigate later. See [rescued data column](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-schema.html#rescued-data-column) for more details. Auto Loader supports three schema evolution modes: add new columns as they are discovered, fail if an unexpected column is seen, or rescue new columns. ``` df = spark.readStream  .format( “cloudFiles” )  .option( “cloudFiles. format” , “json” )  .options(format_options)  .schema(schema)  .load( “/path/to/table” ) ``` ``` df = spark.readStream  .format( “json” )  .options(format_options)  .schema(schema)  .load( “/path/to/table” ) ``` Once it’s converted, users will see instant benefits like scalability and cost reduction. Auto Loader can scale to trillions of files, unlike the open-source file streaming source. One of the ways that Auto Loader does this is with asynchronous backfills. Instead of needing to discover files first, then plan, Auto Loader discovers and processes files concurrently, making it much more efficient and leading to cost reductions in compute resources. ----- **Fixing a file that was processed with Auto Loader** To fix a file that was already processed, Auto Loader supports an option called ‘allowOverwrites’. With this option, Auto Loader can re-ingest and reprocess a file with a new timestamp. If you want to enable this option in an existing Auto Loader stream, you need to stop and restart the Auto Loader stream with the enabled option. ```  df = spark.readStream  .format( “cloudFiles” )  .option( “cloudFiles.format” , “json” )  .schema(schema)  .option( “cloudFiles.allowOverwrites” , “true” )  .options(format_options)  .load( “/path/to/table” ) ``` **Discover missing data** While event notification is a very scalable method to collect all data, it relies on cloud services, which are distributed systems and are not always reliable. With Auto Loader, you can additionally specify a backfill interval, where Auto Loader will perform asynchronous backfills at whatever interval you set up. This can be enabled with a once trigger, ```  df = spark.readStream  .format(“cloudFiles”)  .option(“cloudFiles.format”, “json”)  .schema(schema)  .option( “cloudFiles.backfillInterval” , “1 week” )  .options(format_options)  .load(“/path/to/table”)  .writeStream  .trigger(Trigger.AvailableNow())  .option(“checkpointLocation”, checkpointDir)  .start() ``` The trigger tells Auto Loader how frequently to process incoming data. A processing time trigger will have Auto Loader run continuously and schedule micro-batches at the trigger interval which you have set. The “Once” and “AvailableNow” triggers instruct Auto Loader to process all new data that has been added until the start of your application. Once the data is processed, Auto Loader will automatically shut down. Trigger Once will have Auto Loader process all the new data in a single micro-batch, which requires it to first discover all the new files. With Trigger AvailableNow, Auto Loader can discover and process files concurrently and perform rate limiting, which makes it a preferable alternative to Trigger Once. processing time trigger and available now trigger. The following example shows how to use backfill internal and trigger availableNow together: ----- **Using Auto Loader in SQL with Delta Live Tables** Delta Live Tables is a cloud-native ETL service on Databricks that provides a reliable framework to develop, test, monitor, manage and operationalize data pipelines at scale to drive insights for data science, machine learning and analytics. Auto Loader is available in Delta Live Tables. ``` CREATE INCREMENTAL LIVE TABLE  autoloader_test AS SELECT  *,  id + id2 AS new_id FROM  CLOUD_FILES (  “some/cloud/path” , – the path to the data  “json” – the file format  ); ``` **Live Tables understands** **and coordinates data flow** **between your queries** ----- ### Ingesting Data From External Applications While Auto Loader and COPY INTO are powerful tools, not all data is available as files in cloud object stores. In order to enable a lakehouse, it is critical to incorporate all of your data and break down the silos between sources and downstream teams. To do this, customers need to discover and connect a broad set of data, BI and AI tools, and systems to the data within their lakehouse. ##### Partner Connect Historically, stitching multiple enterprise tools and data sources together has been a burden on the end user, making it very complicated and expensive to execute at any scale. Partner Connect solves this challenge by making it easy for you to integrate data, analytics and AI tools directly within their Databricks Lakehouse. It also allows you to discover new, pre- validated solutions from Databricks partners that support your expanding analytics needs. To ingest into the lakehouse, select the partner tile in Partner Connect via the left navigation bar in Databricks. Partner Connect will automatically configure resources such as clusters, tokens and connection files for you to connect with your data ingestion tools of choice. You can finish signing up for a trial account on the partner’s website or directly log in if you already used Partner Connect to create a trial account. Once you log in, you will see that Databricks is already configured as a destination in the partner portal and ready to be used. ----- **Common Use Case for Partner Connect** **Ingest Salesforce data via Fivetran into Delta Lake** Clicking on the Fivetran tile in Partner Connect starts an automated workflow between the two products. Databricks automatically provisions a SQL endpoint and associated credentials for Fivetran to interact with, and passes the user’s identity and the SQL endpoint configuration to Fivetran automatically via a secure API. Within Fivetran, a Databricks destination is automatically created. This destination is configured to ingest into Delta via the SQL endpoint that was auto-configured by Partner Connect. The customer now selects their choice of data source in Fivetran from hundreds of pre- built connectors — for example, Salesforce. The user authenticates to the Salesforce source, chooses the Salesforce objects they want to ingest into Delta Lake on Databricks ----- (in this case the Account & Contact objects) and starts the initial sync. This automation has saved users dozens of manual steps and copying/pasting of configuration if they manually set up the connection. It also protects the user from making any unintentional configuration errors and spending time debugging those errors. The Salesforce tables are now available to query, join and analyze in Databricks SQL. Watch the [demo](https://databricks.com/partnerconnect#partner-demos) for more details or check out the [Partner Connect guide](https://docs.databricks.com/integrations/partner-connect/index.html?_gl=1*1mz2ts6*_gcl_aw*R0NMLjE2MzY2NzU1NDcuQ2p3S0NBaUFtN09NQmhBUUVpd0FydkdpM0ZHS3ptZTR5Z2YzR3E4ajVrYTNaUExOUEFnaTZIMnNRU05EMC1RYzl0dGxXQjl6ajRuNU14b0N0OGdRQXZEX0J3RQ..&_ga=2.83627156.328510291.1641248936-1825366797.1612985070) to learn more. ----- ### About Databricks Databricks is the data and AI company. More than 5,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , LinkedIn and Facebook . -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/EB-Ingesting-Data-FINAL.pdf2024-09-19T16:57:19Z# 2023 State  of Data + AI ``` Powered by the Databricks Lakehouse ``` 2023 STATE OF DATA + AI ----- |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Col14|Col15|Col16|Col17|Col18|Col19|Col20| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| ||||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| |||||||W|e’|r|e|in||th|e||||||| |||||||go|l|de|n|a|ge||of||||||| ||||||||||||||||||||| |||||||d|a|ta|a|n|d|A|I||||||| ||||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| ----- INTRO In the 6 months since ChatGPT launched, the world has woken up to the vast potential of AI. The unparalleled pace of AI discoveries, model improvements and new products on the market puts data and AI strategy at the top of conversations across every organization around the world. We believe that AI will usher in the next generation of product and software innovation, and we’re already seeing this play out in the market. The next generation of winning companies and executives will be those who understand and leverage AI. In this report, we examine patterns and trends in data and AI adoption across more than 9,000 global Databricks customers. By unifying business intelligence (BI) and AI applications across companies’ entire data estates, the Databricks Lakehouse provides a unique vantage point into the state of data and AI, including which products and technologies are the fastest growing, the types of data science and machine learning (DS/ML) applications being developed and more. ----- ``` Here are the major stories we uncovered: ``` Companies are adopting machine learning and large language models (LLMs) at a rapid pace. Natural language processing (NLP) is dominating use cases, with an accelerated focus on LLMs. Organizations are investing in data integration products as they prioritize more DS/ML initiatives. 50% of our fastestgrowing products represent the data integration category. Organizations are increasingly using the Lakehouse for data warehousing, as evidenced by the high growth of data integration tools dbt and Fivetran, and the accelerated adoption of Databricks SQL. We hope that by sharing these trends, data leaders will be able to benchmark their organizations and gain insights that help inform their strategies for an era defined by data and AI. ----- ``` Summary of Key Findings  DATA SCIENCE AND MACHINE LEARNING:  NLP AND LLMS ARE IN HIGH DEMAND  1 ``` **•** The number of companies using SaaS LLM APIs (used to access services like ChatGPT) has grown 1310% between the end of November 2022 and the beginning of May 2023 **•** NLP accounts for 49% of daily Python data science library usage, making it the most popular application **•** Organizations are putting substantially more models into production (411% YoY growth) while also increasing their ML experimentation (54% YoY growth) **•** Organizations are getting more efficient with ML; for every three experimental models, roughly one is put into production, compared to five experimental models a year prior ----- ``` FASTEST-GROWING DATA AND AI PRODUCTS ``` ``` ADOPTION AND MIGRATION TRENDS ``` 61% of customers migrating to the Lakehouse are coming from onprem and cloud data warehouses The volume of data in Delta Lake has grown 304% YoY The Lakehouse is increasingly being used for data warehousing, including serverless data warehousing with Databricks SQL, which grew 144% YoY BI is the top data and AI market, but growth trends in other markets show that companies are increasingly looking at more advanced data use cases The fastest-growing data and AI product is dbt, which grew 206% YoY by number of customers Data integration is the fastest-growing data and AI market on the Databricks Lakehouse with 117% YoY growth ----- ``` Methodology: How did Databricks create this report? ``` The _2023 State of Data + AI_ is built from fully-aggregated, anonymized data collected from our customers based on how they are using the Databricks Lakehouse and its broad ecosystem of integrated tools. This report focuses on machine learning adoption, data architecture (integrations and migrations) and use cases. The customers in this report represent every major industry and range in size from startups to many of the world’s largest enterprises. Unless otherwise noted, this report presents and analyzes data from February 1, 2022, to January 31, 2023, and usage is measured by number of customers. When possible, we provide YoY comparisons to showcase growth trends over time. ----- ``` Data Science and Machine Learning NATURAL LANGUAGE PROCESSING AND LARGE LANGUAGE MODELS ARE IN HIGH DEMAND ``` Across all industries, companies leverage data science and machine learning (DS/ML) to accelerate growth, improve predictability and enhance customer experiences. Recent advancements in large language models (LLMs) are propelling companies to rethink AI within their own data strategies. Given the rapidly evolving DS/ML landscape, we wanted to understand several aspects of the market: - Which types of DS/ML applications are companies investing in? In particular, given the recent buzz, what does the data around LLMs look like? - Are companies making headway on operationalizing their machine learning models (MLOps)? ----- ``` Time Series Time Series Speech Recognition Simulations & Optimizations Recommender Systems Natural Language  Processing Industry Data Modeling Graph Geospatial Computer Vision Anomaly Detection & Segmentation ``` ```  SPECIALIZED PYTHON DS/ML  LIBRARIES FROM FEBRUARY 2022 TO JANUARY 2023 ``` Note: This chart reflects the unique number of notebooks using ML libraries per day in each of the categories. It includes libraries used for the particular problem-solving use cases mentioned. It does not include libraries used in tooling for data preparations and modeling. ----- ``` Natural language processing dominates machine learning use cases ``` Our second most popular DS/ML application is simulations and optimization, which accounts for 30% of all use cases. This signals organizations are using data to model prototypes and solve problems cost-effectively. To understand how organizations are applying AI and ML within the Lakehouse, we aggregated the usage of specialized Python libraries, which include NLTK, Transformers and FuzzyWuzzy, into popular data science use cases. 1 We look at data from these libraries because Python is on the cutting edge of new developments in ML, advanced analytics and AI, and has consistently ranked as one of the [most popular programming languages](https://www.tiobe.com/tiobe-index/) in recent years. Our most popular use case is natural language processing (NLP), a rapidly growing field that enables businesses to gain value from unstructured textual data. This opens the door for users to accomplish tasks that were previously too abstract for code, such as summarizing content or extracting sentiment from customer reviews. In our data set, 49% of libraries used are associated with NLP. LLMs also fall within this bucket. Given the innovations launched in recent months, we expect to see NLP take off even more in coming years as it is applied to use cases like chatbots, research assistance, fraud detection, content generation and more. ```  In our data set, 49% of  specialized Python libraries  used are associated with NLP ``` Many of the DS/ML use cases are predominantly leveraged by specific industries. While they take up a smaller share of the total, they are mission-critical for many organizations. For example, time series includes forecasting, a use case that is especially popular in industries such as Retail and CPG, which rely heavily on the ability to forecast the need for every item in every store. 1. This data does not include general-purpose ML libraries, including scikit-learn or TensorFlow. ----- ```  USE OF LARGE LANGUAGE MODELS (LLMS) ``` We have rolled these libraries up into groupings based on the type of functionality they provide. |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Col14|Col15|Col16|Col17|Col18|Col19|Col20|Col21|Col22|Col23|Col24|Col25|Col26|Col27|Col28|Col29|Col30|Col31| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||Ma||rch 24, Dolly La||2023 unch||||| ||||sformer|-Related|||||||||||C|||||||||||||||| |||Tran||||||||||||||||, 2022 Launch||||||||||||| |||Libr|aries LLM AP|Is||||||||||||||||||||||||||| |||SaaS||||||||||||||||||||||||||||| |||LLM|Tools|||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||| |Feb|Mar|A|pr|May|June||July||Au||g S|ept||Oct||Nov||De||c J|an|Feb||Mar||Apr||M|ay|| |2022||||||||||||||||||||20|23|||||||||| |||||||||||||||||||||||||||||||| ||Note: T These|here ar libraries|e several provide|popular pretrain|types o ed mod||f Python els and||librarie tools for||s that a buildin|re comm g, trainin||only us g and d||ed for L eploying||LMs. LLMs.||||||||||||| D t i t tl di i th l t k f D b d t lit ----- ``` Large language models are the “it” tool ``` LLMs are currently one of the hottest and most-watched areas in the field of NLP. LLMs have been instrumental in enabling machines to understand, interpret and generate human language in a way that was previously impossible, powering everything from machine translation to content creation to virtual assistants and chatbots. Transformer-related libraries have been growing in popularity even before ChatGPT thrust LLMs into the public consciousness. Within the last 6 months, our data shows two accelerating trends: organizations are building their own LLMs, which models like [Dolly](https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm) show can be quite accessible and inexpensive. And, they are using proprietary models like ChatGPT. Transformerrelated libraries, such as Hugging Face, which are used to train LLMs, have the highest adoption within the Lakehouse. The second most popular type is SaaS LLMs, which are used to access models like OpenAI. This category has grown exponentially in parallel with the [launch of ChatGPT](https://openai.com/blog/chatgpt) : the number of Lakehouse customers using SaaS LLMs has grown Organizations can leverage LLMs either by using SaaS LLM APIs to call services like ChatGPT from OpenAI or by operating their own LLMs in-house. Thinking of building your own modern LLM application? This approach could entail the use of specialized transformer-related Python libraries to train the model, as well as LLM tools like LangChain to develop prompt interfaces or integrations to other systems. ``` LLM DEFINITIONS ``` **◊** **Transformer-related libraries:** Python libraries used to train LLMs (example: Hugging Face) **◊** **SaaS LLM APIs:** Libraries used to access LLMs as a service (example: OpenAI) **◊** **LLM tools:** Toolchains for working with and building proprietary LLMs (example: LangChain) an impressive 1310% between the end of November 2022 and the beginning of May 2023. (In contrast, transformer-related libraries grew 82% in this same period.) ----- ```  ac e ea g e pe e a o a d p oduc o take off across industries ``` The increasing demand for ML solutions and the growing availability of technologies have led to a significant increase in experimentation and production, two distinct parts of the ML model lifecycle. We look at the _logging_ and _registering_ of models in MLflow, an open source platform developed by Databricks, to understand how ML is trending and being adopted within organizations. ```  LOGGED MODELS AND  ML EXPERIMENTATION ``` During the experimentation phase of ML, data scientists develop models designed to solve given tasks. After training the models, they test them to evaluate their accuracy, precision, recall (the percentage of correctly predicted positive instances out of all actual positive instances), and more. These metrics are logged (recorded) in order to analyze the various models’ performance and identify which approach works best for the given task. We have chosen logged models as a proxy to measure ML experimentation because the MLflow Tracking Server is designed to facilitate experiment tracking and reproducibility. MLflow Model Registry launched in May 2021. Overall, the number of logged models has grown 54% since February 2022, while the number of registered models has grown 411% over the same period. This growth in volume suggests organizations are understanding the value of investing in and allocating more people power to ML. ``` REGISTERED MODELS AND ML PRODUCTION ``` Production models have undergone the experimentation phase and are then deployed in real-world applications. They are typically used to make predictions or decisions based on new data. Registering a model is the process of recording and storing metadata about a trained model in a centralized location that allows users to easily access and reuse existing models. Registering models prior to production enables organizations to ensure consistency and reliability in model deployment and scale. We have chosen registered models to represent ML production because the MLflow Model Registry is designed to manage models that have left the experimentation phase through the rest of their lifecycle. ----- g y yi p was registered. Recent advances in ML, such as improved open source libraries like MLflow and Hugging Face, have radically simplified building and putting models into production. The result is that 34% of logged models are now candidates for production today, an improvement from over 20% just a year ago. before committing an ML model to production. We wanted to understand, “How many models do data scientists experiment with before moving to production?” Our data shows the ratio of logged to registered models is 2.9 : 1 as of January 2023. This means that for roughly every three experimental models, one model will get registered as a candidate for production. This ratio has improved significantly from just a year prior, when we |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Col14|Col15|Col16|Col17|Col18|Col19|Col20|Col21|Col22|Col23|Col24|Col25|Col26|Col27|Col28| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| ||||||VS. S||||||||||||||||||||||| |RA RE|TIO GIST|OF ERE|LOGG D MO|ED DEL|||||||||||||||||||||||| ||||||||||||||||||||||||||||| ||||||Models||||||||||||||||||||||| ||||||ber of||||||||||||||||||||||| ||||||Num||||||||||||||||||||||| ||||||||||||||||||||||||||||| |2.|9 :|1|||||||||||||||||||||||||| ``` Ratio of Logged to Registered  Feb Mar Apr May June July Aug Sept Oct Nov Dec Jan Models in Jan 2023 2023 ``` ----- ``` The Modern Data and AI Stack ``` Over the last several years, the trend toward building open, unified data architectures has played out in our own data. We see that data leaders are opting to preserve choice, leverage the best products and deliver innovation across their organizations by democratizing access to data for more people. ----- ```  FASTEST-GROWING DATA AND AI PRODUCTS  dbt 206% ``` ``` Fivetran Informatica Qlik Data Integration Esri Looker Hugging Face ``` ```  181%  174%  152%  145%  141% 110% ``` ``` Lytics Great Expectations Kepler.gl ``` ```  101%  100% 95% ``` ``` 0% 50% 100% 150% 200%  Year-Over-Year Growth by Number of Customers ``` ----- ``` DBT IS THE FASTEST-GROWING DATA AND AI PRODUCT OF 2023 ``` As companies move quickly to develop more advanced use cases with their data, they are investing in newer products that produce trusted data sets for reporting, ML modeling and operational workflows. Hence, we see the rapid rise of data integration products. dbt, a data transformation tool, and Fivetran, which automates data pipelines, are our two fastest-growing data and AI products. This suggests a new era of the data integration market with challenger tools making headway as companies shift to prioritize DS/ML initiatives. With Great Expectations from Superconductive in the ninth spot, a full 50% of our fastest-growing products represent the data integration category. ----- |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Col14|Col15|Col16|Col17|Col18|Col19|Col20| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |GR|OWTH|OF|DAT|A A|ND A|I M|ARKE|TS|||||||||||| ||||||||||||||||||||| |||||||||||||||||Busi|ness I|ntelli|gence| |||||||||||||||||Data & Se Data|Gover curity Scien|nance ce &|| |ers||||||||||||||||Mach Data|ine Le Integ|arning ration|| |Custom|||||||||||||||||||| |ber of|||||||||||||||||||| |Num|||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| ||Feb 2022|Mar|Apr|M|ay|June|July|Aug|Se|pt|Oct|Nov|Dec|Ja 20|n 23||||| ||||||||||||||||||||| ||||||||||||||||||||| Note: In this chart, we count the number of customers deploying one or more data and AI products in each category. These four categories do not encompass all products Databricks products such as Unity Catalog are not included in this data ----- ```  a a a d a e s bus ess e ge ce s standard, organizations invest in their machine learning foundation ``` To understand how organizations are prioritizing their data initiatives, we aggregated all data and AI products on the Databricks Lakehouse and categorized them into four core markets: BI, data governance and security, DS/ML, and data integration. Our data set confirms that BI tools are more widely adopted across organizations relative to more nascent categories — and they continue to grow, with a 66% YoY increase in adoption. This aligns with the broader trend of more organizations performing data warehousing on a Lakehouse, covered in the next section, Views from the Lakehouse. While BI is often where organizations start their data journey, companies are increasingly looking at more advanced data and AI use cases. ``` DEMAND FOR DATA INTEGRATION PRODUCTS IS GROWING FAST ``` We see the fastest growth in the data integration market. These tools enable a company to integrate vast amounts of upstream and downstream data in one consolidated view. Data integration products ensure that all BI and DS/ ML initiatives are built on solid foundation. While it’s easier for smaller markets to experience faster growth, at 117% YoY increased adoption, the data integration market is growing substantially faster than BI. This trend dovetails with the rapid growth of ML adoption we see across the Lakehouse, covered in the DS/ML section of the report. ``` Data integration is the fastest-growing market,  with 117% YoY growth ``` ----- ``` Views from the Lakehouse MIGRATION AND DATA FORMAT TRENDS ``` Data migration is a major undertaking: it can be risky, expensive and delay companies’ timelines. It’s not a task to jump into lightly. As organizations run into the limitations, scalability challenges and the cost burden of legacy data platforms, they are increasingly likely to migrate to a new type of architecture. ----- ``` Migration trends: the best data warehouse is a Lakehouse ``` The Lakehouse Platform is an attractive alternative to traditional data warehouses because it supports advanced use cases and DS/ML, allowing organizations to boost their overall data strategy. As evidenced by the most popular data and AI products, with BI and data integration tools at the top, organizations are increasingly using the data lakehouse for data warehousing. To better understand which legacy platforms organizations are moving away from, we look at the migrations of new customers to Databricks. An interesting takeaway is that roughly half of the companies moving to the Lakehouse are coming from data warehouses. This includes the 22% that are moving from cloud data warehouses. It also demonstrates a growing focus on running data warehousing workloads on a Lakehouse and unifying data platforms to reduce cost. ```  SOURCE OF NEW CUSTOMER   MIGRATIONS TO DATABRICKS ``` ``` 12% ``` ``` 39% ``` ``` 27% ``` ``` 22% ``` ----- ``` Rising tides: the volume of data in Delta Lake has grown 304% YoY ``` As the [volume of data explodes](https://www.researchgate.net/profile/Adanma-Eberendu/publication/309393428_Unstructured_Data_an_overview_of_the_data_of_Big_Data/links/5bc89b5c458515f7d9c65beb/Unstructured-Data-an-overview-of-the-data-of-Big-Data.pdf) , an increasingly large proportion is in the form of semi-structured and unstructured data. Previously, organizations had to manage multiple different platforms for their structured, unstructured and semi-structured data, which caused unnecessary complexity and high costs. The Lakehouse solves this problem by providing a unified platform for all data types and formats. Delta Lake is the foundation of the Databricks Lakehouse. The Delta Lake format encompasses structured, unstructured and semi-structured data. Use has surged over the past 2 years. When compared to the steady, flat or declining growth in other storage formats (e.g., text, JSON and CSV), our data shows that a growing number of organizations are turning to Delta Lake to manage their data. In June 2022, Delta Lake surpassed Parquet as the most popular data lake source, reaching 304% YoY growth. |Col1|VO|LUME|Col4|OF|Col6|DAT|Col8|A M|ANAG|ED,|Col12|Col13|Col14|Col15|Col16|Col17|Col18| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| ||BY|STO||RAG||E FO||RMA|T||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| |ata|||||||||||||||||| |e of D|||||||||||||||||| |Volum|||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||||||||||||||||||| ||Jan|||||||J|an|||Jan||||Ja|| |||||Jan|||||||||||||| |2|019|||2020||||20|21|||2022||||202|| |||||||||Delta|Te|xt||CSV||Av||ro|| |||||||||Parquet|OR|C||JSON|||||| ||||||||||||||||||| ----- ```  g g , with emphasis on serverless ``` Over the past 2 years, companies have vastly increased their usage of data warehousing on the Lakehouse Platform. This is especially demonstrated by use of Databricks SQL ­— the serverless data warehouse on the Lakehouse — which shows 144% YoY growth. This suggests that organizations are increasingly ditching traditional data warehouses and are able to perform all their BI and analytics on a Lakehouse. ```  Data Warehouse ``` ``` Data ``` ``` Lakehouse Platform ``` ``` Lakehouse ``` |Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Col14|Col15|Col16|Col17|Col18|Col19|Col20|Col21|Col22|Col23|Col24|Col25| |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---| |||||||||||||||||||||||||| ||DA|TA W|ARE|HOUS|ING|||||||||||||||||||| ||ON|LAK|EHO|USE|WIT L|H|rs|||||||||||||||||| ||DA|TABR|ICK|S SQ||||||||||||||||||||| ||||||||ustome|||||||||||||||||| ||||||||r of C|||||||||||||||||| ||Note: T as a re|here is a sult of th|spike in e ungat|Octobe ed previ|r 2021 ew||Numbe|||||||||||||||||| ||launch Genera|of Datab l Availab|ricks SQ ility in D|L, follow ecembe|ed by r 2021.|||||||||||||||||||| ||Data c of Dec|onsisten ember d|tly dips i ue to se|n the las asonalit|t week y.||J 2|an 021||Jul 202||y 1||Jan 2022||||July 2022||||Jan 2023||| ----- CONCLUSION ``` Generation AI ``` We’re excited that companies are progressing into more advanced ML and AI use cases, and the modern data and AI stack is evolving to keep up. Along with the rapid growth of data integration tools (including our fastest growing, dbt), we’re seeing the rapid rise of NLP and LLM usage in our own data set, and there’s no doubt that the next few years will see an explosion in these technologies. It’s never been more clear: the companies that harness the power of DS/ML will lead the next generation of data. ----- ``` About Databricks ``` Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast, and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on Twitter, LinkedIn and Instagram. [DISCOVER LAKEHOUSE](https://www.databricks.com/product/data-lakehouse) © Databricks 2023. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation | Terms of Use -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks-2023-state-of-data-report-06072023-v2_0.pdf2024-09-19T16:57:20Z**eBook** # Making Your Digital Twin Come to Life ##### With the Lakehouse for Manufacturing and Tredence ----- ### Contents Introduction ................................................................................................................................................................................................................ **03** Digital Twins Bring Broad Benefits to Manufacturing .......................................................................................................................... **05** What Are Digital Twins? ........................................................................................................................................................................................ **07** Digital Twin Architectures .................................................................................................................................................................................. **08** How to Build a Digital Twin ................................................................................................................................................................................ **09** Why Is Manufacturing Struggling With Data and AI? ............................................................................................................................ **12** Why Databricks for Digital Twins? ................................................................................................................................................................... **13** Why Tredence for Digital Twins? ...................................................................................................................................................................... **14** Using Digital Twins to Drive Insights .............................................................................................................................................................. **15** ----- ### Introduction The concept of digital twins is not new. In fact, it is [reported](https://en.wikipedia.org/wiki/Digital_twin#:~:text=One%20of%20the%20earliest%20examples,Heathrow%20Airport's%20Terminal%201) that the first application was over 25 years ago, during the early phases of foundation and cofferdam construction for the London Heathrow Express facilities, to monitor and predict foundation borehole grouting. In the years since this first application, edge computing, AI, data connectivity, 5G connectivity and the improvements of the Internet of Things (IoT) have enabled digital twins to become cost-effective and are now an imperative in today’s data-driven businesses. Today’s manufacturing industries are expected to streamline and optimize all the processes in their value chain from product development and design, through operations and supply chain optimization to obtaining feedback to reflect and respond to rapidly growing customer demands. The digital twins category is broad and is addressing a multitude of challenges within manufacturing, logistics and transportation. [In a case study published](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [in MIT Technology Review,](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [“profit margins increased and](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [manufacturing time was reduced](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [when digital-twin technology](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [was implemented. Automobile](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [manufacturing profit margins](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [increased by 41% to 54% per](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [model. The estimated average](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [automobile manufacturing](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [time was reduced to](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) [approximately 10 hours.”](https://wp.technologyreview.com/wp-content/uploads/2022/01/Digital-twins-improve-real-life-manufacturing_010522.pdf) **[Digital twins accelerate](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[potential revenue](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[increase up to](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** # 10% **[Time to market](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[accelerated by](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** # 50% **[Time to market](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[accelerated by](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[Product quality](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[improvement up to](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** # 25% **[Product quality](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** **[improvement up to](https://www.mckinsey.com/business-functions/operations/our-insights/digital-twins-the-art-of-the-possible-in-product-development-and-beyond)** ----- **Introduction (continued)** **Digital twin market growth rate accelerates** Digital twins are now so ingrained in manufacturing that the [global industry market](https://www.marketsandmarkets.com/Market-Reports/digital-twin-market-225269522.html) is forecasted to reach $48 billion in 2026. This figure is up from $3.1 billion in 2020 at a CAGR of 58%, riding on the wave of Industry 4.0. **But challenges remain** The most common challenges faced by the manufacturing industry that digital twins are addressing include: **•** Product designs are more complex, resulting in higher cost and increasingly longer development times **•** The supply chain is opaque **•** Production lines are not optimized – performance variations, unknown defects and the projection of operating cost is obscure **•** Poor quality management – overreliance on theory, managed by individual departments **•** Reactive maintenance costs are too high, resulting in excessive downtime or process disruptions **•** Incongruous collaborations between departments **•** Invisibility of customer demand for gathering real-time feedback The growth rate for digital twins is staggering with common adoption reported to be in the 25-40% CAGR growth rate. ----- ### Digital Twins Bring Broad Benefits to Manufacturing Industry 4.0 and subsequent intelligent supply chain efforts have made significant strides in improving operations and building agile supply chains, efforts that would have come at significant costs without digital twin technology. **Let’s look at the benefits that digital twins deliver to the manufacturing sector:** **•** Product design and development is performed with less cost and is completed in less time as iterative simulations, using multiple constraints, deliver the best or most optimized design. All commercial aircraft are designed using digital twins. **•** Digital twins provide the awareness of how long inventory will last, when to replenish and how to minimize the supply chain disruptions. The oil and gas industry, for example, uses supply chain–oriented digital twins to reduce supply chain bottlenecks in storage and midstream delivery, schedule tanker off-loads and model demand with externalities. **•** Continuous quality checks on produced items with ML/AI generated feedback pre-emptively assuring improved product quality. Final paint inspection in the automotive industry, for example, is performed with computer vision built on top of digital twin technology. **•** Striking the sweet spot between when to replace a part before the process degrades or breaks down and utilizing the components to their fullest, digital twins provide manufacturers with realtime feedback. Digital twins are the backbone of building an asset performance management suite. **•** Digital twins create the opportunity to have multiple departments in sync by providing necessary instructions modularly to attain a required throughput. Digital twins are the backbone of kaizen events that optimize manufacturing process flow. **•** Customer feedback loops can be modeled through inputs, from point of sale customer behavior, buying preferences, or product performance and then integrated into the product development process, forming a closed loop providing an improved product design. ----- **Digital Twins Bring Broad Benefits to Manufacturing (continued)** The top four use cases are heavily focused on operational processes and are typically the first to be deployed in manufacturing by a majority of companies. Those that have a lower adoption rate are more complex in deployment, but typically offer higher and longer-lasting value. **[Digital Twin Use Case Deployment](https://blogs.3ds.com/exalead/2019/07/03/digital-twin-use-cases-in-manufacturing-part-5-12/)** Improve product quality Reduce manufacturing costs Reduce unplanned downtime Increase throughput Ensure safe manufacturing Test new design ideas Develop product enhancements Digital transformation of enterprise Speed new product introduction Reduce planned downtime Meet new regulatory challenges Training for new manufacturing processes Design changes to production line Provide service to end users customers Update products in the field **34%** **30%** **28%** **25%** **24%** **16%** **14%** **13%** **13%** **11%** **10%** **8%** **8%** Can you imagine the cost to change an oil refinery’s crude distillation unit process conditions to improve the output of diesel one week and gasoline the next to address changes in demand and ensure maximum economic value? Can you imagine how to replicate an even simple supply chain to model risk? **5%** **1%** ----- ### What Are Digital Twins? Knowing the business challenges and benefits digital twins deliver, let’s turn to the basics and explore what digital twins are and how a modern data stack is necessary to build effective and timely digital twins. The classic definition of digital twin is: “ [A virtual model designed to accurately reflect a physical object](https://www.ibm.com/topics/what-is-a-digital-twin) .” For a discrete or continuous manufacturing process, a digital twin gathers system and processes state data with the help of various IoT sensors [operational technology data (OT)] and enterprise data [informational technology (IT)] to form a virtual model which is then used to run simulations, study performance issues and generate possible insights. **Types of Digital Twins** ----- ### Digital Twin Architectures Classic digital twins have been physics-based models of specific systems. More recently, **data-driven digital twins, which work on the real-time system data, are gaining prominence** . These twins provide the opportunity to not just monitor and simulate system performance under specific conditions, but also provide the platform to further embed AI-based predictive and prescriptive solutions into the industrial environment. Digital twins undergo a series of changes during their lifecycle to become completely autonomous. **Data-Driven Operational Digital Twins: Maturity Journey** **AI** Simulate & Optimize **[Digital twins have reduced](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** **[automotive product design](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** **[lifecycle from](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** # 6-8 18-24 ## years to months **[Digital twins have reduced](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** **[automotive product design](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** **[lifecycle from](https://www.technologyreview.com/2022/01/05/1042981/digital-twins-improve-real-life-manufacturing/)** **[Digital warehouse design lets](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[companies test and learn](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[using a digital twin, which can](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[improve efficiency by](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** # 20% to 25% **[Digital warehouse design lets](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[companies test and learn](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[using a digital twin, which can](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** **[improve efficiency by](https://www.mckinsey.com/business-functions/operations/our-insights/improving-warehouse-operations-digitally)** Identify next best action and integrate with actuation systems **IoT** **Edge/** **Cloud** **Digital Twins** **ERP** Predict & Diagnose |Col1|I i| |---|---| Predictive maintenance, process improvements and Root Causing Monitor & Alert |Col1|P i| |---|---| Real-time operations monitoring and alerting ----- ### How to Build a Digital Twin A data architecture capability is needed to capture and collect the ever-expanding volume and variety of data streaming in real time from example protocols, such as ABB Total Flow, Allen Bradley, Emerson, Fanuc, GE, Hitachi and Mitsubishi. Data collection, data analytics, application enablement and data integration orchestrate the time-series data stream and transfer to the cloud. Azure IoT Hub is used to securely ingest data from edge to cloud. Cloud infrastructure and analytics capabilities are offered within the flexibility of the cloud. Azure Digital Twin is used to model and visualize process workflows. Databricks MLflow and Delta Lake scale to deliver real-time predictive analytics. ----- **How to Build a Digital Twin (continued)** **Digital Twins: Technical Architecture** ----- **How to Build a Digital Twin (continued)** **Building a digital twin doesn’t have to be a daunting task. Below are some simplistic steps:** **System and use case discovery** **and blueprinting** **•** Identify priority plant processes and systems to model, with focused use cases (e.g., asset maintenance, energy management, process monitoring/optimization, etc.) **•** Develop a validated process outline, blueprint and key performance indicators **•** Develop a set of process variables, control variables and manipulated variables **•** Design control loop **•** Validate and document process and asset FMEA for all assets and sub-systems **Technology infrastructure requirements** **•** Technical edge infrastructure onsite — to sense, collect and transmit real-time information **•** Clean, reliable data availability in the cloud **•** Data processing and analytics platform — to design, develop and implement solutions **•** Stream processing and deployment of models for predictions and soft sensing **Visualization delivered** **•** Information communication — visual representation of digital twin along with remote controlling functions (e.g., Power BI dashboards, time series insights, web app-based digital twin portals) **•** Closed-loop feedback — to send the insights and actions back to form a closed loop — Azure – Event Grid and Event Hub with connection from IoT Hub to Azure IoT edge devices and control systems is used **•** Edge platform to orchestrate the data, insights and actions between the cloud and site IT systems **•** Cloud to edge integration — to enable seamless monitoring, alerting and integration with plant OT/IT systems ----- ### Why Is Manufacturing Struggling With Data and AI? **Challenge** **Root Cause** **Goal** Aggregate high volumes and velocities of structured and unstructured data to power predictive analytics (e.g., images, IoT, ERP/SCM) Data architectures that scale for TBs /PBs of enterprise IT and OT data Siloed data from systems designed **Siloed data across the value chain** for on-premises 30 years ago Siloed data from systems designed **Siloed data across the value chain** Legacy architectures such as data historians that can’t handle semi-structured or unstructured data **Unable to scale enterprise data sets** Address manufacturing issues or track **Lack real-time insights** Batch-oriented data transfer granular supply chain issues in the real world Address manufacturing issues or track **Lack real-time insights** Batch-oriented data transfer **Can’t meet intellectual property** **Can’t meet intellectual property** Data lineage established across organizational Systems that do not establish data lineage **requirements** silos and disjointed workflows silos and disjointed workflows ### Data architecture is the root cause of this struggle. ----- ### Why Databricks for Digital Twins? Lakehouse for Manufacturing’s simple, open and collaborative data platform consolidates and enhances data from across the organization and turns it into accessible, actionable insights. Scalable machine learning powers digital twins with predictive insights across the value chain from product development to optimizing operations to building agile supply chains to robust customer insights. Databricks open Lakehouse Platform has shown time and again that it is the foundational enabling technology to power digital twins for manufacturing. But the real power is the Databricks partnership with Tredence that speeds implementation for tailored use cases that deliver superior ROI in less time.” **Dr. Bala Amavasai** , Manufacturing CTO, Databricks **Supports Real-Time** **Decisions** Lakehouse for Manufacturing leverages any enterprise data source — from business critical ERP data to edge sensor data in one integrated platform, making it easy to automate and secure data with fast, real-time performance. **Faster and More** **Accurate Analysis** The true benefits of digital twins are not the business intelligence dashboards, but machine learning insights generated from incorporating real-time data. Scalable and shareable notebook-based machine learning accelerates ROI. **Open Data Sharing** **and Collaboration** Drive stronger customer insights and greater service with partners leveraging open and secure data collaboration between departments or your supply chain delivering faster ROI. ----- ### Why Tredence for Digital Twins? Over the last few years, Tredence’s unique Manufacturing and Supply Chain practice has coupled functional expertise with cutting-edge AI-driven solutions to create measurable business impact for their customers. Now, Tredence’s partnership with Databricks is all set to unlock the power of real-time analytics and actions, to further strengthen their ‘’last mile impact’’ vision. Tredence is excited to co-innovate with Databricks to deliver the solutions required for enterprises to create digital twins from the ground up and implement them swiftly to maximize their ROI. Our partnership enables clients to get the most out of Tredence’s data science capabilities to build decision intelligence around manufacturing processes and Databricks’ Lakehouse Platform to realize the full promise of digital twins.” **Naresh Agarwal** , Head of Industrials, Tredence **Global Reach** Tredence offers a global team with the subject matter expertise that delivers practitioner and useroriented solutions to identify and solve for challenges in digital transformation design and implementation. **Purpose-Built Solutions** Adopt contextual edge to cloud, purpose-built AIoT solutions that unify your ecosystems with connected insights and enhance productivity, while enabling efficient cost structures. **Focused Dedication** A dedicated centre of excellence (CoE) for AIoT and smart manufacturing solutions — serving the entire manufacturing value chain from product development to manufacturing and downstream operations. ----- ### Using Digital Twins to Drive Insights **Use Case** **Predictive Maintenance** - Rolls-Royce sought to use real-time engine data to reduce unplanned maintenance and downtime - Legacy systems were unable to scale data ingestion of engine sensor data in real time for ML **Impact** **Why Databricks?** - The Lakehouse Platform on Azure unifies in-flight data streams with external environmental conditions data to predict engine performance issues - Delta Lake underpins ETL pipelines that feed ML workloads across use cases - MLflow speeds deployment of new models and reduces incidents of grounded planes Rolls-Royce uses Databricks to drive insights around predictive maintenance, improving airframe reliability and reducing carbon emissions. #### 22 million tons of carbon emissions saved #### 5% reduction in unplanned airplane groundings #### Millions of pounds in inventory cost savings from a 50% improvement in maintenance efficiency ----- ### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, Acosta and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark,™ Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . ###### Get started with a free trial of Databricks and start building data applications today **[START YOUR FREE TRIAL](https://databricks.com/try-databricks?itm_data=NavBar-TryDatabricks-Trial)** To learn more, visit us at: **[databricks.com/manufacturing](https://databricks.com/solutions/industries/manufacturing-industry-solutions)** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/making-your-digital-twin-come-to-life.pdf2024-09-19T16:57:22Z### EBOOK # A Compact Guide to Large Language Models ----- SECTION 1 ## Introduction ##### Definition of large language models (LLMs) Large language models are AI systems that are designed to process and analyze vast amounts of natural language data and then use that information to generate responses to user prompts. These systems are trained on massive data sets using advanced machine learning algorithms to learn the patterns and structures of human language, and are capable of generating natural language responses to a wide range of written inputs. Large language models are becoming increasingly important in a variety of applications such as natural language processing, machine translation, code and text generation, and more. While this guide will focus on language models, it’s important to understand that they are only one aspect under a larger generative AI umbrella. Other noteworthy generative AI implementations include projects such as art generation from text, audio and video generation, and certainly more to come in the near future. ----- ##### Extremely brief historical background and development of LLMs ###### 1950s–1990s Initial attempts are made to map hard rules around languages and follow logical steps to accomplish tasks like translating a sentence from one language to another. While this works sometimes, strictly defined rules only work for concrete, well-defined tasks that the system has knowledge about. ###### 1990s Language models begin evolving into statistical models and language patterns start being analyzed, but larger-scale projects are limited by computing power. ###### 2000s Advancements in machine learning increase the complexity of language models, and the wide adoption of the internet sees an enormous increase in available training data. ###### 2012 Advancements in deep learning architectures and larger data sets lead to the development of GPT (Generative Pre-trained Transformer). ###### 2018 Google introduces BERT (Bidirectional Encoder Representations from Transformers), which is a big leap in architecture and paves the way for future large language models. ###### 2020 OpenAI releases GPT-3, which becomes the largest model at 175B parameters and sets a new performance benchmark for language-related tasks. ###### 2022 ChatGPT is launched, which turns GPT-3 and similar models into a service that is widely accessible to users through a web interface and kicks off a huge increase in public awareness of LLMs and generative AI. ###### 2023 Open source LLMs begin showing increasingly impressive results with releases such as Dolly 2.0, LLaMA, Alpaca and Vicuna. GPT-4 is also released, setting a new benchmark for both parameter size and performance. ----- SECTION 2 ## Understanding Large Language Models ##### What are language models and how do they work? Large language models are advanced artificial intelligence systems that take some input and generate humanlike text as a response. They work by first analyzing vast amounts of data and creating an internal structure that models the natural language data sets that they’re trained on. Once this internal structure has been developed, the models can then take input in the form of natural language and approximate a good response. ##### If they’ve been around for so many years, why are they just now making headlines? A few recent advancements have really brought the spotlight to generative AI and large language models: **A D VA N C E M E N T S I N T E C H N I Q U E S** Over the past few years, there have been significant advancements in the techniques used to train these models, resulting in big leaps in performance. Notably, one of the largest jumps in performance has come from integrating human feedback directly into the training process. **I N C R E A S E D A C C E S S I B I L I T Y** The release of ChatGPT opened the door for anyone with internet access to interact with one of the most advanced LLMs through a simple web interface. This brought the impressive advancements of LLMs into the spotlight, since previously these more powerful LLMs were only available to researchers with large amounts of resources and those with very deep technical knowledge. **G R O W I N G C O M P U TAT I O N A L P O W E R** The availability of more powerful computing resources, such as graphics processing units (GPUs), and better data processing techniques allowed researchers to train much larger models, improving the performance of these language models. **I M P R O V E D T R A I N I N G D ATA** As we get better at collecting and analyzing large amounts of data, the model performance has improved dramatically. In fact, Databricks showed that you can get amazing results training a relatively small model with a high-quality data set with [Dolly 2.0](https://huggingface.co/databricks/dolly-v2-12b) (and we released the data set as well with the databricks-dolly-15k [data set](http://databricks/databricks-dolly-15k) ). ----- ##### So what are organizations using large language models for? Here are just a few examples of common use cases for large language models: **C H AT B O T S A N D V I R T U A L A S S I S TA N T S** One of the most common implementations, LLMs can be used by organizations to provide help with things like customer support, troubleshooting, or even having open-ended conversations with userprovided prompts. **C O D E G E N E R AT I O N A N D D E B U G G I N G** LLMs can be trained on large amounts of code examples and give useful code snippets as a response to a request written in natural language. With the proper techniques, LLMs can also be built in a way to reference other relevant data that it may not have been trained with, such as a company’s documentation, to help provide more accurate responses. **S E N T I M E N T A N A LY S I S** Often a hard task to quantify, LLMs can help take a piece of text and gauge emotion and opinions. This can help organizations gather the data and feedback needed to improve customer satisfaction. **L A N G U A G E T R A N S L AT I O N** Globalize all your content without hours of painstaking work by simply feeding your web pages through the proper LLMs and translating them to different languages. As more LLMs are trained in other languages, quality and availability will continue to improve. **S U M M A R I Z AT I O N A N D PA R A P H R A S I N G** Entire customer calls or meetings could be efficiently summarized so that others can more easily digest the content. LLMs can take large amounts of text and boil it down to just the most important bytes. **C O N T E N T G E N E R AT I O N** Start with a detailed prompt and have an LLM develop an outline for you. Then continue on with those prompts and LLMs can generate a good first draft for you to build off. Use them to brainstorm ideas, and ask the LLM questions to help you draw inspiration from. **_Note:_** Most LLMs are _not_ trained to be fact machines. They know how to use language, but they might not know who won the big sporting event last year. It’s always important to fact check and understand the responses before using them as a reference. **T E X T C L A S S I F I C AT I O N A N D C L U S T E R I N G** The ability to categorize and sort large volumes of data enables the identification of common themes and trends, supporting informed decision-making and more targeted strategies. ----- SECTION 3 ## Applying Large Language Models There are a few paths that one can take when looking to apply large language models for their given use case. Generally speaking, you can break them down into two categories, but there’s some crossover between each. We’ll briefly cover the pros and cons of each and what scenarios fit best for each. ##### Proprietary services As the first widely available LLM powered service, OpenAI’s ChatGPT was the explosive charge that brought LLMs into the mainstream. ChatGPT provides a nice user interface (or API) where users can feed prompts to one of many models (GPT-3.5, GPT-4, and more) and typically get a fast response. These are among the highest-performing models, trained on enormous data sets, and are capable of extremely complex tasks both from a technical standpoint, such as code generation, as well as from a creative perspective like writing poetry in a specific style. The downside of these services is the absolutely enormous amount of compute required not only to train them (OpenAI has said GPT-4 cost them over $100 million to develop) but also to serve the responses. For this reason, these extremely large models will likely always be under the control of organizations, and require you to send your data to their servers in order to interact with their language models. This raises privacy and security concerns, and also subjects users to “black box” models, whose training and guardrails they have no control over. Also, due to the compute required, these services are not free beyond a very limited use, so cost becomes a factor in applying these at scale. In summary: Proprietary services are great to use if you have very complex tasks, are okay with sharing your data with a third party, and are prepared to incur costs if operating at any significant scale. ##### Open source models The other avenue for language models is to go to the open source community, where there has been similarly explosive growth over the past few years. Communities like [Hugging Face](https://huggingface.co/) gather hundreds of thousands of models from contributors that can help solve tons of specific use cases such as text generation, summarization and classification. The open source community has been quickly catching up to the performance of the proprietary models, but ultimately still hasn’t matched the performance of something like GPT-4. ----- It does currently take a little bit more work to grab an open source model and start using it, but progress is moving very quickly to make them more accessible to users. On Databricks, for example, we’ve made [improvements to open source](https://www.databricks.com/blog/2023/04/18/introducing-mlflow-23-enhanced-native-llm-support-and-new-features.html) [frameworks](https://www.databricks.com/blog/2023/04/18/introducing-mlflow-23-enhanced-native-llm-support-and-new-features.html) like MLflow to make it very easy for someone with a bit of Python experience to pull any Hugging Face transformer model and use it as a Python object. Oftentimes, you can find an open source model that solves your specific problem that is **orders of magnitude** smaller than ChatGPT, allowing you to bring the model into your environment and host it yourself. This means that you can keep the data in your control for privacy and governance concerns as well as manage your costs. ##### Conclusion and general guidelines Ultimately, every organization is going to have unique challenges to overcome, and there isn’t a one-size-fits-all approach when it comes to LLMs. As the world becomes more data driven, everything, including LLMs, will be reliant on having a strong foundation of data. LLMs are incredible tools, but they have to be used and implemented on top of this strong data foundation. Databricks brings both that strong data foundation as well as the integrated tools to let you use and fine-tune LLMs in your domain. Another huge upside to using open source models is the ability to fine-tune them to your own data. Since you’re not dealing with a black box of a proprietary service, there are techniques that let you take open source models and train them to your specific data, greatly improving their performance on your specific domain. We believe the future of language models is going to move in this direction, as more and more organizations will want full control and understanding of their LLMs. ----- SECTION 4 ## So What Do I Do Next If I Want to Start Using LLMs? That depends where you are on your journey! Fortunately, we have a few paths for you. If you want to go a little deeper into LLMs but aren’t quite ready to do it yourself, you can watch one of Databricks’ most talented developers and speakers go over these concepts in more detail during the on-demand talk “ [How to Build](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) [Your Own Large Language Model Like Dolly.](https://www.databricks.com/resources/webinar/build-your-own-large-language-model-dolly) ” If you’re ready to dive a little deeper and expand your education and understanding of LLM foundations, we’d recommend checking out our [course on LLMs](https://www.edx.org/course/large-language-models-application-through-production) . You’ll learn how to develop production-ready LLM applications and dive into the theory behind foundation models. If your hands are already shaking with excitement and you already have some working knowledge of Python and Databricks, we’ll provide some great examples with sample code that can get you up and running with LLMs right away! ###### Getting started with NLP using Hugging Face transformers pipelines  Fine-Tuning Large Language Models with Hugging Face and DeepSpeed  Introducing AI Functions: Integrating Large Language Models with Databricks SQL ----- ## About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://databricks.com/try-databricks)** #### Contact us for a personalized demo: databricks.com/contact -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/compact-guide-to-large-language-models.pdf2024-09-19T16:57:20Z# Building Reliable Data Lakes at Scale With Delta Lake ----- ## Contents #### Data Engineering Drivers 2  Data Pipeline Key Goals 4  Apache Spark™: The First Unified Analytics Engine 5  Data Reliability Challenges With Data Lakes 6  Delta Lake: A New Storage Layer 7  Delta Lake: Key Features 8  Getting Started With Delta Lake 10 ----- ## Drivers #### Data Engineering Drivers Data engineering professionals are needing to respond to several different drivers. Chief among the drivers they face are: **Rise of Advanced Analytics** — Advanced analytics, including methods based on machine learning techniques, have evolved to such a degree that organizations seek to derive far more value from their corporate assets. **Widespread Adoption** — Once the province of leading edge, high-tech companies, these advanced approaches are being adopted across a multitude of industries from retail to hospitality to healthcare and across private as well as public sector organizations. This is further driving the need for strong data engineering practices. **Regulation** — With the growth of data generation and data collection, there is increased interest in how the data is protected and managed. Regulatory regimes such as GDPR (General Data Protection Regulation) from the EU and other jurisdictions mandate very specific ways in which data must be managed. ----- ## Drivers **Technology Innovation** — The move to cloud-based analytics architectures that is now well underway is being propelled further by innovations such as analytics-focused chipsets, pipeline automation and the unification of data and machine learning. All these offer data professionals new approaches for their data initiatives. **Financial Scrutiny** — With a growth in investment, analytics initiatives are also subject to increasing scrutiny. There is also a greater understanding of data as a valuable asset. Deriving value from data must be done in a manner that is financially responsible and actually value adding to the enterprise and meeting ROI hurdles. **Role Evolution** — Reflecting the importance of managing the data and maximizing value extraction, the Chief Data Officer (CDO) role is becoming more prominent and newer roles such as Data Curator are emerging. They must balance the needs of governance, security and democratization. ----- ## Key Goals #### Data Pipeline Key Goals Making quality data available in a reliable manner is a major determinant of success for data analytics initiatives be they regular dashboards or reports, or advanced analytics projects drawing on state-of-the-art machine learning techniques. Data engineers tasked with this responsibility need to take account of a broad set of dependencies and requirements as they design and build their data pipelines. Three primary goals that data engineers typically seek to address as they work to enable the analytics professionals in their organizations are: **Deliver quality data in less time** — When it comes to data, quality and timeliness are key. Data with gaps or errors (which can arise for many reasons) is “unreliable,” can lead to wrong conclusions, and is of limited value to downstream users. Equally well, many applications require up-to-date information (who wants to use last night’s closing stock price or weather forecast) and are of limited value without it. **Enable faster queries** — Wanting fast responses to queries is natural enough in today’s “New York minute,” online world. Achieving this is particularly demanding when the queries are based on very large data sets. **Simplify data engineering at scale** — It is one thing to have high reliability and performance in a limited, development or test environment. What matters more is the ability to have robust, production data pipelines at scale without requiring high operational overhead. ----- ### ™ ## Apache Spark #### Apache Spark ™ : The First Unified Analytics Engine Originally developed at UC Berkeley in 2009, Apache Spark can be considered the first unified analytics engine. Uniquely bringing data and AI technologies together, Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create Customer Data Emails/ Web Pages Click Streams Video/ Speech ... Sensor Data (IoT) complex workflows. #### Big Data Processing #### Machine Learning Since its release, Apache Spark, has seen rapid adoption by enterprises across a wide range of industries. Internet powerhouses ETL + SQL + Streaming MLlib + SparkR such as Netflix, Yahoo and eBay have deployed Spark at massive scale, collectively processing multiple petabytes of data on clusters of over 8,000 nodes making it the de facto choice for new analytics initiatives. It has quickly become the largest open source community in big data, with over 1000 contributors from 250+ organizations. ##### While Spark has had a significant impact in taking data analytics to the next level, practitioners continue to face data reliability and performance challenges with their data lakes. ----- ## Data Reliability Challenges With Data Lakes **Failed Writes** — If a production job that is writing data experiences failures which are inevitable in large distributed environments, it can result in data corruption through partial or multiple writes. What is needed is a mechanism that is able to ensure that either a write takes place completely or not at all (and not multiple times, adding spurious data). Failed jobs can impose a considerable burden to recover to a clean state. **Schema Mismatch** — When ingesting content from multiple sources, typical of large, modern big data environments, it can be difficult to ensure that the same data is encoded in the same way i.e., the schema matches. A similar challenge arises when the formats for data elements are changed without informing the data engineering team. Both can result in low quality, inconsistent data that requires cleaning up to improve its usability. The ability to observe and enforce schema would serve to mitigate this. **Lack of Consistency** — In a complex big data environment, one may be interested in considering a mix of both batch and streaming data. Trying to read data while it is being appended to provides a challenge since on the one hand there is a desire to keep ingesting new data while on the other hand anyone reading the data prefers a consistent view. This is especially an issue when there are multiple readers and writers at work. It is undesirable and impractical, of course, to stop read access while writes complete or stop write access while reads are in progress. ----- ## Delta Lake: A New Storage Layer [Delta Lake](https://delta.io/) is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Raw data is ingested from various batch and streaming input sources. Simple, reliable data pipelines help create a curated data lake containing tables of differing degrees of refinement based on business needs. The data in these tables is then made available via the standard Spark APIs or special connectors for various use cases such as machine learning, SQL analytics or feeding to a data warehouse. Streaming ###### Analytics and Machine Learning Batch Ingestion Tables Refined Tables (Bronze) (Silver) Feature/Agg Data Store (Gold) ###### Your Existing Data Lake ----- ## Delta Lake: Key Features **ACID Transactions —** Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the **Scalable Metadata Handling —** In big data, even the metadata itself can be “big data.” Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. strongest level of isolation level. **Time Travel (data versioning) —** Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. For further details, please see this [documentation](https://www.google.com/url?q=https://docs.delta.io/latest/delta-batch.html%23-deltatimetravel&sa=D&source=editors&ust=1666305658154469&usg=AOvVaw0Zh1svr9wsqkIDKGQTgtLh) . **Schema Enforcement —** Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. ----- ## Delta Lake: Key Features Parquet **Open Format —** All data in Delta Lake is stored in Apache Parquet format, enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. **Unified Batch and Streaming Source and Sink** — A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box. **Schema Evolution —** Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL. **100% Compatible With Apache Spark API —** Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine. ----- ## Getting Started With Delta Lake **Getting started with Delta Lake is easy. Specifically, to create a Delta table simply specify Delta instead of using Parquet.** #### Instead of parquet ... ``` dataframe .write .format(“ parquet ”) .save(“/data”) ``` #### … simply say delta ``` dataframe .write .format(“ delta ”) .save(“/data”) ``` ##### Learn more about Delta Lake : [Delta Lake Blogs](https://delta.io/blog) Delta Lake Tutorials [Delta Lake Integrations](https://delta.io/integrations/) **For more information, please refer to the** **[documentation](https://docs.delta.io/latest/index.html)** **.** -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/building-reliable-data-lakes-at-scale-with-delta-lake.pdf2024-09-19T16:57:20Z#### eBook # The CDP Build vs Buy Guide: ### How to Compose Your CDP with the Databricks Lakehouse and ActionIQ ----- ## The Need for a Customer Data Platform Organizations need to deliver personalized experiences to their customers to stay ahead of the curve — that means they need a customer data platform (CDP). Through a CDP, data from every touch point, along with third-party information, is brought together to provide a unified view of the customer. This enables your marketing team to analyze, identify and activate customers with targeted content. The key question for all IT teams at these organizations is whether to build or to buy. A CDP that sounds like music to the ears of business leaders may be perceived as noise by enterprise IT leaders. The business side of the house needs immediate enablement, and an out-of-the-box system dedicated to the specialized needs of marketers seems like the fastest path to a solution. But for IT, the CDP is yet another system, bringing stack baggage and redundancies to existing marketing and analytics systems.. The cost of adding another system to the landscape and the redundancy of sensitive customer data creates a governance challenge that has immediate consequences. **Critical IT Needs** **Critical Business Needs** Keep control of data access and governance; ability to architecture a customer data stack with decisions on where data is stored and where queries are executed Get customer data access via a no-code interface to generate insights; build customer experiences and activate data within business applications ----- The question of whether to build or buy seems to leave legitimate needs and concerns by one side or the other unaddressed — which is why so many organizations who have built a CDP have expressed dissatisfaction regardless of which side of the fence they came down upon. **At both ActionIQ and Databricks, we believe the best path forward is to acknowledge** **both sides of the debate and provide organizations a third choice of both building and** **buying.** The ActionIQ customer data platform built on the Databricks Lakehouse provides the business with no-code and ease of use interface along with the flexibility and centralized governance IT desires. By shifting the conversation from building or buying to building _and_ buying, we’ve opened the door to finding the right balance of approaches for our customer organizations, helping organizations find greater success in their personalization journey. **“We made an attempt to internally build a CDP platform and while we** **could do basic SQL,** **[audience segmentation](https://www.actioniq.com/solutions/audience-segmentation/)** **and activation across multiple** **channels, by no means were we able to orchestrate an** **[omnichannel journey](https://www.actioniq.com/blog/omnichannel-customer-journey/)** **or offer a campaign interface to our product marketers that could empower** **them to create and manage those journeys. It was going to take at least two** **years for us to build all of that functionality in house.”** – Sravan Gupta, Senior Manager of GTM Systems, Atlassian ----- ## Combining the Build and Buy Approaches Bringing together the best of build and buy involves the deployment of the CDP alongside or within the lakehouse platform. There are three approaches to this: **Bundled** **Composable** **1. Bundled** **2. Hybrid** **3. Lakehouse-Only** Compute Storage Compute Storage (Local & Views) Query Virtualization Metadata Data Copy Lakehouse Storage Lakehouse Lakehouse Compute Compute Storage Storage ----- Deployment Type **Bundled** **Composable –** **Hybrid** **Composable –** **Lakehouse-Only** Description The CDP and the lakehouse are managed as two separate systems. Connectors in either system (as well as third-party tools) allow data to be exchanged, typically as part of an ad hoc or batch process. This approach allows the organization to leverage the functionality of both systems but data is duplicated making governance an on-going concern. The CDP and the lakehouse are managed as two separate systems, but deeper integrations between the two allow the organization to decide within which system a specific dataset should reside. Real-time integrations between the systems allow CDP users to select information assets in the lakehouse and generate queries spanning data on either side of the platform divide. This approach minimizes the need for data duplication which simplifies data governance, even though it must be implemented within two separate systems. All CDP information assets reside within the lakehouse. User interfaces built on other technologies, directly interact with the lakehouse for access to data. This approach minimizes redundancy and allows organizations to implement a centralized data governance strategy for all consumers of customer-relevant data. ----- ## Deployment Architectures The choice of which of these deployment architectures is best depends on the functional requirements of a specific organization. Each has its benefits, and in the case of parallel and federated deployments, organizations can easily transition between deployment architectures over time. The following table captures many of the typical benefits associated with the different deployment architectures. Bundled CDP Deployment Composable CDPHybrid Composable CDPLakehouse-Only Typical User **IT** Component Digital Touchpoints Data Modeling Identity Resolution Data Governance Description Collect and integrate data from digital channels (website, app, etc.) Unify and model data to make it usable by other applications Deduplicate records to build a private ID graph with a single view of the customer Control data access and permitted actions on the data Included with CDP via a tag Sometimes included with CDP Primarily with CDP or other tools (MDM, Lakehouse) Included with CDP Works with any digital touchpoint collection system Either within the CDP or in Lakehouse via real-time integration CDP, MDM, or Lakehouse Both CDP and Lakehouse Works with any digital touchpoint collection system Unified environment with minimal data replication in and centralized governance in Lakehouse Built with Lakehouse and additional tools Managed centrally from Lakehouse ----- Bundled CDP Deployment Composable CDPHybrid Composable CDPLakehouse-Only Typical User **Business** Component Predictive Scoring Marketing Audience Segments Customer Journey Orchestration Data Activations Analytics Description Create and execute models predicting user behaviors such as purchase or churn Use a self-service UI to build rule-based or model-based audiences Define and optimize the customer journey and interactions with the brand across every channel and every phase of the customer lifecycle Integrate seamlessly with delivery systems for both inbound and outbound customer experiences Understand audience and customer journey performance Included with CDP with supplement scoring from Lakehouse Included with CDP Sometimes included with CDP Included with CDP Sometimes included with CDP CDP, or automatically present with Lakehouse Included with CDP CDP, marketing automation, or additional tools Included with CDP Sometimes included with CDP or built with Lakehouse and additional tools Automatically present with Lakehouse Included with CDP CDP, marketing automation, or additional tools CDP, or additional tools Built with Lakehouse and additional tools ----- ## About Databricks Databricks is the data and AI company. More than 9,000 organizations worldwide — including Comcast, Condé Nast, H&M, and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache SparkTM, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. ## About ActionIQ AIQ brings order to CX chaos. Our Customer Experience Hub empowers everyone to be a CX champion by giving business teams the freedom to explore and action on customer data while helping technical teams regain control of where data lives and how it’s used. **[Get in touch](https://www.actioniq.com/get-started/)** with our experts to learn more. -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/build-vs-buy-guide-databricks-action-iq.pdf2024-09-19T16:57:20Z----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----SUCCESS/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Databricks_eBook_FinServ_Personalization-FINAL-092622_image.pdf2024-09-19T16:57:19Z",,,
"**eBook** ## The Data Team’s Guide to the Databricks Lakehouse Platform ----- #### Contents **C H A P TE R 1** **C H A P TE R 2** **C H A P TE R 3** **C H A P TE R 4** **C H A P TE R 5** **C H A P TE R 6** **C H A P TE R 7** **C H A P TE R 8** **C H A P TE R 9** **C H A P TE R 10** **C H A P TE R 11** **C H A P TE R 12** **The data lakehouse** ...................................................................................................................................................................................... **4** **The Databricks Lakehouse Platform** .......................................................................................................................... **11** **Data reliability and performance** ................................................................................................................................... **18** **Unified governance and sharing for data, analytics and AI** ....................................... **28** **Security** .............................................................................................................................................................................................................................. **41** **Instant compute and serverless** ................................................................................................................................... **48** **Data warehousing** ......................................................................................................................................................................................... **52** **Data engineering** ............................................................................................................................................................................................. **56** **Data streaming** .................................................................................................................................................................................................. **68.** **Data science and machine learning** ........................................................................................................................ **7** **3.** **Databricks Technology Partners and the modern data stack** ............................ **7** **9.** **Get started with the Databricks Lakehouse Platform** ....................................................... **8** **1** ----- **I N T R O D U C T I O N** #### The Data Team’s Guide to the Databricks Lakehouse Platform _The Data Team’s Guide to the Databricks Lakehouse Platform_ is designed for data practitioners and leaders who are embarking on their journey into the data lakehouse architecture. In this eBook, you will learn the full capabilities of the data lakehouse architecture and how the Databricks Lakehouse Platform helps organizations of all sizes — from enterprises to startups in every industry — with all their data, analytics, AI and machine learning use cases on one platform. You will see how the platform combines the best elements of data warehouses and data lakes to increase the reliability, performance and scalability of your data platform. Discover how the lakehouse simplifies complex workloads in data engineering, data warehousing, data streaming, data science and machine learning — and bolsters collaboration for your data teams, allowing them to maintain new levels of governance, flexibility and agility in an open and multicloud environment. ----- **CHAPTER** ### The data lakehouse # 01 ----- #### The evolution of data architectures Data has moved front and center within every organization as data-driven insights have fueled innovation, competitive advantage and better customer experiences. However, as companies place mandates on becoming more data-driven, their data teams are left in a sprint to deliver the right data for business insights and innovation. With the widespread adoption of cloud, data teams often invest in large-scale complex data systems that have capabilities for streaming, business intelligence, analytics and machine learning to support the overall business objectives. To support these objectives, data teams have deployed cloud data warehouses and data lakes. Traditional data systems: The data warehouse and data lake With the advent of big data, companies began collecting large amounts of data from many different sources, such as weblogs, sensor data and images. Data warehouses — which have a long history as the foundation for decision support and business intelligence applications — cannot handle large volumes of data. While data warehouses are great for structured data and historical analysis, they weren’t designed for unstructured data, semi-structured data, and data with high variety, velocity and volume, making them unsuitable for many types of data. This led to the introduction of data lakes, providing a single repository of raw data in a variety of formats. While suitable for storing big data, data lakes do not support transactions, nor do they enforce data quality, and their lack of consistency/isolation makes it almost impossible to read, write or process data. For these reasons, many of the promises of data lakes never materialized and, in many cases, reduced the benefits of data warehouses. As companies discovered new use cases for data exploration, predictive modeling and prescriptive analytics, the need for a single, flexible, high-performance system only grew. Data teams require systems for diverse data applications including SQL analytics, real-time analytics, data science and machine learning. ----- To solve for new use cases and new users, a common approach is to use multiple systems — a data lake, several data warehouses and other specialized systems such as streaming, time-series, graph and image databases. But having multiple systems introduces complexity and delay, as data teams invariably need to move or copy data between different systems, effectively losing oversight and governance over data usage. You have now duplicated data in two different systems and the changes you make in one system are unlikely to find their way to the other. So, you are going to have data drift almost immediately, not to mention paying to store the same data multiple times. Then, because governance is happening at two distinct levels across these platforms, you are not able to control things consistently. **Challenges with data, analytics and AI** In a recent [Accenture](https://www.accenture.com/_acnmedia/pdf-108/accenture-closing-data-value-gap-fixed.pdf) study, only 32% of companies reported tangible and measurable value from data. The challenge is that most companies continue to implement two different platforms: data warehouses for BI and data lakes for AI. These platforms are incompatible with each other, but data from both systems is generally needed to deliver game-changing outcomes, which makes success with AI extremely difficult. Today, most of the data is landing in the data lake, and a lot of it is unstructured. In fact, according to [IDC](https://www.idc.com/getdoc.jsp?containerId=US47998321) , about 80% of the data in any organization will be unstructured by 2025. But, this data is where much of the value from AI resides. Subsets of the data are then copied to the data warehouse into structured tables, and back again in some cases. You also must secure and govern the data in both warehouses and offer fine-grained governance, while lakes tend to be coarser grained at the file level. Then, you stand up different stacks of tools on these platforms to do either BI or AI. ----- Finally, the tool stacks on top of these platforms are fundamentally different, which makes it difficult to get any kind of collaboration going between the teams that support them. This is why AI efforts fail. There is a tremendous amount of complexity and rework being introduced into the system. Time and resources are being wasted trying to get the right data to the right people, and everything is happening too slowly to get in front of the competition. **Realizing this requires two disparate,** **incompatible data platforms** **Business** **SQL** **Incomplete** **Data science** **Data** **support for** **intelligence** **analytics** **and ML** **streaming** **SQL** **analytics** **Incomplete** **support for** **use cases** **Incompatible** **security and** **governance models** **Copy subsets of data** |Col1|Col2|Col3|Col4| |---|---|---|---| |Governa T|n a|c b|e and security le ACLs| ||||| |Col1|Col2|Col3|Col4| |---|---|---|---| |Governa File|n s|c a|e and security nd blobs| ||||| **Disjointed** **and duplicative** **Data warehouse** **data silos** **Data lake** Structured tables Unstructured files: logs, text, images, video ----- **Moving forward with a lakehouse architecture** To satisfy the need to support AI and BI directly on vast amounts of data stored in data lakes (on low-cost cloud storage), a new data management architecture emerged independently across many organizations and use cases: the data lakehouse. The data lakehouse can store _all_ and _any_ type of data once in a data lake and make that data accessible directly for AI and BI. The lakehouse paradigm has specific capabilities to efficiently allow both AI and BI on all the enterprise’s data at a massive scale. Namely, it has the SQL and performance capabilities such as indexing, caching and MPP processing to make BI work fast on data lakes. It also has direct file access and direct native support for Python, data science and AI frameworks without the need for a separate data warehouse. In short, a lakehouse is a data architecture that combines the best elements of data warehouses and data lakes. Lakehouses are enabled by a new system design, which implements similar data structures and data management features found in a data warehouse directly on the low-cost storage used for data lakes. ----- ##### Data lakehouse One platform to unify all your data, analytics and AI workloads ###### Lakehouse Platform All machine learning, SQL, BI, and streaming use cases One security and governance approach for all data assets on all clouds ----- **Key features for a lakehouse** Recent innovations with the data lakehouse architecture can help simplify your data and AI workloads, ease collaboration for data teams, and maintain the kind of flexibility and openness that allows your organization to stay agile as you scale. Here are key features to consider when evaluating data lakehouse architectures: Transaction support: In an enterprise lakehouse, many data pipelines will often be reading and writing data concurrently. Support for ACID (Atomicity, Consistency, Isolation and Durability) transactions ensures consistency as multiple parties concurrently read or write data. Schema enforcement and governance: The lakehouse should have a way to support schema enforcement and evolution, supporting data warehouse schema paradigms such as star/snowflake. The system should be able to reason about data integrity, and it should have robust governance and auditing mechanisms. Data governance: Capabilities including auditing, retention and lineage have become essential, particularly considering recent privacy regulations. Tools that allow data discovery have become popular, such as data catalogs and data usage metrics. BI support: Lakehouses allow the use of BI tools directly on the source data. This reduces staleness and latency, improves recency and lowers cost by not having to operationalize two copies of the data in both a data lake and a warehouse. Storage decoupled from compute: In practice, this means storage and compute use separate clusters, thus these systems can scale to many more concurrent users and larger data sizes. Some modern data warehouses also have this property. Openness: The storage formats, such as Apache Parquet, are open and standardized, so a variety of tools and engines, including machine learning and Python/R libraries, can efficiently access the data directly. Support for diverse data types (unstructured and structured): The lakehouse can be used to store, refine, analyze and access data types needed for many new data applications, including images, video, audio, semi-structured data and text. Support for diverse workloads: Use the same data repository for a range of workloads including data science, machine learning and SQL analytics. Multiple tools might be needed to support all these workloads. End-to-end streaming: Real-time reports are the norm in many enterprises. Support for streaming eliminates the need for separate systems dedicated to serving real-time data applications. **Learn more** **•** [Lakehouse: A New Generation of Open Platforms](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) [That Unify Data Warehousing and Advanced Analytics](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) **•** [Building the Data Lakehouse by Bill Inmon, Father of the](https://databricks.com/p/ebook/building-the-data-lakehouse) [Data Warehouse](https://databricks.com/p/ebook/building-the-data-lakehouse) **•** [What Is a Data Lakehouse?](https://databricks.com/glossary/data-lakehouse#:~:text=A%20data%20lakehouse%20is%20a,(ML)%20on%20all%20data.) ----- **CHAPTER** # 02 ### The Databricks Lakehouse Platform ----- #### Lakehouse: A new generation of open platforms ###### This is the lakehouse paradigm Databricks is the inventor and pioneer of the data lakehouse architecture. The data lakehouse architecture was coined in the research paper, [Lakehouse: A New Generation of Open Platforms that](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) [Unify Data Warehousing and Advanced Analytics](http://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf) , introduced by Databricks’ founders, UC Berkeley and Stanford University at the 11th Conference on Innovative Data Systems Research (CIDR) in 2021. At Databricks, we are continuously innovating on the lakehouse architecture to help customers deliver on their data, analytics and AI aspirations. The ideal data, analytics and AI platform needs to operate differently. Rather than copying and transforming data in multiple systems, you need one platform that accommodates all data types. **Data science** **Data** **and ML** **streaming** **All ML, SQL, BI** **and streaming use cases** **One security and governance** **approach for all data assets** **on all clouds** **A reliable data platform** **to efficiently handle** **all data types** **Persona-based** **use cases** **Unity Catalog** Fine-grained governance for data and AI **Delta Lake** Data reliability and performance **Business** **intelligence** **SQL** **analytics** Files and blobs and table ACLs Ideally, the platform must be open, so that you are not locked into any walled gardens. You would also have one security and governance model. It would not only manage all data types, but it would also be cloud-agnostic to govern data wherever it is stored. Last, it would support all major data, analytics and AI workloads, so that your teams can easily collaborate and get access to all the data they need to innovate. ----- #### What is the Databricks Lakehouse Platform? The Databricks Lakehouse Platform unifies your data warehousing and AI uses cases on a single platform. It combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance and performance of data warehouses with the openness, flexibility and machine learning support of data lakes. This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. It’s built on open source and open standards to maximize flexibility. And, its common approach to data management, security and governance helps you operate more efficiently and innovate faster. **Lakehouse Platform** Data Data Data Data science warehousing engineering streaming and ML ----- #### Benefits of the Databricks Lakehouse Platform **Simple** The unified approach simplifies your data architecture by eliminating the data silos that traditionally separate analytics, BI, data science and machine learning. With a lakehouse, you can eliminate the complexity and expense that make it hard to achieve the full potential of your analytics and AI initiatives. **Open** Delta Lake forms the open foundation of the lakehouse by providing reliability and performance directly on data in the data lake. You’re able to avoid proprietary walled gardens, easily share data and build your modern data stack with unrestricted access to the ecosystem of open source data projects and the broad Databricks partner network. **Multicloud** The Databricks Lakehouse Platform offers you a consistent management, security and governance experience across all clouds. You do not need to invest in reinventing processes for every cloud platform that you are using to support your data and AI efforts. Instead, your data teams can simply focus on putting all your data to work to discover new insights. ----- #### The Databricks Lakehouse Platform architecture **Data reliability and performance for lakehouse** [Delta Lake](https://databricks.com/product/delta-lake-on-databricks) is an open format storage layer built for the lakehouse that integrates with all major analytics tools and works with the widest variety of formats to store and process data. **Instant compute and serverless** Serverless compute is a fully managed service where Databricks provisions and manages the compute layer on behalf of the customer in the Databricks cloud account instead of the customer account. As of the current release, serverless compute is supported for use with Databricks SQL. In Chapter 6, we explore the details of instant compute and serverless for lakehouse. [Photon](https://databricks.com/product/photon) is the next-generation query engine built for the lakehouse that leverages a state-of-the-art vectorized engine for fast querying and provides the best performance for all workloads in the lakehouse. In Chapter 3, we explore the details of data reliability and performance for the lakehouse. **Unified governance and security for lakehouse** The Databricks Lakehouse Platform provides unified governance with enterprise scale, security and compliance. The [Databricks Unity Catalog](https://databricks.com/product/unity-catalog) (UC) provides governance for your data and AI assets in the lakehouse — files, tables, dashboards, and machine learning models — giving you much better control, management and security across clouds. [Delta Sharing](https://databricks.com/product/delta-sharing) is an open protocol that allows companies to securely share data across the organization in real time, independent of the platform on which the data resides. In Chapter 4, we go into the details of unified governance for lakehouse and, in Chapter 5, we dive into the details of security for lakehouse. ----- #### The Databricks Lakehouse Platform workloads The Databricks Lakehouse Platform architecture supports different workloads such as data warehousing, data engineering, data streaming, data science and machine learning on one simple, open and multicloud data platform. **Data warehousing** Data warehousing is one of the most business-critical workloads for data teams, and the best data warehouse is a lakehouse. The Databricks Lakehouse Platform lets you run all your SQL and BI applications at scale with up to 12x better price/ performance, a unified governance model, open formats and APIs, and your tools of choice — no lock-in. Reduce resource management overhead with serverless compute, and easily ingest, transform and query all your data in-place to deliver real-time business insights faster. Built on open standards and APIs, the Databricks Lakehouse Platform provides the reliability, quality and performance that data lakes natively lack, plus integrations with the ecosystem for maximum flexibility. In Chapter 7, we go into the details of data warehousing on the lakehouse. **Data engineering** Data engineering on the lakehouse allows data teams to unify batch and streaming operations on a simplified architecture, streamline data pipeline development and testing, build reliable data, analytics and AI workflows on any cloud platform, and meet regulatory requirements to maintain governance. automates the complexity of building and maintaining pipelines and running ETL workloads so data engineers and analysts can focus on quality and reliability to drive valuable insights. In Chapter 8, we go into the details of data engineering on the lakehouse. **Data streaming** [Data streaming](https://www.databricks.com/product/data-streaming) is one of the fastest growing workloads within the Databricks Lakehouse Platform and is the future of all data processing. Real-time processing provides the freshest possible data to an organization’s analytics and machine learning models enabling them to make better, faster decisions, more accurate predictions, offer improved customer experiences and more. The Databricks Lakehouse Platform Dramatically simplifies data streaming to deliver real-time analytics, machine learning and applications on one platform. In Chapter 9, we go into the details of data streaming on the lakehouse. **Data science and machine learning** Data science and machine learning (DSML) on the lakehouse is a powerful workload that is unique to many other data offerings. DSML on the lakehouse provides a data-native and collaborative solution for the full ML lifecycle. It can maximize data and ML team productivity, streamline collaboration, empower ML teams to prepare, process and manage data in a self-service manner, and standardize the ML lifecycle from experimentation to production. In Chapter 10, we go into the details of DSML on the lakehouse. The lakehouse provides an end-to-end data engineering and ETL platform that ----- **Databricks Lakehouse Platform and your** **modern data stack** The Databricks Lakehouse Platform is open and provides the flexibility to continue using existing infrastructure, to easily share data and build your modern data stack with unrestricted access to the ecosystem of open source data projects and the broad Databricks partner network with [Partner Connect](https://databricks.com/partnerconnect) . In Chapter 11, we go into the details of our technology partners and the modern data stack. #### Global adoption of the Databricks Lakehouse Platform Today, Databricks has over 7,000 [customers](https://databricks.com/customers) , from Fortune 500 to unicorns across industries doing transformational work. Organizations around the globe are driving change and delivering a new generation of data, analytics and AI applications. We believe that the unfulfilled promise of data and AI can finally be fulfilled with one platform for data analytics, data science and machine learning with the Databricks Lakehouse Platform. **Learn more** [Databricks Lakehouse Platform](https://databricks.com/product/data-lakehouse) [Databricks Lakehouse Platform Demo Hub](https://databricks.com/discover/demos) [Databricks Lakehouse Platform Customer Stories](https://databricks.com/customers) [Databricks Lakehouse Platform Documentation](https://databricks.com/documentation) [Databricks Lakehouse Platform Training and Certification](https://databricks.com/learn/training/home) [Databricks Lakehouse Platform Resources](https://databricks.com/resources) ----- **CHAPTER** # 03 ### Data reliability and performance To bring openness, reliability and lifecycle management to data lakes, the Databricks Lakehouse Platform is built on the foundation of Delta Lake. Delta Lake solves challenges around unstructured/structured data ingestion, the application of data quality, difficulties with deleting data for compliance or issues with modifying data for data capture. Although data lakes are great solutions for holding large quantities of raw data, they lack important attributes for data reliability and quality and often don’t offer good performance when compared to data warehouses. ----- #### Problems with today’s data lakes When it comes to data reliability and quality, examples of these missing attributes include: **•** **Lack of ACID transactions:** Makes it impossible to mix updates, appends and reads **•** **Lack of schema enforcement:** Creates inconsistent and low-quality data. For example, rejecting writes that don’t match a table’s schema. **•** **Lack of integration with data catalog:** Results in dark data and no single source of truth Even just the absence of these three attributes can cause a lot of extra work for data engineers as they strive to ensure consistent high-quality data in the pipelines they create. These challenges are solved with two key technologies that are at the foundation of the lakehouse: Delta Lake and Photon. **What is Delta Lake?** Delta Lake is a file-based, open source storage format that provides ACID transactions and scalable metadata handling, and unifies streaming and batch data processing. It runs on top of existing data lakes and is compatible with Apache Spark™ and other processing engines. Delta Lake uses Delta Tables which are based on Apache Parquet, a commonly used format for structured data already utilized by many organizations. Therefore, switching existing Parquet tables to Delta Tables is easy and quick. Delta Tables can also be used with semi-structured and unstructured data, providing versioning, reliability, metadata management, and time travel capabilities that make these types of data easily managed as well. As for performance, data lakes use object storage, so data is mostly kept in immutable files leading to the following problems: **•** **Ineffective partitioning:** In many cases, data engineers resort to “poor man’s” indexing practices in the form of partitioning that leads to hundreds of dev hours spent tuning file sizes to improve read/write performance. Often, partitioning proves to be ineffective over time if the wrong field was selected for partitioning or due to high cardinality columns. **•** **Too many small files:** With no support for transactions, appending new data takes the form of adding more and more files, leading to “small file problems,” a known root cause of query performance degradation. ----- **Delta Lake features** **ACID guarantees** Delta Lake ensures that all data changes written to storage are committed for durability and made visible to readers atomically. In other words, no more partial or corrupted files. **Scalable data and metadata handling** Since Delta Lake is built on data lakes, all reads and writes using Spark or other distributed processing engines are inherently scalable to petabyte-scale. However, unlike most other storage formats and query engines, Delta Lake leverages Spark to scale out all the metadata processing, thus efficiently handling metadata of billions of files for petabyte-scale tables. **Audit history and time travel** The Delta Lake transaction log records details about every change made to data, providing a full audit trail of the changes. These data snapshots allow developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. **Schema enforcement and schema evolution** Delta Lake automatically prevents the insertion of data with an incorrect schema, i.e., not matching the table schema. And when needed, it allows the table schema to be explicitly and safely evolved to accommodate ever-changing data. **Support for deletes, updates and merges** Most distributed processing frameworks do not support atomic data modification operations on data lakes. Delta Lake supports merge, update and delete operations to enable complex use cases including but not limited to change data capture (CDC), slowly changing dimension (SCD) operations and streaming upserts. **Streaming and batch unification** A Delta Lake table can work both in batch and as a streaming source and sink. The ability to work across a wide variety of latencies, ranging from streaming data ingestion to batch historic backfill, to interactive queries all work out of the box. ----- **The Delta Lake transaction log** A key to understanding how Delta Lake provides all these capabilities is the transaction log. The Delta Lake transaction log is the common thread that runs through many of Delta Lake’s most notable features, including ACID transactions, scalable metadata handling, time travel and more. The Delta Lake transaction log is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. Delta Lake is built on top of Spark to allow multiple readers and writers of a given table to work on a table at the same time. To always show users correct views of the data, the transaction log serves as a single source of truth: the central repository that tracks all changes that users make to the table. When a user reads a Delta Lake table for the first time or runs a new query on an open table that has been modified since the last time it was read, Spark checks the transaction log to see what new transactions are posted to the table. Then, Spark updates the table with those recent changes. This ensures that a user’s version of a table is always synchronized with the master record as of the most recent query, and that users cannot make divergent, conflicting changes to a table. **Flexibility and broad industry support** Delta Lake is an open source project, with an engaged community of contributors building and growing the Delta Lake ecosystem atop a set of open APIs and is part of the Linux Foundation. With the growing adoption of Delta Lake as an open storage standard in different environments and use cases, comes a broad set of integration with industry-leading tools, technologies and formats. Organizations leveraging Delta Lake on the Databricks Lakehouse Platform gain flexibility in how they ingest, store and query data. They are not limited in storing data in a single cloud provider and can implement a true multicloud approach to data storage. Connectors to tools, such as Fivetran, allow you to leverage Databricks’ ecosystem of partner solutions, so organizations have full control of building the right ingestion pipelines for their use cases. Finally, consuming data via queries for exploration or business intelligence (BI) is also flexible and open. ----- **Delta Lake integrates with all major analytics tools** Eliminates unnecessary data movement and duplication ----- In addition to a wide ecosystem of tools and technologies, Delta Lake supports a broad set of data formats for structured, semi-structured and unstructured data. These formats include image binary data that can be stored in Delta Tables, graph data format, geospatial data types and key-value stores. **Learn more** [Delta Lake on the Databricks Lakehouse](https://databricks.com/product/delta-lake-on-databricks) [Documentation](https://docs.databricks.com/delta/index.html) [Delta Lake Open Source Project](https://docs.databricks.com/delta/index.html) [eBooks: The Delta Lake Series](https://databricks.com/p/ebook/the-definitive-guide-to-delta-lake-series) **What is Photon?** As many organizations standardize on the lakehouse paradigm, this new architecture poses challenges with the underlying query execution engine for accessing and processing structured and unstructured data. The execution engine needs to provide the performance of a data warehouse and the scalability of data lakes. Photon is the next-generation query engine on the Databricks Lakehouse Platform that provides dramatic infrastructure cost savings and speedups for all use cases — from data ingestion, ETL, streaming, data science and interactive queries — directly on your data lake. Photon is compatible with Spark APIs and implements a more general execution framework that allows efficient processing of data with support of the Spark API. This means getting started is as easy as turning it on — no code change and no lock-in. With Photon, typical customers are seeing up to 80% TCO savings over traditional Databricks Runtime (Spark) and up to 85% reduction in VM compute hours. Spark instructions Photon instructions Photon engine Delta/Parquet Photon writer to Delta/Parquet ----- Why process queries with Photon? Query performance on Databricks has steadily increased over the years, powered by Spark and thousands of optimizations packaged as part of the Databricks Runtime (DBR). Photon provides an additional 2x speedup per the TPC-DS 1TB benchmark compared to the latest DBR versions. **Relative speedup to DBR 2.1 by DBR version** Higher is better **Customers have observed significant speedups using** **Photon on workloads such as:** **•** **SQL-based jobs:** Accelerate large-scale production jobs on SQL and Spark DataFrames **•** **IoT use cases:** Faster time-series analysis using Photon compared to Spark and traditional Databricks Runtime **•** **Data privacy and compliance:** Query petabytes-scale data sets to identify and delete records without duplicating data with Delta Lake, production jobs and Photon **•** **Loading data into Delta and Parquet:** Vectorized I/O speeds up data loads for Delta and Parquet tables, lowering overall runtime and costs of data engineering jobs Release date - DBR version (TPC-DS 1TB 10 x i3xl) ----- **100TB TPC-DS price/performance** Lower is better Best price/performance for analytics in the cloud Written from the ground up in C++, Photon takes advantage of modern hardware for faster queries, providing up to 12x better price/performance compared to other cloud data warehouses — all natively on your data lake. Databricks SQL Databricks SQL Cloud data Cloud data Cloud data spot on-demand warehouse 1 warehouse 2 warehouse 3 **System** ----- Works with your existing code and avoids vendor lock-in Photon is designed to be compatible with the Apache Spark DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. All you do is turn it on. Photon will seamlessly coordinate work and resources and transparently accelerate portions of your SQL and Spark queries. No tuning or user intervention required. **Photon in the Databricks Lakehouse Platform** **Client: submit SQL** Parsing Catalyst: analysis/ planning/optimization scheduling Execute task Execute task Execute task Execute task _Lifecycle of a Photon query_ Spark driver JVM Spark executors mixed JVM/Native ----- Optimizing for all data use cases and workloads Photon is the first purpose-built lakehouse engine designed to accelerate all data and analytics workloads: data ingestion, ETL, streaming, data science, and interactive queries. While we started Photon primarily focused on SQL to provide customers with world-class data warehousing performance on their data lakes, we’ve significantly increased the scope of ingestion sources, formats, APIs and methods supported by Photon since then. As a result, customers have seen dramatic infrastructure cost savings and speedups on Photon across all their modern Spark (e.g., Spark SQL and DataFrame) workloads. Query optimizer Native execution engine Caching _Accelerating all workloads on the lakehouse_ **Learn more** [Announcing Photon Public Preview: The Next-Generation](https://www.databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html) [Query Engine on the Databricks Lakehouse Platform](https://www.databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html) [Databricks Sets Official Data Warehousing Performance Record](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) ----- **CHAPTER** # 04 ### Unified governance and sharing for data, analytics and AI Today, more and more organizations recognize the importance of making high-quality data readily available to data teams to drive actionable insights and business value. At the same time, organizations also understand the risks of data breaches which negatively impact brand value and inevitably lead to erosion of customer trust. Governance is one of the most critical components of a lakehouse data platform architecture; it helps ensure that data assets are securely managed throughout the enterprise. However, many companies are using different incompatible governance models leading to complex and expensive solutions. ----- #### Key challenges with data and AI governance **Diversity of data and AI assets** The increased use of data and the added complexity of the data landscape have left organizations with a difficult time managing and governing all types of their data-related assets. No longer is data stored in files or tables. Data assets today take many forms, including dashboards, machine learning models and unstructured data like video and images that legacy data governance solutions simply are not built to govern and manage. **Rising multicloud adoption** More and more organizations now leverage a multicloud strategy to optimize costs, avoid vendor lock-in, and meet compliance and privacy regulations. With nonstandard, cloud-specific governance models, data governance across clouds is complex and requires familiarity with cloud-specific security and governance concepts, such as identity and access management (IAM). **Disjointed tools for data governance on the lakehouse** Today, data teams must deal with a myriad of fragmented tools and services for their data governance requirements, such as data discovery, cataloging, auditing, sharing, access controls, etc. This inevitably leads to operational inefficiencies and poor performance due to multiple integration points and network latency between the services. **Two disparate and incompatible data platforms** Organizations today use two different platforms for their data analytics and AI efforts — data warehouses for BI and data lakes for AI. This results in data replication across two platforms, presenting a major governance challenge. With no unified view of the data landscape, it is difficult to see where data is stored, who has access to what data, and consistently define and enforce data access policies across the two platforms with different governance models. ----- #### One security and governance approach Lakehouse systems provide a uniform way to manage access control, data quality and compliance across all of an organization’s data using standard interfaces similar to those in data warehouses by adding a management interface on top of data lake storage. Modern lakehouse systems support fine-grained (row, column and view level) access control via SQL, query auditing, attribute-based access control, data versioning and data quality constraints and monitoring. These features are generally provided using standard interfaces familiar to database administrators (for example, SQL GRANT commands) to allow existing personnel to manage all the data in an organization in a uniform way. Centralizing all the data in a lakehouse system with a single management interface also reduces the administrative burden and potential for error that comes with managing multiple separate systems. #### What is Unity Catalog? Unity Catalog is a unified governance solution for all data, analytics and AI assets including files, tables, dashboards and machine learning models in your lakehouse on any cloud. Unity Catalog simplifies governance by empowering data teams with a common governance model based on ANSI-SQL to define and enforce fine-grained access controls. With attribute-based access controls, data administrators can enable fine-grained access controls on rows and columns using tags (attributes). Built-in data search and discovery allows data teams to quickly find and reference relevant data for any use case. Unity Catalog offers automated data lineage for all workloads in SQL, R, Scala and Python, to build a better understanding of the data and its flow in the lakehouse. Unity Catalog also allows data sharing across or within organizations and seamless integrations with your existing data governance tools. With Unity Catalog, data teams can simplify governance for all data and AI assets with one consistent model to discover, access and share data, giving you much better native performance, management and security across clouds. ----- **Key benefits** The common metadata layer for cross-workspace metadata is at the account level and eases collaboration by allowing different workspaces to access Unity Catalog metadata through a common interface and break down data silos. Further, the data permissions in Unity Catalog are applied to account-level identities, rather than identities that are local to a workspace, allowing a consistent view of users and groups across all workspaces. Catalog, secure and audit access to all data assets on any cloud Unity Catalog provides centralized metadata, enabling data teams to create a single source of truth for all data assets ranging from files, tables, dashboards to machine learning models in one place. ----- Unity Catalog offers a unified data access layer that provides a simple and streamlined way to define and connect to your data through managed tables, external tables, or files, while managing their access controls. Unity Catalog centralizes access controls for files, tables and views. It allows fine-grained access controls for restricting access to certain rows and columns to the users and groups who are authorized to query them. With Attribute-Based Access Controls (ABAC), you can control access to multiple data items at once based on user and data attributes, further simplifying governance at scale. For example, you will be able to tag multiple columns as personally identifiable information (PII) and manage access to all columns tagged as PII in a single rule. Today, organizations are dealing with an increased burden of regulatory compliance, and data access auditing is a critical component to ensure your organization is set up for success while meeting compliance requirements. Unity Catalog also provides centralized fine-grained auditing by capturing an audit log of operations such as create, read, update and delete (CRUD) that have been performed against the data. This allows a fine-grained audit trail showing who accessed a given data set and helps you meet your compliance and business requirements. ----- Built-in data search and discovery Data discovery is a critical component to break down data silos and democratize data across your organization to make data-driven decisions. Unity Catalog provides a rich user interface for data search and discovery, enabling data teams to quickly search relevant data assets across the data landscape and reference them for all use cases — BI, analytics and machine learning — accelerating time-to-value and boosting productivity. ----- Automated data lineage for all workloads Data lineage describes the transformations and refinements of data from source to insight. Lineage includes capturing all the relevant metadata and events associated with the data in its lifecycle, including the source of the data set, what other data sets were used to create it, who created it and when, what transformations were performed, which other data sets leverage it, and many other events and attributes. Unity Catalog offers automated data lineage down to table and column level, enabling data teams to get an end-to-end view of where data is coming from, what transformations were performed on the data and how data is consumed by end applications such as notebooks, workflows, dashboards, machine learning models, etc. With automated data lineage for all workloads — SQL, R, Python and Scala, data teams can quickly identify and perform root cause analysis of any errors in the data pipelines or end applications. Second, data teams can perform impact analysis to see dependencies of any data changes on downstream consumers and notify them about the potential impact. Finally, data lineage also empowers data teams with increased understanding of their data and reduces tribal knowledge. Unity Catalog can also capture lineage associated with non-data entities, such as notebooks, workflows and dashboards. Lineage can be _Data lineage with Unity Catalog_ retrieved via REST APIs to support integrations with other catalogs. Integrated with your existing tools **Resources** [Learn more about Unity Catalog](https://databricks.com/product/unity-catalog) [AWS Documentation](https://docs.databricks.com/data-governance/unity-catalog/index.html) [Azure Documentation](https://docs.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/) Unity Catalog helps you to future-proof your data and AI governance with the flexibility to leverage your existing data catalogs and governance solutions — Collibra, Alation, Immuta, Privacera, Microsoft Purview and AWS Lakeformation. ----- #### Open data sharing and collaboration Data sharing has become important in the digital economy as enterprises wish to exchange data easily and securely with their customers, partners, suppliers and internal lines of business to better collaborate and unlock value from that data. But to date, a lack of standards-based data sharing protocol has resulted in data sharing solutions tied to a single vendor or commercial product, introducing vendor lock-in risks. What the industry deserves is an open approach to data sharing. **Why data sharing is hard** Data sharing has evolved from an optional feature of a few data platforms to a business necessity and success factor for organizations. Our solution architects encounter daily the classic scenarios of a retailer looking to publish sales data to their suppliers in real time or a supplier that wants to share real-time inventory. As a reminder, data sharing recently triggered the most impressive scientific development that humankind has ever seen. On January 5, 2021, the first sample of the genome of the coronavirus was uploaded to the internet. It wasn’t a lung biopsy from a patient in Wuhan, but a shared digital genomic data set that triggered the development of the first batch of COVID vaccines worldwide. treatments, tests and tracking mutations as they are passed down through a lineage, a branch of the coronavirus family tree. The above graphic shows such a [publicly shared mutation data set](https://www.ncbi.nlm.nih.gov/genbank/) . Since then, coronavirus experts have daily exchanged public data sets, looking for better ----- Sharing data, as well as consuming data from external sources, allows you to collaborate with partners, establish new partnerships, enable research and can generate new revenue streams with data monetization. Despite those promising examples, existing data sharing technologies come with several limitations: **•** Traditional data sharing technologies, such as Secure File Transfer Protocol (SFTP), do not scale well and only serve files offloaded to a server **•** Cloud object stores operate on an object level and are cloud-specific **•** Commercial data sharing offerings baked into vendor products often share tables instead of files, but scaling them is expensive and they are not open and, therefore, do not permit data sharing with a different platform The following table compares proprietary vendor solutions with SFTP, cloud object stores and Delta Sharing. |Col1|Proprietary vendor solutions|SFTP|Cloud object store|Delta Sharing| |---|---|---|---|---| |Secure||||| |Cheap||||| |Vendor agnostic||||| |Multicloud||||| |Open source||||| |Table/DataFrame abstraction||||| |Live data||||| |Predicate pushdown||||| |Object store bandwidth||||| |Zero compute cost||||| |Scalability||||| ----- **Open source data sharing and Databricks** To address the limitations of existing data sharing solutions, Databricks developed [Delta Sharing](https://github.com/delta-io/delta-sharing) , with various contributions from the OSS community, and donated it to the Linux Foundation. An open source–based solution, such as Delta Sharing, eliminates the lock-in of commercial solutions and brings a number of additional benefits such as community-developed integrations with popular, open source data processing frameworks. In addition, open protocols allow the easy integration of commercial clients, such as BI tools. **What is Databricks Delta Sharing?** Databricks Delta Sharing provides an open solution to securely share live data from your lakehouse to any computing platform. Recipients don’t have to be on the Databricks platform or on the same cloud or a cloud at all. Data providers can share live data, without replicating or moving it to another system. Recipients benefit from always having access to the latest version of data and can quickly query shared data using tools of their choice for BI, analytics and machine learning, reducing time-to-value. Data providers can centrally manage, govern, audit and track usage of the shared data on one platform. Unity Catalog natively supports [Delta Sharing](https://databricks.com/product/delta-sharing) , the world’s first open protocol for data sharing, enabling organizations to share live, large-scale data without replication and make data easily and quickly accessible from tools of your choice, with enterprise-grade security. **Key benefits** Open cross-platform sharing Easily share existing data in Delta Lake and Apache Parquet formats between different vendors. Consumers don’t have to be on the Databricks platform, same cloud or a cloud at all. Native integration with Power BI, Tableau, Spark, pandas and Java allow recipients to consume shared data directly from the tools of their choice. Delta Sharing eliminates the need to set up a new ingestion process to consume data. Data recipients can directly access the fresh data and query it using tools of their choice. Recipients can also enrich data with data sets from popular data providers. Sharing live data without copying it Share live ready-to-query data, without replicating or moving it to another system. Most enterprise data today is stored in cloud data lakes. Any of the existing data sets on the provider’s data lake can easily be shared across clouds, regions or data platforms without any data replication or physical movement of data. Data providers can update their data sets reliably in real time and provide a fresh and consistent view of their data to recipients. Centralized administration and governance You can centrally govern, track and audit access to the shared data from a single point of enforcement to meet compliance requirements. Detailed user-access audit logs are kept to know who is accessing the data and monitor usage of the shared data down to table, partition and version level. ----- An open Marketplace for data solutions The demand for third-party data to make data-driven innovations is greater than ever, and data marketplaces act as a bridge between data providers and data consumers to help facilitate the discovery and distribution of data sets. Databricks Marketplace provides an open marketplace for exchanging data products such as data sets, notebooks, dashboards and machine learning models. To accelerate insights, data consumers can discover, evaluate and access more data products from third-party vendors than ever before. Providers can now commercialize new offerings and shorten sales cycles by providing value-added services on top of their data. Databricks Marketplace is powered by Delta Sharing, allowing consumers to access data products without having to be on the Databricks platform. This open approach allows data providers to broaden their addressable market without forcing consumers into vendor lock-in. _Databricks Marketplace_ Privacy-safe data cleanrooms Powered by open source Delta Sharing, the Databricks Lakehouse Platform provides a flexible data cleanroom solution allowing businesses to easily collaborate with their customers and partners on any cloud in a privacy-safe way. Participants in the data cleanrooms can share and join their existing data, and run complex workloads in any language — Python, R, SQL, Java and Scala — on the data while maintaining data privacy. Additionally, data cleanroom participants don’t have to do cost-intensive data replication across clouds or regions with other participants, which simplifies data operations and reduces cost. _Data cleanrooms with Databricks Lakehouse Platform_ ----- **How it works** Delta Sharing is designed to be simple, scalable, non-proprietary and cost-effective for organizations that are serious about getting more from their data. Delta Sharing is natively integrated with Unity Catalog, which allows customers to add fine-grained governance and security controls, making it easy and safe to share data internally or externally. Delta Sharing is a simple REST protocol that securely shares access to part of a cloud data set. It leverages modern cloud storage systems — such as AWS S3, Azure ADLS or Google’s GCS — to reliably transfer large data sets. Here’s how it works for data providers and data recipients. **Data provider** **Data recipient** Data science And many more On-premises The data provider shares existing tables or parts thereof (such as specific table versions or partitions) stored on the cloud data lake in Delta Lake format. The provider decides what data they want to share and runs a sharing server in front of it that implements the Delta Sharing protocol and manages access for recipients. To manage shares and recipients, you can use SQL commands or the Unity Catalog CLI or the intuitive user interface. The data recipient only needs one of the many Delta Sharing clients that supports the protocol. Databricks has released open source connectors for pandas, Apache Spark, Java and Python, and is working with partners on many more. ----- The Delta Sharing data exchange follows three efficient steps: 1. The recipient’s client authenticates to the sharing server and asks to query a specific table. The client can also provide filters on the data (for example, “country=US”) as a hint to read just a subset of the data. 2. The server verifies whether the client is allowed to access the data, logs the request, and then determines which data to send back. This will be a subset of the data objects in cloud storage systems that make up the table. 3. To transfer the data, the server generates short-lived presigned URLs that allow the client to read these Parquet files directly from the cloud provider, so that the transfer can happen in parallel at massive bandwidth, without streaming through the sharing server. **Learn more** [Try Delta Sharing](https://databricks.com/product/delta-sharing) [Delta Sharing Demo](https://youtu.be/wRT1Vpbyy88) [Introducing Delta Sharing: An Open Protocol for Secure Data Sharing](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Introducing Data Cleanrooms for the Lakehouse](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Introducing Databricks Marketplace](https://www.databricks.com/blog/2022/06/28/introducing-data-cleanrooms-for-the-lakehouse.html) [Delta Sharing ODSC Webinar](https://www.youtube.com/watch?v=YrNHtaWlkM8) ----- **CHAPTER** # 05 ### Security Organizations that operate in multicloud environments need a unified, reliable and consistent approach to secure data. We’ve learned from our customers that a simple and unified approach to data security for the lakehouse is one of the most critical requirements for modern data solutions. Databricks is trusted by the world’s largest organizations to provide a powerful lakehouse platform with high security and scalability. In fact, thousands of customers trust Databricks with their most sensitive data to analyze and build data products using machine learning (ML). With significant investment in building a highly secure and scalable platform, Databricks delivers end-to-end platform security for data and users. ----- #### Platform architecture reduces risk The Databricks Lakehouse architecture is split into two separate planes to simplify your permissions, avoid data duplication and reduce risk. The control plane is the management plane where Databricks runs the workspace application and manages notebooks, configuration and clusters. Unless you choose to use [serverless compute](https://docs.databricks.com/serverless-compute/index.html) , the data plane runs inside your cloud service provider account, processing your data without taking it out of your account. You can embed Databricks in your data exfiltration protection architecture using features like customer-managed VPCs/VNets and admin console options that disable export. While certain data, such as your notebooks, configurations, logs, and user information, is present within the control plane, that information is encrypted at rest, and communication to and from the control plane is encrypted in transit. **Users** **Interactive** **users** |Col1|Control pane|Col3| |---|---|---| ||Web application Configurations Notebooks, repos, DBSQL|Cluster Cluste Your cloud s Your cloud s| ||Cluster manager|| You also have choices for where certain data lives: You can host your own store of metadata about your data tables (Hive metastore), or store query **Data** **DBFS root** results in your cloud service provider account and decide whether to use the [Databricks Secrets API.](https://docs.databricks.com/dev-tools/api/latest/secrets.html) ----- #### Step-by-step example **Users** **Interactive** **users** **DBFS root** |Col1|ample|Col3|Col4|Col5| |---|---|---|---|---| ||Control pane 1 4|||| |||Web application Configurations Notebooks, repos, DBSQL Cluster manager|6|Cluster Cluste YYoouurr cclloouudd s| |||||| |||||| |||||| |||||| |||||| ----- Suppose you have a data engineer that signs in to Databricks and writes a notebook that transforms raw data in Kafka to a normalized data set sent to storage such as Amazon S3 or Azure Data Lake Storage. Six steps make that happen: 1. The data engineer seamlessly authenticates, via your single sign-on if desired, to the Databricks web UI in the control plane, hosted in the Databricks account. 2. As the data engineer writes code, their web browser sends it to the control plane. JDBC/ODBC requests also follow the same path, authenticating with a token. 3. When ready, the control plane uses Cloud Service Provider APIs to create a Databricks cluster, made of new instances in the data plane, in your CSP account. Administrators can apply cluster policies to enforce security profiles. 4. Once the instances launch, the cluster manager sends the data engineer’s code to the cluster. 5. The cluster pulls from Kafka in your account, transforms the data in your account and writes it to a storage in your account. 6. The cluster reports status and any outputs back to the cluster manager. The data engineer does not need to worry about many of the details — simply write the code and Databricks runs it. #### Network and server security Here is how Databricks interacts with your cloud service provider account to manage network and server security **Networking** Regardless of where you choose to host the data plane, Databricks networking is straightforward. If you host it yourself, Databricks by default will still configure networking for you, but you can also control data plane networking with your own managed VPC or VNet. The serverless data plane network infrastructure is managed by Databricks in a Databricks cloud service provider account and shared among customers, with additional network boundaries between workspaces and between clusters. Databricks does not rewrite or change your data structure in your storage, nor does it change or modify any of your security and governance policies. Local firewalls complement security groups and subnet firewall policies to block unexpected inbound connections. Customers at the enterprise tier can also use the IP access list feature on the control plane to limit which IP addresses can connect to the web UI or REST API — for example, to allow only VPN or office IPs. ----- **Servers** In the data plane, Databricks clusters automatically run the latest hardened system image. Users cannot choose older (less secure) images or code. For AWS and Azure deployments, images are typically updated every two-to-four weeks. GCP is responsible for its system image. Databricks runs scans for every release, including: **•** System image scanning for vulnerabilities **•** Container OS and library scanning **Severity** **Remediation time** **Critical** **< 14 days** **High** **< 30 days** **Medium** **< 60 days** **Low** **When appropriate** **•** Static and dynamic code scanning **Databricks access** Databricks code is peer reviewed by developers who have security training. Significant design documents go through comprehensive security reviews. Scans run fully authenticated, with all checks enabled, and issues are tracked against the timeline shown in this table. Note that Databricks clusters are typically short-lived (often terminated after a job completes) and do not persist data after they terminate. Clusters typically share the same permission level (excluding high concurrency or Databricks SQL clusters, where more robust security controls are in place). Your code is launched in an unprivileged container to maintain system stability. This security design provides protection against persistent attackers and privilege escalation. Databricks access to your environment is limited to cloud service provider APIs for our automation and support access. Automated access allows the Databricks control plane to configure resources in your environment using the cloud service provider APIs. The specific APIs vary based on the cloud. For instance, an AWS cross-account IAM role, or Azure-owned automation or GKE automation do not grant access to your data sets (see the next section). Databricks has a custom-built system that allows staff to fix issues or handle support requests — for example, when you open a support request and check the box authorizing access to your workspace. Access requires either a support ticket or engineering ticket tied expressly to your workspace and is limited to a subset of employees and for limited time periods. Additionally, if you have configured audit log delivery, the audit logs show the initial access event and the staff’s actions. ----- **Identity and access** Databricks supports robust ACLs and SCIM. AWS customers can configure SAML 2.0 and block non-SSO logins. Azure Databricks and Databricks on GCP automatically integrate with Azure Active Directory or GCP identity. Databricks supports a variety of ways to enable users to access their data. **Examples include:** **•** The Table ACLs feature uses traditional SQL-based statements to manage access to data and enable fine-grained view-based access **•** IAM instance profiles enable AWS clusters to assume an IAM role, so users of that cluster automatically access allowed resources without explicit credentials **•** External storage can be mounted or accessed using a securely stored access key **•** The Secrets API separates credentials from code when accessing external resources **Data security** Databricks provides encryption, isolation and auditing. **Databricks encryption capabilities are** **in place both at rest and in motion** |For data-at-rest encryption: • Control plane is encrypted • Data plane supports local encryption • Customers can use encrypted storage buckets • Customers at some tiers can confgi ure customer-managed keys for managed services|For data-in-motion encryption: • Control plane <-> data plane is encrypted • Offers optional intra-cluster encryption • Customer code can be written to avoid unencrypted services (e.g., FTP)| |---|---| **Customers can isolate users at multiple levels:** **•** **Workspace level:** Each team or department can use a separate workspace **•** **Cluster level:** Cluster ACLs can restrict the users who can attach notebooks to a given cluster **•** **High concurrency clusters:** Process isolation, JVM whitelisting and limited languages (SQL, Python) allow for the safe coexistence of users of different privilege levels, and is used with Table ACLs **•** **Single-user cluster:** Users can create a private dedicated cluster Activities of Databricks users are logged and can be delivered automatically to a cloud storage bucket. Customers can also monitor provisioning activities by monitoring cloud audit logs. ----- **Compliance** **Databricks supports the following compliance standards on** **our multi-tenant platform:** **•** **SOC 2 Type II** **•** **ISO 27001** **•** **ISO 27017** **•** **ISO 27018** Certain clouds support Databricks deployment options for FedRAMP High, HITRUST, HIPAA and PCI. Databricks Inc. and the Databricks platform are also GDPR and CCPA ready. **Learn more** To learn more about Databricks security, visit the [Security and Trust Center](https://databricks.com/trust) ----- **CHAPTER** # 06 ### Instant compute and serverless ----- #### Benefits of Databricks Serverless SQL Serverless SQL is much easier to administer with Databricks taking on the responsibility of deploying, configuring and managing your cluster VMs. Databricks can transfer compute capacity to user queries typically in about 15 seconds — so you no longer need to wait for clusters to start up or scale out to run your queries. Serverless SQL also has built-in connectors to your favorite tools such as Tableau, Power BI, Qlik, etc. These connectors use optimized JDBC/ODBC drivers for easy authentication support and high performance. And finally, you save on cost because you do not need to overprovision or pay for the idle capacity. #### What is serverless compute? Serverless compute is a fully managed service where Databricks provisions and manages the compute layer on behalf of the customer in the Databricks cloud account instead of the customer account. As of the current release, serverless compute is supported for use with Databricks SQL. This new capability for Databricks SQL provides instant compute to users for their BI and SQL workloads, with minimal management required and capacity optimizations that can lower overall cost by 20%-40% on average. This makes it even easier for organizations to expand adoption of the lakehouse for business analysts who are looking to access the rich, real-time data sets of the lakehouse with a simple and performant solution. ----- **Inside Serverless SQL** **Databricks Serverless SQL** **Managed servers** **Serverless SQL** **compute** **Secure** **Instant compute** At the core of Serverless SQL is a compute platform that operates a pool of servers located in a Databricks’ account, running Kubernetes containers that can be assigned to a user within seconds. When many users are running reports or queries at the same time, the compute platform adds more servers to the cluster (again, within seconds) to handle the concurrent load. Databricks manages the entire configuration of the server and automatically performs the patching and upgrades as needed. Each server is running a secure configuration and all processing is secured by three layers of isolation: The Kubernetes container hosting the runtime; the virtual machine (VM) hosting the container; and the virtual network for the workspace. Each layer is isolated to one workspace with no sharing or cross-network traffic allowed. The containers use hardened configurations, VMs are shut down and not reused, and network traffic is restricted to nodes in the same cluster. ----- #### Performance of Serverless SQL We ran a set of internal tests to compare Databricks Serverless SQL to the current Databricks SQL and several traditional cloud data warehouses. We found Serverless SQL to be the most cost-efficient and performant environment to run SQL workloads when considering cluster startup time, query execution time and overall cost. **Databricks Serverless SQL is the highest** **performing and most cost-effective solution** **Cloud SQL solutions compared** **Faster** **Query** **execution** **time** **Slower** **Serverless** **SQL** **CDW1** **CDW3** **Cost Estimate** **High** **Medium** **Low** **CDW2** **CDW4** **Slower** **Faster** **(~5min)** **Startup time** **(~2-3sec)** **Learn more** The feature is currently in Public Preview. Sign up to [request access to Serverless SQL](https://databricks.com/p/ebook/serverless-sql-preview-sign-up) . To learn more about Serverless SQL, visit our [documentation page.](https://docs.databricks.com/serverless-compute/index.html) ----- **CHAPTER** # 07 ### Data warehousing Data warehouses are not keeping up with today’s world. The explosion of languages other than SQL and unstructured data, machine learning, IoT and streaming analytics are forcing organizations to adopt a bifurcated architecture of disjointed systems: Data warehouses for BI and data lakes for ML. While SQL is ubiquitous and known by millions of professionals, it has never been treated as a first-class citizen on data lakes, until the lakehouse. ----- #### What is data warehousing The Databricks Lakehouse Platform provides a simplified multicloud and serverless architecture for your data warehousing workloads. Data warehousing on the lakehouse allows SQL analytics and BI at scale with a common governance model. Now you can ingest, transform and query all your data in-place — using your SQL and BI tools of choice — to deliver real-time business insights at the best price/performance. Built on open standards and APIs, the lakehouse provides the reliability, quality and performance that data lakes natively lack, and integrations with the ecosystem for maximum flexibility — no lock-in. With data warehousing on the lakehouse, organizations can unify all analytics and simplify their architecture to enable their business with real-time business insights at the best price/performance. #### Key benefits **Best price/performance** Lower costs, get the best price/performance and eliminate resource management overhead On-premises data warehouses have reached their limits — they physically cannot scale to handle the growing volumes of data, and don’t provide the elasticity customers need to respond to ever-changing business needs. Cloud data warehouses are a great alternative to on-premises data warehouses, providing greater scale and elasticity, but cloud costs for proprietary cloud data warehouses typically yield to an exponential cost increase following the growth of data volume. The Databricks Lakehouse Platform provides instant, elastic SQL serverless compute — decoupled from storage on cheap cloud object stores — and thousands of performance optimizations that can lower overall infrastructure costs by [an average of 40%](https://databricks.com/blog/2021/08/30/announcing-databricks-serverless-sql.html) . Databricks automatically determines instance types and configuration for the best price/performance — [up to 12x better](https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) [than traditional cloud data warehouses](https://databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) — and scale for high concurrency use cases. ----- **Built-in governance** One source of truth and one unified governance layer across all data teams Underpinned by Delta Lake, the Databricks Lakehouse Platform simplifies your architecture by allowing you to establish one single copy of all your data for in-place analytics and ETL/ELT on your existing data lakes — no more data movements and copies in disjointed systems. Then, seamless integration with Databricks Unity Catalog lets you easily discover, secure and manage all your data with fine-grained governance, data lineage, and standard SQL. **Rich ecosystem** Ingest, transform and query all your data in-place with your favorite tools Very few tools exist to conduct BI on data lakes. Generally, doing so has required data analysts to submit Spark jobs or use a developer interface. While these tools are common for data scientists, they require knowledge of languages and interfaces that are not traditionally part of a data analyst’s tool set. As a result, the learning curve for an analyst to make use of a data lake is too high when well-established tools and methods already exist for data warehouses. The Databricks Lakehouse Platform works with your preferred tools like dbt, Fivetran, Power BI or Tableau, allowing analysts and analytical engineers to easily ingest, transform and query the most recent and complete data, without having to move it into a separate data warehouse. Additionally, it empowers every analyst across your organization to quickly and collaboratively find and share new insights with a built-in SQL editor, visualizations and dashboards. **Break down silos** Accelerate time from raw to actionable data and go effortlessly from BI to ML applications, organizations will need to manage an entirely different system than their SQL-only data warehouse, slowing down collaboration and innovation. The Databricks Lakehouse Platform provides the most complete end-to-end data warehousing solution for all your modern analytics needs, and more. Now you can empower data teams and business users to access the latest data faster for downstream real-time analytics and go effortlessly from BI to ML. Speed up the time from raw to actionable data at any scale — in batch and streaming. And go from descriptive to advanced analytics effortlessly to uncover new insights. It is challenging for data engineering teams to enable analysts at the speed that the business requires. Data warehouses need data to be ingested and processed ahead of time before analysts can access and query it using BI tools. Because traditional data warehouses lack real-time processing and do not scale well for large ETL jobs, they create new data movements and bottlenecks for the data engineering team, and make it slow for analysts to access the latest data. And for advanced analytics (ML) ----- **Data warehousing on Databricks** **Truly decoupled, serverless, compute layer** **Data consumers** **Data processing** **Unity Catalog** **ETL** **ETL** **Bronze raw** **Silver staging** **Gold DW/marts** **Open storage layer** **Data ingest** **Data sources** **Databricks** **Partner Connect** **Continuous** **ingest** **Batch** **ingest** **On-premises** **OLTP** **OLAP** **Hadoop** **Third-party data** **loT devices** **SaaS applications** **Social** **DWH** **On-premises** **Hadoop** **Third-party data** **loT devices** **SaaS applications** **Social** **DWH** **Learn more** [Try Databricks SQL for free](https://dbricks.co/dbsql) [Databricks SQL Demo](https://databricks.com/discover/demos/databricks-sql) [Databricks SQL Data](https://youtu.be/jlEdoVpWwNc) [Warehousing Admin Demo](https://youtu.be/jlEdoVpWwNc) [On-demand Webinar: Learn](https://databricks.com/p/webinar/learn-databricks-sql-from-the-experts) [Databricks SQL From the Experts](https://databricks.com/p/webinar/learn-databricks-sql-from-the-experts) [eBook: Inner Workings of the](https://databricks.com/p/ebook/data-lakehouse-is-your-next-data-warehouse) [Lakehouse for Analytics and BI](https://databricks.com/p/ebook/data-lakehouse-is-your-next-data-warehouse) ----- **CHAPTER** # 08 ### Data engineering Organizations realize the value data plays as a strategic asset for growing revenues, improving the customer experience, operating efficiently or improving a product or service. Data is really the driver of all these initiatives. Nowadays, data is often streamed and ingested from hundreds of different data sources, sometimes acquired from a data exchange, cleaned in various ways with different orchestrated steps, versioned and shared for analytics and AI. And increasingly, data is being monetized. Data teams rely on getting the right data at the right time for analytics, data science and machine learning, but often are faced with challenges meeting the needs of their initiatives for data engineering. ----- #### Why data engineering is hard One of the biggest challenges is accessing and managing the increasingly complex data that lives across the organization. Most of the complexity arises with the explosion of data volumes and data types, with organizations amassing an estimated [80% of data that is unstructured and semi-structured.](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) With this volume, managing data pipelines to transform and process data is slow and difficult, and increasingly expensive. And to top off the complexity, most businesses are putting an increased emphasis on multicloud environments which can be even more difficult to maintain. [Zhamak Dehghani](https://databricks.com/speaker/zhamak-dehghani) , a principal technology consultant at Thoughtworks, wrote that data itself has become a product, and the challenging goal of the data engineer is to build and run the machinery that creates this high-fidelity data product all the way from ingestion to monetization. Despite current technological advances data engineering remains difficult for several reasons: **Complex data ingestion methods** Data ingestion means retrieving batch and streaming data from various sources and in various formats. Ingesting data is hard and complex since you either need to use an always-running streaming platform like Apache Kafka or you need to be able to keep track of which files haven’t been ingested yet. Data engineers are required to spend a lot of time hand-coding repetitive and error-prone data ingestion tasks. **Data engineering principles** These days, large operations teams are often just a memory of the past. Modern data engineering principles are based on agile software development methodologies. They apply the well-known “you build it, you run it” paradigm, use isolated development and production environments, CI/CD, and version control transformations that are pushed to production after validation. Tooling needs to support these principles. ----- **Third-party tools** Data engineers are often required to run additional third-party tools for orchestration to automate tasks such as ELT/ETL or customer code in notebooks. Running third-party tools increases the operational overhead and decreases the reliability of the system. **Performance tuning** Finally, with all pipelines and workflows written, data engineers need to constantly focus on performance, tuning pipelines and architectures to meet SLAs. Tuning such architectures requires in-depth knowledge of the underlying architecture and constantly observing throughput parameters. Most organizations are dealing with a complex landscape of data warehouses and data lakes these days. Each of those platforms has its own limitations, workloads, development languages and governance model. With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting, transforming, processing, scheduling and delivering data. The lakehouse platform automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake so data engineers can focus on quality and reliability to drive valuable insights. Data engineering in the lakehouse allows data teams to unify batch and streaming operations on a simplified architecture, streamline data pipeline development and testing, build reliable data, analytics and AI workflows on any cloud platform, and meet regulatory requirements to maintain world-class governance. The lakehouse provides an end-to-end data engineering and ETL platform that automates the complexity of building and maintaining pipelines and running ETL workloads so data engineers and analysts can focus on quality and reliability to drive valuable insights. #### Databricks makes modern data engineering simple There is no industry-wide definition of modern data engineering. This should come close: _A_ **_unified data platform_** _with_ **_managed data ingestion_** _, schema detection,_ _enforcement, and evolution, paired with_ **_declarative, auto-scaling data_** **_flow_** _integrated with a lakehouse_ **_native orchestrator_** _that supports all_ _kinds of workflows._ ----- ----- #### Benefits of data engineering on the lakehouse By simplifying and modernizing with the lakehouse architecture, data engineers gain an enterprise-grade and enterprise-ready approach to building data pipelines. The following are eight key differentiating capabilities that a data engineering solution team can enable with the Databricks Lakehouse Platform: **•** **Easy data ingestion:** With the ability to ingest petabytes of data, data engineers can enable fast, reliable, scalable and automatic data ingestion for analytics, data science or machine learning. **•** **Data pipeline observability:** Monitor overall data pipeline estate status from a dataflow graph dashboard and visually track end-to-end pipeline health for performance, quality, status and latency. **•** **Simplified operations:** Ensure reliable and predictable delivery of data for analytics and machine learning use cases by enabling easy and automatic data pipeline deployments into production or roll back pipelines and minimize downtime. **•** **Scheduling and orchestration:** Simple, clear and reliable orchestration of data processing tasks for data and machine learning pipelines with the ability to run multiple non-interactive tasks as a directed acyclic graph (DAG) on a Databricks compute cluster. **•** **Automated ETL pipelines:** Data engineers can reduce development time and effort and focus on implementing business logic and data quality checks within the data pipeline using SQL or Python. **•** **Data quality checks:** Improve data reliability throughout the data lakehouse so data teams can confidently trust the information for downstream initiatives with the ability to define data quality and automatically address errors. **•** **Batch and streaming:** Allow data engineers to set tunable data latency with cost controls without having to know complex stream processing and implement recovery logic. **•** **Automatic recovery:** Handle transient errors and use automatic recovery for most common error conditions that can occur during the operation of a pipeline with fast, scalable fault-tolerance. ----- **Data engineering is all about data quality** The goal of modern data engineering is to distill data with a quality that is fit for downstream analytics and AI. Within the Lakehouse, data quality is achieved on three different levels. 1. On a **technical level** , data quality is guaranteed by enforcing and evolving schemas for data storage and ingestion. **Kenesis** **CSV,** **JSON, TXT...** **Data Lake** 2. On an **architectural level** , data quality is often achieved by implementing the medallion architecture. A medallion architecture is a data design pattern used to logically organize data in a [lakehouse](https://databricks.com/glossary/data-lakehouse) with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture, e.g., from Bronze to Silver to Gold layer tables. 3. The **Databricks Unity Catalog** comes with robust data quality management with built-in quality controls, testing, monitoring and enforcement to ensure accurate and useful data is available for downstream BI, analytics and machine learning workloads. **Streaming** **analytics** **Bronze** **Silver** **Gold** **BI and** **reporting** Raw ingestion Filtered, cleaned, Business-level and history augmented aggregates **Quality** **Data science** **and ML** ----- #### Data ingestion With the Databricks Lakehouse Platform, data engineers can build robust hyper-scale ingestion pipelines in streaming and batch mode. They can incrementally process new files as they land on cloud storage — with no need to manage state information — in scheduled or continuous jobs. Data engineers can efficiently track new files (with the ability to scale to billions of files) without having to list them in a directory. Databricks automatically infers the schema from the source data and evolves it as the data loads into the Delta Lake lakehouse. Efforts continue with enhancing and supporting Auto Loader, our powerful data ingestion tool for the Lakehouse. **What is Auto Loader?** Have you ever imagined that ingesting data could become as easy as dropping a file into a folder? Welcome to Databricks Auto Loader. [Auto Loader](https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html) is an optimized data ingestion tool that incrementally and efficiently processes new data files as they arrive in the cloud storage built into the Databricks Lakehouse. Auto Loader can detect and enforce the schema of your data and, therefore, guarantee data quality. New files or files that have been changed since the last time new data was processed are identified automatically and ingested. Noncompliant data sets are quarantined into rescue data columns. You can use the [trigger once] option with Auto Loader to turn it into a job that turns itself off. **Ingestion for data analysts: COPY INTO** Ingestion also got much easier for data analysts and analytics engineers working with Databricks SQL. [COPY INTO](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html) is a simple SQL command that follows the lake-first approach and loads data from a folder location into a Delta Lake table. COPY INTO can be scheduled and called by a job repeatedly. When run, only new files from the source location will be processed. #### Data transformation Turning SQL queries into production ETL pipelines typically involves a lot of tedious, complicated operational work. Even at a small scale, the majority of a data practitioner’s time is spent on tooling and managing infrastructure. Although the medallion architecture is an established and reliable pattern for improving data quality, the implementation of this pattern is challenging for many data engineering teams. While hand-coding the medallion architecture was hard for data engineers, creating data pipelines was outright impossible for data analysts not being able to code with Spark Structured Streaming in Scala or Python. Even at a small scale, most data engineering time is spent on tooling and managing infrastructure rather than transformation. Auto-scaling, observability and governance are difficult to implement and, as a result, often left out of the solution entirely. ----- #### What is Delta Live Tables? Delta Live Tables (DLT) is the first ETL framework that uses a simple **declarative approach** to building reliable data pipelines. DLT automatically auto-scales your infrastructure so data analysts and engineers can spend less time on tooling and focus on getting value from data. Engineers are able to **treat their data as code** and apply modern software engineering best practices like testing, error-handling, monitoring and documentation to deploy reliable pipelines at scale. DLT fully supports both Python and SQL and is tailored to work with both streaming and batch workloads. With DLT you write a Delta Live Table in a SQL notebook, create a pipeline under Workflows and simply click [Start]. **Write** **create live table** **Create** **a pipeline** **Click** **Start** Start ----- DLT reduces the implementation time by accelerating development and automating complex operational tasks. Since DLT can use plain SQL, it also enables data analysts to create production pipelines and turns them into the often discussed “analytics engineer.” At runtime, DLT speeds up pipeline execution applied with Photon. Software engineering principles are applied for data engineering to foster the idea of treating your data as code. Your data is the sole source of truth for what is going on inside your business. Beyond just the transformations, there are many things that should be included Dependency Full refresh management *Coming soon in the code that define your data. Declaratively express entire data flows in SQL or Python. Natively enable modern software engineering best practices like separate development and production environments, the ability to easily test before deploying, deploy and manage environments using parameterization, unit testing and documentation. DLT also automatically scales compute, providing the option to set the minimum and maximum number of instances and let DLT size up the cluster according to cluster utilization. In addition, tasks like orchestration, error handling and recovery, and performance optimization are all handled automatically. Incremental computation* Checkpointing and retries ----- Expectations in the code help prevent bad data from flowing into tables, track data quality over time, and provide tools to troubleshoot bad data with granular pipeline observability. This enables a high-fidelity lineage diagram of your pipeline to track dependencies and aggregate data quality metrics across all your pipelines. Unlike other products that force you to deal with streaming and batch workloads separately, DLT supports any type of data workload with a single API so data engineers and analysts alike can build cloud-scale data pipelines faster without the need for advanced data engineering skills. #### Data orchestration The lakehouse makes it much easier for businesses to undertake ambitious data and machine learning (ML) initiatives. However, orchestrating and managing end-to-end production workflows remains a bottleneck for most organizations, relying on external tools or cloud-specific solutions that are not part of their lakehouse platform. Tools that decouple task orchestration from the underlying data processing platform reduce the overall reliability of their production workloads, limit observability, and increase complexity for end users. #### What is Databricks Workflows? [Databricks Workflows](https://databricks.com/product/workflows) is the first fully managed and integrated lakehouse [orchestration](https://databricks.com/glossary/orchestration) service that allows data teams to build reliable workflows on any cloud. Workflows lets you orchestrate data flow pipelines (written in DLT or dbt), as well as machine learning pipelines, or any other tasks such as notebooks or Python wheels. Since Databricks Workflows is fully managed, it eliminates operational overhead for data engineers, enabling them to focus on your workflows not on managing your infrastructure. It provides an easy point-and-click authoring experience for all your data teams, not just those with specialized skills. Deep integration with the underlying lakehouse platform ensures you will create and run reliable production workloads on any cloud while providing deep and centralized monitoring with simplicity for end users. Sharing job clusters over multiple tasks reduces the time a job takes, reduces costs by eliminating overhead and increases cluster utilization with parallel tasks. ----- Databricks Workflows’ deep integration with the lakehouse can best be seen with its monitoring and observability features. The matrix view in the following graphic shows a history of runs for a job. Failed tasks are marked in red. A failed job can be repaired and rerun with the click of a button. Rerunning a failed task detects and triggers the execution of all dependent tasks. You can create workflows with the UI, but also through the Databricks Workflows API, or with external orchestrators such as Apache Airflow. Even if you are using an external orchestrator, Databricks Workflows’ monitoring acts as a single pane of glass that includes externally triggered workflows. ----- #### Orchestrate anything Remember that DLT is one of many task types for Databricks Workflows. This is where the managed data flow pipelines with DLT tie together with the easy point-and-click authoring experience of Databricks Workflows. In the following example, you can see an end-to-end workflow built with customers in a workshop: Data is streamed from Twitter according to search terms, then ingested with Auto Loader using automatic schema detection and enforcement. In the next step, the data is cleaned and transformed with Delta Live table pipelines written in SQL, and finally run through a pre-trained BERT language model from Hugging Face for sentiment analysis of the tweets. Different task types for ingest, cleanse/transform and ML are combined in a single workflow. Using Workflows, these tasks can be scheduled to provide a daily overview of social media coverage and customer sentiment for a business. After streaming tweets with filtering for keywords such as “data engineering,” “lakehouse” and “Delta Lake,” we curated a list of those tweets that were classified as positive with the highest probability score. **Learn more** [Data Engineering on the](https://databricks.com/solutions/data-pipelines) [Lakehouse](https://databricks.com/solutions/data-pipelines) [Delta Live Tables](https://databricks.com/product/delta-live-tables) [Databricks Workflows](https://www.databricks.com/product/workflows) [Big Book of Data Engineering](https://databricks.com/p/ebook/the-big-book-of-data-engineering?itm_data=datapipelines-promo-bigbookofde) ----- **CHAPTER** ### Data streaming # 09 **CHAPTER** There are two types of data processing: batch processing and streaming processing. Batch processing refers to the discontinuous, periodic processing of data that has been stored for a period of time. For example, an organization may need to run weekly reports on a set of predictable transaction data. There is no need for this data to be streaming — it can be processed on a weekly basis. Streaming processing, on the other hand, refers to unbounded processing of data as it arrives. ----- **Data Streaming Challenges** However, getting value from streaming data can be a tricky practice. While most data today can be considered streaming data, organizations are overwhelmed by the need to access, process and analyze the volume, speed and variety of this data moving through their platforms. To keep pace with innovation, they must quickly make sense of data streams decisively, consistently and in real time. Three common technical challenges organizations experience with implementing real-time data streaming include: **•** **Specialized APIs and language skills:** Data practitioners encounter barriers to adopting streaming skillsets because there are new languages, APIs and tools to learn. **•** **Operational complexity:** To implement data streaming at scale, data teams need to integrate and manage streaming-specific tools with their other cloud services. They also have to manually build complex operational tooling to help these systems recover from failure, restart workloads without reprocessing data, optimize performance, scale the underlying infrastructure, and so on. **•** **Incompatible governance models:** Different governance and security models across real-time and historical data platforms makes it difficult to provide the right access to the right users, see the end-to-end data lineage, and/or meet compliance requirements. In a wide variety of cases, an organization might find it useful to leverage streaming data. Here are some common examples: **•** **Retail:** Real-time inventory updates help support business activities, such as inventory and pricing optimization and optimization of the supply chain, logistics and just-in-time delivery. **•** **Smart energy:** Smart meter monitoring in real time allows for smart electricity pricing models and connection with renewable energy sources to optimize power generation and distribution. **•** **Preventative maintenance:** By reducing unplanned outages and unnecessary site and maintenance visits, real-time streaming analytics can lower operational and equipment costs. **•** **Industrial automation:** Manufacturers can use streaming and predictive analytics to improve production processes and product quality, including setting up automated alerts. **•** **Healthcare:** To optimize care recommendations, real-time data allows for the integration of various smart sensors to monitor patient condition, medication levels and even recovery speed. **•** **Financial institutions:** Firms can conduct real-time analysis of transactions to detect fraudulent transactions and send alerts. They can use fraud analytics to identify patterns and feed data into machine learning algorithms. Regardless of specific use cases, the central tenet of streaming data is that it gives organizations the opportunity to leverage the freshest possible insights for better decision-making and more optimized customer experiences. ----- **Data streaming architecture** Before addressing these challenges head-on, it may help to take a step back and discuss the ingredients of a streaming data pipeline. Then, we will explain how the Databricks Lakehouse Platform operates within this context to address the aforementioned challenges. Every application of streaming data requires a pipeline that brings the data from its origin point — whether sensors, IoT devices or database transactions — to its final destination. In building this pipeline, streaming architectures typically employ two layers. First, streaming capture systems **capture** and temporarily store streaming data for processing. Sometimes these systems are also called messaging systems or messaging buses. These systems are optimized for small payloads and high frequency inputs/outputs. Second, streaming **processing** systems continuously process data from streaming capture systems and other storage systems. **Capturing** **Processing** It may help to think of a simplified streaming pipeline according to the following seven phases: 1. Data is continuously generated at origin points 2. The generated data is captured from those origin points by a capture system like Apache Kafka (with limited retention) **3. The captured data is extracted and incrementally ingested to** **a processing platform like Databricks; data is ingested exactly** **once and stored permanently, even if this step is rerun** **4. The ingested data is converted into a workable format** **5. The formatted data is cleansed, transformed and joined in** **a number of pipeline steps** **6. The transformed data is processed downstream through** **analysis or ML modeling** 7. The resulting analysis or model is used for some sort of practical application, which may be anything from basic reporting to an event-driven software application You will notice four of the steps in this list are in boldface. This is because the lakehouse architecture is specifically designed to optimize this part of the pipeline. Uniquely, the Databricks Lakehouse Platform can ingest, transform, analyze and model on streaming data _alongside_ batch-processed data. It can accommodate both structured _and_ unstructured data. It is here that the value of unifying the best pieces of data lakes and data warehouses really shines for complex enterprise use cases. ----- **Data Streaming on the Lakehouse** Now let’s zoom in a bit and see how the Databricks Lakehouse Platform addresses each part of the pipeline mentioned above. **Streaming data ingestion and transformation** begins with continuously and incrementally collecting raw data from streaming sources through a feature called Auto Loader. Once the data is ingested, it can be transformed from raw, messy data into clean, fresh, reliable data appropriate for downstream analytics, ML or applications. [Delta Live Tables (DLT)](https://www.databricks.com/product/delta-live-tables) makes it easy to build and manage these data pipelines while automatically taking care of infrastructure management and scaling, data quality, error testing and other administrative tasks. DLT is a high-level abstraction built on Spark Structured Streaming, a scalable and fault-tolerant stream processing engine. **[Real-time analytics](https://www.databricks.com/product/databricks-sql)** refers to the downstream analytical application of streaming data. With fresher data streaming into SQL analytics or BI reporting, more actionable insights can be achieved, resulting in better business outcomes. **[Real-time ML](https://www.databricks.com/product/machine-learning)** involves deploying ML models in a streaming mode. This deployment is supported with structured streaming for continuous inference from a live data stream. Like real-time analytics, real-time ML is a downstream impact of streaming data, but for different business use cases (i.e., AI instead of BI). Real-time modeling has many benefits, including more accurate predictions about the future. **Real-time applications** process data directly from streaming pipelines and trigger programmatic actions, such as displaying a relevant ad, updating the price on a pricing page, stopping a fraudulent transaction, etc. There typically is no human-in-the-loop for such applications. Data in cloud storage and message stores ----- **Databricks Lakehouse Platform differentiators** Understanding what the lakehouse architecture provides is one thing, but it is useful to understand how Databricks uniquely approaches the common challenges mentioned earlier around working with streaming data. **Databricks empowers unified data teams.** Data engineers, data scientists and analysts can easily build streaming data workloads with the languages and tools they already know and the APIs they already use. **Databricks simplifies development and operations.** Organizations can focus on getting value from data by reducing complexity and automating much of the production aspects associated with building and maintaining real-time data workloads. See why customers love streaming on the Databricks Lakehouse Platform with these resources. **Learn more** [Data Streaming Webpage](https://www.databricks.com/product/data-streaming) [Project Lightspeed: Faster and Simpler Stream Processing](https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html) [With Apache Spark](https://www.databricks.com/blog/2022/06/28/project-lightspeed-faster-and-simpler-stream-processing-with-apache-spark.html) [Structured Streaming Documentation](https://docs.databricks.com/spark/latest/structured-streaming/index.html) [Streaming — Getting Started With Apache Spark on Databricks](https://databricks.com/spark/getting-started-with-apache-spark/streaming) **Databricks is one platform for streaming and batch data.** Organizations can eliminate data silos, centralize security and governance models, and provide complete support for all their real-time use cases under one roof — the roof of the lakehouse. Finally — and perhaps most important — Delta Lake, the core of the [Databricks](https://www.databricks.com/product/data-lakehouse) [Lakehouse Platform](https://www.databricks.com/product/data-lakehouse) , was built for streaming from the ground up. Delta Lake is deeply integrated with Spark Structured Streaming and overcomes many of the limitations typically associated with streaming systems and files. In summary, the Databricks Lakehouse Platform dramatically simplifies data streaming to deliver real-time analytics, machine learning and applications on one platform. And, that platform is built on a foundation with streaming at its core. This means organizations of all sizes can use their data in motion and make more informed decisions faster than ever. ----- **CHAPTER** ### Data science and machine learning # 10 **CHAPTER** While most companies are aware of the potential benefits of applying machine learning and AI, realizing these potentials can often be quite challenging for those brave enough to take the leap. Some of the largest hurdles come from siloed/disparate data systems, complex experimentation environments, and getting models served in a production setting. Fortunately, the Databricks Lakehouse Platform provides a helping hand and lets you use data to derive innovative insights, build powerful predictive models, and enable data scientists, ML engineers, and developers of all kinds to create within the space of machine learning and AI. ----- #### Databricks Machine Learning ----- #### Exploratory data analysis With all the data in one place, data is easily explored and visualized from within the notebook-style experience that provides support for various languages (R, SQL, Python and Scala) as well as built-in visualizations and dashboards. Confidently and securely share code with co-authoring, commenting, automatic versioning, Git integrations and role-based access controls. The platform provides laptop-like simplicity at production-ready scale. ----- #### Model creation and management From data ingestion to model training and tuning, all the way through to production model serving and versioning, the Lakehouse brings the tools needed to simplify those tasks. Get right into experimenting with the Databricks ML runtimes, optimized and preconfigured to include most popular libraries like scikit-learn, XGBoost and more. Massively scale thanks to built-in support for distributed training and hardware acceleration with GPUs. From within the runtimes, you can track model training sessions, package and reuse models easily with [MLflow](https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html) , an open source machine learning platform created by Databricks and included as a managed service within the Lakehouse. It provides a centralized location from which to manage models and package code in an easily reusable way. Training these models often involves the use of features housed in a centralized feature store. Fortunately, Databricks has a built-in feature store that allows you to create new features, explore and re-use existing features, select features for training and scoring machine learning models, and publish features to low-latency online stores for real-time inference. If you are looking to get a head start, [AutoML](https://databricks.com/blog/2022/04/18/supercharge-your-machine-learning-projects-with-databricks-automl-now-generally-available.html) allows for low to no-code experimentation by pointing to your data set and automatically training models and tuning hyperparameters to save both novice and advanced users precious time in the machine learning process. AutoML will also report back metrics related to the model training results as well as the code needed to repeat the training already custom-tailored to your data set. This glass box approach ensures that you are never trapped or suffer from vendor lock-in. In that regard, the Lakehouse supports the industry’s widest range of data tools, development environments, and a thriving ISV ecosystem so you can make your workspace your own and put out your best work. ##### Compute platform **Any ML workload optimized and accelerated** **Databricks Machine Learning Runtime** - Optimized and preconfigured ML frameworks - Turnkey distribution ML - Built-in AutoML - GPU support out of the box Built-in **ML frameworks** and **model explainability** Built-in support for **AutoML** and **hyperparameter tuning** Built-in support for **distributed training** Built-in support for **hardware accelerators** ----- #### Deploy your models to production Exploring and creating your machine learning models typically represents only part of the task. Once the models exist and perform well, they must become part of a pipeline that keeps models updated, monitored and available for use by others. **Webhooks** allow registering of Databricks can help here by providing a world-class experience for model versioning, monitoring and serving within the same platform that you can use to generate the models themselves. This means you can make all your ML pipelines in the same place, monitor them for drift, retrain them with new data, and promote and serve them easily and at scale. Throughout the ML lifecycle, rest assured knowing that lineage and governance are being tracked the entire way. This means regulatory compliance and security woes are significantly reduced, potentially saving costly issues down the road. callbacks on events like stage transitions to integrate with CI/CD automation. **Tags** allow storing deployment — specific metadata with model versions, e.g., whether the deployment was successful. **Model lifecycle management** Staging Production Archived Logged model **Comments** allow communication and collaboration between teammates when reviewing model versions. ----- **Learn more** [Databricks Machine Learning](https://databricks.com/product/machine-learning) [Databricks Data Science](https://databricks.com/product/data-science) [Databricks ML Runtime Documentation](https://docs.databricks.com/runtime/mlruntime.html) ----- **CHAPTER** # 11 ### Databricks Technology Partners and the modern data stack Databricks Technology Partners integrate their solutions with Databricks to provide complementary capabilities for ETL, data ingestion, business intelligence, machine learning and governance. These integrations allow customers to leverage the Databricks Lakehouse Platform’s reliability and scalability to innovate faster while deriving valuable data insights. Use preferred analytical tools with optimized connectors for fast performance, low latency and high user concurrency to your data lake. ----- With [Partner Connect](https://databricks.com/partnerconnect) , you can bring together all your data, analytics and AI tools on one open platform. Databricks provides a fast and easy way to connect your existing tools to your lakehouse using validated integrations and helps you discover and try new solutions. **Databricks thrives within your modern data stack** **BI and dashboards** **Machine learning** **Data science** **Data governance** **Data pipelines** **Data ingestion** Data Data Data warehousing engineering streaming **Unity Catalog** Data science and ML **Consulting** **and SI partners** **Delta Lake** **Cloud Data Lake** **Learn more** [Become a Partner](https://databricks.com/p/register-your-interest-for-databricks-partner-program) [Partner Connect demos](https://databricks.com/partnerconnect#partner-demos) [Partner Connect](https://databricks.com/partnerconnect) [Databricks Partner Connect Guide](https://docs.databricks.com/integrations/partner-connect/index.html) ----- **CHAPTER** ### Get started with the Databricks Lakehouse Platform # 12 ----- #### Databricks Trial Get a collaborative environment for data teams to build solutions together with interactive notebooks to use Apache Spark TM , SQL, Python, Scala, Delta Lake, MLflow, TensorFlow, Keras, scikit-learn and more. **•** Available as a 14-day full trial in your own cloud or as a lightweight trial hosted by Databricks **[Try Databricks for free](https://databricks.com/try-databricks?itm_data=NavBar-TryDatabricks-Trial)** **[Databricks documentation](https://databricks.com/documentation)** Get detailed documentation to get started with the Databricks Lakehouse Platform on your cloud of choice: Databricks on AWS, Azure Databricks and [Databricks on Google Cloud](https://docs.gcp.databricks.com/?_gl=1*16ovt38*_gcl_aw*R0NMLjE2NTI1NDYxNjIuQ2owS0NRandwdjJUQmhEb0FSSXNBTEJuVm5saU9ydGpfX21uT1U5NU5iRThSbmI5a3o2OGdDNUY0UTRzYThtTGhVZHZVb0NhTkRBMmlWc2FBcEN6RUFMd193Y0I.&_ga=2.135042808.863708747.1652113196-1440404449.1635787641&_gac=1.225252968.1652546163.Cj0KCQjwpv2TBhDoARIsALBnVnliOrtj__mnOU95NbE8Rnb9kz68gC5F4Q4sa8mLhUdvUoCaNDA2iVsaApCzEALw_wcB) . **[Databricks Demo Hub](https://databricks.com/discover/demos)** Get a firsthand look at Databricks from the practitioner’s perspective with these simple on-demand videos. Each demo is paired with related materials — including notebooks, videos and eBooks — so that you can try it out for yourself on Databricks. **[Databricks Academy](https://databricks.com/learn/training/home)** Whether you are new to the data lake or building on an existing skill set, you can find a curriculum tailored to your role or interest. With training and certification through Databricks Academy, you will learn to master the Databricks Lakehouse Platform for all your big data analytics projects. **[Databricks Community](https://community.databricks.com/)** **[Databricks Labs](https://databricks.com/learn/labs)** Databricks Labs are projects created by the field to help customers get their use cases into production faster. **[Databricks customers](https://databricks.com/customers)** Discover how innovative companies across every industry are leveraging the Databricks Lakehouse Platform. Get answers, network with peers and solve the world’s toughest problems, together. ----- #### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) **,** [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc) . © Databricks 2022. All rights reserved. Apache, Apache Spark, Spark and the Spark -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/The-Data-Teams-Guide-to-the-DB-Lakehouse-Platform.pdf,2024-09-19T16:57:20Z
"##### Guide ## 6 Strategies for Building Personalized Customer Experiences ----- ### Contents **Introduction** ................................................................................................................................................................................................................. **3** **1.** **Building a Foundation for Personalization** Leveraging ML-Based Customer Entity Resolution ............................................................................................................................... **4** **2.** **Estimating Customer Lifetime Value** Building Brand Loyalty With Data ................................................................................................................................................................. **6** **3.** **Mitigating Customer Churn** Balancing Acquisition and Retention .......................................................................................................................................................... **10** **4.** **Streamlining Customer Analysis and Targeting** Creating Efficiency and Accuracy With Behavioral Data .................................................................................................................. **14** **5.** **Assessing Consumer Interest Data** Fine-Tuning ML Recommendations ............................................................................................................................................................ **18** **6.** **Delivering Personalized Customer Journeys** Crafting a Real-Time Recommendation Engine .................................................................................................................................... **14** **Conclusion** Building a Direct Path to Winning the Minds and Wallets of Your Customers ............................................................................. **23** ----- ### Introduction In today’s experience-driven world, the most beloved brands are the ones that know their customers. Customers are loyal to brands that recognize their needs and preferences — and tailor user journeys and engagements accordingly. A study from McKinsey shows [76% of consumers](https://www.mckinsey.com/business-functions/growth-marketing-and-sales/our-insights/the-value-of-getting-personalization-right-or-wrong-is-multiplying) are more likely to consider buying from a brand that personalizes the shopping and user experience to the wants and needs of the customer. And as organizations pursue omnichannel excellence, these same high expectations of online experiences also extend to brick-and-mortar locations — revealing for many merchants that personalized engagement is fundamental to attracting customers and expanding share of wallet. But achieving a 360-degree view of your customers to serve personalized experiences requires integrating various types of data — including demographics, behavioral and transactional — to develop robust profiles. This guide focuses on six actionable strategic pillars for businesses to leverage automation, real-time data, AI-driven analysis and well-tuned ML models to architect and deliver customized customer experiences at every touch point. # 76% of consumers are more likely to purchase due to personalization # 76% ----- ### Building a Foundation for Personalization Get a 360-degree view of the customer by leveraging ML-based entity resolution To create truly personalized interactions, you need actionable insights about your customers. Start by establishing a common customer profile and accurately linking together customer records across disparate data sets. Get a 360-degree view of your target customer by bringing together: - Sales and traffic-driven first-party data - Product ratings and surveys - Customer surveys and support center calls - Third-party data purchased from data aggregators and online trackers - Zero-party data provided by customers themselves Location **C A S E S T U DY** **Personalizing‌ ‌experiences‌ with‌ ‌data‌ ‌and‌ ‌ML‌** Grab is the largest online-to-offline platform in Southeast Asia and has generated over 6 billion transactions for transport, food and grocery delivery, and digital payments. Grab uses Databricks to create sophisticated customer segmentation and recommendation engines that can now ingest and optimize thousands of user-generated signals and data sources simultaneously, enhancing data integrity and security, and reducing weeks of work to only hours. [Get the full story](https://www.databricks.com/customers/grab) Demographics Orders Network/ Usage “The C360 platform empowered teams to create consumer features at scale, which in turn allows for these features to be extended to other markets and used by other teams. This helps to reduce the engineering overhead and costs exponentially.” **N I K H I L DWA R A K A N AT H** Head of Analytics, Grab Social Apps/ Clickstream |Col1|Col2|Col3|Col4|Col5|Col6| |---|---|---|---|---|---| ||||||| ||Cus 3|t 6|o|mer 0|| ||||||| ||||||| Service Call/ Records Customer 360 Billing Devices ----- Given the different data sources and data types, automated matching can still be incredibly challenging due to inconsistent formats, misinterpretation of data, and entry errors across various systems. And even if inconsistent, all that data may be perfectly valid — but to accurately connect the millions of customer identities most retailers manage, businesses must lean on automation. In a machine learning (ML) approach to entity resolution, text attributes like name, address and phone number are translated into numerical representations that can be used to quantify the degree of similarity between any two attribute values. But your ability to train such a model depends on your access to accurately labeled training data. It’s a time-consuming exercise, but if done right, the model learns to reflect the judgments of the human reviewers. Many organizations rely on libraries encapsulating this knowledge to build their applications and workflows. One such library is [Zingg](https://www.zingg.ai/) , an open source library bringing together ML-based approaches to intelligent candidate pair generation and pair-scoring. Oriented toward the construction of custom workflows, Zingg presents these capabilities within the context of commonly employed steps such as training data label assignment, model training, data set deduplication, and (cross-data set) record matching. Built as a native Apache Spark TM application, Zingg scales well to apply these techniques to enterprise-sized data sets. Organizations can then use Zingg in combination with platforms such as Databricks Lakehouse to provide the back end to human-in-the-middle workflow applications that automate the bulk of the entity resolution work and present data experts with a more manageable set of edge case pairs to interpret. As an active-learning solution, models can be retrained to take advantage of this additional human input to improve future predictions and further reduce the number of cases requiring expert review. Finally, these technologies can be assembled to enable their own enterprise-scaled customer entity resolution workflow applications. **Need help building your foundation for a** **360-degree view of your customers?** Get pre-built code sample data and step-by-step instructions in a Databricks notebook in the **Customer Entity Resolution** **Solution Accelerator.** **•** Translating text attributes (like name, address, phone number) into quantifiable numerical representations **•** Training ML models to determine if these numerical labels form a match **•** Scoring the confidence of each match **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/customer-entity-resolution)** ----- ### Estimating Customer Lifetime Value Building brand loyalty to drive share of wallet with data Once you’ve set up a 360-degree view of the customer, the next challenge is how to spend money to profitably grow the brand. The goal is to spend marketing dollars on activities that attract loyal customers and avoid spending on unprofitable customers or activities that damage the brand. Keep in mind, that making decisions solely based on ROI isn’t the answer. This one-track approach could ultimately weaken your brand equity and make you more dependent on lowering your price through promotions as a way to generate sales. **C A S E S T U DY** **Identifying and engaging brand loyalists** Today’s customer has overwhelmingly abundant options in products and services to choose from. That’s why personalizing customer experiences is so important, as it increases revenue, marketing efficiency and customer retention. Not every customer carries the same potential for profitability. Different customers derive different value from your products and services, which directly translates into differences in the overall amount of value a business can expect in return. Mutually beneficial relationships carefully align customer acquisition cost (CAC) and retention rates with the total revenue or customer lifetime value (CLV). **Predicting and increasing customer lifetime value with ML** Kolibri Games, creators of Idle Miner Tycoon and Idle Factory Tycoon, attracts over 10 million monthly active users. With Databricks, they achieved a 30% increase in player LTV, improved data team productivity by 3x, and reduced ML model-to-production time by 40x. [Get the full story](https://databricks.com/customers/kolibri-games) Within your existing customer base are people ranging from brand loyalists to brand transients. Brand loyalists are highly engaged with your brand, are willing to share their experience with others, and are the most likely to purchase again. Brand transients have no loyalty to your brand and shop based on price. Your focus should be on growing the group of brand loyalists while minimizing interactions with brand transients. **Calculating customers’ lifetime intent** To assess the remaining lifetime in a customer relationship, businesses must carefully examine the transactional signals and other indicators from previous customer engagements and transactions. For example, if a frequent customer slows down their buying habits — or simply doesn’t make a purchase for an extended period of time — it may signal the upcoming end of the relationship. However, in the case of another customer who engages infrequently, the same extended absence may not signal anything notable. The infrequent buyer may continue to purchase even after a long pause in activity. ----- Customer A Customer B Customer C Past Future Different customers with the same number of transactions, but signaling different lifetime intent. The probability of re-engagement (P_alive) relative to a customer’s history of purchases. Every customer relationship with a business has a lifespan. Understanding what point in the lifespan at a given time provides critical insight to inform marketing and sales tactics. By proactively discovering shifts in the relationship, you can adapt how to respond to each customer at the optimal time. For example, a certain signal might prompt a change in how to deliver products and services, which could help maximize revenue. Transactional signals can be used to estimate the probability that a customer is active and likely to return in the future. Popularized as the Buy ’til You Die (BTYD) model, analysts can compare a customer’s frequency and recency of engagement to similar patterns across their user population to accurately predict individual CLV. The mathematics behind these predictive CLV models is complex, but the logic behind these critical models is accessible through a popular Python library named Lifetimes, which allows the input of simple summary metrics in order to derive customer-specific lifetime estimates. **C A S E S T U DY** **How personalized experiences keep customers coming** **back for more** Publicis Groupe empowers brands to transform retail experiences with digital technologies, but data challenges and team silos stood in the way of delivering the personalization that their customers required. See how they use Databricks to create a single customer view that allows them to drive customer loyalty and retention. As a result, they’ve seen a 45%–50% increase in customer campaign revenue. [Get the full story](https://databricks.com/customers/publicis-groupe) ----- **Delivering customer lifetime estimates to the business** Spark natively distributes this work across a multi-server environment, enabling consistent, accurate and efficient analysis. Spark’s flexibility allows models to adapt in real time as new information is ingested, eliminating the bottlenecks that come with manual data mapping and profile building. With per customer metrics calculated, the Lifetimes library can be used to train multiple BTYD models, such as Pareto/NBD and BG/NBD. Training models to predict engagements over time using proprietary data can take several months and thousands of training runs. [Hyperopt](http://hyperopt.github.io/hyperopt/) , a specialized snippet library, helps businesses tap into the infrastructure behind their Spark environments and distribute the training outputs across models. Using the Lifetimes library to calculate customer-specific probabilities at speed and scale can be challenging — from processing large volumes of transaction data to deriving data curves and value distribution patterns and, eventually, to integration with business initiatives. But with the proper approach, you can resolve all of them. These models depend on three key per customer metrics: **FREQUENCY** The number of times within a given time period in which a repeat transaction is observed **AGE** The length of time between the occurrence of an initial transaction to the end of a given time period **RECENCY** The “age” of a customer (how long they’ve engaged with a brand) at the time of their latest repeat transaction ----- **Solution deployment** Once properly trained, these models can determine the probability that a customer will re-engage, as well as the number of engagements a business can expect from that customer over time. But the real challenge is putting these predictive capabilities into the hands of those that determine customer engagement. Matrices illustrating the probability a customer is alive (left) and the number of future purchases in a 30-day window given a customer’s frequency and recency metrics (right). Businesses need a way to develop and deploy solutions in a highly scalable environment with a limited upfront cost. Databricks Solution Accelerators leverage real-world sample data sets and pre-built code to show how raw data can be transformed into real solutions — including step-by-step instructions ready to go in a Databricks notebook. **Need help determining your customers’** **lifetime value?** Use the **Customer Lifetime Value Accelerator** to **•** Ingest sample retail data **•** Use pre-built code to develop visualizations and explore past purchase behavior **•** Apply machine learning to predict the likelihood and nature of future purchases **[GET THE ACCELERATOR](https://databricks.com/solutions/accelerators/customer-lifetime-value)** ----- ### Mitigating Customer Churn Balancing acquisition and retention with personalized experiences There are no guarantees of success. With a bevy of options at their disposal, customer churn is a reality that companies face and are focused on overcoming every day. One [recent analysis](https://info.recurly.com/annual-subscription-billling-metrics-report?submissionGuid=3c21cde7-5f58-4d86-9218-332d697e7b3e) of consumer-oriented subscription services estimated a segment average 7.2% monthly rate of churn. When narrowed to brands focused on consumer goods, that rate jumped to 10.0%. This figure translates to a lifetime of 10 months for the average subscription box service, leaving businesses of this kind with little time to recover acquisition costs and bring subscribers to net profitability. **C A S E S T U DY** ##### Riot Games **Creating an optimal in-game experience for League of Legends** Riot Games is one of the top PC game developers in the world, with over 100 million monthly active users, 500 billion data points, and over 26 petabytes of data and counting. They turned to Databricks to build a more efficient and scalable way to leverage data and improve the overall gaming experience — ensuring customer engagement and reducing churn. [Get the full story](https://www.databricks.com/customers/riot-games) Organizations must take an honest look at the cost of acquisition relative to a customer’s lifetime value (LTV) earned. These figures need to be brought into a healthy balance and treated as a “chronic condition” [to be managed.](https://retailtouchpoints.com/features/trend-watch/can-subscription-retail-solve-its-customer-retention-problem) **Understanding attrition predictability through subscriptions:** **Examining retention-based acquisition variables** Public data for subscription services is extremely hard to come by. KKBox, a Taiwan-based music streaming service, recently released over two years of anonymized [subscription data](https://www.kaggle.com/c/kkbox-churn-prediction-challenge) to examine customer churn. Through analyzing the data, we uncover customer dynamics familiar to any subscription provider. Most subscribers join the KKBox service through a 30-day trial offer. Customers then appear to enlist in one-year subscriptions, which provide the service with a steady flow of revenue. Subscribers typically churn at the end of the 30-day trial and at regular one-year intervals. The Survival Rate reflects the proportion of the initial (Day 1) subscriber population that is retained over time, first at the roll-to-pay milestone, and then at the renewal milestone. ----- By Initial Payment Method timeline Customer attrition by subscription day on the KKBox streaming service for customers registering via different payment methods. By Initial Payment Plan Days timeline Customer attrition by subscription day on the KKBox streaming service for customers selecting different initial payment methods and terms/days. This pattern of high initial drop-off, followed by a period of slower but continuing drop-off cycles makes intuitive sense. Where it gets interesting is when the data changes. The patterns of customer churn become vastly different as time passes and new or changing elements are introduced (e.g., payment methods and options, membership tiers, etc.). By Registration Channel timeline Customer attrition by subscription day on the KKBox streaming service for customers registering via different channels. ----- These patterns seem to indicate that KKBox _could_ potentially differentiate between customers based on their lifetime potential, using only the information available at subscriber acquisition. In the same way, non-subscription businesses could use similar data techniques to get an accurate illustration of the total lifetime value of a particular customer, even before collecting historical data. This information can help businesses target certain shoppers with effective discounts or promotions as early as trial registration. Nevertheless, it’s always important to consider more than individual data points. The baseline risk of customer attrition over a subscription lifespan. The channel and payment method multipliers combine to explain a customer’s risk of attrition at various points in time. The higher the value, the higher the proportional risk of churn in the associated period. ----- **Applying churn analytics to your data** This analysis is useful in two ways: **1)** to quantify the risk of customer churn and **2)** to paint a quantitative picture of the specific factors that explain that risk, giving analysts a clearer understanding of what to focus on, what to ignore and what to investigate further. The main challenge is organizing the input data. The data required to examine customer attrition may be scattered across multiple systems, making an integrated analysis difficult. [Data lakes](https://databricks.com/discover/data-lakes/introduction) support the creation of transparent, sustainable data processing pipelines that are flexible, scalable and highly cost-efficient. Remember that **churn is a chronic** **condition to be managed** , and attrition data should be periodically revisited to maintain alignment between acquisition and retention efforts. **Need help predicting customer churn?** Use the **Subscriber Churn Prediction Accelerator** to analyze behavioral data, identify subscribers with an increased risk of cancellation, and predict attrition. Machine learning lets you quantify a user’s likelihood to churn, identifying factors that explain the risk. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/survivorship-and-churn)** ----- ### Streamlining Customer Analysis and Targeting Creating efficient and highly targeted customer experiences with behavioral data Effective targeting comes down to one fundamental element: the cost of delivering a good or service relative to what a consumer is willing to pay. In the earliest applications of segmentation, manufacturers recognized that specialized product lines targeting specific consumer groups could help brands stand out against competitors. **C A S E S T U DY** **Finding that special something every time** Pandora is a jewelry company with global reach. They built their master consumer view (MCV) dashboard on the Databricks Lakehouse Platform, giving them the insights necessary to deliver highly targeted messaging and personalization — resulting in 80% growth in email marketing success, a 50% increase in click-to-open rate across 65 million emails, and 255M DKK (Danish Krone) in quarterly revenue. [Get the full story](https://www.databricks.com/customers/pandora) This mode of thinking extends beyond product development and into every customer-oriented business function, requiring specific means of ideation, production and delivery. The work put into segmentation doesn’t need to be a gamble. Scrutinizing customers and testing responsiveness is an ongoing process. Organizations must analyze and adapt to shifting markets, changing consumer demand and evolving business objectives. **C A S E S T U DY** **Powering insight-driven dashboards to increase customer** **acquisition** Bagelcode is a global game company with more than 50 million global users. By using the Databricks Lakehouse Platform, they are now able to support more diversified indicators, such as a user’s level of frequency and the amount of time they use a specific function for each game, enabling more well-informed responses. In addition, the company is mitigating customer churn by better predicting gamer behavior and providing personalized experiences at scale. [Get the full story](https://www.databricks.com/customers/bagelcode) “Thanks to Databricks Lakehouse, we can support real-time business decision-making based on data analysis results that are automatically updated on an hourly and daily basis, even as data volumes have increased by nearly 1,000 times.” **J O O H Y U N K I M** Vice President, Data and AI, Bagelcode ----- A brand’s goal with segmentation should be to define a shared customer perspective on customers, allowing the organization to engage users consistently and cohesively. But any adjustments to customer engagement require careful consideration of [organizational change concerns](https://www.researchgate.net/publication/45348436_Bridging_the_segmentation_theorypractice_divide) . **C A S E S T U DY** **Responding to global demand shifts with ease** Reckitt produces some of the world’s most recognizable and trusted consumer brands in hygiene, health and nutrition. With Databricks Lakehouse on Azure, they’re able to meet the needs of billions of consumers worldwide by surfacing real-time, highly accurate, deep customer insights, leading to a better understanding of trends and demand, allowing them to provide best-in-class experiences in every market. [Get the full story](https://www.databricks.com/customers/reckitt) **A segmentation walk-through: Grocery chain promotions** A promotions management team for a large grocery chain is responsible for running a number of promotional campaigns, each of which is intended to drive greater overall sales. Today, these marketing campaigns include leaflets and coupons mailed to individual households, manufacturer coupon matching, in-store discounts and the stocking of various private-label alternatives to popular national brands. Recognizing uneven response rates between households, the team is eager to determine if customers might be segmented based on their responsiveness to these promotions. They anticipate that such segmentation may allow the promotions management team to better target individual households, driving overall higher response rates for each promotional dollar spent. Using historical data from point-of-sale systems along with campaign information from their promotions management systems, the team derives a number of features that capture the behavior of various households with regard to promotions. Applying standard data preparation techniques, the data is organized for analysis and using a variety of clustering algorithms, such as k-means and hierarchical clustering, the team settles on two potentially useful cluster designs. ----- Overlapping segment designs separating households based on their responsiveness to various promotional offerings. Profiling of clusters to identify differences in behavior across clusters. **Assessing results** Comparing households by demographic factors not used in developing the clusters themselves, some interesting patterns separating cluster members by age and other factors are identified. While this information may be useful in not only predicting cluster membership and designing more effective campaigns targeted to specific groups of households, the team recognizes the need to collect additional demographic data before putting too much emphasis on these results. With profiling, marketers can discern those customer households in the highlighted example fall into two groups: those who are responsive to coupons and mailed leaflets, and those who are not. Further divisions show differing degrees of responsiveness to other promotional offers. ----- **Need help segmenting your customers for** **more targeted marketing?** Use the **Customer Segmentation Accelerator** and drive better purchasing predictions based on behaviors. Through sales data, campaigns and promotions systems, you can build useful customer clusters to effectively target various households with different promos and offers. Age-based differences in cluster composition of behavior-based customer segments. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/customer-segmentation)** The results of the analysis now drive a dialog between the data scientists and the promotions management team. Based on initial findings, a revised analysis will be performed focused on what appear to be the most critical features differentiating households as a means to simplify the cluster design and evaluate overall cluster stability. Subsequent analyses will also examine the revenue generated by various households to understand how changes in promotional engagement may impact customer spending. Using this information, the team believes they will have the ability to make a case for change to upper management. Should a change in promotions targeting be approved, the team makes plans to monitor household spending, promotions spend and campaign responsiveness rates using much of the same data used in this analysis. This will allow the team to assess the impact of these efforts and identify when the segmentation design needs to be revisited. ----- #### Assessing Consumer Interest Data to Inform Engagement Strategies Fine-tuning ML recommendations to boost conversions Personalization is a [journey](https://www.bcg.com/publications/2021/the-fast-track-to-digital-marketing-maturity) . To operationalize personalized experiences, it’s important to identify high-value audiences who have the highest likelihood of specific actions. Here’s where **propensity scoring** comes in. Specifically, this process allows companies to estimate customers’ potential receptiveness to an offer or to content related to a subset of products, and determine which messaging to apply. Calculating propensity scores requires assessment of past interactions and data points (e.g., frequency of purchases, percentage of spend associated with a particular product category, days since last purchase and other historical data). Databricks provides critical capabilities for propensity scoring (like the Feature Store, AutoML and MLflow) to help businesses answer three key considerations and develop a robust process: **1.** How to maintain the significant number of features used to train propensity models **2.** How to rapidly train models aligned with new campaigns **3.** How to rapidly re-deploy models, retrained as customer patterns drift, into the scoring pipeline **Boosting model training efficiency** With the [Databricks Feature Store](https://docs.databricks.com/applications/machine-learning/feature-store/index.html) , data scientists can easily reuse features created by others. The feature store is a centralized repository that enables the persistence, discovery and sharing of features across various model training exercises. As features are captured, lineage and other metadata are captured. Standard security models ensure that only permitted users and processes may employ these features, enforcing the organization’s data access policies on data science processes. **Extracting the complexities of ML** [Databricks AutoML](https://docs.databricks.com/applications/machine-learning/automl.html) allows you to quickly generate models by leveraging industry best practices. As a glass box solution, AutoML first generates a collection of notebooks representing various aligned model variations. In addition to iteratively training models, AutoML allows you to access the notebooks associated with each model, creating an editable starting point for further exploration. **Streamlining the overall ML lifecycle** [MLflow](https://docs.databricks.com/applications/mlflow/index.html) is an open source machine learning model repository, managed within the Databricks Lakehouse. This repository enables tracking and analysis of the various model iterations generated by both AutoML and custom training cycles alike. When used in combination with the Databricks Feature Store, models persisted with MLflow can retain knowledge of the features used during training. As models are retrieved, this same information allows the model to retrieve relevant features from the Feature Store, greatly simplifying the scoring workflow and enabling rapid deployment. ----- **How to build a propensity scoring workflow with Databricks** Using these features in combination, many organizations implement propensity scoring as part of a three-part workflow: **1.** Data engineers work with data scientists to define features relevant to the propensity scoring exercise and persist these to the Feature Store. Daily or even real-time feature engineering processes are then defined to calculate up-to-date feature values as new data inputs arrive. Model Training and Deployment **2.** As part of the inference workflow, customer identifiers are presented to previously trained models in order to generate propensity scores based on the latest features available. Feature Store information captured with the model allows data engineers to retrieve these features and easily generate the desired scores, which can then be used for analysis within Databricks Lakehouse or published to downstream marketing systems. **3.** In the model-training workflow, data scientists periodically retrain the propensity score models to capture shifts in customer behaviors. As these models are persisted to MLfLow, change management processes are used to evaluate and elevate those models that meet organizational criteria-toproduction status. In the next iteration of the inference workflow, the latest production version of each model is retrieved to generate customer scores. Score Generation and Publication ETL **Need help assessing interest from your** **target audience?** Feature Engineering ETL Feature Store Profiles Sales Promotions Customer Use the **Propensity Scoring Accelerator** to estimate customers’ potential receptiveness to an offer or to content related to a subset of products. Using these scores, marketers can determine which of the many messages at their disposal should be presented to a specific customer. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/propensity-scoring)** Downstream Applications A three-part propensity scoring workflow. ----- ### Delivering Personalized Customer Journeys Strategies for crafting a real-time recommendation engine As the economy continues to weather unpredictable disruptions, shortages and demand, delivering personalized customer experiences at speed and scale will require adaptability on the ground and within a company’s operational tech stack. With the Databricks Lakehouse, Al-Futtaim has transformed their data strategy and operations, allowing them to create a “golden customer record” that improves all decision-making from forecasting demand to powering their global loyalty program. [Get the full story](https://www.databricks.com/customers/al-futtaim) **C A S E S T U DY** “Databricks Lakehouse allows every division in our organization — from automotive to retail — to gain a unified view of our customer across businesses. With these insights, we can optimize everything from forecasting and supply chain, to powering our loyalty program through personalized marketing campaigns, cross-sell strategies and offers.” **D M I T R I Y D O V G A N** Head of Data Science, Al-Futtaim Group As COVID-19 forced a [shift](https://www.mckinsey.com/business-functions/marketing-and-sales/our-insights/a-global-view-of-how-consumer-behavior-is-changing-amid-covid-19) in consumer focus toward value, availability, quality, safety and community, brands most attuned to changing needs and sentiments saw customers [switch](https://martechseries.com/sales-marketing/customer-experience-management/braze-survey-one-in-four-consumers-tried-new-brand-during-covid-19/) from [rivals](https://www.retailtouchpoints.com/resources/personalization-gains-new-relevance-as-covid-19-challenges-brand-loyalties) to their brand. While some segments gained business and many lost, organizations that had already begun the journey toward improved customer experience saw better outcomes, closely mirroring patterns [observed](https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Marketing%20and%20Sales/Our%20Insights/Adapting%20customer%20experience%20in%20the%20time%20of%20coronavirus/Adapting-customer-experience-in-the-time-of-coronavirus.ashx) in the 2007–2008 recession. **Creating a unified view across 200+ brands** As a driving force for economic growth in the Middle East, Al-Futtaim impacts the lives of millions of people across the region through the distribution and operations of global brands like Toyota, IKEA, Ace Hardware and Marks & Spencer. Al-Futtaim’s focus is to harness their data to improve all areas of the business, from streamlining the supply chain to optimizing marketing strategies. But with the brands capturing such a wide variety of data, Al-Futtaim’s legacy systems struggled to provide a single view into the customer due to data silos and the inability to scale efficiently to meet analytical needs. ----- The personalization of customer experiences will remain a key focus for B2C and [B2B organizations](https://hbr.org/2017/07/how-b2b-sellers-are-offering-personalization-at-scale) . Increasingly, market analysts are recognizing customer experience as a [disruptive force](https://sloanreview.mit.edu/article/the-experience-disrupters/) enabling upstart organizations to upend long-established players. **Focus on the customer journey** Personalization starts with a careful exploration of the [customer journey](https://hbr.org/2015/11/competing-on-customer-journeys) . The [digitization of each stage](https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/the-drumbeat-of-digital-how-winning-teams-play) provides the customer with flexibility in terms of how they will engage and provides the organization with the ability to [assess](https://www.bcg.com/en-us/publications/2020/three-personalization-imperatives-during-covid-crisis) [the health of their model](https://www.bcg.com/en-us/publications/2020/three-personalization-imperatives-during-covid-crisis) . **C A S E S T U DY** **Personalizing the beauty product shopping experience** Flaconi wanted to leverage data and AI to become the No. 1 online beauty product destination in Europe. However, they struggled with massive volumes of streaming data and with infrastructure complexity that was resource-intensive and costly to scale. See how they used Databricks to increase time-to-market by 200x, reduce staff costs by 40% and increase net order income. Get the full story ¹ Comparison of total returns to shareholders for publicly traded companies ranking in the top 10 or bottom 10 of Forrester’s Customer Experience Performance Index in 2007-09. Source: Forrester Customer Experience Performance Index (2007-09); press search CX leaders outperform laggards, even in a down market, in this visualization of the Forrester Customer Experience Performance Index [as provided](https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/Marketing%20and%20Sales/Our%20Insights/Adapting%20customer%20experience%20in%20the%20time%20of%20coronavirus/Adapting-customer-experience-in-the-time-of-coronavirus.ashx) by McKinsey & Company. ----- Careful consideration of how customers interact with various assets — and how these interactions may be interpreted as expressions of preference — can unlock a wide range of data that enables personalization. The complexity of these engines requires that they be deployed thoughtfully, using limited pilots and customer response assessments. And in those assessments, it’s important to keep in mind that there is no expectation of perfection — only incremental improvement over the prior solution. **C A S E S T U DY** **Need help generating personalized** **recommendations?** **Connecting shoppers to savings with data-driven** **personalization‌** Use the **Recommendation Engines Accelerator** to estimate customers’ potential receptiveness to an offer or to content related to a subset of products. Using these scores, marketers can determine which of the many messages at their disposal should be presented to a specific customer. **[GET THE ACCELERATOR](https://www.databricks.com/solutions/accelerators/propensity-scoring)** Flipp is an online marketplace that aggregates weekly shopping circulars, so consumers get deals and discounts without clipping coupons. Siloed customer data sources once made getting insights difficult. Now with Databricks, Flipp’s data teams can access and democratize data, helping them do their jobs more effectively while bringing better deals to users, more meaningful insights to partners, and a 10% jump in foot traffic to brick-and-mortar retailers. Get the full story The engines we use to serve content based on customer preferences are known as recommenders. With some recommenders, a heavy focus on the shared preferences of similar customers helps define what recommendations will actually make an impact. With others, it can be more useful to focus on the properties of the content itself (e.g., product descriptions). ----- ### Building a Direct Path to Winning the Minds and Wallets of Your Customers Providing deep, effective personalized experiences to customers depends on a brand’s ability to intelligently leverage consumer and market data from a wide variety of sources to fuel faster, smarter decisions — without sacrificing accuracy for speed. The Databricks Lakehouse Platform is purpose-built for exactly that, offering a scalable data architecture that unifies all your data, analytics and AI to deliver unforgettable customer experiences. Created on open source and open standards, Databricks offers a robust and cost-effective platform for brands to collaborate with partners, clients, manufacturers and distributors to unleash more innovation and efficiencies at every touch point. Businesses can rapidly ingest available data in real time, at scale, and create accessible, data-driven insights that enable actionable strategies across the value chain. Databricks is a multicloud platform, designed for quick enterprise development. Teams using the Lakehouse can more effectively reveal the 360-degree view into their company’s operational health and the evolving needs of their customers — all while empowering teams to easily unify data efforts, perform fine-grained analyses and streamline cross-functional data operations using a single, sophisticated solution. ###### Learn more about Databricks Lakehouse for industries  like Retail & Consumer Goods, Media & Entertainment  and more at databricks.com/solutions ----- ### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 50% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) , [LinkedIn](https://www.linkedin.com/company/databricks) and [Facebook](https://www.facebook.com/databricksinc/) . **[START YOUR FREE TRIAL](https://www.databricks.com/try-databricks?utm_medium=paid+search&utm_source=google&utm_campaign=14272820537&utm_adgroup=126939742998&utm_content=trial&utm_offer=try-databricks&utm_ad=563736421186&utm_term=databricks%20free%20trial&gclid=Cj0KCQjwpeaYBhDXARIsAEzItbHzQGCu2K58-lnVCepMI5MYP6jTXkgfvqmzwAMqrlVwVOniebOE43UaAk3OEALw_wcB)** ##### Contact us for a personalized demo databricks.com/contact -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Databricks-Customer-360-ebook-Final.pdf,2024-09-19T16:57:19Z
"#### eBook # Big Book of Retail  & Consumer Goods Use Cases ##### Driving real-time decisions  with the Lakehouse ----- ### Contents (1/2) C H A P T E R 1 :  Introduction 4 **C H A P T E R 2 :**  **Modern Data Platform for Real-Time Retail** 6 Common challenges 6 The Lakehouse for Retail 8 **C H A P T E R 3 :** **Use Case: Real-Time Supply Chain Data**  12 Case Study: Gousto 14 Case Study: ButcherBox 14 **C H A P T E R 4 :**  **Use Case: Truck Monitoring** 15 Case Study: Embark 16 **C H A P T E R 5 :** **Use Case: Inventory Allocation**  17 Case Study: H&M 19 Case Study: Edmunds 19 **C H A P T E R 6 :** **Use Case: Point of Sale and Clickstream**  20 **C H A P T E R 7 :** **Use Case: On-Shelf Availability**  22 Case Study: Reckitt 25 **C H A P T E R 8 :** **�Use Case: Customer and Vehicle Identification** 26 **C H A P T E R 9 :**  **Use Case: Recommendation Engines** 28 Case Study: Wehkamp 31 Case Study: Columbia 31 Case Study: Pandora 31 **C H A P T E R 1 0 :**  **Use Case: Perpetual Inventory** 32 **C H A P T E R 1 1 :**  **Use Case: Automated Replenishments** 34 ----- ### Contents (2/2) C H A P T E R 1 2 :  Use Case: Fresh Food Forecasting 36 Case Study: ButcherBox 37 Case Study: Sam’s Club 37 **C H A P T E R 1 3 :**  **Use Case: Propensity-to-Buy** 38 **C H A P T E R 1 4 :**  **Use Case: Next Best Action** 41 **C H A P T E R 1 5 :** **Customers That Innovate With Databricks Lakehouse for Retail**  43 **C H A P T E R 1 6 :**  **Conclusion** 43 ----- **CHAPTER 1:** ### Introduction Retailers are increasingly being challenged to make time-sensitive decisions in their operations. Consolidating e-commerce orders. Optimizing distribution to ensure item availability. Routing delivery vehicles. These decisions happen thousands of times daily and have a significant financial impact. Retailers need real-time data to support these decisions, but legacy systems are limited to data that’s hours or days old. **When seconds matter, only the Lakehouse delivers better decisions** Retail is a 24/7 business where customers expect accurate information and immediate relevant feedback. The integration of physical and e-commerce customer experiences into an omnichannel journey has been happening for the past 20 years, but the pandemic provided a jolt to consumer trends that dramatically shifted purchasing patterns. In reaction to these industry changes, retailers have responded with significant, rapid investments — including stronger personalization, order fulfillment, and delivery and loyalty systems. While these new targeted capabilities have addressed the immediate need — and created expectations of making decisions in real time — most retailers still rely on legacy data systems, which impedes their ability to scale these innovations. Unfortunately, most legacy systems are only able to process information in hours or days. The delays caused by waiting for data are leading to significant risks and costs for the industry. **Grocers** need to consolidate order picking to achieve profitability in e-commerce, but this requires up-to- the-minute order data. Not having this information causes them to spend more resources on having people pick orders separately, at a higher operating cost. **Apparel retailers** must be able to present the correct available inventory on their website. This requires that in-store sales be immediately reflected in their online systems. Inaccurate information can lead to lost sales, or worse, the customer becoming unsatisfied and moving to different retailers. ----- **Convenience fuel retailers** must collaborate with distribution centers, direct-to-store delivery distributors and other partners. Having delayed data can lead to out-of-stocks, costing stores thousands of dollars per week. The margin of error in retail has always been razor thin, but with a pandemic and inflationary pressures, it’s at zero. Reducing the error rate requires better predictions and real-time data. **Use Case Guide** In this use case guide, we show how the Databricks Lakehouse for Retail is helping leading organizations take **all of their data in a single lakehouse architecture, streamline their data engineering and management,** **make it ready for SQL and ML/AI** , and **do so very fast within their own cloud infrastructure environment** **based on open source and open standards** . These capabilities are all delivered at world-record-setting performance, while achieving a market-leading total cost of ownership. Databricks Lakehouse for Retail has become the industry standard for enabling retailers to drive decisions in real time. This use case guide also highlights common use cases across the industry, and offers additional resources in the form of Solution Accelerators and reference architectures to help as you embark on your own journey to drive better customer experiences with data and AI. ----- **CHAPTER 2:** ### Modern Data Platform  for Real-Time Retail Retailers continue to adapt to rapidly shifting dynamics across the omnichannel. In navigating these changes, retailers are increasingly focused on improving the real-time availability of data and insights, and performing advanced analytics delivered within tight business service windows. **Common challenges** In response to the surge in e-commerce and volatility in their supply chains, retailers are investing millions in modernizing distribution centers, partnering with delivery companies, and investing in customer engagement systems. Warehouse automation is expected to become a $41B market according to Bloomberg. Increasingly, distribution centers are being automated with robotics to power dynamic routing and delivery. Shoppers that became accustomed to having fast, same-day, and sometimes even overnight delivery options during the pandemic now expect them as the norm. Retailers understand that the shipping and delivery experience is now one of many touchpoints that merchants can use to develop customer brand loyalty. ## $41B Market | Retail Warehouse Automation Yet while retailers modernize different areas of their operations, they’re constrained by a single point of weakness, as they are reliant on legacy data platforms to bring together all of this data. Powering real-time decisions in modern retail requires real-time ingestion of data, transformation, governance of information, and powering business intelligence and predictive analytics all within the time required by retail operations. ----- **Ingesting large volumes of transactional data in real time.** The biggest blocker to crucial insights is the ability to ingest data from transaction systems in real time. Transaction logs from point-of-sale systems, clickstreams, mobile applications, advertising and promotions, as well as inventory, logistics and other systems, are constantly streaming data. Big data sets need to be ingested, cleansed and aggregated and integrated with each other before they can be used. The problem? Retailers have used legacy data warehouses that are built around batch processing. And worse, increasing the frequency of how often data is processed leads to a “hockey stick” in costs. As a result of these limitations, merchants resort to ingesting data nightly to deal with the large volumes of data and integration with other data sets. The result? Accurate data to drive decisions can be delayed by days. **Performing fine-grained analysis at scale within tight time windows.** Retailers have accepted a trade-off when performing analysis. Predictions can be detailed and accurate, or they can be fast. Running forecasts or price models at a day, store and SKU level can improve accuracy by 10% or more, but doing so requires tens of millions of model calculations that need to be performed in narrow service windows. This is well beyond the capability of legacy data platforms. As a result, companies have been forced to accept the trade-off and live with less accurate predictions. **Powering real-time decisions on the front line.** Data is only useful if it drives decisions, but serving real-time data to thousands of employees is a daunting task. While data warehouses are capable of serving reports to large groups of users, they’re still limited to stale data. Most retailers limit the frequency of reports to daily or weekly updates and depend on the staff to use their best judgment for decisions that are more frequent. **Delivering a hyper-personalized omnichannel experience.** The storefront of the 21st century is focused on delivering personalized experiences throughout the omnichannel. Retailers have access to a trove of customer data, and yet off-the-shelf tools for personalization and customer segmentation struggle to deal with high volumes, and the analytics have high rates of inaccuracy. Retailers need to deliver personalized experiences at scale to win in retail. ----- ###### The Lakehouse for Retail Databricks Lakehouse for Retail solves these core challenges. The Lakehouse unlocks the ability to unify all types of data — from images to structured data — in real time, provide enterprise-class management and governance, and then immediately turn that data into actionable insights with real-time reporting and predictive analytics. It does this with record-setting speed and industry-leading total cost of ownership (TCO) in a platform-as-a-service (PaaS) that allows customers to solve these pressing problems. **Any structure** **Reliable, real-time** **Capabilities for** **Data sharing** **or frequency** **processing** **any persona** **& collaboration** _Semi-structured batch_ **All of** **your sources** Competitive activity E-commerce Mobile Applications Video & Images Point of Sale Distribution & Logistics Customer & Loyalty Delivery & Partners _Structured real-time_ _Semi-structured real-time_ _Unstructured batch_ _Semi-structured real-time_ _Structured real-time_ _Structured batch_ Data Lakehouse Data Management and Governance Process, manage and query all of your data Ad Hoc Data Science **Internal Teams** Production Machine Learning **Customers** BI Reporting & Dashboarding **Partners** Real-time Applications Any Cloud _Structured real-time_ ----- **Reference Architecture** At the core of the Databricks Lakehouse for Retail is technology that enables retailers to avoid the trade- offs between speed and accuracy. Technology such as Delta Lake enables the Lakehouse — a new paradigm that combines the best elements of data warehouses and data lakes — to directly address these factors by enabling you to unify all of your data — structured and unstructured, batch and real-time — in one centrally managed and governed location. Once in the Lakehouse, e-commerce systems, reporting users, analysts, data scientists and data engineers can all leverage this information to serve models for applications and power real-time reporting, advanced analytics, large-scale forecasting models and more. **EDGE** **HYBRID** **CLOUD** REST Model Serving |Machine Learning Operations Tracking Registery|RES| |---|---| ||Application| Replication Automatic DBs |Col1|Real-tim| |---|---| ||| Raw Data (Bronze Table) Clean Data (Silver Table) Refined Data (Gold Table) Business Applications Power BI Batch ----- ###### How it works The Lakehouse for Retail was built from the ground up to solve the needs of modern retail. It blends simplicity, flexibility and lower cost of ownership with best-in-industry performance. The result is differentiated capabilities that help retailers win. Robust data Time-sensitive machine Data in real time Use all of your data Real-time reporting management learning **Limited.** EDWs support the management of structured data. **No.** Data lakes lack enterprise-class data management tools. **Yes.** Delta and Unity Catalog offer native data management and governance of all data types. **No.** EDWs offer quick access to reports on old data. **No.** Data lakes were not designed for reporting, let alone real-time reporting. **No.** Data lakes are able to support large analytics, but lack the ability to meet business SLAs. **No.** EDWs must extract data and send it to a third party for machine learning. **Yes.** Data views can be materialized, enabling front- line employees with real- time data. **Yes.** The Lakehouse can scale to process the most demanding predictions within business SLAs. **No.** Data warehouses are batch oriented, restricting data updates to hours or days. **No.** Data lakes are batch oriented. **Yes.** Support for real-time streaming data. **No.** Data warehouses have very limited support for unstructured data. **Yes.** Data lakes offer support for all types of data. **Yes.** Supports all types of data in a centrally managed platform. **LEGACY DATA** **WAREHOUSE** **LEGACY DATA** **DATA LAKES** **(HADOOP)** **DATA LAKES** **ROBUST** **DATA** **ROBUST** ----- **Data in real time.** Retail operates in real time and so should your data. The Lakehouse offers support for streaming data from clickstream, mobile applications, IoT sensors and even real-time e-commerce and point-of-sale data. And Delta Lake enables this world-record-leading performance while maintaining support for ACID transactions. **Use all of your data.** Retailers are increasingly capturing data from mobile devices, video, images and a growing variety of other data sources. This data is extremely powerful in helping to improve our understanding of consumer behavior and operations. The Lakehouse for Retail enables companies to take full advantage of all types of data in a cost-efficient way, in a single unified lakehouse architecture. **Robust data management and governance** that companies need to protect sensitive data, but was lacking from earlier big data systems. The Lakehouse offers transactional integrity with ACID compliance, detailed data security, schema enforcement, time travel, data lineage and more. Moving to a modern data architecture does not require sacrificing enterprise maturity. **High-performance predictive analytics.** Machine learning models, such as demand forecasting or recommendation engines, can be run in hours without compromising accuracy. The Lakehouse can scale to support tens of millions of predictions in tight windows, unlocking critical and time- sensitive analytics such as allocating inventory, optimizing load tenders and logistics, calculating item availability and out-of-stocks, and delivering highly personalized predictions. **Value with Databricks** By using Databricks to build and support your lakehouse, you can empower your business with even more speed, agility and cost savings. The flexibility of the Databricks Lakehouse Platform means that you can start with the use case that will have the most impact on your business. As you implement the pattern, you will find that you’re able to tackle use cases quicker and more easily than before. To get you started, this guidebook contains the use cases we most commonly see across the Retail and Consumer Goods industry. ----- **CHAPTER 3** ### Use Case:  Real-Time Supply  Chain Data **Overview** As companies see a surge in demand from e-commerce and delivery services, and seek increasing efficiencies with plant or distribution centers, real-time data is becoming a key part of the technical roadmap. Real-time supply chain data allows customers to deal with problems as they happen and before items are sent downstream or shipped to consumers, which is the first step in enabling a supply chain control tower. **R E L E V A N T F O R** Retail Consumer Goods Manufacturers Distributors Logistics Restaurants **Challenges** **Batch data** — existing data warehouses bring data in batch, creating a lag between when something is happening and when a customer can act on it **Complex analysis in real time** — if ingesting data in real time wasn’t a big enough challenge, companies have the added pressure to take immediate action on it **Complex maintenance** — ETL tools to bring data in batch are often complex and costly to maintain ----- **Value with the Databricks Lakehouse** Databricks has enabled real-time streaming of supply chain data across a variety of customers for specific plant operations or as part of a supply chain control tower. **Near real-time ingestion and visibility of data** — one customer experienced a 48,000% improvement in speed to data, with greater reliability **Cost-neutral** — because Delta’s efficient engine requires smaller instances, many customers report that they were able to move from batch to real-time at neutral costs **�Simplified architecture and maintenance** — leveraging Delta for ingestion streamlines the pattern for real-time data ingestions. Customers frequently report that the amount of code required to support streaming ingestion is 50% less than previous solutions. **Immediate enablement of additional use cases** — customers can now prevent problems as they’re happening, predict and prevent issues, and even gain days on major changes such as production schedules between shifts **Solution overview** Databricks allows for both streaming and batch data sets to be ingested and made available to enable real-time supply chain use cases. Delta Lake simplifies the change data capture process while providing ACID transactions and scalable metadata handling, and unifying streaming and batch data processing. And Delta Lake supports versioning and enables rollbacks, full historical audit trails, and reproducible machine learning experiments. **Typical use case data sources include:** Supply planning, procurement, manufacturing execution, warehousing, order fulfillment, shop floor/historian data, IoT sensor, transportation management ----- **CASE STUDY** With Databricks, Gousto was able to implement real-time visibility in their supply chain. Gousto moved from daily batch updates to near real-time streaming data, utilizing Auto Loader and Delta Lake. The platform provided by Databricks has allowed Gousto to respond to increased demand during the coronavirus outbreak by providing real-time insight into performance on the factory picking lines. **CASE STUDY** As a young e-commerce company, ButcherBox needed to act nimbly to make the most of the data from its hundreds of thousands of subscribers. With Databricks Lakehouse, the company could pull 18 billion rows of data in under three minutes. Now, ButcherBox has a near real-time understanding of its customers, and can also act proactively to address any logistical and delivery issues. HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. ----- **CHAPTER 4** ### Use Case: Truck Monitoring With many industries still feeling the effects of supply chain issues, being able to increase the efficiency of trucks on the road can make all the difference in getting goods into the hands of customers in a timely manner. Real-time data is making it easier for companies to get immediate insights into truck manufacturing delays, maintenance issues, supply chain issues, delivery schedules and driver safety. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics **Challenges** **** Siloed data makes it difficult to get a comprehensive understanding of fleet performance A lack of real-time insights can delay responses to manufacturing or supply chain issues Not having effective automation and AI increases the risk of human error, which can result in vehicular accidents or shipment delays ----- **Value with the Databricks Lakehouse** Databricks empowers companies to get real-time insights into their fleet performance, from manufacturing to delivery. **Near real-time insights** — the greater speed to data means a quicker response to issues and the ability to monitor driver safety more immediately **Ability to scale** — although consumer demands are constantly evolving, Databricks can handle fleet expansion without sacrificing data quality and speed **Optimizing with AI/ML** — implementing AI and ML models can lead to more effective route monitoring, proactive maintenance and reduced risk of accidents **Solution overview** Databricks enables better truck monitoring, quickly ingesting data on everything from vehicle manufacturing to route optimization. This results in a more complete and real-time view of a company’s fleet, and these analytics provide companies with the tools they need to scale and improve their operations. **Typical use case data sources include:** Supply planning, transportation management, manufacturing, predictive maintenance **CASE STUDY** With 94% of vehicular accidents attributed to human error, Embark used the Databricks Lakehouse Platform to unlock thousands of hours of recorded data from its trucks and then collaboratively analyze that data via dashboards. This has resulted in more efficient ML model training as Embark speeds toward fully autonomous trucks. HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. ----- **CHAPTER 5** ### Use Case: Inventory Allocation **Overview** Replenishment planning is the process of determining what needs to go where. It is used by replenishment planning, distributors and consumer goods companies performing vendor-managed replenishment (VMR) or vendor-managed inventory (VMI) to make daily decisions on which product needs to be sent to which store and on what day. Replenishment is challenging for companies because it deals with rapidly changing data and the need to make complex decisions on that data in narrow service windows. Retailers need to stream in real-time sales data to signal how much of a product has been sold in order. Inaccurate sales data leads to an insufficient number of products being sent to stores. This results in lost sales and low customer satisfaction. Inventory allocation is a process that might be performed multiple times a day during peak seasons, or daily during slower seasons. Companies need the ability to scale to perform tens of millions of predictions multiple times a day — on demand and dynamically — during peak season without paying a premium for this capability throughout the year. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Restaurants ----- **Challenges** Customers must complete tens of millions of inventory allocation predictions within tight time windows. This information is used to determine which products get put on trucks and go to specific stores. Traditional inventory allocation rules cause trade-offs in accuracy in order to calculate all possibilities in the service windows Legacy tools have rudimentary capabilities and have limited ability to consider flavors, sizes and other attributes that may be more or less popular by store **Value with Databricks** Customers are able to complete inventory allocation models within SLAs with no trade-off for accuracy.  **Speed —** on average, customers moving to Databricks for demand forecasting report a double-digit improvement in forecast accuracy  **Ability to scale** and perform fine-grained (day, store, item) level allocations  **Provide more robust allocations** by incorporating causal factors that may increase demand, or include information on flavors or apparel sizes for specific stores **Solution overview** The objective of inventory allocation is to quickly determine when to distribute items and where — from warehouses and distribution centers to stores. Inventory allocation begins by looking at the consumption rate of products, the available inventory and the shipping schedules, and then using this information to create an optimized manifest of what items should be carried on which trucks, at what point, and at what time. This becomes the plan for route accounting systems that arrange deliveries. Inventory allocation also deals with trade-offs related to scarcity of items. If an item has not been available in a store for a long time, that store may receive heightened priority for the item in the allocation. ----- HOW TO GET STARTED **Typical use case data sources include:** point of sale, digital sales, replenishment data, modeled safety stock, promotions data, weather **View our webinar covering demand forecasting with Starbucks and then read our blog about** **demand forecasting.** **[Demand forecasting with causal factors.](https://www.databricks.com/blog/2020/03/26/new-methods-for-improving-supply-chain-demand-forecasting.html)** Our most popular notebook at Databricks. This blog walks you through the business and technical challenges of performing demand forecasting and explains how we approached solving it. **[On-demand webinar for demand forecasting.](https://www.databricks.com/blog/2020/02/21/on-demand-webinar-granular-demand-forecasting-at-scale.html)** Video and Q&A from our webinar with Starbucks. Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **CASE STUDY** H&M turned to the Databricks Lakehouse Platform to simplify its infrastructure management, enable performant data pipelines at scale, and simplify the machine learning lifecycle. The result was a more data- driven organization that could better forecast operations to streamline costs and boost revenue. **CASE STUDY** Edmunds is on a mission to make car shopping an easy experience for all. With the Databricks Lakehouse Platform, they are able to simplify access to their disparate data sources and build ML models that make predictions off data streams. With real-time insights, they can ensure that the inventory of vehicle listings on their website is accurate and up to date, improving overall customer satisfaction. ----- **CHAPTER 6** ### Use Case: Point of Sale  and Clickstream **Overview** Disruptions in the supply chain — from reduced product supply and diminished warehouse capacity — coupled with rapidly shifting consumer expectations for seamless omnichannel experiences are driving retailers to rethink how they use data to manage their operations. Historically, point-of-sale (POS) systems recorded all in-store transactions, but were traditionally kept in a system that was physically in the store. This would result in a delay in actionable insights. And now with consumers increasingly shopping online, it’s crucial to not only collect and analyze that clickstream data quickly, but also unify it with POS data to get a complete and real-time snapshot of each customer’s shopping behavior. Near real-time availability of information means that retailers can continuously update their estimates of item availability. No longer is the business managing operations based on their knowledge of inventory states as they were a day prior, but instead is taking actions based on their knowledge of inventory states as they are now. **R E L E V A N T F O R** Retail E-commerce **Challenges** Retailers with legacy POS systems in their brick-and-mortar stores are working with siloed and incomplete sales data Both POS and clickstream data need to be unified and ingested in real time ----- HOW TO GET STARTED Contact your Databricks account team **Value with Databricks** Databricks brings POS and clickstream data together for a unified data source that leads to real-time insights and a clearer understanding of customer behavior.  **Single source of truth** — a centralized, cloud-based POS system means it can be merged with clickstream data  **Near real-time insights** — the greater speed to data means businesses get the latest insights into customer purchasing behaviors and trends to have them perform a free proof-of- concept with your real-time data.  **Scalability** — companies can scale with Databricks to handle data from countless transactions ----- **CHAPTER 7** ### Use Case: On-Shelf Availability **Overview** Ensuring the availability of a product on shelf is the single largest problem in retail. Retailers globally are missing out on nearly $1 trillion in sales because they don’t have on hand what customers want to buy in their stores. Shoppers encounter out-of-stock scenarios as often as one in three shopping trips. All told, worldwide, shoppers experience $984 billion worth of out-of-stocks, $144.9 billion in North America alone, according to industry research firm IHL. In the past, if a customer faced an out-of-stock, they would most likely select a substitute item. The cost of going to another store prevented switching. Today, e-commerce loyalty members, such as those who belong to Walmart+ and Amazon Prime, are 52% more likely than other consumers to purchase out-of-stock items online. It is believed that a quarter of Amazon’s retail revenue comes from customers who first tried to buy a product in-store. In all, an estimated $36 billion is lost to brick-and-mortar competition, and another $34.8 billion is lost to Amazon or another e-retailer, according to IHL. On-shelf availability takes on a different meaning in pure e-commerce applications. An item can be considered in stock when it is actually in a current customer’s basket. If another customer places the same item in their basket, there is the possibility that the first customer will purchase the last available item before the second customer. This problem is exacerbated by retailers who use stores to keep inventory. In these situations, customers may order an item that is picked for delivery at a much later time. The window between ordering and picking creates the probability of out-of-stocks. On-shelf availability predicts the depletion of inventory by item, factors in safety stock levels and replenishment points, and generates a signal that suggests an item may be out of stock. This information is used to generate alerts to retail staff, distributors, brokers and consumer goods companies. Every day, tens of thousands of people around the world do work that is generated by these algorithms. The sheer volume of data used to calculate on-shelf availability prevents most companies from analyzing all of their products. Companies have between midnight and 4 AM to collect all of the needed information and run these models, which is beyond the capability of legacy data systems. Instead, companies choose the priority categories or products to analyze, which means a significant percentage of their unavailable products will not be proactively addressed. ----- One of the biggest challenges with on-shelf availability is determining when an item is actually out of stock. While some retailers are investing in computer vision and robots, and others employ the use of people to manually survey item availability, most retailers default to a signal of determining when an item has not been scanned in an acceptable time. **R E L E V A N T F O R** Retail Consumer Goods E-commerce Direct to Consumer **Challenges** The biggest challenge to generating on-shelf availability alerts is time. Companies may receive their final sales data from the preceding day shortly after midnight. They have less than 4 hours from that point to ingest large volumes of t-log data and calculate probabilities of item availability. Most firms are encumbered by a data warehouse process that only releases data after it has been ingested and aggregates have been calculated, a process that can require multiple hours per night. For this reason, most firms make sacrifices in their analysis. They may alternate categories they analyze by different days, prioritize only high-impact SKUs, or run analysis at higher-level and less-accurate aggregate levels. Among the challenges: Processing large volumes of highly detailed data and running millions of models in a narrow time window Companies are spending hundreds of thousands of dollars annually to generate these daily alerts for a few categories Dealing with false positives and negatives in predictions Distributing information quickly and efficiently to internal systems and external partners ----- **Value with Databricks** Databricks enables customers to generate on-shelf availability (OSA) predictions at scale with no compromises. **** Delta removes the data processing bottleneck. Delta enables retailers to stream in real time or to batch process large volumes of highly detailed and frequently changing point-of-sale transaction data. **** Easily scale to process all OSA predictions within tight service windows using Apache Spark TM **** Manage features and localize models with additional causal data to improve accuracy with MLflow **** Easily deploy information via streams, through API for mobile applications or partners, or to Delta for reporting **** Enable retailers to monetize their data by directly licensing OSA alerts **Solution overview** Databricks enables companies to perform on-shelf availability analysis without making compromises to the breadth or quality of predictions. It begins with Delta Lake — a nearly perfect platform for ingesting and managing t-log data. One of the biggest challenges in t-log data is the frequent number of changes to a transaction that can occur within a data. Delta Lake simplifies this with transaction awareness using a transaction log, and creates additional metadata for easier retrieval. Data is made available in a fraction of the time needed in data warehouse- based systems. This is why the largest retailers in the world are using Delta Lake for processing t-log data. Once data is available, users need to generate predictions about item availability on the shelf. With its extremely performant engine and the ability to distribute computation across countless nodes, Spark provides the perfect platform for calculating out-of-stocks. Customers no longer need to run in aggregate or against a subset of data. ----- **HOW TO GET STARTED** [Solution Accelerator:](https://www.databricks.com/solutions/accelerators/on-shelf-availability) [On-Shelf Availability](https://www.databricks.com/solutions/accelerators/on-shelf-availability) In this solution, we show how the Databricks Lakehouse Platform enables real-time insights to rapidly respond And lastly, data is only useful if it drives better outcomes. Databricks can write the resulting data into Delta Lake for further reporting, or to any downstream application via APIs, feeds or other integrations. Users can feed their predictive alerts to downstream retail operations systems or even to external partners within the tightest service windows, and in enough time to drive actions on that day. **Typical use case data sources include:** point-of-sale data, replenishment data, safety stock calculations, manual inventory data (optional), robotic or computer vision inventory data (optional) **CASE STUDY** Reckitt distributes its products to millions of consumers in over 60 countries, which was causing the organization to struggle with the complexity of forecast demand, especially with large volumes of different types of data across many disjointed pipelines. Thanks to the Databricks Lakehouse Platform, Reckitt now uses predictive analytics, product placement and business forecasting to better support neighborhood grocery stores. to demand, drive more sales by ensuring stock is available on shelf, and scale out your forecasting models to accommodate any size operation. ----- **CHAPTER 8** ### Use Case: Customer and Vehicle Identification **Overview** COVID-19 led to increased consumer demand for curbside pickup, drive-through and touchless payment options. Retailers that were able to implement these new services have been able to differentiate overall customer experiences and mitigate catastrophic hits on revenue levels. For retailers to create a seamless contactless experience for customers, they need real-time data to know when a customer has arrived and where they’re located, as well as provide updates throughout the pickup journey. And through the use of computer vision, they can capture that data by employing optical recognition on images to read vehicle license plates. Retailers can also use information captured from license plates to make recommendations on buying patterns. Looking ahead, facial recognition also has the potential to provide retailers with valuable information to better serve their customers in real time. **R E L E V A N T F O R** Retail Consumer Goods Drive-Through Food Retailers **Challenges** Ineffective data processing can lead to suboptimal order preparation timing Without real-time data, it can be difficult to provide customers with live updates on their order status ----- **Value with Databricks** Databricks makes it possible to not only identify customers and vehicles in real time but also provide real- time communications throughout the entire shopping and curbside or drive-through experience.  **Near real-time insights** — the greater speed to data means retailers can get the right order preparation timing  **Recommendations** — being able to quickly access and refer to data from previous visits will ensure each subsequent visit is equally as or more seamless than the last  **Optimizing with AI/ML** — implementing AI and ML models can lead to more effective geofencing, vehicle identification and order prediction **CASE STUDY** **CASE STUDY** ----- **CHAPTER 9** ### Use Case: Recommendation Engines **Overview** Customers that feel understood by a retailer are more likely to spend more per purchase, purchase more frequently with that retailer, and deliver higher profitability per customer. The way that retailers achieve this is by recommending products and services that align with customer needs. Providing an experience that makes customers feel understood helps retailers stand out from the crowd of mass merchants and build loyalty. This was true before COVID, but shifting consumer preferences make this more critical than ever for retail organizations. With research showing the cost of customer acquisition is as much as five times as retaining existing ones, organizations looking to succeed in the new normal must continue to build deeper connections with existing customers in order to retain a solid consumer base. There is no shortage of options and incentives for today’s consumers to rethink long-established patterns of spending. Recommendation engines are used to create personalized experiences for users across retail channels. These recommendations are generated based on the data collected from purchases, items interacted with, users’ behavior across physical and digital channels, and other data such as from customer service interactions and reviews. Leveraging a Customer 360 architecture that collects all user clickstream and behavioral data, marketers are able to create recommendations that are integrated with other business objectives such as highlighting items that are on promotion or product availability. Creating recommendations is not a monolithic activity. Recommendation engines are used to personalize the customer experience in every possible area of consumer engagement, from proactive notifications and offers, to landing page optimization, suggested products, automated shipment recommendations, cross-sell and upsell, and even suggestions for complementary items after the purchase. ----- **R E L E V A N T F O R** Retail E-commerce Direct to Consumer Media Telecom Financial Services (any B2B or B2C company) **Challenges** Recommendation engines are very difficult to do well. Many companies use off-the-shelf recommenders, but traditional off-the-shelf systems suffer from high rates of inaccuracy. In our analysis, we found general recommenders with 29% variance, meaning that of every 10 recommendations delivered, 3 would be irrelevant. **Massive volumes of highly detailed and frequently changing data.** Recommendation accuracy is improved by having recent data, and yet most systems struggle to handle the large volumes of information involved. **Creating a 360 view of the customer.** Identity and being able to stitch together all customer touchpoints in one place are critical to enabling this use case. More data, including transaction and clickstream data, is critical for driving accuracy and precision in messaging. **Processing speed.** Retailers need to be able to frequently refresh models based on constantly changing dynamics, and deliver real-time recommendations via APIs. **Automation.** This is an “always-on” use case where automation is essential for scalability and responsiveness based on frequent model updates. ----- Many firms choose to use recommender systems from Amazon or Google. Using these systems trains the general recommendation engine in a way that helps competitors improve the accuracy of their own recommendations. **Value with Databricks** Recommendations are one of the most critical capabilities that a retailer maintains. This is a capability that retailers must own, and Databricks provides a solid platform for enabling this. Using Databricks as the foundation for their Customer 360 architecture to deliver omnichannel personalization, sample value metrics from a media agency include: **200% ROI for 70% of retailers** engaging in advanced personalization **10% improvement** in conversions **35% improvement** in purchase frequency **37% improvement** in customer lifetime value **Solution overview** Recommendations are only as good as the data that powers them. Delta Lake provides the best platform for capturing and managing huge volumes of highly atomic and frequently changing data. It allows organizations to combine various sources of data in a timely and efficient manner, from transactions, demographics and preference information across products, to clickstream, digital journey and marketing analytics data to bring a 360 view of customer interactions to enable omnichannel personalization. By identifying changes in user behavior or engagement, retailers are able to detect early signals that indicate a propensity to buy or a change in preferences, and recommend products and services that will keep consumers engaged. ----- **Typical use case data sources include:** Customer 360 data, CRM, loyalty data, transaction data, clickstream data, mobile data: **Engagement data** — transaction log data, clickstream data, promotion interaction **Identity** — loyalty data, person ID, device ID, email, IP address, name, gender, income, presence of children, location **User lifecycle** — subscription status, payment history, cost of acquisition, lifetime value, propensity to churn **CASE STUDY** For Wehkamp to provide the best shopping experience for their customers, they turned to Databricks for help with their data analytics and machine learning needs, resulting in a highly engaging web shop personalized to each of their customers. **CASE STUDY** Columbia’s legacy ETL was unable to support batch and real-time use cases at scale. After migrating to Databricks, the company is now able to more efficiently and reliably work with its data, resulting in smarter business decisions. **CASE STUDY** Pandora wanted to drive stronger online engagement with their customers, so they used the Databricks Lakehouse Platform to create more personalized experiences and boost both click-to-open rates and quarterly revenue. HOW TO GET STARTED Databricks has created [four](https://www.databricks.com/solutions/accelerators/recommendation-engines) [Recommendation Engine accelerators,](https://www.databricks.com/solutions/accelerators/recommendation-engines) with content-based and collaborative filter methods, and both item- and user-based analysis. These accelerators have been further refined to be highly performant to enable frequent retraining of models. To begin working on recommendation engines, contact your Databricks account team. ----- **CHAPTER 10** ### Use Case: Perpetual Inventory **Overview** With the rapid adoption of digital channels for retail, staying on top of your inventory is crucial to meeting customer demand. As a result, the periodic inventory system is now outdated — instead, using a perpetual inventory model allows businesses to perform immediate and real-time tracking of sales and inventory levels. This has the added benefit of reducing labor costs and human error, ensuring that you always have an accurate overview of your inventory and can better forecast demand to avoid costly stockouts. The key to building a perpetual inventory system is real-time data. By capturing real-time transaction records related to sold inventory, retailers can make smarter inventory decisions that streamline operations and lower overall costs. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Supply Chain Inventory Management **Challenges** **** Companies need to scale to handle ever-increasing inventory and the data associated with the products **** Data needs to be ingested and then processed in real time (or near real-time) to provide a truly accurate view of inventory ----- HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **Value with Databricks** Databricks enables real-time inventory updates, giving businesses the insights they need to properly manage inventory and to forecast more accurately. **Near real-time insights** — the greater speed to data means inventory is automatically updated with the latest sales data **Detailed records** — with all inventory updates and movements being tracked as they happen, companies know they’re getting the most accurate information at any point **Optimizing with AI/ML** — using AI and ML can help with forecasting demand and reducing inventory management costs ----- **CHAPTER 11** ### Use Case: Automated  Replenishments **Overview** Customers favor convenience more than ever when it comes to their goods, and automated replenishments help meet that need. Whether it’s through a connected device or smartphone app, real-time data plays a key role in ensuring consumers get a refill automatically delivered at the right time. On the manufacturing side, this real-time data can also help with vendor-managed replenishment (VMR), reducing the time needed to forecast, order and receive thousands of items. **R E L E V A N T F O R** Retail Consumer Goods Distributors Logistics Direct to Customer **Challenges** **** Being able to ingest large amounts of data quickly is crucial to actually fulfilling the replenishment orders With VMR, there may be a disconnect between the vendor and customer, resulting in a forecast for replenishment even when the customer can’t fulfill that order ----- HOW TO GET STARTED Contact your Databricks account team to have them perform a free proof-of- concept with your real-time data. **Value with Databricks** Databricks enables real-time inventory updates, giving businesses the insights they need to properly manage inventory and to forecast more accurately. **Near real-time insights** — the greater speed to data means businesses can stay on top of customer needs **Scalability** — companies can scale with Databricks to handle thousands of SKUs, each with its own unique properties and expiry dates **Optimizing with AI/ML** — using AI and ML can lead to better forecasting and predictions ----- **CHAPTER 12** ### Use Case: Fresh Food Forecasting **Overview** Fresh food typically accounts for up to 40% of revenue for grocers, and plays an important role in driving store traffic. But fresh food is also incredibly complex to manage — prices can be volatile, there is a wide range of suppliers to work with and the products expire, which creates significant amounts of waste. In order to avoid losing significant revenue, businesses need to properly forecast when food is nearing its sell-by date, the current levels of customer demand (also taking into account seasonality), and the proper timing for replenishing food stock. Being able to tap into real-time data is key to staying on top of the ever- changing needs around fresh food. **R E L E V A N T F O R** Retail E-commerce Distributors Logistics Restaurants **Challenges** **** Because of the perishable nature of fresh food, customers need to be able to ingest data quickly enough to conduct daily forecasting and daily replenishment **** Customers are running aggregate-level forecasts, which are less accurate than fine-grained forecasting **** Customers are forced to compromise on what they can analyze ----- HOW TO GET STARTED Contact your Databricks account team to get started with inventory allocation. Databricks does not have a Solution Accelerator. View our webinar covering demand forecasting with Starbucks and then read our blog about demand forecasting. [Fine-grained time series forecasting at scale.](https://www.databricks.com/blog/2021/04/06/fine-grained-time-series-forecasting-at-scale-with-facebook-prophet-and-apache-spark-updated-for-spark-3.html) This blog details the importance of time series forecasting, walks through building a simple model to show the use of Facebook Prophet, and then shows off the combination of Facebook Prophet and Adobe Spark to scale to hundreds of models. [On-demand webinar for demand forecasting.](https://www.databricks.com/blog/2020/02/21/on-demand-webinar-granular-demand-forecasting-at-scale.html) Video and Q&A from our webinar with Starbucks **Value with Databricks** Customers average double-digit improvement in forecast accuracy, leading to a reduction in lost sales and in spoiled products, as well as lower inventory and handling costs. **Improved accuracy** — on average, customers moving to Databricks for demand forecasting report a double-digit improvement in forecast accuracy **�Ability to scale and perform fine-grained (day, store, item) level forecasts** — rapidly scale to tens of millions of model iterations in narrow service windows. Companies need accurate demand forecasts in a few hours. **Eliminate compromises on what to analyze** — customers do not need to select winners or losers among the products they forecast. They can predict demand for all products as frequently as required. **Solution overview:** Databricks is well suited to handling forecasting for fresh food at scale. Forecasting begins with the Databricks Solution Accelerator. It enables companies to rapidly build fine-grained forecasting of items — forecasting that can be efficiently scaled to tens of millions of predictions in tight service windows. **Typical use case data sources include:** historic point-of-sale data, shipment data, promotions, pricing, expiration dates and weather. **CASE STUDY** ButcherBox faced the complex challenges of securing inventory with enough lead time, meeting highly variable customer order preferences and unpredictable customer sign-ups, and managing delivery logistics. With Databricks, the company was able to create a predictive solution to adapt quickly and integrate tightly with the rest of its data estate. on demand forecasting. **CASE STUDY** Sam’s Club needed to build out an enterprise-scale data platform to handle the billions of transactions and trillions of events going through the company. Find out how Databricks became a key component in the shift from on premises Hadoop clusters to a cloud based platform ----- **CHAPTER 13** ### Use Case: Propensity-to-Buy **Overview** Customers often have repeatable purchase patterns that may not be noticed upon initial observation. While we know that commuting office workers are likely to purchase coffee at a coffee shop on weekday mornings, do we understand why they visit on Thursday afternoons? And more importantly, how do we predict these buying moments when customers are not in our stores? The purpose of a propensity-to-buy model is to predict when a customer is predisposed to make a purchase and subsequently act on that information by engaging customers. Traditional propensity-to-buy models leveraged internal sales and loyalty data to identify patterns of consumption. These models are useful, but are limited in understanding the full behavior of customers. More advanced propensity-to-buy models are now incorporating alternative data sets to identify trips to competing retailers, competitive scan data from receipts, and causal data that helps to explain when and why customers make purchases. Propensity-to-buy models create a signal that is sent to downstream systems such as those for promotion management, email and mobile alerts, recommendations and others. **R E L E V A N T F O R** Retail E-commerce Direct to Consumer ----- **Challenges** **** Customers do not want to be inundated with messages from retailers. Companies need to limit their outreach to customers to avoid angering them. Companies need to traverse and process vast sums of customer data and generate probabilities of purchase frequently Companies need to look at external data that helps build a propensity-to-buy model that captures the full share of the customer wallet. They need to quickly test and incorporate additional data that improves the accuracy of their models. **Value with Databricks** **** Databricks allows companies to efficiently traverse huge volumes of customer data over time, and efficiently synthesize this into data for analysis **** Companies need to traverse and process vast sums of customer data and generate probabilities of purchase frequency **** Companies need to look at external data that helps build a propensity-to-buy model that captures the full share of the customer wallet. They need to quickly test and incorporate additional data that improves the accuracy of their models. **Solution overview:** Propensity-to-buy analytics determine the signals that indicate the probability a customer is in a buying moment. Historic propensity models relied on sales data to identify buying patterns, but newer approaches are incorporating behavioral data. Proximity to a coffee shop might push a consumer over the threshold of a buying moment. Traditional, batch-oriented operations are insufficient to solve this problem. If you wait until that night, or even later in the day you have lost the opportunity to act ----- **HOW TO GET STARTED** To begin working on propensity-to- buy, leverage our [Propensity Scoring](https://www.databricks.com/solutions/accelerators/propensity-scoring) [Solution Accelerator](https://www.databricks.com/solutions/accelerators/propensity-scoring) With the propensity to buy, speed becomes a critical force in determining key inflection points. Databricks enables marketers to ingest data in real time and update probabilities. Lightweight queries can be automated to refresh models, and the resulting data can be fed automatically to downstream promotions, web or mobile systems, where the consumer can be engaged. As this data is streamed into Delta Lake, data teams can quickly capture the data for broader analysis. Calculating a propensity to buy requires traversing interactions that are episodic in nature, and span broad periods of time. Delta Lake helps simplify this with scalable metadata handling, ACID transactions and data skipping. Delta Lake even manages schema evolution to provide users with flexibility as their needs evolve. **Typical use case data sources include:** point-of-sale data with tokens, loyalty data, e-commerce sales data, mobile application data, competitive scan or receipt data (optional), place of interest data (optional) ----- **CHAPTER 14** ### Use Case: Next Best Action **Overview** The e-commerce boom over the last couple of years has given consumers ample choice for digital shopping options. If your business isn’t engaging customers at every point in their purchasing journey, you risk losing them to a competitor. By applying AI/ML to automatically determine — in real time — the next best action for customers, you can greatly increase your conversion rates. **R E L E V A N T F O R** Retail Consumer Goods Direct to Consumer E-commerce **Challenges** Siloed data makes it difficult to create an accurate and comprehensive profile of each customer, resulting in suboptimal recommendations for the next best action Companies need to ingest large amounts of data in real time and then take action on it immediately Many businesses still struggle with training their ML models to properly determine the next best action (and self-optimize based on the results) ----- **HOW TO GET STARTED** To begin working on propensity-to- buy, leverage our [Propensity Scoring](https://www.databricks.com/solutions/accelerators/propensity-scoring) [Solution Accelerator](https://www.databricks.com/solutions/accelerators/propensity-scoring) **Value with Databricks:** Databricks provides all the tools needed to **process large volumes of data and find the next best** **action** at any given point in the customer journey **Near real-time insights** — the greater speed to data means businesses can react immediately to customer actions **Single source of truth** — break down data silos by unifying all of a company’s customer data (including basic information, transactional data, online behavior/purchase history, and more) to get a complete customer profile **Optimizing with AI/ML** — use AI to create self-optimizing ML models that are trained to find the best next step for customers ----- **CHAPTER 15** ### Customers That Innovate With Databricks Lakehouse for Retail Some of the top retail and consumer packaged goods companies in the world turn to Databricks Lakehouse for Retail to deliver real-time experiences to their customers. Today, data is at the core of every innovation in the retail and consumer packaged goods industry. Databricks Lakehouse for Retail enables companies across every sector of retail and consumer goods to harness the power of real-time data and analytics to solve strategic challenges and deliver more engaging experiences to customers. Get started with a free trial of Lakehouse for Retail and start building better data applications today. **[Start your free trial](https://databricks.com/try-databricks)** Contact us for a personalized demo at: [databricks.com/contact](http://databricks.com/contact ) ----- ###### About Databricks Databricks is the data and AI company. More than 7,000 organizations worldwide — including Comcast, Condé Nast, H&M and over 40% of the Fortune 500 — rely on the Databricks Lakehouse Platform to unify their data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe. Founded by the original creators of Apache Spark™, Delta Lake and MLflow, Databricks is on a mission to help data teams solve the world’s toughest problems. To learn more, follow Databricks on [Twitter](https://twitter.com/databricks) , [LinkedIn](https://www.linkedin.com/company/databricks/) and [Facebook](https://www.facebook.com/databricksinc/) . **[Sign up for a free trial](https://databricks.com/try-databricks)** -----",SUCCESS,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/lakehouse_for_retail-082922.pdf,2024-09-19T16:57:21Z


#### ✅✏️ Run the synthetic evaluation data generation

Optionally, you can customize the guidelines to guide the synthetic data generation.  By default, guidelines are not applied - to apply the guidelines, uncomment `guidelines=guidelines` in the `generate_evals_df(...)` call.  See our [documentation](https://docs.databricks.com/en/generative-ai/agent-evaluation/synthesize-evaluation-set.html) for more details.

In [0]:
from databricks.agents.evals import generate_evals_df

# NOTE: The guidelines you provide are a free-form string. The markdown string below is the suggested formatting for the set of guidelines, however you are free
# to add your sections here. Note that this will be prompt-engineering an LLM that generates the synthetic data, so you may have to iterate on these guidelines before
# you get the results you desire.
guidelines = """
# Task Description
The Agent is a RAG chatbot that answers questions about using Spark on Databricks. The Agent has access to a corpus of Databricks documents, and its task is to answer the user's questions by retrieving the relevant docs from the corpus and synthesizing a helpful, accurate response. The corpus covers a lot of info, but the Agent is specifically designed to interact with Databricks users who have questions about Spark. So questions outside of this scope are considered irrelevant.

# User personas
- A developer who is new to the Databricks platform
- An experienced, highly technical Data Scientist or Data Engineer

# Example questions
- what API lets me parallelize operations over rows of a delta table?
- Which cluster settings will give me the best performance when using Spark?

# Additional Guidelines
- Questions should be succinct, and human-like
"""

synthesized_evals_df = generate_evals_df(
    docs=source_documents,
    # The number of evaluations to generate for each doc.
    num_evals=10,
    # A optional set of guidelines that help guide the synthetic generation. This is a free-form string that will be used to prompt the generation.
    # guidelines=guidelines
)

# Write the synthetic evaluation data to the evaluation set table
spark.createDataFrame(synthesized_evals_df).write.format("delta").mode("append").saveAsTable(agent_storage_config.evaluation_set_uc_table)

# Display the synthetic evaluation data
eval_set_df = spark.table(agent_storage_config.evaluation_set_uc_table)
display(eval_set_df.toPandas())

Generating evaluations:   0%|          | 0/10 evals generated [Elapsed: 00:00, Remaining: ?]

request_id,request,expected_retrieved_context,expected_facts,source_type,source_id
de1daac1a320379ce055bdc8b8342a2d7ca8d1ea08483081801f8219f41dc69d,"List(List(List(What percentage of consumers, according to a McKinsey study, are more likely to consider buying from a brand that personalizes the shopping and user experience?, user)))","List(List(“In today’s experience-driven world, the most beloved brands are the ones that know their customers. Customers are loyal to brands that recognize their needs and preferences — and tailor user journeys and engagements accordingly. A study from McKinsey shows 76% of consumers are more likely to consider buying from a brand that personalizes the shopping and user experience to the wants and needs of the customer. And as organizations pursue omnichannel excellence, these same high expectations of online experiences also extend to brick-and-mortar locations — revealing for many merchants that personalized engagement is fundamental to attracting customers and expanding share of wallet. But achieving a 360-degree view of your customers to serve personalized experiences requires integrating various types of data — including demographics, behavioral and transactional — to develop robust profiles. This guide focuses on six actionable strategic pillars for businesses to leverage automation, real-time data, AI-driven analysis and well-tuned ML models to architect and deliver customized customer experiences at every touch point.”, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Databricks-Customer-360-ebook-Final.pdf))",List(76% of consumers are more likely to consider buying from a brand that personalizes the shopping and user experience.),SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/Databricks-Customer-360-ebook-Final.pdf
4b452a4426892dea5c35302c50dc70d62c0b2993f478af59a42b59d7c258bfa0,"List(List(List(What are two key challenges mentioned for predictive maintenance in government agencies?, user)))","List(List(##### Overview **Integrating unstructured data** Equipment data doesn’t just come in the form of IoT data. Agencies can gather rich unstructured signals like audio, visual (e.g., video inspections) and text (e.g., maintenance logs). Most legacy data architectures are unable to integrate structured and unstructured data sources. **Operationalizing machine learning** Most agencies lack the advanced analytics tools needed to build models that can predict potential equipment failures. Those that do typically have their data scientists working in a siloed set of tools, resulting in unnecessary data replication and inefficient workflows., /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-and-ai-use-cases-for-the-public-sector.pdf))","List(Difficulty integrating structured and unstructured data sources due to legacy data architectures., Inefficient workflows caused by a lack of advanced analytics tools and siloed environments for data scientists.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-and-ai-use-cases-for-the-public-sector.pdf
6d1c05783fb5945cc9b121919eabdc2194c9c64809821e3c30b7f758a4d12a40,"List(List(List(What percentage of specialized Python libraries in the data set are associated with natural language processing (NLP), and what are some of the tasks enabled by NLP?, user)))","List(List(``` Our most popular use case is natural language processing (NLP), a rapidly growing field that enables businesses to gain value from unstructured textual data. This opens the door for users to accomplish tasks that were previously too abstract for code, such as summarizing content or extracting sentiment from customer reviews. In our data set, 49% of libraries used are associated with NLP. LLMs also fall within this bucket. Given the innovations launched in recent months, we expect to see NLP take off even more in coming years as it is applied to use cases like chatbots, research assistance, fraud detection, content generation and more. ```, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks-2023-state-of-data-report-06072023-v2_0.pdf))","List(49% of specialized Python libraries in the data set are associated with NLP., Examples of tasks enabled by NLP include summarizing content, extracting sentiment from customer reviews, chatbots, research assistance, fraud detection, and content generation.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks-2023-state-of-data-report-06072023-v2_0.pdf
8fc168f55c01c3d4059869879a9e54e8601faef19e46f011ac239c44dbe72f40,"List(List(List(Why is real-time data crucial for retail operations, and what problems do legacy systems cause?, user)))","List(List(“Retailers need real-time data to support these decisions, but legacy systems are limited to data that’s hours or days old. When seconds matter, only the Lakehouse delivers better decisions [...] most retailers still rely on legacy data systems, which impedes their ability to scale these innovations. Unfortunately, most legacy systems are only able to process information in hours or days. The delays caused by waiting for data are leading to significant risks and costs for the industry.”, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/lakehouse_for_retail-082922.pdf))","List(Real-time data enables immediate decision-making., Real-time data enables better decision-making in critical moments., Legacy systems process outdated data., Legacy systems cause delays., Legacy systems lead to risks for the retail industry., Legacy systems lead to costs for the retail industry.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/lakehouse_for_retail-082922.pdf
66725804819c75f5e3005072cb81414f01272d64b1b0a8ea89a58392599b1ff7,"List(List(List(What are the key features and advantages of the lakehouse pattern?, user)))","List(List(“The lakehouse pattern represents a paradigm shift from traditional on-premises data warehouse systems that are expensive and complex to manage. It uses an open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID semantics of data warehouses. A lakehouse pattern enables data transformation, cleansing, and validation to support both business intelligence and machine learning (ML) users on all data. Lakehouse is cloud-centric and unifies a complete up-to-date data set for teams, allowing collaboration across an organization.”, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/guide-evolve-your-data-warehouse-to-the-lakehouse-v3.pdf))","List(The lakehouse pattern has an open data management architecture., It combines data lakes and data warehouses, offering flexibility and scale along with data management and ACID semantics., It supports data transformation, cleansing, and validation., The lakehouse pattern is cloud-centric., It enhances support for both business intelligence and machine learning., It is cost-efficient., It offers an up-to-date unified data set., It improves collaboration across the organization.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/guide-evolve-your-data-warehouse-to-the-lakehouse-v3.pdf
1373db51df7476c934e04796eaceed4d4475d7b7a70efcb3405b121c71e96923,"List(List(List(What is game telemetry, and what primary metrics are tracked in game telemetry according to the text?, user)))","List(List(Game telemetry refers to the data collected about player behavior and interactions within a video game. The primary data source is the game engine. And the goal of game telemetry is to gather information that can help game developers understand player behavior and improve the overall game experience. Some of the primary metrics that are typically tracked in game telemetry include: - **Player engagement:** Track the amount of time players spend playing the game, and their level of engagement with different parts of the game. - **Game progress:** Monitor player progress through different levels and milestones in the game. - **In-game purchases:** Track the number and value of in-game purchases made by players. - **Player demographics:** Collect demographic information about players, such as age, gender, location, and device type. - **Session length:** Monitor the length of each player session, and how often players return to the game. - **Retention:** Track the percentage of players who return to the game after their first session. - **User Acquisition:** Track the number of new players acquired through different marketing channels., /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks_ultimate_gaming_data_guide_2023.pdf))","List(Game telemetry is data collected about player behavior and interactions within a video game., The data is primarily sourced from the game engine., Primary metrics tracked in game telemetry include:  - player engagement  - game progress  - in-game purchases  - player demographics  - session length  - retention  - user acquisition)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/databricks_ultimate_gaming_data_guide_2023.pdf
3b231daee5434db054e2ee8b4aee9b4edba19aa8886c0d491daa1b36b743142f,"List(List(List(What are some of the common problems faced by data lakes according to the document?, user)))","List(List(**Challenges with data lakes** Data lakes are a common element within modern data architectures. They serve as a central ingestion point for the plethora of data that organizations seek to gather and mine. While a good step forward in getting to grips with the range of data, they run into the following common problems: **1. Reading and writing into data lakes is not reliable.** Data engineers often run into the problem of unsafe writes into data lakes that cause readers to see garbage data during writes. They have to build workarounds to ensure readers always see consistent data during writes. **2. The data quality in data lakes is low.** Dumping unstructured data into a data lake is easy, but this comes at the cost of data quality. Without any mechanisms for validating schema and the data, data lakes suffer from poor data quality. As a consequence, analytics projects that strive to mine this data also fail. **3. Poor performance with increasing amounts of data.** As the amount of data that gets dumped into a data lake increases, the number of files and directories also increases. Big data jobs and query engines that process the data spend a significant amount of time handling the metadata operations. This problem is more pronounced in the case of streaming jobs or handling many concurrent batch jobs. **4. Modifying, updating or deleting records in data lakes is hard.** Engineers need to build complicated pipelines to read entire partitions or tables, modify the data and write them back. Such pipelines are inefficient and hard to maintain., /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/030521-2-The-Delta-Lake-Series-Complete-Collection.pdf))","List(Unreliable reading and writing operations, Low data quality due to the lack of validation mechanisms, Poor performance with increasing data volume, Difficulty in modifying, updating, or deleting records)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/030521-2-The-Delta-Lake-Series-Complete-Collection.pdf
9673989eb3b8242fc0a48d6338f31191260dd7cf6c7eacb26f2ed1512af803a2,"List(List(List(What new opportunities can data sharing create for organizations looking to generate additional revenue?, user)))","List(List(**Key benefits of data sharing** As you can see from the use cases described above, there are many benefits of data sharing, including: **Greater collaboration with existing partners.** In today’s hyper-connected digital economy, no single organization can advance its business objectives without partnerships. Data sharing helps solidify existing partnerships and can help organizations establish new ones. **Ability to generate new revenue streams.** With data sharing, organizations can generate new revenue streams by offering data products or data services to their end consumers. **Ease of producing new products, services or business models.** Product teams can leverage both first-party data and third-party data to refine their products and services and expand their product/service catalog. **Greater efficiency of internal operations.** Teams across the organization can meet their business goals far more quickly when they don’t have to spend time figuring out how to free data from silos. When teams have access to live data, there’s no lag time between the need for data and the connection with the appropriate data source., /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/a-new-approach-to-data-sharing-2nd-edition-databricks.pdf))","List(Data sharing can enable organizations to offer data products., Data sharing can enable organizations to offer data services.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/a-new-approach-to-data-sharing-2nd-edition-databricks.pdf
21866cbed9a5ba0daafc9367a06f6679f7e6290dd05b59cfd45d36fdbc8fbe73,"List(List(List(Why is it recommended to use larger clusters for workloads in Databricks, and how does this affect cost efficiency?, user)))","List(List(**EBOOK** ## The Big Book of Data Engineering 2nd Edition A collection of technical blogs, including code samples and notebooks ##### With all-new content ----- #### Contents **S E CTI ON 1** **Introduction to Data Engineering on Databricks** ............................................................................................................. **03** **S E CTI ON 2** **Guidance and Best Practices** ........................................................................................................................................................................... **10** **2 .1** Top 5 Databricks Performance Tips ................................................................................................................................................. 11 **2 . 2** How to Profile PySpark ........................................................................................................................................................................ 16 **2 . 3** Low-Latency Streaming Data Pipelines With Delta Live Tables and Apache Kafka .......................................................... 20 **2 . 4** Streaming in Production: Collected Best Practices ................................................................................................................... 25 **2 . 5** Streaming in Production: Collected Best Practices, Part 2 ...................................................................................................... 32 **2 .6** Building Geospatial Data Products ................................................................................................................................................. 37 **2 .7** Data Lineage With Unity Catalog .................................................................................................................................................... 47 **2 . 8** Easy Ingestion to Lakehouse With COPY INTO ............................................................................................................................ 50 **2 .9** Simplifying Change Data Capture With Databricks Delta Live Tables .................................................................................. 57 **2 .1 0** Best Practices for Cross-Government Data Sharing ................................................................................................................. 65 **S E CTI ON 3** **Ready-to-Use Notebooks and Data Sets** ...................................................................................................................................... **74** **S E CTI ON 4** **Case Studies** ................................................................................................................................................................................................................................. **76** **4 . 1** Akamai .................................................................................................................................................................................................... 77 **4 . 2** Grammarly ........................................................................................................................................................................................... 80 **4 . 3** Honeywell .............................................................................................................................................................................................. 84 **4 . 4** Wood Mackenzie ................................................................................................................................................................................. 87 **4 . 5** Rivian .................................................................................................................................................................................................... 90 **4 . 6** AT&T ....................................................................................................................................................................................................... 94 ----- **SECTION** # 01 ### Introduction to Data Engineering on Databricks ----- Organizations realize the value data plays as a strategic asset for various business-related initiatives, such as growing revenues, improving the customer experience, operating efficiently or improving a product or service. However, accessing and managing data for these initiatives has become increasingly complex. Most of the complexity has arisen with the explosion of data volumes and data types, with organizations amassing an estimated [80% of data in](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) [unstructured and semi-structured format](https://www.forbes.com/sites/forbestechcouncil/2019/01/29/the-80-blind-spot-are-you-ignoring-unstructured-organizational-data/?sh=681651dc211c) . As the collection of data continues to increase, 73% of the data goes unused for analytics or decision-making. In order to try and decrease this percentage and make more data usable, data engineering teams are responsible for building data pipelines to efficiently and reliably deliver data. But the process of building these complex data pipelines comes with a number of difficulties: **•** In order to get data into a data lake, data engineers are required to spend immense time hand-coding repetitive data ingestion tasks **•** Since data platforms continuously change, data engineers spend time building and maintaining, and then rebuilding, complex scalable infrastructure **•** As data pipelines become more complex, data engineers are required to find reliable tools to orchestrate these pipelines **•** With the increasing importance of real-time data, low latency data pipelines are required, which are even more difficult to build and maintain **•** Finally, with all pipelines written, data engineers need to constantly focus on performance, tuning pipelines and architectures to meet SLAs **How can Databricks help?** With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting, transforming, processing, scheduling and delivering data. The Lakehouse Platform automates the complexity of building and maintaining pipelines and running ETL workloads directly on a data lake so data engineers can focus on quality and reliability to drive valuable insights. Lakehouse Platform **One platform to support multiple personas** **BI & Data** **Warehousing** **Data** **Engineering** **Data** **Streaming** **Data** **Science & ML** ©2023 Databricks Inc. — All rights reserved **Unity Catalog** **Fine-grained governance for data and AI** **Delta Lake** **Data reliability and performance** **Cloud Data Lake** All Raw Data (Logs, Texts, Audio, Video, Images) Figure 1 The Databricks Lakehouse Platform unifies your data, analytics and AI on one common platform for all your data use cases ----- **Key differentiators for successful data engineering** **with Databricks** By simplifying on a lakehouse architecture, data engineers need an enterprise-grade and enterprise-ready approach to building data pipelines. To be successful, a data engineering solution team must embrace these eight key differentiating capabilities: **Data ingestion at scale** With the ability to ingest petabytes of data with auto-evolving schemas, data engineers can deliver fast, reliable, scalable and automatic data for analytics, data science or machine learning. This includes: **•** Incrementally and efficiently processing data as it arrives from files or streaming sources like Kafka, DBMS and NoSQL **•** Automatically inferring schema and detecting column changes for structured and unstructured data formats **•** Automatically and efficiently tracking data as it arrives with no manual intervention **•** Preventing data loss by rescuing data columns **Declarative ETL pipelines** Data engineers can reduce development time and effort and instead focus on implementing business logic and data quality checks within the data pipeline using SQL or Python. This can be achieved by: **•** Using intent-driven declarative development to simplify “how” and define “what” to solve **•** Automatically creating high-quality lineage and managing table dependencies across the data pipeline **•** Automatically checking for missing dependencies or syntax errors, and managing data pipeline recovery **Real-time data processing** Allow data engineers to tune data latency with cost controls without the need to know complex stream processing or implement recovery logic. **•** Avoid handling batch and real-time streaming data sources separately **•** Execute data pipeline workloads on automatically provisioned elastic Apache Spark™-based compute clusters for scale and performance **•** Remove the need to manage infrastructure and focus on the business logic for downstream use cases ----- **Unified orchestration of data workflows** Simple, clear and reliable orchestration of data processing tasks for data, analytics and machine learning pipelines with the ability to run multiple non-interactive tasks as a directed acyclic graph (DAG) on a Databricks compute cluster. Orchestrate tasks of any kind (SQL, Python, JARs, Notebooks) in a DAG using Databricks Workflows, an orchestration tool included in the lakehouse with no need to maintain or pay for an external orchestration service. **•** Easily create and manage multiple tasks with dependencies via UI, API or from your IDE **•** Have full observability to all workflow runs and get alerted when tasks fail for fast troubleshooting and efficient repair and rerun **•** Leverage high reliability of 99.95% uptime **•** Use performance optimization clusters that parallelize jobs and minimize data movement with cluster reuse **Data quality validation and monitoring** Improve data reliability throughout the data lakehouse so data teams can confidently trust the information for downstream initiatives by: **•** Defining data quality and integrity controls within the pipeline with defined data expectations **•** Addressing data quality errors with predefined policies (fail, drop, alert, quarantine) **•** Leveraging the data quality metrics that are captured, tracked and reported for the entire data pipeline Data Sources Data Warehouses On-premises Systems SaaS Applications Machine & Application Logs Application Events Mobile & IoT Data Cloud Storage Messag e Buses **Lakehouse Platform** **Workflows** for end-to-end orchestration Real-Time BI Apps Real-Time AI Apps Real-Time Analytics with **Databricks SQL** Real-Time Machine Learning with **Databricks ML** Streaming ETL with **Delta Live Tables** Predictive Maintenance Personalized Offers Patient Diagnostics Real-Time Operational Apps Real-Time Applications with **Spark Structured Streaming** **Photon** for lightning-fast data processing **Unity Catalog** for data governance and sharing **Delta Lake** for open and reliable data storage Alerts Detection Fraud Dynamic Pricing ©2023 Databricks Inc. — All rights reserved Figure 2 A unified set of tools for real-time data processing ----- **Fault tolerant and automatic recovery** Handle transient errors and recover from most common error conditions occurring during the operation of a pipeline with fast, scalable automatic recovery that includes: **•** Fault tolerant mechanisms to consistently recover the state of data **•** The ability to automatically track progress from the source with checkpointing **•** The ability to automatically recover and restore the data pipeline state **Data pipeline observability** Monitor overall data pipeline status from a dataflow graph dashboard and visually track end-to-end pipeline health for performance, quality and latency. Data pipeline observability capabilities include: **•** A high-quality, high-fidelity lineage diagram that provides visibility into how data flows for impact analysis **•** Granular logging with performance and status of the data pipeline at a row level **•** Continuous monitoring of data pipeline jobs to ensure continued operation **Automatic deployments and operations** Ensure reliable and predictable delivery of data for analytics and machine learning use cases by enabling easy and automatic data pipeline deployments and rollbacks to minimize downtime. Benefits include: **•** Complete, parameterized and automated deployment for the continuous delivery of data **•** End-to-end orchestration, testing and monitoring of data pipeline deployment across all major cloud providers **Migrations** Accelerating and de-risking the migration journey to the lakehouse, whether from legacy on-prem systems or disparate cloud services. The migration process starts with a detailed discovery and assessment to get insights on legacy platform workloads and estimate migration as well as Databricks platform consumption costs. Get help with the target architecture and how the current technology stack maps to Databricks, followed by a phased implementation based on priorities and business needs. Throughout this journey companies can leverage: **•** Automation tools from Databricks and its ISV partners **•** Global and/or regional SIs who have created Brickbuilder migration solutions **•** Databricks Professional Services and training This is the recommended approach for a successful migration, whereby customers have seen a 25-50% reduction in costs and 2-3x faster time to value for their use cases. ----- **Unified governance** With Unity Catalog, data engineering and governance teams benefit from an enterprisewide data catalog with a single interface to manage permissions, centralize auditing, automatically track data lineage down to the column level, and share data across platforms, clouds and regions. Benefits: **•** Discover all your data in one place, no matter where it lives, and centrally manage fine-grained access permissions using an ANSI SQL-based interface **•** Leverage automated column-level data lineage to perform impact analysis of any data changes across the pipeline and conduct root cause analysis of any errors in the data pipelines **•** Centrally audit data entitlements and access **•** Share data across clouds, regions and data platforms, while maintaining a single copy of your data in your cloud storage ©2023 Databricks Inc. — All rights reserved Figure 3 The Databricks Lakehouse Platform integrates with a large collection of technologies **A rich ecosystem of data solutions** The Databricks Lakehouse Platform is built on open source technologies and uses open standards so leading data solutions can be leveraged with anything you build on the lakehouse. A large collection of technology partners make it easy and simple to integrate the technologies you rely on when migrating to Databricks and to know you are not locked into a closed data technology stack. ----- **Conclusion** As organizations strive to become data-driven, data engineering is a focal point for success. To deliver reliable, trustworthy data, data engineers shouldn’t need to spend time manually developing and maintaining an end-to-end ETL lifecycle. Data engineering teams need an efficient, scalable way to simplify ETL development, improve data reliability and manage operations. As described, the eight key differentiating capabilities simplify the management of the ETL lifecycle by automating and maintaining all data dependencies, leveraging built-in quality controls with monitoring and by providing deep visibility into pipeline operations with automatic recovery. Data engineering teams can now focus on easily and rapidly building reliable end-to-end production-ready data pipelines using only SQL or Python for batch and streaming that deliver high-value data for analytics, data science or machine learning. **Follow proven best practices** In the next section, we describe best practices for data engineering end-to end use cases drawn from real-world examples. From data ingestion and real-time processing to analytics and machine learning, you’ll learn how to translate raw data into actionable data. As you explore the rest of this guide, you can find data sets and code samples in the various **[Databricks Solution Accelerators](https://www.databricks.com/solutions/accelerators)** , so you can get your hands dirty as you explore all aspects of the data lifecycle on the Databricks Lakehouse Platform. **Start experimenting with these** **free Databricks** **notebooks** **.** ----- **SECTION** # 02 ### Guidance and Best Practices **2.1** Top 5 Databricks Performance Tips **2.2** How to Profile PySpark **2.3** Low-Latency Streaming Data Pipelines With Delta Live Tables and Apache Kafka **2.4** Streaming in Production: Collected Best Practices **2.5** Streaming in Production: Collected Best Practices, Part 2 **2.6** Building Geospatial Data Products **2.7** Data Lineage With Unity Catalog **2.8** Easy Ingestion to Lakehouse With COPY INTO **2.9** Simplifying Change Data Capture With Databricks Delta Live Tables **2.10** Best Practices for Cross-Government Data Sharing ----- SECTION 2.1 **Top 5 Databricks Performance Tips** by **B R YA N S M I T H** and **R O B S A K E R** March 10, 2022 As solutions architects, we work closely with customers every day to help them get the best performance out of their jobs on Databricks — and we often end up giving the same advice. It’s not uncommon to have a conversation with a customer and get double, triple, or even more performance with just a few tweaks. So what’s the secret? How are we doing this? Here are the top 5 things we see that can make a huge impact on the performance customers get from Databricks. Here’s a TLDR: **•** **Use larger clusters.** It may sound obvious, but this is the number one problem we see. It’s actually not any more expensive to use a large cluster for a workload than it is to use a smaller one. It’s just faster. If there’s anything you should take away from this article, it’s this. Read section 1. Really. **•** **Use** **[Photon](https://databricks.com/blog/2021/06/17/announcing-photon-public-preview-the-next-generation-query-engine-on-the-databricks-lakehouse-platform.html?itm_data=product-cta-announcingPhotonBlog)** , Databricks’ new, super-fast execution engine. Read section 2 to learn more. You won’t regret it. **•** **Clean out your configurations** . Configurations carried from one Apache Spark™ version to the next can cause massive problems. Clean up! Read section 3 to learn more. **•** **Use** **[Delta Caching](https://docs.databricks.com/delta/optimizations/delta-cache.html)** . There’s a good chance you’re not using caching correctly, if at all. See Section 4 to learn more. **•** **Be aware of lazy evaluation** . If this doesn’t mean anything to you and you’re writing Spark code, jump to section 5. **•** **Bonus tip! Table design is super important** . We’ll go into this in a future blog, but for now, check out the [guide on Delta Lake best practices](https://docs.databricks.com/delta/best-practices.html) . **1. Give your clusters horsepower!** This is the number one mistake customers make. Many customers create tiny clusters of two workers with four cores each, and it takes forever to do anything. The concern is always the same: they don’t want to spend too much money on larger clusters. Here’s the thing: **it’s actually not any more expensive to use a** **large cluster for a workload than it is to use a smaller one. It’s just faster.** ----- The key is that you’re renting the cluster for the length of the workload. So, if you spin up that two worker cluster and it takes an hour, you’re paying for those workers for the full hour. However, if you spin up a four worker cluster and it takes only half an hour, the cost is actually the same! And that trend continues as long as there’s enough work for the cluster to do. Here’s a hypothetical scenario illustrating the point: **Number of Workers** **Cost Per Hour** **Length of Workload (hours)** **Cost of Workload** 1 $1 2 $2 2 $2 1 $2 4 $4 0.5 $2 8 $8 0.25 $2 Notice that the total cost of the workload stays the same while the real-world time it takes for the job to run drops significantly. So, bump up your Databricks cluster specs and speed up your workloads without spending any more money. It can’t really get any simpler than that. **2. Use Photon** Our colleagues in engineering have rewritten the Spark execution engine in C++ and dubbed it Photon. The results are impressive! Beyond the obvious improvements due to running the engine in native code, they’ve also made use of CPU-level performance features and better memory management. On top of this, they’ve rewritten the Parquet writer in C++. So this makes writing to Parquet and Delta (based on Parquet) super fast as well! But let’s also be clear about what Photon is speeding up. It improves computation speed for any built-in functions or operations, as well as writes to Parquet or Delta. So joins? Yep! Aggregations? Sure! ETL? Absolutely! That UDF (user-defined function) you wrote? Sorry, but it won’t help there. The job that’s spending most of its time reading from an ancient on-prem database? Won’t help there either, unfortunately. ----- The good news is that it helps where it can. So even if part of your job can’t be sped up, it will speed up the other parts. Also, most jobs are written with the native operations and spend a lot of time writing to Delta, and Photon helps a lot there. So give it a try. You may be amazed by the results! **3. Clean out old configurations** You know those Spark configurations you’ve been carrying along from version to version and no one knows what they do anymore? They may not be harmless. We’ve seen jobs go from running for hours down to minutes simply by cleaning out old configurations. There may have been a quirk in a particular version of Spark, a performance tweak that has not aged well, or something pulled off some blog somewhere that never really made sense. At the very least, it’s worth revisiting your Spark configurations if you’re in this situation. Often the default configurations are the best, and they’re only getting better. Your configurations may be holding you back. **4. The Delta Cache is your friend** This may seem obvious, but you’d be surprised how many people are not using the [Delta Cache](https://docs.databricks.com/delta/optimizations/delta-cache.html) , which loads data off of cloud storage (S3, ADLS) and keeps it on the workers’ SSDs for faster access. If you’re using Databricks SQL Endpoints you’re in luck. Those have caching on by default. In fact, we recommend using [CACHE SELECT * FROM table](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-cache.html) to preload your “hot” tables when you’re starting an endpoint. This will ensure blazing fast speeds for any queries on those tables. If you’re using regular clusters, be sure to use the i3 series on Amazon Web Services (AWS), L series or E series on Azure Databricks, or n2 in GCP. These will all have fast SSDs and caching enabled by default. Of course, your mileage may vary. If you’re doing BI, which involves reading the same tables over and over again, caching gives an amazing boost. However, if you’re simply reading a table once and writing out the results as in some ETL jobs, you may not get much benefit. You know your jobs better than anyone. Go forth and conquer. ----- **5. Be aware of lazy evaluation** However, there is a catch here. Every time you try to display or write out results, it runs the execution plan again. Let’s look at the same block of code but extend it and do a few more operations. —------- _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) .filter(...) ) _# Now run the execution plan to get results_ df2.display() _# Unfortunately this will run the plan again, including filtering, joining,_ _etc_ df2.display() _# So will this…_ df2.count() —------ If you’re a data analyst or data scientist only using SQL or doing BI you can skip this section. However, if you’re in data engineering and writing pipelines or doing processing using Databricks/Spark, read on. When you’re writing Spark code like select, groupBy, filter, etc., you’re really building an execution plan. You’ll notice the code returns almost immediately when you run these functions. That’s because it’s not actually doing any computation. So even if you have petabytes of data, it will return in less than a second. However, once you go to write your results out you’ll notice it takes longer. This is due to lazy evaluation. It’s not until you try to display or write results that your execution plan is actually run. —------- _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) . filter (...) _# Now run the execution plan to get results_ df2.display() —------ ----- The developer of this code may very well be thinking that they’re just printing out results three times, but what they’re really doing is kicking off the same processing three times. Oops. That’s a lot of extra work. This is a very common mistake we run into. So why is there lazy evaluation, and what do we do about it? In short, processing with lazy evaluation is way faster than without it. Databricks/Spark looks at the full execution plan and finds opportunities for optimization that can reduce processing time by orders of magnitude. So that’s great, but how do we avoid the extra computation? The answer is pretty straightforward: save computed results you will reuse. This works especially well when [Delta Caching](https://docs.databricks.com/delta/optimizations/delta-cache.html) is turned on. In short, you benefit greatly from lazy evaluation, but it’s something a lot of customers trip over. So be aware of its existence and save results you reuse in order to avoid unnecessary computation. **Start experimenting with these** **free Databricks** **notebooks** **.** Let’s look at the same block of code again, but this time let’s avoid the recomputation: _# Build an execution plan._ _# This returns in less than a second but does no work_ df2 = (df .join(...) .select(...) . filter (...) ) _# save it_ df2.write.save(path) _# load it back in_ df3 = spark.read.load(path) _# now use it_ df3.display() _# this is not doing any extra computation anymore. No joins, filtering,_ _etc. It’s already done and saved._ df3.display() _# nor is this_ df3.count() ----- SECTION 2.2  **How to Profile PySpark** by **X I N R O N G M E N G , TA K U YA U E S H I N , H Y U K J I N K W O N** and **A L L A N F O LT I N G** October 6, 2022 In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore, PySpark UDFs offer more flexibility since they enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state “what to do”; PySpark, as a sandbox, encapsulates “how to do it.” That makes PySpark easier to use, but it can be difficult to identify performance bottlenecks and apply custom optimizations. To address the difficulty mentioned above, PySpark supports various profiling tools, which are all based on [cProfile](https://docs.python.org/3/library/profile.html#module-cProfile) , one of the standard Python [profiler](https://docs.python.org/3/library/profile.html) [implementations](https://docs.python.org/3/library/profile.html) . PySpark Profilers provide information such as the number of function calls, total time spent in the given function, and filename, as well as line number to help navigation. That information is essential to exposing tight loops in your PySpark programs, and allowing you to make performance improvement decisions. **Driver profiling** PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process; thus, we can profile it as a normal Python program using cProfile as illustrated below: import cProfile with cProfile.Profile() as pr: _# Your code_ pr.print_stats() **Workers profiling** Executors are distributed on worker nodes in the cluster, which introduces complexity because we need to aggregate profiles. Furthermore, a Python worker process is spawned per executor for PySpark UDF execution, which makes the profiling more intricate. ----- The UDF profiler, which is introduced in Spark 3.3, overcomes all those obstacles and becomes a major tool to profile workers for PySpark applications. We’ll illustrate how to use the UDF profiler with a simple Pandas UDF example. Firstly, a PySpark DataFrame with 8,000 rows is generated, as shown below. ```  sdf = spark.range( 0 , 8 * 1000 ).withColumn(  'id' , (col( 'id' ) % 8 ). cast ( 'integer' ) # 1000 rows x 8 groups (if group  by 'id' )  ).withColumn( 'v' , rand()) ``` Later, we will group by the id column, which results in 8 groups with 1,000 rows per group. The Pandas UDF plus_one is then created and applied as shown below: ```  import pandas as pd  def plus_one ( pdf: pd.DataFrame ) -> pd.DataFrame:  return pdf.apply( lambda x: x + 1 , axis= 1 )  res = sdf.groupby( ""id"" ).applyInPandas(plus_one, schema=sdf.schema)  res.collect() ``` Executing the example above and running sc.show_profiles() prints the following profile. The profile below can also be dumped to disk by sc.dump_ profiles(path). The UDF id in the profile (271, highlighted above) matches that in the Spark plan for res. The Spark plan can be shown by calling res.explain() . Note that plus_one takes a pandas DataFrame and returns another pandas DataFrame. For each group, all columns are passed together as a pandas DataFrame to the plus_one UDF, and the returned pandas DataFrames are combined into a PySpark DataFrame. ----- The first line in the profile’s body indicates the total number of calls that were monitored. The column heading includes **•** ncalls , for the number of calls. **•** tottime , for the total time spent in the given function (excluding time spent in calls to sub-functions) **•** percall , the quotient of tottime divided by ncalls **•** cumtime , the cumulative time spent in this and all subfunctions (from invocation till exit) **•** percall , the quotient of cumtime divided by primitive calls **•** filename:lineno(function) , which provides the respective information for each function Digging into the column details: plus_one is triggered once per group, 8 times in total; _arith_method of pandas Series is called once per row, 8,000 times in total. pandas.DataFrame.apply applies the function lambda x: x + 1 row by row, thus suffering from high invocation overhead. We can reduce such overhead by substituting the pandas.DataFrame.apply with pdf + 1, which is vectorized in pandas. The optimized Pandas UDF looks as follows: ```  import pandas as pd  def plus_one_optimized ( pdf: pd.DataFrame ) -> pd.DataFrame:  return pdf + 1  res = sdf.groupby( ""id"" ).applyInPandas(plus_one_optimized, schema=sdf.  schema)  res.collect() ``` The updated profile is as shown below. We can summarize the optimizations as follows: **•** Arithmetic operation from 8,000 calls to 8 calls **•** Total function calls from 2,898,160 calls to 2,384 calls **•** Total execution time from 2.300 seconds to 0.004 seconds The short example above demonstrates how the UDF profiler helps us deeply understand the execution, identify the performance bottleneck and enhance the overall performance of the user-defined function. The UDF profiler was implemented based on the executor-side profiler, which is designed for PySpark RDD API. The executor-side profiler is available in all active Databricks Runtime versions. ----- Both the UDF profiler and the executor-side profiler run on Python workers. They are controlled by the spark.python.profile Spark configuration, which is false by default. We can enable that Spark configuration on a Databricks Runtime cluster as shown below. **Conclusion** PySpark profilers are implemented based on cProfile; thus, the profile reporting relies on the [Stats](https://docs.python.org/3/library/profile.html#the-stats-class) class. [Spark Accumulators](https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators) also play an important role when collecting profile reports from Python workers. Powerful profilers are provided by PySpark in order to identify hot loops and suggest potential improvements. They are easy to use and critical to enhance the performance of PySpark programs. The UDF profiler, which is available starting from Databricks Runtime 11.0 (Spark 3.3), overcomes all the technical challenges and brings insights to user-defined functions. In addition, there is an ongoing effort in the Apache Spark™ open source community to introduce memory profiling on executors; see [SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281) for more information. **Start experimenting with these** **free Databricks** **notebooks** **.** ----- SECTION 2.3  **Low-Latency Streaming Data Pipelines With Delta Live Tables** **and Apache Kafka** by **F R A N K M U N Z** August 9, 2022 [Delta Live Tables (DLT)](https://databricks.com/product/delta-live-tables) is the first ETL framework that uses a simple declarative approach for creating reliable data pipelines and fully manages the underlying infrastructure at scale for batch and [streaming data](https://www.databricks.com/product/data-streaming) . Many use cases require actionable insights derived from near real-time data. Delta Live Tables enables low-latency streaming data pipelines to support such use cases with low latencies by directly ingesting data from event buses like [Apache Kafka](https://kafka.apache.org/) , [AWS](https://aws.amazon.com/kinesis/) [Kinesis](https://aws.amazon.com/kinesis/) , [Confluent Cloud](https://www.confluent.io/confluent-cloud) , [Amazon MSK](https://www.youtube.com/watch?v=HtU9pb18g5Q) , or [Azure Event Hubs](https://docs.microsoft.com/en-us/azure/event-hubs/) . This article will walk through using DLT with Apache Kafka while providing the required Python code to ingest streams. The recommended system architecture will be explained, and related DLT settings worth considering will be explored along the way. **Streaming platforms** Event buses or message buses decouple message producers from consumers. A popular streaming use case is the collection of click-through data from users navigating a website where every user interaction is stored as an event in Apache Kafka. The event stream from Kafka is then used for real-time streaming data analytics. Multiple message consumers can read the same data from Kafka and use the data to learn about audience interests, conversion rates, and bounce reasons. The real-time, streaming event data from the user interactions often also needs to be correlated with actual purchases stored in a billing database. **Apache Kafka** [Apache Kafka](https://kafka.apache.org/) is a popular open source event bus. Kafka uses the concept of a topic, an append-only distributed log of events where messages are buffered for a certain amount of time. Although messages in Kafka are not deleted once they are consumed, they are also not stored indefinitely. The message retention for Kafka can be configured per topic and defaults to 7 days. Expired messages will be deleted eventually. This article is centered around Apache Kafka; however, the concepts discussed also apply to many other event busses or messaging systems. ----- **Streaming data pipelines** In a data flow pipeline, Delta Live Tables and their dependencies can be declared with a standard SQL Create Table As Select (CTAS) statement and the DLT keyword “live.” When developing DLT with Python, the @dlt.table decorator is used to, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf))","List(Larger clusters execute workloads faster in Databricks., The faster execution reduces the total time required for workload completion., The overall cost efficiency is balanced due to reduced workload completion time despite higher hourly costs.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf
088c4943384eaa6a228c3d68ff70fbef6bcbe9c50176180e73244de1d7f3be1a,"List(List(List(What are some common challenges around managing data and performance at scale for modern digital native companies as they mature?, user)))","List(List(``` TECHNICAL GUIDE ``` # Solving Common Data Challenges #### Startups and Digital Native Businesses ----- ### Table of Contents # 01 ``` CHALLENGE:   ###### Creating a unified data architecture for data quality, governance and efficiency # 03 CHALLENGE:   ###### Building effective machine learning operations ``` # 02 ``` CHALLENGE:   ###### Building a data architecture to support scale and performance # 04 SUMMARY: ###### The Databricks Lakehouse Platform addresses these challenges ``` ----- **I N T R O D U C T I O N** This guide shares how the lakehouse architecture can increase productivity and cost-efficiently support all your data, analytics and AI workloads, and flexibly scale with the pace of growth for your company. Read the entire guide or dive straight into a specific challenge. With the advent of cloud infrastructure, a new generation of startups has rapidly built and scaled their businesses. The use of cloud infrastructure, once seen as innovative, has now become table stakes. The differentiator for the fastest-moving startups and digital natives now comes from the effective use of data at scale, primarily analytics and AI. Digital natives — defined as fast-moving, lean, and technically savvy, born-in-the-cloud organizations — are beginning to focus on new data-driven use cases such as real-time machine learning and personalized customer experiences. To pursue these new data-intensive use cases and initiatives, organizations must look beyond the technologies that delivered them to this point in time. Over time, these technologies, such as transactional databases, streaming/batch pipelines and firstgeneration analytics engines, have led to brittle This guide examines some of the biggest data challenges and solutions for startups and for scaling digital native businesses that have reached the point where an end-to-end modern data platform is a smart investment. Some key considerations include: systems that are not cost-efficient and require time-consuming administration and engineering toil. In addition to growing maintenance needs, data is often stored in disparate locations and formats, with little or no governance, making real-time use cases, analytics and AI difficult or impossible. **Consolidating on a unified data platform** As mentioned above, siloed data storage and management add administrative and financial cost. You can benefit significantly when you unify your data in one location with a flexible architecture that scales with your needs and delivers performance for future success. For this, you will want an open platform that supports all your data including batch and streaming workloads, data analytics and machine learning. With data unification, you create a more efficient, integrated approach to ingesting, cleaning and organizing your data. You also need automation to make data analysis easier for the nontechnical users in the company. But broader data access also means more focus on security, privacy, compliance and access control, which can create overhead for a growing. **Scaling up capacity and increasing performance** **and usability of the data solutions** Data teams at growing digital native organizations find it time intensive and costly to handle the growing volume and velocity of their data being ingested from multiple sources, across multiple clouds. You now need a unified and simplified platform that can instantly scale up capacity and deliver more computing power on demand to free up your data teams to produce outputs more quickly. This lowers the total cost for the overall infrastructure by eliminating redundant licensing, infrastructure and administration costs. **Building effective machine learning operations** For data teams beginning their machine learning journeys, the challenge of training data models can increase in management complexity. Many teams with disparate coding needs for the entire model lifecycle suffer inefficiencies from transferring data and code across many separate services. To build and manage effective ML operations, consider an end-to-end MLOps environment that brings all data together in one place and incorporates managed services for experiment tracking, model training, feature development and feature and model serving. ----- # 01 ``` CHALLENGE:  ## Create a unified data architecture for data quality, governance and efficiency ``` ----- ``` CHALLENGE 01 ### Create a unified data architecture for data quality, governance and efficiency ``` As cloud-born companies grow, data volumes rapidly increase, leading to new challenges and use cases. Among the challenges: Application stacks optimized for transaction use cases aren’t able to handle the volume, velocity and variety of data that modern data teams require. For example, this leads to query performance issues as data volume grows. Data silos develop as each team within an organization chooses different ETL/ELT and storage solutions for their needs. As the organization grows and changes, these pipelines and storage solutions become brittle, hard to maintain and nearly impossible to integrate. These data silos lead to discoverability, integration and access issues, which prevent teams from leveraging the full value of the organization’s available data. Data governance is hard. Disparate ETL/ELT and storage solutions lead to governance, compliance, auditability and access control challenges, which expose organizations to tremendous risk. The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing and maintaining data solutions at scale. It integrates with cloud storage and the security in your cloud account, manages and deploys cloud infrastructure on your behalf. Your data practitioners no longer need separate storage systems for their data. And you don’t have to rely on your cloud provider for security. The lakehouse has its own robust security built into the platform. For all the reasons above, the most consistent advice from successful data practitioners is to create a “single source of truth” by unifying all data on a single platform. With the Databricks Lakehouse Platform, you can unify all your data on one platform, reducing data infrastructure costs and compute. You don’t need excess data copies and you can retire expensive legacy infrastructure. ```  01 ``` ----- ``` CUSTOMER STORY: GRAMMARLY ### Helping 30 million people and 50,000 teams communicate more effectively ``` While its business is based on analytics, [Grammarly](http://www.grammarly.com) for many years relied on a homegrown analytics platform to drive its AI writing assistant to help users improve multiple aspects of written communications. As teams developed their own requirements, data silos inevitably emerged as different business areas implemented analytics tools individually. “Every team decided to solve their analytics needs in the best way they saw fit,” said Chris Locklin, Engineering Manager, Data Platforms, at Grammarly. “That created challenges in consistency and knowing which data set was correct.” To better scale and improve data storage and query capabilities, Grammarly brought all its analytical data into the Databricks Lakehouse Platform and created a central hub for all data producers and consumers across the company. Grammarly had several goals with the lakehouse, including better access control, security, ingestion flexibility, reducing costs and fueling collaboration. “Access control in a distributed file system is difficult, and it only gets more complicated as you ingest more data sources,” said Locklin. To manage access control, enable end-to-end observability and monitor data quality, Grammarly relies on the data lineage capabilities within Unity Catalog. “Data lineage allows us to effectively monitor usage of our data and ensure it upholds the standards we set as a data platform team,” said Locklin. “Lineage is the last crucial piece for access control.” Data analysts within Grammarly now have a consolidated interface for analytics, which leads to a single source of truth and confidence in the accuracy and availability of all data managed by the data platform team. Having a consistent data source across the company also resulted in greater speed and efficiency and reduced costs. Data practitioners experienced 110% faster querying at 10% of the cost to ingest compared to a data warehouse. Grammarly can now make its 5 billion daily events available for analytics in under 15 minutes rather than 4 hours. Migrating off its rigid legacy infrastructure gave Grammarly the flexibility to do more and the confidence that the platform will evolve with its needs. Grammarly is now able to sustain a flexible, scalable and highly secure analytics platform that helps 30 million people and 50,000 teams worldwide write more effectively every day. [Read the full story here.](https://www.databricks.com/customers/grammarly) ----- ###### How to unify the data infrastructure with Databricks The [Databricks Lakehouse Platform](https://docs.databricks.com/lakehouse/index.html) architecture is composed of two primary parts: - The infrastructure to deploy, configure and manage the platform and services You can build a Databricks workspace by configuring secure integrations between the Databricks platform and your cloud account, and then Databricks deploys temporary Apache Spark™/Photon clusters using cloud resources in your account to process and store data in object storage and other integrated services you control. Here are three steps to get started with the Databricks Lakehouse Platform: **Understand the architecture** The lakehouse provides a unified architecture, meaning that all data is stored in the same accessible place. The diagram shows how data comes in from sources like a customer relationship management (CRM) system, an enterprise resource planning (ERP) system, websites or unstructured customer emails. **Optimize the storage layer** All data is stored in cloud storage while Databricks provides tooling to assist with ingestion, such as Auto Loader, and we recommend [open-source](https://delta.io/) [Delta Lake](https://docs.databricks.com/delta/index.html) as the storage format of choice. Delta optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform. Having all your data in the same optimized, open storage keeps all your use cases in the same place, thus enabling collaboration and removing software tool overhead. - the customer-owned infrastructure managed in collaboration by Databricks and the customer. The lakehouse handles all varieties of data (structured, semi-structured, unstructured), as well as all velocities of data (streaming, batch or somewhere in the middle). [Sign up for a free trial](https://www.databricks.com/try-databricks#account) account with the instructions on the [get started page.](https://docs.databricks.com/getting-started/index.html) ----- The Databricks Lakehouse organizes data stored with Delta Lake in cloud object storage with familiar concepts like database, tables and views. Delta Lake extends Parquet data files with a file-based transaction log for [ACID transactions](https://docs.databricks.com/lakehouse/acid.html) and scalable metadata handling. Delta Lake is fully compatible with Apache Spark APIs, and was developed for tight integration with Structured Streaming, allowing you to easily use a single copy of data for both batch and streaming operations to provide incremental processing at scale.This model combines many of the benefits of a data warehouse with the scalability and flexibility of a data lake. To learn more about the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform, see [Getting started](https://docs.databricks.com/getting-started/delta.html) [with Delta Lake](https://docs.databricks.com/getting-started/delta.html) [.](https://docs.databricks.com/getting-started/delta.html) The first step in unifying your data architecture is setting up how data is to be accessed and used across the organization. We’ll discuss this as a series of steps: **1** Set up governance with Unity Catalog **2** Grant secure access to the data ###### “Delta Lake provides us with a single source of truth for all of our data,” said Stone. “Now our data engineers are able to build reliable data pipelines that thread the needle on key topics, such as inventory management, allowing us to identify in near real-time what our trends are so we can figure out how to effectively move inventory.”  – Jake Stone, Senior Manager, Business Analytics at ButcherBox [Learn more](https://www.databricks.com/blog/2022/02/07/how-butcherbox-uses-data-insights-to-provide-quality-food-tailored-to-each-customers-unique-taste.html) **3** Capture audit logs **4** View data lineage **5** Set up data sharing ----- **Configure unified governance** Databricks recommends using catalogs to provide an easily searchable inventory of data, notebooks, dashboards and models. Often this means that catalogs can correspond to software development environment scope, team or business unit. [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/get-started.html) manages how data is secured, accessed and shared. Unity Catalog offers a single place to administer data access policies that apply across all workspace and personas and automatically captures user-level audit logs that record access to your data. Data stewards can securely grant access to a broad set of users to discover and analyze data at scale. These users can use a variety of languages and tools, including SQL and Python, to create derivative data sets, models and dashboards that can be shared across teams. To set up Unity Catalog for your organization, you do the following: **1** Configure an S3 bucket and IAM role that Unity Catalog can use to store and access data in your AWS account. **2** Create a metastore for each region in which your organization operates, and attach workspaces to the metastore. Each workspace will have the same view of the data you manage in Unity Catalog. **3** If you have a new account, add users, groups and service principals to your Databricks account. **4** Next, create and grant access to catalogs, schemas and tables. For complete setup instructions, see [Get started using Unity Catalog.](https://docs.databricks.com/data-governance/unity-catalog/get-started.html#:~:text=To%20enable%20your%20Databricks%20account%20to%20use%20Unity,Transfer%20your%20metastore%20admin%20role%20to%20a%20group.) ----- ###### How Unity Catalog works You will notice that the hierarchy of primary data objects in Unity Catalog flows from metastore to table: **Metastore** is the top-level container for metadata. Each metastore exposes a three-level namespace (catalog.schema.table) that organizes your data. **Metastore** **Catalog** **Schemas** **Views** **Managed** **Tables** **Catalog** is the first layer of the object hierarchy, used to organize your data assets. **Schemas** , also known as databases, are the second layer of the object hierarchy and contain tables and views. **Table** is the lowest level in the object hierarchy, and tables can be external (stored in external locations in your cloud storage of choice) or managed (stored in a storage container in your cloud storage that you create expressly for Databricks). You can also create readonly **Views** from tables. **External** **tables** The diagram below represents the file system hierarchy of a single storage bucket: ----- Unity Catalog uses the identities in the Databricks account to resolve users, service principals, and groups and to enforce permissions. To configure identities in the account, follow the instructions in [Manage users,](https://docs.databricks.com/administration-guide/users-groups/index.html) [service principals, and groups](https://docs.databricks.com/administration-guide/users-groups/index.html) . Refer to those users, service principals, and groups when you create [access-control policies](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/index.html) in Unity Catalog. Unity Catalog users, service principals, and groups must also be added to workspaces to access Unity Catalog data in a notebook, a Databricks SQL query, Data Explorer or a REST API command. The assignment of users, service principals, and groups to workspaces is called identity federation. All workspaces attached to a Unity Catalog metastore are enabled for identity federation. Securable objects in Unity Catalog are hierarchical, meaning that granting a privilege on a catalog or schema automatically grants the privilege to all current and future objects within the catalog or schema. For more on granting privileges, see the [Inheritance model](https://docs.databricks.com/data-governance/unity-catalog/manage-privileges/privileges.html#inheritance) . A common scenario is to set up a schema per team where only that team has USE SCHEMA and CREATE on the schema. This means that any tables produced by team members can only be shared within the team. Data Explorer uses the privileges configured by Unity Catalog administrators to ensure that users are only able to see catalogs, databases, tables and views that they have permission to query. [Databricks Data Explorer](https://docs.databricks.com/data/index.html) is the main user interface for many Unity Catalog features. Use Data Explorer to view schema details, preview sample data, and see table details and properties. Administrators can view and change owners. Admins and data object owners can grant and revoke permissions through this interface. **Set up secure access** In Unity Catalog, data is secure by default. Initially, users have no access to data in a metastore. Access can be granted by either a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Securable objects in Unity Catalog are hierarchical and privileges are inherited downward. Unity Catalog’s security model is based on standard ANSI SQL and allows administrators to grant permissions in their existing data lake using familiar syntax, at the level of catalogs, databases (schema), tables and views. Privileges and metastores are shared across workspaces, allowing administrators to set secure permissions once against groups synced from identity providers and know that end users only have access to the proper data in any Databricks workspace they enter. ----- ``` CUSTOMER STORY: BUTCHERBOX ### How Butcherbox Uses Data Insights to Provide Quality Food Tailored to Each Customer’s Unique Taste ``` As a young e-commerce company, [ButcherBox](https://www.butcherbox.com/) has to be nimble as its customers’ needs change, which means it is constantly considering behavioral patterns, distribution center efficiency, a growing list of marketing and communication channels, and order processing systems. The meat and seafood subscription company collects data on hundreds of thousands of subscribers. It deployed the Databricks Lakehouse Platform to gain visibility across its diverse range of data systems and enable its analytics team to securely view and export data in the formats needed. With so much data feeding in from different sources — from email systems to its website — the data team at ButcherBox quickly discovered that data silos were a significant “We knew we needed to migrate from our legacy data warehouse environment to a data analytics platform that would unify our data and make it easily accessible for quick analysis to improve supply chain operations, forecast demand and, most importantly, keep up with our growing customer base,” explained Jake Stone, Senior Manager, Business Analytics, at ButcherBox. The platform allows analysts to share builds and iterate on a project without getting into the code. Querying a table of 18 billion rows would have been problematic with a traditional platform. With Databricks, ButcherBox can do it in three minutes. “Delta Lake provides us with a single source of truth for all of our data,” said Stone. “Now our data engineers are able to build reliable data pipelines that thread the needle on key topics such as inventory management, allowing us to identify in near real- time what our trends are so we can figure out how to effectively move inventory.” [Read the full story here.](https://www.databricks.com/blog/2022/02/07/how-butcherbox-uses-data-insights-to-provide-quality-food-tailored-to-each-customers-unique-taste.html) problem because they blocked complete visibility into critical insights needed to make strategic and marketing decisions. ----- **Set up secure data sharing** Databricks uses an open protocol called [Delta Sharing](https://docs.databricks.com/data-sharing/index.html) to share data with other entities regardless of their computing platforms. Delta Sharing is integrated with Unity Catalog. Your data must be registered with Unity Catalog to manage, govern, audit and track usage of the shared data on the Lakehouse Platform. The primary concepts of Delta Sharing are shares (read-only collections of tables and table partitions to be shared) and recipients (objects that associate an organization with a credential or secure sharing identifier). As a data provider, you generate a token and share it securely with the recipient. They use the token to authenticate and get read access to the tables you’ve included in the shares you’ve given them access to. Recipients access the shared data in read-only format. Whenever the data provider updates data tables in their own Databricks account, the updates appear in near real-time in the recipient’s system. **Capture audit logs** Unity Catalog captures an audit log of actions performed against the metastore. To access audit logs for Unity Catalog events, you must enable and configure audit logs for your account. Audit logs for each workspace and account-level activities are delivered to your account. See how to [configure audit](https://docs.databricks.com/data-governance/unity-catalog/audit.html) [logs](https://docs.databricks.com/data-governance/unity-catalog/audit.html) and create a dashboard to analyze audit log data. **View data lineage** You can use Unity Catalog to capture runtime data lineage across queries in any language executed on a Databricks cluster or SQL warehouse. Lineage can be visualized in Data Explorer in near real-time and retrieved with the Databricks REST API. Lineage is aggregated across all workspaces attached to Unity Catalog and captured down to the column level, and includes notebooks, workflows and dashboards related to the query. To understand the requirements and how to capture lineage data, see [Capture and view data](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html) [lineage with Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html) . Unity Catalog Metastore Catalog Data providers can use Databricks audit logging to monitor the creation and modification of shares, and recipients can monitor recipient activity on shares. Data recipients who use shared data in a Databricks account can use Databricks audit logging to understand who is accessing which data. ----- ###### Resources: - [Databricks documentation](https://docs.databricks.com/?_ga=2.8076210.1659353804.1668454132-1193545868.1666711643) - [Getting Started With Delta Lake](https://docs.databricks.com/delta/index.html) - [Webinar: Deep Dive Into Lakehouse With Delta Lake](https://www.databricks.com/p/webinar/deep-dive-into-lakehouse-with-delta-lake-complimentary-training) - [Big Book of Data Engineering Use Cases](https://www.databricks.com/explore/de-data-warehousing/big-book-of-data-engineering#page=1) - [10 Powerful Features to Simplify Semi-structured](https://www.databricks.com/blog/2021/11/11/10-powerful-features-to-simplify-semi-structured-data-management-in-the-databricks-lakehouse.html) [Data Management in the Databricks Lakehouse](https://www.databricks.com/blog/2021/11/11/10-powerful-features-to-simplify-semi-structured-data-management-in-the-databricks-lakehouse.html) ###### Key Takeaways - With the Databricks Lakehouse Platform, you can unify and simplify all your data on one platform to better scale and improve data storage and query capabilities - The lakehouse helps reduce data infrastructure and compute costs. You don’t need excess data copies and can retire expensive legacy infrastructure. Leverage Delta Lake as the open format storage layer to deliver reliability, security and performance on your data lake — for both streaming and batch operations — replacing data silos with a single home for structured, semi-structured and unstructured data With Unity Catalog you can centralize governance for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud The Databricks Lakehouse Platform is open source with multicloud flexibility so that you can use your data however and wherever you want — no vendor lock-in ----- # 02 ``` CHALLENGE:  ## Build your data architecture to support scale and performance ``` ----- ``` CHALLENGE 02 ### Build your data architecture to support scale and performance ``` As modern digital native companies mature, data volumes grow and new use cases develop. This inevitably leads to the increasing complexity of data architecture as new storage and access patterns emerge. Data growth can come suddenly and unexpectedly, when it does, the existing architecture needs to sustain performance, all the while being cost-effective. The relational databases and traditional data warehouses that met the needs of the businesses once upon a time are now creating limitations for new real-time use cases and large-scale data analytics pipelines. Here are some common challenges around managing data and performance at scale: **Volume and velocity** — Exponentially increasing data sources, and the speed at which they capture and create data. **Latency requirements** — The demands of downstream applications and users have evolved (people want data and the results from the data faster). **Governance** — Cataloging, auditing, securing and reporting on data is burdensome at scale when using old systems not built with data access controls and compliance in mind. **Multicloud** is really hard. **Data storage** — Storing data in the wrong format is slow to access, query and is expensive at scale. **Data format** — Supporting structured, semistructured and unstructured data formats is now a requirement. Most data storage solutions are designed to handle only one type of data, requiring multiple products to be stitched together. ``` 02 ``` ----- ###### Lakehouse solves scale and performance challenges The solution for growing digital companies is a unified and simplified platform that can instantly scale up capacity to deliver more computing power on demand, freeing up teams to go after the much-needed data and produce outputs more quickly. With a lakehouse, they can replace their data silos with a single home for their structured, semi-structured and unstructured data. Users and applications throughout the enterprise environment can connect to the same single copy of the data to drive diverse workloads. The lakehouse architecture is cost-efficient for scaling, lowering the total cost of ownership for the overall infrastructure by consolidating all data estate and use cases onto a single platform and eliminating redundant licensing, infrastructure and administration costs. Unlike other warehouse options that can only scale horizontally, the Databricks Lakehouse can scale horizontally and vertically based on workload demands. With the Databricks Lakehouse, you can optimize the compute costs on a platform that is [2.7x faster and](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) [12x more performant than Snowflake](https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html) , according to research by the Barcelona Supercomputing Center. And your data teams are more productive by focusing on more strategic initiatives versus managing multiple data solutions. ``` CUSTOMER STORY: RIVIAN ### Driving into the future of electric transportation ``` ``` CUSTOMER STORY: RIVIAN ``` With more than 11,000 electric adventure vehicles (EAVs) on the road generating multiple terabytes of IoT data per day, [Rivian](https://rivian.com/) is using data insights and machine learning to improve vehicle health and performance. However, with legacy cloud tooling, it struggled to scale pipelines cost-effectively and spent significant resources on maintenance. Before Rivian even shipped its first EAV, it was already up against data visibility and tooling limitations that decreased output, prevented collaboration and increased operational costs. Rivian chose to modernize its data infrastructure on the Databricks Lakehouse Platform, giving it the ability to unify all its data into a common view for downstream analytics and machine learning. Now, unique data teams have a range of accessible tools to deliver actionable insights for different use cases, from predictive maintenance to smarter product development. “Today we have various teams, both technical and business, using Databricks Lakehouse to explore our data, build performant data pipelines, and extract actionable business and product insights via visual dashboards,” said Wassym Bensaid, Vice President of Software Development at Rivian. For instance, Rivian’s ADAS (advanced driver-assistance systems) Team can now easily prepare telemetric accelerometer data to understand all EAV motions. This core recording data includes information about pitch, roll, speed, suspension and airbag activity to help Rivian understand vehicle performance, driving patterns and connected car system predictability. Based on these key performance metrics, Rivian can improve the accuracy of smart features and the control that drivers have over them. By leveraging the Databricks Lakehouse Platform, Rivian has seen a 30%–50% increase in runtime performance, which has led to faster insights and model performance. [Read the full story here.](https://www.databricks.com/customers/rivian) ----- ###### How to ensure scalability and performance with Databricks The [Databricks Lakehouse Platform](https://docs.databricks.com/lakehouse/index.html) is built for ensuring scalability and performance for your data architecture based on the following features and capabilities: - A simplified and cost-efficient architecture that increases productivity - A platform that ensures reliable, high performing ETL workloads — for streaming and batch data — while Databricks automatically manages your infrastructure - The ability to ingest, transform and query all your data in one place, and scale on demand with serverless compute - Enables real-time data access for all data, analytics and AI use cases ----- The following section will provide a short series of steps for understanding the key components of the Databricks Lakehouse Platform. **Step 2** **Understand the common Delta Lake operations** The Databricks Lakehouse Platform simplifies the entire data lifecycle, from data ingestion to monitoring and governance, and it starts with [Delta Lake](https://www.databricks.com/product/delta-lake-on-databricks) , a fully open-source storage system based on the Delta format providing reliability through ACID transactions and scalable metadata handling. Large quantities of raw files in blob storage can be converted to Delta to organize and store the data cheaply. This allows for flexibility of data movement while being performant and less expensive. **Step 1** **Get a trial Databricks account** Start your 14-day free trial with Databricks on AWS in a few easy steps. [Get started with a free trial and setup](https://docs.databricks.com/getting-started/index.html) . During the 14day free trial, all Databricks usage is free, but Databricks uses compute and S3 storage resources in your cloud provider account. and writing data can occur simultaneously without risk of many queries resulting in performance degradation or deadlock for business-critical workloads. This means that users and applications throughout the enterprise environment can connect to the same single copy of the data to drive diverse workloads, with all viewers guaranteed to receive the most current version of the data at the time their query executes. With performance features like indexing, Delta Lake customers have seen [ETL workloads execute](https://www.databricks.com/customers/columbia) [up to 48x faster.](https://www.databricks.com/customers/columbia) [Get acquainted with the Delta Lake storage format](https://docs.databricks.com/delta/tutorial.html) and learn how to create, manage and query tables. With support for ACID transactions and schema enforcement, Delta Lake provides the reliability that traditional data lakes lack. This enables you to scale reliable data insights throughout the organization and run analytics and other data projects directly on your data lake — [for up to 50x faster time-to-insight.](https://www.databricks.com/customers/wejo) Delta Lake transactions use log files stored alongside data files to provide ACID guarantees at a table level. Because the data and log files backing Delta Lake tables live together in cloud object storage, reading ----- All data in Delta Lake is stored in open Apache Parquet format, allowing data to be read by any compatible reader. APIs are open and compatible with Apache Spark, so you have access to a vast open-source ecosystem to avoid data lock-in from proprietary formats and conversions, which have embedded and added costs. ###### By leveraging Databricks and Delta Lake, we have already been able to democratize data at scale while lowering the cost of running production workloads by 60%, saving us millions of dollars.”  — Steve Pulec, Chief Technology Officer, YipitData [Learn more](https://www.databricks.com/customers/yipitdata) ----- **Step 3** **Ingest data efficiently at scale** With a [Lakehouse Platform](https://www.databricks.com/product/data-lakehouse) , data teams can ingest data from hundreds of data sources for analytics, AI and streaming applications into one place. Databricks recommends [Auto Loader](https://docs.databricks.com/ingestion/auto-loader/index.html) for incremental data ingestion. To ingest any file that can land in a data lake, Auto Loader incrementally and automatically processes new data files as they arrive in cloud storage in scheduled or continuous jobs. Auto Loader scales to support near real-time ingestion of millions of files per hour. For pushing data in Delta Lake, the SQL command [COPY INTO](https://docs.databricks.com/ingestion/copy-into/index.html) allows you to perform batch file ingestion into Delta Lake. COPY INTO is best used when the input directory contains thousands of files or fewer, and the user prefers SQL. COPY INTO can be used over JDBC to push data into Delta Lake at your convenience. **Step 4** **Leverage production-ready tools** **to automate ETL pipelines** Once the raw data is ingested, Databricks provides a suite of production-ready tools that allow data professionals to quickly develop and deploy extract, transform and load (ETL) pipelines. Databricks SQL allows analysts to run SQL queries against the same tables used in production ETL workloads, allowing for real-time business intelligence at scale. With your trial account, [it’s time to develop and deploy](https://docs.databricks.com/getting-started/etl-quick-start.html) [your first extract, transform a, /Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/technical_guide_solving_common-data-challenges-for-startups-and-digital-native-businesses.pdf))","List(Increasing volume and velocity of data as companies mature., Need for faster data access and reduced latency., Challenges in data governance, including cataloging, auditing, and securing data., Complexities of using multiple cloud environments., Data storage issues such as slow access, poor query performance, and high costs., Requirement to support structured, semi-structured, and unstructured data formats.)",SYNTHETIC_FROM_DOC,/Volumes/casaman_ssa/demos/volume_databricks_documentation/databricks-pdf/technical_guide_solving_common-data-challenges-for-startups-and-digital-native-businesses.pdf
