# DATABRICKS COURSE - DATABRICKS ACADEMY
---

## DATABRICKS - DATA INGESTION WITH LAKEFLOW CONNECT

![image.png](attachment:image.png)

> In this lecture, we'll explore how Databricks streamlines data engineering by enabling seamless data ingestion and management using LakeFlow Connect and advanced storage solutions like Delta Lake, Parquet, and Iceberg.

## WHAT IS THE LAKEFLOW CONNECT?

> In Lakeflow Connect, data ingestion is streamlined with simple, efficient connectors that enable you to bring in data from files, cloud storage, databases, enterprise applications, and streaming sources directly into the Databricks Lakehouse—all within a unified, managed platform.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

> WITH LAKEFLOW CONNECT, YOU CAN PERFORM EFFICIENT INGESTION PIPELINES ALL WITHIN DATABRICKS.

It's simple setup and maintenance, providing Unified orchestration, observability, and governance all within the Databricks Data Intelligence Platform.

> Lakeflow Connect provides built-in connectors for the Databricks Data Intelligence Platform to streamline data ingestion.

- Key benefits include:
    - A managed and efficient solution that reduces costs and accelerates time to value.
    - Self-service interfaces that enable practitioners across the organization to easily ingest data from enterprise applications.
    - Unified observability and governance to ensure secure, reliable, and well-monitored pipelines and tables.

## SO WHAT EXACTLY IS LAKEFLOW CONNECT?
> Lakeflow Connect provides simple, efficient connectors to ingest data into the Databricks Lakehouse from a wide range of sources, including enterprise applications, databases, cloud storage, local files, message buses, and more. 

It supports three main types of ingestion:

- **Manual File Uploads:** This allows users to upload local files directly to Databricks into either a volume or as a table, making it extremely easy to bring local data into the platform quickly.

- **Standard Connectors:** These connectors support data ingestion from various sources such as cloud object storage, Kafka, and more. They support multiple ingestion modes, including batch, incremental batch, and streaming. We'll explore these ingestion methods in more detail shortly.

- **Managed Connectors:** Purpose-built for ingesting data from enterprise applications, including SaaS platforms and databases. They leverage efficient incremental read/write patterns to provide scalable, cost-effective, and high-performance data ingestion into the lakehouse.

### INGESTION METHODS OVERVIEW

When ingesting data into Databricks using Lakeflow Connect Standard Connectors, you can choose from several ingestion methods.

#### BATCH INGESTION
Let's start with batch ingestion. Batch ingestion loads data as batches of rows into Databricks, often based on a schedule.

> Traditional batch ingestion processes all records each time it runs. Common techniques for performing batch ingestion include:

- `The SQL statement: CREATE TABLE AS SELECT`
- `The Python method: spark.read.load()`

### INCREMENTAL INGESTION

> While traditional batch ingestion processes all records every time it runs, incremental batch ingestion automatically detects new records in the data source and skips records that have already been ingested. This means only new data is ingested.

Incremental batch ingestion is faster and more resource efficient because it processes only new records instead of reprocessing the entire data source.

Common techniques for performing incremental batch ingestion include:
- `The SQL statement: COPY INTO`
- `The Python method: spark.readStream (Auto Loader with a timed trigger)`
- `Declarative Pipelines: CREATE OR REFRESH STREAMING TABLE`

### STREAMING INGESTION

> With streaming ingestion, data is continuously loaded as it is generated, allowing you to query it in near real-time. 

This method is ideal for loading streaming data from sources such as Apache Kafka, Amazon Kinesis, Google Pub/Sub, and Apache Pulsar.

Streaming ingestion processes data as it arrives, enabling low-latency analysis and immediate action. In contrast, micro-batch ingestion collects data over short, frequent intervals (seconds or minutes) and processes it in small batches. This strikes a
balance between latency and system efficiency.

Common techniques for performing streaming ingestion include:
- spark.readStream (Auto Loader with continuous trigger)
- Declarative Pipelines (trigger mode continuous)