#Creating Catalogs and volume to create datalake/delatalake and deltalakehouse

In [0]:
%sql
CREATE CATALOG IF NOT EXISTS data_optimization;
CREATE SCHEMA IF NOT EXISTS data_optimization.data_db;
CREATE VOLUME IF NOT EXISTS data_optimization.data_db.volume1;

In [0]:
%sql
USE data_optimization.data_db;
--Do this once for all to avoid repeating mentioning it in the queries

#Create Delta lake from raw csv files and create delta table

In [0]:
%python
df1=spark.read.csv("/Volumes/data_optimization/data_db/volume1/druginfo.csv",header=True,inferSchema=True,quote='"')
##Create Datalake
##Parquet format
df1.write.mode("overwrite").format("parquet").save("/Volumes/data_optimization/data_db/volume1/Deltalake/druginfo.parquet")
##Delta format
df1.write.mode("overwrite").format("delta").save("/Volumes/data_optimization/data_db/volume1/Deltalake/druginfo.delta")

##Creating deltalake house from df1 
df1.write.mode("overwrite").saveAsTable("data_optimization.data_db.drug_tbl")

###under the hood data is stored in S3
explain select * from data_optimization.data_db.drug_tbl

![image_1770300046372.png](./image_1770300046372.png "image_1770300046372.png")

##Alternate way to create lake house from file directly
'''
CREATE TABLE IF NOT EXISTS drug_tbl (
  uniqueid INT,
  drugname STRING,
  condition STRING,
  rating INT,
  date DATE,
  usefulcount INT
)
USING DELTA;'''


##Difference between delta and parquet file formats
![Delta vs parquet_1770617852897.png](./Delta vs parquet_1770617852897.png "Delta vs parquet_1770617852897.png")

#Why delta over other file formats in databricks:

Delta supports:
- ACID transactions
- Schema enforcement
- Schema evolution
- UPDATE / DELETE
- MERGE (UPSERT)
- Time travel
- Data versioning
- Streaming support
- Concurrent writes
- Optimize / compaction

#Schema merge/evolution while writing

We can add Schema evolution feature just by adding the below option in Delta tables.<br>
df.write.option("mergeSchema","True").saveAsTable("lakehousecat1.deltadb.drugstbl",mode='overwrite')

| Feature               | Merge Schema on READ                    | Merge Schema on WRITE   |
| --------------------- | --------------------------------------- | ----------------------- |
| Purpose               | Handle schema differences at query time | Evolve table schema     |
| Modifies table schema | ‚ùå No                                    | ‚úÖ Yes                   |
| Persists new columns  | ‚ùå No                                    | ‚úÖ Yes                   |
| Affects other readers | ‚ùå No                                    | ‚úÖ Yes                   |
| Typical use           | Backward compatibility                  | Schema evolution        |
| Risk                  | Performance overhead                    | Accidental schema drift |


#DDL on DeltaLakeHouse

In [0]:
%sql
--DDL is supportive (we will do more of these further)
--CREATE TABLE
create or replace table data_optimization.default.sampletable1(id int,name string) 
using delta;
insert into data_optimization.default.sampletable1 values(1,'kavi');--Though the data is stored internally in delta file, we can't see the data in delta format in databricks serverless

--ALTER TABLE
alter table data_optimization.default.sampletable1 set tblproperties (delta.enableChangeDataFeed = true);

--DESCRIBE HISTORY BEFORE DELETION
describe history data_optimization.default.sampletable1;

--DROP TABLE
--drop table data_optimization.default.sampletable;


--Important Note: DDLs doesnt create a part file.. they just modif the unity catalog metadata. But the create rows/versions in describe history command

#DML on deltalakehouse

In [0]:
%sql
DROP TABLE IF EXISTS data_optimization.default.sampletable_dml;
create or replace table data_optimization.default.sampletable_dml(id int,name string) 
using delta;



##Insert

- Each insert transaction will create a new part file
- If we have 2 individual insert transaction.. each will create 2 new part files
- Insert never drop or alter any existing part files


In [0]:
%sql
insert into data_optimization.default.sampletable_dml values(1,'kavi'),(2,'Shank'),(3,'Vedha');
--adds a new partfile


In [0]:
insert into data_optimization.default.sampletable_dml values(4,'kavi'),(5,'Shank');
insert into data_optimization.default.sampletable_dml values(6,'Vedha'),(7,'Shank');


##Update

Important considering during update:
- Update never physically alters or removes files.. It just creates a entry in the 
delta log file

- If there is only one entry in the file and when we try to update the entry .. it shows numFilesremoved=1 and numfileadded=1 . It means the logical reference of this file is removed from snapshot. While select this file will be ignored

- If there are multiple records/entries in one part file and when we try to update one entry .. it shows numfilesremoved=0 and numfilesadded=1. Here the logical reference of the file is not removed from snapshot. But the deletion vecotr points which rows was deletd

In both of the above cases.. its just a logical metadata added in the delta log that the file is delete but in realtime the file remains



- After every update a opmitize is command is run automatically


**Traditional Delta behavior (Copy-on-Write):**

- DELETE / UPDATE ‚Üí rewrite whole files

- Expensive for small changes

**Deletion Vectors:**

- DELETE / UPDATE ‚Üí no file rewrite

- Faster DML

- Fewer small files

**What happens on UPDATE in Delta Lake?**

An UPDATE = DELETE + INSERT
Step-by-step:
**Old row**
- Marked as deleted using a Deletion Vector
- No Parquet rewrite (if DV is enabled)

**New row**
- Written into a new Parquet file
So you get:
- üßæ Deletion Vector file (for old rows)
- üìÑ New Parquet file (for updated rows)

In [0]:
%sql
update data_optimization.default.sampletable_dml  set name='Shankar' where id=1;

describe history data_optimization.default.sampletable_dml;

In [0]:
insert into data_optimization.default.sampletable_dml values(8,'ant');
Update data_optimization.default.sampletable_dml set name='Ants' where id=8;


##Delete   
A Deletion Vector is a metadata-based way to mark rows as deleted without rewriting Parquet files.

üëâ Instead of copy-on-write (rewrite files), Delta:

- Keeps the Parquet file as-is
- Stores a bitmap of deleted row positions
- Skips those rows at read time

Why Deletion Vectors exist

**Traditional Delta behavior (Copy-on-Write):**

- DELETE / UPDATE ‚Üí rewrite whole files
- Expensive for small changes

**Deletion Vectors:**

- DELETE / UPDATE ‚Üí no file rewrite
- Faster DML
- Fewer small files

In [0]:
DELETE FROM data_optimization.default.sampletable_dml where id=8;

DESCRIBE HISTORY data_optimization.default.sampletable_dml;

In [0]:
DELETE FROM data_optimization.default.sampletable_dml where id=7;

In [0]:
DESCRIBE detail data_optimization.default.sampletable_dml;

##What happens while reading the table after update or delete

**Step-by-step: READ after UPDATE / DELETE**<br>
When you query the table:
SELECT * FROM table_name;<br>
1Ô∏è‚É£ Snapshot resolution (from _delta_log)

Delta:
- Reads the latest transaction JSON/Parquet logs
- Builds the current snapshot

Determines:
- Which Parquet files are active
- Which rows inside those files are logically deleted

2Ô∏è‚É£ File selection (data skipping)
- Delta decides which files to scan using:
- File-level statistics (min/max)
- Partition pruning
- Predicate pushdown (inherited from Parquet)
üëâ Files fully deleted are skipped entirely.

3Ô∏è‚É£ Apply Deletion Vectors (DV filtering)

For files that have deletion vectors:
- Delta loads the corresponding DV file
- Uses a bitmap / index to filter out deleted row IDs
- Only valid rows are passed to Spark

**üìå Important:**
- The Parquet file itself is not rewritten
- Deleted rows are skipped at read time

4Ô∏è‚É£ Merge old + new data
During the scan:
- Old Parquet files (with DV applied)
- New Parquet files (from UPDATE/DELETE)
 are unioned transparently.

In simple terms:

1. Delta log determines which Parquet part files are active and should be read, skipping files that are fully removed.
2. For files with deletion vectors, Delta applies the deletion vectors to filter out deleted rows.
3. The remaining rows are returned to the query.

##Merge- SCD1

In [0]:
DROP TABLE IF EXISTS data_optimization.default.sampletable_dml_src;
create or replace table data_optimization.default.sampletable_dml_src(id int,name string) 
using delta;
insert into data_optimization.default.sampletable_dml_src values(8,'bhoom'),(2,'pen');
--considering data is sent in incremental fashion from source daily



In [0]:
SELECT * FROM data_optimization.default.sampletable_dml;

In [0]:
MERGE INTO data_optimization.default.sampletable_dml t1
USING data_optimization.default.sampletable_dml_src t2
ON t1.id=t2.id
WHEN MATCHED THEN
UPDATE SET t1.name=t2.name
WHEN NOT MATCHED THEN
INSERT (t1.id,t1.name) VALUES (t2.id,t2.name);

In [0]:
SELECT * FROM data_optimization.default.sampletable_dml;

##Merge - SCD2

In [0]:
DROP TABLE IF EXISTS data_optimization.default.sampletable_dml_scd2;
create or replace table data_optimization.default.sampletable_dml_scd2(
  id int,
  name string,
  start_date date default current_date(),
  end_date date default '9999-12-31',
  is_active char(1) default 'Y')
using delta
tblproperties('delta.feature.allowColumnDefaults' = 'supported');


In [0]:
INSERT INTO data_optimization.default.sampletable_dml_scd2
SELECT
  id,
  name,
  current_date()              AS start_date,
  DATE '9999-12-31'            AS end_date,
  'Y'                          AS is_active
FROM data_optimization.default.sampletable_dml;


In [0]:
DROP TABLE IF EXISTS data_optimization.default.sampletable_dml_scd2_src;
create or replace table data_optimization.default.sampletable_dml_scd2_src(id int,name string) 
using delta;
insert into data_optimization.default.sampletable_dml_scd2_src values(8,'bhoom'),(3,'pen'),(4,'sindu');

In [0]:
SELECT * FROM data_optimization.default.sampletable_dml_scd2_src

###Method1

In [0]:
MERGE INTO data_optimization.default.sampletable_dml_scd2 t1
USING data_optimization.default.sampletable_dml_scd2_src t2
ON t1.id = t2.id and is_active='Y' and end_date='9999-12-31'
WHEN NOT MATCHED  THEN
  INSERT(
    id,
    name,
    start_date,
    end_date,
    is_active
  )
  VALUES (
    t2.id,
    t2.name,
    current_date(),
    DATE '9999-12-31',
    'Y'
  );
MERGE INTO data_optimization.default.sampletable_dml_scd2 t1
USING data_optimization.default.sampletable_dml_scd2_src t2
ON t1.id = t2.id

WHEN MATCHED AND  t1.is_active = 'Y' and t1.name <> t2.name THEN
  UPDATE SET
    t1.end_date = current_date(),
    t1.is_active = 'N'
WHEN NOT MATCHED  THEN
  INSERT(
    id,
    name,
    start_date,
    end_date,
    is_active
  )
  VALUES (
    t2.id,
    t2.name,
    current_date(),
    DATE '9999-12-31',
    'Y'
  );



###Method 2

In [0]:
UPDATE data_optimization.default.sampletable_dml_scd2
SET
  end_date = current_date(),
  is_active = 'N'
WHERE id IN (
  SELECT t1.id
  FROM data_optimization.default.sampletable_dml_scd2 t1
  LEFT SEMI JOIN data_optimization.default.sampletable_dml_scd2_src t2
    ON t1.id = t2.id
   AND t1.name <> t2.name
)
AND is_active = 'Y';

INSERT INTO data_optimization.default.sampletable_dml_scd2
SELECT
  t2.id,
  t2.name,
  current_date()            AS start_date,
  DATE '9999-12-31'          AS end_date,
  'Y'                        AS is_active
FROM data_optimization.default.sampletable_dml_scd2_src t2
LEFT ANTI JOIN data_optimization.default.sampletable_dml_scd2 t1
  ON t1.id = t2.id
 AND t1.is_active = 'Y';



#DML on Deltalake files

In [0]:
%python
df2=spark.read.format('csv').load('/Volumes/data_optimization/data_db/volume1/sample_data1.txt',header=True,infeSchema=True)
df2.write.format("delta").save('/Volumes/data_optimization/data_db/volume1/Deltalake/sample_data',mode='overwrite')
spark.read.format("delta").load('/Volumes/data_optimization/data_db/volume1/Deltalake/sample_data').display()

In [0]:
%python
from delta.tables import DeltaTable
from pyspark.sql.functions import lit

deltafile = DeltaTable.forPath(spark,"/Volumes/data_optimization/data_db/volume1/Deltalake/sample_data")
deltafile.update('id=5',{'name':lit('penny')})

Before update:
![image_1770315713283.png](./image_1770315713283.png "image_1770315713283.png")
Delta_log:
![image_1770315642909.png](./image_1770315642909.png "image_1770315642909.png")
After update:

Crated a deletion vector file  for old record and also a part file to  hold the new record
![image_1770316073417.png](./image_1770316073417.png "image_1770316073417.png")

![image_1770316042250.png](./image_1770316042250.png "image_1770316042250.png")

#TCL 
Databricks does not support explicit TCL commands; instead, Delta Lake provides implicit per-statement transactions with automatic commit and rollback, and recovery is achieved using time travel and restore operations.

| TCL Concept       | How Databricks Achieves It |
| ----------------- | -------------------------- |
| BEGIN TRANSACTION | Implicit per-statement     |
| COMMIT            | Automatic log commit       |
| ROLLBACK          | RESTORE TABLE(RESTORE TABLE table_name TO VERSION AS OF version_number             |
| SAVEPOINT         | Table VERSION              |
| ABORT TRANSACTION | Statement failure          |
| CONSISTENT READ   | Snapshot isolation         |
| TRANSACTION LOG   | `_delta_log`               |
