#Versioning/History/TimeTravel


1. **History **
History is captured for every commit as a row in '**describe history**' command.. ie for each and every transaction(insert/update/delete). Even the ddl change and optimize is also captured as a row
![image_1770570329414.png](./image_1770570329414.png "image_1770570329414.png")
2. **Version**
- **'version as of'** will display the respective snapshot version of the table
- Usually 0 version is create table and it doesnt show any record.

3. **Time Travel**
- **Timestamp as of** Reads the table as it existed at that exact timestamp and Any commits after the given timestamp is ignored

In [0]:
DESCRIBE HISTORY data_optimization.default.sampletable1;

select * from data_optimization.default.sampletable1 timestamp as of '2026-02-05T14:51:25.000+00:00';
select * from data_optimization.default.sampletable version as of 2; 

In [0]:
describe data_optimization.data_db.drug_tbl

In [0]:
drop table if exists data_optimization.data_db.drug_tbl2;
create table data_optimization.data_db.drug_tbl2(
uniqueid int,
drugname string,
condition  string,
rating int,
date date,
usefulcount int)
using delta
partitioned by(rating);



#Optimize
####1Ô∏è‚É£ Before OPTIMIZE ‚Äî baseline Delta table

##### Directory structure

```
delta_lab/
‚îú‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000000.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000001.json
‚îÇ   ‚îî‚îÄ‚îÄ 00000000000000000002.json
‚îú‚îÄ‚îÄ part-00000-aaa.parquet
‚îú‚îÄ‚îÄ part-00001-bbb.parquet
‚îú‚îÄ‚îÄ part-00002-ccc.parquet
```

Assumption:

* Small files
* Some deletes/updates already happened
* Deletion Vectors are enabled

#### 2Ô∏è‚É£ What the Parquet files contain (conceptually)
##### `part-00000-aaa.parquet`
```
id | name
---------
1  | A
2  | B   (deleted later)
3  | C
```

##### `part-00001-bbb.parquet`
```
id | name
---------
4  | D
5  | E
```

##### Deletion Vector (DV)
Stored separately (simplified):

```
DV for part-00000 ‚Üí row index {1}
```

Meaning:

* Row `(2, B)` is deleted
* File still physically contains it

---

#### 3Ô∏è‚É£ Delta log BEFORE OPTIMIZE (important)

##### Example `00000000000000000002.json`

```json
{
  "add": {
    "path": "part-00000-aaa.parquet",
    "size": 2048,
    "deletionVector": {
      "storageType": "u",
      "pathOrInlineDv": "dv-0001",
      "cardinality": 1
    }
  }
}
```

‚û°Ô∏è File is **active**
‚û°Ô∏è DV masks deleted rows

---

#### 4Ô∏è‚É£ OPTIMIZE is triggered

```sql
OPTIMIZE delta_lab;
```

Now the magic happens.

---

#### 5Ô∏è‚É£ What OPTIMIZE actually DOES internally

##### OPTIMIZE reads:

* All **active Parquet files**
* Their **deletion vectors**
* Applies the **latest snapshot**

##### OPTIMIZE writes:

* **New Parquet files**
* Containing **only live rows**
* With **no deletion vectors**

---

#### 6Ô∏è‚É£ Directory structure AFTER OPTIMIZE

```
delta_lab/
‚îú‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000000.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000001.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000002.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000003.json   ‚Üê OPTIMIZE commit
‚îÇ   ‚îî‚îÄ‚îÄ 00000000000000000003.checkpoint.parquet
‚îú‚îÄ‚îÄ part-00000-aaa.parquet          ‚Üê old (logically removed)
‚îú‚îÄ‚îÄ part-00001-bbb.parquet          ‚Üê old (logically removed)
‚îú‚îÄ‚îÄ part-00002-ccc.parquet          ‚Üê old (logically removed)
‚îú‚îÄ‚îÄ part-00000-zzz.parquet          ‚Üê NEW optimized file
```

‚ö†Ô∏è Old files still exist physically
‚ö†Ô∏è They are no longer referenced

---

#### 7Ô∏è‚É£ Contents of the NEW optimized Parquet file

##### `part-00000-zzz.parquet`

```
id | name
---------
1  | A
3  | C
4  | D
5  | E
```

‚úÖ Deleted row `(2, B)` is **gone**
‚úÖ No DV needed anymore

---

#### 8Ô∏è‚É£ Delta log entry CREATED by OPTIMIZE

##### `00000000000000000003.json`

```json
{
  "commitInfo": {
    "operation": "OPTIMIZE",
    "operationMetrics": {
      "numRemovedFiles": "3",
      "numAddedFiles": "1",
      "numDeletedRows": "1"
    }
  }
}
{
  "add": {
    "path": "part-00000-zzz.parquet",
    "size": 8192,
    "dataChange": false
  }
}
{
  "remove": {
    "path": "part-00000-aaa.parquet",
    "deletionTimestamp": 1700000000000
  }
}
{
  "remove": {
    "path": "part-00001-bbb.parquet",
    "deletionTimestamp": 1700000000000
  }
}
{
  "remove": {
    "path": "part-00002-ccc.parquet",
    "deletionTimestamp": 1700000000000
  }
}
```

---

#### 9Ô∏è‚É£ Key observations (THIS IS THE GOLD)

##### üîπ OPTIMIZE is snapshot-based

* It does **not care** how many updates/deletes happened
* It only materializes **final valid rows**

##### üîπ OPTIMIZE removes DV indirectly

* DV is **not copied**
* Clean Parquet files are written

##### üîπ History is still intact

* Old files are only **logically removed**
* Time travel still works

---

#### üîü What VACUUM does AFTER this

```sql
VACUUM delta_lab;
```

Then:

```
‚ùå part-00000-aaa.parquet
‚ùå part-00001-bbb.parquet
‚ùå part-00002-ccc.parquet
```

Only this remains:

```
delta_lab/
‚îú‚îÄ‚îÄ _delta_log/
‚îú‚îÄ‚îÄ part-00000-zzz.parquet
```


#### üîë Ultra-clear mental model

```
DELETE / UPDATE ‚Üí DV or rewrite
OPTIMIZE        ‚Üí snapshot materialization
VACUUM          ‚Üí physical cleanup
```

#### Interview-level one-liner

> **OPTIMIZE rewrites Delta data files based on the latest snapshot, compacting files and eliminating deleted rows and deletion vectors, while preserving history through logical remove entries in the transaction log.**



#Vaccum:
#### 1Ô∏è‚É£ State BEFORE VACUUM (post-OPTIMIZE)
##### Directory structure

```
delta_lab/
‚îú‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000000.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000001.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000002.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000003.json   ‚Üê OPTIMIZE commit
‚îÇ   ‚îî‚îÄ‚îÄ 00000000000000000003.checkpoint.parquet
‚îú‚îÄ‚îÄ part-00000-aaa.parquet          ‚Üê old, logically removed
‚îú‚îÄ‚îÄ part-00001-bbb.parquet          ‚Üê old, logically removed
‚îú‚îÄ‚îÄ part-00002-ccc.parquet          ‚Üê old, logically removed
‚îú‚îÄ‚îÄ part-00000-zzz.parquet          ‚Üê ACTIVE
```
Important:
* Old files have `remove` entries in Delta log
* Files still exist physically
* Time travel still works


#### 2Ô∏è‚É£ VACUUM is triggered

```sql
VACUUM delta_lab;
```
Default:
```text
RETAIN 168 HOURS (7 days)
```

#### 3Ô∏è‚É£ What VACUUM actually READS

VACUUM reads **only metadata**, not table data:

##### Reads:

* Latest Delta snapshot
* `add` actions ‚Üí active files
* `remove` actions ‚Üí deletion timestamps
* Retention policy

##### Builds two sets:

```
ACTIVE FILES    = { part-00000-zzz.parquet }
REMOVED FILES   = {
  part-00000-aaa.parquet,
  part-00001-bbb.parquet,
  part-00002-ccc.parquet
}
```

#### 4Ô∏è‚É£ Retention check (critical)

For each removed file:

```
current_time - deletionTimestamp >= retention
```

If TRUE ‚Üí eligible for deletion

#### 5Ô∏è‚É£ What VACUUM DELETES (physically)

##### Storage layer AFTER VACUUM

```
delta_lab/
‚îú‚îÄ‚îÄ _delta_log/
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000000.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000001.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000002.json
‚îÇ   ‚îú‚îÄ‚îÄ 00000000000000000003.json
‚îÇ   ‚îî‚îÄ‚îÄ 00000000000000000003.checkpoint.parquet
‚îú‚îÄ‚îÄ part-00000-zzz.parquet          ‚Üê ACTIVE (kept)
```

‚ùå Old Parquet files are **gone**

#### 6Ô∏è‚É£ What VACUUM does NOT touch (very important)

##### Delta logs:

* ‚ùå No new JSON files
* ‚ùå No new checkpoint
* ‚ùå No table version increment

##### Metadata:

* ‚ùå No add/remove entries
* ‚ùå No history rewrite

VACUUM is **read-only on `_delta_log`**.

#### 7Ô∏è‚É£ Why current queries still work

Because:
* Current snapshot references only:

  ```
  part-00000-zzz.parquet
  ```
* Spark never looks for deleted files

#### 8Ô∏è‚É£ What BREAKS after VACUUM

##### Time travel beyond retention:

```sql
SELECT * FROM delta_lab VERSION AS OF 2;
```

‚ùå Fails with:

```
FileNotFoundException
```

Because:

* Delta log references files
* Files no longer exist physically

---

#### 9Ô∏è‚É£ VACUUM with deletion vectors (DV case)

If DVs existed earlier:

* Old DV files referenced by removed files
* Become unreferenced
* VACUUM deletes DV files too

But:

* DV for active files (if any) are preserved

---

#### üîü Full lifecycle diagram

```
INSERT / UPDATE / DELETE
        ‚Üì
Delta log add/remove
        ‚Üì
OPTIMIZE
        ‚Üì
New files + remove old files (logical)
        ‚Üì
VACUUM
        ‚Üì
Physical deletion of old files
```

---

#### üîë Ultra-important mental model

> **OPTIMIZE changes the logical snapshot.
> VACUUM changes only physical storage.**

They NEVER overlap responsibilities.

---

#### Interview-ready one-liner

> **VACUUM reads Delta transaction logs to identify obsolete files and physically deletes them from storage after the retention period, without modifying table metadata or history.**


In [0]:
VACUUM delta_lab;
--default retention is 168 hrs(7 days)
VACUUM delta_lab RETAIN 2 HOURS;
ALTER TABLE table_name SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '24 hours');
--If you attempt to run VACUUM with a retention period lower than 168 hours (7 days), Databricks will throw an error to prevent accidental data loss. To override this, 
SET spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM table_name RETAIN 1 HOURS;


--DRY RUN 
VACUUM table_name RETAIN 24 HOURS DRY RUN;



#ACID 
ACID Transactions
**Delta Lake supports ACID transactions under the hood via a transaction log.**
| ACID        | In Databricks         |
| ----------- | --------------------- |
| Atomicity   | Every transactions are Individual Transactions / All or nothing.A transaction is an indivisible unit. Either all its operations are executed successfully, or none are, preventing partial updates. If one part fails, the entire transaction is rolled back.  |
| Consistency | Schema + constraints.A transaction must transform the database from one valid state to another, maintaining all predefined rules, constraints, and integrity checks.  |
| Isolation   | Using Version/Time/restore we can isolate transactions, we can't use TCL (commit/rollback).Concurrent transactions do not interfere with each other. Each transaction behaves as if it is the only one operating on the data, preventing issues like dirty reads or inconsistent data.Isolation guarantees that concurrent transactions do not see each other‚Äôs intermediate or partial changes. |
| Durability  | Every transaction Always hit the disk (durable), but can be controlled by Transaction log,Once a transaction is committed, its changes are permanently saved in the database, surviving any subsequent system failures.  |

##Atomicity and consistency

In [0]:
--Atomicity:
--From start transaction until commit everything is one transaction. But in databricks every single statement is a transaction and it is autocommitted.
--eg:
INSERT INTO table_name VALUES (1,'a'),(2,'b');
--This is a transaction and it is autocommitted.


--Consistency:
--Define nulls or constraints for a table
--eg: 
CREATE TABLE abc (
id int not null,   --not null constraint
name string default unknown,
age int check age>18   --age validation constraint
)

##Isloation:
Classic problems Isolation prevents<br>
**Reading garbage data:**<br>
Txn A: UPDATE orders SET amount = 1000 WHERE id = 1;
       (not committed yet)
<br>
Txn B: SELECT amount FROM orders WHERE id = 1;
       ‚Üí sees 1000 ‚ùå<br>
If Txn A fails ‚Üí Txn B read garbage.<br>
**with Delta:**
- Txn B reads the last committed snapshot
- Uncommitted files are invisible<br>

**Non-repeatable Read (data changes mid-query):**<br>
**Without isolation (bad)**<br>
Txn A: SELECT salary FROM emp WHERE id = 10; ‚Üí 50000<br>
Txn B: UPDATE emp SET salary = 60000 WHERE id = 10; COMMIT;<br>
Txn A: SELECT salary FROM emp WHERE id = 10; ‚Üí 60000 ‚ùå<br>
Same query, different result.<br>
**With Delta isolation (good)**<br>
Txn A: SELECT salary FROM emp WHERE id = 10; ‚Üí 50000<br>
Txn B: UPDATE emp SET salary = 60000 WHERE id = 10; COMMIT;<br>
Txn A: SELECT salary FROM emp WHERE id = 10; ‚Üí 50000 ‚úÖ<br>
Why?
Txn A keeps reading the same snapshot.Changes by Txn B are visible only after Txn A ends

**Phantom Read (rows appear/disappear)**<br>
**Without isolation (bad)<br>**
Txn A: SELECT COUNT(*) FROM orders WHERE region = 'US'; ‚Üí 10<br>
Txn B: INSERT INTO orders VALUES (..., 'US'); COMMIT;<br>
Txn A: SELECT COUNT(*) FROM orders WHERE region = 'US'; ‚Üí 11 ‚ùå<br>
**With Delta isolation (good)**<br>
Txn A: SELECT COUNT(*) FROM orders WHERE region = 'US'; ‚Üí 10<br>
Txn B: INSERT INTO orders VALUES (..., 'US'); COMMIT;<br>
Txn A: SELECT COUNT(*) FROM orders WHERE region = 'US'; ‚Üí 10 ‚úÖ<br>
**Delta snapshot isolation ensures:**<br>
No new rows ‚Äúappear‚Äù mid-transaction<br>

**Lost Update (very important)**<br>
**Without isolation (bad)**<br>
Initial balance = 100<br>
Txn A: balance = balance - 30  ‚Üí writes 70<br>
Txn B: balance = balance - 50  ‚Üí writes 50<br>
Final result = 50 ‚ùå<br>
One update is lost.<br>

**With Delta isolation (good)**<br>
Txn A commits first<br>
Txn B tries to commit ‚Üí CONFLICT ‚ùå<br>

**Delta detects:**
Both touched the same rows/files
Second writer must retry

![image_1770642173766.png](./image_1770642173766.png "image_1770642173766.png")

##Durability

**Delta uses two durable layers:**<br>

- Immutable data files (Parquet)
- Transaction log (_delta_log)

Both are written to reliable storage (DBFS / cloud object storage).

#zorder
Z-ORDER is a data layout optimization technique in Delta Lake that:
- Rewrites data files
- Physically clusters related column values together
- Reduces the number of files and row groups scanned during queries

**note: Zorder can be done only manually using optmize..zorder command**

Imagine a Delta table with files like this:
part-0001 ‚Üí ids: 1‚Äì1M (random ids)<br>
part-0002 ‚Üí ids: 1‚Äì1M (random ids)<br>
part-0003 ‚Üí ids: 1‚Äì1M (random ids)<br>

what spark does without zorder:
SELECT * FROM orders WHERE customer_id = 101;
- ‚ùå Spark scans many files
- ‚ùå Poor data skipping
- ‚ùå High IO

**OPTIMIZE table_name ZORDER BY (col1, col2, ...);**

What Z-ORDER does (high level)
- Takes selected columns
- Applies Z-curve (Morton ordering)
- Rewrites Parquet files so similar values live together

part-0001 ‚Üí customer_id: 1‚Äì1000<br>
part-0002 ‚Üí customer_id: 1001‚Äì2000<br>
part-0003 ‚Üí customer_id: 2001‚Äì3000
Now Spark:
- Reads fewer files
- Uses min/max stats efficiently
- Skips unrelated data

What happens when zorder:
- optimize starts
- Takes active parquet files,ignores inactive files,applies deletion vetor
- z-values is computed for the given columns
-Larger files (‚âà 1GB default) with physically clustered rows(Old small files still exist (for now),later deleted during vaccum)

##Can zorder handledata skew?
Z-ORDER is not a primary solution for data skew, but it can reduce the impact of skew in some read scenarios.

- Now let‚Äôs break that down properly.
- What ‚Äúdata skew‚Äù really means (important)

There are two very different skews people mix up:
**Query / shuffle skew:**
- One key ‚Üí huge amount of data
- One task runs forever
- Others finish fast

**Storage / file-level skew**:
- Some files contain most of the relevant rows
- Others are rarely read

Z-ORDER only helps with #2, not #1.

Can Z-ORDER fix shuffle skew?‚ùå No<br>

Example:<br>
SELECT *
FROM orders
GROUP BY customer_id;

If:
customer_id = 1 ‚Üí 60% of rows


Even after Z-ORDER:
- All those rows still hash to one reducer
- One task still does most of the work
- üëâ Skew remains
- Where Z-ORDER does help
- ‚úÖ Filter skew (read-side skew)

Query:
SELECT *
FROM orders
WHERE customer_id = 1;

**Without Z-ORDER:**
- Data spread across many files
- Many tasks launched
- Lots of IO

**With Z-ORDER:**
- Rows for customer_id = 1 clustered
- Fewer files read
- Fewer tasks launched
- üëâ Less wasted work

**‚úÖ Join-side scan skew (partial help)**<br>
SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id
WHERE c.region = 'EU';


Z-ORDER on customer_id:
- Orders for EU customers are in fewer files
- Spark scans less data before shuffle
- üëâ Shuffle skew still exists, but input size is smaller

What Z-ORDER does NOT do

- ‚ùå Does not split hot keys
- ‚ùå Does not rebalance partitions
- ‚ùå Does not change hash distribution
- ‚ùå Does not fix long-running tasks

Correct tools for data skew (this is key)
| Problem                   | Correct solution  |
| ------------------------- | ----------------- |
| Hot keys in joins         | Salting           |
| Large vs small table join | Broadcast join    |
| Skewed aggregations       | AQE skew handling |
| Write skew                | Repartition       |
| Small files               | OPTIMIZE          |
| Read locality             | Z-ORDER           |

Z-ORDER vs real skew solutions
| Tool        | Fixes skew? | How                      |
| ----------- | ----------- | ------------------------ |
| Z-ORDER     | ‚ö†Ô∏è Partial  | Reduces files scanned    |
| Repartition | ‚úÖ           | Redistributes data       |
| Salting     | ‚úÖ           | Breaks hot keys          |
| AQE         | ‚úÖ           | Splits skewed partitions |
| Broadcast   | ‚úÖ           | Removes shuffle          |


Databricks themselves position Z-ORDER as:
**a read-optimization technique, not a skew-handling mechanism**

**Z-ORDER does not fix data skew, but it can reduce its impact by limiting how much skewed data is scanned during reads.**

#Liquid Clustering
Liquid Clustering is a dynamic, self-managing clustering mechanism for Delta tables that:
- Continuously reorganizes data
- Adapts automatically as data changes
- Eliminates the need for manual OPTIMIZE ZORDER
- Think of it as:
- ‚ÄúZ-ORDER that keeps fixing itself over time.‚Äù

**Z-ORDER problems:**
- Static (needs manual runs)
- Expensive full rewrites
- Degrades as new data arrives
- Needs careful column selection
- Liquid Clustering fixes all of that.

**How Liquid Clustering works (internals)**
- **You define clustering columns (once)**<br>
CREATE TABLE orders (<br>
  cust_name STRING,<br>
  cust_id BIGINT,<br>
)
USING DELTA<br>
CLUSTER BY (customer_id, order_date);<br>
No partitions required.

- **Data is written normally**
-     Streaming or batch
-     Inserts, updates, deletes
-     Deletion vectors supported
-     No immediate reordering.

- **Databricks monitors clustering quality**
-     Internally tracks:
-     File overlap
-     Range dispersion
-     Query access patterns

- **Incremental re-clustering happens**
-     During:OPTIMIZE orders;
-     Only badly clustered files are rewritten
-     Not the entire table
-     Small, incremental rewrites

- **Query-time benefits**
-     Strong data skipping
-     Fewer files scanned
-     Stable performance over time

#Liquid Clustering vs Z-ORDER
| Aspect                  | Z-ORDER        | Liquid Clustering |
| ----------------------- | -------------- | ----------------- |
| Configuration           | Manual per run | Defined once      |
| Rewrite scope           | Full optimize  | Incremental       |
| Handles frequent writes | ‚ùå Poorly       | ‚úÖ Excellent       |
| Streaming-friendly      | ‚ùå              | ‚úÖ                 |
| Maintenance cost        | High           | Low               |
| Column count            | 1‚Äì4            | 1‚Äì4               |
| Adapts over time        | ‚ùå              | ‚úÖ                 |


#Partition By

Note: Exactly works like hive.. create folders and creates part files inside it

CREATE OR REPLACE TABLE customer_txn_part1 ( <br>
    txn_id INT, 
    customer_id INT,
    txn_amount DOUBLE,
    transaction_date DATE
) <br>
using delta<br>
partitioned by (transaction_date);<br>
insert into customer_txn_part1 select * from customer_txn;<br>

- show partitions customer_txn_part1

#CTAS VS DEEP CLONE VS SHALLOW CLONE

| Aspect                         | **CTAS**<br>(Create Table As Select)   | **Deep Clone**                      | **Shallow Clone**                  |
| ------------------------------ | -------------------------------------- | ----------------------------------- | ---------------------------------- |
| Purpose                        | Create a new table from a query result | Full physical copy of a Delta table | Logical copy referencing same data |
| Copies data files              | ‚úÖ Yes (new Parquet files)              | ‚úÖ Yes (full data copy)              | ‚ùå No                               |
| Copies metadata                | ‚ùå Partial (schema only from SELECT)    | ‚úÖ Yes                               | ‚úÖ Yes                              |
| Copies table history           | ‚ùå No                                   | ‚ùå No                                | ‚ùå No                               |
| Copies constraints             | ‚ùå No                                   | ‚úÖ Yes                               | ‚úÖ Yes                              |
| Copies table properties        | ‚ùå No                                   | ‚úÖ Yes                               | ‚úÖ Yes                              |
| Copies Z-ORDER / clustering    | ‚ùå No                                   | ‚úÖ Yes                               | ‚úÖ Yes                              |
| Storage usage                  | High                                   | Very high                           | Very low                           |
| Performance after creation     | Depends on SELECT                      | Same as source                      | Same as source                     |
| Data independence              | Fully independent                      | Fully independent                   | ‚ùå Not independent                  |
| Underlying files shared        | ‚ùå No                                   | ‚ùå No                                | ‚úÖ Yes                              |
| Time to create                 | Medium‚ÄìSlow                            | Slow                                | Very fast                          |
| Incremental sync possible      | ‚ùå No                                   | ‚ùå No                                | ‚ùå No                               |
| Supports Unity Catalog         | ‚úÖ Yes                                  | ‚úÖ Yes                               | ‚úÖ Yes                              |
| Supports Time Travel           | ‚ùå Fresh table only                     | ‚úÖ From clone creation               | ‚úÖ From clone creation              |
| Affected if source VACUUM runs | ‚ùå No                                   | ‚ùå No                                | ‚úÖ Yes                              |
| Best for                       | Transformations, aggregations          | Backup, migration                   | Dev/Test, experiments              |


##CTAS
- Query runs
- New Parquet files are written
- New Delta log starts at version 0
- No relationship to source table<br>
What is NOT copied:
- ‚ùå History
- ‚ùå Constraints
- ‚ùå Table properties
- ‚ùå Z-ORDER / clustering<br>
**Syntax:**<br>
CREATE TABLE sales_ctas<br>
AS<br>
SELECT * FROM sales WHERE region = 'EU';


##Deep Clone:
What actually happens<br>
- All active data files are copied
- Metadata is copied
- New Delta log is created
- Source and target are fully independent

What is NOT copied<br>
- ‚ùå Transaction history
- ‚ùå Old versions

Storage impact<br>
- üí∏ Doubles storage immediately

Syntax:
CREATE TABLE sales_deep_clone
CLONE sales;


##Shallow clone:
What actually happens
- No data files copied
- Delta log references source files
- Reads are redirected to original files

Source table files<br>
        ‚ñ≤<br>
        ‚îÇ<br>
Shallow clone metadata


Very important behavior:
- Source table UPDATE/DELETE:
- Does NOT affect clone (new files written)

VACUUM on source:
- ‚ùå Can break clone if retention mismanaged

Storage impact
- üí∞ Almost free

Syntax:
CREATE TABLE sales_deep_clone
SHALLOW CLONE sales;

In [0]:
drop table if exists base_table;
CREATE TABLE  if not exists base_table(
  id int,
  name string,
  dept string
)
using delta
PARTITIONED BY (dept);
insert  into  base_table values(1,'a','maths'),(2,'b','science'),(3,'c','maths');
update base_table set dept='physics' where id=1;

In [0]:

drop table if exists base_table_ctas;
CREATE TABLE IF NOT EXISTS base_table_ctas as select * from  base_table;

In [0]:
DESCRIBE HISTORY base_table_ctas;
--no partition/cluster columns was preserved
--no history/ version was preserved
-- just queried table and wrote results into a new file and metadata started from version 0

In [0]:
CREATE TABLE IF NOT EXISTS 
base_table_deep_clone clone base_table;
DESCRIBE HISTORY base_table_deep_clone;
--partition/cluster columns was preserved
--Deep Clone preserves data state, not data history.
--History/version reset to 0
-- Deep clone just copies all active files related to current base table snapshot and removes all the deletion vectors and inactive files

In [0]:
drop table if exists base_table_shallow_clone;
CREATE TABLE base_table_shallow_clone SHALLOW CLONE base_table;
DESCRIBE HISTORY base_table_shallow_clone;
--partition/cluster columns was preserved
select * from base_table_shallow_clone version as of 1;