# Requirements

This notebook uses [Evcxr Jupyter Kernel](https://github.com/evcxr/evcxr/tree/main/evcxr_jupyter) to run Rust code in a Jupyter notebook. Follow the instructions in [Evcxr Jupyter Kernel](https://github.com/evcxr/evcxr/tree/main/evcxr_jupyter) to build the environment, or just use a Docker file included here (this could be your only option if you are on MacOS with Apple chip becuase of linking errors)
```sh
docker build -t delta-rs .
docker run -it --rm  -p 8888:8888 --name delta-rs -v $PWD/notebooks/delta-rs:/usr/src/delta-rs delta-rs
```

The `Evcxr Jupyter Kernel` seem to have some limitations when it comes to running async code (some work, some don't), so this notebook uses `tokio::runtime::Runtime.block_on` to handle most of the async calls, which should not be necessary if you are running the same code directly (not through Jupyter).


# Introduction to delta-rs

This notebook introduces you to the key features of Delta Lake via the delta-rs library.

`delta-rs` allows you to work with Delta Lake without a Spark runtime.

Once you work through this notebook, you'll have a better understanding of the features that make Delta Lake powerful. It's a relatively quick guide and should be eye-opening! Let's dive in!

We'll start by installing dependencies and importing required types

In [3]:
:dep deltalake = { version = "0.16.2", features = ["arrow", "datafusion"]}
:dep tokio = { version = "1", features = ["full"] }
:dep serde = { version = "1", features = ["derive"] }
:dep serde_json = "1"
:dep chrono = {version = "0.4.31" }

In [4]:
use std::sync::Arc;

use deltalake::arrow::record_batch::RecordBatch;
use deltalake::arrow::datatypes::{Field, DataType, Schema as ArrowSchema};
use deltalake::arrow::array::{Int32Array, StringArray};
use deltalake::arrow::util::pretty::print_batches;

use deltalake::DeltaOps;
use deltalake::DeltaTable;
use deltalake::operations::collect_sendable_stream;
use deltalake::writer::{DeltaWriter, RecordBatchWriter, JsonWriter};
use deltalake::schema::{Schema, SchemaField, SchemaDataType}
use deltalake::datafusion::logical_expr::{col, lit};

## Create a Delta Lake

Let's create a new Delta Lake table with some data

In [5]:
let TABLE_URI = "delta-table".to_string();
let rt = tokio::runtime::Runtime::new().unwrap();

In [10]:
let schema = Arc::new(ArrowSchema::new(
    vec![
        Field::new("num", DataType::Int32, false),
        Field::new("letter", DataType::Utf8, true),
    ]
));

let batch = RecordBatch::try_new(
    schema, 
    vec![
        Arc::new(Int32Array::from(vec![1, 2, 3])),
        Arc::new(StringArray::from(vec!["a", "b", "c"])),
    ]
)?;
rt.block_on(async {
    let ops = DeltaOps::try_from_uri(&TABLE_URI).await.unwrap();
    ops.write(vec![batch]).await;
});

You can inspect the contents of the `delta-table` folder to begin understanding how Delta Lake works. Here's what the folder will contain:
```
delta-table/
    _delta_log/
      00000000000000000000.json
part-00001-b0d48b69-28f8-4301-9876-e6b7e3af9db7-c000.snappy.parquet
```

`delta-table` contains a `delta_log` folder which is often refered to as the "transaction log". The transaction log tracks the files that have been added and removed from the Delta Lake, along with other metadata.

The Parquet file contains the actual data that was written to the Delta Lake.

You don't need to have a detailed understanding of how the transaction log works. A high level conceptual grasp is all you need to understand how Delta Lake provides you with useful data management features.

## Read a Delta Lake

Let's read and print out the table contents. We are using `datafusion` for this and need to import some supporting functions.

In [11]:
fn show_table_data(rt: &tokio::runtime::Runtime, table_name: &str) {
    let (table, data) = rt.block_on(async {
        let ops = DeltaOps::try_from_uri(table_name).await.unwrap();
        let (table, stream) = ops.load().await.unwrap();
        let data: Vec<RecordBatch> = collect_sendable_stream(stream).await.unwrap();
        (table, data)
    });
    println!("----------------\nTable version: {}", table.version());
    deltalake::arrow::util::pretty::print_batches(&data);
}
show_table_data(&rt, &TABLE_URI);

----------------
Table version: 0
+-----+--------+
| num | letter |
+-----+--------+
| 1   | a      |
| 2   | b      |
| 3   | c      |
+-----+--------+


After the first data insert, the Delta Lake is at "version 0". Let's add some more data to the Delta Lake and see how the version gets updated after another write transaction is performed.

## Insert more data into Delta Lake

Add more data to this table. Let's use a `JsonWriter` this timeIgnore the schema information declared earlier and derive it from the table (though we still use that knowledge to create the actual data)

In [12]:
fn write_json(rt: &tokio::runtime::Runtime, table_name: &str, json_data: &str) {
    rt.block_on(async {
        // Open the table
        let mut table = deltalake::open_table(table_name).await.unwrap();
    
        // Create a writer
        let mut writer = JsonWriter::for_table(&table).unwrap();

        // Write and commit the changes
        writer.write(
            json_data
                .lines()
                .map(|line| serde_json::from_str(line).unwrap())
                .collect(),
        ).await.unwrap();
        writer.flush_and_commit(&mut table).await.unwrap();
    });
}

let json_data = "{\"num\" : 77, \"letter\": \"x\"}\n\
                 {\"num\" : 88, \"letter\": \"y\"}\n\
                 {\"num\" : 99, \"letter\": \"z\"}";
write_json(&rt, &TABLE_URI, json_data);

show_table_data(&rt, &TABLE_URI);

----------------
Table version: 1
+-----+--------+
| num | letter |
+-----+--------+
| 1   | a      |
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
| 99  | z      |
+-----+--------+


Notice how the table version has changed

## Time travel to previous version of data
Let's travel back in time and inspect the content of the Delta Lake at "version 0".

In [13]:
fn show_table_version_data(rt: &tokio::runtime::Runtime, table_name: &str, version: i64) -> DeltaTable {
    rt.block_on(async {
        let mut table = deltalake::open_table(table_name).await.unwrap();
        table.load_version(version).await.unwrap();
        let (_table, stream) = DeltaOps(table).load().await.unwrap();
        let data: Vec<RecordBatch> = collect_sendable_stream(stream).await.unwrap();
        println!("----------------\nTable version: {}", &_table.version());
        deltalake::arrow::util::pretty::print_batches(&data);
        _table
    })
}
show_table_version_data(&rt, &TABLE_URI, 0);

----------------
Table version: 0
+-----+--------+
| num | letter |
+-----+--------+
| 1   | a      |
| 2   | b      |
| 3   | c      |
+-----+--------+


Wow! That's cool!

We performed two write transactions and were able to travel back in time and view the contents of the Delta Lake before the second write transaction was performed. This is an incredibly powerful and useful feature.

Delta Lake gives you time travel for free!

## Schema enforcement

Schema enforcement is enabled by default. Here we are trying to append a record that does not contain a required `num` field resulting in error.

**Note:** if you do the same without specifying the schema then write may work with nulls used for missing columns values if null values are allowed in the current table

In [14]:
let bad_data = "{\"name\": [\"bob\", \"denise\"], \"age\": [64, 43]}";
write_json(&rt, &TABLE_URI, bad_data);

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Generic("Failed to convert into Arrow schema: Json error: Encountered unmasked nulls in non-nullable StructArray child: Field { name: \"num\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }")', src/lib.rs:30:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
   1: core::panicking::panic_fmt
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/result.rs:1651:5
   3: tokio::runtime::park::CachedParkThread::block_on
   4: tokio::runtime::context::runtime::enter_runtime
   5: tokio::runtime::runtime::Runtime::block_on
   6: run_user_code_12
   7: evcxr::runtime::Runtime::run_loop
   8: evcxr::runtime::runtime_hook
   9: evcxr_

**Note:** because we are not explicitly specifying a schema, a write operation using `JsonWriter` may still succeed if the data is compatible with existing schema (e.g. the following will still work but additional fields will be ignored and `letter` column values of the new records will be `null`)

In [15]:
let bad_data = "{\"num\": 4, \"name\": \"bob\", \"age\": 64}";
write_json(&rt, &TABLE_URI, bad_data);
show_table_data(&rt, &TABLE_URI);

----------------
Table version: 2
+-----+--------+
| num | letter |
+-----+--------+
| 1   | a      |
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
| 99  | z      |
| 4   |        |
+-----+--------+


## Delete rows

This section demonstrates how you can delete rows of data from the Delta Lake.

In [16]:
rt.block_on(async {
    let mut table = deltalake::open_table(&TABLE_URI).await.unwrap();
    let (table, _) = DeltaOps(table)
        .delete()
        .with_predicate(col("num").lt(lit(2)).or(col("num").gt(lit(88))).or(col("letter").is_null()))
        .await
        .unwrap();
}); 
show_table_data(&rt, &TABLE_URI);

----------------
Table version: 3
+-----+--------+
| num | letter |
+-----+--------+
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
+-----+--------+


## Vacuum old data files

Delta Lake doesn't delete stale file from disk by default. We just performed an overwrite transaction which means that all the data for the latest version of the Delta Lake is in a new file. When we read in the latest version of the Delta Lake, it'll just read the new file. Let's take a look.

In [17]:
fn show_table_files(table: &DeltaTable) {
    let files: Vec<_> = table.get_files();
    let mapped: Vec<&str> = files.iter().map(|p| p.filename().unwrap()).collect();
    println!("----------------\nFiles:\n{}", mapped.join("\n"));
}
fn show_files(rt: &tokio::runtime::Runtime, table_name: &str) {
    rt.block_on(async {
        let table = deltalake::open_table(table_name).await.unwrap();
        show_table_files(&table);
    });
}
show_files(&rt, &TABLE_URI);
show_table_data(&rt, &TABLE_URI);

----------------
Files:
part-00001-a573ca35-c283-4a17-997b-dcbc84e0bda9-c000.snappy.parquet
----------------
Table version: 3
+-----+--------+
| num | letter |
+-----+--------+
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
+-----+--------+


We have several Parquet files on disk, but only a subset of them are used by the current version of the Delta Lake. Let's take a look at all the Parquet files currently in the Delta Lake.

In [18]:
fn show_files_in_table_dir(table_uri: &str) {
    let to_list = format!("./{}", table_uri);
    let paths = std::fs::read_dir(to_list).unwrap();
    println!("All files");
    for path in paths {
        let p = path.unwrap().path();
        if p.is_file() {
            println!("{:?}", p.file_name().unwrap());
        }
    }
}
show_files_in_table_dir(&TABLE_URI);

All files
"part-00001-a573ca35-c283-4a17-997b-dcbc84e0bda9-c000.snappy.parquet"
"part-00000-d6cbb34c-f3d1-432c-952d-5fa1eb83b91a-c000.snappy.parquet"
"part-00001-94474cb5-27f8-43db-8570-7ed4030f6e64-c000.snappy.parquet"
"part-00000-a4dcafbf-a520-43d2-a0c8-6ea38bec26ff-c000.snappy.parquet"


The "stale" Parquet files are what allow for time travel. Let's time travel back to "version 1" of the Delta Lake.

In [19]:
show_table_files(&show_table_version_data(&rt, &TABLE_URI, 1));

----------------
Table version: 1
+-----+--------+
| num | letter |
+-----+--------+
| 1   | a      |
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
| 99  | z      |
+-----+--------+
----------------
Files:
part-00001-94474cb5-27f8-43db-8570-7ed4030f6e64-c000.snappy.parquet
part-00000-a4dcafbf-a520-43d2-a0c8-6ea38bec26ff-c000.snappy.parquet


When we time travel back to version 1, we're reading entirely different files than when we read the latest version of the the Delta Lake.

The legacy files are what allow you to time travel.

If you don't want to time travel, you can delete the legacy files with the vacuum() command.

Let's start with running vacuum in "dry run" mode.

In [20]:
fn run_vacuum(rt: &tokio::runtime::Runtime, table_name: &str, dry_run: bool) {
    rt.block_on(async {
        let mut table = deltalake::open_table(table_name).await.unwrap();
        let (_table, metrics) = DeltaOps(table)
            .vacuum()
            .with_retention_period(chrono::Duration::zero())
            .with_enforce_retention_duration(false)
            .with_dry_run(dry_run)
            .await
            .unwrap();
        println!("{:?}", metrics);
    })
}   
run_vacuum(&rt, &TABLE_URI, true);

VacuumMetrics { dry_run: true, files_deleted: ["part-00000-d6cbb34c-f3d1-432c-952d-5fa1eb83b91a-c000.snappy.parquet", "part-00001-94474cb5-27f8-43db-8570-7ed4030f6e64-c000.snappy.parquet", "part-00000-a4dcafbf-a520-43d2-a0c8-6ea38bec26ff-c000.snappy.parquet"] }


The files aren't actually deleted when the code is executed in dry run mode.

In [21]:
show_files_in_table_dir(&TABLE_URI);

All files
"part-00001-a573ca35-c283-4a17-997b-dcbc84e0bda9-c000.snappy.parquet"
"part-00000-d6cbb34c-f3d1-432c-952d-5fa1eb83b91a-c000.snappy.parquet"
"part-00001-94474cb5-27f8-43db-8570-7ed4030f6e64-c000.snappy.parquet"
"part-00000-a4dcafbf-a520-43d2-a0c8-6ea38bec26ff-c000.snappy.parquet"


Explicitly set dry_run to False to actually delete the files.

In [22]:
run_vacuum(&rt, &TABLE_URI, false);
show_files_in_table_dir(&TABLE_URI);
show_table_data(&rt, &TABLE_URI);

VacuumMetrics { dry_run: false, files_deleted: ["part-00000-d6cbb34c-f3d1-432c-952d-5fa1eb83b91a-c000.snappy.parquet", "part-00001-94474cb5-27f8-43db-8570-7ed4030f6e64-c000.snappy.parquet", "part-00000-a4dcafbf-a520-43d2-a0c8-6ea38bec26ff-c000.snappy.parquet"] }
All files
"part-00001-a573ca35-c283-4a17-997b-dcbc84e0bda9-c000.snappy.parquet"
----------------
Table version: 3
+-----+--------+
| num | letter |
+-----+--------+
| 2   | b      |
| 3   | c      |
| 77  | x      |
| 88  | y      |
+-----+--------+


## Cleanup

Let's delete the Delta Lake now that we're done with this demo. **Be careful**, make sure it is deleting the table folder only, uncomment the last line of code once you are sure it is deleting a right thing

In [23]:
let to_delete = format!("./{}", &TABLE_URI);
println!("This will delete the following folder: {}", to_delete);
std::fs::remove_dir_all(to_delete)?;

This will delete the following folder: ./delta-table
