<img src="./images/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/>  

## lakeFS ❤️ Apache Iceberg

# Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFODNN7EXAMPLE'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

# Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "lakefs-iceberg-nyc"

### Create lakeFSClient

In [4]:
import lakefs_client
from lakefs_client.models import *
from lakefs_client.client import LakeFSClient

# lakeFS credentials and endpoint
configuration = lakefs_client.Configuration()
configuration.username = lakefsAccessKey
configuration.password = lakefsSecretKey
configuration.host = lakefsEndPoint

lakefs = LakeFSClient(configuration)

#### Verify lakeFS credentials by getting lakeFS version

In [5]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.config.get_lake_fs_version()
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v.version}")

Verifying lakeFS credentials…
…✅lakeFS credentials verified

ℹ️lakeFS version 0.104.0


### Define lakeFS Repository

In [6]:
from lakefs_client.exceptions import NotFoundException

try:
    repo=lakefs.repositories.get_repository(repo_name)
    print(f"Found existing repo {repo.id} using storage namespace {repo.storage_namespace}")
except NotFoundException as f:
    print(f"Repository {repo_name} does not exist, so going to try and create it now.")
    try:
        repo=lakefs.repositories.create_repository(repository_creation=RepositoryCreation(name=repo_name,
                                                                                                storage_namespace=f"{storageNamespace}/{repo_name}"))
        print(f"Created new repo {repo.id} using storage namespace {repo.storage_namespace}")
    except lakefs_client.ApiException as e:
        print(f"Error creating repo {repo_name}. Error is {e}")
        os._exit(00)
except lakefs_client.ApiException as e:
    print(f"Error getting repo {repo_name}: {e}")
    os._exit(00)

Repository lakefs-iceberg-nyc does not exist, so going to try and create it now.
Created new repo lakefs-iceberg-nyc using storage namespace s3://example/lakefs-iceberg-nyc


### Set up Spark

In [7]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Iceberg / Jupyter") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.0,io.lakefs:lakefs-iceberg:0.0.1") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", lakefsEndPoint) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.access.key", lakefsAccessKey) \
        .config("spark.hadoop.fs.s3a.secret.key", lakefsSecretKey) \
        .config("spark.sql.catalog.lakefs", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.lakefs.catalog-impl", "io.lakefs.iceberg.LakeFSCatalog") \
        .config("spark.sql.catalog.lakefs.warehouse", f"lakefs://{repo_name}") \
        .config("spark.sql.catalog.lakefs.uri", lakefsEndPoint) \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

---

---

# Main demo starts here 🚦 👇🏻

# Load some Data

For this demo, we will use the [New York City Film Permits dataset](https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p) available as part of the NYC Open Data initiative. We're using a locally saved copy of a 1000 record sample, but feel free to download the entire dataset to use in this notebook!

We'll save the sample dataset into an Iceberg table called `permits`, using lakeFS for the catalog.

In [8]:
df = spark.read.option("inferSchema","true").option("multiline","true").json("/data/nyc_film_permits.json")

In [9]:
df.write.saveAsTable("lakefs.main.nyc.permits")

Py4JJavaError: An error occurred while calling o59.saveAsTable.
: org.apache.spark.SparkException: Writing job aborted
	at org.apache.spark.sql.errors.QueryExecutionErrors$.writingJobAbortedError(QueryExecutionErrors.scala:767)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:409)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:353)
	at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.writeWithV2(WriteToDataSourceV2Exec.scala:108)
	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.$anonfun$writeToTable$1(WriteToDataSourceV2Exec.scala:503)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable(WriteToDataSourceV2Exec.scala:491)
	at org.apache.spark.sql.execution.datasources.v2.TableWriteExecHelper.writeToTable$(WriteToDataSourceV2Exec.scala:486)
	at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.writeToTable(WriteToDataSourceV2Exec.scala:108)
	at org.apache.spark.sql.execution.datasources.v2.AtomicCreateTableAsSelectExec.run(WriteToDataSourceV2Exec.scala:131)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:636)
	at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:566)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1) (e2e3d051b8aa executor driver): java.io.UncheckedIOException: Failed to close current writer
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:124)
	at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
	at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
	at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.close(SparkWrite.java:716)
	at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.commit(SparkWrite.java:698)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:453)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.IOException: can not write FileMetaData(version:1, schema:[SchemaElement(name:table, num_children:14), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:borough, converted_type:UTF8, field_id:1, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:category, converted_type:UTF8, field_id:2, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:communityboard_s, converted_type:UTF8, field_id:3, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:country, converted_type:UTF8, field_id:4, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:enddatetime, converted_type:UTF8, field_id:5, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:enteredon, converted_type:UTF8, field_id:6, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventagency, converted_type:UTF8, field_id:7, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventid, converted_type:UTF8, field_id:8, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventtype, converted_type:UTF8, field_id:9, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:parkingheld, converted_type:UTF8, field_id:10, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:policeprecinct_s, converted_type:UTF8, field_id:11, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:startdatetime, converted_type:UTF8, field_id:12, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:subcategoryname, converted_type:UTF8, field_id:13, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:zipcode_s, converted_type:UTF8, field_id:14, logicalType:<LogicalType STRING:StringType()>)], num_rows:1000, row_groups:[RowGroup(columns:[ColumnChunk(file_offset:97, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[borough], codec:GZIP, num_values:1000, total_uncompressed_size:490, total_compressed_size:483, data_page_offset:97, dictionary_page_offset:4, statistics:Statistics(null_count:0, max_value:53 74 61 74 65 6E 20 49 73 6C 61 6E 64, min_value:42 72 6F 6E 78), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47378, offset_index_length:12, column_index_offset:46611, column_index_length:33), ColumnChunk(file_offset:628, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[category], codec:GZIP, num_values:1000, total_uncompressed_size:628, total_compressed_size:474, data_page_offset:628, dictionary_page_offset:487, statistics:Statistics(null_count:0, max_value:57 45 42, min_value:43 6F 6D 6D 65 72 63 69 61 6C), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47390, offset_index_length:12, column_index_offset:46644, column_index_length:28), ColumnChunk(file_offset:1301, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[communityboard_s], codec:GZIP, num_values:1000, total_uncompressed_size:1766, total_compressed_size:1183, data_page_offset:1301, dictionary_page_offset:961, statistics:Statistics(null_count:0, max_value:39, min_value:30 2C 20 32 2C 20 33), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47402, offset_index_length:12, column_index_offset:46672, column_index_length:23), ColumnChunk(file_offset:2211, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[country], codec:GZIP, num_values:1000, total_uncompressed_size:81, total_compressed_size:119, data_page_offset:2211, dictionary_page_offset:2144, statistics:Statistics(max:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, min:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, null_count:0, max_value:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, min_value:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47414, offset_index_length:11, column_index_offset:46695, column_index_length:63), ColumnChunk(file_offset:3815, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[enddatetime], codec:GZIP, num_values:1000, total_uncompressed_size:14603, total_compressed_size:2736, data_page_offset:3815, dictionary_page_offset:2263, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 32 2D 32 30 54 31 38 3A 30 30 3A 30 30 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 34 54 32 32 3A 30 30 3A 30 30 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47425, offset_index_length:12, column_index_offset:46758, column_index_length:61), ColumnChunk(file_offset:4999, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[BIT_PACKED, RLE, PLAIN], path_in_schema:[enteredon], codec:GZIP, num_values:1000, total_uncompressed_size:27034, total_compressed_size:5023, data_page_offset:4999, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 31 2D 31 38 54 31 34 3A 33 34 3A 30 36 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 32 54 31 33 3A 33 34 3A 31 37 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]), offset_index_offset:47437, offset_index_length:12, column_index_offset:46819, column_index_length:61), ColumnChunk(file_offset:10112, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[eventagency], codec:GZIP, num_values:1000, total_uncompressed_size:104, total_compressed_size:142, data_page_offset:10112, dictionary_page_offset:10022, statistics:Statistics(max:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, min:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, null_count:0, max_value:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, min_value:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47449, offset_index_length:12, column_index_offset:46880, column_index_length:107), ColumnChunk(file_offset:10164, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[BIT_PACKED, RLE, PLAIN], path_in_schema:[eventid], codec:GZIP, num_values:1000, total_uncompressed_size:10034, total_compressed_size:2348, data_page_offset:10164, statistics:Statistics(null_count:0, max_value:36 39 31 38 37 35, min_value:36 37 38 39 30 39), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]), offset_index_offset:47461, offset_index_length:13, column_index_offset:46987, column_index_length:27), ColumnChunk(file_offset:12630, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[eventtype], codec:GZIP, num_values:1000, total_uncompressed_size:375, total_compressed_size:343, data_page_offset:12630, dictionary_page_offset:12512, statistics:Statistics(null_count:0, max_value:54 68 65 61 74 65 72 20 4C 6F 61 64 20 69 6E 20 61 6E 64 20 4C 6F 61 64 20 4F 75 74 73, min_value:44 43 41 53 20 50 72 65 70 2F 53 68 6F 6F 74 2F 57 72 61 70 20 50 65 72 6D 69 74), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47474, offset_index_length:13, column_index_offset:47014, column_index_length:71), ColumnChunk(file_offset:38250, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[parkingheld], codec:GZIP, num_values:1000, total_uncompressed_size:182027, total_compressed_size:26704, data_page_offset:38250, dictionary_page_offset:12855, statistics:Statistics(null_count:0, max_value:57 59 54 48 45 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 4E 4F 52 54 48 20 20 20 31 35 20 53 54 52 45 45 54 20 61 6E 64 20 4E 4F 52 54 48 20 20 20 31 34 20 53 54 52 45 45 54, min_value:31 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 45 41 53 54 20 20 20 31 35 20 53 54 52 45 45 54 20 61 6E 64 20 45 41 53 54 20 20 20 31 37 20 53 54 52 45 45 54 2C 20 20 31 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 45 41 53 54 20 20 20 31 38 20 53 54 52 45 45 54 20 61 6E 64 20 45 41 53 54 20 20 20 32 30 20 53 54 52 45 45 54 2C 20 20 31 20 41 56 45 4E 55 45 20 62 65 74 77 65...), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47487, offset_index_length:13, column_index_offset:47085, column_index_length:139), ColumnChunk(file_offset:40221, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[policeprecinct_s], codec:GZIP, num_values:1000, total_uncompressed_size:2767, total_compressed_size:1487, data_page_offset:40221, dictionary_page_offset:39559, statistics:Statistics(null_count:0, max_value:39 34, min_value:30 2C 20 31 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47500, offset_index_length:13, column_index_offset:47224, column_index_length:22), ColumnChunk(file_offset:42324, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[startdatetime], codec:GZIP, num_values:1000, total_uncompressed_size:11984, total_compressed_size:2462, data_page_offset:42324, dictionary_page_offset:41046, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 31 2D 32 30 54 31 33 3A 30 30 3A 30 30 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 33 54 30 30 3A 30 30 3A 30 30 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47513, offset_index_length:13, column_index_offset:47246, column_index_length:61), ColumnChunk(file_offset:43745, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[subcategoryname], codec:GZIP, num_values:1000, total_uncompressed_size:964, total_compressed_size:745, data_page_offset:43745, dictionary_page_offset:43508, statistics:Statistics(null_count:0, max_value:56 61 72 69 65 74 79, min_value:43 61 62 6C 65 2D 65 70 69 73 6F 64 69 63), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47526, offset_index_length:13, column_index_offset:47307, column_index_length:36), ColumnChunk(file_offset:45476, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[zipcode_s], codec:GZIP, num_values:1000, total_uncompressed_size:5902, total_compressed_size:2358, data_page_offset:45476, dictionary_page_offset:44253, statistics:Statistics(null_count:0, max_value:31 31 36 39 33 2C 20 31 31 36 39 34, min_value:30 2C 20 31 30 30 31 31), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47539, offset_index_length:13, column_index_offset:47343, column_index_length:35)], total_byte_size:258759, num_rows:1000, file_offset:4, total_compressed_size:46607, ordinal:0)], key_value_metadata:[KeyValue(key:iceberg.schema, value:{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"borough","required":false,"type":"string"},{"id":2,"name":"category","required":false,"type":"string"},{"id":3,"name":"communityboard_s","required":false,"type":"string"},{"id":4,"name":"country","required":false,"type":"string"},{"id":5,"name":"enddatetime","required":false,"type":"string"},{"id":6,"name":"enteredon","required":false,"type":"string"},{"id":7,"name":"eventagency","required":false,"type":"string"},{"id":8,"name":"eventid","required":false,"type":"string"},{"id":9,"name":"eventtype","required":false,"type":"string"},{"id":10,"name":"parkingheld","required":false,"type":"string"},{"id":11,"name":"policeprecinct_s","required":false,"type":"string"},{"id":12,"name":"startdatetime","required":false,"type":"string"},{"id":13,"name":"subcategoryname","required":false,"type":"string"},{"id":14,"name":"zipcode_s","required":false,"type":"string"}]})], created_by:parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba), column_orders:[<ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>])
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.write(Util.java:376)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.writeFileMetaData(Util.java:143)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.writeFileMetaData(Util.java:138)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:1338)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1203)
	at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:255)
	at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82)
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
	... 16 more
Caused by: org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TTransportException: java.io.IOException: Filesystem WriteOperationHelper {bucket=lakefs-iceberg-nyc} closed
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:199)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:263)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:245)
	at org.apache.iceberg.shaded.org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:71)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1390)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1240)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData.write(FileMetaData.java:1118)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.write(Util.java:373)
	... 23 more
Caused by: java.io.IOException: Filesystem WriteOperationHelper {bucket=lakefs-iceberg-nyc} closed
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.checkOpen(S3ABlockOutputStream.java:243)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:294)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
	at java.base/java.io.DataOutputStream.write(DataOutputStream.java:112)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:197)
	... 32 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:377)
	... 47 more
Caused by: java.io.UncheckedIOException: Failed to close current writer
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:124)
	at org.apache.iceberg.io.RollingFileWriter.close(RollingFileWriter.java:147)
	at org.apache.iceberg.io.RollingDataWriter.close(RollingDataWriter.java:32)
	at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.close(SparkWrite.java:716)
	at org.apache.iceberg.spark.source.SparkWrite$UnpartitionedDataWriter.commit(SparkWrite.java:698)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:453)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
	at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more
Caused by: java.io.IOException: can not write FileMetaData(version:1, schema:[SchemaElement(name:table, num_children:14), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:borough, converted_type:UTF8, field_id:1, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:category, converted_type:UTF8, field_id:2, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:communityboard_s, converted_type:UTF8, field_id:3, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:country, converted_type:UTF8, field_id:4, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:enddatetime, converted_type:UTF8, field_id:5, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:enteredon, converted_type:UTF8, field_id:6, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventagency, converted_type:UTF8, field_id:7, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventid, converted_type:UTF8, field_id:8, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:eventtype, converted_type:UTF8, field_id:9, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:parkingheld, converted_type:UTF8, field_id:10, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:policeprecinct_s, converted_type:UTF8, field_id:11, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:startdatetime, converted_type:UTF8, field_id:12, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:subcategoryname, converted_type:UTF8, field_id:13, logicalType:<LogicalType STRING:StringType()>), SchemaElement(type:BYTE_ARRAY, repetition_type:OPTIONAL, name:zipcode_s, converted_type:UTF8, field_id:14, logicalType:<LogicalType STRING:StringType()>)], num_rows:1000, row_groups:[RowGroup(columns:[ColumnChunk(file_offset:97, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[borough], codec:GZIP, num_values:1000, total_uncompressed_size:490, total_compressed_size:483, data_page_offset:97, dictionary_page_offset:4, statistics:Statistics(null_count:0, max_value:53 74 61 74 65 6E 20 49 73 6C 61 6E 64, min_value:42 72 6F 6E 78), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47378, offset_index_length:12, column_index_offset:46611, column_index_length:33), ColumnChunk(file_offset:628, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[category], codec:GZIP, num_values:1000, total_uncompressed_size:628, total_compressed_size:474, data_page_offset:628, dictionary_page_offset:487, statistics:Statistics(null_count:0, max_value:57 45 42, min_value:43 6F 6D 6D 65 72 63 69 61 6C), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47390, offset_index_length:12, column_index_offset:46644, column_index_length:28), ColumnChunk(file_offset:1301, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[communityboard_s], codec:GZIP, num_values:1000, total_uncompressed_size:1766, total_compressed_size:1183, data_page_offset:1301, dictionary_page_offset:961, statistics:Statistics(null_count:0, max_value:39, min_value:30 2C 20 32 2C 20 33), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47402, offset_index_length:12, column_index_offset:46672, column_index_length:23), ColumnChunk(file_offset:2211, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[country], codec:GZIP, num_values:1000, total_uncompressed_size:81, total_compressed_size:119, data_page_offset:2211, dictionary_page_offset:2144, statistics:Statistics(max:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, min:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, null_count:0, max_value:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61, min_value:55 6E 69 74 65 64 20 53 74 61 74 65 73 20 6F 66 20 41 6D 65 72 69 63 61), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47414, offset_index_length:11, column_index_offset:46695, column_index_length:63), ColumnChunk(file_offset:3815, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[enddatetime], codec:GZIP, num_values:1000, total_uncompressed_size:14603, total_compressed_size:2736, data_page_offset:3815, dictionary_page_offset:2263, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 32 2D 32 30 54 31 38 3A 30 30 3A 30 30 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 34 54 32 32 3A 30 30 3A 30 30 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47425, offset_index_length:12, column_index_offset:46758, column_index_length:61), ColumnChunk(file_offset:4999, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[BIT_PACKED, RLE, PLAIN], path_in_schema:[enteredon], codec:GZIP, num_values:1000, total_uncompressed_size:27034, total_compressed_size:5023, data_page_offset:4999, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 31 2D 31 38 54 31 34 3A 33 34 3A 30 36 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 32 54 31 33 3A 33 34 3A 31 37 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]), offset_index_offset:47437, offset_index_length:12, column_index_offset:46819, column_index_length:61), ColumnChunk(file_offset:10112, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[eventagency], codec:GZIP, num_values:1000, total_uncompressed_size:104, total_compressed_size:142, data_page_offset:10112, dictionary_page_offset:10022, statistics:Statistics(max:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, min:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, null_count:0, max_value:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67, min_value:4D 61 79 6F 72 27 73 20 4F 66 66 69 63 65 20 6F 66 20 46 69 6C 6D 2C 20 54 68 65 61 74 72 65 20 26 20 42 72 6F 61 64 63 61 73 74 69 6E 67), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47449, offset_index_length:12, column_index_offset:46880, column_index_length:107), ColumnChunk(file_offset:10164, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[BIT_PACKED, RLE, PLAIN], path_in_schema:[eventid], codec:GZIP, num_values:1000, total_uncompressed_size:10034, total_compressed_size:2348, data_page_offset:10164, statistics:Statistics(null_count:0, max_value:36 39 31 38 37 35, min_value:36 37 38 39 30 39), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN, count:1)]), offset_index_offset:47461, offset_index_length:13, column_index_offset:46987, column_index_length:27), ColumnChunk(file_offset:12630, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[eventtype], codec:GZIP, num_values:1000, total_uncompressed_size:375, total_compressed_size:343, data_page_offset:12630, dictionary_page_offset:12512, statistics:Statistics(null_count:0, max_value:54 68 65 61 74 65 72 20 4C 6F 61 64 20 69 6E 20 61 6E 64 20 4C 6F 61 64 20 4F 75 74 73, min_value:44 43 41 53 20 50 72 65 70 2F 53 68 6F 6F 74 2F 57 72 61 70 20 50 65 72 6D 69 74), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47474, offset_index_length:13, column_index_offset:47014, column_index_length:71), ColumnChunk(file_offset:38250, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[parkingheld], codec:GZIP, num_values:1000, total_uncompressed_size:182027, total_compressed_size:26704, data_page_offset:38250, dictionary_page_offset:12855, statistics:Statistics(null_count:0, max_value:57 59 54 48 45 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 4E 4F 52 54 48 20 20 20 31 35 20 53 54 52 45 45 54 20 61 6E 64 20 4E 4F 52 54 48 20 20 20 31 34 20 53 54 52 45 45 54, min_value:31 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 45 41 53 54 20 20 20 31 35 20 53 54 52 45 45 54 20 61 6E 64 20 45 41 53 54 20 20 20 31 37 20 53 54 52 45 45 54 2C 20 20 31 20 41 56 45 4E 55 45 20 62 65 74 77 65 65 6E 20 45 41 53 54 20 20 20 31 38 20 53 54 52 45 45 54 20 61 6E 64 20 45 41 53 54 20 20 20 32 30 20 53 54 52 45 45 54 2C 20 20 31 20 41 56 45 4E 55 45 20 62 65 74 77 65...), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47487, offset_index_length:13, column_index_offset:47085, column_index_length:139), ColumnChunk(file_offset:40221, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[policeprecinct_s], codec:GZIP, num_values:1000, total_uncompressed_size:2767, total_compressed_size:1487, data_page_offset:40221, dictionary_page_offset:39559, statistics:Statistics(null_count:0, max_value:39 34, min_value:30 2C 20 31 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47500, offset_index_length:13, column_index_offset:47224, column_index_length:22), ColumnChunk(file_offset:42324, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[startdatetime], codec:GZIP, num_values:1000, total_uncompressed_size:11984, total_compressed_size:2462, data_page_offset:42324, dictionary_page_offset:41046, statistics:Statistics(null_count:0, max_value:32 30 32 33 2D 30 31 2D 32 30 54 31 33 3A 30 30 3A 30 30 2E 30 30 30, min_value:32 30 32 32 2D 31 31 2D 30 33 54 30 30 3A 30 30 3A 30 30 2E 30 30 30), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47513, offset_index_length:13, column_index_offset:47246, column_index_length:61), ColumnChunk(file_offset:43745, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[subcategoryname], codec:GZIP, num_values:1000, total_uncompressed_size:964, total_compressed_size:745, data_page_offset:43745, dictionary_page_offset:43508, statistics:Statistics(null_count:0, max_value:56 61 72 69 65 74 79, min_value:43 61 62 6C 65 2D 65 70 69 73 6F 64 69 63), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47526, offset_index_length:13, column_index_offset:47307, column_index_length:36), ColumnChunk(file_offset:45476, meta_data:ColumnMetaData(type:BYTE_ARRAY, encodings:[PLAIN_DICTIONARY, BIT_PACKED, RLE], path_in_schema:[zipcode_s], codec:GZIP, num_values:1000, total_uncompressed_size:5902, total_compressed_size:2358, data_page_offset:45476, dictionary_page_offset:44253, statistics:Statistics(null_count:0, max_value:31 31 36 39 33 2C 20 31 31 36 39 34, min_value:30 2C 20 31 30 30 31 31), encoding_stats:[PageEncodingStats(page_type:DICTIONARY_PAGE, encoding:PLAIN_DICTIONARY, count:1), PageEncodingStats(page_type:DATA_PAGE, encoding:PLAIN_DICTIONARY, count:1)]), offset_index_offset:47539, offset_index_length:13, column_index_offset:47343, column_index_length:35)], total_byte_size:258759, num_rows:1000, file_offset:4, total_compressed_size:46607, ordinal:0)], key_value_metadata:[KeyValue(key:iceberg.schema, value:{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"borough","required":false,"type":"string"},{"id":2,"name":"category","required":false,"type":"string"},{"id":3,"name":"communityboard_s","required":false,"type":"string"},{"id":4,"name":"country","required":false,"type":"string"},{"id":5,"name":"enddatetime","required":false,"type":"string"},{"id":6,"name":"enteredon","required":false,"type":"string"},{"id":7,"name":"eventagency","required":false,"type":"string"},{"id":8,"name":"eventid","required":false,"type":"string"},{"id":9,"name":"eventtype","required":false,"type":"string"},{"id":10,"name":"parkingheld","required":false,"type":"string"},{"id":11,"name":"policeprecinct_s","required":false,"type":"string"},{"id":12,"name":"startdatetime","required":false,"type":"string"},{"id":13,"name":"subcategoryname","required":false,"type":"string"},{"id":14,"name":"zipcode_s","required":false,"type":"string"}]})], created_by:parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba), column_orders:[<ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>, <ColumnOrder TYPE_ORDER:TypeDefinedOrder()>])
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.write(Util.java:376)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.writeFileMetaData(Util.java:143)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.writeFileMetaData(Util.java:138)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:1338)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1203)
	at org.apache.iceberg.parquet.ParquetWriter.close(ParquetWriter.java:255)
	at org.apache.iceberg.io.DataWriter.close(DataWriter.java:82)
	at org.apache.iceberg.io.RollingFileWriter.closeCurrentWriter(RollingFileWriter.java:122)
	... 16 more
Caused by: org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TTransportException: java.io.IOException: Filesystem WriteOperationHelper {bucket=lakefs-iceberg-nyc} closed
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:199)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:482)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:489)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:263)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:245)
	at org.apache.iceberg.shaded.org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:71)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1390)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData$FileMetaDataStandardScheme.write(FileMetaData.java:1240)
	at org.apache.iceberg.shaded.org.apache.parquet.format.FileMetaData.write(FileMetaData.java:1118)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.write(Util.java:373)
	... 23 more
Caused by: java.io.IOException: Filesystem WriteOperationHelper {bucket=lakefs-iceberg-nyc} closed
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.checkOpen(S3ABlockOutputStream.java:243)
	at org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:294)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:62)
	at java.base/java.io.DataOutputStream.write(DataOutputStream.java:112)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:50)
	at org.apache.iceberg.shaded.org.apache.parquet.shaded.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:197)
	... 32 more


<strong style="color:red;">If the above step fails, try re-running it. See https://github.com/treeverse/lakefs-iceberg/issues/23 for more details</strong>

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"#### 👉🏻 Optionally, go and view the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo.id}/objects?ref=main&path=nyc%2Fpermits%2F)")

Taking a quick peek at the data, you can see that there are a number of permits for different boroughs in New York.

In [None]:
%%sql

SELECT borough, count(*) AS permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

### Commit the new table and its data

In [None]:
lakefs.commits.commit(repo.id, "main", CommitCreation(
    message="Initial data load",
    metadata={'author': 'rmoff',
              'data source': 'https://data.cityofnewyork.us/City-Government/Film-Permits/tg4x-b46p'}
) )

# Create a new branch

_This is copy-on-write; we're not duplicating the data_

In [None]:
lakefs.branches.create_branch(repo.id, 
                              BranchCreation(name="dev",
                                             source="main"))

### Confirm that we can see the data on the `dev` branch

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

# Making [and reverting] changes on the dev branch

In [None]:
%sql DELETE FROM lakefs.dev.nyc.permits

Let's go big! Let's see what happens when we delete the contents of the table with a careless `DELETE` omitting an all-important predicate

How's that data looking now?

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

But `main` is safe and unsullied 😌

In [None]:
%%sql

SELECT count(*)
FROM lakefs.main.nyc.permits;

## Reverting changes to the `dev` branch

In [None]:
lakefs.branches.reset_branch(repo.id, 
                             "dev",
                             ResetCreation(type="common_prefix", 
                                           path="nyc/permits/"))

_This just resets the changes to the files for this table. To reset the whole branch use_:

```python
lakefs.branches.reset_branch(repo.id, 
                             "dev",
                             ResetCreation(type="reset"))
```

## Our data's back!

In [None]:
%%sql

SELECT count(*)
FROM lakefs.dev.nyc.permits;

# Making changes to the `dev` branch as a collection

## Delete all rows for permits in `Manhattan` from the table

In [None]:
%sql DELETE FROM lakefs.dev.nyc.permits WHERE borough='Manhattan'

## Build an aggregate of the data to show how many permits we issued by category

In [None]:
%%sql

CREATE OR REPLACE TABLE lakefs.dev.nyc.agg_permit_category AS
SELECT category, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY category;

In [None]:
%sql SELECT * FROM lakefs.dev.nyc.agg_permit_category LIMIT 5;

# Compare `main` and `dev`

## `dev`

In [None]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.dev.nyc.permits
GROUP BY borough

## `main`

In [None]:
%%sql

SELECT borough, count(*) permit_cnt
FROM lakefs.main.nyc.permits
GROUP BY borough

# Commit the changes to the `dev` branch

In [None]:
lakefs.commits.commit(repo.id, "dev", 
                      CommitCreation(
                          message="Remove data for Manhattan from permits dataset, build category aggregate",
                          metadata={"etl job name": "etl_job_42",
                                    "author": "rmoff"}
                      ))

# Merge the branch back into `main`

In [None]:
lakefs.refs.merge_into_branch(repository=repo.id, 
                              source_ref="dev", 
                              destination_branch="main")

---

---

---

In [None]:
from IPython.display import Markdown as md

if lakefsEndPoint=='http://lakefs:8000':
    lakeFSWebUI='http://localhost:8000'
else:
    lakeFSWebUI=lakefsEndPoint

md(f"### 👉🏻 View the objects in [lakeFS web UI]({lakeFSWebUI}/repositories/{repo.id}/objects)")