![iceberg-logo](https://www.apache.org/logos/res/iceberg/iceberg.png)

### [Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!](https://tabular.io/blog/docker-spark-and-iceberg/)

In [30]:
from pyiceberg import __version__

__version__

'0.7.1'

# Write support

This notebook demonstrates writing to Iceberg tables using PyIceberg. First, connect to the [catalog](https://iceberg.apache.org/concepts/catalog/#iceberg-catalogs), the place where tables are being tracked.

In [31]:
from pyiceberg.catalog import load_catalog

catalog = load_catalog('default')

# Create an Iceberg table

Next create the Iceberg table directly from the `pyarrow.Table`.

In [45]:
table_name = "default.commits"

try:
    # In case the table already exists
    catalog.drop_table(table_name)
except:
    pass

from pyiceberg.schema import Schema, NestedField, StringType, LongType

schema = Schema(
    NestedField(1, "id", LongType(), True),
    NestedField(2, "name", StringType(), True),
    NestedField(3, "state", StringType(), True),
    NestedField(4, "additions", LongType(), True),
    NestedField(5, "deletes", LongType(), True),
    identifier_field_ids=[1]
)

table = catalog.create_table(table_name, schema=schema)

table

commits(
  1: id: required long,
  2: name: required string,
  3: state: required string,
  4: additions: required long,
  5: deletes: required long
),
partition by: [],
sort order: [],
snapshot: null

# Loading data using Arrow

Create an example PyArrow table that mimics data from the GitHub API.

In [46]:
import pyarrow as pa

from pyiceberg.io.pyarrow import schema_to_pyarrow

pa_schema = schema_to_pyarrow(schema)

df = pa.Table.from_pylist(
    [
        {"id": 123, "name": "Fix bug", "state": "Open", "additions": 22, "deletes": 10},
        {"id": 234, "name": "Add VariantType", "state": "Open", "additions": 29123, "deletes": 302},
        {"id": 345, "name": "Add commit retries", "state": "Open", "additions": 22, "deletes": 10},
    ],
    schema=pa_schema
)

df

pyarrow.Table
id: int64 not null
name: large_string not null
state: large_string not null
additions: int64 not null
deletes: int64 not null
----
id: [[123,234,345]]
name: [["Fix bug","Add VariantType","Add commit retries"]]
state: [["Open","Open","Open"]]
additions: [[22,29123,22]]
deletes: [[10,302,10]]

# Write the data

Let's append the data to the table:

In [47]:
table.append(df)

assert len(table.scan().to_arrow()) == len(df)

table.scan().to_pandas()

Unnamed: 0,id,name,state,additions,deletes
0,123,Fix bug,Open,22,10
1,234,Add VariantType,Open,29123,302
2,345,Add commit retries,Open,22,10


In [35]:
table.inspect.snapshots().to_pandas()

Unnamed: 0,committed_at,snapshot_id,parent_id,operation,manifest_list,summary
0,2025-04-07 13:43:24.679,2018389781075547497,,append,s3://warehouse/default/commits/metadata/snap-2...,"[(added-files-size, 2504), (added-data-files, ..."


# Add moar data

In [48]:
table.append(pa.Table.from_pylist(
    [
        {"id": 456, "name": "Add NanosecondTimestamps", "state": "Merged", "additions": 2392, "deletes": 8},
        {"id": 567, "name": "Add documentation around filters", "state": "Open", "additions": 7543, "deletes": 3},
    ],
    schema=pa_schema
))

table.scan().to_pandas()

Unnamed: 0,id,name,state,additions,deletes
0,456,Add NanosecondTimestamps,Merged,2392,8
1,567,Add documentation around filters,Open,7543,3
2,123,Fix bug,Open,22,10
3,234,Add VariantType,Open,29123,302
4,345,Add commit retries,Open,22,10


In [49]:
table.inspect.snapshots().to_pandas()

Unnamed: 0,committed_at,snapshot_id,parent_id,operation,manifest_list,summary
0,2025-04-07 14:25:20.592,4215114979050777060,,append,s3://warehouse/default/commits/metadata/snap-4...,"[(added-files-size, 2472), (added-data-files, ..."
1,2025-04-07 14:25:32.876,2176183429045520610,4.215115e+18,append,s3://warehouse/default/commits/metadata/snap-2...,"[(added-files-size, 2600), (added-data-files, ..."


# Upsert new data

In [58]:
table.upsert(pa.Table.from_pylist(
    [
        # Nothing changes: No-op
        {"id": 456, "name": "Add NanosecondTimestamps", "state": "Merged", "additions": 2392, "deletes": 8},

        # Updated, nc
        {"id": 567, "name": "Add documentation around filters", "state": "Merged", "additions": 9238, "deletes": 22},
    ],
    schema=pa_schema
))

table.scan().to_pandas()

AttributeError: 'Table' object has no attribute 'upsert'