Skip to content
Permalink
Browse files
Unify schemas recipe (#75)
  • Loading branch information
amol- committed Sep 30, 2021
1 parent a85984f commit 13584172870b921275e6c722884a75c31641d495
Showing 2 changed files with 85 additions and 2 deletions.
@@ -130,7 +130,7 @@ from a variety of inputs, including plain python objects
and will benefit from zero copy behaviour when possible.

Creating Record Batches
======================
=======================

Most I/O operations in Arrow happen when shipping batches of data
to their destination. :class:`pyarrow.RecordBatch` is the way
@@ -108,4 +108,87 @@ as far as they are compatible
pyarrow.Table
col1: int32
col2: string
col3: double
col3: double

Merging multiple schemas
========================

When you have multiple separate groups of data that you want to combine
it might be necessary to unify their schemas to create a superset of them
that applies to all data sources.

.. testcode::

import pyarrow as pa

first_schema = pa.schema([
("country", pa.string()),
("population", pa.int32())
])

second_schema = pa.schema([
("country_code", pa.string()),
("language", pa.string())
])

:func:`unify_schemas` can be used to combine multiple schemas into
a single one:

.. testcode::

union_schema = pa.unify_schemas([first_schema, second_schema])

print(union_schema)

.. testoutput::

country: string
population: int32
country_code: string
language: string

If the combined schemas have overlapping columns, they can still be combined
as far as the colliding columns retain the same type (``country_code``):

.. testcode::

third_schema = pa.schema([
("country_code", pa.string()),
("lat", pa.float32()),
("long", pa.float32()),
])

union_schema = pa.unify_schemas([first_schema, second_schema, third_schema])

print(union_schema)

.. testoutput::

country: string
population: int32
country_code: string
language: string
lat: float
long: float

If a merged field has instead diverging types in the combined schemas
then trying to merge the schemas will fail. For example if ``country_code``
was a numeric instead of a string we would be unable to unify the schemas
because in ``second_schema`` it was already declared as a ``pa.string()``

.. testcode::

third_schema = pa.schema([
("country_code", pa.int32()),
("lat", pa.float32()),
("long", pa.float32()),
])

try:
union_schema = pa.unify_schemas([first_schema, second_schema, third_schema])
except pa.ArrowInvalid as e:
print(e)

.. testoutput::

Unable to merge: Field country_code has incompatible types: string vs int32

0 comments on commit 1358417

Please sign in to comment.