---
title: Awkward Array Primer
date: 2023-10-16
authors:
  - name: Angus Hollands
    affiliations:
      - Princeton University
---

[Awkward Array](https://awkward-array.org/doc/main/) is 
> a library for nested, variable-sized data, including arbitrary-length lists, records, mixed types, and missing data, using NumPy-like idioms.

Creating an Awkward Array is as simple as

In [1]:
import awkward as ak

array = ak.Array([
    ["this", "is"],
    ["a", "ragged", "list", "of", "strings"]
])
array

Of course, Python data structures can be used to store ragged data, e.g.

In [2]:
[
    ["this", "is"],
    ["a", "ragged", "list", "of", "strings"]
]

[['this', 'is'], ['a', 'ragged', 'list', 'of', 'strings']]

But the real power of using Awkward Array is the _performance_ and _expressiveness_ provided by our compiled kernels, and high-level API:

In [72]:
array

In [73]:
ak.num(array, axis=1)

In [74]:
%psearch -e builtin ak.*? function

ak.__dir__
ak.all
ak.almost_equal
ak.any
ak.argcartesian
ak.argcombinations
ak.argmax
ak.argmin
ak.argsort
ak.backend
ak.broadcast_arrays
ak.broadcast_fields
ak.cartesian
ak.categories
ak.combinations
ak.concatenate
ak.copy
ak.corr
ak.count
ak.count_nonzero
ak.covar
ak.drop_none
ak.enforce_type
ak.fields
ak.fill_none
ak.firsts
ak.flatten
ak.from_arrow
ak.from_arrow_schema
ak.from_avro_file
ak.from_buffers
ak.from_categorical
ak.from_cupy
ak.from_dlpack
ak.from_feather
ak.from_iter
ak.from_jax
ak.from_json
ak.from_numpy
ak.from_parquet
ak.from_rdataframe
ak.from_regular
ak.full_like
ak.is_categorical
ak.is_none
ak.is_tuple
ak.is_valid
ak.isclose
ak.linear_fit
ak.local_index
ak.mask
ak.max
ak.mean
ak.merge_option_of_records
ak.merge_union_of_records
ak.metadata_from_parquet
ak.min
ak.mixin_class
ak.mixin_class_method
ak.moment
ak.nan_to_none
ak.nan_to_num
ak.num
ak.ones_like
ak.pad_none
ak.parameters
ak.prod
ak.ptp
ak.ravel
ak.run_lengths
ak.singletons
ak.softmax
ak.sort
ak.std
ak.string

## Supported Types

Awkward supports many different data types:

__Numerics__

In [5]:
ak.Array([1, 2, 3])

__Optionals__

In [6]:
ak.Array([1, None, 2])

__Tagged Unions__

In [7]:
ak.Array([1, "hello world", 2])

__Records__ (composed structures)

In [8]:
ak.Array([
    {'x': 1, 'y': 2},
    {'x': 3, 'y': 4}
])

__(Byte) strings__

In [9]:
ak.Array(["hello", "world!"])

Arrays have types

In [15]:
array.show()

[1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9]


In [16]:
array.type.show()

9 * int64


In [17]:
array.type

ArrayType(NumpyType('int64'), 9, None)

## Reduction

Arrays can be reduced

In [18]:
array = ak.Array([
    [1, 2, 4],
    [],
    [8],
    [16]
])

In [19]:
ak.sum(array, axis=1)

Awkward Array supports _ragged_ reduction:

:::{figure} img/example-reduction-sum-only.svg
:align: left

Ragged reduction at `axis=0`
:::


In [60]:
array = ak.Array([
    [1,    2,   4],
    [            ],
    [None, 8,    ],
    [16,         ]
])

In [75]:
ak.sum(array, axis=0)

## Slicing

Arrays can be sliced

In [62]:
array

In [63]:
array[:1]

In [64]:
array[:, :1]

Awkward provides helpers for slicing single items from maybe-empty sublists

In [65]:
array

In [66]:
array[:, 0]

IndexError: cannot slice ListArray (of length 4) with array(0): index out of range while attempting to get index 0 (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-24/awkward-cpp/src/cpu-kernels/awkward_NumpyArray_getitem_next_at.cpp#L21)

This error occurred while attempting to slice

    <Array [[1, 2, 4], [], [None, 8], [16]] type='4 * var * ?int64'>

with

    (:, 0)

In [68]:
ak.firsts(array, axis=1)

## Columnar Representation

Arrays are represented in-memory in columnar format:

In [10]:
array = ak.Array([
    [1, 2, 3],
    [4, 5, 6, 7],
    [8, 9]
]) 
array.layout

<ListOffsetArray len='3'>
    <offsets><Index dtype='int64' len='4'>
        [0 3 7 9]
    </Index></offsets>
    <content><NumpyArray dtype='int64' len='9'>[1 2 3 4 5 6 7 8 9]</NumpyArray></content>
</ListOffsetArray>

In [11]:
array = ak.Array([
    [
        [1, 2], [3]
    ],
    [
        [4, 5, 6], [7]
    ],
    [
        [8, 9]
    ]
])
array.layout

<ListOffsetArray len='3'>
    <offsets><Index dtype='int64' len='4'>
        [0 2 4 5]
    </Index></offsets>
    <content><ListOffsetArray len='5'>
        <offsets><Index dtype='int64' len='6'>[0 2 3 6 7 9]</Index></offsets>
        <content><NumpyArray dtype='int64' len='9'>[1 2 3 4 5 6 7 8 9]</NumpyArray></content>
    </ListOffsetArray></content>
</ListOffsetArray>

And can easily be composed to form new arrays

In [12]:
array = ak.Array([1, 2, 3, 4, 5, 6, 7, 8, 9])
array

In [13]:
ak.unflatten(
    array,
    counts=[2, 1, 3, 1, 2]
)

Because of this representation, it's easy to compose arrays into records

In [14]:
ak.zip({
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8]
})

It's very easy to move between structured and in-memory representation

In [27]:
array

We provide a function `ak.to_buffers` that can decompose an Array into its data and structure

In [28]:
form, length, container = ak.to_buffers(array)

The `form` is a high-level representation of an Array's structure

In [29]:
form

ListOffsetForm('i64', IndexedOptionForm('i64', NumpyForm('int64', form_key='node2'), form_key='node1'), form_key='node0')

Forms are just JSON!

In [30]:
print(form.to_json())

{"class": "ListOffsetArray", "offsets": "i64", "content": {"class": "IndexedOptionArray", "index": "i64", "content": {"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {}, "form_key": "node2"}, "parameters": {}, "form_key": "node1"}, "parameters": {}, "form_key": "node0"}


In [31]:
length

4

In [32]:
container

{'node0-offsets': array([0, 3, 3, 5, 6]),
 'node1-index': array([ 0,  1,  2, -1,  3,  4]),
 'node2-data': array([ 1,  2,  4,  8, 16])}

It is trivial to reassemble these components into an Array

In [33]:
ak.from_buffers(form, length, container)

## (Byte)Strings

In Awkward Array, bytestrings are views over an array of bytes

In [45]:
ak_bytestrings = ak.Array([
    b"I am", b"a list", b"of strings!"
])
ak_bytestrings

In [69]:
ak_bytestrings.layout

<ListOffsetArray len='3'>
    <parameter name='__array__'>'bytestring'</parameter>
    <offsets><Index dtype='int64' len='4'>
        [ 0  4 10 21]
    </Index></offsets>
    <content><NumpyArray dtype='uint8' len='21'>
        <parameter name='__array__'>'byte'</parameter>
        [ 73  32  97 109  97  32 108 105 115 116 111 102  32 115 116 114 105
         110 103 115  33]
    </NumpyArray></content>
</ListOffsetArray>

Let's drop our string abstraction, to see the raw bytes

In [47]:
ak.without_parameters(ak_bytestrings)

Now we constrast this with NumPy, which only has fixed-length bytestrings

In [46]:
import numpy as np
bytestrings = np.array([b"I am", b"a list", b"of strings!"])
bytestrings

array([b'I am', b'a list', b'of strings!'], dtype='|S11')

In [44]:
bytestrings.view(np.uint8).reshape(-1, 11)

array([[ 73,  32,  97, 109,   0,   0,   0,   0,   0,   0,   0],
       [ 97,  32, 108, 105, 115, 116,   0,   0,   0,   0,   0],
       [111, 102,  32, 115, 116, 114, 105, 110, 103, 115,  33]],
      dtype=uint8)

Awkward also supports strings, using UTF-8 variable-length encoding:

In [48]:
ak_strings = ak.Array(["I am", "a list", "of strings!"])
ak_strings

As before, let's drop our string abstraction, to see the raw code units

In [50]:
ak.without_parameters(ak_strings)

NumPy's strings are encoded using (fixed-length) UTF-32, meaning that they're typically larger

In [51]:
strings = np.array(["I am", "a list", "of strings!"])
strings

array(['I am', 'a list', 'of strings!'], dtype='<U11')

In [39]:
strings.view(np.uint8).reshape(-1, 11, 4)

array([[[ 73,   0,   0,   0],
        [ 32,   0,   0,   0],
        [ 97,   0,   0,   0],
        [109,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0]],

       [[ 97,   0,   0,   0],
        [ 32,   0,   0,   0],
        [108,   0,   0,   0],
        [105,   0,   0,   0],
        [115,   0,   0,   0],
        [116,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0],
        [  0,   0,   0,   0]],

       [[111,   0,   0,   0],
        [102,   0,   0,   0],
        [ 32,   0,   0,   0],
        [115,   0,   0,   0],
        [116,   0,   0,   0],
        [114,   0,   0,   0],
        [105,   0,   0,   0],
        [110,   0,   0,   0],
        [103,   0,   0,   0],
        [115,   0,   0,   0],
        [ 33,   0,   0,   0]]], dtyp

Through PyArrow, we provide a suite of string operations

In [40]:
array = ak.str.split_whitespace(
    ["A \"Hello, World!\" program is generally a computer program that ignores any input"]
)
array

In [41]:
ak.str.upper(
    array
)

Awkward extends PyArrow's string operations into more complex structures

In [56]:
import pyarrow, pyarrow.compute

list_of_strings = pyarrow.array([["hello", "world"], ["hello!"]])
pyarrow.compute.ascii_upper(list_of_strings)

ArrowNotImplementedError: Function 'ascii_upper' has no kernel matching input types (list<item: string>)

PyArrow appears to be focused on interop (Arrow, Parquet, Feather), whereas Awkward is concerned with interactive analysis