# Part 3: Arbitrary data structures

So far, all the arrays we've dealt with have been rectangular (in $n$ dimensions; "rectilinear").

<center>
<img src="../img/8-layer_cube.jpg" width="50%">
</center>

What if we had data like this?

```json
[
  [[1.84, 0.324]],
  [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
  [[0.459, -1.517, 1.545], [0.33, 0.292]],
  [[-0.376, -1.46, -0.206], [0.65, 1.278]],
  [[], [], [1.617]],
  []
]
[
  [[-0.106, 0.611]],
  [[0.118, -1.788, 0.794, 0.658], [-0.105]]
]
[
  [[-0.384], [0.697, -0.856]],
  [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
]
[
  [[0.205, -0.355], [-0.265], [1.042]],
  [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
  [[1.741, -0.199, 0.827]]
]
```

What if we had data like this?

```json
[
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 5.27453, "y": 1.03276},
    {"x": -3.51280, "y": 1.74849}]},
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 8.21630, "y": 4.07844},
    {"x": -0.79157, "y": 3.49478}, {"x": 16.38932, "y": 5.29399},
    {"x": 10.38641, "y": 0.10832}, {"x": -2.07070, "y": 14.07140},
    {"x": 9.57021, "y": -0.94823}, {"x": 1.97332, "y": 3.62380},
    {"x": 5.66760, "y": 11.38001}, {"x": 0.25497, "y": 3.39276},
    {"x": 3.86585, "y": 6.22051}, {"x": -0.67393, "y": 2.20572}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": 3.59528, "y": 7.37191},
    {"x": 0.59192, "y": 2.91503}, {"x": 4.02932, "y": -1.13601},
    {"x": -1.01593, "y": 1.95894}, {"x": 1.03666, "y": 0.05251}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": -8.78510, "y": -0.00497},
    {"x": -15.22688, "y": 3.90244}, {"x": 5.74593, "y": 4.12718}]},
  {"fill": "none", "stroke": "#000000", "points": [{"x": 4.40625, "y": -6.953125},
    {"x": 4.34375, "y": -7.09375}, {"x": 4.3125, "y": -7.140625},
    {"x": 4.140625, "y": -7.140625}]},
  {"fill": "none", "stroke": "#808080", "points": [{"x": 0.46875, "y": -0.09375},
    {"x": 0.46875, "y": -0.078125}, {"x": 0.46875, "y": 0.53125}]}
]
```

What if we had data like this?

```json
[
  {"movie": "Evil Dead", "year": 1981, "actors":
    ["Bruce Campbell", "Ellen Sandweiss", "Richard DeManincor", "Betsy Baker"]
  },
  {"movie": "Darkman", "year": 1900, "actors":
    ["Liam Neeson", "Frances McDormand", "Larry Drake", "Bruce Campbell"]
  },
  {"movie": "Army of Darkness", "year": 1992, "actors":
    ["Bruce Campbell", "Embeth Davidtz", "Marcus Gilbert", "Bridget Fonda",
     "Ted Raimi", "Patricia Tallman"]
  },
  {"movie": "A Simple Plan", "year": 1998, "actors":
    ["Bill Paxton", "Billy Bob Thornton", "Bridget Fonda", "Brent Briscoe"]
  },
  {"movie": "Spider-Man 2", "year": 2004, "actors":
    ["Tobey Maguire", "Kristen Dunst", "Alfred Molina", "James Franco",
     "Rosemary Harris", "J.K. Simmons", "Stan Lee", "Bruce Campbell"]
  },
  {"movie": "Drag Me to Hell", "year": 2009, "actors":
    ["Alison Lohman", "Justin Long", "Lorna Raver", "Dileep Rao", "David Paymer"]
  }
]
```

It might be possible to turn these datasets into tabular form using surrogate keys and database normalization, but

 * they could be inconvenient or less efficient in that form, depending on what we want to do,
 * they were very likely _given_ in a ragged/untidy form. You can't ignore the data-cleaning step!

<br>

Dealing with these datasets as JSON or Python objects is inefficient for the same reason as for lists of numbers.

<br>

We want arbitrary data structure with array-oriented interface and performance...

<center>
<img src="../img/awkward-motivation-venn-diagram.svg" width="40%">
</center>

## Libraries for irregular arrays

<br>

<table>
<tr style="background: white;"><td width="35%"><img src="../img/logo-arrow.svg" width="100%"></td><td style="padding-left: 50px;">In-memory format and an ecosystem of tools, an "exploded database" (database functionality provided as interchangeable pieces). Strong focus on delivering data, zero-copy, between processes.</td></tr>
<tr style="background: white; height: 30px;"><td></td><td></td></tr>
<tr style="background: white;"><td width="35%"><img src="../img/logo-awkward.svg" width="100%"></td><td style="padding-left: 50px;">Library for array-oriented programming like NumPy, but for arbitrary data structures. Losslessly zero-copy convertible to/from Arrow and Parquet.</td></tr>
<tr style="background: white; height: 30px;"><td></td><td></td></tr>
<tr style="background: white;"><td width="35%"><img src="../img/logo-parquet.svg" width="100%"></td><td style="padding-left: 50px;">Disk format for storing large datasets and (selectively) retrieving them.</td></tr>
</table>

<img src="../img/logo-arrow.svg" width="30%">

<br>

In [None]:
import pyarrow as pa

<br>

In [None]:
arrow_array = pa.array([
    [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
    [],
    [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
])

<br>

In [None]:
arrow_array.type

<br>

In [None]:
arrow_array

<img src="../img/logo-awkward.svg" width="30%">

<br>

In [None]:
import awkward as ak

<br>

In [None]:
awkward_array = ak.from_arrow(arrow_array)
awkward_array

<img src="../img/logo-parquet.svg" width="30%">

<br>

In [None]:
ak.to_parquet(awkward_array, "/tmp/file.parquet")

<br>

In [None]:
ak.from_parquet("/tmp/file.parquet")

## Awkward Array

In [None]:
ragged = ak.Array([
    [
      [[1.84, 0.324]],
      [[-1.609, -0.713, 0.005], [0.953, -0.993, 0.011, 0.718]],
      [[0.459, -1.517, 1.545], [0.33, 0.292]],
      [[-0.376, -1.46, -0.206], [0.65, 1.278]],
      [[], [], [1.617]],
      []
    ],
    [
      [[-0.106, 0.611]],
      [[0.118, -1.788, 0.794, 0.658], [-0.105]]
    ],
    [
      [[-0.384], [0.697, -0.856]],
      [[0.778, 0.023, -1.455, -2.289], [-0.67], [1.153, -1.669, 0.305, 1.517, -0.292]]
    ],
    [
      [[0.205, -0.355], [-0.265], [1.042]],
      [[-0.004], [-1.167, -0.054, 0.726, 0.213]],
      [[1.741, -0.199, 0.827]]
    ]
])

**Multidimensional indexing**

In [None]:
ragged[3, 1, -1, 2]

<br>

**Basic slicing**

In [None]:
ragged[3, 1:, -1, 1:3]

<br>

**Advanced slicing**

In [None]:
ragged[[False, False, True, True], [0, -1, 0, -1], 0, -1]

**Awkward slicing**

In [None]:
ragged > 0

<br>

In [None]:
ragged[ragged > 0]

**Reductions**

In [None]:
ak.sum(ragged)

<br>

In [None]:
ak.sum(ragged, axis=-1)

<br>

In [None]:
ak.sum(ragged, axis=0)

How are reductions even defined for ragged arrays?

<center>
<img src="../img/example-reduction-sum.svg" width="50%">
</center>

In [None]:
small_ragged = ak.Array([[   1, 2, 4],
                         [          ],
                         [None, 8,  ],
                         [  16      ]])

<br>

In [None]:
ak.sum(small_ragged, axis=0)

In [None]:
structured = ak.Array([
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 5.27453, "y": 1.03276},
    {"x": -3.51280, "y": 1.74849}]},
  {"fill": "#b1b1b1", "stroke": "none", "points": [{"x": 8.21630, "y": 4.07844},
    {"x": -0.79157, "y": 3.49478}, {"x": 16.38932, "y": 5.29399},
    {"x": 10.38641, "y": 0.10832}, {"x": -2.07070, "y": 14.07140},
    {"x": 9.57021, "y": -0.94823}, {"x": 1.97332, "y": 3.62380},
    {"x": 5.66760, "y": 11.38001}, {"x": 0.25497, "y": 3.39276},
    {"x": 3.86585, "y": 6.22051}, {"x": -0.67393, "y": 2.20572}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": 3.59528, "y": 7.37191},
    {"x": 0.59192, "y": 2.91503}, {"x": 4.02932, "y": -1.13601},
    {"x": -1.01593, "y": 1.95894}, {"x": 1.03666, "y": 0.05251}]},
  {"fill": "#d0d0ff", "stroke": "none", "points": [{"x": -8.78510, "y": -0.00497},
    {"x": -15.22688, "y": 3.90244}, {"x": 5.74593, "y": 4.12718}]},
  {"fill": "none", "stroke": "#000000", "points": [{"x": 4.40625, "y": -6.953125},
    {"x": 4.34375, "y": -7.09375}, {"x": 4.3125, "y": -7.140625},
    {"x": 4.140625, "y": -7.140625}]},
  {"fill": "none", "stroke": "#808080", "points": [{"x": 0.46875, "y": -0.09375},
    {"x": 0.46875, "y": -0.078125}, {"x": 0.46875, "y": 0.53125}]}
])

<br>

In [None]:
import numpy as np

**Elementwise formulas**

In [None]:
np.sqrt(structured["points", "x"]**2 + structured["points", "y"]**2)

<br>

In [None]:
np.sqrt(structured.points.x**2 + structured.points.y**2)

**Quizlet:** Given the following dataset,

In [None]:
data = ak.Array([
    {"movie": "Evil Dead", "year": 1981, "actors":
        ["Bruce Campbell", "Ellen Sandweiss", "Richard DeManincor", "Betsy Baker"]
    },
    {"movie": "Darkman", "year": 1900, "actors":
        ["Liam Neeson", "Frances McDormand", "Larry Drake", "Bruce Campbell"]
    },
    {"movie": "Army of Darkness", "year": 1992, "actors":
        ["Bruce Campbell", "Embeth Davidtz", "Marcus Gilbert", "Bridget Fonda",
         "Ted Raimi", "Patricia Tallman"]
    },
    {"movie": "A Simple Plan", "year": 1998, "actors":
        ["Bill Paxton", "Billy Bob Thornton", "Bridget Fonda", "Brent Briscoe"]
    },
    {"movie": "Spider-Man 2", "year": 2004, "actors":
        ["Tobey Maguire", "Kristen Dunst", "Alfred Molina", "James Franco",
         "Rosemary Harris", "J.K. Simmons", "Stan Lee", "Bruce Campbell"]
    },
    {"movie": "Drag Me to Hell", "year": 2009, "actors":
        ["Alison Lohman", "Justin Long", "Lorna Raver", "Dileep Rao", "David Paymer"]
    }
])

select movies that do _not_ contain `"Bruce Campbell"`. See [ak.all](https://awkward-array.org/doc/main/reference/generated/ak.all.html), [ak.any](https://awkward-array.org/doc/main/reference/generated/ak.any.html), [np.invert](https://numpy.org/doc/stable/reference/generated/numpy.invert.html), and [ak.num](https://awkward-array.org/doc/main/reference/generated/ak.num.html).

In [None]:
%%html
<!-- This will only work on the day of the live tutorial. -->
<div style="overflow: hidden;"><iframe src="https://app.sli.do/event/rbr8JR3hY4WEZ9CpWm94Xg/embed/polls/d92f941a-23fc-494d-a18b-8163205dc779" width="100%" height="280" scrolling="no" style="border: none;"></div>

**Answer:**

In [None]:
is_bruce_campbell = data.actors == "Bruce Campbell"
is_bruce_campbell

In [None]:
all_not_bruce_campbell = ak.all(~is_bruce_campbell, axis=1)
all_not_bruce_campbell

In [None]:
data[all_not_bruce_campbell]

## Combinatorics

Some operations are more meaningful on irregular arrays than rectilinear ones:

<table style="width: 60%">
<tr style="background: white; padding-top: 0px;"><td width="50%"><img src="../img/cartoon-cartesian.svg" width="100%"></td><td width="50%"><img src="../img/cartoon-combinations.svg" width="100%"></td></tr>
</table>

[ak.cartesian](https://awkward-array.org/doc/main/reference/generated/ak.cartesian.html) takes a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of lists from $n$ different arrays, making an array of lists of $n$-tuples.

[ak.combinations](https://awkward-array.org/doc/main/reference/generated/ak.combinations.html) takes $n$ [samples without replacement](http://prob140.org/sp18/textbook/notebooks-md/5_04_Sampling_Without_Replacement.html) of lists from a single array, making an array of lists of $n$-tuples.

<center>
<img src="../img/cartoon-cartesian.svg" width="30%">
</center>

In [None]:
numbers = ak.Array([[1, 2, 3], [], [4]])
letters = ak.Array([["a", "b"], ["c"], ["d", "e"]])

<br>

In [None]:
ak.cartesian([numbers, letters])

<center>
<img src="../img/cartoon-combinations.svg" width="30%">
</center>

In [None]:
values = ak.Array([[1.1, 2.2, 3.3, 4.4], [], [5.5, 6.6]])

<br>

In [None]:
ak.combinations(values, 2)

**Example:** Which actors in the same movie have names that are the same length?

In [None]:
actor_pairs = ak.combinations(data.actors, 2, axis=1)
actor_pairs

In [None]:
actor_pairs["0"]

In [None]:
ak.num(actor_pairs["0"], axis=2)

In [None]:
actor_pairs[(ak.num(actor_pairs["0"], axis=2) == ak.num(actor_pairs["1"], axis=2))]

**Go to the [Part 3 project](project.ipynb) now!**