## Apache Arrow

> Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another.
>
> A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more.

```{r echo=FALSE, out.width='40%', fig.align="center"}
knitr::include_graphics("imgs/arrow_cols.png")
```

---

## Language support

.pull-left-narrow[
Core implementations in:

* C
* C++
* C#
* go
* Java
* JavaScript
* Julia
* Rust
* MATLAB
* Python
* R
* Ruby
]

In [1]:
import pyarrow as pa
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 

The basic building blocks of Arrow are `array` and `table` objects, arrays are collections of data of a uniform type.

In [2]:
num  = pa.array([1, 2, 3, 2], type=pa.int8())
num

<pyarrow.lib.Int8Array object at 0x28327d000>
[
  1,
  2,
  3,
  2
]

In [3]:
year = pa.array([2019,2020,2021,2022])
year

<pyarrow.lib.Int64Array object at 0x28327d300>
[
  2019,
  2020,
  2021,
  2022
]

In [4]:
name = pa.array(
  ["Alice", "Bob", "Carol", "Dave"],
  type=pa.string()
)
name

<pyarrow.lib.StringArray object at 0x28327d360>
[
  "Alice",
  "Bob",
  "Carol",
  "Dave"
]

## Tables

A table is created by combining multiple arrays together to form the columns while also attaching names for each column.

In [5]:
t = pa.table(
  [num, year, name],
  names = ["num", "year", "name"]
)
t

pyarrow.Table
num: int8
year: int64
name: string
----
num: [[1,2,3,2]]
year: [[2019,2020,2021,2022]]
name: [["Alice","Bob","Carol","Dave"]]

## Array indexing

Elements of an array can be selected using `[]` with an integer index or a slice, the former returns a typed scalar the latter an array.

In [7]:
name[0]

<pyarrow.StringScalar: 'Alice'>

In [8]:
name[0:3]

<pyarrow.lib.StringArray object at 0x105acaaa0>
[
  "Alice",
  "Bob",
  "Carol"
]

In [9]:
name[:]

<pyarrow.lib.StringArray object at 0x28327d420>
[
  "Alice",
  "Bob",
  "Carol",
  "Dave"
]

In [10]:
name[-1]

<pyarrow.StringScalar: 'Dave'>

In [11]:
name[::-1]

<pyarrow.lib.StringArray object at 0x105acb0a0>
[
  "Dave",
  "Carol",
  "Bob",
  "Alice"
]

In [12]:
name[4]

IndexError: index out of bounds

In [13]:
name[0] = "Patty"

TypeError: 'pyarrow.lib.StringArray' object does not support item assignment

## Data Types

The following types are language agnostic for the purpose of portability, however some differ slightly from what is available from Numpy and Pandas (or R),

* **Fixed-length primitive types**: numbers, booleans, date and times, fixed size binary, decimals, and other values that fit into a given number
  * Examples: `bool_()`, `uint64()`, `timestamp()`, `date64()`, and many more

* **Variable-length primitive types**: binary, string

* **Nested types**: list, map, struct, and union

* **Dictionary type**: An encoded categorical type


[See [here](https://arrow.apache.org/docs/python/api/datatypes.html#api-types) for the full list of types.]

---

## Schemas

A data structure that contains information on the names and types of columns for a table (or record batch),

In [14]:
t.schema

num: int8
year: int64
name: string

In [15]:
pa.schema([
  ('num', num.type),
  ('year', year.type),
  ('name', name.type)
])

num: int8
year: int64
name: string

## Schema metadata

Schemas can also store additional metadata (e.g. codebook like textual descriptions) in the form a string:string dictionary,

In [16]:
new_schema = t.schema.with_metadata({
  'num': "Favorite number",
  'year': "Year expected to graduate",
  'name': "First name"
})
new_schema

num: int8
year: int64
name: string
-- schema metadata --
num: 'Favorite number'
year: 'Year expected to graduate'
name: 'First name'

In [17]:
t.cast(new_schema).schema

num: int8
year: int64
name: string
-- schema metadata --
num: 'Favorite number'
year: 'Year expected to graduate'
name: 'First name'

## Missing values / None / NANs

In [18]:
pa.array([1,2,None,3])

<pyarrow.lib.Int64Array object at 0x28327ce80>
[
  1,
  2,
  null,
  3
]

In [19]:
pa.array(["alice","bob",None,"dave"])

<pyarrow.lib.StringArray object at 0x296d1ab60>
[
  "alice",
  "bob",
  null,
  "dave"
]

In [20]:
pa.array([1,2,np.nan,3])

<pyarrow.lib.DoubleArray object at 0x296d1ab00>
[
  1,
  2,
  nan,
  3
]

In [21]:
pa.array([1.,2.,None,3.])

<pyarrow.lib.DoubleArray object at 0x296d1ac80>
[
  1,
  2,
  null,
  3
]

In [22]:
pa.array([1,2,None,3])[2]

<pyarrow.Int64Scalar: None>

In [23]:
pa.array([1,2,None,3])[2]

<pyarrow.Int64Scalar: None>

In [24]:
pa.array(["alice","bob",None,"dave"])[2]

<pyarrow.StringScalar: None>

In [25]:
pa.array([1,2,np.nan,3])[2]

<pyarrow.DoubleScalar: nan>

In [26]:
pa.array([[1,2], [3,4], None, [5,None]])

<pyarrow.lib.ListArray object at 0x296d1ae00>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ],
  null,
  [
    5,
    null
  ]
]

In [27]:
pa.array([
  {'x': 1, 'y': True, 'z': "Alice"},
  {'x': 2,            'z': "Bob"  },
  {'x': 3, 'y': False             }
])

<pyarrow.lib.StructArray object at 0x105acb580>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    2,
    3
  ]
-- child 1 type: bool
  [
    true,
    null,
    false
  ]
-- child 2 type: string
  [
    "Alice",
    "Bob",
    null
  ]

## Dictionary array

A dictionary array is the equivalent to a factor in R or pd.Categorical in Pandas,

In [29]:
levels = pa.array(['sun', 'rain', 'clouds', 'snow'])
values = pa.array([0,0,2,1,3,None])
dict_array = pa.DictionaryArray.from_arrays(values, levels)
dict_array

<pyarrow.lib.DictionaryArray object at 0x2968be6c0>

-- dictionary:
  [
    "sun",
    "rain",
    "clouds",
    "snow"
  ]
-- indices:
  [
    0,
    0,
    2,
    1,
    3,
    null
  ]

In [30]:
dict_array.type

DictionaryType(dictionary<values=string, indices=int64, ordered=0>)

In [31]:
dict_array.dictionary_decode()

<pyarrow.lib.StringArray object at 0x296d1b0a0>
[
  "sun",
  "sun",
  "clouds",
  "rain",
  "snow",
  null
]

In [32]:
levels.dictionary_encode()

<pyarrow.lib.DictionaryArray object at 0x296de6ab0>

-- dictionary:
  [
    "sun",
    "rain",
    "clouds",
    "snow"
  ]
-- indices:
  [
    0,
    1,
    2,
    3
  ]

## Record Batches

In between a table and an array Arrow has the concept of a Record Batch - which represents a chunk of the larger table. They are composed of a named collection of equal-length arrays.


In [33]:
batch = pa.RecordBatch.from_arrays(
  [num, year, name],
  ["num", "year", "name"]
)
batch

pyarrow.RecordBatch
num: int8
year: int64
name: string

In [34]:
batch.num_columns

3

In [35]:
batch.num_rows

4

In [36]:
batch.nbytes

69

In [37]:
batch.schema

num: int8
year: int64
name: string

## Batch indexing

`[]` can be used with a Record Batch to select columns (by name or index) or rows (by slice), additionally the `slice()` method can be used to select rows.


In [38]:
batch[0]

<pyarrow.lib.Int8Array object at 0x296d1b4c0>
[
  1,
  2,
  3,
  2
]

In [39]:
batch["name"]

<pyarrow.lib.StringArray object at 0x296d19c60>
[
  "Alice",
  "Bob",
  "Carol",
  "Dave"
]

In [40]:
batch[0:2].to_pandas()

Unnamed: 0,num,year,name
0,1,2019,Alice
1,2,2020,Bob


In [41]:
batch.slice(0,2).to_pandas()

Unnamed: 0,num,year,name
0,1,2019,Alice
1,2,2020,Bob



## Tables vs Record Batches

As mentioned previously, `table` objects are not part of the Arrow specification - rather they are a convenience tool provided to help with the wrangling of multiple Record Batches.


In [42]:
table = pa.Table.from_batches([batch] * 3)
table

pyarrow.Table
num: int8
year: int64
name: string
----
num: [[1,2,3,2],[1,2,3,2],[1,2,3,2]]
year: [[2019,2020,2021,2022],[2019,2020,2021,2022],[2019,2020,2021,2022]]
name: [["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"],["Alice","Bob","Carol","Dave"]]

In [43]:
table.num_columns

3

In [44]:
table.num_rows

12

In [45]:
table.to_pandas()

Unnamed: 0,num,year,name
0,1,2019,Alice
1,2,2020,Bob
2,3,2021,Carol
3,2,2022,Dave
4,1,2019,Alice
5,2,2020,Bob
6,3,2021,Carol
7,2,2022,Dave
8,1,2019,Alice
9,2,2020,Bob


## Chunked Array

The columns of `table` are therefore composed of the columns of each of the batches, these are stored as ChuckedArrays instead of Arrays to reflect this.

In [47]:
table["name"]

<pyarrow.lib.ChunkedArray object at 0x296cd21b0>
[
  [
    "Alice",
    "Bob",
    "Carol",
    "Dave"
  ],
  [
    "Alice",
    "Bob",
    "Carol",
    "Dave"
  ],
  [
    "Alice",
    "Bob",
    "Carol",
    "Dave"
  ]
]

In [48]:
table[1]

<pyarrow.lib.ChunkedArray object at 0x2969d01d0>
[
  [
    2019,
    2020,
    2021,
    2022
  ],
  [
    2019,
    2020,
    2021,
    2022
  ],
  [
    2019,
    2020,
    2021,
    2022
  ]
]

## Arrow + NumPy

Conversion between NumPy arrays and Arrow arrays is straight forward

In [49]:
np.linspace(0,1,11)

array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])

In [50]:
pa.array( np.linspace(0,1,6) )

<pyarrow.lib.DoubleArray object at 0x296f90160>
[
  0,
  0.2,
  0.4,
  0.6000000000000001,
  0.8,
  1
]

In [51]:
pa.array(range(10)).to_numpy()

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

## NumPy & data copies

In [52]:
pa.array(["hello", "world"]).to_numpy()

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

In [53]:
pa.array(["hello", "world"]).to_numpy(zero_copy_only=False)

array(['hello', 'world'], dtype=object)

In [54]:
pa.array([1,2,None,4]).to_numpy()

ArrowInvalid: Needed to copy 1 chunks with 1 nulls, but zero_copy_only was True

In [55]:
pa.array([1,2,None,4]).to_numpy(zero_copy_only=False)

array([ 1.,  2., nan,  4.])

In [56]:
pa.array([[1,2], [3,4], [5,6]]).to_numpy()

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

In [57]:
pa.array([[1,2], [3,4], [5,6]]).to_numpy(zero_copy_only=False)

array([array([1, 2]), array([3, 4]), array([5, 6])], dtype=object)

## Arrow + Pandas

We've already seen some basic conversion of Arrow table objects to Pandas, the conversions here are a bit more complex than with NumPy due in large part to how Pandas handles missing data.

| Source (Pandas)       | Destination (Arrow)   |
|-----------------------|-----------------------|
| `bool`                | `BOOL`                |
| `(u)int{8,16,32,64}`  | `(U)INT{8,16,32,64}`  |
| `float32`             | `FLOAT`               |
| `float64`             | `DOUBLE`              |
| `str / unicode`       | `STRING`              |
| `pd.Categorical`      | `DICTIONARY`          |
| `pd.Timestamp`        | `TIMESTAMP(unit=ns)`  |
| `datetime.date`       | `DATE`                |
| `datetime.time`       | `TIME64`              |

| Source (Arrow)                  | Destination (Pandas)                           |
|---------------------------------|------------------------------------------------|
| `BOOL`                          | `bool`                                         |
| `BOOL` with nulls	              | `object` (with values `True`, `False`,`None`)  |
| `(U)INT{8,16,32,64}`            | `(u)int{8,16,32,64}`                           |
| `(U)INT{8,16,32,64}` with nulls	| `float64`                                      |
| `FLOAT`	                        | `float32`                                      |
| `DOUBLE`	                      | `float64`                                      |
| `STRING`	                      | `str`                                          |
| `DICTIONARY`	                  | `pd.Categorical`                               |
| `TIMESTAMP(unit=*)`             | `pd.Timestamp` (`np.datetime64[ns]`)           |
| `DATE`	                        | `object` (with `datetime.date` objects)        |
| `TIME64`	                      | `object` (with `datetime.time` objects)        |




From [Type differences](https://arrow.apache.org/docs/python/pandas.html#type-differences) documentation

---

## Series & data copies

Due to these discrepancies it is much more likely that converting from Arrow array to a Panda series will require a type to be changed in which case the data will need to be copied. Like `to_numpy()` the `to_pandas()` method also accepts the `zero_copy_only` argument, however it defaults to `False`.

In [58]:
pa.array([1,2,3,4]).to_pandas()

0    1
1    2
2    3
3    4
dtype: int64

In [59]:
pa.array(["hello", "world"]).to_pandas()

0    hello
1    world
dtype: object

In [60]:
pa.array(["hello", "world"]).dictionary_encode().to_pandas()

0    hello
1    world
dtype: category
Categories (2, object): ['hello', 'world']

In [61]:
pa.array([1,2,3,4]).to_pandas(zero_copy_only=True)

0    1
1    2
2    3
3    4
dtype: int64

In [62]:
pa.array(["hello", "world"]).to_pandas(zero_copy_only=True)

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

In [63]:
pa.array(["hello", "world"]).dictionary_encode().to_pandas(zero_copy_only=True)

ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

Note that Arrow arrays are converted to Series while tables & batches are converted to DataFrames


---

## Zero Copy Series Conversions

> Zero copy conversions from `Array` or `ChunkedArray` to NumPy arrays or pandas Series are possible in certain narrow cases:
>
> * The Arrow data is stored in an integer (signed or unsigned `int8` through `int64`) or floating point type (`float16` through `float64`). This includes many numeric types as well as timestamps.
>
> * The Arrow data has no null values (since these are represented using bitmaps which are not supported by pandas).
> 
> * For `ChunkedArray`, the data consists of a single chunk, i.e. `arr.num_chunks == 1`. Multiple chunks will always require a copy because of pandas’s contiguousness requirement.
>
> In these scenarios, `to_pandas` or `to_numpy` will be zero copy. In all other scenarios, a copy will be required.

---
[Source](https://arrow.apache.org/docs/python/pandas.html#zero-copy-series-conversions)



In [64]:
table.to_pandas()

Unnamed: 0,num,year,name
0,1,2019,Alice
1,2,2020,Bob
2,3,2021,Carol
3,2,2022,Dave
4,1,2019,Alice
5,2,2020,Bob
6,3,2021,Carol
7,2,2022,Dave
8,1,2019,Alice
9,2,2020,Bob


In [65]:
table.to_pandas(zero_copy_only=True)

ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block

In [66]:
table.drop(['name']).to_pandas(zero_copy_only=True)

ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block

In [67]:
pa.table([num,year], names=["num","year"]).to_pandas(zero_copy_only=True)

ArrowInvalid: Cannot do zero copy conversion into multi-column DataFrame block

---

## Pandas -> Arrow

To convert from a Pandas DataFrame to an Arrow Table we can use the `from_pandas()` method (schemas can also be inferred from DataFrames)

In [68]:
df = pd.DataFrame({
  'x': np.random.normal(size=5),
  'y': ["A","A","B","C","C"],
  'z': [1,2,3,4,5]
})
pa.Table.from_pandas(df)

pyarrow.Table
x: double
y: string
z: int64
----
x: [[0.4516499898270056,0.7984189283226107,-1.212991453688134,-0.8089083287534667,-0.44380921310449345]]
y: [["A","A","B","C","C"]]
z: [[1,2,3,4,5]]

In [69]:
pa.Schema.from_pandas(df)

x: double
y: string
z: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 563

## An aside on tabular file formats

---

## Comma Separated Values

This and other text & delimiter based file formats are the most common and generally considered the most portable, however they have a number of significant draw backs

* no explicit schema or other metadata

* column types must be inferred from the data

* numerical values stored as text (efficiency and precision issues)

* limited compression options

---

## (Apache) Parquet

> ... provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO.


Core features:
> The values in each column are physically stored in contiguous memory locations and this columnar storage provides the following benefits:
>
> * Column-wise compression is efficient and saves storage space
> * Compression techniques specific to a type can be applied as the column values tend to be of the same type
> * Queries that fetch specific column values need not read the entire row data thus improving performance
> * Different encoding techniques can be applied to different columns


---

## Feather

> ... is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R.

Core features:
* Direct columnar serialization of Arrow tables

* Supports all Arrow data types and compression

* Language agnostic

* Metadata makes it possible to read only the necessary columns for an operation

---
class: middle

## Example - File Format Performance


Based on [Apache Arrow: Read DataFrame With Zero Memory](https://towardsdatascience.com/apache-arrow-read-dataframe-with-zero-memory-69634092b1a) 
---

## Building a large dataset

In [70]:
np.random.seed(1234)
df = (
    pd.read_csv("../data/penguins.csv")
      .sample(1000000, replace=True)
      .reset_index(drop=True)
)
num_cols = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm",    "body_mass_g"]
df[num_cols] = df[num_cols] + np.random.normal(size=(df.shape[0],len(num_cols)))
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Chinstrap,Dream,48.519017,17.136138,200.642256,3800.527739,male,2008
1,Gentoo,Biscoe,49.819091,15.317831,223.507723,5550.240852,male,2008
2,Chinstrap,Dream,45.174882,17.558060,188.966751,3449.851303,female,2007
3,Adelie,Biscoe,43.043647,18.630037,201.280925,4048.906267,male,2008
4,Gentoo,Biscoe,44.926619,13.633214,209.714220,4400.958077,female,2008
...,...,...,...,...,...,...,...,...
999995,Chinstrap,Dream,48.692352,18.184983,200.579977,3799.702510,male,2009
999996,Adelie,Biscoe,43.046500,20.182105,197.283516,4776.263332,male,2009
999997,Adelie,Torgersen,43.192871,17.278273,196.408173,4251.476411,male,2008
999998,Gentoo,Biscoe,53.208097,16.119915,230.537089,5498.259372,male,2009


## Create output files

In [71]:
import os
os.makedirs("../data/scratch/", exist_ok=True)
df.to_csv("../data/scratch/penguins-large.csv")
df.to_parquet("../data/scratch/penguins-large.parquet")
import pyarrow.feather
pyarrow.feather.write_feather(
    pa.Table.from_pandas(df), 
    "../data/scratch/penguins-large.feather"
)
pyarrow.feather.write_feather(
    pa.Table.from_pandas(df.dropna()), 
    "../data/scratch/penguins-large_nona.feather"
)

## File Sizes

In [73]:
def file_size(f):
    x = os.path.getsize(f)
    print(f, "\t\t", round(x / (1024 * 1024),2), "MB")

file_size( "../data/scratch/penguins-large.csv" )

../data/scratch/penguins-large.csv 		 100.91 MB


In [74]:
file_size( "../data/scratch/penguins-large.parquet" )

../data/scratch/penguins-large.parquet 		 32.44 MB


In [75]:
file_size( "../data/scratch/penguins-large.feather" )

../data/scratch/penguins-large.feather 		 48.93 MB


In [76]:
file_size( "../data/scratch/penguins-large_nona.feather" )

../data/scratch/penguins-large_nona.feather 		 50.93 MB


## Read Performance

In [77]:
%timeit pd.read_csv("../data/scratch/penguins-large.csv")

544 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [78]:
%timeit pd.read_parquet("../data/scratch/penguins-large.parquet")

67.8 ms ± 1.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [83]:
%timeit csv.read_csv("../data/scratch/penguins-large.csv")

31.2 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [84]:
import pyarrow

In [86]:
%timeit pyarrow.parquet.read_table("../data/scratch/penguins-large.parquet")

18.9 ms ± 33.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [87]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large.feather")

9.86 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [88]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large_nona.feather")

9.72 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [89]:
%timeit pyarrow.csv.read_csv("../data/scratch/penguins-large.csv").to_pandas()

75.9 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [91]:
%timeit pyarrow.parquet.read_table("../data/scratch/penguins-large.parquet").to_pandas()

63.9 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [92]:
%timeit pyarrow.feather.read_feather("../data/scratch/penguins-large.feather")

55.1 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [93]:
%timeit pyarrow.feather.read_feather("../data/scratch/penguins-large_nona.feather")

53.2 ms ± 416 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [94]:
%timeit pd.read_csv("../data/scratch/penguins-large.csv")["flipper_length_mm"].mean()

524 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [95]:
%timeit pd.read_parquet("../data/scratch/penguins-large.parquet",  columns=["flipper_length_mm"]).mean()

9.83 ms ± 50.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [96]:
%timeit pyarrow.parquet.read_table("../data/scratch/penguins-large.parquet", columns=["flipper_length_mm"]).to_pandas().mean()

7.44 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [97]:
%timeit pyarrow.parquet.read_table("../data/scratch/penguins-large.parquet")["flipper_length_mm"].to_pandas().mean()

22.9 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [98]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large.feather", columns=["flipper_length_mm"]).to_pandas().mean()


6.8 ms ± 211 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [99]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large.feather")["flipper_length_mm"].to_pandas().mean()

20.2 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [100]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large_nona.feather", columns=["flipper_length_mm"]).to_pandas().mean()

4.41 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [101]:
%timeit pyarrow.feather.read_table("../data/scratch/penguins-large_nona.feather")["flipper_length_mm"].to_pandas().mean()

12.8 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## End notebook.