# Apache Arrow

From https://arrow.apache.org/docs/r/:
## What can the `arrow` package do?

-   Read and write **Parquet files** (`read_parquet()`,
    `write_parquet()`), an efficient and widely used columnar format
-   Read and write **Feather files** (`read_feather()`,
    `write_feather()`), a format optimized for speed and
    interoperability
-   Analyze, process, and write **multi-file, larger-than-memory
    datasets** (`open_dataset()`, `write_dataset()`)
-   Read **large CSV and JSON files** with excellent **speed and
    efficiency** (`read_csv_arrow()`, `read_json_arrow()`)
-   Manipulate and analyze Arrow data with **`dplyr` verbs**
-   Read and write files in **Amazon S3** buckets with no additional
    function calls
-   Exercise **fine control over column types** for seamless
    interoperability with databases and data warehouse systems
-   Use **compression codecs** including Snappy, gzip, Brotli,
    Zstandard, LZ4, LZO, and bzip2 for reading and writing data
-   Enable **zero-copy data sharing** between **R and Python**
-   Connect to **Arrow Flight** RPC servers to send and receive large
    datasets over networks
-   Access and manipulate Arrow objects through **low-level bindings**
    to the C++ library
-   Provide a **toolkit for building connectors** to other applications
    and services that use Arrow



In [46]:
import pyarrow as pa
import pandas as pd

let's start by converting a dataset to arrow

In [79]:
import seaborn as sns
planets = sns.load_dataset("planets")
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


In [80]:
planets["method"].value_counts()

Radial Velocity                  553
Transit                          397
Imaging                           38
Microlensing                      23
Eclipse Timing Variations          9
Pulsar Timing                      5
Transit Timing Variations          4
Orbital Brightness Modulation      3
Astrometry                         2
Pulsation Timing Variations        1
Name: method, dtype: int64

In [81]:
planets["year"].astype(str)

0       2006
1       2008
2       2011
3       2007
4       2009
        ... 
1030    2006
1031    2007
1032    2007
1033    2008
1034    2008
Name: year, Length: 1035, dtype: object

In [82]:
planets["method"] = planets["method"].astype("category")
planets["year"] = pd.to_datetime(planets["year"].astype(str), format="%Y")
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006-01-01
1,Radial Velocity,1,874.774,2.21,56.95,2008-01-01
2,Radial Velocity,1,763.0,2.6,19.84,2011-01-01
3,Radial Velocity,1,326.03,19.4,110.62,2007-01-01
4,Radial Velocity,1,516.22,10.5,119.47,2009-01-01


In [83]:
planets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   method          1035 non-null   category      
 1   number          1035 non-null   int64         
 2   orbital_period  992 non-null    float64       
 3   mass            513 non-null    float64       
 4   distance        808 non-null    float64       
 5   year            1035 non-null   datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(3), int64(1)
memory usage: 41.9 KB


**we can convert Pandas -> Arrow**

In [84]:
table = pa.Table.from_pandas(planets)

In [85]:
table

pyarrow.Table
method: dictionary<values=string, indices=int8, ordered=0>
number: int64
orbital_period: double
mass: double
distance: double
year: timestamp[ns]

In [86]:
schema = pa.Schema.from_pandas(planets)

**and back to pandas, without losing the structure (there are limitation, for example multi-index)**

In [87]:
table.to_pandas().head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006-01-01
1,Radial Velocity,1,874.774,2.21,56.95,2008-01-01
2,Radial Velocity,1,763.0,2.6,19.84,2011-01-01
3,Radial Velocity,1,326.03,19.4,110.62,2007-01-01
4,Radial Velocity,1,516.22,10.5,119.47,2009-01-01


In [88]:
table.to_pandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035 entries, 0 to 1034
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   method          1035 non-null   category      
 1   number          1035 non-null   int64         
 2   orbital_period  992 non-null    float64       
 3   mass            513 non-null    float64       
 4   distance        808 non-null    float64       
 5   year            1035 non-null   datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(3), int64(1)
memory usage: 41.9 KB


## Pandas native support for read/write arrow format
You can save pandas dataframes both as feather files or parquet

### Feather
from [pandas guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#feather):
> Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy.

In [89]:
planets.to_feather("data/planets.feather", compression="uncompressed")

`uncompressed` prevent the compression of the data. Feather files are higly otimized. However, the deafault installation of arrow in R from CRAN does not enable the compression.  

### Parquet

from [pandas guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#parquet):
> Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.

In [90]:
planets.to_parquet("data/planets.parquet", engine="pyarrow")