# Data Encoding, Decoding and Flow

All three (Parquet , ORC and Arrow) deal with strcurtured data.

---

# üü¶ **1. Apache Parquet**

### üìå *A columnar **storage file format**.*

Parquet is used to **store** data on disk (or cloud storage like S3) in a way that is:

* Columnar (column by column)
* Highly compressed (Snappy, ZSTD, etc.)
* Optimized for analytics (fast filtering & aggregations)
* Splittable (for parallel reading in Spark, Hive, DuckDB, etc.)

### üìå Where you see Parquet

‚úî Data lakes (S3, GCS, HDFS)
‚úî Spark / Databricks
‚úî Snowflake, BigQuery, Redshift Spectrum
‚úî Pandas & Polars

### üìå Example usage

Save a dataset to a Parquet file:

```python
df.to_parquet("users.parquet")
```

---

# üü© **2. ORC (Optimized Row Columnar)**

### üìå *Another columnar **storage file format**, similar to Parquet.*

Parquet and ORC are competitors.
Both are used for big-data analytics.

ORC is:

* Very compact (often smaller than Parquet)
* Highly optimized for **Hadoop & Hive**
* Fast for certain workloads

### üìå Where ORC is used

‚úî Hive
‚úî Presto / Trino
‚úî Big Hadoop installations
‚úî Some Spark setups (less common now)

### üìå Example usage (Spark)

```python
df.write.orc("example.orc")
```

---

# üü• **3. Apache Arrow**

### üìå *NOT a file format ‚Äî it is an **in-memory data format**.*

This is the most important distinction:

‚ö†Ô∏è **Arrow is NOT a storage file format like Parquet or ORC.**
It does not store data on disk.

Arrow is a **way to store data in RAM** so that analytics tools can share memory *without copying data*.

Arrow stores data in **columnar, contiguous memory buffers** ‚Äî this makes operations extremely fast.

### üìå Why Arrow exists

Before Arrow:

* Pandas ‚Üí copies data to NumPy
* Spark ‚Üí copies data when converting to Pandas
* R, Python, Java, C++ all use different memory layouts
* Serialization overhead everywhere

Arrow solves this by providing *one universal memory layout*.

### üìå Benefits

‚úî Zero-copy data interchange (Python ‚Üî Spark ‚Üî R ‚Üî C++ ‚Ä¶)
‚úî Faster analytics
‚úî Backbone of many modern engines: Pandas 2.0, Polars, DuckDB

### üìå Where you see Arrow

‚úî Pandas 2.0 default engine
‚úî Spark uses Arrow for Pandas UDFs
‚úî Polars native engine
‚úî PyArrow library
‚úî DuckDB integration

### Example: Arrow Table

```python
import pyarrow as pa

table = pa.table({"a": [1,2,3], "b": ["x","y","z"]})
```

---

# üß† So how do Parquet, ORC, and Arrow relate?

They are **complementary**, not competing.

### ‚úî Parquet ‚Üí columnar file on disk

### ‚úî ORC ‚Üí columnar file on disk

### ‚úî Arrow ‚Üí columnar format in memory

Think of it like this:

| Layer                    | Technology                    |
| ------------------------ | ----------------------------- |
| **In-memory processing** | ‚ûú Apache Arrow                |
| **On-disk storage**      | ‚ûú Parquet, ORC                |
| **Higher-level tools**   | Pandas, Spark, Polars, DuckDB |

---

# üßä Simple Analogy (super easy to visualize)

### **Arrow = Serving food on plates (ready to eat)**

Everything laid out nicely, ready for quick access.

### **Parquet / ORC = Food containers stored in the fridge**

Compressed for storage, not for immediate serving.

When you load a Parquet file, many tools convert it into Arrow memory format automatically.

---

# üìù One-sentence definitions

### **Parquet:**

*A compressed, columnar file format optimized for analytical queries.*

### **ORC:**

*A Hadoop-optimized columnar file format similar to Parquet, often with better compression.*

### **Apache Arrow:**

*An in-memory, columnar data format designed for ultra-fast analytics and zero-copy data sharing across systems.*


Avro, Thrift, Protobuf VS Parquet, ORC, Arrow

---

# ‚úÖ **1. Avro, Thrift, Protobuf ‚Üí *Serialization formats***

These are **encode/decode systems** that convert **in-memory objects ‚Üí binary bytes ‚Üí back to objects**.

They are used for:

* RPC frameworks (remote calls)
* Message passing
* Kafka messages
* Network communication
* Storing data in binary form (but not optimized for analytics)

### ‚úî They define how to **serialize** data

They are *not* optimized for analytical querying.

### ‚úî They are schema-based

* Protobuf = .proto
* Thrift = .thrift
* Avro = .avsc (JSON schema)

### ‚úî Output is usually **binary**

Not human-readable.

---

# üü• **2. Parquet, ORC, Arrow ‚Üí *Columnar storage formats***

These are **data storage / representation formats**, not RPC serialization formats.

### ‚úî Used for analytics

* Data lakes (S3/GCS/Azure/HDFS)
* Spark
* BigQuery / Snowflake
* DuckDB / Polars

### ‚úî Columnar = optimized for:

* Filtering
* Aggregation
* Scanning billions of rows
* Compression
* Vectorized execution

### ‚úî NOT used for RPC

You do not send Parquet or ORC through network RPC systems.

---

# üü© **So the correct categorization:**

## üü° **Serialization / Transport / RPC Formats**

Used for messaging, RPC, Kafka, microservices:

| Format       | Purpose                      |
| ------------ | ---------------------------- |
| **Protobuf** | gRPC, internal communication |
| **Thrift**   | services + serialization     |
| **Avro**     | Kafka + schema evolution     |

‚û° **Focus:** encode/decode speed, schema evolution.

---

## üîµ **Analytical Storage Formats (File formats for data lakes)**

| Format      | Purpose                         |
| ----------- | ------------------------------- |
| **Parquet** | analytical columnar storage     |
| **ORC**     | Hive-optimized columnar storage |
| **Arrow**   | *in-memory* columnar format     |

‚û° **Focus:** compression, scan speed, analytics.

---

# üß† **Key Distinction (Super Simple)**

### **Protobuf / Avro / Thrift**

* For communication
* Small messages
* Network RPC
* Schema-based serialization
* Optimized for encode/decode speed
* Used inside programs

### **Parquet / ORC**

* For storage
* Large datasets
* Analytics on GB ‚Üí TB data
* Columnar
* Compressed
* Optimized for reading only needed columns

### **Arrow**

* In-memory representation
* Ultra-fast processing
* Zero-copy between languages (Python ‚Üî C++ ‚Üî R ‚Üî Rust ‚Üî Java)
* Often used *after* you load Parquet

---

# üßä Real-life analogy (easy)

### Serialization formats (Protobuf/Avro/Thrift)

‚û° "Shipping boxes"
Designed to move data efficiently from point A ‚Üí B.

### Parquet/ORC

‚û° "Warehouse storage shelves"
Designed to store huge amounts of data efficiently and retrieve only what you need.

### Arrow

‚û° "Items placed on working tables"
Optimized for immediate processing, no storage.

---



## Apache Parquet, ORC and Arrow

We can easily read (decode) and write (encode) data from and to Parquet, ORC and Arrow files interchangeably. The `pyarrow` library allows us to read a Parquet or ORC file into a `pyarrow.Table` object, which is a columnar data structure that can be converted to a Pandas DataFrame. We can also write a `pyarrow.Table` to a Parquet or ORC file.

Parquet has the following types:

- boolean: 1 bit boolean
- int32: 32 bit signed ints
- int64: 64 bit signed ints
- int96: 96 bit signed ints
- float: IEEE 32-bit floating point values
- double: IEEE 64-bit floating point values
- byte_array: arbitrarily long byte arrays
- fixed_len_byte_array: fixed length byte arrays
- string: UTF-8 encoded strings
- enum: enumeration of strings
- temporal: a logical date type

ORC has the following types:

- boolean: 1 bit boolean
- tinyint: 8 bit signed ints
- smallint: 16 bit signed ints
- int: 32 bit signed ints
- bigint: 64 bit signed ints
- float: IEEE 32-bit floating point values
- double: IEEE 64-bit floating point values
- string: UTF-8 encoded strings
- char: ASCII strings
- varchar: UTF-8 strings
- binary: byte arrays
- timestamp: a logical date type
- date: a logical date type
- decimal: arbitrary precision decimals
- list: an ordered collection of objects
- map: a collection of key-value pairs
- struct: an ordered collection of named fields
- union: a list of types

![overview-diagram](../assets/diagram-2.png)

### Reading (Decoding) and Writing (Encoding) a Parquet File

Let's look at how to decode and encode a Parquet file with mock customers data.


# **How it works:**

1. **Parquet / ORC**

* Are **on-disk, columnar storage formats**.
* Highly compressed, optimized for analytics queries and storage efficiency.
* Ideal for ‚Äúdata at rest‚Äù in a data lake or warehouse.

2. **When you want to analyze the data**

* Libraries like **PyArrow, Pandas 2.0, Spark, DuckDB, Polars** **read the Parquet/ORC file into memory**.
* Internally, they convert it into **Apache Arrow‚Äôs in-memory columnar format**.

3. **In-memory (Arrow Table)**

* Arrow holds data in **contiguous, columnar buffers**.
* Extremely fast for analytics and vectorized operations.
* Multiple languages/tools can share Arrow memory without copying.

4. **Display / work in Pandas**

* Arrow Table ‚Üí Pandas DataFrame (or Polars DataFrame).
* Now you can use Python for analysis, visualization, ML, etc.

---

# **Data Flow Visualization**

```
Storage on disk:    Parquet / ORC files
            ‚Üì   (read)
In-memory format:   Apache Arrow Table
            ‚Üì   (convert to Python objects)
User interface:     Pandas DataFrame / Polars DataFrame
```

---

# **Key points**

* **Parquet/ORC** ‚Üí storage
* **Arrow** ‚Üí in-memory processing
* **Pandas** ‚Üí user-facing analytics library
* You **never manipulate Parquet/ORC files directly** in Pandas; they are first loaded into Arrow buffers.
* This design allows **fast analytics on very large datasets** without decompressing everything unnecessarily.


In [None]:
import pyarrow as pa
import pyarrow.parquet as pq

pyarrow <- Python library for Apache Arrow
    Provides:
        .In-memory columnar data structures (pa.Table, pa.Array, pa.RecordBatch)
        .Efficient memory representation for analytics and zero-copy sharing
        .Conversion between Pandas DataFrames <-> Arrow Tables
        .Varios data types: string, int, float, timestamp, nested types etc

pyarrow.parquet <- module for working with Parqurt files
    Provides:
        .Read Parquet -> Arrow Table (pq.read_table)
        .Rad only metdata (pq.read_metdata)
        .Write Arrow Table -> Parquet file (pq.write_table)
        .Read only specific columns or row groups for efficiency

How they work together

| Module                 | Role                                                           |
| ---------------------- | -------------------------------------------------------------- |
| `pyarrow` (pa)         | Provides in-memory columnar structures (Arrow Tables)          |
| `pyarrow.parquet` (pq) | Reads/writes **disk-based Parquet files** to/from Arrow Tables |


Typical Workflow:
1. Parquet file on disk ‚Üí pq.read_table() ‚Üí Arrow Table (pa.Table)
2. Arrow Table ‚Üí to_pandas() ‚Üí Pandas DataFrame for analysis
3. After analysis ‚Üí pq.write_table() ‚Üí Parquet file to store results

In [None]:
table = pq.read_table('../data/userdata1.parquet')

The cell block above - Reading a Parquet file into memory using Apache Arrow.

In [None]:
table

In [None]:
table.schema

In [None]:
metadata = pq.read_metadata('../data/userdata1.parquet')

metadata

read_metadata() <-readys only the metadata of a Parquet file, without loading the full dataset into memory

Why use read_metadata
Fast: avoids loading full data
Inspect schema & stats before reading
Plan queries / filtering efficiently

In [None]:
metadata.schema

In [None]:
metadata.row_group(0).column(10)

The above cell block:

metadata.row_group(0)
    . Selects the first row group (0-indexed)
    . Returns a RowGroupMetaData object

.column(10)
    . Selects the 11th colum (0-indexed) in that row group
    . Returns a ColumnChunkMetaData Object
    . ColumnMetaData gives information about that column in the row group

Select the first 3 rows of the table:

In [None]:
table.take([0,1,2])

Convert a Table to a DataFrame:

In [None]:
df = table.to_pandas()

In [None]:
df

You can convert the DataFrame back to a Table (note we're using the method from `pa` which is pyarrow):

In [None]:
new_table = pa.Table.from_pandas(df)

new_table

You can write the table back to a Parquet file:

In [None]:
pq.write_table(new_table, "../data/userdata2.parquet")

> 1. How many males and females are there?
>
> 2. What is the average salary for customers from China?
>
> 3. Create a new column `full_name` which combines `first_name` and `last_name` with a space in between in the dataframe. Then convert it back to a new Table and write it to a Parquet file.

### Reading (Decoding) and Writing (Encoding) an ORC File

Let's look at how to decode and encode an ORC file with mock data.

In [None]:
import pyarrow as pa
from pyarrow import orc

In [None]:
table2 = orc.read_table('../data/userdata1.1.orc')

In [None]:
table2

In [None]:
df2 = table2.to_pandas()

df2

You can write the table back to an ORC file:

In [None]:
orc.write_table(table2, "../data/file2.orc")