# Data Encoding, Decoding and Flow


# ‚úÖ **What is Apache Avro?**

**Apache Avro** is a data serialization system created under the Apache Hadoop project.
It‚Äôs designed for:

* **Big Data systems (Hadoop, Kafka, streaming pipelines)**
* **Dynamic schemas**
* **Schema evolution**
* **Fast, compact binary data**

Avro is *similar* in purpose to Protocol Buffers and Thrift‚Ä¶
but the **design philosophy is very different**.

---

# 1Ô∏è‚É£ **Key Features of Avro**

## **‚úî Schema is JSON**

Avro schemas are written in plain **JSON**, such as:

```json
{
  "type": "record",
  "name": "Person",
  "fields": [
    {"name": "user_name", "type": "string"},
    {"name": "favorite_number", "type": ["null", "long"]},
    {"name": "interests", "type": {"type": "array", "items": "string"}}
  ]
}
```

Unlike Protobuf and Thrift which use their own `.proto` / `.thrift` languages.

---

## **‚úî Self-Describing Data (Optional)**

Avro allows embedding the **schema inside the data file** (e.g., in Avro container files).
This is why it‚Äôs extremely popular in Hadoop / Kafka:

* Producer writes the message
* Consumer doesn't need the exact same class definitions
* Schema is stored externally or inside the file

---

## **‚úî Dynamic Typing**

You **do not need generated source code** if you don‚Äôt want it.
You can read/write Avro purely using:

* Dictionaries in Python
* HashMaps in Java
* GenericRecord types in JVM languages

vs Protobuf/Thrift which *require* generated classes.




---

## **‚úî Great for Schema Evolution**

Avro has one of the strongest schema evolution systems:

* Add fields
* Remove fields
* Change defaults
* Make fields nullable

Because Avro stores the data using **field names**, not numeric tags.

Protobuf uses numeric tags ‚Üí faster but stricter.

---

## **‚úî Designed for Big Data**

Avro is the native serialization system used in:

* **Kafka**
* **Hadoop / HDFS**
* **Spark**
* **Hive**
* **Flink**

Its container format supports:

* Compression
* Splittable files
* Embedded schema
* Ready for distributed processing

This is why Avro dominates data engineering workloads.

---

# 2Ô∏è‚É£ Avro vs Protobuf vs Thrift

Here‚Äôs the clearest way to compare:

### **Serialization Purpose**

| Feature      | Avro                            | Protobuf                          | Thrift                       |
| ------------ | ------------------------------- | --------------------------------- | ---------------------------- |
| Main Purpose | Data serialization (big data)   | Serialization + modern RPC (gRPC) | Serialization + built-in RPC |
| Best For     | Kafka / Hadoop / Data pipelines | Microservices                     | Legacy / multi-protocol RPC  |

---

### **Schema Format**

| Value                      | Avro     | Protobuf   | Thrift      |
| -------------------------- | -------- | ---------- | ----------- |
| Schema Language            | JSON     | .proto DSL | .thrift DSL |
| Self-describing data       | Yes      | No         | No          |
| Required generated classes | Optional | Required   | Required    |

---

### **Serialization Style**

| Value                    | Avro                 | Protobuf       | Thrift         |
| ------------------------ | -------------------- | -------------- | -------------- |
| Encoding                 | Binary + JSON schema | Compact binary | Compact binary |
| Uses field names or tags | Names                | Numeric tags   | Numeric tags   |
| Speed                    | Fast                 | Fastest        | Fast           |

Protobuf is typically the fastest, but Avro is close and has richer schema flexibility.

---

### **RPC Support**

| Feature           | Avro                        | Protobuf         | Thrift              |
| ----------------- | --------------------------- | ---------------- | ------------------- |
| Built-in RPC      | Yes (Avro RPC, rarely used) | No (use gRPC)    | Yes (commonly used) |
| Popular RPC usage | Very rare                   | Extremely common | Moderate            |

Most people use Avro **without** its RPC system.

---

# 3Ô∏è‚É£ What Does a Typical Avro Workflow Look Like?

## Client/Server flow in Avro is more like:

1. You define a **JSON schema**
2. You serialize data (Python dict ‚Üí Avro binary)
3. You send that data via:

   * Kafka
   * Hadoop
   * REST
   * Anything you want
4. The consumer reads the binary using:

   * The writer schema (embedded or referenced)
   * Its own reader schema

No stubs, no service generators unless you use Avro RPC.

This differs from:

* **gRPC (Protobuf)** ‚Äî auto-generated client & server
* **Thrift RPC** ‚Äî auto-generated client & server
* **Avro** ‚Äî mostly just raw data serialization

---

# 4Ô∏è‚É£ **In Your Case (with gRPC / Protobuf / Thrift)**

Your earlier code (Protobuf ‚Üí gRPC server ‚Üí client stub) looks like:

```
client ‚Üí stub ‚Üí gRPC ‚Üí protobuf ‚Üí server ‚Üí return protobuf
```

If you used Thrift RPC:

```
client ‚Üí thrift client ‚Üí thrift RPC ‚Üí server ‚Üí return thrift object
```

But with Avro:

There is NO default RPC flow like that.

Typical Avro flow:

```
producer ‚Üí serialize to Avro ‚Üí send to Kafka ‚Üí consumer
```

or

```
application ‚Üí write Avro file ‚Üí Hadoop processes file
```

or

```
HTTP API ‚Üí send Avro bytes ‚Üí backend parses Avro
```

So:

### ‚úî Conceptually similar to Protobuf and Thrift

(because all do schema-based serialization)

### ‚ùå But operationally very different

(because Avro is built for distributed data storage, not RPC calls)

---

# 5Ô∏è‚É£ Summary (One-Shot Explanation)

**Apache Avro** is a serialization system designed for big data pipelines.
It stores schemas in JSON, can embed schemas inside the data, supports dynamic typing, and is the default for Kafka/Hadoop. Unlike Protobuf and Thrift, Avro is not primarily used for RPC, and you don‚Äôt need generated code ‚Äî it serializes raw maps/dicts. It excels at schema evolution and distributed processing.

---

## Apache Avro

Avro has the following types:

- null: no value
- boolean: a binary value
- int: 32-bit signed integer
- long: 64-bit signed integer
- float: single precision (32-bit) IEEE 754 floating-point number
- double: double precision (64-bit) IEEE 754 floating-point number
- bytes: sequence of 8-bit unsigned bytes
- string: Unicode character sequence
- record: ordered collection of named fields
- enum: enumeration of string values
- array: ordered collection of values
- map: collection of key-value pairs
- union: ordered list of values

It has two schema languages: one (`Avro IDL`) intended for human editing, and one (based on JSON) that is more easily machine-readable.

### Encoding

We can encode the previous example record in IDL using the following schema in the `.avsc` file:

```avro
record Person {
  string userName;
  union { null, long } favoriteNumber = null;
  array<string> interests;
}
```

The equivalent JSON representation of that schema is as follows:

```json
{
  "type": "record",
  "name": "Person",
  "fields": [
    { "name": "userName", "type": "string" },
    { "name": "favoriteNumber", "type": ["null", "long"], "default": null },
    { "name": "interests", "type": { "type": "array", "items": "string" } }
  ]
}
```

The data encoded with this schema looks like this:
![avro](../assets/avro.png)

First and foremost, it's important to note that the schema lacks tag numbers. When we encode our sample record using this schema, the resulting Avro binary encoding is impressively compact, spanning just _32 bytes_‚Äîthe most space-efficient among all the encodings we've observed.

Examining the byte sequence, one can readily discern the _absence of field identifiers or datatype markers_. The encoding solely comprises concatenated values. For instance, a string is represented by a length prefix followed by UTF-8 bytes, but there are no explicit indicators within the encoded data to specify that it is, indeed, a string. In fact, it could be interpreted as an integer or any other data type altogether. Similarly, an integer is encoded using a variable-length encoding.

To correctly parse the binary data, you must traverse the fields in the order they appear in the schema and _refer to the schema_ itself to ascertain the datatype of each field. Consequently, the binary data can only be accurately decoded if the code reading the data employs the exact same schema as the code that wrote the data. Any deviation or mismatch in the schema between the reader and the writer would result in incorrectly decoded data.

With Avro, data encoding and decoding are based on two schemas: the `writer's schema` used during data encoding and the `reader's schema` employed during data decoding. These schemas do not necessarily have to be identical but should be compatible. When decoding data, the Avro library compares the writer's and reader's schemas, resolving any discrepancies between them.

The Avro specification ensures that fields in different orders between the writer's and reader's schemas pose no issues during resolution since schema matching occurs based on field names. If the reader's schema lacks a field present in the writer's schema, it is simply ignored. Conversely, if the reader's schema expects a field that the writer's schema does not contain, the missing field is filled in with a default value declared in the reader's schema. This allows for flexible schema evolution while maintaining data compatibility.

### Reading (Decoding) a File

Instead of demonstrating RPC, let's look at how to decode data from a file from a real-world dataset. We have a genomic variation data of 1000 samples from the [OpenCGA](http://docs.opencb.org/display/opencga/Welcome+to+OpenCGA) project.

In [None]:
import fastavro
import copy
import json
from pprint import pprint

copy library <- python's built-in module for copying objects
pprint<- Pretty-Print module. Makes nested dicts and JSON objects easier to read

In [None]:
with open('../data/1k.variants.avro', 'rb') as f:
    reader = fastavro.reader(f)
    genomic_var_1k = [sample for sample in reader]
    metadata = copy.deepcopy(reader.metadata)
    writer_schema = copy.deepcopy(reader.writer_schema)
    schema_from_file = json.loads(metadata['avro.schema'])

‚úÖ 1. with open('../data/1k.variants.avro', 'rb') as f:

    open the Avro file in inary mode
    Avro files are always binary containers, not plain JSON

‚úÖ 2. reader = fastavro.reader(f)

    creates the Avro file reader
    Does the following automatically:
        .reads Avro file header
        . extracts embedded writer schema
        . extracts metadata
        . perpares to decode binary blocks

‚úÖ 3. genomic_var_1k = [sample for sample in reader]

    Iterates through every record in the Avro file
    Each item returned is a Python dictionary that matches the schema (No class generation needed)

‚úÖ 4. metadata = copy.deepcopy(reader.metadata)

    Avro files store metadata in the header

‚úÖ 5. writer_schema = copy.deepcopy(reader.writer_schema)

    This retrieves the actual schema used when the file was written.

‚úÖ 6. schema_from_file = json.loads(metadata['avro.schema'])

    Takes the raw JSON string stored in metadata and converts it into a Python dict.
    This will be identical to writer_schema

üî• Complete Summary Table (Easy View)

| Variable           | Meaning                              | Why Needed                          |
| ------------------ | ------------------------------------ | ----------------------------------- |
| `genomic_var_1k`   | All decoded records in the Avro file | Data itself                         |
| `metadata`         | Header metadata                      | Contains raw schema & codec         |
| `writer_schema`    | Parsed schema used to write file     | Needed for re-writing or validating |
| `schema_from_file` | Schema reconstructed from metadata   | Usually for debugging or display    |


In [None]:
len(genomic_var_1k)

In [None]:
pprint(writer_schema)

In [None]:
pprint(schema_from_file)

In [None]:
pprint(genomic_var_1k[0])

In [None]:
for f in schema_from_file["fields"]:
    print(f["name"])