# File Formats

## DELIMITED

* Comma
* Tab
* Colon
* etc..

In [None]:
import pandas as pd

# Specify the path to your CSV file
json_file_path = "trimmed/trips_trimmed.csv"

# Load the JSON data into a DataFrame
df = pd.read_csv(json_file_path)

# Display the DataFrame
df

## XML 

**eXtensible Markup Language**
* Store and transport data but doesn't present it
* Public standard developed by W3C
* Human and Machine-readable

```xml
<message>
    <text>Hello, world!</text>
</message>
```

```xml
<?xml version = "1.0"?>
<contact-info>
   <name>Ahmed Sami</name>
   <company>henak</company>
   <phone>(011) 123-4567</phone>
</contact-info>
```

![](./images/syntaxrules.png)

### XML Declaration

<?xml version = "1.0" encoding = "UTF-8"?>

### Syntax Rules for XML Declaration

* The XML declaration is case sensitive and must begin with "<?xml>" where "xml" is written in lower-case.
* If document contains XML declaration, then it strictly needs to be the first statement of the XML document.
* The XML declaration strictly needs be the first statement in the XML document.
* An HTTP protocol can override the value of encoding that you put in the XML declaration.


**Tags and Elements**  

An XML file is structured by several XML-elements, also called XML-nodes or XML-tags. The names of XML-elements are enclosed in triangular brackets < > as shown below −  
`<element>`  
Syntax Rules for Tags and Elements  
Element Syntax − Each XML-element needs to be closed either with start or with end elements as shown below −  
`<element>....</element>`  
or in simple-cases, just this way −  
`<element/>`

**Nesting of Elements**  
An XML-element can contain multiple XML-elements as its children, but the children elements must not overlap. i.e., an end tag of an element must have the same name as that of the most recent unmatched start tag.

**Root Element**  
An XML document can have only one root element. For example, following is not a correct XML document, because both the x and y elements occur at the top level without a root element

**Case Sensitivity**  
The names of XML-elements are case-sensitive. That means the name of the start and the end elements need to be exactly in the same case

**XML Attributes**  

An attribute specifies a single property for the element, using a name/value pair. An XML-element can have one or more attributes. For example:  
`<a href = "http://www.sayedkabaka.com/">United States of Talbiya!</a>  `  
Here href is the attribute name and http://www.sayedkabaka.com/ is attribute value.

**XML declaration**  
```xml
<?xml
   version = "version_number"
   encoding = "encoding_declaration"
   standalone = "standalone_status"
?>
```

**XML Comments**   
`<!--Students grades are uploaded by months-->`

### Parse XML using Python

In [None]:
import pandas as pd

xml_file_path = "trimmed/trips_trimmed.xml"

# Function to read XML and parse into DataFrame
def read_xml_to_df(xml_file: str) -> pd.DataFrame:
    # Parse the XML file into a DataFrame
    df = pd.read_xml(xml_file)
    return df

# Read the XML file into a DataFrame
df = read_xml_to_df(xml_file_path)

# Display the DataFrame
df

## JSON

* JSON stands for JavaScript Object Notation.
* The format was specified by Douglas Crockford.
* It was designed for human-readable data interchange.
* It has been extended from the JavaScript scripting language.
* The filename extension is .json.
* JSON Internet Media type is application/json.
* The Uniform Type Identifier is public.json.


```json
{
     "id": "01",
     "language": "Java",
     "edition": "third",
     "author": "Herbert Schildt"
}
```

**JSON VS XML**   
```json
{
   "company": Volkswagen,
   "name": "Vento",
   "price": 800000
}
```

```xml
<car>
   <company>Volkswagen</company>
   <name>Vento</name>
   <price>800000</price>
</car>
```

**Data types**  
1. **String**: Text enclosed in double quotes.
2. **Number**: Any numerical value, including integers and floating-point numbers.
3. **Object**: A collection of key-value pairs, with keys being strings.
4. **Array**: An ordered list of values.
5. **Boolean**: True or false values.
6. **Null**: A special type that represents a null value.

```json
{
  "string": "Hello, World!",
  "number_integer": 123,
  "number_float": 456.78,
  "object": {
    "nested_string": "Nested Hello",
    "nested_number": 42
  },
  "array": [1, "two", 3.0, {"key": "value"}, true],
  "boolean_true": true,
  "boolean_false": false,
  "null_value": null
}
```

**JSON Array**  
```json
[
    {
        "Trip ID":913460,
        "Duration":765,
        "Start Date":"8\/31\/2015 23:26",
        "Start Station":"Harry Bridges Plaza (Ferry Building)",
        "Start Terminal":50,
        "End Date":"8\/31\/2015 23:39",
        "End Station":"San Francisco Caltrain (Townsend at 4th)",
        "End Terminal":70,
        "Bike #":288,
        "Subscriber Type":"Subscriber",
        "Zip Code":"2139"
    },
    {
        "Trip ID":913459,
        "Duration":1036,
        "Start Date":"8\/31\/2015 23:11",
        "Start Station":"San Antonio Shopping Center",
        "Start Terminal":31,
        "End Date":"8\/31\/2015 23:28",
        "End Station":"Mountain View City Hall",
        "End Terminal":27,
        "Bike #":35,
        "Subscriber Type":"Subscriber",
        "Zip Code":"95032"
    },
    {
        "Trip ID":913455,
        "Duration":307,
        "Start Date":"8\/31\/2015 23:13",
        "Start Station":"Post at Kearny",
        "Start Terminal":47,
        "End Date":"8\/31\/2015 23:18",
        "End Station":"2nd at South Park",
        "End Terminal":64,
        "Bike #":468,
        "Subscriber Type":"Subscriber",
        "Zip Code":"94107"
    }
]
```

In [None]:
import pandas as pd

# Specify the path to your JSON file
json_file_path = "trimmed/trips_trimmed.json"

# Load the JSON data into a DataFrame
df = pd.read_json(json_file_path)

# Display the DataFrame
df

## YAML

* YAML is one of the most popular data serialization languages, and it is used mostly for writing configuration files.
* The YAML recursive acronym stands for YAML Ain’t Markup Language

**Basic YAML syntax**  

* **Maps/Dictionaries** (YAML calls it mapping)
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. YAML places no further restrictions on the nodes.
* **Arrays/Lists** (YAML calls them sequences)
The content of a sequence node is an ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself.
* **Literals** (Strings, numbers, boolean, etc.)
The content of a scalar node is an opaque datum that can be presented as a series of zero or more Unicode characters.


```yaml
---
# A sample yaml file
company: spacelift
domain:
 - devops
 - devsecops
tutorial:
  - yaml:
      name: "YAML Ain't Markup Language"
      type: awesome
      born: 2001
  - json:
      name: JavaScript Object Notation
      type: great
      born: 2001
  - xml:
      name: Extensible Markup Language
      type: good
      born: 1996
author: omkarbirade
published: true
```

**JSON**  
```json
{
  "company": "spacelift",
  "domain": ["devops", "devsecops"],
  "tutorial": [
    {
      "yaml": {
        "name": "YAML Ain't Markup Language",
        "type": "awesome",
        "born": 2001
      }
    },
    {
      "json": {
        "name": "JavaScript Object Notation",
        "type": "great",
        "born": 2001
      }
    },
    {
      "xml": {
        "name": "Extensible Markup Language",
        "type": "good",
        "born": 1996
      }
    }
  ],
  "author": "omkarbirade",
  "published": true
}
```

**XML**  
```xml
<root>
    <company>spacelift</company>
    <domain>devops</domain>
    <domain>devsecops</domain>
    <tutorials>
        <yaml>
            <name>YAML Ain't Markup Language</name>
            <type>awesome</type>
            <born>2001</born>
        </yaml>
        <json>
            <name>JavaScript Object Notation</name>
            <type>great</type>
            <born>2001</born>
        </json>
        <xml>
            <name>Extensible Markup Language</name>
            <type>good</type>
            <born>1996</born>
        </xml>
    </tutorials>
    <author>omkarbirade</author>
    <published>true</published>
</root>
```

**Indentation**  
```yaml
tutorial:  #nesting level 1
  - yaml:  #nesting level 2 (2 spaces used for indentation)
      name: "YAML Ain't Markup Language" #string [literal] #nesting level 3 (4 spaces used for indentation)
      type: awesome #string [literal]
      born: 2001 #number [literal]
```

**Mappings**  
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique  

```yaml
name: "Ahmed"
country: "Egypt"
sex: "Male"
```

**Arrays (sequences)**  
The content of a sequence node is an ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself  
```yaml
langauges:
  - English
  - Arabic
  - Hindi
```

**Literals — Strings**  
String literals in YAML do not need to be quoted. It is only important to quote them when they contain a value that can be mistaken for a special character  
```yaml
message1: YAML & JSON # breaks as a & is a special character
message2: "YAML & JSON" # Works as the string is quoted
```

**String blocks**  

Single line:  
```yaml
message: >+
 This block line
 Will be interpreted as a single
 line with a newline character at the 
 end
```

Multiline:  
```yaml
message: |
 this is
 a real multiline
 message
```

**Comments**  

```yaml
---
# Comments inside a YAML file can be added followed by the '#' character
company: spacelift
```

**Documents**  
The above YAML snippet is called a document. A single YAML file can have more than one document. Each document can be interpreted as a separate YAML file which means multiple documents can contain the same or duplicate keys which are not allowed in the same document.
The beginning of a document is denoted by three hyphens ---.  
Triple dots are used to end a YAML document without starting a new one ...

```yaml
---
# document 1
codename: YAML
name: YAML ain't markup language
release: 2001

---
# document 2
uses:
 - configuration language
 - data persistence
 - internet messaging
 - cross-language data sharing

---
# document 3
company: spacelift
domain:
 - devops
 - devsecops
tutorial:
   - name: yaml
   - type: awesome
   - rank: 1
   - born: 2001
author: omkarbirade
published: true
...

**Schemas and Tags**  
**Schemas** can be thought of as the way a parser resolves or understands nodes (values) present in a YAML file. There are primarily three default schemas in YAML:

* **FailSafe** Schema understands only maps, sequences, and strings and is guaranteed to work with any YAML file.
* **JSON schema** understands all types supported within JSON, including boolean, null, int, and float, as well as those in the FailSafe schema. 
* **Core schema** is an extension of the JSON schema, making it more human-readable supporting the same types but in multiple forms.  
For example: 1. null | Null | NULL will all be resolved to the same type null, and true | True | TRUE will all be resolved to the same boolean value.

**Tags** can be thought of as types in YAML.

```yaml
---
# A sample yaml file
company: !!str spacelift
domain:
 - !!str devops
 - !!str devsecops
tutorial:
   - name: !!str yaml
   - type: !!str awesome
   - rank: !!int 1
   - born: !!int 2001
author: !!str omkarbirade
published: !!bool true
```

In [None]:
import pandas as pd
import yaml

def flatten_dict_item(prefix, item):
    """Recursively flattens a dictionary item, adding a prefix."""
    if isinstance(item, dict):
        for k, v in item.items():
            yield from flatten_dict_item(f"{prefix}.{k}" if prefix else k, v)
    elif isinstance(item, list):
        for i, v in enumerate(item):
            yield from flatten_dict_item(f"{prefix}[{i}]", v)
    else:
        yield (prefix, item)

def flatten_yaml(data):
    """Flattens the YAML data into a list of dictionaries."""
    return [dict(flatten_dict_item("", data))]

def load_yaml_to_dataframe(yaml_file_path):
    """Loads YAML and converts it into a Pandas DataFrame."""
    with open(yaml_file_path, 'r') as file:
        yaml_data = yaml.safe_load(file)
    
    # Flatten the YAML data
    flattened_data = flatten_yaml(yaml_data)
    
    # Convert to DataFrame
    df = pd.DataFrame(flattened_data)
    return df

# Specify the path to your YAML file
yaml_file_path = 'trimmed/trips_trimmed.yaml'  # Replace with the actual path to your YAML file

# Load the data into a DataFrame

df = load_yaml_to_dataframe(yaml_file_path)
df


## What is the Problem?

There's no problem :)  
But there are more features and more optimization for bid data analytics

* **Storage and Compression**
* **Splittability**
* **Performance Optimization**
* **Schema Evolution and Data Types**

## Apache Avro

![](./images/avro.png)

* Avro is a language-neutral data serialization system.
* It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
* Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.
* Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.
* Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.
* Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.
* Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.


* **Schema-based Serialization**: Avro uses a schema to define the structure of the data, ensuring that both the data and its schema are stored together. This approach facilitates data validation and compatibility checks during data exchange.
* **Compact and Efficient**: Avro's binary encoding is compact compared to text-based formats like JSON or XML. This leads to reduced storage requirements and faster data transmission, which is particularly beneficial for large datasets.
* **Schema Evolution**: Avro supports schema evolution by allowing schemas to change over time without requiring data to be rewritten. It handles schema evolution gracefully, which is crucial for maintaining backward and forward compatibility.
* **Dynamic Typing**: With Avro, data is dynamically typed. This flexibility makes it easier to handle complex data structures and ensures that applications in different programming languages can read and write Avro data seamlessly.
* **No Overhead for Repeated Fields**: Unlike some formats that require repeated field names, Avro's serialization avoids such redundancy, further optimizing data storage and processing.
* **Interoperability**: Avro is language-agnostic and supports interfaces for multiple programming languages, enhancing interoperability in distributed systems where applications in various languages need to share data.
* **RPC Support**: Avro includes built-in support for remote procedure calls (RPC), which facilitates building networked services that can communicate efficiently.
* **Efficient Data Compression**: Avro files can be compressed, providing additional storage savings and faster input/output operations while maintaining quick access to data.

These features make Avro particularly well-suited for applications that require efficient data serialization and deserialization, compatibility across different systems, and support for evolving data schemas.

In [None]:
!cat ./trimmed/trips_trimmed.avro

In [None]:
import fastavro
import pandas as pd

def avro_to_dataframe(avro_file_path):
    # Open and read the Avro file
    with open(avro_file_path, 'rb') as f:
        # Use fastavro to read the records
        reader = fastavro.reader(f)
        records = [record for record in reader]
    
    # Convert the list of records to a pandas DataFrame
    df = pd.DataFrame(records)
    return df

# Example usage
avro_file_path = './trimmed/trips_trimmed.avro'  # Replace with your Avro file path
df = avro_to_dataframe(avro_file_path)

# Display the DataFrame
df


| Feature                   | Avro                                    | CSV                                   | JSON                                  |
|---------------------------|-----------------------------------------|---------------------------------------|---------------------------------------|
| **Data Encoding**         | Binary                                  | Text                                  | Text                                  |
| **Schema Support**        | Yes; strong schema support with evolution | No schema                             | Optional schema (self-describing)     |
| **Data Compression**      | Built-in compression support            | No built-in compression               | Can be compressed externally         |
| **Readability**           | Not human-readable (binary format)      | Human-readable                        | Human-readable                        |
| **File Size**             | Compact (due to binary encoding)        | Generally larger for complex data    | Larger than binary; depends on data   |
| **Interoperability**      | High, with language agnostic libraries  | High, universal format                | High, universal format                |
| **Serialization/Deserialization Speed** | Fast (optimized for performance) | Fast for simple data, slower for complex parsing | Slower due to parsing overhead       |
| **Schema Evolution**      | Strong support for backward/forward compatibility | N/A                                  | Basic if using schema tools           |
| **Nested Data Support**   | Full support for complex data types     | Limited; complex structures require creativity | Full support through dictionaries and lists |
| **Data Validation**       | Schema ensures validation               | No built-in validation; needs custom code | No built-in validation; needs custom code |
| **Use Case**              | Ideal for large-scale data processing and storage, with diverse and changing schemas | Suitable for simple, flat data structures | Commonly used for APIs and web applications, flexible structure |
| **Tooling and Libraries** | Strong support across multiple platforms (Java, Python, etc.) | Widely supported in various tools and libraries | Widely supported, especially in web development |
| **Support for Remote Procedure Calls (RPC)** | Yes, built-in RPC support           | No                                    | No                                    |

In [None]:
# Example showing schema evolution
import fastavro
from io import BytesIO

# Initial schema (the writer's schema)
writer_schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

# Evolved schema (the reader's schema with an additional field)
reader_schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "email", "type": ["null", "string"], "default": None}  # New field with default value
    ]
}

# Data that conforms to the initial schema
users = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
]

# Write the data using the writer's schema
bytes_writer = BytesIO()
fastavro.writer(bytes_writer, writer_schema, users)
bytes_writer.seek(0)

# Read the data using the evolved reader's schema
bytes_reader = BytesIO(bytes_writer.getvalue())
reader = fastavro.reader(bytes_reader, reader_schema)

# Print the records with the evolved schema
for user in reader:
    print(user)

In [None]:
# Same example showing schema evolution failure with JSON format
import json

# Initial data (writer's schema)
users_json = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
]

# Serialize to JSON
json_data = json.dumps(users_json)
print("Serialized JSON:", json_data)


#### Attempting to Read with an Evolved Schema

# Evolved schema expectation
# Here, we expect also an "email" field

try:
    # Deserialize the JSON data
    deserialized_data = json.loads(json_data)
    
    # Attempt to access the new field, which doesn't exist in the data
    for user in deserialized_data:
        print(f"Name: {user['name']}, Age: {user['age']}, Email: {user['email']}")
except KeyError as e:
    print(f"Error: Missing expected field - {e}")

## Apache ORC

![](./images/orc.png)

### What is Columnar Data Stores?

* **Columnar Storage**: 
  * ORC stores data in a columnar format, allowing efficient data compression and retrieval. Each column is stored separately, which optimizes analytics and read operations.
* **Efficient Compression**:
  * Supports various compression codecs such as ZLIB, Snappy, and LZO. This reduces storage requirements and speeds up data reading by reducing I/O operations.
* **Predicate Pushdown**:
  * Implemented to allow query engines to push filters down to the storage layer, reducing the amount of data that needs to be read and transferred.
* **Lightweight Metadata**:
  * Contains stripe-level statistics that allow quick navigation and data skipping, enhancing performance by avoiding unnecessary reads.
* **Stripe Structure**:
  * Each ORC file is divided into large stripes (typically 64 MB), which contain a collection of row groups. This segmentation supports parallel processing of data.
* **Indexing and Statistics**:
  * Includes built-in indices (min/max values, bloom filters) for columns to enable fast search and retrieval. Additional statistics per column aid in query optimization.
* **Splittable Files**:
  * Designed to be splittable at stripe boundaries, enabling distributed processing frameworks like Apache Hive and Apache Spark to efficiently process file parts in parallel.
* **Data Types Support**:
  * Supports a wide range of data types, including primitive types (int, float, string) and complex types (arrays, maps, structs), making it versatile for various data models.
* **Schema Evolution**:
  * Supports schema evolution, allowing changes to the dataset's schema without the need to rewrite the entire file, facilitating flexible data application development.
* **ACID Compliance**:
  * Enhanced support for ACID transactions when integrated with Hive, enabling robust data management capabilities such as inserts, updates, and deletes.
* **Efficient Reading and Writing**:
  * Optimized reader and writer code paths to ensure high throughput and low-latency access to the data, supporting efficient serialization and deserialization.
* **Time Zone Information**:
  * The format includes mechanisms to preserve accurate data in terms of timestamps and time zones, ensuring consistency across different systems.


![OrcFileLayout.png](attachment:8c3498c0-a3a8-4993-abc1-1eb1218377ff.png)

In [None]:
import pyarrow.orc as orc
import pyarrow as pa

# Function to load an ORC file into a Pandas DataFrame
def load_orc_to_dataframe(file_path):
    # Open the ORC file with pyarrow
    with pa.memory_map(file_path, 'r') as source:
        orc_file = orc.ORCFile(source)
        
        # Read the ORC file into a pyarrow Table
        table = orc_file.read()

    # Convert the pyarrow Table to a Pandas DataFrame
    dataframe = table.to_pandas()
    
    return dataframe

# Specify the path to your ORC file
orc_file_path = 'trimmed/trips_trimmed.orc'

# Load the ORC file into a DataFrame
df = load_orc_to_dataframe(orc_file_path)

# Display the DataFrame
df

In [None]:
import pyarrow.orc as orc
import pyarrow as pa
import fastavro
import pandas as pd
import time

def load_orc_to_dataframe(file_path):
    with pa.memory_map(file_path, 'r') as source:
        orc_file = orc.ORCFile(source)
        table = orc_file.read()
    return table.to_pandas()

def load_avro_to_dataframe(file_path):
    with open(file_path, 'rb') as f:
        reader = fastavro.reader(f)
        records = [record for record in reader]
    return pd.DataFrame(records)

# Assuming 'data.orc' and 'data.avro' contain the same data
orc_file_path = 'trips.orc'
avro_file_path = 'trips.avro'

# Measure the time taken to load and process the ORC file
start_time = time.time()
orc_df = load_orc_to_dataframe(orc_file_path)
orc_duration = time.time() - start_time
print(f"ORC Read Time: {orc_duration:.4f} seconds")

# Measure the time taken to load and process the Avro file
start_time = time.time()
avro_df = load_avro_to_dataframe(avro_file_path)
avro_duration = time.time() - start_time
print(f"Avro Read Time: {avro_duration:.4f} seconds")

# Simple comparison of dataframes (this step is merely illustrative)
print("DataFrames are equal:", orc_df.equals(avro_df))

# Display the first few rows of both dataframes
print("ORC DataFrame head:")
print(orc_df.head())

print("Avro DataFrame head:")
print(avro_df.head())

## Apache Parquet

![parquet.png](./images/parquet.png)

Apache Parquet is a columnar storage file format designed for efficient data processing and storage, particularly within the Hadoop ecosystem. It is widely used for big data processing due to its efficient storage mechanisms. Here are comprehensive bullet points detailing its format specifications:

* **Columnar Storage**:
  * Parquet organizes data in a columnar format, which allows for efficient compression and encoding, particularly benefiting read-heavy operations typical in analytics.
* **Efficient Compression**:
  * Supports various compression techniques like Snappy, GZIP, LZO, and Brotli. Columnar compression enhances storage efficiency and can significantly speed up read operations by reducing I/O.
* **Predicate Pushdown Optimization**:
  * Allows filtering operations to be pushed down to the storage layer, reducing I/O by selecting only relevant parts of the data during query execution.
* **Schema Evolution**:
  * Supports backward-compatible schema evolution. You can add new columns or remove old ones without rewriting the entire file, though changing column types might require more effort.
* **Data Encoding**:
  * Implements efficient encoding schemes like Run Length Encoding (RLE), Dictionary Encoding, Delta Encoding, and Bit Packing to minimize storage space and improve read times.
* **Splittable Files**:
  * Parquet files are splittable, which means large files can be read by parallel processes, optimizing performance for distributed data processing frameworks like Apache Spark.
* **Data Types Support**:
  * Offers strong support for complex nested data types, including structs, maps, and lists, accommodating sophisticated data models.
* **Rich Metadata**:
  * Embeds metadata at multiple levels (file, row group, column chunk, and page level) including min/max statistics which enable efficient data access and robust schema management.
* **Compression Codecs Specification**:
  * Allows specifying different compression codecs for different columns, optimizing storage and processing based on the column characteristics.
* **Support for Various Data Models**:
  * Parquet's architecture integrates well with different computation frameworks and data models, including Avro, Thrift, Protocol Buffers, and others, enhancing its versatility.
* **Efficient Read and Write Operations**:
  * Enables fast sequential reads and writes due to its sophisticated index and metadata, allowing for quick access to necessary data while skipping irrelevant parts.
* **Interoperability**:
  * Widely compatible with multiple data processing engines like Apache Hadoop, Apache Spark, and Apache Hive, and supported in several languages such as Java, C++, Python, etc.
* **Data Corruption Checks**:
  * Contains checksums for data pages which help in ensuring data integrity and detecting corruption during data read operations.
* **Time Zone Handling**:
  * Supports storing timestamps with time zone information, allowing consistent temporal data analysis across different environments.
* **Support for Alternative Encoding**:
  * Parquet supports multiple encoding strategies within the same file, providing flexibility based on the data characteristics and workload requirements.

![ParquetFileLayout.gif](attachment:578d1f39-649f-4c18-926c-df87c5cbfd59.gif)

In [None]:
import pyarrow.parquet as pq
import pandas as pd

# Function to load a Parquet file into a Pandas DataFrame
def load_parquet_to_dataframe(file_path):
    # Read the Parquet file into a PyArrow Table
    table = pq.read_table(file_path)

    # Convert the PyArrow Table to a Pandas DataFrame
    dataframe = table.to_pandas()

    return dataframe

# Specify the path to your Parquet file
parquet_file_path = 'trimmed/trips_trimmed.parquet'

# Load the Parquet file into a DataFrame
df = load_parquet_to_dataframe(parquet_file_path)

# Display the first few rows of the DataFrame
df

**Key advatnages of Parquet over ORC**  
1. **Broad Ecosystem Support**: Parquet is widely supported by several big data tools and platforms, enhancing its versatility for integration and interoperability.
2. **Non-Hadoop Compatibility**: Parquet is frequently chosen in environments that go beyond traditional Hadoop systems, making it a favorite for more diverse technical landscapes.
3. **Encoding and Compression**: Parquet’s efficient encoding and compression techniques can lead to performance advantages, particularly when dealing with analytical workloads.
4. **Schema Evolution**: Parquet provides flexible schema evolution capabilities, allowing more seamless data management and integration over time.
5. **Support for Nested Data**: Parquet excels in handling nested data structures, making it suitable for more complex data formats like JSON.
6. **Optimizations for Columnar Storage**: Parquet is optimized for columnar storage systems, aligning naturally with many SQL-based analytics engines.

| Feature/Property     | JSON                          | Avro                               | ORC                                | Parquet                             |
|----------------------|-------------------------------|------------------------------------|------------------------------------|-------------------------------------|
| **Data Model**       | Text-based, human-readable    | Row-based, schema-oriented         | Columnar                           | Columnar                            |
| **Schema**           | Self-describing               | Required, with strong schema       | Strong schema with partial evolution | Strong schema with evolution        |
| **Schema Evolution** | Challenging                   | Robust support; backward/forward compatibility | Limited (e.g., adding columns)     | Good support for schema evolution   |
| **Compression**      | Not inherent, external tools  | Optional, efficient with codecs    | Built-in, suited for columnar data | Built-in, suited for columnar data  |
| **Use Case**         | Configuration; data interchange | Messaging; schema evolution       | Large-scale analytics              | Large-scale analytics               |
| **Read/Write Performance** | Slow for large datasets | Fast for Avro-specific systems     | Optimized for reads, decent writes | Optimized for reads, decent writes  |
| **File Size**        | Typically larger              | Compact due to binary encoding     | Highly compressed due to columnar design | Highly compressed due to columnar design |
| **Support for Nested/Complex Data** | Limited (via arrays/objects) | Good (supports nested data)       | Excellent (natively supports nested structures) | Excellent (supports nested structures) |
| **Tool Integration** | Widely supported, universal   | Supported by many Hadoop tools     | Supported by Hadoop ecosystem      | Supported by Hadoop ecosystem       |
| **Interoperability** | High                          | Good, with Avro-specific tools     | Primarily within Hadoop ecosystem  | Primarily within Hadoop ecosystem   |
| **Readability**      | High, human-readable          | Low, binary format                 | Low, binary format                 | Low, binary format                  |
| **Data Validation**  | Minimal                       | Strong, due to schema requirement  | Strong, due to schema requirement  | Strong, due to schema requirement   |
| **Data Integrity**   | Limited                       | Good support for validations       | Strong checksums and validations   | Strong checksums and validations    |
| **Transactional Support** | None                    | Good integration (e.g., with Kafka) | Supports ACID transactions with Hive | Often used with transactional systems |

| Feature/Aspect       | Avro                                | ORC                                 | Parquet                              |
|----------------------|-------------------------------------|-------------------------------------|--------------------------------------|
| **Primary Use Case** | Data interchange, streaming, serialization | Big data analytics, read-optimized tasks | Big data analytics, multi-tool compatibility |
| **When to Use**      | - When you need efficient serialization/deserialization<br>- Schema evolution is crucial<br>- Data is frequently added or modified<br>- Integration with streaming platforms (e.g., Kafka) | - When optimizing for read-heavy operations<br>- Large-scale analytical queries<br>- Need strong compression for storage efficiency<br>- Working within Hadoop ecosystem | - When using mixed data processing frameworks<br>- Want high efficiency in distributed environments (e.g., Spark, Hive)<br>- Need robust compression and encoding<br>- When analytics require partial data reads |
| **When Not to Use**  | - For purely analytical purposes (less efficient in querying)<br>- When compression efficiency is paramount<br>- Purely read-heavy workloads | - For high-frequency writes due to lesser write efficiency<br>- When schema evolution and flexibility are needed<br>- Small datasets where overhead isn’t justified | - When schema evolution flexibility is critical<br>- Real-time processing requiring rapid writes<br>- Small-scale, dynamic, constantly-changing data tasks |
| **Supported Compression** | Yes (Snappy, Deflate)            | Advanced codecs (ZLIB, Snappy, LZO) | Multiple codecs (Snappy, GZIP, Brotli) |
| **Schema Evolution Support** | Extensive (add/remove columns, change data types) | Limited (primarily adding columns)  | Good (limited to structural changes) |
| **Integration and Compatibility** | Wide compatibility, including Hadoop and Kafka | Deep integration with Hadoop tools  | Broad compatibility across big data tools |
| **Read/Write Performance** | Fast serialization/deserialization | Optimized for reads, efficient storage | Optimized for reads, supported by various analytical tools |
| **Complex Data Handling** | Supports complex/nested data types | Excellent support for complex/nested structures | Excellent support for complex/nested structures |
| **File Size Efficiency** | Compact due to binary format, but less than ORC/Parquet | Highly efficient due to columnar compression | Highly efficient due to columnar compression |
| **Stream Processing** | Well-suited due to row-based design | Less suitable due to columnar nature | Less suitable due to columnar nature |
| **Key Advantages**   | - Schema evolution<br>- Fast read/write for row-based operations<br>- Ideal for streaming<br>- Good data integrity | - High read performance<br>- Efficient space utilization<br>- Supports complex data models<br>- Robust metadata | - High compatibility<br>- Efficient for various query engines<br>- Powerful meta-data support<br>- Balanced read/write performance |

In [None]:
import pyarrow as pa
import pyarrow.orc as orc
import pyarrow.parquet as pq
import os

# Function to display schema details
def display_schema_details(table, format_name):
    print(f"\n{format_name} Schema:")
    print(table.schema)
    print(f"\n{format_name} Column Names:")
    print(table.column_names)
    print(f"\n{format_name} Number of Columns:")
    print(table.num_columns)
    print(f"\n{format_name} Number of Rows:")
    print(table.num_rows)

# Read ORC file
orc_file = 'trips.orc'
orc_table = orc.read_table(orc_file)

# Read Parquet file
parquet_file = 'trips.parquet'
parquet_table = pq.read_table(parquet_file)

# Display schema and some characteristics
display_schema_details(orc_table, "ORC")
display_schema_details(parquet_table, "Parquet")

# Display file size from the filesystem to understand the compression impact.
orc_file_size = os.path.getsize(orc_file)
parquet_file_size = os.path.getsize(parquet_file)

print(f"\nORC File Size: {orc_file_size} bytes")
print(f"Parquet File Size: {parquet_file_size} bytes")


In [None]:
# Splitting Parquet

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import os

# Define the path to your CSV file
csv_file_path = 'trips.csv'

# Load the CSV file into a Pandas DataFrame
df = pd.read_csv(csv_file_path)

# Optionally, inspect the DataFrame
print(df.head())
print(df.dtypes)

# Define the directory where you want to save the partitioned Parquet files
output_dir = 'output_parquet'
os.makedirs(output_dir, exist_ok=True)

# Convert the Pandas DataFrame to a PyArrow Table
table = pa.Table.from_pandas(df)

# You can specify the column(s) by which you want to partition the Parquet files
partition_columns = ['Bike #']  # Replace with your actual column name(s)

# Write the table to Parquet files, partitioned by the specified column
pq.write_to_dataset(
    table=table,
    root_path=output_dir,
    partition_cols=partition_columns
)

print("Data successfully written as partitioned Parquet files.")

**Use `parquet-tools` to inspect parquet**  

In [None]:
!pip install parquet-tools --break-system-packages

In [None]:
!parquet-tools show --head 10 ./output_parquet/*/*.parquet

In [None]:
!parquet-tools inspect "./output_parquet/Bike #=10/0c9527a1036342239dd557156d1a0281-0.parquet"