# File Formats

## DELIMITED

* Comma
* Tab
* Colon
* etc..

In [10]:
import pandas as pd

# Specify the path to your CSV file
json_file_path = "trimmed/trips_trimmed.csv"

# Load the JSON data into a DataFrame
df = pd.read_csv(json_file_path)

# Display the DataFrame
df

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code
0,913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139
1,913459,1036,8/31/2015 23:11,San Antonio Shopping Center,31,8/31/2015 23:28,Mountain View City Hall,27,35,Subscriber,95032
2,913455,307,8/31/2015 23:13,Post at Kearny,47,8/31/2015 23:18,2nd at South Park,64,468,Subscriber,94107


## XML 

**eXtensible Markup Language**
* Store and transport data but doesn't present it
* Public standard developed by W3C
* Human and Machine-readable

```xml
<message>
    <text>Hello, world!</text>
</message>
```

```xml
<?xml version = "1.0"?>
<contact-info>
   <name>Ahmed Sami</name>
   <company>henak</company>
   <phone>(011) 123-4567</phone>
</contact-info>
```

![](./images/syntaxrules.png)

### XML Declaration

<?xml version = "1.0" encoding = "UTF-8"?>

### Syntax Rules for XML Declaration

* The XML declaration is case sensitive and must begin with "<?xml>" where "xml" is written in lower-case.
* If document contains XML declaration, then it strictly needs to be the first statement of the XML document.
* The XML declaration strictly needs be the first statement in the XML document.
* An HTTP protocol can override the value of encoding that you put in the XML declaration.


**Tags and Elements**  

An XML file is structured by several XML-elements, also called XML-nodes or XML-tags. The names of XML-elements are enclosed in triangular brackets < > as shown below −  
`<element>`  
Syntax Rules for Tags and Elements  
Element Syntax − Each XML-element needs to be closed either with start or with end elements as shown below −  
`<element>....</element>`  
or in simple-cases, just this way −  
`<element/>`

**Nesting of Elements**  
An XML-element can contain multiple XML-elements as its children, but the children elements must not overlap. i.e., an end tag of an element must have the same name as that of the most recent unmatched start tag.

**Root Element**  
An XML document can have only one root element. For example, following is not a correct XML document, because both the x and y elements occur at the top level without a root element

**Case Sensitivity**  
The names of XML-elements are case-sensitive. That means the name of the start and the end elements need to be exactly in the same case

**XML Attributes**  

An attribute specifies a single property for the element, using a name/value pair. An XML-element can have one or more attributes. For example:  
`<a href = "http://www.sayedkabaka.com/">United States of Talbiya!</a>  `  
Here href is the attribute name and http://www.sayedkabaka.com/ is attribute value.

**XML declaration**  
```xml
<?xml
   version = "version_number"
   encoding = "encoding_declaration"
   standalone = "standalone_status"
?>
```

**XML Comments**   
`<!--Students grades are uploaded by months-->`

### Parse XML using Python

In [5]:
import pandas as pd

xml_file_path = "trimmed/trips_trimmed.xml"

# Function to read XML and parse into DataFrame
def read_xml_to_df(xml_file: str) -> pd.DataFrame:
    # Parse the XML file into a DataFrame
    df = pd.read_xml(xml_file)
    return df

# Read the XML file into a DataFrame
df = read_xml_to_df(xml_file_path)

# Display the DataFrame
df

Unnamed: 0,Trip_ID,Duration,Start_Date,Start_Station,Start_Terminal,End_Date,End_Station,End_Terminal,key,Subscriber_Type,Zip_Code
0,913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139
1,913459,1036,8/31/2015 23:11,San Antonio Shopping Center,31,8/31/2015 23:28,Mountain View City Hall,27,35,Subscriber,95032
2,913455,307,8/31/2015 23:13,Post at Kearny,47,8/31/2015 23:18,2nd at South Park,64,468,Subscriber,94107


## JSON

* JSON stands for JavaScript Object Notation.
* The format was specified by Douglas Crockford.
* It was designed for human-readable data interchange.
* It has been extended from the JavaScript scripting language.
* The filename extension is .json.
* JSON Internet Media type is application/json.
* The Uniform Type Identifier is public.json.


```json
{
     "id": "01",
     "language": "Java",
     "edition": "third",
     "author": "Herbert Schildt"
}
```

**JSON VS XML**   
```json
{
   "company": Volkswagen,
   "name": "Vento",
   "price": 800000
}
```

```xml
<car>
   <company>Volkswagen</company>
   <name>Vento</name>
   <price>800000</price>
</car>
```

**Data types**  
1. **String**: Text enclosed in double quotes.
2. **Number**: Any numerical value, including integers and floating-point numbers.
3. **Object**: A collection of key-value pairs, with keys being strings.
4. **Array**: An ordered list of values.
5. **Boolean**: True or false values.
6. **Null**: A special type that represents a null value.

```json
{
  "string": "Hello, World!",
  "number_integer": 123,
  "number_float": 456.78,
  "object": {
    "nested_string": "Nested Hello",
    "nested_number": 42
  },
  "array": [1, "two", 3.0, {"key": "value"}, true],
  "boolean_true": true,
  "boolean_false": false,
  "null_value": null
}
```

**JSON Array**  
```json
[
    {
        "Trip ID":913460,
        "Duration":765,
        "Start Date":"8\/31\/2015 23:26",
        "Start Station":"Harry Bridges Plaza (Ferry Building)",
        "Start Terminal":50,
        "End Date":"8\/31\/2015 23:39",
        "End Station":"San Francisco Caltrain (Townsend at 4th)",
        "End Terminal":70,
        "Bike #":288,
        "Subscriber Type":"Subscriber",
        "Zip Code":"2139"
    },
    {
        "Trip ID":913459,
        "Duration":1036,
        "Start Date":"8\/31\/2015 23:11",
        "Start Station":"San Antonio Shopping Center",
        "Start Terminal":31,
        "End Date":"8\/31\/2015 23:28",
        "End Station":"Mountain View City Hall",
        "End Terminal":27,
        "Bike #":35,
        "Subscriber Type":"Subscriber",
        "Zip Code":"95032"
    },
    {
        "Trip ID":913455,
        "Duration":307,
        "Start Date":"8\/31\/2015 23:13",
        "Start Station":"Post at Kearny",
        "Start Terminal":47,
        "End Date":"8\/31\/2015 23:18",
        "End Station":"2nd at South Park",
        "End Terminal":64,
        "Bike #":468,
        "Subscriber Type":"Subscriber",
        "Zip Code":"94107"
    }
]
```

In [9]:
import pandas as pd

# Specify the path to your JSON file
json_file_path = "trimmed/trips_trimmed.json"

# Load the JSON data into a DataFrame
df = pd.read_json(json_file_path)

# Display the DataFrame
df

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code
0,913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139
1,913459,1036,8/31/2015 23:11,San Antonio Shopping Center,31,8/31/2015 23:28,Mountain View City Hall,27,35,Subscriber,95032
2,913455,307,8/31/2015 23:13,Post at Kearny,47,8/31/2015 23:18,2nd at South Park,64,468,Subscriber,94107


## YAML

* YAML is one of the most popular data serialization languages, and it is used mostly for writing configuration files.
* The YAML recursive acronym stands for YAML Ain’t Markup Language

**Basic YAML syntax**  

* **Maps/Dictionaries** (YAML calls it mapping)
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. YAML places no further restrictions on the nodes.
* **Arrays/Lists** (YAML calls them sequences)
The content of a sequence node is an ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself.
* **Literals** (Strings, numbers, boolean, etc.)
The content of a scalar node is an opaque datum that can be presented as a series of zero or more Unicode characters.


```yaml
---
# A sample yaml file
company: spacelift
domain:
 - devops
 - devsecops
tutorial:
  - yaml:
      name: "YAML Ain't Markup Language"
      type: awesome
      born: 2001
  - json:
      name: JavaScript Object Notation
      type: great
      born: 2001
  - xml:
      name: Extensible Markup Language
      type: good
      born: 1996
author: omkarbirade
published: true
```

**JSON**  
```json
{
  "company": "spacelift",
  "domain": ["devops", "devsecops"],
  "tutorial": [
    {
      "yaml": {
        "name": "YAML Ain't Markup Language",
        "type": "awesome",
        "born": 2001
      }
    },
    {
      "json": {
        "name": "JavaScript Object Notation",
        "type": "great",
        "born": 2001
      }
    },
    {
      "xml": {
        "name": "Extensible Markup Language",
        "type": "good",
        "born": 1996
      }
    }
  ],
  "author": "omkarbirade",
  "published": true
}
```

**XML**  
```xml
<root>
    <company>spacelift</company>
    <domain>devops</domain>
    <domain>devsecops</domain>
    <tutorials>
        <yaml>
            <name>YAML Ain't Markup Language</name>
            <type>awesome</type>
            <born>2001</born>
        </yaml>
        <json>
            <name>JavaScript Object Notation</name>
            <type>great</type>
            <born>2001</born>
        </json>
        <xml>
            <name>Extensible Markup Language</name>
            <type>good</type>
            <born>1996</born>
        </xml>
    </tutorials>
    <author>omkarbirade</author>
    <published>true</published>
</root>
```

**Indentation**  
```yaml
tutorial:  #nesting level 1
  - yaml:  #nesting level 2 (2 spaces used for indentation)
      name: "YAML Ain't Markup Language" #string [literal] #nesting level 3 (4 spaces used for indentation)
      type: awesome #string [literal]
      born: 2001 #number [literal]
```

**Mappings**  
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique  

```yaml
name: "Ahmed"
country: "Egypt"
sex: "Male"
```

**Arrays (sequences)**  
The content of a sequence node is an ordered series of zero or more nodes. In particular, a sequence may contain the same node more than once. It could even contain itself  
```yaml
langauges:
  - English
  - Arabic
  - Hindi
```

**Literals — Strings**  
String literals in YAML do not need to be quoted. It is only important to quote them when they contain a value that can be mistaken for a special character  
```yaml
message1: YAML & JSON # breaks as a & is a special character
message2: "YAML & JSON" # Works as the string is quoted
```

**String blocks**  

Single line:  
```yaml
message: >+
 This block line
 Will be interpreted as a single
 line with a newline character at the 
 end
```

Multiline:  
```yaml
message: |
 this is
 a real multiline
 message
```

**Comments**  

```yaml
---
# Comments inside a YAML file can be added followed by the '#' character
company: spacelift
```

**Documents**  
The above YAML snippet is called a document. A single YAML file can have more than one document. Each document can be interpreted as a separate YAML file which means multiple documents can contain the same or duplicate keys which are not allowed in the same document.
The beginning of a document is denoted by three hyphens ---.  
Triple dots are used to end a YAML document without starting a new one ...

```yaml
---
# document 1
codename: YAML
name: YAML ain't markup language
release: 2001

---
# document 2
uses:
 - configuration language
 - data persistence
 - internet messaging
 - cross-language data sharing

---
# document 3
company: spacelift
domain:
 - devops
 - devsecops
tutorial:
   - name: yaml
   - type: awesome
   - rank: 1
   - born: 2001
author: omkarbirade
published: true
...

**Schemas and Tags**  
**Schemas** can be thought of as the way a parser resolves or understands nodes (values) present in a YAML file. There are primarily three default schemas in YAML:

* **FailSafe** Schema understands only maps, sequences, and strings and is guaranteed to work with any YAML file.
* **JSON schema** understands all types supported within JSON, including boolean, null, int, and float, as well as those in the FailSafe schema. 
* **Core schema** is an extension of the JSON schema, making it more human-readable supporting the same types but in multiple forms.  
For example: 1. null | Null | NULL will all be resolved to the same type null, and true | True | TRUE will all be resolved to the same boolean value.

**Tags** can be thought of as types in YAML.

```yaml
---
# A sample yaml file
company: !!str spacelift
domain:
 - !!str devops
 - !!str devsecops
tutorial:
   - name: !!str yaml
   - type: !!str awesome
   - rank: !!int 1
   - born: !!int 2001
author: !!str omkarbirade
published: !!bool true
```

In [18]:
import pandas as pd
import yaml

def flatten_dict_item(prefix, item):
    """Recursively flattens a dictionary item, adding a prefix."""
    if isinstance(item, dict):
        for k, v in item.items():
            yield from flatten_dict_item(f"{prefix}.{k}" if prefix else k, v)
    elif isinstance(item, list):
        for i, v in enumerate(item):
            yield from flatten_dict_item(f"{prefix}[{i}]", v)
    else:
        yield (prefix, item)

def flatten_yaml(data):
    """Flattens the YAML data into a list of dictionaries."""
    return [dict(flatten_dict_item("", data))]

def load_yaml_to_dataframe(yaml_file_path):
    """Loads YAML and converts it into a Pandas DataFrame."""
    with open(yaml_file_path, 'r') as file:
        yaml_data = yaml.safe_load(file)
    
    # Flatten the YAML data
    flattened_data = flatten_yaml(yaml_data)
    
    # Convert to DataFrame
    df = pd.DataFrame(flattened_data)
    return df

# Specify the path to your YAML file
yaml_file_path = 'trimmed/trips_trimmed.yaml'  # Replace with the actual path to your YAML file

# Load the data into a DataFrame

df = load_yaml_to_dataframe(yaml_file_path)
df


Unnamed: 0,[0].Bike #,[0].Duration,[0].End Date,[0].End Station,[0].End Terminal,[0].Start Date,[0].Start Station,[0].Start Terminal,[0].Subscriber Type,[0].Trip ID,...,[2].Duration,[2].End Date,[2].End Station,[2].End Terminal,[2].Start Date,[2].Start Station,[2].Start Terminal,[2].Subscriber Type,[2].Trip ID,[2].Zip Code
0,288,765,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,Subscriber,913460,...,307,8/31/2015 23:18,2nd at South Park,64,8/31/2015 23:13,Post at Kearny,47,Subscriber,913455,94107


## What is the Problem?

There's no problem :)  
But there are more features and more optimization for bid data analytics

* **Storage and Compression**
* **Splittability**
* **Performance Optimization**
* **Schema Evolution and Data Types**

## Apache Avro

![](./images/avro.png)

* Avro is a language-neutral data serialization system.
* It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
* Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.
* Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.
* Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.
* Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.
* Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.


* **Schema-based Serialization**: Avro uses a schema to define the structure of the data, ensuring that both the data and its schema are stored together. This approach facilitates data validation and compatibility checks during data exchange.
* **Compact and Efficient**: Avro's binary encoding is compact compared to text-based formats like JSON or XML. This leads to reduced storage requirements and faster data transmission, which is particularly beneficial for large datasets.
* **Schema Evolution**: Avro supports schema evolution by allowing schemas to change over time without requiring data to be rewritten. It handles schema evolution gracefully, which is crucial for maintaining backward and forward compatibility.
* **Dynamic Typing**: With Avro, data is dynamically typed. This flexibility makes it easier to handle complex data structures and ensures that applications in different programming languages can read and write Avro data seamlessly.
* **No Overhead for Repeated Fields**: Unlike some formats that require repeated field names, Avro's serialization avoids such redundancy, further optimizing data storage and processing.
* **Interoperability**: Avro is language-agnostic and supports interfaces for multiple programming languages, enhancing interoperability in distributed systems where applications in various languages need to share data.
* **RPC Support**: Avro includes built-in support for remote procedure calls (RPC), which facilitates building networked services that can communicate efficiently.
* **Efficient Data Compression**: Avro files can be compressed, providing additional storage savings and faster input/output operations while maintaining quick access to data.

These features make Avro particularly well-suited for applications that require efficient data serialization and deserialization, compatibility across different systems, and support for evolving data schemas.

In [22]:
!cat ./trimmed/trips_trimmed.avro

Objavro.codenullavro.schema�{"type": "record", "name": "Trip", "fields": [{"name": "Trip ID", "type": "int"}, {"name": "Duration", "type": "int"}, {"name": "Start Date", "type": "string"}, {"name": "Start Station", "type": "string"}, {"name": "Start Terminal", "type": "int"}, {"name": "End Date", "type": "string"}, {"name": "End Station", "type": "string"}, {"name": "End Terminal", "type": "int"}, {"name": "Bike #", "type": "int"}, {"name": "Subscriber Type", "type": "string"}, {"name": "Zip Code", "type": "string"}]} >f<&
�	"+�<G�o���o�8/31/2015 23:26HHarry Bridges Plaza (Ferry Building)d8/31/2015 23:39PSan Francisco Caltrain (Townsend at 4th)��Subscribe2139��o�8/31/2015 23:116San Antonio Shopping Center>8/31/2015 23:28.Mountain View City Hall6FSubscriber
95032��o�8/31/2015 23:13Post at Kearny^8/31/2015 23:18"2nd at South Park��Subscriber
94107>f<&
�	"+�<G�o

In [23]:
import fastavro
import pandas as pd

def avro_to_dataframe(avro_file_path):
    # Open and read the Avro file
    with open(avro_file_path, 'rb') as f:
        # Use fastavro to read the records
        reader = fastavro.reader(f)
        records = [record for record in reader]
    
    # Convert the list of records to a pandas DataFrame
    df = pd.DataFrame(records)
    return df

# Example usage
avro_file_path = './trimmed/trips_trimmed.avro'  # Replace with your Avro file path
df = avro_to_dataframe(avro_file_path)

# Display the DataFrame
df

Unnamed: 0,Trip ID,Duration,Start Date,Start Station,Start Terminal,End Date,End Station,End Terminal,Bike #,Subscriber Type,Zip Code
0,913460,765,8/31/2015 23:26,Harry Bridges Plaza (Ferry Building),50,8/31/2015 23:39,San Francisco Caltrain (Townsend at 4th),70,288,Subscriber,2139
1,913459,1036,8/31/2015 23:11,San Antonio Shopping Center,31,8/31/2015 23:28,Mountain View City Hall,27,35,Subscriber,95032
2,913455,307,8/31/2015 23:13,Post at Kearny,47,8/31/2015 23:18,2nd at South Park,64,468,Subscriber,94107



| Feature                   | Avro                                    | CSV                                   | JSON                                  |
|---------------------------|-----------------------------------------|---------------------------------------|---------------------------------------|
| **Data Encoding**         | Binary                                  | Text                                  | Text                                  |
| **Schema Support**        | Yes; strong schema support with evolution | No schema                             | Optional schema (self-describing)     |
| **Data Compression**      | Built-in compression support            | No built-in compression               | Can be compressed externally         |
| **Readability**           | Not human-readable (binary format)      | Human-readable                        | Human-readable                        |
| **File Size**             | Compact (due to binary encoding)        | Generally larger for complex data    | Larger than binary; depends on data   |
| **Interoperability**      | High, with language agnostic libraries  | High, universal format                | High, universal format                |
| **Serialization/Deserialization Speed** | Fast (optimized for performance) | Fast for simple data, slower for complex parsing | Slower due to parsing overhead       |
| **Schema Evolution**      | Strong support for backward/forward compatibility | N/A                                  | Basic if using schema tools           |
| **Nested Data Support**   | Full support for complex data types     | Limited; complex structures require creativity | Full support through dictionaries and lists |
| **Data Validation**       | Schema ensures validation               | No built-in validation; needs custom code | No built-in validation; needs custom code |
| **Use Case**              | Ideal for large-scale data processing and storage, with diverse and changing schemas | Suitable for simple, flat data structures | Commonly used for APIs and web applications, flexible structure |
| **Tooling and Libraries** | Strong support across multiple platforms (Java, Python, etc.) | Widely supported in various tools and libraries | Widely supported, especially in web development |
| **Support for Remote Procedure Calls (RPC)** | Yes, built-in RPC support           | No                                    | No                                    |

In [25]:
# Example showing schema evolution
import fastavro
from io import BytesIO

# Initial schema (the writer's schema)
writer_schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

# Evolved schema (the reader's schema with an additional field)
reader_schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"},
        {"name": "email", "type": ["null", "string"], "default": None}  # New field with default value
    ]
}

# Data that conforms to the initial schema
users = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
]

# Write the data using the writer's schema
bytes_writer = BytesIO()
fastavro.writer(bytes_writer, writer_schema, users)
bytes_writer.seek(0)

# Read the data using the evolved reader's schema
bytes_reader = BytesIO(bytes_writer.getvalue())
reader = fastavro.reader(bytes_reader, reader_schema)

# Print the records with the evolved schema
for user in reader:
    print(user)

{'name': 'Alice', 'age': 30, 'email': None}
{'name': 'Bob', 'age': 25, 'email': None}


In [None]:
# Same example showing schema evolution failure with JSON format
import json

# Initial data (writer's schema)
users_json = [
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
]

# Serialize to JSON
json_data = json.dumps(users_json)
print("Serialized JSON:", json_data)


#### Attempting to Read with an Evolved Schema

# Evolved schema expectation
# Here, we expect also an "email" field

try:
    # Deserialize the JSON data
    deserialized_data = json.loads(json_data)
    
    # Attempt to access the new field, which doesn't exist in the data
    for user in deserialized_data:
        print(f"Name: {user['name']}, Age: {user['age']}, Email: {user['email']}")
except KeyError as e:
    print(f"Error: Missing expected field - {e}")

## Apache ORC

### What is Columnar Data Stores?