# The History of the Data Lake

There are a million implementations of the distributed file share, but the idea really took off with the publication of a Google paper entitled ["The Google File System"](https://research.google/pubs/the-google-file-system/) - a description of how Google had implemented their fault-tolerant, distributed file system, with data redundancy running on cheap consumer hardware while simultaneously being able to serve a Google-scale number of clients. This paper, combined with another Google paper on [MapReduce](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/) laid out a programming model for effectively working with distributed data and served as the foundation for Hadoop, born in 2006 out of Yahoo.

## In the Beginning, There was Hadoop

![Hadoop logo](images/logos/hadoop_logo.png)

Hadoop was actually an ecosystem, built around the ideas presented in the aforementioned Google papers. 

MapReduce, while key to the computational part of Hadoop, turned out to be fairly tricky to write, so a number of Apache projects sprung up around it. With the benefit of hindsight, the most significant of these would be Apache Hive, and Apache Spark. 

Hive provided a database-like SQL abstraction on top of MapReduce, while Apache Spark moved operations to memory, building a Directed Acyclic Graph (DAG) of operations to be performed on in-memory RDDs (Resilient Distributed Datasets)

## The Advent of the Data Lake

![Data Lake](images/logos/datalake.png)

The concept of a Data Lake was first coined in 2011 by the CTO of Pentaho to better contrast with the concept of a Data Mart. To him, the Data Mart was a targeted set of tables around curated data, but the promise of Hadoop's ecosystem was to be able to store the raw data directly, avoiding having to do the up-front work of deciding what was important, as well as being able to work with heterogeneous data. This resonated with the massive growth in data, where the reigning philosophy was to store it, just in case.

## The Rise of AWS

![AWS Savior](images/aws_savior.png)

As many companies soon found out, actually running Hadoop was a pain, as maintaining the Hadoop HDFS systems alongside all the various distributed server technologies needed to be able to query the data was the domain of highly skilled (and expensive) engineers. AWS launched its Seriously Simple Storage (S3) in 2006, allowing companies to offload their Hadoop implementations onto S3 which became the standard for object storage. There was much rejoicing.

## The Importance of File Formats

Each iteration of distributed file shares have given us better ways of managing the files in a multi-client, fault-tolerant manner. Allowing us to store petabytes of data in files means that the file formats themselves become a key factor in maximizing query performance. 

Let's walk through the most common file formats used in modern Data Engineering

# The three wise row-oriented file formats

## The CSV

![CSV Logo](images/logos/csv_logo.png)

The CSV is the workhorse of Data Engineering, predating Personal Computers by over a decade. Everyone understands CSV and pretty much every system can generate CSVs.

Plain text, human-readable, even Jupyter can read CSV, what's not to love?

In [None]:
import polars as pl
import warnings
warnings.simplefilter('ignore')

In [None]:
df = pl.read_csv('data/10.csv')
df

While easy to read, since a CSV is just text, each column's datatype has been inferred, through a `CSVSniffer` - by defaults sampling the first 100 rows and guessing at the correct datatype. 

If that sounds error-prone, that's because it is!

Of course the worst offender is that there is no standard for CSV files (well, technically there's [RFC 4180](https://datatracker.ietf.org/doc/html/rfc4180) but no-one seems to care) - if you look at any CSV parsing library or function, they are forced to handle any number of potential formats. `polars.read_csv` has 33 arguments, `pandas.read_csv` has 49. This makes portability of CSV difficult, as there's a lot of edge cases to handle across systems. The other important aspect of CSVs to understand is that they are *row-oriented*. That was briefly mentioned previously, so let's dive into what that means:

Given this data:

![Example Data](images/columnar_vs_row.png)

A CSV file would look like this to the parser:
`Seller,Product,Sales ($)\James,Shoes,20.00\Kirk,Shoes,27.50\nPicard,Socks,5.00`

If I want to sum up all the sales, the scanner needs to read through each character one-by-one to identify the `,` separator which signifies a column and `\n` which signifies a row.

![CSV Parser](images/csv_reader.png)

Then it would throw out 2/3rds of the data it read into memory and finally convert the `Sales ($)` string into floats and do the sum.

It remains a fact of Data Engineering life that you'll have to deal with CSVs, and luckily a lot of engineering effort has gone into building very performant csv readers that can automatically handle lots of CSV oddities. That doesn't mean we should accept CSVs - there's a rich suite of superior alternatives!

## The JSON file

![JSON logo](images/logos/json_logo.jpg)

A step up from CSV, JSON, which was designed in 2001, was formalized in 2013 into the [ECMA standard](https://ecma-international.org/publications-and-standards/standards/ecma-404/), making it much more portable. JSON has simple datatypes, and each row can be processed independently, since the metadata is present in every row. It comes at the cost of verbosity though, as each key is repeated for each line, and the format is still row-based.

```json
[
    {"Seller": "James", "Product": "Shoes", "Sales": 20.00}, 
    {"Seller": "Kirk", "Product": "Shoes", "Sales": 27.50}, 
    {"Seller": "Picard", "Product": "Socks", "Sales": 5.00}
]
```

In [None]:
# This generates the json file for demo purposes
# df.filter(pl.col("recommendationid").is_not_null()).write_json("data/10.json")

In [None]:
!jq '.[0]' data/10.json

In [None]:
df = pl.read_json('data/10.json')
df.head()

## Apache Avro

![Apache Avro Logo](images/logos/avro_logo.png)

Avro compared to its siblings is a youngster, having joined the Apache Hadoop project in 2009, and is used mainly as a data interchange format, much like JSON, but is a binary format with a schema defined in JSON. Avro is a row-oriented format, and is a common format used in message brokers like Kafka.

An Avro schema is defined as JSON and would look something like this - Avro introduces an upgrade to a rich type system, at the cost of human readability
```json
{"namespace": "acme.avro",
 "type": "record",
 "name": "Sales",
 "fields": [
     {"name": "Seller", "type": ["string", "null"]},
     {"name": "Product",  "type": "string"},
     {"name": "Sales", "type": "float"}
 ]
}
```

The data is then encoded into the Avro binary format based on the schema, and the consumer would use the schema to decode the incoming binary data.
While used in the Hadoop ecosystem to transmit data back and forth between nodes, Avro is not commonly seen as the format used to store the actual data in a Data Lake, serving more as an excellent way to store metadata.

In [None]:
import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter
import json

avro_schema = {"namespace": "reviews.avro",
 "type": "record",
 "name": "Review",
 "fields": [
     {"name": "recommendationid", "type": ["int", "null"]},
     {"name": "language",  "type": "string", "logicalType": "time-millis"},
     {"name": "timestamp_created", "type": "int", "logicalType": "time-millis"},
     {"name": "timestamp_updated", "type": "int", "logicalType": "time-millis"},
     {"name": 'voted_up', "type": "int"},
     {"name": 'votes_up', "type": "long"},
     {"name": 'votes_funny', "type": "long"},
     {"name": 'weighted_vote_score', "type": "float"},
     {"name": 'comment_count', "type": "long"},
     {"name": 'steam_purchase', "type": "int"},
     {"name": 'received_for_free',"type": "int"},
     {"name": 'written_during_early_access', "type": ["int", "null"]},
     {"name": 'hidden_in_steam_china', "type": "long"},
     {"name": 'steam_china_location', "type": ["string", "null"]},
     {"name": 'author_steamid', "type": "long"},
     {"name": 'author_num_games_owned', "type": "int"},
     {"name": 'author_num_reviews', "type": "int"},
     {"name": 'author_playtime_forever', "type": "int"},
     {"name": 'author_playtime_last_two_weeks', "type": "int"},
     {"name": 'author_playtime_at_review', "type": ["int", "null"]},
     {"name": 'author_last_played', "type": "int", "logicalType": "time-millis"}
 ]
}

reviews_schema = avro.schema.parse(json.dumps(avro_schema))

In [None]:
with open("data/10.avro", "wb") as f:
    writer = DataFileWriter(f, DatumWriter(), reviews_schema)
    for record in df.filter(pl.col("recommendationid").is_not_null()).to_dicts():
        writer.append(record)
    writer.close()

In [None]:
!ls -lh ./data | awk '{print $5, $9}'

In [None]:
pl.read_avro('data/10.avro')

# The Angels of Column-Oriented File Formats
## The First Herald

![Apache ORC Logo](images/logos/apache_orc_logo.png)

Initially released in 2013, ORC was developed by Hortonworks, a now-defunct provider of Hadoop-as-a-platform, and Facebook who have been heavily invested in the Hadoop ecosystem to handle its analytical needs. It was the successor to the RCFile format that was previously used in Hive.

ORC is our first example of a columnar-based dataformat - a typed binary format that is stored in columns, allowing for easy access to a given column of data.

![Column-oriented storage](images/column_storage.png)

Now we can leverage metadata to skip reading large parts of the file that we don't need, and the binary nature means we should get small files

ORC is closely linked to the Hive ecosystem, and is commonly seen in organizations that invested heavily in Hive, such as Facebook.

In [None]:
from pyarrow import orc
# Known issue with all-null columns
orc.write_table(df.select(pl.all().exclude('steam_china_location')).to_arrow(), "data/10.orc")

In [None]:
!ls -lh ./data | awk '{print $5, $9}'

In [None]:
pl.from_arrow(orc.read_table('data/10.orc', columns=['language', 'votes_up']))

## The Savior

![Apache Parquet Logo](images/logos/Apache_Parquet_logo.png)

While only a month older than it's spiritual twin, Parquet has become the defacto standard of the datalake. Parquet was created by Twitter and Cloudera in 2013 to handle it's Hadoop needs, and was based on another Google paper describing the [Dremel](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf) query system.

Generally considered more language-agnostic, Parquet has become the default choice outside of Hive implementations and is generally the go-to format when working in Python.

While Parquet and ORC are generally considered column-oriented data formats, this is actually not true - they are hybrid formats, combining the strengths of row-and-column orientation through striping (ORC term) or row groups (Parquet term). A Row Group will contain a set of grouped data along with metadata describing statistics of that data. This is a key detail in their implementation as this allows query engines to skip parts of the file that aren't relevant to the query, through reading the metadata to get information such as number of rows, columns, min, max etc. depending on the writing engine.

The name of the game is to skip files - the most expensive part of any query is opening a file for reading. At scale, anything that lets us skip reading files will be key to performance.

![Parquet Architecture](images/parquet_format.jpeg)

In [None]:
df.write_parquet("data/10.parquet")

In [None]:
pl.read_parquet("data/10.parquet")

In [None]:
!ls -lh ./data | awk '{print $5, $9}'

## The Holy Spirit

![Apache Arrow Logo](images/logos/apache_arrow_logo.svg)

Apache Arrow is an in-memory data format specification. While this means that it's not a file format, it's a key player in the data landscape, as it specifies a shared memory format for tools to adopt. This means that tools can perform zero-copy conversions between representations, as long as they can understand the Arrow specification. 

In the Python ecosystem, many tools have moved towards adopting Arrow as the native memory format. This includes Pandas, Polars, DuckDB and a whole host of other libraries. The ecosystem also contains tools such as Arrow Flight, an RPC protocol for exchanging client-server via Arrow, Arrow FlightSQL as a server specification for SQL, as well as the Arrow Database Connectivity (ADBC) which aims to provide a client-side abstraction on top.

In short, Arrow is the Lingua Franca of exchanging data, and many of the examples in this notebook are driven by the `pyarrow` library, which is the Python reference implementation based on C++ bindings. 

In [None]:
import pyarrow.parquet as pq

In [None]:
df = pq.read_table('data/10.parquet')

In [None]:
df