# The History of the Datalake

There are a million implementations of the distributed file share, but the idea really took off with the publication of a Google paper entitlted ["The Google File System"](https://research.google/pubs/the-google-file-system/) - a description of how Google had implemented their distributed file system in a fault-tolerant way, with data redundancy running on cheap consumer hardware while simultaneously being able to serve a large number of clients. This paper, combined with another Google paper on [MapReduce](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/) that laid out a programming model for effectively working with distributed data was the foundation for Hadoop, born in 2006 out of Yahoo.

## The dawn of Hadoop

![Hadoop logo](images/logos/hadoop_logo.png)

Hadoop was an ecosystem, built around the ideas presented in the Google papers. MapReduce, while key to the computational part of Hadoop, turned out to be fairly tricky to write, so a number of Apache projects sprung up around it. Today, the most significant of these would be Apache Hive, and Apache Spark. Hive provided a database-like SQL abstraction on top of MapReduce, and Apache Spark which changed the programming model of MapReduce, moving towards building a Directed Acyclic Graph (DAG) of operations to be performed on RDDs (Resilient Distributed Datasets)

## The advent of the datalake

![Data Lake](images/logos/datalake.png)

The concept of a Data Lake was coined in 2011 by the CTO of Pentaho to contrast with the concept of a Data Mart. The Data Mart was a targeted set of tables around curated data. The promise of Hadoop's ecosystem was to be able to store the raw data directly, avoiding having to do the up-front work of deciding what was important, as well as being able to work with heterogenous data.

## The rise of AWS

Many soon found out that running Hadoop was a pain, as maintaining the Hadoop HDFS systems alongside all the various distributed server technologies needed to be able to query the data was the domain of highly skilled (and expensive) engineers. AWS launched its Seriously Simple Storage (S3) in 2006, allowing companies to offload their Hadoop implementations onto S3, becoming the standard for object storage.

## The importance of file formats

Each iteration of distributed file shares have given us better ways of managing the files in a multi-client, fault-tolerant manner. Allowing us to store petabytes of data in files means that the file formats themselves become key to maximizing performance of the client. 

Let's walk through the most common file formats used in modern Data Engineering

# The Row-Oriented Formats

# The CSV

![CSV Logo](images/logos/csv_logo.png)

The CSV is the workhorse of Data Engineering - everyone understands CSV and pretty much every system can generate CSVs. 

Plain text, human readable, even Jupyter can read CSV, what's not to love?

In [1]:
import polars as pl

In [2]:
df = pl.read_csv('data/10.csv')
df

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
,"""1351380029""",1,0,0,0,0,1.0,0,0,0,,76561198008269840,"""0.0""",2,95521,0,37952,1547225350,,
149330962,"""russian""",1698868683,1698868683,1,0,0,0.0,0,1,0,0,1,,76561199093871104,62,26,29,10,29,1698427415
149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194
127959835,"""schinese""",1670214704,1698915106,1,0,0,0.0,0,1,0,0,1,,76561199209656688,47,11,1179,0,1179,1694817767


While easy to read, since a CSV is just text, each column's datatype has been inferred, through a `CSVSniffer` - by defaults sampling the first 100 rows and guessing at the correct datatype. 

If that sounds errorprone, that's because it is!

Of course the worst offender is that there is no standard for CSV files - if you look at any CSV parsing library or function, they are forced to handle any number of potential formats. `polars.read_csv` has 33 arguments, `pandas.read_csv` has 49. This makes portability of CSV difficult, as there's a lot of edge cases to handle across systems. For analytical purposes, CSV is row-oriented, 

Given this data:

![Example Data](images/columnar_vs_row.png)

A CSV file would look like this to the parser:
`Seller,Product,Sales ($)\James,Shoes,20.00\Kirk,Shoes,27.50\nPicard,Socks,5.00`

If I want to sum up all the sales, the scanner needs to read through each character one-by-one to identify the `,` separator which signifies a column and `\n` which signifies a row.

![CSV Parser](images/csv_reader.png)

Then it would throw out 2/3rds of the data it read into memory and finally convert the `Sales ($)` string into floats and do the sum.

It remains a fact of Data Engineering that you'll have to deal with CSVs, and luckily a lot of engineering effort has gone into building very performant csv readers that can automatically handle lots of CSV oddities.

## The JSON file

![JSON logo](images/logos/json_logo.jpg)

A step up from CSV, JSON has a formal [ECMA standard](https://ecma-international.org/publications-and-standards/standards/ecma-404/), making it much more portable. It comes at the cost of verbosity though, as each key is repeated for each line, and the format is still row-based.

```json
[
    {"Seller": "James", "Product": "Shoes", "Sales": 20.00}, 
    {"Seller": "Kirk", "Product": "Shoes", "Sales": 27.50}, 
    {"Seller": "Picard", "Product": "Socks", "Sales": 5.00}
]
```

In [26]:
# df.filter(pl.col("recommendationid").is_not_null()).write_json("data/10.json")

In [7]:
!jq '.[0]' data/10.json

[1;39m{
  [0m[34;1m"recommendationid"[0m[1;39m: [0m[0;39m147937429[0m[1;39m,
  [0m[34;1m"language"[0m[1;39m: [0m[0;32m"english"[0m[1;39m,
  [0m[34;1m"timestamp_created"[0m[1;39m: [0m[0;39m1696875102[0m[1;39m,
  [0m[34;1m"timestamp_updated"[0m[1;39m: [0m[0;39m1717510986[0m[1;39m,
  [0m[34;1m"voted_up"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"votes_up"[0m[1;39m: [0m[0;39m3[0m[1;39m,
  [0m[34;1m"votes_funny"[0m[1;39m: [0m[0;39m0[0m[1;39m,
  [0m[34;1m"weighted_vote_score"[0m[1;39m: [0m[0;39m0.5267999172210693[0m[1;39m,
  [0m[34;1m"comment_count"[0m[1;39m: [0m[0;39m0[0m[1;39m,
  [0m[34;1m"steam_purchase"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"received_for_free"[0m[1;39m: [0m[0;39m0[0m[1;39m,
  [0m[34;1m"written_during_early_access"[0m[1;39m: [0m[0;39m0[0m[1;39m,
  [0m[34;1m"hidden_in_steam_china"[0m[1;39m: [0m[0;39m1[0m[1;39m,
  [0m[34;1m"steam_china_location"[0m[1;39m: [0m[1

In [8]:
df = pl.read_json('data/10.json')
df.head()

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,null,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196


## Apache Avro

![Apache Avro Logo](images/logos/avro_logo.png)

Avro joined the Apache Hadoop project in 2009, and is used mainly as a data interchange format, much like JSON, but is a binary format with a schema defined in JSON. Avro is a row-oriented format, and is a common format used in message brokers like Kafka.

An Avro schema is defined as JSON and would look something like this
```json
{"namespace": "acme.avro",
 "type": "record",
 "name": "Sales",
 "fields": [
     {"name": "Seller", "type": ["string", "null"]},
     {"name": "Product",  "type": "string"},
     {"name": "Sales", "type": "float"}
 ]
}
```

The data is then encoded into the Avro binary format based on the schema, and the consumer would use the schema to decode the incoming binary data.
While used in the Hadoop ecosystem to transmit data back and forth between nodes, Avro is not commonly seen as the format used to store data in a Data Lake

In [9]:
import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter
import json

avro_schema = {"namespace": "reviews.avro",
 "type": "record",
 "name": "Review",
 "fields": [
     {"name": "recommendationid", "type": ["int", "null"]},
     {"name": "language",  "type": "string", "logicalType": "time-millis"},
     {"name": "timestamp_created", "type": "int", "logicalType": "time-millis"},
     {"name": "timestamp_updated", "type": "int", "logicalType": "time-millis"},
     {"name": 'voted_up', "type": "int"},
     {"name": 'votes_up', "type": "long"},
     {"name": 'votes_funny', "type": "long"},
     {"name": 'weighted_vote_score', "type": "float"},
     {"name": 'comment_count', "type": "long"},
     {"name": 'steam_purchase', "type": "int"},
     {"name": 'received_for_free',"type": "int"},
     {"name": 'written_during_early_access', "type": ["int", "null"]},
     {"name": 'hidden_in_steam_china', "type": "long"},
     {"name": 'steam_china_location', "type": ["string", "null"]},
     {"name": 'author_steamid', "type": "long"},
     {"name": 'author_num_games_owned', "type": "int"},
     {"name": 'author_num_reviews', "type": "int"},
     {"name": 'author_playtime_forever', "type": "int"},
     {"name": 'author_playtime_last_two_weeks', "type": "int"},
     {"name": 'author_playtime_at_review', "type": ["int", "null"]},
     {"name": 'author_last_played', "type": "int", "logicalType": "time-millis"}
 ]
}

reviews_schema = avro.schema.parse(json.dumps(avro_schema))

In [10]:
with open("data/reviews.avro", "wb") as f:
    writer = DataFileWriter(f, DatumWriter(), reviews_schema)
    for record in df.filter(pl.col("recommendationid").is_not_null()).to_dicts():
        writer.append(record)
    writer.close()

In [11]:
!ls -hs ./data

total 1.3G
 26M 10.csv	 243M 578080.csv	   15M reviews.avro
129M 10.json	 820M 730.csv		   11M reviews.orc
 34M 289070.csv   14M appid_mapping.json  6.1M reviews.parquet


In [12]:
pl.read_avro('data/reviews.avro')

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i32,str,i32,i32,i32,i64,i64,f32,i64,i32,i32,i32,i64,str,i64,i32,i32,i32,i32,i32,i32
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
8227337,"""polish""",1387815443,1387815443,1,0,0,0.0,0,0,0,0,0,,76561198054766880,114,2,68669,0,47901,1524517419
149330962,"""russian""",1698868683,1698868683,1,0,0,0.0,0,1,0,0,1,,76561199093871104,62,26,29,10,29,1698427415
149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194
127959835,"""schinese""",1670214704,1698915106,1,0,0,0.0,0,1,0,0,1,,76561199209656688,47,11,1179,0,1179,1694817767


# The Column-oriented Formats

## Apache ORC

![Apache ORC Logo](images/logos/apache_orc_logo.png)

Initially released in 2013, ORC was developed by Hortonworks, a now-defunct provider of Hadoop-as-a-platform, and Facebook who have been heavily invested in the Hadoop ecosystem to handle it's analytical needs. It was the successor to the RCFile format that was previously used in Hive.

ORC is our first example of a columnar-based dataformat - a typed binary format that is stored in columns, allowing for easy access to a given column of data.

![Column-oriented storage](images/column_storage.png)

Now we can leverage metadata to skip reading large parts of the file that we don't need, and the binary nature means we should get small files

ORC is closely linked to the Hive ecosystem, and is commonly seen in organizations that invested heavily in Hive, such as Facebook.

In [13]:
from pyarrow import orc
# Known issue with all-null columns
orc.write_table(df.select(pl.all().exclude('steam_china_location')).to_arrow(), "data/reviews.orc")

In [14]:
!ls -hs data/

total 1.3G
 26M 10.csv	 243M 578080.csv	   15M reviews.avro
129M 10.json	 820M 730.csv		   12M reviews.orc
 34M 289070.csv   14M appid_mapping.json  6.1M reviews.parquet


In [15]:
pl.from_arrow(orc.read_table('data/reviews.orc', columns=['language', 'votes_up']))

language,votes_up
str,i64
"""english""",3
"""russian""",0
"""russian""",0
"""turkish""",0
"""brazilian""",1
…,…
"""polish""",0
"""russian""",0
"""english""",0
"""schinese""",0


## Apache Parquet

![Apache Parquet Logo](images/logos/Apache_Parquet_logo.png)

The other column-oriented fileformat, Parquet was created by Twitter and Cloudera in 2013, announced only a month after ORC. 

Generally considered more language-agnostic, Parquet has become the default choice outside of Hive implementations and is generally the go-to format when working in Python

While Parquet and ORC are generally considered column-oriented dataformat, this is actually not quite true - they are hybdrid formats, combining the strengths of row-and-column orientation through striping (ORC term) or row groups (Parquet term). A Row Group will contain a set of grouped data

![Parquet Architecture](images/parquet_format.jpeg)

In [16]:
df.write_parquet("data/reviews.parquet")

In [18]:
pl.read_parquet("data/reviews.parquet")

recommendationid,language,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,steam_purchase,received_for_free,written_during_early_access,hidden_in_steam_china,steam_china_location,author_steamid,author_num_games_owned,author_num_reviews,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played
i64,str,i64,i64,i64,i64,i64,f64,i64,i64,i64,i64,i64,null,i64,i64,i64,i64,i64,i64,i64
147937429,"""english""",1696875102,1717510986,1,3,0,0.5268,0,1,0,0,1,,76561199550893216,35,23,59161,4738,58753,1717541057
166664841,"""russian""",1717510100,1717510100,1,0,0,0.0,0,1,0,0,1,,76561199161536896,24,11,436,71,385,1717512997
166664763,"""russian""",1717510009,1717510009,1,0,0,0.0,0,0,0,0,1,,76561198046827632,0,7,23750,7,23743,1717510490
166663001,"""turkish""",1717508182,1717508182,0,0,0,0.0,0,0,0,0,1,,76561199374468448,32,4,361,19,356,1717508513
166658743,"""brazilian""",1717503385,1717503385,1,1,0,0.52381,0,1,0,0,1,,76561198018922960,9,1,1497,0,1497,1478272196
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
8227337,"""polish""",1387815443,1387815443,1,0,0,0.0,0,0,0,0,0,,76561198054766880,114,2,68669,0,47901,1524517419
149330962,"""russian""",1698868683,1698868683,1,0,0,0.0,0,1,0,0,1,,76561199093871104,62,26,29,10,29,1698427415
149284037,"""english""",1698800321,1698800321,1,0,0,0.0,0,0,1,0,1,,76561199052025216,46,3,3367,2694,3016,1698892194
127959835,"""schinese""",1670214704,1698915106,1,0,0,0.0,0,1,0,0,1,,76561199209656688,47,11,1179,0,1179,1694817767


In [19]:
!ls -hs data/

total 1.3G
 26M 10.csv	 243M 578080.csv	   15M reviews.avro
129M 10.json	 820M 730.csv		   12M reviews.orc
 34M 289070.csv   14M appid_mapping.json  6.1M reviews.parquet


In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

  pid, fd = os.forkpty()


A key detail in their implementation is storing metadata alongside the data. This allows query engines to skip parts of the file that aren't relevant to the query, as the metadata will have information such as number of rows, columns, min, max etc. depending on the writing engine. 

The name of the game is to skip files - the most expensive part of any query is opening a file for reading. At scale, anything that lets us skip reading files will be key to performance. 

In [None]:
# This takes a while!
import polars as pl
pl.scan_csv('s3://datalake/extract/reviews/*.csv', storage_options={"aws_access_key_id": "minio", 
                                                                    "aws_secret_access_key": "minio1234", 
                                                                    "endpoint_url": "http://minio:9000"}).filter(pl.col("recommendationid").is_not_null()).sink_parquet("data/all_reviews.parquet")

## Apache Arrow

![Apache Arrow Logo](images/logos/apache_arrow_logo.svg)

Apache Arrow is an in-memory data format specification. While this means that it's not a file format, it's a key player in the data landscape, as it specifies a shared memory format for tools to adopt. This means that tools can perform zero-copy conversions between representations, as long as they can understand the Arrow specification. 

In the Python ecosystem, many tools have moved towards adopting Arrow as the native memory format. This includes Pandas, Polars, DuckDB and a whole host of other libraries. The ecosystem also contains tools such as Arrow Flight, an RPC protocol for exchanging client-server via Arrow, Arrow FlightSQL as a server specification for SQL, as well as the Arrow Database Connectivity (ADBC) which aims to provide a client-side abstraction on top.

In short, Arrow is the Lingua Franca of exchanging data, and many of the examples in this notebook are driven by the `pyarrow` library, which is the Python reference implementation based on C++ bindings. 