## **Overview on Content Addressable aRchives (CAR / .car)**

**_References_**:
- [CAR v1 Spec](https://ipld.io/specs/transport/car/carv1/)
- [CAR v2 Spec](https://ipld.io/specs/transport/car/carv2/)

The CAR format (Content Addressable aRchives) can be used to store content addressable objects in the form of IPLD block data as a sequence of bytes; typically in a file with a .car filename extension.
> NOTE: The name Certified ARchive has also previously been used to refer to the CAR format.
    
The CAR format is intended as a serialized representation of any IPLD DAG (graph) as the concatenation of its blocks, plus a header that describes the graphs in the file (via root CIDs). The requirement for the blocks in a CAR to form coherent DAGs is not strict, so the CAR format may also be used to store arbitrary IPLD blocks.

In addition to the binary block data, storage overhead for the CAR format consists of:
- A header block encoded as [DAG-CBOR](https://github.com/ipld/specs/blob/a3c982518232b79123af2a2cf5e8642162c62524/block-layer/codecs/dag-cbor.md) containing the format version and an array of root CIDs
- A CID for each block preceding its binary data
- A compressed integer prefixing each block (including the header block) indicating the total length of that block, including the length of the encoded CID

This diagram shows how IPLD blocks, their root CID, and a header combine to form a CAR.


<center><img src="https://ipld.io/specs/transport/car/content-addressable-archives.png" alt="CARv2 Format" width="50%"></center>

#### **Format Description**
The CAR format comprises a sequence of length-prefixed IPLD block data, where the first block in the CAR is the Header encoded as CBOR, and the remaining blocks form the Data component of the CAR and are each additionally prefixed with their CIDs. The length prefix of each block in a CAR is encoded as a "varint"—an unsigned [LEB128](https://en.wikipedia.org/wiki/LEB128) integer. This integer specifies the number of remaining bytes for that block entry—excluding the bytes used to encode the integer, but including the CID for non-header blocks.

<div style="text-align: center">
<pre>|--------- Header --------| |---------------------------------- Data -----------------------------------|</pre>

<pre>[ varint | DAG-CBOR block ] [ varint | CID | block ] [ varint | CID | block ] [ varint | CID | block ] …</pre>
</div>

### **Updates from CARv1 to CARv2**
CARv2 is a minimal upgrade to the CARv1 format with the primary aim of adding an optional index within the format for fast random-access to blocks.

CARv2 makes use of CARv1 by wrapping a properly formed CARv1 with a prefix containing a pragma and header, and a suffix containing the optional index data. Once the offset and length of the CARv1 bytes are determined using CARv2 parsing rules. Though not necessarily ideal, an existing CARv1 decoder could be used to read the roots and CID:Bytes pairs. Likewise, a CARv1 encoder could be be used to encode this data for wrapping by a CARv2 encoder as the payload is the same format.

#### **Format Description**

1. An 11-byte pragma that identify the data as a CARv2 format.
2. A header describing some characteristics of the CARv2 as well as the locations of the data payload and index payload within the CARv2.
3. A standard CARv1 data payload, including standard CARv1 header and roots and sequence of CID:Bytes pairs.
4. An optional index payload, which may be one of a number of supported index formats, allowing for fast lookups of blocks within the data payload.

The CARv2 format can be illustrated as follows:

<center><img src="https://ipld.io/specs/transport/car/carv2/carv2-sections.png" alt="CARv2 Format"></center>


<div style="text-align: center"><p>
<pre>| 11-byte fixed pragma | 40-byte header | optional padding | CARv1 data payload | optional padding | optional index payload |</pre>
</div><p>

### Why the need to create CARs?
Storing content on the Filecoin network is not like typical storage systems that consumers use (such as like Dropox, AWS, OneDrive) which store objects. Content in Filecoin are flat files, known as a [Filecoin Piece](https://spec.filecoin.io/#section-systems.filecoin_files.piece). The "Piece" in Filecoin Piece represents a whole or part of a file that's distilled into an IPLD directed acyclic graph (DAG) in the form of a hash that's called a CID or Payload CID. To make the "Piece into a Filecoin Piece, the IPLD DAG is serialized into a “Content-Addressable aRchive” (.car), which is in raw bytes format.

### Going from Files to CARs







In [70]:
from pathlib import Path
import sys
import subprocess

# Create a variable to store the product name and resolve the path to the root folder, two directories up.
product_name = "l4b"
source_folder = Path(f"../../data/gedi/{product_name}").resolve()

# Create a variable to store the output directory path combined with the product name
output_folder = Path(f"../../data/car_files/{product_name}").resolve()

# Store filenames in a variable for any files found in the source folder and recursively iterate through any child folders
product_metadata = list()
for file_path in source_folder.glob("**/*"):
    if file_path.is_file():
        product_metadata.append((file_path.parents[0], file_path.name))

# Print the description on the ez-prep command
result = subprocess.run("singularity ez-prep help", shell=True, capture_output=True)
print(result.stdout.decode())


NAME:
   singularity ez-prep - Prepare a dataset from a local path

USAGE:
   singularity ez-prep [command options] <path>

CATEGORY:
   Utility

DESCRIPTION:
   This commands can be used to prepare a dataset from a local path with minimum configurable parameters.
   For more advanced usage, please use the subcommands under `storage` and `data-prep`.
   You can also use this command for benchmarking with in-memory database and inline preparation, i.e.
     mkdir dataset
     truncate -s 1024G dataset/1T.bin
     singularity ez-prep --output-dir '' --database-file '' -j $(($(nproc) / 4 + 1)) ./dataset

OPTIONS:
   --max-size value, -M value       Maximum size of the CAR files to be created (default: "31.5GiB")
   --output-dir value, -o value     Output directory for CAR files. To use inline preparation, use an empty string (default: "./cars")
   --concurrency value, -j value    Concurrency for packing (default: 1)
   --database-file value, -f value  The database file to store the metada

In [102]:
max_size = "500mb"
command = f"singularity ez-prep --output-dir {output_folder} --max-size {max_size} {source_folder}"
result = subprocess.run(command, shell=True, capture_output=True)
print(result.stdout.decode())


PieceCID                                                          PieceSize  RootCID                                                      FileSize   StoragePath                                                           
baga6ea4seaqh4djtotm6pl5reqt4gtqa3kv7cgmbbw5veejtmq5lykpdfisfqla  536870912  bafkreicjxhag23doy3yssap7fox4moyemkwdysxpoeen4msnb4pu3msktq  422015414  baga6ea4seaqh4djtotm6pl5reqt4gtqa3kv7cgmbbw5veejtmq5lykpdfisfqla.car  
baga6ea4seaqmkvyi2tu53itcnpxf64xulykmpcpr3hnlsk4axioko22dmytvqoq  536870912  bafkreicx3u5geawjsezjxnji73eb2ibddygi4epaecmgfk2egkg723haei  497707749  baga6ea4seaqmkvyi2tu53itcnpxf64xulykmpcpr3hnlsk4axioko22dmytvqoq.car  
baga6ea4seaqke4s4lsenayy3h6qre4u7smhsvshkf4ai3zkqxljmo3xj7udwcgy  536870912  bafkreic4fbgzv73phric4c3eqjhizmh6wx4gdn7q6uapbkckac24gaba2a  441207453  baga6ea4seaqke4s4lsenayy3h6qre4u7smhsvshkf4ai3zkqxljmo3xj7udwcgy.car  
baga6ea4seaqhmcsc6btm7myw54jvrhjmru3aiv2ztidre4qdnrnmsq2foq3y6ma  536870912  bafkreiex3ecpl4myzmlx46k2mfadsjsyi6doak3ggk

In [104]:
car_files = list(output_folder.glob("*.car"))

command = f"car inspect {car_files[0]} --full"
result = subprocess.run(command, shell=True, capture_output=True)
decoded_stm = result.stdout.decode("utf-8")
print(decoded_stm)


Version: 1
Roots: bafybeif57b5qfwy7ucx4pmycol3bkpu3g3zg723rtn5iju5apkjivo7hvq
Root blocks present in data: Yes
Block count: 4
Min / average / max block length (bytes): 98 / 351 / 918
Min / average / max CID length (bytes): 36 / 36 / 36
Block count per codec:
	dag-pb: 4
CID count per multihash:
	sha2-256: 4



In [101]:
import re

# Use regex to extract the hash value
hash_value = re.search(r"Roots: (.+?)\n", decoded_stm).group(1)

# Print the extracted hash value
command = f"car ls -v {car_files[0]}"
result = subprocess.run(command, shell=True, capture_output=True)
decoded_stm = result.stdout.decode("utf-8")
print(decoded_stm)


dag-pb: bafybeif57b5qfwy7ucx4pmycol3bkpu3g3zg723rtn5iju5apkjivo7hvq
	3 links. 2 bytes
		comp[5.7 MB] bafybeiepmyhpbpn43o7wldfkuurufd7dcidbyob3435wwqll3djfnakeei
		data[2.5 GB] bafybeifurw2z3xmkrflcqy2xfpr34ys7dneodveyeniov7voiss5azkdhu
		guide[295 B] bafybeidgkjpwkmajsdijlkearn5c4ogtxloqogags5cfvpn2riwsjxsvde
	Unixfs Directory
dag-pb: bafybeiepmyhpbpn43o7wldfkuurufd7dcidbyob3435wwqll3djfnakeei
	3 links. 2 bytes
		GEDI_L4B_ATBD_V2.0.pdf[3.0 MB] bafybeifcodio3sdrmcpykzrqjdluyiyeds3gmfsocjuqr4oghkrula4may
		GEDI_L4B_Gridded_Biomass_V2_1.pdf[2.0 MB] bafybeihohmjyjixpqwszocv6hxnaomegxm627ratchdnqmbyxaptcxtgke
		gedi_l4b_excluded_granules_v21.json[759 kB] bafkreihzklpkkhzsmgmupljxtnjbdz2gkcynan24mxinrt5vncb67oo2li
	Unixfs Directory
dag-pb: bafybeifurw2z3xmkrflcqy2xfpr34ys7dneodveyeniov7voiss5azkdhu
	10 links. 2 bytes
		GEDI04_B_MW019MW223_02_002_02_R01000M_MI.tif[14 MB] bafybeiakure7paspbvxqfj4b64svy3vtxqjc72kww6doykphaq63w23ctu
		GEDI04_B_MW019MW223_02_002_02_R01000M_MU.tif[503 MB] bafybeie