# Analysis of different storage options for Helios packet data #
##  JSON vs CSV vs Parquet vs Feather ##
JSON is the most common format for storing data in Helios. However, it is not the most efficient format for storing large amounts of data. In this notebook, we will compare the performance of different storage options for Helios packet data. We will compare the performance of CSV, JSON, Parquet, and Feather formats. 

### JSON ###
JSON is a text-based format that is easy to read and write. It is the main format that the Telemetry website uses to receive incoming packeData from the vehicle. However, JSON is not the most efficient format for storing large amounts of data. JSON files are larger than other formats, and reading and writing JSON files can be slow.

### CSV ###
CSV is a text-based format that is easy to read and write. It is a common format for storing tabular data. CSV files are smaller than JSON files, but they are still larger than other formats. Reading and writing CSV files can be slow but still much faster than JSON files.

### Parquet ###
Parquet is a columnar storage format that is optimized for reading and writing large amounts of data. Parquet files are smaller than JSON and CSV files, and reading and writing Parquet files is much faster than JSON and CSV files. Parquet files are also self-describing, meaning that they contain metadata that describes the structure of the data. Parquet, while not as common as JSON or csv, is still supported by many machine learning and data analysis libraries.

### Feather ###
Feather is a binary columnar storage format that is optimized for reading and writing large amounts of data. Feather files are smaller than JSON and CSV files, and reading and writing Feather files is much faster than JSON, and CSV files. Feather files, similar to Parquet files are also self-describing.

## Performance Benchmark Comparisons ##
We will compare the performance of reading and writing JSON, CSV, Parquet, and Feather files using the Elysia packet data. We will measure the time it takes to read and write each format and compare the file sizes of each format.

This benchmark comparison uses the python time library to measure the time it takes to read and write each format. While this is not the most accurate way to measure performance, it provides a quick and easy way to estimate the performance of each format.

In [5]:

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import time
import os
packetTrainingDataPath="./training_data/Elysia.Packets.json"

In [2]:
# Reading JSON
start_time = time.time()
df = pd.read_json(packetTrainingDataPath)
print("JSON read time:", time.time() - start_time)

# Reading CSV
start_time = time.time()
df = pd.read_csv('output.csv')
print("CSV read time:", time.time() - start_time)

# Reading Parquet
start_time = time.time()
df = pq.read_table('output.parquet').to_pandas()
print("Parquet read time:", time.time() - start_time)

# Reading feather
start_time = time.time()
df = pd.read_feather('output.feather')
print("Feather read time:", time.time() - start_time)


JSON read time: 53.84395909309387


  df = pd.read_csv('output.csv')


CSV read time: 35.935688972473145
Parquet read time: 3.2210075855255127
Feather read time: 1.4769089221954346


In [4]:

# Writing JSON
start_time = time.time()
df.to_json('output.json', orient='records', lines=True)
print("JSON write time:", time.time() - start_time)

# Writing CSV
start_time = time.time()
df.to_csv('output.csv', index=False)
print("CSV write time:", time.time() - start_time)

# Wrting Parquet
start_time = time.time()
table = pa.Table.from_pandas(df)
pq.write_table(table, 'output.parquet')
print("Parquet write time:", time.time() - start_time)

# Writing feather
start_time = time.time()
df.to_feather('output.feather')
print("Feather write time:", time.time() - start_time)


JSON write time: 38.231643199920654
CSV write time: 23.684067010879517
Parquet write time: 2.1993298530578613
Feather write time: 1.1889092922210693


In [7]:
# Size of original JSON in megabytes
print("Original JSON size (Megabytes):", os.path.getsize(packetTrainingDataPath) / 1024 / 1024)

# Size of JSON
print("JSON size (Megabytes):", os.path.getsize('output.json')/ 1024 / 1024)

# Size of CSV
print("CSV size (Megabytes):", os.path.getsize('output.csv')/ 1024 / 1024)

# Size of Parquet
print("Parquet size (Megabytes):", os.path.getsize('output.parquet')/ 1024 / 1024)

# Size of feather
print("Feather size (Megabytes):", os.path.getsize('output.feather')/ 1024 / 1024)




Original JSON size (Megabytes): 2871.9577856063843
JSON size (Megabytes): 2860.488380432129
CSV size (Megabytes): 1283.9465894699097
Parquet size (Megabytes): 149.94902420043945
Feather size (Megabytes): 145.51709175109863


## Conclusion ##
Based on the benchmark comparison,  Feather is the best storage options for Helios packet data, with Parquet as a viable option as well. Parquet and Feather files are smaller than JSON and CSV files, and reading and writing Parquet and Feather files is much faster than JSON and CSV files. 

## References ##
1. https://arrow.apache.org/docs/python/feather.html
2. https://arrow.apache.org/docs/python/parquet.html
3. https://ursalabs.org/blog/2020-feather-v2/#:~:text=Parquet%20is%20fast%20to%20read,is%20even%20faster%20to%20read.&text=In%20the%20case%20of%20Feather,converting%20to%20R%20or%20pandas. 