# 💾 Save to Parquet

The Parquet format is an efficient file format to save Synoptic's data.
If I need to reuse data over many times (i.e., researching a case study) then I don't want to keep asking Synoptic for the data; I want to get the data and save it to local disk. Also, Synoptic restricts how much data you can retrieve in a single API request. If you need long a long time series then you will need to make multiple API calls. You should save the DataFrame information to a Parquet file to save disk space and most performant loading time.

**_What are the benefits of saving Synoptic's data to Parquet instead of the raw JSON?_**

To demonstrate the benefits of Parquet, let's collect a timeseries of 5 days of data for all the stations within 10 miles of WBB.

1. Write the raw JSON to a JSON file.
1. Write the Polars DataFrame to a Parquet file.

- How large is the JSON file versus Parquet file? _Parquet is about 18x smaller than JSON, because it is efficiently compressed._
- How long does it take to load a JSON file versus Parquet file? _Parquet is faster to load into memory, it's already organized in a clean table, you can read only select rows if you want, and it's easy to read multiple files at a time._


In [6]:
from datetime import timedelta

import synoptic
import polars as pl

In [2]:
s = synoptic.TimeSeries(radius="wbb,10", recent=timedelta(days=5))
print(f"Number of rows: {len(s.df()):,}")

🚚💨 Speedy delivery from Synoptic timeseries service.
📦 Received data from 91 stations.
Number of rows: 1,425,573


In [3]:
import json
from pathlib import Path

filepath = Path("sample_timeseries.json")
parquet = filepath.with_suffix(".parquet")

# Write raw data to JSON
with open(filepath, "w") as f:
    json.dump(s.json, f, indent=4)


# Write DataFrame to Parquet
s.df().write_parquet(parquet)

print(f"JSON file size: {filepath.stat().st_size / 1000 / 1000:>5.2f} MB")
print(f"  Parquet size: {parquet.stat().st_size / 1000 / 1000:>5.2f} MB")

JSON file size: 43.81 MB
  Parquet size:  2.29 MB


Wow, for 1.4 million observations Parquet is more than 19x smaller than the raw JSON. That's impressive.

Reading Parquet is also fast, plus it's already in a DataFrame and we don't need to parse the JSON again.

In [4]:
%%time
# Read the JSON file
with open(filepath, "r") as json_file:
    data = json.load(json_file)

CPU times: user 357 ms, sys: 200 ms, total: 557 ms
Wall time: 551 ms


In [7]:
%%time
_ = pl.read_parquet(parquet)

CPU times: user 543 ms, sys: 531 ms, total: 1.07 s
Wall time: 303 ms
