# Writing ISD data to partitioned parquet

You'll need to download one or more years of ISD data.
The easiest way is to get the archive, then unzip, e.g. https://www.ncei.noaa.gov/data/global-hourly/archive/isd/isd_2021_c20210804T131425.tar.gz.
These files can be close to 5GB, so be aware.
Once you've downloaded, extract into subdirectories by year, e.g. using `mkdir 2021 && tar xzf isd_2021_c20210804T131425.tar.gz -C 2021`.
Once extracted, a directory might run you 50GB, so make sure you have space.
We're using https://github.com/gadomski/pyisd to read files into pandas data frames

In [1]:
import os
import isd.io
import pandas
import pyarrow
from pyarrow import parquet
from tqdm.notebook import tqdm

# Set this to the directory containing your uncompressed ISD data, organized into directories by year.
directory = "isd"
data_frames = []
for file_name in os.listdir(directory):
    path = os.path.join(directory, file_name)
    if not os.path.isdir(path):
        continue
    try:
        year = int(file_name)
    except ValueError:
        print(f"Could not convert {file_name} to an int, skipping directory")
        continue
    i = 0
    for file_name in tqdm(os.listdir(path)):
        data_frame = isd.io.read_to_dataframe_lite(os.path.join(path, file_name))
        data_frames.append(data_frame)


data_frame = pandas.concat(data_frames, ignore_index=True)
table = pyarrow.Table.from_pandas(data_frame)
parquet.write_to_dataset(table, root_path="isd/parquet", partition_cols=["year", "month"])
    

  0%|          | 0/13226 [00:00<?, ?it/s]

ValueError: Invalid ISD line (too short): Integrated Surface Database Station History, August 2021
