<img width="50" src="https://carbonplan-assets.s3.amazonaws.com/monogram/dark-small.png" style="margin-left:0px;margin-top:20px"/>

# FIA to Parquet

_by Joe Hamman (CarbonPlan), June 30, 2020_

This notebook converts FIA csv files to Parquet format and stages them in a
Google Cloud Storage bucket.

**Inputs:**

- `ENTIRE` directory

**Outputs:**

- One Parquet dataset per CSV: `gs://carbonplan-data/raw/fia/<name>.parquet`

**Notes:**

- No reprojection or processing of the data is done in this notebook.


In [None]:
import io
import os.path
import pathlib

import gcsfs
import pandas as pd

# run `gcloud auth login` on the command line, or try switching token to `browser`
fs = gcsfs.GCSFileSystem(
    project="carbonplan",
    token="/Users/jhamman/.config/gcloud/legacy_credentials/joe@carbonplan.org/adc.json",
)

In [None]:
workdir = pathlib.Path("/Users/jhamman/workdir/carbonplan_data_downloads/fia/")

In [None]:
csvs = (workdir / "ENTIRE").glob("*csv")

In [None]:
import numpy as np


def force_float32(fname):

    memmap = fname.stat().st_size > 1e8

    df = pd.read_csv(fname, engine="c", low_memory=False, memory_map=memmap)
    for c in df:
        if "f8" in df[c].dtype.str:
            df[c] = df[c].astype(np.float32)

    return df

In [None]:
failed = []
for fname in csvs:
    blob = f"carbonplan-data/raw/fia/{fname.stem}.parquet"
    print(fname)

    df = force_float32(fname)

    try:
        df.to_parquet(
            blob, compression="gzip", open_with=fs.open, row_group_offsets=1000
        )
        # consider using dask dataframe here to write to chunked dataframes here.
        print("  --> ", blob)
    except:
        failed.append(fname)