<img width="50" src="https://carbonplan-assets.s3.amazonaws.com/monogram/dark-small.png" style="margin-left:0px;margin-top:20px"/>

# FLUXNET to Parquet

_by Joe Hamman (CarbonPlan), August 7, 2020_

This notebook converts FLUXNET csv files to Parquet format and stages them in a
Google Cloud Storage bucket.

**Inputs:**

- `fluxnet` directory

**Outputs:**

- One Parquet dataset per CSV: `gs://carbonplan-data/raw/fluxnet/<name>.parquet`

**Notes:**

- No reprojection or processing of the data is done in this notebook.


In [None]:
import pathlib

import dask.dataframe as dd
import fsspec
import gcsfs
import pandas as pd
from fsspec.implementations.zip import ZipFileSystem
from tqdm import tqdm

# run `gcloud auth login` on the command line, or try switching token to `browser`
fs = gcsfs.GCSFileSystem(
    project="carbonplan",
    token="/Users/jhamman/.config/gcloud/legacy_credentials/joe@carbonplan.org/adc.json",
)

In [None]:
workdir = pathlib.Path("/Users/jhamman/workdir/carbonplan_data_downloads/")

In [None]:
storage_options = {"token": fs.session.credentials, "project": "carbonplan"}

In [None]:
zips = (workdir / "fluxnet").glob("*zip")


def make_fname(stem):
    p = stem.lower().split("_")
    if "AUX" in stem:
        name = "_".join([p[1], *p[3:4]])
    else:
        name = "_".join([p[1], *p[3:5]])
    return name


for zipfile in tqdm(zips):
    print(zipfile)

    zipfs = ZipFileSystem(zipfile, mode="r")
    csvs = zipfs.glob("*csv")

    for csv in csvs:
        fname = pathlib.PosixPath(csv)
        name = make_fname(fname.stem)
        blob = blob = f"gcs://carbonplan-data/raw/fluxnet/{name}.parquet"

        df = pd.read_csv(zipfs.open(csv, mode="rb"))
        ddf = dd.from_pandas(df, chunksize=1000).repartition(
            partition_size="50MB"
        )
        ddf.to_parquet(blob, storage_options=storage_options)

        print("--> ", blob)