<img width="50" src="https://carbonplan-assets.s3.amazonaws.com/monogram/dark-small.png" style="margin-left:0px;margin-top:20px"/>

# Global Carbon Project to Parquet

_by Joe Hamman (CarbonPlan), August 17, 2020_

This notebook converts faw Excel files from the Global Carbon Project to Parquet
format and stages them in a Google Cloud Storage bucket.

**Inputs:**

- `gcp` directory

**Outputs:**

- One Parquet dataset per Excel sheet:
  `gs://carbonplan-data/raw/gcp/<name>.parquet`

**Notes:**

- No reprojection or processing of the data is done in this notebook.


In [None]:
import dask.dataframe as dd
import gcsfs
import pandas as pd

In [None]:
# run `gcloud auth login` on the command line, or try switching token to `browser`
fs = gcsfs.GCSFileSystem(
    project="carbonplan",
    token="/Users/jhamman/.config/gcloud/legacy_credentials/joe@carbonplan.org/adc.json",
)

storage_options = {"token": fs.session.credentials, "project": "carbonplan"}


def process(fname, target, **open_kwargs):
    df = pd.read_excel(fname, **open_kwargs)
    df = df.loc[:, ~df.columns.str.contains("^Unnamed")]
    df = dd.from_pandas(df, npartitions=1)
    df.to_parquet(target, engine="fastparquet", storage_options=storage_options)

## National Carbon Emissions


In [None]:
fname = "/Users/jhamman/workdir/carbonplan_data_downloads/gcp/National_Carbon_Emissions_2019v1.0.xlsx"

# Territorial Emissions
target = "gs://carbonplan-data/raw/gcp/consumption_emissions.parquet"
open_kwargs = dict(sheet_name="Territorial Emissions", skiprows=16, index_col=0)
process(fname, target, **open_kwargs)

# Consumption Emissions
target = "gs://carbonplan-data/raw/gcp/territorial_emissions.parquet"
open_kwargs = dict(sheet_name="Consumption Emissions", skiprows=8, index_col=0)
process(fname, target, **open_kwargs)

# Emissions Transfers
target = "gs://carbonplan-data/raw/gcp/transfer_emissions.parquet"
open_kwargs = dict(sheet_name="Emissions Transfers", skiprows=8, index_col=0)
process(fname, target, **open_kwargs)

## Global Carbon Budget


In [None]:
fname = "/Users/jhamman/workdir/carbonplan_data_downloads/gcp/raw_gcb_Global_Carbon_Budget_2019v1.0.xlsx"

# Global Carbon Budget
target = "gs://carbonplan-data/raw/gcp/global_carbon_budget.parquet"
open_kwargs = dict(sheet_name="Global Carbon Budget", skiprows=18, index_col=0)
process(fname, target, **open_kwargs)

# Fossil Emissions by Fuel Type
target = "gs://carbonplan-data/raw/gcp/fossil_emissions_by_fuel_type.parquet"
open_kwargs = dict(
    sheet_name="Fossil Emissions by Fuel Type", skiprows=12, index_col=0
)
process(fname, target, **open_kwargs)

# Land-Use Change Emissions
target = "gs://carbonplan-data/raw/gcp/land_use_change_emissions.parquet"
open_kwargs = dict(
    sheet_name="Land-Use Change Emissions", skiprows=25, index_col=0
)
process(fname, target, **open_kwargs)

# Ocean Sink
target = "gs://carbonplan-data/raw/gcp/ocean_sink.parquet"
open_kwargs = dict(sheet_name="Ocean Sink", skiprows=22, index_col=0)
process(fname, target, **open_kwargs)

# Terrestrial Sink
target = "gs://carbonplan-data/raw/gcp/terrestrial_sink.parquet"
open_kwargs = dict(sheet_name="Terrestrial Sink", skiprows=23, index_col=0)
process(fname, target, **open_kwargs)

# Historical Budget
target = "gs://carbonplan-data/raw/gcp/historical_budget.parquet"
open_kwargs = dict(sheet_name="Historical Budget", skiprows=14, index_col=0)
process(fname, target, **open_kwargs)