# Using the VCE Resource Adequacy Renewable Energy (RARE) Power Dataset Parquet Files

The RARE dataset, `vcerare` in PUDL, was produced by Vibrant Clean Energy and is licensed to the public under the Creative Commons Attribution 4.0 International license (CC-BY-4.0). The data consists of hourly, county-level renewable generation profiles for solar pv, onshore, and offshore wind in the continental United States. It was compiled based on outputs from the NOAA HRRR weather model. Visit our [VCE Data Source](https://catalystcoop-pudl.readthedocs.io/en/nightly/data_sources/vcerare.html) page to learn more.

## There are two primary access methods for RARE data

* **Raw CSV archives (Zenodo)**
* **Processed PUDL Parquet files (S3 bucket)**
  

If you saw CSV and said: *"that's for me!"* -- we're here to say: *"give Parquet a chance!"* This notebook is intended to help you navigate and customize this enormous dataset in a way that may actually make your life easier than trying to wrangle it in Excel.

### Raw CSV archives (Zenodo)

The raw data are separated into CSVs by year (2019 - 2023) and generation source (solar PV, onshore wind, offshore wind). Each file contains an index column for the hour of year (1 - 8760) and value columns containing estimated capacity factor for the county. The files have been optimized for distribution in Excel as they contain as little additional information as possible.

See the [Zenodo archive README](https://zenodo.org/records/13937523) for more detail on the raw data schema and contents. 

### Processed PUDL Parquet files (S3 bucket)

The processed data are published as Apache Parquet files--a file format designed for efficient data storage and retreival. These processed files combine all years and generation sources into one table. They also restructure the data from a wide format to a tall format by pulling county into a column value rather than a column header. Additional informational columns are added to the processed data including: 

* Latitude
* Longitude
* County FIPS code
* Report rear
* Datetime UTC
* Separate state and county (or Lake) name fields (the raw data is formatted together as county_state)

Take a look at our [data dictionary](https://catalystcoop-pudl.readthedocs.io/en/nightly/data_dictionaries/pudl_db.html#out-vcerare-hourly-available-capacity-factor) for more information on the processed table schema.

## Which one should you use?

If you want or need to alter the layout of the data in order to feed it into your model, are interested in a particular subset of the data, want to connect it to geospatial data, or want to view / run analysis on all the data at once, the processed PUDL Parquet files are the way to go! For context, the row limit in Excel is a little more than 1 million rows. The processed PUDL table contains 136,437,000 rows. Below, we'll show you how to wrangle this dataset without blowing up your computer.

## Accessing RARE Parquet files

In [None]:
import io
import matplotlib.pyplot as plt
import os
import pandas as pd
import random
import sys
import pudl
pudl_paths = pudl.workspace.setup.PudlPaths()

In [None]:
rare_df = pd.read_parquet(
    pudl_paths.pudl_output / "parquet/out_vceregen__hourly_available_capacity_factor.parquet"
)
rare_df.sample(10)

### Filter the data

You can use this format to filter the data you'd like to access. Here are some examples: 

```
# Filter by single year
filtered_df = rare_df[rare_df["report_year"]==2023]

# Filter by multiple years
filtered_df = rare_df[rare_df["report_year"].isin([2019, 2020])]

# Filter by state
filtered_df = rare_df[rare_df["state"]=="TX"]
```

In [None]:
filtered_df = rare_df[rare_df["state"]=="RI"]
filtered_df.head(10)

### Visualize the data

In [None]:
# random county_id generator:
random_county = random.choice(rare_df["county_id_fips"])

In [None]:
# Editable fields: 
year = 2023
county_id = random_county
gen_type = "onshore_wind" # choose from: solar_pv, onshore_wind, offshore_wind

In [None]:
# Non-editable variables
plot_df = rare_df[
    (rare_df["report_year"]==year)
    & (rare_df["county_id_fips"]==county_id)
]
county_name = plot_df[plot_df["county_id_fips"]==county_id].county_or_lake_name.unique().item()
state_name = plot_df[plot_df["county_id_fips"]==county_id].state.unique().item()

# Make the chart
plt.hist(plot_df[f"capacity_factor_{gen_type}"], bins=100, range=(0, 1))
plt.title(
    f"county: {county_name}; state: {state_name}; gen: {gen_type}; max:{plot_df[f"capacity_factor_{gen_type}"].max():.2f}")
plt.grid()
plt.show();

### Download the data

#### Memory check - will it Excel? 

If you want to know whether your table is capable of being processed as a csv you can use this memory estimator. If the estimated memory exceeds 500 MB it's too big! Cutting columns or making the filter scope smaller will help reduce the file size.

In [None]:
# test_df = the name of the table you want to test
test_df = plot_df

# Calculate the memory usage in bytes, including the index
mem_usage_bytes = test_df.memory_usage(deep=True).sum()
# Convert bytes to megabytes
mem_usage_mb = mem_usage_bytes / (1024 * 1024)
csv_size_estimate_mb = mem_usage_mb * 2  # Rough multiplier for CSV size
print(f"Estimated CSV size: {csv_size_estimate_mb:.2f} MB")

#### Download!
Specify your desired download location by filling in the `download_path` and running this cell will output the data to that location under the name `rare_power_data.csv`.

In [None]:
# Add the file path you want to download the data to
download_path = ""
plot_df.to_csv(download_path+"/rare_power_data.csv")