# Using the VCE Resource Adequacy Renewable Energy (RARE) Power Dataset Parquet Files

## There are two primary access methods for RARE data

* **Raw CSV archives (Zenodo)**
* **Processed PUDL Parquet files (S3 bucket)**
  

If you saw CSV and said: *"that's for me!"* -- we're here to say: *"give Parquet a chance!"* This notebook is intended to help you navigate and customize this enormous dataset in a way that may actually make your life easier than trying to wrangle it in Excel.

### Raw CSV archives (Zenodo)

The raw data are separated into CSVs by year (2019 - 2023) and generation source (solar PV, onshore wind, offshore wind). Each file contains an index column for the hour of year (1 - 8760) and value columns containing estimated capacity factor for the county. The files have been optimized for distribution in Excel as they contain as little additional information as possible.

See the [Zenodo archive README](https://zenodo.org/records/13937523) for more detail on the raw data schema and contents. 

### Processed PUDL Parquet files (S3 bucket)

The processed data are published as Apache Parquet files--a file format designed for efficient data storage and retreival. These processed files combine all years and generation sources into one table. They also restructure the data from a wide format to a tall format by pulling county into a column value rather than a column header. Additional informational columns are added to the processed data including: 

* Latitude
* Longitude
* County FIPS code
* Report rear
* Datetime UTC
* Separate state and county (or Lake) name fields (the raw data is formatted together as county_state)

Take a look at our [data dictionary](https://catalystcoop-pudl.readthedocs.io/en/nightly/data_dictionaries/pudl_db.html#out-vcerare-hourly-available-capacity-factor) for more information on the processed table schema.

## Which one should you use?

If you want or need to alter the layout of the data in order to feed it into your model, are interested in a particular subset of the data, want to connect it to geospatial data, or want to view / run analysis on all the data at once, the processed PUDL Parquet files are the way to go! For context, the row limit in Excel is a little more than 1 million rows. The processed PUDL table contains 136,437,000 rows. Below, we'll show you how to wrangle this dataset without blowing up your computer.

## Accessing RARE Parquet files

In [None]:
import altair as alt
import io
import matplotlib.pyplot as plt
import os
import pandas as pd
import random
import sys
import pudl
pudl_paths = pudl.workspace.setup.PudlPaths()

In [None]:
# Load the full RARE dataset from Parquet into pandas
rare_df = pd.read_parquet(
    pudl_paths.pudl_output / "parquet/out_vcerare__hourly_available_capacity_factor.parquet", engine='pyarrow', dtype_backend='pyarrow'
)
# Display ten random hourly records, to give an overview of what the data looks like
rare_df.sample(10)

### Filter the data

You can use this format to filter the data you'd like to access. Here are some examples: 

```
# Filter by single year
filtered_df = rare_df[rare_df["report_year"]==2023]

# Filter by multiple years
filtered_df = rare_df[rare_df["report_year"].isin([2019, 2020])]

# Filter by state
filtered_df = rare_df[rare_df["state"]=="TX"]
```

In [None]:
# Example filter to pull data from Rhode Island only
filtered_df = rare_df[rare_df["state"]=="RI"]
# Show the first 10 values as a sample view
filtered_df.head(10)

### Create a simple histogram

In [None]:
# Random county_id_fips generator:
random_county_id_fips = random.choice(rare_df["county_id_fips"])
random_county_id_fips

In [None]:
# Select a random county by default, or substitute in your own 5-digit county FIPS ID"
hist_year = 2023 # Update to your desired year
hist_county_id_fips = random_county_id_fips # If you want, add your custom 5 digit code here instead
hist_gen_type = "onshore_wind" # Choose from: solar_pv, onshore_wind, offshore_wind

In [None]:
# This is where we make the plot
plot_df = rare_df[
    (rare_df["report_year"]==hist_year)
    & (rare_df["county_id_fips"]==hist_county_id_fips)
]
county_name = plot_df[plot_df["county_id_fips"]==hist_county_id_fips].county_or_lake_name.unique()[0]
state_name = plot_df[plot_df["county_id_fips"]==hist_county_id_fips].state.unique()[0]

# Make the chart
plt.hist(plot_df[f"capacity_factor_{gen_type}"], bins=100, range=(0, 1))
plt.title(
    f"county: {county_name}; state: {state_name}; gen: {hist_gen_type}; max:{plot_df[f"capacity_factor_{gen_type}"].max():.2f}")
plt.grid()
plt.show()

### Download the data

#### Memory check - will it Excel? 

If you want to know whether your table is capable of being processed as a csv you can use this memory estimator. If the estimated memory exceeds 500 MB it's too big! Cutting columns or making the filter scope smaller will help reduce the file size.

In [None]:
# test_df = the name of the table you want to test
test_df = plot_df # You can edit this to the name of the df you want to test the memory of

# Calculate the memory usage in bytes, including the index
mem_usage_bytes = test_df.memory_usage(deep=True).sum()
# Convert bytes to megabytes
mem_usage_mb = mem_usage_bytes / (1024 * 1024)
csv_size_estimate_mb = mem_usage_mb * 2  # Rough multiplier for CSV size
print(f"Estimated CSV size: {csv_size_estimate_mb:.2f} MB")

#### Download!
Specify your desired download location by filling in the `download_path` and running this cell will output the data to that location under the name `rare_power_data.csv`.

In [None]:
# Add the df you want to download
download_df = plot_df # Change this to whatever df you want to download
# Add the file path you want to download the data to
# Leave this blank to save the data in the same folder as this notebook.
download_path = "" 
# e.g. download_path = "/home/user/Desktop/folder/data_folder/"
download_df.to_csv(download_path+"rare_power_data.csv")

## Explore the data

In [None]:
state = "RI" # Input your desired state
annual_year = 2023 # Input your desired year

In [None]:
cap_fac_cols = ['solar_pv', 'onshore_wind', 'offshore_wind']

# Remove capacity_factor prefix from columns
rare_df_copy = rare_df.copy()
rare_df_copy.columns = [
    col.replace('capacity_factor_', '') if col.startswith('capacity_factor_') else col
    for col in rare_df_copy.columns
]

# Filter based on year and state
rare_year_state_subset_raw = rare_df_copy[
    (rare_df_copy["report_year"] == annual_year) & (rare_df_copy["state"] == state)
]

# Calculate both mean and variance, then stack and reshape
agg_funcs = {col: ['mean', 'var'] for col in cap_fac_cols}
stats_df = rare_year_state_subset_raw.groupby(
    ["report_year", "state", "county_or_lake_name", "county_id_fips"]
).agg(agg_funcs)

# Stack and reshape to combine mean and variance
rare_year_state_subset_final = stats_df.stack(level=0).reset_index()
rare_year_state_subset_final.columns = ["report_year", "state", "county_or_lake_name", "county_id_fips", "gen_type", "average_capacity_factor", "average_variance"]

In [None]:
# Define the data source for the chart
source = rare_year_state_subset_final

# Set the color scheme for the chart
color = alt.Color('gen_type:N').scale(
    domain=['solar_pv', 'onshore_wind', 'offshore_wind'],
    range=['#e7ba52', '#69b373', '#aec7e8']
)

# We create two selections:
# - a brush that is active on the top panel
# - a multi-click that is active on the bottom panel
brush = alt.selection_interval(encodings=['x'])
click = alt.selection_point(encodings=['color'])

# Top panel is scatter plot of temperature vs time
points = alt.Chart().mark_point().encode(
    alt.X('average_variance:Q', title='Annual Variance'),
    alt.Y('average_capacity_factor:Q', title='Annual Average Capacity Factor')
        .scale(domain=[0, 1]),
    color=alt.condition(brush, color, alt.value('lightgray')),
    tooltip=[
        alt.Tooltip('county_or_lake_name:N', title='County Name'),
        alt.Tooltip('county_id_fips:N', title='FIPS ID'),
        alt.Tooltip('average_capacity_factor:Q', title='Average Annual Capacity Factor', format='.2f'),  # Rounded to 2 decimals
        alt.Tooltip('average_variance:Q', title='Average Annual Variance', format='.2f'),  # Rounded to 2 decimals
    ]
).properties(
    width=550,
    height=300
).add_params(
    brush
).transform_filter(
    click
)

# Bottom panel is a bar chart of average capacity factor per generation type
bars = alt.Chart().mark_bar().encode(
    x=alt.X('average(average_capacity_factor):Q', title='Average Capacity Factor (Excluding Outliers)'),
    y=alt.Y('gen_type:N', title='Generation Type'),
    color=alt.condition(click, color, alt.value('lightgray')),
    tooltip=[
        alt.Tooltip('gen_type:N', title='Generation Type'),
        alt.Tooltip('average(average_capacity_factor):Q', title='Average Capacity Factor', format='.2f')
    ]
).transform_filter(
    (alt.datum.average_capacity_factor > 0) & brush
).properties(
    width=550,
).add_params(
    click
)

alt.vconcat(
    points,
    bars,
    data=source,
    title=f"Average Capacity Factor vs. Variance for Counties in {state} in {year}"
).configure_title(
    fontSize=24,
    anchor="middle",
    dy=-10,
).configure_axis(
    titleFontSize=16,  # Adjust the axis title font size
    titleFontWeight='bold',  # Set axis title to bold
    titleColor='black',  # Set axis title color
    labelFontSize=12,  # Adjust the axis label font size
    labelColor='gray',
    titlePadding=10,
)

In [None]:
county_id_fips = "25027" #Input your desired county_id_fips
monthly_year = 2023 # Input your desired year

In [None]:
rare_county_year_subset = rare_df_copy[
    (rare_df_copy["county_id_fips"] == county_id_fips)
    & (rare_df_copy["report_year"] == monthly_year)
][cap_fac_cols + ["datetime_utc"]].set_index("datetime_utc")

# Update resampling to use 'ME' instead of 'M'
rare_county_subset_monthly = rare_county_year_subset.resample('ME').agg(['mean', 'var']).stack(level=-2, future_stack=True).reset_index().rename(columns={"level_1": "gen_type"})

# Format the 'year_month' column
rare_county_subset_monthly['year_month'] = rare_county_subset_monthly['datetime_utc'].dt.strftime('%Y-%B')

In [None]:
source = rare_county_subset_monthly

# Base chart with 'year_month' on the x-axis and custom colors for 'gen_type'
base = alt.Chart(source).encode(
    x=alt.X('year_month:N', title=f"Month of Year {year}"),  # Set x-axis title here
    color=alt.Color('gen_type:N', scale=alt.Scale(domain=list(custom_colors.keys()), range=list(custom_colors.values())))
)

# Fold the 'mean' and 'var' columns into a key-value format
folded = base.transform_fold(
    ['mean', 'var'],  # Columns to fold
    as_=['Statistic', 'Value']  # New column names after folding
)

# Create the line chart for both mean and variance
lines = folded.mark_line().encode(
    y='Value:Q',  # Use the folded 'Value' for y-axis
    strokeDash='Statistic:N'  # Differentiate 'mean' and 'var' by dash style
)

# Combine the lines into a layered chart
chart = lines.properties(
    title=f"Monthly Average Capacity Factor and Variance by Generation Type in county FIPS ID {county_id_fips} in {year}",
    width=600,
    height=400
).configure_title(
    fontSize=24,
    anchor="middle",
    dy=-10,
).configure_axis(
    titleFontSize=16,  # Adjust the axis title font size
    titleFontWeight='bold',  # Set axis title to bold
    titleColor='black',  # Set axis title color
    labelFontSize=12,  # Adjust the axis label font size
    labelColor='gray',
    titlePadding=10,
)

chart.display()
