<a href="https://colab.research.google.com/github/ellenwang995/final_project/blob/main/USElectricityMarkets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **Variation in US Retail Electricity Price Growth**
##### *Python for Public Policy - Final Project*
##### *Ellen Wang*
##### *UNI: elw2164*



---


---

## **Introduction:**
Amidst the AI craze of the past few years, there has been growing concern that the increased demand for electricity to power large data centers will pose a threat to energy accessibility and affordability. However, this concern with rising electricity prices is not new. A New Berkeley Lab report published early 2025 discusses how residential retail prices, post-2021, have been increasing faster than inflation, meaning that real prices of retail electricity for residential consumers has been increasing.

However, what is interesting is that the increase in retail electricity price at the national level can not be attributed to increased demand, as retail electricity sales stayed relatively constant from 2019-2023 (Energy Markets and Policy Lab - Berkeley Lab 2025). The main area of increased cost from 2019-2024 was the augmentation of capital expenditures on distribution infrastructure as well as distribution operations and maintenance expenditures.

Characteristics of distribution infrastructure and maintenance needs are likely to be unique to different grids in the US. Thus, for this project, I am interested in comparing residential retail electricity prices in two of the largest electricity grids in the US—CAISO (California) and ERCOT (Texas)—and the factors that could impact these price differentials.





---


## **Methodology:**

For this research, I will use the Pandas, Poltly, NLTK, Matplotlib and PyPDF2 libraries in python to assess price differentials in residential retail electricity prices and the potential reasons for these price differentials. The paper will be structured into three sections: comparing prices and demand in the retail electricity market, examining prices and supply composition in the wholesale market, and investigating changes in electricity market regulation.

#### *Retail Electricity Market*

To first understand the difference between residential retail electricity prices in ERCOT and CAISO, I will use Pandas to explore, clean, and organize data sourced from the U.S. Energy Information Administration (U.S. Energy Information Administration 2025). This data shows residential electricity prices and total sales for each state, every month spanning from 2010 to July 2025. After cleaning the data, I will use Poltly to graph the prices and sales in Texas and California to compare the markets in the two states.

#### *Wholesale Electricity Market*

Next, I will use data sourced from the Energy Markets and Policy Lab branch of Berkeley Lab that shows wholesale electricity prices and load composition (two categories of renewable energy—or wind and utility solar—and non-renewable energy) for seven different grids in the US. Prices and load composition are shown hourly by season for each grid (Energy Markets and Policy Lab - Berkeley Lab 2024). Again for this data, I will use Pandas to explore and clean the data. Specifically for this dataset, I will need to mutate the dataframe to add two variables: the average seasonal wholesale price for each year and the proportion of renewable energy load for each year (as a weighted average). I will then use Poltly to graph and analyze average seasonal wholesale prices and the change in proportion of renewable energy load on the CAISO and ERCOT grid.

#### *Market Regulation*

Finally, I will use NLTK, PyPDF2, and Matplotlib to assess three regulation documents regarding changes in retail electricity price formulation in CAISO from 2017-2026, specifically for the largest utility company in California—Pacific Gas & Electric (California Public Utilities Commission 2017-2023). The regulation documents are sourced from the California Public Utilities Commission website as PDFs, which will then be converted into text using PyPDF2. I will then use tools from NLTK such as tokenization, lemmetization, and cleaning for stopwords to organize the texts. These normalized texts will then be run through the frequency distribution tool in NLTK, allowing me to assess the change in rationale for increasing retail electricity prices from 2017-2026.






---

## **Results:**

### Retail Electricity Market




#### Importing, Explorning, and Cleaning the Data

First, I imported the retail market data sourced from EIA (U.S. Energy Information Administration 2025), which can be found in my repository in GitHub. The name of the file is "MonthlyPrice_State.xlsx"

I then checked which variables and what type of variables are included in the dataset, converting "Date" into a date and time variable. Finally, I checked for duplicates.

In [1]:
#importing Pandas library and EIA data for residential retail electricty prices and sales
import pandas as pd

price_df = pd.read_excel("/content/MonthlyPrice_State.xlsx")
price_df.head()

Unnamed: 0,Year,Month,Date,State,Price,Sales,Revenue
0,2025,7,Jul 2025,Alaska,27.3,148350.34,40502.62
1,2025,7,Jul 2025,Alabama,15.88,3708754.8,589018.84
2,2025,7,Jul 2025,Arkansas,13.23,2100595.4,277912.37
3,2025,7,Jul 2025,Arizona,15.38,5247371.7,806999.6
4,2025,7,Jul 2025,California,32.58,8266152.6,2693302.6


In [3]:
#explore the data
price_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Year     9537 non-null   int64  
 1   Month    9537 non-null   int64  
 2   Date     9537 non-null   object 
 3   State    9537 non-null   object 
 4   Price    9537 non-null   float64
 5   Sales    9537 non-null   float64
 6   Revenue  9537 non-null   float64
dtypes: float64(3), int64(2), object(2)
memory usage: 521.7+ KB


In [8]:
#change the variable "Date" to a date and time variable type
price_df['Date_dt'] = pd.to_datetime(price_df['Date'], format = '%b %Y')
price_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 8 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   Year     9537 non-null   int64         
 1   Month    9537 non-null   int64         
 2   Date     9537 non-null   object        
 3   State    9537 non-null   object        
 4   Price    9537 non-null   float64       
 5   Sales    9537 non-null   float64       
 6   Revenue  9537 non-null   float64       
 7   Date_dt  9537 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(3), int64(2), object(2)
memory usage: 596.2+ KB


I used the [python documentation book](https://docs.python.org/3/library/datetime.html#format-codes) to look up how to format the last argument to fit the format of the variable "Date" in my dataset. I found that %b represents an abbreviated month format.

In [6]:
#checking for duplicates
price_df[price_df.duplicated(keep=False)].sort_values(by=['Date_dt', 'State'])
price_df[price_df.duplicated(keep = False)]

Unnamed: 0,Year,Month,Date,State,Price,Sales,Revenue,Date_dt


#### Mutating the Exisitng Data Frame

In addition to looking into change in electricity prices and price volatility on a macro time scale, it is also important to assess volatility throughout the year. This is why I also create a separate data frame isolated for electricity prices and sales during the year 2016 and 2024.

In [10]:
#creating another new data frame price_2024 with all the entries from price_df that have the year 2024
price_2024 = price_df[price_df['Date_dt'].dt.year == 2024]
price_2024

Unnamed: 0,Year,Month,Date,State,Price,Sales,Revenue,Date_dt
357,2024,12,Dec 2024,Alaska,22.38,211868.21,47425.22,2024-12-01
358,2024,12,Dec 2024,Alabama,14.91,2788677.20,415668.94,2024-12-01
359,2024,12,Dec 2024,Arkansas,11.74,1499647.50,176081.29,2024-12-01
360,2024,12,Dec 2024,Arizona,15.20,2353304.30,357730.26,2024-12-01
361,2024,12,Dec 2024,California,30.55,7074452.60,2161521.30,2024-12-01
...,...,...,...,...,...,...,...,...
964,2024,1,Jan 2024,Vermont,21.14,230484.45,48712.78,2024-01-01
965,2024,1,Jan 2024,Washington,11.07,4694306.70,519574.15,2024-01-01
966,2024,1,Jan 2024,Wisconsin,16.54,2189903.50,362248.76,2024-01-01
967,2024,1,Jan 2024,West Virginia,13.65,1301973.60,177732.18,2024-01-01


In [9]:
#creating a new data frame price_2016 with all the entries from price_df that have the year 2016
price_2016 = price_df[price_df['Date_dt'].dt.year == 2016]
price_2016

Unnamed: 0,Year,Month,Date,State,Price,Sales,Revenue,Date_dt
5253,2016,12,Dec 2016,Alaska,20.17,224355.91,45243.16,2016-12-01
5254,2016,12,Dec 2016,Alabama,11.96,2579366.50,308518.17,2016-12-01
5255,2016,12,Dec 2016,Arkansas,9.48,1413122.00,133996.72,2016-12-01
5256,2016,12,Dec 2016,Arizona,11.23,2252766.80,252889.54,2016-12-01
5257,2016,12,Dec 2016,California,18.16,7308559.00,1327060.50,2016-12-01
...,...,...,...,...,...,...,...,...
5860,2016,1,Jan 2016,Vermont,16.65,211324.69,35191.09,2016-01-01
5861,2016,1,Jan 2016,Washington,9.24,3969633.50,366719.54,2016-01-01
5862,2016,1,Jan 2016,Wisconsin,13.45,2174100.20,292517.81,2016-01-01
5863,2016,1,Jan 2016,West Virginia,10.69,1430145.10,152855.05,2016-01-01


#### Data Visualizations

Here, we create line graphs for change in electricity prices and sales across the whole time frame of 2010 - 2024, as well as individual grpahs for year 2016 and 2024 using the Plotly Library. I created the graphs using functions so that the state could be changed if I wanted to compare or add other states, as well as use other data frames (ex. different years).

In [11]:
# Install plotly
import plotly.express as px

In [20]:
#line graph for electricity prices
def price_fig(df, state, title):

#I used AI to help me create the following line of code allowing for variation in state selection
    if state == 'all':
        filtered_df = df
    else:
        filtered_df = df[df['State'].isin(state)]

    pricefig = px.line(filtered_df, x='Date_dt', y='Price', color='State',
                       title=title,
                       labels={'Date_dt': 'Date', 'Price': 'Price (cents/kWh)'})

    return pricefig

state = ["California", "Texas"]

#Figure 1 shows the change in electricity prices from 2010-2024 for California, Texas, and New York.
pricefig1 = price_fig(price_df, state, 'Change in Retail Electricity Price (2010-2024)')
pricefig1.show()

#Figure 2 shows the change in electricity prices during 2024 for California, Texas, and New York.
pricefig2 = price_fig(price_2024, state, 'Change in Retail Electricity Price (2024)')
pricefig2.show()

#Figure 3 shows the change in electricity prices during 2016 for California, Texas, and New York.
pricefig3 = price_fig(price_2016, state, 'Change in Retail Electricity Price (2016)')
pricefig3.show()

AI input for creating the function so the graphs could change by selecting different states:

I have a code that produces a line graph using a function allowing for variation in dataframe, state, and title input. How can I create a statement that allows for the change in state or have multiple states?

def price_fig(df, state, title): pricefig = px.line(df, x='Date_dt', y='Price', color='State', title=title, labels={'Date_dt': 'Date', 'Price': 'Price (cents/kWh)'}) return pricefig

AI Output:

AI gave me the command

    if state == 'all':
        filtered_df = df
    else:
        filtered_df = df[df['State'].isin(state)]

and changed the dataframe input in the px.line function to the created variable filtered_df

In [25]:
#line graphs for retail electricity sales
def sales_fig(df, state, title):

    if state == 'all':
        filtered_df = df
    else:
        filtered_df = df[df['State'].isin(state)]


    salesfig = px.line(filtered_df, x='Date_dt', y='Sales', color='State',
                       title=title,
                       labels={'Date_dt': 'Date', 'Sales': 'Sales (MWh)'})

    return salesfig

state = ["Texas", "California"]


salesfig1 = sales_fig(price_df, state, 'Change in Electricity Sales (2010-2024)')
salesfig1.show()

salesfig2 = sales_fig(price_2024, state, 'Change in Electricity Sales (2024)')
salesfig2.show()

salesfig3 = sales_fig(price_2016, state, 'Change in Electricity Sales (2016)')
salesfig3.show()



---

## **Conclusion & Discussion:**