In [9]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import os

# U.S. Electricity Generation Technologies and Fuel Consumption Analysis (2015-2023)
## Data Wrangle - Pt. 1

**Collaborators:**

* Amy Zhang (332)
* ChatGPT (0101)
* Perplexity AI (20250328, 20250330)
* Gemini AI (20250328, 20250330, 20250331)

# Table of Contents

## Project Steps

* **Step 1: Data Acquisition and Initial Processing (power_plant_df)**
    * a) Data Completeness Assessment (% Missing Values)
    * b) Initial Column Removal
    * c) Data Type Mapping and Conversion (Preparation for Summary Statistics)
    * d) CSV Export with Data Type Preservation (Backup)
* **Step 2: Data Profiling and Cleaning**
  * **Disclaimer: Due to limited domain knowledge in power plant operations and energy sector data, the following analysis proceeded with a broad scope, examining all categorical and numerical columns. Detailed interpretations and potential implications of the findings are documented within the 'ProcessERD_CF6.1.png' file, which includes visual representations of data relationships and discovered patterns.**
    * a) CSV Re-import and Data Type Verification
    * b) Categorical Column Analysis (Summary Statistics & Exploration)
        * i) 13,188 Unique Plant IDs
        * ii) 63 Plants with Nuclear Unit IDs
        * iii) Plant Names with Multiple Plant IDs (32 Names, 541 Rows): Identified Need for Plant ID and State Granularity
        * iv) Duplicate Record Check: Confirmed Granularity (Year, Plant ID, State, Prime Mover, Fuel Type, Plant Name)
        * v) Removal of Six Duplicate Records
        * vi) [ERROR - Amended in Second Notebook] Removal of Rows with Missing Plant Name, Operator Name, Operator ID, and State (9 Rows)
        * vii) Operator ID/Name Consistency Check: Flagged Inconsistent Rows
        * viii) Imputation of Missing NERC Region Values Using Operator IDs
        * ix) Handling Operator ID/NERC Region Outliers ("State-Fuel Level Increment" Entries)
        * x-xiii) Column Additions (Full Prime Mover and Fuel Type Names); Column Removals (MER Fuel Type Code, Respondent Frequency, Balancing Authority)
        * xiv) **[MAJOR ERROR - Amended in Second Notebook]** Removal of Physical Unit Label Column
        * xv) Numerical Column Descriptive Statistics and Subsequent Analysis Planning
* **Step 3: Future Steps and Iterations**
    * 1. Data Re-processing (Original power_plant_df): Retain Physical Unit Labels
    * 2. Creation of Year-State-Prime Mover-Fuel Type Pivot Table

# Step 1. Merge all the Power Plant Electricity Generation and Fuel Consumptions datasets (2015-2023)

In [15]:
# List of file paths
file_paths = [
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2023/EIA923_Schedules_2_3_4_5_M_12_2023_Final.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2022/EIA923_Schedules_2_3_4_5_M_12_2022_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2021/EIA923_Schedules_2_3_4_5_M_12_2021_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2020/EIA923_Schedules_2_3_4_5_M_12_2020_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2019/EIA923_Schedules_2_3_4_5_M_12_2019_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2018/EIA923_Schedules_2_3_4_5_M_12_2018_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2017/EIA923_Schedules_2_3_4_5_M_12_2017_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2016/EIA923_Schedules_2_3_4_5_M_12_2016_Final_Revision.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/PROJECTdata/f923_2015/EIA923_Schedules_2_3_4_5_M_12_2015_Final_Revision.xlsx'
]

# Read 5th sheet from each file (index 4 in zero-based indexing), starting from row 5
dfs_powerplants = [pd.read_excel(fp, sheet_name=0, skiprows=5) for fp in file_paths]

# Display column names from each dataframe
for i, df in enumerate(dfs_powerplants):
    print(f"Columns from {file_paths[i].split('/')[-1]}:")
    print(df.columns.tolist())
    print("\n" + "-"*80 + "\n")


Columns from EIA923_Schedules_2_3_4_5_M_12_2023_Final.xlsx:
['Plant Id', 'Combined Heat And\nPower Plant', 'Nuclear Unit Id', 'Plant Name', 'Operator Name', 'Operator Id', 'Plant State', 'Census Region', 'NERC Region', 'Reserved', 'NAICS Code', 'EIA Sector Number', 'Sector Name', 'Reported\nPrime Mover', 'Reported\nFuel Type Code', 'MER\nFuel Type Code', 'Balancing\nAuthority Code', 'Respondent\nFrequency', 'Physical\nUnit Label', 'Quantity\nJanuary', 'Quantity\nFebruary', 'Quantity\nMarch', 'Quantity\nApril', 'Quantity\nMay', 'Quantity\nJune', 'Quantity\nJuly', 'Quantity\nAugust', 'Quantity\nSeptember', 'Quantity\nOctober', 'Quantity\nNovember', 'Quantity\nDecember', 'Elec_Quantity\nJanuary', 'Elec_Quantity\nFebruary', 'Elec_Quantity\nMarch', 'Elec_Quantity\nApril', 'Elec_Quantity\nMay', 'Elec_Quantity\nJune', 'Elec_Quantity\nJuly', 'Elec_Quantity\nAugust', 'Elec_Quantity\nSeptember', 'Elec_Quantity\nOctober', 'Elec_Quantity\nNovember', 'Elec_Quantity\nDecember', 'MMBtuPer_Unit\nJanua

## 1a) Assess data completeness -- see the % missing for each attribute.

In [37]:
# Concatenate all dataframes vertically
merged_df = pd.concat(dfs_powerplants, ignore_index=True)

# Sort the dataframe by Year and Plant Id
merged_df = merged_df.sort_values(['YEAR', 'Plant Id'])

# Reset the index
merged_df = merged_df.reset_index(drop=True)

# Count of missing values per column
missing_count = merged_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(merged_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())

                                     Missing Count  Missing Percentage
Plant Id                                         0            0.000000
Combined Heat And\nPower Plant                   0            0.000000
Nuclear Unit Id                                  0            0.000000
Plant Name                                       9            0.006837
Operator Name                                    9            0.006837
Operator Id                                      0            0.000000
Plant State                                      9            0.006837
Census Region                                    0            0.000000
NERC Region                                   5188            3.941111
Reserved                                    131638          100.000000
NAICS Code                                       0            0.000000
EIA Sector Number                                0            0.000000
Sector Name                                      0            0.000000
Report

## 1b) Dropping columns 'Reserved', 'Reserved.1', and 'Reserved.2' because they hold zero information. 

In [50]:
# Columns to remove
columns_to_remove = ['Reserved', 'Reserved.1', 'Reserved.2']

# Remove the columns if they exist
merged_df = merged_df.drop(columns=[col for col in columns_to_remove if col in merged_df.columns])

print(merged_df.dtypes.to_string())

Plant Id                                 int64
Combined Heat And\nPower Plant          object
Nuclear Unit Id                         object
Plant Name                              object
Operator Name                           object
Operator Id                             object
Plant State                             object
Census Region                           object
NERC Region                             object
NAICS Code                               int64
EIA Sector Number                        int64
Sector Name                             object
Reported\nPrime Mover                   object
Reported\nFuel Type Code                object
MER\nFuel Type Code                     object
Balancing\nAuthority Code               object
Respondent\nFrequency                   object
Physical\nUnit Label                    object
Quantity\nJanuary                       object
Quantity\nFebruary                      object
Quantity\nMarch                         object
Quantity\nApr

## 1c) Data Type Mapping and Conversion -- preparing for summary statistics

In [54]:
# Columns to convert to numeric (float64)
numeric_cols = [
    'Quantity\nJanuary', 'Quantity\nFebruary', 'Quantity\nMarch', 'Quantity\nApril',
    'Quantity\nMay', 'Quantity\nJune', 'Quantity\nJuly', 'Quantity\nAugust',
    'Quantity\nSeptember', 'Quantity\nOctober', 'Quantity\nNovember', 'Quantity\nDecember',
    'Elec_Quantity\nJanuary', 'Elec_Quantity\nFebruary', 'Elec_Quantity\nMarch', 'Elec_Quantity\nApril',
    'Elec_Quantity\nMay', 'Elec_Quantity\nJune', 'Elec_Quantity\nJuly', 'Elec_Quantity\nAugust',
    'Elec_Quantity\nSeptember', 'Elec_Quantity\nOctober', 'Elec_Quantity\nNovember', 'Elec_Quantity\nDecember',
    'MMBtuPer_Unit\nJanuary', 'MMBtuPer_Unit\nFebruary', 'MMBtuPer_Unit\nMarch', 'MMBtuPer_Unit\nApril',
    'MMBtuPer_Unit\nMay', 'MMBtuPer_Unit\nJune', 'MMBtuPer_Unit\nJuly', 'MMBtuPer_Unit\nAugust',
    'MMBtuPer_Unit\nSeptember', 'MMBtuPer_Unit\nOctober', 'MMBtuPer_Unit\nNovember', 'MMBtuPer_Unit\nDecember',
    'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch', 'Tot_MMBtu\nApril',
    'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune', 'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust',
    'Tot_MMBtu\nSeptember', 'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
    'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch', 'Elec_MMBtu\nApril',
    'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune', 'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust',
    'Elec_MMBtu\nSeptember', 'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
    'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
    'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', 'Netgen\nAugust',
    'Netgen\nSeptember', 'Netgen\nOctober', 'Netgen\nNovember', 'Netgen\nDecember'
]

# Convert the specified columns to numeric (float64)
for col in numeric_cols:
    if col in merged_df.columns: #check if column exists before attempting conversion
        merged_df[col] = pd.to_numeric(merged_df[col], errors='coerce') #errors='coerce' will turn non-numeric values into NaN

#Convert YEAR to int64
merged_df['YEAR'] = pd.to_numeric(merged_df['YEAR'], errors='coerce').astype('Int64')

# Convert 'Combined Heat And\nPower Plant' to boolean
if 'Combined Heat And\nPower Plant' in merged_df.columns:
    merged_df['Combined Heat And\nPower Plant'] = merged_df['Combined Heat And\nPower Plant'].map({'Y': True, 'N': False})

# Sort the dataframe by Year and Plant Id
merged_df = merged_df.sort_values(['YEAR', 'Plant Id'])

# Reset the index
merged_df = merged_df.reset_index(drop=True)

# Display the dtypes
print(merged_df.dtypes.to_string())

Plant Id                                 int64
Combined Heat And\nPower Plant            bool
Nuclear Unit Id                         object
Plant Name                              object
Operator Name                           object
Operator Id                             object
Plant State                             object
Census Region                           object
NERC Region                             object
NAICS Code                               int64
EIA Sector Number                        int64
Sector Name                             object
Reported\nPrime Mover                   object
Reported\nFuel Type Code                object
MER\nFuel Type Code                     object
Balancing\nAuthority Code               object
Respondent\nFrequency                   object
Physical\nUnit Label                    object
Quantity\nJanuary                      float64
Quantity\nFebruary                     float64
Quantity\nMarch                        float64
Quantity\nApr

## 1d) CSV Export with Data Type Preservation

In [56]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/original_merge'
output_file = 'merged_EIA923_Schedules_2_3_4_5_2015_2023.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
merged_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'merged_EIA923_Schedules_2_3_4_5_2015_2023.csv' has been created successfully.


# Step 2. Data Profile + Clean 

## 2a) Re-import CSV; double-check dtypes. 

In [59]:
# Import the Excel file
power_plant_df = pd.read_csv('/Users/amyzhang/Desktop/A6_Dashboard/01 Data/original_merge/merged_EIA923_Schedules_2_3_4_5_2015_2023.csv')
# Print the data types using to_string()
print(power_plant_df.dtypes.to_string())

  power_plant_df = pd.read_csv('/Users/amyzhang/Desktop/A6_Dashboard/01 Data/original_merge/merged_EIA923_Schedules_2_3_4_5_2015_2023.csv')


Plant Id                                 int64
Combined Heat And\nPower Plant            bool
Nuclear Unit Id                         object
Plant Name                              object
Operator Name                           object
Operator Id                             object
Plant State                             object
Census Region                           object
NERC Region                             object
NAICS Code                               int64
EIA Sector Number                        int64
Sector Name                             object
Reported\nPrime Mover                   object
Reported\nFuel Type Code                object
MER\nFuel Type Code                     object
Balancing\nAuthority Code               object
Respondent\nFrequency                   object
Physical\nUnit Label                    object
Quantity\nJanuary                      float64
Quantity\nFebruary                     float64
Quantity\nMarch                        float64
Quantity\nApr

## 2b) Categorical Columns - Summary Statistics

#### i) Unique Plant IDs: 13,188

In [63]:
num_unique_plant_ids = power_plant_df['Plant Id'].nunique()
print(f"Number of unique Plant IDs: {num_unique_plant_ids}")

Number of unique Plant IDs: 13188


In [61]:
# Identify categorical columns
categorical_cols = power_plant_df.select_dtypes(include=['object', 'category']).columns

# Display summary statistics for categorical columns
if not categorical_cols.empty:
    print("Summary Statistics for Categorical Columns:")
    for col in categorical_cols:
        print(f"\nColumn: {col}")
        print(power_plant_df[col].describe())
else:
    print("There are no categorical columns in the dataframe.")

Summary Statistics for Categorical Columns:

Column: Nuclear Unit Id
count     131638
unique         5
top            .
freq      130766
Name: Nuclear Unit Id, dtype: object

Column: Plant Name
count                         131629
unique                         13743
top       State-Fuel Level Increment
freq                            1729
Name: Plant Name, dtype: object

Column: Operator Name
count                         131629
unique                          7119
top       State-Fuel Level Increment
freq                            1729
Name: Operator Name, dtype: object

Column: Operator Id
count     131638
unique     12375
top        99999
freq        1110
Name: Operator Id, dtype: int64

Column: Plant State
count     131629
unique        51
top           CA
freq       16052
Name: Plant State, dtype: object

Column: Census Region
count     131638
unique        10
top          SAT
freq       22624
Name: Census Region, dtype: object

Column: NERC Region
count     126450
unique       

#### ii) Number of Power Plants (Plant IDs) with Nuclear Unit: 63

In [67]:
# Filter out rows where Nuclear Unit Id is '.'
filtered_df = power_plant_df[power_plant_df['Nuclear Unit Id'] != '.']

# Count the number of unique Plant IDs in the filtered DataFrame
num_unique_plant_ids = filtered_df['Plant Id'].nunique()

# Print the result
print(f"Number of unique Plant IDs (excluding '.' in Nuclear Unit Id): {num_unique_plant_ids}")

Number of unique Plant IDs (excluding '.' in Nuclear Unit Id): 63


#### iii) Exploring Plant Names with more than 1 Plant ID: 32 Plants Names; 541 rows

In [73]:
# Treat NaN as a unique Plant Id
plant_name_plant_id_counts = power_plant_df.groupby('Plant Name')['Plant Id'].apply(lambda x: x.nunique())
plant_name_plant_id_counts = plant_name_plant_id_counts.sort_values(ascending=False)

# Filter for Plant Names with more than 1 Plant Id
multiple_plant_ids = plant_name_plant_id_counts[plant_name_plant_id_counts > 1]

# Print the Plant Names and their counts
print("Plant Names with More Than 1 Plant Id:")
print(multiple_plant_ids)

# Print the number of such Plant Names
num_multiple_plant_ids = len(multiple_plant_ids)
print(f"\nNumber of Plant Names with More Than 1 Plant Id: {num_multiple_plant_ids}")

Plant Names with More Than 1 Plant Id:
Plant Name
Richland                            3
Tait Electric Generating Station    2
Bear Creek Solar                    2
Odessa                              2
Dover                               2
Fredonia                            2
High Plains                         2
Consumer Operations LLC             2
Halifax                             2
Calvert City                        2
South Plant                         2
Kelford                             2
Bedford Solar                       2
Cedar Creek                         2
Franklin Solar                      2
Desert Star Hybrid                  2
Newington                           2
Seminole                            2
Wilkinson DeFore                    2
Beaver Dam                          2
Quincy Solar                        2
Bear Creek                          2
Drop 5                              2
River Bend Solar, LLC               2
Bliss                               2


In [99]:
# Plant Names with more than 1 Plant Id
plant_names_to_examine = [
    'Richland', 'Tait Electric Generating Station', 'Bear Creek Solar', 'Odessa', 'Dover',
    'Fredonia', 'High Plains', 'Consumer Operations LLC', 'Halifax', 'Calvert City',
    'South Plant', 'Kelford', 'Bedford Solar', 'Cedar Creek', 'Franklin Solar',
    'Desert Star Hybrid', 'Newington', 'Seminole', 'Wilkinson DeFore', 'Beaver Dam',
    'Quincy Solar', 'Bear Creek', 'Drop 5', 'River Bend Solar, LLC', 'Bliss',
    'Cascade Dam', 'Wilson Solar', 'Harmony Solar', 'Smithfield Packaged Meats Corp.',
    'Pima Community College', 'Jefferson Solar', 'Clinton'
]

# Create an empty list to store the rows
rows_to_include = []

# Iterate through the Plant Names and collect the rows
for plant_name in plant_names_to_examine:
    rows = power_plant_df[power_plant_df['Plant Name'] == plant_name]
    rows_to_include.append(rows)

# Concatenate the collected rows into a single DataFrame
selected_plants_df = pd.concat(rows_to_include, ignore_index=True)

# Now, selected_plants_df contains all the rows to examine.

#Prelim exam
print(selected_plants_df.shape)

# Count of missing values per column
missing_count_selected = selected_plants_df.isnull().sum()

# Percentage of missing values per column
missing_percentage_selected = (missing_count_selected / len(selected_plants_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary_selected = pd.DataFrame({
    'Missing Count': missing_count_selected,
    'Missing Percentage': missing_percentage_selected
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary_selected.to_string())

(541, 97)
                                     Missing Count  Missing Percentage
Plant Id                                         0            0.000000
Combined Heat And\nPower Plant                   0            0.000000
Nuclear Unit Id                                  0            0.000000
Plant Name                                       0            0.000000
Operator Name                                    0            0.000000
Operator Id                                      0            0.000000
Plant State                                      0            0.000000
Census Region                                    0            0.000000
NERC Region                                     33            6.099815
NAICS Code                                       0            0.000000
EIA Sector Number                                0            0.000000
Sector Name                                      0            0.000000
Reported\nPrime Mover                            0            0.000

In [101]:
selected_plants_df['Plant Id'].value_counts(dropna=False)

Plant Id
2914     32
8002     27
10360    27
8837     22
2847     20
         ..
65009     2
65374     1
61104     1
64294     1
62965     1
Name: count, Length: 65, dtype: int64

In [103]:
plant_name_to_query = 'Richland' 
richland_rows = selected_plants_df[selected_plants_df['Plant Name'] == plant_name_to_query]
print(richland_rows)

    Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id Plant Name  \
0       2880                           False               .   Richland   
1       2880                           False               .   Richland   
2       2880                           False               .   Richland   
3       2880                           False               .   Richland   
4       2880                           False               .   Richland   
5       2880                           False               .   Richland   
6       2880                           False               .   Richland   
7       2880                           False               .   Richland   
8       2880                           False               .   Richland   
9       2880                           False               .   Richland   
10     60513                           False               .   Richland   
11      2880                           False               .   Richland   
12      2880             

### iv) Checking dataset granularity-- examine duplicates

In [105]:
# Create a combined key to check for duplicates
check_duplicates_df = selected_plants_df[['YEAR', 'Plant Name', 'Plant Id', 'Plant State', 'Reported\nPrime Mover']].copy()

# Find duplicates based on the combined key
duplicates = check_duplicates_df.duplicated(keep=False)

# Print the duplicate rows
print("Duplicate Rows (Year-Plant Name-Plant Id-State-Reported Prime Mover):")
print(check_duplicates_df[duplicates])

# Print the number of duplicate rows
print(f"\nNumber of Duplicate Rows: {len(check_duplicates_df[duplicates])}")

# Check for the number of unique plant IDs per plant name after the new duplicate check.
unique_plant_ids_per_plant = check_duplicates_df.groupby('Plant Name')['Plant Id'].nunique()
print("\nUnique Plant IDs per Plant Name after duplicate check:")
print(unique_plant_ids_per_plant)

# Check for plant names with more than one unique plant id after the new duplicate check.
print("\nPlant Names with More than one unique Plant Id:")
print(unique_plant_ids_per_plant[unique_plant_ids_per_plant>1])

Duplicate Rows (Year-Plant Name-Plant Id-State-Reported Prime Mover):
       YEAR Plant Name  Plant Id Plant State Reported\nPrime Mover
0    2015.0   Richland      2880          OH                    GT
1    2015.0   Richland      2880          OH                    GT
2    2016.0   Richland      2880          OH                    GT
3    2016.0   Richland      2880          OH                    GT
4    2017.0   Richland      2880          OH                    GT
..      ...        ...       ...         ...                   ...
532  2017.0    Clinton      1818          MI                    IC
533  2018.0    Clinton      1818          MI                    IC
534  2018.0    Clinton      1818          MI                    IC
535  2019.0    Clinton      1818          MI                    IC
536  2019.0    Clinton      1818          MI                    IC

[213 rows x 5 columns]

Number of Duplicate Rows: 213

Unique Plant IDs per Plant Name after duplicate check:
Plant Name
Bear

In [107]:
plant_name_to_query = 'South Plant'

# Filter for the specific Plant Name
south_plant_df = selected_plants_df[selected_plants_df['Plant Name'] == plant_name_to_query]

# Create the combined key for checking granularity
granularity_columns = ['YEAR', 'Plant Id', 'Plant Name', 'Plant State', 'Reported\nPrime Mover']
south_plant_granularity_df = south_plant_df[granularity_columns].copy()

# Print the rows for the South Plant with the specified granularity columns
print(f"Rows for {plant_name_to_query} with Year-Plant Id-Plant Name-State-Reported Prime Mover granularity:")
print(south_plant_granularity_df)
#Check the number of duplicates.
print(f"\nNumber of duplicates: {len(south_plant_granularity_df[duplicates])}")

Rows for South Plant with Year-Plant Id-Plant Name-State-Reported Prime Mover granularity:
       YEAR  Plant Id   Plant Name Plant State Reported\nPrime Mover
227  2015.0      7758  South Plant          IA                    IC
228  2016.0      7758  South Plant          IA                    IC
229  2017.0      7758  South Plant          IA                    IC
230  2018.0      7758  South Plant          IA                    IC
231  2019.0      7758  South Plant          IA                    IC
232  2020.0      7758  South Plant          IA                    IC
233  2021.0      7758  South Plant          IA                    IC
234  2022.0       492  South Plant          CO                    ST
235  2022.0       492  South Plant          CO                    ST
236  2022.0       492  South Plant          CO                    ST
237  2022.0      7758  South Plant          IA                    IC
238  2023.0       492  South Plant          CO                    GT
239  2023.0 

  print(f"\nNumber of duplicates: {len(south_plant_granularity_df[duplicates])}")


In [109]:
plant_name_to_query = 'South Plant'

# Filter for the specific Plant Name
south_plant_df = selected_plants_df[selected_plants_df['Plant Name'] == plant_name_to_query]

# Create the combined key for checking granularity
granularity_columns = ['YEAR', 'Plant Id', 'Plant Name', 'Plant State', 'Reported\nPrime Mover', 'Reported\nFuel Type Code']
south_plant_granularity_df = south_plant_df[granularity_columns].copy()

# Print the rows for the South Plant with the specified granularity columns
print(f"Rows for {plant_name_to_query} with Year-Plant Id-Plant Name-State-Reported Prime Mover-Reported Fuel Type granularity:")
print(south_plant_granularity_df)

# Check for duplicates at this granularity
duplicates = south_plant_granularity_df.duplicated(keep=False)
print(f"\nDuplicate Rows:")
print(south_plant_granularity_df[duplicates])

#Check the number of duplicates.
print(f"\nNumber of duplicates: {len(south_plant_granularity_df[duplicates])}")

Rows for South Plant with Year-Plant Id-Plant Name-State-Reported Prime Mover-Reported Fuel Type granularity:
       YEAR  Plant Id   Plant Name Plant State Reported\nPrime Mover  \
227  2015.0      7758  South Plant          IA                    IC   
228  2016.0      7758  South Plant          IA                    IC   
229  2017.0      7758  South Plant          IA                    IC   
230  2018.0      7758  South Plant          IA                    IC   
231  2019.0      7758  South Plant          IA                    IC   
232  2020.0      7758  South Plant          IA                    IC   
233  2021.0      7758  South Plant          IA                    IC   
234  2022.0       492  South Plant          CO                    ST   
235  2022.0       492  South Plant          CO                    ST   
236  2022.0       492  South Plant          CO                    ST   
237  2022.0      7758  South Plant          IA                    IC   
238  2023.0       492  Sou

#### (iv cont'd) Granularity of dataset: Year - Plant Id - Plant State - Prime Mover - Fuel Type - Plant Name; result: only six 

In [133]:
# Granularity 4: Year - Plant Id - Reported\nPrime Mover - Reported\nFuel Type Code - Plant Name
granularity4_columns = ['YEAR', 'Plant Id', 'Plant State', 'Reported\nPrime Mover', 'Reported\nFuel Type Code', 'Plant Name']
granularity4_df = selected_plants_df[granularity4_columns].copy()
duplicates4 = granularity4_df.duplicated(keep=False)

# Get the indices of the duplicate rows
duplicate_indices = granularity4_df[duplicates4].index

# Pull up the full information from selected_plants_df for these indices
full_duplicate_rows = selected_plants_df.loc[duplicate_indices]

# Print the full information
print("Full Information for Duplicate Rows:")
print(full_duplicate_rows)

Full Information for Duplicate Rows:
     Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id    Plant Name  \
221      8837                           False               .  Calvert City   
222      8837                           False               .  Calvert City   
223      8837                           False               .  Calvert City   
224      8837                           False               .  Calvert City   
225      8837                           False               .  Calvert City   
226      8837                           False               .  Calvert City   

                  Operator Name Operator Id Plant State Census Region  \
221  Tennessee Valley Authority       18642          KY           ESC   
222  Tennessee Valley Authority       18642          KY           ESC   
223  Tennessee Valley Authority       18642          KY           ESC   
224  Tennessee Valley Authority       18642          KY           ESC   
225  Tennessee Valley Authority       18642 

### v) Removal of Duplicate Rows in `selected_plants_df`

We identified 6 duplicate rows within the `selected_plants_df` DataFrame. These duplicates were uncovered when examining Plant Names that were associated with more than one Plant Id. Upon inspection, these rows contained no unique or distinguishing information. Therefore, we removed these duplicate rows from `selected_plants_df`.

This removal serves two primary purposes:

1.  **Data Cleaning:** Eliminating redundant rows ensures data integrity and prevents potential skewing of subsequent analyses.
2.  **Granularity Refinement:** By removing these duplicates, we aim to improve the dataset's granularity. This step is crucial for accurately determining the combination of columns that uniquely identify each plant across the entire dataset (`power_plant_df`).

#### Removal of Duplicate Rows in `power_plant_df`

Following the removal of duplicate rows from `selected_plants_df`, we must also remove the corresponding rows from the original `power_plant_df` DataFrame. This ensures consistency between the two DataFrames and maintains the integrity of the full dataset.

In [138]:
# Get the indices of the duplicate rows in selected_plants_df
duplicate_indices = granularity4_df[duplicates4].index

# Remove the duplicate rows from selected_plants_df
selected_plants_df = selected_plants_df.drop(duplicate_indices)

# Remove the corresponding rows from power_plant_df
power_plant_df = power_plant_df.drop(duplicate_indices)

# Verification
granularity4_df_cleaned = selected_plants_df[granularity4_columns].copy()
duplicates4_cleaned = granularity4_df_cleaned.duplicated(keep=False)

print("Duplicate rows removed. New number of duplicates in selected_plants_df:")
print(len(granularity4_df_cleaned[duplicates4_cleaned]))

# Verification of power_plant_df removal
granularity_power_plant = power_plant_df[granularity4_columns].copy()
duplicates_power_plant = granularity_power_plant.duplicated(keep=False)
print("Duplicate rows removed. New number of duplicates in power_plant_df:")
print(len(granularity_power_plant[duplicates_power_plant]))

Duplicate rows removed. New number of duplicates in selected_plants_df:
0
Duplicate rows removed. New number of duplicates in power_plant_df:
1461


In [140]:
# Granularity explore for the entire dataset, power_plant_df
granularity5_columns = ['YEAR', 'Plant Id', 'Plant State', 'Reported\nPrime Mover', 'Reported\nFuel Type Code', 'Plant Name']
granularity5_df = power_plant_df[granularity5_columns].copy()
duplicates5 = granularity5_df.duplicated(keep=False)

# Get the rows that are duplicates
duplicate_rows_df_5 = power_plant_df[duplicates5]

# Print the duplicate rows dataframe
print(duplicate_rows_df_5)

# You can now query duplicate_rows_df
# Example:
# print(duplicate_rows_df[duplicate_rows_df['Plant Id'] == 12345])

        Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id  \
37            46                           False               2   
38            46                           False               1   
39            46                           False               3   
569          566                           False               2   
570          566                           False               3   
...          ...                             ...             ...   
131626     99999                           False               .   
131629     99999                            True               .   
131631     99999                           False               .   
131633     99999                            True               .   
131637     99999                           False               .   

                        Plant Name               Operator Name Operator Id  \
37                    Browns Ferry  Tennessee Valley Authority       18642   
38                    Brown

In [143]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv'
output_file = 'duplicates_granularity_search.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
duplicate_rows_df_5.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'duplicates_granularity_search.csv' has been created successfully.


### vi) [ERROR -- ammended in 2nd notebook] Completeness Check - Remove 9 rows of empty Plant Name, Operator Name, Operator Id, Plant State
(In second notebook, these are not removed, as the missing years are easily imputed, and there is still potentially useful information re: Prime Mover and Fuel Type for aggregation.)

In [151]:
blank_plant_names = power_plant_df[power_plant_df['Plant Name'].isna()]  # or .isnull()

count_blank_plant_names = len(blank_plant_names)

print(f"Number of Plant Names that are blank: {count_blank_plant_names}")

# Print the rows with blank Plant Names
print("\nRows with Blank Plant Names:")
print(blank_plant_names)

# Count the number of rows with blank Plant Names
num_blank_rows = len(blank_plant_names)
print(f"\nNumber of Rows with Blank Plant Names: {num_blank_rows}")

Number of Plant Names that are blank: 9

Rows with Blank Plant Names:
        Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id Plant Name  \
42580       8825                           False               .        NaN   
56456       8825                           False               .        NaN   
70849       8825                           False               .        NaN   
85879       8825                           False               .        NaN   
101635      8825                           False               .        NaN   
118015      8825                           False               .        NaN   
130952      8825                           False               .        NaN   
130953      8825                           False               .        NaN   
130954      8825                           False               .        NaN   

       Operator Name Operator Id Plant State Census Region NERC Region  \
42580            NaN           .         NaN           MAT       

In [153]:
# You can now query duplicate_rows_df
# Example:
# print(duplicate_rows_df[duplicate_rows_df['Plant Id'] == 12345])

print(power_plant_df[power_plant_df['Operator Id'] == '.'])

        Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id Plant Name  \
42580       8825                           False               .        NaN   
56456       8825                           False               .        NaN   
70849       8825                           False               .        NaN   
85879       8825                           False               .        NaN   
101635      8825                           False               .        NaN   
118015      8825                           False               .        NaN   
130952      8825                           False               .        NaN   
130953      8825                           False               .        NaN   
130954      8825                           False               .        NaN   

       Operator Name Operator Id Plant State Census Region NERC Region  \
42580            NaN           .         NaN           MAT         NaN   
56456            NaN           .         NaN           MAT   

#### [ERROR cont'd] Removal of Rows with Blank Plant Names

We identified 9 rows in the `power_plant_df` DataFrame where the 'Plant Name' column was blank (`NaN`). Upon further inspection, these rows also contained missing values in the 'Operator Name' column and had '.' in the 'Operator Id' column. Given that these rows lack essential identifying information and are incomplete, we removed them from the `power_plant_df` DataFrame.

This removal serves to:

1.  **Clean the Data:** Removing rows with missing or invalid identifying information ensures the dataset's integrity and prevents potential errors in subsequent analysis.
2.  **Improve Data Quality:** By removing these incomplete rows, we improve the overall quality of the dataset, making it more reliable for analysis and modeling.
3.  **Ensure Consistency:** Removing the rows that have NaN values in the Plant Name column, also removes the rows that have NaN values in the Operator Name column, and the '.' in the Operator Id column, ensuring that these columns are consistent.

In [156]:
blank_plant_names = power_plant_df[power_plant_df['Plant Name'].isna()]

# Get the indices of the rows with blank Plant Names
blank_indices = blank_plant_names.index

# Remove the rows from power_plant_df
power_plant_df = power_plant_df.drop(blank_indices)

# Verification 
blank_plant_names_after_removal = power_plant_df[power_plant_df['Plant Name'].isna()]
print(f"Number of rows with blank Plant Names after removal: {len(blank_plant_names_after_removal)}")

#Verification of Operator Name and ID removal.
blank_operator_name = power_plant_df[power_plant_df['Operator Name'].isna()]
blank_operator_id = power_plant_df[power_plant_df['Operator Id'] == '.']

print(f"Number of rows with blank operator names after removal: {len(blank_operator_name)}")
print(f"Number of rows with '.' operator IDs after removal: {len(blank_operator_id)}")

Number of rows with blank Plant Names after removal: 0
Number of rows with blank operator names after removal: 0
Number of rows with '.' operator IDs after removal: 0


### vii) Operator Id and Operator Names consistency check -- flag column created to for where Operator Name has multiple Operator Ids

## Flagging Operator Names with Multiple Operator IDs

To facilitate analysis and ensure data integrity, especially when aggregating data at the operator level, we have created a new column in the `power_plant_df` DataFrame: **"Conflicting Operator Id Flag"**.

This column serves to flag observations where the 'Operator Name' is associated with more than one unique 'Operator Id'. This situation indicates potential inconsistencies in the data, as it suggests that the same operator may be represented by multiple distinct IDs.

**Purpose of the Flag:**

* **Identification of Potential Issues:** The flag allows for quick identification of rows that require further investigation.
* **Aggregation Accuracy:** When aggregating data at the operator level, the flag can be used to determine how to handle these potentially conflicting records.
* **Analysis and Reporting:** The flag provides a convenient way to filter and analyze these potentially problematic entries.
* **Data Cleaning and Correction:** The flag allows for easy identification of records that may require data cleaning or correction.

By adding this flag, we can more easily manage and address these inconsistencies, improving the reliability and accuracy of our analysis.

In [160]:
# Group by Operator Id and check unique Operator Names
operator_id_name_counts = power_plant_df.groupby('Operator Id')['Operator Name'].nunique()

# Find Operator Ids with more than one unique Operator Name
conflicting_operator_ids = operator_id_name_counts[operator_id_name_counts > 1]

# Get the rows for the conflicting Operator Ids
conflicting_rows_list = []
for operator_id in conflicting_operator_ids.index:
    conflicting_rows = power_plant_df[power_plant_df['Operator Id'] == operator_id]
    conflicting_rows_list.append(conflicting_rows)

# Concatenate the DataFrames into a single DataFrame
conflicting_rows_df = pd.concat(conflicting_rows_list, ignore_index=True)

# Print the number of conflicting rows
print(f"Number of Conflicting Rows: {len(conflicting_rows_df)}")

# Now conflicting_rows_df contains all the rows with conflicting Operator Ids.
# You can query this DataFrame as needed.

# Example:
# print(conflicting_rows_df[['Operator Name', 'Operator Id', 'Plant Name', 'Plant State']])

Number of Conflicting Rows: 10648


In [164]:
conflicting_rows_df[['Plant Id', 'Plant Name', 'Plant State', 'Operator Name', 'Operator Id', 'YEAR']].head(20)

Unnamed: 0,Plant Id,Plant Name,Plant State,Operator Name,Operator Id,YEAR
0,62,Annex Creek,AK,Alaska Electric Light&Power Co,213,2015.0
1,63,Gold Creek,AK,Alaska Electric Light&Power Co,213,2015.0
2,63,Gold Creek,AK,Alaska Electric Light&Power Co,213,2015.0
3,64,Lemon Creek,AK,Alaska Electric Light&Power Co,213,2015.0
4,64,Lemon Creek,AK,Alaska Electric Light&Power Co,213,2015.0
5,65,Salmon Creek 1,AK,Alaska Electric Light&Power Co,213,2015.0
6,78,Snettisham,AK,Alaska Electric Light&Power Co,213,2015.0
7,7250,Auke Bay,AK,Alaska Electric Light&Power Co,213,2015.0
8,7250,Auke Bay,AK,Alaska Electric Light&Power Co,213,2015.0
9,57085,Lake Dorothy Hydroelectric Project,AK,Alaska Electric Light&Power Co,213,2015.0


In [170]:
print(conflicting_rows_df[conflicting_rows_df['Operator Id'] == 213])

    Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id      Plant Name  \
0         62                           False               .     Annex Creek   
1         63                           False               .      Gold Creek   
2         63                           False               .      Gold Creek   
3         64                           False               .     Lemon Creek   
4         64                           False               .     Lemon Creek   
..       ...                             ...             ...             ...   
62        63                           False               .      Gold Creek   
63        64                           False               .     Lemon Creek   
64        64                           False               .     Lemon Creek   
65        65                           False               .  Salmon Creek 1   
66        78                           False               .      Snettisham   

                        Operator Name O

In [174]:
# To explore further in Excel
conflicting_rows_df['Operator Id'].value_counts(dropna=False)

Operator Id
14328    548
61944    434
15399    328
57280    256
6455     201
        ... 
57408      2
61801      2
57402      2
56996      2
61701      2
Name: count, Length: 504, dtype: int64

In [176]:
# Group by Operator Name and check unique Operator Ids
operator_name_id_counts = power_plant_df.groupby('Operator Name')['Operator Id'].nunique()

# Find Operator Names with more than one unique Operator Id
conflicting_operator_names = operator_name_id_counts[operator_name_id_counts > 1]

# Print the Operator Names and their counts
print("Operator Names with Multiple Operator Ids:")
print(conflicting_operator_names)

Operator Names with Multiple Operator Ids:
Operator Name
0HAM WHAM8 Solar, LLC              2
1001 Ebenezer Church Solar, LLC    2
1008 Matthews Solar, LLC           2
1009 Yadkin Solar, LLC             2
1034 Catherine Lake Solar, LLC     2
                                  ..
Zion Energy LLC                    2
Zotos International                2
Zumbro Garden LLC                  2
Zumbro Solar LLC                   2
esVolta LP                         2
Name: Operator Id, Length: 5738, dtype: int64


In [184]:
print(power_plant_df['Operator Id'].dtypes)

object


In [186]:
power_plant_df['Operator Id'] = power_plant_df['Operator Id'].str.strip()
# Then re-run your code to find conflicting Operator Names

In [188]:
# Group by Operator Name and check unique Operator Ids
operator_name_id_counts_2 = power_plant_df.groupby('Operator Name')['Operator Id'].nunique()

# Find Operator Names with more than one unique Operator Id
conflicting_operator_names_2 = operator_name_id_counts_2[operator_name_id_counts_2 > 1]

# Print the Operator Names and their counts
print("Operator Names with Multiple Operator Ids:")
print(conflicting_operator_names_2)

Operator Names with Multiple Operator Ids:
Operator Name
Boise White Paper LLC                   2
Buckeye Power, Inc                      2
Cascade Solar LLC                       2
Cleveland Cliffs                        2
Domtar Industries Inc                   2
Domtar Paper Company LLC                2
Dow Chemical Co                         2
Evergy Missouri West                    2
ExxonMobil Oil Corp                     2
Formosa Plastics Corp                   2
International Paper Co                  2
Merck & Co Inc                          2
North American Energy Services          2
Phillips 66                             2
Phillips 66 Company                     2
State Farm Mutual Auto Ins Co           2
Tate & Lyle Ingredients Americas Inc    2
Tesla Inc.                              2
UGI Energy Services, LLC                2
Name: Operator Id, dtype: int64


In [190]:
power_plant_df['Operator Id'] = power_plant_df['Operator Id'].str.lower()
# Then re-run your code to find conflicting Operator Names

In [192]:
# Group by Operator Name and check unique Operator Ids
operator_name_id_counts_3 = power_plant_df.groupby('Operator Name')['Operator Id'].nunique()

# Find Operator Names with more than one unique Operator Id
conflicting_operator_names_3 = operator_name_id_counts_3[operator_name_id_counts_3 > 1]

# Print the Operator Names and their counts
print("Operator Names with Multiple Operator Ids:")
print(conflicting_operator_names_3)

Operator Names with Multiple Operator Ids:
Operator Name
Boise White Paper LLC                   2
Buckeye Power, Inc                      2
Cascade Solar LLC                       2
Cleveland Cliffs                        2
Domtar Industries Inc                   2
Domtar Paper Company LLC                2
Dow Chemical Co                         2
Evergy Missouri West                    2
ExxonMobil Oil Corp                     2
Formosa Plastics Corp                   2
International Paper Co                  2
Merck & Co Inc                          2
North American Energy Services          2
Phillips 66                             2
Phillips 66 Company                     2
State Farm Mutual Auto Ins Co           2
Tate & Lyle Ingredients Americas Inc    2
Tesla Inc.                              2
UGI Energy Services, LLC                2
Name: Operator Id, dtype: int64


In [194]:
# Group by Operator Name and check unique Operator Ids
operator_name_id_counts = power_plant_df.groupby('Operator Name')['Operator Id'].nunique()

# Find Operator Names with more than one unique Operator Id
conflicting_operator_names = operator_name_id_counts[operator_name_id_counts > 1]

# Create a DataFrame from the conflicting Operator Names Series
conflicting_names_df = pd.DataFrame(conflicting_operator_names)

# Rename the column for clarity
conflicting_names_df.rename(columns={'Operator Id': 'Unique Operator Id Count'}, inplace=True)

# Print the DataFrame
print(conflicting_names_df)

# You can now query conflicting_names_df
# Example:
# print(conflicting_names_df.loc['Your Operator Name Here'])

                                      Unique Operator Id Count
Operator Name                                                 
Boise White Paper LLC                                        2
Buckeye Power, Inc                                           2
Cascade Solar LLC                                            2
Cleveland Cliffs                                             2
Domtar Industries Inc                                        2
Domtar Paper Company LLC                                     2
Dow Chemical Co                                              2
Evergy Missouri West                                         2
ExxonMobil Oil Corp                                          2
Formosa Plastics Corp                                        2
International Paper Co                                       2
Merck & Co Inc                                               2
North American Energy Services                               2
Phillips 66                                            

In [196]:
conflicting_names_df.shape

(19, 1)

In [202]:
dot_operator_ids = power_plant_df[power_plant_df['Operator Id'] == '.']

value_counts = dot_operator_ids.value_counts()

print(value_counts)

Series([], Name: count, dtype: int64)


In [204]:
# Operator Names with multiple Operator Ids (from your previous result)
conflicting_operator_names = [
    'Boise White Paper LLC', 'Buckeye Power, Inc', 'Cascade Solar LLC',
    'Cleveland Cliffs', 'Domtar Industries Inc', 'Domtar Paper Company LLC',
    'Dow Chemical Co', 'Evergy Missouri West', 'ExxonMobil Oil Corp',
    'Formosa Plastics Corp', 'International Paper Co', 'Merck & Co Inc',
    'North American Energy Services', 'Phillips 66', 'Phillips 66 Company',
    'State Farm Mutual Auto Ins Co', 'Tate & Lyle Ingredients Americas Inc',
    'Tesla Inc.', 'UGI Energy Services, LLC'
]

# Create a flag column
power_plant_df['Conflicting Operator Id Flag'] = power_plant_df['Operator Name'].apply(
    lambda x: 1 if x in conflicting_operator_names else 0
)

# Verify the flag
conflicting_flag_check = power_plant_df[power_plant_df['Conflicting Operator Id Flag'] == 1]

print(conflicting_flag_check[['Operator Name', 'Operator Id', 'Conflicting Operator Id Flag']])
print(f"\nNumber of flagged rows: {len(conflicting_flag_check)}")

                Operator Name Operator Id  Conflicting Operator Id Flag
5060           Merck & Co Inc         NaN                             1
5061           Merck & Co Inc         NaN                             1
5298    Boise White Paper LLC         NaN                             1
5299    Boise White Paper LLC         NaN                             1
5300    Boise White Paper LLC         NaN                             1
...                       ...         ...                           ...
129356             Tesla Inc.       57313                             1
129921      Cascade Solar LLC       65189                             1
130184             Tesla Inc.       57313                             1
130185             Tesla Inc.       57313                             1
130186             Tesla Inc.       57313                             1

[1670 rows x 3 columns]

Number of flagged rows: 1670


## viii) Imputation of Missing NERC Region Values Using Operator ID

We observed missing values (NaNs) in the 'NERC Region' column of the `power_plant_df` DataFrame. To address these missing values, we employed an imputation strategy that leverages the 'Operator Id' column.

**Rationale:**

It was discovered that a single state can be associated with multiple NERC Regions, depending on the specific NERC mapping being used. This makes state-based imputation unreliable. However, we found that the 'Operator Id' is generally consistently associated with a specific NERC Region.

**Methodology:**

We created a mapping between 'Operator Id' and 'NERC Region' by grouping the data by 'Operator Id' and extracting the first non-null 'NERC Region' value for each operator. This mapping was then used to impute the missing 'NERC Region' values.

**Benefits:**

* **Improved Data Accuracy:** Imputing based on 'Operator Id' provides a more accurate representation of the NERC Region associations compared to state-based imputation.
* **Reduced Missing Values:** This imputation strategy effectively reduces the number of missing values in the 'NERC Region' column.
* **Enhanced Data Consistency:** By leveraging the 'Operator Id' column, we ensure that the NERC Region assignments are consistent with the operator's operational context.

In [209]:
power_plant_df['NERC Region'].value_counts(dropna=False)

NERC Region
WECC    30184
SERC    26316
RFC     21253
NPCC    18401
MRO     18002
TRE      5851
NaN      5179
SPP      2915
FRCC     2048
ASCC     1012
HICC      462
Name: count, dtype: int64

In [211]:
# Create a dictionary to map Operator Id to NERC Region
operator_id_nerc_map = power_plant_df.dropna(subset=['NERC Region']).groupby('Operator Id')['NERC Region'].first().to_dict()

# Impute NaN values in 'NERC Region' based on Operator Id
power_plant_df['NERC Region'] = power_plant_df.apply(
    lambda row: operator_id_nerc_map.get(row['Operator Id']) if pd.isna(row['NERC Region']) else row['NERC Region'],
    axis=1
)

# Check the remaining NaN values
remaining_nan_nerc = power_plant_df['NERC Region'].isna().sum()
print(f"Remaining NaN values in NERC Region: {remaining_nan_nerc}")

# Verify the changes
nerc_region_counts_post_imputation = power_plant_df['NERC Region'].value_counts(dropna=False)
print("\nNERC Region counts post-imputation:")
print(nerc_region_counts_post_imputation)

Remaining NaN values in NERC Region: 3147

NERC Region counts post-imputation:
NERC Region
WECC    30368
SERC    26560
RFC     21528
NPCC    18613
MRO     18080
TRE      5867
None     3147
SPP      2919
FRCC     2139
ASCC     1644
HICC      758
Name: count, dtype: int64


In [226]:
# Find the rows where 'NERC Region' is still NaN
mysterious_nan_df = power_plant_df[power_plant_df['NERC Region'].isna()]

# Print the DataFrame
print("Rows with Remaining NaN in NERC Region:")
print(mysterious_nan_df)

# If you want to know how many rows there are
num_mysterious_nan = len(mysterious_nan_df)
print(f"\nNumber of rows: {num_mysterious_nan}")

# You can now query mysterious_nan_df to investigate further
# Example:
# print(mysterious_nan_df['Operator Id'].value_counts())

Rows with Remaining NaN in NERC Region:
        Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id  \
69            66                           False               .   
70            66                           False               .   
10744      58277                           False               .   
10838      58380                           False               .   
10839      58380                           False               .   
...          ...                             ...             ...   
131633     99999                            True               .   
131634     99999                            True               .   
131635     99999                           False               .   
131636     99999                           False               .   
131637     99999                           False               .   

                           Plant Name                          Operator Name  \
69                            Skagway          Alaska Power and

In [216]:
mysterious_nan_df['Plant State'].value_counts(dropna=False)

Plant State
AK    375
HI    289
CA    250
NY    225
TX    192
MA    122
FL    121
PA     93
NJ     92
NC     86
IL     69
MD     68
CT     65
WI     61
LA     58
MN     57
MI     54
CO     53
OH     52
VA     52
IN     48
ME     47
GA     44
SC     44
MO     42
RI     40
TN     40
IA     37
AR     35
AZ     31
WA     30
AL     30
KY     29
KS     25
MS     24
NM     21
NE     20
NV     19
OR     17
OK     17
UT     14
SD     11
VT      8
ID      8
WV      7
ND      6
WY      6
NH      6
DC      5
DE      2
Name: count, dtype: int64

In [218]:
mysterious_nan_df['Census Region'].value_counts(dropna=False)

Census Region
PACN    664
SAT     429
MAT     410
WSC     302
PACC    297
NEW     288
ENC     284
WNC     198
MTN     152
ESC     123
Name: count, dtype: int64

In [228]:
nerc_map = {
    'WECC': 'Western Electricity Coordinating Council',
    'SERC': 'SERC Reliability Corporation',
    'RFC': 'ReliabilityFirst Corporation',
    'NPCC': 'Northeast Power Coordinating Council',
    'MRO': 'Midwest Reliability Organization',
    'TRE': 'Texas Reliability Entity',
    'SPP': 'Southwest Power Pool',
    'FRCC': 'Florida Reliability Coordinating Council',
    'ASCC': 'Alaska Systems Coordinating Council',
    'HICC': 'Hawaii Island Coordinating Council'
}

power_plant_df['NERC Region Full Name'] = power_plant_df['NERC Region'].map(nerc_map)

#The Nan values will remain Nan.
print(power_plant_df[['NERC Region', 'NERC Region Full Name']].head())

  NERC Region         NERC Region Full Name
0        SERC  SERC Reliability Corporation
1        SERC  SERC Reliability Corporation
2        SERC  SERC Reliability Corporation
3        SERC  SERC Reliability Corporation
4        SERC  SERC Reliability Corporation


In [232]:
power_plant_df['NERC Region Full Name'].value_counts(dropna=False)

NERC Region Full Name
Western Electricity Coordinating Council    30368
SERC Reliability Corporation                26560
ReliabilityFirst Corporation                21528
Northeast Power Coordinating Council        18613
Midwest Reliability Organization            18080
Texas Reliability Entity                     5867
NaN                                          3147
Southwest Power Pool                         2919
Florida Reliability Coordinating Council     2139
Alaska Systems Coordinating Council          1644
Hawaii Island Coordinating Council            758
Name: count, dtype: int64

In [234]:
power_plant_df['NERC Region'].value_counts(dropna=False)

NERC Region
WECC    30368
SERC    26560
RFC     21528
NPCC    18613
MRO     18080
TRE      5867
None     3147
SPP      2919
FRCC     2139
ASCC     1644
HICC      758
Name: count, dtype: int64

In [237]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv'
output_file = 'power_plant_df_inmediares.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
power_plant_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'power_plant_df_inmediares.csv' has been created successfully.


In [241]:
# Count unique Operator Names
unique_operator_names = mysterious_nan_df['Operator Name'].nunique()
print(f"Number of unique Operator Names: {unique_operator_names}")

# Count unique Operator Ids
unique_operator_ids = mysterious_nan_df['Operator Id'].nunique()
print(f"Number of unique Operator Ids: {unique_operator_ids}")

# Get the value counts of the Operator Names
print("\nOperator Name Value Counts:")
print(mysterious_nan_df['Operator Name'].value_counts())

# Get the value counts of the Operator IDs
print("\nOperator Id Value Counts:")
print(mysterious_nan_df['Operator Id'].value_counts())

Number of unique Operator Names: 445
Number of unique Operator Ids: 165

Operator Name Value Counts:
Operator Name
State-Fuel Level Increment       1729
Alaska Village Elec Coop, Inc      90
Walmart Stores Texas, LLC          49
RRI Energy Services, LLC           43
Alaska Power and Telephone Co      40
                                 ... 
Webster Solar, LLC                  1
RT405 Westerlo Solar 2, LLC         1
Woodland Avenue Solar 1, LLC        1
Griffin Road Solar 2, LLC           1
Cypress Creek Renewables            1
Name: count, Length: 445, dtype: int64

Operator Id Value Counts:
Operator Id
99999    619
39004     43
65869      9
11824      4
58873      4
        ... 
66159      1
66157      1
63804      1
59898      1
66295      1
Name: count, Length: 165, dtype: int64


In [243]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv'
output_file = 'mysterious_nan_df.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
mysterious_nan_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'mysterious_nan_df.csv' has been created successfully.


## ix) Handling "State-Fuel Level Increment" Entries

Within the `mysterious_nan_df` DataFrame (NERC Regions still NaN post imputation by Operator Id), the entries associated with the "State-Fuel Level Increment" Operator Name represent a unique category that warrants special consideration. These entries do not correspond to traditional power plants but rather to a protocol for recording fuel consumption adjustments at the state level. While the exact nature of these adjustments requires further investigation, they are believed to be related to state-level reporting and corrections.

**Rationale for Retention:**

Despite their non-plant-specific nature, these entries contain valuable information relevant to the study's objectives. They maintain 100% completeness in the year-to-date metrics, as well as crucial data regarding prime movers and reported fuel types. Therefore, excluding them would result in the loss of potentially significant insights into fuel consumption trends and technology utilization.

**Handling Non-NERC Relationship:**

Given that these entries represent state-level adjustments rather than plant-level operations, they do not align with traditional NERC (North American Electric Reliability Corporation) region assignments. To address this, we have taken the following steps:

* **Operator ID:** We have consistently set the 'Operator Id' for these entries to `99999`, aligning with their NAICS code, which is a placeholder for unclassified industries.
* **NERC Region Full Name:** We have assigned the value "N/A" to the 'NERC Region Full Name' column for these entries, explicitly indicating their non-applicability to NERC region categorization.
* **Flagging:** We have created a flag column, "State Fuel Adjustment Flag", to easily filter and identify these rows for future analysis.

By retaining these entries and explicitly flagging their non-NERC relationship, we preserve valuable data while ensuring clarity and accuracy in our analysis.

In [249]:
# Filter for 'State-Fuel Level Increment' rows
state_fuel_increment_df = mysterious_nan_df[mysterious_nan_df['Operator Name'] == 'State-Fuel Level Increment']

# Calculate missing values per column
missing_values = state_fuel_increment_df.isnull().sum()

# Calculate the percentage of missing values per column
percentage_missing = (missing_values / len(state_fuel_increment_df)) * 100

# Create a DataFrame to display the results
missing_data_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage Missing': percentage_missing
})

# Print the DataFrame
print(missing_data_df.to_string())

                                     Missing Values  Percentage Missing
Plant Id                                          0            0.000000
Combined Heat And\nPower Plant                    0            0.000000
Nuclear Unit Id                                   0            0.000000
Plant Name                                        0            0.000000
Operator Name                                     0            0.000000
Operator Id                                    1110           64.198959
Plant State                                       0            0.000000
Census Region                                     0            0.000000
NERC Region                                    1729          100.000000
NAICS Code                                        0            0.000000
EIA Sector Number                                 0            0.000000
Sector Name                                       0            0.000000
Reported\nPrime Mover                             0            0

### To remove the State-Fuel Level Increment entries would bias our study of electricity generation technologies (prime mover + fuel type).

In [255]:
original_prime_mover_counts = power_plant_df['Reported\nPrime Mover'].value_counts()
original_fuel_type_counts = power_plant_df['Reported\nFuel Type Code'].value_counts()

# Calculate value counts for Reported\nPrime Mover and Reported\nFuel Type Code in state_fuel_increment_df
increment_prime_mover_counts = state_fuel_increment_df['Reported\nPrime Mover'].value_counts()
increment_fuel_type_counts = state_fuel_increment_df['Reported\nFuel Type Code'].value_counts()

# Calculate the impact (how many of each value would be removed)
removed_prime_mover = increment_prime_mover_counts
removed_fuel_type = increment_fuel_type_counts

# Calculate the percentage impact on power_plant_df
prime_mover_impact = (removed_prime_mover / original_prime_mover_counts) * 100
fuel_type_impact = (removed_fuel_type / original_fuel_type_counts) * 100

# Create DataFrames to display the results
prime_mover_impact_df = pd.DataFrame({
    'Removed Count': removed_prime_mover,
    'Percentage Impact': prime_mover_impact
}).fillna(0)

fuel_type_impact_df = pd.DataFrame({
    'Removed Count': removed_fuel_type,
    'Percentage Impact': fuel_type_impact
}).fillna(0)

# Print the results
print("Impact on Reported\nPrime Mover:")
print(prime_mover_impact_df)

print("\nImpact on Reported\nFuel Type Code:")
print(fuel_type_impact_df)

Impact on Reported
Prime Mover:
                       Removed Count  Percentage Impact
Reported\nPrime Mover                                  
BA                              95.0           5.325112
BT                               0.0           0.000000
CA                             147.0           2.101802
CE                               0.0           0.000000
CP                               0.0           0.000000
CS                               4.0           1.398601
CT                             141.0           1.910310
FC                              21.0           2.046784
FW                               1.0           2.941176
GT                             207.0           1.356843
HY                              26.0           0.202303
IC                             220.0           1.306724
OT                               4.0           2.094241
PS                               2.0           0.550964
PV                             148.0           0.481363
ST              

In [257]:
# Filter for 'State-Fuel Level Increment' rows
state_fuel_increment_mask = power_plant_df['Operator Name'] == 'State-Fuel Level Increment'

# Set Operator Id to 99999
power_plant_df.loc[state_fuel_increment_mask, 'Operator Id'] = 99999

# Set NERC Region Full Name to 'N/A'
power_plant_df.loc[state_fuel_increment_mask, 'NERC Region Full Name'] = 'N/A'

# Verify the changes
print(power_plant_df[state_fuel_increment_mask][['Operator Name', 'Operator Id', 'NERC Region Full Name']].head())

                    Operator Name Operator Id NERC Region Full Name
51760  State-Fuel Level Increment       99999                   N/A
51761  State-Fuel Level Increment       99999                   N/A
51762  State-Fuel Level Increment       99999                   N/A
51763  State-Fuel Level Increment       99999                   N/A
51764  State-Fuel Level Increment       99999                   N/A


In [264]:
power_plant_df['NERC Region Full Name'].value_counts(dropna=False)

NERC Region Full Name
Western Electricity Coordinating Council    30368
SERC Reliability Corporation                26560
ReliabilityFirst Corporation                21528
Northeast Power Coordinating Council        18613
Midwest Reliability Organization            18080
Texas Reliability Entity                     5867
Southwest Power Pool                         2919
Florida Reliability Coordinating Council     2139
N/A                                          1729
Alaska Systems Coordinating Council          1644
NaN                                          1418
Hawaii Island Coordinating Council            758
Name: count, dtype: int64

## Handling Remaining NaN Values in NERC Region

After imputing NERC region values based on Operator ID where possible, we still have a significant number of NaN values remaining. Given the complexity of NERC region assignments and the potential for state-level overlap, we have chosen to leave these remaining values as NaN for the time being.

**Rationale:**

* **Preservation of Data Integrity:** Imputing potentially inaccurate NERC region values could introduce significant bias into our analysis.
* **Focus on Core Predictors:** Our primary focus is on analyzing the relationship between prime movers, fuel types, and fuel efficiency. NERC regions, while potentially relevant, are not direct predictors of fuel efficiency.
* **Feature Importance Assessment:** We plan to assess the importance of NERC region as a feature during the machine learning phase of our project. This will allow us to determine whether NERC region provides significant predictive power. If it does, we may revisit our imputation strategy and explore more nuanced approaches, such as creating a column that captures multiple NERC region assignments per state.

By deferring a more complex imputation strategy until the feature importance assessment, we prioritize data integrity and avoid introducing potentially misleading information into our initial analysis.

In [267]:
# Fill remaining NaN values in 'NERC Region' with NaN
power_plant_df['NERC Region'] = power_plant_df['NERC Region'].fillna(pd.NA)

# Fill remaining NaN values in 'NERC Region Full Name' with NaN
power_plant_df['NERC Region Full Name'] = power_plant_df['NERC Region Full Name'].fillna(pd.NA)

# Verify the changes
print(power_plant_df[['NERC Region', 'NERC Region Full Name']].isna().sum())
print(power_plant_df[power_plant_df['NERC Region Full Name'] == 'N/A'].head())

NERC Region              3147
NERC Region Full Name    1418
dtype: int64
       Plant Id  Combined Heat And\nPower Plant Nuclear Unit Id  \
51760     99999                           False               .   
51761     99999                           False               .   
51762     99999                           False               .   
51763     99999                           False               .   
51764     99999                           False               .   

                       Plant Name               Operator Name Operator Id  \
51760  State-Fuel Level Increment  State-Fuel Level Increment       99999   
51761  State-Fuel Level Increment  State-Fuel Level Increment       99999   
51762  State-Fuel Level Increment  State-Fuel Level Increment       99999   
51763  State-Fuel Level Increment  State-Fuel Level Increment       99999   
51764  State-Fuel Level Increment  State-Fuel Level Increment       99999   

      Plant State Census Region NERC Region  NAICS Code  ... 

## x) New Columns: Prime Mover Full Name; Fuel Type Full Name

In [269]:
# Mapping dictionary for Prime Mover codes to full names
prime_mover_map = {
    'BA': 'Energy Storage, Battery',
    'BT': 'Turbines Used in a Binary Cycle (Geothermal)',
    'CA': 'Combined-Cycle -- Steam Part',
    'CE': 'Energy Storage, Compressed Air',
    'CP': 'Energy Storage, Concentrated Solar Power',
    'CS': 'Combined-Cycle Single-Shaft',
    'CT': 'Combined-Cycle Combustion Turbine Part',
    'ES': 'Energy Storage, Other',
    'FC': 'Fuel Cell',
    'FW': 'Energy Storage, Flywheel',
    'GT': 'Combustion (Gas) Turbine',
    'HA': 'Hydrokinetic, Axial Flow Turbine',
    'HB': 'Hydrokinetic, Wave Buoy',
    'HK': 'Hydrokinetic, Other',
    'HY': 'Hydraulic Turbine',
    'IC': 'Internal Combustion Engine',
    'PS': 'Energy Storage, Reversible Hydraulic Turbine',
    'OT': 'Other',
    'ST': 'Steam Turbine',
    'PV': 'Photovoltaic',
    'WT': 'Wind Turbine, Onshore',
    'WS': 'Wind Turbine, Offshore'
}

# Create the Prime Mover Full Name column
power_plant_df['Prime Mover Full Name'] = power_plant_df['Reported\nPrime Mover'].map(prime_mover_map)

# Verify the changes
print(power_plant_df[['Reported\nPrime Mover', 'Prime Mover Full Name']].head(10))

  Reported\nPrime Mover                   Prime Mover Full Name
0                    HY                       Hydraulic Turbine
1                    CA            Combined-Cycle -- Steam Part
2                    CT  Combined-Cycle Combustion Turbine Part
3                    ST                           Steam Turbine
4                    ST                           Steam Turbine
5                    HY                       Hydraulic Turbine
6                    ST                           Steam Turbine
7                    ST                           Steam Turbine
8                    ST                           Steam Turbine
9                    ST                           Steam Turbine


In [271]:
# Mapping dictionary for Fuel Type codes to full names
fuel_type_map = {
    'AB': 'Agricultural By-Products',
    'ANT': 'Anthracite Coal',
    'BFG': 'Blast Furnace Gas',
    'BIT': 'Bituminous Coal',
    'BLQ': 'Black Liquor',
    'DFO': 'Distillate Fuel Oil',
    'GEO': 'Geothermal',
    'H2': 'Hydrogen',
    'JF': 'Jet Fuel',
    'KER': 'Kerosene',
    'LFG': 'Landfill Gas',
    'LIG': 'Lignite Coal',
    'MSB': 'Biogenic Municipal Solid Waste',
    'MSN': 'Non-biogenic Municipal Solid Waste',
    'MWH': 'Electricity used for energy storage',
    'NG': 'Natural Gas',
    'NUC': 'Nuclear',
    'OBG': 'Other Biomass Gas',
    'OBL': 'Other Biomass Liquids',
    'OBS': 'Other Biomass Solids',
    'OG': 'Other Gas',
    'OTH': 'Other Fuel',
    'PC': 'Petroleum Coke',
    'PGG': 'Gaseous Propane',
    'PUR': 'Purchased Steam',
    'RC': 'Refined Coal',
    'RFO': 'Residual Fuel Oil',
    'SCC': 'Coal-based Synfuel',
    'SGC': 'Coal-Derived Synthesis Gas',
    'SGP': 'Synthesis Gas from Petroleum Coke',
    'SLW': 'Sludge Waste',
    'SUB': 'Subbituminous Coal',
    'SUN': 'Solar',
    'TDF': 'Tire-derived Fuels',
    'WAT': 'Water',
    'WC': 'Waste/Other Coal',
    'WDL': 'Wood Waste Liquids',
    'WDS': 'Wood/Wood Waste Solids',
    'WH': 'Waste Heat',
    'WND': 'Wind',
    'WOW': 'Waste/Other Oil'
}

# Create the Fuel Type Full Name column
power_plant_df['Fuel Type Full Name'] = power_plant_df['Reported\nFuel Type Code'].map(fuel_type_map)

# Verify the changes
print(power_plant_df[['Reported\nFuel Type Code', 'Fuel Type Full Name']].head(10))

  Reported\nFuel Type Code     Fuel Type Full Name
0                      WAT                   Water
1                       NG             Natural Gas
2                       NG             Natural Gas
3                      BIT         Bituminous Coal
4                       NG             Natural Gas
5                      WAT                   Water
6                      BIT         Bituminous Coal
7                       NG             Natural Gas
8                      WDS  Wood/Wood Waste Solids
9                      BIT         Bituminous Coal


In [273]:
power_plant_df.shape

(131623, 101)

## xi) Removal of 'MER\nFuel Type Code' Column

The 'MER\nFuel Type Code' column was removed from the dataset due to its high percentage of missing values and redundancy. 

**Rationale:**

* **High Missing Value Rate:** Approximately 75% of the entries in the 'MER\nFuel Type Code' column were missing. This significant incompleteness renders the column unreliable for analysis.
* **Redundancy:** The 'MER\nFuel Type Code' represents an alternative categorization schema for fuel types, developed for internal use by the EIA, and ultimately duplicates information already provided by the 'Reported\nFuel Type Code' column.
* **Data Integrity:** Removing this incomplete and redundant column simplifies the dataset and improves overall data integrity, reducing the potential for misleading interpretations.

By removing this column, we focus on the more complete and directly relevant 'Reported\nFuel Type Code' column, which aligns with the EIA's public reporting standards.

In [276]:
# Remove the MER\nFuel Type Code column
power_plant_df = power_plant_df.drop('MER\nFuel Type Code', axis=1)

# Verify the removal
print("MER\\nFuel Type Code" in power_plant_df.columns) # Should print False

False


## xii) Removal of 'Balancing\nAuthority Code' Column

The 'Balancing\nAuthority Code' column was removed from the dataset due to its significant percentage of missing values and its indirect relevance to the project's focus.

**Rationale:**

* **Significant Missing Value Rate:** Approximately 32% of the entries in the 'Balancing\nAuthority Code' column were missing, rendering the column incomplete and potentially unreliable.
* **Indirect Relevance:** Balancing Authorities (BAs) are primarily responsible for monitoring and managing the distribution of electricity. While important for grid stability, their role is indirectly related to the fuel consumption and generation of individual power plants.
* **Project Focus:** This project aims to focus on understanding the technologies involved in fuel consumption and power generation at the power plant level. The 'Balancing\nAuthority Code' column, which pertains to grid distribution, falls outside the scope of this core objective.
* **Data Simplification:** Removing this incomplete and indirectly relevant column simplifies the dataset and improves the focus of the analysis on the primary variables of interest.

By removing this column, we prioritize the analysis of variables that directly relate to fuel consumption, generation, and the technologies involved in these processes.

In [278]:
# Remove the Balancing\nAuthority Code column
power_plant_df = power_plant_df.drop('Balancing\nAuthority Code', axis=1)

# Verify the removal
print("Balancing\nAuthority Code" in power_plant_df.columns) # Should print False

False


## xiii) Removal of 'Respondent\nFrequency' Column

The 'Respondent\nFrequency' column was removed from the dataset due to its complex interpretation and indirect relevance to the project's core objectives.

**Rationale:**

* **Complex Interpretation:** The column reflects EIA data collection methods and reporting levels, including annual (A), monthly (M), and ambiguous (AM) categories.
* **Ambiguous "AM" Category:** The "AM" category, representing a significant number of unique Plant IDs, is difficult to interpret and could lead to misinterpretations.
* **Indirect Relevance:** The column primarily pertains to data collection methodologies, which are indirectly related to the fuel consumption and generation characteristics of power plants.
* **Simplification:** Removing the column simplifies the dataset and avoids unnecessary complexity, focusing the analysis on more directly relevant variables.

Analysis of unique Plant IDs revealed that the column does not directly correspond to plant size, further supporting the decision to remove it.

In [287]:
power_plant_df['Respondent\nFrequency'].value_counts(dropna=False)

Respondent\nFrequency
A      47979
NaN    39855
M      30661
AM     13128
Name: count, dtype: int64

In [293]:
# Calculate unique Plant IDs for each Respondent Frequency category, including NaN
unique_plant_ids = power_plant_df.groupby('Respondent\nFrequency')['Plant Id'].nunique(dropna=False)

# Print the results
print("Unique Plant IDs per Respondent Frequency:")
print(unique_plant_ids)

Unique Plant IDs per Respondent Frequency:
Respondent\nFrequency
A     9439
AM    7448
M     3045
Name: Plant Id, dtype: int64


In [296]:
# Remove the Respondent\nFrequency column
power_plant_df = power_plant_df.drop('Respondent\nFrequency', axis=1)

# Verify the removal
print(" Respondent\nFrequency" in power_plant_df.columns) # Should print False

False


*## xiv) [MAJOR ERROR; ammended in 2nd notebook] Removal of 'Physical\nUnit Label' and Related Columns*

*The 'Physical\nUnit Label' column and its associated quantity and MMBtu per unit columns were removed from the dataset due to a high percentage of missing values and the availability of equivalent MMBtu-based metrics.*

***Rationale:***

* *__High Missing Value Rate:__ Approximately 42% of the rows were missing information in the 'Physical\nUnit Label' column, rendering it incomplete and potentially unreliable.*
* *__Redundancy:__ The 'Physical\nUnit Label' column provides information about the physical units used to report fuel consumption (e.g., mcf, short tons, barrels). However, the dataset already includes MMBtu-based metrics, which provide a standardized measure of fuel energy content.*
* *__MMBtu-Based Metrics:__ The MMBtu (million British thermal units) is a common unit for measuring fuel energy content, and the dataset includes columns for MMBtu-based fuel consumption and heat content.*
* *__Data Simplification:__ Removing these columns simplifies the dataset and focuses the analysis on the standardized MMBtu-based metrics, which are more appropriate for comparing fuel consumption across different fuel types and plants.*

*By removing these columns, we prioritize the use of consistent and complete MMBtu-based metrics, which are more relevant to the project's focus on fuel consumption and energy generation.*


# Re-evaluating Physical Unit Labels and Imputation Strategy (cont'd)

Upon further analysis, we have discovered that the `Physical\nUnit Label` column, previously considered less critical, plays a vital role in understanding the nature of fuel consumption and electricity generation within our dataset. Specifically, the presence of 'Megawatthours' as a distinct label indicates the presence of energy storage technologies, which require a different interpretation of the `Quantity` and `MMBtu` columns compared to traditional fuel-consuming plants.

Furthermore, we have determined that there is a consistent mapping between the combination of `Reported\nPrime Mover` and `Reported Fuel Type Code` and the `Physical\nUnit Label`. This consistency allows us to develop an imputation strategy to fill in the missing `Physical\nUnit Label` values. By imputing these labels, we can accurately distinguish between energy storage and traditional fuel plants, and properly interpret the associated fuel consumption and generation metrics. 
**This Re-evaluated process is undertaken for the second notebook**

In [299]:
# Remove the Physical\nUnit Label and related columns
columns_to_remove = [
    'Physical\nUnit Label',
    'Quantity\nJanuary', 'Quantity\nFebruary', 'Quantity\nMarch', 'Quantity\nApril',
    'Quantity\nMay', 'Quantity\nJune', 'Quantity\nJuly', 'Quantity\nAugust',
    'Quantity\nSeptember', 'Quantity\nOctober', 'Quantity\nNovember', 'Quantity\nDecember',
    'Elec_Quantity\nJanuary', 'Elec_Quantity\nFebruary', 'Elec_Quantity\nMarch',
    'Elec_Quantity\nApril', 'Elec_Quantity\nMay', 'Elec_Quantity\nJune',
    'Elec_Quantity\nJuly', 'Elec_Quantity\nAugust', 'Elec_Quantity\nSeptember',
    'Elec_Quantity\nOctober', 'Elec_Quantity\nNovember', 'Elec_Quantity\nDecember',
    'MMBtuPer_Unit\nJanuary', 'MMBtuPer_Unit\nFebruary', 'MMBtuPer_Unit\nMarch',
    'MMBtuPer_Unit\nApril', 'MMBtuPer_Unit\nMay', 'MMBtuPer_Unit\nJune',
    'MMBtuPer_Unit\nJuly', 'MMBtuPer_Unit\nAugust', 'MMBtuPer_Unit\nSeptember',
    'MMBtuPer_Unit\nOctober', 'MMBtuPer_Unit\nNovember', 'MMBtuPer_Unit\nDecember',
    'Total Fuel Consumption\nQuantity', 'Electric Fuel Consumption\nQuantity'
]

power_plant_df = power_plant_df.drop(columns_to_remove, axis=1)

# Verify the removal
print(all(col not in power_plant_df.columns for col in columns_to_remove)) # Should print True

True


In [301]:
power_plant_df.columns

Index(['Plant Id', 'Combined Heat And\nPower Plant', 'Nuclear Unit Id',
       'Plant Name', 'Operator Name', 'Operator Id', 'Plant State',
       'Census Region', 'NERC Region', 'NAICS Code', 'EIA Sector Number',
       'Sector Name', 'Reported\nPrime Mover', 'Reported\nFuel Type Code',
       'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch',
       'Tot_MMBtu\nApril', 'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune',
       'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust', 'Tot_MMBtu\nSeptember',
       'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
       'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch',
       'Elec_MMBtu\nApril', 'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune',
       'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust', 'Elec_MMBtu\nSeptember',
       'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
       'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
       'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', '

In [303]:
power_plant_df.describe()

Unnamed: 0,Plant Id,NAICS Code,EIA Sector Number,Tot_MMBtu\nJanuary,Tot_MMBtu\nFebruary,Tot_MMBtu\nMarch,Tot_MMBtu\nApril,Tot_MMBtu\nMay,Tot_MMBtu\nJune,Tot_MMBtu\nJuly,...,Netgen\nAugust,Netgen\nSeptember,Netgen\nOctober,Netgen\nNovember,Netgen\nDecember,Total Fuel Consumption\nMMBtu,Elec Fuel Consumption\nMMBtu,Net Generation\n(Megawatthours),YEAR,Conflicting Operator Id Flag
count,131623.0,131623.0,131623.0,124927.0,125083.0,125418.0,125684.0,125902.0,126186.0,126405.0,...,126643.0,126912.0,127224.0,127565.0,129109.0,131623.0,131623.0,131623.0,130915.0,131623.0
mean,38816.078026,12701.249516,2.355857,248336.8,218698.2,219760.7,202446.6,222760.4,248585.1,280318.9,...,28767.38,24806.56,22510.28,21969.03,23954.0,2700467.0,2514722.0,281381.1,2019.288103,0.012688
std,26852.4505,64346.061059,1.774151,1086637.0,950113.9,948789.0,874965.6,957759.9,1041874.0,1150725.0,...,112147.5,98632.96,92322.55,91732.13,100484.2,11344090.0,11243990.0,1119152.0,2.568077,0.111923
min,1.0,22.0,1.0,0.0,0.0,0.0,0.0,-435.0,0.0,0.0,...,-162724.0,-111998.0,-78615.0,-79300.0,-122650.0,0.0,0.0,-1121756.0,2015.0,0.0
25%,6430.0,22.0,1.0,47.0,37.0,29.0,19.0,35.0,59.0,69.0,...,8.622,5.991,4.05825,3.715,5.0,2105.5,1670.0,250.0,2017.0,0.0
50%,55037.0,22.0,2.0,3008.0,3053.0,3620.5,3652.5,4071.0,4417.0,4720.0,...,631.0,537.5925,463.7145,394.424,366.149,50721.0,44098.0,6707.0,2019.0,0.0
75%,59100.0,22.0,2.0,47975.0,45555.0,49560.5,48707.25,52429.5,57252.25,62833.0,...,6464.124,5599.247,4956.942,4658.0,4924.401,650653.5,473309.5,68687.64,2022.0,0.0
max,99999.0,562213.0,7.0,21272240.0,18958170.0,18504360.0,19941960.0,24401140.0,23438760.0,20711700.0,...,2105042.0,1909415.0,1914883.0,1893736.0,2132168.0,211809400.0,211809400.0,21166050.0,2023.0,1.0


## xv) Import EIA documentation on NAICS Code FuLl Name; create new column

In [311]:
# Path to your Excel file
excel_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv/2022-NAICS-Codes-listed-numerically-2-Digit-through-6-Digit.xlsx'

# Import the Excel sheet
naics_excel_df = pd.read_excel(excel_file_path)

# Display the first few rows to inspect the data
print(naics_excel_df.head())

# Check the data types of the columns
print("\nColumn Data Types:")
print(naics_excel_df.dtypes)

# Check the column names
print("\nColumn Names:")
print(naics_excel_df.columns)

   Seq. No. 2022 NAICS US   Code                          2022 NAICS US Title  \
0         1                   11  Agriculture, Forestry, Fishing and HuntingT   
1         2                  111                             Crop ProductionT   
2         3                 1111                   Oilseed and Grain FarmingT   
3         4                11111                             Soybean FarmingT   
4         5               111110                              Soybean Farming   

                                         Description  
0  The Sector as a Whole\n\nThe Agriculture, Fore...  
1  Industries in the Crop Production subsector gr...  
2  This industry group comprises establishments p...  
3               See industry description for 111110.  
4  This industry comprises establishments primari...  

Column Data Types:
Seq. No.                 int64
2022 NAICS US   Code    object
2022 NAICS US Title     object
Description             object
dtype: object

Column Names:
Index(['Se

In [313]:
# Path to your Excel file
excel_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv/2022-NAICS-Codes-listed-numerically-2-Digit-through-6-Digit.xlsx'

# Import the Excel sheet
naics_excel_df = pd.read_excel(excel_file_path)

# Rename the relevant columns
naics_excel_df = naics_excel_df.rename(columns={
    '2022 NAICS US   Code': 'NAICS Code',
    '2022 NAICS US Title': 'NAICS Full Name'
})

# Drop the other columns
naics_excel_df = naics_excel_df[['NAICS Code','NAICS Full Name']]

# Left merge with power_plant_df, overwriting NAICS Full Name
power_plant_df = pd.merge(power_plant_df, naics_excel_df, on='NAICS Code', how='left', suffixes=('', '_y'))

# Overwrite the original NAICS Full Name column
power_plant_df['NAICS Full Name'] = power_plant_df['NAICS Full Name_y']

# Remove the redundant NAICS Full Name_y column
power_plant_df = power_plant_df.drop('NAICS Full Name_y', axis=1)

# Verify the changes
print(power_plant_df[['NAICS Code', 'NAICS Full Name']].head())

  NAICS Code NAICS Full Name
0         22      UtilitiesT
1         22      UtilitiesT
2         22      UtilitiesT
3         22      UtilitiesT
4         22      UtilitiesT


In [315]:
power_plant_df['NAICS Full Name'].value_counts(dropna=False)

NAICS Full Name
UtilitiesT                                                               108332
NaN                                                                        4207
Educational ServicesT                                                      2892
Paper ManufacturingT                                                       1932
Paperboard MillsT                                                          1659
                                                                          ...  
Natural Gas DistributionT                                                     6
Truck TransportationT                                                         5
Alumina and Aluminum Production and ProcessingT                               4
Nursing and Residential Care FacilitiesT                                      4
Pesticide, Fertilizer, and Other Agricultural Chemical ManufacturingT         2
Name: count, Length: 80, dtype: int64

In [317]:
# Get a list of unique NAICS Full Name values
unique_naics_full_names = power_plant_df['NAICS Full Name'].unique()

# Print the list
print(unique_naics_full_names)

['UtilitiesT' 'Petroleum RefineriesT' 'Steam and Air-Conditioning Supply'
 'HospitalsT'
 'Computing Infrastructure Providers, Data Processing, Web Hosting, and Related ServicesT'
 'Amusement, Gambling, and Recreation IndustriesT'
 'Chemical ManufacturingT' 'Paper ManufacturingT' 'Food ManufacturingT'
 'Miscellaneous ManufacturingT' 'Sewage Treatment Facilities'
 'Transportation Equipment ManufacturingT'
 'Solid Waste Combustors and Incinerators' 'Textile Product MillsT'
 'Real Estate and Rental and LeasingT' 'Educational ServicesT'
 'Wood Product ManufacturingT' 'Industrial Gas ManufacturingT'
 'Petroleum and Coal Products ManufacturingT'
 'Administrative and Support ServicesT' nan
 'Nitrogenous Fertilizer Manufacturing' 'Paperboard MillsT'
 'AccommodationT' 'Plastics Material and Resin Manufacturing'
 'Iron and Steel Mills and Ferroalloy ManufacturingT'
 'Water Supply and Irrigation Systems' 'Rail TransportationT'
 'Solid Waste Landfill' 'Crop ProductionT' 'Correctional Institutions'


In [319]:
# Clean the NAICS Full Name column
power_plant_df['NAICS Full Name'] = power_plant_df['NAICS Full Name'].str.replace('T$', '', regex=True)

# Verify the cleaning
print(power_plant_df['NAICS Full Name'].unique())

['Utilities' 'Petroleum Refineries' 'Steam and Air-Conditioning Supply'
 'Hospitals'
 'Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services'
 'Amusement, Gambling, and Recreation Industries' 'Chemical Manufacturing'
 'Paper Manufacturing' 'Food Manufacturing' 'Miscellaneous Manufacturing'
 'Sewage Treatment Facilities' 'Transportation Equipment Manufacturing'
 'Solid Waste Combustors and Incinerators' 'Textile Product Mills'
 'Real Estate and Rental and Leasing' 'Educational Services'
 'Wood Product Manufacturing' 'Industrial Gas Manufacturing'
 'Petroleum and Coal Products Manufacturing'
 'Administrative and Support Services' nan
 'Nitrogenous Fertilizer Manufacturing' 'Paperboard Mills' 'Accommodation'
 'Plastics Material and Resin Manufacturing'
 'Iron and Steel Mills and Ferroalloy Manufacturing'
 'Water Supply and Irrigation Systems' 'Rail Transportation'
 'Solid Waste Landfill' 'Crop Production' 'Correctional Institutions'
 'Plastics and Rubber P

In [321]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv'
output_file = 'power_plant_df_inmediares.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
power_plant_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'power_plant_df_inmediares.csv' has been created successfully.


In [323]:
power_plant_df.columns

Index(['Plant Id', 'Combined Heat And\nPower Plant', 'Nuclear Unit Id',
       'Plant Name', 'Operator Name', 'Operator Id', 'Plant State',
       'Census Region', 'NERC Region', 'NAICS Code', 'EIA Sector Number',
       'Sector Name', 'Reported\nPrime Mover', 'Reported\nFuel Type Code',
       'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch',
       'Tot_MMBtu\nApril', 'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune',
       'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust', 'Tot_MMBtu\nSeptember',
       'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
       'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch',
       'Elec_MMBtu\nApril', 'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune',
       'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust', 'Elec_MMBtu\nSeptember',
       'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
       'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
       'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', '

# xvi) Numerical Columns - Descriptive Statistics & Epiphany as to Next Steps

In [325]:
metrics_columns = [
    'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch',
    'Tot_MMBtu\nApril', 'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune',
    'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust', 'Tot_MMBtu\nSeptember',
    'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
    'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch',
    'Elec_MMBtu\nApril', 'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune',
    'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust', 'Elec_MMBtu\nSeptember',
    'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
    'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
    'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', 'Netgen\nAugust',
    'Netgen\nSeptember', 'Netgen\nOctober', 'Netgen\nNovember',
    'Netgen\nDecember', 'Total Fuel Consumption\nMMBtu',
    'Elec Fuel Consumption\nMMBtu', 'Net Generation\n(Megawatthours)'
]

# Calculate descriptive statistics
descriptive_stats = power_plant_df[metrics_columns].describe()

# Calculate missing percentages
missing_percentages = power_plant_df[metrics_columns].isna().mean() * 100

# Print the results
print("Descriptive Statistics:")
print(descriptive_stats)

print("\nMissing Percentages:")
print(missing_percentages)

Descriptive Statistics:
       Tot_MMBtu\nJanuary  Tot_MMBtu\nFebruary  Tot_MMBtu\nMarch  \
count        1.249270e+05         1.250830e+05      1.254180e+05   
mean         2.483368e+05         2.186982e+05      2.197607e+05   
std          1.086637e+06         9.501139e+05      9.487890e+05   
min          0.000000e+00         0.000000e+00      0.000000e+00   
25%          4.700000e+01         3.700000e+01      2.900000e+01   
50%          3.008000e+03         3.053000e+03      3.620500e+03   
75%          4.797500e+04         4.555500e+04      4.956050e+04   
max          2.127224e+07         1.895817e+07      1.850436e+07   

       Tot_MMBtu\nApril  Tot_MMBtu\nMay  Tot_MMBtu\nJune  Tot_MMBtu\nJuly  \
count      1.256840e+05    1.259020e+05     1.261860e+05     1.264050e+05   
mean       2.024466e+05    2.227604e+05     2.485851e+05     2.803189e+05   
std        8.749656e+05    9.577599e+05     1.041874e+06     1.150725e+06   
min        0.000000e+00   -4.350000e+02     0.000000e+0

# Step 3) Next Steps

## 1. Aggregating Fuel and Electricity Metrics

At this stage, we recognize that while Plant IDs provide the most granular detail, our project's focus on predicting electricity technology efficiency necessitates aggregating to a higher level.

The categories that are most consistently complete include: NAICS Code, EIA Sector Numbers and Names, Reported Prime Mover, Reported Fuel Type, Plant State, Census Region, and Operator Name. We have generated a pivot table at this granular level, specifically at the Operator Name level. It's important to note that the month-by-month breakdown may still contain blanks due to inconsistent reporting frequencies. To mitigate this, we have included a column in the pivot table representing the count of unique Plants operated by each Operator, which will aid in normalization during aggregation.

The aggregation will happen in Excel. However, before this can happen, we will need to re-include the Physical Unit Label column. 

In [377]:
# Path to your Excel file
excel_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/01 Data/exploratory_csv/power_plant_df_inmediares.xlsx'

# Import the specified sheet
pivot_table_df = pd.read_excel(excel_file_path, sheet_name='power_plant_df_PIVOT1')

# Print the column names and data types
print(pivot_table_df.dtypes.to_string())

YEAR                                       object
Reported\nFuel Type Code                   object
Fuel Type Full Name                        object
Reported\nPrime Mover                      object
Prime Mover Full Name                      object
Plant State                                object
Census Region                              object
NERC Region                                object
Operator Id                                object
Operator Name                              object
EIA Sector Number                         float64
Sector Name                                object
NAICS Code                                float64
NAICS Full Name                            object
Count of Plant Id                           int64
Sum of Total Fuel Consumption\nMMBtu        int64
Sum of Elec Fuel Consumption\nMMBtu         int64
Sum of Net Generation\n(Megawatthours)    float64
Sum of Tot_MMBtu\nJanuary                 float64
Sum of Tot_MMBtu\nFebruary                float64


In [348]:
# Group by Prime Mover, Fuel Type, and check unique Physical Unit Labels
grouped_data = original_merge_df.groupby(['Reported\nPrime Mover', 'Reported\nFuel Type Code'])['Physical\nUnit Label'].nunique()

# Filter for groups with more than one unique Physical Unit Label
multiple_units = grouped_data[grouped_data > 1]

# Print the results
print("Prime Mover and Fuel Type combinations with multiple Physical Unit Labels:")
print(multiple_units)

Prime Mover and Fuel Type combinations with multiple Physical Unit Labels:
Series([], Name: Physical\nUnit Label, dtype: int64)


## 2. Re-evaluating Physical Unit Labels and Imputation Strategy

After confirming that missing Physical Unit Labels can be imputed, we opt to complete data wrangling through the following steps: 

1. Begin with the original merged power_plant_df and carry out all cleaning processes except for removing Physical Unit Labels.

2. Export to Excel. In Excel, aggregate to the Prime Mover - Fuel Type - State level. Import this pivot table back into Jupyter. 

3. Data profile and clean. 

This revised approach will enable us to obtain a more accurate and comprehensive understanding of the dataset, particularly regarding the identification and analysis of energy storage technologies.