# 😶‍🌫️ THE Final Data Wrangle attempt : CareerFoundry 6.1

#### Collaborators: 
- Amy Zhang (instance 332)
- ChatGPT (instance 0101)
- Perplexity (Instance ID: Perplexity-20250319)
- Perplexity (Instance ID: Perplexity-20250320)
- Perplexity (Instance ID: Perplexity-20250322)
- Perplexity (Instance ID: Perplexity-20250323)
- Perplexity (Instance ID: Perplexity-20250325)


# Data Import and Curation Summary (2015-2023)

This study utilizes three main datasets, derived from previous mega-merges and further curated:

## 1. Power Plant Data (power_plant_df → pp_cleaned_df)
- **Source**: Schedule 2-5 for Power Plants (Electricity Generation and Fuel Consumption)
- **Original size**: 131,638 rows, 54 columns
- **Curated file**: `filtered_power_plant_df.csv`
  - **Description**: Abridged version including only observations with Plant IDs also present in water_df
  - **Size**: 23,450 rows, 54 columns
  - **Unique Power Plants**: 918

## 2. Water Usage Data (water_df)
- **Source**: Schedule 8 Environmental Information for Power Plants (Cooling System Information)
- **Curated file**: `water_df_StateCleaned_2015_2023.csv` (water_df_simplified)
  - **Description**: Water metrics imputed; missing states cross-referenced with power_plant_df
  - **Unique Power Plants**: 918

## 3. Thermoelectric Information (merged_df → filtered_df_2)
- **Description**: Combines power plant and water usage data, including important meta-Water usage information (Water Type and Water Source)
- **Curated file**: `CoolingBoilerDetail_PlantCodeMatch.csv`
  - **Description**: Abridged version including only observations with Plant Codes that match Plant IDs in power_plant_df
  - **Unique Power Plants**: 918

*Note: All curated datasets contain information on the same 918 unique power plants, ensuring consistency across analyses.*


In [448]:
# Import files. 

import numpy as np
import pandas as pd
import os

# Set the correct file paths
water_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/water_STATE_2015_2023.csv'
power_plant_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/PowerPlants_Merged_2015_2023.csv'

# Function to import CSV and display dtypes
def import_csv(file_path, df_name):
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        print(f"\nDtypes for {df_name}:")
        print(df.dtypes.to_string())
        return df
    else:
        print(f"File not found: {file_path}")
        return None

# Import the CSV files
water_df = import_csv(water_file_path, 'water_df')
power_plant_df = import_csv(power_plant_file_path, 'power_plant_df')


Dtypes for water_df:
Year                                         int64
Month                                        int64
Plant ID                                     int64
Cooling System\nID                          object
Type of Cooling System                      object
Cooling System\nStatus                      object
Hours in Service                            object
Chlorine \n(thousand lbs)                   object
Diversion Rate \n(gallons per minute)       object
Withdrawal Rate \n(gallons per minute)      object
Discharge Rate \n(gallons per minute)       object
Consumption Rate \n(gallons per minute)     object
Method for Flow Rates                       object
Intake Average\nTemperature (ºF)            object
Intake Maximum\nTemperature (ºF)            object
Discharge Average\nTemperature (ºF)         object
Discharge Maximum\nTemperature (ºF)         object
Method for Temperatures                     object
Diversion Volume \n(million gallons)        object
Withdrawa

  df = pd.read_csv(file_path)



Dtypes for power_plant_df:
Plant Id                                 int64
Combined Heat And\nPower Plant          object
Nuclear Unit Id                         object
Plant Name                              object
Operator Name                           object
Operator Id                             object
Plant State                             object
Census Region                           object
NERC Region                             object
Reserved                               float64
NAICS Code                               int64
EIA Sector Number                        int64
Sector Name                             object
Reported\nPrime Mover                   object
Reported\nFuel Type Code                object
MER\nFuel Type Code                     object
Balancing\nAuthority Code               object
Respondent\nFrequency                   object
Physical\nUnit Label                    object
Quantity\nJanuary                       object
Quantity\nFebruary              

### Step 1: Re-import and process power_plant_df (with data dictionary)

To generate summary statistics for the power plant dataset, we need to follow these steps:

1. **Re-import the power_plant_df with a data dictionary**
2. **Export the processed data**
3. **Re-import the exported data for analysis**

This process ensures that our dataset is properly structured and includes all necessary information for accurate statistical analysis. Power_plant_df will most likely be the foundation datasheet to which other datasets will be merged or with which they will be brought into conversation, as it includes the most complete information re: Electricity Generation (quantity and purpose) and Fuel Consumption. 


In [450]:
import pandas as pd
import numpy as np

# Set display options to show all columns without truncation
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Updated dtype dictionary (try to read as numeric directly)
dtype_dict = {
    'YEAR': 'str',  # Read initially as object to handle mixed types
    'Plant Id': 'int64',
    'Nuclear Unit Id': 'str',
    'Combined Heat And Power Plant': 'str',
    'Plant Name': 'str',
    'Operator Name': 'str',
    'Operator Id': 'str',
    'Plant State': 'str',
    'Census Region': 'str',
    'NERC Region': 'str',
    'EIA Sector Number': 'str',
    'Sector Name': 'str',
    'NAICS Code': 'str',
    'Reported Prime Mover': 'str',
    'Reported Fuel Type Code': 'str',
    'MER Fuel Type Code': 'str',
    'Balancing Authority Code': 'str',
    'Physical Unit Label': 'str'
}

# Load the CSV with specified dtypes
power_plant_df = pd.read_csv(
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/PowerPlants_Merged_2015_2023.csv',
    dtype=dtype_dict,
    low_memory=False
)

# Correctly convert YEAR to numeric, coercing errors
power_plant_df['YEAR'] = pd.to_numeric(power_plant_df['YEAR'], errors='coerce')

# Convert '.' to NaN in 'Operator Id'
power_plant_df['Operator Id'] = power_plant_df['Operator Id'].replace('.', np.nan)

# Convert 'Operator Id' to numeric
power_plant_df['Operator Id'] = pd.to_numeric(power_plant_df['Operator Id'], errors='coerce')

# List of all monthly numeric columns
monthly_cols = [
    'Quantity\nJanuary', 'Quantity\nFebruary',
       'Quantity\nMarch', 'Quantity\nApril', 'Quantity\nMay', 'Quantity\nJune',
       'Quantity\nJuly', 'Quantity\nAugust', 'Quantity\nSeptember',
       'Quantity\nOctober', 'Quantity\nNovember', 'Quantity\nDecember',
       'Elec_Quantity\nJanuary', 'Elec_Quantity\nFebruary',
       'Elec_Quantity\nMarch', 'Elec_Quantity\nApril', 'Elec_Quantity\nMay',
       'Elec_Quantity\nJune', 'Elec_Quantity\nJuly', 'Elec_Quantity\nAugust',
       'Elec_Quantity\nSeptember', 'Elec_Quantity\nOctober',
       'Elec_Quantity\nNovember', 'Elec_Quantity\nDecember',
       'MMBtuPer_Unit\nJanuary', 'MMBtuPer_Unit\nFebruary',
       'MMBtuPer_Unit\nMarch', 'MMBtuPer_Unit\nApril', 'MMBtuPer_Unit\nMay',
       'MMBtuPer_Unit\nJune', 'MMBtuPer_Unit\nJuly', 'MMBtuPer_Unit\nAugust',
       'MMBtuPer_Unit\nSeptember', 'MMBtuPer_Unit\nOctober',
       'MMBtuPer_Unit\nNovember', 'MMBtuPer_Unit\nDecember',
       'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch',
       'Tot_MMBtu\nApril', 'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune',
       'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust', 'Tot_MMBtu\nSeptember',
       'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
       'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch',
       'Elec_MMBtu\nApril', 'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune',
       'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust', 'Elec_MMBtu\nSeptember',
       'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
       'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
       'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', 'Netgen\nAugust',
       'Netgen\nSeptember', 'Netgen\nOctober', 'Netgen\nNovember',
       'Netgen\nDecember', 'Total Fuel Consumption\nQuantity',
       'Electric Fuel Consumption\nQuantity', 'Total Fuel Consumption\nMMBtu',
       'Elec Fuel Consumption\nMMBtu', 'Net Generation\n(Megawatthours)'
]

# Convert monthly columns to numeric, coercing errors to NaN
for col in monthly_cols:
    power_plant_df[col] = power_plant_df[col].astype(str).str.replace(',', '', regex=False)
    power_plant_df[col] = pd.to_numeric(power_plant_df[col], errors='coerce')

# Convert Total Fuel Consumption MMBtu and Elec Fuel Consumption MMBtu to float64
power_plant_df['Total Fuel Consumption\nMMBtu'] = pd.to_numeric(power_plant_df['Total Fuel Consumption\nMMBtu'], errors='coerce')
power_plant_df['Elec Fuel Consumption\nMMBtu'] = pd.to_numeric(power_plant_df['Elec Fuel Consumption\nMMBtu'], errors='coerce')

# Convert 'Combined Heat And Power Plant' to Boolean
power_plant_df['Combined Heat And\nPower Plant'] = power_plant_df['Combined Heat And\nPower Plant'].map({'Y': True, 'N': False})

# Print data types
with pd.option_context('display.max_rows', None):
    print(power_plant_df.dtypes.to_string())



Plant Id                                 int64
Combined Heat And\nPower Plant            bool
Nuclear Unit Id                         object
Plant Name                              object
Operator Name                           object
Operator Id                            float64
Plant State                             object
Census Region                           object
NERC Region                             object
Reserved                               float64
NAICS Code                              object
EIA Sector Number                       object
Sector Name                             object
Reported\nPrime Mover                   object
Reported\nFuel Type Code                object
MER\nFuel Type Code                     object
Balancing\nAuthority Code               object
Respondent\nFrequency                   object
Physical\nUnit Label                    object
Quantity\nJanuary                      float64
Quantity\nFebruary                     float64
Quantity\nMar

### Step 2: Looking at the amount of missing/incomplete data in power_plant_df allows us to get a quick sense of which columns can be dropped. 

In [452]:
# Count of missing values per column
missing_count = power_plant_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(power_plant_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())



                                     Missing Count  Missing Percentage
Plant Id                                         0            0.000000
Combined Heat And\nPower Plant                   0            0.000000
Nuclear Unit Id                                  0            0.000000
Plant Name                                       9            0.006837
Operator Name                                    9            0.006837
Operator Id                                      9            0.006837
Plant State                                      9            0.006837
Census Region                                    0            0.000000
NERC Region                                   5188            3.941111
Reserved                                    131638          100.000000
NAICS Code                                       0            0.000000
EIA Sector Number                                0            0.000000
Sector Name                                      0            0.000000
Report

In [454]:
# Remove specified columns
columns_to_remove = [
    'Reserved',
    'Reserved.1',
    'Reserved.2',
    'MER\nFuel Type Code', 'Nuclear Unit Id',
    'Balancing\nAuthority Code', 'Physical\nUnit Label',
    'Respondent\nFrequency', 'Quantity\nJanuary', 'Quantity\nFebruary',
       'Quantity\nMarch', 'Quantity\nApril', 'Quantity\nMay', 'Quantity\nJune',
       'Quantity\nJuly', 'Quantity\nAugust', 'Quantity\nSeptember',
       'Quantity\nOctober', 'Quantity\nNovember', 'Quantity\nDecember',
       'Elec_Quantity\nJanuary', 'Elec_Quantity\nFebruary',
       'Elec_Quantity\nMarch', 'Elec_Quantity\nApril', 'Elec_Quantity\nMay',
       'Elec_Quantity\nJune', 'Elec_Quantity\nJuly', 'Elec_Quantity\nAugust',
       'Elec_Quantity\nSeptember', 'Elec_Quantity\nOctober',
       'Elec_Quantity\nNovember', 'Elec_Quantity\nDecember',
       'MMBtuPer_Unit\nJanuary', 'MMBtuPer_Unit\nFebruary',
       'MMBtuPer_Unit\nMarch', 'MMBtuPer_Unit\nApril', 'MMBtuPer_Unit\nMay',
       'MMBtuPer_Unit\nJune', 'MMBtuPer_Unit\nJuly', 'MMBtuPer_Unit\nAugust',
       'MMBtuPer_Unit\nSeptember', 'MMBtuPer_Unit\nOctober',
       'MMBtuPer_Unit\nNovember', 'MMBtuPer_Unit\nDecember','Total Fuel Consumption\nQuantity',
    'Electric Fuel Consumption\nQuantity'
]

power_plant_df = power_plant_df.drop(columns=columns_to_remove)


### Step 3) power_plant_df Data Cleaning Summary

#### Handling Missing Data

1. **Numerical Values (Monthly Energy Data)**
   - Imputed missing values using the median for each specific month, year, and 'Reported\nFuel Type Code'
   - For any remaining missing values, used the median for that month and year
   
2. **Categorical Variables**
   - Replaced empty strings and whitespace-only strings with NaN
   - Affected columns: Plant Name, Operator Name, Operator Id, Plant State, NERC Region, AER\nFuel Type Code

3. **Removing Invalid Years**
   - Removed 711 rows where the YEAR value was NaN
   - This step ensures all remaining data has valid year information

#### Remaining Missing Data
After cleaning:
- Most categorical variables have minimal missing data (<0.005%)
- NERC Region: 3.42% missing
- AER\nFuel Type Code: 25.63% missing

These steps standardized our approach to missing data, improving data quality for subsequent analysis.


In [456]:
# Count of missing values per column
missing_count = power_plant_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(power_plant_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())


                                 Missing Count  Missing Percentage
Plant Id                                     0            0.000000
Combined Heat And\nPower Plant               0            0.000000
Plant Name                                   9            0.006837
Operator Name                                9            0.006837
Operator Id                                  9            0.006837
Plant State                                  9            0.006837
Census Region                                0            0.000000
NERC Region                               5188            3.941111
NAICS Code                                   0            0.000000
EIA Sector Number                            0            0.000000
Sector Name                                  0            0.000000
Reported\nPrime Mover                        0            0.000000
Reported\nFuel Type Code                     0            0.000000
Tot_MMBtu\nJanuary                        6696            5.08

### Step 3a: Imputing Missing Metrics

To handle missing values in our dataset, we employ a two-step imputation process:

1. **Primary Imputation**: We first impute missing values for energy and generation metrics using the median of the year and Reported Fuel Type. This approach is chosen because:
   - It preserves the temporal trends within each year.
   - It accounts for the specific characteristics of different fuel types.
   - The median is used instead of the mean to minimize the impact of outliers.

2. **Secondary Imputation**: If any missing values remain after the primary imputation, we fill them with the median of the respective year across all fuel types. This ensures that we have a complete dataset while still maintaining some temporal context.

The imputation process is applied to all monthly columns (January through December) for both energy and generation metrics. This method allows us to maintain the integrity of our data while providing reasonable estimates for missing values based on similar plants and time periods.

In [458]:
import pandas as pd
import numpy as np

# Assuming your DataFrame is named 'power_plant_df'

# 1. Impute missing values for energy and generation metrics
monthly_cols = [col for col in power_plant_df.columns if col.endswith(('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'))]

for col in monthly_cols:
    power_plant_df[col] = power_plant_df.groupby(['YEAR', 'Reported\nFuel Type Code'])[col].transform(lambda x: x.fillna(x.median()))

# Check for any remaining missing values after this imputation
still_missing = power_plant_df[monthly_cols].isnull().sum()
print("Columns with remaining missing values after imputation:")
print(still_missing[still_missing > 0])

# If there are still missing values, you might want to fill them with a more general median
for col in monthly_cols:
    power_plant_df[col] = power_plant_df[col].fillna(power_plant_df.groupby('YEAR')[col].transform('median'))

# Verify final missing count
final_missing = power_plant_df[monthly_cols].isnull().sum()
print("\nFinal missing value count:")
print(final_missing[final_missing > 0])


Columns with remaining missing values after imputation:
Tot_MMBtu\nJanuary       718
Tot_MMBtu\nFebruary      719
Tot_MMBtu\nMarch         717
Tot_MMBtu\nApril         717
Tot_MMBtu\nMay           717
Tot_MMBtu\nJune          717
Tot_MMBtu\nJuly          717
Tot_MMBtu\nAugust        717
Tot_MMBtu\nSeptember     717
Tot_MMBtu\nOctober       714
Tot_MMBtu\nNovember      718
Tot_MMBtu\nDecember      717
Elec_MMBtu\nJanuary      718
Elec_MMBtu\nFebruary     719
Elec_MMBtu\nMarch        717
Elec_MMBtu\nApril        717
Elec_MMBtu\nMay          717
Elec_MMBtu\nJune         717
Elec_MMBtu\nJuly         717
Elec_MMBtu\nAugust       717
Elec_MMBtu\nSeptember    717
Elec_MMBtu\nOctober      714
Elec_MMBtu\nNovember     718
Elec_MMBtu\nDecember     717
Netgen\nJanuary          718
Netgen\nFebruary         719
Netgen\nMarch            717
Netgen\nApril            717
Netgen\nMay              717
Netgen\nJune             717
Netgen\nJuly             717
Netgen\nAugust           717
Netgen\nSeptembe

After imputation, we verify that all missing values have been filled. This approach provides us with a complete dataset for further analysis while minimizing the introduction of bias from imputed values.


In [460]:
missing_rows = power_plant_df[power_plant_df['Tot_MMBtu\nJanuary'].isnull()]
print(missing_rows[['YEAR', 'Reported\nFuel Type Code', 'Plant Id']].head())
print(f"Unique years in missing data: {missing_rows['YEAR'].unique()}")
print(f"Unique fuel types in missing data: {missing_rows['Reported\nFuel Type Code'].unique()}")


        YEAR Reported\nFuel Type Code  Plant Id
130927   NaN                      BIT      8809
130928   NaN                      BIT      8812
130929   NaN                      DFO      8812
130930   NaN                      BIT      8812
130931   NaN                      DFO      8812
Unique years in missing data: [nan]
Unique fuel types in missing data: ['BIT' 'DFO' 'PC' 'SUB' 'RFO' 'NG' 'KER' 'WC' 'WDS' 'SUN' 'MWH' 'OBG'
 'WAT' 'LFG' 'OG' 'BLQ' 'GEO' 'OTH' 'RC' 'SLW' 'WND' 'JF' 'OBL' 'PUR'
 'TDF' 'AB' 'ANT' 'BFG' 'LIG' 'OBS' 'SC' 'WO']


In [462]:
# Remove NaN Years (711 rows)
# Count rows before removal
print(f"Number of rows before removal: {len(power_plant_df)}")

# Remove rows with NaN years
power_plant_df = power_plant_df.dropna(subset=['YEAR'])

# Count rows after removal
print(f"Number of rows after removal: {len(power_plant_df)}")

# Verify that we've removed exactly 711 rows
print(f"Number of rows removed: {len(power_plant_df) - (len(power_plant_df) + 711)}")

# Check for any remaining missing values in the monthly columns
monthly_cols = [col for col in power_plant_df.columns if col.endswith(('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'))]

final_missing = power_plant_df[monthly_cols].isnull().sum()
print("\nFinal missing value count in monthly columns:")
print(final_missing[final_missing > 0])


Number of rows before removal: 131638
Number of rows after removal: 130927
Number of rows removed: -711

Final missing value count in monthly columns:
Series([], dtype: int64)


In [464]:
# List of categorical columns
categorical_cols = [
    'Plant Name', 'Operator Name', 'Operator Id', 'Plant State',
    'Census Region', 'NERC Region', 'EIA Sector Number', 'Sector Name',
    'Reported\nPrime Mover', 'Reported\nFuel Type Code', 'AER\nFuel Type Code'
]

# Replace empty strings and whitespace-only strings with NaN
for col in categorical_cols:
    power_plant_df[col] = power_plant_df[col].replace(r'^\s*$', np.nan, regex=True)

# Check for any remaining missing values
missing_after = power_plant_df[categorical_cols].isnull().sum()
missing_percentage_after = 100 * power_plant_df[categorical_cols].isnull().sum() / len(power_plant_df)
missing_table_after = pd.concat([missing_after, missing_percentage_after], axis=1, keys=['Missing Count', 'Missing Percentage'])

print("Missing values in categorical columns after cleaning:")
print(missing_table_after)


Missing values in categorical columns after cleaning:
                          Missing Count  Missing Percentage
Plant Name                            6            0.004583
Operator Name                         6            0.004583
Operator Id                           6            0.004583
Plant State                           6            0.004583
Census Region                         0            0.000000
NERC Region                        4477            3.419463
EIA Sector Number                     0            0.000000
Sector Name                           0            0.000000
Reported\nPrime Mover                 0            0.000000
Reported\nFuel Type Code              0            0.000000
AER\nFuel Type Code               33558           25.631077


In [466]:
# Count of missing values per column
missing_count = power_plant_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(power_plant_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())

                                 Missing Count  Missing Percentage
Plant Id                                     0            0.000000
Combined Heat And\nPower Plant               0            0.000000
Plant Name                                   6            0.004583
Operator Name                                6            0.004583
Operator Id                                  6            0.004583
Plant State                                  6            0.004583
Census Region                                0            0.000000
NERC Region                               4477            3.419463
NAICS Code                                   0            0.000000
EIA Sector Number                            0            0.000000
Sector Name                                  0            0.000000
Reported\nPrime Mover                        0            0.000000
Reported\nFuel Type Code                     0            0.000000
Tot_MMBtu\nJanuary                           0            0.00

In [468]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN'
output_file = 'power_plant_df_dtyped.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
power_plant_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")


Saving confirmed: 'power_plant_df_dtyped.csv' has been created successfully.


In [472]:
# Re-import the CSV with specified data types
pp_cleaned_df= pd.read_csv('/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/power_plant_df_dtyped.csv')
# Display dtypes
print(pp_cleaned_df.dtypes.to_string())


Plant Id                             int64
Combined Heat And\nPower Plant        bool
Plant Name                          object
Operator Name                       object
Operator Id                        float64
Plant State                         object
Census Region                       object
NERC Region                         object
NAICS Code                           int64
EIA Sector Number                    int64
Sector Name                         object
Reported\nPrime Mover               object
Reported\nFuel Type Code            object
Tot_MMBtu\nJanuary                 float64
Tot_MMBtu\nFebruary                float64
Tot_MMBtu\nMarch                   float64
Tot_MMBtu\nApril                   float64
Tot_MMBtu\nMay                     float64
Tot_MMBtu\nJune                    float64
Tot_MMBtu\nJuly                    float64
Tot_MMBtu\nAugust                  float64
Tot_MMBtu\nSeptember               float64
Tot_MMBtu\nOctober                 float64
Tot_MMBtu\n

  pp_cleaned_df= pd.read_csv('/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/power_plant_df_dtyped.csv')


### Step 4) water_df -- Cooling System Information; Re-importing with the correct data dictionary

In [474]:
water_df.columns

Index(['Year', 'Month', 'Plant ID', 'Cooling System\nID',
       'Type of Cooling System', 'Cooling System\nStatus', 'Hours in Service',
       'Chlorine \n(thousand lbs)', 'Diversion Rate \n(gallons per minute)',
       'Withdrawal Rate \n(gallons per minute)',
       'Discharge Rate \n(gallons per minute)',
       'Consumption Rate \n(gallons per minute)', 'Method for Flow Rates',
       'Intake Average\nTemperature (ºF)', 'Intake Maximum\nTemperature (ºF)',
       'Discharge Average\nTemperature (ºF)',
       'Discharge Maximum\nTemperature (ºF)', 'Method for Temperatures',
       'Diversion Volume \n(million gallons)',
       'Withdrawal Volume \n(million gallons)',
       'Discharge Volume \n(million gallons)',
       'Consumption Volume \n(million gallons)', 'State'],
      dtype='object')

In [476]:
# Define the data types for each column
data_types = {
    'Year': int,
    'Month': int,
    'Plant ID': int,
    'Cooling System\nID': str,
    'Type of Cooling System': str,
    'Cooling System\nStatus': str,
    'State': str
}

# List of columns to be converted to float
float_columns = [
    'Hours in Service',
    'Chlorine \n(thousand lbs)',
    'Diversion Rate \n(gallons per minute)',
    'Withdrawal Rate \n(gallons per minute)',
    'Discharge Rate \n(gallons per minute)',
    'Consumption Rate \n(gallons per minute)',
    'Intake Average\nTemperature (ºF)',
    'Intake Maximum\nTemperature (ºF)',
    'Discharge Average\nTemperature (ºF)',
    'Discharge Maximum\nTemperature (ºF)',
    'Diversion Volume \n(million gallons)',
    'Withdrawal Volume \n(million gallons)',
    'Discharge Volume \n(million gallons)',
    'Consumption Volume \n(million gallons)'
]

# Function to convert problematic strings to float
def safe_float(val):
    try:
        return float(val)
    except ValueError:
        return np.nan

# Read the CSV file with the specified data types and converters
water_df = pd.read_csv('/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/water_STATE_2015_2023.csv', 
                       dtype=data_types, 
                       converters={col: safe_float for col in float_columns})

# Remove specified columns
columns_to_remove = ['Method for Temperatures', 'Method for Flow Rates']
water_df = water_df.drop(columns=columns_to_remove)

# Check the data types after import
print("Data types after import:")
print(water_df.dtypes)

# Print the shape of the dataframe
print("\nShape of the dataframe:")
print(water_df.shape)

# Print summary statistics to verify data import
print("\nSummary statistics:")
print(water_df.describe())


Data types after import:
Year                                         int64
Month                                        int64
Plant ID                                     int64
Cooling System\nID                          object
Type of Cooling System                      object
Cooling System\nStatus                      object
Hours in Service                           float64
Chlorine \n(thousand lbs)                  float64
Diversion Rate \n(gallons per minute)      float64
Withdrawal Rate \n(gallons per minute)     float64
Discharge Rate \n(gallons per minute)      float64
Consumption Rate \n(gallons per minute)    float64
Intake Average\nTemperature (ºF)           float64
Intake Maximum\nTemperature (ºF)           float64
Discharge Average\nTemperature (ºF)        float64
Discharge Maximum\nTemperature (ºF)        float64
Diversion Volume \n(million gallons)       float64
Withdrawal Volume \n(million gallons)      float64
Discharge Volume \n(million gallons)       float64
Consum

## Summary of Key Metrics

### Hours in Service
- **Range**: 0 to 744 hours (31 days)
- **Mean**: 485.5 hours (~20 days/month)
- **Note**: Reasonable for monthly data

### Chlorine (thousand lbs)
- **Range**: 0 to 1,919
- **Mean**: 6.53
- **Note**: High max value needs investigation

### Water Rates (gallons per minute)
- **Observation**: Very wide ranges for all metrics (Diversion, Withdrawal, Discharge, Consumption)
- **Concern**: Extremely high max values (e.g., Discharge Rate max: 16,690,880 gpm)
- **Action**: Verify accuracy of extreme values

### Temperatures (ºF)
1. **Intake Average**
   - Range: 0 to 105°F (reasonable)
2. **Intake Maximum**
   - Max: 8,805°F (clear error, needs correction)
3. **Discharge Average**
   - Max: 1,163°F (likely error)
4. **Discharge Maximum**
   - Range: 0 to 193°F (upper end high but possibly valid)

### Water Volumes (million gallons)
- **Observation**: Wide ranges for all categories
- **Issues**:
  - Withdrawal Volume min: -2.303 (impossible)
  - Consumption Volume min: -1,045 (impossible)
- **Action**: Investigate and correct negative values

## Recommendations

1. Investigate and correct extreme values, especially in temperature and water metrics
2. Check for unit consistency across all plants
3. Set reasonable upper and lower bounds for each metric based on domain knowledge
4. Address impossible negative values in Withdrawal and Consumption volumes
5. Carefully examine temperature values above water's boiling point (212°F/100°C)

These steps will enhance data quality and reliability for further analysis.


In [442]:
# Count of missing values per column
missing_count = water_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(water_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())

                                         Missing Count  Missing Percentage
Year                                                 0            0.000000
Month                                                0            0.000000
Plant ID                                             0            0.000000
Cooling System\nID                                  48            0.033973
Type of Cooling System                             138            0.097674
Cooling System\nStatus                               3            0.002123
Hours in Service                                  1324            0.937100
Chlorine \n(thousand lbs)                        14245           10.082315
Diversion Rate \n(gallons per minute)            74837           52.968072
Withdrawal Rate \n(gallons per minute)           22028           15.590960
Discharge Rate \n(gallons per minute)            32969           23.334772
Consumption Rate \n(gallons per minute)          31641           22.394842
Intake Average\nTemperatu

### Step 5) BEFORE tackling the deeper data clean for water_df, we try to filter down to the relevant rows based on power_plant_df as the base.

In [478]:
# Get unique Plant Ids from both dataframes
unique_water_plant_ids = set(water_df['Plant ID'].unique())
unique_power_plant_ids = set(pp_cleaned_df['Plant Id'].unique())

# Find the intersection of the two sets (matching Plant Ids)
matching_plant_ids = unique_water_plant_ids.intersection(unique_power_plant_ids)

# Count the number of matching Plant Ids
num_matching_ids = len(matching_plant_ids)

# Print results
print(f"Number of unique 'Plant Id' in water_df: {len(unique_water_plant_ids)}")
print(f"Number of unique 'Plant Id' in power_plant_df: {len(unique_power_plant_ids)}")
print(f"Number of matching 'Plant Id' between water_df and power_plant_df: {num_matching_ids}")


Number of unique 'Plant Id' in water_df: 918
Number of unique 'Plant Id' in power_plant_df: 13187
Number of matching 'Plant Id' between water_df and power_plant_df: 918


### Step 5a) It turns out that power_plant_df can be filtered down based on water_df. We create a csv file version of power_plant_df to this end. 

In [480]:
# Get the unique Plant Ids from water_df
water_plant_ids = set(water_df['Plant ID'].unique())

# Filter power_plant_df to only include rows where Plant Id is in water_plant_ids
filtered_power_plant_df = pp_cleaned_df[pp_cleaned_df['Plant Id'].isin(water_plant_ids)]

# Verify the number of unique Plant Ids in the filtered dataframe
print(f"Number of unique Plant Ids in filtered_power_plant_df: {filtered_power_plant_df['Plant Id'].nunique()}")

# Optional: Check if all water_df Plant Ids are present in the filtered dataframe
all_present = set(filtered_power_plant_df['Plant Id'].unique()) == water_plant_ids
print(f"All water_df Plant Ids present in filtered_power_plant_df: {all_present}")

# Export the filtered dataframe to a CSV file
filtered_power_plant_df.to_csv('filtered_power_plant_df.csv', index=False)
print("Filtered power plant data exported to 'filtered_power_plant_df.csv'")


Number of unique Plant Ids in filtered_power_plant_df: 918
All water_df Plant Ids present in filtered_power_plant_df: True
Filtered power plant data exported to 'filtered_power_plant_df.csv'


In [488]:
import pandas as pd

# Calculate missing percentage for each dataframe
def missing_percentage(df):
    return (df.isnull().sum() / len(df)) * 100

missing_filtered = missing_percentage(filtered_power_plant_df)
missing_cleaned = missing_percentage(pp_cleaned_df)

# Combine the results into a single dataframe
comparison_df = pd.concat([missing_filtered, missing_cleaned], axis=1, keys=['Abridged', 'Full'])

# Sort by the highest missing percentage in either dataframe
comparison_df = comparison_df.sort_values(by=['Abridged', 'Full'], ascending=False)

# Display the results
print(comparison_df)

# Optionally, you can export this to a CSV file
#comparison_df.to_csv('missing_percentage_comparison.csv')


                                  Abridged       Full
AER\nFuel Type Code              20.945110  25.631077
NERC Region                       0.204717   3.419463
Plant Name                        0.000000   0.004583
Operator Name                     0.000000   0.004583
Operator Id                       0.000000   0.004583
Plant State                       0.000000   0.004583
Plant Id                          0.000000   0.000000
Combined Heat And\nPower Plant    0.000000   0.000000
Census Region                     0.000000   0.000000
NAICS Code                        0.000000   0.000000
EIA Sector Number                 0.000000   0.000000
Sector Name                       0.000000   0.000000
Reported\nPrime Mover             0.000000   0.000000
Reported\nFuel Type Code          0.000000   0.000000
Tot_MMBtu\nJanuary                0.000000   0.000000
Tot_MMBtu\nFebruary               0.000000   0.000000
Tot_MMBtu\nMarch                  0.000000   0.000000
Tot_MMBtu\nApril            

### Step 6) Back to water_df! First, can the missing 'State' be filled in through cross-reference with filtered_power_plant_df? 

In [498]:
water_df.columns

Index(['Year', 'Month', 'Plant ID', 'Cooling System\nID',
       'Type of Cooling System', 'Cooling System\nStatus', 'Hours in Service',
       'Chlorine \n(thousand lbs)', 'Diversion Rate \n(gallons per minute)',
       'Withdrawal Rate \n(gallons per minute)',
       'Discharge Rate \n(gallons per minute)',
       'Consumption Rate \n(gallons per minute)',
       'Intake Average\nTemperature (ºF)', 'Intake Maximum\nTemperature (ºF)',
       'Discharge Average\nTemperature (ºF)',
       'Discharge Maximum\nTemperature (ºF)',
       'Diversion Volume \n(million gallons)',
       'Withdrawal Volume \n(million gallons)',
       'Discharge Volume \n(million gallons)',
       'Consumption Volume \n(million gallons)', 'State'],
      dtype='object')

In [500]:
filtered_power_plant_df.columns

Index(['Plant Id', 'Combined Heat And\nPower Plant', 'Plant Name',
       'Operator Name', 'Operator Id', 'Plant State', 'Census Region',
       'NERC Region', 'NAICS Code', 'EIA Sector Number', 'Sector Name',
       'Reported\nPrime Mover', 'Reported\nFuel Type Code',
       'Tot_MMBtu\nJanuary', 'Tot_MMBtu\nFebruary', 'Tot_MMBtu\nMarch',
       'Tot_MMBtu\nApril', 'Tot_MMBtu\nMay', 'Tot_MMBtu\nJune',
       'Tot_MMBtu\nJuly', 'Tot_MMBtu\nAugust', 'Tot_MMBtu\nSeptember',
       'Tot_MMBtu\nOctober', 'Tot_MMBtu\nNovember', 'Tot_MMBtu\nDecember',
       'Elec_MMBtu\nJanuary', 'Elec_MMBtu\nFebruary', 'Elec_MMBtu\nMarch',
       'Elec_MMBtu\nApril', 'Elec_MMBtu\nMay', 'Elec_MMBtu\nJune',
       'Elec_MMBtu\nJuly', 'Elec_MMBtu\nAugust', 'Elec_MMBtu\nSeptember',
       'Elec_MMBtu\nOctober', 'Elec_MMBtu\nNovember', 'Elec_MMBtu\nDecember',
       'Netgen\nJanuary', 'Netgen\nFebruary', 'Netgen\nMarch', 'Netgen\nApril',
       'Netgen\nMay', 'Netgen\nJune', 'Netgen\nJuly', 'Netgen\nAugust',
  

In [506]:
# Get unique Plant IDs with missing State in water_df
missing_state_plant_ids = water_df.loc[water_df['State'].isna(), 'Plant ID'].unique()

# Check for matches in filtered_power_plant_df
state_matches = filtered_power_plant_df[filtered_power_plant_df['Plant Id'].isin(missing_state_plant_ids)][['Plant Id', 'Plant State']]

# Check results
print(f"Found {len(state_matches)} potential matches in filtered_power_plant_df")
print("Sample matches:")
print(state_matches.drop_duplicates().head(10))

# If matches are found, proceed with filling in the missing states
if not state_matches.empty:
    # Create a state mapping dictionary
    state_mapping = state_matches.dropna().drop_duplicates().set_index('Plant Id')['Plant State'].to_dict()
    
    # Fill missing states using the mapping
    water_df['State'] = water_df.apply(lambda row: state_mapping.get(row['Plant ID']) if pd.isna(row['State']) else row['State'], axis=1)
    
    # Check remaining missing states
    remaining_missing = water_df['State'].isna().sum()
    print(f"\nRemaining missing states after update: {remaining_missing}")
else:
    print("\nNo matching Plant IDs from NaN 'State' found in filtered_power_plant_df")

# Print the number of unique states now present in water_df
print(f"\nNumber of unique states in water_df after update: {water_df['State'].nunique()}")


Found 1922 potential matches in filtered_power_plant_df
Sample matches:
     Plant Id Plant State
6           7          AL
9           8          AL
47         50          AL
49         51          LA
86         87          NM
138       127          TX
203       207          FL
254       271          CA
285       302          CA
308       330          CA

Remaining missing states after update: 0

Number of unique states in water_df after update: 84


In [514]:
# Count of missing values per column
missing_count = water_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(water_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())

                                         Missing Count  Missing Percentage
Year                                                 0            0.000000
Month                                                0            0.000000
Plant ID                                             0            0.000000
Cooling System\nID                                  48            0.033973
Type of Cooling System                             138            0.097674
Cooling System\nStatus                               3            0.002123
Hours in Service                                  1324            0.937100
Chlorine \n(thousand lbs)                        14245           10.082315
Diversion Rate \n(gallons per minute)            74837           52.968072
Withdrawal Rate \n(gallons per minute)           22028           15.590960
Discharge Rate \n(gallons per minute)            32969           23.334772
Consumption Rate \n(gallons per minute)          31641           22.394842
Intake Average\nTemperatu

In [520]:
# List of columns to keep
columns_to_keep = [
    'Year', 'Month', 'Plant ID', 'Cooling System\nID', 'Type of Cooling System',
    'Cooling System\nStatus', 'Hours in Service', 'Chlorine \n(thousand lbs)',
    'Withdrawal Volume \n(million gallons)', 'Consumption Volume \n(million gallons)',
    'State'
]

# Create a new dataframe with only the selected columns
water_df_simplified = water_df[columns_to_keep]

# Verify the remaining columns
print("Columns in the simplified dataframe:")
print(water_df_simplified.columns.tolist())

# Check the missing data in the simplified dataframe
missing_data = water_df_simplified.isnull().sum()
missing_percentage = 100 * water_df_simplified.isnull().sum() / len(water_df_simplified)

missing_table = pd.concat([missing_data, missing_percentage], axis=1, keys=['Missing Count', 'Missing Percentage'])
print("\nMissing data in the simplified dataframe:")
print(missing_table)


Columns in the simplified dataframe:
['Year', 'Month', 'Plant ID', 'Cooling System\nID', 'Type of Cooling System', 'Cooling System\nStatus', 'Hours in Service', 'Chlorine \n(thousand lbs)', 'Withdrawal Volume \n(million gallons)', 'Consumption Volume \n(million gallons)', 'State']

Missing data in the simplified dataframe:
                                        Missing Count  Missing Percentage
Year                                                0            0.000000
Month                                               0            0.000000
Plant ID                                            0            0.000000
Cooling System\nID                                 48            0.033973
Type of Cooling System                            138            0.097674
Cooling System\nStatus                              3            0.002123
Hours in Service                                 1324            0.937100
Chlorine \n(thousand lbs)                       14245           10.082315
Withdrawa

## Step 7) Data Imputation Strategy for Water Metrics

### Imputation Approach

We employed a hierarchical imputation strategy for key water metrics to minimize data loss while maintaining data integrity. The approach was as follows:

#### Chlorine (thousand lbs):
- Imputed using the median value for the same plant, cooling system type, and month.

#### Withdrawal Volume (million gallons) and Consumption Volume (million gallons):
Given their crucial nature and higher missing percentages, a more sophisticated approach was used:

1. First attempt: Impute using the median for the same plant, cooling system type, and month.
2. Second attempt: If still missing, use the median for the same plant and month across years.
3. Third attempt: If still missing, use the overall median for that cooling system type and month.

### Handling Remaining Missing Values

After imputation, 138 cases remained where imputation was not possible due to lack of any reference data. For these cases:

- Decision: Keep the missing values as is.
- Rationale: These cases likely represent unique situations where no comparable data exists.
- Implication: These missing values will be handled during analysis using methods that can accommodate missing data.

### Note on Data Integrity

This imputation strategy aims to balance data completeness with accuracy. By using a hierarchical approach, we prioritize plant-specific and time-specific data where available, falling back to more general imputation only when necessary. The decision to retain some missing values ensures that we don't introduce potentially misleading data where no reliable basis for imputation exists.


In [524]:
# Create an EXPLICIT COPY of the filtered dataframe
water_df_simplified = water_df[columns_to_keep].copy()

# Modified imputation functions using .loc
def impute_mode_by_plant(df, column):
    df.loc[:, column] = df.groupby('Plant ID')[column].transform(
        lambda x: x.fillna(x.mode().iloc[0]) if not x.mode().empty else x
    )
    return df

def impute_median_by_plant_month_year(df, column):
    df.loc[:, column] = df.groupby(['Plant ID', 'Month'])[column].transform(
        lambda x: x.fillna(x.median())
    )
    return df


# For other columns, use single-assignment pattern
for col in ['Chlorine \n(thousand lbs)', 
            'Withdrawal Volume \n(million gallons)', 
            'Consumption Volume \n(million gallons)']:
    # First imputation
    water_df_simplified.loc[:, col] = water_df_simplified.groupby(
        ['Plant ID', 'Type of Cooling System', 'Month']
    )[col].transform(lambda x: x.fillna(x.median()))
    
    # Second imputation
    water_df_simplified.loc[:, col] = water_df_simplified.groupby(
        ['Plant ID', 'Month']
    )[col].transform(lambda x: x.fillna(x.median()))
    
    # Final imputation
    water_df_simplified.loc[:, col] = water_df_simplified.groupby(
        ['Type of Cooling System', 'Month']
    )[col].transform(lambda x: x.fillna(x.median()))

# Verify remaining missing values
print(water_df_simplified.isnull().sum())


Year                                         0
Month                                        0
Plant ID                                     0
Cooling System\nID                          48
Type of Cooling System                     138
Cooling System\nStatus                       3
Hours in Service                          1324
Chlorine \n(thousand lbs)                  138
Withdrawal Volume \n(million gallons)      138
Consumption Volume \n(million gallons)     138
State                                        0
dtype: int64


In [526]:
missing_data = water_df_simplified[water_df_simplified['Withdrawal Volume \n(million gallons)'].isnull()]
print(missing_data['Plant ID'].nunique())
print(missing_data['Plant ID'].value_counts())


14
Plant ID
991      54
56846    27
57331    24
10672    10
55098     7
350       4
54761     2
55518     2
55328     2
1599      2
1769      1
8226      1
6705      1
59913     1
Name: count, dtype: int64


In [532]:
print(missing_data['Month'].value_counts())
print(missing_data['Year'].value_counts())


Month
11    15
7     14
12    14
5     13
9     13
10    13
1     12
6     12
8     12
2      8
3      7
4      5
Name: count, dtype: int64
Year
2016    60
2017    39
2018    32
2023     7
Name: count, dtype: int64


In [535]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN'
output_file = 'water_df_StateCleaned_2015_2023.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
water_df_simplified.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")


Saving confirmed: 'water_df_StateCleaned_2015_2023.csv' has been created successfully.


### Step 8) Final: Cooling Boiler Generator Detail (2015-2023)

In [539]:
import pandas as pd
import os

# List of file paths
file_paths = [
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/Cooling_Boiler_Generator_Data_Detail_2023.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/Cooling_Boiler_Generator_Data_Detail_2022.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/Cooling_Boiler_Generator_Data_Detail_2021.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2020.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2019.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2018.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2017.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2016.xlsx',
    '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cooling_detail_2015.xlsx'
]

# Function to read Excel file
def read_excel(file_path):
    return pd.read_excel(file_path, header=2)  # header starts at row 3

# Read and merge all files
dfs = [read_excel(file) for file in file_paths]
merged_df = pd.concat(dfs, ignore_index=True)

# Print info about the merged dataframe
print(f"Shape of merged dataframe: {merged_df.shape}")
print("\nColumns in merged dataframe:")
print(merged_df.columns.tolist())
print("\nSample of merged dataframe:")
print(merged_df.head())

# Check for missing values
print("\nMissing values in each column:")
print(merged_df.isnull().sum())

# Save the merged dataframe to a CSV file
output_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/merged_cooling_data.csv'
merged_df.to_csv(output_path, index=False)
print(f"\nMerged data saved to: {output_path}")


  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")
  warn("""Cannot parse header or footer so it will be ignored""")


Shape of merged dataframe: (751524, 70)

Columns in merged dataframe:
['\n \n \n \n \n \nUtility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month', 'Generator ID', 'Boiler ID', 'Cooling ID', 'Generator Primary Technology', 'Summer Capacity of Steam Turbines (MW)', 'Gross Generation from Steam Turbines (MWh)', 'Net Generation from Steam Turbines (MWh)', 'Summer Capacity Associated with Single Shaft Combined Cycle Units (MW)', 'Gross Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Net Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Summer Capacity Associated with Combined Cycle Gas Turbines (MW)', 'Gross Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Net Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Fuel Consumption from All Fuel Types (MMBTU)', 'Fuel Consumption from Steam Turbines (MMBTU)', 'Fuel Consumption from Single Shaft Combined Cycle Units (MMBTU)', 'Fuel Consumption from Combined Cycle Gas Tu

In [540]:
# Count of missing values per column
missing_count = merged_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(merged_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())


                                                                          Missing Count  Missing Percentage
\n \n \n \n \n \nUtility ID                                                           0            0.000000
State                                                                                24            0.003194
Plant Code                                                                            0            0.000000
Plant Name                                                                           24            0.003194
Year                                                                                  0            0.000000
Month                                                                                 0            0.000000
Generator ID                                                                        360            0.047903
Boiler ID                                                                           144            0.019161
Cooling ID                  

In [543]:
# Define the threshold for removing columns (e.g., remove columns with > 20% missing data)
missing_threshold = 25.0

# Calculate the percentage of missing values for each column
missing_percentage = merged_df.isnull().sum() / len(merged_df) * 100

# Filter out columns with missing percentage above the threshold
columns_to_keep = [col for col in merged_df.columns if missing_percentage[col] <= missing_threshold]

# Create a new dataframe with only the selected columns
cleaned_df = merged_df[columns_to_keep]

# Print information about the cleaned dataframe
print(f"Shape of cleaned dataframe: {cleaned_df.shape}")
print("\nColumns retained:")
print(cleaned_df.columns.tolist())

# Save the cleaned dataframe to a new CSV file
output_path_cleaned = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/Environmental/cleaned_cooling_data.csv'
cleaned_df.to_csv(output_path_cleaned, index=False)
print(f"\nCleaned data saved to: {output_path_cleaned}")


Shape of cleaned dataframe: (751524, 65)

Columns retained:
['\n \n \n \n \n \nUtility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month', 'Generator ID', 'Boiler ID', 'Cooling ID', 'Generator Primary Technology', 'Summer Capacity of Steam Turbines (MW)', 'Gross Generation from Steam Turbines (MWh)', 'Net Generation from Steam Turbines (MWh)', 'Summer Capacity Associated with Single Shaft Combined Cycle Units (MW)', 'Gross Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Net Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Summer Capacity Associated with Combined Cycle Gas Turbines (MW)', 'Gross Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Net Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Fuel Consumption from All Fuel Types (MMBTU)', 'Fuel Consumption from Steam Turbines (MMBTU)', 'Fuel Consumption from Single Shaft Combined Cycle Units (MMBTU)', 'Fuel Consumption from Combined Cycle Gas Turbines (MM

In [544]:
# Count of missing values per column
missing_count = cleaned_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(cleaned_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())


                                                                          Missing Count  Missing Percentage
\n \n \n \n \n \nUtility ID                                                           0            0.000000
State                                                                                24            0.003194
Plant Code                                                                            0            0.000000
Plant Name                                                                           24            0.003194
Year                                                                                  0            0.000000
Month                                                                                 0            0.000000
Generator ID                                                                        360            0.047903
Boiler ID                                                                           144            0.019161
Cooling ID                  

In [547]:
# Fix column names
cleaned_df.rename(columns={'\n \n \n \n \n \nUtility ID': 'Utility ID'}, inplace=True)

# List of columns to remove
columns_to_remove = [
    '860 Cooling Type 1', '923 Cooling Type', 'Generator Status',
    'Generator Inservice Month', 'Generator Inservice Year',
    'Generator Retirement Month', 'Generator Retirement Year',
    'Boiler Status', 'Boiler Inservice Month', 'Boiler Inservice Year',
    'Boiler Retirement Month', 'Boiler Retirement Year'
]

# Remove specified columns
cleaned_df_2 = cleaned_df.drop(columns=columns_to_remove, errors='ignore')

# Print information about the cleaned dataframe
print(f"Shape of cleaned dataframe: {cleaned_df_2.shape}")
print("\nColumns in cleaned dataframe:")
print(cleaned_df_2.columns.tolist())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df.rename(columns={'\n \n \n \n \n \nUtility ID': 'Utility ID'}, inplace=True)


Shape of cleaned dataframe: (751524, 53)

Columns in cleaned dataframe:
['Utility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month', 'Generator ID', 'Boiler ID', 'Cooling ID', 'Generator Primary Technology', 'Summer Capacity of Steam Turbines (MW)', 'Gross Generation from Steam Turbines (MWh)', 'Net Generation from Steam Turbines (MWh)', 'Summer Capacity Associated with Single Shaft Combined Cycle Units (MW)', 'Gross Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Net Generation Associated with Single Shaft Combined Cycle Units (MWh)', 'Summer Capacity Associated with Combined Cycle Gas Turbines (MW)', 'Gross Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Net Generation Associated with Combined Cycle Gas Turbines (MWh)', 'Fuel Consumption from All Fuel Types (MMBTU)', 'Fuel Consumption from Steam Turbines (MMBTU)', 'Fuel Consumption from Single Shaft Combined Cycle Units (MMBTU)', 'Fuel Consumption from Combined Cycle Gas Turbines (MMBTU)'

In [549]:
# List of columns to remove
columns_to_remove = [
    'Summer Capacity of Steam Turbines (MW)',
    'Gross Generation from Steam Turbines (MWh)',
    'Net Generation from Steam Turbines (MWh)',
    'Summer Capacity Associated with Single Shaft Combined Cycle Units (MW)',
    'Gross Generation Associated with Single Shaft Combined Cycle Units (MWh)',
    'Net Generation Associated with Single Shaft Combined Cycle Units (MWh)',
    'Summer Capacity Associated with Combined Cycle Gas Turbines (MW)',
    'Gross Generation Associated with Combined Cycle Gas Turbines (MWh)',
    'Net Generation Associated with Combined Cycle Gas Turbines (MWh)',
    'Fuel Consumption from Steam Turbines (MMBTU)',
    'Fuel Consumption from Single Shaft Combined Cycle Units (MMBTU)',
    'Fuel Consumption from Combined Cycle Gas Turbines (MMBTU)'
]

# Remove specified columns
cleaned_df_3 = cleaned_df_2.drop(columns=columns_to_remove, errors='ignore')

# Print information about the new dataframe
print(f"Shape of cleaned_df_3: {cleaned_df_3.shape}")
print("\nColumns in cleaned_df_3:")
print(cleaned_df_3.columns.tolist())

# Count of missing values per column
missing_count = cleaned_df_3.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(cleaned_df_3)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())



Shape of cleaned_df_3: (751524, 41)

Columns in cleaned_df_3:
['Utility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month', 'Generator ID', 'Boiler ID', 'Cooling ID', 'Generator Primary Technology', 'Fuel Consumption from All Fuel Types (MMBTU)', 'Coal Consumption (MMBTU)', 'Natural Gas Consumption (MMBTU)', 'Petroleum Consumption (MMBTU)', 'Biomass Consumption (MMBTU)', 'Other Gas Consumption (MMBTU)', 'Other Fuel Consumption (MMBTU)', 'Water Withdrawal Volume (Million Gallons)', 'Water Consumption Volume (Million Gallons)', 'Water Withdrawal Intensity Rate (Gallons / MWh)', 'Water Consumption Intensity Rate (Gallons / MWh)', 'Water Withdrawal Rate per Fuel Consumption (Gallons / MMBTU)', 'Water Consumption Rate per Fuel Consumption (Gallons / MMBTU)', 'Cooling Unit Hours in Service', 'Average Distance of Water Intake Below Water Surface (Feet)', 'Cooling System Type', 'Water Type', 'Water Source', 'Water Source Name', 'Cooling Status', 'Cooling Inservice Month', 'Cooling Inser

In [551]:
result = cleaned_df_3.groupby('Plant Code').agg({
    'Water Type': 'nunique',
    'Water Source': 'nunique',
    'Water Source Name': 'nunique'
})

print(result)

            Water Type  Water Source  Water Source Name
Plant Code                                             
3                    2             2                  1
7                    1             1                  1
8                    1             1                  1
10                   1             1                  1
26                   1             1                  1
...                ...           ...                ...
62949                0             0                  0
63931                0             0                  0
64020                1             1                  1
65284                1             1                  1
65285                1             1                  1

[1010 rows x 3 columns]


In [557]:
plant_code_3_rows = cleaned_df_3[cleaned_df_3['Plant Code'] == 3]
print(plant_code_3_rows.shape)

columns_to_show = ['Plant Code', 'Water Type', 'Water Source', 'Water Source Name']
print(plant_code_3_rows[columns_to_show])


(1908, 41)
        Plant Code Water Type Water Source Water Source Name
0                3      Fresh      Surface      Mobile River
1                3      Fresh      Surface      Mobile River
2                3      Fresh      Surface      Mobile River
3                3      Fresh      Surface      Mobile River
4                3      Fresh      Surface      Mobile River
...            ...        ...          ...               ...
668515           3  Reclaimed    Discharge      Mobile River
668516           3  Reclaimed    Discharge      Mobile River
668517           3  Reclaimed    Discharge      Mobile River
668518           3  Reclaimed    Discharge      Mobile River
668519           3  Reclaimed    Discharge      Mobile River

[1908 rows x 4 columns]


In [569]:
# Get the unique Plant Id values from filtered_power_plant_df
plant_ids = filtered_power_plant_df['Plant Id'].unique()

# Filter cleaned_df_3 to only include rows where Plant Code is in plant_ids
filtered_cleaned_df_3 = cleaned_df_3[cleaned_df_3['Plant Code'].isin(plant_ids)]

print(f"Number of rows in filtered_cleaned_df_3: {len(filtered_cleaned_df_3)}")

print(filtered_cleaned_df_3['Plant Code'].nunique())


Number of rows in filtered_cleaned_df_3: 726600
918


In [571]:
# List of columns to remove
columns_to_remove = [
    'Number Operable Generators',
    'Number Operable Boilers',
    'Number Operable Cooling Systems',
    'Relationship Type'
]

# Remove specified columns
filtered_cleaned_df_4 = filtered_cleaned_df_3.drop(columns=columns_to_remove, errors='ignore')

# Print information about the new dataframe
print(f"Shape of filtered_cleaned_df_4: {filtered_cleaned_df_4.shape}")
print("\nColumns in filtered_cleaned_df_4:")
print(filtered_cleaned_df_4.columns.tolist())

Shape of filtered_cleaned_df_4: (726600, 37)

Columns in filtered_cleaned_df_4:
['Utility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month', 'Generator ID', 'Boiler ID', 'Cooling ID', 'Generator Primary Technology', 'Fuel Consumption from All Fuel Types (MMBTU)', 'Coal Consumption (MMBTU)', 'Natural Gas Consumption (MMBTU)', 'Petroleum Consumption (MMBTU)', 'Biomass Consumption (MMBTU)', 'Other Gas Consumption (MMBTU)', 'Other Fuel Consumption (MMBTU)', 'Water Withdrawal Volume (Million Gallons)', 'Water Consumption Volume (Million Gallons)', 'Water Withdrawal Intensity Rate (Gallons / MWh)', 'Water Consumption Intensity Rate (Gallons / MWh)', 'Water Withdrawal Rate per Fuel Consumption (Gallons / MMBTU)', 'Water Consumption Rate per Fuel Consumption (Gallons / MMBTU)', 'Cooling Unit Hours in Service', 'Average Distance of Water Intake Below Water Surface (Feet)', 'Cooling System Type', 'Water Type', 'Water Source', 'Water Source Name', 'Cooling Status', 'Cooling Inservice Mont

In [573]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN'
output_file = 'CoolingBoilerDetail_PlantCodeMatch.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
filtered_cleaned_df_4.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'CoolingBoilerDetail_PlantCodeMatch.csv' has been created successfully.


In [582]:
# Set the correct file path
file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/CoolingBoilerDetail_PlantCodeMatch.csv'

# Check if the file exists
if os.path.exists(file_path):
    # Re-import the CSV with specified data types
    cooling_boiler_df= pd.read_csv(file_path)
    # Display dtypes
    print(cooling_boiler_df.dtypes.to_string())
else:
    print(f"File not found: {file_path}")


  cooling_boiler_df= pd.read_csv(file_path)


Utility ID                                                         int64
State                                                             object
Plant Code                                                         int64
Plant Name                                                        object
Year                                                               int64
Month                                                              int64
Generator ID                                                      object
Boiler ID                                                         object
Cooling ID                                                        object
Generator Primary Technology                                      object
Fuel Consumption from All Fuel Types (MMBTU)                      object
Coal Consumption (MMBTU)                                          object
Natural Gas Consumption (MMBTU)                                   object
Petroleum Consumption (MMBTU)                      

In [584]:
# Count of missing values per column
missing_count = cooling_boiler_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(cooling_boiler_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())


                                                               Missing Count  Missing Percentage
Utility ID                                                                 0            0.000000
State                                                                      0            0.000000
Plant Code                                                                 0            0.000000
Plant Name                                                                 0            0.000000
Year                                                                       0            0.000000
Month                                                                      0            0.000000
Generator ID                                                             264            0.036334
Boiler ID                                                                 72            0.009909
Cooling ID                                                                 0            0.000000
Generator Primary Technology  

In [586]:
cooling_boiler_df.head()

Unnamed: 0,Utility ID,State,Plant Code,Plant Name,Year,Month,Generator ID,Boiler ID,Cooling ID,Generator Primary Technology,Fuel Consumption from All Fuel Types (MMBTU),Coal Consumption (MMBTU),Natural Gas Consumption (MMBTU),Petroleum Consumption (MMBTU),Biomass Consumption (MMBTU),Other Gas Consumption (MMBTU),Other Fuel Consumption (MMBTU),Water Withdrawal Volume (Million Gallons),Water Consumption Volume (Million Gallons),Water Withdrawal Intensity Rate (Gallons / MWh),Water Consumption Intensity Rate (Gallons / MWh),Water Withdrawal Rate per Fuel Consumption (Gallons / MMBTU),Water Consumption Rate per Fuel Consumption (Gallons / MMBTU),Cooling Unit Hours in Service,Average Distance of Water Intake Below Water Surface (Feet),Cooling System Type,Water Type,Water Source,Water Source Name,Cooling Status,Cooling Inservice Month,Cooling Inservice Year,Combined Heat and Power Generator?,Generator Primary Energy Source Code,Generator Prime Mover Code,Sector,Steam Plant Type
0,195,AL,3,Barry,2023,1,3,3,1-3,Conventional Steam Coal,,,,,,,,96.72,0,,,,,744,10,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
1,195,AL,3,Barry,2023,2,3,3,1-3,Conventional Steam Coal,,,,,,,,96.72,0,,,,,672,10,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
2,195,AL,3,Barry,2023,3,3,3,1-3,Conventional Steam Coal,,,,,,,,563.45,0,,,,,744,10,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
3,195,AL,3,Barry,2023,4,3,3,1-3,Conventional Steam Coal,,,,,,,,474.9,0,,,,,720,10,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
4,195,AL,3,Barry,2023,5,3,3,1-3,Conventional Steam Coal,,,,,,,,2083.262,0,,,,,744,10,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0


In [590]:
cooling_boiler_df.columns

Index(['Utility ID', 'State', 'Plant Code', 'Plant Name', 'Year', 'Month',
       'Generator ID', 'Boiler ID', 'Cooling ID',
       'Generator Primary Technology',
       'Fuel Consumption from All Fuel Types (MMBTU)',
       'Coal Consumption (MMBTU)', 'Natural Gas Consumption (MMBTU)',
       'Petroleum Consumption (MMBTU)', 'Biomass Consumption (MMBTU)',
       'Other Gas Consumption (MMBTU)', 'Other Fuel Consumption (MMBTU)',
       'Water Withdrawal Volume (Million Gallons)',
       'Water Consumption Volume (Million Gallons)',
       'Water Withdrawal Intensity Rate (Gallons / MWh)',
       'Water Consumption Intensity Rate (Gallons / MWh)',
       'Water Withdrawal Rate per Fuel Consumption (Gallons / MMBTU)',
       'Water Consumption Rate per Fuel Consumption (Gallons / MMBTU)',
       'Cooling Unit Hours in Service',
       'Average Distance of Water Intake Below Water Surface (Feet)',
       'Cooling System Type', 'Water Type', 'Water Source',
       'Water Source Name', 'Co

In [594]:
cooling_boiler_df['Generator Primary Energy Source Code'].value_counts(dropna=False)

Generator Primary Energy Source Code
NG     502188
BIT     88428
SUB     54288
BLQ     20172
NUC     14520
RFO     10956
RC      10140
BFG      8748
OG       3564
WDS      2508
LIG      2316
DFO      1752
SUN      1620
AB       1296
PC        864
MSW       864
SGC       792
WC        432
SGP       300
NaN       264
WO        216
LFG       168
WH         84
TDF        72
OBS        48
Name: count, dtype: int64

In [596]:
columns_to_remove = [
    'Coal Consumption (MMBTU)', 'Natural Gas Consumption (MMBTU)',
    'Petroleum Consumption (MMBTU)', 'Biomass Consumption (MMBTU)',
    'Other Gas Consumption (MMBTU)', 'Other Fuel Consumption (MMBTU)',
    'Water Withdrawal Intensity Rate (Gallons / MWh)',
    'Water Consumption Intensity Rate (Gallons / MWh)',
    'Water Withdrawal Rate per Fuel Consumption (Gallons / MMBTU)',
    'Water Consumption Rate per Fuel Consumption (Gallons / MMBTU)',
    'Cooling Unit Hours in Service',
    'Average Distance of Water Intake Below Water Surface (Feet)'
]

cooling_boiler_df = cooling_boiler_df.drop(columns=columns_to_remove)


In [600]:
cooling_boiler_df.head()

Unnamed: 0,Utility ID,State,Plant Code,Plant Name,Year,Month,Generator ID,Boiler ID,Cooling ID,Generator Primary Technology,Fuel Consumption from All Fuel Types (MMBTU),Water Withdrawal Volume (Million Gallons),Water Consumption Volume (Million Gallons),Cooling System Type,Water Type,Water Source,Water Source Name,Cooling Status,Cooling Inservice Month,Cooling Inservice Year,Combined Heat and Power Generator?,Generator Primary Energy Source Code,Generator Prime Mover Code,Sector,Steam Plant Type
0,195,AL,3,Barry,2023,1,3,3,1-3,Conventional Steam Coal,,96.72,0,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
1,195,AL,3,Barry,2023,2,3,3,1-3,Conventional Steam Coal,,96.72,0,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
2,195,AL,3,Barry,2023,3,3,3,1-3,Conventional Steam Coal,,563.45,0,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
3,195,AL,3,Barry,2023,4,3,3,1-3,Conventional Steam Coal,,474.9,0,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0
4,195,AL,3,Barry,2023,5,3,3,1-3,Conventional Steam Coal,,2083.262,0,Open,Fresh,Surface,Mobile River,OP,2,1954,N,BIT,ST,Electric Utility,1.0


In [602]:
# Count of missing values per column
missing_count = cooling_boiler_df.isnull().sum()

# Percentage of missing values per column
missing_percentage = (missing_count / len(cooling_boiler_df)) * 100

# Combine into a single DataFrame for better readability
missing_summary = pd.DataFrame({
    'Missing Count': missing_count,
    'Missing Percentage': missing_percentage
})

# Display the summary
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(missing_summary.to_string())


                                              Missing Count  Missing Percentage
Utility ID                                                0            0.000000
State                                                     0            0.000000
Plant Code                                                0            0.000000
Plant Name                                                0            0.000000
Year                                                      0            0.000000
Month                                                     0            0.000000
Generator ID                                            264            0.036334
Boiler ID                                                72            0.009909
Cooling ID                                                0            0.000000
Generator Primary Technology                            264            0.036334
Fuel Consumption from All Fuel Types (MMBTU)          22693            3.123176
Water Withdrawal Volume (Million Gallons

## Handling Missing Data in Key Metrics

Due to lack of domain knowledge at this time, we forego imputation for the following metrics:

- Fuel Consumption from All Fuel Types (MMBTU)
- Water Withdrawal Volume (Million Gallons)
- Water Consumption Volume (Million Gallons)

Instead, we consider the following approaches:

1. **Analyze complete cases**: 
   - Focus analysis on the rows where these values are not missing.

2. **Separate analysis**: 
   - Conduct analyses that don't require these specific fields.
   - Perform a separate analysis for the subset of data where these fields are available.

3. **Flagging**: 
   - Create a flag column for each of these metrics to indicate whether the value was originally missing.
   - This allows us to include all data in our analysis while maintaining transparency about missing values.

These strategies will help maintain data integrity while maximizing the utility of our dataset for analysis.


In [605]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN'
output_file = 'CoolingBoilerDetail_PlantCodeMatch.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
cooling_boiler_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'CoolingBoilerDetail_PlantCodeMatch.csv' has been created successfully.


In [611]:
cooling_boiler_df.dtypes

Utility ID                                        int64
State                                            object
Plant Code                                        int64
Plant Name                                       object
Year                                              int64
Month                                             int64
Generator ID                                     object
Boiler ID                                        object
Cooling ID                                       object
Generator Primary Technology                     object
Fuel Consumption from All Fuel Types (MMBTU)     object
Water Withdrawal Volume (Million Gallons)        object
Water Consumption Volume (Million Gallons)       object
Cooling System Type                              object
Water Type                                       object
Water Source                                     object
Water Source Name                                object
Cooling Status                                  

In [617]:
import pandas as pd

# Define the data types for specific columns
dtype_dict = {
    'Utility ID': 'int64',
    'State': 'object',
    'Plant Code': 'int64',
    'Plant Name': 'object',
    'Year': 'int64',
    'Month': 'int64',
    'Generator ID': 'object',
    'Boiler ID': 'object',
    'Cooling ID': 'object',
    'Generator Primary Technology': 'object',
    'Fuel Consumption from All Fuel Types (MMBTU)': 'float64',
    'Water Withdrawal Volume (Million Gallons)': 'float64',
    'Water Consumption Volume (Million Gallons)': 'float64',
    'Cooling System Type': 'object',
    'Water Type': 'object',
    'Water Source': 'object',
    'Water Source Name': 'object',
    'Cooling Status': 'object',
    'Cooling Inservice Month': 'object',
    'Cooling Inservice Year': 'object',
    'Combined Heat and Power Generator?': 'object',
    'Generator Primary Energy Source Code': 'object',
    'Generator Prime Mover Code': 'object',
    'Sector': 'object',
    'Steam Plant Type': 'float64'
}

# Import the CSV with specified data types
file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/CoolingBoilerDetail_PlantCodeMatch.csv'
cooling_boiler_df = pd.read_csv(file_path, dtype=dtype_dict, na_values=[' '])



# Verify the data types
print(cooling_boiler_df.dtypes)


Utility ID                                        int64
State                                            object
Plant Code                                        int64
Plant Name                                       object
Year                                              int64
Month                                             int64
Generator ID                                     object
Boiler ID                                        object
Cooling ID                                       object
Generator Primary Technology                     object
Fuel Consumption from All Fuel Types (MMBTU)    float64
Water Withdrawal Volume (Million Gallons)       float64
Water Consumption Volume (Million Gallons)      float64
Cooling System Type                              object
Water Type                                       object
Water Source                                     object
Water Source Name                                object
Cooling Status                                  

In [619]:
# Export data to CSV
path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN'
output_file = 'CoolingBoilerDetail_PlantCodeMatch.csv'
full_path = os.path.join(path, output_file)

# Save the dataframe
cooling_boiler_df.to_csv(full_path, index=False)

# Check if the file was created successfully
if os.path.exists(full_path):
    print(f"Saving confirmed: '{output_file}' has been created successfully.")
else:
    print("Error: File was not saved.")

Saving confirmed: 'CoolingBoilerDetail_PlantCodeMatch.csv' has been created successfully.


# Final Notes

In [636]:
# Set the correct file paths
water_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/water_df_StateCleaned_2015_2023.csv'
power_plant_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/filtered_power_plant_df.csv'
cooling_boiler_file_path = '/Users/amyzhang/Desktop/A6_Dashboard/Datasets/CF_6.1_Vers2_DataWrangle/Vers2_CLEAN/CoolingBoilerDetail_PlantCodeMatch.csv'

# Function to import CSV and display dtypes
def import_csv(file_path, df_name):
    if os.path.exists(file_path):
        df = pd.read_csv(file_path)
        print(f"\nDtypes for {df_name}:")
        print(df.dtypes.to_string())
        return df
    else:
        print(f"File not found: {file_path}")
        return None

# Import the CSV files
water_df = import_csv(water_file_path, 'water_df')
power_plant_df = import_csv(power_plant_file_path, 'power_plant_df')
cooling_boiler_df = import_csv(cooling_boiler_file_path, 'cooling_boiler_df')


Dtypes for water_df:
Year                                        int64
Month                                       int64
Plant ID                                    int64
Cooling System\nID                         object
Type of Cooling System                     object
Cooling System\nStatus                     object
Hours in Service                          float64
Chlorine \n(thousand lbs)                 float64
Withdrawal Volume \n(million gallons)     float64
Consumption Volume \n(million gallons)    float64
State                                      object

Dtypes for power_plant_df:
Plant Id                             int64
Combined Heat And\nPower Plant        bool
Plant Name                          object
Operator Name                       object
Operator Id                        float64
Plant State                         object
Census Region                       object
NERC Region                         object
NAICS Code                           int64
EIA Sector Nu