Notebook 2: Core Data Cleaning & Selection
* Goal: Clean and filter the raw data from Notebook 1 into a pristine, usable format.

Block 1: Select Relevant Columns

The raw data has 29 columns. We'll select only the 11 we need for our entire 8-notebook analysis. This makes our DataFrame much lighter and easier to manage.

In [12]:
# Define the columns we want to keep
columns_to_keep = [
    'Country',
    'Reactor name',
    'Status',
    'Type',
    'Thermal Capacity, MWt',
    'Year',
    'Electricity Supplied, GW.h',
    'Grid Connection Date',
    'Construction Start Date',
    'Commercial Operation Date',
    'First Critically Date'
]

# Filter the DataFrame to keep only these columns
df_clean = df_raw[columns_to_keep].copy()

print(f"Selected {len(columns_to_keep)} relevant columns.")
print(df_clean.head())

Selected 11 relevant columns.
     Country Reactor name       Status  Type  Thermal Capacity, MWt    Year  \
0  ARGENTINA     ATUCHA-1  Operational  PHWR                   1179  1974.0   
1  ARGENTINA     ATUCHA-1  Operational  PHWR                   1179  1975.0   
2  ARGENTINA     ATUCHA-1  Operational  PHWR                   1179  1976.0   
3  ARGENTINA     ATUCHA-1  Operational  PHWR                   1179  1977.0   
4  ARGENTINA     ATUCHA-1  Operational  PHWR                   1179  1978.0   

  Electricity Supplied, GW.h Grid Connection Date Construction Start Date  \
0                      947.5         19 Mar, 1974            31 May, 1968   
1                     2357.8         19 Mar, 1974            31 May, 1968   
2                     2408.6         19 Mar, 1974            31 May, 1968   
3                       1537         19 Mar, 1974            31 May, 1968   
4                    2711.81         19 Mar, 1974            31 May, 1968   

  Commercial Operation Date Firs

Block 2: Clean Key Numeric Columns

This is a critical step. We'll convert Year to an integer and clean Electricity Supplied, GW.h, which is currently text (object), into a number (float).

In [13]:
# 1. Clean 'Year' column
# Fill any missing years with 0 and convert to a nullable Integer
df_clean['Year'] = df_clean['Year'].fillna(0).astype('Int64')

# 2. Clean 'Electricity Supplied, GW.h' column
# This column has non-numeric characters. We force them to NaN (errors='coerce').
df_clean['Electricity Supplied, GW.h'] = pd.to_numeric(df_clean['Electricity Supplied, GW.h'], errors='coerce')
# Now, fill any resulting NaNs with 0 (assuming 0 generation if not reported)
df_clean['Electricity Supplied, GW.h'] = df_clean['Electricity Supplied, GW.h'].fillna(0)

# 3. Clean 'Thermal Capacity, MWt'
# This is already a number, but we'll fill NaNs with 0 just in case.
df_clean['Thermal Capacity, MWt'] = df_clean['Thermal Capacity, MWt'].fillna(0)

print("Cleaned 'Year', 'Electricity Supplied', and 'Thermal Capacity' columns.")

Cleaned 'Year', 'Electricity Supplied', and 'Thermal Capacity' columns.


Block 3: Clean Date Columns

To perform any time-based analysis (like calculating fleet age), we must convert all date columns from text to proper datetime objects.

In [14]:
# Define all date columns to convert
date_cols = ['Grid Connection Date', 'Construction Start Date',
             'Commercial Operation Date', 'First Critically Date']

for col in date_cols:
    if col in df_clean.columns:
        # The format='%d %b, %Y' matches the text format "31 May, 1968"
        df_clean[col] = pd.to_datetime(df_clean[col], format='%d %b, %Y', errors='coerce')

print("Converted all date columns to datetime objects.")

Converted all date columns to datetime objects.


Block 4: Final Inspection and Save

Finally, let's inspect our cleaned DataFrame with .info() to confirm all the data types are correct. Then, we'll save this clean file. This is the file we will use for the rest of the project.

In [15]:
print("\n--- Final Cleaned DataFrame Info ---")
df_clean.info()

print("\n--- Final Cleaned DataFrame Head ---")
print(df_clean.head())

# Save the clean data to a new CSV file
output_file = 'pris_clean.csv'
df_clean.to_csv(output_file, index=False)

print(f"\nNotebook 2 complete. Clean data saved to: {output_file}")


--- Final Cleaned DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20830 entries, 0 to 20829
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Country                     20830 non-null  object        
 1   Reactor name                20830 non-null  object        
 2   Status                      20830 non-null  object        
 3   Type                        20830 non-null  object        
 4   Thermal Capacity, MWt       20830 non-null  int64         
 5   Year                        20830 non-null  Int64         
 6   Electricity Supplied, GW.h  20830 non-null  float64       
 7   Grid Connection Date        20737 non-null  datetime64[ns]
 8   Construction Start Date     20830 non-null  datetime64[ns]
 9   Commercial Operation Date   20655 non-null  datetime64[ns]
 10  First Critically Date       20738 non-null  datetime64[ns]
dtypes: Int64(1), dat