In [1]:
import pandas as pd
import numpy as np

We use pandas for handling tabular data and numpy for numerical operations. These are standard tools in data science, ensuring reproducibility and clarity.


In [2]:
df_flood_raw = pd.read_csv("C:/Users/SHANIA/Downloads/Metro-Manila-Flood-Insights/Data/AEGISDataset.csv")

We load the raw dataset into memory. This is the starting point for all analysis. Keeping the raw file untouched ensures we always have a reference baseline

In [3]:
# Missing values
print("Missing values per column:\n", df_flood_raw.isnull().sum())

# Duplicates
duplicates_count = df_flood_raw.duplicated().sum()
print(f"Duplicate rows: {duplicates_count}")

# Drop duplicates if any
if duplicates_count > 0:
    df_flood_raw = df_flood_raw.drop_duplicates().reset_index(drop=True)


Missing values per column:
 lat           0
lon           0
flood_heig    0
elevation     0
precipitat    0
dtype: int64
Duplicate rows: 2


Missing values check: Ensures no gaps in critical variables (lat, lon, flood_height, elevation, precipitation).

Duplicate check: Prevents double-counting of locations, which could distort analysis.

Dropping duplicates: Keeps the dataset clean and avoids bias in later modeling.

In [4]:
df_flood_processed = df_flood_raw.copy()

# Elevation bins
df_flood_processed['elevation_bin'] = pd.cut(
    df_flood_processed['elevation'],
    bins=[-1,5,15,30,50,100],
    labels=['Very Low','Low','Moderate','High','Very High']
)

# Precipitation bins
df_flood_processed['precip_bin'] = pd.cut(
    df_flood_processed['precipitat'],
    bins=[-1,5,10,15,20,25],
    labels=['Very Low','Low','Moderate','High','Very High']
)

# Simple Flood Risk Index
df_flood_processed['flood_risk_index'] = (
    df_flood_processed['flood_heig'] +
    (1 - df_flood_processed['elevation']/df_flood_processed['elevation'].max()) +
    df_flood_processed['precipitat']/df_flood_processed['precipitat'].max()
)


Elevation bins: Grouping elevation into categories makes patterns easier to interpret (e.g., lowland vs upland). 

Precipitation bins: Same logic; since  rainfall ranges are easier to compare when grouped.
 
Flood Risk Index: A simple combined score that mixes flood height, elevation, and precipitation. This is not a “true” model output but a BI indicator to summarize risk. It helps visualize hotspots before applying machine learning.

⚠️ Note: This index is heuristic (rule-based). It should be documented as an exploratory measure, not a definitive risk score. Later, we can refine it with proper modeling.

In [5]:
processed_path = "C:/Users/SHANIA/Downloads/Metro-Manila-Flood-Insights/Data/AEGISDataset_processed.csv"
df_flood_processed.to_csv(processed_path, index=False)
print(f"Preprocessed dataset saved as '{processed_path}'")


Preprocessed dataset saved as 'C:/Users/SHANIA/Downloads/Metro-Manila-Flood-Insights/Data/AEGISDataset_processed.csv'


In [6]:
df_flood_processed.head()

Unnamed: 0,lat,lon,flood_heig,elevation,precipitat,elevation_bin,precip_bin,flood_risk_index
0,14.640394,121.055708,0,54.553295,9.0,Very High,Low,0.834528
1,14.698299,121.002132,0,21.856272,10.0,Moderate,Low,1.238192
2,14.698858,121.100261,0,69.322807,16.0,Very High,High,1.007032
3,14.57131,120.983334,0,10.987241,8.0,Low,Low,1.26131
4,14.762232,121.075735,0,87.889847,18.0,Very High,High,0.900089


In [7]:
df_flood_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3508 entries, 0 to 3507
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype   
---  ------            --------------  -----   
 0   lat               3508 non-null   float64 
 1   lon               3508 non-null   float64 
 2   flood_heig        3508 non-null   int64   
 3   elevation         3508 non-null   float64 
 4   precipitat        3508 non-null   float64 
 5   elevation_bin     3508 non-null   category
 6   precip_bin        3508 non-null   category
 7   flood_risk_index  3508 non-null   float64 
dtypes: category(2), float64(5), int64(1)
memory usage: 171.8 KB
