# Feature Engineering for PM2.5 Prediction

**Purpose of this notebook:**
- Load raw observational data
- Engineer physical, photographic, meteorological, and temporal features
- Label daytime and nighttime observations
- Export a clean dataset for exploratory analysis and machine learning

## 1. Import Required Libraries
Only essential libraries are used to ensure reproducibility.

In [None]:
import numpy as np
import pandas as pd

## 2. Load Raw Dataset
The input file contains:
- Ground PM2.5 measurements
- Sky and ground RGB values
- Surface meteorology
- Upper-air meteorology (500 hPa, 850 hPa)

In [None]:
df = pd.read_csv("Model Input/ML_DL_input.csv")
df.shape

## 3. Datetime Construction and Temporal Encoding

Time information is encoded using sine and cosine transformations to preserve cyclical properties of hour-of-day and month-of-year.

In [None]:
# Combine date and hour into a single datetime variable
df['Datetime'] = pd.to_datetime(
    df['Date'].astype(str) + ' ' + df['Hour'].astype(str),
    errors='coerce'
)

# Extract hour and month
df['hr'] = df['Datetime'].dt.hour
df['mnth'] = df['Datetime'].dt.month

# Cyclical time encoding
df['hr_sin']   = np.sin(df.hr * 2 * np.pi / 24)
df['hr_cos']   = np.cos(df.hr * 2 * np.pi / 24)
df['mnth_sin'] = np.sin((df.mnth - 1) * 2 * np.pi / 12)
df['mnth_cos'] = np.cos((df.mnth - 1) * 2 * np.pi / 12)

## 4. Atmospheric Physical Feature Construction

Physical quantities are derived from upper-air meteorological variables:
- Air density
- Kinetic Energy (KE)
- Gravitational Potential Energy (GE)

These variables describe atmospheric stability and transport capacity.

In [None]:
# Constants
R = 8314   # Universal gas constant (J kmol⁻¹ K⁻¹)
M = 29     # Mean molecular weight of air (kg kmol⁻¹)

# Air density (kg m⁻3)
df['Ro_500'] = 50000 * M / (R * df['T_500'])
df['Ro_850'] = 85000 * M / (R * df['T_850'])

# Kinetic energy (J m⁻3)
df['KE_500'] = 0.5 * df['Ro_500'] * (df['U_500']**2 + df['V_500']**2)
df['KE_850'] = 0.5 * df['Ro_850'] * (df['U_850']**2 + df['V_850']**2)

# Gravitational potential energy (J m⁻3)
df['GE_500'] = df['Ro_500'] * df['GP_500']
df['GE_850'] = df['Ro_850'] * df['GP_850']

## 5. Photographic Feature Engineering

Photographic features are derived from sky and ground RGB channels:
- Channel ratios (R/G, R/B, B/R)
- Total RGB intensity

These variables capture optical effects of aerosols.

In [None]:
# Sky photographic features
df['R_G_Sky'] = df['R_S_M'] / df['G_S_M']
df['R_B_Sky'] = df['R_S_M'] / df['B_S_M']
df['B_R_Sky'] = df['B_S_M'] / df['R_S_M']
df['RGB_Sky'] = df['R_S_M'] + df['G_S_M'] + df['B_S_M']

# Ground photographic features
df['R_G_Ground'] = df['R_G_M'] / df['G_G_M']
df['R_B_Ground'] = df['R_G_M'] / df['B_G_M']
df['RGB_Ground'] = df['R_G_M'] + df['G_G_M'] + df['B_G_M']

## 6. Daytime and Nighttime Classification

Daytime and nighttime samples are distinguished using downward shortwave radiation intensity.

In [None]:
# Daytime flag
df['is_day'] = df['R_S_M'] > 100

## 7. Feature Group Definitions

The following feature groups are defined and reused consistently in subsequent notebooks.

In [None]:
PHOTO_FEATURES = [
    'R_G_Sky', 'R_B_Sky', 'B_R_Sky', 'RGB_Sky',
    'R_G_Ground', 'R_B_Ground', 'RGB_Ground'
]

MET_GROUND = ['T2M', 'BLH', 'SP', 'TP', 'U_10', 'V_10']

MET_UPPER = ['T_500', 'U_500', 'V_500', 'T_850', 'U_850', 'V_850']

PHYSICAL = ['KE_500', 'KE_850', 'GE_500', 'GE_850']

TIME_FEATURES = ['hr_sin', 'hr_cos', 'mnth_sin', 'mnth_cos']

## 8. Export Feature Dataset

The final dataset is saved and used as input  
for exploratory analysis and machine learning notebooks.

In [None]:
df.to_csv("Model Input/ML_DL_processed.csv", index=False)
df.shape