# Q4: Feature Engineering

**Phase 5:** Feature Engineering & Aggregation  
**Points: 9 points**

**Focus:** Create derived features, perform time-based aggregations, calculate rolling windows.

**Lecture Reference:** Lecture 11, Notebook 2 ([`11/demo/02_wrangling_feature_engineering.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/02_wrangling_feature_engineering.ipynb)), Phase 5. Also see Lecture 09 (rolling windows).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load wrangled data from Q3
df = pd.read_csv('output/q3_wrangled_data.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q3_wrangled_data.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with datetime index")

Loaded 182,516 records with datetime index


---

## Objective

Create derived features, perform time-based aggregations, and calculate rolling windows for time series analysis.

**Time Series Note:** Rolling windows are essential for time series data. They capture temporal dependencies (e.g., 7-hour rolling mean captures short-term patterns). See **Lecture 09** for time series rolling window operations. For hourly data, common window sizes are 7-24 hours (capturing daily patterns). Use pandas `rolling()` method with `window` parameter to specify the number of periods.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q4_features.csv`
**Format:** CSV file
**Content:** Dataset with all derived features added
**Requirements:**
- All original columns from Q3
- All new derived features added as columns
- **No index column** (save with `index=False`)

### 2. `output/q4_rolling_features.csv`
**Format:** CSV file
**Content:** Dataset with rolling window features
**Required Columns:**
- Original datetime column
- At least one rolling window calculation column (e.g., `water_temp_rolling_7h`, `air_temp_rolling_24h`)

**Requirements:**
- Must include at least one rolling window calculation
- Rolling window names should be descriptive (e.g., `temp_rolling_7h` for 7-hour rolling mean)
- **No index column** (save with `index=False`)

**Example columns:**
```csv
Measurement Timestamp,wind_speed_rolling_7h,humidity_rolling_24h,pressure_rolling_7h
2022-01-01 00:00:00,6.8,65.2,1013.5
2022-01-01 01:00:00,6.9,65.3,1013.6
...
```

**Note:** The example shows rolling windows of predictor variables (wind speed, humidity, pressure), not the target variable. If you're predicting Air Temperature, do NOT create rolling windows of Air Temperature - this causes data leakage.

### 3. `output/q4_feature_list.txt`
**Format:** Plain text file
**Content:** List of new features created (one per line)
**Requirements:**
- One feature name per line
- No extra text, just feature names
- Include all derived features, rolling features, and categorical features created

**Example format:**
```
temp_difference
temp_ratio
wind_speed_squared
comfort_index
water_temp_rolling_7h
air_temp_rolling_24h
wind_speed_rolling_7h
temp_category
wind_category
```

---

## Requirements Checklist

- [ ] Derived features created (differences, ratios, interactions, etc.)
- [ ] Time-based aggregations performed (by hour, day, month, etc.) - optional but recommended
- [ ] At least one rolling window calculation (rolling mean, rolling median, etc.)
- [ ] Categorical features created (if applicable)
- [ ] Feature list documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Create derived features** - Differences, ratios, interactions between variables (watch for division by zero)
2. **Calculate rolling windows** - Use `.rolling()` on predictor variables to capture temporal patterns

   ⚠️ **Data Leakage Warning:** Do not create ANY features that use your target variable - this includes rolling windows, differences, ratios, or interactions involving the target. For example, if predicting Air Temperature, do not create `air_temp * humidity` or `air_temp - wet_bulb`. Only derive features from other predictor variables.

3. **Create categorical features** - Bin continuous variables if useful (optional)
4. **Check for infinity values** - Ratios can produce infinity; replace with NaN and handle appropriately
5. **Document and save** - Remember to `reset_index()` before saving CSVs

---

## Decision Points

- **Derived features:** What relationships might be useful? Temperature differences? Ratios? Interactions between variables?
- **Rolling windows:** What window size makes sense? 7 hours? 24 hours? Consider the temporal scale of your data. For hourly data, 7-24 hours captures daily patterns.
- **Time-based aggregations:** Aggregate by hour? Day? Week? What temporal granularity is useful for your analysis?

---

## Checkpoint

After Q4, you should have:
- [ ] Derived features created
- [ ] At least one rolling window calculation
- [ ] Feature list documented
- [ ] All 3 artifacts saved: `q4_features.csv`, `q4_rolling_features.csv`, `q4_feature_list.txt`

---

**Next:** Continue to `q5_pattern_analysis.md` for Pattern Analysis.


## Start of Feature Engineering

In [2]:
# Make a copy to prevent modifying original data
df_featured = df.copy()

# WEATHER DERIVED FEATURES
# -------------------------------------------
# Wind Features:
# Vectorized wind speed and direction to resolve circular issue
# -------------------------------------------
deg_to_rad = np.pi / 180

# u = east-west component; v = north-south
df_featured['wind_u'] = df_featured['Wind Speed'] * np.cos(df_featured['Wind Direction'] * deg_to_rad)
df_featured['wind_v'] = df_featured['Wind Speed'] * np.sin(df_featured['Wind Direction'] * deg_to_rad)

# Wind direction delta, circular difference
df_featured['wind_dir_delta'] = (
    (df_featured['Wind Direction'] - df_featured['Wind Direction'].shift(1) + 180) % 360 - 180
)

# Categorize wind direction into North, East, South, and West based on wind direction
def categorize_wind_direction(degree):
    """
    Categorize wind direction based on degree.

    - North (316-359; 0-45)
    - East (46-135)
    - South (136-225): 
    - West (226-315)
    """
    if 45 < degree <= 135 :
        return 'East'
    elif 135 < degree <= 225:
        return 'South'
    elif 225 < degree <= 315:
        return 'West'
    else:
        return 'North'

df_featured['wind_category'] = df_featured['Wind Direction'].apply(categorize_wind_direction)

# -------------------------------------------
# Pressure Trend Features:
# Also useful to predict air temperature
# -------------------------------------------
df_featured['pressure_delta'] = df_featured['Barometric Pressure'] - df_featured['Barometric Pressure'].shift(1)

# Categorical trend: rising / falling / steady
df_featured['pressure_trend'] = pd.cut(
    df_featured['pressure_delta'],
    bins=[-np.inf, -0.5, 0.5, np.inf],
    labels=['falling', 'steady', 'rising']
)

In [3]:
# -------------------------------------------
# ROLLING FEATURES
# -------------------------------------------
# For hourly data, common window sizes are 7-24 hours (capturing daily patterns)
# Create rolling mean for each major numeric columns
# -------------------------------------------

# Wet Bulb Temperature
df_featured['wet_temp_rolling_7h'] = df_featured['Wet Bulb Temperature'].rolling(window=7, min_periods=1).mean()
df_featured['wet_temp_rolling_24h'] = df_featured['Wet Bulb Temperature'].rolling(window=24, min_periods=1).mean()
# Humidity
df_featured['humidity_rolling_7h'] = df_featured['Humidity'].rolling(window=7, min_periods=1).mean()
df_featured['humidity_rolling_24h'] = df_featured['Humidity'].rolling(window=24, min_periods=1).mean()
# Rain Intensity
df_featured['rain_intensity_rolling_7h'] = df_featured['Rain Intensity'].rolling(window=7, min_periods=1).mean()
df_featured['rain_intensity_rolling_24h'] = df_featured['Rain Intensity'].rolling(window=24, min_periods=1).mean()
# Pressure
df_featured['pressure_rolling_7h'] = df_featured['Barometric Pressure'].rolling(window=7, min_periods=1).mean()
df_featured['pressure_rolling_24h'] = df_featured['Barometric Pressure'].rolling(window=24, min_periods=1).mean()

rolling_df = pd.DataFrame({
    "wet_temp_rolling_7h": df_featured['wet_temp_rolling_7h'],
    "wet_temp_rolling_24h": df_featured['wet_temp_rolling_24h'],
    "humidity_rolling_7h": df_featured['humidity_rolling_7h'],
    "humidity_rolling_24h": df_featured['humidity_rolling_24h'],
    "rain_intensity_rolling_7h": df_featured['rain_intensity_rolling_7h'],
    "rain_intensity_rolling_24h": df_featured['rain_intensity_rolling_24h'],
    "pressure_rolling_7h": df_featured['pressure_rolling_7h'],
    "pressure_rolling_24h": df_featured['pressure_rolling_24h']
})

display(rolling_df)

Unnamed: 0_level_0,wet_temp_rolling_7h,wet_temp_rolling_24h,humidity_rolling_7h,humidity_rolling_24h,rain_intensity_rolling_7h,rain_intensity_rolling_24h,pressure_rolling_7h,pressure_rolling_24h
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2015-04-25 09:00:00,5.900000,5.900000,86.000000,86.000000,7.20,7.20,986.100000,986.100000
2015-04-30 05:00:00,5.100000,5.100000,81.000000,81.000000,3.60,3.60,988.000000,988.000000
2015-05-22 15:00:00,5.733333,5.733333,72.333333,72.333333,2.40,2.40,988.633333,988.633333
2015-05-22 16:00:00,6.050000,6.050000,69.000000,69.000000,1.80,1.80,988.950000,988.950000
2015-05-22 17:00:00,6.100000,6.100000,67.400000,67.400000,1.44,1.44,989.140000,989.140000
...,...,...,...,...,...,...,...,...
2025-12-02 10:00:00,-5.571429,-4.350000,77.714286,79.958333,0.00,0.00,994.071429,993.495833
2025-12-02 11:00:00,-5.671429,-4.483333,76.285714,79.416667,0.00,0.00,994.157143,993.495833
2025-12-02 11:00:00,-5.728571,-4.616667,75.571429,79.083333,0.00,0.00,994.514286,993.491667
2025-12-02 12:00:00,-5.828571,-4.758333,74.285714,78.625000,0.00,0.00,994.200000,993.412500


In [4]:
# Save the feature engineered data
df_featured.to_csv('output/q4_features.csv')

# Save the created rolling features
rolling_df.reset_index().to_csv('output/q4_rolling_features.csv', index=False)

# Save new feature created
new_cols = list(set(df_featured.columns.values) - set(df.columns.values))
with open('output/q4_feature_list.txt', 'w') as f:
    for i in new_cols:
        f.write(f"{i}\n")