# Q4: Feature Engineering

**Phase 5:** Feature Engineering & Aggregation  
**Points: 9 points**

**Focus:** Create derived features, perform time-based aggregations, calculate rolling windows.

**Lecture Reference:** Lecture 11, Notebook 2 ([`11/demo/02_wrangling_feature_engineering.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/02_wrangling_feature_engineering.ipynb)), Phase 5. Also see Lecture 09 (rolling windows).

---

## Setup

In [39]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load wrangled data from Q3
df = pd.read_csv('output/q3_wrangled_data.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q3_wrangled_data.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with datetime index")

Loaded 196,313 records with datetime index


---

## Objective

Create derived features, perform time-based aggregations, and calculate rolling windows for time series analysis.

**Time Series Note:** Rolling windows are essential for time series data. They capture temporal dependencies (e.g., 7-hour rolling mean captures short-term patterns). See **Lecture 09** for time series rolling window operations. For hourly data, common window sizes are 7-24 hours (capturing daily patterns). Use pandas `rolling()` method with `window` parameter to specify the number of periods.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q4_features.csv`
**Format:** CSV file
**Content:** Dataset with all derived features added
**Requirements:**
- All original columns from Q3
- All new derived features added as columns
- **No index column** (save with `index=False`)

### 2. `output/q4_rolling_features.csv`
**Format:** CSV file
**Content:** Dataset with rolling window features
**Required Columns:**
- Original datetime column
- At least one rolling window calculation column (e.g., `water_temp_rolling_7h`, `air_temp_rolling_24h`)

**Requirements:**
- Must include at least one rolling window calculation
- Rolling window names should be descriptive (e.g., `temp_rolling_7h` for 7-hour rolling mean)
- **No index column** (save with `index=False`)

**Example columns:**
```csv
Measurement Timestamp,wind_speed_rolling_7h,humidity_rolling_24h,pressure_rolling_7h
2022-01-01 00:00:00,6.8,65.2,1013.5
2022-01-01 01:00:00,6.9,65.3,1013.6
...
```

**Note:** The example shows rolling windows of predictor variables (wind speed, humidity, pressure), not the target variable. If you're predicting Air Temperature, do NOT create rolling windows of Air Temperature - this causes data leakage.

### 3. `output/q4_feature_list.txt`
**Format:** Plain text file
**Content:** List of new features created (one per line)
**Requirements:**
- One feature name per line
- No extra text, just feature names
- Include all derived features, rolling features, and categorical features created

**Example format:**
```
temp_difference
temp_ratio
wind_speed_squared
comfort_index
water_temp_rolling_7h
air_temp_rolling_24h
wind_speed_rolling_7h
temp_category
wind_category
```

---

## Requirements Checklist

- [ ] Derived features created (differences, ratios, interactions, etc.)
- [ ] Time-based aggregations performed (by hour, day, month, etc.) - optional but recommended
- [ ] At least one rolling window calculation (rolling mean, rolling median, etc.)
- [ ] Categorical features created (if applicable)
- [ ] Feature list documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Create derived features** - Differences, ratios, interactions between variables (watch for division by zero)
2. **Calculate rolling windows** - Use `.rolling()` on predictor variables to capture temporal patterns

   ⚠️ **Data Leakage Warning:** Do not create ANY features that use your target variable - this includes rolling windows, differences, ratios, or interactions involving the target. For example, if predicting Air Temperature, do not create `air_temp * humidity` or `air_temp - wet_bulb`. Only derive features from other predictor variables.

3. **Create categorical features** - Bin continuous variables if useful (optional)
4. **Check for infinity values** - Ratios can produce infinity; replace with NaN and handle appropriately
5. **Document and save** - Remember to `reset_index()` before saving CSVs

---

## Decision Points

- **Derived features:** What relationships might be useful? Temperature differences? Ratios? Interactions between variables?
- **Rolling windows:** What window size makes sense? 7 hours? 24 hours? Consider the temporal scale of your data. For hourly data, 7-24 hours captures daily patterns.
- **Time-based aggregations:** Aggregate by hour? Day? Week? What temporal granularity is useful for your analysis?

---

## Checkpoint

After Q4, you should have:
- [ ] Derived features created
- [ ] At least one rolling window calculation
- [ ] Feature list documented
- [ ] All 3 artifacts saved: `q4_features.csv`, `q4_rolling_features.csv`, `q4_feature_list.txt`

---

**Next:** Continue to `q5_pattern_analysis.md` for Pattern Analysis.


In [40]:
# output/q4_features.csv

before_column_number = df.shape[1]
df.index = pd.to_datetime(df.index)
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'
    
df['season'] = df.index.month.map(get_season)

print(df['season'])

df['Wind Speed Squared'] = df['Wind Speed'] ** 2

print(df['Wind Speed Squared'])

print(df['Rain Intensity'])

def is_raining(rain_intensity):
    if rain_intensity > 0:
        return 1
    else:
        return 0
    
df["Is Raining"] = df['Rain Intensity'].map(is_raining)

print(df["Is Raining"])

print(df["Wind Speed"])

print(df["Maximum Wind Speed"])

df['Gust Factor'] = df['Maximum Wind Speed'] / df['Wind Speed']

print(df['Gust Factor'])

print(df['Wind Direction'])

df['Wind Direction X'] =  df['Wind Speed'] * np.cos(np.radians(df['Wind Direction']))

print(df['Wind Direction X'])

df['Wind Direction Y'] =  df['Wind Speed'] * np.sin(np.radians(df['Wind Direction']))

print(df['Wind Direction Y'])

df["Rain Intensity x Humidity"] = df["Rain Intensity"] * df["Humidity"]

print(df["Rain Intensity x Humidity"])

print(df['Humidity'])

def humidity_level(humidity):
    if humidity < 30:
        return 'Low'
    elif 30 <= humidity <= 60:
        return 'Medium'
    else:
        return 'High'
    
df['Humidity Level'] = df['Humidity'].map(humidity_level)

df['Sin_Hour'] = np.sin(2 * np.pi * df.index.hour / 24)

print(df['Sin_Hour'])
df['Cos_Hour'] = np.cos(2 * np.pi * df.index.hour / 24)

print(df['Cos_Hour'])

df.reset_index().to_csv('output/q4_features.csv', index=False)

Measurement Timestamp
2015-04-25 09:00:00    Spring
2015-04-30 05:00:00    Spring
2015-05-22 15:00:00    Spring
2015-05-22 16:00:00    Spring
2015-05-22 17:00:00    Spring
                        ...  
2025-12-03 08:00:00    Winter
2025-12-03 09:00:00    Winter
2025-12-03 09:00:00    Winter
2025-12-03 10:00:00    Winter
2025-12-03 10:00:00    Winter
Name: season, Length: 196313, dtype: object
Measurement Timestamp
2015-04-25 09:00:00    26.01
2015-04-30 05:00:00    51.84
2015-05-22 15:00:00     3.61
2015-05-22 16:00:00    16.00
2015-05-22 17:00:00     2.25
                       ...  
2025-12-03 08:00:00    13.69
2025-12-03 09:00:00    10.89
2025-12-03 09:00:00     4.84
2025-12-03 10:00:00    10.89
2025-12-03 10:00:00     5.29
Name: Wind Speed Squared, Length: 196313, dtype: float64
Measurement Timestamp
2015-04-25 09:00:00    5.133688
2015-04-30 05:00:00    0.000000
2015-05-22 15:00:00    0.000000
2015-05-22 16:00:00    0.000000
2015-05-22 17:00:00    0.000000
                        

In [41]:
# output/q4_rolling_features.csv

df['barometric_pressure_rolling_mean_7h'] = df['Barometric Pressure'].rolling('7h').mean()
df['humidity_rolling_mean_7h'] = df['Humidity'].rolling('7h').mean()
df['solar_radiation_rolling_mean_7h'] = df['Solar Radiation'].rolling('7h').mean()
df['total_rain_rolling_mean_24h'] = df['Total Rain'].rolling('24h').mean()

df.reset_index()[['Measurement Timestamp', 'barometric_pressure_rolling_mean_7h', 'humidity_rolling_mean_7h', 'solar_radiation_rolling_mean_7h', 'total_rain_rolling_mean_24h']].to_csv('output/q4_rolling_features.csv', index=False)

In [42]:
# output/q4_feature_list.txt

new_col_names = df.columns[before_column_number:]

with open('output/q4_feature_list.txt', 'w') as f:
    for col in new_col_names:
        f.write(f"{col}\n")