# Q4: Feature Engineering

**Phase 5:** Feature Engineering & Aggregation  
**Points: 9 points**

**Focus:** Create derived features, perform time-based aggregations, calculate rolling windows.

**Lecture Reference:** Lecture 11, Notebook 2 ([`11/demo/02_wrangling_feature_engineering.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/02_wrangling_feature_engineering.ipynb)), Phase 5. Also see Lecture 09 (rolling windows).

---

## Setup

In [70]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load wrangled data from Q3
df = pd.read_csv('output/q3_wrangled_data.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
#Measurement Timestamp
# Or if you saved without index:
#df = pd.read_csv('output/q3_wrangled_data.csv')
#df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
#df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with datetime index")
display(df.head())
df.shape

Loaded 120,394 records with datetime index


Unnamed: 0_level_0,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,Measurement Timestamp Label,Measurement ID
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2015-04-25 09:00:00,63rd Street Weather Station,7.0,5.9,86,0.0,0.0,5.2,liquid,119,5.1,7.1,986.1,38.0,354.0,12.0,04/25/2015 9:00 AM,63rdStreetWeatherStation201504250900
2015-04-30 05:00:00,63rd Street Weather Station,6.1,4.3,76,0.0,0.0,2.5,none,11,7.2,13.0,989.9,4.0,354.0,11.9,04/30/2015 5:00 AM,63rdStreetWeatherStation201504300500
2015-05-22 15:00:00,Oak Street Weather Station,17.7,7.0,55,0.0,0.0,1.4,none,63,1.9,2.8,994.7,689.0,329.0,12.0,05/22/2015 3:00 PM,OakStreetWeatherStation201505221500
2015-05-22 17:00:00,Oak Street Weather Station,17.7,6.3,56,0.0,0.0,1.4,none,124,1.5,2.3,994.7,180.0,329.0,12.1,05/22/2015 5:00 PM,OakStreetWeatherStation201505221700
2015-05-22 18:00:00,Oak Street Weather Station,17.7,6.5,54,0.0,0.0,1.4,none,156,1.9,3.4,994.7,127.0,329.0,12.1,05/22/2015 6:00 PM,OakStreetWeatherStation201505221800


(120394, 17)

---

## Objective

Create derived features, perform time-based aggregations, and calculate rolling windows for time series analysis.

**Time Series Note:** Rolling windows are essential for time series data. They capture temporal dependencies (e.g., 7-hour rolling mean captures short-term patterns). See **Lecture 09** for time series rolling window operations. For hourly data, common window sizes are 7-24 hours (capturing daily patterns). Use pandas `rolling()` method with `window` parameter to specify the number of periods.

---

## Required Artifacts

You must create exactly these 3 files in the `output/` directory:

### 1. `output/q4_features.csv`
**Format:** CSV file
**Content:** Dataset with all derived features added
**Requirements:**
- All original columns from Q3
- All new derived features added as columns
- **No index column** (save with `index=False`)

### 2. `output/q4_rolling_features.csv`
**Format:** CSV file
**Content:** Dataset with rolling window features
**Required Columns:**
- Original datetime column
- At least one rolling window calculation column (e.g., `water_temp_rolling_7h`, `air_temp_rolling_24h`)

**Requirements:**
- Must include at least one rolling window calculation
- Rolling window names should be descriptive (e.g., `temp_rolling_7h` for 7-hour rolling mean)
- **No index column** (save with `index=False`)

**Example columns:**
```csv
Measurement Timestamp,wind_speed_rolling_7h,humidity_rolling_24h,pressure_rolling_7h
2022-01-01 00:00:00,6.8,65.2,1013.5
2022-01-01 01:00:00,6.9,65.3,1013.6
...
```

**Note:** The example shows rolling windows of predictor variables (wind speed, humidity, pressure), not the target variable. If you're predicting Air Temperature, do NOT create rolling windows of Air Temperature - this causes data leakage.

### 3. `output/q4_feature_list.txt`
**Format:** Plain text file
**Content:** List of new features created (one per line)
**Requirements:**
- One feature name per line
- No extra text, just feature names
- Include all derived features, rolling features, and categorical features created

**Example format:**
```
temp_difference
temp_ratio
wind_speed_squared
comfort_index
water_temp_rolling_7h
air_temp_rolling_24h
wind_speed_rolling_7h
temp_category
wind_category
```

---

## Requirements Checklist

- [ ] Derived features created (differences, ratios, interactions, etc.)
- [ ] Time-based aggregations performed (by hour, day, month, etc.) - optional but recommended
- [ ] At least one rolling window calculation (rolling mean, rolling median, etc.)
- [ ] Categorical features created (if applicable)
- [ ] Feature list documented
- [ ] All 3 required artifacts saved with exact filenames

---

## Your Approach

1. **Create derived features** - Differences, ratios, interactions between variables (watch for division by zero)
2. **Calculate rolling windows** - Use `.rolling()` on predictor variables to capture temporal patterns

   ⚠️ **Data Leakage Warning:** Do not create ANY features that use your target variable - this includes rolling windows, differences, ratios, or interactions involving the target. For example, if predicting Air Temperature, do not create `air_temp * humidity` or `air_temp - wet_bulb`. Only derive features from other predictor variables.

3. **Create categorical features** - Bin continuous variables if useful (optional)
4. **Check for infinity values** - Ratios can produce infinity; replace with NaN and handle appropriately
5. **Document and save** - Remember to `reset_index()` before saving CSVs

---

## Decision Points

- **Derived features:** What relationships might be useful? Temperature differences? Ratios? Interactions between variables?
- **Rolling windows:** What window size makes sense? 7 hours? 24 hours? Consider the temporal scale of your data. For hourly data, 7-24 hours captures daily patterns.
- **Time-based aggregations:** Aggregate by hour? Day? Week? What temporal granularity is useful for your analysis?

---

## Checkpoint

After Q4, you should have:
- [ ] Derived features created
- [ ] At least one rolling window calculation
- [ ] Feature list documented
- [ ] All 3 artifacts saved: `q4_features.csv`, `q4_rolling_features.csv`, `q4_feature_list.txt`

---

**Next:** Continue to `q5_pattern_analysis.md` for Pattern Analysis.


In [None]:
 #Feature engineering constants
SECONDS_PER_MINUTE = 60
# Create voltage_drop_per_min
#df['minutes'] = (df['Measurement Timestamp'].diff().dt.total_seconds()/60)
#this was trouble beacuse of the index 
# going to reference the index (to allow the calulations) - this gave me problems 
# to detach the index and retain the Measurement Timestamp as a column ( to use df.reset_index)
df = df.reset_index()
# now the measurement timestamp is object again 
# need to convert into datetime
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
#df['Measurement Timestamp']
# Create voltage_drop_per_min
df['minutes'] = (df['Measurement Timestamp'].diff().dt.total_seconds()/60)
df['voltage_perminute'] =df['Battery Life']/df['minutes']
print(df['voltage_perminute'].head())
# create : dry index from relation between Solar Radiation and Total Rain 
df['dry_index']= df['Solar Radiation']/(df['Total Rain']+1 )
print(df['dry_index'].head())
df.shape

##save as csv : `output/q4_features.csv`, index = False 
df.to_csv ('output/q4_features.csv', index = False)

#check 
df = pd.read_csv('output/q3_wrangled_data.csv', parse_dates=['Measurement Timestamp'])



0         NaN
1    0.001710
2    0.000372
3    0.100833
4    0.201667
Name: voltage_perminute, dtype: float64
0      6.129032
1      1.142857
2    287.083333
3     75.000000
4     52.916667
Name: dry_index, dtype: float64


In [72]:

#based on the examples I would create: 
#'7d_air_temp_rate_perday
#7d_air_temp_rate 
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
weekly_df = df[['Measurement Timestamp']+ numeric_cols].resample ('7D',on='Measurement Timestamp').mean()
weekly_df ['7d_air_temp_rate'] = weekly_df['Air Temperature'].diff()
weekly_df['7d_air_temp_rate_perday'] = weekly_df['7d_air_temp_rate'] / 7
display(weekly_df.head())
print(df.index)
weekly_df.shape

Unnamed: 0_level_0,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Wind Direction,Wind Speed,Maximum Wind Speed,Barometric Pressure,Solar Radiation,Heading,Battery Life,7d_air_temp_rate,7d_air_temp_rate_perday
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2015-04-25,6.55,5.1,81.0,0.0,0.0,3.85,65.0,6.15,10.05,988.0,21.0,354.0,11.95,,
2015-05-02,,,,,,,,,,,,,,,
2015-05-09,,,,,,,,,,,,,,,
2015-05-16,17.7,6.8125,53.875,0.0,0.0,1.4,142.75,1.4375,2.95,994.7,135.0,329.0,12.05,,
2015-05-23,19.834188,17.064957,64.948718,0.0,0.0,6.606838,159.641026,1.448718,3.194872,993.805128,163.213675,329.0,12.089744,2.134188,0.304884


RangeIndex(start=0, stop=120394, step=1)


(554, 15)

In [73]:
### 2. `output/q4_rolling_features.csv`
#Dataset with rolling window features
#**Required Columns:**
# Original datetime column
#At least one rolling window calculation column (e.g., `water_temp_rolling_7h`, `air_temp_rolling_24h`)

#**Requirements:**
# Must include at least one rolling window calculation
# Rolling window names should be descriptive (e.g., `temp_rolling_7h` for 7-hour rolling mean)
# **No index column** (save with `index=False`)
#e.g
#Measurement Timestamp,wind_speed_rolling_7h,humidity_rolling_24h,pressure_rolling_7h
#2022-01-01 00:00:00,6.8,65.2,1013.5
#2022-01-01 01:00:00,6.9,65.3,1013.6

#Rolling examples 
#humidity_rolling_7-day 
#solar_radiation_7-day
#wind_speed_rolling_7-day
#total_rain_rollin_30-day and 7 day 

# Apply rolling windows 
print("Rolling Window Operations\n")

#1-Humidity_rolling_7-day###########

# Centered rolling window (looks both forward and backward)
df['humidity_rolling_24h_centered'] = df['Humidity'].rolling(window=24, center=True).mean()
df['humidity_rolling_7d_centered'] = df['Humidity'].rolling(window=168, center=True).mean()
# Expanding window (from start to current)
#Humidity_expanding.rolling_7day['humidity_expanding_mean'] =Humidity_expanding.rolling_7-day['Humidity'].expanding().mean()
df['humidity_expanding_mean'] =df['Humidity'].expanding().mean()

print(df[['humidity_rolling_24h_centered', 'humidity_expanding_mean','humidity_rolling_7d_centered']].describe())

#2-solar_radiation_7-day#############


# Centered rolling window (looks both forward and backward)
df['solar_radiation_rolling_7d_centered'] = df['Solar Radiation'].rolling(window=168, center=True).mean()

# Expanding window (from start to current)
df['solar_radiation_expanding_mean'] =df['Solar Radiation'].expanding().mean()


print(df[['solar_radiation_rolling_7d_centered', 'solar_radiation_expanding_mean']].describe())
#3-total_rain_rollin_7-day and 30-day ###################


df['total_rain_rolling_7d_centered'] = df['Total Rain'].rolling(window=168, center=True).mean()

# Expanding window (from start to current)
df['total_rain_expanding_mean'] =df['Total Rain'].expanding().mean()

print(df[['total_rain_rolling_7d_centered', 'total_rain_expanding_mean']].describe())


#4-wind_speed_rolling_7-day#########################

df['wind_speed_rolling_7d_centered'] = df['Total Rain'].rolling(window=168, center=True).mean()

# Expanding window (from start to current)
df['wind_speed_expanding_mean'] =df['Total Rain'].expanding().mean()

print(df[['wind_speed_rolling_7d_centered', 'wind_speed_expanding_mean']].describe())

display(df.head()) # doesnt has the index as shown so I will save it like this 
#save with index= False 
#check 
df= df.reset_index() 
df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
#output/q4_rolling_features.csv
df.to_csv('output/q4_rolling_features.csv', index = False)


Rolling Window Operations

       humidity_rolling_24h_centered  humidity_expanding_mean  \
count                  120371.000000            120394.000000   
mean                       69.825774                71.088983   
std                        13.643224                 1.067970   
min                        19.916667                49.115385   
25%                        60.375000                70.572247   
50%                        70.291667                71.054377   
75%                        80.208333                71.419578   
max                        99.416667                86.000000   

       humidity_rolling_7d_centered  
count                 120227.000000  
mean                      69.825727  
std                        9.331879  
min                       38.380952  
25%                       63.458333  
50%                       69.982143  
75%                       76.235119  
max                       96.934524  
       solar_radiation_rolling_7d_centered  s

Unnamed: 0,Measurement Timestamp,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,...,Measurement ID,humidity_rolling_24h_centered,humidity_rolling_7d_centered,humidity_expanding_mean,solar_radiation_rolling_7d_centered,solar_radiation_expanding_mean,total_rain_rolling_7d_centered,total_rain_expanding_mean,wind_speed_rolling_7d_centered,wind_speed_expanding_mean
0,2015-04-25 09:00:00,63rd Street Weather Station,7.0,5.9,86,0.0,0.0,5.2,liquid,119,...,63rdStreetWeatherStation201504250900,,,86.0,,38.0,,5.2,,5.2
1,2015-04-30 05:00:00,63rd Street Weather Station,6.1,4.3,76,0.0,0.0,2.5,none,11,...,63rdStreetWeatherStation201504300500,,,81.0,,21.0,,3.85,,3.85
2,2015-05-22 15:00:00,Oak Street Weather Station,17.7,7.0,55,0.0,0.0,1.4,none,63,...,OakStreetWeatherStation201505221500,,,72.333333,,243.666667,,3.033333,,3.033333
3,2015-05-22 17:00:00,Oak Street Weather Station,17.7,6.3,56,0.0,0.0,1.4,none,124,...,OakStreetWeatherStation201505221700,,,68.25,,227.75,,2.625,,2.625
4,2015-05-22 18:00:00,Oak Street Weather Station,17.7,6.5,54,0.0,0.0,1.4,none,156,...,OakStreetWeatherStation201505221800,,,65.4,,207.6,,2.38,,2.38


In [74]:
df = pd.read_csv('output/q4_rolling_features.csv')
df = df.set_index('Measurement Timestamp')
display(df.head())

Unnamed: 0_level_0,index,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,...,Measurement ID,humidity_rolling_24h_centered,humidity_rolling_7d_centered,humidity_expanding_mean,solar_radiation_rolling_7d_centered,solar_radiation_expanding_mean,total_rain_rolling_7d_centered,total_rain_expanding_mean,wind_speed_rolling_7d_centered,wind_speed_expanding_mean
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-25 09:00:00,0,63rd Street Weather Station,7.0,5.9,86,0.0,0.0,5.2,liquid,119,...,63rdStreetWeatherStation201504250900,,,86.0,,38.0,,5.2,,5.2
2015-04-30 05:00:00,1,63rd Street Weather Station,6.1,4.3,76,0.0,0.0,2.5,none,11,...,63rdStreetWeatherStation201504300500,,,81.0,,21.0,,3.85,,3.85
2015-05-22 15:00:00,2,Oak Street Weather Station,17.7,7.0,55,0.0,0.0,1.4,none,63,...,OakStreetWeatherStation201505221500,,,72.333333,,243.666667,,3.033333,,3.033333
2015-05-22 17:00:00,3,Oak Street Weather Station,17.7,6.3,56,0.0,0.0,1.4,none,124,...,OakStreetWeatherStation201505221700,,,68.25,,227.75,,2.625,,2.625
2015-05-22 18:00:00,4,Oak Street Weather Station,17.7,6.5,54,0.0,0.0,1.4,none,156,...,OakStreetWeatherStation201505221800,,,65.4,,207.6,,2.38,,2.38
