# Q6: Modeling Preparation

**Phase 7:** Modeling Preparation  
**Points: 3 points**

**Focus:** Perform temporal train/test split, select features, handle categorical variables.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 7. This notebook demonstrates temporal train/test splitting (see "Your Approach" section below for the key code pattern).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load feature-engineered data from Q4
df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q4_features.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")
display(df.head())

Loaded 182,516 records with features


Unnamed: 0_level_0,Station Name,Air Temperature,Wet Bulb Temperature,Humidity,Rain Intensity,Interval Rain,Total Rain,Precipitation Type,Wind Direction,Wind Speed,...,pressure_delta,pressure_trend,wet_temp_rolling_7h,wet_temp_rolling_24h,humidity_rolling_7h,humidity_rolling_24h,rain_intensity_rolling_7h,rain_intensity_rolling_24h,pressure_rolling_7h,pressure_rolling_24h
Measurement Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-04-25 09:00:00,63rd Street Weather Station,7.0,5.9,86,7.2,5.0,5.2,60.0,119,5.1,...,0.0,steady,5.9,5.9,86.0,86.0,7.2,7.2,986.1,986.1
2015-04-30 05:00:00,63rd Street Weather Station,6.1,4.3,76,0.0,0.0,2.5,0.0,11,7.2,...,3.8,rising,5.1,5.1,81.0,81.0,3.6,3.6,988.0,988.0
2015-05-22 15:00:00,Oak Street Weather Station,6.1,7.0,55,0.0,0.0,1.4,0.0,63,1.9,...,0.0,steady,5.733333,5.733333,72.333333,72.333333,2.4,2.4,988.633333,988.633333
2015-05-22 16:00:00,Foster Weather Station,9.17,7.0,59,0.0,0.0,1.4,0.0,4,4.0,...,0.0,steady,6.05,6.05,69.0,69.0,1.8,1.8,988.95,988.95
2015-05-22 17:00:00,Foster Weather Station,9.28,6.3,61,0.0,0.0,1.4,0.0,40,1.2,...,0.0,steady,6.1,6.1,67.4,67.4,1.44,1.44,989.14,989.14


---

## Objective

Prepare data for modeling by performing temporal train/test split, selecting features, and handling categorical variables.

**CRITICAL - Temporal Split:** For time series data, you **MUST** use temporal splitting (earlier data for training, later data for testing). **DO NOT** use random split. Why? Time series data has temporal dependencies - using future data to predict the past would be data leakage.

---

## Required Artifacts

You must create exactly these 5 files in the `output/` directory:

### 1. `output/q6_X_train.csv`
**Format:** CSV file
**Content:** Training features (X)
**Requirements:**
- All feature columns (no target variable)
- Only training data (earlier time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 2. `output/q6_X_test.csv`
**Format:** CSV file
**Content:** Test features (X)
**Requirements:**
- All feature columns (same as X_train)
- Only test data (later time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 3. `output/q6_y_train.csv`
**Format:** CSV file
**Content:** Training target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only training data (corresponding to X_train)
- **No index column** (save with `index=False`)

**Example:**
```csv
Water Temperature
15.2
15.3
15.1
...
```

### 4. `output/q6_y_test.csv`
**Format:** CSV file
**Content:** Test target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only test data (corresponding to X_test)
- **No index column** (save with `index=False`)

### 5. `output/q6_train_test_info.txt`
**Format:** Plain text file
**Content:** Train/test split information
**Required information:**
- Split method: Temporal (80/20 or similar)
- Training set size: [number] samples
- Test set size: [number] samples
- Training date range: [start] to [end]
- Test date range: [start] to [end]
- Number of features: [number]
- Target variable: [name]

**Example format:**
```
TRAIN/TEST SPLIT INFORMATION
==========================

Split Method: Temporal (80/20 split by time)

Training Set Size: 40000 samples
Test Set Size: 10000 samples

Training Date Range: 2022-01-01 00:00:00 to 2026-09-15 07:00:00
Test Date Range: 2026-09-15 08:00:00 to 2027-09-15 07:00:00

Number of Features: 22
Target Variable: Water Temperature
```

---

## Requirements Checklist

- [ ] Target variable selected
- [ ] Temporal train/test split performed (train on earlier data, test on later data - **NOT random split**)
- [ ] Features selected and prepared
- [ ] Categorical variables handled (encoding if needed)
- [ ] No data leakage (future data not in training set)
- [ ] All 5 required artifacts saved with exact filenames

---

## Your Approach

1. **Select target variable** - Choose a meaningful numeric variable to predict
2. **Select features** - Exclude target, non-numeric columns, and any features derived from the target (to avoid data leakage)
3. **Handle categorical variables** - One-hot encode if needed
4. **Perform temporal train/test split** - Sort by datetime, then split by index position (earlier data for training, later for testing)
5. **Save artifacts** - Save X_train, X_test, y_train, y_test as separate CSVs
6. **Document split** - Record split sizes, date ranges, and feature count

---

## Feature Selection Guidelines

When selecting features for modeling, think critically about each feature:

**Red Flags to Watch For:**
- **Circular logic**: Does this feature use the target variable to predict the target?
  - Example: Rolling mean of target, lag of target (if not handled carefully)
  - Example: If predicting `Air Temperature`, using `air_temp_rolling_7h` is circular - you're predicting temperature from smoothed temperature
- **Data leakage**: Does this feature contain information that wouldn't be available at prediction time?
  - Example: Future values, aggregated statistics that include the current value
- **Near-duplicates**: Is this feature nearly identical to the target?
  - Check correlations - if correlation > 0.95, investigate whether it's legitimate
  - Example: A feature with 99%+ correlation with the target is likely problematic

**Good Practices:**
- Use external predictors (other weather variables, temporal features)
- Create rolling windows of **predictors**, not the target
  - Good: `wind_speed_rolling_7h`, `humidity_rolling_24h`
  - Bad: `air_temp_rolling_7h` when predicting Air Temperature
- Use derived features that combine multiple predictors
- Think: "Would I have this information when making a real prediction?"

**Remember:** The goal is to predict the target from **other** information, not from the target itself.

---

## Decision Points

- **Target variable:** What do you want to predict? Temperature? Water conditions? Choose something meaningful and measurable.
- **Temporal split:** **CRITICAL** - Use temporal split (earlier data for training, later data for testing), NOT random split. Why? Time series data has temporal dependencies. Typical split: 80/20 or 70/30.
- **Feature selection:** Which features are most relevant? Consider correlations, domain knowledge, and feature importance from previous analysis.
- **Categorical encoding:** If you have categorical variables, encode them (one-hot encoding, label encoding, etc.) before modeling.

---

## Checkpoint

After Q6, you should have:
- [ ] Temporal train/test split completed (earlier ‚Üí train, later ‚Üí test)
- [ ] Features prepared (no target, no datetime index)
- [ ] Categorical variables encoded
- [ ] No data leakage verified
- [ ] All 5 artifacts saved: `q6_X_train.csv`, `q6_X_test.csv`, `q6_y_train.csv`, `q6_y_test.csv`, `q6_train_test_info.txt`

---

**Next:** Continue to `q7_modeling.md` for Modeling.


## Start of Model Preparation

In [2]:
# For data with temporal structure, we must split by time (not randomly)
# Train on earlier data, test on later data

# Sort by datetime to ensure temporal order
df_model = df.reset_index().sort_values('Measurement Timestamp').copy()

# Train/test split configuration
TRAIN_RATIO = 0.80  # 80% for training, 20% for testing

# Define temporal split point
# IMPORTANT: For time series, we split by time (not randomly) to prevent data leakage
split_date = df_model['Measurement Timestamp'].quantile(TRAIN_RATIO)

# Create train/test split
train = df_model[df_model['Measurement Timestamp'] < split_date].copy()
test = df_model[df_model['Measurement Timestamp'] >= split_date].copy()

print("Temporal Train/Test Split")
display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Measurements': [f"{len(train):,}", f"{len(test):,}"],
    'Date Range': [
        f"{train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}",
        f"{test['Measurement Timestamp'].min()} to {test['Measurement Timestamp'].max()}"
    ]
}))
print(f"Split date: {split_date}")

Temporal Train/Test Split


Unnamed: 0,Dataset,Measurements,Date Range
0,Train,146012,2015-04-25 09:00:00 to 2023-05-25 18:00:00
1,Test,36504,2023-05-25 19:00:00 to 2025-12-02 12:00:00


Split date: 2023-05-25 19:00:00


In [3]:
# Quick exploration: Which features are correlated?
# This helps us identify redundant features

# First, define our target and potential features
target = 'Air Temperature'

# List all numeric features we might use
numeric_features = ['Wet Bulb Temperature', 'Humidity', 'Rain Intensity', 'Interval Rain',
                    'Total Rain', 'Barometric Pressure','Solar Radiation', 'hour', 'day_of_week', 
                    'month', 'year', 'is_weekend', 'wind_u', 'wind_v', 'wind_dir_delta',
                    'pressure_diff_1h', 'wet_temp_rolling_7h',
                    'wet_temp_rolling_24h', 'humidity_rolling_7h', 'humidity_rolling_24h',
                    'rain_intensity_rolling_7h', 'rain_intensity_rolling_24h',
                    'pressure_rolling_7h', 'pressure_rolling_24h']

# numeric_features = df.select_dtypes(include=[np.number]).columns.tolist()

# Check which features are actually available in the data
available_numeric = [f for f in numeric_features if f in df_model.columns]

# Calculate correlation with target
if available_numeric:
    correlation_with_target = df_model[available_numeric + [target]].corr()[target].sort_values(ascending=False)
    print("Features correlated with Air Temperature:")
    print(correlation_with_target)
    print()

Features correlated with Air Temperature:
Air Temperature               1.000000
Wet Bulb Temperature          0.979152
wet_temp_rolling_7h           0.976280
wet_temp_rolling_24h          0.962520
Total Rain                    0.452193
Solar Radiation               0.290747
month                         0.267694
hour                          0.080389
wind_v                        0.077987
humidity_rolling_24h          0.063825
rain_intensity_rolling_24h    0.027031
humidity_rolling_7h           0.025918
Interval Rain                 0.024553
wind_dir_delta                0.015286
rain_intensity_rolling_7h     0.014529
Humidity                      0.014424
Rain Intensity                0.008858
is_weekend                   -0.014025
day_of_week                  -0.022270
year                         -0.030034
wind_u                       -0.133806
pressure_rolling_24h         -0.224401
pressure_rolling_7h          -0.248553
Barometric Pressure          -0.252696
Name: Air Temperature,

In [4]:
# Define target variable
target = 'Air Temperature'

print(df.columns.values)
# Select features for modeling
# Include temporal, geographic, and measurement characteristics
feature_cols = [
    # Temporal features
    'hour', 'day_of_week', 'month',
    # Weather characteristics
    'Wet Bulb Temperature', 'Barometric Pressure', 'Total Rain', 'Solar Radiation', 'Humidity',
    # Derived features
    'wind_v', 'wind_u',
    # Rolling features: for each category, select window with stronger correlation
    'humidity_rolling_24h', 'rain_intensity_rolling_24h', 'pressure_rolling_7h',
    # Categorical (will need encoding)
    'Station Name'
]

# Check feature availability
available_features = [f for f in feature_cols if f in df_model.columns]
missing_features = [f for f in feature_cols if f not in df_model.columns]

print("üìã Feature Availability")
print(f"Available features: `{available_features}`")
if missing_features:
    display(f"‚ö†Ô∏è **Missing features** (will skip): `{missing_features}`")

# Select available features
X_train = train[available_features].copy()
X_test = test[available_features].copy()
y_train = train[target].copy()
y_test = test[target].copy()

display(pd.DataFrame({
    'Dataset': ['X_train', 'X_test'],
    'Shape': [
        f"{X_train.shape[0]:,} √ó {X_train.shape[1]}",
        f"{X_test.shape[0]:,} √ó {X_test.shape[1]}"
    ]
}))

['Station Name' 'Air Temperature' 'Wet Bulb Temperature' 'Humidity'
 'Rain Intensity' 'Interval Rain' 'Total Rain' 'Precipitation Type'
 'Wind Direction' 'Wind Speed' 'Maximum Wind Speed' 'Barometric Pressure'
 'Solar Radiation' 'Heading' 'Battery Life' 'Measurement Timestamp Label'
 'Measurement ID' 'exclude' 'exclude_reason' 'hour' 'day_of_week' 'month'
 'year' 'day_name' 'is_weekend' 'wind_u' 'wind_v' 'wind_dir_delta'
 'wind_category' 'pressure_delta' 'pressure_trend' 'wet_temp_rolling_7h'
 'wet_temp_rolling_24h' 'humidity_rolling_7h' 'humidity_rolling_24h'
 'rain_intensity_rolling_7h' 'rain_intensity_rolling_24h'
 'pressure_rolling_7h' 'pressure_rolling_24h']
üìã Feature Availability
Available features: `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'Barometric Pressure', 'Total Rain', 'Solar Radiation', 'Humidity', 'wind_v', 'wind_u', 'humidity_rolling_24h', 'rain_intensity_rolling_24h', 'pressure_rolling_7h', 'Station Name']`


Unnamed: 0,Dataset,Shape
0,X_train,"146,012 √ó 14"
1,X_test,"36,504 √ó 14"


In [5]:
# Identify categorical variables
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

print("üè∑Ô∏è Feature Types")
print(f"Categorical features: `{categorical_cols}`")
print(f"Numeric features: `{numeric_cols}`")

# For simplicity, we'll use pandas get_dummies for one-hot encoding
# In practice, you might use sklearn's OneHotEncoder

X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, prefix=categorical_cols, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, prefix=categorical_cols, drop_first=True)

# Ensure test set has same columns as training set
# Add missing columns (with 0s) and remove extra columns
for col in X_train_encoded.columns:
    if col not in X_test_encoded.columns:
        X_test_encoded[col] = 0

X_test_encoded = X_test_encoded[X_train_encoded.columns]

print("### ‚úÖ After One-Hot Encoding")
display(pd.DataFrame({
    'Dataset': ['Training features', 'Test features'],
    'Shape': [
        f"{X_train_encoded.shape[0]:,} √ó {X_train_encoded.shape[1]}",
        f"{X_test_encoded.shape[0]:,} √ó {X_test_encoded.shape[1]}"
    ]
}))
display(f"Feature names: `{list(X_train_encoded.columns)[:10]}...` ({len(X_train_encoded.columns)} total)")

üè∑Ô∏è Feature Types
Categorical features: `['Station Name']`
Numeric features: `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'Barometric Pressure', 'Total Rain', 'Solar Radiation', 'Humidity', 'wind_v', 'wind_u', 'humidity_rolling_24h', 'rain_intensity_rolling_24h', 'pressure_rolling_7h']`
### ‚úÖ After One-Hot Encoding


Unnamed: 0,Dataset,Shape
0,Training features,"146,012 √ó 15"
1,Test features,"36,504 √ó 15"


"Feature names: `['hour', 'day_of_week', 'month', 'Wet Bulb Temperature', 'Barometric Pressure', 'Total Rain', 'Solar Radiation', 'Humidity', 'wind_v', 'wind_u']...` (15 total)"

In [6]:
# Save prepared datasets for modeling
X_train_encoded.to_csv('output/q6_X_train.csv', index=False)
X_test_encoded.to_csv('output/q6_X_test.csv', index=False)
y_train.to_csv('output/q6_y_train.csv', index=False)
y_test.to_csv('output/q6_y_test.csv', index=False)

print("üíæ Prepared Datasets Saved")
display(pd.DataFrame({
    'File': ['X_train', 'X_test', 'y_train', 'y_test'],
    'Shape': [
        f"{X_train.shape[0]:,} √ó {X_train.shape[1]}",
        f"{X_test.shape[0]:,} √ó {X_test.shape[1]}",
        f"{len(y_train):,}",
        f"{len(y_test):,}"
    ]
}))
print("‚úÖ Ready for next phase: Modeling & Results!")

üíæ Prepared Datasets Saved


Unnamed: 0,File,Shape
0,X_train,"146,012 √ó 14"
1,X_test,"36,504 √ó 14"
2,y_train,146012
3,y_test,36504


‚úÖ Ready for next phase: Modeling & Results!
