# Q6: Modeling Preparation

**Phase 7:** Modeling Preparation  
**Points: 3 points**

**Focus:** Perform temporal train/test split, select features, handle categorical variables.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 7. This notebook demonstrates temporal train/test splitting (see "Your Approach" section below for the key code pattern).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load feature-engineered data from Q4
df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q4_features.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")

Loaded 196,479 records with features


---

## Objective

Prepare data for modeling by performing temporal train/test split, selecting features, and handling categorical variables.

**CRITICAL - Temporal Split:** For time series data, you **MUST** use temporal splitting (earlier data for training, later data for testing). **DO NOT** use random split. Why? Time series data has temporal dependencies - using future data to predict the past would be data leakage.

---

## Required Artifacts

You must create exactly these 5 files in the `output/` directory:

### 1. `output/q6_X_train.csv`
**Format:** CSV file
**Content:** Training features (X)
**Requirements:**
- All feature columns (no target variable)
- Only training data (earlier time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 2. `output/q6_X_test.csv`
**Format:** CSV file
**Content:** Test features (X)
**Requirements:**
- All feature columns (same as X_train)
- Only test data (later time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 3. `output/q6_y_train.csv`
**Format:** CSV file
**Content:** Training target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only training data (corresponding to X_train)
- **No index column** (save with `index=False`)

**Example:**
```csv
Water Temperature
15.2
15.3
15.1
...
```

### 4. `output/q6_y_test.csv`
**Format:** CSV file
**Content:** Test target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only test data (corresponding to X_test)
- **No index column** (save with `index=False`)

### 5. `output/q6_train_test_info.txt`
**Format:** Plain text file
**Content:** Train/test split information
**Required information:**
- Split method: Temporal (80/20 or similar)
- Training set size: [number] samples
- Test set size: [number] samples
- Training date range: [start] to [end]
- Test date range: [start] to [end]
- Number of features: [number]
- Target variable: [name]

**Example format:**
```
TRAIN/TEST SPLIT INFORMATION
==========================

Split Method: Temporal (80/20 split by time)

Training Set Size: 40000 samples
Test Set Size: 10000 samples

Training Date Range: 2022-01-01 00:00:00 to 2026-09-15 07:00:00
Test Date Range: 2026-09-15 08:00:00 to 2027-09-15 07:00:00

Number of Features: 22
Target Variable: Water Temperature
```

---

## Requirements Checklist

- [ ] Target variable selected
- [ ] Temporal train/test split performed (train on earlier data, test on later data - **NOT random split**)
- [ ] Features selected and prepared
- [ ] Categorical variables handled (encoding if needed)
- [ ] No data leakage (future data not in training set)
- [ ] All 5 required artifacts saved with exact filenames

---

## Your Approach

1. **Select target variable** - Choose a meaningful numeric variable to predict
2. **Select features** - Exclude target, non-numeric columns, and any features derived from the target (to avoid data leakage)
3. **Handle categorical variables** - One-hot encode if needed
4. **Perform temporal train/test split** - Sort by datetime, then split by index position (earlier data for training, later for testing)
5. **Save artifacts** - Save X_train, X_test, y_train, y_test as separate CSVs
6. **Document split** - Record split sizes, date ranges, and feature count

---

## Feature Selection Guidelines

When selecting features for modeling, think critically about each feature:

**Red Flags to Watch For:**
- **Circular logic**: Does this feature use the target variable to predict the target?
  - Example: Rolling mean of target, lag of target (if not handled carefully)
  - Example: If predicting `Air Temperature`, using `air_temp_rolling_7h` is circular - you're predicting temperature from smoothed temperature
- **Data leakage**: Does this feature contain information that wouldn't be available at prediction time?
  - Example: Future values, aggregated statistics that include the current value
- **Near-duplicates**: Is this feature nearly identical to the target?
  - Check correlations - if correlation > 0.95, investigate whether it's legitimate
  - Example: A feature with 99%+ correlation with the target is likely problematic

**Good Practices:**
- Use external predictors (other weather variables, temporal features)
- Create rolling windows of **predictors**, not the target
  - Good: `wind_speed_rolling_7h`, `humidity_rolling_24h`
  - Bad: `air_temp_rolling_7h` when predicting Air Temperature
- Use derived features that combine multiple predictors
- Think: "Would I have this information when making a real prediction?"

**Remember:** The goal is to predict the target from **other** information, not from the target itself.

---

## Decision Points

- **Target variable:** What do you want to predict? Temperature? Water conditions? Choose something meaningful and measurable.
- **Temporal split:** **CRITICAL** - Use temporal split (earlier data for training, later data for testing), NOT random split. Why? Time series data has temporal dependencies. Typical split: 80/20 or 70/30.
- **Feature selection:** Which features are most relevant? Consider correlations, domain knowledge, and feature importance from previous analysis.
- **Categorical encoding:** If you have categorical variables, encode them (one-hot encoding, label encoding, etc.) before modeling.

---

## Checkpoint

After Q6, you should have:
- [ ] Temporal train/test split completed (earlier → train, later → test)
- [ ] Features prepared (no target, no datetime index)
- [ ] Categorical variables encoded
- [ ] No data leakage verified
- [ ] All 5 artifacts saved: `q6_X_train.csv`, `q6_X_test.csv`, `q6_y_train.csv`, `q6_y_test.csv`, `q6_train_test_info.txt`

---

**Next:** Continue to `q7_modeling.md` for Modeling.


In [21]:
# =========================
# Target Variable
# =========================
target_var = "Air Temperature"


In [22]:
# =========================
# Feature Selection
# =========================

# Drop target from feature set
X = df.drop(columns=[target_var])

# Remove any accidental datetime columns
X = X.select_dtypes(exclude=["datetime64[ns]"])

# Separate target
y = df[target_var]

print("✅ Features and target separated")
print("Number of features:", X.shape[1])


✅ Features and target separated
Number of features: 21


In [23]:
# Identify categorical columns
categorical_cols = X.select_dtypes(include=["object"]).columns
categorical_cols



Index(['Station Name', 'Measurement Timestamp Label', 'Measurement ID'], dtype='object')

In [24]:
# ✅ SAFE categorical columns only (low cardinality)
safe_categorical_cols = ["Station Name"]  # add others ONLY if small categories

# ✅ One-hot encode ONLY these columns
X = pd.get_dummies(X, columns=safe_categorical_cols, drop_first=True)

print("✅ Safe categorical encoding complete")
print("New feature count:", X.shape[1])


✅ Safe categorical encoding complete
New feature count: 22


In [25]:
# ❌ Drop any unique identifiers that should never be modeled
bad_id_cols = ["Measurement ID", "Measurement Timestamp Label"]

X = X.drop(columns=[c for c in bad_id_cols if c in X.columns])

print("✅ High-cardinality ID columns removed")


✅ High-cardinality ID columns removed


In [26]:
# View all final feature names
feature_names = X.columns.tolist()

print(f"Total number of features: {len(feature_names)}")
feature_names[:30]  # preview first 30


Total number of features: 20


['Wet Bulb Temperature',
 'Humidity',
 'Rain Intensity',
 'Interval Rain',
 'Total Rain',
 'Precipitation Type',
 'Wind Direction',
 'Wind Speed',
 'Maximum Wind Speed',
 'Barometric Pressure',
 'Solar Radiation',
 'Heading',
 'Battery Life',
 'wind_speed_squared',
 'is_raining',
 'pressure_change',
 'solar_radiation_normalized',
 'wind_humidity_interaction',
 'Station Name_Foster Weather Station',
 'Station Name_Oak Street Weather Station']

In [27]:
features_to_drop = [
    "Wet Bulb Temperature",   # true leakage
    "Heading",                # physically meaningless
    "Battery Life",           # sensor health variable
    "wind_speed_squared",    # redundant derived feature
    "Solar Radiation"        # keeping normalized version instead
    "pressure_change"
]

X = X.drop(columns=features_to_drop, errors="ignore")

print("✅ Dropped features based on leakage & redundancy control:")
for f in features_to_drop:
    print(" -", f)

print("✅ Final feature count:", X.shape[1])




✅ Dropped features based on leakage & redundancy control:
 - Wet Bulb Temperature
 - Heading
 - Battery Life
 - wind_speed_squared
 - Solar Radiationpressure_change
✅ Final feature count: 16


In [29]:
# =========================
# Temporal Train/Test Split (70/30)
# =========================

split_index = int(len(df) * 0.7)

X_train = X.iloc[:split_index]
X_test = X.iloc[split_index:]

y_train = y.iloc[:split_index]
y_test = y.iloc[split_index:]

print("✅ Temporal train/test split completed")
print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])


✅ Temporal train/test split completed
Train size: 137535
Test size: 58944


In [30]:
# =========================
# Save Train/Test CSV Files
# =========================

X_train.to_csv("output/q6_X_train.csv", index=False)
X_test.to_csv("output/q6_X_test.csv", index=False)

y_train.to_csv("output/q6_y_train.csv", index=False)
y_test.to_csv("output/q6_y_test.csv", index=False)

print("✅ All Q6 train/test CSV files saved")


✅ All Q6 train/test CSV files saved


In [31]:
# =========================
# Save Train/Test Info Text File
# =========================

train_start = df.index.min()
train_end = df.index[split_index - 1]
test_start = df.index[split_index]
test_end = df.index.max()

train_test_info = [
    "TRAIN/TEST SPLIT INFORMATION",
    "==========================",
    "",
    "Split Method: Temporal (70/30 split by time)",
    "",
    f"Training Set Size: {len(X_train)} samples",
    f"Test Set Size: {len(X_test)} samples",
    "",
    f"Training Date Range: {train_start} to {train_end}",
    f"Test Date Range: {test_start} to {test_end}",
    "",
    f"Number of Features: {X_train.shape[1]}",
    f"Target Variable: {target_var}"
]

with open("output/q6_train_test_info.txt", "w") as f:
    f.write("\n".join(train_test_info))

print("✅ output/q6_train_test_info.txt saved")


✅ output/q6_train_test_info.txt saved
