# Q6: Modeling Preparation

**Phase 7:** Modeling Preparation  
**Points: 3 points**

**Focus:** Perform temporal train/test split, select features, handle categorical variables.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 7. This notebook demonstrates temporal train/test splitting (see "Your Approach" section below for the key code pattern).

---

## Setup

In [198]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load feature-engineered data from Q4
df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q4_features.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")

Loaded 196,313 records with features


---

## Objective

Prepare data for modeling by performing temporal train/test split, selecting features, and handling categorical variables.

**CRITICAL - Temporal Split:** For time series data, you **MUST** use temporal splitting (earlier data for training, later data for testing). **DO NOT** use random split. Why? Time series data has temporal dependencies - using future data to predict the past would be data leakage.

---

## Required Artifacts

You must create exactly these 5 files in the `output/` directory:

### 1. `output/q6_X_train.csv`
**Format:** CSV file
**Content:** Training features (X)
**Requirements:**
- All feature columns (no target variable)
- Only training data (earlier time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 2. `output/q6_X_test.csv`
**Format:** CSV file
**Content:** Test features (X)
**Requirements:**
- All feature columns (same as X_train)
- Only test data (later time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 3. `output/q6_y_train.csv`
**Format:** CSV file
**Content:** Training target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only training data (corresponding to X_train)
- **No index column** (save with `index=False`)

**Example:**
```csv
Water Temperature
15.2
15.3
15.1
...
```

### 4. `output/q6_y_test.csv`
**Format:** CSV file
**Content:** Test target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only test data (corresponding to X_test)
- **No index column** (save with `index=False`)

### 5. `output/q6_train_test_info.txt`
**Format:** Plain text file
**Content:** Train/test split information
**Required information:**
- Split method: Temporal (80/20 or similar)
- Training set size: [number] samples
- Test set size: [number] samples
- Training date range: [start] to [end]
- Test date range: [start] to [end]
- Number of features: [number]
- Target variable: [name]

**Example format:**
```
TRAIN/TEST SPLIT INFORMATION
==========================

Split Method: Temporal (80/20 split by time)

Training Set Size: 40000 samples
Test Set Size: 10000 samples

Training Date Range: 2022-01-01 00:00:00 to 2026-09-15 07:00:00
Test Date Range: 2026-09-15 08:00:00 to 2027-09-15 07:00:00

Number of Features: 22
Target Variable: Water Temperature
```

---

## Requirements Checklist

- [ ] Target variable selected
- [ ] Temporal train/test split performed (train on earlier data, test on later data - **NOT random split**)
- [ ] Features selected and prepared
- [ ] Categorical variables handled (encoding if needed)
- [ ] No data leakage (future data not in training set)
- [ ] All 5 required artifacts saved with exact filenames

---

## Your Approach

1. **Select target variable** - Choose a meaningful numeric variable to predict
2. **Select features** - Exclude target, non-numeric columns, and any features derived from the target (to avoid data leakage)
3. **Handle categorical variables** - One-hot encode if needed
4. **Perform temporal train/test split** - Sort by datetime, then split by index position (earlier data for training, later for testing)
5. **Save artifacts** - Save X_train, X_test, y_train, y_test as separate CSVs
6. **Document split** - Record split sizes, date ranges, and feature count

---

## Feature Selection Guidelines

When selecting features for modeling, think critically about each feature:

**Red Flags to Watch For:**
- **Circular logic**: Does this feature use the target variable to predict the target?
  - Example: Rolling mean of target, lag of target (if not handled carefully)
  - Example: If predicting `Air Temperature`, using `air_temp_rolling_7h` is circular - you're predicting temperature from smoothed temperature
- **Data leakage**: Does this feature contain information that wouldn't be available at prediction time?
  - Example: Future values, aggregated statistics that include the current value
- **Near-duplicates**: Is this feature nearly identical to the target?
  - Check correlations - if correlation > 0.95, investigate whether it's legitimate
  - Example: A feature with 99%+ correlation with the target is likely problematic

**Good Practices:**
- Use external predictors (other weather variables, temporal features)
- Create rolling windows of **predictors**, not the target
  - Good: `wind_speed_rolling_7h`, `humidity_rolling_24h`
  - Bad: `air_temp_rolling_7h` when predicting Air Temperature
- Use derived features that combine multiple predictors
- Think: "Would I have this information when making a real prediction?"

**Remember:** The goal is to predict the target from **other** information, not from the target itself.

---

## Decision Points

- **Target variable:** What do you want to predict? Temperature? Water conditions? Choose something meaningful and measurable.
- **Temporal split:** **CRITICAL** - Use temporal split (earlier data for training, later data for testing), NOT random split. Why? Time series data has temporal dependencies. Typical split: 80/20 or 70/30.
- **Feature selection:** Which features are most relevant? Consider correlations, domain knowledge, and feature importance from previous analysis.
- **Categorical encoding:** If you have categorical variables, encode them (one-hot encoding, label encoding, etc.) before modeling.

---

## Checkpoint

After Q6, you should have:
- [ ] Temporal train/test split completed (earlier → train, later → test)
- [ ] Features prepared (no target, no datetime index)
- [ ] Categorical variables encoded
- [ ] No data leakage verified
- [ ] All 5 artifacts saved: `q6_X_train.csv`, `q6_X_test.csv`, `q6_y_train.csv`, `q6_y_test.csv`, `q6_train_test_info.txt`

---

**Next:** Continue to `q7_modeling.md` for Modeling.


In [199]:
# target variable: ['Air Temperature']

In [200]:
print(df.columns)
print(df.dtypes)

Index(['Station Name', 'Air Temperature', 'Wet Bulb Temperature', 'Humidity',
       'Rain Intensity', 'Interval Rain', 'Total Rain', 'Precipitation Type',
       'Wind Direction', 'Wind Speed', 'Maximum Wind Speed',
       'Barometric Pressure', 'Solar Radiation', 'Heading', 'Battery Life',
       'Measurement Timestamp Label', 'Measurement ID', 'hour', 'day_of_week',
       'month', 'year', 'day_name', 'is_weekend', 'Wind Speed Squared',
       'Is Raining', 'Is Summer', 'Is Winter', 'Is Spring', 'Is Fall',
       'Maximum Wind Speed - Wind Speed', 'Wind Direction X',
       'Wind Direction Y', 'Rain Intensity x Humidity', 'Sin_Hour', 'Cos_Hour',
       'barometric_pressure_rolling_mean_7h', 'humidity_rolling_mean_7h',
       'solar_radiation_rolling_mean_7h', 'total_rain_rolling_mean_24h'],
      dtype='object')
Station Name                            object
Air Temperature                        float64
Wet Bulb Temperature                   float64
Humidity                        

In [201]:
# One Hot Encoding for Categorical Variables

#df = pd.get_dummies(df, drop_first=True)
# df.head()

# going to exclude Station column is I feel that the station should not affect Air Temperature significantly

# exclude Measurement Timestamp Label and Measurement ID because they are similar to Measurement Timestamp


correlation_matrix_columns = df.select_dtypes(include=[np.number])

excluded = ['Station Name', 'Measurement ID', 'Measurement Timestamp Label']


correlation_matrix = correlation_matrix_columns.corr()['Air Temperature'].sort_values(ascending=False)

print(correlation_matrix)



Air Temperature                        1.000000
Wet Bulb Temperature                   0.827626
Is Summer                              0.622920
total_rain_rolling_mean_24h            0.469528
Total Rain                             0.371333
solar_radiation_rolling_mean_7h        0.340513
month                                  0.271475
Solar Radiation                        0.241084
Is Fall                                0.086654
Wind Direction Y                       0.084037
hour                                   0.062972
humidity_rolling_mean_7h               0.038008
Heading                                0.012986
Interval Rain                          0.012157
Humidity                               0.008020
Maximum Wind Speed - Wind Speed       -0.000475
year                                  -0.009219
Rain Intensity x Humidity             -0.010848
Rain Intensity                        -0.011194
is_weekend                            -0.013362
day_of_week                           -0

In [202]:
# exclude Web Bulb Temperature because it is too similar of a measurement to Air Temperature 

In [203]:
# we will select top 20 features based on correlation values excluding Wet Bulb Temperature to predict Air Temperature
features_for_pred = correlation_matrix.head(22).index.tolist()

features_for_pred.remove('Wet Bulb Temperature')

print(features_for_pred)
print(len(features_for_pred))

excluded_features = [col for col in df.columns if col not in features_for_pred] + excluded
num_excluded = len(excluded_features)

print(f"Excluded features: {excluded_features}")
print(f"Number of excluded features: {num_excluded}")

df_modeling = df[features_for_pred]

print(df_modeling)

['Air Temperature', 'Is Summer', 'total_rain_rolling_mean_24h', 'Total Rain', 'solar_radiation_rolling_mean_7h', 'month', 'Solar Radiation', 'Is Fall', 'Wind Direction Y', 'hour', 'humidity_rolling_mean_7h', 'Heading', 'Interval Rain', 'Humidity', 'Maximum Wind Speed - Wind Speed', 'year', 'Rain Intensity x Humidity', 'Rain Intensity', 'is_weekend', 'day_of_week', 'Cos_Hour']
21
Excluded features: ['Station Name', 'Wet Bulb Temperature', 'Precipitation Type', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Barometric Pressure', 'Battery Life', 'Measurement Timestamp Label', 'Measurement ID', 'day_name', 'Wind Speed Squared', 'Is Raining', 'Is Winter', 'Is Spring', 'Wind Direction X', 'Sin_Hour', 'barometric_pressure_rolling_mean_7h', 'Station Name', 'Measurement ID', 'Measurement Timestamp Label']
Number of excluded features: 21
                       Air Temperature  Is Summer  \
Measurement Timestamp                               
2015-04-25 09:00:00               7.00        

In [204]:
df_official = df_modeling.reset_index().sort_values('Measurement Timestamp').copy()

train_ratio = 0.8

split_date = df_official['Measurement Timestamp'].quantile(train_ratio)

train = df_official[df_official['Measurement Timestamp'] < split_date].copy()
test = df_official[df_official['Measurement Timestamp'] >= split_date].copy()

display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Num_Rows': [f"{len(train):,}", f"{len(test):,}"],
    'Date Range': [
        f"{train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}",
        f"{test['Measurement Timestamp'].min()} to {test['Measurement Timestamp'].max()}"
    ]
}))

print(features_for_pred)

features_for_pred.remove('Air Temperature')

print(features_for_pred)


# print(features)

target = ['Air Temperature']

print(target)


Unnamed: 0,Dataset,Num_Rows,Date Range
0,Train,157049,2015-04-25 09:00:00 to 2023-07-05 22:00:00
1,Test,39264,2023-07-05 23:00:00 to 2025-12-03 10:00:00


['Air Temperature', 'Is Summer', 'total_rain_rolling_mean_24h', 'Total Rain', 'solar_radiation_rolling_mean_7h', 'month', 'Solar Radiation', 'Is Fall', 'Wind Direction Y', 'hour', 'humidity_rolling_mean_7h', 'Heading', 'Interval Rain', 'Humidity', 'Maximum Wind Speed - Wind Speed', 'year', 'Rain Intensity x Humidity', 'Rain Intensity', 'is_weekend', 'day_of_week', 'Cos_Hour']
['Is Summer', 'total_rain_rolling_mean_24h', 'Total Rain', 'solar_radiation_rolling_mean_7h', 'month', 'Solar Radiation', 'Is Fall', 'Wind Direction Y', 'hour', 'humidity_rolling_mean_7h', 'Heading', 'Interval Rain', 'Humidity', 'Maximum Wind Speed - Wind Speed', 'year', 'Rain Intensity x Humidity', 'Rain Intensity', 'is_weekend', 'day_of_week', 'Cos_Hour']
['Air Temperature']


In [205]:
X_train = train[features_for_pred].copy()
X_test = test[features_for_pred].copy()

y_train = train[target].copy()
y_test = test[target].copy()

display(pd.DataFrame({
    'Dataset': ['X_train', 'X_test'],
    'Shape': [
        f"{X_train.shape[0]:,} × {X_train.shape[1]}",
        f"{X_test.shape[0]:,} × {X_test.shape[1]}"
    ]
}))

Unnamed: 0,Dataset,Shape
0,X_train,"157,049 × 20"
1,X_test,"39,264 × 20"


In [206]:
# save CSV's

X_train.to_csv('output/q6_X_train.csv', index = False)

# print(X_train.isna().sum())

X_test.to_csv('output/q6_X_test.csv', index = False)

y_train.to_csv('output/q6_y_train.csv', index = False)

y_test.to_csv('output/q6_y_test.csv', index = False)

In [207]:
# generate repport

with open('output/q6_train_test_info.txt', 'w') as f:
    f.write('TRAIN/TEST SPLIT INFORMATION\n')
    f.write('Split Method: Temporal (80/20 split by time)\n')
    f.write(f'Training Set Size: {len(X_train)}\n')
    f.write(f'Test Set Size: {len(X_test)}\n')
    f.write(f"Training Date Range: {train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}\n")
    f.write(f"Test Date Range: {test['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}\n")
    f.write(f"Number of Features: {len(X_train.columns)}\n")
    f.write(f"Target Variable: Air Temperature")
