# Q6: Modeling Preparation

**Phase 7:** Modeling Preparation  
**Points: 3 points**

**Focus:** Perform temporal train/test split, select features, handle categorical variables.

**Lecture Reference:** Lecture 11, Notebook 3 ([`11/demo/03_pattern_analysis_modeling_prep.ipynb`](https://github.com/christopherseaman/datasci_217/blob/main/11/demo/03_pattern_analysis_modeling_prep.ipynb)), Phase 7. This notebook demonstrates temporal train/test splitting (see "Your Approach" section below for the key code pattern).

---

## Setup

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

# Load feature-engineered data from Q4
df = pd.read_csv('output/q4_features.csv', parse_dates=['Measurement Timestamp'], index_col='Measurement Timestamp')
# Or if you saved without index:
# df = pd.read_csv('output/q4_features.csv')
# df['Measurement Timestamp'] = pd.to_datetime(df['Measurement Timestamp'])
# df = df.set_index('Measurement Timestamp')
print(f"Loaded {len(df):,} records with features")


Loaded 120,394 records with features


In [2]:
print(df.head())
df.dtypes

#trobles with the intg64 so i will trasform dtypes 
int64_columns = ['Humidity', 'Wind Direction']
#convert 
df[int64_columns] = df[ int64_columns].astype('float64')

df['voltage_perminute'] = df ['voltage_perminute'].round(3)
print(df[['voltage_perminute']].head())

                                      Station Name  Air Temperature  \
Measurement Timestamp                                                 
2015-04-25 09:00:00    63rd Street Weather Station              7.0   
2015-04-30 05:00:00    63rd Street Weather Station              6.1   
2015-05-22 15:00:00     Oak Street Weather Station             17.7   
2015-05-22 17:00:00     Oak Street Weather Station             17.7   
2015-05-22 18:00:00     Oak Street Weather Station             17.7   

                       Wet Bulb Temperature  Humidity  Rain Intensity  \
Measurement Timestamp                                                   
2015-04-25 09:00:00                     5.9        86             0.0   
2015-04-30 05:00:00                     4.3        76             0.0   
2015-05-22 15:00:00                     7.0        55             0.0   
2015-05-22 17:00:00                     6.3        56             0.0   
2015-05-22 18:00:00                     6.5        54           

---

## Objective

Prepare data for modeling by performing temporal train/test split, selecting features, and handling categorical variables.

**CRITICAL - Temporal Split:** For time series data, you **MUST** use temporal splitting (earlier data for training, later data for testing). **DO NOT** use random split. Why? Time series data has temporal dependencies - using future data to predict the past would be data leakage.

---

## Required Artifacts

You must create exactly these 5 files in the `output/` directory:

### 1. `output/q6_X_train.csv`
**Format:** CSV file
**Content:** Training features (X)
**Requirements:**
- All feature columns (no target variable)
- Only training data (earlier time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 2. `output/q6_X_test.csv`
**Format:** CSV file
**Content:** Test features (X)
**Requirements:**
- All feature columns (same as X_train)
- Only test data (later time periods)
- **No index column** (save with `index=False`)
- **No datetime column** (unless it's a feature, not the index)

### 3. `output/q6_y_train.csv`
**Format:** CSV file
**Content:** Training target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only training data (corresponding to X_train)
- **No index column** (save with `index=False`)

**Example:**
```csv
Water Temperature
15.2
15.3
15.1
...
```

### 4. `output/q6_y_test.csv`
**Format:** CSV file
**Content:** Test target variable (y)
**Requirements:**
- Single column with target variable name as header
- Only test data (corresponding to X_test)
- **No index column** (save with `index=False`)

### 5. `output/q6_train_test_info.txt`
**Format:** Plain text file
**Content:** Train/test split information
**Required information:**
- Split method: Temporal (80/20 or similar)
- Training set size: [number] samples
- Test set size: [number] samples
- Training date range: [start] to [end]
- Test date range: [start] to [end]
- Number of features: [number]
- Target variable: [name]

**Example format:**
```
TRAIN/TEST SPLIT INFORMATION
==========================

Split Method: Temporal (80/20 split by time)

Training Set Size: 40000 samples
Test Set Size: 10000 samples

Training Date Range: 2022-01-01 00:00:00 to 2026-09-15 07:00:00
Test Date Range: 2026-09-15 08:00:00 to 2027-09-15 07:00:00

Number of Features: 22
Target Variable: Water Temperature
```

---

## Requirements Checklist

- [ ] Target variable selected
- [ ] Temporal train/test split performed (train on earlier data, test on later data - **NOT random split**)
- [ ] Features selected and prepared
- [ ] Categorical variables handled (encoding if needed)
- [ ] No data leakage (future data not in training set)
- [ ] All 5 required artifacts saved with exact filenames

---

## Your Approach

1. **Select target variable** - Choose a meaningful numeric variable to predict
2. **Select features** - Exclude target, non-numeric columns, and any features derived from the target (to avoid data leakage)
3. **Handle categorical variables** - One-hot encode if needed
4. **Perform temporal train/test split** - Sort by datetime, then split by index position (earlier data for training, later for testing)
5. **Save artifacts** - Save X_train, X_test, y_train, y_test as separate CSVs
6. **Document split** - Record split sizes, date ranges, and feature count

---

## Feature Selection Guidelines

When selecting features for modeling, think critically about each feature:

**Red Flags to Watch For:**
- **Circular logic**: Does this feature use the target variable to predict the target?
  - Example: Rolling mean of target, lag of target (if not handled carefully)
  - Example: If predicting `Air Temperature`, using `air_temp_rolling_7h` is circular - you're predicting temperature from smoothed temperature
- **Data leakage**: Does this feature contain information that wouldn't be available at prediction time?
  - Example: Future values, aggregated statistics that include the current value
- **Near-duplicates**: Is this feature nearly identical to the target?
  - Check correlations - if correlation > 0.95, investigate whether it's legitimate
  - Example: A feature with 99%+ correlation with the target is likely problematic

**Good Practices:**
- Use external predictors (other weather variables, temporal features)
- Create rolling windows of **predictors**, not the target
  - Good: `wind_speed_rolling_7h`, `humidity_rolling_24h`
  - Bad: `air_temp_rolling_7h` when predicting Air Temperature
- Use derived features that combine multiple predictors
- Think: "Would I have this information when making a real prediction?"

**Remember:** The goal is to predict the target from **other** information, not from the target itself.

---

## Decision Points

- **Target variable:** What do you want to predict? Temperature? Water conditions? Choose something meaningful and measurable.
- **Temporal split:** **CRITICAL** - Use temporal split (earlier data for training, later data for testing), NOT random split. Why? Time series data has temporal dependencies. Typical split: 80/20 or 70/30.
- **Feature selection:** Which features are most relevant? Consider correlations, domain knowledge, and feature importance from previous analysis.
- **Categorical encoding:** If you have categorical variables, encode them (one-hot encoding, label encoding, etc.) before modeling.

---

## Checkpoint

After Q6, you should have:
- [ ] Temporal train/test split completed (earlier → train, later → test)
- [ ] Features prepared (no target, no datetime index)
- [ ] Categorical variables encoded
- [ ] No data leakage verified
- [ ] All 5 artifacts saved: `q6_X_train.csv`, `q6_X_test.csv`, `q6_y_train.csv`, `q6_y_test.csv`, `q6_train_test_info.txt`

---

**Next:** Continue to `q7_modeling.md` for Modeling.


In [3]:
#train 
#from sklearn.model_selection import train_test_split



# checked and Measurement Timestamp is index ; no other column viseable for index 

df_model = df.reset_index().sort_values('Measurement Timestamp').copy()
#check
print(df_model.columns.tolist())

#trouble with Measurement Timestamp during modeling as was recorded as Object : Want to preserve the time-serie 
#before training 
df_model ['hour'] = df_model['Measurement Timestamp'].dt.hour

#I will select the split 80/20 
TRAIN_RATIO = 0.80 
#Define temporal split poit 
split_date = df_model['Measurement Timestamp'].quantile(TRAIN_RATIO)
#Create train/test split 
train = df_model[df_model['Measurement Timestamp'] < split_date].copy()
test = df_model[df_model['Measurement Timestamp'] >= split_date].copy()



from IPython.display import display, Markdown
#summarize the split 
display(Markdown("Temporal Train/Test Split"))
display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Rows': [f"{len(train):,}", f"{len(test):,}"],
    'Date Range': [
        f"{train['Measurement Timestamp'].min()} to {train['Measurement Timestamp'].max()}",
        f"{test['Measurement Timestamp'].min()} to {test['Measurement Timestamp'].max()}"
    ]
}))
display(Markdown(f"Split date:{split_date}"))


#save the test :`output/q6_X_train.csv` (save with `index=False`)
###train.to_csv('output/q6_X_train.csv', index=False)
#save the training: `output/q6_X_test.csv`;save with `index=False`)
##test.to_csv('output/q6_X_test.csv', index=False)

['Measurement Timestamp', 'Station Name', 'Air Temperature', 'Wet Bulb Temperature', 'Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain', 'Precipitation Type', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Barometric Pressure', 'Solar Radiation', 'Heading', 'Battery Life', 'Measurement Timestamp Label', 'Measurement ID', 'minutes', 'voltage_perminute', 'dry_index']


Temporal Train/Test Split

Unnamed: 0,Dataset,Rows,Date Range
0,Train,96315,2015-04-25 09:00:00 to 2022-12-08 06:00:00
1,Test,24079,2022-12-08 07:00:00 to 2025-12-04 14:00:00


Split date:2022-12-08 06:24:00

In [4]:
train['voltage_perminute'] = train['voltage_perminute'].round(3)
print(train[['voltage_perminute']].head())


test['voltage_perminute'] = test['voltage_perminute'].round(3)
print(test[['voltage_perminute']].head())

   voltage_perminute
0                NaN
1              0.002
2              0.000
3              0.101
4              0.202
       voltage_perminute
96315              0.198
96316              0.198
96317              0.198
96318              0.198
96319              0.195


In [30]:
train[['Measurement Timestamp', 'hour']].head(10)
print(train[['Measurement Timestamp', 'hour']].head())
print(test[['Measurement Timestamp', 'hour']].head())

  Measurement Timestamp  hour
0   2015-04-25 09:00:00     9
1   2015-04-30 05:00:00     5
2   2015-05-22 15:00:00    15
3   2015-05-22 17:00:00    17
4   2015-05-22 18:00:00    18
      Measurement Timestamp  hour
96315   2022-12-08 07:00:00     7
96316   2022-12-08 08:00:00     8
96317   2022-12-08 09:00:00     9
96318   2022-12-08 10:00:00    10
96319   2022-12-08 11:00:00    11


In [5]:
### 3. `output/q6_y_train.csv`

#target Air temperature prediction

#check for datalekeage
# First, define our target and potential features
target = 'Air Temperature'

# List all numeric features we might use
#not including dry_index(beacuse it takes into account air temperature when calculated and would create data leakage)
numeric_features = ['hour','Measurement Timestamp','Wet Bulb Temperature', 'Humidity ', 'Rain Intensity', 'Interval Rain',
                   'Total Rain ', 'Wind Direction', 'Wind Speed','Maximum Wind Speed','Heading',
                   'Barometric Pressure', 'Solar Radiation', 'Battery Life','voltage_perminute']

# Check which features are actually available in the data
available_numeric = [f for f in numeric_features if f in df_model.columns]

# Calculate correlation with target
if available_numeric:
    correlation_with_target = df_model[available_numeric + [target]].corr()[target].sort_values(ascending=False)
    print("Features correlated with fare_amount:")
    print(correlation_with_target)
    print()

#Result : Wet Bulb temperature is Wet Bulb Temperature     0.980755 ; this is highly correlated and might be redundant but I am including becuase of the data sample. 

Features correlated with fare_amount:
Air Temperature          1.000000
Wet Bulb Temperature     0.980755
Solar Radiation          0.269317
Battery Life             0.136296
hour                     0.063265
Measurement Timestamp    0.022085
Heading                  0.020346
voltage_perminute        0.011220
Wind Direction          -0.167145
Maximum Wind Speed      -0.214320
Wind Speed              -0.230656
Barometric Pressure     -0.252409
Rain Intensity                NaN
Interval Rain                 NaN
Name: Air Temperature, dtype: float64



In [6]:
# First, define our target 
target = 'Air Temperature'
#Wet bulb removed beacuse it is >95 correcaltion
#drop voltage perminute as I was not ablte to didgits(3) ; somwhow it was not working 
#Selelct features:
feature_cols = [
    # Temporal features
    'Humidity', 'Rain Intensity', 'Interval Rain',
    'Total Rain', 'Wind Direction', 'Wind Speed','Maximum Wind Speed','Heading',
    'Barometric Pressure', 'Solar Radiation', 'Battery Life',
    #categorical
    'Precipitation Type'
]

# Check feature availability
available_features = [f for f in feature_cols if f in df_model.columns]
missing_features = [f for f in feature_cols if f not in df_model.columns]


#display them
display(Markdown("Feature Availability"))
display(Markdown(f"Available features: `{available_features}`"))
if missing_features:
    display(Markdown(f"Missing features (will skip): `{missing_features}`"))


# Select available features
X_train = train[available_features].copy()
X_test = test[available_features].copy()
y_train = train[target].copy()
y_test = test[target].copy()

display(pd.DataFrame({
    'Dataset': ['X_train', 'X_test'],
    'Shape': [
        f"{X_train.shape[0]:,} × {X_train.shape[1]}",
        f"{X_test.shape[0]:,} × {X_test.shape[1]}"
    ]
}))

print (X_train.columns)


Feature Availability

Available features: `['Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Heading', 'Barometric Pressure', 'Solar Radiation', 'Battery Life', 'Precipitation Type']`

Unnamed: 0,Dataset,Shape
0,X_train,"96,315 × 12"
1,X_test,"24,079 × 12"


Index(['Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain',
       'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Heading',
       'Barometric Pressure', 'Solar Radiation', 'Battery Life',
       'Precipitation Type'],
      dtype='object')


In [7]:
# Identify categorical variables
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
numeric_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()

display(Markdown("Feature Types"))
display(Markdown(f"Categorical features:`{categorical_cols}`"))
display(Markdown(f"**Numeric features:`{numeric_cols}`"))

# For simplicity, we'll use pandas get_dummies for one-hot encoding
# In practice, you might use sklearn's OneHotEncoder

X_train_encoded = pd.get_dummies(X_train, columns=categorical_cols, prefix=categorical_cols, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=categorical_cols, prefix=categorical_cols, drop_first=True)

Feature Types

Categorical features:`['Precipitation Type']`

**Numeric features:`['Humidity', 'Rain Intensity', 'Interval Rain', 'Total Rain', 'Wind Direction', 'Wind Speed', 'Maximum Wind Speed', 'Heading', 'Barometric Pressure', 'Solar Radiation', 'Battery Life']`

In [34]:
#checked the missing values -only one so I am just saving 
# Check for missing values
missing_in_train = X_train_encoded.isnull().sum()[X_train_encoded.isnull().sum() > 0]
if len(missing_in_train) == 0:
    display(Markdown("No missing values"))
else:
    missing_df = pd.DataFrame({'Column': missing_in_train.index, 'Missing Count': missing_in_train.values})
    display(missing_df)

#only one missing value voltage_perminute(fill in with median)
#Fill in missing value and for numeric column use median 
for col in numeric_cols:
    if col in X_train_encoded.columns:
        median_val = X_train_encoded[col].median()
        X_train_encoded[col] = X_train_encoded[col].fillna(median_val)
        X_test_encoded[col] = X_test_encoded[col].fillna(median_val)

#check - display 
display(pd.DataFrame({
    'Dataset': ['Train', 'Test'],
    'Missing Values': [
        X_train_encoded.isnull().sum().sum(),
        X_test_encoded.isnull().sum().sum()
    ]
}))

# Save datasets for modeling
X_train_encoded.to_csv('output/q6_X_train.csv', index=False)
X_test_encoded.to_csv('output/q6_X_test.csv', index=False)
y_train.to_csv('output/q6_y_train.csv', index=False)
y_test.to_csv('output/q6_y_test.csv', index=False)

#check were 4 files creted 


No missing values

Unnamed: 0,Dataset,Missing Values
0,Train,0
1,Test,0
