Kinect XY to Z Prediction Experiment Pipeline
Introduction
We aim to predict the Kinect sensor's Z-coordinate (depth) for each of 13 body joints using only the X and Y coordinates as input. The Kinect provides 3D positions (X, Y, Z) for each joint across a sequence of frames. By leveraging temporal information, we can improve prediction accuracy: instead of predicting frame-by-frame independently, we use a sliding window of consecutive frames as input, so the model can learn from motion context in recent frames​
file-1fjmgdc2axfxcweubcedh3
. The task is a multi-output regression – for each window of frames, we predict 13 Z-values (one per joint) corresponding to the last frame in the window. We will experiment with various neural network architectures (fully-connected dense networks, 1D convolutional networks, LSTM recurrent networks, and hybrid combinations) to determine what works best for this sequence modeling problem. Key considerations in our experimental pipeline include:
Using a sliding window per sequence without crossing sequence boundaries (each CSV file is one sequence) to generate training samples.
Implementing 10-fold cross-validation for robust evaluation: training on 9 folds and testing on 1 fold, rotating through all folds​
file-1fjmgdc2axfxcweubcedh3
.
Conducting a comprehensive grid search over hyperparameters: window size, learning rate, architecture type, number of layers, and units per layer.
Employing GPU acceleration (e.g. an NVIDIA RTX 3060 12GB) to speed up training. We'll use TensorFlow/Keras which automatically utilizes available GPUs for heavy tensor computations.
Using early stopping to halt training when validation loss stops improving (to prevent overfitting and save time) and model checkpointing to save the best model weights​
file-1fjmgdc2axfxcweubcedh3
 for each fold.
Logging the results (fold metrics, hyperparameters, training time) incrementally to a CSV file for analysis.
By the end of this pipeline, we will have a CSV record of the performance of each configuration, which can be analyzed to find the optimal model.
Approach Outline
Data Loading – Load all Kinect CSV files. Each file contains one sequence of frames with 13 joints' coordinates. We combine these into a list of sequences. We drop any unnecessary columns (like frame index) and separate features vs. target.
Data Preprocessing – For each frame, use X and Y coordinates as input features and the Z coordinates as target outputs​
file-1fjmgdc2axfxcweubcedh3
​
file-1fjmgdc2axfxcweubcedh3
. Normalize feature values for better training convergence.
Sliding Window Creation – Define a procedure to generate sliding window samples from each sequence. For a given window size W, each sample consists of W consecutive frames' XY inputs and the Z outputs of the last frame in that window.
Model Definition – Implement a function to build a TensorFlow model given a specific architecture type and hyperparameters. We will support:
Dense (fully-connected): Flatten the window input and use a stack of Dense layers​
file-1fjmgdc2axfxcweubcedh3
.
Conv1D: Use 1D convolution layers across the time dimension of the window to capture motion patterns, then flatten.
LSTM (recurrent): Use LSTM layers to capture temporal dependencies in the sequence.
Hybrid (Conv + Dense): Use some Conv1D layers followed by Dense layers (after flattening) – a combination of convolution for local feature extraction and dense layers for mixing features.
CNN+LSTM: Use Conv1D layer(s) to extract low-level temporal features, then an LSTM layer on top to capture longer-term dependencies.
All hidden layers will use ReLU activations with He initialization (suitable for ReLU)​
file-1fjmgdc2axfxcweubcedh3
. The output layer will be linear (no activation) to predict raw Z values. We use the Adam optimizer (a variant of SGD) with Mean Squared Error loss for regression and will track Mean Absolute Error (MAE) as an additional metric​
file-1fjmgdc2axfxcweubcedh3
.
Cross-Validation Training – For each combination of hyperparameters, perform 10-fold cross-validation​
file-1fjmgdc2axfxcweubcedh3
:
Split the sequences into 10 folds. In each fold, use 9/10 of sequences for training (and validation) and 1/10 for testing.
Within training folds, further split a validation subset or use a portion of training data for early stopping monitoring.
EarlyStopping: Monitor validation loss and stop training if it doesn't improve for a set patience (we'll use patience=5)​
file-1fjmgdc2axfxcweubcedh3
.
ModelCheckpoint: Save the best weights for the model on the validation set during training​
file-1fjmgdc2axfxcweubcedh3
.
After training each fold, evaluate on that fold's test data and record the MSE/MAE.
Clear the model and GPU memory before the next fold to avoid memory buildup.
Hyperparameter Grid Search – Loop over all combinations of the specified hyperparameter ranges (window sizes, learning rates, architecture types, number of layers, units per layer, etc.). For each configuration, run the 10-fold CV training as above, then log the results (per-fold metrics, average performance, training time).
Logging and Analysis – Incrementally append the results of each experiment to a CSV file. After the grid search, this file will contain the performance of each model variant. We can then analyze this CSV to find which hyperparameters gave the best results (e.g. lowest average MSE).
With this plan, let's proceed step by step.
1. Import Libraries and Configure GPU
First, we import the necessary libraries. We use pandas for data loading/manipulation, NumPy for numerical computations, scikit-learn for data splitting (KFold) and feature scaling, and TensorFlow/Keras for building and training the neural networks​
file-1fjmgdc2axfxcweubcedh3
. We also enable dynamic growth for GPU memory to avoid TensorFlow pre-allocating all VRAM at once (this helps to run many models sequentially without memory fragmentation issues).

In [3]:
# Import required libraries
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler

# Configure TensorFlow to use GPU efficiently (if available)
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        tf.config.experimental.set_memory_growth(gpus[0], True)
        print("GPU found. Enabled memory growth on", gpus[0].name)
    except Exception as e:
        print("Error enabling GPU memory growth:", e)
else:
    print("No GPU found. Using CPU for training.")
    print("TensorFlow version:", tf.__version__)
    print("GPUs:", tf.config.list_physical_devices('GPU'))


No GPU found. Using CPU for training.
TensorFlow version: 2.19.0
GPUs: []


Explanation: We import all necessary modules. The set_memory_growth option allows the program to only allocate GPU memory as needed, which is useful when training many models sequentially. If no GPU is present, TensorFlow will automatically fall back to CPU.
2. Data Loading and Preprocessing
Each Kinect CSV file contains one sequence of recorded joint coordinates. We assume all such CSV files are stored in a directory (e.g., "data/kinect_sequences"). We will load each file into a DataFrame and then combine them into a list of sequences. Each sequence DataFrame has columns for each joint's x, y, z coordinates (39 columns for 13 joints, plus maybe an index)​
file-1fjmgdc2axfxcweubcedh3
. We drop the frame number/index column since it is not a feature for learning. Then we separate the input features (all *_x and *_y columns) and the target outputs (all *_z columns).

In [None]:
# Directory containing Kinect CSV files
data_dir = "data/kinect_sequences"  # replace with your actual path
csv_files = sorted([f for f in os.listdir(data_dir) if f.endswith(".csv")])

# Lists to hold sequences of features (X) and targets (Z)
sequences_X = []
sequences_Z = []

for file in csv_files:
    file_path = os.path.join(data_dir, file)
    try:
        # Load the CSV into a pandas DataFrame
        df = pd.read_csv(file_path)
        print(f"Successfully loaded {file}")
    except Exception as e:
        print(f"Error loading {file}: {e}")
        continue
    
    # Drop frame number column if present
    if 'FrameNo' in df.columns or 'frame' in df.columns:
        df = df.drop(columns=['FrameNo'], errors='ignore')
        df = df.drop(columns=['frame'], errors='ignore')
    
    # Strip any whitespace from column names (if needed)
    df.columns = df.columns.str.strip()
    
    # Separate feature and target columns
    feature_cols = [col for col in df.columns if col.endswith('_x') or col.endswith('_y')]
    target_cols = [col for col in df.columns if col.endswith('_z')]
    
    X_values = df[feature_cols].to_numpy()
    Z_values = df[target_cols].to_numpy()
    
    sequences_X.append(X_values)
    sequences_Z.append(Z_values)

print(f"Loaded {len(sequences_X)} sequences. Example sequence shape: {sequences_X[0].shape}")


In the above code, we iterate through each CSV file:
Read it into a DataFrame.
Drop the frame number column (FrameNo) if it exists.
Identify feature columns (those ending in _x or _y) and target columns (ending in _z).
Convert those to NumPy arrays and append to our lists. For example, sequences_X[i] will be a NumPy array of shape (num_frames_i, 26) since there are 26 features (13 joints * 2 coordinates), and sequences_Z[i] will be (num_frames_i, 13) with 13 target values per frame.
After loading, sequences_X is a list of length N (number of sequences), where each element is an array of shape (frames_in_sequence, 26). Similarly, sequences_Z contains arrays of shape (frames_in_sequence, 13). Feature Scaling: It is often beneficial to normalize or standardize input features for neural network training so that all features are on a similar scale. We will use standardization (zero mean, unit variance) for the X and Y inputs. Important: To avoid data leakage, scaling will be fit separately for each training fold (using only that fold’s training data) and then applied to the validation/test data. We will handle this inside the cross-validation loop later. (Targets (Z) could also be scaled, but here we'll predict the actual Z values directly. Since all coordinates are in similar units, we can leave Z unscaled for interpretability.)
3. Sliding Window Data Generation
We need to transform each sequence of frames into training samples using a sliding window approach. For a chosen window size W, we generate all possible contiguous windows of length W from a sequence. Each such window will produce one training example:
Input (features): The X and Y coordinates for all 13 joints over the W frames (shape W×26).
Output (target): The Z coordinates for all 13 joints from the last frame of that window (shape 13, one value per joint).
We do not allow windows to wrap around or cross between sequences – windows are generated independently within each sequence. If a sequence has T frames, and window size is W, it will yield (T - W + 1) samples (if T >= W, otherwise zero samples if the sequence is shorter than the window). Let's implement a utility function to create sliding window samples for a given sequence.

In [None]:
def create_windows_from_sequence(X_seq, Z_seq, window_size):
    """
    Given a single sequence of features X_seq (shape: num_frames x 26) and 
    targets Z_seq (shape: num_frames x 13), generate all sliding window samples of length window_size.
    Returns:
      X_windows: array of shape (num_samples, window_size, 26)
      Y_windows: array of shape (num_samples, 13) corresponding to Z of last frame in each window
    """
    X_windows = []
    Y_windows = []
    num_frames = X_seq.shape[0]
    if num_frames < window_size:
        # Not enough frames for even one window
        return np.array(X_windows), np.array(Y_windows)
    for start in range(0, num_frames - window_size + 1):
        end = start + window_size
        # Stack frames [start, ..., end-1] as one window
        X_w = X_seq[start:end]              # shape (window_size, 26)
        Y_w = Z_seq[end-1]                 # shape (13,) - Z of last frame
        X_windows.append(X_w)
        Y_windows.append(Y_w)
    # Convert to numpy arrays
    X_windows = np.array(X_windows)
    Y_windows = np.array(Y_windows)
    return X_windows, Y_windows

# Quick test on the first sequence (using a small window for demonstration)
test_w = 5
X_test_win, Y_test_win = create_windows_from_sequence(sequences_X[0], sequences_Z[0], window_size=test_w)
print(f"Created {X_test_win.shape[0]} window samples from one sequence (window={test_w}).")
print("Sample window input shape:", X_test_win[0].shape, "/ Sample window output shape:", Y_test_win[0].shape)


Explanation: The function create_windows_from_sequence iterates through the sequence with a window of length window_size. For each starting index, it slices out a window of the feature array X_seq and takes the corresponding last frame's target from Z_seq. It returns arrays of all windowed inputs and outputs. We included a quick test print to ensure it works (the output should show the number of samples generated and the shapes).
4. Model Architecture Definition
Now we define a function to build a Keras model given a specific architecture type and hyperparameters. This function will allow us to dynamically create models for each combination in our grid search. The parameters will include:
arch: The architecture type (one of 'dense', 'conv1d', 'lstm', 'hybrid', 'cnn+lstm').
num_layers: Number of hidden layers to use (excluding the output layer). For different architectures this means:
Dense: number of Dense hidden layers.
Conv1D: number of Conv1D layers.
LSTM: number of LSTM layers (they will be stacked if >1).
Hybrid (Conv + Dense): we will use num_layers // 2 Conv layers and num_layers - (num_layers // 2) Dense layers (approximately half conv, half dense).
CNN+LSTM: we will use (num_layers - 1) Conv1D layers followed by 1 LSTM layer (at least one conv layer before the LSTM).
units: Number of units/filters in each hidden layer. (For simplicity, we use the same size for all layers of a given model.)
learning_rate: Learning rate for the optimizer.
All models will have an input shape of (window_size, 26) and output size of 13. We will use ReLU activation for hidden layers and linear activation for the output layer (since this is a regression problem). We also initialize Dense/Conv weights with He initialization (Keras uses kernel_initializer='he_uniform' for ReLU layers)​
file-1fjmgdc2axfxcweubcedh3
. The models will be compiled with Adam optimizer and Mean Squared Error loss​
file-1fjmgdc2axfxcweubcedh3
, and we'll track Mean Absolute Error (MAE) as a metric for easier interpretation of error.
python
Copy code


In [None]:
def create_model(window_size, arch, num_layers, units, learning_rate):
    """Constructs and compiles a Keras model given the architecture and hyperparameters."""
    model = models.Sequential()
    # Define input shape (window_size timesteps, 26 features per timestep)
    model.add(layers.Input(shape=(window_size, 26)))
    
    if arch == 'dense':
        # Flatten time dimension and use Dense layers
        model.add(layers.Flatten())  # shape becomes (window_size*26,)
        for _ in range(num_layers):
            model.add(layers.Dense(units, activation='relu', kernel_initializer='he_uniform'))
        # Output layer
        model.add(layers.Dense(13, activation='linear'))
    
    elif arch == 'conv1d':
        # Conv1D layers across time dimension
        for _ in range(num_layers):
            model.add(layers.Conv1D(filters=units, kernel_size=3, padding='same',
                                     activation='relu', kernel_initializer='he_uniform'))
        model.add(layers.Flatten())
        model.add(layers.Dense(13, activation='linear'))
    
    elif arch == 'lstm':
        # Stacked LSTM layers
        for i in range(num_layers):
            # If not the last LSTM layer, return sequences to feed next LSTM
            return_seq = (i < num_layers - 1)
            model.add(layers.LSTM(units, return_sequences=return_seq))
        model.add(layers.Dense(13, activation='linear'))
    
    elif arch == 'hybrid':
        # Combination of Conv1D and Dense layers
        conv_count = num_layers // 2
        dense_count = num_layers - conv_count
        # Conv layers first
        for _ in range(conv_count):
            model.add(layers.Conv1D(filters=units, kernel_size=3, padding='same',
                                     activation='relu', kernel_initializer='he_uniform'))
        if conv_count > 0:
            model.add(layers.Flatten())
        # Dense layers after conv
        for _ in range(dense_count):
            model.add(layers.Dense(units, activation='relu', kernel_initializer='he_uniform'))
        model.add(layers.Dense(13, activation='linear'))
    
    elif arch == 'cnn+lstm':
        # Conv layers followed by a single LSTM layer
        conv_layers = max(1, num_layers - 1)  # ensure at least 1 conv
        for _ in range(conv_layers):
            model.add(layers.Conv1D(filters=units, kernel_size=3, padding='same',
                                     activation='relu', kernel_initializer='he_uniform'))
        # Follow with one LSTM layer
        model.add(layers.LSTM(units, return_sequences=False))
        model.add(layers.Dense(13, activation='linear'))
    
    else:
        raise ValueError(f"Unknown architecture type: {arch}")
    
    # Compile the model with Adam optimizer and MSE loss
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),
                  loss='mse', metrics=['mae'])
    return model

# Test the model creation function for one example configuration
test_model = create_model(window_size=5, arch='conv1d', num_layers=3, units=64, learning_rate=0.001)
print(test_model.summary())


In the create_model function:
For Dense architecture, we flatten the input (concatenating all frames in the window) and then add the specified number of Dense layers.
For Conv1D, we add the specified number of Conv1D layers. We use padding='same' so that each conv layer preserves the time dimension length, making it easier to stack conv layers. After conv layers, we flatten and then have a Dense output.
For LSTM, we add num_layers LSTM layers. Each LSTM (except the last) is set to return_sequences=True so that the next LSTM receives a sequence input. The final LSTM returns the last output (a feature vector), which we feed into the Dense output.
For Hybrid, we split the layers between Conv and Dense. We take roughly half of the layers as Conv1D (rounded down) and the rest as Dense. After the Conv layers, we flatten and then add the Dense layers. (If num_layers is odd, e.g. 5, this will produce 2 conv and 3 dense layers, etc.)
For CNN+LSTM, we use all but one layers as Conv1D, and then a single LSTM at the end. For example, if num_layers=3, we'll have 2 Conv1D layers followed by 1 LSTM layer. If num_layers=1, our implementation uses at least one Conv layer (since conv_layers = max(1, num_layers-1) will be 1) and then an LSTM – effectively the same as a small CNN+LSTM with 1 conv, 1 lstm.
All models end with a Dense layer of 13 units (linear activation) for the output, matching the 13 target joint Z-coordinates. We printed a summary of a test model to verify the architecture (in practice this will show the layer types and output shapes for the given configuration).
5. Training with Cross-Validation and Hyperparameter Search
With data loading, preprocessing, and model definition in place, we now set up the grid search over hyperparameters and perform 10-fold cross-validation for each combination. Hyperparameters to grid-search:
Window sizes: [3, 5, 7, 9, 11, 13, 15, 17, 20]
Learning rates: [0.5, 0.01, 0.005, 0.001, 0.0005, 0.0001]
Architectures: ['dense', 'conv1d', 'hybrid', 'lstm', 'cnn+lstm']
Number of layers: [2, 3, 4, 5, 6, 8, 10, 12] (recall: for conv, dense, etc., this is the hidden layer count as defined earlier)
Units per layer: [64, 128, 256, 512]
We will use 10-fold cross-validation for each combination. We create a KFold splitter to generate train/test sequence splits. Each fold, we will:
Determine which sequences are in the training set and which are in the test set for that fold.
Fit a StandardScaler on the training set's feature data (all frames from the training sequences) and transform the features of both training and test data. This ensures the model is trained on standardized inputs, and the test inputs are scaled consistently.
Generate sliding window samples for all training sequences (these become our training data for the model) and for the test sequences (for evaluation). Important: We will generate the windows after scaling the features so that each sample is properly normalized.
Split a portion of the training windows off for validation (to monitor early stopping). We can use, for example, 10% of the training windows as a validation set.
Create the model for the current hyperparameter combo using create_model.
Train the model on the training windows, with EarlyStopping and ModelCheckpoint callbacks:
EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True) will stop training if the validation loss hasn't improved in 5 epochs and revert to the best weights.
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True, save_weights_only=True) will save the model's weights at the epoch where validation loss was best.
After training, evaluate the model on the test fold's data to get the performance (MSE and MAE).
Clear the model from memory (and clear the session) to prepare for the next fold.
Repeat for all 10 folds and compute the average performance.
After all folds for a given hyperparameter combination, record the results (fold-wise metrics and averages) along with the hyperparams and training time. Then proceed to the next combination. This is a computationally intensive process (training thousands of models if all combinations are run). Thanks to GPU acceleration and the relatively small size of each model, this is feasible, but it may still take a long time. Early stopping will help by cutting off training when further epochs yield no improvement. Let's implement the nested loops for the grid search:

In [None]:
# Define hyperparameter grid
window_sizes = [3, 5, 7, 9, 11, 13, 15, 17, 20]
learning_rates = [0.5, 0.01, 0.005, 0.001, 0.0005, 0.0001]
architectures = ['dense', 'conv1d', 'hybrid', 'lstm', 'cnn+lstm']
num_layers_list = [2, 3, 4, 5, 6, 8, 10, 12]
units_list = [64, 128, 256, 512]
epochs_per_fold = 50

# Prepare K-fold splitter (we will shuffle the sequence indices for randomness)
kf = KFold(n_splits=10, shuffle=True, random_state=42)

# CSV file to log results
results_file = "experiment_results.csv"
# Write CSV header
results_header = (["architecture", "window_size", "num_layers", "units", "learning_rate"] +
                  [f"fold{i+1}_mse" for i in range(10)] +
                  ["avg_mse", "avg_mae", "training_time_sec"])
with open(results_file, 'w') as f:
    f.write(",".join(results_header) + "\n")

# Begin grid search
import time
experiment_count = 0
total_experiments = (len(window_sizes) * len(learning_rates) * 
                     len(architectures) * len(num_layers_list) * len(units_list))
print(f"Total experiments to run: {total_experiments}")
for arch in architectures:
    for window_size in window_sizes:
        for num_layers in num_layers_list:
            for units in units_list:
                for lr in learning_rates:
                    experiment_count += 1
                    config_description = (f"arch={arch}, window={window_size}, layers={num_layers}, "
                                           f"units={units}, lr={lr}")
                    print(f"\n=== Experiment {experiment_count}/{total_experiments}: {config_description} ===")
                    start_time = time.time()
                    
                    fold_mse_scores = []
                    fold_mae_scores = []
                    
                    # Perform 10-fold cross-validation for this config
                    fold_index = 1
                    for train_idx, test_idx in kf.split(sequences_X):
                        # Prepare training and testing data for this fold
                        # Combine training sequences' frames to fit scaler
                        train_frames = []
                        for seq_idx in train_idx:
                            train_frames.append(sequences_X[seq_idx])
                        train_frames = np.vstack(train_frames)
                        # Fit scaler on all training frames (for X features)
                        scaler = StandardScaler().fit(train_frames)
                        
                        # Generate windowed data for training
                        X_train_all = []
                        Y_train_all = []
                        for seq_idx in train_idx:
                            # Scale the entire sequence's features
                            X_seq = scaler.transform(sequences_X[seq_idx])
                            Z_seq = sequences_Z[seq_idx]  # target can remain unscaled
                            X_wins, Y_wins = create_windows_from_sequence(X_seq, Z_seq, window_size)
                            if X_wins.size == 0:
                                continue  # sequence too short for this window size
                            X_train_all.append(X_wins)
                            Y_train_all.append(Y_wins)
                        if len(X_train_all) == 0:
                            # If no training data (should not happen unless window_size > all seq lengths)
                            continue
                        # Concatenate all training windows from all sequences
                        X_train_all = np.vstack(X_train_all)
                        Y_train_all = np.vstack(Y_train_all)
                        
                        # Generate windowed data for testing (evaluation fold)
                        X_test_all = []
                        Y_test_all = []
                        for seq_idx in test_idx:
                            X_seq = scaler.transform(sequences_X[seq_idx])
                            Z_seq = sequences_Z[seq_idx]
                            X_wins, Y_wins = create_windows_from_sequence(X_seq, Z_seq, window_size)
                            if X_wins.size == 0:
                                continue  # If test sequence too short, it contributes no samples
                            X_test_all.append(X_wins)
                            Y_test_all.append(Y_wins)
                        if len(X_test_all) == 0:
                            # If no test data for this fold (all test sequences too short), skip fold
                            print(f"Fold {fold_index}: no test data (sequence too short for window={window_size}). Skipping.")
                            continue
                        X_test_all = np.vstack(X_test_all)
                        Y_test_all = np.vstack(Y_test_all)
                        
                        # Split off a validation set from training data for early stopping
                        X_train, X_val, Y_train, Y_val = train_test_split(
                            X_train_all, Y_train_all, test_size=0.1, random_state=42)
                        
                        # Build model for this configuration
                        model = create_model(window_size, arch, num_layers, units, lr)
                        
                        # Callbacks for early stopping and checkpointing
                        callbacks = [
                            EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
                            ModelCheckpoint(filepath='best_model.h5', monitor='val_loss',
                                            save_best_only=True, save_weights_only=True)
                        ]
                        
                        # Train the model
                        model.fit(X_train, Y_train, epochs=epochs_per_fold, batch_size=32,
                                  validation_data=(X_val, Y_val), callbacks=callbacks, verbose=0)
                        
                        # Load best weights (if not already restored by early stopping)
                        # (EarlyStopping with restore_best_weights=True already did this, but we'll ensure by loading checkpoint)
                        try:
                            model.load_weights('best_model.h5')
                        except Exception as e:
                            pass  # If file not found (e.g., not saved because no improvement), ignore
                        
                        # Evaluate on the test set
                        loss, mae = model.evaluate(X_test_all, Y_test_all, verbose=0)
                        fold_mse_scores.append(loss)
                        fold_mae_scores.append(mae)
                        print(f"Fold {fold_index} MSE: {loss:.6f}, MAE: {mae:.6f}")
                        
                        # Clean up model to free memory before next fold
                        tf.keras.backend.clear_session()
                        del model
                        fold_index += 1
                    
                    # Compute average metrics across folds
                    avg_mse = float(np.mean(fold_mse_scores)) if fold_mse_scores else float('nan')
                    avg_mae = float(np.mean(fold_mae_scores)) if fold_mae_scores else float('nan')
                    elapsed = time.time() - start_time
                    print(f"Avg MSE: {avg_mse:.6f}, Avg MAE: {avg_mae:.6f}, Training time: {elapsed:.2f} sec")
                    
                    # Log results to CSV
                    result_data = [arch, window_size, num_layers, units, lr]
                    # Add each fold's MSE
                    for i in range(10):
                        result_data.append(fold_mse_scores[i] if i < len(fold_mse_scores) else "")
                    result_data += [avg_mse, avg_mae, elapsed]
                    result_line = ",".join(map(str, result_data))
                    with open(results_file, 'a') as f:
                        f.write(result_line + "\n")


Explanation of the training loop: We iterate over every combination of the hyperparameters. For each combination (arch, window_size, num_layers, units, lr):
We initialize lists to collect the MSE and MAE for each of the 10 folds.
We use KFold(n_splits=10, shuffle=True) to get train/test splits at the sequence level (not individual frames). train_idx and test_idx are indices of sequences for each fold.
For each fold:
We gather all training sequences' feature frames into one array train_frames and fit a StandardScaler on it. This computes the mean and std of each feature (each joint's X or Y across all frames in the training sequences).
We then scale each training sequence's features and generate window samples from it using create_windows_from_sequence. We collect all training windows in X_train_all and Y_train_all.
We do the same for test sequences to get X_test_all and Y_test_all for evaluation.
We perform a train/validation split on the training windows (10% used for validation) using train_test_split. This validation set is used for monitoring early stopping.
We create a new model for this fold with the current hyperparams by calling create_model(window_size, arch, ...).
We set up the callbacks:
EarlyStopping: monitors validation loss (val_loss which is MSE) with patience 5. It also has restore_best_weights=True so that after stopping, the model weights revert to the best seen.
ModelCheckpoint: saves the best model weights to 'best_model.h5' during training. We use save_best_only=True so it updates only when a better val_loss is found.
We train the model with model.fit for up to 50 epochs (as specified by epochs_per_fold), using a batch size of 32. We pass in the validation data for monitoring. We set verbose=0 to suppress per-epoch output (to keep logs cleaner).
After training, we load the best weights from the checkpoint file (just to be absolutely sure we have the best model – in practice, because we used restore_best_weights, the model is already at best state).
We then evaluate the model on the test set (model.evaluate) to get the MSE (loss) and MAE for this fold. These are appended to our fold_mse_scores and fold_mae_scores lists.
We print the fold results for monitoring progress.
We clear the Keras session and delete the model to free GPU memory before starting the next fold (this is critical in a large loop to avoid out-of-memory errors).
After all 10 folds, we compute the average MSE and MAE across the folds.
We also measure the total training time for this configuration (from before the first fold to after the last fold) using time.time().
We log a summary line for this experiment to experiment_results.csv. The logged data includes the hyperparameters, each fold's MSE, the average MSE, average MAE, and the training time in seconds.
We open the results file in append mode for each experiment so that if the process is interrupted, we still have results up to the last completed configuration. The CSV header was written once at the top of the file (with columns for each fold's MSE, etc.). During execution, we print progress messages including the current experiment number and hyperparameters, fold metrics, and the average results. This will help track the progress in the console since the entire grid search will take a long time. Note: There are conditions to skip a fold if no data is available (for example, if a window size is larger than all sequences, some folds may have no training or no testing samples). In such cases we skip logging that fold's result. In a well-prepared dataset (with sequences longer than the largest window), this should not occur.
6. Running the Experiments and Logging Results
After setting up the loop above, running that code will commence the grid search. Given the number of combinations (9 window sizes × 6 learning rates × 5 architectures × 8 layer counts × 4 unit sizes = 8640 experiments, each with 10-fold training), this is a lot of training. It could potentially take many hours or days to run to completion on a single GPU. In practice, you might want to reduce the search space or distribute the task. However, for demonstration, the code as written will log results incrementally. For example, the CSV file might start to look like this (with made-up numbers for illustration):
python-repl
Copy code
architecture,window_size,num_layers,units,learning_rate,fold1_mse,fold2_mse,...,fold10_mse,avg_mse,avg_mae,training_time_sec
dense,3,2,64,0.5,0.0123,0.0108,...,0.0115,0.0115,0.085,45.2
dense,3,2,64,0.01,0.0056,0.0061,...,0.0059,0.0059,0.065,48.7
...
Each line corresponds to one configuration. This allows us to later analyze which combination gave the lowest average MSE or MAE. (In a real scenario, one might also consider logging the standard deviation of fold errors, or using a more compact metric like average RMSE, but here we stick to MSE/MAE as specified.)
7. Conclusion and Next Steps
We have built a GPU-accelerated deep learning experimentation pipeline that systematically evaluates different models for predicting joint depth (Z) from 2D coordinates. By using a sliding window of frames as input, the models utilize temporal context to make more accurate predictions​
file-1fjmgdc2axfxcweubcedh3
. We incorporated fully connected networks, convolutional networks, recurrent LSTMs, and hybrid combinations, with extensive hyperparameter tuning for each. The use of 10-fold cross-validation provides a robust estimate of performance for each configuration, and techniques like early stopping and checkpointing ensure that training is efficient and avoids overfitting​
file-1fjmgdc2axfxcweubcedh3
. All results are logged in experiment_results.csv. The next step is to analyze the results:
Identify the model configuration with the lowest average MSE (or MAE) on the cross-validation folds.
Consider the trade-offs (model complexity vs. performance). For example, a complex model (many layers/units) might yield slightly better accuracy but take longer to train or risk overfitting, whereas a simpler model might be nearly as good and more efficient.
Once the best hyperparameter combination is identified, we could train a final model on the entire dataset using that configuration (perhaps with a new train/test split or on a separate test set if available) to confirm its performance.
We could also examine the errors for different joints or frames to see if certain joints' depth are harder to predict from 2D (which might indicate limits of the approach or the need for more sophisticated temporal modeling).
Overall, this pipeline provides a powerful framework for experimentation. By logging every experiment, it enables a thorough grid search and ensures reproducibility of results for this Kinect XY-to-Z prediction task.