Kinect XY to Z Coordinate Prediction with TensorFlow
Introduction
In this notebook, we will build a deep learning model to predict the Z coordinates of Kinect body joints from their X and Y coordinates. The Kinect sensor provides 3D joint positions (X, Y, Z) for 13 joints across a sequence of frames. Our goal is to train a model that uses the X and Y values as input to estimate the corresponding Z value for each joint. To make use of temporal information, we will incorporate a sliding window of frames (default window size = 5) as the model input. This means the model will consider a short sequence of consecutive frames when making each prediction, providing context that can improve accuracy. We will experiment with different neural network architectures (fully-connected Dense layers, 1D convolutional layers, and a hybrid of both) to see which works best for this task. Plan:
Load and inspect the dataset: Ensure it has 13 joints with X, Y, Z coordinates per frame.
Prepare the data: Separate input features (X and Y for all joints) and target outputs (Z for all joints). Create sequence data using a sliding window of frames, and optionally normalize the features for better training.
Define a modular model architecture: Build a function to create a TensorFlow model that can be configured for three modes: Dense-only, Conv1D-only, or a hybrid (Conv1D + Dense). We will choose appropriate activation functions (ReLU for hidden layers, linear for output) and weight initializations (He initialization for ReLU layers) for each.
Compile the model: Use a stochastic gradient descent optimizer (e.g. Adam, an adaptive SGD variant) with a regression loss (Mean Squared Error) and track Mean Absolute Error (MAE) as an additional metric.
Train and evaluate with 10-fold cross-validation: Split the data into 10 folds, train the model on 9 folds and validate on the 1 remaining fold, rotating through all folds. This provides a robust evaluation of model performance on all data.
Use callbacks for early stopping and checkpointing: Integrate an EarlyStopping callback to halt training when validation loss stops improving (preventing overfitting and saving time) and a ModelCheckpoint callback to save the best model weights to disk. We will explain how to adjust or disable these features as needed.
Analyze results: For each fold/architecture, we'll record the performance (MSE and MAE) and compute the average across folds. This will help compare the architectures and ensure the model generalizes well.
Let's get started by importing necessary libraries and loading the dataset.
Import Libraries and Load Data
First, we import the required libraries. We will use pandas for data loading/manipulation, NumPy for numerical operations, scikit-learn for data splitting and normalization, and TensorFlow/Keras for building the neural network.


In [3]:
# Import required libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

Now, we load the Kinect joint data. We assume the data is stored in one or multiple CSV files where each row corresponds to a frame and contains the 3D coordinates for 13 joints (x, y, z for each joint). If the data is split across multiple files (e.g., different recordings or subjects), we will load and combine them. Each file is expected to have columns like head_x, head_y, head_z, left_shoulder_x, ... right_foot_z (13 joints × 3 coordinates = 39 columns). There may also be an index or frame number column which we will drop.

In [5]:
import os

data_dir = 'data/kinect_good_preprocessed'

# List all CSV files that end with kinect.csv
csv_files = [f for f in os.listdir(data_dir) if f.endswith('kinect.csv')]

# Create an empty dataframe
data = pd.DataFrame()

# Load each CSV file and concatenate to the main dataframe
for file in csv_files:
    file_path = os.path.join(data_dir, file)
    try:
        df = pd.read_csv(file_path)
        # Add a source column to track which file just to check
        # df['source_file'] = file

        # Append to the main dataframe
        data = pd.concat([data, df], ignore_index=True)

        print(f"Successfully loaded {file}")
    except Exception as e:
        print(f"Error loading {file}: {e}")

# Drop the FrameNo column if it exists
if 'FrameNo' in data.columns:
    data = data.drop('FrameNo', axis=1)

# Check the dataframe after dropping FrameNo
print(f"Dataframe shape after dropping FrameNo: {data.shape}")

data.head()

Successfully loaded A100_kinect.csv
Successfully loaded A101_kinect.csv
Successfully loaded A102_kinect.csv
Successfully loaded A103_kinect.csv
Successfully loaded A104_kinect.csv
Successfully loaded A105_kinect.csv
Successfully loaded A106_kinect.csv
Successfully loaded A108_kinect.csv
Successfully loaded A109_kinect.csv
Successfully loaded A10_kinect.csv
Successfully loaded A110_kinect.csv
Successfully loaded A111_kinect.csv
Successfully loaded A112_kinect.csv
Successfully loaded A113_kinect.csv
Successfully loaded A114_kinect.csv
Successfully loaded A115_kinect.csv
Successfully loaded A116_kinect.csv
Successfully loaded A117_kinect.csv
Successfully loaded A118_kinect.csv
Successfully loaded A119_kinect.csv
Successfully loaded A11_kinect.csv
Successfully loaded A120_kinect.csv
Successfully loaded A121_kinect.csv
Successfully loaded A122_kinect.csv
Successfully loaded A123_kinect.csv
Successfully loaded A124_kinect.csv
Successfully loaded A125_kinect.csv
Successfully loaded A126_kinec

Unnamed: 0,head_x,head_y,head_z,left_shoulder_x,left_shoulder_y,left_shoulder_z,left_elbow_x,left_elbow_y,left_elbow_z,right_shoulder_x,...,left_knee_z,right_knee_x,right_knee_y,right_knee_z,left_foot_x,left_foot_y,left_foot_z,right_foot_x,right_foot_y,right_foot_z
0,-0.01648,0.7796,0.070646,-0.16221,0.5803,0.047752,-0.23056,0.84519,-0.033909,0.17278,...,-0.046769,0.13624,-0.51948,0.003528,-0.13975,-0.86193,-0.025068,0.15162,-0.91701,-0.005627
1,-0.01699,0.77937,0.069989,-0.16242,0.58031,0.047168,-0.2305,0.84547,-0.034061,0.17334,...,-0.045948,0.13663,-0.51921,0.003291,-0.13982,-0.86208,-0.02498,0.15001,-0.91352,-0.001491
2,-0.017345,0.77928,0.069703,-0.16258,0.58039,0.046964,-0.23044,0.84564,-0.034175,0.1731,...,-0.039191,0.13633,-0.51957,0.003223,-0.13937,-0.86233,-0.024777,0.14981,-0.91385,-0.001252
3,-0.017602,0.77935,0.06952,-0.16276,0.58035,0.046524,-0.23037,0.84598,-0.034318,0.17349,...,-0.044046,0.13653,-0.51892,0.002961,-0.13907,-0.86225,-0.024431,0.15711,-0.92048,-0.01492
4,-0.017559,0.77938,0.06953,-0.16279,0.5803,0.046288,-0.22906,0.8507,-0.036518,0.17365,...,-0.047814,0.13662,-0.5194,0.002998,-0.13917,-0.86364,-0.025274,0.15212,-0.91871,-0.004539


Explanation: The code above reads all CSV files containing Kinect data and concatenates them into one pandas DataFrame called data. If there's a FrameNo column (which just indexes frames), we drop it since we don't need it for learning. We then print the shape and a subset of column names to verify the data loaded correctly. The expected shape should be (num_frames, 39) because we have 39 features (13 joints × 3 coordinates) per frame.
Data Preprocessing
Separating Features (X, Y) and Targets (Z)
We need to use the X and Y coordinates as input features and the Z coordinates as the target output. Let's split the DataFrame into X (containing all _x and _y columns) and y (containing all _z columns). We will also convert them to NumPy arrays for TensorFlow.

In [6]:
# Separate input features (XY coordinates) and target (Z coordinates)
# Identify columns by suffix
x_columns = [col for col in data.columns if col.endswith('_x')]
y_columns = [col for col in data.columns if col.endswith('_y')]
z_columns = [col for col in data.columns if col.endswith('_z')]

# Sort columns to maintain consistent order (optional, for safety)
x_columns = sorted(x_columns)
y_columns = sorted(y_columns)
z_columns = sorted(z_columns)

# Prepare feature matrix X and target matrix y
X = data[x_columns + y_columns].values  # shape: (num_frames, 26) -> 13 joints * 2 (x,y)
y = data[z_columns].values             # shape: (num_frames, 13) -> 13 joints * 1 (z)

print("Features shape (X):", X.shape)
print("Targets shape (y):", y.shape)

Features shape (X): (24005, 26)
Targets shape (y): (24005, 13)


Explanation: We filter column names by their suffix (_x, _y, _z) to gather the respective coordinate sets. There should be 13 columns for each suffix (13 X's, 13 Y's, 13 Z's). We then form X by concatenating the X and Y columns, resulting in 26 features per frame, and y from the Z columns, resulting in 13 target values per frame. We print the shapes to verify; for example, you might see Features shape: (24005, 26) and Targets shape: (24005, 13) if there are 24,005 frames in total.
Creating Sequence Data with a Sliding Window
Instead of predicting Z from a single frame's X and Y, we will use a window of consecutive frames as input. This means each training sample will consist of window_size frames of X/Y data, and the target will be the Z values of the last frame in that sequence. By default, we'll use window_size = 5, but this can be adjusted. For example, if window_size=5, then frames 1–5 (their X and Y values) will be used to predict frame 5's Z values; frames 2–6 will predict frame 6's Z, and so on. This provides the model with context from the preceding 4 frames when predicting the 5th frame's depth.

In [16]:
# Create sequences of frames for input using a sliding window
window_size = 10  # you can change this to use more or fewer frames in each input sequence

X_seq = []
y_seq = []
num_frames = X.shape[0]
for i in range(num_frames - window_size + 1):
    # Take window_size consecutive frames for input
    X_seq.append(X[i : i+window_size])
    # Take the Z targets of the last frame in the sequence as the output
    y_seq.append(y[i+window_size - 1])
X_seq = np.array(X_seq)
y_seq = np.array(y_seq)

print("Sequence input shape (X_seq):", X_seq.shape)
print("Sequence target shape (y_seq):", y_seq.shape)

Sequence input shape (X_seq): (23996, 10, 26)
Sequence target shape (y_seq): (23996, 13)


Explanation: We iterate through the data to build sequences of length window_size. X_seq will have shape (N_sequences, window_size, 26) where 26 is the number of input features per frame (X and Y for 13 joints). y_seq will have shape (N_sequences, 13) corresponding to the Z values for the last frame of each sequence. If our original data had N frames, the number of sequences N_sequences will be N - window_size + 1 (because we can't start a sequence in the very last frames if they don't have window_size frames ahead of them). For instance, with 24,005 frames and window_size=5, X_seq would have 24,001 sequences (since 4 frames are used to start the first sequence).
Feature Scaling (Normalization)
Neural networks often train faster and more reliably when input features are on similar scales. Here, X and Y coordinates might already be in a similar range (e.g., meters in a Kinect coordinate system), but it's still beneficial to normalize. We will use StandardScaler to standardize features: each feature (coordinate) will be scaled to have mean 0 and standard deviation 1 based on the training data. We will also scale the target Z values similarly. Important: When doing cross-validation, we must be careful to fit the scaler on the training portion of each fold and apply it to that fold’s validation data, to avoid leaking information. We'll handle this inside the cross-validation loop. (If you prefer not to scale the outputs because they are already in a known range, you can skip scaling y. But scaling X is highly recommended.)
Building a Modular Neural Network Model
Next, we'll define a function to create our TensorFlow model. This function will allow us to switch between different architectures:
Dense-only: A fully-connected network that treats each sequence as a flattened vector (no explicit temporal processing, aside from what the window provides inherently).
Conv1D-only: A model that uses 1D convolution layers over the time dimension to automatically learn temporal features from the sequence.
Hybrid (Conv + Dense): A combination where we first apply a convolutional layer to extract temporal features, then flatten and pass through dense layers for further processing.
Each architecture will share the same input-output structure: input shape corresponds to our sequence of frames, and output is 13 numbers (predicted Z for each joint). We will use ReLU activation for hidden layers (since it's a good default for many problems) and linear activation for the output (because this is a regression task and we want to predict continuous values without bounding them). We will also use He initialization (he_normal) for layers with ReLU, which is known to help train deep networks with ReLU activations by initializing weights in a suitable range​
file-7q2hg52vvsvnwofjxdtetf
​
file-7q2hg52vvsvnwofjxdtetf
. Let's define the model-building function:

In [17]:
def create_model(model_type='dense', window_size=5, n_features=26, n_outputs=13):
    """
    Create a TensorFlow Keras model for predicting joint Z coordinates from XY coordinates.
    model_type: 'dense', 'conv', or 'hybrid' to specify the architecture.
    window_size: number of frames in the input sequence.
    n_features: number of features per frame (should be 26 for 13 joints' X and Y).
    n_outputs: number of outputs (13 joints' Z values).
    """
    if model_type == 'dense':
        # Dense-only model: flatten the sequence and use fully connected layers
        inputs = tf.keras.Input(shape=(window_size * n_features,))  # input is a flat vector of length window_size*26
        x = layers.Dense(128, activation='relu', kernel_initializer='he_normal')(inputs)
        x = layers.Dense(64, activation='relu', kernel_initializer='he_normal')(x)
        outputs = layers.Dense(n_outputs, activation='linear')(x)  # linear output for regression
        model = models.Model(inputs, outputs)
    
    elif model_type == 'conv':
        # Conv1D-only model: use convolutional layers over time dimension, then flatten to output
        inputs = tf.keras.Input(shape=(window_size, n_features))   # input is a 5x26 sequence
        x = layers.Conv1D(filters=64, kernel_size=3, activation='relu', 
                           kernel_initializer='he_normal', padding='same')(inputs)
        x = layers.Conv1D(filters=64, kernel_size=3, activation='relu', 
                           kernel_initializer='he_normal', padding='same')(x)
        x = layers.Flatten()(x)
        outputs = layers.Dense(n_outputs, activation='linear')(x)
        model = models.Model(inputs, outputs)
    
    elif model_type == 'hybrid':
        # Hybrid model: one Conv1D layer followed by Dense layers
        inputs = tf.keras.Input(shape=(window_size, n_features))
        x = layers.Conv1D(filters=64, kernel_size=3, activation='relu', 
                           kernel_initializer='he_normal', padding='same')(inputs)
        x = layers.Flatten()(x)
        x = layers.Dense(64, activation='relu', kernel_initializer='he_normal')(x)
        outputs = layers.Dense(n_outputs, activation='linear')(x)
        model = models.Model(inputs, outputs)
    
    else:
        raise ValueError("Unsupported model_type. Choose from 'dense', 'conv', 'hybrid'.")
    
    # Compile the model with optimizer, loss, and metrics for regression
    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), 
                  loss='mse',             # Mean Squared Error for regression loss
                  metrics=['mae'])        # Mean Absolute Error as an additional metric
    return model

# Example: Build each type of model and display its architecture
for arch in ['dense', 'conv', 'hybrid']:
    model = create_model(model_type=arch, window_size=window_size, n_features=X_seq.shape[2], n_outputs=y_seq.shape[1])
    print(f"{arch.capitalize()} model summary:")
    model.summary()
    print("\n")


Dense model summary:




Conv model summary:




Hybrid model summary:






Explanation: The create_model function constructs a Keras model based on the specified model_type.
For 'dense', we reshape the input into a vector of length window_size * n_features (for a window of 5 frames with 26 features each, that’s 130 inputs). We then add two Dense layers (128 and 64 units) with ReLU activations, and a final Dense layer with 13 units (one per joint's Z) and linear activation. Dropout can be added if needed, but here we keep it simple.
For 'conv', the input shape is (window_size, n_features) (e.g., 5×26). We apply two 1D convolutional layers with a kernel size of 3 (looking at 3-frame patterns) and ‘same’ padding to preserve the sequence length. These conv layers will move a sliding 3-frame window over the 5-frame sequence to extract motion features. The output of conv layers is then flattened and fed directly to the output layer. (We could also add a Dense layer after flattening if desired, but here we're making conv-only relatively direct.)
For 'hybrid', we use one Conv1D layer (to start learning temporal features), then flatten and pass through a Dense layer (64 units) before the final output. This combines the strengths of convolution (local pattern learning in time) and dense layers (global feature combination).
In all cases, we compile the model with the Adam optimizer (a popular variant of stochastic gradient descent that adapts the learning rate for each parameter)​
file-7q2hg52vvsvnwofjxdtetf
. We use Mean Squared Error (MSE) as the loss function since this is a regression problem, and include Mean Absolute Error (MAE) as a metric for easier interpretation of errors. (MSE gives more weight to large errors, while MAE is more linear, so seeing both is useful.) After defining the function, we create one instance of each model type and print a summary of its architecture. The .summary() output shows the layers and shapes, which can help verify that each model is constructed as expected (you'll see the layer types Dense or Conv1D and the output shape of (None, 13) for the final layer, confirming 13 outputs).
10-Fold Cross-Validation Training and Evaluation
Now we will train and evaluate the model using 10-fold cross-validation. Cross-validation means we will split the dataset into 10 parts (folds) of roughly equal size. For each of the 10 iterations, we take one fold as the validation set and the remaining 9 folds as the training set. We train the model on the training portion and evaluate on the validation fold. This way, every data point gets to be in a validation set exactly once, and we can assess how well the model generalizes across all data. Using cross-validation is helpful especially when the dataset is not extremely large, as it maximizes the use of available data for training while still providing a robust evaluation on unseen data for each fold. It also helps to detect if the model is overfitting or if performance is consistent. Before we start, let's set up the model type we want to evaluate and prepare the KFold splitter:

In [18]:
# Choose which architecture to evaluate: 'dense', 'conv', or 'hybrid'
model_type = 'conv'  # << change this to try a different architecture >>

# Set up 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=42)


We set shuffle=True for KFold to ensure the data is shuffled before splitting into folds (this is generally a good idea unless the data is time-series in order, but here shuffling across the entire dataset is fine since sequences are already formed and order within each sequence matters, not the global order of sequences). We use a random_state for reproducibility of the fold splits.
Early Stopping and Model Checkpointing
We'll use two Keras callbacks during training:
EarlyStopping: This callback monitors the validation loss and stops training early if it hasn't improved in a specified number of epochs (patience). This prevents the model from overfitting the training data once it stops getting better on validation data. We will use restore_best_weights=True so that after stopping, the model weights revert to the best encountered during training (the epoch with lowest validation loss). We can adjust patience based on how many epochs of no-improvement we want to tolerate. For example, patience=5 will stop training if the validation loss doesn't improve for 5 consecutive epochs. If we wanted to disable early stopping, we could simply not include this callback, or set a very high patience (effectively letting all epochs run).​
file-7q2hg52vvsvnwofjxdtetf
ModelCheckpoint: This callback saves the model to disk whenever an improvement in the monitored metric (validation loss here) is observed. We set save_best_only=True to save only the best model (lowest val loss) for each fold. This is useful if we want to keep the model file for later use or analysis. We'll save each best model to a file like "best_model.h5". If you want to save models for each fold separately, you can include the fold number in the filename. If you prefer not to save models to disk, you can omit this callback.
Now, let's perform the cross-validation loop. For each fold, we will:
Split X_seq and y_seq into training and validation sets according to the current fold indices.
Scale the features using StandardScaler: fit on the training data (flattening the sequence data into 2D shape of (samples*window_size, features) for scaling each feature across all time steps) and transform both training and validation data. We'll also scale the target y values (fit on y_train, transform y_val). This scaling is done inside each fold to avoid any data leakage.
Build a fresh model for the chosen model_type (we need a new model for each fold to ensure independent training).
Train the model on the training fold with early stopping and checkpoint callbacks, and validate on the validation fold. We set a relatively high epochs (e.g. 100) but expect early stopping to halt before that if the model stops improving. We also use a batch size of 32 (this can be tuned; 32 is a common starting point).
Record the validation performance (MSE and MAE) for this fold.
After the loop, compute the average MSE and MAE across all folds as the overall evaluation of the model.

In [None]:
fold = 1
val_scores = []  # to collect validation MSE and MAE for each fold

for train_idx, val_idx in kf.split(X_seq):
    print(f"Training fold {fold} of {kf.get_n_splits()}...")
    
    # Split into training and validation for this fold
    X_train_fold, X_val_fold = X_seq[train_idx], X_seq[val_idx]
    y_train_fold, y_val_fold = y_seq[train_idx], y_seq[val_idx]
    
    # Scale features: fit on training fold, transform both train and val
    X_scaler = StandardScaler()
    y_scaler = StandardScaler()
    # Flatten the 3D (samples, window_size, features) into 2D (samples*window_size, features) for scaling
    X_train_flat = X_train_fold.reshape(-1, X_train_fold.shape[2])
    X_val_flat = X_val_fold.reshape(-1, X_val_fold.shape[2])
    X_train_flat_scaled = X_scaler.fit_transform(X_train_flat)
    X_val_flat_scaled = X_scaler.transform(X_val_flat)
    # Reshape back to the original sequence shape for model input
    X_train_scaled = X_train_flat_scaled.reshape(X_train_fold.shape[0], X_train_fold.shape[1], X_train_fold.shape[2])
    X_val_scaled = X_val_flat_scaled.reshape(X_val_fold.shape[0], X_val_fold.shape[1], X_val_fold.shape[2])
    # Scale targets (Z values)
    y_train_scaled = y_scaler.fit_transform(y_train_fold)
    y_val_scaled = y_scaler.transform(y_val_fold)
    
    # If using dense model, we need to flatten the sequence dimension for the model input
    if model_type == 'dense':
        X_train_input = X_train_scaled.reshape(X_train_scaled.shape[0], -1)  # flatten to (samples, window_size*n_features)
        X_val_input   = X_val_scaled.reshape(X_val_scaled.shape[0], -1)
    else:
        X_train_input = X_train_scaled  # (samples, window_size, n_features)
        X_val_input   = X_val_scaled
    
    # Create a new model for this fold
    model = create_model(model_type=model_type, window_size=window_size, 
                         n_features=X_seq.shape[2], n_outputs=y_seq.shape[1])
    
    # Define callbacks
    early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True, verbose=0)
    
    # Train the model on this fold's training data, with validation
    history = model.fit(X_train_input, y_train_scaled, 
                        epochs=100, batch_size=32,
                        validation_data=(X_val_input, y_val_scaled),
                        callbacks=[early_stop, checkpoint],
                        verbose=0)
    
    # Evaluate on the validation set
    val_loss, val_mae = model.evaluate(X_val_input, y_val_scaled, verbose=0)
    val_scores.append((val_loss, val_mae))
    print(f" Fold {fold} - Validation MSE: {val_loss:.4f}, MAE: {val_mae:.4f}")
    fold += 1

# Calculate average performance across all folds
val_losses = [score[0] for score in val_scores]
val_maes = [score[1] for score in val_scores]
print(window_size, "Frame Window")
print("\nAverage 10-Fold Validation MSE:", np.mean(val_losses))
print("Average 10-Fold Validation MAE:", np.mean(val_maes))

Training fold 1 of 10...
 Fold 1 - Validation MSE: 0.0166, MAE: 0.0904
Training fold 2 of 10...
 Fold 2 - Validation MSE: 0.0207, MAE: 0.1075
Training fold 3 of 10...
 Fold 3 - Validation MSE: 0.0210, MAE: 0.1084
Training fold 4 of 10...
 Fold 4 - Validation MSE: 0.0146, MAE: 0.0898
Training fold 5 of 10...
 Fold 5 - Validation MSE: 0.0164, MAE: 0.0944
Training fold 6 of 10...
 Fold 6 - Validation MSE: 0.0166, MAE: 0.0953
Training fold 7 of 10...
 Fold 7 - Validation MSE: 0.0133, MAE: 0.0846
Training fold 8 of 10...
 Fold 8 - Validation MSE: 0.0144, MAE: 0.0890
Training fold 9 of 10...
 Fold 9 - Validation MSE: 0.0146, MAE: 0.0895
Training fold 10 of 10...
 Fold 10 - Validation MSE: 0.0225, MAE: 0.1072

Average 10-Fold Validation MSE: 0.017070140782743694
Average 10-Fold Validation MAE: 0.09559105336666107


Explanation: For each fold, we perform the steps discussed. We reshape and scale the data within the fold to ensure the scaler is fit only on training data. We conditionally reshape the inputs for the dense model (since it expects a flat input). We then compile a fresh model using our create_model function and train it.
The EarlyStopping is set with patience=5, meaning if the validation loss doesn't improve for 5 epochs in a row, training will stop. We set restore_best_weights=True so the model will revert to the best weights achieved during this training when we evaluate it.
The ModelCheckpoint will save the best model of the current fold to best_model.h5. After training each fold, you could load this file to examine or use the best model for that fold. Note that here we use the same filename for each fold, so it will be overwritten each time. If you want to save all fold models, you can specify filepath=f"best_model_fold{fold}.h5" in the callback.
We run model.evaluate on the validation set to get the final MSE and MAE for that fold and store them. Finally, we compute the average MSE and MAE across all 10 folds to get an overall sense of performance. This average is a more stable estimate of how the model might perform on unseen data than a single train/test split. Note: We set verbose=0 in model.fit to suppress the per-epoch output for cleanliness (you can set verbose=1 to see the training progress for each fold). We do print a summary line per fold and the average at the end.
Results and Next Steps
After running the cross-validation above, you will have the average validation performance of the chosen architecture (model_type). For example, you might see output like:
yaml
Copy code
Fold 1 - Validation MSE: 0.0123, MAE: 0.0805  
...  
Fold 10 - Validation MSE: 0.0131, MAE: 0.0832  

Average 10-Fold Validation MSE: 0.0127  
Average 10-Fold Validation MAE: 0.0810
This indicates the model's typical prediction error. MAE ~0.0810 (in whatever units the Z is, possibly meters) means on average the prediction is off by about 0.081 in Z. You can now compare different architectures by setting model_type to 'conv' or 'hybrid' and re-running the cross-validation cell. Because we wrote our code modularly, trying a new architecture is as simple as changing that variable — no other code changes are needed. The function create_model takes care of building the appropriate model. By comparing the average MAE (or MSE) across folds for each architecture, you can determine which model type performs best for this task. Early Stopping and Model Checkpoint Recap: During training, if you notice that training stops well before the max epochs (due to early stopping), that's usually a good sign that the model found an optimum. If it stops too early (possibly underfitting), you might consider increasing patience or initial epochs. If it never stops (goes to full epochs) and validation loss keeps improving, you might increase epochs or adjust learning rate, or reduce patience if training is very long. The model checkpoint ensures you have the best model weights; since we used restore_best_weights=True, the model you evaluate is already the best one, so loading from the checkpoint isn't strictly necessary unless you want to save it for later use. To disable these, simply remove them from the callbacks list. Extensibility: Feel free to experiment further:
Adjust the window_size for the input sequence (e.g., try 3 or 7) and see if more or less temporal context helps.
Modify the network architecture: you can add more Dense layers, change the number of filters in Conv1D, or even try other sequence models like LSTM/GRU for comparison.
Tune hyperparameters such as learning rate, batch size, or training epochs to see their effect on performance.
If you have a separate test set, after choosing the best model type and training on all data (or all training data), evaluate the model on that test set to get a final unbiased performance metric.
This notebook provides a clear, modular framework for tackling the Kinect XY→Z prediction problem, making it easier for beginners to understand and adjust various components of the deep learning pipeline. Happy experimenting!