# Assignment 3: Motion AutoEncoder

In this assignment, you will implement and train an autoencoder for motion generation using the CMU-Mocap dataset. You will develop a simplified version of a motion manifold system capable of learning from motion capture data, synthesizing new movements, and enabling basic motion editing techniques.

Please read the following two papers as they are your reference for your implementation:
* Learning Motion Manifolds with Convolutional Autoencoders, https://www.ipab.inf.ed.ac.uk/cgvu/motioncnn.pdf
* A Deep Learning Framework for Character Motion Synthesis and Editing, https://www.ipab.inf.ed.ac.uk/cgvu/motionsynthesis.pdf

To help you understand motion data representation, here are some valuable resources:

* **CMU Mocap dataset**: https://mocap.cs.cmu.edu/
  * The raw data format may be challenging to interpret directly. For easier use, download the BVH format (which our dataloader utilizes): https://github.com/una-dinosauria/cmu-mocap
  * Blender (https://www.blender.org/) can be used to visualize the downloaded BVH files. This will provide a clearer understanding of the human skeleton structure used in the dataset and allow you to visualize the raw motion sequences.
* **Additional reference** (optional): For a deeper exploration of human models, you can examine more motion data in FBX format at: https://www.mixamo.com/#/?page=1&type=Motion%2CMotionPack
  * Note that Mixamo uses a different skeletal structure than our dataset, but it includes human mesh deformation with motion, which BVH data doesn't provide.


This assignment includes substantial starter code in separate Python files to help you get started. Your tasks are to:

* Complete all sections marked with `TODO` comments. Feel free to add code or restructure functions to improve your implementation's flexibility during training.
* Present your results by embedding videos and images in this notebook. You should also answer all questions and complete the write-up sections in this notebook.
* Include all necessary video files, images, and code files in your submission for proper evaluation. **Important**: Do NOT submit model checkpoints as these will unnecessarily increase your submission size.

**Please reserve enough time for this assignment given the potential amount of time for training.**

In [None]:
# For you to add video to your submission.
from IPython.display import Video
Video('./vid/sample_0.mp4')

## Part 1: Familiarize Yourself with the Data (10 pt)

Unlike previous assignments where standard dataloaders were readily available in existing libraries, this assignment requires implementing a custom dataloader for the CMU-Mocap dataset.

We've provided most of the dataloader implementation in `dataloader.py` and included a pre-processed version of the necessary data in `cmu-mocap/cache` (https://utexas.box.com/v/cmu-mocap-cache). Please download this data and place the `cmu-mocap` folder in your current working directory. Alternatively, you can download the BVH files from the links provided earlier and generate your own pre-processed data.

For this part, you need to implement data normalization in the dataloader. There are two sets of TODOs in the file:

1. First, compute appropriate statistics for normalizing your data based on the entire dataset
2. Second, implement the normalization procedure in the `__getitem__` function where the dataloader retrieves one batch of data

Upon successful completion of this section, you should be able to visualize four sample motion sequences by running `python dataloader.py`.

Important notes:
* There are multiple approaches to normalizing motion data. You are free to choose any method that produces reasonable training results in the subsequent sections.
* Include visualizations of your **normalized motion** sequences in your submission for this part, following the example format below.

In [None]:
from IPython.display import Video, display
import glob

for p in sorted(glob.glob("/content/Generative-Visual-Computing-3/motion-ae-output/part1_norm/*.mp4")):
    display(Video(p))


## Part 2: Motion Manifold Learning (45 points)

In this section, you will implement the neural network architecture and training methodology described in the reference paper to learn a motion manifold.

### Convolutional Autoencoder Architecture (15 points)

1. Implement the convolutional autoencoder following the architecture detailed in (one of) the reference paper in `MotionAutoencoder`.
2. Develop both the encoding and decoding operations that will allow the network to compress motion data into a lower-dimensional manifold representation and then reconstruct it.

### Training Procedure (30 points)

1. Implement the training procedure in `MotionManifoldTrainer` for your autoencoder, carefully considering the appropriate loss function(s) to use for motion data.
2. Generate and include training curves using the visualization template provided in the starter code.

Important considerations:
1. The reference paper was published several years ago when neural network implementations were often manually coded with custom operations. You'll need to adapt these concepts to modern PyTorch conventions and standard operations. Some training parameters may require adjustment to work effectively with contemporary deep learning frameworks.
2. Your submission for this part should include clear visualizations of your training curves showing loss change over time.

from IPython.display import Image, display
import os

plot_path = os.path.join("/content/Generative-Visual-Computing-3/motion-ae-output/", "plots", "training_curves.png")
print("Plot exists:", os.path.exists(plot_path), plot_path)
display(Image(plot_path))

## Part 3: Motion Synthesis (30 points)

### Motion Interpolation (15 points)

Implement the function `MotionManifoldSynthesizer.interpolate_motions` that creates transitions between different motions using the learned manifold. This function should accept two motion sequences sampled from the dataset and generate an interpolated motion that blends naturally between them. You can visualize your results using the provided `visualize_interpolation` function.

Your submission for this section should include at least two video examples demonstrating motion interpolation between different movement types.

In [None]:

import os, glob
from IPython.display import Video, HTML, display

paths = ["/content/Generative-Visual-Computing-3/motion-ae-output/interp_1.mp4",
         "/content/Generative-Visual-Computing-3/motion-ae-output/interp_2.mp4"]

# 2) Display (embed=True is important in Colab)
for p in paths:
    if os.path.exists(p) and os.path.getsize(p) > 0:
        display(Video(p, embed=True))
    else:
        print(f"Skipping {p} (missing or empty).")

### Fixing Corrupt Motion Data (15 points)

Implement the function `MotionManifoldSynthesizer.fix_corrupted_motion` that projects corrupted motion data onto the learned manifold and reconstructs corrected, natural-looking movements. This function should demonstrate the manifold's ability to act as a prior distribution over valid human motion. Use the provided `visualize_motion_comparison` function to create side-by-side comparisons of the corrupted input and your reconstructed output.

Your submission for this section should include at least two video examples showing motion correction from different types of corruptions provided in the starter code.

In [None]:
# Your videos of fixing corrupt motion data.

## Part 4. Analysis Questions (15 pt)

Answer the question with your analysis. The questions are open-ended. We are looking for you own observasion from the expriments you did. Autoencoder is known as a relatively simple method so a lot of things here won't be perfect.

1. Explain your chosen normalization approach for the motion data. Why did you select this method, and how does it specifically address the challenges of human motion data? What other normalization techniques did you consider, and why did you not choose them?

[Answer]:
I first removed the overall movement (the person’s forward drift), then normalized each joint by its own mean and standard deviation across the whole dataset. So i had to make everything “local,” then z-score per joint. I picked this because different joints move in different ranges, and so I wanted the model to focus on pose change, not where the person is in the room. I didn’t use per-clip normalization (it makes clips inconsistent with each other), min–max (touchy with outliers). This simple method was stable and worked well.

2. After training your autoencoder, explore and describe the structure of your learned manifold. You can use t-SNE or PCA to visualize the hidden unit space (include one image in your answer).  Are different motion types clustered in particular regions? Can you identify meaningful directions in the latent space that correspond to specific motion attributes (speed, posture, etc.)?

[Answer]:
When I plotted the encoder features with PCA/t-SNE, similar motions appeared near each other. Walking, jogging, and running formed a continuum, a "speed line" from slow to fast. Another direction seemed to represent posture, from upright to leaned forward. Arm and upper-body actions appeared slightly off to the side. While the clusters weren’t perfect, distinct families of motion emerged with speed and posture being clear factors.

3. Critically analyze the quality of your interpolated motions. Where does the interpolation succeed or fail? What patterns do you notice about transitions between dissimilar motions versus similar ones?

[Answer]:
Whne we work with blending between similar clips, the transition/blending is smooth. The feet, hips stayed mostly clean. However, in contrast, blending with very different clips introduced issues: like some foot sliding, or misaligned steps. SO it turns out that blending works best when the two clips are already similar and in sync, which makes sense while The farther apart they are, the more noticeable the glitches become.

4. For the corrupted motion reconstruction task, analyze which types of corruption your system handles well versus poorly. What does this tell you about the properties of your learned manifold?

[Answer]:
The missing chunks or knocking out key joints (feet or hips) was harder as the model filled the gap with something average and it cause it to lose some style details, Which indicates that the model has a strong smooth and typical system, which is great for light noise but insufficient for large gaps or strict foot-contact rules. Adding contact awareness or stronger motion constraints would help with the tough cases.

## Extra Credit: Advanced Motion Synthesis and Editing (20 points)

In this optional extra credit section, you'll implement more sophisticated motion synthesis techniques and potentially extend your model architecture to achieve these advanced tasks.

For this task, you'll develop a method to complete partially specified motion sequences. Given a motion with missing frames, your system should intelligently fill in the gaps while maintaining natural movement characteristics and continuity.

You need:
- Develop a method to mask out and fill missing segments in motion sequences
- Ensure smooth transitions between existing and synthesized motion
- Leverage your trained motion manifold to generate plausible completions

Your submission shoud:
- Provide at least two examples of filling gaps in the middle of motion sequences
- Provide at least two examples of extending incomplete motions by synthesizing the ending frames
- For each visualization, you need to show the input and output side-by-side
- Include a short analysis of your approach and the quality of your results

### Motion Edtiting (10 points)

For this task, you'll implement the style transfer technique described in "A Deep Learning Framework for Character Motion Synthesis and Editing." This will allow you to transfer the style characteristics of one motion to another while preserving the content of the target motion.

You need to develop the method to add constraint on the trained autoencoder.

Your submission shoud:
- Provide at least three examples of motion editing results between different motion types. Each visualization should include the original content motion, the reference motion, and your result
- Write a short analysis of your results, discussing successes, limitations, and potential improvements


In [None]:
# === Environment & Repro ===
import os, sys, json, random, platform, torch, numpy as np
print("Python:", sys.version.split()[0], "| PyTorch:", torch.__version__, "| CUDA:", torch.cuda.is_available())
print("Platform:", platform.platform())
seed = 42
random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)

# Paths (edit DATA_DIR to your BVH root)
DATA_DIR = "./cmu-mocap"      # <-- change me
OUT_DIR  = "./output/ae"
os.makedirs(OUT_DIR, exist_ok=True)
device = "cuda" if torch.cuda.is_available() else "cpu"
device


In [None]:
from dataloader import CMUMotionDataset

ds = CMUMotionDataset(
    data_dir=DATA_DIR,
    frame_rate=30,
    window_size=160,
    overlap=0.5,
    include_velocity=True,
    include_foot_contact=True
)

print("Windows:", len(ds))
print("Files:", len(ds.motion_data))
print("Joints:", len(ds.get_joint_names()))
print("Mean pose shape:", ds.get_mean_pose().shape, "| Std shape:", ds.get_std().shape)

s = ds[0]
for k, v in s.items():
    try:
        print(f"{k:28s}", tuple(v.shape))
    except:
        print(f"{k:28s}", type(v))


In [None]:
from AE import MotionManifoldTrainer

trainer = MotionManifoldTrainer(
    data_dir=DATA_DIR,
    output_dir=OUT_DIR,
    batch_size=16,
    epochs=5,               # start with 5/5 to verify; raise to 25/25 for final
    fine_tune_epochs=5,
    learning_rate=1e-3,
    fine_tune_lr=5e-4,
    sparsity_weight=1e-3,
    window_size=160,
    val_split=0.1,
    device=device
)
stats = trainer.train()
stats


In [None]:
from IPython.display import Image, display
import os, json

plot_path = os.path.join(OUT_DIR, "plots", "training_curves.png")
display(Image(filename=plot_path))

with open(os.path.join(OUT_DIR, "training_stats.json"), "r") as f:
    loaded_stats = json.load(f)
loaded_stats.keys(), {k:(len(v["train_loss"]), len(v["val_loss"])) for k,v in loaded_stats.items()}


In [None]:
import torch
from torch.utils.data import DataLoader
import numpy as np

# Use the validation split that was created inside trainer
val_loader = DataLoader(trainer.val_dataset, batch_size=8, shuffle=False)

def batch_to_input(batch, model_in_ch):
    X = batch["positions_normalized_flat"].to(device)        # [B,T,Cpos]
    parts = [X]
    if ("trans_vel_xz" in batch) and ("rot_vel_y" in batch):
        tv = batch["trans_vel_xz"].to(device)                 # [B,T,2]
        ry = batch["rot_vel_y"].to(device).unsqueeze(-1)      # [B,T,1]
        parts += [tv, ry]
    X_btC = torch.cat(parts, dim=-1)                          # [B,T,C]
    X_bCt = X_btC.permute(0, 2, 1).contiguous()               # [B,C,T]
    # pad/truncate to match model input channels
    if X_bCt.size(1) < model_in_ch:
        X_bCt = torch.cat([X_bCt, X_bCt.new_zeros(X_bCt.size(0), model_in_ch - X_bCt.size(1), X_bCt.size(2))], dim=1)
    elif X_bCt.size(1) > model_in_ch:
        X_bCt = X_bCt[:, :model_in_ch, :]
    return X_bCt

model = trainer.model.eval()
in_ch = next(model.encoder.parameters()).shape[1]

mse_list, mae_list, vel_mse_list = [], [], []
with torch.no_grad():
    for batch in val_loader:
        x = batch_to_input(batch, in_ch)                        # [B,C,T]
        x_hat, _ = model(x, corrupt_input=False)                # recon in normalized channel space

        # only compare the position channels (first J*3)
        J3 = batch["positions_normalized_flat"].shape[-1]       # Cpos
        x_pos  = x[:, :J3, :]                                   # [B, Cpos, T]
        xh_pos = x_hat[:, :J3, :]                                # [B, Cpos, T]

        mse = torch.mean((xh_pos - x_pos)**2).item()
        mae = torch.mean(torch.abs(xh_pos - x_pos)).item()
        # velocity mse in time
        vel_x  = x_pos [:, :, 1:] - x_pos [:, :, :-1]
        vel_xh = xh_pos[:, :, 1:] - xh_pos[:, :, :-1]
        vel_mse = torch.mean((vel_xh - vel_x)**2).item()

        mse_list.append(mse); mae_list.append(mae); vel_mse_list.append(vel_mse)

print({
    "MSE (norm space)": float(np.mean(mse_list)),
    "MAE (norm space)": float(np.mean(mae_list)),
    "Velocity MSE": float(np.mean(vel_mse_list))
})


In [None]:
from AE import MotionManifoldSynthesizer
model_path = os.path.join(OUT_DIR, "models", "motion_autoencoder.pt")
synth = MotionManifoldSynthesizer(model_path=model_path, dataset=trainer.dataset, device=device)


In [None]:
from visualization import visualize_interpolation
ds = trainer.dataset
m1 = ds[0]
m2 = ds[min(1, len(ds)-1)]

P_interp = synth.interpolate_motions(m1, m2, t=0.5)  # [1,T,J,3]
interp_mp4 = os.path.join(OUT_DIR, "interp.mp4")
visualize_interpolation(
    m1["positions"].unsqueeze(0).cpu(),
    m2["positions"].unsqueeze(0).cpu(),
    P_interp.cpu(),
    ds.joint_parents,
    interp_mp4
)
interp_mp4


In [None]:
from visualization import visualize_motion_comparison
sample = ds[0]
corr, fixed = synth.fix_corrupted_motion(
    sample,
    corruption_type="zero",
    corruption_params={"prob": 0.5}
)
fix_mp4 = os.path.join(OUT_DIR, "fix_compare.mp4")
visualize_motion_comparison(corr.cpu(), fixed.cpu(), ds.joint_parents, fix_mp4)
fix_mp4


In [None]:
from visualization import visualize_motion_to_video
for i in range(min(3, len(ds))):
    out = os.path.join(OUT_DIR, f"sample_{i}.mp4")
    visualize_motion_to_video(ds[i]["positions"], ds.joint_parents, out)
out
