---
## **<p style="text-align: center; text-decoration: underline;">DATA CHALLENGE</p>**
# **<p style="text-align: center;">HUMAN MOTION DESCRIPTION (HMD): Motion-To-Text</p>**
---

> IMT Nord Europe *2025*.

---

![examples](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fimg.clipart-library.com%2F2%2Fclip-motions%2Fclip-motions-6.png&f=1&nofb=1&ipt=0747ffa645bb5f7798e8a2d44499b28f1156ce0e83b1b300fabfed4c6ab1fdf2&ipo=images)

### ■ **Overview**
In this data challenge, you will explore the intersection of natural language processing (NLP) and human motion synthesis by working on text-to-motion and motion-to-text tasks using the HumanML3D dataset. This dataset contains 3D human motion sequences paired with rich textual descriptions, enabling models to learn bidirectional mappings between language and motion.

#### **I. Main Task: Motion-To-Text Generation**
- **Motion-to-Text:** Develop a model to describe human motions in natural language given a sequence of 3D poses.

#### **II. Dataset Overview:**
- HumanML3D includes 14,616 motion samples across diverse actions (walking, dancing, sports) and 44,970 text annotations.
- Data includes skeletal joint positions, rotations, and fine-grained textual descriptions.

<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fproduction-media.paperswithcode.com%2Fdatasets%2F446194c5-ce59-43eb-b4cb-570a7a4d0cd9.png&f=1&nofb=1&ipt=b2edbe3251cab88e26a7f9d4e765c811b2cc890dc2ace7f7456baeca076b115b&ipo=images" alt="description" style="width:800px; height:600px;" />

The provided dataset contains the following components:

- 1. `motions` Folder: Contains `.npy` files, each representing a sequence of body poses. Each file has a shape of `(T, N, d)`, where:
  - `T`: Number of frames in the sequence (varies across sequences).
  - `N`: Number of joints in the body (22 in this case).
  - `d`: Dimension of each joint (3D coordinates: `x`, `y`, `z`).

- 2. `texts` Folder: Contains `.npy` files, each providing **3 textual descriptions** of the corresponding motion sequence. Each description is accompanied by part-of-speech (POS) tags for every word in the description. Example: "a person jump hop to the right#a/DET person/NOUN jump/NOUN hop/NOUN to/ADP the/DET right/NOUN#"

- 3. File Lists
    - **`all.txt`**: List of all motion files in the dataset.
    - **`train.txt`**: List of motion files for training.
    - **`val.txt`**: List of motion files for validation.
    - **`test.txt`**: List of motion files for testing.


#### **III. Evaluation Metrics**

BLEU (Bilingual Evaluation Understudy): The BLEU score evaluates the quality of generated text against reference texts using n-gram precision.
> Note: Higher BLEU scores (closer to 1 or 100\%) indicate better text-motion alignment. BLEU focuses on lexical overlap, not semantic accuracy. For motion descriptions, it measures how well generated text matches the linguistic patterns of ground-truth annotations.

Solutions should be submitted in the following format (in a csv file):

For each ID in the motion test set, you must predict the corresponding description. The file should contain a header and have the following format:

| id      | text                                                                 |
|---------|---------------------------------------------------------------------|
| 004822  | A person walks slowly forward, swinging their arms naturally        |
| 014457  | Someone performs a golf swing with proper form                      |
| 009613  | An individual jogs backwards diagonally across the room             |
| 008463  | A man bends down to pick up an object while walking                 |
| 012365  | A dancer spins clockwise while raising both arms                    |
| 007933  | Two people engage in a slow-motion martial arts demonstration       |
| 003430  | A child skips happily across a playground                           |
| 014522  | An athlete performs a perfect cartwheel sequence                    |
| 005698  | A woman gracefully practices yoga sun salutations                   |
| 001664  | A parkour expert vaults over a low wall                             |

You can generate your submission files using pandas as follows:

    >>> submission = pd.DataFrame({
    ...     'id': ['004822', '014457', ...],
    ...     'text': [
    ...         "a person walking slowly",
    ...         "someone swinging a golf club",
    ...         ...
    ...     ]
    ... })
    ... submission.to_csv('./submission.csv', index=False)

# **Auto Encoder Motion2Text by Antoine Mariot**

In [None]:
pip install -r requirements.txt

In [None]:
import numpy as np
from os.path import join as pjoin
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os
import torch
from torch.utils.data import DataLoader
from transformers import get_scheduler
from tqdm import tqdm
from transformers import GPT2Config, GPT2LMHeadModel
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
import random
from transformers import GPT2Tokenizer
from torch.utils.data import Dataset
import pandas as pd

### Chargement des indices

In [None]:
# Importation des données
motion_data_dir = './motions'
text_data_dir = './texts' 

In [None]:
# Chargement des indices depuis un fichier texte
def load_indices(file_path):
    with open(file_path, 'r') as f:
        indices = [line.strip() for line in f.readlines()]
    return indices

# Charger les indices pour train, test, val
train_indices = load_indices("./train.txt")
test_indices = load_indices("./test.txt")
val_indices = load_indices("./val.txt")

### Chargement séquences de mouvements

In [None]:
def load_motion_sequences(sequence_dir, indices):
    sequences = []
    for idx in indices:
        file_path = f"{sequence_dir}/{idx}.npy"
        if os.path.exists(file_path):  # Vérifie que le fichier existe
            sequences.append(np.load(file_path))
        else:
            print(f"Fichier manquant : {file_path}")
    return sequences

In [None]:
def normalize_sequences(sequences):
    all_data = np.concatenate(sequences, axis=0)  # Combine toutes les séquences
    mean = all_data.mean(axis=(0, 1))  # Moyenne globale sur frames et jointures
    std = all_data.std(axis=(0, 1))   # Écart-type global

    # Appliquer la normalisation
    normalized_sequences = [(seq - mean) / std for seq in sequences]
    return normalized_sequences

In [None]:
def convert_to_tensor(sequences, max_length=150):
    padded_sequences = []
    for seq in sequences:
        if len(seq) > max_length:  # Troncature si la séquence est trop longue
            seq = seq[:max_length]
        else:  # Padding si la séquence est trop courte
            padding = np.zeros((max_length - len(seq), seq.shape[1], seq.shape[2]))
            seq = np.vstack([seq, padding])
        padded_sequences.append(seq)
    
    return torch.tensor(padded_sequences, dtype=torch.float32)


In [None]:
# train
train_sequences = load_motion_sequences(motion_data_dir, train_indices)
train_sequences = [seq for seq in train_sequences if seq.ndim == 3 and seq.shape[0] > 1]
train_sequences = normalize_sequences(train_sequences)
train_tensor = convert_to_tensor(train_sequences)

In [None]:
# test
test_sequences = load_motion_sequences(motion_data_dir, test_indices)
test_sequences = [seq for seq in test_sequences if seq.ndim == 3 and seq.shape[0] > 1]
test_sequences = normalize_sequences(test_sequences)
test_tensor = convert_to_tensor(test_sequences)

In [None]:
# val
val_sequences = load_motion_sequences(motion_data_dir, val_indices)
val_sequences = [seq for seq in val_sequences if seq.ndim == 3 and seq.shape[0] > 1]
val_sequences = normalize_sequences(val_sequences)
val_tensor = convert_to_tensor(val_sequences)

In [None]:
#vérification
print("Val tensor:", val_tensor.shape)
print("Train tensor:", train_tensor.shape)
print("Test tensor:", test_tensor.shape)

In [None]:
# Sauvegarge des tenseurs pour futurs imports
# Définir les chemins de sauvegarde
save_path = ""
torch.save(val_tensor, save_path + "val_tensor.pt")
torch.save(train_tensor, save_path + "train_tensor.pt")
torch.save(test_tensor, save_path + "test_tensor.pt")

print("Tenseurs sauvegardés avec succès !")

#### Import des tenseurs déjà sauvegardés si besoin

In [None]:
# Charger les tenseurs
train_tensor = torch.load(f"./train_tensor.pt")
val_tensor = torch.load(f"./val_tensor.pt")
test_tensor = torch.load(f"./test_tensor.pt")

print("Tenseurs chargés !")
print(f"Train Tensor: {train_tensor.shape}")
print(f"Val Tensor: {val_tensor.shape}")
print(f"Test Tensor: {test_tensor.shape}")

### Chargement des descriptions textuelles

In [None]:
def load_text_descriptions(text_dir, indices, max_sentences=3):
    text_descriptions = {}  # Dictionnaire {id: [description1, description2, description3]}

    for idx in indices:
        file_path = os.path.join(text_dir, f"{idx}.txt")
        if os.path.exists(file_path):
            with open(file_path, 'r', encoding="utf-8") as f:
                lines = [line.split('#')[0].strip() for line in f.readlines() if line.strip()]
                
                # Garder jusqu'à `max_sentences` phrases
                text_descriptions[idx] = lines[:max_sentences]
                
        else:
            print(f"Avertissement : {idx}.txt introuvable")
    
    return text_descriptions

In [None]:
class MotionTextDataset(Dataset):
    def __init__(self, motion_dict, text_descriptions_dict, tokenizer, max_length=50, multiple_sentences=False):
        """
        Dataset associant les séquences de mouvements à plusieurs descriptions textuelles.
        """
        self.motions = motion_dict
        self.text_descriptions = text_descriptions_dict
        self.tokenizer = tokenizer
        self.indices = list(text_descriptions_dict.keys())  # Liste des indices valides
        self.max_length = max_length
        self.multiple_sentences = multiple_sentences  # Option pour gérer plusieurs phrases

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        index = self.indices[idx]
        
        # Récupérer la séquence de mouvement
        motion = self.motions[index]  # [Frames, Joints, Coordinates]

        # Récupérer **les phrases associées** à cet ID
        text_list = self.text_descriptions[index]
        
        if self.multiple_sentences:
            # **Concaténer les phrases** en une seule avec un séparateur ". "
            text = " ".join(text_list)
        else:
            # **Prendre uniquement la première phrase**
            text = text_list[0]

        # Tokeniser le texte
        tokenized_text = self.tokenizer(
            text,
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=self.max_length
        )

        return {
            "motion": torch.tensor(motion, dtype=torch.float32),  # Convertir en tenseur PyTorch
            "input_ids": tokenized_text["input_ids"].squeeze(0),  
            "attention_mask": tokenized_text["attention_mask"].squeeze(0)
        }

### Appel des descriptions/Tokenizer/initialisation du dataset MotionText/Vérification du dataset et des tokens

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # Ajouter un token de padding explicite

In [None]:
# Charger les descriptions textuelles pour chaque dataset train/val
train_descriptions = load_text_descriptions(text_data_dir, train_indices, max_sentences=3)
val_descriptions = load_text_descriptions(text_data_dir, val_indices,max_sentences=3)

In [None]:
# Vérifier un échantillon aléatoire
sample_ids = random.sample(list(train_descriptions.keys()), 5)  # Prendre 5 IDs au hasard

print("\n **Vérification des descriptions chargées**")
for idx in sample_ids:
    print(f"\n ID {idx} :")
    for i, sentence in enumerate(train_descriptions[idx]):
        print(f"   ➝ Phrase {i+1} : {sentence}")

# Vérifier le nombre moyen de phrases par séquence
num_sentences = [len(desc) for desc in train_descriptions.values()]
print(f"\n Moyenne de phrases par ID : {sum(num_sentences) / len(num_sentences):.2f}")
print(f" Distribution : {set(num_sentences)} phrases par ID (attendu : {max_sentences})")

In [None]:
#  Associer chaque mouvement à son ID de texte (en prenant toutes les phrases)
indices_list = list(train_descriptions.keys())[:len(train_tensor)]  # Garder seulement les IDs qui existent
indices_list_val = list(val_descriptions.keys())[:len(val_tensor)] 

#  Créer un dictionnaire {id_text: motion_tensor[i]} pour TRAIN
train_tensor_filtered = {idx: train_tensor[i] for i, idx in enumerate(indices_list)}
train_descriptions_filtered = {idx: train_descriptions[idx] for idx in indices_list}  #  Prendre toutes les phrases !

#  Créer un dictionnaire {id_text: motion_tensor[i]} pour VALIDATION
val_tensor_filtered = {idx: val_tensor[i] for i, idx in enumerate(indices_list_val)}
val_descriptions_filtered = {idx: val_descriptions[idx] for idx in indices_list_val}  #  Prendre toutes les phrases !

print(f" Correspondance des indices corrigée pour TRAIN : {len(train_tensor_filtered)} séquences")
print(f" Correspondance des indices corrigée pour VALIDATION : {len(val_tensor_filtered)} séquences")

In [None]:
# Créer le dataset avec les **données filtrées et plusieurs phrases**
train_dataset = MotionTextDataset(train_tensor_filtered, train_descriptions_filtered, tokenizer, multiple_sentences=True)
val_dataset = MotionTextDataset(val_tensor_filtered, val_descriptions_filtered, tokenizer, multiple_sentences=True)

print(f" Dataset d'entraînement : {len(train_dataset)} exemples")
print(f" Dataset de validation : {len(val_dataset)} exemples")

batch_size = 64  # Ajustable selon la mémoire GPU

# Créer les DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=4)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, num_workers=4)

print(" DataLoaders créés avec succès !")

In [None]:
def verify_text_descriptions(text_descriptions):
    """
    Vérifie que chaque ID possède bien 3 phrases.
    """
    count_per_id = {idx: len(texts) for idx, texts in text_descriptions.items()}
    
    #  Vérifier combien d'IDs ont exactement 3 phrases
    total_ids = len(count_per_id)
    ids_with_3_sentences = sum(1 for count in count_per_id.values() if count == 3)
    
    print(f" Vérification des descriptions textuelles :")
    print(f"  - Nombre total d'IDs : {total_ids}")
    print(f"  - IDs avec exactement 3 phrases : {ids_with_3_sentences} ({(ids_with_3_sentences / total_ids) * 100:.2f}%)")
    
    #  Afficher quelques exemples
    print("\n Exemples de descriptions :")
    for i, (idx, texts) in enumerate(text_descriptions.items()):
        print(f"  - ID {idx}: {texts}")
        if i == 4:  # Afficher seulement les 5 premiers exemples
            break

In [None]:
#  Tester sur les descriptions de train et validation
verify_text_descriptions(train_descriptions)
verify_text_descriptions(val_descriptions)

In [None]:
# Vérification d'un batch
for batch in train_loader:
    print("Shape des séquences de mouvements :", batch["motion"].shape)  # [Batch, Frames, Joints, Coordinates]
    print("Shape des tokens textuels :", batch["input_ids"].shape)  # [Batch, Max_Length]
    print("Shape des masques d'attention :", batch["attention_mask"].shape)  # [Batch, Max_Length]
    break

In [None]:
def verify_token_lengths(text_descriptions, tokenizer, max_length=50):
    """
    Vérifie le nombre de tokens générés par phrase et s'assure qu'ils ne dépassent pas max_length.
    """
    token_counts = []  # Stocke les longueurs des tokens

    print(" Vérification du nombre de tokens par phrase...")
    
    for idx, sentences in text_descriptions.items():
        for sentence in sentences:  # On vérifie toutes les phrases associées à un ID
            tokens = tokenizer(sentence, truncation=False, padding=False)["input_ids"]
            token_counts.append(len(tokens))
    
    #  Statistiques globales
    print(f" Nombre total de phrases tokenisées : {len(token_counts)}")
    print(f" Longueur moyenne des phrases tokenisées : {sum(token_counts) / len(token_counts):.2f}")
    print(f" Nombre de phrases dépassant {max_length} tokens : {sum(1 for t in token_counts if t > max_length)}")

    #  Afficher quelques exemples
    print("\n Exemples de phrases tokenisées :")
    for i, sentence in enumerate(list(text_descriptions.values())[0][:3]):  # 3 phrases du premier ID
        tokens = tokenizer(sentence, truncation=False, padding=False)["input_ids"]
        print(f"  - Phrase {i+1} ({len(tokens)} tokens) : {sentence}")
        print(f"  - Tokens : {tokens}\n")

In [None]:
#  Tester sur le dataset de TRAIN et VALIDATION
verify_token_lengths(train_descriptions, tokenizer, max_length=50)
verify_token_lengths(val_descriptions, tokenizer, max_length=50)

In [None]:
def verify_attention_masks(train_loader, tokenizer):
    """
    Vérifie que les tokens de padding sont bien ignorés par l'attention mask.
    """
    for batch in train_loader:
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]

        #  Vérifier si des tokens padding existent dans input_ids
        pad_token_id = tokenizer.pad_token_id
        num_pad_tokens = (input_ids == pad_token_id).sum().item()

        print(f" Vérification de l'attention mask :")
        print(f"  - Nombre total de tokens dans le batch : {input_ids.numel()}")
        print(f"  - Nombre de tokens de padding détectés : {num_pad_tokens}")
        print(f"  - Exemple de Input IDs : {input_ids[0].tolist()}")
        print(f"  - Exemple de Attention Mask : {attention_mask[0].tolist()}")
        break  # On affiche seulement le premier batch

In [None]:
#  Tester sur le TRAIN LOADER
verify_attention_masks(train_loader, tokenizer)

## **Architecture du modèle**

In [None]:
class MotionEncoderTransformer(nn.Module):
    def __init__(self, joint_dim, coord_dim, hidden_dim=256, embedding_dim=512, hidden_size=768, num_layers=4, num_heads=8, ff_dim=None, dropout=0.1):
        super(MotionEncoderTransformer, self).__init__()

        self.noise_factor = 0.05
        ff_dim = ff_dim or (hidden_dim * 4)  #  Défaut : 4x hidden_dim si non spécifié

        #  Convolutions 1D pour capturer les motifs locaux
        self.conv1 = nn.Conv1d(in_channels=joint_dim * coord_dim, out_channels=hidden_dim, kernel_size=4, stride=2, padding=1)
        self.conv2 = nn.Conv1d(in_channels=hidden_dim, out_channels=hidden_dim, kernel_size=4, stride=2, padding=1)

        #  Normalisation & Dropout
        self.layer_norm1 = nn.LayerNorm(hidden_dim)
        self.layer_norm2 = nn.LayerNorm(hidden_dim)
        self.dropout_conv = nn.Dropout(dropout)

        #  TransformerEncoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=hidden_dim, 
            nhead=num_heads, 
            dim_feedforward=ff_dim,  #  Utilisation de ff_dim ici
            dropout=dropout, 
            batch_first=True
        )
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)

        #  Fully Connected
        self.fc = nn.Linear(hidden_dim, embedding_dim)  

        #  Projection vers GPT-2
        self.projection = nn.Linear(embedding_dim, hidden_size)
        self.layer_norm = nn.LayerNorm(hidden_size)

        #  Dropout final
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        B, T, J, C = x.shape
        x = x.view(B, T, J * C).permute(0, 2, 1)  # [Batch, Features, Time]

        #  Convolutions + Activation + Normalisation
        x = torch.relu(self.conv1(x))
        x = self.layer_norm1(x.permute(0, 2, 1)).permute(0, 2, 1)
        x = self.dropout_conv(x)

        x = torch.relu(self.conv2(x))
        x = self.layer_norm2(x.permute(0, 2, 1)).permute(0, 2, 1)
        x = self.dropout_conv(x)

        #  Passage dans le Transformer Encoder
        x = x.permute(0, 2, 1)  # [Batch, Time, Features]
        x = self.transformer_encoder(x)  # [Batch, Time, Hidden_dim]
        x = x[:, -1, :]  # Prendre le dernier état de la séquence (comme LSTM)

        #  Fully Connected + Activation
        x = self.fc(x)
        x = F.leaky_relu(x, negative_slope=0.02)  
        x = self.dropout(x)

        #  Projection vers GPT-2
        x = self.projection(x)
        x = self.layer_norm(x)

        #  Ajout de bruit pour stabiliser
        noise = torch.randn_like(x) * self.noise_factor
        x = x + noise

        #  Normalisation L2 finale
        x = F.normalize(x, p=2, dim=-1) * 10

        return x

In [None]:
config = GPT2Config.from_pretrained("gpt2", add_cross_attention=True) #  Charger GPT-2 avec Cross-Attention
text_decoder = GPT2LMHeadModel.from_pretrained("gpt2", config=config) # Charger sur CPU d'abord, puis transférer sur GPU

text_decoder = text_decoder.to("cuda")
text_decoder.resize_token_embeddings(len(tokenizer)) #  Mettre à jour le modèle pour prendre en compte le nouveau token

In [None]:
class TextDecoderTransformer(nn.Module):
    def __init__(self, latent_dim=768, gpt2_model="gpt2"):
        super(TextDecoderTransformer, self).__init__()

        self.gpt2 = GPT2LMHeadModel.from_pretrained(gpt2_model)

        #  Ajout d'une projection vers GPT-2
        self.latent_to_embedding = nn.Linear(latent_dim, self.gpt2.config.n_embd)
        self.layer_norm = nn.LayerNorm(self.gpt2.config.n_embd)

        #  Ajout d’un Cross-Attention Layer
        self.cross_attention = nn.MultiheadAttention(embed_dim=self.gpt2.config.n_embd, num_heads=8, batch_first=True)

    def forward(self, latent_vector, input_ids, attention_mask):
        #  Projection et normalisation du latent vector
        transformed_vector = self.latent_to_embedding(latent_vector)
        transformed_vector = self.layer_norm(transformed_vector).unsqueeze(1)  # [Batch, 1, Embedding_dim]

        #  Passage dans Cross-Attention avant GPT-2
        hidden_state, _ = self.cross_attention(transformed_vector, transformed_vector, transformed_vector)

        #  Passage dans GPT-2
        outputs = self.gpt2(input_ids=input_ids, attention_mask=attention_mask, encoder_hidden_states=hidden_state)
        return outputs.logits  # [Batch, Seq_Length, Vocab_Size]

In [None]:
if torch.cuda.is_available():
    print("GPU disponible :", torch.cuda.get_device_name(0))
else:
    print("Aucun GPU disponible. Le calcul se fera sur le CPU.")

In [None]:
# Vérifie si le GPU est disponible
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Appareil utilisé :", device)

### **Entrainement**

In [None]:
def mean_pooling(hidden_states, mask):
    """Moyenne des embeddings cachés en tenant compte du mask"""
    mask_expanded = mask.unsqueeze(-1).expand(hidden_states.size()).float()
    pooled = torch.sum(hidden_states * mask_expanded, dim=1) / (mask_expanded.sum(dim=1) + 1e-9)
    return pooled.squeeze(1) if pooled.dim() == 3 else pooled

def train_autoencoder(motion_encoder, text_decoder, train_loader, tokenizer, device="cuda", num_epochs=10, lr=1e-4, gradient_accumulation_steps=1):
    
    #  Déplacer les modèles sur GPU et les mettre en mode entraînement
    motion_encoder.to(device).train()
    text_decoder.to(device).train()

    #  Optimiseur (AdamW avec régularisation plus forte)
    optimizer = torch.optim.AdamW(
        list(motion_encoder.parameters()) + list(text_decoder.parameters()), 
        lr=lr, weight_decay=0.02  #  Légère augmentation du weight decay pour régularisation
    )

    #  Scheduler dynamique (ReduceLROnPlateau)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="min", factor=0.5, patience=2, verbose=True  #  Ajustement plus progressif du LR
    )

    #  Fonction de perte
    criterion = torch.nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

    #  Normalisation des embeddings pour stabiliser
    embedding_norm = torch.nn.LayerNorm(768).to(device)

    #  Geler GPT-2 pendant les **4 premières époques**
    for param in text_decoder.parameters():
        param.requires_grad = False  

    beta = 0.98  #  Moving Average pour la stabilisation de la loss

    for epoch in range(num_epochs):
        total_loss = 0
        running_loss = 0  

        #  Barre de progression `tqdm`
        progress_bar = tqdm(enumerate(train_loader), total=len(train_loader), desc=f" Epoch {epoch+1}/{num_epochs}")

        for batch_idx, batch in progress_bar:
            try:
                #  Charger les données sur GPU
                motions = batch["motion"].to(device, non_blocking=True)
                input_ids = batch["input_ids"].to(device, non_blocking=True)
                attention_mask = batch["attention_mask"].to(device, non_blocking=True)

                #  Encodage du mouvement avec Transformer Encoder
                motion_embeddings = motion_encoder(motions)  # [Batch, Time, Hidden]
                if motion_embeddings.dim() == 3:  
                    motion_embeddings = mean_pooling(motion_embeddings, attention_mask)  # [Batch, Hidden]

                motion_embeddings = embedding_norm(motion_embeddings)  # [Batch, Hidden]
                motion_embeddings = motion_embeddings.unsqueeze(1)  # [Batch, 1, Hidden] pour GPT-2

                #  Scheduled Sampling amélioré
                if epoch > 2 and loss.item() < 3.8:  #  Déclenchement basé sur la perte
                    with torch.no_grad():
                        sampled_ids = torch.argmax(text_decoder(
                            input_ids=input_ids[:, :-1], 
                            attention_mask=attention_mask[:, :-1], 
                            encoder_hidden_states=motion_embeddings
                        ).logits, dim=-1)
                    
                    mask = torch.rand_like(input_ids[:, :-1].float()) < 0.10  # 🔹 10% remplacés par la prédiction
                    input_ids[:, :-1][mask] = sampled_ids[mask]

                #  Passage dans GPT-2 (décodeur)
                outputs = text_decoder(
                    input_ids=input_ids[:, :-1],  
                    attention_mask=attention_mask[:, :-1],
                    encoder_hidden_states=motion_embeddings,
                    labels=input_ids[:, 1:]  
                )

                logits = outputs.logits  # [Batch, Seq_Length, Vocab_Size]

                #  Calcul de la perte
                loss = criterion(logits.permute(0, 2, 1), input_ids[:, 1:])

                #  Backpropagation avec Gradient Accumulation
                loss.backward()
                if (batch_idx + 1) % gradient_accumulation_steps == 0:
                    torch.nn.utils.clip_grad_norm_(motion_encoder.parameters(), max_norm=5.0)  
                    torch.nn.utils.clip_grad_norm_(text_decoder.parameters(), max_norm=5.0)
                    optimizer.step()
                    optimizer.zero_grad()

                total_loss += loss.item()
                running_loss = beta * running_loss + (1 - beta) * loss.item()
                smoothed_loss = running_loss / (1 - beta ** (batch_idx + 1))

                #  Mise à jour de la barre de progression
                progress_bar.set_postfix(loss=f"{smoothed_loss:.4f}")

            except RuntimeError as e:
                print(f" Erreur au batch {batch_idx + 1}: {e}")
                continue  

        #  Mise à jour du scheduler
        scheduler.step(total_loss / max(1, len(train_loader)))  #  Réduction du LR uniquement si la perte ne descend pas

        #  Débloquer GPT-2 après 4 epochs
        if epoch == 4:
            for param in text_decoder.parameters():
                param.requires_grad = True  
            for param_group in optimizer.param_groups:
                param_group["lr"] *= 0.5  #  Réduction du LR avant fine-tuning de GPT-2

        #  Sauvegarde automatique du modèle
        torch.save(motion_encoder.state_dict(), f"motion_encoder_epoch_{epoch+1}.pth")
        torch.save(text_decoder.state_dict(), f"text_decoder_epoch_{epoch+1}.pth")
        print(f" Modèle sauvegardé à l'époque {epoch+1} !")

        #  Afficher la perte moyenne par époque
        avg_loss = total_loss / max(1, len(train_loader))  
        print(f" Époque [{epoch+1}/{num_epochs}] - Perte moyenne : {avg_loss:.4f}")

    print(" Entraînement terminé avec succès ! ")


#### Initialisation de l'encoder et lancement de l'entraînement

In [None]:
motion_encoder = MotionEncoderTransformer(
    joint_dim=22,    # Nombre de jointures
    coord_dim=3,     # Coordonnées x, y, z
    hidden_dim=512,  #  Augmenté pour plus de capacité de représentation
    num_layers=6,    #  6 couches de Transformer Encoder
    num_heads=8,     #  Multi-head attention (8 têtes)
    ff_dim=1024,     #  Feedforward dimension interne
    dropout=0.1      # Régularisation
).to("cuda")  # Envoi sur le GPU

In [None]:
# Simuler un batch de mouvements (batch_size=4, frames=150, joints=22, coords=3)
dummy_motion = torch.randn(4, 150, 22, 3).to("cuda")

# Passer dans le MotionEncoder
with torch.no_grad():
    output_embedding = motion_encoder(dummy_motion)

print(f" MotionEncoder Output: {output_embedding.shape}")

#### Réinitialisation des poids de l'encodeur pour entraînement complet

In [None]:
def reset_model_weights(model):
    for module in model.modules():
        if hasattr(module, 'reset_parameters'):
            module.reset_parameters()

# Réinitialiser uniquement le MotionEncoder
reset_model_weights(motion_encoder)

print(" MotionEncoder réinitialisé, GPT-2 conservé ! ")

In [None]:
for batch in train_loader:
    print(batch.keys())  # Doit contenir "motion", "input_ids", "attention_mask"
    break

assert next(motion_encoder.parameters()).is_cuda, "MotionEncoder n'est pas sur GPU !"
assert next(text_decoder.parameters()).is_cuda, "TextDecoder n'est pas sur GPU !"
assert tokenizer.pad_token_id is not None, " pad_token_id non défini dans le tokenizer !"

In [None]:
torch.backends.cudnn.benchmark = True
train_autoencoder(motion_encoder, text_decoder, train_loader, tokenizer, device="cuda", num_epochs=10) # Quasi complète convergence dès 10 épochs 

### **Sauvegarde du modèle final**

In [None]:
#  Sauvegarder le motion_encoder
motion_encoder_path = "/kaggle/working/motion_encoder"
torch.save(motion_encoder.state_dict(), motion_encoder_path)
print(f" Motion Encoder sauvegardé sous : {motion_encoder_path}")

#  Sauvegarder le text_decoder
text_decoder_path = "/kaggle/working/text_decoder"
torch.save(text_decoder.state_dict(), text_decoder_path)
print(f" Text Decoder sauvegardé sous : {text_decoder_path}")

#  Sauvegarder le tokenizer (Hugging Face)
tokenizer_path = "/kaggle/working/tokenizer"
tokenizer.save_pretrained(tokenizer_path)
print(f" Tokenizer GPT-2 sauvegardé sous : {tokenizer_path}")


### **Génération des prédictions**

In [None]:
def generate_descriptions(motion_encoder, text_decoder, tokenizer, tensors, indices, device="cuda", max_new_tokens=17, mode="validation"):
    #  Mode évaluation
    motion_encoder.eval()
    text_decoder.eval()

    generated_descriptions = {}

    # Liste des conjonctions et prépositions qui indiquent une phrase incomplète
    invalid_endings = ["and", "then", "as", "while", "with", "before", "after", "but", "so"]

    with torch.no_grad():  # Pas de calcul de gradient pour l'inférence
        for i, (motion, idx) in enumerate(zip(tensors, indices)):
            motion = motion.unsqueeze(0).to(device)  # Ajouter une dimension batch

            #  1. Encoder le mouvement
            motion_embedding = motion_encoder(motion)  # [1, 768]
            motion_embedding = motion_embedding.unsqueeze(1)  # GPT-2 attend [Batch, 1, Hidden]

            print(f"motion_embedding {i}: mean={motion_embedding.mean().item()}, std={motion_embedding.std().item()}")

            #  2. Définir un prompt minimal
            start_prompt = "A person"
            input_ids = tokenizer(start_prompt, return_tensors="pt")["input_ids"].to(device)

            #  3. Générer le texte avec des paramètres optimisés
            output = text_decoder.generate(
                input_ids,
                max_new_tokens=max_new_tokens,  # Plus de tokens pour éviter une coupure trop tôt
                num_beams=5,  # Augmente la cohérence
                encoder_hidden_states=motion_embedding,  
                attention_mask=input_ids.new_ones(input_ids.shape),
                pad_token_id=tokenizer.eos_token_id,
                do_sample=True,  
                top_p=0.85,  # Réduction de la diversité excessive
                temperature=1.8,  # Moins de hasard pour de meilleures phrases
                repetition_penalty=2.0,  # Pénalise les répétitions sans supprimer des structures utiles
                no_repeat_ngram_size=4  # Évite des répétitions longues
            )

            #  4. Décoder les tokens
            generated_text = tokenizer.decode(output[0], skip_special_tokens=True).strip()

            #  Correction pour éviter les répétitions et phrases mal formées
            if not generated_text.lower().startswith("a person"):
                generated_text = "A person " + generated_text  # Ajout forcé

            #  Garder uniquement la première phrase correcte
            sentences = generated_text.split(".")
            generated_text = sentences[0].strip() + "."

            #  Assurer une bonne fin de phrase
            last_word = generated_text.split()[-1].lower()
            if last_word in invalid_endings:  # Si la phrase finit mal, on force une fin correcte
                generated_text += " They complete the movement."

            #  Associer à l'ID du mouvement
            generated_descriptions[idx] = generated_text
            print(f" {mode.capitalize()} {i+1}/{len(tensors)} - Généré : {generated_text}")

    return generated_descriptions


### On sort les prédictions du jeu de test

In [None]:
#  Générer les descriptions pour le test set

test_descriptions = generate_descriptions(
    motion_encoder, text_decoder, tokenizer, test_tensor, test_indices, device="cuda",mode="test"
)

In [None]:
# Sauvegarder les prédictions brutes dans un CSV
backup_df = pd.DataFrame({
    'id': list(test_descriptions.keys()),
    'text': list(test_descriptions.values())
})

# Enregistrer le fichier CSV
backup_csv_path = "/kaggle/working/test_descriptions_backup9.csv"
backup_df.to_csv(backup_csv_path, index=False)