# AthenAI Workout Generation Model Training

Fine-tune google/flan-t5-base to generate structured workouts for AthenAI, using user context, workout goals, and template types. Data is mapped to AthenAI backend DTOs. Datasets: [onurSakar/GYM-Exercise](https://huggingface.co/datasets/onurSakar/GYM-Exercise), [Kaggle Gym Exercise Data](https://www.kaggle.com/datasets/niharika41298/gym-exercise-data), [Free Exercise DB](https://github.com/yuhonas/free-exercise-db).

In [None]:
# Install required libraries
!pip install transformers datasets huggingface_hub kaggle pandas --quiet

## Load Secrets
- Upload your Hugging Face token (for model push) and Kaggle API key (for dataset download) as files.

In [None]:
from google.colab import files
uploaded = files.upload()  # Upload 'hf_token.txt' and 'kaggle.json'

In [None]:
# Setup Hugging Face and Kaggle credentials
import os
from google.colab import userdata
from huggingface_hub import login
login(userdata.get('HF_TOKEN'))
!mkdir -p ~/.kaggle; mv kaggle.json ~/.kaggle/; chmod 600 ~/.kaggle/kaggle.json

## Download Datasets
- Hugging Face: onurSakar/GYM-Exercise
- Kaggle: niharika41298/gym-exercise-data
- GitHub: yuhonas/free-exercise-db

## 1. Load & Preprocess Datasets
- Download and load all external datasets (Hugging Face, Kaggle, GitHub).
- Map all data to AthenAI DTO format.

In [None]:
from datasets import load_dataset
dataset_hf = load_dataset('onurSakar/GYM-Exercise')
!kaggle datasets download -d niharika41298/gym-exercise-data --unzip
!git clone https://github.com/yuhonas/free-exercise-db.git

## Explore & Normalize Data
- Map external formats to AthenAI DTOs (see backend structure).

In [None]:
import pandas as pd
# Load Kaggle CSV and inspect columns
df_kaggle = pd.read_csv('megaGymDataset.csv')
print('Columns:', df_kaggle.columns.tolist())
df_kaggle.sample(5)

In [None]:
# Show sample from GitHub exercise dataset (free-exercise-db)
import json
with open('free-exercise-db/dist/exercises.json') as f:
    exercises_github = json.load(f)
print('Number of exercises:', len(exercises_github))
print('Sample exercise:')
print(json.dumps(exercises_github[0], indent=2))

## Define App-Specific DTOs
- Use AthenAI backend DTOs for exercises, equipment, muscular groups, etc.

In [None]:
# Example DTO structure (Python dict, based on Go DTOs)
athenai_exercise_dto = {
    'id': None,
    'name': '',
    'description': '',
    'equipment': [],
    'muscular_groups': [],
    'deleted_at': None
}

## Data Preprocessing
- Convert all datasets to AthenAI DTO format.

In [None]:
# Map Kaggle and GitHub exercise data to AthenAI DTO format and combine
def map_kaggle_to_dto(row):
    return {
        'id': row['Unnamed: 0'] if not pd.isna(row['Unnamed: 0']) else None,
        'name': row['Title'] if not pd.isna(row['Title']) else '',
        'description': row['Desc'] if not pd.isna(row['Desc']) else '',
        'equipment': [row['Equipment']] if not pd.isna(row['Equipment']) else [],
        'muscular_groups': [row['BodyPart']] if not pd.isna(row['BodyPart']) else [],
        'deleted_at': None
    }
dto_kaggle = df_kaggle.apply(map_kaggle_to_dto, axis=1).tolist()

def map_github_to_dto(exercise):
    return {
        'id': exercise.get('id', None),
        'name': exercise.get('name', ''),
        'description': ' '.join(exercise.get('instructions', [])),
        'equipment': [exercise['equipment']] if exercise.get('equipment') else [],
        'muscular_groups': exercise.get('primaryMuscles', []) + exercise.get('secondaryMuscles', []),
        'deleted_at': None
    }
dto_github = [map_github_to_dto(ex) for ex in exercises_github]

# Hugging Face fitness dataset is loaded for context only, not mapped to DTOs

# Combine Kaggle and GitHub sources
all_exercises_dto = dto_kaggle + dto_github
print(f'Total mapped exercises: {len(all_exercises_dto)}')
print('Sample mapped exercise:', all_exercises_dto[0])

## Prepare Training Data
- Input: user context, goal, template type
- Output: workout DTO (list of exercises, sets, reps, etc.)

## 2. Training Setup
- Format training examples for model input/output.
- Set up model, tokenizer, and training arguments.

In [None]:
# Build multiple diverse training examples for the model
user_contexts = [
    {
        'description': 'Recovering from knee surgery, wants to regain strength and mobility.',
        'training_phase': 'rehabilitation',
        'motivation': 'return to sports',
        'special_situation': 'leg limitation'
    },
    {
        'description': 'Busy professional aiming for general fitness and stress relief.',
        'training_phase': 'maintenance',
        'motivation': 'reduce stress',
        'special_situation': ''
    },
    {
        'description': 'Young athlete preparing for a competition, needs endurance.',
        'training_phase': 'pre-competition',
        'motivation': 'win race',
        'special_situation': ''
    },
    {
        'description': 'Middle-aged person with back operation, wants to lose weight safely.',
        'training_phase': 'weight loss',
        'motivation': 'improve health',
        'special_situation': 'back operation'
    }
 ]

workout_templates = [
    {
        'name': 'Lower Body Rehab',
        'description': 'Safe lower body exercises for rehabilitation.',
        'difficulty_level': 'beginner',
        'estimated_duration_minutes': 45,
        'target_audience': 'rehabilitation',
        'blocks': [
            {
                'block_name': 'Warmup',
                'block_type': 'warmup',
                'block_order': 1,
                'exercise_count': 2,
                'estimated_duration_minutes': 10,
                'instructions': 'Gentle mobility and activation.',
                'exercises': all_exercises_dto[:2]
            },
            {
                'block_name': 'Main Block',
                'block_type': 'main',
                'block_order': 2,
                'exercise_count': 3,
                'estimated_duration_minutes': 25,
                'instructions': 'Controlled strength movements.',
                'exercises': all_exercises_dto[2:5]
            },
            {
                'block_name': 'Cool Down',
                'block_type': 'cooldown',
                'block_order': 3,
                'exercise_count': 1,
                'estimated_duration_minutes': 10,
                'instructions': 'Stretch and relax.',
                'exercises': all_exercises_dto[5:6]
            }
        ]
    },
    {
        'name': 'Quick Office Fitness',
        'description': 'Short, equipment-free routine for busy professionals.',
        'difficulty_level': 'beginner',
        'estimated_duration_minutes': 20,
        'target_audience': 'general_fitness',
        'blocks': [
            {
                'block_name': 'Warmup',
                'block_type': 'warmup',
                'block_order': 1,
                'exercise_count': 1,
                'estimated_duration_minutes': 5,
                'instructions': 'Light stretching.',
                'exercises': all_exercises_dto[6:7]
            },
            {
                'block_name': 'Main Block',
                'block_type': 'main',
                'block_order': 2,
                'exercise_count': 2,
                'estimated_duration_minutes': 10,
                'instructions': 'Bodyweight exercises.',
                'exercises': all_exercises_dto[7:9]
            },
            {
                'block_name': 'Cool Down',
                'block_type': 'cooldown',
                'block_order': 3,
                'exercise_count': 1,
                'estimated_duration_minutes': 5,
                'instructions': 'Breathing and relaxation.',
                'exercises': all_exercises_dto[9:10]
            }
        ]
    },
    {
        'name': 'Endurance Builder',
        'description': 'Cardio-focused template for athletes.',
        'difficulty_level': 'intermediate',
        'estimated_duration_minutes': 60,
        'target_audience': 'endurance',
        'blocks': [
            {
                'block_name': 'Warmup',
                'block_type': 'warmup',
                'block_order': 1,
                'exercise_count': 2,
                'estimated_duration_minutes': 10,
                'instructions': 'Dynamic stretching.',
                'exercises': all_exercises_dto[10:12]
            },
            {
                'block_name': 'Cardio Block',
                'block_type': 'cardio',
                'block_order': 2,
                'exercise_count': 4,
                'estimated_duration_minutes': 40,
                'instructions': 'High intensity intervals.',
                'exercises': all_exercises_dto[12:16]
            },
            {
                'block_name': 'Cool Down',
                'block_type': 'cooldown',
                'block_order': 3,
                'exercise_count': 1,
                'estimated_duration_minutes': 10,
                'instructions': 'Static stretching.',
                'exercises': all_exercises_dto[16:17]
            }
        ]
    },
    {
        'name': 'Safe Weight Loss',
        'description': 'Low-impact template for weight loss and back safety.',
        'difficulty_level': 'beginner',
        'estimated_duration_minutes': 30,
        'target_audience': 'weight_loss',
        'blocks': [
            {
                'block_name': 'Warmup',
                'block_type': 'warmup',
                'block_order': 1,
                'exercise_count': 1,
                'estimated_duration_minutes': 5,
                'instructions': 'Gentle stretching.',
                'exercises': all_exercises_dto[17:18]
            },
            {
                'block_name': 'Main Block',
                'block_type': 'main',
                'block_order': 2,
                'exercise_count': 2,
                'estimated_duration_minutes': 20,
                'instructions': 'Low-impact movements.',
                'exercises': all_exercises_dto[18:20]
            },
            {
                'block_name': 'Cool Down',
                'block_type': 'cooldown',
                'block_order': 3,
                'exercise_count': 1,
                'estimated_duration_minutes': 5,
                'instructions': 'Relaxation and breathing.',
                'exercises': all_exercises_dto[20:21]
            }
        ]
    }
 ]

# Pair each user context with a matching workout template for training
train_examples = [
    {'input': user_contexts[i], 'output': workout_templates[i]} for i in range(len(user_contexts))
 ]
print('Sample training examples for model:')
import pprint; pprint.pprint(train_examples)

In [None]:
# Synthetic Training Example Generator: Consistent Block Structure
import random
def get_exercises_by_type(exercises, block_type):
    # Example mapping: you may need to refine this based on your DTOs
    type_map = {
        'warmup': ['Warmup', 'Mobility', 'Stretch', 'Activation', 'Light'],
        'main': ['Strength', 'Power', 'Compound', 'Bodyweight', 'Resistance'],
        'cardio': ['Cardio', 'Aerobic', 'Interval', 'Endurance'],
        'cooldown': ['Cool Down', 'Stretch', 'Relaxation', 'Breathing']
    }
    keywords = type_map.get(block_type, [])
    return [ex for ex in exercises if any(kw.lower() in ex.get('description','').lower() or kw.lower() in ex.get('name','').lower() for kw in keywords)]

def generate_random_workout_template(exercises):
    blocks = []
    # Randomize block types and counts for more variation
    block_defs = [
        {'block_type': 'warmup', 'count': random.randint(1,3), 'duration': random.randint(5,15), 'instructions': random.choice(['Gentle mobility and activation.','Light stretching.','Dynamic warmup.'])},
        {'block_type': random.choice(['main','cardio']), 'count': random.randint(2,5), 'duration': random.randint(15,40), 'instructions': random.choice(['Strength movements.','High intensity intervals.','Bodyweight exercises.','Cardio focus.'])},
        {'block_type': 'cooldown', 'count': random.randint(1,2), 'duration': random.randint(5,15), 'instructions': random.choice(['Stretch and relax.','Breathing and relaxation.','Static stretching.'])}
    ]
    for i, block_def in enumerate(block_defs):
        block_exs = get_exercises_by_type(exercises, block_def['block_type'])
        selected = random.sample(block_exs, min(block_def['count'], len(block_exs))) if block_exs else []
        blocks.append({
            'block_name': block_def['block_type'].capitalize(),
            'block_type': block_def['block_type'],
            'block_order': i+1,
            'exercise_count': len(selected),
            'estimated_duration_minutes': block_def['duration'],
            'instructions': block_def['instructions'],
            'exercises': selected
        })
    return {
        'name': f"Randomized Workout {random.randint(1000,9999)}",
        'description': random.choice(['Auto-generated workout for training.','Personalized workout plan.','Custom fitness routine.']),
        'difficulty_level': random.choice(['beginner','intermediate','advanced']),
        'estimated_duration_minutes': sum(b['estimated_duration_minutes'] for b in blocks),
        'target_audience': random.choice(['rehabilitation','general_fitness','endurance','weight_loss','athlete','office_worker']),
        'blocks': blocks
    }

# Generate synthetic training examples (step 1)
synthetic_train_examples = []
for _ in range(1000):
    user_ctx = {
        'description': random.choice([
            'Recovering from injury, needs gentle start.',
            'Wants to build muscle.',
            'Looking for weight loss.',
            'Training for a marathon.',
            'Needs stress relief.',
            'Preparing for competition.',
            'Improving general fitness.',
            'Returning to sports after break.',
            'Busy professional with limited time.',
            'Middle-aged person with back operation.'
        ]),
        'training_phase': random.choice(['rehabilitation','maintenance','pre-competition','weight loss','general','strength','cardio']),
        'motivation': random.choice(['return to sports','reduce stress','win race','improve health','improve fitness','lose weight','gain muscle','increase endurance']),
        'special_situation': random.choice(['leg limitation','back operation','none','injury','time constraint',''])
    }
    workout = generate_random_workout_template(all_exercises_dto)
    synthetic_train_examples.append({'input': user_ctx, 'output': workout})
print(f"Generated {len(synthetic_train_examples)} synthetic training examples.")
# Step 2: Shuffle and mix with curated examples before training
# (Do this in the training cell)

In [None]:
# This cell intentionally left blank. Hugging Face mapped exercises are not added to training examples. Only curated and synthetic examples are used for training.

## Model Loading & Training
- Load flan-t5-base, fine-tune with Hugging Face Trainer.

In [None]:
# ===============================
# 1. Imports
# ===============================
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Trainer, TrainingArguments, DataCollatorForSeq2Seq
from datasets import Dataset
import torch, json, random

# Use GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ===============================
# 2. Load base model + tokenizer
# ===============================
model_name = "a-albiol/AthenAI"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

# ===============================
# 3. Combine + shuffle dataset
# ===============================
all_train_examples = train_examples + synthetic_train_examples
random.shuffle(all_train_examples)

def format_example(example):
    input_text = f"Description: {example['input']['description']} | Phase: {example['input']['training_phase']} | Motivation: {example['input']['motivation']} | Special: {example['input']['special_situation']}"
    # ⬇ You might want to start with a simpler target than full JSON first
    output_text = json.dumps(example['output'], ensure_ascii=False)
    return {'input_text': input_text, 'output_text': output_text}

dataset = Dataset.from_list([format_example(e) for e in all_train_examples])

# ===============================
# 4. Train/Validation split
# ===============================
dataset = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = dataset['train']
eval_dataset = dataset['test']

# ===============================
# 5. Preprocessing (Tokenization)
# ===============================
def preprocess_function(examples):
    model_inputs = tokenizer(
        examples['input_text'],
        max_length=256,
        truncation=True,
    )
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            examples['output_text'],
            max_length=512,
            truncation=True,
        )
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_eval  = eval_dataset.map(preprocess_function, batched=True)

# ===============================
# 6. Data Collator
# ===============================
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=-100  # ignore padding tokens in loss
)

# ===============================
# 7. Training Arguments
# ===============================
training_args = TrainingArguments(
    output_dir='./athenai-finetune',
    evaluation_strategy="epoch",  # evaluate after each epoch
    save_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=5,
    fp16=True,
    logging_dir="./logs",
    logging_steps=10,
    report_to="none"
)

# ===============================
# 8. Trainer
# ===============================
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# ===============================
# 9. Train
# ===============================
trainer.train()


In [None]:
# Debug: Print a few tokenized training examples to inspect input/output and tokenization
print('--- Raw training example ---')
for i in range(3):
    print('Input:', all_train_examples[i]['input'])
    print('Output:', all_train_examples[i]['output'])
    print()

print('--- Tokenized training example ---')
for i in range(3):
    example = format_example(all_train_examples[i])
    tokenized = preprocess_function(example)
    print('Input text:', example['input_text'])
    print('Output text:', example['output_text'])
    print('Input token ids:', tokenized['input_ids'])
    print('Label token ids:', tokenized['labels'])
    print()

In [None]:
# Inspect a batch from the tokenized training dataset to check for padding, label, and attention mask issues
from torch.utils.data import DataLoader

batch_size = 4
loader = DataLoader(tokenized_train, batch_size=batch_size)
batch = next(iter(loader))

print('--- Batch keys ---')
print(batch.keys())

for i in range(batch_size):
    print(f'\n--- Example {i+1} ---')
    print('Input IDs:', batch['input_ids'][i])
    print('Labels:', batch['labels'][i])
    if 'attention_mask' in batch:
        print('Attention mask:', batch['attention_mask'][i])
    # Decode input and label for inspection
    print('Decoded input:', tokenizer.decode(batch['input_ids'][i], skip_special_tokens=True))
    print('Decoded label:', tokenizer.decode(batch['labels'][i], skip_special_tokens=True))

## Evaluation & Sample Generation
- Generate sample workouts from user context using the fine-tuned model.
- Evaluate model outputs for accuracy and structure.

In [None]:
# Evaluate model: generate workouts for real user contexts and compare to expected templates
import pprint
def generate_workout(user_context):
    input_text = f"Description: {user_context['description']} | Phase: {user_context['training_phase']} | Motivation: {user_context['motivation']} | Special: {user_context['special_situation']}"
    inputs = tokenizer(input_text, return_tensors='pt').to(device)
    outputs = model.generate(**inputs, max_length=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Evaluate on all training examples
for i, example in enumerate(train_examples):
    print(f"\n=== Example {i+1} ===")
    print("User Context:")
    pprint.pprint(example['input'])
    print("Expected Workout Template:")
    pprint.pprint(example['output'])
    print("Model Generated Workout:")
    generated = generate_workout(example['input'])
    print(generated)

## Save & Push Model to Hugging Face
- Save model and push to a-albiol/AthenAI.

In [None]:
from huggingface_hub import login
from google.colab import userdata

login(userdata.get('HF_TOKEN'))  # Replace ACCESS_TOKEN with your actual token

# Upload model and tokenizer to Hub with name 'AthenAI'
model.push_to_hub("AthenAI")
tokenizer.push_to_hub("AthenAI")

print("Model and tokenizer successfully uploaded to Hugging Face Hub as 'AthenAI'.")