# üöÄ Train Your First Custom Wake Word with Nanowakeword!

Welcome to the official tutorial for **Nanowakeword**! 

In this notebook, we will guide you through the entire process of training a high-performance, custom wake word model from scratch. You don't need any pre-existing data‚Äîwe will download everything we need and let Nanowakeword's intelligent engine do the heavy lifting.

**Our goal:** Go from zero to a ready-to-use wake word model in just a few simple steps. Let's get started!

**Installation**

In [None]:
# @title Step 1: Install Nanowakeword
# We install the full [train] package to get all the necessary dependencies.

# ! pip install --no-cache-dir "nanowakeword[train]==1.3.4"
! pip install "nanowakeword[train] @ git+https://github.com/arcosoph/nanowakeword.git"
! pip install piper-tts

print("Installation complete!")

## Step 2: Prepare the Dataset

A great model starts with great data. For this tutorial, we will:
1.  **Download** open-source noise and Room Impulse Response (RIR) datasets.
2.  **Organize** all your project files within a clean, well-structured folder hierarchy for better clarity and maintainability.

In [None]:
# @title Step 2: Download & Prepare the SonicWeave-v1 Dataset

import os
from pathlib import Path
import subprocess
import shutil

# --- Configuration ---
DATASET_REPO_URL = "https://huggingface.co/datasets/arcosoph/SonicWeave-v1"
DATA_DIR = Path("./nanowakeword_data")

# --- Define Final Paths ---
noise_dir = DATA_DIR / "Noise"
rir_dir = DATA_DIR / "Rir"
positive_dir = DATA_DIR / "positive_wakeword"
negative_dir = DATA_DIR / "negative_speech"

# --- Main Logic ---
print("The dataset is downloading. This may take a moment...")

# Download only if the dataset folders are not already created.
if not noise_dir.exists() or not rir_dir.exists() or not any(noise_dir.iterdir()):
    
    # Clone the repository to a temporary location
    temp_clone_dir = DATA_DIR / "temp_repo"
    
    print(f"Downloading the Starter Dataset from {DATASET_REPO_URL}...")
    
    # --depth 1 only downloads the latest commit, which is much faster      
    try:
        subprocess.run(
            ["git", "clone", "--depth", "1", DATASET_REPO_URL, str(temp_clone_dir)],
            check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL # Hide unnecessary log messages
        )
    except subprocess.CalledProcessError as e:
        print(f"Error: Failed to clone the dataset repository. Please check the URL.")
        print(f"Git command failed with error: {e}")
    else:
        print("Organizing dataset files...")

        # Move the noise and rir folders from the cloned repository to the correct location
        try:

            # Move only required folders
            for folder_name in ["Noise", "Rir"]:
                src = temp_clone_dir / folder_name
                dst = DATA_DIR / folder_name
                if src.exists():
                    shutil.move(str(src), str(dst))

            # Delete temp_repo, ignore errors if some files are locked
            shutil.rmtree(temp_clone_dir, ignore_errors=True)
                        
            print("\nSonicWeave-v1 Dataset is ready!")
        except FileNotFoundError:
            print("Error: 'noise' or 'rir' folder not found inside the cloned repository.")
        except Exception as e:
            print(f"Error organizing files: {e}")
else:
    print("SonicWeave-v1 Dataset already found.")

## Step 3: Configure and Train the Model

Now for the fun part! We will create a `config.yaml` file and then run the Nanowakeword training command.

In [None]:
# @title Step 3.1: Create the "Beast Mode" Configuration File
import yaml

# Define the configuration dictionary
config_dict = {
    # --- 1. Project Settings ---
    "model_name": "user_nww_rnn",
    "output_dir": "./trained_models",

    
    # Data Generation Counts
    "generate_positive_samples": 2044,
    "generate_negative_samples": 4044, # Should be higher than positive

    # TTS Batch Settings (Optimized for Colab/GPU)
    "tts_batch_size": 256,          # Fast for positive
    "tts_batch_size_negative": 64,  # Safe for negative (avoids OOM)

    # --- 3. Hard Negatives (False Positive Killers) ---
    # NOTE: Add words here that sound similar to your target to confuse the model.
    # Example: ["hey arcosoph", "hello soph", "archive"]
    "custom_negative_phrases": [], 
    
    # How many times to repeat each custom phrase?
    # "custom_negative_per_phrase": 20, # if use custom_nagative_phrases
    # "adversarial_text_generation": True, # Auto-fill gaps if custom count is low (default: True)

    # --- 5. Path Configuration ---
    # Using the directory variables defined earlier (recommended) 
    # or hardcoded strings if you prefer.
    "positive_data_path": str(positive_dir),
    "negative_data_path": str(negative_dir),
    "background_paths": [str(noise_dir)],
    "rir_paths": [str(rir_dir)],

    # --- 6. Augmentation Settings (Per-Example Mode) ---
    "augmentation_rounds": 10,
    "augmentation_batch_size": 128,
    "feature_gen_cpu_ratio": 0.8,
    
    "augmentation_settings": {
        "BackgroundNoise": 0.8,  # 80% files will have noise
        "RIR": 0.6,              # 60% files will have reverb
        "PitchShift": 0.5,
        "Gain": 1.0,
        "ColoredNoise": 0.4,
        "BandStopFilter": 0.2
    },

    # --- 7. Model Architecture ---
    "model_type": "rnn",     # RNN is excellent for wake words we test it (you can also use others like dnn, transformer...
    "layer_size": 256,
    "n_blocks": 4,
    "dropout_prob": 0.5,
    "embedding_dim": 64,

    # --- 8. Training Hyperparameters ---
    "steps": 25000,          # 25k steps for robust training
    "batch_size": 128,
    
    "optimizer_type": "adamw",
    "lr_scheduler_type": "onecycle",
    "learning_rate_max": 0.001,
    "weight_decay": 0.01,
    
    # Loss Weights
    "loss_weight_triplet": 0.4,
    "loss_weight_class": 1.0,

    # --- 9. Smart Batch Composition ---
    "batch_composition": {
        "batch_size": 128,
        "source_distribution": {
            "positive": 30,         # 30% Target phrase
            "negative_speech": 40,  # 40% Human speech (Critical for false positives)
            "pure_noise": 30        # 30% Background noise
        }
    },

    # --- 10. Checkpointing & Controls ---
    "checkpointing": {
        "enabled": True,
        "interval_steps": 1000,
        "limit": 5
    },
    
    # --- 11. Pipeline Flags ---
    "overwrite": True,
    "debug_mode": True,
    "generate_clips": True,  # Set to False if you restart training to save time
    "transform_clips": True,
    "train_model": True

    # You can provide other parameters if you want...
}

# Write the config to a YAML file
config_path = "./config.yaml"
with open(config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, sort_keys=False)


**Run Training!**

In [None]:
# @title Step 3.2: Run the Magic Command! üöÄ
# This command will do everything: augment data, extract features, and train the model.
# It might take some time depending on the hardware (especially on a CPU).

from nanowakeword.trainer import train 

args_list = [
    '--config_path', f'{config_path}',
]

print("Starting NanoWakeWord training...")

try:
    train(args_list)
    print("\n\nCONGRATULATIONS! (‚úø‚óï‚Äø‚óï‚úø)")
    print("Your custom wake word model has been successfully trained!")

except Exception as e:
    print(f"\nAn error occurred during training: {e}")

## What's Next?

You have successfully trained your own custom wake word model!

You can now download the `.onnx` file from the `trained_models` directory (check the file browser on the left) and use it in your own applications.

For more advanced topics, such as using your own datasets or fine-tuning the configuration, please check out our full documentation on **[GitHub](https://github.com/arcosoph/nanowakeword)**.

---
## Step 4: Save Your Model to Google Drive

The final step is to save your trained model and performance graph to a safe and accessible place. Instead of a slow direct download, we will save the files directly to your Google Drive. This process is almost instantaneous.

Run the cells below to:
1.  Connect your Google Drive account.
2.  Copy all the trained files into a new folder named `nanowakeword_models` in your Drive.

In [None]:
# @title Step 4.1: Connect to Google Drive
# This will ask for your permission to access your Google Drive.

from google.colab import drive
import os

try:
    drive.mount('/content/drive')
    print("\nGoogle Drive connected successfully!")
except Exception as e:
    print(f"An error occurred while connecting to Google Drive: {e}")

In [None]:
# @title Step 4.2: Copy Final Model and Artifacts to Google Drive üìÇ

import os
import shutil

# --- Configuration ---
# Get model_name and output_dir from the config_dict defined earlier
model_name = config_dict.get("model_name", "my_model")
output_dir = config_dict.get("output_dir", "./trained_models")

# --- Source and Destination Paths ---
# The source project directory containing all generated files
source_project_dir = os.path.join(output_dir, model_name)

# The destination folder in your Google Drive
drive_destination_dir = f"drive/MyDrive/nanowakeword_models/{model_name}"

# --- Start Copy Process ---
print("Starting the process to copy trained files to Google Drive...")

# Check if the source directory exists
if not os.path.exists(source_project_dir):
    print(f"\n‚ùå ERROR: Source directory not found at '{source_project_dir}'")
    print("This indicates that the training process did not create the expected output folder.")
    print("Please ensure the training step completed successfully before running this cell.")
else:
    # If an old folder exists in Drive, remove it to ensure a clean copy
    if os.path.exists(drive_destination_dir):
        print(f"üîÑ Found an existing folder in Drive. Removing it for a fresh copy: '{drive_destination_dir}'")
        shutil.rmtree(drive_destination_dir)

    # --- Copy the entire project folder ---
    # This is much simpler and more reliable than copying individual files.
    # It preserves the professional directory structure.
    try:
        shutil.copytree(source_project_dir, drive_destination_dir)
        
        print("\n" + "="*50)
        print("‚úÖ SUCCESS! All files have been saved to your Google Drive.")
        print("="*50)
        print(f"\nYour complete project, including the model and performance graphs, can be found in:")
        print(f"‚û°Ô∏è '{drive_destination_dir}'")
        
        # Optional: List the contents of the new folder in Drive for verification
        print("\nContents of the saved folder:")
        for root, dirs, files in os.walk(drive_destination_dir):
            level = root.replace(drive_destination_dir, '').count(os.sep)
            indent = ' ' * 4 * (level)
            print(f"{indent}{os.path.basename(root)}/")
            sub_indent = ' ' * 4 * (level + 1)
            for f in files:
                print(f"{sub_indent}{f}")

    except Exception as e:
        print(f"\n‚ùå ERROR: An unexpected error occurred during the copy process.")
        print(f"Details: {e}")