# 🚀 Train Your First Custom Wake Word with Nanowakeword!

Welcome to the official tutorial for **Nanowakeword**! 

In this notebook, we will guide you through the entire process of training a high-performance, custom wake word model from scratch. You don't need any pre-existing data—we will download everything we need and let Nanowakeword's intelligent engine do the heavy lifting.

**Our goal:** Go from zero to a ready-to-use wake word model in just a few simple steps. Let's get started!

**Installation**

In [None]:
# @title Step 1: Install Nanowakeword
# We install the full [train] package to get all the necessary dependencies.

! pip install --no-cache-dir "nanowakeword[train]==1.2.0"
! pip install piper-tts

print("Installation complete!")

## Step 2: Prepare the Dataset

A great model starts with great data. For this tutorial, we will:
1.  **Download** open-source noise and Room Impulse Response (RIR) datasets.
2.  **Generate** our own custom wake word samples using a built-in TTS engine.
3.  **Organize** all your project files within a clean, well-structured folder hierarchy for better clarity and maintainability.

In [None]:
# @title Step 2.1: Download & Prepare the Nanowakeword Starter Dataset

import os
from pathlib import Path
import subprocess
import shutil

# --- Configuration ---
DATASET_REPO_URL = "https://huggingface.co/datasets/arcosoph/SonicWeave-v1"
DATA_DIR = Path("./nanowakeword_data")

# --- Define Final Paths ---
noise_dir = DATA_DIR / "Noise"
rir_dir = DATA_DIR / "Rir"
positive_dir = DATA_DIR / "positive_wakeword"
negative_dir = DATA_DIR / "negative_speech"

# --- Main Logic ---
print("The dataset is downloading. This may take a moment...")

# Download only if the dataset folders are not already created.
if not noise_dir.exists() or not rir_dir.exists() or not any(noise_dir.iterdir()):
    
    # Clone the repository to a temporary location
    temp_clone_dir = DATA_DIR / "temp_repo"
    
    print(f"Downloading the Starter Dataset from {DATASET_REPO_URL}...")
    
    # --depth 1 only downloads the latest commit, which is much faster      
    try:
        subprocess.run(
            ["git", "clone", "--depth", "1", DATASET_REPO_URL, str(temp_clone_dir)],
            check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL # Hide unnecessary log messages
        )
    except subprocess.CalledProcessError as e:
        print(f"Error: Failed to clone the dataset repository. Please check the URL.")
        print(f"Git command failed with error: {e}")
    else:
        print("Organizing dataset files...")

        # Move the noise and rir folders from the cloned repository to the correct location
        try:

            # Move only required folders
            for folder_name in ["Noise", "Rir"]:
                src = temp_clone_dir / folder_name
                dst = DATA_DIR / folder_name
                if src.exists():
                    shutil.move(str(src), str(dst))

            # Delete temp_repo, ignore errors if some files are locked
            shutil.rmtree(temp_clone_dir, ignore_errors=True)
                        
            print("\nStarter Dataset is ready!")
        except FileNotFoundError:
            print("Error: 'noise' or 'rir' folder not found inside the cloned repository.")
        except Exception as e:
            print(f"Error organizing files: {e}")
else:
    print("Nanowakeword Starter Dataset already found.")

**Generate Wake Word Samples**

In [None]:
# @title Step 2.2: Generate Custom Wake Word & Adversarial Negative Audio

# We'll use this code instead of the `--generate_clips` flag.

from nanowakeword.generate_samples import generate_samples
from nanowakeword.data import generate_adversarial_texts 

#@markdown Define your custom wake word and the number of samples you want to generate.
WAKE_WORD = "Hey Computer"    #@param {type:"string" }
NUM_POSITIVE_SAMPLES = 1000   #@param {type:"integer"}
NUM_NEGATIVE_SAMPLES = 4000   #@param {type:"integer"}
#    ✍️(◔◡◔) ༼ つ ◕_◕ ༽つ


# 1. Creating positive samples (directly)
generate_samples(
                text=WAKE_WORD,
                output_dir=str(positive_dir),
                max_samples=NUM_POSITIVE_SAMPLES
          )

print(f"\nGenerating {NUM_NEGATIVE_SAMPLES} intelligent adversarial negative samples...")

# NanoWakeword will automatically generate strong negative text based on the wakeword.
# For example: "Hey Commuter", "Play Computer", "Hey Peter", "Okay Jupiter" etc. Thousands of variations
adversarial_texts = generate_adversarial_texts(
                    input_text=WAKE_WORD,
                    N=NUM_NEGATIVE_SAMPLES
)

# Now create audio from those automatically generated texts
generate_samples(
                 text=adversarial_texts,
                 output_dir=str(negative_dir),
                 max_samples=NUM_NEGATIVE_SAMPLES
)

print("\nAll synthetic audio has been generated successfully!")

## Step 3: Configure and Train the Model

Now for the fun part! We will create a `config.yaml` file and then run the Nanowakeword training command.

We will use the magical `--auto-config` flag to let the Intelligent Engine analyze our newly prepared data and build the best possible model.

**Configuration and Training**

In [None]:
# @title Step 3.1: Create the Configuration File
import yaml

config_dict = {
    # Data Paths (pointing to our newly created folders)
    "wakeword_data_path": str(positive_dir),
    "background_data_path": str(negative_dir),
    "background_paths": [str(noise_dir)],
    "rir_paths": [str(rir_dir)],
    # Model Output
    "model_name": "hey_computer_v1",
    "output_dir": "./trained_models",
    # Model Type
    "model_type": "dnn" # DNN offers faster training, while LSTM and other architectures provide greater robustness.
}

# Write the config to a YAML file
config_path = "./config.yaml"
with open(config_path, 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False)

print(f"✅ Configuration file saved to {config_path}")

**Run Training!**

In [None]:
# @title Step 3.2: Run the Magic Command! 🚀
# This command will do everything: augment data, extract features, and train the model.
# It might take some time depending on the hardware (especially on a CPU).

# The data was generated before, but now we will use the generate_clips flag to adjust the amount of data.

from nanowakeword.trainer import train 

args_list = [
    '--training_config', f'{config_path}',
    '--auto-config',
    '--generate_clips', 
    '--augment_clips',
    '--train_model',
    '--overwrite' 
]

print("Starting NanoWakeWord training...")

try:
    train(args_list)
    print("\n\nCONGRATULATIONS! (✿◕‿◕✿)")
    print("Your custom wake word model has been successfully trained!")

except Exception as e:
    print(f"\nAn error occurred during training: {e}")

## What's Next?

You have successfully trained your own custom wake word model!

You can now download the `.onnx` or `.tflite` file from the `trained_models` directory (check the file browser on the left) and use it in your own applications.

`If you face any issues while converting to TFLite, don’t worry. Your ONNX model is entirely your asset, and you can manually convert it to TFLite anytime you wish.`

For more advanced topics, such as using your own datasets or fine-tuning the configuration, please check out our full documentation on **[GitHub](https://github.com/arcosoph/nanowakeword)**.

---
## Step 4: Save Your Model to Google Drive

The final step is to save your trained model and performance graph to a safe and accessible place. Instead of a slow direct download, we will save the files directly to your Google Drive. This process is almost instantaneous.

Run the cells below to:
1.  Connect your Google Drive account.
2.  Copy all the trained files into a new folder named `nanowakeword_models` in your Drive.

In [None]:
# @title Step 4.1: Connect to Google Drive
# This will ask for your permission to access your Google Drive.

from google.colab import drive
import os

try:
    drive.mount('/content/drive')
    print("\nGoogle Drive connected successfully!")
except Exception as e:
    print(f"An error occurred while connecting to Google Drive: {e}")

In [None]:
# @title Step 4.2: Copy Trained Files to Your Drive 📂 (Final Reliable Version)

import os
import shutil
import glob

# Using the configuration of the previous cell
model_name = config_dict.get("model_name", "my_model")
output_dir = config_dict.get("output_dir", "./trained_models")

# Creating a destination folder in Google Drive
drive_folder_path = f"/content/drive/MyDrive/nanowakeword_models/{model_name}"

# If there is an old folder, delete it and start over.
if os.path.exists(drive_folder_path):
    print(f"Removing existing folder in Drive: '{drive_folder_path}'")
    shutil.rmtree(drive_folder_path)

os.makedirs(drive_folder_path, exist_ok=True)
print(f"Created a new folder in your Google Drive: '{drive_folder_path}'")

files_copied_count = 0
files_found = False

# --- 1. Copy the model files (.onnx, .tflite) ---
model_files = glob.glob(os.path.join(output_dir, f"{model_name}*.*"))
for file_path in model_files:
    if file_path.endswith(('.onnx', '.tflite')):
        try:
            shutil.copy(file_path, drive_folder_path)
            print(f"  - Copied model: {os.path.basename(file_path)}")
            files_copied_count += 1
            files_found = True
        except Exception as e:
            print(f"  - Failed to copy {os.path.basename(file_path)}: {e}")

# --- 2. Find and copy the graph folder ---
# As of nanowakeword v1.2.0, the graph folder is inside the output_dir, outside the model folder.

graphs_source_path_option1 = os.path.join(output_dir, "graphs")

graphs_source_path_option2 = os.path.join(output_dir, model_name, "graphs")

graphs_source_path = None
if os.path.exists(graphs_source_path_option1):
    graphs_source_path = graphs_source_path_option1
elif os.path.exists(graphs_source_path_option2):
    graphs_source_path = graphs_source_path_option2

if graphs_source_path:
    graphs_dest_folder = os.path.join(drive_folder_path, "graphs")
    try:
        shutil.copytree(graphs_source_path, graphs_dest_folder)
        print(f"  - Copied performance graphs folder from '{graphs_source_path}'.")
        files_copied_count += 1
        files_found = True
    except Exception as e:
        print(f"  - Failed to copy graphs folder: {e}")
else:
    print("  - Performance graphs folder not found.")

if files_found:
    print(f"\n✅ Success! {files_copied_count} item(s) have been saved to your Google Drive.")
    print(f"Please check the '{model_name}' folder inside 'nanowakeword_models' in your Google Drive.")
else:
    print(f"Error: Could not find any trained model files or graphs to copy.")
    print("Please make sure the training step (3.2) completed successfully.")