# Vosk

Vosk is a speech recongition toolkit, the offical website is [here](https://alphacephei.com/vosk/).

We choose it from the following advantages of it:

1. Works offline, even on lightweight devices - Just like our Raspberry Pi
2. Portable per-language models are only 50Mb each, but there are much bigger server models available.
3. Supports speaker identification beside simple speech recognition.

In [None]:
# set project root
from pathlib import Path

PROJECT_ROOT = Path("/home/thaonan/Projects/Dormitory-Assistant")

%cd {PROJECT_ROOT}

# using following code to download vosk

# Install required packages
!sudo apt-get update
!sudo apt-get install -y portaudio19-dev
!sudo apt install ffmpeg  # For audio conversion

%pip install vosk
%pip install --force-reinstall sounddevice
%pip install pydub
%pip install scipy
%pip install numpy
%pip install matplotlib
%pip install soundfile


import time
import sounddevice as sd
from scipy.io.wavfile import write
import numpy as np

## Command line Use examples
You can transcribe a file with a simple vosk-transcriber command line tool. 

### 1. Video2Txt / Audio2Txt

```sh
vosk-transcriber -i test.mp4 -o test.txt
vosk-transcriber -i test.mp4 -t srt -o test.srt
vosk-transcriber -l fr -i test.m4a -t srt -o test.srt
vosk-transcriber --list-languages
```


In [None]:
# Transfer mp4 to txt (set language as en)
!vosk-transcriber -i resource/test.mp4 -o resource/test.txt


In [None]:
# You can see all  all the supporting languages
!vosk-transcriber --list-languages

### Spearker Identification Inplementation

Now lets start to try its speaker identification functions.

#### 1. Importing Required Modules:

- os, sys → Handle file paths and system operations.
- wave → Read WAV audio files.
- json → Parse recognition results.
- numpy → Perform numerical operations (for speaker verification).
- vosk → The Vosk speech recognition engine.


In [None]:
import os
import sys
import wave
import json
import numpy as np
from vosk import Model, KaldiRecognizer, SpkModel

#### 2. Download and Check Speaker Model

Now we are going to download a speaker model now. The download page is [here](https://alphacephei.com/vosk/models), there are multiple speaker model you can use.
The speaker model is used to extract **speaker embeddings** (also called "x-vectors") from the audio input, and also well be used to identify the speaker. 



In [None]:
SPK_MODEL_PATH = "model-spk"

# Download and extract the SPEAKER model (not the small-en-us model)
!wget https://alphacephei.com/vosk/models/vosk-model-spk-0.4.zip
!unzip vosk-model-spk-0.4.zip -d {SPK_MODEL_PATH}
!ls {SPK_MODEL_PATH}/


if not os.path.exists(SPK_MODEL_PATH):
    print("Please download the speaker model from "
        "https://alphacephei.com/vosk/models and unpack as {SPK_MODEL_PATH} "
        "in the current folder.")
    sys.exit(1)

##### What is Speaker Embeddings?
An x-vector (or speaker embedding) is a compact numerical representation (a vector of numbers, e.g., [0.12, -0.45, ...]) that captures the unique characteristics of a speaker's voice, such as pitch, tone, and vocal tract shape. You can understand it as "voice fingerprint".

1. **What It Is**:  
   - A **fixed-length vector** (e.g., 512 numbers) extracted from audio using a neural network.  
   - Represents **voice identity**, not the spoken words.  
   - Example: Two recordings of *different sentences* from the **same speaker** will have **similar x-vectors**.  

2. **How It Works**:  
   - A model (like VOSK's `spk-model`) analyzes the audio and outputs the x-vector.  
   - Computed from **spectral features** (e.g., MFCCs) of the voice.  

3. **Purpose**:  
   - **Speaker Verification**: Confirm if two audios are from the same person (e.g., voice authentication).  
   - **Speaker Diarization**: Label "who spoke when" in a conversation.  
   - **Compare Speakers**: Measure similarity using **cosine distance** (small distance = similar voices).  

4. **X-Vector vs. Raw Audio**:  
   | Feature          | Raw Audio (WAV)               | X-Vector                          |  
   |------------------|-------------------------------|-----------------------------------|  
   | **Format**       | Time-series sound samples     | Compact numerical vector (e.g., 512-dim) |  
   | **Content**      | Words + noise + speaker traits | Only speaker traits               |  
   | **Usage**        | Playback, ASR                 | Speaker recognition/comparison    |  

---


##### **Why Use X-Vectors?**  
- **Efficiency**: A 5s audio → 512 numbers (easy to store/compare).  
- **Privacy**: No raw audio is stored, just the voice fingerprint.  
- **Accuracy**: Beats older methods (like i-vectors) for speaker recognition.  



### 3. Opening and Validating the Audio File

1. Opens a WAV file for reading in binary mode ("rb")
2. Validates 3 critical properties of the audio file:
    - Must be mono (single channel)
    - Must use 16-bit PCM encoding (2 bytes per sample)
    - Must be uncompressed (no compression)

In [None]:
wav_file = ""

wf = wave.open(wav_file, "rb")
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print("Audio file must be WAV format mono PCM.")
    sys.exit(1)

### 4. Setting Up Speech and Speaker Recogntion Models

- Loads the speech recognition model (en-us for English).
- Loads the speaker model for identifying speaker embeddings (x-vectors).
- Creates a recognizer (rec) that transcribes speech and detects speakers.

In [None]:
# Large vocabulary free form recognition
model = Model(lang="en-us")
spk_model = SpkModel(SPK_MODEL_PATH)
#rec = KaldiRecognizer(model, wf.getframerate(), spk_model)
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetSpkModel(spk_model)

### 5. Get Speaker Signature

[Using this link to get your own voice signature!](./Tools-Get_Voice_Signature.ipynb)

1. Record a wav file
2. transfer to wav
3. using wav file to generate spk_sig variable as numpy array