# Speech-to-Text Engine
This is a simple implementation of **Mozilla's DeepSpeech** STT (speech-to-text) library.

Reference: [YouTube - Speech to Text using Python - Fast and Accurate](https://www.youtube.com/watch?v=iWha--55Lz0)

Follow the below instructions to run this notebook.

#### Create Virtual Environment
```bash
python -m venv deepspeech-env
# On Windows
.\deepspeech-env\Scripts\activate
# On Linux
source ./deepspeech-env/bin/activate
```

#### Install Dependencies
```shell
python -m pip install --upgrade pip
pip install numpy pandas notebook deepspeech
```

#### Download the required files
```bash
# Get model files
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

# Get sample audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
wget -O speech.wav https://github.com/EN10/DeepSpeech/blob/master/woman1_wb.wav?raw=true

# Unpack audio files
tar -zxvf audio-0.9.3.tar.gz
```
---

### Import the requirements

In [None]:
import os
import wave

import numpy as np

from pathlib import Path

from deepspeech import Model
from IPython.display import Audio

### Define Constants

In [None]:
# Base directory
BASE_DIR = Path(os.getcwd()).resolve()

# Model and scorer files
MODEL_FILE = str(BASE_DIR / "deepspeech-0.9.3-models.pbmm")
SCORER_FILE = str(BASE_DIR / "deepspeech-0.9.3-models.scorer")

# Model tuning values
BEAM_WIDTH = 500
LM_ALPHA = 0.93
LM_BETA = 1.18

### Initiate and Tune the Model

In [None]:
# Initiate
model = Model(MODEL_FILE)
model.enableExternalScorer(SCORER_FILE)

# Tune
model.setScorerAlphaBeta(LM_ALPHA, LM_BETA)
model.setBeamWidth(BEAM_WIDTH)

### Function `read_wav_file`
A function to read the audio file `.wav` and return its buffer

In [None]:
def read_wav_file(file: Path) -> tuple:
    with wave.open(file, "rb") as wav_file:
        rate = wav_file.getframerate()
        frames = wav_file.getnframes()
        buffer = wav_file.readframes(frames)
        print(rate)
        print(frames)

    return buffer, rate

### Function `transcribe`
Actual function that converts speech to text. Takes audio file as an input

In [None]:
def transcribe(audio_file: Path) -> str:
    buffer, rate = read_wav_file(audio_file)
    data16 = np.frombuffer(buffer, dtype=np.int16)
    
    return model.stt(data16)

### Listen to the Audio files

In [None]:
Audio(BASE_DIR / "audio" / "4507-16021-0012.wav")

In [None]:
Audio(BASE_DIR / "audio" / "8455-210777-0068.wav")

In [None]:
Audio(BASE_DIR / "audio" / "2830-3980-0043.wav")

In [None]:
Audio(BASE_DIR / "speech.wav")

### Convert Speech to Text using `transcribe` function

In [None]:
transcribe(str(BASE_DIR / "audio" / "4507-16021-0012.wav"))

In [None]:
transcribe(str(BASE_DIR / "audio" / "2830-3980-0043.wav"))

In [None]:
transcribe(str(BASE_DIR / "audio" / "8455-210777-0068.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech-002.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech-003.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech-004.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech-005.wav"))

In [None]:
transcribe(str(BASE_DIR / "speech-006.wav"))