# Train your first 🐸 STT model 💫

👋 Hello and welcome to Coqui (🐸) STT 

The goal of this notebook is to show you a **typical workflow** for **training** and **testing** an STT model with 🐸.

Let's train a very small model on a very small amount of data so we can iterate quickly.

In this notebook, we will:

1. Download data and format it for 🐸 STT.
2. Configure the training and testing runs.
3. Train a new model.
4. Test the model and display its performance.

So, let's jump right in!

*PS - If you just want a working, off-the-shelf model, check out the [🐸 Model Zoo](https://www.coqui.ai/models)*

In [1]:
! pip install coqui_stt_training

Collecting coqui_stt_training
  Downloading coqui_stt_training-1.4.0-py3-none-any.whl (94 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.3/94.3 KB[0m [31m189.4 kB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
[?25hCollecting bs4
  Using cached bs4-0.0.1.tar.gz (1.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting coqpit
  Using cached coqpit-0.0.17-py3-none-any.whl (13 kB)
Collecting protobuf<=3.20.1
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m540.9 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting tensorflow==1.15.4
  Downloading tensorflow-1.15.4-cp37-cp37m-manylinux2010_x86_64.whl (110.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0mm eta [3

[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.3/25.3 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting urllib3>=1.21.1
  Using cached urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
Collecting pyparsing>=2.0.3
  Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB)
Collecting Pillow>=4.1.1
  Downloading Pillow-9.4.0-cp37-cp37m-manylinux_2_28_x86_64.whl (3.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting pyjwt<2.5.0,>=2.4.0
  Downloading PyJWT-2.4.0-py3-none-any.whl (18 kB)
Collecting furl>=2.0.0
  Downloading furl-2.1.3-py2.py3-none-any.whl (20 kB)
Collecting pathlib2>=2.3.0
  Downloading pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB)
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.0.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17

## ✅ Download & format sample data for English

**First things first**: we need some data.

We're training a Speech-to-Text model, so we need some _speech_ and we need some _text_. Specificially, we want _transcribed speech_. Let's download an English audio file and its transcript and then format them for 🐸 STT. 

🐸 STT expects to find information about your data in a CSV file, where each line contains:

1. the **path** to an audio file
2. the **size** of that audio file
3. the **transcript** of that audio file.

Formatting the audio and transcript isn't too difficult in this case. We define `download_sample_data()` which does all the work. If you have a custom dataset, you will want to write a custom data importer.

**Second things second**: we want an alphabet. The output layer of a typical* 🐸 STT model represents letters in the alphabet. Let's download an English alphabet from Coqui and use that.

*_If you are working with languages with large character sets (e.g. Chinese), you can set `bytes_output_mode=True` instead of supplying an `alphabet.txt` file. In this case, the output layer of the STT model will correspond to individual UTF-8 bytes instead of individual characters._

### 👀 Take a look at the data

In [1]:
csv_file = open("train.csv", "r")
print(csv_file.read())

wav_filename,wav_filesize,transcript
../data/TRAIN/MAN/GT/3O758A.wav,68028,three zero seven five eight
../data/TRAIN/MAN/GT/6A.wav,36078,six
../data/TRAIN/MAN/GT/12ZA.wav,46728,one two zero
../data/TRAIN/MAN/GT/153Z4A.wav,66388,one five three zero four
../data/TRAIN/MAN/GT/17914A.wav,61064,one seven nine one four
../data/TRAIN/MAN/GT/1A.wav,30344,one
../data/TRAIN/MAN/GT/1O21928A.wav,81134,one zero two one nine two eight
../data/TRAIN/MAN/GT/1ZA.wav,48366,one zero
../data/TRAIN/MAN/GT/2129883A.wav,90556,two one two nine eight eight three
../data/TRAIN/MAN/GT/2598A.wav,49596,two five nine eight
../data/TRAIN/MAN/GT/26A.wav,38946,two six
../data/TRAIN/MAN/GT/27O39A.wav,71714,two seven zero three nine
../data/TRAIN/MAN/GT/2A.wav,31164,two
../data/TRAIN/MAN/GT/2B.wav,29934,two
../data/TRAIN/MAN/GT/2O14A.wav,50004,two zero one four
../data/TRAIN/MAN/GT/316A.wav,54920,three one six
../data/TRAIN/MAN/GT/334OA.wav,54920,three three four zero
../data/TRAIN/MAN/GT/38116A.wav,77038,three eight on

In [3]:
alphabet_file = open("../utils/en_alphabet.txt", "r")
print(alphabet_file.read())

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
'
# The last (non-comment) line needs to end with a newline.



## ✅ Configure & set hyperparameters

Coqui STT comes with a long list of hyperparameters you can tweak. We've set default values, but you will often want to set your own. You can use `initialize_globals_from_args()` to do this. 

You must **always** configure the paths to your data, and you must **always** configure your alphabet. Additionally, here we show how you can specify the size of hidden layers (`n_hidden`), the number of epochs to train for (`epochs`), and to initialize a new model from scratch (`load_train="init"`).

In [4]:
from coqui_stt_training.util.config import initialize_globals_from_args

initialize_globals_from_args(
    alphabet_config_path="../utils/en_alphabet.txt",
    checkpoint_dir="../models/ckpt_dir",
    train_files=["train.csv"],
    test_files=["test.csv"],
    load_train="init",
    n_hidden=50,
    epochs=20,
)



2023-03-03 15:44:05.907646: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-03-03 15:44:05.910910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1999965000 Hz
2023-03-03 15:44:05.911442: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3e53970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-03-03 15:44:05.911470: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


### 👀 View all Config settings

In [5]:
from coqui_stt_training.util.config import Config

# Take a peek at the entire Config
print(Config.to_json())

{
    "train_files": [
        "train.csv"
    ],
    "dev_files": [],
    "test_files": [
        "test.csv"
    ],
    "metrics_files": [],
    "auto_input_dataset": "",
    "vocab_file": "",
    "read_buffer": 1048576,
    "feature_cache": "",
    "cache_for_epochs": 0,
    "shuffle_batches": false,
    "shuffle_start": 1,
    "shuffle_buffer": 1000,
    "feature_win_len": 32,
    "feature_win_step": 20,
    "audio_sample_rate": 16000,
    "normalize_sample_rate": true,
    "augment": null,
    "epochs": 20,
    "dropout_rate": 0.05,
    "dropout_rate2": 0.05,
    "dropout_rate3": 0.05,
    "dropout_rate4": 0.0,
    "dropout_rate5": 0.0,
    "dropout_rate6": 0.05,
    "relu_clip": 20.0,
    "beta1": 0.9,
    "beta2": 0.999,
    "epsilon": 1e-08,
    "learning_rate": 0.001,
    "train_batch_size": 1,
    "dev_batch_size": 1,
    "test_batch_size": 1,
    "export_batch_size": 1,
    "skip_batch_test": false,
    "inter_op_parallelism_threads": 0,
    "intra_op_parallelism_threads": 0,

## ✅ Train a new model

Let's kick off a training run 🚀🚀🚀 (using the configure you set above).

This notebook should work on either a GPU or a CPU. However, in case you're running this on _multiple_ GPUs we want to only use one, because the sample dataset (one audio file) is too small to split across multiple GPUs.

In [6]:
import os
from coqui_stt_training.train import train

# use maximum one GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

train()

I Performing dummy training to check for memory problems.
I If the following process crashes, you likely have batch sizes that are too big for your available system memory (or GPU memory).
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 3 | Loss: 495.395294     
--------------------------------------------------------------------------------
I FINISHED optimization in 0:00:00.893648
I Dummy run finished without problems, now starting real training process.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:05:59 | Steps: 5958 | Loss: 7.916502    

Process ForkPoolWorker-16:
Process ForkPoolWorker-9:
Process ForkPoolWorker-14:
Process ForkPoolWorker-10:
Process ForkPoolWorker-12:
Process ForkPoolWorker-5:


I FINISHED optimization in 0:05:59.556968


Process ForkPoolWorker-8:
Process ForkPoolWorker-15:
Process ForkPoolWorker-11:
Process ForkPoolWorker-1:
Process ForkPoolWorker-7:
Process ForkPoolWorker-13:
Process ForkPoolWorker-6:
Process ForkPoolWorker-4:
Process ForkPoolWorker-2:
Process ForkPoolWorker-3:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
  

  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/lib/python3.7/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/lib/python3.7/multiprocess

## ✅ Test the model

We made it! 🙌

Let's kick off the testing run, which displays performance metrics.

We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable 😇

You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.

When you start training your own models, make sure your testing data doesn't include your training data 😅

In [None]:
from coqui_stt_training.evaluate import test

test()