Skip to content
/ pisets Public

The python library and service for automatic speech recognition and transcribing in Russian and English

License

Notifications You must be signed in to change notification settings

bond005/pisets

Repository files navigation

License Apache 2.0 Python 3.9

pisets

This project represents a python library and service for automatic speech recognition and transcribing in Russian and English.

You can generate subtitles in the SubRip format for any audio or video which is supported with FFmpeg software.

The "pisets" is Russian word (in Cyrillic, "писец") for denoting a person who writes down the text, including dictation (the corresponding English term is "scribe"). Thus, if you need to make a text transcript of an audio recording of a meeting or seminar, then the artificial "Pisets" will help you.

Installation

This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer PyTorch, and you need to install CPU- or GPU-based build of PyTorch ver. 2.0 or later. You can see more detailed description of dependencies in the requirements.txt.

Other important dependencies are:

  • KenLM: a statistical N-gram language model inference code;
  • FFmpeg: a software for handling video, audio, and other multimedia files.

These dependencies are not only "pythonic". Firstly, you have to build the KenLM C++ library from sources accordingly this recommendation: https://github.com/kpu/kenlm#compiling (it is easy for any Linux user, but it can be a problem for Windows users, because KenLM is not fully cross-platform). Secondly, you have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.

Also, for installation you need to Python 3.9 or later. I recommend using a new Python virtual environment witch can be created with Anaconda or venv. To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:

git clone https://github.com/bond005/pisets.git
cd pisets
python -m pip install -r requirements.txt

To check workability and environment setting correctness you can run the unit tests:

python -m unittest

Usage

Command prompt

Usage of the Pisets is very simple. You have to write the following command in your command prompt:

python speech_to_srt.py \
    -i /path/to/your/sound/or/video.m4a \
    -o /path/to/resulted/transcription.srt \
    -lang ru \
    -r \
    -f 50

The 1st argument -i specifies the name of the source audio or video in any format supported by FFmpeg.

The 2st argument -o specifies the name of the resulting SubRip file into which the recognized transcription will be written.

Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, -lang specifies the used language. You can select Russian (ru, rus, russian) or English (en, eng, english). The default language is Russian.

-r indicates the need for a more smart rescoring of speech hypothesis with a large language model as like as T5. This option is possible for Russian only, but it is important for good quality of generated transcription. Thus, I highly recommend using the option -r if you want to transcribe a Russian speech signal.

-f sets the maximum duration of the sound frame (in seconds). The fact is that the Pisets is designed so that a very long audio signal is divided into smaller sound frames, then these frames are recognized independently, and the recognition results are glued together into a single transcription. The need for such a procedure is due to the architecture of the acoustic neural network. And this argument determines the maximum duration of such frame, as defined above. The default value is 50 seconds, and I don't recommend changing it.

If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the Pisets will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the Pisets will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).

Docker and REST-API

Installation of the Pisets can be difficult, especially for Windows users (in Linux it is trivial). Accordingly, in order to simplify the installation process and hide all the difficulties from the user, I suggest using a docker container that is deployed and runs on any operating system. In this case, audio transmission for recognition and receiving transcription results is carried out by means of the REST API.

You can build the docker container youself:

docker build -t bond005/pisets:0.1 .

But the easiest way is to download the built image from Docker-Hub:

docker pull bond005/pisets:0.1

After building (or pulling) you have to run this docker container:

docker run -p 127.0.0.1:8040:8040 pisets:0.1

Hurray! The docker container is ready for use, and the Pisets will transcribe your speech. You can use the Python client for the Pisets service in the script client_ru_demo.py:

python client_ru_demo.py \
    -i /path/to/your/sound/or/video.m4a \
    -o /path/to/resulted/transcription.srt

But the easiest way is to use a special virtual machine with the Pisets in Yandex Cloud. This is an example curl for transcribing your speech with the Pisets in the Unix-like OS:

echo -e $(curl -X POST 178.154.244.147:8040/transcribe -F "audio=@/path/to/your/sound/or/video.m4a" | awk '{ print substr( $0, 2, length($0)-2 ) }') > /path/to/resulted/transcription.srt

Important notes

  1. The Pisets in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have use the command-line tool speech_to_srt.py.

  2. This docker container, unlike the command-line tool, does not support GPU.

Models and algorithms

The Pisets transcribes speech signal in four steps:

  1. The acoustic deep neural network, based on fine-tuned Wav2Vec2, performs the primary recognition of the speech signal and calculates the probabilities of the recognized letters. So the result of the first step is a probability matrix.
  2. The statistical N-gram language model translates the probability matrix into recognized text using a CTC beam search decoder.
  3. The language deep neural network, based on fine-tuned T5, corrects possible errors and generates the final recognition text in a "pure" form (without punctuations, only in lowercase, and so on).
  4. The last component of the "Pisets" places punctuation marks and capital letters.

The first and the second steps for English speech are implemented with Patrick von Platen's Wav2Vec2-Base-960h + 4-gram, and Russian speech transcribing is based on my Wav2Vec2-Large-Ru-Golos-With-LM.

The third step is not supported for English speech, but it is based on my ruT5-ASR for Russian speech.

The fourth step is realized on basis of the multilingual text enhancement model created by Silero.

My tests show a strong superiority of the recognition system based on the given scheme over Whisper Medium, and a significant superiority over Whisper Large when transcribing Russian speech. The methodology and test results are open:

Also, you can see the independent evaluation of my Wav2Vec2-Large-Ru-Golos-With-LM model (without T5-based rescorer) on various Russian speech corpora in comparison with other open Russian speech recognition models: https://alphacephei.com/nsh/2023/01/22/russian-models.html (in Russian).

Contact

Ivan Bondarenko - @Bond_005 - bond005@yandex.ru

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

About

The python library and service for automatic speech recognition and transcribing in Russian and English

Resources

License

Stars

Watchers

Forks

Packages

No packages published