<a href="https://colab.research.google.com/github/abalvet/ASR4linguists/blob/main/Transcribe_an_audio_file_with_Whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


This notebook is based on  https://www.css.cnrs.fr/whisper-for-transcribing-social-science-interviews/


_Colab is easy to use and easy to learn. No need to master Python, you just have to click on the "play" button in front of the code sections, and maybe adjust some parameters._

Author: Yacine Chitour

For further information:  https://www.css.cnrs.fr/whisper-for-transcribing-social-science-interviews/

### Enable a GPU to speed up calculations


We are going to use a GPU (*Graphics Processing Unit*) in order to speed up the transcription of the audios.

* To do this, select the "Edit" menu in Colab, then click "Notebook settings".
* Then select the "GPU" value for "Hardware Accelerator" drop-down list. The following two lines of code allow you to check that the command has worked. They show the type of GPU used.

In [1]:
from torch import cuda
cuda.get_device_name(0)


'Tesla T4'

### Load Whisper

We first run the following line to install the Whisper library, from the OpenAI Github repository:

In [None]:
!pip install git+https://github.com/openai/whisper.git

### Load FFmpeg

We then install the free library FFmpeg, which we use to handle audio files:

In [None]:
!sudo apt update && sudo apt install ffmpeg

### Transcribe your audio (**⚠ warning : GDRP**)

After loading your file in the left "Files" tab, you can start the transcription. The file can be in several formats (m4a, mp3, mp4, mpeg, mpga, wav, webm) :


* There are several models, with different transcription speeds depending on the size of the underlying model. The lightest ones (`tiny` and `base`) are clearly not good enough for French, but they do the job in English.
*     Most of the time, the `medium` transcription model is more than sufficient, even with background noise.
*   If you find the beginning of the transcription disappointing - it is displayed progressively in the output window - you can try to use a larger, and therefore more accurate, model, such as the `large` model, or `large-v2`. However, transcription time is a bit longer (about twice as long as the medium model).

If you have the GPU installed, using the `large-v2` model is a good option, fast and efficient.

The choice of the model is made in the parameters after "`--model`".


In [None]:
!whisper "/content/enrgistrement chez Apple Store.wav" --model large-v2 # Remember to change the name and format of the file

### Getting the output text


Just download the `.txt` file with the same name as the audio, which is automatically loaded in the "Files" tab on the left of the notebook.

### Additional feature: translate audio from one language to another

Whisper is a transcription tool, but also a translation tool [from and to a large number of languages](https://github.com/openai/whisper) (Spanish, English, Arabic, Ukrainian, Swedish, Hindi, _etc._). Just add the parameter `--language fr` for example to translate and transcribe an audio in English into a text in French.  




In [None]:
!whisper "audio.mp3" --model medium --language fr

In [None]:
!pip install faster-whisper

In [None]:
!pip install -U whisper-ctranslate2

In [None]:
!whisper-ctranslate2 /content/enregistrement-chez-Apple-Store.wav --model large-v2 --language French --task transcribe --initial_prompt "ben c'ét- c'était pas mal  quoi  euh  eeeeet  bah  bon  donc  en fait  beh  mmmh  pff  je sais pas  je dis ça je dis rien"  --word_timestamps True -o repertoire_destination

In [None]:
!whisper-ctranslate2 /content/enregistrement-chez-Apple-Store.wav --model large-v3 --language French --task transcribe --initial_prompt "ben c'ét- c'était pas mal  quoi  euh  eeeeet  bah  bon  donc  en fait  beh  mmmh  pff  je sais pas"  --word_timestamps True -o medium

In [12]:
!whisper-ctranslate2 /content/enregistrement-chez-Apple-Store.wav --model large-v3 --language French --task transcribe --initial_prompt "ben euh  eeeeet  bah  bon  donc  en fait  beh  mmmh  pff"  --word_timestamps True -o large-v3b

Detected language 'French' with probability 1.000000
[00:06.420 --> 00:07.620]  à
[00:07.620 --> 00:16.940]  on s'est fait
[00:20.180 --> 00:21.380]  et
[00:21.380 --> 00:21.960]  et
[00:21.960 --> 00:22.300]  et
[00:22.300 --> 00:22.400]  et
[00:22.400 --> 00:22.460]  et
[00:22.460 --> 00:22.560]  et
[00:22.560 --> 00:24.360]  et
[00:31.360 --> 00:32.760]  d
[00:34.960 --> 00:36.360]  d
[00:36.360 --> 00:37.420]  ce
[00:37.420 --> 00:38.000]  or
[00:38.000 --> 00:38.020]  s
[00:44.260 --> 00:45.660]  hu
[00:52.900 --> 00:54.300]  tu
[00:54.300 --> 00:54.320]  tu
[00:54.320 --> 00:54.340]  tu
[01:21.720 --> 01:23.120]  Merci.
[01:30.980 --> 01:32.720]  Je ne sais pas.
[02:08.860 --> 02:09.220]  Bonsoir.
[02:09.440 --> 02:09.800]  Bonsoir.
[02:09.880 --> 02:10.140]  Bienvenue.
[02:11.240 --> 02:16.400]  En fait, je dois prendre un cadeau pour ma copine en fin de Valentin et je ne sais pas
[02:16.400 --> 02:17.040]  quoi prendre en tête.
[02:17.740 --> 02:20.860]  Un Iphone ou un Ipod ou

In [14]:
!whisper "/content/enregistrement-chez-Apple-Store.wav" --model large-v3 -o whisper-large-v3

Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: French
[00:00.000 --> 00:02.000]  ...
[00:02.000 --> 00:04.000]  ...
[00:04.000 --> 00:06.000]  ...
[00:06.000 --> 00:08.000]  ...
[00:08.000 --> 00:10.000]  ...
[00:10.000 --> 00:12.000]  ...
[00:12.000 --> 00:14.000]  ...
[00:14.000 --> 00:16.000]  ...
[00:16.000 --> 00:18.000]  ...
[00:18.000 --> 00:20.000]  ...
[00:20.000 --> 00:22.000]  ...
[00:22.000 --> 00:24.000]  ...
[00:24.000 --> 00:26.000]  ...
[00:26.000 --> 00:28.000]  ...
[00:28.000 --> 00:30.000]  ...
[00:30.000 --> 00:32.000]  ...
[00:32.000 --> 00:34.000]  ...
[00:34.000 --> 00:36.000]  ...
[00:36.000 --> 00:38.000]  ...
[00:38.000 --> 00:40.000]  ...
[00:40.000 --> 00:42.000]  ...
[00:42.000 --> 00:44.000]  ...
[00:44.000 --> 00:46.000]  ...
[00:46.000 --> 00:48.000]  ...
[00:48.000 --> 00:50.000]  ...
[00:50.000 --> 00:52.000]  ...
[00:52.000 --> 00:54.000]  ...
[00:54.000 --> 00:56.000]  ...
[00:56.000 -