# Performing Audio Transcription

This guide will explain how to transcribe audio files using SpeechLine.

First, load in the your transcription model by passing its Hugging Face model checkpoint into [`Wav2Vec2Transcriber`](../../reference/transcribers/wav2vec2)

In [2]:
from speechline.transcribers import Wav2Vec2Transcriber

transcriber = Wav2Vec2Transcriber("bookbot/wav2vec2-ljspeech-gruut")

Downloading (…)lve/main/config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/378M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/509 [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/406 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

Next, you will need to transform your input audio file (given by `sample.wav`) into a `Dataset` format like the following

In [3]:
from datasets import Dataset, Audio

dataset = Dataset.from_dict({"audio": ["sample.wav"]})
dataset = dataset.cast_column("audio", Audio(sampling_rate=transcriber.sampling_rate))

Once preprocessing is finished, simply pass the input data into the transcriber.

In [4]:
phoneme_offsets = transcriber.predict(dataset, output_offsets=True)

Transcribing Audios:   0%|          | 0/1 [00:00<?, ? examples/s]

The output format of the transcription model is shown below. It is a list of dictionary containing the transcribed `phoneme`, `start_time` and `end_time` stamps of the corresponding phoneme.

In [5]:
phoneme_offsets

[[{'end_time': 0.02, 'start_time': 0.0, 'text': 'ɪ'},
  {'end_time': 0.3, 'start_time': 0.26, 'text': 't'},
  {'end_time': 0.36, 'start_time': 0.34, 'text': 'ɪ'},
  {'end_time': 0.44, 'start_time': 0.42, 'text': 'z'},
  {'end_time': 0.62, 'start_time': 0.5, 'text': 'noʊt'},
  {'end_time': 0.78, 'start_time': 0.76, 'text': 'ʌ'},
  {'end_time': 0.94, 'start_time': 0.92, 'text': 'p'}]]

You can manually check the model output by playing a segment (using the start and end timestamps) of your input audio file. 

First, load your audio file.

In [18]:
from pydub import AudioSegment

audio = AudioSegment.from_file("sample.wav")
audio

You can use the following function to play a segment of your audio from a given offset

In [19]:
def play_segment(offsets, index: int):
    start = offsets[index]["start_time"]
    end = offsets[index]["end_time"]
    print(offsets[index]["text"])
    return audio[start * 1000 : end * 1000]

Here are some examples of the phoneme segments

In [20]:
play_segment(phoneme_offsets[0], 0)

ɪ


In [21]:
play_segment(phoneme_offsets[0], 1)

t


In [22]:
play_segment(phoneme_offsets[0], 2)

ɪ


In [23]:
play_segment(phoneme_offsets[0], 3)

z


In [24]:
play_segment(phoneme_offsets[0], 4)

noʊt


In [25]:
play_segment(phoneme_offsets[0], 5)

ʌ


In [26]:
play_segment(phoneme_offsets[0], 6)

p
