# Transcription

In [1]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import os


device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=256,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

print(sample)

  from .autonotebook import tqdm as notebook_tqdm
2024-06-29 12:17:22.543359: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-29 12:17:22.574547: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'path': '0d38672e0bbdbdc460af55b8bb84a15b2730db2819f2af64f9c777d4d586f2de', 'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00024414, 0.00048828,
       0.0005188 ]), 'sampling_rate': 16000}


In [2]:
import librosa as lr

DATA_PATH = "/home/dvdblk/dev/swisshacks2024/data/jb-swisshacks2024"

jb_sample = DATA_PATH + "/audio_data/XEA040Q8N9.wav"
audio, sr = lr.load(jb_sample, sr=16_000)
print(audio.shape, sr)

(213160,) 16000


In [3]:
result = pipe(audio, generate_kwargs={"language": "english"})
print(result["text"])

You have passed language=english, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=english.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 Hello, I am Noya Chimurman. Social Security Number 7667-587-79988 account number ZR8097 I want to invest in high-tech security systems for my property.


In [4]:
# prints out sample rates per file
for f in os.listdir(DATA_PATH + "/audio_data"):
    if f.endswith(".wav"):
        print(f, lr.get_samplerate(DATA_PATH + "/audio_data/" + f))

XEA040Q8N9.wav 44100
35UVJCB74Q.wav 48000
5L4PYZ36G0.wav 24000
BJ53WB0WQB.wav 48000
4UYWOVMYD9.wav 44100
VWEQSW9GQY.wav 48000
OFML4KZCAE.wav 44100
32LZR8ZQYK.wav 44100
ZGZHPG1TS8.wav 48000
BWGSZNP0LF.wav 44100
Z2UAISMWW6.wav 48000
BO81DNUC98.wav 44100
LF0QTFG29B.wav 44100
XZUKKHWKB8.wav 48000
JBZJ2LV64J.wav 48000
G11VNYE1HH.wav 48000
IAGY5YUR9E.wav 44100
43QRG7SY14.wav 48000
TBUUEZXSLP.wav 48000
1KMDDIB0WA.wav 44100
S0G2JOY8VT.wav 24000
76X65A5PP1.wav 16000
HBDLHYEA1L.wav 48000
C4C384534S.wav 44100
DRAD23MMMA.wav 44100
6G186D1H5K.wav 48000
NPVTQJD50W.wav 48000
LL7V1S0QG3.wav 48000
DK89JDSW3X.wav 48000
L54SD4SCXE.wav 48000
GAJOY10X4A.wav 44100
AHQQ4HWF9Z.wav 48000
KICNKIP98W.wav 44100
XK7B8CDK0T.wav 48000
IEMF4VU2VH.wav 48000
5TKRD0UET0.wav 48000
5B0N0KF0OZ.wav 44100
DX7RS4N7IC.wav 44100
SYQYA98A3O.wav 48000
FPMY3OD663.wav 44100
H6XNGJ7SCM.wav 44100
0P2ZQ4BASS.wav 44100
72K6TSQ829.wav 44100
HCRK8WNTRJ.wav 44100
EL568TREPD.wav 44100
TQEMWEEP28.wav 48000
DTU78DE6VZ.wav 24000
N44HQ30PQE.wa

In [5]:
transcriptions = {}

# do it for all audio samples in the DATA_PATH
for f in os.listdir(DATA_PATH + "/audio_data"):
    if f.endswith(".wav"):
        file_path = DATA_PATH + "/audio_data/" + f
        audio, sr = lr.load(file_path, sr=None)
        result = pipe(audio, generate_kwargs={"language": "english"})
        print(result["text"])

        # remove .wav and use as key
        key = f[:-4]
        transcriptions[key] = result["text"]


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


 Hello, I'm a little girl from the U.S. I'm from the U.S. I'm from the U.S. I'm from the U.S. I'm from the U.S. I'm from the U.S. I'm from the U.S. Bye bye.
 I am a human being. I am a human being. I am a human being. You are a material heart between the earth and the soul. And you are sure and unquiet. Just like the people of the world.
 Hi, I am So-Yoon Kim. Without and holding an apprenticeship in biotechnology, I need to transfer some funds to an anonymous account. Can you ensure this is done without any documentation?


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


 Hello, this is a video of a new martial art game I'm calling out of the U.S. Thank you.
 Good morning, my name is Elena Koi, I'm a celebrity chef, Social Security Number 756.3654.2847266. to 2 at 4 7 26 6. Please set up a new voice call to Endone Alice. I need this done quickly.
 Hello guys, welcome to the wellness and happiness real life stream. I hope this is gonna be simple and easy You are not, forborn, pervasive. To ascertain if I am real or just too natural to suggest a spirit to be.


Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


 The woman who you had to stay with a picture in a world where two swords and two swords had been shined under my head. I'm the world to my beast, the frogman. I'm a little fool, but I'm a little.
 Hi, I'm Sophia Omega. My name is Petronio Kuliski. I'm a social security member. 756-1021-0. Gate 403-33. I'm a fat 33. I can't stop. I'm a fat 33. I'm a fat 33.


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 Hi, this is Finn Wolf. Welcome to my channel. 9, and the 6, 6, 6, 7, 4, 6, 4, 7, 6, 4, 4, 1, 4, 8, 8, 8, act, act. A magic to do is for a little soul. To give a soul a call to a great lesson on a fruit soul. Thank you.


In [None]:
len(transcriptions)

400

In [None]:
import csv

# Step 2: Open a new CSV file in write mode
with open(DATA_PATH + "/transcriptions_correct_sr.csv", "w", newline='') as file:
    # Step 3: Create a CSV writer object
    writer = csv.writer(file)

    # Step 4: Write the header row (optional)
    writer.writerow(['Audio File', 'Transcription'])

    # Step 5: Iterate through the dictionary and write rows
    for key, value in transcriptions.items():
        writer.writerow([key, value])