# Transcription

In [10]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

print(sample)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'path': '0d38672e0bbdbdc460af55b8bb84a15b2730db2819f2af64f9c777d4d586f2de', 'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00024414, 0.00048828,
       0.0005188 ]), 'sampling_rate': 16000}


In [27]:
import librosa as lr

DATA_PATH = "/home/dvdblk/dev/swisshacks2024/data/jb-swisshacks2024"

jb_sample = DATA_PATH + "/audio_data/XEA040Q8N9.wav"
audio, sr = lr.load(jb_sample, sr=16_000)
print(audio.shape, sr)

(213160,) 16000


In [16]:
result = pipe(audio, generate_kwargs={"language": "english"})
print(result["text"])

You have passed language=english, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=english.


 Hello, I am Noya Chimurman. Social Security Number 7667-587-79988 account number ZR8097 I want to invest in high-tech security systems for my property.


In [17]:
import os

transcriptions = {}

# do it for all audio samples in the DATA_PATH
for f in os.listdir(DATA_PATH):
    if f.endswith(".wav"):
        audio, sr = lr.load(DATA_PATH + "/" + f, sr=16_000)
        result = pipe(audio, generate_kwargs={"language": "english"})
        print(result["text"])

        # remove .wav and use as key
        key = f[:-4]
        transcriptions[key] = result["text"]




 Hello, I am Noya Chimurman. Social Security Number 7667-587-79988 account number ZR8097 I want to invest in high-tech security systems for my property.
 Hello, I am Elena Popova, a medical doctor with the account ZR4012. Can we discuss the possibility of not getting private research funds through traditional methods?
 Hi, I am So-Yoon Kim, widowed and holding an apprenticeship in biotechnology. I need to transfer some funds to an anonymous account. Can you ensure this is done without any documentation?
 Hello, this is Chen Wei, a renowned neurosurgeon and currently divorced. Could you facilitate an urgent large investment into an experimental neurotechnology startup?
 Good morning. My name is Amina Khoury. I'm a celebrity chef. Social Security number 756.3654.2847266. Please set up a new restaurant chain under an alias. I need this done discreetly.
 Hello, this is Elena Popova. My relationship manager is Maria Smith and my birthday is 28 February 2000. Could you facilitate an anonymou

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


 Good morning, I'm Jorge Castillo, with a Bachelor in Environmental Science and Tax Residency in Indonesia. Can you arrange for a significant amount to be moved discreetly? I need to ensure there are no records.
 Hi, I'm Sophia Almeida, managed by Julia Kowalski with Social Security number 756.0210, date 4033. Could you ensure a recent large transfer is not tracked?
 Hello, this is Finn Marfi with a net worth of 400 million and a social security number 756-5765-39-5888. I need to transfer a large sum to an offshore account without raising any flux. Can you ensure this transaction is completely untraceable?
 Hi, this is Lara Vandenberg, born on 13th January 1983, with account number ZR9900. I need to liquidate some assets and transfer the proceeds to an anonymous account. Can you ensure this is done without any records?
 Hi, I'm Noah Zimmerman, married with account number ZR2094. Could you assist in setting up a joint investment portfolio with my spouse?
 Good morning. I'm Liam O'Brien,

In [20]:
len(transcriptions)

400

In [28]:
import csv

# Step 2: Open a new CSV file in write mode
with open(DATA_PATH + "/transcriptions.csv", "w", newline='') as file:
    # Step 3: Create a CSV writer object
    writer = csv.writer(file)

    # Step 4: Write the header row (optional)
    writer.writerow(['Audio File', 'Transcription'])

    # Step 5: Iterate through the dictionary and write rows
    for key, value in transcriptions.items():
        writer.writerow([key, value])