Performing testing on the different Whisper implementations.

1. Whisper - pip/conda package
2. Whisper - Huggingface

# Whisper - Huggingface

After trying all the different implementations, I have decided not to use the Huggingface implementation as it doesn't have the `verbose=True` option , neither does it have the ability to input prompts into the whisper model.

In [2]:
# Load model directly
# import torch
# from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq, pipeline

# pipe = pipeline(
#   "automatic-speech-recognition",
#   model="openai/whisper-tiny",
#   chunk_length_s=60,
#   device=torch.device("mps"),
# )


# Load the pipeline for speech-to-text, specifying the model and its processor
# speech_recognition = pipeline(
# 	"automatic-speech-recognition",
# 	model="jlvdoorn/whisper-large-v3-atco2-asr-atcosim",
# 	device=torch.device('mps'))


# speech_recognition = pipeline(
# 	"automatic-speech-recognition",
# 	model="openai/whisper-tiny",
# 	return_timestamps=True,
# 	device=torch.device('mps'))
#
# # Example usage (assuming you have an audio file)
# results = speech_recognition("atc_train.mp3")['text']
# print(results)



# Whisper - pip/conda package

By using just the pip/conda package, I can use the `verbose=True` option and also input prompts into the whisper model.

That being said, attempting to run Whisper pip on Apple silicon is a real pain. From my testing, I run into the 
``` shell
Could not run 'aten::empty.memory_format' with arguments from the 'SparseMPS' backend. 
```
issue. This is a known issue with the MPS backend on Apple silicon. 

That being said, there have been online discussion on replicating the Whisper pip functionality on Huggingface. i.e. replicating the `snippets` functionality
``` python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, Audio

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", low_cpu_mem_usage=True)
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(16_000))

sample = next(iter(dataset))
inputs = processor(sample["audio"]["array"], padding=True, truncation=False, return_attention_mask=True, return_tensors="pt")

outputs = model.generate(**inputs, return_segments=True)

print(outputs)
```

But this is not a priority for me at the moment. I will be using the pip/conda package for now.

In [3]:
import torch.backends.mps
!pip install openai-whisper



In [4]:
# Load the model and the processor
import whisper
import torch
model = whisper.load_model("large-v3")


# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
# model.to(device)
# print(f"Model loaded on {device}")

## ATC-Specific Prompts/Phrases

To be honest, the implementation of the Whisper model itself is not really difficult. It's more so the data prep for the whisper model (to essentially zero-shot tune the model) that is the most time consuming part. 


In [5]:
general = ['Air Traffic Control communications']
nato = [
	'Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo', 'Foxtrot', 'Golf',
	'Hotel', 'India', 'Juliett', 'Kilo', 'Lima', 'Mike', 'November',
	'Oscar', 'Papa', 'Quebec', 'Romeo', 'Sierra', 'Tango', 'Uniform',
	'Victor', 'Whiskey', 'Xray', 'Yankee', 'Zulu'
]
atc_common_words = [
    "acknowledge", "affirmative", "altitude", "approach", "apron", "arrival",
    "bandbox", "base", "bearing", "cleared", "climb", "contact", "control",
    "crosswind", "cruise", "descend", "departure", "direct", "disregard",
    "downwind", "estimate", "final", "flight", "frequency", "go around",
    "heading", "hold", "identified", "immediate", "information", "instruct",
    "intentions", "land", "level", "maintain", "mayday", "message", "missed",
    "navigation", "negative", "obstruction", "option", "orbit", "pan-pan",
    "pattern", "position", "proceed", "radar", "readback", "received",
    "report", "request", "required", "runway", "squawk", "standby", "takeoff",
    "taxi", "threshold", "traffic", "transit", "turn", "vector", "visual",
    "waypoint", "weather", "wilco", "wind", "with you", "speed",
    "heavy", "light", "medium", "emergency", "fuel", "identifier",
    "limit", "monitor", "notice", "operation", "permission", "relief",
    "route", "signal", "stand", "system", "terminal", "test", "track",
    "understand", "verify", "vertical", "warning", "zone", "no", "yes", "unable",
    "clearance", "conflict", "coordination", "cumulonimbus", "deviation", "enroute",
    "fix", "glideslope", "handoff", "holding", "IFR", "jetstream", "knots",
    "localizer", "METAR", "NOTAM", "overfly", "pilot", "QNH", "radial",
    "sector", "SID", "STAR", "tailwind", "transition", "turbulence", "uncontrolled",
    "VFR", "wake turbulence", "X-wind", "yaw", "Zulu time", "airspace",
    "briefing", "checkpoint", "departure", "elevation", "FL (flight level)",
    "ground control", "hazard", "ILS", "jetway", "kilo", "logbook", "missed approach",
    "nautical mile", "offset", "profile", "quadrant", "RVR (Runway Visual Range)",
    "static", "touchdown", "upwind", "variable", "wingtip", "Yankee", "zoom climb",
    "altitude restriction", "airspeed", "backtrack", "Cleared for the Option", "deadhead", 
    "ETOPS (Extended Operations)", "final approach fix", "gate", "holding pattern", 
    "instrument approach", "jumpseat", "minimums", "NOTAM (Notice to Airmen)", "pushback", 
    "RNAV (Area Navigation)", "slot time", "taxiway", "TCAS (Traffic Collision Avoidance System)",
    "visual approach", "wind shear", "zero fuel weight", "accelerate-stop distance available",
    "barometric pressure", "clearance delivery", "departure frequency", "ETA (Estimated Time of Arrival)",
    "flight deck", "ground proximity warning system", "handoff", "IFR clearance", "jet route",
    "knots", "landing clearance", "Mach number", "NDB (Non-Directional Beacon)", "obstacle clearance",
    "PAPI (Precision Approach Path Indicator)", "QFE (Field Elevation Pressure)", "radar contact",
]


combined_phrases = [
    'ATC', 'Pilot', 'Call sign', 'Altitude', 'Heading', 'Speed', 'Climb to', 'Descend to',
    'Maintain', 'Approach', 'Tower', 'Ground', 'Runway', 'Taxi', 'Takeoff', 'Landing',
    'Flight level', 'Squawk', 'Radar contact', 'Traffic', 'Hold short', 'Cleared for',
    'Go around', 'Read back', 'Roger', 'Wilco', 'Affirmative', 'Negative', 'Standby',
    'Mayday', 'Pan-pan', 'Flight plan', 'Visibility', 'Weather', 'Wind', 'Gusts',
    'Turbulence', 'Icing conditions', 'Deicing', 'Instrument Landing System (ILS)',
    'Visual Flight Rules (VFR)', 'Instrument Flight Rules (IFR)', 'No-fly zone',
    'Restricted airspace', 'Flight path', 'Direct route', 'Vector', 'Frequency change',
    'Handoff', 'Final approach', 'Initial climb to', 'Contact approach', 'Squawk ident',
    'Flight information region (FIR)', 'Control zone', 'Terminal control area (TMA)',
    'Standard instrument departure (SID)', 'Standard terminal arrival route (STAR)',
    'Missed approach', 'Holding pattern', 'Minimum safe altitude', 'Transponder',
    'Traffic alert and collision avoidance system (TCAS)', 'Reduce speed to', 'Increase speed to',
    'Flight conditions', 'Clear of conflict', 'Resume own navigation', 'Request altitude change',
    'Request route change', 'Flight visibility', 'Ceiling', 'Severe weather', 'Convective SIGMET',
    'AIRMET', 'NOTAM', 'QNH', 'QFE', 'Transition altitude', 'Transition level',
    'No significant change (NOSIG)', 'Temporary flight restriction (TFR)', 'Special use airspace',
    'Military operation area (MOA)', 'Instrument approach procedure (IAP)', 'Visual approach',
    'Non-directional beacon (NDB)', 'VHF omnidirectional range (VOR)',
    'Automatic terminal information service (ATIS)', 'Pushback', 'Engine start clearance',
    'Line up and wait', 'Unicom', 'Cross runway', 'Backtrack', 'Departure frequency',
    'Arrival frequency', 'Go-ahead', 'Hold position', 'Check gear down',
    'Clearance delivery', 'Touch and go', 'Circuit pattern', 'Altitude restriction', 'Climb via SID',
    'Descend via STAR', 'Speed restriction', 'Flight following', 'Radar service terminated', 'Squawk VFR',
    'Change to advisory frequency', 'Report passing altitude', 'Report position', 'Estimated time of arrival (ETA)',
    'Actual time of departure (ATD)', 'Block altitude', 'Cruise climb', 'Direct to', 'Execute missed approach',
    'Flight deck', 'Ground proximity warning system (GPWS)', 'In-flight refueling', 'Joining instructions',
    'Lost communications', 'Minimum en route altitude (MEA)', 'Next waypoint', 'Obstacle clearance height (OCH)',
    'Procedure turn', 'Radar vectoring', 'Radio failure', 'Short final', 'Standard rate turn',
    'Terminal radar service area (TRSA)', 'Undershoot', 'Visual meteorological conditions (VMC)',
    'Wide-body aircraft', 'Yaw damper', 'Zulu time conversion', 'Area navigation (RNAV)',
    'Required navigation performance (RNP)', 'Barometric pressure', 'Control tower handover', 'Datalink communication',
    'Emergency locator transmitter (ELT)', 'Flight data recorder (FDR)', 'Ground control intercept (GCI)',
    'Hydraulic failure', 'Instrument meteorological conditions (IMC)', 'Jet route', 'Knock-it-off (emergency cease operations)',
    'Low visibility operations (LVO)', 'Missed approach point (MAP)', 'Navigation aids (NAVAIDS)',
    'Oxygen mask deployment', 'Precision approach radar (PAR)', 'Quick reaction alert (QRA)',
    'Runway incursion', 'Search and rescue (SAR)', 'Tail strike', 'Upwind leg', 'Vertical speed',
    'Wake turbulence category', 'X-ray cockpit security', 'Yield to incoming aircraft', 'Zero visibility takeoff'
]


## Transcribing the ATC audio

In [6]:
# collate all the phrases

collated_list = general + nato + atc_common_words + combined_phrases

collated_list_string = ' '.join(collated_list)

# calling this will actually result in the model.transcribe returning the transcribed text + timestamps into the cell output itself.
result = model.transcribe("./content/atc_train.mp3", verbose=True, language="en", prompt=collated_list_string)


# print(result["text"])




[00:00.000 --> 00:01.500]  What taxiway the letter?
[00:01.500 --> 00:04.000]  Oh, negative sir, we're on 22R holding short of Fox, sir.
[00:04.000 --> 00:06.000]  What taxiway do you enter the ramp?
[00:06.000 --> 00:10.000]  Okay, sir, we just exit the runway and we're holding short of Fox, short on 22R.
[00:10.000 --> 00:13.500]  You're not listening to what I'm asking you. What taxiway do you enter the ramp?
[00:13.500 --> 00:15.000]  Not on the ramp yet, sir.
[00:15.000 --> 00:18.000]  What taxiway do you enter the ramp? Tell me. What letter?
[00:18.000 --> 00:22.000]  Okay, we can enter at Kilo for 52R.
[00:22.000 --> 00:25.000]  That's what I need to get out of you. We talked like six times.
[00:25.000 --> 00:27.000]  Straight ahead and hold short of hotel, sir.
[00:27.000 --> 00:30.000]  Straight ahead and hold short of hotel, sir.
[00:30.000 --> 00:33.000]  Follow the Asiana and next time I would like you to be polite with me.
[00:33.000 --> 00:37.000]  Okay, but if I gotta ta

In [9]:
result

{'text': " What taxiway the letter? Oh, negative sir, we're on 22R holding short of Fox, sir. What taxiway do you enter the ramp? Okay, sir, we just exit the runway and we're holding short of Fox, short on 22R. You're not listening to what I'm asking you. What taxiway do you enter the ramp? Not on the ramp yet, sir. What taxiway do you enter the ramp? Tell me. What letter? Okay, we can enter at Kilo for 52R. That's what I need to get out of you. We talked like six times. Straight ahead and hold short of hotel, sir. Straight ahead and hold short of hotel, sir. Follow the Asiana and next time I would like you to be polite with me. Okay, but if I gotta talk to you six times, I got other people I gotta talk to. And you don't understand what I'm saying. What I'm saying, polite with me, alright? You want polite with me? I'll make a report. Go ahead. 29, we're trying to clear a Wabbit on the runway. Roger, we're holding short of top of the 29. Crazy Wabbit. You know, I've seen stranger things

In [10]:
result['text']

" What taxiway the letter? Oh, negative sir, we're on 22R holding short of Fox, sir. What taxiway do you enter the ramp? Okay, sir, we just exit the runway and we're holding short of Fox, short on 22R. You're not listening to what I'm asking you. What taxiway do you enter the ramp? Not on the ramp yet, sir. What taxiway do you enter the ramp? Tell me. What letter? Okay, we can enter at Kilo for 52R. That's what I need to get out of you. We talked like six times. Straight ahead and hold short of hotel, sir. Straight ahead and hold short of hotel, sir. Follow the Asiana and next time I would like you to be polite with me. Okay, but if I gotta talk to you six times, I got other people I gotta talk to. And you don't understand what I'm saying. What I'm saying, polite with me, alright? You want polite with me? I'll make a report. Go ahead. 29, we're trying to clear a Wabbit on the runway. Roger, we're holding short of top of the 29. Crazy Wabbit. You know, I've seen stranger things, so it m