# Assignment 1: Decoding States

---

## Task 5) Dual-Tone Multi-Frequency

[Dual-tone multi-frequency DTMF](https://en.wikipedia.org/wiki/Dual-tone_multi-frequency_signaling) signaling is an old way of transmitting dial pad keystrokes over the phone.
Each key/symbol is assigned a frequency pair: `[(1,697,1209), (2,697,1336), (3,697,1477), (A,697,1633), (4,770,1209), (5,770,1336), (6,770,1477), (B,770,1633), (7,852,1209), (8,852,1336), (9,852,1477), (C,852,1633), (*,941,1209), (0,941,1336), (#,941,1477), (D,941,1633)]`.
You can generate some DTMF sequences online, eg. <https://www.audiocheck.net/audiocheck_dtmf.php>

### Features

For feature computation, use librosa to compute the power spectrum (`librosa.stft` and `librosa.amplitude_to_db`), and extract the approx. band energy for each relevant frequency.

> Note: It's best to identify silence by the overall spectral energy and to normalize the band energies to sum up to one.

### Decoding

To decode DTMF sequences, we can use again dynamic programming, this time applied to states rather than edits.
For DTMF sequences, consider a small, fully connected graph that has 13 states: 0-9, A-D, \*, \# and _silence_.
As for the DP-matrix: the rows will denote the states and the columns represent the time; we will decode left-to-right (ie. time-synchronous), and at each time step, we will have to find the best step forward.
The main difference to edit distances or DTW is, that you may now also "go up" in the table, ie. change state freely.
For distance/similarity, use a template vector for each state that has `.5` for those two bins that need to be active.

When decoding a sequence, the idea is now that we remain in one state as long as the key is pressed; after that, the key may either be released (and the spectral energy is close to 0) hence we're in pause, or another key is pressed, hence it's a "direct" transition.
Thus, from the backtrack, collapse the sequence by digit and remove silence, eg. `44443-3-AAA` becomes `433A`.

---

### Preparation

In [23]:
import librosa
import numpy as np
from typing import List, Tuple
import matplotlib.pyplot as plt

In [24]:
### Notice: librosa defaults to 22.050 Hz sample rate; adjust if needed!

DTMF_TONES = [
    ('1', 697, 1209), 
    ('2', 697, 1336), 
    ('3', 697, 1477), 
    ('A', 697, 1633),
    ('4', 770, 1209),
    ('5', 770, 1336),
    ('6', 770, 1477),
    ('B', 770, 1633),
    ('7', 852, 1209),
    ('8', 852, 1336),
    ('9', 852, 1477),
    ('C', 852, 1633),
    ('*', 941, 1209),
    ('0', 941, 1336),
    ('#', 941, 1477),
    ('D', 941, 1633)
]

### Implement the Decoding

In [152]:
### TODO:
### 1. familiarize with librosa stft to compute powerspectrum
### 2. extract the critical bands from the power spectrum (ie. how much energy in the DTMF-related freq bins?)
### 3. define template vectors representing the state (see dtmf_tones)
### 4. for a new recording, extract critical bands and do DP do get state sequence
### 5. backtrack & collapse

### Notice: you will need a couple of helper functions...

def decode(y, sr):
    """
    Apply DTMF signal decoding with improved thresholding and handling of consecutive identical tones.
    
    Arguments:
    y -- Input signal.
    sr -- Sample rate. 
    
    Returns a list of DTMF-signals (excluding silence and reducing false positives).
    """
    n_fft = 1024
    D = librosa.stft(y, n_fft=n_fft)
    S = np.abs(D) ** 2

    freq_resolution = sr / n_fft

    # Map DTMF frequencies to STFT bins
    dtmf_bins = [(round(f1 / freq_resolution), round(f2 / freq_resolution)) for _, f1, f2 in DTMF_TONES]
    
    similarity_scores = np.zeros((S.shape[1], len(DTMF_TONES)))
    
    for i, (low_bin, high_bin) in enumerate(dtmf_bins):
        energy = S[low_bin, :] + S[high_bin, :]
        similarity_scores[:, i] = energy
    
    # Introduce dynamic thresholding based on the median energy level
    threshold = np.median(similarity_scores) * 1.5 

    decoded_indices = np.argmax(similarity_scores, axis=1)
    decoded_indices[similarity_scores.max(axis=1) < threshold] = -1  # Mark low-energy frames as -1

    decoded_keys = []
    previous_key = None
    for index in decoded_indices:
        if index == -1:
            previous_key = None
        else:
            current_key = DTMF_TONES[index][0]
            if current_key != previous_key:
                decoded_keys.append(current_key)
                previous_key = current_key

    return decoded_keys
    
    ### END YOUR CODE

### Test the Decoding

In [157]:
SR = 14000
TEST_FILE = "data/112163_112196_11#9632_##9696.wav"

y, sr = librosa.load(TEST_FILE, sr=SR)
print(decode(y=y, sr=sr))

['1', '1', '2', '1', '6', '3', '1', '1', '2', '1', '9', '6', '1', '1', '#', '9', '6', '3', '2', '#', '#', '9', '6', '9', '6']
