## How audio fingerprinting works

Audio fingerprinting is a method to "fingerprint", i.e., assing a unique identifier to an audio signal. The "fingerprint" can then be used later to identify or match signals.

In [None]:
# Import relevant packages

from bokeh.io import output_notebook
import warnings
import AudioFP as afp

warnings.filterwarnings('ignore')
output_notebook()

### Create AudioFP object

We start by creating an object of the AudioFP class.

In [None]:
# Create AudioFP object for a song

song1 = afp.AudioFP(process='m')

### Read audio signal

The next step is to read the signal of an audio file. Note that only `.mp3` files can be properly read with this code.

In [None]:
# Read audio signal from a file

channels, framerate = afp.AudioFP.read_audiofile(song1, True, 'vanilla_ice_ice_ice_baby')

In [None]:
len(channels )

### Create a spectrogram

Once we have the raw audio signal, we can generate a spectrogram. A [spectrogram](https://en.wikipedia.org/wiki/Spectrogram) is a visual representation of the frequency content of the signal as a function of time. The spectrogram of any audio signal can be considered unique however, it is too large to be useful as a unique fingerprint.

In [None]:
# Generate spectrogram 

f, t, sgram = afp.AudioFP.generate_spectrogram(song1, True, channels, framerate)

### Condense spectrogram data

The spectrogram of an audio signal could be considered its unique signature. Therefore to identify whether two signals are the same, one can compare their spectrograms. However, the spectrogram is essentially a quite large three dimensional (frequency, time, amplitude) array and therefore requires considerable amount of memory. To phyically store and computationally compare unique signatures in the form of spectrograms for millions of songs (Shazam has a database of several millions of songs) would be an intractable problem. So, the next step is to take all the information in the spectrogram and find a way to condense it. The way [Shazam does this](https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf) is by generating what they call a "constellation map". The constellation map is generated by finding local peaks in the spectrogram. 

In [None]:
# Find local peaks in the spectrogram

fp, tp, peaks = afp.AudioFP.find_peaks(song1, True, f, t, sgram)

### Generate a hashed fingerprint

With the above constellation map, we have now condensed the data from the spectrogram of the audio signal. The next step is to take this condensed data and generate a fingerprint. Shazam uses a technique where a frequency of a local peak is paired with another local peak frequency in its vicinity and a time difference between the frequencies is calculated. So for each local peak frequency (anchor), we have a collection of nearby frequencies (targets) and their time deltas. This preserves local unique features in the spectrogram and is the information used to generate a fingerprint by passing it to a Hashing function. A Hashing or [Hash function](https://en.wikipedia.org/wiki/Hash_function) is a kind of function that takes data with a variable size and produces output data with a fixed size (called a Hash). Also, a Hashing functions will always produce the same Hash for the same input. The output of the Hash function is the audio fingerprint and allows us to compare signals that might be of different lengths. 

In [None]:
# Use hashing function and generate fingerprint

fp = afp.AudioFP.generate_fingerprint(song1, True, fp, tp, peaks)

### Something to note

There is one big difference between what we have done so far and how Shazam does their audio fingerprinting especially searching and storing. When generating the fingerprint, Shazam also stores the time point of each of the anchor frequencies. Thus, instead of one hashed fingerprint per audio signal, Shazam has a database entry for each signal that consists of the time point of each target frequency and the associated hash value. This has a key advantage when it comes to comparing signals. Knowing the time offset of each hash allows Shazam to use a much smaller subset of the entire audio signal for comparison with the original. However, to store and extract these many hashes efficiently requires creating a database which is outside the scope of this exercise. The steps we followed will also allow us to compare two signals where one is a smaller subset of the other however, the accuracy would be lower.