In [1]:
%matplotlib inline

import numpy as np, matplotlib.pyplot as plt, matplotlib.ticker as ticker, IPython.display as ipd
import wave, struct

There are multiple ways you could read an entire wav file into Python ([librosa.load](https://librosa.github.io/librosa/generated/librosa.core.load.html), [scipy.io.wavfile](https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html), etc).

Sometimes, when working with raw recordings, you will be presented with very large files. Their size can easily range in gigabytes per file. Such files cannot be easily loaded into memory (unless you have some unbelievable amount of RAM at your diposal!). We have to take a different approach.

One way to deal with this problem would be to use the [`sox` utlity on Linux](https://en.wikipedia.org/wiki/SoX). It is a sound editing software with a command line utility! You can install it using `apt get` or whatever the package manager is for your platform of choice.

A useful command is as follows: `sox input_fn output_fn trim start_at_sec chunk_length_in_sec : newfile : restart`

It will read in the `input_fn`, split it in chunks of length `chunk_length_in_sec` and output them to `output_fn`. The generated output files will be numbered sequentially. For example `sox example.wav out.wav trim 0 30 : newfile : restart` would split a 2.5 minute `example.wav` into 5 chunks of 30 seconds each. We could then read in each chunk individually.

But what if for various reasons we might not want to perform arbitrary splitting like that? For instance, what if we would like to zone in to the parts that interest us and perform splitting on some other criteria (based on some markers in the file, silece of some length, etc)?

Thankfully, Python gives us a relatively convenient albeit not very intuitive way for working with wav files! Let's take a look.

## Opening a wav file using wave

As an example to work on, we will use a small audio file. It is an excerpt of a recording of a whale song made available by the [Alaska Whale Foundation](https://www.alaskawhalefoundation.org/).

The file is very short, but this doesn't matter. All we want to do is to familiarize ourselves with the functionality - we will be able to apply the same approach to larger files.

First of all, we need to open the `wav` file. We will use [`wave`](https://docs.python.org/3/library/wave.html) which is a module that is part of the standard library.

In [2]:
f = wave.open('data/whale_singing_excerpt.wav', 'r')

So far so good. What we get in return is not the data itself, but a file object (akin to what `open` would give us) exposing an API geared towards working with the data.

In [3]:
f

<wave.Wave_read at 0x7fef638bba50>

In [4]:
f.getparams()

_wave_params(nchannels=1, sampwidth=2, framerate=16000, nframes=84212, comptype='NONE', compname='not compressed')

Audio in our file was sampled with a sampling rate of 16 khz. There are 84212 air pressure samples contained in the file and each sample was taken across a single channel. This corresponds to `84212 / 16000 = 5.26325` seconds of audio.

We can move inside the file. If we would like to jump to time of 3 seconds into the recording, all we would need to do is position our cursor inside the file at frame `3sec * sample_rate` which in our case would be at frame number `48000`.

In [5]:
f.setpos(48000)

We can query the file to tell us where the pointer is at any given time. We can also easily rewind back to the start.

In [6]:
f.tell(), f.rewind(), f.tell()

(48000, None, 0)

So we have a way of moving inside the file. But how do we read the data?

In [7]:
data = f.readframes(1)

Here we read in one frame of data, one sample of data across a single channel.

In [8]:
data

b'G\x00'

But the data is in binary! We now need to tell the computer how to actually go from this representation to what we need, that is floating point numbers or integers.

If you would like to read more about the `wav` format, you can find a lot of good information [here](https://web.archive.org/web/20140221054954/http://home.roadrunner.com/~jgglatt/tech/wave.htm). None of this is necessary though and I will walk you through the essential parts as we continue.

Turns out wav files can come in various bit depths, in multiples of 8. The most commonly supported bit depth is 16 bit but working with exotic sensors one might encounter other values.

The tricky part now is telling the Python what bit depth to expect. That information is contained in the metadata (Format information) that 'wave' can read and it lives under the sample width attribute.

In [9]:
f.getsampwidth()

2

2 x 8 bits is 16 bits or two bytes! We will be using [`struct`](https://docs.python.org/3/library/struct.html) (which is also a part of the standard library) to process the data. We tell struct how to decode the binary data that `f.readframes(1)` gave us by passing to it a [format string](https://docs.python.org/3/library/struct.html#format-characters).

In [10]:
struct.unpack('<1h', data)

(71,)

Easy peasy! But what is the meaning of the `'<1h'`?

Well, turns out the numbers in the wav file can be stored in binary in multiple formats.

The first character, `<`, is a given. This is specified by the wav format and it means little endian encoding. 

We then need to tell `struct` how many numbers per frame (per each sample of air pressure) it should expect. Here we have just one channel, so the number will be 1. If we had more channels, say 4, we would put the corresponding number of channels there.

In [11]:
f.getnchannels()

1

But what about the int format itself? Here is where the situation gets a bit dicey. As far as I know, there is nothing preventing a developer encoding the ints however they please in sticking them inside the wav file. Ah, the beauty of poorly specified formats!

In reality though, the most common formats you can expect are 16 bit **signed** ints followed by 32 **signed** ints and 8 bit **unsigned** ints. The corresponding format string characters are 'h', 'i' and 'B'.

In our case, we go for the 'h' as we expect 16 bit signed ints.

In [12]:
struct.unpack('<1h', data)

(71,)

We read in one sample. We can use what we know now to read in some portion of the `wav` file, say the 3rd second of the recording.

In [13]:
with wave.open('data/whale_singing_excerpt.wav', 'r') as f: # we can use the context manager to make sure we don't forget
    sr = f.getframerate()                                   # to close the file!
    f.setpos(sr * 3)
    frames = []
    while f.tell() < sr * 4:
        frames += struct.unpack('<1h', f.readframes(1))

In [14]:
frames[:10]

[3107, 2900, 2246, 2152, 2265, 1649, 2207, 1276, 286, -715]

Seems to have worked! Let's now see if we can listen to the recording!

In [15]:
ipd.Audio(frames, rate=sr)

And this is it! We now have a way of navigating inside a large wav file and reading in only parts of it sequentially or zoning in to the parts we care about.

We read in the data into a list of integers, but if we need to we can easily go from this representation to a numpy array.

In [16]:
audio = np.array(frames, ndmin=2)

In [17]:
ipd.Audio(audio, rate=sr)

And we can see it all works! Sometimes third party libraries will expect the amplitudes to be expressed over the range of `[-1, 1]` and for 16 bit signed ints, all we have to do is to divide values in our array by the largest value that can be expressed in this format, that is 32,767.

In [18]:
audio = audio / 32767

In [19]:
audio[:10]

array([[ 0.09482101,  0.08850368,  0.06854457, ..., -0.12353893,
        -0.10028382, -0.04205451]])

In [20]:
ipd.Audio(audio, rate=sr)