# Intro
The fifth post from a series of posts about my Masters project with the Physics Department at Durham University.

# Meeting:

## During the meeting we discussed:
- Stuart has a stereo microphone to play around with. It can connect to a device and you could test it with a ping pong ball to see differences in phase, amplitude etc in the two channels.
- We zoomed in on audacity to see the waves of the two channels, looking at the phase difference specifically, but noticing a large different in amplitude.
- The biology data is only 16khz. The rule is that you need double the sampling rate of the maximum frequency present in an audio file to record it accuracy. This means sounds, like birdsongs, which are likely above 8khz in frequency will not have been recorded with high precision.
- The idea behind stereo data is about this extra information a model could discern such as directionality. This information could be used both to improve bird classification, but also to discern, in the case of multiple birds of the same species singing, that there are indeed multiple birds.
- Instead of training a model with (specific) stereo data for our purposes, we could take a different approach and instead use an already good classification model to classify birds from certain segments of soundscapes. We would then look into these segments and try to analyse them with concepts such as directionality and so on.
- I expressed my concerns about having a lack of data to train a model from scratch. The concern being that the specifics of a recording setup, I.E, the type of microphone(s), the distance between them, the positive of the device relative to other objects such as trees and the ground, etc, matter too much in order to extract sensitive data such as directionality if we used data from different setups. 
- Stuart suggested that if a model trained on different setups, these setup specifics might not matter, as the model would be able to be generalised to work regardless, but I expressed that this would require not just a lot of data, but a lot of data where we know the setup specifics in order to understand what the model is training on.
- I have a underlying concern that there might be some specific we would overlook since recording setups don't appear to be a simple field in practice (but are initially in theory). 
- Instead of getting the aim of this project to be useful immediately, Stuart said that instead we could make it more of a first step for further research to build on. For example, "If we setup stereo microphones in this way and do this analysis with this model, we find X, in the future, more people can copy this setup and improve our results to possibly find Y".
- Directionality however might be okay because instead of absolute directionality, we could find relative directionality, which doesn't matter depending on the distances between the microphones. 
- I said that I found it hard to find papers about using stereo data for classification, but Robert has found some. Specifically, he said he would send me a literature review about it, which should be helpful.
- We then move on to talk about the possibility of using stable diffusion to generate soundscapes.
- Robert found the example prompt I gave very funny because that combination of birds and environment doesn't make sense. This is entertaining, but also important, because generating real possible prompt combinations might be really important so that models can train properly on them. Generating purposely really fake prompt combinations somehow might also help. In either case, some domain knowledge from Robert or the biology team would be needed.
- Another reason why generating synthetic data is a good idea is because some models might struggle with the lack of training data for rare/exotic species. This approach might help to aid that significantly. 
- A spectrogram shows frequency and time, with amplitude (power) as shaded. A 3D stable diffusion model might be better because instead of shades to show amplitude, it has another dimension for it.
- In theory, we agree that it is a good idea, and regardless of whether it is too ambitious, masters projects are allowed to be exploratory.
- It's very ambitious, but it would be nice to publish a paper even if this doesn't work out because the idea and exploration might be very helpful.
- One really interesting observation Stuart made was that a mask of a subject in a normal image would be really similar to a mask of a birdsong in a synthetic spectrogram. In fact, isn't the latter just then a birdsong classification model?? I need to throughly read the recent DiffEdit paper which generates masks automatically for images. Their mask generation model might be able to be turned into a soundscape birdsong classification model!
- However, by Christmas holiday (6 weeks) I need to have some real work done rather than just learning. Masters projects will be accessed not necessarily by learning, but by exploration of ideas that needs to be written out.
- By Christmas, I would like to at least have a stable diffusion model that can generate individual bird song (not soundscapes). 

# Problem idea update

Following through on last week, it's not a good project direction to focus on trying to replicate and beat Google's paper directly. That's simply too ambitious given the amount of work it would require to first just understand the paper and second to improve on it. Instead, a novel approach to it, and other competitors, is a better idea.

Something that I've been thinking about is an advanced data augmentation technique. <br>
In BirdCLEF 2022, the training data is given as audio files of individual bird songs, but the test data are soundscapes. It's not normal for training data and test data to be two different types. Normally both are the same so that you can create a model effectively to solve your desired problem. What would be ideal is having both soundscapes as the training and test data. <br>
The reason why this is not the case I suppose would be the lack of a large number of labeled soundscapes to train on. Labeling enough soundscapes is simply too human time and resource intensive. Not only does listening to each file take several minutes, but requires an expert to label the birdsongs. It might even be too much for even an expert to know 100+ bird species' song and be able to accuracy access them in noisy environments.

Instead of recording soundscapes and manually labeling them, what if it were possible to create synthetic soundscapes out of already labeled data? These synthetic soundscapes could already then be labeled if we transfer the labels over. In practice, this could follow a similar practice I described in last week's post. 

However, a newer idea sprung to mind. Instead of using a more manual approach to soundscape creation, how about using an AI generation approach?

# Stable Diffusion Soundscape generation

The basic idea goes as follows: <br>
1. Stable diffusion generates images from a given prompt. 
2. A spectrogram is a 2D (or 3D) visualisation of a sound, it is an image.
3. Stable diffusion can generate a spectrogram of a birdsong given the right prompt.
4. If it's possible to generate bird spectrograms, then it might be possible to generate entire soundscapes too, given the right prompt. 

We could then use these soundscapes to train a model on. 

The benefit of a novel data augmentation approach is that is doesn't necessarily compete with already successful models such as Google's. Instead it can complement them, making them even better. 

Stable diffusion models are trained on different images, say teddy bears, and in a different instance, Mars, and then put them together in a novel way that it has never seen before, or in a way that might not even exist as a photo or in reality. In this way, it could learn about birdsongs and environmental sounds and generate unique training data suited to improving other models.

## Stable Diffusion mechanics

Stable diffusion heavily relies on two things: the data it has been trained on, and the prompt it is given to generate a given image.

### Training 
Currently as far as I can see, stable diffusion model online have been trained on 'normal' images in order to generate 'normal' artwork. It's hard to define what normal is, but what it isn't, is spectrograms of birdsongs and environments. My approach would likely require training a stable diffusion model from scratch on spectograms, of either relevant spectrograms or other audio, because intuitively, fine-tuning pretrained wouldn't work well enough. Regardless it's worth trying just too see results however.

To train a model yourself takes two things: enough data and enough computation. There should be enough data online given the size of xeno-canto. Computation is the harder demon. Being at a university fortunately might resolve this for me. There are computing clusters available for research use, so in theory it should be possible to request time for one. Alternatively, it is possible to rent GPU time from companies, but this would likely be very expensive, so would require research funding. 

However, there might be a way to decrease computational needs: <br>

To lower the computation needed to train a diffusion model, we use latent (compressed) representations of images. In fact, stable diffusion itself is a latent diffusion model rather than a general diffusion model specifically because of this. The autoencoder (vae) in stable diffusion is what does this image compression, and it makes a significant difference, reducing memory requirements for a 512x512x3 image by 48 times, speeding up training and inference (image creation) significantly. 

Stable Diffusion is based on latent diffusion. It was proposed in a paper High-Resolution Image Synthesis with Latent Diffusion Models at https://arxiv.org/abs/2112.10752.

What if there is some specific autoencoder approach for spectrogram generation that could decrease computational needs significantly?

### Prompts (and labels)

The entire point of this soundscape approach is to create labeled soundscapes because there are not enough available. If stable diffusion could create soundscapes, but not labeled ones, the whole approach falls out. Fortunately, the way stable diffusion creates images, using prompts, might also conveniently be the answer the answer to this issue.

An example I found on https://lexica.art/ at https://lexica.art/prompt/ea5b8646-6e6e-4a0e-b618-bae8c796f8cc of a generated image is as follows:

Prompt: "Scifi art by greg rutkowski, a man wearing futuristic riot control gear, claustrophobic and futuristic environment, detailed and intricate environment, high technology, highly detailed portrait, digital painting, artstation, concept art, smooth, sharp foccus ilustration, artstation hq"

Image:

![image.png](attachment:image.png)

Not all prompts are like this, in fact, thinking about what prompts to use in itself is another entire process since it's surprisingly not intuitive or easy to write good prompts. 

Regardless, looking at our prompt, it also has words that can be related to labels. <br>
"Scifi art", "man", "futuristic riot control gear", "claustrophobic and futuristic environment", "high technology", "highly detailed portrait", etc. Some words aren't as useful, like "detailed and intricate environment" as it doesn't give much information. Some words like "by greg rutkowski" are telling of the artist this was based on.

If we had a soundscape generation model, we could use a prompt I'm making up to illustrate like:

Prompt: "Forest enviroment, a Barn owl singing lightly at the start, lightly noisy environment, near a river, a Black grouse singing throughout, highly detailed, soundscape"

This prompt would include information about the labels we want, namely, "a Barn owl singing lightly at the start","near a river", "a Black grouse singing throughout".

This does mean however, that a model trying to use these soundscapes for training needs to be able to know how to use these labels properly, which is different from using the labels in BirdCLEF 2022 to train for instance. There is also the natural concern of whether the labels from the prompts are correct (enough) to make soundscape generation method good enough to be useful training for real soundscapes.

### An ambitious addition

A spectrogram is a 2D (or 3D) visualisation of a sound, it is an image. <br>
In the example given I used a 2D visualisation, but what about doing all of this in 3D? <br>

Recently, Google released a paper about 3D image generation! That is creating 3D images from a written prompt. 2 minute papers has a brilliant video on it https://www.youtube.com/watch?v=L3G0dx1Q0R8. They call it "DreamFusion".

Unofficial open source DreamFusion using stable diffusion is already becoming available. https://github.com/ashawkey/stable-dreamfusion. I really do wonder if using a 3D spectrogram to generate soundscapes would be better than a 2D one.

### Sound to Spectrogram and Spectrogram to Sound conversion.

Whether or not it's possible to turn a spectrogram back into audio might not actually matter too much. <br>
What I mean is, to classify the birds in a given test spectrogram, we simply convert it into a spectrogram and then use a image model trained on only spectrograms to classify it. I think there are CNN based papers that do this.

There is also the idea that we can modify the test soundscape/spectrogram to be more similar to the training ones on purpose. Maybe we can use another model to do this.

# Alternative approaches to soundscape generation

## By audio diffusion: 

### Harmonai

Stability.ai released stable diffusion. They also work in other areas, including AI in biology and AI in audio.

Last month they released Harmonai, an open source generative audio tool, dance diffusion. Dance diffusion allows you create music. It's a digital music production tool. More info can be found at a wandb.ai blog post https://wandb.ai/wandb_gen/audio/reports/Harmonai-s-Dance-Diffusion-Open-Source-AI-Audio-Generation-Tool-For-Music-Producers--VmlldzoyNjkwOTM1, at Harmonai's website https://www.harmonai.org/, and their GitHub https://github.com/Harmonai-org/sample-generator. There's also a guide to using it at https://drive.google.com/file/d/1nEFEpK27v0nytNXmmYQb06X_RI6kKPve/view.

I haven't yet throughly investigated the use of dance diffusion, but:

The latter guide detailed an interested model checkpoint (a pretrained model to fine-tune). It is honk-140k, trained on recordings of the Canada Goose from xeno-canto. This implies that it's possible to generate birdsong with it, once trained. 

But whether it's possible to do labeled soundscape creation from individual labeled audio labels is my concern. About the labels, dance diffusion doesn't seem to use prompts like stable diffusion, it seems to just create new sounds based on the trained data. This isn't particularly useful because there is already enough birdsong available online. 

Regardless, understanding how dance diffusion generates sound might yield some new ideas about how to use stable diffusion to do. Specifically, how it handles audio data. Perhaps I would find a way to to do labeled soundscape creation with it once I know how it works.

### The Generative Landscape

There is a course online at https://johnowhitaker.github.io/tglcourse/.
It covers all types of diffusion generation, including image and audio.

Lesson 15 at https://johnowhitaker.github.io/tglcourse/dm4.html is about Diffusion for Audio on Class conditioned birdcalls. The course is not yet complete yet, but should be soon.

The course is created by Jonathan Whitaker, who also is contributing to fast.ai part 2, so should be of great quality.

## By other generation methods:

This would be doing the process by combining previous audio data together to create soundscapes. <br>
Unlike diffusion methods, this isn't as new as an idea, and has likely been tried and tested before. Because diffusion is so new, I'm more attracted to it as a novel idea.

However there are some things I could learn from these approaches.

The Earth Species Project (https://www.earthspecies.org/) released a paper about BioCPPNet at https://www.nature.com/articles/s41598-021-02790-2. This is about solving the cocktail problem to tell apart sounds from a group of animals of the same species. For example, to tell which individual is speaking from a group of macaques monkeys.

They have a video explaining BioCPPNet at https://www.youtube.com/watch?v=TGWFr-6JCDk. In particular at the 1 minute mark, they state "We implement a supervised training scheme: we construct a synthetic mixture dataset by additively overlapping signals".

This is synthetic mixture dataset creation, which could be similar to soundscape creation, so could be very useful to understand.

# Stereo data as a novel approach

Last week with Stuart and Robert, we discussed using stereo data as our novel approach to the cocktail problem.
Whether it is indeed a novel approach and untried is going to take some research online to see if others have already tried this.

## Motivation

The paper (likely outdated, 2017) Multi-band Approach to Deep Learning-Based Artificial Stereo Extension, attempts to use machine learning to turn mono audio into stereo audio.

It motivates that:
"It is well known that stereophonic sound provides a more pleasant and natural experience than monaural (monophonic) sound on account of the presence of spatial information containing both ambience and/or the distinguished relative positions of objects and events".

The idea behind using stereo data for birdsong classification is that this extra presence of spatial information, the information about relative positions of objects and events, would help.

## Stereo Data Availability

BirdCLEF 2022, and other datasets, use data from xenocanto, so it is worth looking through it. Ideally, there would be a search tag to find stereo data.

On xenocanto records, under Technical details, it states various details. Here is an example from https://xeno-canto.org/757580.
## Xenocanto Technical Details
- File type	mp3
- Length	24.7 (s)
- Sampling rate	44100 (Hz)
- Bitrate of mp3	258189 (bps)
- Channels	2 (stereo)
- Device	not specified
- Microphone	not specified
- Automatic recording	yes
 
It does tell us the number of channels and so whether it is stereo or mono. This example is missing the device and microphone, but I found another with a iphone using a Echo Meter Touch 2 Pro. There are also useful properties on xeno-canto like the recording quality and environment, as well as the type of bird sound. Whether it is a flight call or dawn song etc.

Xeno-canto doesn't just have individual bird recordings, but also soundscapes that I suspect will be mostly unlabeled.

But how about searching for stereo data in bulk?

https://xeno-canto.org/help/search states how to do an advanced search. Entries are tagged, and you search through tags with tag:searchterm. Available tags include the country, the geographic coordinates, whether there are other species in the background, the recording quality. A limitation is that I cannot see how to search for specifically mono or stereo data. There is a tag for the device, the microphone, and the sampling rate, but not the number of channels. I can instead search using the remarks tag (the comments from the uploader) and the mic tag for stereo microphones.

Looking at the API at https://xeno-canto.org/explore/api, it has dvc: recording device used, mic: microphone used, smp: sample rate, but not explicitly the number of channels. Perhaps if I contact xeno-canto they will know of a way to look for just stereo data.

Another website Robert linked to me, freesound.org, explicitly has a stereo tag! https://freesound.org/browse/tags/stereo/. However, it only has 5093 recordings, and if I add 'bird' as a tag, only 241. There might however be other datasets for stereo data. 

Since there are ways to record stereo data, I will assume the type of microphone used really matters to process delicate information like the relative positions of objects and events. This means that data might be very limited in supply. Either I can train a model on just general stereo audio and fine tune to see if it works for a specific microphone, or train on mono audio and fine tune, or train on some combination and fine tune. 

## Papers

The first paper in this section was about turning mono data into stereo data. There is also another paper at https://www.researchgate.net/publication/352807819_Identification_of_Fake_Stereo_Audio_Using_SVM_and_CNN, which tries to identify stereo audio data created from mono audio data. In theory you could use the two to improve a model's ability to generate stereo data from mono data.

Classification of Bird Species using Audio processing and Deep Neural Network, at https://ieeexplore.ieee.org/abstract/document/9917735:

It discusses audio feature extractors like the spectrogram and Inverse Short Time Fourier Transform, as well as different ML approaches as related work.

They (frustratingly) don't state their dataset's name, only that it's on Kaggle and that it's 23.5gb. I searched through Kaggle and couldn't easily find it based on that. 

Regardless, in their dataset, 11472 audio files were mono, and 9903 stereo. Not a big difference. THere might be enough stereo data available, but not enough from a specific microphone if that is what is needed.

BirdCLEF 2022's dataset doesn't even include whether the files are stereo or mono in its metadata.csv file.

## Conclusion

Using stereo data might a good approach to the cocktail problem, but I'm unsure whether it would be novel because stereo data has been available for some time. Looking online for "Stereo vs Mono data for audio classification" is surprisingly dry for papers on Google and Google Scholar?

Because of its delicacy, stereo data might need to be collected from a specific microphone setup to extract it's delicate information. Xeno-canto doesn't seem to provide an easy way to search for stereo data, but does allow microphone searches. There is a concern that limiting this project to a specific stereo microphone will limit it's usefulness for others.

## Biosciences recordings

From the biology team, I have received some data of audio recordings.

The data in total 3.54 gb in size. There are 8 files, all .wav. Each file is 16000hz with two channels (stereo). <br>
I used audacity to have a look at some of the files.

The recordings are from UK woodlands and from a couple of scrub (the wildlife habitat) sites. They are examples of the kind of typical data the biology team has collected. They've collected this data for 3-4 years daily during Jan-Jun for about 3-4 years from about 20 different locations.

What stands out from this dataset to me is the amount of time it has been collected over. If there was a good model to classify birds, it would be really interesting to see how the number of birds changes over the years in these varied locations. This could yield insights into changes in biodiversity due to climate change for example. An alternative project direction would be rather than trying to solve the cocktail problem, to understand and use someone else's solution to do analysis on it's results over time.

The audio files are not labeled. However I recall the biology team had a PhD student who handlabeled some data for them. This would be useful to look at, however due to how much time and effort it takes to handlabel soundscapes, there is most likely not enough data to train a model from scratch on.

Steve commented on the dataset: <br>
"Attached are a selection of woodland audio files to have an initial play with, all from 2017, one from each site in mid-May. They are  2hrs 15mins long and start approx. 45mins before sunrise. So, the first 30 mins is often quiet, and things get gradually noisier and more complex thereafter. We have these data for 3-4 years, daily from about Jan-Jun for about 20 sites. Sites here include Abernethy Forest RSPB , Durham Uni Woodland, Minsmere RSPB, The Lodge RSPB, RSPB Wood of Cree, RSPB Ynis Hir, all from around mid-May.  Also included, for slight contrast are a couple of scrub habitat sites (Green Farm nr Durham, Pinnock Hill near Durham)"

# Unknown yet Approaches

I watched a video by Andrew Ng, the co-founder and head of Google Brain and the former chief scientist at Baidu. In it, he describes useful guidance on how to do machine learning research. 

He states how to find new ideas for approaches to a problem:

1. You should learn how to replicate other papers' results'. You learn a lot by doing so.
2. Reading many papers (20-50).

Andrew says this is a incredibly reliable process to get new ideas. 

The video can be found here https://www.youtube.com/watch?v=hkagmGAu74Y.

# Work done 

## Papers about BirdCLEF competitions

An immensely useful find this week are papers describing the different approaches the competition teams for BirdCLEF used.

The most recent: Overview of BirdCLEF 2022: Endangered bird species recognition in soundscape recordings, at http://ceur-ws.org/Vol-3180/paper-154.pdf.

For my soundscape generation approach, perhaps the most useful paragraph is found searching for 'data augmentation':

"Sampathkumar & Kowerko [15]: Data augmentation is an important processing step in
bird sound recognition because of the domain shift between training and test recordings.
In their work, this team focused on evaluating the best augmentation scheme for this task.
Most transformations focus on adding different patterns of noise to the source recording, thus
emulating noisy soundscape recordings. While the authors find that all augmentations methods
improve the baseline experiment, Gaussian noise, loudness normalization and tanh distortion
appear to be most impactful."

Most approaches added noise to the training data to emulate noise in the test soundscape. They did not create synthetic soundscapes entirely like I proposed. Gaussian noise, loudness normalization and tanh distortion
appear to be the most useful noise to add.

And in general, the conclusion is useful:

"Despite being set up as a few-shot learning task, few teams decided to employ techniques other
than CNNs. Pre-trained neural networks for image recognition still dominated the task, and
participants tried to cope with the lack of training data through intensive data augmentation
and transfer learning. Surprisingly, there was only a weak correlation between the number of
training samples and overall per-species performance. This indicates that other factors - such as
repertoire size and call patterns - might outweigh training data quantity. Automatic detection
of endangered and rare species remains challenging. Still, this year’s competition demonstrated
that passive acoustic monitoring combined with machine learning could already be a powerful
monitoring tool for some endangered species. BirdCLEF continues to engage a large number of
data scientists from around the world to develop new and effective acoustic analysis solutions
that aid avian conservation."

There's a lack of training data because BirdCEF doesn't provide all of the data available on xeno-canto, which isn't an issue for my project. Somehow more training samples per species didn't correlate strongly with better species identification, because of other factors. Most teams tried using CNNs, but not with spectrograms.

The competition overview does well to motivate my soundscape creation approach:

"In recent years, research in the domain of bioacoustics shifted towards deep neural networks
for sound event recognition [7, 8]. In past editions, we have seen many attempts to utilize
convolutional neural network (CNN) classifiers to identify bird calls based on visual representations of these sounds (i.e., spectrograms) [9, 10, 11]. Despite their success for bird sound
recognition in focal recordings, the classification performance of CNNs on continuous and
omnidirectional soundscape recordings remained low. Passive acoustic monitoring can be a
valuable sampling tool for habitat assessments and observations of environmental niches, which
often are threatened. However, manual processing of large collections of soundscape data is not
desirable, and automated attempts can help to advance this process [12]. Yet, the lack of suitable
validation and test data prevented the development of reliable techniques to solve this task.

Bridging the acoustic gap between high-quality training recordings and complex soundscapes
with varying ambient noise levels is one of the most challenging tasks in the domain of audio
event recognition. This is especially true when the amount of training data is insufficient, as is
the case for many rare and endangered bird species around the globe. Despite the vast amounts
of data collected on Xeno-canto and other online sound libraries, audio data for endangered
birds is still sparse. However, those endangered species are most relevant for conservation,
rendering acoustic monitoring of endangered birds particularly difficult."

I should reading reference [12] of the paper to get a better idea of the soundscape availability problem.

Searching for 'diffusion' within the paper yields no results, implying further that indeed using diffusion to generate soundscapes is a novel approach.

## Finished fast.ai lesson 10

This lesson was really useful in consolidating understanding about how stable diffusion works. It also starts the hard but rewarding journey of programming entire Python modules/frameworks from scratch, a skill likely vital further on in this project. The fast ai lessons for this advanced course are very demanding. They include a 2-2.30 hour lecture and plenty of homework, with the course creator Jeremy stating that he expects each lesson to take around 10 hours. 

My blog for this lesson can be found at https://exiomius.quarto.pub/blog/posts/2022-09-27-Lesson10Blog.md.html.

## Updated and Fixed blog

I was having trouble with my old fastpages based blog not displaying posts with maths and images correctly. To remedy this, I created a blog using Quarto instead. This post is on the new blog, and I transferred all the old posts here too.

## Investigated Durham Uni Supercomputers 

The computer science department has a list of machines at https://www.durham.ac.uk/departments/academic/computer-science/about-us/facilities/. I met a student who used one for some machine learning research. As I'm a student at both the physics and computer science department, I could possibly use the physics department's supercomputers too, if they are suited towards ML.

For my purposes, Bede is GPU based and the description states it is ideally suited towards ML. Bede is shared between multiple universities. At Durham, the main contact Dmitry Nikolaenko at durham-bede-support@n8cir.org.uk.

Learning how to use Bede, as its Linux based, and how/where to store the training data etc is going to be a task within itself.

## Found yet more datasets

https://github.com/AgaMiko/bird-recognition-review has useful resources for birdsong classification and yet more datasets.

This blog post, https://towardsdatascience.com/sound-based-bird-classification-965d0ecacb2b, also contains an approach, but also a nice introduction to the problem.

## Found another another thesis

Bird Species Classification And Acoustic Features Selection Based on Distributed Neural Network with Two Stage Windowing of Short-Term Features at https://arxiv.org/ftp/arxiv/papers/2201/2201.00124.pdf by Nahian Ibn Hasan like last week's thesis, describes another good introduction to the problem and approach.