# Intro
The fourth post from a series of posts about my Masters project with the Physics Department at Durham University.

# Meeting:
## During the meeting we discussed:
- Trying to compete directly with Google or other big research teams isn't a good direction for a one year project. What would be better is a novel direction of something else.
- One such approach would be trying to use stero data instead of mono data to solve the cocktail problem.
- Most audio data online is mono, for example on xenocanto, but we could try and get in contact with research groups like the biology team and ask for stero data.
- One of the reasons why stero data is rarer, is because it requires a special microphone to collect it with.
- Different stero microphones have different properties, such as the width between the the two microphones, and this needs to be accounted for when gathering and analysing data.
- We could look at stero data, and then calculate the difference between the first channel and second channel to make a third channel. Then try classifying with one channel, and with all three, and seeing if it helps.
- What would be interesting only possible with stero data is trying to find the direction of the birds singing. But this might be impossible, because getting labeled data of that is difficult. 
- The differences between mono and stero data. Besides having 2 channels, there are differences in time delay and attenuation. I need to look into this more.
- Investigating whether it's possible to convert between mono and stero data. It might be impossible to do exactly because there is information missing within the mono data, particularly because intuitively, stero data you can find direction from but mono you cannot. This is an information problem.
- Stuart might order a stero microphone to play around with; Robert is asking whether it is possible to borrow one.
- Another novel approach would be trying to use stable diffusion to generate spectrograms. The idea being, if there is a lack of stero data we could synthesise our own. There could even be another model added to correct synthesised audio data to be more like real audio data. 
- A motivation behind this could be the prevalence of an image classification approach in classifying birdsong. Much research uses CNNs for example.
- About the Physics content of the project. There needs to be some Physics for the sake of the external marker and external questions at the viva. 
- Physics content can be added by investigating how to do the image to audio conversion (or vise versa in the case of stable diffusion), because of the transformations and information problem involved, or the prevalence of linear algebra/maths being involved, and even just in the Physics way of thinking of testing hypothesis and different approaches.

# Problem definition update
From now on, I will refer to the problem of trying to identify birds in a noisy environment as 'the cocktail problem', as in at a cocktail party where many people are speaking and where the environment is noisy, it is hard to tell who is speaking. <br>

A soundscape is going to be defined as an audio file that contains various birdsong in a noisy environment. We are trying to classify birds from a soundscape.

# Work to do:
I've been thinking about all the things I need to do and learn for this project. Here is an overview of them. It's honestly a lot of work, and hard to tell how much time each will take until I make more progress. It might take the whole of first term to get a handle on this.

## Machine Learning Skills
Using frameworks like fast.ai and transformers isn't as simple as just using their predefined functions and models to do everything. Learning how to find the best hyperparameters, and good validation sets, among many other things, takes a combination of theory and practice to gain intuition. Jeremy from fast.ai said there is no substitute for practice, and provides a lot of guidance on how to do so. 

## Machine Learning Theory
As well as using frameworks and models, you have to spend time learning the theory behind how they work. For instance, how the components in a CNN work the way they do. 

## Framework Skills
Understanding the theory, and then having novel ideas to approach the cocktail problem, I need to then implement these ideas by knowing how to create the new code to do so. <br>
This involves learning how to edit frameworks and create your own, covered in fast.ai part 2. <br>

There's also learning about https://nbdev.fast.ai/ to create frameworks and their documentation.

## Data handling/preprocessing/Physics

Learning how to store data, access it, transform it into the right size and format, edit it, add noise to it, interpret it (bird domain information) etc. 
<br>
There could be much work to be done on transforming the audio data. Fourier and Gabor transformers etc. I found a YouTube playlist of guides on this at https://www.youtube.com/watch?v=RMfeYitdO-c. The fourth initial project reference, "New aspects in birdsong recognition utilizing the gabor transform", focuses on the gabor transform and likely much Physics too.

## Custom metrics, creation, and evaluation for models

The biology department have their own interests and goals of what they want from a model. I would need to talk in detail with them about their priorities, e.g. preferences in confusion matrix metrics, in bird species etc. They might want a model to work with data over a few years to spot trends too.

## Machine learning explainability and communication

Learning how to implement and create methods and visualisations to communicate why the models are predicting as they do. This is especially important for marking in the final report.

## Machine learning maths. 

To read and implement the latest machine learning papers, some mathematical knowledge is needed. I am contemplating doing yet another free fast.ai course, Computational Linear Algebra, explained here https://www.fast.ai/posts/2017-07-17-num-lin-alg.html, to help with the maths side of things. <br>

Alternatively or in addition, the book Deep Learning by Ian Goodfellow provides a mathematical backing and Jeremy recommended reading the first 6 chapters of it to help with understanding and implementing maths in papers.

# Work done 

## Practiced Transformers
A list of transformer tasks is at https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb which is quite useful. <br>

### In particular for audio classification it details the process: <br>
1. Instantiate a feature extractor and a model from the checkpoint name.
2. Process the audio signal to be classified with a feature extractor.
3. Pass the input through the model and take the argmax to retrieve the most likely class.
4. Convert the class id to a class name with id2label to return an interpretable result.

I went through the HuggingFace transformers documentation and did some of the notebooks to understand them. 
- https://www.kaggle.com/adnanjinnah/audio-classification-hf-1/
- https://www.kaggle.com/adnanjinnah/audio-classification-hf-2/
- https://www.kaggle.com/adnanjinnah/audio-classification-hf-3/
and they covered the 4 step process detailed above.

## Practiced Trying to attempt BirdCLEF 2022
It's well worth practicing attempts for a competition with the goal exactly as my own.
After trying fast.ai's audio module last week, and thinking it is outdated (the GitHub repo hasn't been updated in roughly 6 months), I decided to use HuggingFace instead. This is mainly due to Jeremy recommending it as an up to date framework, but also because it is used in fast.ai part 2. 

With that in mind, I attempted it at https://www.kaggle.com/adnanjinnah/birdclef-first-attempt/.
This attempt was writhe with problems. While it was my first time using HuggingFace audio, the number of problems I encountered and issues involved were too much. I did not manage to get any model to work. I spent the entire time just trying to get the data loaded properly for usage. 

### To summarise:
- HuggingFace's load_model has several different methods to load audio. They all require the data to be formatted in a particular way. I tried all them with no success.
- Kaggle's competition dataset is set to read only for some reason. This makes it so I cannot directly just edit the files to get them right.
- I tired simply downloading the dataset and reuploading it to Kaggle but A. this is inefficient and B. won't work for the unseen test data.
- I tried copying over the dataset from the read-only input folder to the editable output folder, but this is also inefficient and even so:
- I couldn't load the copied data using load_dataset's audiofolder function. I'm not sure why. I have it formatted in the exact way the documentation shows. The issue may be I need to upload the dataset to HuggingFace's website first, but this has the same issues as the first attempt.
- A way to get around having to copy the data, with is also inefficient but would atleast work with the unseen test data is to tell load_dataset the URLs of the audio files. This didn't work either, because some of the URLs don't work in the instant load_dataset wants to access them. I couldn't find a way to tell load_dataset to ignore or look later at these URLs.
- I tried using a different method of load_dataset, this one however seems to require the main .csv file to contain the audio files in array format. Because the .csv file contains a path to the audio files instead of their content, I tried using another module, librosa, to create a column in the .csv file containing the audio. This didn't work, because of an excess memory error. And also, this is very inefficient. 

After extensively trying all methods I could find in the documentation with little success, this entire process took around 10 hours. I found tutorials to help with no luck. For now, I've given up on trying to get it to work myself. I need to find some resource online or in person to help. In hindsight, I probably should have done this earlier.

### On the bright side, atleast I learn't a few things from the struggle:
- First, how transformers requires a dataset to be formatted in a specific way, and that HuggingFace has a website dedicated to storing datasets in an already formatted way.
- Experience in reading through documentation and troubleshooting.
- The fact that sometimes URLs don't work, and that last week's code had a solution, but I couldn't implement it into HuggingFace's load_model.
- That different loading methods require paths to audio files or them on the .csv file.
- That audio files are stored as a file such as .ogg or as an array.
- How librosa is a module to convert audio files into audio files into said arrays.
- That memory errors will occur from trying to do too much at once. I could get my last method to work if I figured out a way to split up the data, but regardless this approach is inefficient considering we already have the files so it's better to find a different method.
- How to use os to copy files and folders over, or search and retrieve their file paths.
- The fact that, for some datasets like BirdCLEF, there is a metadata.csv file with a column for the paths of the audio files.
- That for advanced dataset formatting, for HuggingFace, you can create a .py script to do things exactly as you want.

## Finished fast.ai lesson 9:

This lesson was the first of fast.ai part 2 and a very well taught one.
In it, Jeremy described conceptually how stable diffusion, an crazy new image generation model, works.
Due to it's difficulty, the lesson took me a full day to complete, but it was well worth it. The ideas and skills I'm being introduced to and learning will prove really helpful for the project going forwards. 
Next week, the lesson will focus on programming stable diffusion from scratch, and building on that, how to programme your own custom Python machine learning libraries. This is vital because it would allow me not just to copy other people's code to solve the cocktail problem, but implement my own ideas and test things, perhaps even at a research level.

My post for lesson 9 can be found at: https://exiomius.quarto.pub/blog/posts/2022-10-11-L9Blog.md.html

## Finished CLA lesson 1:

Computational Linear Algebra is a fast.ai course covering linear algebra to be centered around practical applications and algorithms. <br>
More info and lesson 1 blog can be found here: https://exiomius.quarto.pub/blog/posts/2022-10-17-CLA1.md.html

## Useful Datasets Found

- BirdCLEF 2022 uses data from xeno-carto, implying that last week's approach to downloading them is a good idea.
- I found ESC-50, a dataset of labeled environmental audio recordings at https://dagshub.com/kinkusuma/esc50-dataset, also at https://huggingface.co/datasets/ashraq/esc50. These include sounds like rain, sea waves, animals.
- I found that Machine Listening Lab at Queen Mary's University run a birdsong competition and have many datasets that I could possibly use at http://machine-listening.eecs.qmul.ac.uk/bird-audio-detection-challenge/.

## Useful Research Tools
- Scholarcy summarises research articles.
- https://inciteful.xyz/ is good for finding papers.
- I was told that Prostudy is useful for keeping resources stored for a dissertation. 

## A Similar Thesis
My friend's friend wrote a thesis similar in aim to mine last year. <br>

Title: Using mel-frequency cepstral coefficients and principal components analysis to classify bird vocalisations based on citizen science recordings. <br>
Student Name: Alex Dyfrig Swainston. <br>

I messaged Alex and got a copy, and he said he's happy to help if I have any questions.

## New Ideas:
Here are a few new ideas I had about tackling the cocktail problem.

A big issue is the lack of properly labeled data for soundscapes.
The biology department painstakingly handlabeled some soundscapes, but it is a difficult and time consuming task that even great ecologists struggle with. What if there was a way to create our own soundscapes that are already labeled?
For instance, we have plenty of data from xeno-canto of individual bird songs with varying amounts of noise. What if I also found some audio files of forest environments, and I created a model to combine xeno-canto bird songs with these to imitate a real soundscape? This way, I could create an endless amount of soundscapes to train on, and the birds within them would be labeled!

### To create a soundscape:
- I could download bird song(s),
- Cut out various parts of them, e.g. if it's 3 minutes long, I cut out random intervals of 20-30 seconds to imitate the bird moving or other sounds overpowering their song,
- Randomly vary how loud the bird songs are,
- Add in enviromental sounds like a forest soundscape (but being careful there are no birds present!),
- Use a noise function, (which is used in stable diffusion), to randomly add noise. Alternatively, find a way to make a model that can generate real noise that is recorded by microphones and use that.

I could put multiple birdsongs in the same artifical soundscape, and even make them overlap, but I also need to be careful that perhaps I should make the birds singing be realistically in the same environment. I mean I shouldn't put two birds together that geographically would never meet, or two birds that never sing at the same time of day, or in general enviromental sounds that don't match the birds present.

Another idea is to add geographical data somehow to the dataset. Perhaps with another input for a satellite image to help.