In [1]:
import IPython.display as ipd

> This report details our process for analyzing bird audio, with some snippets of code. You can find the full project repo [on github](https://github.com/adithyabsk/pracds_final).

# Introduction

In this project, we aimed to accurately classify bird sounds as songs or calls. We used 3 different approaches and models based on recording metadata, the audio data itself, and spectrogram images of the recording to perform this classification task.

![pipeline overview](/assets/pipeline.svg)

## Motivation
One motivation for this project is practical and concrete, to make it faster and easier for scientists to collect data on bird populations, verify community-sourced labels, etc.

The other motivation is more open-ended: one of curiosity and self-education to understand the “hidden” insights in bird sounds. Bird calls reveal regional dialects, a sense of humor, information about predators in the area, indicators of ecosystem health - and inevitably also the threat on their ecosystems posed by human activity. Through the process of exploring bird call audio data, we hope we can build towards better understanding the impacts of the "anthropony" (sounds produced by humans) and become better listeners.

## Songs vs Calls
Bird sounds have a variety of different dimensions, but one of the first levels of categorizing bird sounds is classifying them as a song or a call, as each have distinct functions and reveal different aspects of the birds’ ecology ([1](https://www.audubon.org/news/a-beginners-guide-common-bird-sounds-and-what-they-mean), [2](https://www.youtube.com/watch?v=4_1zIwEENt8)).



### Songs
Songs tend to be longer, more melodic, and used for marking territory and attracting mates. Birds' song repertoire and song rate can indicate their health and the quality of their habitat, including pollutant levels and plant diversity ([3](https://en.wikipedia.org/wiki/Bird_vocalization#Function), [4](www.jstor.org/stable/20062442), [5](https://www.fs.usda.gov/treesearch/pubs/46856)).


In [2]:
# song sparrow - song
ipd.Audio("/work/pracds_final/notebooks/assets/574080.mp3")

### Calls
Calls are shorter than songs, and perform a wider range of functions like signalling food, maintaining social cohesion and contact, coordinating flight, resolving conflicts, and sounding alarms (distress, mobbing, hawk alarms) ([6](https://doi.org/10.1196/annals.1298.034)). Bird alarm calls can be understood and passed along across species, and have been found to encode information about the size and threat of a potential predator, so birds can respond accordingly - i.e. more intense mobbing for a higher threat ([7](https://www.nationalgeographic.com/animals/article/nuthatches-chickadees-communication-danger), [8](https://doi.org/10.1126/science.1108841)). Alarm calls can also give scientists an estimate of the number of predators in an area. 



In [3]:
# song sparrow - call
ipd.Audio("/work/pracds_final/notebooks/assets/585148.mp3")

### Useful Terms ([9](#References))
**Call** : Bird vocalization

**Song** : Courtship or territorial vocalization, typically from male songbirds in characteristic phrases

**Note** : Smallest unit of vocalization, an uninterrupted trace on the spectrogram

**Syllable** : Combination of notes that are separated by short intervals; syllables are separated by longer intervals

**Rhythmic structure** : The way notes, syllables, or calls repeat (count and rate of repetition)

**Fundamental frequency** : The call pitch, visible on a spectrogram usually as the lowest band in a stack of integer-multiple bands (harmonics)

**Harmonics** : Integer multiple frequency bands upon the fundamental frequency band.

**Frequency modulation** : Changes in fundamental frequency contour during a call

**Tonal call** : Call, containing the fundamental frequency and its related harmonics.

**Noisy call** : Call where the fundamental frequency and harmonics are indistinguishable or lacking from the inside. Looks like uniform noise on the spectrogram.

**Dominant frequency** : The frequency where the maximum energy of a call is concentrated.

# Related Work on Bird Sound Data
## Gender identification using acoustic analysis in birds without external sexual dimorphism ([9](https://doi.org/10.1186/s40657-015-0033-y))
![spectrograms](https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs40657-015-0033-y/MediaObjects/40657_2015_33_Fig1_HTML.gif?as=webp)
- Relevant vocal features for differentiating sex were:
    - Average fundamental frequency
    - Maximum fundamental frequency
    - Biphonation (2 independent fundamental frequencies)
    - Duration of notes
    - Number of syllables in call
    - Amplitude modulation (wideband spectra)
    - Intervals between syallables
- Differentiated gender in some species just by inspecting spectrograms, used Discriminate Function Analysis when acoustic variables overlapped more across sexes.

## Regional dialects have been discovered among many bird species and the Yellowhammer is a great example ([10](http://www.yellowhammers.net/about), [11](https://doi.org/10.1093/beheco/arz114))

Birds have been observed to have regional dialects, as a result of ecological drivers, human disturbances, and cultural evolution.

A citizen science project took place in the Czech Republic starting in 2011 to study dialects of Yellowhammer birds (a species easily recognized by sight and sound). Its dialects differ in the frequency and length of their final syllables.
![yellowhammer dialects](https://bou.org.uk/wp-content/uploads/2013/05/Pipek-fig2-530x296.png)

# DVC
[Data Version Control (DVC)](https://dvc.org/) is a useful tool for data science projects that is like git for data and modeling. We built out our pipeline first in jupyter notebooks, and then in DVC, making it easy to change parameters and run the full pipeline from one place.

# Collecting Data
For our analysis, we used audio files and metadata from xeno-canto.org. Xeno-canto (XC) is a website for collecting and sharing audio recordings of birds. Recordings and identifications on XC are sourced from the community (anyone can join), as are recording quality ratings and flags for wrong IDs.

![xenocanto](assets/xenocanto.png)

XC has a [very simple API](https://www.xeno-canto.org/explore/api) that allows us to make RESTful queries, and specify a number of [filter parameters](https://www.xeno-canto.org/help/search) including country, species, recording quality, and duration.

We used the XC API to get metadata and IDs for all recordings in the United States, and saved the JSON payload as a dataframe and csv.

# Filtering & Labeling
Through our DVC pipeline, we further filtered by the top 220 unique species, recordings under 20 seconds, recording quality A or B, and recordings with spectrograms available on XC to get a dataframe of 5800 recordings. We created labels (1 for call, 0 for song) by parsing the 'type' column of the df.

# Exploring & Visualizing Data
With our dataset assembled, we began exploring it visually. A distribution of recordings by genus, with song-call splits shows that the genus most represented in the dataset are warblers (*Setophaga*) with many more song than call recordings. We can also see that, as expected, woodpeckers (*Melanerpes*), jays, magpies, and crows (*Cyanocitta*, *Corvus*) have almost no song recordings in the dataset. 

![genus distribution](/assets/svc_count_vs_genus.png)

A map of recording density shows the regions most represented in the datset.

![sample density geo](/assets/svc_sample_density_usa.png)

Given our domain knowledge that songs serve an important function in mating, we expected to see a higher proportion of songs in the spring, which is confirmed by the data! 

![time distribution](/assets/svc_vs_month.png)

# Metadata Classification Model
In one model, we used the tabular metadata from xeno-canto (XC) entries, with some filtering and pre-processing, to train a [Logistic Regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

We used the genus, species, English name, and location (latitude and longitude) from XC metadata, all mapped and imputed using sk-learn transformers to one-hot encoders apart from latitude and longitude (mapped using min-max scaling). We also extracted additional features of identified gender (male or female) and age of the bird (juvenile or adult) from the "type" notes in XC metadata.

![image of metadata df](/assets/metadata_Xdf.png)

## Model Performance
The Logistic Regression model achieves a score of 72.4%. We can compare its performance to a baseline of guessing the mean of the training labels. We can also think of this as the % of the training data that is labelled 1, meaning the bird sound is a call and not a song, which is 55%.

# Audio Classification Model
For one of our models, we used the bird audio recordings themselves (mp3 and wav files). We converted the audio files into timeseries data, then used ts-fresh to extract time and frequency-domain features, which we used to train a Logistic Regression model.

You can refer to [2.0-mk-audio-model.ipynb](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/2.0-mk-audio-model.ipynb) or the [python scripts](https://github.com/adithyabsk/pracds_final/blob/main/pracds_final/features/proc_audio.py) in our dvc pipeline for details on this model.

## Building Audio Features

### Audio to Timeseries
Using the [librosa](https://librosa.org/) package, we loaded mp3s and wav files into arrays of timeseries data. The `librosa.load()` function is not very efficient in loading mp3 files, as it falls back to using the audioread and ffmpeg libraries (see [here](https://www.audiolabs-erlangen.de/resources/MIR/FMP/B/B_PythonAudio.html) and [here](https://stackoverflow.com/questions/59854527/librosa-load-takes-too-long-to-loadsample-mp3-files)), so it may be worth looking into alternatives like [pydub](https://github.com/jiaaro/pydub)'s `AudioSegment()` for further work with audio analysis.

### Filtering
Once the audio data was in an array format, we ran it through a high-pass [Butterworth filter](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html) to take out background noise. We tested different order and critical frequency parameters for Butterworth and Firwin filters, then plotted the spectrograms and listened to the audio before and after filtering to determine which parameters and filters best reduced background noise without clipping the bird sound frequencies. You can see [1.0-mk-audio-exploration.ipynb](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/1.0-mk-audio-exploration.ipynb) for details on this process.

![image of filter comparison](/assets/audio_comparing_filters.png)

In [4]:
# code to load audio and filter

### Feature Selection & Extraction
Because the audio data arrays are very large, we found it was best to immediately featurize each audio file after unpacking and filtering it to avoid running out of memory. We used [tsfresh](https://tsfresh.readthedocs.io/en/latest/index.html) to extract audio features. tsfresh takes in dataframes with an id column, time column, and value column.

<!-- ![time series input df for a single id](/assets/timeseriesdf_singleid.png) -->
<img src="/assets/timeseriesdf_singleid.png" alt="time series input df for a single id" width="40%">

tsfresh has built-in feature calculator presets to extract certain features all at once, and also allows you to manually specify which features to extract with a dictionary parameter. In addition, it provides a useful `select_features()` function that checks the significance of each of the columns in feature matrix X.

It takes a very long time to extract the `ComprehensiveFCParameters()` and `EfficientFCParameters()` (est. 39 and 13 hours for just 5% of the dataset), so we manually specified the following small set of features based on our domain background research and understanding of audio analysis. 

In [5]:
manual_fc_params = {
    "abs_energy": None,
    "fft_aggregated": [{"aggtype": "centroid"}, {"aggtype": "kurtosis"}],
    "root_mean_square": None,
    "spkt_welch_density": [{"coeff": 2}, {"coeff": 5}, {"coeff": 8}],
}

![image of feature output df for all ids](/assets/audio_Xdf.png)

## Model Tuning & Performance
To begin tuning our model, we compared different feature selections using a random subset of the training and testing data. We extracted features using our manual selection, tsfresh's `MinimalFCParameters()`, and the features selected from the MinimalFCParameters() set on 5% of the dataset, and found the scores were around .58, .47, and .47 respectively. Because our manual selection performed best, and because the process of extracting many features at once then calculating significance was very time consuming, we continued with the manual selection approach.

We tested a few other manually selected features, including different fft aggregation types and zero-crossing rate, but found that they did not improve model performance. With more time, it would be worth exploring more features based on domain knowledge.

We split our full feature matrix X into training and testing sets (matching those used in the other models), then trained a Logistic Regression model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) using both the normalized and unnormalized features.

Our Logistic Regression model performs slightly above the baseline of 55%. By plotting some of our features against each other, we can already see that it is nearly impossible for us to distinguish song and call points in the feature space from a human perspective. We hope with more time spent on tuning feature selection (perhaps exploring a wider range of possible features like fundamental frequency using [tsfel](https://tsfel.readthedocs.io/en/latest/descriptions/feature_list.html)), feature parameters, and possibly feature scope (frame size) we can make our model more accurate.

| Model           | Score       | LogLoss | Error    |
| --------------- | ----------- | ------- | -------- |
| LogReg          | 0.61        | 18.75   | 0.39      |
| Baseline        | 0.55        | 0.69    | 0.46      |

![image of feat plots](/assets/audio_feature_plot.png)

# Spectrogram Classification Model

We used a computer vision approach to analyze spectrograms using fast.ai's convnet model. We take a pre-trained xresnet34 architecture cut at the pooling layer, and train its last layers on image data loaders of our spectrogram images.


# Results & Future Work
Across our three models, we achieved scores in a range of 60-72%. This is just above the baseline score of 55%, and we believe with more time to tune and ensemble the models, one could achieve an even more accurate classifier.

The classification of song vs call is the first distinction one can make in bird audio data across species, and on its own can give insights into the number of predators in an ecosystem, the timing of mating season, and other behaviors. It could also be valuable when part of a larger system of models. Once a recording is classified as a call or song, it can be analyzed in a more specific way, potentially with separate models to further classify calls or songs by species, gender, location, etc.

# References
1. "A Beginner’s Guide to Common Bird Sounds and What They Mean." [*Audubon.org.*](https://www.audubon.org/news/a-beginners-guide-common-bird-sounds-and-what-they-mean)
2. "Two Types of Communication Between Birds: Understanding Bird Language Songs And Calls." [*Youtube.*](https://www.youtube.com/watch?v=4_1zIwEENt8)
3. "Bird Vocalization." [*Wikipedia.*](https://en.wikipedia.org/wiki/Bird_vocalization#Function)
4. Gorissen, Leen, et al. “Heavy Metal Pollution Affects Dawn Singing Behaviour in a Small Passerine Bird.” *Oecologia*, vol. 145, no. 3, 2005, pp. 504–509. [JSTOR](www.jstor.org/stable/20062442)
5. Ortega, Yvette K.; Benson, Aubree; Greene, Erick. 2014. Invasive plant erodes local song diversity in a migratory passerine. *Ecology.* 95(2): 458-465. [Ecological Society of America](https://www.fs.usda.gov/treesearch/pubs/46856)
6. Marler, P. (2004), Bird Calls: Their Potential for Behavioral Neurobiology. Annals of the New York Academy of Sciences, 1016: 31-44. [https://doi.org/10.1196/annals.1298.034](https://doi.org/10.1196/annals.1298.034)
7. "These birds 'retweet' alarm calls—but are careful about spreading rumors." [*National Geographic.*](https://www.nationalgeographic.com/animals/article/nuthatches-chickadees-communication-danger)
8. Templeton, Christopher N., et al. “Allometry of Alarm Calls: Black-Capped Chickadees Encode Information About Predator Size.” Science, vol. 308, no. 5730, American Association for the Advancement of Science, 2005, pp. 1934–37, [doi:10.1126/science.1108841](https://doi.org/10.1126/science.1108841).
9. Volodin, I.A., Volodina, E.V., Klenova, A.V. et al. Gender identification using acoustic analysis in birds without external sexual dimorphism. Avian Res 6, 20 (2015). [https://doi.org/10.1186/s40657-015-0033-y](https://doi.org/10.1186/s40657-015-0033-y)
10. "About yellowhammers." [*Yellowhammer Dialects.*](http://www.yellowhammers.net/about)
11. Harry R Harding, Timothy A C Gordon, Emma Eastcott, Stephen D Simpson, Andrew N Radford, Causes and consequences of intraspecific variation in animal responses to anthropogenic noise, Behavioral Ecology, Volume 30, Issue 6, November/December 2019, Pages 1501–1511, [https://doi.org/10.1093/beheco/arz114](https://doi.org/10.1093/beheco/arz114)
12. "Open-source Version Control System for Machine Learning Projects." [*DVC.*](https://dvc.org/)
13. [*xeno-canto.*](https://www.xeno-canto.org/explore/api)
14. [*scikit-learn.*](https://scikit-learn.org/stable/index.html)

In [6]:
from nbformat import read

# Default path for this notebook (can be run inside the notebook)
path = "0.0-mk-final-report.ipynb"
with open(path, "r", encoding="utf-8") as f:
    nb = read(f, 4)

# Count the number of lines in markdown or heading cells
word_count = sum(
    [
        len(cell["source"].replace("#", "").lstrip().split(" "))
        for cell in nb["cells"]
        if cell.cell_type in ["markdown", "heading"]
    ]
)

# Count number of lines in the notebook and subtract the number of
# lines in this cell
line_count = (
    sum(
        [
            # Filter out cells that are comments or are empty
            len(
                list(
                    filter(
                        lambda line: not (line.lstrip().startswith("#")),
                        cell["source"].split("\n"),
                    )
                )
            )
            for cell in nb["cells"]
            if cell.cell_type == "code"
        ]
    )
    - 27
)

print(f"Word Count: {word_count:,}")
print(f"Line Count: {line_count:,}")

Word Count: 2,245
Line Count: 17


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=435b4438-1c5d-4d1d-a665-25eae05859c1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>