# Introduction

In this project, we aimed to accurately classify bird sounds as songs or calls. We used 3 different approaches and models based on recording metadata, the audio data itself, and spectrogram images of the recording to perform this classification task.

## Motivation
One motivation for this project is practical and concrete, to make it faster and easier for scientists to collect data on bird populations, verify community-sourced labels, etc.

The other motivation is more open-ended: one of curiosity and self-education to discover the “hidden” insights in bird sounds.  As novice bird-listeners ourselves, we are amazed by the variability and information encoded in bird calls. Bird calls reveal regional dialects, a sense of humor, information about predators in the area, indicators of ecosystem health, and other dimensions of their ecology - and inevitably also the threat posed by human activity. To our untrained ears, it's difficult to even distinguish calls from different species, let alone genders or pitch-shifts in reaction to noise pollution. Through the process visualizing and analyzing features of bird call audio, we hope we can build towards better understanding the impacts of the "anthropony" (sounds produced by humans) and become better listeners.

## Songs vs Calls
Bird sounds have a variety of different dimensions, but one of the first levels of categorizing bird sounds is classifying them as a song or a call, as each have distinct functions and reveal different aspects of the birds’ ecology. For example, the frequency of bird alarm calls can indicate the number of predators in an area [cite]().

### Useful Terms
**Call** : Bird vocalization

**Song** : Courtship or territorial vocalization, typically from male songbirds in characteristic phrases

**Note** : Smallest unit of vocalization, an uninterrupted trace on the spectrogram

**Syllable** : Combination of notes that are separated by short intervals; syllables are separated by longer intervals

**Rhythmic structure** : The way notes, syllables, or calls repeat (count and rate of repetition)

**Fundamental frequency** : The call pitch, visible on a spectrogram usually as the lowest band in a stack of integer-multiple bands (harmonics)

**Harmonics** : Integer multiple frequency bands upon the fundamental frequency band.

**Frequency modulation** : Changes in fundamental frequency contour during a call

**Tonal call** : Call, containing the fundamental frequency and its related harmonics.

**Noisy call** : Call where the fundamental frequency and harmonics are indistinguishable or lacking from the inside. Looks like uniform noise on the spectrogram.

**Dominant frequency** : The frequency where the maximum energy of a call is concentrated.

# Related Work on Bird Sound Data
## Gender identification using acoustic analysis in birds without external sexual dimorphism
> Volodin, I.A., Volodina, E.V., Klenova, A.V. et al. Gender identification using acoustic analysis in birds without external sexual dimorphism. Avian Res 6, 20 (2015). https://doi.org/10.1186/s40657-015-0033-y
, [Article Link](https://avianres.biomedcentral.com/articles/10.1186/s40657-015-0033-y)


### Background
- Determining the sex of adult birds using their calls is useful for monomorphic birds (no difference in appearance across sexes), is noninvasive, and may be done at a distance
- It is important to sex birds for wildlife management - breeding and census estimates
- Bird calls likely differ across sexes due to differences in morphology (e.g. size of beak, vocal organ, trachea), the method of producing a call, or in the type of calls/songs

### Contribution
- Researchers used spectrograms and power spectra
![spectrograms](https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs40657-015-0033-y/MediaObjects/40657_2015_33_Fig1_HTML.gif?as=webp)
- They used simple visual inspection (without measuring acoustic variables) for certain species, or using Discriminate Function Analysis when acoustic variables overlapped more across sexes.
    - DFA accuracy > 70% was considered sufficient but low reliability for sexing by call (50% would be equivalent to random chance)
    - 91 - 100% : perfect reliability
- Relevant vocal features for differentiating sex were:
    - Average fundamental frequency
    - Maximum fundamental frequency
    - Biphonation (2 independent fundamental frequencies)
    - Duration of notes
    - Number of syllables in call
    - Amplitude modulation (wideband spectra)
    - Intervals between syallables

## Further Reading
- [2009 Classifying Bird Calls with Supervised Learning](https://eecs.oregonstate.edu/research/bioacoustics/briggs_icdm09.pdf) - useful background on audio features
- [How Birds Develop Song Dialects](https://sora.unm.edu/sites/default/files/journals/condor/v077n04/p0385-p0406.pdf) - regional dialects in birdsong
- [Do bird calls of the same species differ across countries?](https://www.researchgate.net/post/Does-bird-calls-of-same-species-from-one-country-differ-from-the-calls-from-other-country) - Researchgate forum, some more links here

## Regional dialects have been discovered among many bird species and the Yellowhammer is a great example

[Current Project](http://www.yellowhammers.net/about)

Birds have been observed to have regional dialects, as a result of ecological drivers, [human disturbances](https://academic.oup.com/beheco/article/30/6/1501/5526711?login=true), and cultural evolution.

A famous citizen science project took place in the Czech Republic starting in 2011 to study dialects of Yellowhammer birds (a species easily recognized by sight and sound). Its dialects differ in the frequency and length of their final syllables.

It revealed two main dialect groups and the border between, and brought up more questions like: How are the dialects maintained? What causes the dialect boundaries between neighboring habitats? How do dialects evolve?
![yellowhammer dialects](https://bou.org.uk/wp-content/uploads/2013/05/Pipek-fig2-530x296.png)

## Other bird audio data science


# Collecting Data
For our analysis, we used audio files and metadata from xeno-canto.org. Xeno-canto (XC) is a website for collecting and sharing audio recordings of birds. 

Recordings and identifications on XC are sourced from the community (anyone can join), as are recording quality ratings and flags for wrong IDs.

XC has a [very simple API](https://www.xeno-canto.org/explore/api) that allows us to make RESTful queries, without even requiring an API key. A request url looks like this: `https://www.xeno-canto.org/api/2/recordings?query=bearded+bellbird+q:A&page=5`.
We can also test our request by going to `https://www.xeno-canto.org/explore?query=[our-query-here]`. You can learn more about query parameters [here](https://www.xeno-canto.org/help/search).

Request payloads contain fields that are nearly identical to the query parameters. Some especially important fields were:
- **gen**  : genus
- **sp**   : species
- **type** : a comma-separated list of the sound types of the recording (e.g. call, song, male, female)
- **file** : the url to the audio file
- **sono** : the urls to the sonogram image files
- **loc** : the location of the recording (e.g. Pittsburgh, Allegheny County, Pennsylvania)
- **lat** & **lng** : the latitude and longitude of the recording

## DVC

# Filtering

# Exploring & Visualizing Data

# Metadata Classification Model
In one model, we used the tabular metadata from xeno-canto (XC) entries, with some filtering and pre-processing, to train a Logistic Regression model.

We used the genus, species, English name, and location (latitude and longitude) from XC metadata, all mapped and imputed using sk-learn transformers to one-hot encoders apart from latitude and longitude (mapped using min-max scaling). We also extracted additional features of identified gender (male or female) and age of the bird (juvenile or adult) from the "type" notes in XC metadata.

![image of metadata df](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/assets/metadata_Xdf.png)

## Model Performance
score of .724

# Audio Classification Model
For one of our models, we used the bird audio recordings themselves (mp3 and wav files). We converted the audio files into timeseries data, then used ts-fresh to extract time and frequency-domain features, which we used to train a Logistic Regression model.

You can refer to [2.0-mk-audio-model.ipynb](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/2.0-mk-audio-model.ipynb) or the [python scripts](https://github.com/adithyabsk/pracds_final/blob/main/pracds_final/features/proc_audio.py) in our dvc pipeline for details on this model.

## Building Audio Features

### Audio to Timeseries
Using the [librosa](https://librosa.org/) package, we loaded mp3s and wav files into arrays of timeseries data. The `librosa.load()` function is not very efficient in loading mp3 files, as it falls back to using the audioread and ffmpeg libraries (see [here](https://www.audiolabs-erlangen.de/resources/MIR/FMP/B/B_PythonAudio.html) and [here](https://stackoverflow.com/questions/59854527/librosa-load-takes-too-long-to-loadsample-mp3-files)), so it may be worth looking into alternatives like [pydub](https://github.com/jiaaro/pydub)'s `AudioSegment()` for further work with audio analysis.

### Filtering
Once the audio data was in an array format, we ran it through a high-pass [Butterworth filter](https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.butter.html) to take out background noise. We tested different order and critical frequency parameters for Butterworth and Firwin filters, then plotted the spectrograms and listened to the audio before and after filtering to determine which parameters and filters best reduced background noise without clipping the bird sound frequencies. You can see [1.0-mk-audio-exploration.ipynb](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/1.0-mk-audio-exploration.ipynb) for details on this process.

![image of filter comparison](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/assets/audio_comparing_filters.png)

In [None]:
# code to load audio and filter

### Feature Selection & Extraction
From the filtered audio data arrays, we used [tsfresh](https://tsfresh.readthedocs.io/en/latest/index.html) to extract audio features. tsfresh takes in dataframes with an id column, time column, and value column. We extracted features from one audio id at a time, so each of our dataframes' id columns were filled with a single id at a time.

![image of time series input df for a single id]()

> Note: we found it was best to immediately featurize each audio file after unpacking it into a timeseries array. The timeseries arrays are very large, so trying to store them for all audio files actually crashed our notebook and made us run out of memory! (We estimated it would be a dataframe of about 150 million rows to store all 5800 audio files as arrays.)

tsfresh has built-in feature calculator presets to extract certain features all at once, and also allows you to manually specify which features to extract with a dictionary parameter. In addition, it provides a useful `select_features()` function that checks the significance of each of the columns in feature matrix X.

It takes a very long time to extract the `ComprehensiveFCParameters()` and `EfficientFCParameters()` (est. 39 and 13 hours for just 5% of the dataset), so we manually specified the following small set of features based on our domain background research and understanding of audio analysis. 

In [None]:
manual_fc_params = {
    "abs_energy": None,
    "fft_aggregated": [{"aggtype":"centroid"}, {"aggtype":"kurtosis"}],
    "root_mean_square": None,
    "spkt_welch_density": [{"coeff":2},{"coeff":5},{"coeff":8}]
}

![image of feature output df for all ids]()

## Model Tuning & Performance
To begin tuning our model, we compared different feature selections using a random subset of the training and testing data. We extracted features using our manual selection, tsfresh's `MinimalFCParameters()`, and the features selected from the MinimalFCParameters() set on 5% of the dataset, and found the scores were around .58, .47, and .47 respectively. Because our manual selection performed best, and because the process of extracting many features at once then calculating significance was very time consuming, we continued with the manual selection approach.

We tested a few other manually selected features, including different fft aggregation types and zero-crossing rate, but found that they did not improve model performance. With more time, it would be worth exploring more features based on domain knowledge.

We split our full feature matrix X into training and testing sets (matching those used in the other models), then trained a Logistic Regression model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) using both the normalized and unnormalized features.

We can compare our model's performance to a baseline of guessing the mean of the training labels. We can also think of this as the % of the training data that is labelled 1, meaning the bird sound is a call and not a song, which is 55%. Our Logistic Regression model performs slightly above the baseline. By plotting some of our features against each other, we can already see that it is nearly impossible for us to distinguish song and call points in the feature space from a human perspective. We hope with more time spent on tuning feature selection (perhaps exploring a wider range of possible features like fundamental frequency using [tsfel](https://tsfel.readthedocs.io/en/latest/descriptions/feature_list.html)), feature parameters, and possibly feature scope (frame size) we can make our model more accurate.

| Model      | Score      | LogLoss | Error    |
| ---------- | ---------- | ------- | -------- |
| LogReg     | .61        | 18.75   | .39      |
| Baseline   | .55        | .69     | .46      |

![image of feat plots](https://github.com/adithyabsk/pracds_final/blob/main/notebooks/assets/audio_feature_plot.png)

# Spectrogram Classification Model

We used a computer vision approach to analyze spectrograms by training a fastai CNN, based on the pre-trained model xresnet.

# Results

- this classification could be the first level in a hierarchy of models / data analysis
- once you know it's call vs song, you can analyze it in a more specific way

In [2]:
from nbformat import read

# Default path for this notebook (can be run inside the notebook)
path = "0.0-mk-final-report.ipynb"
with open(path, "r", encoding="utf-8") as f:
    nb = read(f, 4)

# Count the number of lines in markdown or heading cells
word_count = sum(
    [
        len(cell["source"].replace("#", "").lstrip().split(" "))
        for cell in nb["cells"]
        if cell.cell_type in ["markdown", "heading"]
    ]
)

# Count number of lines in the notebook and subtract the number of
# lines in this cell
line_count = (
    sum(
        [
            # Filter out cells that are comments or are empty
            len(
                list(
                    filter(
                        lambda line: not (line.lstrip().startswith("#")),
                        cell["source"].split("\n"),
                    )
                )
            )
            for cell in nb["cells"]
            if cell.cell_type == "code"
        ]
    )
    - 27
)

print(f"Word Count: {word_count:,}")
print(f"Line Count: {line_count:,}")

Word Count: 1,921
Line Count: 14
