# DS@GT BirdCLEF 2023 Asssessment Notebook

[BirdCLEF](https://ceur-ws.org/Vol-3180/paper-154.pdf) is a classification and prediction competition that runs annually as part of the Conference and Labs of the Evaluation Forum (CLEF).
The goal of the competition is to help classify soundscapes containing bird songs. 

This notebook tests familiarity with tools used in the project software stack for [BirdCLEF 2023](https://www.imageclef.org/BirdCLEF2023).
The dataset is derived from the 2022 competition training dataset and features from the [BirdNET project](https://birdnet.cornell.edu/).
You can use any library available to you, and reference external documentation.

Make a copy of the notebook: `File > Save a copy in Drive`. 
When you are done with the notebook, share the results with acmiyaguchi@gatech.edu.

Also see the [DS@GT Kaggle Competition Team proposal for the BirdCLEF 2023 team](https://docs.google.com/document/d/13Nq0RmNe714f2YCTXfG5BjuHRBc_3gBVd4ggzQ5d4MM/edit?usp=sharing).


In [None]:
# install necessary packages, and anything else you might want
!pip install pyspark umap-learn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

from pyspark.sql import SparkSession
import os
import sys


def get_spark(cores=4, memory="2g"):
    """Get a spark session for a single driver."""
    os.environ["PYSPARK_PYTHON"] = sys.executable
    os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
    return (
        SparkSession.builder.config("spark.driver.memory", memory)
        .config("spark.driver.cores", cores)
        .config("spark.sql.execution.arrow.pyspark.enabled", "true")
        .getOrCreate()
    )


# get the dataset for the assessment
url = "https://storage.googleapis.com/birdclef-2023/data/processed/2022-01-15-assessment.parquet"
df = pd.read_parquet(url)
display(df.head())

# get a spark context and create a dataframe for it
spark = get_spark()
spark_df = spark.createDataFrame(df)
spark_df.printSchema()
spark_df.show(n=1, vertical=True, truncate=80)

## 1. embeddings

An embedding maps vectors in $\mathbb{R}^n$ to $\mathbb{R}^m$ such that distances (e.g., $L_1$ or $L_2$ norm) are preserved.
In this question, we examine embeddings generated by passing training birdcall audio through [BirdNET-Analyzer](https://github.com/kahst/BirdNET-Analyzer).
This model accepts audio at a 48kHz sampling rate in 3-second clips, extracts a spectrogram, and applies a deep convolutional neural network to classify the type of call in the clip. We can extract the second to last layer as an embedding (i.e., $\mathbb{R}^{320}$ vector). See the [paper](https://www.sciencedirect.com/science/article/pii/S1574954121000273) for more details.

For this notebook, we keep one embedding per audio track representing the most confident prediction of the audio. We select the three most common species in the dataset for analysis: [skylar], [normoc], and [houspa].

[skylar]: https://ebird.org/species/skylar
[normoc]: https://ebird.org/species/normoc
[houspa]: https://ebird.org/species/houspa

### (a) Plot the BirdNET embeddings in a 2D or 3D scatterplot colored by `primary_label`.

Hint: Use PCA or UMAP for dimensionality reduction.

In [None]:
from umap import UMAP
from sklearn.decomposition import PCA

X = np.stack(df.emb)
primary_label = df.primary_label
birdnet_label = df.birdnet_label

# TODO: build plot that differentiates betweens different species

### (b) Describe the geometry of 1D embeddings and how you might use it to visualize data.

No code is required for the question; a short answer will suffice.

## 2. k-nn classification

A [K-Nearest Neighbor (K-NN) Classifier](https://scikit-learn.org/stable/modules/neighbors.html#classification) uses a majority vote of neighbors to assign a label to a point.
We are interested in using K-NN classification as an semi-unsupervised labeling method on unlabeled training data.

In this question, we will implement and analyze a K-NN classifier implemented in SQL (or PySpark Dataframes).
We test familarity with SQL concepts, such as `GROUP BY`, `JOIN`, and `WINDOW` functions.


### (a) How would you compute the nearest neighbors of all points in the BirdNET embedding dataset?

Re-stated: for all vectors in $\mathbb{R}^{n}$, how would you compute the $k$ closest neighbors for each vector.

What distance metric would you use? What algorithms or libraries would you use?  What if the dataset does not fit into main memory? No code is required for the question; a short answer will suffice.

### (b) What percentage of primary labels are correctly predicted by BirdNET for each species?

You are given a table named `labels` defined by the following schema:

name | type | description
-|-|-
id | integer | A unique identifier for a given audio clip
primary_label | string | The species label assigned to the clip in the training dataset
birdnet_label | string | The species label assigned by BirdNET Analyzer

Compute how often the BirdNET analyzer matches the assigned training dataset label.

---

For this question, you **must** answer using PySpark SQL (a mostly-compliant ANSI SQL dialect) or the PySpark Dataframe API.
See the [Spark SQL reference](https://spark.apache.org/docs/3.2.0/sql-ref.html) and [API docs](https://spark.apache.org/docs/3.1.2/api/python/reference/pyspark.sql.html) for more information.

Hint: common table expressions (CTEs) are useful for organizing queries, and for reusing subqueries.



In [None]:
from pyspark.sql import functions as F, Window

labels = spark_df.select("id", "primary_label", "birdnet_label")
labels.createOrReplaceTempView("labels")
labels.printSchema()

# NOTE: example of using the sql interface to spark
spark.sql(
    """
      SELECT
          primary_label,
          count(*) as n
      FROM labels
      GROUP BY 1
      ORDER BY n DESC
    """
).show()

# NOTE: the same query, but using the dataframe interface
labels.groupBy("primary_label").agg(F.count("*").alias("n")).orderBy(F.desc("n")).show()

In [None]:
spark.sql(
    """
      -- TODO: implement
      select null;
    """
).show()

### (c) Compute labels for each point in the dataset using k-nn classification.

For each id, find the most common label among its neighbors using the mode of the BirdNET labels.
Run the `compute_pct` function to compute the percentage of matching predictions.
How does this answer compare to the answer in (b)?

---

For this question, you **must** answer using PySpark SQL (a mostly-compliant ANSI SQL dialect) or the PySpark Dataframe API.
See the [Spark SQL reference](https://spark.apache.org/docs/3.2.0/sql-ref.html) and [API docs](https://spark.apache.org/docs/3.1.2/api/python/reference/pyspark.sql.html) for more information.

Hint: Use a window function to assign a row number or rank based on counts.

In [None]:
from pyspark.sql import functions as F, Window

neighbors = spark_df.select(
    "id", F.posexplode("neighbors").alias("pos", "neighbor_id")
).orderBy("id", "pos")
neighbors.createOrReplaceTempView("neighbors")
neighbors.printSchema()

# NOTE: example of using the neighbors table with the labels table
spark.sql(
    """
      SELECT
          neighbors.id,
          pos,
          neighbor_id,
          primary_label
      FROM neighbors
      JOIN labels ON neighbors.neighbor_id = labels.id
      LIMIT 5
  """
).show()

# NOTE: example of using the neighbors table with the labels table
neighbors.join(labels.withColumnRenamed("id", "neighbor_id"), on="neighbor_id").select(
    "id", "pos", "neighbor_id", "primary_label"
).limit(5).show()

In [None]:
knn_result = spark.sql(
    """
      -- TODO: implement
      SELECT
        id,
        'N/A' as birdnet_label
      FROM labels
      GROUP BY 1
    """
)
assert knn_result.columns == ["id", "birdnet_label"], "mismatched columns"
assert knn_result.count() == labels.count(), "mismatched row counts"
knn_result.groupBy("birdnet_label").count().orderBy(F.desc("count")).show()

In [None]:
# NOTE: to avoid spoiling the answer to part (b), we compute percentages in python/pandas
def compute_pct(knn_result, labels):
    temp_knn_df = knn_result.toPandas()
    temp_labels_df = labels.select("id", "primary_label").toPandas()

    joined = temp_knn_df.merge(temp_labels_df, on="id")

    counts = {}
    for label in temp_labels_df.primary_label.unique():
        sub = joined[joined.primary_label == label]
        a = sub[sub.birdnet_label == label].shape[0]
        b = sub.shape[0]
        counts[label] = a / b * 100

    spark.createDataFrame(
        [(k, float(v)) for k, v in counts.items()], ["label", "pct"]
    ).orderBy(F.desc("pct")).show()


compute_pct(knn_result, labels)

## 3. transfer learning

[Transfer learning](https://en.wikipedia.org/wiki/Transfer_learning) is a problem that involves transfering knowledge from one problem domain to another.
Advancements in large language model (LLM) embeddings on image generation like [CLIP](https://openai.com/blog/clip/) and [Stable Diffusion](https://en.wikipedia.org/wiki/Stable_Diffusion) can be considered a type of transfer learning.

We are interested in reusing/fine-tuning the embedding layer of the BirdNET Analyzer on a more specific task in the upcoming BirdCLEF 2023 competition.
In this question, we explore the use of embeddings in a simple classification task.
We also test some familiarity with tools like [PyTorch](https://pytorch.org/).

We create a binary classification task, which involves predicting whether an audio clip is `normoc` or not.
We also limit our dataset to clips that are labeled `normoc` or `skylar`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

subset = df[df.primary_label.isin(["normoc", "skylar"])]
X = np.stack(subset.emb)
y = subset.primary_label.eq("normoc").astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y)

### (a) Fit BirdNET embeddings to a logistic regression model

Report the test accuracy of a model using `LogisticRegression`.

In [None]:
# TODO: implement

### (b) Fit BirdNET embeddings to a neural network

Implement `Net` using a neural network with at least two layers. 
Use `train_generator` to fit the embedding and label data to the model.
Plot the loss curve of the training procedure.
Finally, report the accuracy of the model.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self, input_dim=320):
        super().__init__()
        # NOTE: define layers in the initialization function
        # self.layer1 = ...
        # self.layer2 = ...

    def forward(self, x):
        raise NotImplementedError()


def train_generator(
    net: Net,
    X_train: np.ndarray,
    y_train: np.ndarray,
    epochs: int = 100,
    criterion=nn.BCELoss(),
    lr: float = 0.001,
):
    """A python generator that yields the epoch and loss at each step."""
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    for epoch in range(epochs):
        optimizer.zero_grad()
        outputs = net(torch.from_numpy(X_train).float())
        loss = criterion(
            outputs, torch.from_numpy(y_train.values.reshape(-1, 1)).float()
        )
        loss.backward()
        optimizer.step()
        yield epoch, loss.item()

In [None]:
net = Net(input_dim=X_train.shape[1])
# TODO: train the model
# TODO: plot the loss curve
# TODO: report accuracy score on the tests set

### (c) Analyze the input data in the embedding space of the second-to-last layer of the model defined in part (b)

We are interested in analyzing properties of the second-to-last layer of our neural network.
Register a callback on a `torch` layer to capture activations on forward passes of model inference/training.
Report the shape of the extracted activations on `X`.

Perform some analysis on the extracted activations.
Some examples:

- Visualizing the embedding space labeled by `y`
- Reporting the accuracy of a classifier on `y` using activations as a feature

In [None]:
# https://discuss.pytorch.org/t/how-can-l-load-my-best-model-as-a-feature-extractor-evaluator/17254/6
activations = {}


def get_activation(name):
    """Register a hook to extract activations from a layer.

    If we have three layers defined layer1, layer2, and layer3 in our network,
    we would register the activations as follows:

      >>> net.layer2.register_forward_hook(get_activations("layer2"))

    Then, activations will be available on every call to `forward`. State is
    global, so it is overwritten on every call.

    >>> net(X)
    >>> activations["layer2"]
    """

    def hook(model, input, output):
        activations[name] = output.detach().numpy()

    return hook

In [None]:
# TODO: register the second to last layer

# TODO: activate the layer and report the shape
_ = net(torch.from_numpy(X).float())
emb = activations.get("my-layer")
assert emb is not None, "activation is missing"

# TODO: analyze (or visualize) the activations