# Inclusive Speech Technology Lab - Week 2

## Acoustic Features for Inclusive Speech Processing

**The goal of this lab is to reflect on what kind of information is included in speech and what kind of acoustic features capture task-specific information. In other words, which acoustic feature(s) is best for what speech processing task and link this to speech as a communication signal.** In this lab you will learn about various acoustic features. First, you will extract different acoustic features from speech files. Throughout this lab, you will be using two databases, the German emotions dataset [Emo-DB](https://github.com/audeering/emodb), and the [Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech (DAIS)](https://pure.tudelft.nl/ws/portalfiles/portal/157666992/DAIS_The_Delft_Database_of_EEG_Recordings_of_Dutch_Articulated_and_Imagined_Speech.pdf) dataset. You will extract features from each dataset, and cluster them using k-means clustering, with the goal of finding clusters that can predict emotion (for the Emo-DB dataset) and vowels (for the DAIS dataset). You will analyze how effective different acoustic features are at achieving this goal. The lab consist of the following parts:

1. Loading the Datasets
2. Acoustic Feature Extraction
3. K-Means Clustering
 
Throughout this lab you will be asked to write code in this notebook, and you will also be asked to **reflect on your results which are to be written down in a group report.** The report should only focus on this lab. **For more information on the report, refer to the Inclusive Speech Technology Brightspace page, in the `Lab` section.**

**We provide reflection questions in this notebook.** Sections where we ask you to implement a coding exercise will be marked with a <i style='color: #468fea; font-size: 15px;' class="fa fa-code" aria-hidden="true"></i> symbol. Sections where we ask reflection questions will be marked with a <i style='color: #468fea; font-size: 15px;' class="fa fa-file-text" aria-hidden="true"></i> symbol. Please answer all questions. However, keep in mind that **just** answering the questions in this notebook will not result in a good quality report.

**There is no "correct" way of completing this lab!** It is, however, important that you can justify why you make certain choices.

### Let's Get Started

Before getting started with the lab, you should import all of the necessary libraries that you will be using. **Run the code block below to install the necessary libraries for this lab.**

In [None]:
# Install the libraries using pip
%pip install matplotlib
%pip install numpy
%pip install pandas
%pip install seaborn
%pip install librosa
%pip install opensmile
%pip install audb

Now that you have installed the libraries, **run the code block below to import them into the notebook.**

In [None]:
# Import necessary functions

# Imports for plotting and data manipulatin
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Imports for audio processing
import librosa
from IPython.display import Audio
import audb
import audiofile
import opensmile
import os

# Imports for clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

## 1. Loading the Datasets

In this lab you will be working with two databases: the German emotions dataset [Emo-DB](https://github.com/audeering/emodb), which contains speech spoken with different emotions, and the [Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech (DAIS)](https://pure.tudelft.nl/ws/portalfiles/portal/157666992/DAIS_The_Delft_Database_of_EEG_Recordings_of_Dutch_Articulated_and_Imagined_Speech.pdf) dataset, which contains speech of vowels. This section will explain how to load each dataset.

Beginning with the German emotions dataset, this can be download and loaded using one command: `emodb = audb.load('emodb')`. This command will download the dataset for you. For more documentation on using `audb.load(...)`, refer to the [documentation](https://audeering.github.io/audb/load.html). To access one specific file, you can use the following command:
```
db = audb.load(
    'emodb',
    version='1.4.1',
    format='wav',
    mixdown=True,
    sampling_rate=16000,
    media='wav/14.*A.*\.wav',  # here you can choose which subset of files to use (for interpretation of the filenames see emodb documentation)
    full_path=False,
    verbose=False,
)
```

The DAIS dataset can be downloaded from Brightspace. The file is called `dais.zip` and can be found under `Content` > `Labs` > `Week 2`. Download this file and unzip it. Once you have downloaded the files to your local machine, you can load the files as you would normally with Python. For example, you can get file paths using `os.walk(...)`, and you can load the speech file using `signal, sampling_rate = audiofile.read(file_path)`. **Once you are done with this lab, make sure to delete the DAIS dataset from your local machine.**


## 1. Acoustic Feature Extraction

**For the next part of the lab, you will practice extracting acoustic features using Emo-DB.** You will be using [openSMILE](https://audeering.github.io/opensmile-python/usage.html) to do so. The goal of this section is to become comfortable with extracting acoustic features and understanding them.

### 1.1 Extract & Explore Features of a Single Speech File

**First you will take a look at a single speech file and analyse it using openSMILE.** Below, you will load the Emo-DB dataset that you will be using. Loading the dataset will allow you to work with the speech files in this notebook. We provide you the code to do so in the code block below. We also give you code to display a specific speech file from the dataset, which you can listen to to get an understanding of the speech that the dataset contains. **Run the code block below to load your dataset and display a speech file**.

In [None]:
# Load emodb database
db = audb.load(
    'emodb',
    version='1.4.1',
    format='wav',
    mixdown=True,
    sampling_rate=16000,
    media=r'wav/14.*A.*\.wav',
    full_path=False,
    verbose=False,
)

# Read speech file
file = os.path.join(db.root, db.files[1])
signal, sampling_rate = audiofile.read(
    file,
    duration=10,
    always_2d=True,
)

# Play audio
Audio(data=signal, rate=sampling_rate)

**Now you will extract different acoustic features from the speech file you just loaded.** By extracting acoustic features from speech signals, we quantify and measure certain aspects of the speech signal that help us analyze the signal. To extract acoustic features, you will use `opensmile.Smile`. This object allows you to extract acoustic features according to a `feature_set` and `feature_level`. For this lab, we will be using the acoustic feature set `opensmile.FeatureSet.eGeMAPSv02`, which corresponds to The Extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS). eGeMAPS is further explained in [this paper](https://sail.usc.edu/publications/files/eyben-preprinttaffc-2015.pdf).

To store the acoustic features, we will use Pandas, specifically the [`pd.DataFrame()` object](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Each frame (row) corresponds to a segment of speech. Columns represent different acoustic features, or other important information such as start and end times of the frame (row).

First, we will focus on getting all the eGeMAPS features. We provide the code to extract and display the eGeMAPS features, which you can run below. For guidance on how to work with acoustic feature extraction in `OpenSMILE`, you can refer to [this](https://audeering.github.io/opensmile-python/usage.html) tutorial.

In [None]:
# Define the OpenSmile feature extractor and process signal
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,
    feature_level=opensmile.FeatureLevel.LowLevelDescriptors,
)

features = smile.process_signal(
    signal,
    sampling_rate
)

# Display all features
pd.set_option('display.max_columns', None)
features_df = features.copy()
features_df.columns = [''.join(col).strip() for col in features_df.columns.values]
features_df = features_df.reset_index()
features_df.head()

You may notice that eGeMAPS computes a lot of acoustic features. **Throughout this lab you will be focusing on four of these acoustic features: pitch, jitter, loudness, and MFCCs.** We expect you to understand what acoustic information each of these acoustic features capture. Please refer to the [eGeMAPS documentation](https://sail.usc.edu/publications/files/eyben-preprinttaffc-2015.pdf) in case of questions about these acoustic features.

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> For this next code block, you **should filter your data such that only the four acoustic features you want to work with are present.** Note that MFCCs have several orders, and eGeMAPS provides MFCC orders 1-4. You should choose which MFCC order or combination of orders you want to use for your analysis.

In [None]:
# Write your code here

### 1.2 Extract & Explore Features of the Full Dataset

Now that you have a better understand of the underlying data, we will take a look at the full dataset. Load the complete Emo-DB dataset. Then, **complete the feature extraction on the complete Emo-DB dataset.** You should do this using the same method as you used before: define the openSMILE feature extractor and process the speech signals. 

In the previous section, you had multiple frames (rows) for a single speech file. Now, you will also have multiple files to work with, each of which will be split into multiple frames. Thus, **consider carefully how you will organize your data.** Tips: you will likely want to add a new column which keeps track of which file a speech signal corresponds to. You may also want to combine frames from the same file, such that each file has only one frame.

Make sure you also filter the acoustic features such that only the necessary acoustic features are present!

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> Load the audio from the Emo-DB dataset, extract the four acoustic features for each file in `emodb.files`, and store the features in a Pandas dataframe.

In [None]:
# Loading the complete Emo-DB dataset
emodb = audb.load('emodb')

# Write your code here

Now you should have a dataframe that contains all of the extracted acoustic features for each speech file. Next, you will merge your acoustic features with the emotion labels for the speech files. This will allow you to have all acoustic features and emotion information for a speech file in one place (or put differently: each acoustic feature now has an emotion label), which will make processing the data easier. First, you need to get the emotion labels corresponding to each speech file. To do so, we provide you code that loads the emotion labels into a dataframe object, stored in the `merged_df` variable.

**Run the code block below to get the emotion labels for the speech files.** Read the table that is output, so you understand the structure of the data.

In [None]:
# File paths for the ground truth data / original CSV files
path = emodb.root  # Just use it directly
emotion_file = os.path.join(path, 'db.emotion.csv')
files_file = os.path.join(path, 'db.files.csv')
speaker_file = os.path.join(path, 'db.speaker.csv')

# Load the CSV files
emotion_df = pd.read_csv(emotion_file)
files_df = pd.read_csv(files_file)
speaker_df = pd.read_csv(speaker_file)

# Merge the dataframes
merged_df = pd.merge(emotion_df, files_df, on='file')
merged_df = pd.merge(merged_df, speaker_df, on='speaker')

merged_df.head()

You now have two dataframes, one that contains the features per each file, and one that contains the emotion labels per each file. **Combine these two dataframes into one dataframe that contains both the acoustic features and the emotion label for each file.**

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> Combine the emotion label and acoustic feature frames for each speech file.

In [None]:
# Write your code here

**Take a second to reflect on the feature data you have produced in the table above.**

<i style='color: #468fea; font-size: 20px;' class="fa fa-file-text" aria-hidden="true"></i> Why are there multiple rows in the table? Why do the values of an acoustic feature (column) fluctuate over the different rows? Why do the different columns have different values? Why do you think your features might be useful for analyzing speech? Which MFCC order(s) did you choose to work with, and why?

If you do not know the answers to these questions, please revise the lecture content on acoustic features.

## 2. K-Means Clustering

### 2.1 Clustering Emotions
Now that you have extracted the acoustic features from the emotions database, you will apply k-means clustering to the acoustic features dataframe that you computed in the previous section to investigate how well the acoustic features are able to cluster together speech files that have the same emotion label. First, you should **compute the clusters for each of the four acoustic features, and you should visualize these clusters in a plot for each acoustic feature separately**. Visualizing the plots will help you confirm that your implementation of k-means clustering is correct. We have imported the `KMeans` from [scikit-learn](https://scikit-learn.org/1.5/index.html), which will perform the clustering for you. **Note that you should use seven clusters, since there are seven emotions in the dataset(!)**

There are several things you should consider while clustering. First: since the acoustic features are extracted per frame, you should calculate statistical summaries (e.g., mean, standard deviation) for each acoustic feature across the entire speech file to reduce dimensionality. Second, you should standardise the acoustic features to have zero mean and unit variance, since K-Means is sensitive to the scale of the data points.

**You should produce four one-dimensional cluster plots, one for each acoustic feature.** Since your clusters are one-dimensional, your plots should show a single line with several colored dots. Each color represents a cluster of acoustic features that your k-means algorithm found.

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> Cluster the acoustic features you extracted above, and visualize the results.

In [None]:
# Write your code here

If your k-means clustering was correct, your plots should contain seven clusters. You may notice that your plots are hard to interpret as the clusters do not have any information on the emotion labels.

To understand how well each acoustic feature is in capturing a specific emotion, you can create a heatmap. A heatmap shows how many times each emotion appears in each cluster. The more often an emotion appears in a cluster, the better the acoustic feature is in capturing important information about that emotion. Your heat map should plot emotions on the x-axis, and cluster indices on the y-axis. To create this heat map, you can combine your cluster data with the `merged_df` dataframe that we provide you in part 1.2 of this lab. To plot the heat map, you can use `sns.heatmap`. 

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> Create and plot a heat map that represents the distribution of emotions per cluster. You should use the clusters you calculated in the previous code block.

In [None]:
# Write your code here

Another way of analyzing how emotions relate to your clusters would be to calculate the cluster purity score. The cluster purity tells you the degree to which the clusters contain a single class (emotion), and is given by the following formula: 
$$
\text{purity} = \frac{1}{N}\sum_{m\in M}\max_{d\in D}|m \cap d|
$$
where $M$ is the set of clusters, $D$ is the set of classes (emotions), and $N$ is the number of data points. **Use this formula to calculate the cluster purity for the clusters.**

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> In the code block below, calculate the purity scores for your clusters.

In [None]:
# Write your code here

**Take a second to reflect on the results of your clustering.**

<i style='color: #468fea; font-size: 20px;' class="fa fa-file-text" aria-hidden="true"></i> Which acoustic features result in a better clustering of the emotions and which in a worse clustering?

### 2.2 Clustering Vowels

Now that you have analyzed the effectiveness of the acoustic features to capture emotion in the speech files, you will do the same but for vowels. **First, create the same four acoustic features, loudness, pitch, jitter, and MFCCs, for the DAIS dataset. Next, use k-means clustering to cluster the acoustic features. Finally, calculate and analyze the cluster heat maps and cluster purity for your sets of clusters.** Make sure to answer the reflection questions at the end of this section.

<i style='color: #468fea; font-size: 20px;' class="fa fa-code" aria-hidden="true"></i> Calculate the acoustic features for the DAIS dataset, cluster the results, using five clusters, and calculate the cluster heat maps and purity.

In [None]:
# Write your code here

**Take a second to reflect on the results of your clustering.**

<i style='color: #468fea; font-size: 20px;' class="fa fa-file-text" aria-hidden="true"></i> Which acoustic features result in a better clustering of the vowels and which in a worse clustering? Are there differences in which acoustic features work better or worse on the two clustering tasks and why do you think that is?