# Feature Extraction and Exploratory Data Analysis
----
In this interactive notebook, we will handle the introductory feature extraction processing of our data, and perform some exploratory data analysis to see which features may be the most helpful in classification. We will also perform some cleaning procedures here.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from afe import AudioFeatureExtractor, BatchExtractor, FeatureVisualizer

import librosa.display

In [2]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

Let's instantiate a new AudioFeatureExtractor object, which we will need to pull relevant features out of our audio samples. In particular, we will begin our investigation with the Mel Frequency Cepstral Coefficients (mfccs). More information on these features can be found at the wikipedia page for the [Mel-frequncy cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum).

In [3]:
extractor = AudioFeatureExtractor()

Let's also read in the all important index file for our bird vocalization data. In particular, we should do some exploration about our targets and drop unneeded columns.

In [4]:
bird_index = pd.read_csv('bird_vocalization_index.csv', index_col=0)

In [5]:
bird_index.head()

Unnamed: 0,country,duration_seconds,english_cname,file_id,file_name,file_url,genus,latitude,license,location,longitude,recordist,recordist_url,sonogram_url,species,type,remarks,full_name
0,United States,3,Abert's Towhee,17804,XC17804.mp3,https://www.xeno-canto.org/17804/download,Melozone,33.3117,http://creativecommons.org/licenses/by-nc-nd/2.5/,"Cibola National Wildlife Refuge, Cibola, Arizo...",-114.68912,Nathan Pieplow,https://www.xeno-canto.org/contributor/EKKJJJRDJY,https://www.xeno-canto.org/sounds/uploaded/EKK...,aberti,'seet' call,XC17804 © Nathan Pieplow // Cibola National Wi...,Abert's Towhee (Melozone aberti)
1,United States,4,Abert's Towhee,177367,XC177367.mp3,https://www.xeno-canto.org/177367/download,Melozone,34.285,http://creativecommons.org/licenses/by-nc-sa/4.0/,"Bill Williams River NWR, Arizona, United States",-114.069,Lauren Harter,https://www.xeno-canto.org/contributor/YQNGFTBRRT,https://www.xeno-canto.org/sounds/uploaded/YQN...,aberti,call,XC177367 © Lauren Harter // Bill Williams Rive...,Abert's Towhee (Melozone aberti dumeticola)
2,United States,4,Abert's Towhee,145505,XC145505.mp3,https://www.xeno-canto.org/145505/download,Melozone,34.285,http://creativecommons.org/licenses/by-nc-sa/3.0/,"Bill Williams River NWR, Arizona, United States",-114.069,Lauren Harter,https://www.xeno-canto.org/contributor/YQNGFTBRRT,https://www.xeno-canto.org/sounds/uploaded/YQN...,aberti,Squeal duet,XC145505 © Lauren Harter // Bill Williams Rive...,Abert's Towhee (Melozone aberti dumeticola)
3,United States,5,Abert's Towhee,228159,XC228159.mp3,https://www.xeno-canto.org/228159/download,Melozone,33.1188,http://creativecommons.org/licenses/by-nc-nd/4.0/,"Salton Sea, CA, United States",-115.7945,Peter Boesman,https://www.xeno-canto.org/contributor/OOECIWCSWV,https://www.xeno-canto.org/sounds/uploaded/OOE...,aberti,interaction duet,"XC228159 © Peter Boesman // Salton Sea, CA, Un...",Abert's Towhee (Melozone aberti)
4,United States,5,Abert's Towhee,51313,XC51313.mp3,https://www.xeno-canto.org/51313/download,Melozone,36.0628,http://creativecommons.org/licenses/by-nc-sa/3.0/,"Sunset Park, Las Vegas, Nevada, United States",-115.1128,Mike Nelson,https://www.xeno-canto.org/contributor/PWDLINYMKL,https://www.xeno-canto.org/sounds/uploaded/PWD...,aberti,call,"XC51313 © Mike Nelson // Sunset Park, Las Vega...",Abert's Towhee (Melozone aberti dumeticola)


In [6]:
bird_index.english_cname.value_counts()

California Thrasher         30
Bell's Sparrow              30
Rufous-crowned Sparrow      30
Mountain Quail              30
Black-headed Grosbeak       30
                            ..
Bell's Vireo                30
Thick-billed Fox Sparrow    30
Hermit Thrush               30
Northern Pygmy Owl          30
Black-tailed Gnatcatcher    30
Name: english_cname, Length: 91, dtype: int64

As we can see, this is a balanced dataset, though each category has very few samples. This will likely prove a challenge that we will need to overcome. Out of curiosity, we may want to consider defining our target instead as the genus of the bird, so as to potentially increase the number of samples in each target label.

In [7]:
bird_index.genus.value_counts()

Vireo             150
Empidonax          90
Polioptila         90
Corvus             60
Catharus           60
                 ... 
Artemisiospiza     30
Aegolius           30
Setophaga          30
Nucifraga          30
Aphelocoma         30
Name: genus, Length: 68, dtype: int64

This has less overall categories but it introduces imbalance into the dataset. We will have to do some exploration to see if one or the other is more accurate. In general, a model that classifies based on genus is just less useful than one that could also predict the full species, as doing so would simultaneously predict the genus if needed.

Anyhow, let's continue on with our data cleaning process.

In [8]:
bird_index_trim = bird_index.drop(columns=[
    'country',
    'file_url',
    'license',
    'recordist',
    'recordist_url',
    'sonogram_url',
    'remarks',
    'latitude',
    'longitude',
    'location',
    'full_name'])

In [9]:
bird_index_trim.head()

Unnamed: 0,duration_seconds,english_cname,file_id,file_name,genus,species,type
0,3,Abert's Towhee,17804,XC17804.mp3,Melozone,aberti,'seet' call
1,4,Abert's Towhee,177367,XC177367.mp3,Melozone,aberti,call
2,4,Abert's Towhee,145505,XC145505.mp3,Melozone,aberti,Squeal duet
3,5,Abert's Towhee,228159,XC228159.mp3,Melozone,aberti,interaction duet
4,5,Abert's Towhee,51313,XC51313.mp3,Melozone,aberti,call


## Small Sample Testing
----
Here we do testing on a much smaller sample size.

In [13]:
testing_sample = bird_index_trim.sample(n=2)

In [14]:
testing_sample.head()

Unnamed: 0,duration_seconds,english_cname,file_id,file_name,genus,species,type
1948,57,Pygmy Nuthatch,229772,XC229772.mp3,Sitta,pygmaea,call
81,31,American Bushtit,283953,XC283953.mp3,Psaltriparus,minimus,call


In [15]:
testing_sample.to_csv('test_index.csv')

In [16]:
sample_be = BatchExtractor(frame_length=2048, audio_index='test_index.csv')

In [17]:
sample_be.batch_extract_and_merge(list(sample_be.extraction_dict.keys())[:-1], results_folder='feature_extraction_sample/')

mfcc
melspec
zcr
ccqt
cstft
ccens
rms
centroid
bandwidth
contrast
flatness
rolloff
tonnetz


## Full Feature Extraction
----
This is gonna take a while

In [18]:
full_be = BatchExtractor(frame_length=2048, audio_folder='raw_data/', audio_index='bird_vocalization_index.csv')
extraction_methods = list(full_be.extraction_dict.keys())[:-1]

In [20]:
full_be.batch_extract_and_merge(extraction_methods, results_folder='feature_extraction/')

mfcc
melspec
zcr
ccqt
cstft
ccens
rms
centroid
bandwidth
contrast
flatness
rolloff
tonnetz


Hope it works!!!

In [43]:
fv = FeatureVisualizer()

for file_name in testing_sample.file_name:
    name = file_name[:-4]
    fv.plot_melspec(name).savefig(f'plots/melspecs/{name}.png')
    plt.close()