# Feature Extraction and Exploratory Data Analysis
----
In this interactive notebook, we will handle the introductory feature extraction processing of our data, and perform some exploratory data analysis to see which features may be the most helpful in classification. We will also perform some cleaning procedures here.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from audio_feature_extraction import AudioFeatureExtractor

In [2]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

Let's instantiate a new AudioFeatureExtractor object, which we will need to pull relevant features out of our audio samples. In particular, we will begin our investigation with the Mel Frequency Cepstral Coefficients (mfccs). More information on these features can be found at the wikipedia page for the [Mel-frequncy cepstrum](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum).

In [3]:
afe = AudioFeatureExtractor()

Let's also read in the all important index file for our bird vocalization data. In particular, we should do some exploration about our targets and drop unneeded columns.

In [4]:
bird_index = pd.read_csv('bird_vocalization_index.csv', index_col=0)

In [5]:
bird_index.head()

Unnamed: 0,country,duration_seconds,english_cname,file_id,file_name,file_url,genus,latitude,license,location,longitude,recordist,recordist_url,sonogram_url,species,type,remarks,full_name
0,United States,3,Abert's Towhee,17804,XC17804.mp3,https://www.xeno-canto.org/17804/download,Melozone,33.3117,http://creativecommons.org/licenses/by-nc-nd/2.5/,"Cibola National Wildlife Refuge, Cibola, Arizo...",-114.68912,Nathan Pieplow,https://www.xeno-canto.org/contributor/EKKJJJRDJY,https://www.xeno-canto.org/sounds/uploaded/EKK...,aberti,'seet' call,XC17804 © Nathan Pieplow // Cibola National Wi...,Abert's Towhee (Melozone aberti)
1,United States,4,Abert's Towhee,177367,XC177367.mp3,https://www.xeno-canto.org/177367/download,Melozone,34.285,http://creativecommons.org/licenses/by-nc-sa/4.0/,"Bill Williams River NWR, Arizona, United States",-114.069,Lauren Harter,https://www.xeno-canto.org/contributor/YQNGFTBRRT,https://www.xeno-canto.org/sounds/uploaded/YQN...,aberti,call,XC177367 © Lauren Harter // Bill Williams Rive...,Abert's Towhee (Melozone aberti dumeticola)
2,United States,4,Abert's Towhee,145505,XC145505.mp3,https://www.xeno-canto.org/145505/download,Melozone,34.285,http://creativecommons.org/licenses/by-nc-sa/3.0/,"Bill Williams River NWR, Arizona, United States",-114.069,Lauren Harter,https://www.xeno-canto.org/contributor/YQNGFTBRRT,https://www.xeno-canto.org/sounds/uploaded/YQN...,aberti,Squeal duet,XC145505 © Lauren Harter // Bill Williams Rive...,Abert's Towhee (Melozone aberti dumeticola)
3,United States,5,Abert's Towhee,228159,XC228159.mp3,https://www.xeno-canto.org/228159/download,Melozone,33.1188,http://creativecommons.org/licenses/by-nc-nd/4.0/,"Salton Sea, CA, United States",-115.7945,Peter Boesman,https://www.xeno-canto.org/contributor/OOECIWCSWV,https://www.xeno-canto.org/sounds/uploaded/OOE...,aberti,interaction duet,"XC228159 © Peter Boesman // Salton Sea, CA, Un...",Abert's Towhee (Melozone aberti)
4,United States,5,Abert's Towhee,51313,XC51313.mp3,https://www.xeno-canto.org/51313/download,Melozone,36.0628,http://creativecommons.org/licenses/by-nc-sa/3.0/,"Sunset Park, Las Vegas, Nevada, United States",-115.1128,Mike Nelson,https://www.xeno-canto.org/contributor/PWDLINYMKL,https://www.xeno-canto.org/sounds/uploaded/PWD...,aberti,call,"XC51313 © Mike Nelson // Sunset Park, Las Vega...",Abert's Towhee (Melozone aberti dumeticola)


In [6]:
bird_index.english_cname.value_counts()

Bewick's Wren               30
Northern Saw-whet Owl       30
Hermit Warbler              30
Bell's Vireo                30
Ridgway's Rail              30
                            ..
Pacific-slope Flycatcher    30
Wrentit                     30
White-headed Woodpecker     30
White-crowned Sparrow       30
Clark's Nutcracker          30
Name: english_cname, Length: 91, dtype: int64

As we can see, this is a balanced dataset, though each category has very few samples. This will likely prove a challenge that we will need to overcome. Out of curiosity, we may want to consider defining our target instead as the genus of the bird, so as to potentially increase the number of samples in each target label.

In [7]:
bird_index.genus.value_counts()

Vireo         150
Polioptila     90
Empidonax      90
Corvus         60
Agelaius       60
             ... 
Junco          30
Mimus          30
Spinus         30
Bubo           30
Icteria        30
Name: genus, Length: 68, dtype: int64

This has less overall categories but it introduces imbalance into the dataset. We will have to do some exploration to see if one or the other is more accurate. In general, a model that classifies based on genus is just less useful than one that could also predict the full species, as doing so would simultaneously predict the genus if needed.

Anyhow, let's continue on with our data cleaning process.

In [8]:
bird_index_trim = bird_index.drop(columns=[
    'country',
    'file_url',
    'license',
    'recordist',
    'recordist_url',
    'sonogram_url',
    'remarks',
    'latitude',
    'longitude',
    'location',
    'full_name'])

In [9]:
bird_index_trim.head()

Unnamed: 0,duration_seconds,english_cname,file_id,file_name,genus,species,type
0,3,Abert's Towhee,17804,XC17804.mp3,Melozone,aberti,'seet' call
1,4,Abert's Towhee,177367,XC177367.mp3,Melozone,aberti,call
2,4,Abert's Towhee,145505,XC145505.mp3,Melozone,aberti,Squeal duet
3,5,Abert's Towhee,228159,XC228159.mp3,Melozone,aberti,interaction duet
4,5,Abert's Towhee,51313,XC51313.mp3,Melozone,aberti,call


In [16]:
bird_index_trim.tail()

Unnamed: 0,duration_seconds,english_cname,file_id,file_name,genus,species,type
2725,65,Yellow-breasted Chat,278880,XC278880.mp3,Icteria,virens,call
2726,67,Yellow-breasted Chat,247723,XC247723.mp3,Icteria,virens,song
2727,68,Yellow-breasted Chat,408122,XC408122.mp3,Icteria,virens,call
2728,71,Yellow-breasted Chat,315271,XC315271.mp3,Icteria,virens,song
2729,72,Yellow-breasted Chat,412925,XC412925.mp3,Icteria,virens,"call, song"


In [11]:
mfcc_cols = [f'mfcc_{i}' for i in range(afe.n_mfcc)]
mfcc_df = pd.DataFrame(columns=mfcc_cols)

for file_name in bird_index.file_name:
    path = f'raw_data/{file_name}'
    audio = afe.get_audio(path)
    mfcc = afe.extract_mfcc(audio)
    mean_mfcc = pd.Series(data=np.mean(mfcc, axis=1).T, name=file_name[:-4])
    mfcc_df = mfcc_df.append(mean_mfcc)

mfcc_df.head()

Unnamed: 0,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,mfcc_9,...,30,31,32,33,34,35,36,37,38,39
XC17804,,,,,,,,,,,...,-4.339367,1.333487,-1.381387,1.127425,0.019164,1.58422,-1.046119,0.390373,-0.772603,0.007741
XC177367,,,,,,,,,,,...,4.171523,10.4455,4.803803,7.397606,2.491925,7.013959,2.289602,6.180407,3.391241,5.123373
XC145505,,,,,,,,,,,...,-4.910753,2.799666,-3.150884,0.545678,-4.270985,0.121063,-1.778178,0.215153,-2.737019,-0.376697
XC228159,,,,,,,,,,,...,-1.24084,5.67948,-1.4401,2.651594,-1.378592,4.166149,-1.852941,0.604743,-0.974777,1.993201
XC51313,,,,,,,,,,,...,-2.270434,2.383327,-2.32989,1.478776,-2.991751,2.201844,-2.684334,0.971656,-1.603543,1.356117


In [12]:
mfcc_df_fix = mfcc_df.iloc[:,40:]
mfcc_df_fix.columns = mfcc_cols

In [13]:
mfcc_df_fix.head()

Unnamed: 0,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,mfcc_9,...,mfcc_30,mfcc_31,mfcc_32,mfcc_33,mfcc_34,mfcc_35,mfcc_36,mfcc_37,mfcc_38,mfcc_39
XC17804,-637.002319,92.520363,24.869247,16.338438,9.65211,16.172144,-4.202899,-0.850909,2.363304,5.314737,...,-4.339367,1.333487,-1.381387,1.127425,0.019164,1.58422,-1.046119,0.390373,-0.772603,0.007741
XC177367,-476.148224,38.358307,-20.634218,27.997766,-3.036796,-6.488371,-1.503356,2.496129,-7.325582,4.177428,...,4.171523,10.4455,4.803803,7.397606,2.491925,7.013959,2.289602,6.180407,3.391241,5.123373
XC145505,-470.592529,-2.882417,-31.855843,-4.317948,-27.501474,-11.062483,-17.295322,-12.182595,-23.136421,-9.659082,...,-4.910753,2.799666,-3.150884,0.545678,-4.270985,0.121063,-1.778178,0.215153,-2.737019,-0.376697
XC228159,-409.649872,39.505535,-55.50819,41.864464,1.146368,15.745145,-11.268634,9.626593,-2.680408,1.706944,...,-1.24084,5.67948,-1.4401,2.651594,-1.378592,4.166149,-1.852941,0.604743,-0.974777,1.993201
XC51313,-371.380554,49.912628,-6.731991,14.544933,-21.981588,-2.324481,0.908933,-0.259918,-4.7134,-2.809736,...,-2.270434,2.383327,-2.32989,1.478776,-2.991751,2.201844,-2.684334,0.971656,-1.603543,1.356117


In [14]:
mfcc_df_fix.to_csv('feature_extraction/mfcc.csv')

In [38]:
mfcc_df_fix

Unnamed: 0,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,mfcc_9,...,mfcc_31,mfcc_32,mfcc_33,mfcc_34,mfcc_35,mfcc_36,mfcc_37,mfcc_38,mfcc_39,target_label
XC17804,-637.002319,92.520363,24.869247,16.338438,9.652110,16.172144,-4.202899,-0.850909,2.363304,5.314737,...,1.333487,-1.381387,1.127425,0.019164,1.584220,-1.046119,0.390373,-0.772603,0.007741,
XC177367,-476.148224,38.358307,-20.634218,27.997766,-3.036796,-6.488371,-1.503356,2.496129,-7.325582,4.177428,...,10.445500,4.803803,7.397606,2.491925,7.013959,2.289602,6.180407,3.391241,5.123373,
XC145505,-470.592529,-2.882417,-31.855843,-4.317948,-27.501474,-11.062483,-17.295322,-12.182595,-23.136421,-9.659082,...,2.799666,-3.150884,0.545678,-4.270985,0.121063,-1.778178,0.215153,-2.737019,-0.376697,
XC228159,-409.649872,39.505535,-55.508190,41.864464,1.146368,15.745145,-11.268634,9.626593,-2.680408,1.706944,...,5.679480,-1.440100,2.651594,-1.378592,4.166149,-1.852941,0.604743,-0.974777,1.993201,
XC51313,-371.380554,49.912628,-6.731991,14.544933,-21.981588,-2.324481,0.908933,-0.259918,-4.713400,-2.809736,...,2.383327,-2.329890,1.478776,-2.991751,2.201844,-2.684334,0.971656,-1.603543,1.356117,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XC278880,-420.937195,41.450069,11.201295,30.159315,15.052879,7.443112,0.569483,6.847230,-4.032795,5.055322,...,4.993779,2.813457,4.545143,2.098345,3.940672,2.590675,3.917060,2.527125,3.446178,
XC247723,-484.730225,62.208557,-8.232522,22.582636,-3.817088,6.886751,2.701933,9.559089,-0.533370,6.664311,...,0.811582,0.161992,1.210766,-2.106176,1.584530,-1.239140,-0.508456,-1.674814,1.773405,
XC408122,-459.099884,35.358799,-82.617920,-26.392630,-43.452499,-2.535847,-0.103498,-4.882932,-6.143108,-2.026506,...,0.805393,-1.201482,1.870967,-0.135233,1.834059,-0.258632,1.737307,-1.602005,1.410520,
XC315271,-378.852264,122.311287,-62.072483,-20.169588,-31.252464,-10.470041,-22.994585,-13.320385,-12.598288,-7.095076,...,1.393314,-2.446371,0.063041,-2.660192,0.449015,-2.764065,0.759668,-2.104463,0.713509,


In [41]:
target_labels = bird_index_trim.english_cname

In [40]:
mfcc_df_fix = mfcc_df_fix.drop(columns='target_label')

In [42]:
mfcc_df_fix

Unnamed: 0,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,mfcc_9,...,mfcc_30,mfcc_31,mfcc_32,mfcc_33,mfcc_34,mfcc_35,mfcc_36,mfcc_37,mfcc_38,mfcc_39
XC17804,-637.002319,92.520363,24.869247,16.338438,9.652110,16.172144,-4.202899,-0.850909,2.363304,5.314737,...,-4.339367,1.333487,-1.381387,1.127425,0.019164,1.584220,-1.046119,0.390373,-0.772603,0.007741
XC177367,-476.148224,38.358307,-20.634218,27.997766,-3.036796,-6.488371,-1.503356,2.496129,-7.325582,4.177428,...,4.171523,10.445500,4.803803,7.397606,2.491925,7.013959,2.289602,6.180407,3.391241,5.123373
XC145505,-470.592529,-2.882417,-31.855843,-4.317948,-27.501474,-11.062483,-17.295322,-12.182595,-23.136421,-9.659082,...,-4.910753,2.799666,-3.150884,0.545678,-4.270985,0.121063,-1.778178,0.215153,-2.737019,-0.376697
XC228159,-409.649872,39.505535,-55.508190,41.864464,1.146368,15.745145,-11.268634,9.626593,-2.680408,1.706944,...,-1.240840,5.679480,-1.440100,2.651594,-1.378592,4.166149,-1.852941,0.604743,-0.974777,1.993201
XC51313,-371.380554,49.912628,-6.731991,14.544933,-21.981588,-2.324481,0.908933,-0.259918,-4.713400,-2.809736,...,-2.270434,2.383327,-2.329890,1.478776,-2.991751,2.201844,-2.684334,0.971656,-1.603543,1.356117
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XC278880,-420.937195,41.450069,11.201295,30.159315,15.052879,7.443112,0.569483,6.847230,-4.032795,5.055322,...,1.832722,4.993779,2.813457,4.545143,2.098345,3.940672,2.590675,3.917060,2.527125,3.446178
XC247723,-484.730225,62.208557,-8.232522,22.582636,-3.817088,6.886751,2.701933,9.559089,-0.533370,6.664311,...,-1.958673,0.811582,0.161992,1.210766,-2.106176,1.584530,-1.239140,-0.508456,-1.674814,1.773405
XC408122,-459.099884,35.358799,-82.617920,-26.392630,-43.452499,-2.535847,-0.103498,-4.882932,-6.143108,-2.026506,...,-1.583124,0.805393,-1.201482,1.870967,-0.135233,1.834059,-0.258632,1.737307,-1.602005,1.410520
XC315271,-378.852264,122.311287,-62.072483,-20.169588,-31.252464,-10.470041,-22.994585,-13.320385,-12.598288,-7.095076,...,-2.756106,1.393314,-2.446371,0.063041,-2.660192,0.449015,-2.764065,0.759668,-2.104463,0.713509


In [50]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [44]:
ss = StandardScaler()

In [47]:
mfcc_df_norm = pd.DataFrame(ss.fit_transform(mfcc_df_fix), columns=mfcc_df_fix.columns)

In [48]:
mfcc_df_norm.describe()

Unnamed: 0,mfcc_0,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,mfcc_6,mfcc_7,mfcc_8,mfcc_9,...,mfcc_30,mfcc_31,mfcc_32,mfcc_33,mfcc_34,mfcc_35,mfcc_36,mfcc_37,mfcc_38,mfcc_39
count,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,...,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0,2730.0
mean,-2.827205e-16,-1.732436e-17,9.146123000000001e-17,-1.504825e-17,1.174274e-17,-6.0187910000000004e-18,-1.191965e-16,2.785724e-17,-2.8711260000000003e-17,-1.3013600000000001e-17,...,-4.050484e-17,-2.3912500000000003e-17,-8.316506e-17,7.161549e-17,8.434442000000001e-17,-1.213112e-16,-8.442575000000001e-17,-2.846726e-18,6.612537000000001e-17,5.505873e-17
std,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,...,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183,1.000183
min,-5.304026,-4.576742,-4.055969,-3.927953,-4.722705,-3.742513,-3.446451,-3.581297,-4.174816,-4.071939,...,-6.523359,-5.095152,-3.373265,-4.5699,-3.908238,-5.391141,-5.854865,-5.030977,-5.545018,-5.968292
25%,-0.6717914,-0.568773,-0.5521012,-0.6123356,-0.6753495,-0.6273742,-0.6416874,-0.6192902,-0.625715,-0.5732595,...,-0.6027892,-0.5934565,-0.6620635,-0.5966032,-0.6244224,-0.6061979,-0.6558736,-0.6022021,-0.6497748,-0.595642
50%,0.02820311,0.1175528,0.1046891,-0.02412203,0.04313718,-0.01720562,0.013935,-0.01476739,-0.01281406,0.04153985,...,-0.05655705,-0.003816292,-0.1248205,0.01327996,-0.1050297,0.003138745,-0.1069318,-0.001950243,-0.1374201,0.006653865
75%,0.6711735,0.6584625,0.7084924,0.6114238,0.6659653,0.6478918,0.6590574,0.6138842,0.5846345,0.6270869,...,0.5781046,0.5702201,0.5679978,0.584066,0.5819345,0.5739366,0.5593109,0.5709235,0.5588487,0.5706071
max,3.107173,2.577024,2.984092,5.459884,3.696341,4.946857,3.334387,5.012965,5.309098,3.482304,...,4.489061,7.237955,5.343588,4.798363,4.784788,5.238797,5.421757,5.233231,5.101888,5.000673


In [51]:
le = LabelEncoder()
encoded_labels = le.fit_transform(target_labels)

In [52]:
encoded_labels

array([ 0,  0,  0, ..., 90, 90, 90])

In [53]:
x_train, x_test, y_train, y_test = train_test_split(mfcc_df_norm, encoded_labels, test_size=0.25)

In [54]:
dtc = DecisionTreeClassifier()

In [55]:
dtc.fit(x_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [56]:
dtc.score(x_train, y_train)

0.9995114802149487

In [57]:
dtc.score(x_test, y_test)

0.09224011713030747

In [58]:
from sklearn.model_selection import GridSearchCV

In [59]:
param_grid = {'criterion': ['gini', 'entropy'],
             'max_features': np.arange(2, 40, 2),
             'max_depth': np.arange(1, 20, 1)}
dtc_gs = GridSearchCV(dtc, param_grid, n_jobs=-1, cv=3)

In [60]:
dtc_gs.fit(x_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=None,
                                              splitter='best'),
             iid='warn', n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9

In [61]:
dtc_gs.best_score_

0.07327796775769418

In [62]:
dtc_gs.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=17,
                       max_features=18, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [63]:
from sklearn.svm import SVC

In [73]:
svc = SVC(probability=True)

In [74]:
param_grid = {'C': 10**np.arange(-2.0, 3.0, 1.0),
             'kernel': ['poly', 'rbf', 'linear']}

svc_gs = GridSearchCV(svc, param_grid, cv=3, n_jobs=-1)

In [75]:
svc_gs.fit(x_train, y_train)

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                           probability=True, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=-1,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'kernel': ['poly', 'rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [76]:
svc_gs.best_score_

0.17537860283341475