### In this notebook we connstruct an svc_rbf model using a test-train-validatin split which achieves:
### 67% accuracy (vs. baseline 24%) classifying 'fam_or_subfam' (15 possible values) on unseen test data, and
### 91% accuracy (vs. baseline 67%) classifying 'critter_name' (3 possible values) when trained on 'fam_or_subfam'.
### When trained on 'critter_name' alone, our model also achieves 91% accuracy. Neither model ever guesses 'cicada'.
### Either model would serve as a reasonable classifier for 'cricket' vs 'kaydid', achieving about 91% accuracy,

In [69]:
import librosa, librosa.display
import numpy as np
import matplotlib.pyplot as plt
import os
import scipy
import math
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.metrics import r2_score
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import confusion_matrix

In [70]:
#Import the training and testing data files
df = pd.read_csv('MLNS_Final_Train.csv')
df_test = pd.read_csv('MLNS_Final_Test.csv')

In [71]:
#Returns dataframe containing mfcc avg and var, and hs_mfcc avg and var, truncated to indicated depth, as well 
#as main_freq, range, max_mean, and peak_freq if True.
def truncate_mfcc(df, avg_depth=20, var_depth=10, hs_avg_depth=20, hs_var_depth=10, other_Features=True):
    df_temp=df['mfcc_'+str(0)+'_avg']
    df_temp=df_temp.rename('blah')
    for n in range(0,avg_depth):
        df_temp=pd.concat([df_temp, df['mfcc_'+str(n)+'_avg']], axis=1)
    for n in range(0,var_depth):
        df_temp=pd.concat([df_temp, df['mfcc_'+str(n)+'_var']], axis=1)
    for n in range(0,hs_avg_depth):
        df_temp=pd.concat([df_temp, df['hs_mfcc_'+str(n)+'_avg']], axis=1)
    for n in range(0,hs_var_depth):
        df_temp=pd.concat([df_temp, df['hs_mfcc_'+str(n)+'_var']], axis=1)
    if other_Features:
        df_temp=pd.concat([df_temp, df['main_freq']], axis=1)
        df_temp=pd.concat([df_temp, df['range']], axis=1)
        df_temp=pd.concat([df_temp, df['max_mean']], axis=1)
        df_temp=pd.concat([df_temp, df['peak_freq']], axis=1)
    return df_temp.drop(columns=['blah'])

In [72]:
#Computes the accuracy of svc_rbf at the given value of the hyperparameter C, depths of mfcc features, other_features, and critter vs. fam_or_subfam
def svc_rbf_acc(df, C=5, avg_depth=40, var_depth=40, hs_avg_depth=40, hs_var_depth=40, other_Features=True, crit=False):
    X=truncate_mfcc(df=df, avg_depth=avg_depth,var_depth=var_depth,hs_avg_depth=hs_avg_depth,hs_var_depth=hs_var_depth,other_Features=other_Features)
    if crit:
        Y = df['critter_name']
    else:
        Y = df['fam_or_subfam']
    X_train, X_val, y_train, y_val = train_test_split(X.copy(), Y, shuffle=True, random_state=17, test_size=.2, stratify=Y)
    svc_pipe = Pipeline([('scale', StandardScaler()), ('svc_rbf',SVC(kernel='rbf', C=C))])
    svc_pipe.fit(X_train, y_train)
    pred_train = svc_pipe.predict(X_train)
    pred_val = svc_pipe.predict(X_val)
    score_train = accuracy_score(y_train,pred_train)
    score_val = accuracy_score(y_val,pred_val)
    return score_train, score_val

### The data below compares accuracy on training and validation sets.

In [73]:
# Runs svc_rbf over the indicated depth of all 4 mfcc features, including the 4 other features, and the values of C.
for depth in range(1,40,4):
    for C in [.1,1,5,10,20]:
        A, B =svc_rbf_acc(df,C,depth,depth,depth,depth,True,False)
        print(f"svc_rbf (Train, Validation) accuracy when mfcc depth = {depth} and C = {C}: {A}, {B}") 

svc_rbf (Train, Validation) accuracy when mfcc depth = 1 and C = 0.1: 0.5234437086092715, 0.5391949152542372
svc_rbf (Train, Validation) accuracy when mfcc depth = 1 and C = 1: 0.5827814569536424, 0.5699152542372882
svc_rbf (Train, Validation) accuracy when mfcc depth = 1 and C = 5: 0.6373509933774835, 0.576271186440678
svc_rbf (Train, Validation) accuracy when mfcc depth = 1 and C = 10: 0.663046357615894, 0.5783898305084746
svc_rbf (Train, Validation) accuracy when mfcc depth = 1 and C = 20: 0.6858278145695365, 0.5911016949152542
svc_rbf (Train, Validation) accuracy when mfcc depth = 5 and C = 0.1: 0.5803973509933775, 0.597457627118644
svc_rbf (Train, Validation) accuracy when mfcc depth = 5 and C = 1: 0.702251655629139, 0.6578389830508474
svc_rbf (Train, Validation) accuracy when mfcc depth = 5 and C = 5: 0.7984105960264901, 0.6694915254237288
svc_rbf (Train, Validation) accuracy when mfcc depth = 5 and C = 10: 0.8392052980132451, 0.6726694915254238
svc_rbf (Train, Validation) accura

### It appears that using all of our features gives the greatest accuracy. Next we fine-tune C, observing that over-fitting seems to have already occurred when C=5.

In [74]:
for C in np.linspace(.1,3.1,31):
    A, B =svc_rbf_acc(df,C,40,40,40,40,True,False)
    print(f"svc_rbf (Train, Validation) accuracy when mfcc depth = {40} and C = {C}: {A}, {B}") 

svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.1: 0.5589403973509933, 0.559322033898305
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.2: 0.6333774834437086, 0.6197033898305084
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.30000000000000004: 0.679205298013245, 0.6440677966101694
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.4: 0.704635761589404, 0.6557203389830508
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.5: 0.7282119205298013, 0.6663135593220338
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.6: 0.7520529801324504, 0.6705508474576272
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.7000000000000001: 0.7711258278145695, 0.6779661016949152
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.8: 0.7896688741721855, 0.6864406779661016
svc_rbf (Train, Validation) accuracy when mfcc depth = 40 and C = 0.9: 0.8068874172185431, 0

### C=1.5 already gives essentially the maximum validation accuracy, and increasing C beyond this only leads to overfitting. So we select C=1.5. Now we train the model and test it.

In [75]:
#First train the model exactly using the same training data as before, and C=1.5.
svc_pipe = Pipeline([('scale', StandardScaler()), ('svc_rbf',SVC(kernel='rbf', C=1.5))])
X=truncate_mfcc(df=df, avg_depth=40,var_depth=40,hs_avg_depth=40,hs_var_depth=40,other_Features=True)
Y = df['fam_or_subfam']
X_train, X_val, y_train, y_val = train_test_split(X.copy(), Y, shuffle=True, random_state=17, test_size=.2, stratify=Y)
svc_pipe.fit(X_train, y_train)

#Test the model on unseen test data. The baseline is to always predict the most common family, Gryllinae.
X_test=truncate_mfcc(df=df_test, avg_depth=40,var_depth=40,hs_avg_depth=40,hs_var_depth=40,other_Features=True)
y_test=df_test['fam_or_subfam']
pred = svc_pipe.predict(X_test)
pred_baseline = pred.copy()
for i in range(0,len(pred)):
    pred_baseline[i] = 'Gryllinae'
print(f'Test accuracy: {accuracy_score(pred,y_test)} vs. Baseline: {accuracy_score(pred_baseline,y_test)}')

Test accuracy: 0.673728813559322 vs. Baseline: 0.24067796610169492


### Our svc_rbf model with C=1.5 achieves 67% accuracy identifying 'fam_or_subfam' on unseen test data vs the baseline of 24%. Next, we test our model on identifying critter_name: 'cricket', 'kaydid', or 'cicada'.

In [76]:
#Dictionary from 'fam_or_subfam' to the coarser classification 'critter_name'
fam_dict = {'Gryllinae':'cricket', 'Conocephalinae':'kaydid', 'Oecanthinae':'cricket',
            'Phaneropterinae': 'kaydid', 'Trigonidiinae':'cricket', 'Nemobiinae':'cricket', 'Hapithinae':'cricket', 
            'Mogoplistinae':'cricket', 'Tettigoniinae':'kaydid', 'Pseudophyllinae':'kaydid', 'Cicadidae':'cicada',
            'Gryllotalpidae':'cricket', 'Eneopterinae':'cricket', 'Phalangopsidae':'cricket', 'Listroscelidinae':'cricket'}

#Converts a pd Series with fam_or_subfam entries into critter names
def fam_to_crit(series):
    X=series.copy()
    for i in range(0, X.shape[0]):
        X.at[i]=fam_to_crit_string(X[i])
    return X

def fam_to_crit_string(fam_name):
    return fam_dict[fam_name]

### Our knn model with k=5 achieves 91% accuracy identifying 'cricket', 'kaydid', or 'cicada' on unseen test data when trained on fam_or_subfam, vs. the baseline 67% (always guessing cricket). Below is the confusion matrix. Unlike knn, we never predict cicada!

In [77]:
print(f'Test accuracy: {accuracy_score(fam_to_crit(pd.Series(list(y_test))),fam_to_crit(pd.Series(pred)))} vs. Baseline {accuracy_score(fam_to_crit(pd.Series(list(y_test))),fam_to_crit(pd.Series(pred_baseline)))}')

Test accuracy: 0.9101694915254237 vs. Baseline 0.6686440677966101


In [78]:
conf_mat = confusion_matrix(fam_to_crit(pd.Series(list(y_test))), fam_to_crit(pd.Series(pred)))

In [79]:
pd.DataFrame(conf_mat,
                 columns = ['Predicted cicada', 'Predicted cricket', 'Predicted kaydid'],
                 index = ['Actual cicada', 'Actual cricket', 'Actual kaydid'])

Unnamed: 0,Predicted cicada,Predicted cricket,Predicted kaydid
Actual cicada,0,9,2
Actual cricket,0,742,47
Actual kaydid,0,48,332


### Next, we train a model only on critter_name data. First we find an appropriate value of C.

In [80]:
for C in np.linspace(.1,3.1,31):
    print(f"knn (Train, Validation) accuracy when C = {C}: {svc_rbf_acc(df,C,40,40,40,40,True,True)}") 

knn (Train, Validation) accuracy when C = 0.1: (0.8956291390728477, 0.885593220338983)
knn (Train, Validation) accuracy when C = 0.2: (0.9088741721854304, 0.8951271186440678)
knn (Train, Validation) accuracy when C = 0.30000000000000004: (0.9133774834437086, 0.8972457627118644)
knn (Train, Validation) accuracy when C = 0.4: (0.9178807947019868, 0.8972457627118644)
knn (Train, Validation) accuracy when C = 0.5: (0.9210596026490067, 0.8983050847457628)
knn (Train, Validation) accuracy when C = 0.6: (0.9282119205298013, 0.8972457627118644)
knn (Train, Validation) accuracy when C = 0.7000000000000001: (0.9321854304635762, 0.8983050847457628)
knn (Train, Validation) accuracy when C = 0.8: (0.9356291390728477, 0.9014830508474576)
knn (Train, Validation) accuracy when C = 0.9: (0.9382781456953643, 0.9046610169491526)
knn (Train, Validation) accuracy when C = 1.0: (0.9430463576158941, 0.9046610169491526)
knn (Train, Validation) accuracy when C = 1.1: (0.9464900662251655, 0.9057203389830508)
kn

### We select C=1.1.

In [81]:
#First train the model exactly using the same training data as before, only using critter_name, and C=1.1
svc_pipe = Pipeline([('scale', StandardScaler()), ('svc_rbf',SVC(kernel='rbf', C=1.1))])
X=truncate_mfcc(df=df, avg_depth=40,var_depth=40,hs_avg_depth=40,hs_var_depth=40,other_Features=True)
Y = df['critter_name']
X_train, X_val, y_train, y_val = train_test_split(X.copy(), Y, shuffle=True, random_state=17, test_size=.2, stratify=Y)
svc_pipe.fit(X_train, y_train)

#Test the model on unseen test data.
X_test=truncate_mfcc(df=df_test, avg_depth=40,var_depth=40,hs_avg_depth=40,hs_var_depth=40,other_Features=True)
y_test=df_test['critter_name']
pred = svc_pipe.predict(X_test)
pred_baseline=pred.copy()
for i in range(0, len(pred)):
    pred_baseline[i] = 'cricket'
print(f'Test accuracy: {accuracy_score(pred,y_test)} vs. Baseline: {accuracy_score(pred_baseline,y_test)}')

Test accuracy: 0.9144067796610169 vs. Baseline: 0.6686440677966101


### Our knn model with k=5 achieves 91% accuracy identifying critter_name: 'cricket', 'kaydid', or 'cicada' on unseen test data. Below is the confusion matrix. Interestingly, this model also never guesses cicada!

In [82]:
conf_mat = confusion_matrix(pd.Series(list(y_test)), pd.Series(pred))

In [83]:
pd.DataFrame(conf_mat,
                 columns = ['Predicted cicada', 'Predicted cricket', 'Predicted kaydid'],
                 index = ['Actual cicada', 'Actual cricket', 'Actual kaydid'])

Unnamed: 0,Predicted cicada,Predicted cricket,Predicted kaydid
Actual cicada,0,9,2
Actual cricket,0,760,29
Actual kaydid,0,61,319
