Name: Ashwini Giri. USC ID: 5413882039

# 2. Multi-class Classification Using Support Vector Machines

(a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29# . Choose 70% of the data randomly as the training set.

The dataset is available on the UCI repository. 
Data Set Information:

This dataset was used in several classifications tasks related to the challenge of anuran species recognition through their calls. It is a multilabel dataset with three columns of labels. This dataset was created segmenting 60 audio records belonging to 4 different families, 8 genus, and 10 species. Each audio corresponds to one specimen (an individual frog), the record ID is also included as an extra column. 

Below are all the imports for the question.

In [48]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import hamming_loss
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import preprocessing
from sklearn.svm import LinearSVC
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import math
%matplotlib inline

Loading data into a dataframe 'data'

In [5]:
data=pd.read_csv("Frogs_MFCCs.csv")
data.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species,RecordID
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.108351,-0.077623,-0.009568,0.057684,0.11868,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre,1
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre,1
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre,1
3,1.0,0.224392,0.118985,0.329432,0.372088,0.361005,0.015501,-0.194347,-0.098181,0.270375,...,-0.136009,-0.177037,-0.130498,-0.054766,-0.018691,0.023954,Leptodactylidae,Adenomera,AdenomeraAndre,1
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre,1


Dropping column RecordID because it is not used for training the model. Since it is not a predictor.

In [6]:
data.drop(data.columns[[-1]],axis=1,inplace=True)
data.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species
0,1.0,0.152936,-0.105586,0.200722,0.317201,0.260764,0.100945,-0.150063,-0.171128,0.124676,...,-0.024017,-0.108351,-0.077623,-0.009568,0.057684,0.11868,0.014038,Leptodactylidae,Adenomera,AdenomeraAndre
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,0.012022,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,0.083536,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre
3,1.0,0.224392,0.118985,0.329432,0.372088,0.361005,0.015501,-0.194347,-0.098181,0.270375,...,-0.050224,-0.136009,-0.177037,-0.130498,-0.054766,-0.018691,0.023954,Leptodactylidae,Adenomera,AdenomeraAndre
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,0.062837,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre


extracting 70% of the data as a training set and remaining 30% as testing dataframe randomly.

In [8]:
length_data = len(data)

In [17]:
seventy = math.ceil(0.7*length_data)

Dividing the data into trainind data and testing data. 70% of randomly selected data is used for training and rest is used for testing.

In [20]:
training_dataframe = data.sample(n=seventy)
testing_dataframe = data.loc[~data.index.isin(training_dataframe.index)]

In [21]:
training_dataframe.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species
3790,1.0,0.484387,0.311378,0.496459,0.135927,0.018673,-0.10153,-0.069067,0.210057,0.151129,...,0.03937,0.255994,0.060555,-0.034571,-0.077492,0.011143,0.203752,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
6876,1.0,0.762473,0.979206,0.102796,0.018289,0.369815,0.062039,0.032543,0.021825,0.075106,...,0.067579,-0.031503,-0.082713,0.011084,-0.049397,0.03662,-0.067443,Hylidae,Osteocephalus,OsteocephalusOophagus
810,1.0,0.239769,-0.114294,0.274827,0.42423,0.173509,-0.153161,-0.137944,0.125721,0.226308,...,0.006997,-0.125237,-0.08255,0.076587,0.094104,0.009526,-0.051039,Dendrobatidae,Ameerega,Ameeregatrivittata
4604,1.0,0.0448,0.157207,0.588344,0.235148,0.038768,-0.169406,-0.047229,0.193274,0.025277,...,0.270782,0.208196,-0.124745,-0.210277,-0.074102,0.191372,0.196609,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1024,1.0,0.264661,-0.144207,-0.035626,0.070046,0.420539,0.577921,0.132684,-0.344939,-0.330111,...,-0.242179,-0.127762,0.085208,0.045794,-0.009993,0.057729,0.075425,Dendrobatidae,Ameerega,Ameeregatrivittata


In [22]:
testing_dataframe.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,0.012022,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056,Leptodactylidae,Adenomera,AdenomeraAndre
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,0.083536,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162,Leptodactylidae,Adenomera,AdenomeraAndre
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,0.062837,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244,Leptodactylidae,Adenomera,AdenomeraAndre
11,1.0,0.277948,0.091657,0.331656,0.307372,0.257359,0.065702,-0.19186,-0.133537,0.22002,...,-0.01826,-0.119167,-0.1109,-0.112485,-0.053184,0.044291,-0.011456,Leptodactylidae,Adenomera,AdenomeraAndre
15,1.0,0.137623,-0.085808,0.322446,0.344695,0.285642,0.056517,-0.314418,-0.252324,0.288897,...,0.071433,-0.058694,-0.072913,-0.064263,0.022455,0.130752,0.074132,Leptodactylidae,Adenomera,AdenomeraAndre


(b) Each instance has three labels: Families, Genus, and Species. Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-class classification is to train a classifier for each label. We first try this approach:


i. Research exact match and hamming score/ loss methods for evaluating multi- label classification and use them in evaluating the classifiers in this problem.

1. Hamming Loss: The Hamming loss is the fraction of labels that are incorrectly predicted.

2. Exact Match: In multilabel classification, accuracy_score function provided by sklearn computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

In short, exact match gives the score of how many of the predicted labels are true labels.

Seperating the data into train data, train labels, test data and test labels

In [33]:
training_dataframe_label=training_dataframe[["Family","Genus","Species"]]
training_dataframe_predictors=training_dataframe.drop(["Family","Genus","Species"],axis=1)
testing_dataframe_label=testing_dataframe[["Family","Genus","Species"]]
testing_dataframe_predictors=testing_dataframe.drop(["Family","Genus","Species"],axis=1)

In [34]:
training_dataframe_predictors.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
3790,1.0,0.484387,0.311378,0.496459,0.135927,0.018673,-0.10153,-0.069067,0.210057,0.151129,...,0.415678,0.020157,-0.278,0.03937,0.255994,0.060555,-0.034571,-0.077492,0.011143,0.203752
6876,1.0,0.762473,0.979206,0.102796,0.018289,0.369815,0.062039,0.032543,0.021825,0.075106,...,0.150804,0.037436,-0.11007,0.067579,-0.031503,-0.082713,0.011084,-0.049397,0.03662,-0.067443
810,1.0,0.239769,-0.114294,0.274827,0.42423,0.173509,-0.153161,-0.137944,0.125721,0.226308,...,-0.085351,0.167496,0.204276,0.006997,-0.125237,-0.08255,0.076587,0.094104,0.009526,-0.051039
4604,1.0,0.0448,0.157207,0.588344,0.235148,0.038768,-0.169406,-0.047229,0.193274,0.025277,...,0.165627,-0.356441,-0.203355,0.270782,0.208196,-0.124745,-0.210277,-0.074102,0.191372,0.196609
1024,1.0,0.264661,-0.144207,-0.035626,0.070046,0.420539,0.577921,0.132684,-0.344939,-0.330111,...,0.140017,0.245331,0.030875,-0.242179,-0.127762,0.085208,0.045794,-0.009993,0.057729,0.075425


In [35]:
training_dataframe_label.head()

Unnamed: 0,Family,Genus,Species
3790,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
6876,Hylidae,Osteocephalus,OsteocephalusOophagus
810,Dendrobatidae,Ameerega,Ameeregatrivittata
4604,Leptodactylidae,Adenomera,AdenomeraHylaedactylus
1024,Dendrobatidae,Ameerega,Ameeregatrivittata


In [36]:
testing_dataframe_predictors.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.254341,0.022786,0.16332,0.012022,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.237384,0.050791,0.207338,0.083536,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,-0.298524,0.037439,0.219153,0.062837,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244
11,1.0,0.277948,0.091657,0.331656,0.307372,0.257359,0.065702,-0.19186,-0.133537,0.22002,...,-0.281642,-0.025145,0.11987,-0.01826,-0.119167,-0.1109,-0.112485,-0.053184,0.044291,-0.011456
15,1.0,0.137623,-0.085808,0.322446,0.344695,0.285642,0.056517,-0.314418,-0.252324,0.288897,...,-0.333589,0.041608,0.236627,0.071433,-0.058694,-0.072913,-0.064263,0.022455,0.130752,0.074132


In [37]:
testing_dataframe_label.head()

Unnamed: 0,Family,Genus,Species
1,Leptodactylidae,Adenomera,AdenomeraAndre
2,Leptodactylidae,Adenomera,AdenomeraAndre
4,Leptodactylidae,Adenomera,AdenomeraAndre
11,Leptodactylidae,Adenomera,AdenomeraAndre
15,Leptodactylidae,Adenomera,AdenomeraAndre


ii. Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation. You are welcome to try to solve the problem with both normalized and raw attributes and report the results.

Training the Support vector classifier for all three classes. Using 10 fold cross validation to choose the weight of penalty and width of the Gaussian kernel. The weight of the penalty is passed in the C parameter.

For class Family

In [84]:
classifier=SVC()
penalty_parameters={'C':[0.8,45,300],'gamma':[0.9,5,28]}
cross_val=GridSearchCV(classifier,penalty_parameters,cv=10)
cross_val.fit(training_dataframe_predictors,training_dataframe_label["Family"])
cross_val.best_params_

{'C': 45, 'gamma': 0.9}

In [96]:
prediction=cross_val.predict(testing_dataframe_predictors)
hamming_loss=hamming_loss(testing_dataframe_label["Family"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Family"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("    Class: Family")
species_daf

      Class: Family


Unnamed: 0,Hamming Loss,Exact Match
0,0.075642,0.924358


For class Species

In [86]:
classifier1=SVC()
penalty_parameters1={'C':[0.4,23,986],'gamma':[0.1,9,76]}
# penalty_parameters1={'C':[0.04,0.005,0.3,20,125],'gamma':[0.2,0.08,0.001,87,455,987]}
cross_val1=GridSearchCV(classifier1,penalty_parameters1,cv=10)
cross_val1.fit(training_dataframe_predictors,training_dataframe_label["Species"])
cross_val1.best_params_

{'C': 986, 'gamma': 0.1}

In [97]:
prediction1=cross_val1.predict(testing_dataframe_predictors)
hamming_loss1=hamming_loss(testing_dataframe_label["Species"],prediction1)
species_daf1 = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
exact_match1=accuracy_score(testing_dataframe_label["Species"], prediction1)
species_daf1.loc[0] = [hamming_loss1,exact_match1]
print("     Class: Species")
species_daf1

      Class: Species


Unnamed: 0,Hamming Loss,Exact Match
0,0.082691,0.917309


For class Genus

In [98]:
classifier2=SVC()
penalty_parameters2={'C':[0.9,98,367],'gamma':[5,90,156]}
# penalty_parameters2={'C':[0.09,0.003,3,90,200,875],'gamma':[7,0.5,0.06,55,127,789]}
cross_val2=GridSearchCV(classifier2,penalty_parameters2,cv=10)
cross_val2.fit(training_dataframe_predictors,training_dataframe_label["Genus"])
cross_val2.best_params_

{'C': 98, 'gamma': 5}

In [99]:
prediction1=cross_val2.predict(testing_dataframe_predictors)
hamming_loss1=hamming_loss(testing_dataframe_label["Genus"],prediction1)
species_daf1 = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
exact_match1=accuracy_score(testing_dataframe_label["Genus"], prediction1)
species_daf1.loc[0] = [hamming_loss1,exact_match1]
print("      Class: Genus")
species_daf1

      Class: Genus


Unnamed: 0,Hamming Loss,Exact Match
0,0.069986,0.930014


iii. Repeat 2(b)ii with L1-penalized SVMs. Remember to normalize the attributes.

Normalizing the attributes using sklearn's preprocessing

In [100]:
normalized_training_dataframe = preprocessing.normalize(training_dataframe_predictors)

In [104]:
normalized_testing_dataframe = preprocessing.normalize(testing_dataframe_predictors)

In [101]:
training_dataframe_predictors.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
3790,1.0,0.484387,0.311378,0.496459,0.135927,0.018673,-0.10153,-0.069067,0.210057,0.151129,...,0.415678,0.020157,-0.278,0.03937,0.255994,0.060555,-0.034571,-0.077492,0.011143,0.203752
6876,1.0,0.762473,0.979206,0.102796,0.018289,0.369815,0.062039,0.032543,0.021825,0.075106,...,0.150804,0.037436,-0.11007,0.067579,-0.031503,-0.082713,0.011084,-0.049397,0.03662,-0.067443
810,1.0,0.239769,-0.114294,0.274827,0.42423,0.173509,-0.153161,-0.137944,0.125721,0.226308,...,-0.085351,0.167496,0.204276,0.006997,-0.125237,-0.08255,0.076587,0.094104,0.009526,-0.051039
4604,1.0,0.0448,0.157207,0.588344,0.235148,0.038768,-0.169406,-0.047229,0.193274,0.025277,...,0.165627,-0.356441,-0.203355,0.270782,0.208196,-0.124745,-0.210277,-0.074102,0.191372,0.196609
1024,1.0,0.264661,-0.144207,-0.035626,0.070046,0.420539,0.577921,0.132684,-0.344939,-0.330111,...,0.140017,0.245331,0.030875,-0.242179,-0.127762,0.085208,0.045794,-0.009993,0.057729,0.075425


In [105]:
testing_dataframe_predictors.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
1,1.0,0.171534,-0.098975,0.268425,0.338672,0.268353,0.060835,-0.222475,-0.207693,0.170883,...,-0.254341,0.022786,0.16332,0.012022,-0.090974,-0.05651,-0.035303,0.02014,0.082263,0.029056
2,1.0,0.152317,-0.082973,0.287128,0.276014,0.189867,0.008714,-0.242234,-0.219153,0.232538,...,-0.237384,0.050791,0.207338,0.083536,-0.050691,-0.02359,-0.066722,-0.025083,0.099108,0.077162
4,1.0,0.087817,-0.068345,0.306967,0.330923,0.249144,0.006884,-0.265423,-0.1727,0.266434,...,-0.298524,0.037439,0.219153,0.062837,-0.048885,-0.053074,-0.08855,-0.031346,0.10861,0.079244
11,1.0,0.277948,0.091657,0.331656,0.307372,0.257359,0.065702,-0.19186,-0.133537,0.22002,...,-0.281642,-0.025145,0.11987,-0.01826,-0.119167,-0.1109,-0.112485,-0.053184,0.044291,-0.011456
15,1.0,0.137623,-0.085808,0.322446,0.344695,0.285642,0.056517,-0.314418,-0.252324,0.288897,...,-0.333589,0.041608,0.236627,0.071433,-0.058694,-0.072913,-0.064263,0.022455,0.130752,0.074132


normalizing the training data.

In [102]:
normalized_training_dataframe=pd.DataFrame(normalized_training_dataframe,columns=training_dataframe_predictors.columns)
normalized_training_dataframe.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
0,0.678855,0.328828,0.21138,0.337023,0.092275,0.012676,-0.068924,-0.046887,0.142598,0.102595,...,0.282185,0.013683,-0.188721,0.026726,0.173782,0.041108,-0.023469,-0.052606,0.007564,0.138318
1,0.6006,0.457941,0.588111,0.06174,0.010984,0.222111,0.037261,0.019545,0.013108,0.045108,...,0.090573,0.022484,-0.066108,0.040588,-0.018921,-0.049678,0.006657,-0.029668,0.021994,-0.040506
2,0.786115,0.188486,-0.089848,0.216046,0.333494,0.136398,-0.120402,-0.10844,0.098831,0.177904,...,-0.067096,0.131671,0.160585,0.005501,-0.098451,-0.064894,0.060207,0.073976,0.007489,-0.040123
3,0.703074,0.031498,0.110528,0.41365,0.165326,0.027257,-0.119105,-0.033206,0.135886,0.017771,...,0.116448,-0.250605,-0.142974,0.19038,0.146377,-0.087705,-0.14784,-0.052099,0.134548,0.138231
4,0.701377,0.185627,-0.101144,-0.024987,0.049128,0.294957,0.40534,0.093062,-0.241933,-0.231532,...,0.098205,0.172069,0.021655,-0.169859,-0.08961,0.059763,0.032119,-0.007009,0.04049,0.052901


normalizing the testing data.

In [106]:
normalized_testing_dataframe=pd.DataFrame(normalized_testing_dataframe,columns=testing_dataframe_predictors.columns)
normalized_testing_dataframe.head()

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_13,MFCCs_14,MFCCs_15,MFCCs_16,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22
0,0.785985,0.134823,-0.077793,0.210978,0.266191,0.210921,0.047815,-0.174862,-0.163243,0.134311,...,-0.199908,0.01791,0.128367,0.009449,-0.071504,-0.044416,-0.027748,0.01583,0.064657,0.022837
1,0.791909,0.120621,-0.065707,0.227379,0.218578,0.150357,0.006901,-0.191827,-0.173549,0.184149,...,-0.187986,0.040222,0.164193,0.066153,-0.040143,-0.018681,-0.052837,-0.019864,0.078485,0.061106
2,0.757024,0.06648,-0.051739,0.232381,0.250517,0.188608,0.005211,-0.200932,-0.130738,0.201697,...,-0.22599,0.028342,0.165904,0.047569,-0.037007,-0.040178,-0.067035,-0.023729,0.08222,0.05999
3,0.774734,0.215335,0.07101,0.256945,0.238132,0.199385,0.050902,-0.148641,-0.103456,0.170457,...,-0.218198,-0.019481,0.092867,-0.014147,-0.092322,-0.085918,-0.087146,-0.041203,0.034314,-0.008876
4,0.714276,0.098301,-0.06129,0.230315,0.246207,0.204027,0.040369,-0.224581,-0.180229,0.206352,...,-0.238275,0.02972,0.169017,0.051023,-0.041924,-0.05208,-0.045902,0.016039,0.093393,0.052951


Using L1 penalty for linear support vector classifier. The weight of the penalty is passed in the 'C' parameter.

For class Family

In [103]:
classifier=LinearSVC(penalty='l1', dual=False)
penalty_parameters={'C':[0.08,0.008,0.8,5,45,345,1000]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
cross_val_penalty.fit(normalized_training_dataframe,training_dataframe_label["Family"])
cross_val_penalty.best_params_

{'C': 345}

In [108]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Family"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Family"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Family")
species_daf

      Class: Family


Unnamed: 0,Hamming Loss,Exact Match
0,0.078654,0.921346


For class Genus

In [109]:
classifier=LinearSVC(penalty='l1', dual=False)
penalty_parameters={'C':[0.02,0.006,0.1,66,8,978]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
cross_val_penalty.fit(normalized_training_dataframe,training_dataframe_label["Genus"])
cross_val_penalty.best_params_

{'C': 66}

In [110]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Genus"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Genus"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Genus")
species_daf

      Class: Genus


Unnamed: 0,Hamming Loss,Exact Match
0,0.063452,0.936548


For class Species

In [112]:
classifier=LinearSVC(penalty='l1',dual=False)
penalty_parameters={'C':[0.09,0.003,3,90,200,875]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
cross_val_penalty.fit(normalized_training_dataframe,training_dataframe_label["Species"])
cv_results_withpenalty=pd.DataFrame(cross_val_penalty.cv_results_)
cross_val_penalty.best_params_

{'C': 90}

In [111]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Species"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Species"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Species")
species_daf

      Class: Species


Unnamed: 0,Hamming Loss,Exact Match
0,0.087345,0.912655


iv. Repeat 2(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the classifiers you trained.

Using SMOTE to balance the classes and using linear support vector classifier with penalty L1.

For class Family

In [113]:
classifier=LinearSVC(penalty='l1',dual=False)
penalty_parameters={'C':[0.07,0.006,7,72,321,945]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
smote=SMOTE()
x_train_smote,y_train_smote = smote.fit_sample(training_dataframe_predictors,training_dataframe_label["Family"])
cross_val_penalty.fit(x_train_smote,y_train_smote)
cross_val_penalty.best_params_

{'C': 321}

In [114]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Family"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Family"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Family")
species_daf

      Class: Family


Unnamed: 0,Hamming Loss,Exact Match
0,0.063248,0.936752


For class Genus

In [118]:
classifier=LinearSVC(penalty='l1',dual=False)
penalty_parameters={'C':[0.34,0.987,3,213]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
smote=SMOTE()
x_train_smote,y_train_smote = smote.fit_sample(training_dataframe_predictors,training_dataframe_label["Genus"])
cross_val_penalty.fit(x_train_smote,y_train_smote)
cross_val_penalty.best_params_

{'C': 213}

In [115]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Genus"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Genus"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Genus")
species_daf

      Class: Genus


Unnamed: 0,Hamming Loss,Exact Match
0,0.054617,0.945383


For class Species

In [119]:
classifier=LinearSVC(penalty='l1',dual=False)
penalty_parameters={'C':[0.87,0.054,65,121]}
cross_val_penalty=GridSearchCV(classifier,penalty_parameters,cv=10)
smote=SMOTE()
x_train_smote,y_train_smote = smote.fit_sample(training_dataframe_predictors,training_dataframe_label["Species"])
cross_val_penalty.fit(x_train_smote,y_train_smote)
cross_val_penalty.best_params_

{'C': 65}

In [116]:
prediction=cross_val_penalty.predict(normalized_testing_dataframe)
hamming_loss=hamming_loss(testing_dataframe_label["Species"],prediction)
exact_match=accuracy_score(testing_dataframe_label["Species"], prediction)
species_daf = pd.DataFrame(columns=['Hamming Loss','Exact Match'])
species_daf.loc[0] = [hamming_loss,exact_match]
print("     Class: Species")
species_daf

      Class: Species


Unnamed: 0,Hamming Loss,Exact Match
0,0.064982,0.935018


After balancing the classes the hamming loss has been decreased and exact match has increased.