Name: Alexis Jordan

# KNN Classifier

Here, we will use a KNN classifier to classify our respiratory sounds. We'll try out one KNN and then move on to Grid Search to find the best parameters.

In [8]:
import pandas as pd
import numpy as np
import librosa as lb

In [2]:
train = pd.read_csv('train.csv')
val = pd.read_csv('val.csv')

Let's grab our labels

In [3]:
y_train = train.disease
y_val = val.disease

In [5]:
y_train

0            COPD
1            COPD
2       Pneumonia
3       Pneumonia
4            COPD
          ...    
5513         COPD
5514         COPD
5515         COPD
5516         COPD
5517         COPD
Name: disease, Length: 5518, dtype: object

### Encoding our data

We'll use label encoder to encode our classes

In [7]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y_train_enc = le.fit_transform(y_train)
y_val_enc = le.fit_transform(y_val)

A function to grab our features. We'll be using MFCC based on its popularity in the papers. We'll take the mean of these features as it is requisite that the numpy arrays are of the same size

In [9]:
def getFeatures(path):
    soundArr, sample_rate = lb.load(path)
    mfcc = lb.feature.mfcc(y = soundArr, sr = sample_rate)
    
    return mfcc

In [10]:
root = 'processed_audio_files/'

mfcc = []

for idx, row in val.iterrows():
    path = root + row['filename']
    a = getFeatures(path)
    mfcc.append(np.mean(a))
    
mfcc_val = np.array(mfcc)

Let's double check to make sure that the lengths match

In [11]:
print(len(mfcc_val))
print(len(y_val_enc))

1380
1380


In [12]:
root = 'processed_audio_files/'
mfcc = []

for idx, row in train.iterrows():
    path = root + row['filename']
    a = getFeatures(path)
    mfcc.append(np.mean(a))
    
mfcc_train = np.array(mfcc)

In [13]:
print(len(mfcc_train))
print(len(y_train_enc))

5518
5518


### Reshaping

In order to properly use these new arrays of values, we will have to reshape them

In [14]:
train_reshape = mfcc_train.reshape(-1, 1)
val_reshape = mfcc_val.reshape(-1, 1)

### Classifier

In [16]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 20)
knn.fit(train_reshape, y_train_enc)

y_pred = knn.predict(val_reshape)

print(y_pred)

[3 3 3 ... 3 3 3]


In [17]:
from sklearn import metrics

metrics.accuracy_score(y_val_enc, y_pred)

0.8318840579710145

### Grid Search with KNN

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [20]:
ss = StandardScaler()

train_scaled = ss.fit_transform(train_reshape)
val_scaled = ss.fit_transform(val_reshape)

In [21]:
grid_params = {'n_neighbors': [25, 50, 100, 150, 250],
              'weights': ['uniform', 'distance'],
              'metric': ['euclidean', 'manhattan']}

model = GridSearchCV(knn, grid_params, cv = 5, n_jobs = -1)

model.fit(train_scaled, y_train_enc)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(n_neighbors=20), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan'],
                         'n_neighbors': [25, 50, 100, 150, 250],
                         'weights': ['uniform', 'distance']})

In [22]:
print(f'Model Score: {model.score(val_scaled, y_val_enc)}')
print(f'Best Parameters: {model.best_params_}')

Model Score: 0.8333333333333334
Best Parameters: {'metric': 'euclidean', 'n_neighbors': 100, 'weights': 'uniform'}


#### 83.333% accuracy