# Progress Report
#### 1. Data in hand?

I have included some but not all of the features I want to include. There are leads I have yet to finish chasing, so the number of columns will likely increase over time.

#### 2. Is EDA done?

I am continuing to look at individual columns and how they correlate with the target. I look at both the Pearson and the Spearman correlation coefficients to help illuminate what potential usefulness a given feature may have for the final model. This is a constant work in progress.

#### 3. Any modeling? How are predictions performing?

AUC/ROC scores are rarely performing over 0.55. This is troublesome, but I have not given up hope that with the right columns and some grid searching, I can reach higher and higher scores. I have a soft goal of achieving a 0.70 AUC/ROC score.

#### 4. Obstacles? (Processing, acquisition, cleaning, model issues)

Processing power has been the primary limiting factor in my endeavors. I run lambda functions on a subset of the dataframe to get something that can be immediately plugged into a model. However, to look at correlation coefficients, I need a bigger subset or the entire dataframe to ensure there is an appropriate proportion of 1's and 0's in the target. Each column I attempt to add takes 5+ minutes to take shape, which makes my workflow disjointed and cumbersome.

#### 5. Has topic changed? Enough progress to move forward?

The topic has not yet changed, but if scores don't improve I have several ideas on how to pivot. I will cross that bridge if I get to it.

#### 6. Timeline for next week and a half? Have to's versus would like to's.

Utitlize AWS to overcome my processing power limitations. I can add a column, see if it helps, act upon that information, and then try another. I have several dozen potential columns that I truly believe will be helpful. I can later look into other signal transformations that may or may not have any usefulness at all. The stretch goal is to test every signal transformation method available in the library I use (there are around 46, I have looked through 12 so far). 

#### 7. Topics for 1:1

How to read and tweak ROC/AUC scores; how to select features when addition of a seemingly highly correlated column seems to majorly worsen the results. I kept only three features and got the best results so far in this notebook. 

Import statements

In [1]:
from pydub import AudioSegment
from glob import glob
from scipy.io import wavfile
from IPython.core.display import HTML
from IPython.display import Audio
import pandas as pd
import librosa
import numpy as np
import scipy.stats as sps
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

Assembling dataframe

In [2]:
moreau = pd.read_csv('assets/moreau/results.csv')

In [3]:
fado = pd.read_csv('assets/fado/results.csv')

In [4]:
jacques = sorted(glob('assets/au-suivant/*.wav'))
jacques = pd.DataFrame(jacques, columns=['filepath'])
jacques = jacques.join(pd.read_csv('assets/au-suivant/results.csv'))

In [5]:
marling = sorted(glob('assets/marling/*.wav'))
marling = pd.DataFrame(marling, columns=['filepath'])
marling = marling.join(pd.read_csv('assets/marling/results.csv'))

In [41]:
df = moreau.append(fado)
df = df.append(jacques)
df = df.append(marling)
df.reset_index(inplace=True, drop=True)

Columns that have shown to be helpful in classifying audio segments

In [43]:
# Greater than .10 correlation coefficient for both Pearson and Spearman
df['length'] = df['filepath'].apply(lambda x: len(librosa.core.load(x)[0]))
df['stft_min_skewness'] = df['filepath'].apply(lambda x: np.min(sps.describe(librosa.core.stft(librosa.core.load(x)[0])).skewness))
df['ifgram_min_corrcoef_variance'] = df['filepath'].apply(lambda x: np.min(np.corrcoef(sps.describe(librosa.core.ifgram(librosa.core.load(x)[0])).variance)))

In [44]:
# df.dropna(inplace=True)
x = df.drop(['filepath', 'classification'], axis=1)
y = df['classification']
x_test, x_train, y_test, y_train = train_test_split(x, y)
rfc = RandomForestClassifier(n_estimators=100, criterion='entropy')
rfc.fit(x_train, y_train)
pred_probs = rfc.predict_proba(x_test)
predictions = []
for x in range(len(pred_probs)):
    predictions.append(pred_probs[x][1])
print(rfc.score(x_train, y_train))
print(rfc.score(x_test, y_test))
print(roc_auc_score(y_test, predictions, average='samples'))
rfc.feature_importances_

  array = np.array(array, dtype=dtype, order=order, copy=copy)


1.0
0.919482686692
0.586615287826


  array = np.array(array, dtype=dtype, order=order, copy=copy)


array([ 0.34537499,  0.32649686,  0.32812815])