# Machine Learning with Python

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 4.1 Working with time series.

Time series data is very common in many areas of research, but it offers some special challenges for machine learning.

Our first example dataset is derived from a collection of audio (WAV) files of human heartbeats, recorded on a digital stethoscope.

In [None]:
heart = pd.read_json('heartbeat.json')

In [None]:
heart.head()

Each audio clip has been resampled at 1000Hz and converted to a list of floats - these are held in the `audio` column of the dataframe. The length of the array is therefore the recording length in milliseconds.

Note that the recordings are of different lengths, so it is not convenient to store each timepoint in one column of the dataframe. Pandas can handle columns containing complex datatypes, but this is not easily stored in a CSV file - hence we are using JSON format in this instance.

We can easily plot each waveform:

In [None]:
row = heart.iloc[0]

fig, ax = plt.subplots()
ax.plot( np.arange(0,row['length']), row['audio'] )
ax.set(xlabel='Time (ms)', ylabel='Sound Amplitude')
plt.show()

### Classification

There are three labels on the recordings showing a normal heartbeat or one of two abnormal sounds (murmur and extrastole).

In [None]:
heart['label'].value_counts()

To attempt classification, we will need to define some features.

We can often make a good start using simple summary statistics.

To begin, we will convert the audio column to use numpy arrays.

In [None]:
heart['audio'] = [ np.array(audio) for audio in heart['audio'] ]

and shuffle the rows.

In [None]:
heart = heart.sample(frac=1, random_state=1)

Here are a few simple summary stats for the audio data.

In [None]:
heart['MEAN'] = heart['audio'].apply(np.mean)
heart['MIN'] = heart['audio'].apply(np.min)
heart['MAX'] = heart['audio'].apply(np.max)
heart['STD'] = heart['audio'].apply(np.std)

In [None]:
heart.head()

Defining the features and target:

In [None]:
X = heart[['MEAN','MIN','MAX','STD']]
y = heart['label'] == 'normal'  # for a two-class problem

In [None]:
# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(X, c=y, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8)

plt.show()

Possibly some of the abnormal heartbeats are detectable here, but we can't do a lot better than guessing.

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

# a 5-fold cross-validation, scored using AUC
score = cross_val_score( rfc,X,y,cv=5,scoring='roc_auc' )
print("AUC scores:", score)
print("mean:", score.mean())

How can we obtain more meaningful features?

### Features from smoothed data

The *envelope* captures more relevant information by local averaging.
It is a kind of smoothing operation, which uses a rolling window.

We start with the *rectified* waveform (the absolute value):

In [None]:
audio = heart.iloc[0]['audio']
rectified = np.abs(audio)

In [None]:
fig, ax = plt.subplots()
ax.plot( np.arange(0,len(rectified)), rectified )
ax.set(xlabel='Time (ms)', ylabel='Sound Amplitude')
plt.show()

Then we apply a sliding window of a certain size (here we are using 50ms) and calculate the mean.

In [None]:
from numpy.lib.stride_tricks import sliding_window_view
v = sliding_window_view(rectified,50)
smoothed = v.mean(axis=-1)

In [None]:
fig, ax = plt.subplots()
ax.plot( np.arange(0,len(smoothed)), smoothed )
ax.set(xlabel='Time (ms)', ylabel='Sound Amplitude')
plt.show()

We can then extract summary statistics for the smoothed envelope - these are likely to be more informative than those for the original noisy data.

In [None]:
def envelope(x) :
  rectified = np.abs(x)
  v = sliding_window_view(rectified,50)
  return v.mean(axis=-1)

In [None]:
heart['env'] = heart['audio'].apply(envelope)

In [None]:
heart['env_MEAN'] = heart['env'].apply(np.mean)
heart['env_MIN'] = heart['env'].apply(np.min)
heart['env_MAX'] = heart['env'].apply(np.max)
heart['env_STD'] = heart['env'].apply(np.std)

In [None]:
heart.head()

In [None]:
X2 = heart[['env_MEAN','env_MIN','env_MAX','env_STD']]

In [None]:
# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(X2, c=y, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8)

plt.show()

In [None]:
# a 5-fold cross-validation, scored using AUC
score = cross_val_score( rfc,X2,y,cv=5,scoring='roc_auc' )
print("AUC scores:", score)
print("mean:", score.mean())

### Spectral features

For audio data, as well as many other time series, the frequency spectrum carries a lot of important information.

There are many advanced analysis techniques making use of Fourier-transformed time series. 

In [None]:
audio = heart.iloc[0]['audio']
ft = np.fft.rfft(audio)

In [None]:
fig, ax = plt.subplots()
ax.plot( np.arange(0,len(ft)), ft.real )
ax.set(xlabel='Frequency', ylabel='Amplitude')
plt.show()

The centroid of the frequency spectrum is one useful summary - this is the weighted mean of frequency:

In [None]:
sfx = np.sum( np.abs(ft.real) * np.arange(0,len(ft)))
sx = np.sum( np.abs(ft.real) )
sfx / sx

We can follow this over a rolling window (once again using a window of size 50ms):

In [None]:
def centroid(x) :
  v = sliding_window_view(x,50)
  ft = np.fft.rfft(v)
  sfx = np.sum( np.abs(ft.real) * np.arange(0,ft.shape[1]), axis=-1 )
  sx = np.sum( np.abs(ft.real), axis=-1 )
  return sfx/sx

In [None]:
ct = centroid(audio)
ct

In [None]:
fig, ax = plt.subplots()
ax.plot( np.arange(0,len(ct)), ct )
#ax.set(xlabel='Time (ms)', ylabel='Sound Amplitude')
plt.show()

As before, we can extract summary statistics from these plots.

In [None]:
heart['centroid'] = heart['audio'].apply(centroid)

In [None]:
heart['centroid_MEAN'] = heart['centroid'].apply(np.mean)
heart['centroid_MIN'] = heart['centroid'].apply(np.min)
heart['centroid_MAX'] = heart['centroid'].apply(np.max)
heart['centroid_STD'] = heart['centroid'].apply(np.std)

In [None]:
X3 = heart[['centroid_MEAN','centroid_MIN','centroid_MAX','centroid_STD']]

In [None]:
# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(X3, c=y, figsize=(15, 15),
                           marker='o', hist_kwds={'bins': 20}, s=60,
                           alpha=.8)

plt.show()

In [None]:
# a 5-fold cross-validation, scored using AUC
score = cross_val_score( rfc,X3,y,cv=5,scoring='roc_auc' )
print("AUC scores:", score)
print("mean:", score.mean())

By combining a variety of complex features, our classifier starts to look more useful:

In [None]:
X4 = pd.concat([X,X2,X3],axis=1)

In [None]:
X4.head()

In [None]:
# a 5-fold cross-validation, scored using AUC
score = cross_val_score( rfc,X4,y,cv=5,scoring='roc_auc' )
print("AUC scores:", score)
print("mean:", score.mean())

### Exercise




We have already eliminated the high-frequency noise using the smoothed envelopes. Try applying the FFT to these series to obtain a further set of features. Does this help with classification?

Have a look at some examples of the two abnormal heartbeat patterns and compare to the normal ones. Can you come up with any additional summary features that might be informative in classification?