## 2.9 Feature engineering


The process of feature engineering is of manipulating, transforming, and selecting raw data into features that can be used in statistical analysis of prediction.

* statistical features
* temporal features
* spectral features (Fourier and Wavelet transforms)

This lecture will demonstrate how to automatically extract features from a popular (but simple) Python package ``tsfel`` to extract common features of time series. We will take the example of seismic waveforms recorded in the Pacific Northwest. The Pacific Northwest data detect and labels seismic waveforms for event of various origins: earthquake, explosions (mostly quarry blasts), and surface events (usually avalanches and landslides), but also seismic noise (ambient Earth vibrations in between).

We will explore how these features vary among the four categories, or classes of seismic events.

[Level 1]

In [None]:
# Import modules for seismic data and feature extraction

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import scipy.stats as st
import os
import h5py # for reading .h5 files

In [None]:
#make sure to install wget
!pip install wget
import wget

We download 2 files from the class storage: a CSV file with the waveforms themselves as an HDF5 file and their associated metadata as a CSV file:

In [None]:
wget.download("https://www.dropbox.com/s/f0e1ywupdbuv3l3/miniPNW_metadata.csv?dl=1")
wget.download("https://www.dropbox.com/s/0ffh4r23mitn2dz/miniPNW_waveforms.hdf5?dl=1")
# os.replace("miniPNW_metadata.csv","../../miniPNW_metadata.csv")
# os.replace("miniPNW_waveforms.hdf5","../../miniPNW_waveforms.hdf5")

## Metadata

We first read the metadata and arange them into a Pandas Data frame

In [None]:
# plot the time series
df = pd.read_csv("miniPNW_metadata.csv")
df.head()


The nature of the event source is located in one of the metadata attributes

In [None]:
df['source_type'].unique()

Let's assume that we are exploring features to classify the waveforms into the categories of the event types. We will attribute the **labels* as the ``source_type`` attribute.

In [None]:
labels=df['source_type']

How many seismic waveforms are there in each of the category?

In [None]:
##

Would you say that this is a balanced data set with respect to the four classes of interest?

Now we will look at the seismic data, taking a random waveform from each of the categories

In [None]:
plt.hist(labels)

Now are read the data. It is stored in an HDF5 files under a finite number of groups. Each groups has an array of datasets that correspond to the waveforms. To link the metadata to the waveform files, the key trace_name has the dataset ID. The address is labeled as follows:
```
bucketX$i,:3,:n
```

where ``X`` is the HDF5 group number, ``i`` is the index. The file has typically 3 waveforms from each direction of ground motions N, E, Z. In the following exercise, we will focus on the vertical waveforms.

In [None]:
f = h5py.File("miniPNW_waveforms.hdf5", "r")

Below a function to read the file in the data

In [None]:
def read_data(tn,f):
    bucket, narray = tn.split('$')  # split the string of trace_name into bucket and narray
    x, y, z = iter([int(i) for i in narray.split(',:')]) # split thenarray into x, y, z
    data = f['/data/%s' % bucket][x, :y, :z] # read the data as a 3D array
    return data

The trace name is stored as data attriobute in the metadata.

In [None]:
ldata=list(df['trace_name'])

In [None]:
crap=read_data(ldata[400],f)
print(crap.shape)
plt.plot(crap[1,:])

We will just extract the Z component and reshape them into a single array.

In [None]:
nt=crap.shape[-1]
ndata=len(labels)
Z=np.zeros(shape=(ndata,nt))
for i in range(ndata-1):
    Z[i,:]=read_data(df.iloc[i]["trace_name"],f)[2,:nt]

Now we have data and its attributes, in particular the label as source type.

We are going to extract features automatically from tsfel and explore how varied the

In [None]:
!pip install tsfel
import tsfel

We need to format for input into tsfresh. It needs 1 column with the ``id`` (or label), one column for the time stamps (``sort``)

In [None]:
cfg = tsfel.get_features_by_domain()     

In [None]:
Z.shape

In [None]:
for i in range(100):#Z.shape[0]):
    print(i/Z.shape[0])
    if i==0:
        X=tsfel.time_series_features_extractor(cfg, Z[i,:],fs=100.,)
        X['source_type']=df.iloc[i]['source_type']
    else:
        XX = tsfel.time_series_features_extractor(cfg, Z[i,:],fs=100.,)
        XX['source_type']=df.iloc[i]['source_type']
        X=pd.concat([XX,X],axis=0,ignore_index=True)


In [None]:
X.head()

## Exploring the feature space

Here we will plot distributions of features among the four classes. And we will explore what features are most correlated with each others in each of the categories.


In [None]:
plt.hist(np.log10(X['0_Wavelet variance_1']));

In [None]:
plt.matshow(X.drop('source_type',axis=1).corr(),cmap='seismic',vmin=-1,vmax=1)

Now calculate features for 100 events of each category and find features that differ from each other.

You may 1) calculate all features, 2) plot histograms/distributions of each feature and overlay each source-specific feature, 3) report on the ones that look different between each class and propose a workflow to classify between event types.