## Data Preparation

### 0. Description
* MeasurementDataReader can be used to query the source directory
    * in MeasurementDataReader.query(), specify query_type as Measurement or Action
    * with query_type=Measurement the query gives temporally consecutive data (multiple actions)
    * with query_type=Action the query gives data with a specific action only
* The query yields a data_dict object (defaultdict)
    * This can be filtered e.g. by tool
* The data_dict holds values for each measurement campagne (e.g. 01, 02, 03, 04 for electric_screwdriver)
* Each entry of the data_dict holds observations with all sensors
* Each sensor has its own timesteps and labels. Sensors may have different numbers of observations
* In total, the data_dict contains here 4x4 entries. One for each measurement campagne (01,02,03,04) and each of them has data for 4 sensors.
* datatools.to_ts_data takes the data_dict and returns a list Xt, containing one numpy array for each of the 4x4 entries of the data_dict and a corresponding y.

* Windows of fixed length (w.r.t time) can be created with fhgutils.Segment()


* Note: Both fhgutils and seglearn provide a Segment class
    * fhgutils.Segment facilitates segmentation with fixed window size
    * seglearn.transform.Segment for a fixed number of elements per window

### 1. Loading
* load data for each tool separately (if more than one wanted)
* load data by Measurement (not action)
    * this gives a timeseries with different actions
* result is a DataBunch of multiple measurements

In [1]:
from pathlib import Path
from datatools.fixture import setup_dummy_data
from datatools import Tool, Config, MeasurementSeries, Measurement, DataTypes, Action
from datatools import ACC, GYR, MAG, MIC, POS, VEL
import numpy as np
import pandas as pd
from seglearn.transform import Segment as SegmentFixedNum
from seglearn.base import TS_Data
from seglearn.pipe import Pype
from fhgutils import Segment, contextual_recarray_dtype, filter_ts_data
from datatools import to_ts_data
source = "./tool-tracking-data/"

from datatools import MeasurementDataReader, Measurement, MeasurementSeries, Action
mdr = MeasurementDataReader(source=source)

q = mdr.query(query_type=Measurement)

mytool = "electric_screwdriver"
#mytool = "pneumatic_screwdriver"
#mytool = "pneumatic_rivet_gun"
#mytool = "test_tool"

data_dict = q.filter_by(Tool == mytool, DataTypes == [ACC, GYR]).get()

[INFO] Read data:   6%|█████                                                                           | 1/16 [00:00<00:02,  7.31it/s, file=ACC-01-102.291.csv]

[INFO] Preparing data from:
  tool-tracking-data\electric_screwdriver\pythagoras-10-20200716
  tool-tracking-data\electric_screwdriver\pythagoras-10-20200716
  tool-tracking-data\electric_screwdriver\pythagoras-10-20200716
  tool-tracking-data\electric_screwdriver\pythagoras-10-20200716
[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\ACC-01-102.291.csv) and from the timestamps differ by 0.001Hz



[INFO] Read data:  25%|████████████████████                                                            | 4/16 [00:00<00:01,  8.35it/s, file=ACC-04-102.291.csv]

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\ACC-02-102.291.csv) and from the timestamps differ by 0.001Hz

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\ACC-03-102.291.csv) and from the timestamps differ by 0.001Hz

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\ACC-04-102.291.csv) and from the timestamps differ by 0.001Hz



[INFO] Read data:  38%|██████████████████████████████                                                  | 6/16 [00:00<00:01,  8.54it/s, file=GYR-02-102.291.csv]

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\GYR-01-102.291.csv) and from the timestamps differ by 0.001Hz

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\GYR-02-102.291.csv) and from the timestamps differ by 0.001Hz

[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\GYR-03-102.291.csv) and from the timestamps differ by 0.001Hz


[INFO] Read annotation: 100%|█████████████████████████████████████████████████████████████████████████| 16/16 [00:00<00:00,  9.12it/s, file=data-02.annotation]


[WARN] The mean sampling rate from the filename (tool-tracking-data\electric_screwdriver\pythagoras-10-20200716\GYR-04-102.291.csv) and from the timestamps differ by 0.001Hz



[INFO] Read annotation: 100%|█████████████████████████████████████████████████████████████████████████| 16/16 [00:01<00:00, 12.85it/s, file=data-04.annotation]

[INFO] Finished with 4 measurement(s).





### 2. Processing
* select desired measurements from DataBunch
* convert to numpy arrays with fhgutils.to_ts_data
* create time-windows
* filter data
    * remove label -1
    * remove windows with conflicting labels

In [2]:
measurement_data = []
for k, v in data_dict.items():
    data = pd.concat([pd.DataFrame(v.acc), pd.DataFrame(v.gyr)], axis=1)
    data = data.loc[:,~data.columns.duplicated()]
    measurement_data.append(data)
df = pd.concat(measurement_data)
df = df.reset_index(drop=True)
y = df.label.to_numpy()
X = df.iloc[:, df.columns != "label"].to_numpy()

In [26]:
obs_per_window = 100
window_length = 0.2
overlap = 0.5

pipe = Pype([('segment', SegmentFixedNum(width=obs_per_window, overlap=overlap))])
#pipe = Pype([('segment', Segment(window_length=window_length, overlap=overlap, enforce_size=True))])

X_trans, y_trans = pipe.fit_transform(X[:1000], y[:1000])

In [33]:
def window_per_obs(X, obs_per_window=100):
    for i in range(X.shape[0] // obs_per_window):
        pass
    return
a = np.array([[1,2],[2,3],[3,4],[4,5]])
a[np.newaxis, :, :]

array([[[1, 2],
        [2, 3],
        [3, 4],
        [4, 5]]])

In [None]:
# clean and window data
Xt, Xc, y = to_ts_data(data_dict, contextual_recarray_dtype)

X = TS_Data(Xt, Xc)

In [None]:
data_dict.get("01")

In [None]:
# Segment with fixed window size (w.r.t time)

window_length = 0.2 # unit in s
overlap = 0.5 # unit in percent

pipe = Pype([('segment', Segment(window_length=window_length, overlap=overlap, enforce_size=True, n=len(np.unique(Xc.desc))))])

X_trans, y_trans = pipe.fit_transform(X, y)

In [None]:
# Segment with fixed number of observations per window

obs_per_window = 100
overlap = 0.5

pipe = Pype([('segment', SegmentFixedNum(width=obs_per_window, overlap=overlap))])

X_trans, y_trans = pipe.fit_transform(X,y)

In [None]:
print(len(X_trans))
for i, b in enumerate(X_trans):
    print(i)
    print(b.shape, b.index)
    print(b.context_data.shape)
    print(b.context_data)
    print(b.ts_data.shape)
    print(b.ts_data)
    break