# Setup

Get the Kaggle Dataset used in this demo [here](https://www.kaggle.com/c/career-con-2019/data) if you haven't done so yet.

In [1]:
import os

import pandas as pd
import numpy as np

from IPython.display import display

In [2]:
DATA_PATH = os.path.join("data", "career-con-2019")

In [3]:
X_train = pd.read_csv(os.path.join(DATA_PATH, "X_train.csv"))
X_train.head()

Unnamed: 0,row_id,series_id,measurement_number,orientation_X,orientation_Y,orientation_Z,orientation_W,angular_velocity_X,angular_velocity_Y,angular_velocity_Z,linear_acceleration_X,linear_acceleration_Y,linear_acceleration_Z
0,0_0,0,0,-0.75853,-0.63435,-0.10488,-0.10597,0.10765,0.017561,0.000767,-0.74857,2.103,-9.7532
1,0_1,0,1,-0.75853,-0.63434,-0.1049,-0.106,0.067851,0.029939,0.003386,0.33995,1.5064,-9.4128
2,0_2,0,2,-0.75853,-0.63435,-0.10492,-0.10597,0.007275,0.028934,-0.005978,-0.26429,1.5922,-8.7267
3,0_3,0,3,-0.75852,-0.63436,-0.10495,-0.10597,-0.013053,0.019448,-0.008974,0.42684,1.0993,-10.096
4,0_4,0,4,-0.75852,-0.63435,-0.10495,-0.10596,0.005135,0.007652,0.005245,-0.50969,1.4689,-10.441


# Create Sample Time Series Data to use as Demo

For this notebook, I will only use a sample data to demonstrate how to label using Label Studio. This is to show how to label and use them in case you really have your own time series data.

In [4]:
# create a "sample_data" folder if not exists
os.makedirs("sample_data", exist_ok=True)
# define the path to the sample time series CSV file
SAMPLE_TS_PATH = os.path.join("sample_data", "sample_time_series.csv")

In [5]:
# ts: time series
sample_ts = X_train.copy()          # make a copy
sample_ts = sample_ts.iloc[:100]    # take only first 100 columns for demo

In [6]:
# NOTE: This column is not needed if you do not want to use any "time" column to display in the labeling interface of Label Studio.
# we create a sample range of datetime data to replace our index column,
# with frequency of second using `freq=s`, you may change this to minutes if you would like,
# and name the index column as "time" to be able to easily access it in Label Studio labeling interface later
sample_ts.index = pd.date_range("2021-08-01", periods=len(sample_ts), freq='s', name="time")

In [7]:
# specify column names that we want to rename to
column_dict = {
    "angular_velocity_X": "velocity",
    "linear_acceleration_X": "acceleration",
}
# renamed these columns to "velocity" and "acceleration" for easier reference in Label Studio later
sample_ts.rename(columns=column_dict, inplace=True)

In [8]:
# use only the necessary columns for our example
sample_ts = sample_ts[['series_id', 'measurement_number', 'velocity', 'acceleration']]

NOTE: You MUST specify the `date_format` parameter here when saving to CSV file using the `to_csv` method, and this format string must be EXACTLY the same with the one you use in the Label Studio labeling interface's "Code" panel (will show later). Otherwise the `to_csv` method will save with a default `datetime` format that messes up the format that you want. You can verify it yourself by opening the CSV file using Microsoft Excel and compare the results with and without `date_format`.

In [9]:
sample_ts.to_csv(SAMPLE_TS_PATH, date_format="%Y-%m-%d %H:%M:%S")

In [10]:
# test loading back
sample_ts = pd.read_csv(SAMPLE_TS_PATH, index_col=0)
sample_ts

Unnamed: 0_level_0,series_id,measurement_number,velocity,acceleration
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2021-08-01 00:00:00,0,0,0.107650,-0.74857
2021-08-01 00:00:01,0,1,0.067851,0.33995
2021-08-01 00:00:02,0,2,0.007275,-0.26429
2021-08-01 00:00:03,0,3,-0.013053,0.42684
2021-08-01 00:00:04,0,4,0.005135,-0.50969
...,...,...,...,...
2021-08-01 00:01:35,0,95,-0.032556,0.88128
2021-08-01 00:01:36,0,96,-0.005574,2.13840
2021-08-01 00:01:37,0,97,0.009161,1.52220
2021-08-01 00:01:38,0,98,0.010391,1.76180


We will upload this time series data **TWICE** into Label Studio to simulate that we have two different sequences. It is better to label one sequence at a time in Label Studio to make things easier, rather than labeling all sequences combined at once.

# Labeling time series in Label Studio

- After you have created a new project in Label Studio, import the `sample_time_series.csv` file (inside `sample_data` folder) that we created above the section here into Label Studio (I assume you know how to do it here).
- Then go to "Settings" > Labeling Interface > Browse Templates > Time Series Analysis, then select "Activity Recognition". After that, you need to check the template code by clicking the "Code" panel at the right side of the "Browse Templates" button. You should see something similar to the template code below (slightly different here because I have adjusted this general template).

![sample label template](images/sample-label-template.png)

- This is what we will be tweaking according to our given time series data. 
- The image above is just used to show a nice colored syntax (by copying the code into an IDE like Visual Studio Code). You can also refer to all the templates with some explanations in the `label_studio_templates.html` file in this GitHub repo. 
- You should copy the entire code below and paste into the Label Studio "Code" interface there to overwrite the original template there and follow my instructions here.

```
<View>
  <TimeSeries name="ts" valueType="url" value="$timeseriesUrl" sep="," timeColumn="time" timeFormat="%Y-%m-%d %H:%M:%S"
    timeDisplayFormat="%Y-%m-%d">
    <Channel column="velocity" units="miles/h" displayFormat=",.1f" strokeColor="#1f77b4" legend="velocity" />
    <Channel column="acceleration" units="miles/h^2" displayFormat=",.1f" strokeColor="#ff7f0e" legend="acceleration" />
  </TimeSeries>

  <Header value="Time Series classification" style="font-weight: normal" />
  <Choices name="pattern" toName="ts">
    <Choice value="Normal" />
    <Choice value="Anomaly" />
  </Choices>

  <TimeSeriesLabels name="label" toName="ts">
    <Label value="Run" />
    <Label value="Walk" />
  </TimeSeriesLabels>
</View>
```

- This code is similar to how HTML language is defined, and it would make your life easier if you used HTML before, or at least understand how HTML is structured.
- There are two main terminologies that you should know: `tags`, and `attributes` or `parameters` (the terminology used in the official documentation). You should refer to the documentation [here](https://labelstud.io/tags/timeseries.html) for more details but I will give a brief explanations for the most important parts.
- For labeling time series in Label Studio, there are several tags that we must pay attention to: `TimeSeries` tag, `Choices` tag, and also the `TimeSeriesLabels` tag.

`TimeSeries` tag:
- This tag has several attributes such as `name`, `valueType`, and `value`. 
- Only the `timeColumn` attribute should be changed/removed if necessary. This `timeColumn` attribute is used to point to a specific column in your CSV file (in this case the column name is `time`) for displaying `datetime-like` labels on your x-axis when labeling. 
- `timeFormat` and `timeDisplayFormat` attributes are linked to the `timeColumn` attribute, these two attributes follow the `strftime` implementation of the `datetime` library in Python. `timeFormat` is used to specify the format of the `datetime` of the time column in your CSV file, and `timeDisplayFormat` is used to specify how to display it in the Label Studio labeling interface.
- You can remove these 3 time-related attributes then Label Studio will use incremental integer values 0, 1, 2, ... as your x-axis labels when labeling in Label Studio. This should be done if you experience problems related to using `timeColumn` for your x-axis.

`Channel` tag under the `TimeSeries` tag:
- This tag points to our column in the CSV file to use it to display on the labeling interface of Label Studio.
- You can include any column(s) from the CSV file with the exact same name in each of the `column` attributes, e.g. in this case there are two columns named exactly as `velocity` and `acceleration`
- Other attributes are named quite intuitively so I think you should understand. They are just for more visual improvements during labeling. You may refer to the official documentation [here](https://labelstud.io/tags/timeseries.html) for more information if your don't understand them.

`Choice` tags under the `Choices` tag:
- Each of this `Choice` tag will be appear as checkboxes to be selected to label the entire time series.
- You can only choose one checkbox to label one time series, it is not multi-label.
- Add more by copying the lines and renaming the `value` attributes.

`Label` tags under the `TimeSeriesLabels` tag:
- Each of this `Label` tag will be appear as individual selectable labels to label specific regions.
- You can add more labels if necessary by copying more lines and renaming the `value` attributes

# Checking JSON output from Label Studio

In [13]:
# checking the original careercon 2019 labels
# this is what we want to achieve with our custom y_train file
y_train = pd.read_csv(os.path.join(DATA_PATH, "y_train.csv"))
y_train

Unnamed: 0,series_id,group_id,surface
0,0,13,fine_concrete
1,1,31,concrete
2,2,20,concrete
3,3,31,concrete
4,4,22,soft_tiles
...,...,...,...
3805,3805,55,tiled
3806,3806,67,wood
3807,3807,48,fine_concrete
3808,3808,54,tiled


In this example, we will only use 2 sequences of time series data as example. In the CareerCon 2019 dataset, there are actually 3810 sequences as shown in the cell directly above this block.

In [14]:
import json
# SAMPLE_LABEL_JSON = os.path.join("sample_data", "sample_label_output_1.json")
SAMPLE_LABEL_JSON = os.path.join("sample_data", "sample_label_output_2.json")
label_json = json.loads(open(SAMPLE_LABEL_JSON).read())
label_json

[{'id': 529,
  'annotations': [{'id': 552,
    'completed_by': {'id': 1,
     'email': 'roxastan@hotmail.com',
     'first_name': '',
     'last_name': ''},
    'result': [{'value': {'start': 0,
       'end': 45,
       'instant': False,
       'timeserieslabels': ['Run']},
      'id': 'L9Ip4hisF8',
      'from_name': 'label',
      'to_name': 'ts',
      'type': 'timeserieslabels'},
     {'value': {'start': 45,
       'end': None,
       'instant': False,
       'timeserieslabels': ['Walk']},
      'id': 'gSkA7PE5LJ',
      'from_name': 'label',
      'to_name': 'ts',
      'type': 'timeserieslabels'},
     {'value': {'choices': ['Anomaly']},
      'id': 'eEon2MSjLx',
      'from_name': 'pattern',
      'to_name': 'ts',
      'type': 'choices'}],
    'was_cancelled': False,
    'ground_truth': False,
    'created_at': '2021-08-30T09:20:41.312541Z',
    'updated_at': '2021-08-30T09:20:41.313041Z',
    'lead_time': 72.311,
    'prediction': {},
    'result_count': 0,
    'task': 529}],


- BE CAREFUL that Label Studio version 1.2 (or lower) has a bug that ends up not displaying the labeled region that spans until the end of the time series (you can try to drag the labeled region until the end to see).
- The label actually STILL EXISTS BUT CANNOT BE SEEN in the labeling interface, but you can actually see the labeled region at the right side.
- And the exported output file will show the "end" value as "None" as a result of this bug.

In [15]:
print(len(label_json))

2


In [16]:
## checking all the nested data
for i, label in enumerate(label_json):
    print(i)
    print(label.keys())
    for annot in label['annotations']:
        print(annot.keys())
        for label in annot['result']:
            print(label)
    print()
    
    # added this to check only 5 sequences in case the data is too large
    if i == 4:
        break

0
dict_keys(['id', 'annotations', 'predictions', 'file_upload', 'data', 'meta', 'created_at', 'updated_at', 'project'])
dict_keys(['id', 'completed_by', 'result', 'was_cancelled', 'ground_truth', 'created_at', 'updated_at', 'lead_time', 'prediction', 'result_count', 'task'])
{'value': {'start': 0, 'end': 45, 'instant': False, 'timeserieslabels': ['Run']}, 'id': 'L9Ip4hisF8', 'from_name': 'label', 'to_name': 'ts', 'type': 'timeserieslabels'}
{'value': {'start': 45, 'end': None, 'instant': False, 'timeserieslabels': ['Walk']}, 'id': 'gSkA7PE5LJ', 'from_name': 'label', 'to_name': 'ts', 'type': 'timeserieslabels'}
{'value': {'choices': ['Anomaly']}, 'id': 'eEon2MSjLx', 'from_name': 'pattern', 'to_name': 'ts', 'type': 'choices'}

1
dict_keys(['id', 'annotations', 'predictions', 'file_upload', 'data', 'meta', 'created_at', 'updated_at', 'project'])
dict_keys(['id', 'completed_by', 'result', 'was_cancelled', 'ground_truth', 'created_at', 'updated_at', 'lead_time', 'prediction', 'result_count'

# Extract label pattern of the entire time series

This section will only label each time series sequence as a specific pattern/class, i.e. `Anomaly` or `Normal` in this case.

In [48]:
label_list = []

for i, ts in enumerate(label_json):
    # print(i)
    ts_labels = ts['annotations'][0]['result']
    for label in ts_labels:
        current_label = label['value']
        if 'choices' in current_label:
            # to get the choice value
            labels = current_label['choices']
            label_list.append({
                'series_id': i,
                'label': labels[0],
            })
# create a DataFrame similar to the original y_train by using
# the list of dictionaries
sample_y_train_1 = pd.DataFrame(label_list) 
display(sample_y_train_1)

Unnamed: 0,series_id,label
0,0,Anomaly
1,1,Anomaly


This kind of anomaly detection or classification task will not require the change of `series_id` of the original `X_train` data, because we did not cut out specific regions. 

Therefore, it is very simple and we are done here.

In [49]:
sample_y_train_1.to_csv("sample_data/sample_y_train_1.csv", index=False)

# Extracting specific regions as labels

This will be more difficult than just categorizing each individual sequence. It is up to you whether you want to extract them out or just use the "start" and "end" points of the regions as features.

In this example, I will show how to extract them and label as a new sequence for each of the regions, and update them with new `series_id`.

In [39]:
series_id = 0
label_list = []

for i, ts in enumerate(label_json):
    # print(i)
    ts_labels = ts['annotations'][0]['result']
    for label in ts_labels:
        current_label = label['value']

        # we only want the labels with "start" or "end" key
        # for the regions we want to extract
        if 'start' in current_label:
            current_id = series_id
            start = current_label['start']
            end = current_label['end']
            if end is None:
                # when this happens, it means the region spans
                #  until the end of the time series,
                #  this is a bug from Label Studio v1.2.
                #  Hence, take the number of rows from the CSV file
                end = len(sample_ts)
            labels = current_label['timeserieslabels']
            label_list.append({
                'original_series_id': i,
                'new_series_id': current_id,
                'start': start,
                'end': end,
                'label': labels[0],
            })
            # increment for a new sequence for each region
            series_id += 1
# create a DataFrame similar to the original y_train
sample_y_train_2 = pd.DataFrame(label_list) 
display(sample_y_train_2)

Unnamed: 0,original_series_id,new_series_id,start,end,label
0,0,0,0,45,Run
1,0,1,45,200,Walk
2,1,2,0,24,Walk
3,1,3,32,98,Run


In [40]:
sample_y_train_2.to_csv("sample_data/sample_y_train_2.csv", index=False)

In [42]:
## there might be easier way to do this but this is what I came up with

merge_list = []

# iterate over every row of for each region
for row in sample_y_train_2.itertuples():
    # get all the required data
    original_series_id = row.original_series_id
    new_series_id = row.new_series_id
    start = row.start
    end = row.end
    
    # create the list of measurement numbers and series_id
    measurement_number = np.arange(start, end)
    series_id_list = np.repeat(new_series_id, end - start)
    
    # create new DataFrame
    expanded_y = pd.DataFrame({
        'original_series_id': original_series_id,
        'new_series_id': series_id_list,
        'measurement_number': measurement_number,
    })
    
    # append it to a list to concatenate them altogether later
    merge_list.append(expanded_y)

expanded_y = pd.concat(merge_list, ignore_index=True)
expanded_y

Unnamed: 0,original_series_id,new_series_id,measurement_number
0,0,0,0
1,0,0,1
2,0,0,2
3,0,0,3
4,0,0,4
...,...,...,...
285,1,3,93
286,1,3,94
287,1,3,95
288,1,3,96


In [43]:
# merge them just like the JOIN in SQL syntax
#  the `left_on` is the column(s) of first DataFrame to join to second DataFrame,
#  while the `right_on` is the column(s) of second DataFrame to join to first DataFrame
merged_df = pd.merge(
    sample_ts,
    expanded_y,
    left_on=["series_id", "measurement_number"],
    right_on=["original_series_id", "measurement_number"]
)
merged_df

Unnamed: 0,series_id,measurement_number,velocity,acceleration,original_series_id,new_series_id
0,0,0,0.107650,-0.74857,0,0
1,0,1,0.067851,0.33995,0,0
2,0,2,0.007275,-0.26429,0,0
3,0,3,-0.013053,0.42684,0,0
4,0,4,0.005135,-0.50969,0,0
...,...,...,...,...,...,...
185,1,93,-0.023636,1.11930,1,3
186,1,94,-0.038759,2.85380,1,3
187,1,95,-0.032556,0.88128,1,3
188,1,96,-0.005574,2.13840,1,3


NOTE: The `pd.merge` function will remove rows with NaN values (non-labeled regions) automatically after the join. Hence, 200 rows have now become 190 rows only. But you can specify parameter to keep all of the rows, but it is not required in this case because we only want the labeled regions.

In [44]:
# drop old unwanted columns
merged_df.drop(columns=["series_id", "original_series_id"], inplace=True)

In [45]:
# rename the new column
merged_df.rename(columns={"new_series_id": "series_id"}, inplace=True)

In [46]:
# rearrange the `series_id` column as the first column
merged_df = merged_df[['series_id', 'measurement_number', 'velocity', 'acceleration']]
merged_df

Unnamed: 0,series_id,measurement_number,velocity,acceleration
0,0,0,0.107650,-0.74857
1,0,1,0.067851,0.33995
2,0,2,0.007275,-0.26429
3,0,3,-0.013053,0.42684
4,0,4,0.005135,-0.50969
...,...,...,...,...
185,3,93,-0.023636,1.11930
186,3,94,-0.038759,2.85380
187,3,95,-0.032556,0.88128
188,3,96,-0.005574,2.13840


Finally, we generated the new `X_train` data with one `series_id` for each labeled region.

In [36]:
merged_df.to_csv("sample_data/sample_X_train_2.csv", index=False)