In [1]:
import pandas as pd

# Choosing datasets

## About the data

Data is provenient from the The UCR Time Series Classification Archive (2019) [1]. This repository has 128 datasets, each one containing a collection of time series, already splitted in training and testing sets, aiming classification.

The archive also contains additional information of each dataset such as data type (image, sensor, spectro, ...), size of train and test samples, length of the time serie and error rates provenient from classification, utilizing measures as Euclidean distance and DTW (Dynamic Time Warping).

Let's take a look into the datasets.

In [2]:
datasets_df = pd.read_csv('data/DataSummary.csv')
datasets_df.head(10)

Unnamed: 0,ID,Type,Name,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
0,1,Image,Adiac,390,391,37,176,0.3887,0.3913 (3),0.3964,0.9591,A. Jalba
1,2,Image,ArrowHead,36,175,3,251,0.2,0.2000 (0),0.2971,0.6057,L. Ye & E. Keogh
2,3,Spectro,Beef,30,30,5,470,0.3333,0.3333 (0),0.3667,0.8,K. Kemsley & A. Bagnall
3,4,Image,BeetleFly,20,20,2,512,0.25,0.3000 (7),0.3,0.5,J. Hills & A. Bagnall
4,5,Image,BirdChicken,20,20,2,512,0.45,0.3000 (6),0.25,0.5,J. Hills & A. Bagnall
5,6,Sensor,Car,60,60,4,577,0.2667,0.2333 (1),0.2667,0.6833,J. Gao
6,7,Simulated,CBF,30,900,3,128,0.1478,0.0044 (11),0.0033,0.6644,N. Saito
7,8,Sensor,ChlorineConcentration,467,3840,3,166,0.35,0.3500 (0),0.3516,0.4674,L. Li & C. Faloutsos
8,9,Sensor,CinCECGTorso,40,1380,4,1639,0.1029,0.0696 (1),0.3493,0.7464,physionet.org
9,10,Spectro,Coffee,28,28,2,286,0.0,0.0000 (0),0.0,0.4643,"K, Kemsley & A. Bagnall"


In this project, we are aiming to see the efficiency of using Optimal-Path Forest to classificate time series, and its respective techniques of speeding up the performance. Therefore it will be a good idea to choose datasets with a distinct ration of length and training size (as it's the most costable process).

In [3]:
df_names = ['FaceFour', 'EthanolLevel', 'ChlorineConcentration', 'Phoneme', 'ShapesAll', 'TwoPatterns', 'InsectWingbeatSound', 'WordSynonyms', 'FordA']
datasets_df.loc[datasets_df['Name'].isin(df_names)]

Unnamed: 0,ID,Type,Name,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
7,8,Sensor,ChlorineConcentration,467,3840,3,166,0.35,0.3500 (0),0.3516,0.4674,L. Li & C. Faloutsos
24,25,Image,FaceFour,24,88,4,350,0.2159,0.1136 (2),0.1705,0.7045,A. Ratanamahatana & E. Keogh
28,29,Sensor,FordA,3601,1320,2,500,0.3348,0.3091 (1),0.4455,0.4841,A. Bagnall
36,37,Sensor,InsectWingbeatSound,220,1980,11,256,0.4384,0.4152 (1),0.6449,0.9091,Y. Chen & E. Keogh
53,54,Sensor,Phoneme,214,1896,39,1024,0.8908,0.7727 (14),0.7716,0.8871,H. Hamooni & A. Mueen
61,62,Image,ShapesAll,600,600,60,512,0.2483,0.1980 (4),0.2317,0.9833,J. Hills & A. Bagnall
74,75,Simulated,TwoPatterns,1000,4000,4,128,0.0932,0.0015 (4),0.0,0.7412,P. Geurts
81,82,Image,WordSynonyms,267,638,25,270,0.3824,0.2618 (9),0.3511,0.7806,T. Rath & R. Manmatha
97,98,Spectro,EthanolLevel,504,500,4,1751,0.726,0.7180 (1),0.724,0.748,A. Bagnall


After reducing the datasets to a few options, we can select those who are more "unique". In this case, we chose the following datasets:

In [4]:
df_names = ['WordSynonyms', 'ChlorineConcentration', 'ShapesAll', 'EthanolLevel', 'FordA']
new_df = datasets_df.loc[datasets_df['Name'].isin(df_names)].set_index('Name')
new_df.loc[df_names].reset_index().drop(columns=['ID'])

Unnamed: 0,Name,Type,Train,Test,Class,Length,ED (w=0),DTW (learned_w),DTW (w=100),Default rate,Data donor/editor
0,WordSynonyms,Image,267,638,25,270,0.3824,0.2618 (9),0.3511,0.7806,T. Rath & R. Manmatha
1,ChlorineConcentration,Sensor,467,3840,3,166,0.35,0.3500 (0),0.3516,0.4674,L. Li & C. Faloutsos
2,ShapesAll,Image,600,600,60,512,0.2483,0.1980 (4),0.2317,0.9833,J. Hills & A. Bagnall
3,EthanolLevel,Spectro,504,500,4,1751,0.726,0.7180 (1),0.724,0.748,A. Bagnall
4,FordA,Sensor,3601,1320,2,500,0.3348,0.3091 (1),0.4455,0.4841,A. Bagnall


# References

[1] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh
Gharghabi, Chotirat Ann Ratanamahatana, Yanping Chen, Bing Hu, Nurjahan Begum, Anthony Bagnall ,
Abdullah Mueen, Gustavo Batista, & Hexagon-ML (2019). The UCR Time Series Classification Archive.
URL https://www.cs.ucr.edu/~eamonn/time_series_data_2018/