<a href="https://colab.research.google.com/github/dylanoco/basketball-classification-dribbling/blob/main/creating_ts_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Running Libraries

In [3]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import glob
from sktime.datasets import write_ndarray_to_tsfile
import os
import math
import random

In [2]:
!pip install --upgrade sktime

Collecting sktime
  Downloading sktime-0.27.0-py3-none-any.whl (21.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.9/21.9 MB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting scikit-base<0.8.0 (from sktime)
  Downloading scikit_base-0.7.2-py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.2/127.2 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-base, sktime
Successfully installed scikit-base-0.7.2 sktime-0.27.0


## The Process of Creating a Dataset

I will be using dummy data I created by moving my arm around for twelve seconds.
This will be used to help me convert the CSV file recieved into a compatable .ts file type to be used for training.
Once I have done this for the Accelerometer, Gyroscope and then finally both of them combined, I will proceed to create the actual training and testing dataset for the models.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Converting CSV into a DataFrame for manipulation
df_accel_data = pd.read_csv('/content/drive/MyDrive/Development_Project/dummy_data/Accelerometer.csv')
#df_gyro_data = pd.read_csv('/content/drive/MyDrive/Development_Project/dummy_data/Gyroscope.csv')

In [None]:
#Resampling the Seconds
df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')

df_accel_data.set_index('seconds_elapsed', inplace=True)

df_resampled = df_accel_data.resample('S').mean()

df_resampled.reset_index(inplace=True)

df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6

# Print the resampled DataFrame
print(df_resampled)


## Dropping Columns

Now, I must drop the necessary columns so that I only remain with; z,y and x.

In [None]:
df_accel_data = df_accel_data.drop(columns=['time','seconds_elapsed'])

In [None]:
df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])

## Normalisation; MinMax()

I will apply Normalisation to this data. This will help with consistency in the future and aid in combatting biases the model may develop during training.

In [None]:
scaler = MinMaxScaler()

# Selecting the axes, scaling them with MinMaxScaler
df_resampled[['z', 'y', 'x']] = scaler.fit_transform(df_resampled[['z', 'y', 'x']])

## Establishing the labels (Y)

Of course, each instance needs a label for the Classification models to be able to identify the movement performed.

In [5]:
label = 'test_1'

## Exporting the data

Finally, we want to be able to export this data into a suitable format. I have decided to use the .ts format from SKTime.

A Time-Series file requires data in the format, specified in the documentation;
"The dataset in a 3d ndarray to be written as a ts file which must be of the structure specified in the documentation examples/loading_data.ipynb. (n_instances, n_columns, n_timepoints)"

As a result, I need to transpose the data so that the rows and columns are essentially flipped; 3 rows, n columns (n being the amount of timepoints, the 'rows' in this case are the columns z,y and x.)

Then, once inserted into the array, the shape will match the format required

*   (1 instance, 3 rows, n columns)
*   "(n_instances, n_columns, n_timepoints)"


In [None]:
df_resampled = np.transpose(df_resampled)

In [None]:




X = []
y = []
X.append(df_resampled)
y.append(label)
X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data', class_label=sy,class_value_list=y)

## Allowing for more than one instance.

Now that we can successfully create one instance of a movement, insert it into a dataset and create that dataset, we now need to make it capable of iterating through numerous instances.

This will consist of creating a loop, iterating through a folder of CSV files, preprocessing each one of them and appending it into the list.

Once completed, we then convert that list into a numpy Array to then be packed away as a Time-Series dataset !

In [None]:
import os
import random

In [None]:
X = []
y = []

In [None]:

path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(path)
for file in files:

  df_accel_data = pd.read_csv(path+file)
  df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')
  df_accel_data.set_index('seconds_elapsed', inplace=True)
  df_resampled = df_accel_data.resample('S').mean()
  df_resampled.reset_index(inplace=True)
  df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6

  df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])

  scaler = MinMaxScaler()
  # Selecting the axes, scaling them with MinMaxScaler
  df_resampled[['z', 'y', 'x']] = scaler.fit_transform(df_resampled[['z', 'y', 'x']])

  df_resampled = np.transpose(df_resampled)

  X.append(df_resampled)
  y.append(label + str(random.randint(0,5)))

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data', class_label=sy,class_value_list=y)





## End Result

We now have an ndarray with a shape of (4,3,13); 4 Instances (I had 4 CSV Files in one folder), 3 Columns (X,Y, and Z) and 13 Timepoints (Recorded at 12 seconds but the 0th second is included).

We can now record as many instances as we want, insert it into a folder of our choice and generate a time-series dataset.
## Part 2
However, not only are we going to record individual datasets (one for accelerometer and another for gyroscope), we also need to combine the two.

As a result, we need to;
* Be able to iterate through folders
* differentiate between Accelerometer and Gyroscope csv files
* rename the columns and combine the two dataframes together to count as one instance!



In [None]:
import glob
dir_path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(path)

X = []
y = []

for root,files,dir in os.walk(dir_path): #This allows us to walk through the folder that has more folders(instances)
  for name in files: #Gets the name of the folders
    path = os.path.join(dir_path,name)
    folder_files = os.listdir(path)
    count = 0
    df_list = []
    for file in folder_files: #Grabs the individual names of the files within that folder
      file_name = file.split(".") #Allows us to split the name of the file from the file type extension
      #Standard Preprocessing from earlier
      df_accel_data = pd.read_csv(os.path.join(path,file))
      df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')
      df_accel_data.set_index('seconds_elapsed', inplace=True)
      df_resampled = df_accel_data.resample('S').mean()
      df_resampled.reset_index(inplace=True)
      df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6
      df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])
      #By using .split(), we can get the name of the file and rename the columns as neccessary.
      if(file_name[0] == "Accelerometer"):
        df_resampled.rename(columns={"z":"accel_z", "y": "accel_y", "x": "accel_x"},inplace=True)
        print(df_resampled.columns)
        scaler = MinMaxScaler()
        # Selecting the RENAMED axes
        df_resampled[['accel_z', 'accel_y', 'accel_x']] = scaler.fit_transform(df_resampled[['accel_z', 'accel_y', 'accel_x']])
      else: #Same process for Gyroscope
        df_resampled.rename(columns={"z": "gyro_z", "y": "gyro_y", "x": "gyro_x"},inplace=True)
        print(df_resampled.columns)
        scaler = MinMaxScaler()
        df_resampled[['gyro_z', 'gyro_y', 'gyro_x']] = scaler.fit_transform(df_resampled[['gyro_z', 'gyro_y', 'gyro_x']])

      df_list.append(df_resampled)
      count += 1

    df_total = pd.concat(df_list, axis=1, join='inner')
    df_total = np.transpose(df_total)
    X.append(df_total)
    y.append(label + str(random.randint(0,10)))

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data_combined', class_label=sy,class_value_list=y)

Index(['accel_z', 'accel_y', 'accel_x'], dtype='object')
Index(['gyro_z', 'gyro_y', 'gyro_x'], dtype='object')
Index(['accel_z', 'accel_y', 'accel_x'], dtype='object')
Index(['gyro_z', 'gyro_y', 'gyro_x'], dtype='object')


## 2 Second Instances

My goal here is to be able to collect 2 second instances (roughly the time it takes to perform a dribble move) from one set of data so that I can get more accurate results of the movement being performed than taking 10 seconds (for example) of data which includes me performing the move multiple times.

If we were to do this, Resampling no longer becomes necessary. This is because resampling includes taking the mean average of the occurences within that single second that was recorded. If we are taking 2 second instances, we wat those 2 seconds to be as precise as possible. As a result, we remove the resampling process.

## First Attempt

Issue: Due to instances of different lenghts, we cannot continue the final conversion of turning X into an Array.

In [6]:

dir_path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(dir_path)

X = []
y = []
df_accel_total = []
df_gyro_total = []

for root,files,dir in os.walk(dir_path): #This allows us to walk through the folder that has more folders(instances)
  for name in files: #Gets the name of the folders
    path = os.path.join(dir_path,name)
    folder_files = os.listdir(path)
    for file in folder_files: #Grabs the individual names of the files within that folder
      file_name = file.split(".") #Allows us to split the name of the file from the file type extension
      counter_split = 2
      index_holder = 0;
      print(file_name[0])
      #Standard Preprocessing from earlier
      df_accel_data = pd.read_csv(os.path.join(path,file))
      for index,row in df_accel_data.iterrows():
        if(math.floor(row['seconds_elapsed']) == counter_split ):
          df_temp = df_accel_data.iloc[index_holder:index,:]
          index_holder = index
          counter_split += 2

          df_temp = df_temp.drop(columns=['time','seconds_elapsed'])
          #By using .split(), we can get the name of the file and rename the columns as neccessary.
          if(file_name[0] == "Accelerometer"):
            df_temp.rename(columns={"z":"accel_z", "y": "accel_y", "x": "accel_x"},inplace=True)
            print(df_temp)
            scaler = MinMaxScaler()
            # Selecting the RENAMED axes
            df_temp[['accel_z', 'accel_y', 'accel_x']] = scaler.fit_transform(df_temp[['accel_z', 'accel_y', 'accel_x']])
            df_accel_total.append(df_temp)
          else: #Same process for Gyroscope
            df_temp.rename(columns={"z": "gyro_z", "y": "gyro_y", "x": "gyro_x"},inplace=True)
            print(df_temp)
            scaler = MinMaxScaler()
            df_temp[['gyro_z', 'gyro_y', 'gyro_x']] = scaler.fit_transform(df_temp[['gyro_z', 'gyro_y', 'gyro_x']])
            df_gyro_total.append(df_temp)



index = 0
for item in df_accel_total:
  df_list = []
  df_list.append(df_accel_total[index])
  df_list.append(df_gyro_total[index])
  df_total = pd.concat(df_list, axis=1, join='inner')
  df_total = np.transpose(df_total)
  X.append(df_total)
  y.append(label + str(random.randint(0,10)))
  index += 1

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data_combined', class_label=sy,class_value_list=y)

Accelerometer
    accel_z   accel_y   accel_x
0  1.338586 -1.456287  0.995221
1  2.134468 -0.402921  0.511669
2  1.566823  1.207753  0.356628
3 -3.274407 -4.050470 -0.954763
4  9.090763  1.645252  8.305042
      accel_z   accel_y   accel_x
5   11.810346 -2.655034  6.780014
6    3.668150  0.259544  9.171693
7    2.495482 -0.821216  7.364952
8    6.688351 -5.408457 -1.667495
9   18.131771 -2.222675  1.849996
10  -8.877424  0.072896 -1.635194
      accel_z   accel_y   accel_x
11  10.259792 -4.249830 -0.630009
12   9.550881 -1.230097  2.687378
13  14.077687 -0.613041  7.859201
14  -5.744693 -4.368533 -8.355460
15  -8.133892 -5.418841 -5.295728
16  -2.963998 -1.695791  4.488464
     accel_z   accel_y   accel_x
17  0.984179 -0.244292  0.350111
18 -1.756842 -0.115461  0.394145
19  1.895712 -5.185919 -0.561691
20 -2.518353 -2.369159 -1.055237
21  0.219545 -1.029243 -0.612476
     accel_z   accel_y   accel_x
22  0.776889 -1.488166 -0.035596
23  0.496181 -1.937151  0.172148
24  6.032318 -1.25941

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (12, 6) + inhomogeneous part.

In [7]:
X

[                0         1         2    3    4
 accel_z  0.373063  0.437428  0.391522  0.0  1.0
 accel_y  0.455462  0.640401  0.923188  0.0  1.0
 accel_x  0.210586  0.158365  0.141622  0.0  1.0
 gyro_z   0.373063  0.437428  0.391522  0.0  1.0
 gyro_y   0.455462  0.640401  0.923188  0.0  1.0
 gyro_x   0.210586  0.158365  0.141622  0.0  1.0,
                5         6         7         8         9        10
 accel_z  0.765953  0.464493  0.421075  0.576314  1.000000  0.00000
 accel_y  0.485784  1.000000  0.809322  0.000000  0.562064  0.96707
 accel_x  0.779349  1.000000  0.833314  0.000000  0.324516  0.00298
 gyro_z   0.765953  0.464493  0.421075  0.576314  1.000000  0.00000
 gyro_y   0.485784  1.000000  0.809322  0.000000  0.562064  0.96707
 gyro_x   0.779349  1.000000  0.833314  0.000000  0.324516  0.00298,
                11        12   13        14        15        16
 accel_z  0.828112  0.796196  1.0  0.107565  0.000000  0.232757
 accel_y  0.243250  0.871602  1.0  0.218550  0.0000

##Second Attempt

In [62]:
def split_dataframe(df, chunk_size = 10000):
    chunks = list()
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
      if(len(df[i*chunk_size:(i+1)*chunk_size]) < 3):
        print((len(df[i*chunk_size:(i+1)*chunk_size])))
        pass
      else:
        chunks.append(df[i*chunk_size:(i+1)*chunk_size])
    #print(num_chunks)
    return chunks

def find_chunks(df):
  sum = 0
  counter = 0
  counter_split = 2
  index_holder = 0
  for index,row in df.iterrows():
    if(round(row['seconds_elapsed']) == counter_split ):
      sum += len(df_accel_data.iloc[index_holder:index,:])
      index_holder = index
      counter_split += 2
      counter += 1
  chunk_amount = round(sum / counter)

  return chunk_amount

In [75]:
import glob
import numpy
dir_path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(path)

X = []
y = []

for root,files,dir in os.walk(dir_path): #This allows us to walk through the folder that has more folders(instances)
  for name in files: #Gets the name of the folders
    path = os.path.join(dir_path,name)
    folder_files = os.listdir(path)
    count = 0
    df_list = []
    for file in folder_files: #Grabs the individual names of the files within that folder
      file_name = file.split(".") #Allows us to split the name of the file from the file type extension
      #Standard Preprocessing from earlier
      df_resampled = pd.read_csv(os.path.join(path,file))
      #By using .split(), we can get the name of the file and rename the columns as neccessary.
      if(file_name[0] == "Accelerometer"):
        df_resampled.rename(columns={"z":"accel_z", "y": "accel_y", "x": "accel_x"},inplace=True)

        scaler = MinMaxScaler()
        # Selecting the RENAMED axes
        df_resampled[['accel_z', 'accel_y', 'accel_x']] = scaler.fit_transform(df_resampled[['accel_z', 'accel_y', 'accel_x']])
      else:#Same process for Gyroscope
        df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])
        df_resampled.rename(columns={"z": "gyro_z", "y": "gyro_y", "x": "gyro_x"},inplace=True)
        scaler = MinMaxScaler()
        df_resampled[['gyro_z', 'gyro_y', 'gyro_x']] = scaler.fit_transform(df_resampled[['gyro_z', 'gyro_y', 'gyro_x']])

      df_list.append(df_resampled)
      count += 1


    df_total = pd.concat(df_list, axis=1, join='inner')
    chunk = find_chunks(df_total)
    df_chunk = split_dataframe(df_total, chunk_size=chunk)
    for df in df_chunk:
      if(len(df) < chunk):
        pass
      else:
        print(len(df))
        df= df.drop(columns=['time','seconds_elapsed'])
        df = np.transpose(df)
        X.append(df)
        y.append(label + str(random.randint(0,10)))

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data_combined', class_label=sy,class_value_list=y)

5
5
5
5
5
5
5
5
5
5
5
5


In [78]:
X[0]

array([[0.37824193, 0.40770901, 0.38669228, 0.2074485 , 0.66526185],
       [0.56094309, 0.71005857, 0.93806714, 0.19370801, 1.        ],
       [0.53349686, 0.5059081 , 0.49706236, 0.42224181, 0.95055383],
       [0.37824193, 0.40770901, 0.38669228, 0.2074485 , 0.66526185],
       [0.56094309, 0.71005857, 0.93806714, 0.19370801, 1.        ],
       [0.53349686, 0.5059081 , 0.49706236, 0.42224181, 0.95055383]])