<a href="https://colab.research.google.com/github/dylanoco/basketball-classification-dribbling/blob/main/creating_ts_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The Process of Creating a Dataset

I will be using dummy data I created by moving my arm around for twelve seconds.
This will be used to help me convert the CSV file recieved into a compatable .ts file type to be used for training.
Once I have done this for the Accelerometer, Gyroscope and then finally both of them combined, I will proceed to create the actual training and testing dataset for the models.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Converting CSV into a DataFrame for manipulation
df_accel_data = pd.read_csv('/content/drive/MyDrive/Development_Project/dummy_data/Accelerometer.csv')
#df_gyro_data = pd.read_csv('/content/drive/MyDrive/Development_Project/dummy_data/Gyroscope.csv')

In [None]:
#Resampling the Seconds
df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')

df_accel_data.set_index('seconds_elapsed', inplace=True)

df_resampled = df_accel_data.resample('S').mean()

df_resampled.reset_index(inplace=True)

df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6

# Print the resampled DataFrame
print(df_resampled)


## Dropping Columns

Now, I must drop the necessary columns so that I only remain with; z,y and x.

In [None]:
df_accel_data = df_accel_data.drop(columns=['time','seconds_elapsed'])

In [None]:
df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])

## Normalisation; MinMax()

I will apply Normalisation to this data. This will help with consistency in the future and aid in combatting biases the model may develop during training.

In [None]:
scaler = MinMaxScaler()

# Selecting the axes, scaling them with MinMaxScaler
df_resampled[['z', 'y', 'x']] = scaler.fit_transform(df_resampled[['z', 'y', 'x']])

## Establishing the labels (Y)

Of course, each instance needs a label for the Classification models to be able to identify the movement performed.

In [None]:
label = 'test_1'

## Exporting the data

Finally, we want to be able to export this data into a suitable format. I have decided to use the .ts format from SKTime.

A Time-Series file requires data in the format, specified in the documentation;
"The dataset in a 3d ndarray to be written as a ts file which must be of the structure specified in the documentation examples/loading_data.ipynb. (n_instances, n_columns, n_timepoints)"

As a result, I need to transpose the data so that the rows and columns are essentially flipped; 3 rows, n columns (n being the amount of timepoints, the 'rows' in this case are the columns z,y and x.)

Then, once inserted into the array, the shape will match the format required

*   (1 instance, 3 rows, n columns)
*   "(n_instances, n_columns, n_timepoints)"


In [None]:
!pip install --upgrade sktime

In [None]:
from sktime.datasets import write_ndarray_to_tsfile

In [None]:
df_resampled = np.transpose(df_resampled)

In [None]:




X = []
y = []
X.append(df_resampled)
y.append(label)
X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data', class_label=sy,class_value_list=y)

## Allowing for more than one instance.

Now that we can successfully create one instance of a movement, insert it into a dataset and create that dataset, we now need to make it capable of iterating through numerous instances.

This will consist of creating a loop, iterating through a folder of CSV files, preprocessing each one of them and appending it into the list.

Once completed, we then convert that list into a numpy Array to then be packed away as a Time-Series dataset !

In [None]:
import os
import random

In [None]:
X = []
y = []

In [None]:

path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(path)
for file in files:

  df_accel_data = pd.read_csv(path+file)
  df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')
  df_accel_data.set_index('seconds_elapsed', inplace=True)
  df_resampled = df_accel_data.resample('S').mean()
  df_resampled.reset_index(inplace=True)
  df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6

  df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])

  scaler = MinMaxScaler()
  # Selecting the axes, scaling them with MinMaxScaler
  df_resampled[['z', 'y', 'x']] = scaler.fit_transform(df_resampled[['z', 'y', 'x']])

  df_resampled = np.transpose(df_resampled)

  X.append(df_resampled)
  y.append(label + str(random.randint(0,5)))

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data', class_label=sy,class_value_list=y)





## End Result

We now have an ndarray with a shape of (4,3,13); 4 Instances (I had 4 CSV Files in one folder), 3 Columns (X,Y, and Z) and 13 Timepoints (Recorded at 12 seconds but the 0th second is included).

We can now record as many instances as we want, insert it into a folder of our choice and generate a time-series dataset.
## Part 2
However, not only are we going to record individual datasets (one for accelerometer and another for gyroscope), we also need to combine the two.

As a result, we need to;
* Be able to iterate through folders
* differentiate between Accelerometer and Gyroscope csv files
* rename the columns and combine the two dataframes together to count as one instance!



In [None]:
import glob
dir_path = '/content/drive/MyDrive/Development_Project/dummy_data/'
files = os.listdir(path)

X = []
y = []

for root,files,dir in os.walk(dir_path): #This allows us to walk through the folder that has more folders(instances)
  for name in files: #Gets the name of the folders
    path = os.path.join(dir_path,name)
    folder_files = os.listdir(path)
    count = 0
    df_list = []
    for file in folder_files: #Grabs the individual names of the files within that folder
      file_name = file.split(".") #Allows us to split the name of the file from the file type extension
      #Standard Preprocessing from earlier
      df_accel_data = pd.read_csv(os.path.join(path,file))
      df_accel_data['seconds_elapsed'] = pd.to_datetime(df_accel_data['seconds_elapsed'], unit='s')
      df_accel_data.set_index('seconds_elapsed', inplace=True)
      df_resampled = df_accel_data.resample('S').mean()
      df_resampled.reset_index(inplace=True)
      df_resampled['seconds_elapsed'] = df_resampled['seconds_elapsed'].astype(int) / 1e6
      df_resampled = df_resampled.drop(columns=['time','seconds_elapsed'])
      #By using .split(), we can get the name of the file and rename the columns as neccessary.
      if(file_name[0] == "Accelerometer"):
        df_resampled.rename(columns={"z":"accel_z", "y": "accel_y", "x": "accel_x"},inplace=True)
        print(df_resampled.columns)
        scaler = MinMaxScaler()
        # Selecting the RENAMED axes
        df_resampled[['accel_z', 'accel_y', 'accel_x']] = scaler.fit_transform(df_resampled[['accel_z', 'accel_y', 'accel_x']])
      else: #Same process for Gyroscope
        df_resampled.rename(columns={"z": "gyro_z", "y": "gyro_y", "x": "gyro_x"},inplace=True)
        print(df_resampled.columns)
        scaler = MinMaxScaler()
        df_resampled[['gyro_z', 'gyro_y', 'gyro_x']] = scaler.fit_transform(df_resampled[['gyro_z', 'gyro_y', 'gyro_x']])

      df_list.append(df_resampled)
      count += 1

    df_total = pd.concat(df_list, axis=1, join='inner')
    df_total = np.transpose(df_total)
    X.append(df_total)
    y.append(label + str(random.randint(0,10)))

X = np.asarray(X)
y = np.asarray(y)
sy = set(y)

file_path = '/content/drive/MyDrive/Development_Project/'
write_ndarray_to_tsfile(data = X, path = file_path, problem_name='sample_data_combined', class_label=sy,class_value_list=y)

Index(['accel_z', 'accel_y', 'accel_x'], dtype='object')
Index(['gyro_z', 'gyro_y', 'gyro_x'], dtype='object')
Index(['accel_z', 'accel_y', 'accel_x'], dtype='object')
Index(['gyro_z', 'gyro_y', 'gyro_x'], dtype='object')
