**Using this workspace**

The following code is used to prepare the dataset for training.  There are multiple subjects with varying numbers of sessions.  The end goal is to have a pandas dataframe with all of the subjects in order and also create an output label for each input.  Because the output data is sampled at a different rate than the input data, it has to be interpolated so that they match.  At the end a file named "final.csv" is created, which will be used when training our model.  

**Specify Google Drive Directory With Training Data CSV Files**

Subject input data is labeled "subject_00<subject number>_*__x.csv

Subject output data is labeled "subject_00<subject number>_*__y.csv

In [None]:
TrainingData="/content/gdrive/MyDrive/ProjectC2Final/TrainingData"

**Import Google Drive, Authenticate Access**

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


**Import Libraries Needed for Data Preprocessing**

In [None]:
%tensorflow_version 1.x

from keras.utils import to_categorical
from numpy import array
from numpy import argmax
import os
import glob
import pandas as pd
os.chdir(TrainingData)
extension = 'csv'

**Combine all subjects into one pandas dataframe**

Go through each subject and all of their sessions and append the data onto one large dataframe

In [None]:

x_total = [[]]
y_total = [[]]
temp = pd.DataFrame()
all_x_filenames=list()
all_x_filenames,all_xtime_filenames,all_y_filenames,all_ytime_filenames=[None]*9,[None]*9,[None]*9,[None]*9
#pd.concat([pd.read_csv('subject_001_*__x.csv',header=None),pd.read_csv('subject_001_*__y.csv',header=None)],axis=1)
for j in range(1,8):
  all_x_filenames[j] = [i for i in glob.glob('subject_00{}*__x.{}'.format(j,extension))]
  all_xtime_filenames[j] = [i for i in glob.glob('subject_00{}*__x_time.{}'.format(j,extension))]
  all_y_filenames[j] = [i for i in glob.glob('subject_00{}*__y.{}'.format(j,extension))]
  all_ytime_filenames[j] = [i for i in glob.glob('subject_00{}*__y_time.{}'.format(j,extension))]
  count = 0
  x_total.append([j])
  y_total.append([j])
  x_total[j][0]=(pd.DataFrame([0,0,0]))
  y_total[j][0]=(pd.DataFrame([0,0,0]))
  #print(y_total,x_total)
  #print(len(all_x_filenames[j])+1)
  for i,p,k,l,part in zip(sorted(all_x_filenames[j]),sorted(all_xtime_filenames[j]),sorted(all_y_filenames[j]),sorted(all_ytime_filenames[j]),range(1,len(all_x_filenames[j])+1)):
    x_total[j].append(pd.concat([pd.read_csv(i,header=None),pd.read_csv(p,header=None).rename({0:"time"} ,axis=1)],axis=1))
    x_total[j][part]['subject_name'] = ['subject_00'+str(j)+'__'+str(part)]*len(x_total[j][part])
    x_total[j][part].to_csv('x_total'+str(j)+'__'+str(part)+'.csv', index=False, encoding='utf-8-sig')
    # print(pd.read_csv(k,header=None).head(10),pd.read_csv(l,header=None).head(10),part,j)
    # abc = pd.concat([pd.read_csv(k,header=None),pd.read_csv(l,header=None)],axis=1)
    y_total[j].append(pd.concat([pd.read_csv(k,header=None),pd.read_csv(l,header=None).rename({0:"time"} ,axis=1)],axis=1))
    y_total[j][part]['subject_name'] = ['subject_00'+str(j)+'__'+str(part)]*len(y_total[j][part])
    y_total[j][part].to_csv('y_total'+str(j)+'__'+str(part)+'.csv', index=False, encoding='utf-8-sig')
    count = count +1
  #for i in range(len(all_x_filenames)):
    #pd.concat()
  #pd.read_csv('subject_001_01__y.csv',header=None).value_counts()
    # print(type(x_total).__name__)
    # print(x_total[j][part].columns)
    # print('header value is \n',x_total[j][part])
    x_total[j][part] = x_total[j][part].iloc[1:]
    y_total[j][part] = y_total[j][part].iloc[1:]
    x_total[j][part] = x_total[j][part].rename(columns = {0:"xa",1:"ya",2:"za",3:"xg",4:"yg",5:"zg"})
    y_total[j][part] = y_total[j][part].rename(columns = {0:"Label"})
    
    x_total[j][part]['time'] = x_total[j][part]['time'].apply(lambda x: int(x*100)/100)
    

    #x_total[j][part] = x_total[j][part]['time'].apply(lambda x: float(round(x, 2)))
    
    
    
    # y_total[j][part] = y_total[j][part].rename(columns={'subject_name': 'subject_name'})
    # x_total[j][part] = x_total[j][part].rename(columns={'subject_name': 'sub'})
    print("subject")
    print(j,part)
    abc = pd.DataFrame(x_total[j][part]).merge(pd.DataFrame(y_total[j][part]), on = ['time', 'subject_name'], how = 'left')
    temp = temp.append(abc)

subject
1 1
subject
1 2
subject
1 3
subject
1 4
subject
1 5
subject
1 6
subject
1 7
subject
1 8
subject
2 1
subject
2 2
subject
2 3
subject
2 4
subject
2 5
subject
3 1
subject
3 2
subject
3 3
subject
4 1
subject
4 2
subject
5 1
subject
5 2
subject
5 3
subject
6 1
subject
6 2
subject
6 3
subject
7 1
subject
7 2
subject
7 3
subject
7 4


**Interpolate the undersampled outputs**

The input data is sampled at 40 hz while the output data is sample at 10hz.  This means there will be 4 times as many inputs as there are outputs.  We want every input to have an output, so the output data has to be interpolated.

In [None]:
temp

Unnamed: 0,xa,ya,za,xg,yg,zg,time,subject_name,Label
0,4.186920,8.344455,2.908057,0.005771,-0.004480,-0.003345,0.02,subject_001__1,
1,4.544637,8.408659,2.890000,0.007967,0.022412,0.001159,0.05,subject_001__1,
2,4.849308,8.411614,2.900692,0.027778,-0.010670,-0.014223,0.07,subject_001__1,
3,4.509190,8.118649,2.847298,0.021577,-0.045498,-0.021111,0.10,subject_001__1,
4,4.226515,8.273807,2.851742,0.012534,0.000445,-0.016830,0.12,subject_001__1,0.0
...,...,...,...,...,...,...,...,...,...
39433,1.266767,8.270302,5.139698,-0.002222,0.005120,-0.000951,985.85,subject_007__4,
39434,1.150554,8.234723,5.204723,-0.002844,-0.008267,-0.003333,985.87,subject_007__4,
39435,1.216095,8.261302,5.236952,-0.002018,0.003734,0.001111,985.90,subject_007__4,
39436,1.314432,8.252274,5.215568,-0.005769,0.007968,-0.000449,985.92,subject_007__4,0.0


In [None]:
temp = pd.DataFrame(temp.interpolate(method = 'pad',limit = 4)).fillna(0)

In [None]:
  temp

Unnamed: 0,xa,ya,za,xg,yg,zg,time,subject_name,Label
0,4.186920,8.344455,2.908057,0.005771,-0.004480,-0.003345,0.02,subject_001__1,0.0
1,4.544637,8.408659,2.890000,0.007967,0.022412,0.001159,0.05,subject_001__1,0.0
2,4.849308,8.411614,2.900692,0.027778,-0.010670,-0.014223,0.07,subject_001__1,0.0
3,4.509190,8.118649,2.847298,0.021577,-0.045498,-0.021111,0.10,subject_001__1,0.0
4,4.226515,8.273807,2.851742,0.012534,0.000445,-0.016830,0.12,subject_001__1,0.0
...,...,...,...,...,...,...,...,...,...
39433,1.266767,8.270302,5.139698,-0.002222,0.005120,-0.000951,985.85,subject_007__4,0.0
39434,1.150554,8.234723,5.204723,-0.002844,-0.008267,-0.003333,985.87,subject_007__4,0.0
39435,1.216095,8.261302,5.236952,-0.002018,0.003734,0.001111,985.90,subject_007__4,0.0
39436,1.314432,8.252274,5.215568,-0.005769,0.007968,-0.000449,985.92,subject_007__4,0.0


**Assign labels to the new interpolated dataframe**

In [None]:
X = temp[['xa','ya','za','xg','yg','zg','time','subject_name']]
y = temp['Label']

**Not Using SMOTE as it decreases the accuracy**

In [None]:
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(categorical_features = [7],random_state=42)
X_with_smote,Y_with_smote = sm.fit_resample(X,y)

**Combine the input and output data into one csv**

In [None]:
from google.colab import files

In [None]:
X_final,Y_final = X,y

In [None]:
Y_final = pd.DataFrame(Y_final, columns = ["Label"])

In [None]:
X_final = pd.DataFrame(X_final, columns=['xa','ya','za','xg','yg','zg','time','subject_name'])

In [None]:
X_final['Label'] = Y_final['Label']

In [None]:
X_final['Label'].value_counts()

0.0    967332
3.0    201148
2.0     70356
1.0     54644
Name: Label, dtype: int64

In [None]:
X_final.to_csv("final_without_8.csv")
files.download("final_without_8.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>