# Classifying actual/simulated acceleration data

In this document, we will use a one-dimensional convolutional network to classify actual and simulated acceleration data, and compare it to other models (logistic regression, svm). 

### Loading/splitting data

We will first load the actual/simulation datasets, get their lengths, and print first few lines of them. 

In [1]:
import json
import pandas as pd
import numpy as np

with open('data/data_good.json', 'rb') as f:
    raw1 = f.readlines()
df1 = pd.read_json(raw1[0])
with open('data/simulation.json', 'rb') as f:
    raw2 = f.readlines()
df2 = pd.read_json(raw2[0])
print('# rows of actual data:', len(df1))
print('# rows of simulated data:', len(df2))
print('')
print('first 5 lines of actual data:')
print(df1.head())
print('')
print('first 5 lines of simulated data:')
print(df2.head())

# rows of actual data: 1600
# rows of simulated data: 1600

first 5 lines of actual data:
      x     y     z
0 -2.32 -0.86  9.13
1 -2.71 -1.37  8.75
2 -3.03 -1.70  8.37
3 -3.14 -1.92  8.14
4 -2.96 -1.99  8.47

first 5 lines of simulated data:
          x         y         z
0 -3.850784 -0.085201 -1.490804
1  4.038864 -1.818646 -2.300385
2 -1.056220 -0.847621  0.738123
3  4.368797 -0.950352  0.329629
4  4.805529 -1.561125  0.697635


We will then concatenate the datasets, cast the concatenated acceleration data to numpy array for easier processing, create labels from it, and divide acceleration data and labels into training and testing sets with test ratio 0.3. 

In [2]:
X = np.array(pd.concat([df1, df2]))
y1 = np.zeros(len(df1.index))
y2 = np.ones(len(df2.index))
y = np.concatenate([y1, y2])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)

### One-dimensional convolutional network

Let's first construct the 1d conv net. 

In [3]:
import tensorflow as tf
from tensorflow import keras

cnn = keras.Sequential([
    keras.layers.Conv1D(filters=50, kernel_size=2, input_shape=(None, 3), padding='same'), 
    keras.layers.Dense(2, activation='sigmoid')
])
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In Keras, neural network models are constructed layer by layer, and the first layer in our model is a conv1d layer taking 3d tensors as inputs, so we need to reshape our input data. Say that our input data has dimension $m \cdot 3$ ($m$ rows of 3-dimensional acceleration data). The general idea is that we want to run convolution on every few rows of our original dataset, and in order for that to work, we need to reshape our data to dimension $m \cdot 1 \cdot 3$ and match it to filters of size $ n \cdot 3$, where $n$ is the number of rows we want to consider together each time. 

Note that the output of our model has shape $m \cdot 1 \cdot 2$, where each row specifies probabilities that the sample belongs to the 2 categories, so we need to convert/reshape our label. Say that our label has dimension $m \cdot 1$. So we will first convert our label to one-hot vectors for it to have shape $m \cdot 2$, and then reshape it to dimension $m \cdot 1 \cdot 2$. 

In [4]:
X_train_cnn = np.array(X_train)
X_test_cnn = np.array(X_test)

idx_train = [int(idx) for idx in y_train-1]
idx_test = [int(idx) for idx in y_test-1]
y_train_cnn = np.zeros((len(y_train), 2))
y_train_cnn[np.arange(len(y_train)), idx_train] = 1
y_test_cnn = np.zeros((len(y_test), 2))
y_test_cnn[np.arange(len(y_test)), idx_test] = 1

X_train_cnn = X_train_cnn.reshape(X_train_cnn.shape[0], 1, X_train_cnn.shape[1])
X_test_cnn = X_test_cnn.reshape(X_test_cnn.shape[0], 1, X_test_cnn.shape[1])
y_train_cnn = y_train_cnn.reshape(y_train_cnn.shape[0], 1, y_train_cnn.shape[1])
y_test_cnn = y_test_cnn.reshape(y_test_cnn.shape[0], 1, y_test_cnn.shape[1])

Let's see how the 1d conv net performs on this data. 

In [5]:
cnn.fit(X_train_cnn, y_train_cnn, validation_split=0.3, epochs=10)
loss, acc = cnn.evaluate(X_test_cnn, y_test_cnn)
print('loss:', loss)
print('accuracy:', acc)

Train on 1568 samples, validate on 672 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
loss: 0.03783066018174092
accuracy: 1.0


The train/validation/test accuracies quickly rise to 1. 

### Comparison with logistic regression/svm

We will then fit logistic regression/svm on the same training/testing sets for comparison. Besides accuracy, f-score is also printed to check behavior of these models. 

In [6]:
from sklearn.metrics import accuracy_score, f1_score

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(X_train, y_train)
print('accuracy:', accuracy_score(y_test, lr.predict(X_test)))
print('f-score for two categories:', f1_score(y_test, lr.predict(X_test), average=None))

from sklearn.svm import SVC
sv = SVC().fit(X_train, y_train)
print('accuracy:', accuracy_score(y_test, sv.predict(X_test)))
print('f-score for two categories:', f1_score(y_test, sv.predict(X_test), average=None))

accuracy: 1.0
f-score for two categories: [1. 1.]
accuracy: 1.0
f-score for two categories: [1. 1.]




It's then easy to see that the data can be easily separated by simple models. 

### Conclusion 

That our data can be separated by simple models doesn't necessarily mean our 1d conv net is useless. We do, however, need to run it on a different set of data to better measure its performance.  