# Motiv Data Scientist/Algorithm Engineer Interview Assignment 

**Disclaimer**: This is loosely built around a [machine learning crash course](https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/programming-exercises) and [TensorFlow tutorial](https://www.tensorflow.org/get_started/get_started_for_beginners).

## Load and modify the data

Firt things first, import the modules we'll need.

In [None]:
import zipfile
import pandas as pd
from io import StringIO
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split

Let's look at what's inside our zip file. 

In [None]:
zip_folder = zipfile.ZipFile("Algorithm_DSdata.zip", "r")
zip_folder.namelist()

Let's now look at what the first file look like. 

In [None]:
first_file = zip_folder.namelist()[1]
zip_folder.read(first_file)[:100].decode("utf-8")

We're told that the data is as follows:
1. UTC. The data is sampled roughly at 50Hz. Feel free to assume all data was
uniformly at 50Hz, but please be prepared to discuss the ramification of this
assumption.
2. Acceleration in g’s for the x-axis of an accelerometer. Values will be positive or
negative.
3. Acceleration in g’s for the y-axis of an accelerometer. Values will be positive or
negative.
4. Acceleration in g’s for the z-axis of an accelerometer. Values will be positive or
negative.
5. Activity label. This element will either be blank or contain either “walk” or “run”. Any
element with a blank label is considered to be neither walking nor running.

Considering that, let's load an aggregated dataframe (the files aren't too large, so it shouldn't be a problem to keep that in memory). 

In [None]:
column_names = 'utc, x, y, z, label'.split(', ')
df_list = []

for file_name in zip_folder.namelist()[1:]:
    csv = StringIO(zip_folder.read(file_name).decode("utf-8"))
    data = pd.read_csv(csv, header=None, names=column_names, encoding="utf-8")    
    df_list.append(data)

zip_folder.close()

This is what one of our dataframes looks like. It's not surprising that the label contains a lot of `NaN`s...

In [None]:
df = df_list[0]
df.head()

## Build the model

In this exercise, we'll try to predict whether a record corresponds to `Walk` or `Run`, which will be our label. We'll use `x`, `y` and `z` as our input features.

To train our model, we'll use the TensorFlow [Estimator](https://www.tensorflow.org/get_started/estimator) API.

### Step 1: Define features and configure feature columns

In [None]:
# Define the input features.
feature_columns =  'x, y, z'.split(', ')

In [None]:
# This can make some rows non-contiguous temporally.
labeled_list, unlabeled_list = (
    [df.loc[~df.label.isna(), feature_columns + ['label']] for df in df_list], 
    [df.loc[df.label.isna(), feature_columns + ['label']] for df in df_list])

We will need to convert the label to an integer to train our model, might as well do that here!

In [None]:
for df in labeled_list:
    df.label.replace({'Walk': 0, 'Run': 1}, inplace=True)

In [None]:
labeled_list[0].head()

Great! Our label in now an `int`, we can pursue...

**Note**: Some of our files contained only unlabeled data!

In [None]:
[len(df) for df in labeled_list]

**Note**: Only 2 of our *csv* files contain both *Walk* and *Run* labels!

In [None]:
[(i, df.label.unique()) for (i, df) in enumerate(labeled_list)]

Let's define our training, validation and testing datasets.

In [None]:
train_ts = labeled_list[0:3] + labeled_list[5:7] + [labeled_list[8]]

Let's make predictions on that training data, to see how well our model fit it during training.

In [None]:
import matplotlib.pyplot as plt
plt.plot(labeled_list[3].label)
plt.show()

We wanted to split `labeled_list[3]` into validation and testing data, but because of the lack of variability in the data, we'll just keep one testing dataset.

In [None]:
test_ts = [labeled_list[3]]

### Step 2: Define the input function

The input function instructs TensorFlow how to preprocess
the data, as well as how to batch, shuffle, and repeat it during model training.

First, we'll convert our *pandas* feature data into a dict of NumPy arrays. We can then use the TensorFlow [Dataset API](https://www.tensorflow.org/programmers_guide/datasets) to construct a dataset object from our data, and then break
our data into batches of `batch_size`, to be repeated for the specified number of epochs (num_epochs). 

**NOTE:** When the default value of `num_epochs=None` is passed to `repeat()`, the input data will be repeated indefinitely.

Next, if `shuffle` is set to `True`, we'll shuffle the data so that it's passed to the model randomly during training. The `buffer_size` argument specifies
the size of the dataset from which `shuffle` will randomly sample.

Finally, our input function constructs an iterator for the dataset and returns the next batch of data to our classifier.

In [None]:
NUM_FEATURES = 3
# The classifier should output its class predictions every 50 samples (roughly 1Hz).
NUM_POINTS = 50

def timeseries_to_tf_dataset(single_timeseries):
    assert(isinstance(single_timeseries, pd.DataFrame))
    # The columns are acceleration on the x, y and z axis, and the labels. 
    assert(len(single_timeseries.columns) == NUM_FEATURES+1)
    
    def crop_and_slice(tensor):
        # Select a random fraction of our dataset. 
        tensor = tf.random_crop(tensor, [NUM_POINTS, NUM_FEATURES+1])
        features = tf.reshape(tensor[:, :NUM_FEATURES], [NUM_FEATURES*NUM_POINTS])
        targets = tensor[-1, NUM_FEATURES]
        return features, targets

    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensors(np.array(single_timeseries)).repeat()
    return dataset.map(crop_and_slice)

# Currently not working. 
def sample_from_datasets(datasets, weights):
    def sample(*ds_samples):
        # An integer in [0, len(datasets))
        index = tf.random_uniform((), 0, len(datasets), dtype=tf.int32)
        return ds_samples[index]
    
    return tf.data.Dataset.zip(datasets).map(sample)

# Source: 
# https://stackoverflow.com/questions/49058913/interleaving-multiple-tensorflow-datasets-together
def combine_datasets(datasets):
    def concat(xs):
        xs0 = tf.data.Dataset.from_tensors(xs[0])
        for x in xs[1:]:
            xs0 = xs0.concatenate(tf.data.Dataset.from_tensors(x))
        return xs0
    
    return tf.data.Dataset.zip(datasets).flat_map(lambda *xs: concat(xs))
    
def input_fn(timeseries_list, batch_size, shuffle=False):
    """An input function for training.

    Args:
      timeseries_list: list of pandas DataFrames of features and targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
    Returns:
      Tensorflow dataset
    """
    datasets, weights = zip(*[(timeseries_to_tf_dataset(x), len(x)) for x in timeseries_list])
    # We're currently not using the weights.
    weights = [float(w)/sum(weights) for w in weights]
    dataset = combine_datasets(datasets)

    # Turn the dataset feature into a dictionary.
    def make_features_a_dictionary(feature, label):
        return {'combined_features': feature}, label
    dataset = dataset.map(make_features_a_dictionary)
    
    # Shuffle the data, if specified.
    if shuffle:
        dataset = dataset.shuffle(buffer_size=10000)

    #  and batch the examples.
    dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))

    # Return the dataset.
    return dataset

### Step 3: Build the model

In [None]:
# Build 2 hidden layer DNN.
my_feature_columns = [tf.feature_column.numeric_column(key='combined_features', shape=(NUM_FEATURES * NUM_POINTS,))]
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    hidden_units=[50, 50],
    n_classes=2)

### Step 4: Train the Model

First we need to separate training and testing data.

In [None]:
# DNN parameters.
batch_size = 100
train_steps = 1000

# Train the Model.
classifier.train(
    input_fn=lambda:input_fn(train_ts, batch_size), steps=train_steps)

### Step 5: Evaluate the Model

In [None]:
classifier.evaluate(
    input_fn=lambda:input_fn(train_ts, batch_size), steps=1000)

In [None]:
classifier.evaluate(
    input_fn=lambda:input_fn(test_ts, batch_size), steps=1000)

## Conclusion

The trained model is nearly perfect on the train set, which contains both *Walk* and *Run* records, but predicts *Walk* for every example in the test set. This looks like overfitting, but we tried some standard things (reducing model complexity or training time), with no results. It's possible there's a big qualitative difference between the train and test sets.