# Motiv Data Scientist/Algorithm Engineer Interview Assignment 

**Disclaimer**: This is loosely built around a [machine learning crash course](https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/programming-exercises) and [TensorFlow tutorial](https://www.tensorflow.org/get_started/get_started_for_beginners).

## Load and modify the data

Firt things first, import the modules we'll need.

In [None]:
import zipfile
import pandas as pd
from io import StringIO
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split

Let's look at what's inside our zip file. 

In [None]:
zip_folder = zipfile.ZipFile("Algorithm_DSdata.zip", "r")
zip_folder.namelist()

Let's now look at what the first file look like. 

In [None]:
first_file = zip_folder.namelist()[1]
zip_folder.read(first_file)[:100].decode("utf-8")

We're told that the data is as follows:
1. UTC. The data is sampled roughly at 50Hz. Feel free to assume all data was
uniformly at 50Hz, but please be prepared to discuss the ramification of this
assumption.
2. Acceleration in g’s for the x-axis of an accelerometer. Values will be positive or
negative.
3. Acceleration in g’s for the y-axis of an accelerometer. Values will be positive or
negative.
4. Acceleration in g’s for the z-axis of an accelerometer. Values will be positive or
negative.
5. Activity label. This element will either be blank or contain either “walk” or “run”. Any
element with a blank label is considered to be neither walking nor running.

Considering that, let's load an aggregated dataframe (the files aren't too large, so it shouldn't be a problem to keep that in memory). 

In [None]:
column_names = 'utc, x, y, z, label'.split(', ')
df_list = []

for file_name in zip_folder.namelist()[1:]:
    csv = StringIO(zip_folder.read(file_name).decode("utf-8"))
    data = pd.read_csv(csv, header=None, names=column_names, encoding="utf-8")    
    df_list.append(data)

zip_folder.close()

This is what one of our dataframes looks like. It's not surprising that the label contains a lot of `NaN`s...

In [None]:
df = df_list[0]
df.head()

In [None]:
labeled_list, unlabeled_list = [df.loc[~df.label.isna()] for df in df_list], [df.loc[df.label.isna()] for df in df_list]

We will need to convert the label to an integer to train our model, might as well do that here!

In [None]:
for df in labeled_list:
    unique_labels = df.label.unique()
    df.label.replace({'Walk': 0, 'Run': 1}, inplace=True)

Whoops! `label` is a `float` because of the `NaN`s in that column. 
Now that we removed those, let's convert `label` to `int`s. Again, this is required by TensorFlow.

In [None]:
labeled_list[0].head()
# = labelled.label.astype(int)

Great! Our label in now an `int`, we can pursue...

We notice that some of our files contained only unlabeled data.

In [None]:
[len(df) for df in labeled_list]

In [None]:
[len(df) for df in unlabeled_list]

## Build the model

In this exercise, we'll try to predict whether a record corresponds to `Walk` or `Run`, which will be our label. We'll use `x`, `y` and `z` as our input features.

To train our model, we'll use the TensorFlow [Estimator](https://www.tensorflow.org/get_started/estimator) API.

### Step 1: Define features and configure feature columns

In [None]:
# Define the input features.
feature_columns = 'x, y, z'.split(', ')

# Feature columns describe how to use the input.
my_feature_columns = []
for column_name in feature_columns:
    my_feature_columns.append(tf.feature_column.numeric_column(key=column_name))

### Step 3: Define the Input Function

The input function instructs TensorFlow how to preprocess
the data, as well as how to batch, shuffle, and repeat it during model training.

First, we'll convert our *pandas* feature data into a dict of NumPy arrays. We can then use the TensorFlow [Dataset API](https://www.tensorflow.org/programmers_guide/datasets) to construct a dataset object from our data, and then break
our data into batches of `batch_size`, to be repeated for the specified number of epochs (num_epochs). 

**NOTE:** When the default value of `num_epochs=None` is passed to `repeat()`, the input data will be repeated indefinitely.

Next, if `shuffle` is set to `True`, we'll shuffle the data so that it's passed to the model randomly during training. The `buffer_size` argument specifies
the size of the dataset from which `shuffle` will randomly sample.

Finally, our input function constructs an iterator for the dataset and returns the next batch of data to our classifier.

In [None]:
def train_input_fn(input_tensor, batch_size, n_points=50, shuffle=False, num_epochs=None):
    """An input function for training.

    Args:
      input_tensor: pandas DataFrame of features and targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """
    size = len(input_tensor.columns)
    
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(dict(input_tensor))
    print(dataset)
    print(input_tensor.head())
    # Select a random fraction of our dataset. 
    dataset = tf.random_crop(dataset, [n_points, size])

    # Repeat, and batch the examples.
    dataset = dataset.batch(batch_size).repeat(num_epochs)

    # Shuffle the data, if specified. This wouldn't make sense here as we're using a time-dependent prediction.
    if shuffle:
      dataset = dataset.shuffle(buffer_size=10000)

    # Return the next batch of data.
    dataset = dataset.make_one_shot_iterator().get_next()

    # Return the dataset.
    return dataset

### Step 5: Build the Model

In [None]:
# Build 2 hidden layer DNN with 10, 10 units respectively.
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    # Two hidden layers of 10 nodes each.
    hidden_units=[10, 10],
    # The model must choose between 3 classes.
    n_classes=2)

### Step 4: Train the Model

First we need to separate training and testing data.

In [None]:
print('Fraction of labeled data: {0:.1%}'.format(float(len(labeled))/len(df)))

In [None]:
train, test = labeled_list[0][feature_columns + ['label']], unlabeled_list[0][feature_columns + ['label']]

In [None]:
train.head()

In [None]:
dict(train.reset_index(drop=True))

In [None]:
#train, test = train_test_split(labeled, test_size=0.2)

In [None]:
train_x, train_y = train[feature_columns], train['label']
test_x, test_y = test[feature_columns], test['label']

In [None]:
# DNN parameters.
batch_size = 100
train_steps = 1000

# Train the Model.
classifier.train(
    input_fn=lambda:train_input_fn(train, batch_size), steps=train_steps)

### Step 6: Evaluate the Model

Let's make predictions on that training data, to see how well our model fit it during training.

In [None]:
def eval_input_fn(features, labels, batch_size):
    """An input function for evaluation or prediction"""
    features=dict(features)
    if labels is None:
        # No labels, use only features.
        inputs = features
    else:
        inputs = (features, labels)

    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    # Batch the examples
    assert batch_size is not None, "batch_size must not be None"
    dataset = dataset.batch(batch_size)

    # Return the dataset.
    return dataset

In [None]:
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(test_x, test_y, batch_size))

In [None]:
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(train_x, train_y, batch_size))

In [None]:
# Create an input function for predictions.
# Note: Since we're making just one prediction for each example, we don't 
# need to repeat or shuffle the data here.
prediction_input_fn = lambda: train_input_fn(features, , num_epochs=1, shuffle=False)

# Call predict() on the linear_regressor to make predictions.
predictions = classifier.predict(input_fn=prediction_input_fn)

In [None]:
# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])

# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
# print "Mean Squared Error (on training data): {0:%0.3f}".format(mean_squared_error)
# print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error

In [None]:
df.utc = pd.to_datetime(df.utc, unit='s')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
labelled.shape

## Is this a good model? How would you judge how large this error is?

Mean Squared Error (MSE) can be hard to interpret, so we often look at Root Mean Squared Error (RMSE)
instead.  A nice property of RMSE is that it can be interpreted on the same scale as the original targets.

Let's compare the RMSE to the difference of the min and max of our targets:

In [None]:
min_house_value = california_housing_dataframe["median_house_value"].min()
max_house_value = california_housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value

print "Min. Median House Value: %0.3f" % min_house_value
print "Max. Median House Value: %0.3f" % max_house_value
print "Difference between Min. and Max.: %0.3f" % min_max_difference
print "Root Mean Squared Error: %0.3f" % root_mean_squared_error

Our error spans nearly half the range of the target values. Can we do better?

This is the question that nags at every model developer. Let's develop some basic strategies to reduce model error.

The first thing we can do is take a look at how well our predictions match our targets, in terms of overall summary statistics.

In [None]:
calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

Okay, maybe this information is helpful. How does the mean value compare to the model's RMSE? How about the various quantiles?

We can also visualize the data and the line we've learned.  Recall that linear regression on a single feature can be drawn as a line mapping input *x* to output *y*.

First, we'll get a uniform random sample of the data so we can make a readable scatter plot.

In [None]:
sample = california_housing_dataframe.sample(n=300)

Next, we'll plot the line we've learned, drawing from the model's bias term and feature weight, together with the scatter plot. The line will show up red.

In [None]:
# Get the min and max total_rooms values.
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# Retrieve the final weight and bias generated during training.
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# Get the predicted median_house_values for the min and max total_rooms values.
y_0 = weight * x_0 + bias 
y_1 = weight * x_1 + bias

# Plot our regression line from (x_0, y_0) to (x_1, y_1).
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# Label the graph axes.
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# Plot a scatter plot from our data sample.
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# Display graph.
plt.show()

This initial line looks way off.  See if you can look back at the summary stats and see the same information encoded there.

Together, these initial sanity checks suggest we may be able to find a much better line.

In [None]:
to_predict.drop(columns='label', inplace=True)

In [None]:
to_predict.head()

In [None]:
expected = ['Walk', 'Run']

In [None]:
predictions = classifier.predict(
    input_fn=lambda:eval_input_fn(to_predict[feature_columns], labels=None, batch_size=batch_size))


In [None]:
predictions
        class_id = pred_dict['class_ids'][0]
        probability = pred_dict['probabilities'][class_id]


In [None]:
for pred_dict, expec in zip(predictions, expected):
        class_id = pred_dict['class_ids'][0]
        probability = pred_dict['probabilities'][class_id]

        print(class_id, 100 * probability, expec)