# Motiv Data Scientist/Algorithm Engineer Interview Assignment 

**Disclaimer**: This is loosely built around a [machine learning crash course](https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/programming-exercises) and [TensorFlow tutorial](https://www.tensorflow.org/get_started/get_started_for_beginners).

## Load and modify the data

Firt things first, import the modules we'll need.

In [1]:
import zipfile
import pandas as pd
from io import StringIO
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split

Let's look at what's inside our zip file. 

In [2]:
zip_folder = zipfile.ZipFile("Algorithm_DSdata.zip", "r")
zip_folder.namelist()

['HW Data/',
 'HW Data/hw_file_1.csv',
 'HW Data/hw_file_10.csv',
 'HW Data/hw_file_2.csv',
 'HW Data/hw_file_3.csv',
 'HW Data/hw_file_4.csv',
 'HW Data/hw_file_5.csv',
 'HW Data/hw_file_6.csv',
 'HW Data/hw_file_7.csv',
 'HW Data/hw_file_8.csv',
 'HW Data/hw_file_9.csv']

Let's now look at what the first file look like. 

In [3]:
first_file = zip_folder.namelist()[1]
zip_folder.read(first_file)[:100].decode("utf-8")

'1446833694,0.59375,-0.76562,-0.26562,\n1446833694.0183,0.59375,-0.76562,-0.26562,\n1446833694.0367,0.5'

We're told that the data is as follows:
1. UTC. The data is sampled roughly at 50Hz. Feel free to assume all data was
uniformly at 50Hz, but please be prepared to discuss the ramification of this
assumption.
2. Acceleration in g’s for the x-axis of an accelerometer. Values will be positive or
negative.
3. Acceleration in g’s for the y-axis of an accelerometer. Values will be positive or
negative.
4. Acceleration in g’s for the z-axis of an accelerometer. Values will be positive or
negative.
5. Activity label. This element will either be blank or contain either “walk” or “run”. Any
element with a blank label is considered to be neither walking nor running.

Considering that, let's load an aggregated dataframe (the files aren't too large, so it shouldn't be a problem to keep that in memory). 

In [4]:
column_names = 'utc, x, y, z, label'.split(', ')
df = pd.DataFrame(columns=column_names)

for file_name in zip_folder.namelist()[1:]:
    csv = StringIO(zip_folder.read(file_name).decode("utf-8"))
    data = pd.read_csv(csv, header=None, names=column_names, encoding="utf-8")    
    df = pd.concat([df, data], axis=0)

zip_folder.close()

This is what `df` looks like. It's not surprising that the label contains a lot of `NaN`s...

In [5]:
df.head()

Unnamed: 0,utc,x,y,z,label
0,1446834000.0,0.59375,-0.76562,-0.26562,
1,1446834000.0,0.59375,-0.76562,-0.26562,
2,1446834000.0,0.59375,-0.76562,-0.26562,
3,1446834000.0,0.60938,-0.76562,-0.26562,
4,1446834000.0,0.60938,-0.76562,-0.26562,


In [6]:
df.label.unique()

array([nan, 'Walk', 'Run'], dtype=object)

We will need to convert the label to an integer to train our model, might as well do that here!

In [7]:
df.label.replace({'Walk': 0, 'Run': 1}, inplace=True)

In [8]:
labelled, to_predict = df.loc[~df.label.isna()], df.loc[df.label.isna()]

In [9]:
labelled.head()

Unnamed: 0,utc,x,y,z,label
106973,1446836000.0,-0.28125,-0.70312,-0.79688,0.0
106974,1446836000.0,-0.4375,-0.67188,-0.90625,0.0
106975,1446836000.0,-0.35938,-0.60938,-1.0625,0.0
106976,1446836000.0,-0.14062,-0.625,-1.1562,0.0
106977,1446836000.0,0.015625,-0.5625,-1.0312,0.0


Whoops! `label` is a `float` because of the `NaN`s in that column. 
Now that we removed those, let's convert `label` to `int`s. Again, this is required by TensorFlow.

In [10]:
labelled.label = labelled.label.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [11]:
labelled.head()

Unnamed: 0,utc,x,y,z,label
106973,1446836000.0,-0.28125,-0.70312,-0.79688,0
106974,1446836000.0,-0.4375,-0.67188,-0.90625,0
106975,1446836000.0,-0.35938,-0.60938,-1.0625,0
106976,1446836000.0,-0.14062,-0.625,-1.1562,0
106977,1446836000.0,0.015625,-0.5625,-1.0312,0


Great! Our label in now an `int`, we can pursue...

## Build the model

In this exercise, we'll try to predict whether a record corresponds to `Walk` or `Run`, which will be our label. We'll use `x`, `y` and `z` as our input features.

To train our model, we'll use the TensorFlow [Estimator](https://www.tensorflow.org/get_started/estimator) API.

### Step 1: Define features and configure feature columns

In [13]:
# Define the input features.
feature_columns = 'x, y, z'.split(', ')

# Feature columns describe how to use the input.
my_feature_columns = []
for column_name in feature_columns:
    my_feature_columns.append(tf.feature_column.numeric_column(key=column_name))

### Step 3: Define the Input Function

The input function instructs TensorFlow how to preprocess
the data, as well as how to batch, shuffle, and repeat it during model training.

First, we'll convert our *pandas* feature data into a dict of NumPy arrays. We can then use the TensorFlow [Dataset API](https://www.tensorflow.org/programmers_guide/datasets) to construct a dataset object from our data, and then break
our data into batches of `batch_size`, to be repeated for the specified number of epochs (num_epochs). 

**NOTE:** When the default value of `num_epochs=None` is passed to `repeat()`, the input data will be repeated indefinitely.

Next, if `shuffle` is set to `True`, we'll shuffle the data so that it's passed to the model randomly during training. The `buffer_size` argument specifies
the size of the dataset from which `shuffle` will randomly sample.

Finally, our input function constructs an iterator for the dataset and returns the next batch of data to our classifier.

In [29]:
def train_input_fn(features, labels, batch_size, shuffle=True, num_epochs=None):
    """An input function for training.

    Args:
      features: pandas DataFrame of features
      labels: pandas DataFrame of targets
      batch_size: Size of batches to be passed to the model
      shuffle: True or False. Whether to shuffle the data.
      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely
    Returns:
      Tuple of (features, labels) for next data batch
    """
    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

    # Shuffle the data, if specified.
    if shuffle:
      dataset = dataset.shuffle(buffer_size=10000)
    
    dataset.map(lambda x: tf.random_crop(x, (TIMESERIES_SIZE, )))

    # Shuffle, repeat, and batch the examples.
    dataset = dataset.repeat().batch(batch_size)

    # Should I use this?
    # Return the next batch of data.
    features, labels = dataset.make_one_shot_iterator().get_next()

    # Return the dataset.
    return features, labels

### Step 5: Build the Model

In [43]:
# Build 2 hidden layer DNN with 10, 10 units respectively.
classifier = tf.estimator.DNNClassifier(
    feature_columns=my_feature_columns,
    # Two hidden layers of 10 nodes each.
    hidden_units=[40, 40],
    # The model must choose between 3 classes.
    n_classes=2)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\bsyou\\AppData\\Local\\Temp\\tmp70gpc2ne', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000019441BDFFD0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


### Step 4: Train the Model

First we need to separate training and testing data.

In [16]:
print('Fraction of labeled data: {0:.1%}'.format(float(len(labeled))/len(df)))

Fraction of labelled data: 45.7%


In [17]:
train, test = train_test_split(labelled, test_size=0.2)

In [18]:
train_x, train_y = train[feature_columns], train['label']
test_x, test_y = test[feature_columns], test['label']

In [44]:
# DNN parameters.
batch_size = 100
train_steps = 1000

# Train the Model.
classifier.train(
    input_fn=lambda:train_input_fn(train_x, train_y, batch_size), steps=train_steps)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\bsyou\AppData\Local\Temp\tmp70gpc2ne\model.ckpt.
INFO:tensorflow:loss = 77.65143, step = 0
INFO:tensorflow:global_step/sec: 179.026
INFO:tensorflow:loss = 4.809237, step = 100 (0.559 sec)
INFO:tensorflow:global_step/sec: 394.753
INFO:tensorflow:loss = 15.169634, step = 200 (0.253 sec)
INFO:tensorflow:global_step/sec: 399.473
INFO:tensorflow:loss = 3.6578286, step = 300 (0.250 sec)
INFO:tensorflow:global_step/sec: 419.992
INFO:tensorflow:loss = 4.665278, step = 400 (0.238 sec)
INFO:tensorflow:global_step/sec: 421.292
INFO:tensorflow:loss = 15.00949, step = 500 (0.237 sec)
INFO:tensorflow:global_step/sec: 426.169
INFO:tensorflow:loss = 6.0020413, step = 600 (0.235 sec)
INFO:tensorflow:global_step/sec:

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x19441bdf4a8>

### Step 6: Evaluate the Model

Let's make predictions on that training data, to see how well our model fit it during training.

In [36]:
def eval_input_fn(features, labels, batch_size):
    """An input function for evaluation or prediction"""
    features=dict(features)
    if labels is None:
        # No labels, use only features.
        inputs = features
    else:
        inputs = (features, labels)

    # Convert the inputs to a Dataset.
    dataset = tf.data.Dataset.from_tensor_slices(inputs)

    # Batch the examples
    assert batch_size is not None, "batch_size must not be None"
    dataset = dataset.batch(batch_size)

    # Return the dataset.
    return dataset

In [45]:
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(test_x, test_y, batch_size))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-30-05:35:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\bsyou\AppData\Local\Temp\tmp70gpc2ne\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-30-05:35:49
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.9748281, accuracy_baseline = 0.96066004, auc = 0.93429714, auc_precision_recall = 0.6543493, average_loss = 0.08486344, global_step = 1000, label/mean = 0.039339956, loss = 8.4841, precision = 0.88860106, prediction/mean = 0.032180455, recall = 0.4117647


In [46]:
# Evaluate the model.
eval_result = classifier.evaluate(
    input_fn=lambda:eval_input_fn(train_x, train_y, batch_size))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-30-05:35:57
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\bsyou\AppData\Local\Temp\tmp70gpc2ne\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-30-05:36:14
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.97402024, accuracy_baseline = 0.96027476, auc = 0.93045187, auc_precision_recall = 0.640741, average_loss = 0.087585606, global_step = 1000, label/mean = 0.039725233, loss = 8.758229, precision = 0.87549996, prediction/mean = 0.0324, recall = 0.40337634


In [24]:
# Create an input function for predictions.
# Note: Since we're making just one prediction for each example, we don't 
# need to repeat or shuffle the data here.
prediction_input_fn = lambda: train_input_fn(features, , num_epochs=1, shuffle=False)

# Call predict() on the linear_regressor to make predictions.
predictions = classifier.predict(input_fn=prediction_input_fn)

In [28]:
# Format predictions as a NumPy array, so we can calculate error metrics.
predictions = np.array([item['predictions'][0] for item in predictions])

# Print Mean Squared Error and Root Mean Squared Error.
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
# print "Mean Squared Error (on training data): {0:%0.3f}".format(mean_squared_error)
# print "Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error

NameError: name 'my_feature' is not defined

In [54]:
df.utc = pd.to_datetime(df.utc, unit='s')

In [55]:
df.head()

Unnamed: 0,utc,x,y,z,label
0,2015-11-06 18:14:54.000000,0.59375,-0.76562,-0.26562,
1,2015-11-06 18:14:54.018300,0.59375,-0.76562,-0.26562,
2,2015-11-06 18:14:54.036700,0.59375,-0.76562,-0.26562,
3,2015-11-06 18:14:54.055000,0.60938,-0.76562,-0.26562,
4,2015-11-06 18:14:54.073400,0.60938,-0.76562,-0.26562,


In [56]:
df.shape

(1157700, 5)

In [57]:
labelled.shape

(529356, 5)

## Is this a good model? How would you judge how large this error is?

Mean Squared Error (MSE) can be hard to interpret, so we often look at Root Mean Squared Error (RMSE)
instead.  A nice property of RMSE is that it can be interpreted on the same scale as the original targets.

Let's compare the RMSE to the difference of the min and max of our targets:

In [0]:
min_house_value = california_housing_dataframe["median_house_value"].min()
max_house_value = california_housing_dataframe["median_house_value"].max()
min_max_difference = max_house_value - min_house_value

print "Min. Median House Value: %0.3f" % min_house_value
print "Max. Median House Value: %0.3f" % max_house_value
print "Difference between Min. and Max.: %0.3f" % min_max_difference
print "Root Mean Squared Error: %0.3f" % root_mean_squared_error

Our error spans nearly half the range of the target values. Can we do better?

This is the question that nags at every model developer. Let's develop some basic strategies to reduce model error.

The first thing we can do is take a look at how well our predictions match our targets, in terms of overall summary statistics.

In [0]:
calibration_data = pd.DataFrame()
calibration_data["predictions"] = pd.Series(predictions)
calibration_data["targets"] = pd.Series(targets)
calibration_data.describe()

Okay, maybe this information is helpful. How does the mean value compare to the model's RMSE? How about the various quantiles?

We can also visualize the data and the line we've learned.  Recall that linear regression on a single feature can be drawn as a line mapping input *x* to output *y*.

First, we'll get a uniform random sample of the data so we can make a readable scatter plot.

In [0]:
sample = california_housing_dataframe.sample(n=300)

Next, we'll plot the line we've learned, drawing from the model's bias term and feature weight, together with the scatter plot. The line will show up red.

In [0]:
# Get the min and max total_rooms values.
x_0 = sample["total_rooms"].min()
x_1 = sample["total_rooms"].max()

# Retrieve the final weight and bias generated during training.
weight = linear_regressor.get_variable_value('linear/linear_model/total_rooms/weights')[0]
bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')

# Get the predicted median_house_values for the min and max total_rooms values.
y_0 = weight * x_0 + bias 
y_1 = weight * x_1 + bias

# Plot our regression line from (x_0, y_0) to (x_1, y_1).
plt.plot([x_0, x_1], [y_0, y_1], c='r')

# Label the graph axes.
plt.ylabel("median_house_value")
plt.xlabel("total_rooms")

# Plot a scatter plot from our data sample.
plt.scatter(sample["total_rooms"], sample["median_house_value"])

# Display graph.
plt.show()

This initial line looks way off.  See if you can look back at the summary stats and see the same information encoded there.

Together, these initial sanity checks suggest we may be able to find a much better line.

In [48]:
to_predict.drop(columns='label', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [49]:
to_predict.head()

Unnamed: 0,utc,x,y,z
0,1446834000.0,0.59375,-0.76562,-0.26562
1,1446834000.0,0.59375,-0.76562,-0.26562
2,1446834000.0,0.59375,-0.76562,-0.26562
3,1446834000.0,0.60938,-0.76562,-0.26562
4,1446834000.0,0.60938,-0.76562,-0.26562


In [56]:
expected = ['Walk', 'Run']

In [51]:
predictions = classifier.predict(
    input_fn=lambda:eval_input_fn(to_predict[feature_columns], labels=None, batch_size=batch_size))


In [52]:
predictions
        class_id = pred_dict['class_ids'][0]
        probability = pred_dict['probabilities'][class_id]


<generator object Estimator.predict at 0x000001AE02F49678>

In [57]:
for pred_dict, expec in zip(predictions, expected):
        class_id = pred_dict['class_ids'][0]
        probability = pred_dict['probabilities'][class_id]

        print(class_id, 100 * probability, expec)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\bsyou\AppData\Local\Temp\tmpb_r4u6yk\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
0 98.76295328140259 Walk
0 98.76295328140259 Run
