This challenge consists of two sets of files:

* `X.CSV` for various values of `X`: contains data similar to what you were to produce for the C++ part of this challenge. Specifically, each file contains the average, min, max, and standard deviation of the total acceleration in 10 second long time windows for a single video. The `start` and `end` columns indicate the bounds of the time window in milliseconds from the beginning of the video.
* `X.jumps.csv` for various values of `X`: these files go with the matching `X.CSV` files and indicate when a kite surfer jumped. That is, if one row of `X.jumps.csv` has `start == 53280` and `end == 55540` then a kite surfer jumped, leaving the water at 53280 milliseconds from the beginning of the video and landing back on the water at 55540 milliseconds from the beginning of the video.

The goal is to be able to predict time windows which contain kite jumps. This Jupyter Notebook is a (bad) attempt at building a model to do that. Your challenge is to identify the various ways this attempt could be improved. You do not need to actual fix the code (though that is permitted). You can simply add new Markdown cells to the notebook indicating the mistakes you observe and suggesting improvements.

In [62]:
import pandas as pd
import numpy as np
import glob
import os.path
from keras.layers import Input, Dense
from keras.models import Model
from keras import optimizers
import sklearn.model_selection as ms

In [63]:
data_dir='challenge_data'

Read all the forces and jump data into dictionaries whose keys are the names of the jump files and whose values are the Pandas DataFrame instances holding the data.

In [64]:
force_files = glob.glob(os.path.join(data_dir, '*.CSV'))
jump_files = glob.glob(os.path.join(data_dir, '*.jumps.csv'))

forces = {}
for ff in force_files:
    forces[ff] = pd.read_csv(ff)
jumps = {}
for ff, jf in zip(force_files, jump_files):
    # Use the forces filename as the key so we can easily match the jump times with
    # the corresponding forces
    jumps[ff] = pd.read_csv(jf)

In [65]:
forces.keys()

['challenge_data/V3.CSV',
 'challenge_data/V4.CSV',
 'challenge_data/V1.CSV',
 'challenge_data/V2.CSV']

In [66]:
forces['challenge_data/V1.CSV'].head()

Unnamed: 0,avg,max,min,start,stddev,end
0,1.193093,3.077215,0.120284,0,0.568452,10000
1,1.188233,3.077215,0.120284,500,0.572365,10500
2,1.188903,3.077215,0.120284,1000,0.575058,11000
3,1.194125,3.077215,0.120284,1500,0.577237,11500
4,1.221625,3.113024,0.120284,2000,0.595027,12000


In [67]:
jumps.keys()

['challenge_data/V3.CSV',
 'challenge_data/V4.CSV',
 'challenge_data/V1.CSV',
 'challenge_data/V2.CSV']

In [68]:
jumps['challenge_data/V1.CSV'].head()

Unnamed: 0,Start,End
0,295800,298000
1,379600,384300
2,558300,562800
3,1056300,1060700
4,1125400,1129500


Now we want to join the computed data (the summary statistics for each time window) against the jumps data so we have a target for supervised learning. Specifically, if the window for a row contains a jump then our target is `True`. Otherwise it is `False`.

In [69]:
for k in forces.keys():
    cur_f = forces[k]
    cur_f['Target'] = False
    jt = jumps[k]
    for i in range(jt.shape[0]):
        start = jt.Start.iloc[i]
        end = jt.End.iloc[i]
        cur_f.loc[(cur_f.start <= start) & (cur_f.end >= end), 'Target'] = True

In [70]:
jumps['challenge_data/V1.CSV'].head()

Unnamed: 0,Start,End
0,295800,298000
1,379600,384300
2,558300,562800
3,1056300,1060700
4,1125400,1129500


In [71]:
# Make sure we set our target correctly
f1 = forces['challenge_data/V1.CSV']
f1[(f1.start > 40000) & (f1.start < 60000)]

Unnamed: 0,avg,max,min,start,stddev,end,Target
81,1.380238,4.863788,0.304278,40500,0.682105,50500,False
82,1.361738,4.863788,0.274618,41000,0.679024,51000,False
83,1.328504,4.863788,0.274618,41500,0.656603,51500,False
84,1.326717,4.863788,0.274618,42000,0.656298,52000,False
85,1.345357,4.863788,0.274618,42500,0.657399,52500,False
86,1.34732,4.863788,0.274618,43000,0.675417,53000,False
87,1.329462,4.440003,0.274618,43500,0.628885,53500,False
88,1.284885,3.784215,0.274618,44000,0.575518,54000,False
89,1.271766,3.784215,0.274618,44500,0.556836,54500,False
90,1.25297,3.784215,0.274618,45000,0.533204,55000,False


In [72]:
# Now concatenate things into one big data set
all_data = pd.concat(forces.values(), ignore_index=True)

In [73]:
all_data.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
0,0.97924,1.434954,0.654423,0,0.061798,10000,False
1,0.978753,1.434954,0.654423,500,0.060129,10500,False
2,0.976366,1.434954,0.654423,1000,0.059742,11000,False
3,0.976729,1.434954,0.654423,1500,0.053761,11500,False
4,0.975602,1.434954,0.479002,2000,0.060523,12000,False


In [74]:
all_data.shape

(8397, 7)

In [75]:
sum([x.shape[0] for x in forces.values()])

8397

Now split the data into training and test data sets.

**SPENCER**: Calling your test set "valid" is somewhat unorthodox... Changing that.

In [76]:
train, test = ms.train_test_split(all_data, test_size=0.2)

In [77]:
train.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
6644,1.443083,5.83841,0.336497,32500,0.716005,42500,False
5310,1.183809,3.241373,0.362419,121000,0.467846,131000,False
1341,1.227089,3.212756,0.572236,670500,0.393725,680500,False
5280,1.107248,2.220643,0.29017,106000,0.334737,116000,False
4127,1.693326,8.219544,0.265194,524500,1.249571,534500,False


In [78]:
test.head()

Unnamed: 0,avg,max,min,start,stddev,end,Target
5912,1.22336,3.633863,0.699267,422000,0.403433,432000,False
427,1.567193,4.543819,0.227626,213500,0.738849,223500,False
5412,1.170742,2.930905,0.183172,172000,0.449847,182000,False
8287,1.131793,4.5241,0.236222,854000,0.4211,864000,False
5384,1.161403,2.603983,0.237976,158000,0.446781,168000,False


In [79]:
frac_true = np.sum(train.values[:,-1]/np.float_(train.values.shape[0]))
print ("Train ratio:", frac_true)
print("Test ratio:",np.sum(test.values[:,-1])/np.float_(test.values.shape[0]))

('Train ratio:', 0.03930326038409996)
('Test ratio:', 0.033928571428571426)


In [80]:
train.shape

(6717, 7)

In [81]:
test.shape

(1680, 7)

Now lets see if we can train a neural network on this data.

**SPENCER**: Sigmoids can be tempermental... ReLUs generally are better. Also your features are not in the same range of values which can lead to issues.

Performing some normalization to the input features would be a good idea.

In [82]:
inputs = Input(shape=(4,))
l1 = Dense(10, activation='sigmoid')(inputs)
l2 = Dense(15, activation='sigmoid')(l1)
out = Dense(1, activation='sigmoid')(l2)
model = Model(inputs=inputs, outputs=out)

In [83]:
def to_data_and_target(df):
    """Given either our training or test data set return a pair of (data, targets) where data is just the
    columns that should be input and targets are just the corresponding targets."""
    data = df[['avg', 'min', 'max', 'stddev']]
    targets = df.Target
    return (data, targets)

In [84]:
train_d, train_t = to_data_and_target(train)

In [85]:
train_d.head()

Unnamed: 0,avg,min,max,stddev
6644,1.443083,0.336497,5.83841,0.716005
5310,1.183809,0.362419,3.241373,0.467846
1341,1.227089,0.572236,3.212756,0.393725
5280,1.107248,0.29017,2.220643,0.334737
4127,1.693326,0.265194,8.219544,1.249571


In [86]:
train_t.head()
np.sum(train_t)

264

In [87]:
opt = optimizers.SGD(lr=0.02, momentum=0.2, decay=1e-6)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, validation_split=0.1, epochs=50, verbose=2)

Train on 6045 samples, validate on 672 samples
Epoch 1/50
 - 0s - loss: 0.2948 - acc: 0.8961 - val_loss: 0.1628 - val_acc: 0.9643
Epoch 2/50
 - 0s - loss: 0.1693 - acc: 0.9603 - val_loss: 0.1554 - val_acc: 0.9643
Epoch 3/50
 - 0s - loss: 0.1673 - acc: 0.9603 - val_loss: 0.1546 - val_acc: 0.9643
Epoch 4/50
 - 0s - loss: 0.1672 - acc: 0.9603 - val_loss: 0.1545 - val_acc: 0.9643
Epoch 5/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1544 - val_acc: 0.9643
Epoch 6/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1544 - val_acc: 0.9643
Epoch 7/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1543 - val_acc: 0.9643
Epoch 8/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1542 - val_acc: 0.9643
Epoch 9/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1543 - val_acc: 0.9643
Epoch 10/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1543 - val_acc: 0.9643
Epoch 11/50
 - 0s - loss: 0.1671 - acc: 0.9603 - val_loss: 0.1543 - val_acc: 0.9643
Epoch 12/50
 - 0s - loss: 0.1671 - acc

<keras.callbacks.History at 0x7fe0f7ec43d0>

It looks like things were still improving after 50 epoch so maybe we should try more.

**SPENCER**: Pretty difficult to assess performance of a model if you're only looking at the change in the training set performance -- you should be able to achieve perfect performance on your training set with a sufficiently complex model.

By adding a validation split we can keep an eye on overfitting and see whether we're actually getting any improvement.

Also, since positives only make up 3% of the data, you're going to get at least 96+% accuracy by just predicting FALSE every time. 

Some class weighting or sampling scheme to account for this could help

In [93]:
opt = optimizers.SGD(lr=0.02, momentum=0.2, decay=1e-6)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, epochs=100, verbose=2, validation_split=0.1)

Train on 6045 samples, validate on 672 samples
Epoch 1/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1548 - val_acc: 0.9643
Epoch 2/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 3/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 4/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 5/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 6/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 7/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 8/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 9/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 10/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 11/100
 - 0s - loss: 0.1664 - acc: 0.9603 - val_loss: 0.1549 - val_acc: 0.9643
Epoch 12/100
 - 0s - loss: 

<keras.callbacks.History at 0x7fe0f567e410>

And it looks like even after 100 epochs things were still improving, though slowly. So maybe we should try more and/or increase the learning rate and/or decrease the decay. Let's try something like that.

**SPENCER**: Parameters should be altered one at a time to measure their effect. Particularly in the case of interrelated parameters.

In [89]:
opt = optimizers.SGD(lr=0.04, momentum=0.2, decay=1e-7)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x=train_d.values, y=train_t.values, epochs=200, verbose=2, validation_split=0.1)

Train on 6045 samples, validate on 672 samples
Epoch 1/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1547 - val_acc: 0.9643
Epoch 2/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1545 - val_acc: 0.9643
Epoch 3/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1546 - val_acc: 0.9643
Epoch 4/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1545 - val_acc: 0.9643
Epoch 5/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1547 - val_acc: 0.9643
Epoch 6/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1546 - val_acc: 0.9643
Epoch 7/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1545 - val_acc: 0.9643
Epoch 8/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1547 - val_acc: 0.9643
Epoch 9/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1545 - val_acc: 0.9643
Epoch 10/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1544 - val_acc: 0.9643
Epoch 11/200
 - 0s - loss: 0.1670 - acc: 0.9603 - val_loss: 0.1544 - val_acc: 0.9643
Epoch 12/200
 - 0s - loss: 

<keras.callbacks.History at 0x7fe0f70ee310>

That looks pretty good! Let's assess accuracy on our validation data set.

In [90]:
test_d, test_t = to_data_and_target(test)
model.evaluate(test_d.values, test_t.values)

#Percentage of jumps correctly identified, assuming a 0.5 cut
p = model.predict(test_d.values)
np.sum([1 for it, thing in enumerate(p>0.5) if thing == test_t.values[it] and test_t.values[it]==True])/np.float_(test_d.values.shape[0])



0.0

Awesome! Better than 95% accuracy our validation set. This is a good model!

**SPENCER**: We see that assuming a threshold for binary classification of 0.5, we actually identify 0 of the jumps correctly. Since the goal here is to learn to identify jumps, we're not doing too great. Monitoring a different metric (like recall) would help give a clearer picture