# Predicting Remaining Useful Life (advanced)
<p style="margin:30px">
    <img style="display:inline; margin-right:50px" width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
    <img style="display:inline" width=15% src="https://upload.wikimedia.org/wikipedia/commons/e/e5/NASA_logo.svg" alt="NASA" />
</p>

This notebook has a more advanced workflow than [the other notebook](Simple%20Featuretools%20RUL%20Demo.ipynb) for predicting Remaining Useful Life (RUL). If you are a new to either this dataset or Featuretools, I would recommend reading the other notebook first. 

## Highlights
* Demonstrate how novel entityset structures improve predictive accuracy
* Build custom primitives using time-series functions from [tsfresh](https://github.com/blue-yonder/tsfresh)
* Improve Mean Absolute Error by tuning hyper parameters with [BTB](https://github.com/HDI-Project/BTB)

Here is a collection of scores from a run of both notebooks. Because of the randomness in the Random Forest Regressor and how we choose labels from the Train data, scores are subject to change.

|                                 | Train |  Test |
|---------------------------------|---------------|
| Median Baseline                 | 62.55 | 50.55 |
| Simple Featuretools             | 41.18 | 39.56 |
| Advanced: Custom Primitives     | 36.46 | 32.60 |
| Advanced: Hyperparameter Tuning | 27.49 | 13.42 |


# Step 1: Load Data
Here we load in the train data using the same function we used in the previous notebook:

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft
import utils

utils.download_data()
data_path = 'data/train_FD004.txt'
data = utils.load_data(data_path)

data.head()

Using previously downloaded data
Loaded data with:
61249 Recordings
249 Engines
21 Sensor Measurements
3 Operational Settings


Unnamed: 0_level_0,engine_no,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,index,time
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,1,42.0049,0.84,100.0,445.0,549.68,1343.43,1112.93,3.91,...,8074.83,9.3335,0.02,330,2212,100.0,10.62,6.367,0,2000-01-01 00:00:00
1,1,2,20.002,0.7002,100.0,491.19,606.07,1477.61,1237.5,9.35,...,8046.13,9.1913,0.02,361,2324,100.0,24.37,14.6552,1,2000-01-01 00:10:00
2,1,3,42.0038,0.8409,100.0,445.0,548.95,1343.12,1117.05,3.91,...,8066.62,9.4007,0.02,329,2212,100.0,10.48,6.4213,2,2000-01-01 00:20:00
3,1,4,42.0,0.84,100.0,445.0,548.7,1341.24,1118.03,3.91,...,8076.05,9.3369,0.02,328,2212,100.0,10.54,6.4176,3,2000-01-01 00:30:00
4,1,5,25.0063,0.6207,60.0,462.54,536.1,1255.23,1033.59,7.05,...,7865.8,10.8366,0.02,305,1915,84.93,14.03,8.6754,4,2000-01-01 00:40:00


We also make cutoff times by selecting a random cutoff time from the life of each engine:

In [2]:
cutoff_times = utils.make_cutoff_times(data)

cutoff_times.head()

Unnamed: 0_level_0,engine_no,cutoff_time,RUL
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2000-01-02 04:00:00,152
2,2,2000-01-04 19:00:00,73
3,3,2000-01-07 01:00:00,56
4,4,2000-01-08 21:40:00,62
5,5,2000-01-10 04:40:00,69


We're going to do something fancy for our entityset. The values for `operational_setting` 1-3 are continuous but create an implicit relation between different engines. If two engines have a similar `operational_setting`, it could indicate that we should expect the sensor measurements to mean similar things. We make clusters of those settings using `KMeans` from scikit-learn and make a new entity from the clusters.

In [3]:
from sklearn.cluster import KMeans

nclusters = 25

def make_entityset(data, nclusters, kmeans=None):
    X = data[['operational_setting_1', 'operational_setting_2', 'operational_setting_3']]
    if kmeans:
        kmeans=kmeans
    else:
        kmeans = KMeans(n_clusters=nclusters).fit(X)
    data['settings_clusters'] = kmeans.predict(X)
    
    es = ft.EntitySet('Dataset')
    es.entity_from_dataframe(dataframe=data,
                             entity_id='recordings',
                             index='index',
                             time_index='time')

    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='engines',
                        index='engine_no')
    
    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='cycles',
                        index='time_in_cycles')
    
    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='settings_clusters',
                        index='settings_clusters')
    
    return es, kmeans
es, kmeans = make_entityset(data, nclusters)
es

Entityset: Dataset
  Entities:
    recordings [Rows: 61249, Columns: 29]
    cycles [Rows: 543, Columns: 2]
    settings_clusters [Rows: 25, Columns: 2]
    engines [Rows: 249, Columns: 2]
  Relationships:
    recordings.engine_no -> engines.engine_no
    recordings.time_in_cycles -> cycles.time_in_cycles
    recordings.settings_clusters -> settings_clusters.settings_clusters

# Step 2: DFS and Creating a Model
In addition to changing our `EntitySet` structure, we're also going to use some time series primitives from the package [tsfresh](https://github.com/blue-yonder/tsfresh).

In [4]:
from tsfresh.feature_extraction.feature_calculators import number_peaks, mean_abs_change
from featuretools.primitives import make_agg_primitive
import featuretools.variable_types as vtypes

MeanAbsChange = make_agg_primitive(mean_abs_change,
                                   input_types=[vtypes.Numeric],
                                   return_type=vtypes.Numeric,
                                   name="mean_abs_change")

NumPeaks = make_agg_primitive(lambda x: number_peaks(x, 10),
                              input_types=[vtypes.Numeric],
                              return_type=vtypes.Numeric,
                              name="number_peaks")


from featuretools.primitives import Sum, Mean, Std, Skew, Max, Min, Last, CumSum, Diff, Trend, Count
fm, features = ft.dfs(entityset=es, 
                      target_entity='engines',
                      agg_primitives=[Max, Min, Last, Count, Trend, NumPeaks],
                      trans_primitives=[],
                      cutoff_time=cutoff_times,
                      max_depth=3,
                      verbose=True)
fm.to_csv('advanced_fm.csv')

  from pandas.core import datetools


Built 1337 features
Elapsed: 1:04:52 | Remaining: 00:00 | Progress: 100%|██████████| Calculated: 11/11 chunks


# Step 3: Feature Selection and Scoring
Here, we'll use [Recursive Feature Elimination](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html). In order to set ourselves up for later optimization, we're going to write a generic `pipeline` function which takes in a set of hyperparameters and returns a score. Our pipeline will first run `RFE` and then split the remaining data for scoring by a `RandomForestRegressor`. We're going to pass in a list of hyperparameters, which we will tune later. 

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.feature_selection import RFE

def pipeline(X, y, hyperparams):
    """ Hyperparams: [
            0: number of estimators for the random forest in RFE
            1: number of features to select
            2: number of estimators for  random forest in scoring
            3: max feats for random forest in scoring
        ]
    """
    reg = RandomForestRegressor(n_estimators=int(hyperparams[0]), n_jobs=3)
    X_train, X_valid, y_train, y_valid = train_test_split(X, y)
    X_train, X_test, y_train, y_test = train_test_split(X_train, y_train)

    selector = RFE(reg, int(hyperparams[1]), step=50)
    selector = selector.fit(X_train, y_train)
    max_feats = min(hyperparams[3], hyperparams[1])
    reg = RandomForestRegressor(n_estimators=int(hyperparams[2]), 
                                max_features=int(max_feats))
    reg.fit(selector.transform(X_train), y_train)
    
    preds = reg.predict(selector.transform(X_test))
    score_train = mean_absolute_error(preds, y_test)
    
    preds = reg.predict(selector.transform(X_valid))
    score_valid = mean_absolute_error(preds, y_valid)
    return score_train, score_valid, (selector, reg)

fm = pd.read_csv('advanced_fm.csv', index_col='engine_no')
X = fm.copy().fillna(0)
y = X.pop('RUL')

rfe_nest = 50
nfeats = 50
sco_nest = 50
sco_maxfeats = 50

hyperparams = [rfe_nest, nfeats, sco_nest, sco_maxfeats]
score_train, score_valid, (selector, model) = pipeline(X, y, hyperparams)

print('Mean Abs Error: {:.2f}'.format(score_valid))
high_imp_feats = utils.feature_importances(X.iloc[:, selector.support_], model, feats=10)

Mean Abs Error: 35.79
1: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_13)) [0.207]
2: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_8)) [0.098]
3: MAX(recordings.sensor_measurement_13) [0.052]
4: NUMBER_PEAKS(recordings.cycles.TREND(recordings.operational_setting_2, time)) [0.051]
5: NUMBER_PEAKS(recordings.sensor_measurement_7) [0.047]
6: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_11)) [0.043]
7: MAX(recordings.cycles.LAST(recordings.sensor_measurement_13)) [0.036]
8: MAX(recordings.sensor_measurement_2) [0.027]
9: MAX(recordings.cycles.LAST(recordings.sensor_measurement_3)) [0.026]
10: MAX(recordings.cycles.LAST(recordings.sensor_measurement_15)) [0.025]
-----



Lastly, we can use that selector and regressor to score the test values.

In [7]:
data2 = utils.load_data('data/test_FD004.txt')

es2, _ = make_entityset(data2, nclusters, kmeans=kmeans)
fm2 = ft.calculate_feature_matrix(entityset=es2, features=features, verbose=True, chunk_size='cutoff time')
X = fm2.copy().fillna(0)
y = pd.read_csv('data/RUL_FD004.txt', sep=' ', header=-1, names=['RUL'], index_col=False)
preds2 = model.predict(selector.transform(X))
print('Mean Abs Error: {:.2f}'.format(mean_absolute_error(preds2, y)))

Loaded data with:
41214 Recordings
248 Engines
21 Sensor Measurements
3 Operational Settings

Elapsed: 00:00 | Remaining: ? | Progress:   0%|          | Calculated: 0/1 chunks[A


KeyboardInterrupt: 

# Step 4: Hyperparameter Tuning
Because of the way we set up our pipeline, we can use a Gaussian Process to tune the hyperparameters. We will use [BTB](https://github.com/HDI-Project/BTB) from the [HDI Project](https://github.com/HDI-Project). Change to fm_test and es_test

In [9]:
from btb.hyper_parameter import HyperParameter
from btb.tuning.gp import GP
from tqdm import tqdm

def run_btb(X, y, n=30):
    hyperparam_ranges = [
            ('selector_n_estimators', HyperParameter('int', [100, 1000])),
            ('select_n_features', HyperParameter('int', [5, 50])),
            ('model_n_estimators', HyperParameter('int', [100, 500])),
            ('model_max_feats', HyperParameter('int', [5, 20])),
    ]
    tuner = GP(hyperparam_ranges)

    tested_parameters = np.zeros((n, len(hyperparam_ranges)), dtype=object)
    scores = []
    best = 100
    
    print('[sel_n_est, sel_n_feats, model_n_est, model_max_feats]')
    for i in range(n):
        tuner.fit(tested_parameters[:i, :], scores)
        hyperparams = tuner.propose()

        bound, valid, _ = pipeline(X, y, hyperparams)
        tested_parameters[i, :] = hyperparams
        scores.append(bound)
        print('{}: Bound - {:.2f}, Valid - {:.2f}'.format(hyperparams, -bound, -valid))

    return tested_parameters, scores

X = fm.copy().fillna(0)
y = X.pop('RUL')

tested_parameters, scores = run_btb(X, y, n=30)

[sel_n_est, sel_n_feats, model_n_est, model_max_feats]
[278.  34. 458.   9.]: Bound - -34.63, Valid - -42.95
[323.  33. 269.  10.]: Bound - -31.64, Valid - -40.90
[274.  36. 464.  10.]: Bound - -40.42, Valid - -30.63
[869.  27. 405.   8.]: Bound - -37.18, Valid - -43.23
[878.  12. 172.  18.]: Bound - -46.29, Valid - -29.84
[839.  48. 127.  15.]: Bound - -39.26, Valid - -41.87
[478.  30. 463.  18.]: Bound - -33.07, Valid - -45.28
[967.  16. 442.  13.]: Bound - -33.28, Valid - -41.74
[485.   9. 140.  15.]: Bound - -37.43, Valid - -39.19
[339.  13. 341.  12.]: Bound - -35.76, Valid - -36.43
[979.  25. 240.  10.]: Bound - -40.07, Valid - -39.26
[488.  10. 141.   9.]: Bound - -39.08, Valid - -37.95
[881.  46. 144.  19.]: Bound - -43.76, Valid - -37.66
[766.  27. 266.  15.]: Bound - -42.38, Valid - -48.05
[879.   6. 174.  16.]: Bound - -35.57, Valid - -45.82
[983.   5. 398.  11.]: Bound - -43.42, Valid - -44.35
[184.  19. 353.  14.]: Bound - -30.26, Valid - -43.53
[841.  50. 124.  17.]: Boun

In [12]:
X = fm.copy().fillna(0)
y = X.pop('RUL')
hyperparams = [906.,  33., 500.,   6.]
score, valid, (selector, model) = pipeline(X, y, hyperparams)

print('Mean Abs Error on Train: {:.2f}'.format(valid))
high_imp_feats = utils.feature_importances(X.iloc[:, selector.support_], model, feats=10)

Mean Abs Error on Train: 38.14
1: NUMBER_PEAKS(recordings.cycles.TREND(recordings.operational_setting_2, time)) [0.080]
2: COUNT(recordings) [0.052]
3: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_13)) [0.050]
4: MAX(recordings.settings_clusters.LAST(recordings.sensor_measurement_11)) [0.046]
5: NUMBER_PEAKS(recordings.sensor_measurement_8) [0.040]
6: LAST(recordings.cycles.MIN(recordings.sensor_measurement_4)) [0.038]
7: MAX(recordings.sensor_measurement_3) [0.035]
8: NUMBER_PEAKS(recordings.cycles.MAX(recordings.sensor_measurement_20)) [0.035]
9: NUMBER_PEAKS(recordings.sensor_measurement_9) [0.035]
10: MAX(recordings.sensor_measurement_13) [0.033]
-----



In [13]:
X = fm2.copy().fillna(0)
y = pd.read_csv('data/RUL_test_truth.txt', sep=' ', header=-1, names=['RUL'], index_col=False)

preds2 = model.predict(selector.transform(X))
score2 = mean_absolute_error(preds2, y)
print('Mean Abs Error on Test: {:.2f}'.format(score2))


NameError: name 'fm2' is not defined