# Predicting Remaining Useful Life
<p style="margin:30px">
    <img width=50% src="https://www.featuretools.com/wp-content/uploads/2017/12/FeatureLabs-Logo-Tangerine-800.png" alt="Featuretools" />
</p>

We show how to use Featuretools on the [phm 08 challenge dataset](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/) from NASA (as provided by LL). This notebook demonstrates a rapid way to create a model which predicts the Remaining Useful Life (RUL) of an engine using an initial dataframe of time-series data.

*If you're running this notebook yourself, please [download](https://ti.arc.nasa.gov/c/13/) the phm_08 challenge dataset into the `data` folder in this repository.*

## Highlights
* Quickly make end-to-end workflow using time-series data
* Find interesting automatically generated features

# Step 1: Load Data
Here we load in the train data and give the columns names according to the `description.txt` file. There is a column to identify different engines, the amount of time the engine has been running, 3 `operational_settings` and 21 `sensor_measurements`. We'll also add a fictional `time` which says that each event happens after the event before it.

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft
import utils

operational_settings = ['operational_setting_{}'.format(i + 1) for i in range (3)]
sensor_columns = ['sensor_measurement_{}'.format(i + 1) for i in range(26)]
cols = ['engine_no', 'time_in_cycles'] + operational_settings + sensor_columns

data = pd.read_csv('data/RUL_train.txt', sep=' ', header=-1, names=cols)

def clean_data(data, cols):
    data = data.drop(cols[-5:], axis=1)
    data['index'] = data.index
    data.index = data['index']
    data['time'] = pd.date_range('1/1/2000', periods=data.shape[0], freq='s')
    return data

data = clean_data(data, cols)
data.head()

Unnamed: 0_level_0,engine_no,time_in_cycles,operational_setting_1,operational_setting_2,operational_setting_3,sensor_measurement_1,sensor_measurement_2,sensor_measurement_3,sensor_measurement_4,sensor_measurement_5,...,sensor_measurement_14,sensor_measurement_15,sensor_measurement_16,sensor_measurement_17,sensor_measurement_18,sensor_measurement_19,sensor_measurement_20,sensor_measurement_21,index,time
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,1,42.0049,0.84,100.0,445.0,549.68,1343.43,1112.93,3.91,...,8074.83,9.3335,0.02,330,2212,100.0,10.62,6.367,0,2000-01-01 00:00:00
1,1,2,20.002,0.7002,100.0,491.19,606.07,1477.61,1237.5,9.35,...,8046.13,9.1913,0.02,361,2324,100.0,24.37,14.6552,1,2000-01-01 00:00:01
2,1,3,42.0038,0.8409,100.0,445.0,548.95,1343.12,1117.05,3.91,...,8066.62,9.4007,0.02,329,2212,100.0,10.48,6.4213,2,2000-01-01 00:00:02
3,1,4,42.0,0.84,100.0,445.0,548.7,1341.24,1118.03,3.91,...,8076.05,9.3369,0.02,328,2212,100.0,10.54,6.4176,3,2000-01-01 00:00:03
4,1,5,25.0063,0.6207,60.0,462.54,536.1,1255.23,1033.59,7.05,...,7865.8,10.8366,0.02,305,1915,84.93,14.03,8.6754,4,2000-01-01 00:00:04


In the train data, we have the full lifespan of 249 different engines. If we were to predict the RUL of each engine at the last point in each engine's life, the label would always be 0.

The way around that which we choose here samples a point in the life of each engine and only uses data from before that point. We can create features with that restriction easily by using [cutoff_times](https://docs.featuretools.com/automated_feature_engineering/handling_time.html) in Featuretools. 

The function `make_cutoff_times` in [utils](utils.py) does that step automatically. You can run the next cell several times and see differing results.

In [2]:
cutoff_times = utils.make_cutoff_times(data)
cutoff_times.head()

Unnamed: 0_level_0,engine_no,cutoff_time,RUL
engine_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,2000-01-01 00:05:09,11
2,2,2000-01-01 00:09:24,55
3,3,2000-01-01 00:11:42,224
4,4,2000-01-01 00:18:20,100
5,5,2000-01-01 00:21:04,129


To apply Deep Feature Synthesis we need to establish an `EntitySet` structure for our data. The key insight in this step is that we're really interested in our data as collected by `engine`. We can create an `engines` entity by normalizing by the `engine_no` column in the raw data. In the next section, we'll create a feature matrix for the `engines` entity directly rather than the base dataframe of `recordings`.

In [3]:
def make_entityset(data):
    es = ft.EntitySet('Dataset')
    es.entity_from_dataframe(dataframe=data,
                             entity_id='recordings',
                             index='index',
                             time_index='time')

    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='engines',
                        index='engine_no')

    es.normalize_entity(base_entity_id='recordings', 
                        new_entity_id='cycles',
                        index='time_in_cycles')
    return es
es = make_entityset(data)
es

Entityset: Dataset
  Entities:
    recordings (shape = [61249, 28])
    engines (shape = [249, 2])
    cycles (shape = [543, 2])
  Relationships:
    recordings.engine_no -> engines.engine_no
    recordings.time_in_cycles -> cycles.time_in_cycles

# Step 2: DFS and Creating a Model
With the work from the last section in hand, we can quickly build features using Deep Feature Synthesis (DFS). The function `ft.dfs` takes an `EntitySet` and stacks primitives like `Max`, `Min` and `Last` exhaustively across entities.

In [4]:
from featuretools.primitives import Sum, Mean, Std, Skew, Max, Min, Last, CumSum, Diff, Trend
fm, features = ft.dfs(entityset=es, 
                      target_entity='engines',
                      agg_primitives=[Max, Min, Last],
                      trans_primitives=[],
                      cutoff_time=cutoff_times,
                      max_depth=3,
                      verbose=True)

Built 290 features
Elapsed: 04:10 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 249/249 cutoff times


Next we can make predictions. We'll use `train_test_split` from scikit-learn to split our data into a train and test set. After that, we can `fit` a Random Forest Regressor and make predictions. Those predictions can be scored by finding the mean of the absolute value of the errors.

In [5]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

X = fm.copy().fillna(0)
y = X.pop('RUL')

reg = RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(X, y)
reg.fit(X_train, y_train)
    
preds = reg.predict(X_test)
scores = mean_absolute_error(preds, y_test)

print('Mean Abs Error: {:.2f}'.format(scores))

high_imp_feats = utils.feature_importances(X, reg)

Mean Abs Error: 43.53
1: MAX(recordings.sensor_measurement_13) [0.212]
2: MAX(recordings.cycles.LAST(recordings.sensor_measurement_13)) [0.167]
3: LAST(recordings.time_in_cycles) [0.089]
4: MAX(recordings.cycles.LAST(recordings.sensor_measurement_4)) [0.053]
5: MAX(recordings.cycles.LAST(recordings.sensor_measurement_11)) [0.050]
-----



Let's compare that score to the Mean Abs Error that we would get by predicting the median of `X_train` at every point:

In [6]:
medianpredict = [np.median(y_train) for _ in y_test]
print('Baseline by median: Mean Abs Error = {:.2f}'.format(
    mean_absolute_error(medianpredict, y_test)))

Baseline by median: Mean Abs Error = 76.83


# Step 3: Using the Model
Once we're done creating features and tuning the machine learning, we can apply the exact same transformations (including DFS) to our test data. For this particular case, the real answer isn't in the data so we don't need to worry about cutoff times.

In [8]:

data2 = pd.read_csv('data/RUL_test.txt', sep=' ', header=-1, names=cols)
data2 = clean_data(data2, cols)

es2 = make_entityset(data2)
fm2 = ft.calculate_feature_matrix(entityset=es2, features=features, verbose=True)
fm2.head()

Elapsed: 00:06 | Remaining: 00:00 | Progress: 100%|██████████|| Calculated: 1/1 cutoff times


Unnamed: 0_level_0,MAX(recordings.operational_setting_1),MAX(recordings.operational_setting_2),MAX(recordings.operational_setting_3),MAX(recordings.sensor_measurement_1),MAX(recordings.sensor_measurement_2),MAX(recordings.sensor_measurement_3),MAX(recordings.sensor_measurement_4),MAX(recordings.sensor_measurement_5),MAX(recordings.sensor_measurement_6),MAX(recordings.sensor_measurement_7),...,LAST(recordings.cycles.LAST(recordings.sensor_measurement_12)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_13)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_14)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_15)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_16)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_17)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_18)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_19)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_20)),LAST(recordings.cycles.LAST(recordings.sensor_measurement_21))
engine_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,42.0075,0.842,100.0,518.67,642.66,1592.22,1407.74,14.62,21.57,560.83,...,371.51,2388.13,8135.76,8.6654,0.03,369,2319,100.0,28.83,17.1118
2,42.0074,0.8419,100.0,518.67,643.16,1598.32,1416.3,14.62,21.61,555.04,...,314.97,2388.09,8060.44,9.2049,0.02,366,2324,100.0,24.64,14.6896
3,42.0072,0.842,100.0,518.67,642.85,1590.21,1403.87,14.62,21.59,556.06,...,371.92,2388.1,8130.24,8.672,0.03,368,2319,100.0,28.53,17.1455
4,42.0077,0.842,100.0,518.67,642.63,1589.99,1409.06,14.62,21.59,555.19,...,314.83,2388.1,8064.7,9.2551,0.02,363,2324,100.0,24.41,14.7103
5,42.0077,0.8417,100.0,518.67,642.95,1597.12,1412.33,14.62,21.61,553.67,...,130.8,2387.93,8082.32,9.3197,0.02,330,2212,100.0,10.6,6.44


In [9]:
X = fm2.copy().fillna(0)
y = pd.read_csv('data/RUL_test_truth.txt', sep=' ', header=-1, names=['RUL'], index_col=False)
preds2 = reg.predict(X)
print('Mean Abs Error: {:.2f}'.format(mean_absolute_error(preds2, y)))


Mean Abs Error: 49.05
