<a href="https://colab.research.google.com/github/andreaaraldo/machine-learning-for-networks/blob/master/08.predictive-maintenance/ignore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Aircraft Engines

We have a time series of aircraft engines measures. A sample is a vector of recordings at a specific time stamp of a certain engine.

The goal is to predict, by observing these measures, when an engine is going to fail.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Setting seed for reproducability
np.random.seed(1234)  
PYTHONHASHSEED = 0
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, recall_score, precision_score

from sklearn.utils.multiclass import unique_labels
from sklearn.metrics import classification_report


In [2]:
! wget https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/PM_train.txt
! wget https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/PM_test.txt
! wget https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/PM_truth.txt

--2020-05-19 20:46:39--  https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/PM_train.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3515356 (3.4M) [text/plain]
Saving to: ‘PM_train.txt’


2020-05-19 20:46:39 (13.7 MB/s) - ‘PM_train.txt’ saved [3515356/3515356]

--2020-05-19 20:46:40--  https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/PM_test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2228855 (2.1M) [text/plai

In [0]:
# read training data 
train_df = pd.read_csv('PM_train.txt', sep=" ", header=None)
train_df.drop(train_df.columns[[26, 27]], axis=1, inplace=True)
train_df.columns = ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
                     's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
                     's15', 's16', 's17', 's18', 's19', 's20', 's21']

In [4]:
# read test data
test_df = pd.read_csv('PM_test.txt', sep=" ", header=None)
test_df.drop(test_df.columns[[26, 27]], axis=1, inplace=True)
test_df.columns = ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3',
                     's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14',
                     's15', 's16', 's17', 's18', 's19', 's20', 's21']

test_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413


In [0]:
# read ground truth data
truth_df = pd.read_csv('PM_truth.txt', sep=" ", header=None)
truth_df.drop(truth_df.columns[[1]], axis=1, inplace=True)

In [6]:
train_df = train_df.sort_values(['id','cycle'])
train_df.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044


Just as an example, let us check when the engine 1 fails. Obviously, it fails in the last cycle for which we have measures for it.

In [7]:
fail = max(train_df [ train_df['id']==1 ] ['cycle'] )
print ("Engine 1 fails at cycle ", fail)

Engine 1 fails at cycle  192


In the test data, we have an incomplete time series (you can imagine we have recordings up to "now", so we do not know at what cycle in the future the engine will fail).

Note that the ids in the train and test set do not correspond. Engine 1 in the training set is not Engine 1 of the test set.

In [8]:
test_df [ test_df['id']==1 ]

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413
5,1,6,0.0012,0.0003,100.0,518.67,642.11,1579.12,1395.13,14.62,21.61,554.22,2388.0,9050.96,1.3,47.26,521.92,2388.08,8127.46,8.4238,0.03,392,2388,100.0,38.91,23.3467
6,1,7,-0.0,0.0002,100.0,518.67,642.11,1583.34,1404.84,14.62,21.61,553.89,2388.05,9051.39,1.3,47.31,522.01,2388.06,8134.97,8.3914,0.03,391,2388,100.0,38.85,23.3952
7,1,8,0.0006,-0.0,100.0,518.67,642.54,1580.89,1400.89,14.62,21.61,553.59,2388.05,9052.86,1.3,47.21,522.09,2388.06,8125.93,8.4213,0.03,393,2388,100.0,39.05,23.3224
8,1,9,-0.0036,0.0,100.0,518.67,641.88,1593.29,1412.28,14.62,21.61,554.49,2388.06,9048.55,1.3,47.37,522.03,2388.05,8134.15,8.4353,0.03,391,2388,100.0,39.1,23.4521
9,1,10,-0.0025,-0.0001,100.0,518.67,642.07,1585.25,1398.64,14.62,21.61,554.28,2388.04,9051.95,1.3,47.14,522.0,2388.06,8134.08,8.4093,0.03,391,2388,100.0,38.87,23.382


The last measurement we had for Engine 1 in the test set is 31, but the engine may have survived also later.

## Feature engineering (pre-processing)

### Training Set

---



Let us add to the training table a column `remaining_duration` indicating, for each cycle, how many remaining cycle we must wait before the engine fails.

In [9]:
failing_cycle = \
   pd.DataFrame(train_df.groupby('id')['cycle'].max() )

# reset_index is needed to make the following merge 
# operation work
failing_cycle = failing_cycle.reset_index()
failing_cycle.columns = ['id', 'max']
train_df_aug = train_df.merge(failing_cycle, on=['id'], 
                          how='left')
train_df_aug['remaining_duration'] = train_df_aug['max'] - \
    train_df_aug['cycle']
train_df_aug.drop('max', axis=1, inplace=True)
train_df_aug.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,remaining_duration
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187


#### Normalize

We normalize the measures, to avoid that one monopolizes and biases the results. We normalize everything except id, cycles, label1, remaining_duration

In [0]:
cols_to_normalize = \
    train_df.columns.difference(['id','cycle','remaining_duration','label1'])
min_max_scaler = preprocessing.MinMaxScaler()
norm_train_df = pd.DataFrame(min_max_scaler.fit_transform(train_df_aug[cols_to_normalize]), 
                             columns=cols_to_normalize, 
                             index=train_df_aug.index)
the_rest_of_train_df = \
    train_df_aug[train_df_aug.columns.difference(cols_to_normalize)]


# Since the_rest_of_train_df and norm_train_df have the same
# index, we can join them and be sure that the values of 
# each row across columns will be correct
join_df = the_rest_of_train_df.join(norm_train_df)

# At this point, the order of columns has changed. We reset
# it as in the original dataframe
train_df_norm = join_df.reindex(columns = train_df_aug.columns)


### Test set

The engines in the test set are different than the engines in the training set, even if they have the same id.
The test data will be obtained from the test file together with the ground truth file

#### Normalize

In [0]:
norm_test_df = pd.DataFrame(min_max_scaler.transform(test_df[cols_to_normalize]), 
                            columns=cols_to_normalize, 
                            index=test_df.index)
test_join_df = \
    test_df[test_df.columns.difference(cols_to_normalize)].join(norm_test_df)

# Reorder columns
test_df_norm = test_join_df.reindex(columns = test_df.columns)


#### Other processing

We create the test dataset, by adding to each row the `remaining_duration`. 
Before doing that, let us see what is the cycle in which each of the engines in the test set fails

In [0]:
last_measurement = \
    pd.DataFrame(test_df.groupby('id')['cycle'].max()).reset_index()
last_measurement.columns = ['id','max']

truth_df.columns = ['remaining_life']
truth_df_aug = truth_df.copy(deep=True)
truth_df_aug['id'] = truth_df.index + 1
truth_df_aug['fail_cycle'] = \
    last_measurement['max']+truth_df['remaining_life']

In [15]:
truth_df_aug.drop('remaining_life', axis=1, inplace=True)
truth_df_aug.head()


Unnamed: 0,id,fail_cycle
0,1,143
1,2,147
2,3,195
3,4,188
4,5,189


Now, we can assign the `remaining_duration` to each engine for each cycle

In [16]:
test_df_merged = test_df_norm.merge(truth_df_aug, on=['id'],\
                                   how='left')
test_df_merged['remaining_duration'] = \
    test_df_merged['fail_cycle']-test_df_merged['cycle']
test_df_merged.drop('fail_cycle', axis=1,inplace=True)
test_df_merged.head()

Unnamed: 0,id,cycle,setting1,setting2,setting3,s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20,s21,remaining_duration
0,1,1,0.632184,0.75,0.0,0.0,0.545181,0.310661,0.269413,0.0,1.0,0.652174,0.212121,0.127614,0.0,0.208333,0.646055,0.220588,0.13216,0.308965,0.0,0.333333,0.0,0.0,0.55814,0.661834,142
1,1,2,0.344828,0.25,0.0,0.0,0.150602,0.379551,0.222316,0.0,1.0,0.805153,0.166667,0.146684,0.0,0.386905,0.739872,0.264706,0.204768,0.213159,0.0,0.416667,0.0,0.0,0.682171,0.686827,141
2,1,3,0.517241,0.583333,0.0,0.0,0.376506,0.346632,0.322248,0.0,1.0,0.68599,0.227273,0.158081,0.0,0.386905,0.69936,0.220588,0.15564,0.458638,0.0,0.416667,0.0,0.0,0.728682,0.721348,140
3,1,4,0.741379,0.5,0.0,0.0,0.370482,0.285154,0.408001,0.0,1.0,0.679549,0.19697,0.105717,0.0,0.255952,0.573561,0.25,0.17009,0.257022,0.0,0.25,0.0,0.0,0.666667,0.66211,139
4,1,5,0.58046,0.5,0.0,0.0,0.391566,0.352082,0.332039,0.0,1.0,0.694042,0.166667,0.102396,0.0,0.27381,0.73774,0.220588,0.152751,0.300885,0.0,0.166667,0.0,0.0,0.658915,0.716377,138
