<a href="https://colab.research.google.com/github/andreaaraldo/machine-learning-for-networks/blob/master/08.predictive-maintenance/Predictive-maintenance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

We have a set of aircraft engines measures. A sample is a vector of recordings of a certain engine in some point in time.

The goal is to predict, by observing these measures, the risk of failure of engines, to see which one should be checked or replaced before the others.

To this aim, we use Cox Proportional Hazards model.

The dataset is also used in a notebook on [Deep Learning for Predictive Maintenance (by Azure)](https://github.com/Azure/lstms_for_predictive_maintenance/blob/master/Deep%20Learning%20Basics%20for%20Predictive%20Maintenance.ipynb), where they apply another method (Long-Short Term Memory).

# Need to configure packages

We first need to revert to an older version of scikit-learn, for a compatibility issue with `scikit-survival`, the library we are going to use for our Survival Analysis

In [1]:
!pip uninstall scikit-learn -y
!pip install scikit-learn==0.22
!pip install scikit-survival

Uninstalling scikit-learn-0.22:
  Successfully uninstalled scikit-learn-0.22
Collecting scikit-learn==0.22
  Using cached https://files.pythonhosted.org/packages/2e/d0/860c4f6a7027e00acff373d9f5327f4ae3ed5872234b3cbdd7bcb52e5eff/scikit_learn-0.22-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: scikit-learn
Successfully installed scikit-learn-0.22


In [0]:
import pandas as pd

from sksurv.linear_model import CoxPHSurvivalAnalysis
from sksurv.metrics import concordance_index_censored

# Load the dataset

In [3]:
! wget https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/dataset/transformed/test_set.csv
! wget https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/dataset/transformed/training_set.csv

--2020-05-19 23:34:13--  https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/dataset/transformed/test_set.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4071170 (3.9M) [text/plain]
Saving to: ‘test_set.csv.1’


2020-05-19 23:34:13 (93.9 MB/s) - ‘test_set.csv.1’ saved [4071170/4071170]

--2020-05-19 23:34:15--  https://raw.githubusercontent.com/andreaaraldo/machine-learning-for-networks/master/08.predictive-maintenance/dataset/transformed/training_set.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting res

In [0]:
df_trn = pd.read_csv("training_set.csv")
df_test = pd.read_csv("training_set.csv")

In [7]:
df_trn.head()

Unnamed: 0,setting1,setting2,s2,s3,s4,s6,s7,s8,s9,s11,s12,s13,s14,s15,s17,s20,s21,remaining_duration,failure_observed
0,0.45977,0.166667,0.183735,0.406802,0.309757,1.0,0.726248,0.242424,0.109755,0.369048,0.633262,0.205882,0.199608,0.363986,0.333333,0.713178,0.724662,191,True
1,0.609195,0.25,0.283133,0.453019,0.352633,1.0,0.628019,0.212121,0.100242,0.380952,0.765458,0.279412,0.162813,0.411312,0.333333,0.666667,0.731014,190,True
2,0.252874,0.75,0.343373,0.369523,0.370527,1.0,0.710145,0.272727,0.140043,0.25,0.795309,0.220588,0.171793,0.357445,0.166667,0.627907,0.621375,189,True
3,0.54023,0.5,0.343373,0.256159,0.331195,1.0,0.740741,0.318182,0.124518,0.166667,0.889126,0.294118,0.174889,0.166603,0.333333,0.573643,0.662386,188,True
4,0.390805,0.333333,0.349398,0.257467,0.404625,1.0,0.668277,0.242424,0.14996,0.255952,0.746269,0.235294,0.174734,0.402078,0.416667,0.589147,0.704502,187,True


In [12]:
X_trn = df_trn.drop(columns=["remaining_duration", "failure_observed"])
y_trn = df_trn[["failure_observed", "remaining_duration"]]

X_test = df_test.drop(columns=["remaining_duration", "failure_observed"])
y_test = df_test[["failure_observed", "remaining_duration"]]

y_trn

Unnamed: 0,failure_observed,remaining_duration
0,True,191
1,True,190
2,True,189
3,True,188
4,True,187
...,...,...
20626,True,4
20627,True,3
20628,False,2
20629,True,1


In [9]:
print(f'Number of samples: {len(y_trn)}')
print(f'Number of right censored samples: {len(y_trn.query("failure_observed == False"))}')
print(f'Percentage of right censored samples: {100*len(y_trn.query("failure_observed == False"))/len(y_trn):.1f}%')

Number of samples: 20631
Number of right censored samples: 2155
Percentage of right censored samples: 10.4%


The dataset is already scaled and all the columns are numerical. We do not need to do pre-processing for that.

`CoxPHSurvivalAnalysis` wants to ingest the `y` in a specific format.

In [0]:
y_trn_record = y_trn.to_records(index=False)

Now we can train our model.

In [11]:
model = CoxPHSurvivalAnalysis()
model.fit(X_trn, y_trn_record)

CoxPHSurvivalAnalysis(alpha=0, n_iter=100, ties='breslow', tol=1e-09, verbose=0)

We can now predict the "risk scores", indicating the risk of failure. These risks are centered around a baseline, found during training.

In [20]:
y_pred = model.predict(X_test)
y_pred[0:10]



array([ 0.06011228,  0.10466007, -0.14488193, -0.65472335,  0.20204911,
       -0.47950975, -0.18609595, -0.51670297, -0.50246697, -0.25945973])

The performance of our survival model can be summarized by the concordance index. Intuitively, it is the fraction of pair of samples A and B such that the model predicted that A had less survival probability than B and indeed A died before B (the engine A failed before engine B).

In [14]:
conc_idx = concordance_index_censored(y_test["failure_observed"], 
                        y_test["remaining_duration"], y_pred)

print(f'The c-index of Cox is given by {conc_idx[0]:.3f}')



The c-index of Cox is given by 0.811
