# Predictive Maintenance

## Step 1: Preprocessing

### Prerequisites

For this notebook you need to install:

- `pandas`
- `numpy`
- `sklearn`

The easiest way is to install these libraries with `pip`, which is the python package installation tool.
You can simply use

    pip install pandas
    pip install numpy
    pip install sklearn

which should install everything easily.

### What does this file do ?

This file constitues the first step of the predicitive maintenance process for the NASA turbines dataset.

### When do I need to run it ?

This file extracts the raw data from the `..\input\` folder and preprocess it in order to create useable data for the rest of the process.

Thus, this script is necessary each time raw data is changed (new train set, new test set or new real-world-based test failure observation set).

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

## 1. Data Ingestion

In this step, data is extracted from txt files.

This dataset contains records of NASA turbines. The train set holds the engine run-to-failure data. The test set holds the engine operating data without failure events recorded. Finally, the truth set contains the information of true remaining cycles for each engine in the testing data.

In [2]:
names = ['id', 'cycle', 'setting1', 'setting2', 'setting3', 's1', 's2', 's3','s4', 's5', 's6', 's7', 's8',
         's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']

# read training data
train_data = pd.read_csv('input/TrainSet.txt', sep=" ", header=None)
train_data.drop(train_data.columns[[26, 27]], axis=1, inplace=True)
train_data.columns = names

train_data = train_data.sort_values(['id','cycle'])

# read test data
test_data = pd.read_csv('input/TestSet.txt', sep=" ", header=None)
test_data.drop(test_data.columns[[26, 27]], axis=1, inplace=True)
test_data.columns = names

# read ground truth data
truth_df = pd.read_csv('input/TestSet_RUL.txt', sep=" ", header=None)
truth_df.drop(truth_df.columns[[1]], axis=1, inplace=True)

In [3]:
print("This is the size of the train dataset: {} entries and {} features".format(train_data.shape[0], 
                                                                                 train_data.shape[1]))
print("This is the size of the test dataset: {} entries and {} features".format(test_data.shape[0],
                                                                                test_data.shape[1]))
print("This is the size of the truth dataset: {} entries and {} features".format(truth_df.shape[0],
                                                                                 truth_df.shape[1]))

This is the size of the train dataset: 20631 entries and 26 features
This is the size of the test dataset: 13096 entries and 26 features
This is the size of the truth dataset: 100 entries and 1 features


In [4]:
n_turb = train_data["id"].unique().max()
n_train, n_features = train_data.shape
print("There is {} turbines in each dataset".format(n_turb))

There is 100 turbines in each dataset


## 2. Data Preprocessing

This step adds new features to train and test set, which will constitutes the labels for the coming prediction algorithms.

### 2.1 Train Set

For this train set, we calculate the Remaining Useful Life (RUL) for each cycle of each turbine.

Then, we generate labels for a hypothetical binary classification, while trying to answer the question: is a specific engine going to fail within n cycles ? These labels aren't used in the following learning algorithms, but could be useful for a (future ?) binary classification step.

In [5]:
# Data Labeling - generate column RUL
rul = pd.DataFrame(train_data.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
train_data = train_data.merge(rul, on=['id'], how='left')
train_data['RUL'] = train_data['max'] - train_data['cycle']
train_data.drop('max', axis=1, inplace=True)

# generate label columns
w1 = 30
w0 = 15
train_data['label1'] = np.where(train_data['RUL'] <= w1, 1, 0 )
train_data['label2'] = train_data['label1']
train_data.loc[train_data['RUL'] <= w0, 'label2'] = 2

As the values of the different features are widely scattered, it is interesting to normalize them. Here, we use the min-max normalisation to perform it.

Only the settings and the parameters are normalized (in place), as well as the cycle's number (in an other column). The other variables are left untouched.

In [6]:
# MinMax normalization (from 0 to 1)
train_data['cycle_norm'] = train_data['cycle']
cols_normalize = train_data.columns.difference(['id','cycle','RUL','label1','label2'])
min_max_scaler = MinMaxScaler()
norm_train_data = pd.DataFrame(min_max_scaler.fit_transform(train_data[cols_normalize]),
                               columns=cols_normalize, index=train_data.index)
join_data = train_data[train_data.columns.difference(cols_normalize)].join(norm_train_data)
train_data = join_data.reindex(columns = train_data.columns)

print("The size of the train data set is now: {} entries and {} features.".format(train_data.shape[0],
                                                                                  train_data.shape[1]))

train_data.to_csv('input/train.csv', encoding='utf-8',index = None)
print("Train Data saved as input/train.csv")

The size of the train data set is now: 20631 entries and 30 features.
Train Data saved as input/train.csv


## 2.2 Test Set

The process is similar to the train set one.

However, the RUL is calculated based on the values in the truth data set.

In [7]:
# MinMax normalization (from 0 to 1)
test_data['cycle_norm'] = test_data['cycle']
norm_test_data = pd.DataFrame(min_max_scaler.transform(test_data[cols_normalize]),
                              columns=cols_normalize, index=test_data.index)
test_join_data = test_data[test_data.columns.difference(cols_normalize)].join(norm_test_data)
test_data = test_join_data.reindex(columns = test_data.columns)
test_data = test_data.reset_index(drop=True)

# generate RUL
rul = pd.DataFrame(test_data.groupby('id')['cycle'].max()).reset_index()
rul.columns = ['id', 'max']
truth_df.columns = ['more']
truth_df['id'] = truth_df.index + 1
truth_df['max'] = rul['max'] + truth_df['more']
truth_df.drop('more', axis=1, inplace=True)
test_data = test_data.merge(truth_df, on=['id'], how='left')
test_data['RUL'] = test_data['max'] - test_data['cycle']
test_data.drop('max', axis=1, inplace=True)

# generate label columns w0 and w1 for test data
test_data['label1'] = np.where(test_data['RUL'] <= w1, 1, 0 )
test_data['label2'] = test_data['label1']
test_data.loc[test_data['RUL'] <= w0, 'label2'] = 2

print("The size of the test data set is now: {} entries and {} features.".format(test_data.shape[0],
                                                                                 test_data.shape[1]))

test_data.to_csv('input/test.csv', encoding='utf-8',index = None)
print("Test Data saved as input/test.csv")

The size of the test data set is now: 13096 entries and 30 features.
Test Data saved as input/test.csv
