# Predictive maintenance

## Part 1: Data Preparation

The original data can be [downloaded from this link.](https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/#turbofan) Since the content in the train and test datasets is different, we are making it uniform before we start the data exploration and the model buiding process. We will convert the data into a more natural format for Vaex.

In [1]:
import vaex

### Read the data

The data contains a list of sensors. These are their names and meanings:


| Name      |Description                      |Unit     |    
|-----------|---------------------------------|---------|    
| T2        | Total temperature at fan inlet  | °R      |    
| T24       | Total temperature at LPC outlet | °R      |    
| T30       | Total temperature at HPC outlet | °R      |    
| T50       | Total temperature at LPT outlet | °R      |    
| P2        | Pressure at fan inlet           | psia    |    
| P15       | Total pressure in bypass-duct   | psia    |    
| P30       | Total pressure at HPC outlet    | psia    |    
| Nf        | Physical fan speed              | rpm     |    
| Nc        | Physical core speed             | rpm     |    
| epr       | Engine pressure ratio (P50/P2)  | --      |    
| Ps30      | Static pressure at HPC outlet   | psia    |    
| phi       | Ratio of fuel flow to Ps30      | pps/psi |    
| NRf       | Corrected fan speed             | rpm     |    
| NRc       | Corrected core speed            | rpm     |    
| BPR       | Bypass Ratio                    | --      |    
| farB      | Burner fuel-air ratio           | --      |    
| htBleed   | Bleed Enthalpy                  | --      |    
| Nf_dmd    | Demanded fan speed              | rpm     |    
| PCNfR_dmd | Demanded corrected fan speed    | rpm     |    
| W31       | HPT coolant bleed               | lbm/s   |    
| W32       | LPT coolant bleed               | lbm/s   |    


In [2]:
column_names = ['unit_number', 'time_in_cycles', 'setting_1', 'setting_2', 'setting_3',
                'T2', 'T24', 'T30', 'T50', 'P2', 'P15', 'P30', 'Nf', 'Nc', 'epr', 'Ps30', 'phi', 
                'NRf', 'NRc', 'BPR', 'farB', 'htBleed', 'Nf_dmd', 'PCNfR_dmd', 'W31', 'W32']


# The training data
train_data = vaex.read_csv("./data/train_FD001.txt", sep='\s+', names=column_names)

# The testing data
test_data = vaex.read_csv("./data/test_FD001.txt", sep='\s+', names=column_names)

# The "answer" to the test data
y_test = vaex.read_csv('./data/RUL_FD001.txt', names=['remaining_cycles'])
y_test['unit_number'] = vaex.vrange(1, 101)
y_test['unit_number'] = y_test.unit_number.astype('int')

### Create proper train and test datasets

- in the training set, the engines are run until failure occurs, so we can calculate the target varuable, i,e, the RUL (Remaining Useful Life) based on when a particular engines running;
- in the test set the engines are run for some time, and our goal is to predict their RULs. Their RUL are provided in a separate file, so we need to join it so it can be made available for evaluating scores and estimateing model performance

In [3]:
def prepare_data(data, y=None):
    df = data.copy()  # As to not modify the underlying dataframe
    # Count how many cycles each unit is run for - groupby and count
    g = df.groupby('unit_number').agg({'max_cycles': vaex.agg.count('time_in_cycles')})
    # Join to the main data - basically adds the "max_cycle" column
    df = df.join(other=g, on='unit_number', how='left')
    
    # Calculate the RUL:
    if y is None:  # This is for the train data -> last point is the point of failure
        # Calculate the RUL
        df['RUL'] = df.max_cycles - df.time_in_cycles
        # Drop the column that is not needed anymore
        df = df.drop(columns=['max_cycles'])
    else:  # This is for the test data -> add the answer to calculate the RUL
        # Join the answers
        df = df.join(y, on='unit_number', how='left')
        # Calculate the RUL
        df['RUL'] = df.max_cycles + df.remaining_cycles - df.time_in_cycles
        # Drop the columns that are not needed anymore
        df = df.drop(columns=['remaining_cycles', 'max_cycles'])    
    # Done
    return df

In [4]:
# Add the RUL to the train and test sets
df_train = prepare_data(train_data)
df_test = prepare_data(test_data, y=y_test)


### Quick preview of the datasets

In [5]:
df_train

#,unit_number,time_in_cycles,setting_1,setting_2,setting_3,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,-0.0007,-0.0004,100.0,518.67,641.82,1589.7,1400.6,14.62,21.61,554.36,2388.06,9046.19,1.3,47.47,521.66,2388.02,8138.62,8.4195,0.03,392,2388,100.0,39.06,23.419,191
1,1,2,0.0019,-0.0003,100.0,518.67,642.15,1591.82,1403.14,14.62,21.61,553.75,2388.04,9044.07,1.3,47.49,522.28,2388.07,8131.49,8.4318,0.03,392,2388,100.0,39.0,23.4236,190
2,1,3,-0.0043,0.0003,100.0,518.67,642.35,1587.99,1404.2,14.62,21.61,554.26,2388.08,9052.94,1.3,47.27,522.42,2388.03,8133.23,8.4178,0.03,390,2388,100.0,38.95,23.3442,189
3,1,4,0.0007,0.0,100.0,518.67,642.35,1582.79,1401.87,14.62,21.61,554.45,2388.11,9049.48,1.3,47.13,522.86,2388.08,8133.83,8.3682,0.03,392,2388,100.0,38.88,23.3739,188
4,1,5,-0.0019,-0.0002,100.0,518.67,642.37,1582.85,1406.22,14.62,21.61,554.0,2388.06,9055.15,1.3,47.28,522.19,2388.04,8133.8,8.4294,0.03,393,2388,100.0,38.9,23.4044,187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20626,100,196,-0.0004,-0.0003,100.0,518.67,643.49,1597.98,1428.63,14.62,21.61,551.43,2388.19,9065.52,1.3,48.07,519.49,2388.26,8137.6,8.4956,0.03,397,2388,100.0,38.49,22.9735,4
20627,100,197,-0.0016,-0.0005,100.0,518.67,643.54,1604.5,1433.58,14.62,21.61,550.86,2388.23,9065.11,1.3,48.04,519.68,2388.22,8136.5,8.5139,0.03,395,2388,100.0,38.3,23.1594,3
20628,100,198,0.0004,0.0,100.0,518.67,643.42,1602.46,1428.18,14.62,21.61,550.94,2388.24,9065.9,1.3,48.09,520.01,2388.24,8141.05,8.5646,0.03,398,2388,100.0,38.44,22.9333,2
20629,100,199,-0.0011,0.0003,100.0,518.67,643.23,1605.26,1426.53,14.62,21.61,550.68,2388.25,9073.72,1.3,48.39,519.67,2388.23,8139.29,8.5389,0.03,395,2388,100.0,38.29,23.064,1


In [6]:
df_test

#,unit_number,time_in_cycles,setting_1,setting_2,setting_3,T2,T24,T30,T50,P2,P15,P30,Nf,Nc,epr,Ps30,phi,NRf,NRc,BPR,farB,htBleed,Nf_dmd,PCNfR_dmd,W31,W32,RUL
0,1,1,0.0023,0.0003,100.0,518.67,643.02,1585.29,1398.21,14.62,21.61,553.9,2388.04,9050.17,1.3,47.2,521.72,2388.03,8125.55,8.4052,0.03,392,2388,100.0,38.86,23.3735,142
1,1,2,-0.0027,-0.0003,100.0,518.67,641.71,1588.45,1395.42,14.62,21.61,554.85,2388.01,9054.42,1.3,47.5,522.16,2388.06,8139.62,8.3803,0.03,393,2388,100.0,39.02,23.3916,141
2,1,3,0.0003,0.0001,100.0,518.67,642.46,1586.94,1401.34,14.62,21.61,554.11,2388.05,9056.96,1.3,47.5,521.97,2388.03,8130.1,8.4441,0.03,393,2388,100.0,39.08,23.4166,140
3,1,4,0.0042,0.0,100.0,518.67,642.44,1584.12,1406.42,14.62,21.61,554.07,2388.03,9045.29,1.3,47.28,521.38,2388.05,8132.9,8.3917,0.03,391,2388,100.0,39.0,23.3737,139
4,1,5,0.0014,0.0,100.0,518.67,642.51,1587.19,1401.92,14.62,21.61,554.16,2388.01,9044.55,1.3,47.31,522.15,2388.03,8129.54,8.4031,0.03,390,2388,100.0,38.99,23.413,138
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13091,100,194,0.0049,0.0,100.0,518.67,643.24,1599.45,1415.79,14.62,21.61,553.41,2388.02,9142.37,1.3,47.69,520.69,2388.0,8213.28,8.4715,0.03,394,2388,100.0,38.65,23.1974,24
13092,100,195,-0.0011,-0.0001,100.0,518.67,643.22,1595.69,1422.05,14.62,21.61,553.22,2388.05,9140.68,1.3,47.6,521.05,2388.09,8210.85,8.4512,0.03,395,2388,100.0,38.57,23.2771,23
13093,100,196,-0.0006,-0.0003,100.0,518.67,643.44,1593.15,1406.82,14.62,21.61,553.04,2388.11,9146.81,1.3,47.57,521.18,2388.04,8217.24,8.4569,0.03,395,2388,100.0,38.62,23.2051,22
13094,100,197,-0.0038,0.0001,100.0,518.67,643.26,1594.99,1419.36,14.62,21.61,553.37,2388.07,9148.85,1.3,47.61,521.33,2388.08,8220.48,8.4711,0.03,395,2388,100.0,38.66,23.2699,21


### Export the datasets to HDF5

In [7]:
df_train.export_hdf5('./data/data_train.hdf5')
df_test.export_hdf5('./data/data_test.hdf5')

The data is ready and now we can start with the modeling process.