The proposal for this project is to use machine learning to detect patterns in the sensor data so that the stackeholer will be better to exploit the forecasting and act accordingly by maintaining the unterlying system. 

This notebook is keept in such a way as I went trough it. However, weak results are not interfering with final outcomes and predictions.

We are going to use the following python libraries and therefore the upcoming cell is going to install it for us.

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.holtwinters import SimpleExpSmoothing

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_squared_error

from statsmodels.tsa.stattools import adfuller


# Data Loading

In [None]:
# access kaggle datasets online
!pip install kaggle
!mkdir ~/.kaggle 
!cp kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d nphantawee/pump-sensor-data
!unzip pump-sensor-data.zip

In [None]:
# np.genfromtxt('sensor.csv', delimiter=',', dtype=None)
df = pd.read_csv('sensor.csv')

# Exploratory Data Analysis

In [None]:
df.info()

There are some missing values. Some columns will be dropped as seen in the upcoming column. 
Those columns with only a few missing values will be 'backfilled' with the pandas method in the later stage of this notebook.

In [None]:
df.isnull().sum().plot(kind='bar', figsize=(12,1));
plt.title('Number of Missing Values in a Column');

In the next step visual inspectation of each sensor record is made. 

In [None]:
# Transforming mashine status from strings to integers 
conditions = [(df['machine_status'] =='NORMAL'), (df['machine_status'] =='BROKEN'), (df['machine_status'] =='RECOVERING')]
choices = [1, 0, 0.5]
df['Operation'] = np.select(conditions, choices, default=0)

We can see group by some sensors in terms of their behavior and the absolute numbers they display. All sensors show stationary behavior, which means that their values are not changing from the initial value in time. Non-stationary behaviour we be that the values are rising / or falling in time.

In [None]:
# Computationally expensive calculation! Therefore commented out
# for i, s in enumerate(df.drop(['timestamp', 'machine_status', 'Operation'], axis = 1).backfill().T.to_numpy()):
#    result = adfuller(s)
#    print(f'Sensor {i}:', result[1])

Another visualisation for the dataset. The red dotted lines are showing the time point of machine failure.

In [None]:
# General overview over all sensors throughout the measurement. 
# red dotted lines represent machine failure

ymin = 0
i = 0
fig, axs = plt.subplots(9, 6, figsize = (14,20))
fig.tight_layout()

for x0 in list(range(0, 9, 1)):
    #print('x0', x0)
    for y0 in list(range(0, 6, 1)):
        #print('x0 and y0', x0,  y0)
        if i < 10:
            sensor_number = 'sensor_0{}'.format(i)
            ymax = df[sensor_number].max()
        elif i > 51:
            break
        else:
            sensor_number = 'sensor_{}'.format(i)
        ymax = df[sensor_number].max()
        axs[x0, y0].plot(df[sensor_number])
        axs[x0, y0].set_title(sensor_number)
        axs[x0, y0].vlines(x = df[df['machine_status'] == 'BROKEN'].index, ymin = ymin, ymax = ymax, color='red', linestyle='--')
        i = i + 1

In [None]:
df.drop(['Unnamed: 0','sensor_00','sensor_15','sensor_50','sensor_51'], axis=1, inplace=True)

In [None]:
df.set_index('timestamp').plot(subplots =True, sharex = True, figsize = (20,50));

In [None]:
# Statistics of the machine. It is a highly imbalanced data set
df.machine_status.value_counts()

In [None]:
# Status of the machine. 1 = operational, 0.5 maintenance and 0 = broken
df.set_index('timestamp').Operation.plot(figsize=(13,1));
plt.ylabel('Machine Status');

## Machnine Learning without a Time Shift

In [None]:
df.set_index('timestamp', inplace=True)
df.index.freq = 'min' # df.index gets the frequency of the time series. It is needed for future steps.

In [None]:
df = df.backfill() # somehow fillna('backfill') produced columns with objects.')
df.dropna(inplace = True) # those rows which contain NaNs due to the shift-method are now removed

In [None]:
# train/test split time series
train_df = df.loc[df.index < "2018-06-09 10:40:00"]
test_df = df.loc[df.index >= "2018-06-09 10:40:00"]
X_train = train_df.drop(['machine_status', 'Operation'], axis = 1)
y_train = train_df.Operation
X_test = test_df.drop(['machine_status', 'Operation'], axis = 1)
y_test = test_df.Operation

### Linear Regression

In [None]:
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title('Linear Regression Prediction');

In [None]:
print('RMSE for Linear Regression: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

### Random Forrest

In [None]:
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title('Random Forest Prediction');

In [None]:
print('RMSE for Random Forest: ', "%.3f" % mean_squared_error(y_test, y_pred)**(1/2))

## Convert a Time Series to a Supervised Learning Problem: Sliding Window

Conclusively, random forest show a lower root mean square error (RMSE) than linear regression algorithm. It can better predict the underlying machine status. However, those example are not predicitions in advance. As both algorothms predict the machine fallout within a minute, which is to short
to take action or do a maintanace precedure. Therefore, a shift in the window function is necessary.

Shifting the  features X steps to be able to see the failure of the machine X minutes in advance.

You can play around with the time shift. Positive values push the time trace more into the future, while negative values pull the time back.
1440 minutes corresponds to a single day. The amount of time before the machine stops working.

In [None]:
for i in df.columns:
    if i == 'timestamp':
        continue
    else:
        for t in [180, 1440, 2880, 5760]: 
            df[f'{t}-{i}'] = df[i].shift(t)

The next plot show the amount of Nans for each column

In [None]:
df.isna().sum().plot(kind='bar', figsize = (12, 1));
plt.title('Number of NaNs in a Column');

As visible in the upcoming slide, Nans are not present any more

In [None]:
df.isna().sum().plot(kind='bar', figsize = (12, 1));
plt.title('Number of NaNs in a Column');

In [None]:
df.index.freq = 'min' # df.index gets the frequency of the time series. It is needed for future steps.
df = df.backfill() # somehow fillna('backfill') produced columns with objects.')
df.dropna(inplace = True) # those rows which contain NaNs due to the shift-method are now removed

In [None]:
df_180 = pd.concat([df[df.columns[50::4]], df[df.columns[1:48:1]]], axis = 1)
df_1440 = pd.concat([df[df.columns[51::4]], df[df.columns[1:48:1]]], axis= 1)
df_2880 = pd.concat([df[df.columns[52::4]], df[df.columns[1:48:1]]], axis =1)
df_5760 = pd.concat([df[df.columns[53::4]], df[df.columns[1:48:1]]], axis = 1)

### 180 Minutes in Advance Prediction

In [None]:
number = 180
df_number = df_180
# train/test split time series. The dataset is split roughly 50:50 for training and test
train_df = df_number.loc[df.index < "2018-06-09 10:40:00"]
test_df = df_number.loc[df.index >= "2018-06-09 10:40:00"]

# X_train = train_df.drop(['machine_status', 'Operation'], axis = 1)
X_train = train_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
y_train = train_df[f'{number}-Operation']

# X_test = test_df.drop(['machine_status', 'Operation'], axis = 1)
X_test = test_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
X_test_plot = test_df.drop([f'{number}-machine_status'], axis = 1) # this is needed for the upcoming plot
y_test = test_df[f'{number}-Operation']

In [None]:
# Time when the machine status is broken
X_test_plot[X_test_plot['180-Operation'] == 0].index.tolist()

In [None]:
# Linear Regression 
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Linear Regression Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Linear Regression: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

In [None]:
# Random Forest
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Random Forest Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Random Forest: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

### 1440 Minutes in Advance Prediction

In [None]:
number = 1440
df_number = df_1440
# train/test split time series. The dataset is split roughly 50:50 for training and test
train_df = df_number.loc[df.index < "2018-06-09 10:40:00"]
test_df = df_number.loc[df.index >= "2018-06-09 10:40:00"]

# X_train = train_df.drop(['machine_status', 'Operation'], axis = 1)
X_train = train_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
y_train = train_df[f'{number}-Operation']

# X_test = test_df.drop(['machine_status', 'Operation'], axis = 1)
X_test = test_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
X_test_plot = test_df.drop([f'{number}-machine_status'], axis = 1) # this is needed for the upcoming plot
y_test = test_df[f'{number}-Operation']

In [None]:
# Linear Regression 
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Linear Regression Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Linear Regression: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

In [None]:
# Random Forest
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Random Forest Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Random Forest: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

### 2880 Minutes in Advance Prediction

In [None]:
number = 2880
df_number = df_2880
# train/test split time series. The dataset is split roughly 50:50 for training and test
train_df = df_number.loc[df.index < "2018-06-09 10:40:00"]
test_df = df_number.loc[df.index >= "2018-06-09 10:40:00"]

# X_train = train_df.drop(['machine_status', 'Operation'], axis = 1)
X_train = train_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
y_train = train_df[f'{number}-Operation']

# X_test = test_df.drop(['machine_status', 'Operation'], axis = 1)
X_test = test_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
X_test_plot = test_df.drop([f'{number}-machine_status'], axis = 1) # this is needed for the upcoming plot
y_test = test_df[f'{number}-Operation']

In [None]:
# Linear Regression 
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Linear Regression Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Linear Regression: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

In [None]:
# Random Forest
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Random Forest Prediction {number} Minutes in Advance');

In [None]:
print('RMSE for Random Forest: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))

### 5760 Minutes in Advance Prediction

In [None]:
number = 5760
df_number = df_5760
# train/test split time series. The dataset is split roughly 50:50 for training and test
train_df = df_number.loc[df.index < "2018-06-09 10:40:00"]
test_df = df_number.loc[df.index >= "2018-06-09 10:40:00"]

# X_train = train_df.drop(['machine_status', 'Operation'], axis = 1)
X_train = train_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
y_train = train_df[f'{number}-Operation']

# X_test = test_df.drop(['machine_status', 'Operation'], axis = 1)
X_test = test_df.drop([f'{number}-machine_status', f'{number}-Operation'], axis = 1)
X_test_plot = test_df.drop([f'{number}-machine_status'], axis = 1) # this is needed for the upcoming plot
y_test = test_df[f'{number}-Operation']

In [None]:
# Linear Regression 
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Linear Regression Prediction {number} Minutes in Advance');

In [None]:
# Random Forest
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

In [None]:
y_test_plot = y_test.copy()
y_test_plot = pd.DataFrame(y_test_plot)
y_test_plot['y_pred'] = y_pred.tolist()
y_test_plot.plot.line(figsize=(15,2));
plt.title(f'Random Forest Prediction {number} Minutes in Advance');
plt.savefig(f"visualisations/{number}_in_advance.png",dpi=300);

In [None]:
print('RMSE for Random Forest: ', "%.3f" % mean_squared_error(y_pred, y_test)**(1/2))


# Conclusion

The notebook showed that a time series analysis with a supervides ML algorithm such as random forest is possible. Also this notebook investigated some time interval to predict machine failure in advance. As the result suggest pump failure can be predicted upto 4 days in advance. However, whether this result are solid proof remains uncertain. Both for 180 minutes and 4 days before machine failure the ML algorithm performed equally well, which is surprising. 