# Predictive Maintanence
Predictive maintenance is a technique that uses data analysis tools and techniques to detect anomalies in your operation and possible defects in equipment and processes so you can fix them before they result in failure.


# Data Description¶
There are 5 CSV files consisting of:

***Telemetry Time Series Data (PdM_telemetry.csv)*** : It consists of hourly average of voltage, rotation, pressure, vibration collected from 100 machines for the year 2015.

***Error (PdM_errors.csv)***: These are errors encountered by the machines while in operating condition. Since, these errors don't shut down the machines, these are not considered as failures. The error date and times are rounded to the closest hour since the telemetry data is collected at an hourly rate.

***Maintenance (PdM_maint.csv)***: If a component of a machine is replaced, that is captured as a record in this table. Components are replaced under two situations:

During the regular scheduled visit, the technician replaced it (Proactive Maintenance)
A component breaks down and then the technician does an unscheduled maintenance to replace the component (Reactive Maintenance). This is considered as a failure and corresponding data is captured under Failures. Maintenance data has both 2014 and 2015 records. This data is rounded to the closest hour since the telemetry data is collected at an hourly rate.

***Failures (PdM_failures.csv)**: Each record represents replacement of a component due to failure. This data is a subset of Maintenance data. This data is rounded to the closest hour since the telemetry data is collected at an hourly rate.

***Metadata of Machines (PdM_Machines.csv)***: Model type & age of the Machines.

Predictive Maintanence
Predictive maintenance is a technique that uses data analysis tools and techniques to detect anomalies in your operation and possible defects in equipment and processes so you can fix them before they result in failure.

In [None]:
import anai
from anai.preprocessing import Preprocessor
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px

In [None]:
telemetry_df = pd.read_csv('DATA/PdM_telemetry.csv')
errors_df = pd.read_csv('DATA/PdM_errors.csv')
maint_df = pd.read_csv('DATA/PdM_maint.csv')
failures_df = pd.read_csv('DATA/PdM_failures.csv')
machines_df = pd.read_csv('DATA/PdM_machines.csv')

In [None]:
tables = [telemetry_df, maint_df, failures_df, errors_df]
for df in tables:
    df["datetime"] = pd.to_datetime(df["datetime"], format="%Y-%m-%d %H:%M:%S")
    df.sort_values(["datetime", "machineID"], inplace=True, ignore_index=True)

# Data Insights



## Telemetry Data¶
> This data consists of hourly average of voltage, rotation, pressure, vibration collected from 100 machines for the year 2015.¶

In [None]:
print(f"Shape of the Telemetry : {telemetry_df.shape}")
print("\n")
telemetry_df.head()

In [None]:
print(f"No.Of Machine in the Telemetry : {telemetry_df.machineID.nunique()}")

### Missing Values in the Telemetry data 

In [None]:
telemetry_df.datetime.describe()

In [None]:
print('missing Dates : ' , telemetry_df.datetime.isna().sum())

In [None]:
telemetry_df.isna().sum()

In [None]:
telemetry_df.describe()  ##info on this is Required

## Error Data

## Maintainance Data

## Machine Data

In [None]:
print(f"Shape of the Machines Data: {machines_df.shape}")
print("\n")
machines_df.head()

## Failure Data

# EXPLORATORY DATA ANALYSIS

## EDA Functions

In [None]:
def create_date_features(source_df, target_df, feature_name):
    '''
    Create new features related to dates
    
    source_df : DataFrame consisting of the timestamp related feature
    target_df : DataFrame where new features will be added
    feature_name : Name of the feature of date type which needs to be decomposed.
    '''
    target_df.loc[:, 'year'] = source_df.loc[:, feature_name].dt.year.astype('uint16')
    target_df.loc[:, 'month'] = source_df.loc[:, feature_name].dt.month.astype('uint8')
    target_df.loc[:, 'quarter'] = source_df.loc[:, feature_name].dt.quarter.astype('uint8')
    target_df.loc[:, 'weekofyear'] = source_df.loc[:, feature_name].dt.isocalendar().week.astype('uint8')
    
    target_df.loc[:, 'hour'] = source_df.loc[:, feature_name].dt.hour.astype('uint8')
    
    target_df.loc[:, 'day'] = source_df.loc[:, feature_name].dt.day.astype('uint8')
    target_df.loc[:, 'dayofweek'] = source_df.loc[:, feature_name].dt.dayofweek.astype('uint8')
    target_df.loc[:, 'dayofyear'] = source_df.loc[:, feature_name].dt.dayofyear.astype('uint8')
    target_df.loc[:, 'is_month_start'] = source_df.loc[:, feature_name].dt.is_month_start
    target_df.loc[:, 'is_month_end'] = source_df.loc[:, feature_name].dt.is_month_end
    target_df.loc[:, 'is_quarter_start']= source_df.loc[:, feature_name].dt.is_quarter_start
    target_df.loc[:, 'is_quarter_end'] = source_df.loc[:, feature_name].dt.is_quarter_end
    target_df.loc[:, 'is_year_start'] = source_df.loc[:, feature_name].dt.is_year_start
    target_df.loc[:, 'is_year_end'] = source_df.loc[:, feature_name].dt.is_year_end
    
    # This is of type object
    target_df.loc[:, 'month_year'] = source_df.loc[:, feature_name].dt.to_period('M')
    
    return target_df



def plot_histogram(data, x_column, color_column, title, nbins=1000, width=1000, height=600, log_x=False, log_y=False):
    """
    Generates a Plotly histogram.
    """
    fig = px.histogram(
        data,
        x=x_column,
        color=color_column,
        title=title,
        nbins=nbins,
        width=width,
        height=height,
        log_x=log_x,
        log_y=log_y
    )
    
    fig.update_layout(
        xaxis_title=x_column,
        yaxis_title="Count"
    )
    
    return fig

def plot_boxplot(data, x_column, y_column, title, width=1000, height=900, xaxis_title=None, yaxis_title=None):
    """
    Generates a Plotly boxplot.

   
    """
    fig = px.box(
        data,
        x=x_column,
        y=y_column,
        title=title,
        width=width,
        height=height
    )
    
    # Update layout with custom axis titles if provided
    fig.update_layout(
        xaxis_title=xaxis_title if xaxis_title else x_column,
        yaxis_title=yaxis_title if yaxis_title else y_column
    )
    
    return fig

import plotly.express as px

def plot_scatter(df, feature_x, feature_y, title=None, xlabel=None, ylabel=None, width=800, height=600):
    """
    Create a scatter plot using Plotly.
    """
    fig = px.scatter(
        df,
        x=feature_x,
        y=feature_y,
        title=title,
        width=width,
        height=height
    )
    
    # Update axis labels if provided
    fig.update_layout(
        xaxis_title=xlabel if xlabel else feature_x,
        yaxis_title=ylabel if ylabel else feature_y
    )
    
    return fig



## EDA On Telemetry Data

 Vibration of Machine1 for 2015

In [None]:
df_vib_machine_1 = telemetry_df[
    telemetry_df.machineID == 1][["datetime", "vibration"]]


In [None]:
fig = px.line(x = df_vib_machine_1['datetime'].values, y = df_vib_machine_1['vibration'].values ,title="Vibration of Machine 1",template='plotly_dark')
fig.update_layout(xaxis_title='Time', yaxis_title='Vibration')
fig.show() 

Voltage for Machine1 for January Month

In [None]:
plot_df = telemetry_df.loc[(telemetry_df['machineID'] == 1) &
                        (telemetry_df['datetime'] > pd.to_datetime('2015-01-01')) &
                        (telemetry_df['datetime'] < pd.to_datetime('2015-02-01')), ['datetime', 'volt']]


In [None]:
fig = px.line(x=plot_df['datetime'].values, y=plot_df['volt'].values, title='Voltage over time', template='plotly_dark')
fig.update_layout(xaxis_title='Time', yaxis_title='Voltage')
fig.show()


Machine2 Voltage First Two weeks of 2015

In [None]:
df_vib_machine_1 = telemetry_df[
    (telemetry_df.machineID == 2) & (
        telemetry_df.datetime.dt.isocalendar().week.isin(
            [1, 2, 3]))][["datetime", "volt"]]


In [None]:
fig = px.line(x=df_vib_machine_1['datetime'].values, y=df_vib_machine_1['volt'].values, title='Voltage over time', template='plotly_dark')
fig.update_layout(xaxis_title='Time', yaxis_title='Voltage')
fig.show()


Plot the distribution of voltage across various months. Ideally there should be some amount seasonality in the data¶

In [None]:
telemetry_df = create_date_features(telemetry_df, telemetry_df, "datetime")
telemetry_df.head()

In [None]:
telemetry_df = create_date_features(telemetry_df, telemetry_df, "datetime")
telemetry_df.head()

In [None]:
telemetry_df['month_year'] = telemetry_df['month_year'].astype(str)

fig = plot_boxplot(
    telemetry_df,
    x_column="volt",
    y_column="month_year",
    title="Distribution of volt by month_year"
)
fig.show()

It shows the voltage across Machines are not varying over month.

We can ignore the entry for 2016 since we only have data for one day in 2016.

In [None]:


fig = px.box(
    telemetry_df[telemetry_df.machineID == 80], 
    x="volt",  # Horizontal axis
    y="month_year",  # Grouping variable
    title="Distribution of volt by month_year",
    width=1000,  # Adjust width (optional)
    height=900   # Adjust height (optional)
)

fig.update_layout(
    xaxis_title="volt", 
    yaxis_title="month_year"
    )

fig.show()

In [None]:
fig = plot_histogram(
    telemetry_df,
    x_column="volt",
    color_column="month_year",
    title="Distribution of volt",
    nbins=1000
)
fig.show()

Thank you for sharing the histogram output! Here’s an analysis of the provided plot:

Observations:

	1.	Overall Distribution (Shape):
	•	The volt values exhibit a bell-shaped curve, which is indicative of a normal distribution. This suggests that most of the volt values are clustered around the mean, with fewer occurrences at the extremes.
	2.	Spread Across month_year:
	•	Each month_year is represented by a different color in the stacked histogram.
	•	There is a consistent distribution across months; no month appears to deviate significantly in terms of the volt distribution’s central tendency or spread.
	•	All months seem to have similar peak counts, with most data points centered around volt values between 160 and 180.
	

Insights:

	1.	Consistency Over Time:
	•	The near-identical distributions across months suggest that the volt readings are stable over time. This could indicate that the monitored system operates consistently, with no drastic changes or anomalies month-to-month.
	


In [None]:

for name in ['rotate', 'pressure', 'vibration']:
    fig  =plot_histogram(telemetry_df, x_column=name, color_column="month_year",  title=f"Distribution of {name}")
    fig.show()

Observations about Telemetry Data¶
1. This may be synthetically generated data distributed between 1st Jan 2015 to 1st Jan 2016.
2. Each row represents the state of a machine on a particular hour. Voltage, vibration, pressure & rotation of a machine have been averaged hourly.
3. There are 100 unique Machines.
4. There are no duplicates or missing values in the dataset.
The four parameters voltage, vibration, pressure & rotation are normally distributed.

## EDA on Machine Data 

In [None]:
fig = plot_boxplot(
    machines_df,
    x_column="age",
    y_column="model",
    title="Distribution of age by model",
   
    height = 400
)
fig.show()

The age of the Machines is distributed between 0 to 20. The median age is to ~12.5. There are no outliers. Another indication that this is a synthetic data.



In [None]:
# Create a DF with number of errors, maintenance records and failure records across machines

# Create a DF consisting of number of erros across Machines
erros_across_machine = errors_df.groupby("machineID").size()
erros_across_machine = pd.DataFrame(erros_across_machine, columns=["num_errors"]).reset_index()

machines_errors_df = pd.merge(machines_df, erros_across_machine, how='left', on="machineID")

# Create a DF consisting of number of maintenance records across Machines
maint_across_machine = maint_df.groupby("machineID").size()
maint_across_machine = pd.DataFrame(maint_across_machine, columns=["num_maint"]).reset_index()

machines_errors_df = pd.merge(machines_errors_df, maint_across_machine, how='left', on="machineID")

# Create a DF consisting of number of failure records across Machines
failure_across_machine = failures_df.groupby("machineID").size()
failure_across_machine = pd.DataFrame(failure_across_machine, columns=["num_failure"]).reset_index()

machines_errors_df = pd.merge(machines_errors_df, failure_across_machine, how='left', on="machineID")

machines_errors_df.head()

In [None]:
fig = plot_scatter(
    df=machines_errors_df,
    feature_x="age",
    feature_y="num_errors",
    title="Age vs Number of Errors",
    xlabel="Age",
    ylabel="Number of Errors"
)
fig.show()

In [None]:
fig = plot_scatter(
    df=machines_errors_df,
    feature_x="age",
    feature_y="num_maint",
    title="Age vs Number of Maintainance Records",
    xlabel="Age",
    ylabel="Number of Maintainance"
)
fig.show()

In [None]:
fig = plot_scatter(
    df=machines_errors_df,
    feature_x="age",
    feature_y="num_failure",
    title="Age vs Number of Failure Records",
    xlabel="Age",
    ylabel="Number of Failure"
)
fig.show()

From the above three plots, it appears only Number of Failures is slightly correlated with Age.¶


# Feature Engineering

### Identifying Lag Features from Telemetry Data on a window of 24 hours

In [82]:
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry_df,
                               index='datetime',
                               columns='machineID',
                               values=col).resample('3H', closed='left', label='right').mean().unstack())
telemetry_mean_3h = pd.concat(temp, axis=1)
telemetry_mean_3h.columns = [i + 'mean_3h' for i in fields]
telemetry_mean_3h.reset_index(inplace=True)


temp = []

for col in fields:
    temp.append(pd.pivot_table(telemetry_df,
                               index='datetime',
                               columns='machineID',
                               values=col).resample('3H', closed='left', label='right').std().unstack())
telemetry_sd_3h = pd.concat(temp, axis=1)
telemetry_sd_3h.columns = [i + 'sd_3h' for i in fields]
telemetry_sd_3h.reset_index(inplace=True)

telemetry_mean_3h.head()

Unnamed: 0,machineID,datetime,voltmean_3h,rotatemean_3h,pressuremean_3h,vibrationmean_3h
0,1,2015-01-01 09:00:00,170.028993,449.533798,94.592122,40.893502
1,1,2015-01-01 12:00:00,164.192565,403.949857,105.687417,34.255891
2,1,2015-01-01 15:00:00,168.134445,435.781707,107.793709,41.239405
3,1,2015-01-01 18:00:00,165.514453,430.472823,101.703289,40.373739
4,1,2015-01-01 21:00:00,168.809347,437.11112,90.91106,41.738542


	•	Each row represents a 3-hour time interval.
	•	Each column corresponds to a specific machineID.
	•	The values are the mean of the selected column (col) for that machineID during that time interval.

In [None]:
temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry_df,
    index='datetime',
    columns='machineID',
    values=col).resample('3H',closed='left',
    label='right',).first().unstack().rolling(window=24, center=False).mean())

telemetry_mean_24h = pd.concat(temp, axis=1)
telemetry_mean_24h.columns = [i + 'mean_24h' for i in fields]
telemetry_mean_24h.reset_index(inplace=True)
telemetry_mean_24h = telemetry_mean_24h.loc[-telemetry_mean_24h['voltmean_24h'].isnull()]

temp = []
fields = ['volt', 'rotate', 'pressure', 'vibration']
for col in fields:
    temp.append(pd.pivot_table(telemetry_df,
    index='datetime',
    columns='machineID',
    values=col).resample('3H',
    closed='left',
    label='right',
    ).first().unstack().rolling(window=24, center=False).std())
    
telemetry_sd_24h = pd.concat(temp, axis=1)
telemetry_sd_24h.columns = [i + 'sd_24h' for i in fields]
telemetry_sd_24h.reset_index(inplace=True)
telemetry_sd_24h = telemetry_sd_24h.loc[-telemetry_sd_24h['voltsd_24h'].isnull()]

telemetry_mean_24h.head(10)

Unnamed: 0,machineID,datetime,voltmean_24h,rotatemean_24h,pressuremean_24h,vibrationmean_24h
23,1,2015-01-04 06:00:00,171.536044,456.036706,101.652072,44.017022
24,1,2015-01-04 09:00:00,171.069056,457.285237,101.011726,44.148324
25,1,2015-01-04 12:00:00,170.859615,461.116153,101.172241,44.672216
26,1,2015-01-04 15:00:00,171.566669,457.893518,100.708151,44.993232
27,1,2015-01-04 18:00:00,171.536866,457.67211,99.826551,45.16057
28,1,2015-01-04 21:00:00,172.800672,454.497453,100.896227,45.690929
29,1,2015-01-05 00:00:00,171.963248,452.687991,101.312313,45.658369
30,1,2015-01-05 03:00:00,171.206225,448.104961,101.030466,46.457982
31,1,2015-01-05 06:00:00,171.999801,449.729553,101.47285,46.879346
32,1,2015-01-05 09:00:00,171.247302,451.93097,101.368307,47.831655


	A loop iterates over each field in fields.
	•	For each field:
	•	The pd.pivot_table() function transforms the telemetry data to have machineID as columns and datetime as the index.
	•	It uses resample('3H') to downsample the data to 3-hour intervals, taking the first value in each interval (.first()).
	•	The unstacked data undergoes a rolling 24-hour window computation for the mean (rolling(window=24).mean()).

In [84]:
telemetry_feat = pd.concat([telemetry_mean_3h,
                            telemetry_sd_3h.iloc[:, 2:6],
                            telemetry_mean_24h.iloc[:, 2:6],
                            telemetry_sd_24h.iloc[:, 2:6]], axis=1).dropna()
telemetry_feat.describe()

Unnamed: 0,machineID,datetime,voltmean_3h,rotatemean_3h,pressuremean_3h,vibrationmean_3h,voltsd_3h,rotatesd_3h,pressuresd_3h,vibrationsd_3h,voltmean_24h,rotatemean_24h,pressuremean_24h,vibrationmean_24h,voltsd_24h,rotatesd_24h,pressuresd_24h,vibrationsd_24h
count,291977.0,291977,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0,291977.0
mean,50.503899,2015-07-02 19:50:32.314188800,170.777344,446.605536,100.858665,40.384696,13.299177,44.456698,8.885976,4.441105,170.738342,446.622451,100.87186,40.382356,15.055923,50.680485,10.330242,5.103534
min,1.0,2015-01-01 09:00:00,125.532506,211.811184,72.118639,26.569635,0.025509,0.078991,0.027417,0.015278,156.713608,310.118604,91.162625,35.800869,6.178154,18.363177,4.275651,2.108104
25%,26.0,2015-04-02 15:00:00,164.449518,427.559989,96.238713,38.147732,8.027807,26.903727,5.370694,2.684653,168.100594,440.859663,98.730139,39.379127,13.409911,44.992826,8.984156,4.488631
50%,51.0,2015-07-02 21:00:00,170.43425,448.382424,100.234309,40.145805,12.495649,41.794255,8.346061,4.173937,170.285725,448.772454,100.195972,40.107229,14.942428,50.156537,10.00968,5.008386
75%,76.0,2015-10-02 03:00:00,176.612207,468.448273,104.406729,42.227512,17.688547,59.105539,11.790367,5.899868,172.609273,456.129192,101.780484,40.908734,16.55676,55.657662,11.19938,5.589524
max,100.0,2016-01-01 06:00:00,241.420717,586.682904,162.309656,69.311324,58.444332,179.903039,35.659369,18.305595,206.333895,491.081522,138.291979,55.266429,30.806053,117.198342,30.665847,12.757609
std,28.863913,,9.501061,33.130486,7.414592,3.478391,6.966005,23.217195,4.656154,2.320281,4.178951,15.686284,3.983127,1.764322,2.383652,8.368899,2.129642,0.92164


In [85]:
telemetry_feat.head()

Unnamed: 0,machineID,datetime,voltmean_3h,rotatemean_3h,pressuremean_3h,vibrationmean_3h,voltsd_3h,rotatesd_3h,pressuresd_3h,vibrationsd_3h,voltmean_24h,rotatemean_24h,pressuremean_24h,vibrationmean_24h,voltsd_24h,rotatesd_24h,pressuresd_24h,vibrationsd_24h
23,1,2015-01-04 06:00:00,186.092896,451.641253,107.989359,55.308074,13.48909,62.185045,5.118176,4.904365,171.536044,456.036706,101.652072,44.017022,13.716658,41.767447,11.754808,6.878286
24,1,2015-01-04 09:00:00,166.281848,453.787824,106.187582,51.99008,24.276228,23.621315,11.176731,3.394073,171.069056,457.285237,101.011726,44.148324,13.741098,41.038218,11.521602,6.9295
25,1,2015-01-04 12:00:00,175.412103,445.450581,100.887363,54.251534,34.918687,11.001625,10.580336,2.921501,170.859615,461.116153,101.172241,44.672216,13.915181,33.879652,11.667258,7.162152
26,1,2015-01-04 15:00:00,157.347716,451.882075,101.28938,48.602686,24.617739,28.950883,9.966729,2.356486,171.566669,457.893518,100.708151,44.993232,13.583969,33.790685,11.447426,7.244618
27,1,2015-01-04 18:00:00,176.45055,446.033068,84.521555,47.638836,8.0714,76.511343,2.636879,4.108621,171.536866,457.67211,99.826551,45.16057,13.590129,33.787875,11.919716,7.167877
