# TODOs
- [ ] send email to Errol - does the team that installed an asset do all other maintenance related events for the same asset? Useful for determining team productivity and effectiveness.
- [ ] Meaning of previous_repairs and previous_unplanned in assets table.
At first it seemed that it was simply the values of the last row from the events table, but it doesn't seem to match up.
- [ ] Meaning of non-zero previous_repairs and previous_unplanned in first event after installation.
- [ ] Transform datetimes and time periods to numeric data
- [ ] Augment assets data with events statistics: number of replacements, number of repairs, average time between events, ...?


Source:
- [predictive maintenance article - towardsdatascience](https://towardsdatascience.com/how-to-implement-machine-learning-for-predictive-maintenance-4633cdbe4860)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Data exploration

## 1.1. Load data

In [None]:
datafiles = !ls data/*

In [None]:
datasets = {}
for fn in datafiles:
    dataset_name = fn.split('/')[-1].rstrip('.csv')
    datasets[dataset_name] = pd.read_csv(fn)

for name, dataset in datasets.items():
    print('\n'*3, name, '\n')
    print(dataset)

In [None]:
(datasets['replacement_data'].shape[0] + datasets['repair_data'].shape[0]) == datasets['planned_data'].shape[0]

The planned data seems to contain all the repair and replacement events.

## 1.2. Join all datasets in 2 tables (events, assets)

We seem to have 2 types of data. We'll start the process of combining all tables into 2 tables for each of these types:
- time series data for maintenance events ("events");
- and assets' attributes data ("assets").



### 1.2.1. Join events

To combine these tables, we will:
1. We'll need to first add a column to repair_data and replacement_data to indicate the type of event.
2. All columns are the same, so we can concatenate repair and replacement events.
3. Inner join events with planned data. We just need the planned column.

Let's call this new dataset "events".

In [None]:
datasets['replacement_data']['type'] = 'replacement'
datasets['repair_data']['type'] = 'repair'

In [None]:
datasets['all_events_data'] = pd.concat([datasets['replacement_data'], datasets['repair_data']])

In [None]:
events = pd.merge(datasets['all_events_data'], datasets['planned_data'], how='inner', on=['event_id', 'asset_id', 'event_date'])

In [None]:
date_cols = ['event_date', 'installed_date']
for col in date_cols:
    events[col] = pd.to_datetime(events[col])

In [None]:
events.sort_values(by=['asset_id', 'event_date'])

### 1.2.2 Join assets

Let's join the attributes of the assets in a single table:
- asset_attribute_data_general
- asset_attribute_data_usage
- asset_attribute_data_weather
- asset_data

Let's call this new dataset "assets".

In [None]:
assets = pd.merge(datasets['asset_attribute_data_general'], datasets['asset_attribute_data_usage'], on='asset_id')
assets = pd.merge(assets, datasets['asset_attribute_data_weather'], on='asset_id')
assets = pd.merge(assets, datasets['asset_data'], on='asset_id')

In [None]:
date_cols = ['end_date', 'start_date']
for col in date_cols:
    assets[col] = pd.to_datetime(assets[col])

In [None]:
assets['total_useful_life'] = assets['end_date'] - assets['start_date']

In [None]:
assets

## 1.3. Check for gaps in datasets

Now we have only 2 tables to work with. One refers to data about maintenance events, the other about asset attributes.

Let's check if there are gaps in the data:
- Do all assets have maintenance events? If not, why?
- Are there events refering to missing assets? These might need to be discarded depending on the following analysis.

In [None]:
# do all assets have maintenance events?
assets_that_broke = events['asset_id'].unique()
print(f'{len(assets_that_broke)}\t assets that have replacement or repair events')

print(f'{assets.shape[0]}\t assets')

assets_without_events = assets[assets['asset_id'].isin(assets_that_broke) == False]
print(f'{assets_without_events.shape[0]}\t assets without events')

In [None]:
assets_without_events

In [None]:
assets.describe()

Of the 200 registered assets, we have maintenance events on 194.
Looking at the data from the 6 that didn't have incidents, no pattern is identified about their attributes.
Different teams installed them, they have different materials, locations, weather and were operational on different years.
The only similarity is that all these assets have a total useful life well below the 25% percentile.
However, there are assets that had a shorter useful life and still had maintenance events.

Let's check the statistics for time between events to decide whether to consider that these 6 assets are outliers and exclude them from further exploration.

In [None]:
events['time_since_last_event'] = events['event_date'] - events['installed_date']

In [None]:
events['time_since_last_event'].describe()

In [None]:
q = 0.9
print(events['time_since_last_event'].quantile(q))
print(f'# events over quantile {q}: {2032*(1-q)}')

For 10% of events (~203 events), the time elapsed since the previous event was higher than 321 days.
Of the 6 assets than didn't have events, 5 have a useful life below 300 days.
This means that it's plausible that these 6 assets didn't have any maintenance events, given their brief useful life, and we'll reject the hypothesis that it's due to missing data in the events table


They will not be removed from the analysis when considering only assets' attributes.

## 1.4. Data Strategy

The events and assets tables are two different kinds of data and we can use them to answer lots of questions.
1. Which asset attributes are correlated with number of maintenance events?
2. Which attributes are correlated with total useful life?
3. Are there better performing teams?
4. Should we be avoiding certain materials in specific locations or weather clusters?
5. Can we predict the remaining useful life of an asset within several given time horizons (e.g. 30, 90, 180 days)?

In section 2., we'll use the events data to augment the assets table and gather as much insights as possible about questions 1-4, and others that might arise during analysis.

These insights will be used to guide question 5, in section 3.
There, we'll build and describe a functional pipeline to predict approximate remaining useful life for each asset.
**This pipeline can then be used in production, informing management decisions and guiding operation and maintenance teams in the field for reducing costs and downtime, and increase team productivity and customer satisfaction.**


# 2. Data analysis

## 2.1. Transforming features

### 2.1.1. Transforming categorical features

For computing the correlation of the different features, we have to transform some of them.
Specifically, we'll need to transform categorical features (team, line, material).

Let's make a copy of the original assets table and apply this transformation.

In [None]:
assets2 = assets.copy()

In [None]:
assets2 = pd.get_dummies(assets2, columns=['asset_material', 'asset_line', 'asset_weather_cluster', 'asset_install_team', 'asset_weather_cluster'])

In [None]:
assets2

### 2.1.2. Describing `previous_repairs` and `previous_unplanned`

The `previous_repairs` and `previous_unplanned` columns are present both in the events and assets tables.

In [None]:
events[events['asset_id'] == 'A:xoauw0'] .sort_values('event_date')

`events`

After studying the events sequence for several assets, the following was concluded.
- In the events table, the previous_repairs column contains how many repair events ocurred since the last replacement event.
- The previous_unplanned contains how many unplanned repair events occurred snce the last replacement event.
- Each time a replacement happens, both counters are reset to 0.

In [None]:
assets

### 2.1.3 Transforming dates and time periods

We'll also need to transform features containing dates and time durations.

Let's make a copy of the original assets table.

In [None]:
assets2['total_useful_life'] = assets2['total_useful_life'].map(lambda dt: dt.days)

In [None]:
assets2[assets2['previous_unplanned'] > 0]

In [None]:
events[events['asset_id']=='A:k1vxf4'].sort_values('event_date')

In [None]:
events[events['asset_id']=='A:h9mqxr'].sort_values('event_date')

In [None]:
assets[assets['previous_unplanned'] != 0]

In [None]:
assets[assets['previous_repairs'] != 0]

In [None]:
assets2.corr()['total_useful_life'].sort_values()

# 3. Predictive maintenance models

- Augment events dataset with data from assets table.
- Augment events dataset to have columns for indicating whether another event will happen in different time horizons: 30, 90, 180, 360 days.
- We can have different models, one for each time horizon and they'll be binary class models.
- Or, we can have a multi class model, where each time horizon is a different class.