<a href="https://colab.research.google.com/github/astrapi69/DroidBallet/blob/master/ML_Manufacturing_Machine_Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercises

In [None]:
# install if needed
#!pip install pyod

## Overview: Manufacturing Machine Anomaly Detection

### Project Detail

>In this project, we use [NAB-dataset](https://drive.google.com/file/d/12fFZ9k8wsmWBVUhcsVxmKsqHxaVzAzqt/view?usp=sharing), which is a novel benchmark for evaluating algorithms for anomaly detection in several fields.  
>There are 58 timeseries data from various kind of sources.
>* **Real data**
>    * realAWSCloudwatch
>    * realAdExchange
>    * realKnownCause
>    * realTraffic
>    * realTweets
>* **Artificial data**
>    * artificialNoAnomaly
>    * artificialWithAnomaly
>
>Among these datasets, you will analyze **'machine_temperature_system_failure'** from realKnownCause dataset which focuses on an actual manufacturing device.

>This dataset does not include which were the actual anomaly rows, so we need to refer to the [NAB github page](https://github.com/numenta/NAB/blob/master/labels/combined_windows.json).




### Load the dataset
>As above, we use 'machine_temperature_system_failure.csv' for our analysis.  
>According to dataset information, it has the following features :
>* Temperature sensor data of an internal component of a large, industrial machine.

In [None]:
import pandas as pd

orig_url = "https://drive.google.com/file/d/12fFZ9k8wsmWBVUhcsVxmKsqHxaVzAzqt/view?usp=sharing"
file_id = orig_url.split('/')[-2]
data_path='https://drive.google.com/uc?export=download&id=' + file_id

df = pd.read_csv(data_path)
df.head(10)

Unnamed: 0,timestamp,value
0,2013-12-02 21:15:00,73.967322
1,2013-12-02 21:20:00,74.935882
2,2013-12-02 21:25:00,76.124162
3,2013-12-02 21:30:00,78.140707
4,2013-12-02 21:35:00,79.329836
5,2013-12-02 21:40:00,78.710418
6,2013-12-02 21:45:00,80.269784
7,2013-12-02 21:50:00,80.272828
8,2013-12-02 21:55:00,80.353425
9,2013-12-02 22:00:00,79.486523


### Pre-processing

#### Anomaly Points
>We can get anomaly points information [here](https://github.com/numenta/NAB/blob/master/labels/combined_windows.json)

This is usually rare unless we get feedback from domain experts regarding which days there was an issue \ anomaly in the machine.

But we can use this to generate ground truth labels for each row of data in terms of whether it was truly an anomaly or not!

In [None]:
anomaly_points = [
        ["2013-12-10 06:25:00.000000","2013-12-12 05:35:00.000000"],
        ["2013-12-15 17:50:00.000000","2013-12-17 17:00:00.000000"],
        ["2014-01-27 14:20:00.000000","2014-01-29 13:30:00.000000"],
        ["2014-02-07 14:55:00.000000","2014-02-09 14:05:00.000000"]
]

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

#is anomaly? : True => 1, False => 0

# by default nothing is an anomaly
df['anomaly'] = 0

# convert time slices of data as anomalies (in the anomaly column) by setting it to 1
for start, end in anomaly_points:
    df.loc[((df['timestamp'] >= start) & (df['timestamp'] <= end)), 'anomaly'] = 1

In [None]:
df['anomaly'].value_counts()

In [None]:
df['anomaly'].value_counts(normalize=True)

Roughly 10% of data has anomalies

## EDA

Let's plot some graphs now!

In [None]:
# creating a separate DF to use for visuals \ adding time elements
visual_df = df.copy()

### Datetime Information

In [None]:
visual_df['year'] = df['timestamp'].dt.year
visual_df['month'] = df['timestamp'].dt.month
visual_df['day'] = df['timestamp'].dt.day_name()
visual_df['hour'] = df['timestamp'].dt.hour

In [None]:
visual_df.head()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt


### Looking at Anomalies based on Year, Month and Weekday

In [None]:
sns.countplot(x='year', hue='anomaly', data=visual_df);

In [None]:
sns.countplot(x='month', hue='anomaly', data=visual_df);

In [None]:
sns.countplot(x='day', hue='anomaly', data=visual_df);

### Visualizing Anomaly Distributions

In [None]:
sns.kdeplot(x='value', hue='anomaly', data=visual_df, shade=True);

In [None]:
sns.boxplot(y='value', x='anomaly', data=visual_df);

### Time Series Analysis

>plot temperature & its given anomaly points.

In [None]:
plt.figure(figsize=(16, 6))
sns.scatterplot(x='timestamp', y='value', data=visual_df, marker=".");

In [None]:
plt.figure(figsize=(16, 6))
sns.scatterplot(x='timestamp', y='value', data=visual_df, hue='anomaly', marker=".");

<a href="#top" class="btn btn-success btn-sm active" role="button" aria-pressed="true" style="color:white;">Table of Contents</a>

## Modeling
>We will build several anomaly detection models and compare them each other. Let's create our datset `X` and ground truth labels `y`. Remember to use `y` only to evaluate model performance. In reality you are not supposed to know `y` beforehand. So don't use it for modeling

We will train and evaluate the following models

- 3-Sigma
- Boxplot
- Local Outlier Factor (LOF)
- Isolation Forest
- Mean Absolute Deviation (MAD)

### Prepare Dataset

In [None]:
X = df[['timestamp', 'value']]
y = df['anomaly']

X.head()

### Model: 1. The 3-sigma Model

Compute mean and standard deviation and use the 3-STD (3-sigma) rule to compute the lower and upper limit.

Remember,

LL = mean - 3sigma
UL = mean + 3sigma

Find out if each row of data is an outlier or not based on the temperature (`value`) column from `X` and compute performance using the ground truth labels in `y` using `classification_report`

#### Build Model and Compute Outliers

#### Evaluate Model Performance

In [None]:
from sklearn.metrics import classification_report



#### Visualize Outliers using TimeSeries

Use a similar scatterplot as before

### Model: 2. The Box-Plot Model

Compute Q1 and Q3 and use the IQR rule to compute the lower and upper whiskers.

Remember,

LW = Q1 - 1.5xIQR
UW = Q3 + 1.5xIQR

Find out if each row of data is an outlier or not based on the temperature (`value`) column from `X` and compute performance using the ground truth labels in `y` using `classification_report`

#### Build Model and Compute Outliers

#### Evaluate Model Performance

#### Visualize Outliers using TimeSeries

Use a similar scatterplot as before

### Model: 3. The Local Outlier Factor Model

You have already used LOF and learnt about it before. Now use it for outlier detection and evaluate its performance and visualize the outliers.

Use a default contamination rate of 0.1 for this model

You can use the `pyod` library for this model

#### Build Model and Compute Outliers

#### Evaluate Model Performance

#### Visualize Outliers using TimeSeries

Use a similar scatterplot as before

### Model: 4. The Isolation Forest Model

You have already used IForest and learnt about it before. Now use it for outlier detection and evaluate its performance and visualize the outliers.

Use a default contamination rate of 0.1 for this model

You can use the `pyod` library for this model

#### Build Model and Compute Outliers

#### Evaluate Model Performance

#### Visualize Outliers using TimeSeries

Use a similar scatterplot as before

### Model: 5. The Median Absolute deviation (MAD) Model

Median Absolute deviation (MAD) is usually used for univariate data. It is a statistical model and is a very simple measure of variation in a sample. In that sense, it is quite similar to the standard deviation in terms of measuring statistical dispersion.

For a univariate data set $X_1, X_2, ..., X_n$, the MAD is defined as the median of the absolute deviations (residuals) from the data's median $\tilde{X} = median(X)$:

$$MAD = median(|X_i - \tilde{X}|)$$

To calculate a range of values that will not be considered outliers, we take the median value of the data and add/substract the MAD value multiplied with a threshold multiplier $t$:

$$ \tilde{X} \pm MAD*t$$


Luckily PyOD can do everything for us! You just need to specify the `threshold` value $t$ in the model

You can use the `pyod` library for this model

You can use the default threshold of 3.5.




#### Build Model and Compute Outliers

#### Evaluate Model Performance

#### Visualize Outliers using TimeSeries

Use a similar scatterplot as before

### Bonus: Rolling MAD

For time-series data such as this we usually compute a "rolling MAD" for a window that is moving over the data and then have a series of median values and MAD thresholds. Calculate a "rolling MAD" and experiment with the window size.

## Conclusion


Which models worked the best? Answer the same below based on your observation.

Simpler statistical models like boxplot and MAD performed better!

## References
> * **NAB Anomaly Points References**  
> https://github.com/numenta/NAB/blob/master/labels/combined_windows.json  
> * **PyOD documentation**  
> https://pyod.readthedocs.io/en/latest/  