![Clarify Logo](https://global-uploads.webflow.com/5e81e464dad44d3a9a32d1f4/5ed10fc3f1ff8467f4466786_logo.svg)
<img src="https://uploads-ssl.webflow.com/5f031b98adc00651e28ef04b/6058a5f7b4c86c42885a2c2c_orchest-logo-no-padding.svg" alt="Orchest Logo" width="200" align="right"/>
# Welcome to anomaly detection with email notifications in Clarify - using Orchest! 📧


<tr>
  <td> <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/clarify_alerts.png" alt="Drawing" width="600px;"/>

  <td> <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/email_notification.jpg" alt="Drawing" width="300;"/> </td>
</tr>

## Prerequisites 
In this tutorial we will import a Orchest project from Github and use it to **get data from Clarify**, **find anomalies**, and **write the anomalies back to Clarify** in a new Signal. [Login](https://clarifyapp.clarify.io/login?state=db75a5bd-3cad-4734-812c-1249da64b633) or [Sing up for free](https://www.clarify.io/signup) to [Clarify](https://www.clarify.io/) to visualize and share insights of your data and Login or Sing up to [Orchest Cloud](https://auth.cloud.orchest.io/u/login) for free to use Orchest pipelines.

## What we will do

1. [Import a Project from github and Inspect the pipepline 🛠 ](#import)
2. [Do anomaly detection 🧐](#anomaly) 
3. [Create a job 👷](#job)
4. [View results in Clarify and receive email notifications 📮 ](#clarify)


---
Other resources:
* [PyClarify SDK](https://github.com/clarify/pyclarify)
* [SDK documentation](https://clarify.github.io/pyclarify/)
* [Orchest documentation](https://docs.orchest.io/en/stable/)
* [API reference](https://docs.clarify.io/reference/http)
* [Forecasting using Clarify and Orchest](https://colab.research.google.com/github/clarify/data-science-tutorials/blob/main/tutorials/Orchest.ipynb)

<a name="import"></a>
# Import a Project from github and Inspect the pipepline 🛠

Once you have loged in Orchest and created an instance, you can [import a project](https://www.tella.tv/video/cknr7of9c000409jr5gx4efjy/view) from github. Copy paste the github URL: https://github.com/clarify/orchest_projects to import the whole pipeline to get you started.

<td> <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/pipeline.png" alt="Orchest Pipeline" style="width: 1000px;"/> </td>

### Steps

In the pipeline you can see all the steps you need to:
* Import data from Clarify. Steps: Load Clarify Data,  Getting new data
* Train and Test on new data, an anomaly detection algorithm. Steps: Train Anomaly Detection algorithm, Test Anomaly Detection algorithm
* Trigger an alarm. Steps: Trigger Alarm
* Write anomaly points (if found) back to Clarify. Steps: Write anomaly points to Clarify
* Send an email (if needed). Steps: Send email

In the Final output step you can inspect from the logs what happend in the last job. For example if new data points were written to Clarify and if an email was send.

<td> <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/final_output.png" alt="Logs from final output step" style="width: 400px;"/> </td>

To all the steps a python file or a jupyter notebook is used to run all the code you need. The only thing that is missing is to set some parameters in some of the step.

### Add parameters to the steps

<font color='green'>**Load Clarify Data**</font>

Open the properties tap from the Load Clarify Data step. Under parameters you will see an example for your parameters. 


      {
        "get_all_data": true,
        "item_id": "c1vcqk2005qb5nusljko",
        "from": "2022-03-01T12:00:00Z"
      }

For now ignore the `get_all_data parameter`, we will come back to it later, in the jobs section. Leave it to `true`.

In the `item_id` parameter put the item id from the item you are intrested in to get anomaly notifications. 

🙋 If you unsure what an [item](https://docs.clarify.io/users/admin/items/) is and how to [create](https://docs.clarify.io/developers/quickstart/concepts) it, check out [Clarify's documentation](https://docs.clarify.io) 📄

In the `from` parameter put the starting date from your time series data. The anomaly detection model will be fitted on the selected range (*from*  til *to* = now)

<td> <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/time_series.jpg" alt="Timeline" style="width: 1000px;"/> </td>


<font color='green'>**Train Anomaly Detection algorithm**</font>

As an anomaly detection algorithm we will use `Isolation Forest`. Using [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) you can set up a couple of parameters to this model like the `outliers_fraction` which here is set to 0.01. In the [Do anomaly detection 🧐 ](#anomaly) section you can read more about this algorithm and what parameters you can change in order to make it optimal on your data. Doubble click on the Train Anomaly Detection algorithm step to inspect the jupyter notebook. 


<font color='green'>**Getting new data**</font>

In the `item_id` parameter add the same id as in the _Load Clarify Data_ step.

In this step you can set how much 'back' in the past you want to get data. For example if you want get all the data from the pass **1** hour until now and check them for anomaly points, set the `hours` parameter to 1. 
This parameter - must be the same with the recurring job frequency. So if `hours = 1` the job should also run every hour. If  `hours = 2` the job should run every 2 hours. Here we are using hours but feel free to change the frequency by opening the `getting_new_data` python file to change it.


<font color='green'>**Write anomaly points to Clarify**</font>

The only parameter you need to set is the item name, for example `my_alerts`. Actually if you are already familiar with [PyClarify](https://pypi.org/project/pyclarify/) you know that this parameter is in reality the input id of the signal. Once you publish your newly created signal, the input id is set by default as an item name. 


<font color='green'>**Send email**</font>

In the empty strings " " add the receiver and the sender email. In order to send an email you have to provide a password. Makes sense to not be able to just send an email from `alexia@clarify.io` to someone, right? If you have a gmail email, [here](https://support.google.com/mail/answer/185833?hl=en) is a quick guide about how to create a password for third party apps. In the next step you will see where to put it in a safe place.



### Add parameters to the project
Click [here](https://docs.orchest.io/en/stable/fundamentals/environment_variables.html#project-environment-variables) for a 1 min read  for how to set up project environment variables.

> According to Orchest documentation: Environment variables are persisted within Orchest. Make sure only authorized people have access to your instance and sensible data. See how to setup authentication in the [orchest settings](https://docs.orchest.io/en/stable/fundamentals/configuration.html#orchest-settings).


You need to add two project environment variables:

| Name | Value |
| --- | --- |
| clarify-credentials | {   "credentials": {     "type": "client-credentials",     "clientId": "",     "clientSecret": ""   },   "integration": "",   "apiUrl": "https://api.clarify.io/v1/" } |
| email-password | 123 |

Click [here](https://docs.clarify.io/users/admin/integrations/credentials#create-credentials) for how to get your clarify-credentials 🕵️

<a name="anomaly"></a>
# Do anomaly detection 🧐

In this chapter is optional. If you are intrested about a machine learing algorithm called Isolation Forest and how to optimize your data to this model keep reading. If you rather want to jump into the [Create a job](#job) section, in less than 5 min you will have an anomaly email notification pipeline up and running.

## Isolation Forest: Training stage

The goal here is to train an anomaly detection algorithm in order to find anomalies in our data. As we will see later, once we have trained the model, we can save it and use it on new (streaming) data points. Once new anomalies are found we will get an email notification about them and write them back to Clarify in a new signal.

### Isolation Forest 
To perform anomaly detection, we will use Isolation Forest (or iForest). As mentioned in the [Isolation Forest paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) anomalies are here considered to be those which are 'few and different'. These data points can be isolated more easily than 'normal' data points. Therefore anomalies are isolated closer to the root of the tree, also known as Isolation Tree (or iTree) 

<a href="https://www.researchgate.net/figure/Isolation-Forest-learned-iForest-construction-for-toy-dataset_fig1_352017898"><img src="https://www.researchgate.net/publication/352017898/figure/fig1/AS:1029757483372550@1622524724599/Isolation-Forest-learned-iForest-construction-for-toy-dataset.png" alt="Isolation Forest: learned iForest construction for toy dataset"/></a>

[Source](https://www.researchgate.net/figure/Isolation-Forest-learned-iForest-construction-for-toy-dataset_fig1_352017898) Scientific Figure on ResearchGate 


If we create multiple trees we can get the average path lengths from all the data points and from that optain the anomaly score values. The top `m` data points with the corresponding lowest anomaly score values are considered as anomalies.

In order to gain a better understaning of the algorithm's performance, in the following figures we will plot the predict values (which are either 1 or -1) and the average anomaly scores (negative scores represent outliers and positive scores represent normal points).

Ok let's jump into some code. Here you can run the cells and see your results in plots. If you want to run the code from this notebook and see what exactly the algorithm will do in the orchest pipeline, put your clarify-credentials file to this workspace in order to get your data from Clarify and write data back. 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
from pylab import rcParams
import plotly.graph_objects as go

from sklearn.ensemble import IsolationForest
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
import pickle
import json

from pyclarify import ClarifyClient


# Get Data 
# In Orchest this step is done in the Load Clarify data.
client = ClarifyClient('clarify-credentials.json')

# Add the item id and the starting date in the not_before parameter.
response = client.select_items_data(ids = ['<YOUR-ITEM-ID>'], not_before = "2022-03-01T12:00:00Z", before = datetime.today())
df = response.result.data.to_pandas().drop_duplicates()
dates = df.index
values = sum(df.values.tolist(), [])
df = pd.DataFrame({'date': dates, 'x': values})

X = df[["x"]]
X.head()

In sklearn the number of trees which will be created are by default 100, and the sub-sampling size is by default 256. 
We will change these values to reduce the run time and the swampling and masking effects on our data points. 

**Swampling** is finding false anomalies. When anomaly points are close to normal points the process of finding these anomaly points is harder and can easiely lead to categorizing normal points as anomalies. 

**Masking** is called the phenomenon where we have many anomalies building a small cluster, therefore ending up to be categorized as normal points. 


These two phenomenons can be tuned by the sub-samplint size parameter (called max_samples in sklearn). Because we don't need to isolate all data points, if we reduce the sub-samlping size, we can obtain more true anomaly points. 
Note that we use sub-sampling without replacement.

<img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/isolationg.png"/>

[Source Isolation Forest paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) 
(a) Isolating a normal point Xi needs more splitting procedures than (b) isolating an anomaly point Xo.


The complexity of the training process can be tuned with the number of trees used (called n_estimators in sklearn). As mentioned in the  [Isolation Forest paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) for n_estimators = 100 the path lengths usually converge well. 

<img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/before.png" width=312 height=312 />  <img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/after.png" width=300 height=300 />

[Source Isolation Forest paper](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest) 
The first plot has 4096 instances, the second plot has 128 instanses


Last but not least, when using Isolation Forest on time series data it usually helps to set the contamination parameter. This parameter defines the threshold on the scores when fitting the data on the model.
The default value is auto, which will set the threshold as determined in the original paper. When using contamination = 'auto' the anomaly points which the algorithm finds are way too many as we can see in the plot below.
In this notebook we will use a value of 0.01 for contamination which represents the proportion of outliers in the data set. 

<img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/anomalies.png"/>
Anomaly points when setting contamination to 'auto'. Use contamination = 0.01 to get better results.





In [None]:
# Set models parameters
rng = np.random.RandomState(100)
model = IsolationForest(n_estimators = 90, max_samples=200, contamination=0.01, random_state=rng)
model.fit(X)
pred = model.predict(X) 

In [None]:
# Fast hack to see the score regions. 

X_ = np.column_stack((np.array(df["x"]), np.arange(0, len(df["x"]))))
rng = np.random.RandomState(100)
Xmodel = IsolationForest(n_estimators = 90, max_samples=200, contamination=0.01, random_state=rng)
Xmodel.fit(X_)

rcParams['figure.figsize'] = 30, 13
l = len(df["x"])
xx, yy = np.meshgrid(np.linspace(0, l, 50), np.linspace(0, l, 50))
Z = Xmodel.decision_function(np.c_[yy.ravel(), xx.ravel()]) # decision_function returns the anomaly scores
Z = Z.reshape(xx.shape)

plt.title("IsolationForest")
plt.contourf(xx, yy, Z, cmap=plt.cm.Blues_r)

b1 = plt.scatter(X_[:, 1], X_[:, 0], c="yellow", s=40, edgecolor="k")
plt.axis("tight")
plt.xlim((0, l))
plt.ylim((0, l))
plt.legend([b1],["training observations"],loc="upper left")
plt.show()

In the plot above we can see an example of different colors of regions. 

The darker the color is, the more likely is that the points in this area are anomaly points. 

These regions where created by using the anomaly scores for x = (0,1, ... 50) and y = (0,1,...50) on the 2d fitted model. 

This way we can have the scores for the hole grid.

Note that we will not use the 2d model since we have only one time series. This plot was created only as an example - as the results of the 1d and 2d model are fairly similar.

In [None]:
scores = model.decision_function(X)
l = len(scores)

fig = go.Figure(data=go.Scatter(x = np.arange(0,l), y = scores))
fig.update_layout(title = "Score values")
fig.show()

We can also plot the score values from the training set X. Negative scores represent outliers, positive scores represent inliers. If we zoom in the plot, we can easily find which data points are found as anomalies. Note that many lines are very close to y = 0. Therefore the algorithm is not sure if these points are anomalies or not.

In [None]:
df['anomaly_score'] = pd.Series(pred)
anomaly = df.loc[df['anomaly_score'] == -1, ['date', 'x']] 

print("anomalies found:")
anomaly

In [None]:
fig = go.Figure(data=go.Scatter(x = np.arange(0,len(pred)), y = pred))
fig.update_layout(title = "predict values")
fig.show()

The `predict` method returns the values 1 or -1. With -1 are marked the data points which corresponde to a negavive score value and with 1 are marked the data points which corresponde to a positive score value.

In the above plot we can see that the same data points which have a negative score value, have a -1 predict value.

In [None]:
fig = go.Figure(data=go.Scatter(x = df['date'], y = df['x'], mode='markers', name = "Normal values", marker=dict(color='blue', size=4)))
fig.add_traces(go.Scatter(x = anomaly['date'], y = anomaly['x'], textposition='top left',
                          textfont=dict(color='#233a77'),
                          mode='markers+text',
                          name = "Anomaly value",
                          marker=dict(color='red', size=6)))

fig.show()

Last but not least, we plot the 'normal' data points with blue, and the anomaly data points with red.

In Orchest we save the model so that we can use it on new (streaming) data.

    file = '../data/model.sav'
    f = open(file,'x')
    f.close()
    with open(file, 'wb') as f:
        print("Save model in model file...")
        pickle.dump(model, f)
    print("Done!")

As an end note, some of you might be wondering why we didn't mentioned anything about normalizing our data. 
This is a good observation. The first question that we should ask is if it is needed. Since we are using an anomaly detection algorithm, normalizing our data would maybe make it harder to the algorithm to find anomaly points. By scaling our data, it will probably not play a big role, since the way the Isolation Forest algorithm works is by randomly selecting a value between the min and max value and spit them. What the min and max values are doesn't matter, but the bigger the range is, the easier it probably is to find anomaly points. 

<a name="job"></a>
# Create a job 👷

Before we create a job, we will run the `Load Clarify Data` and `Train Anomaly Detection algorithm` so that we have saved the Isolation Forest model in the /data folder. We only have to run once these two steps, therefore in the Load Clarify Data step we will set the parameter `get_all_data` to `false`. Once we do that we will also [skip all the notebook cells](https://docs.orchest.io/en/latest/getting_started/how_to.html#skip-notebook-cells) from the `Train Anomaly Detection algorithm` step in order to not run the cells every time the job runs. 

To create a job watch this [tutorial](https://www.tella.tv/video/cknr9nq1u000609kz9h0advvk) or read the [documentation](https://docs.orchest.io/en/stable/fundamentals/jobs.html). In the `getting new data step` we said that we have to use the same frequency, remember? If you have a `hours = 1` parameter there, create a job which runs every hour. If you have `hours = 2` create a job which runs every 2 hours. 

<a name="clarify"></a>
# View results in Clarify and receive email notifications 📮

Now the best part, the first time you run the pipeline *and* anomaly points where found, a new signal got created in [Clarify](https://clarifyapp.clarify.io/). Once you publish the signal and have it as an item, you can add to a timeline your item you used in the anomaly detection algorithm and the item you created from the pipeline with all the anomaly points. As an example we use the simple name `my_alerts` but it is good practice to give more descriptive names in order to find your items more easy. 

🙋 If you unsure how to create an item from a signal check out the [clarify documantation](https://docs.clarify.io/developers/quickstart/publish-signals).  

💡 In Clarify you can set a `GAP DETECTION` parameter in the item metadata. Set it to every second in order to plot the anomaly points as points and not as a line.

Ready to see your results in Clarify? 

<img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/clarify_alerts.png" alt="Drawing" width="1000px;"/> 

Every time new anomaly points are found, you will get an email notification with a plot 🤩 

<img src="https://raw.githubusercontent.com/clarify/data-science-tutorials/main/media/email_notifications/email_notification.jpg" alt="Drawing" width="350px;"/> 


Congratulations on completing this tutorial. 🎉

You’re all set to take a day off and enjoy your weekends with ease 😎
