<h1 style="text-align:center;text-decoration: underline">Stream Analytics Tutorial</h1>
<h1>Overview</h1>
<p>Welcome to the stream analytics tutorial for EpiData Lite's Jupyter Notebook interface. In this tutorial we will perform near real-time stream analytics on sample weather data acquired from a simulated wireless sensor network.</p>
<p><b>Note:</b> This tutorial assumes the EpiData Lite platform was started with measurement-class="sensor_measurement" (default) setting via conf/application.conf. If the platform was started with measurement-class="automated_test" setting, please follow the tutorial in Automated Test folder.</p>

<h2>EpiDataLiteContext and EpiDataLiteStreamingContext</h2>

<h3>1. Context and Modules Import</h3>
<p>As a first step, We will import <i>EpiDataLiteContext</i> object <i>ec</i> and <i>EpiDataLiteStreamingContext</i> object <i>esc</i>. EpiDataLiteContext provides methods for query and offline analytics on batch data while EpiDataLiteStreamingContext provides methods for near real-time analytics on streaming data.</p> 
<p>We will also import packages and modules required for this tutorial.<p> 

In [None]:
# Import packages and modules

from epidata.EpiDataLiteContext import ec
from epidata.EpiDataLiteStreamingContext import esc

from datetime import datetime, timedelta

print(ec)
print(esc)

<h3>2. Context Initialization</h3>
<p>Next, we initialize the EpiDataLiteContext and EpiDataLiteStreamingContext objects. This step opens the required network connections for querying, and stream processing of data.</p>

In [None]:
# Initialize EpiDataLiteContext and EpiDataLiteStreamingContext

ec.init()
esc.init()

<h2>Stream Analysis</h2>
<h3>Algorithms</h3>
<p>EpiData supports development and deployment of pre-defined as well as custom algorithms. In this tutorial, we will use pre-defined algorithms for substituting missing (None) values and aggregating measurements.</p> 
<p>Below are few examples of pre-defined algorithms available in EpiData Lite platform:
    <ul>
    <li><i>Identity</i>: Returns the original measurments without any modifications. Selects the measurements based on the measurements' names specified in the operation.</li>
    <li><i>FillMissingValue</i>: Substitutes missing values using moving average computation</li>
    <li><i>MeasStatistics</i>: Computes standard statistics, including count, min, max, mean, sum, and stddev, on the measurements.</li>
    </ul>
    
<h3>Transformations</h3>
<p>We will define transformations using the pre-defined algorithms described above. EpiData transformations are created using method <i>create_transformation()</i>, which takes the following inputs:
    <ul>
    <li>Name of the pre-defined algorithm</li>
    <li>List of measurements to apply the algorithm to</li>
    <li>Arguments for the algorithm</li>
    </ul>
</p>

In [None]:
# Define tranformations using pre-defined algorithms

op1 = esc.create_transformation("FillMissingValue", ["Temperature", "Wind_Speed", "Relative_Humidity"], {"method":"rolling", "s":3})
op2 = esc.create_transformation("Identity", ["Temperature", "Wind_Speed", "Relative_Humidity"], {})
op3 = esc.create_transformation("MeasStatistics", ["Temperature", "Wind_Speed", "Relative_Humidity"], {"method": "standard"})

<h3>Streams</h3>
<p>Once transformations have been created, we define streams using EpiDataLiteStreamingContext's <i>create_stream()</i> method. The method takes source topic, destination topic, and transformation object as its inputs. The source topic can be one of the pre-defined topics, namely <i>measurements_original</i>, or a custom topic defined by you. The destination topic can be one of the pre-defined topics, namely <i>'measurements_cleansed'</i> and <i>'measurements_summary'</i>, or a custom topic defined by you.</p>

In [None]:
# Define stream processing

esc.create_stream("measurements_original", "measurements_substituted", op1)
esc.create_stream("measurements_substituted", "measurements_cleansed", op2)
esc.create_stream("measurements_substituted", "measurements_summary", op3)

<p>Next, we start the streams using EpiDataLiteStreamingContext's <i>start_streaming()</i> method. This starts the transformation operations on near real-time data and sends the results to the specified destinations</p>

In [None]:
# Start near real-time processing

esc.start_streaming()

<h2>Data Ingestion</h2>

<h3>1. Download Python Script</h3>

<p>We will use the provided Python script <i>sensor_data_ingest_with_outliers.py</i> to simulate weather data and push it to the EpiData Lite platform. Download the example <i>sensor_data_ingest_with_outliers.py</i> from Jupyter Notebook's tree view as show below.</p>
<img src="./static/jupyter_tree_view.png">

<h3>2. Ingest Data using Terminal / Command Prompt</h3>
<p>The next step is to run the Python script <i>'sensor_data_ingest_with_outliers.py'</i> using a Python 3 interpreter. This script sends simulated weather data to EpiData Lite platform using REST interface. You should see status of each ingestion steps (iterations) in your standard output.</p>
<p>For brewity, we have only included the output of first iteration in the image below:</p>
<img src="./static/terminal_view_with_output.png">

<h2>Query, Retrieve and Visualization</h2>

<h3>1. Query - Keys</h3>
<p>Data stored in the EpiData platform can be queried by specifying the primary data attributes, start time and stop time. Below are the primary data attributes for the current dataset:
<ul>
<li><i>company, site, station, sensor</i></li>
</ul>
</p>
<p>
We can use EpiDataLiteContext's <i>list_keys()</i> method to obtain the values of the primary data attributes for our simulated weather dataset.</p>

In [None]:
# Query the primary data attributes

keys = ec.list_keys()

print(keys)

<h3>2. Query - Original, Cleansed and Summary Data</h3>
<p>We will query original and processed data using EpiDataLiteContext's <i>query_measurements_original()</i>, <i>query_measurements_cleansed()</i> and <i>query_measurements_summary()</i> methods.</p> 

In [None]:
# Query original measurements

primary_key={"company": "EpiData", "site": "San_Francisco", "station":"WSN-1", "sensor": ["Temperature_Probe","Anemometer","RH_Probe"]}
start_time = datetime.strptime('01/01/2023 00:00:00', '%m/%d/%Y %H:%M:%S')
stop_time = datetime.strptime('01/01/2024 00:00:00', '%m/%d/%Y %H:%M:%S')

df_original = ec.query_measurements_original(primary_key, start_time, stop_time)

print(df_original.tail(10))

In [None]:
# Query cleansed measurements

primary_key={"company": "EpiData", "site": "San_Francisco", "station":"WSN-1", "sensor": ["Temperature_Probe","Anemometer","RH_Probe"]}
start_time = datetime.strptime('1/1/2023 00:00:00', '%m/%d/%Y %H:%M:%S')
stop_time = datetime.strptime('1/1/2024 00:00:00', '%m/%d/%Y %H:%M:%S')

df_cleansed = ec.query_measurements_cleansed(primary_key, start_time, stop_time)

print(df_cleansed.tail(10))

In [None]:
# Query measurements summary

primary_key={"company": "EpiData", "site": "San_Francisco", "station":"WSN-1", "sensor": ["Temperature_Probe","Anemometer","RH_Probe"]}
start_time = datetime.strptime('1/1/2023 00:00:00', '%m/%d/%Y %H:%M:%S')
stop_time = datetime.strptime('1/1/2024 00:00:00', '%m/%d/%Y %H:%M:%S')

df_summary = ec.query_measurements_summary(primary_key, start_time, stop_time)
print(df_summary.tail(10))

<h3>3. Visualize - Original and Cleansed Data</h3>

<p>Next, we will visualize the original and cleansed data using Python's Bokeh package. In the resulting visualization, we can see the result of <i>'Identity'</i> and <i>'FillMissingValue'</i> algorithms applied to the original data.</p>

In [None]:
# Visualize Stream Analytics Results

import pandas as pd
from bokeh.layouts import column
from bokeh.plotting import figure, output_notebook, show
%matplotlib inline

output_notebook()

df_original_temperatures = df_original.loc[df_original["meas_name"] == "Temperature"]
df_cleansed_temperatures = df_cleansed.loc[df_cleansed["meas_name"] == "Temperature"]

plot_original = figure(min_width=800, height=200, x_axis_label="Timestamp", x_axis_type="datetime", y_axis_label="Temperature")
plot_original.background_fill_color = "#fafafa"
plot_original.line(df_original_temperatures["ts"], df_original_temperatures["meas_value"], color='navy', alpha=0.75)
plot_original.title = "Original Measurements"

plot_cleansed = figure(min_width=800, height=200, x_axis_label="Timestamp", x_axis_type="datetime", y_axis_label="Temperature")
plot_cleansed.background_fill_color = "#fafafa"
plot_cleansed.line(df_cleansed_temperatures["ts"], df_cleansed_temperatures["meas_value"], color='red', alpha=0.75)
plot_cleansed.title = "Cleansed Measurements"

show(column(plot_original, plot_cleansed))

<h2>Stop Stream Analytics</h2>
<p>We can not stop the stream processing EpiDataLiteStreamingContext's <i>stop_streaming()</i> method.<p>

In [None]:
# Stop current near real-time processing

esc.stop_streaming()

<h2>Context Closing</h2>
<p>Now, we can clear (reset) the EpiDataLiteContext and EpiDataLiteStreamingContext using their respective <i>clear</i> methods.</p>

In [None]:
# Clear EpiDataLiteContext and EpiDataLiteStreamingContext

ec.clear()
esc.clear()

<h2>Next Steps</h2>
<p>Congratulations, you have successfully perfomed near real-time analytics using pre-defined algorithms on simulated weather data. The next step is to explore various capabilities of EpiData by creating your own near real-time analytics application!</p>