Skip to content

UCY-LINC-LAB/StreamSight

Repository files navigation

StreamSight

StreamSight is a query-driven framework for streaming analytics in edge computing. It supports users in composing analytic queries that are automatically translated and mapped to streaming operations suitable for running on distributed processing engines deployed in wide areas of coverage. Our reference implementation has been integrated with Apache Spark 2.1.1.

What does StreamSight do?

Data scientists and platform operators can compose complex analytic queries without any knowledge of the programmable model of the underlying distributed processing engine by solely using high-level directives and expressions. An example of such analytic query from the domain of intelligent transportation is the following:

COMPUTE
    ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
    BY city_segment
EVERY 5 SECONDS

The above query computes for a 10min sliding window the average bus delay per city segment with new datapoints considered every 5s, which is particularly useful for a traffic operator in detecting traffic congestion in each city segment. Furthermore, StreamSight gives user the ability to:

  1. Prioritize query execution so results of high importance are always output in time (e.g., high load influx), while low priority insights are scheduled when resources are available
  2. Request query enforcement on a sample of the data stream for indicative but in time responses
  3. Define constraints such as the maximum tolerable upper error bounds for approximate query responses
COMPUTE
    ARITHMETIC_MEAN(bus_delay, 10 MINUTES)
    BY city_segment EVERY 5 SECONDS
    WITH MAX_ERROR 0.005 AND CONFIDENCE 0.95

The above example extends the previous query so that approximate answers are provided to significantly improve response time by executing the query on a sample of the data stream.

Getting Started

Build StreamSight

  1. Make sure you've installed Docker on your machine.
  2. Configure (optionally) and submit your queries:
    • See this section for configuring StreamSight.
    • See this section for composing your queries.
  3. Build the docker image using the following command:
docker build -t streamsight -f Dockerfile .

Local Mode

Run StreamSight in your local machine

docker run -p 4040:4040 -it streamsight

Cluster Mode

If you have an existing Spark Cluster, you only have to specify the address of the master node:

MASTER_NODE=spark://${spark-url}
spark-submit --master $MASTER_NODE ./StreamSight-0.0.1.jar

For convenience, we provide a number of scripts (located in directory /cluster) to help you setup a Spark cluster via docker-compose. The only requirement is to have docker-compose installed on your system.

cd cluster/
# Create 1 Master and 4 Spark workers
./start-infra.sh

# You can scale to 8 workers with the following script:
./scale-workers.sh 8

# To submit StreamSight, just run:
./build-run-cluster.sh

StreamSight Architecture

The following figure depicts a high-level and abstract overview of the StreamSight framework. Users submit ad-hoc queries following the declarative query model and the system compiles these queries into low-level streaming commands.

image

Once the executable artifact is produced, users can submit it to the underlying distributed processing engine. The raw monitoring metrics are fed into the processing engine where they are transformed into analytic insights and streamed to a high-availability queuing system.

image

Insight Declaration Examples

In this section we present a few eaxmples for useful insights from monitoring a cloud infrastructure.

Example 1

A useful insight that many companies need to monitor and take decisions on that is cpu utilization. So the following insight returns the average CPU utilization of a service for 30 seconds every 10 seconds.

cpu_utilization = COPMUTE (
        ARITHMETIC_MEAN( cpu_user, 30 SECONDS )
        + ARITHMETIC_MEAN( cpu_sys, 30 SECONDS )
) EVERY 10 SECONDS ;

Example 2

The free space in ram can be crucial for some applications, thus, the following expression gives us the RAM Average Usage for 10 minutes every 30 seconds.

ram_usage_per_service = COPMUTE
        ARITHMETIC_MEAN( service:ram , 10 MINUTES )
EVERY 30 SECONDS ;

Example 3

Next we present an insight for maximum number of HTTP Requests per Second for 10 minutes which computes every 30 seconds grouped by region. For devops and developers, who works on web-based applications, the peak of traffic in a specific region can be a critical factor.

http_requests_per_seconds_by_region = COPMUTE
        MAX( service:requests_per_seconds , 10 MINUTES ) BY Region
EVERY 30 SECONDS ;

Configuration

Parameter Default Value Description
app insights-test The name of the StreamSight job
insights.file /home/insights.ins The file containing insights
dummy.input true When enabled a producer pushes random measurements to the system
dummy.interval 100 The periodicity of the dummy producer in ms
dummy.dimensions 5 The number of metrics produced (e.g., m1,m2,...)

Reference

When using the framework please use the following reference to cite our work:

"StreamSight: A Query-Driven Framework for Streaming Analytics in Edge Computing", Z. Georgiou, M. Symeonides, D. Trihinas, G. Pallis, M. D. Dikaiakos, 11th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2018), Zurich, Switzerland, Dec 2018.

Licence

The framework is open-sourced under the Apache 2.0 License base. The codebase of the framework is maintained by the authors for academic research and is therefore provided "as is".

About

Query-Driven Framework for Streaming Analytics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published