# High-speed Data Visualization:
# Kafka meets Elasticsearch
<br>

<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="500" align="left">

<br clear="all" />
<br>
<hr>

**Presented by** Ryan Jepsen

## Contributors

* Will Duckworth – Sr VP Engineering, Ashburn VA
* Ryan Jepsen – Software Engineer, Chicago, IL
* Kamal Kang – Sr Software Engineer, Asburn VA
* Shaun Litt – Pricinple Data Warehouse Architect, Chicago, IL
* Peter Wojciechowski – Dir Software Engineering, Westlake Village, CA
* Mike Keane – Dir Software Engineering, Chicago, IL

<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Agenda

* Jupyter
* Dataset
* Confluent 4.1
* Apache Spark
* Elasticsearch/Kibana
* Live Demo

<img src="high_speed_data_vis_diagram.jpg" align="center">
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Jupyter

* Open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text

* Quickly test and implement new ideas, document findings, and share them with other users with ease

* Supports a variety of interpreters

<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## About the Dataset

**iPinYou Global RTB Bidding Algorithm Competition Dataset**

http://contest.ipinyou.com/

<br>
The raw iPinYou data comes in several TSV files each containing one of 4 different log types (Bid, Impression, Click and Conversion)

<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

For the purpose of this presentation/demo, we've merged the data into a single Avro Data File with the following schema:

```json
{
	"type":"record",
	"name":"IPinYou",
	"namespace":"com.conversantmedia.cake.avro",
	"fields":[
		{"name":"bid_id","type":["string","null"]},
		{"name":"timestamp","type":["double","null"]},
		{"name":"log_type","type":["int","null"]},
		{"name":"ipinyou_id","type":["string","null"]},
		{"name":"user_agent","type":["string","null"]},
		{"name":"ip_address","type":["string","null"]},
		{"name":"region","type":["int","null"]},
		{"name":"city","type":["int","null"]},
		{"name":"ad_exchange","type":["int","null"]},
		{"name":"domain","type":["string","null"]},
		{"name":"url","type":["string","null"]},
		{"name":"anonymous_url_id","type":["string","null"]},
		{"name":"ad_slot_id","type":["string","null"]},
		{"name":"ad_slot_width","type":["int","null"]},
		{"name":"ad_slot_height","type":["int","null"]},
		{"name":"ad_slot_visibility","type":["string","null"]},
		{"name":"ad_slot_format","type":["string","null"]},
		{"name":"ad_slot_floor_price","type":["int","null"]},
		{"name":"creative_id","type":["string","null"]},
		{"name":"bidding_price","type":["int","null"]},
		{"name":"paying_price","type":["int","null"]},
		{"name":"landing_page_url","type":["string","null"]},
		{"name":"advertiser_id","type":["int","null"]},
		{"name":"user_tags","type":["string","null"]}
	]
}
```

If you want to learn more about the iPinYou dataset, please visit http://contest.ipinyou.com/ipinyou-dataset.pdf
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Confluent 4.1

#### Components involved:

* Kafka

* Schema Registry

* Kafka Connect

* Zookeeper (as a dependency)

<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Kafka

* Open-source, distributed stream-processing platform

* Follows publish-subscribe architecture

* Data is organized into topics, topics are split into partitions

* Easy to integrate into other systems

<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Schema Registry

* Stores a versioned history of schemas for messages published to Kafka

* Supports schema evolution, compatability is configurable

* Safeguard against publishing data to Kafka with incompatible schemas

* REST API for managing schema versions and compatability settings

<br>
<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Kafka Connect

* Service included in Apache Kafka to stream data between other systems

* Easy to integrate Kafka into existing data pipelines

* Confluent offers several connectors out of the box, but custom connectors can be created and integrated

* Supports a REST API for managing connectors
<br>
<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Apache Spark

* Open-source, distributed, unified multi-purpose engine for large-scale data processing

* Easily integrates into other systems

* Offers scalable and fault-tolerant stream processing via Structured Streaming

<br>
<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Elasticsearch

* Open-source, distributed, full-text search engine built on top of Apache Lucene

* Offers near real-time search

* RESTful API

* Document based, entities stored as structured JSON objects

* Collection of similarly structured documents make up a searchable **index**

* Easy to scale

<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Connecting Kafka to Elasticsearch

#### Confluent Elasticsearch Connector
* Writes data from a Kafka topic to an index in Elasticsearch

* Each Kafka message on a topic is treated as an event and converted to a unique document in the associated index

* Messages are consumed and immediately indexed, supporting near real-time analytics/visualzations in Kibana

* Index mappings can be inferred from Schema Registry

* Offers exactly-once delivery
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## If you want to learn more...

* iPinYou Competition - http://contest.ipinyou.com/

* Apache Kafka - https://kafka.apache.org/

* Apache Spark - https://spark.apache.org/

* Confluent - https://www.confluent.io/

* Elastic - https://www.elastic.co/

<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">

## Live Demo!

<br>
<br>

### [JupyterLab](http://localhost:8888/lab)

<br>
<br>
<br>
<br>
<br>
<img src="http://engineering.conversantmedia.com/assets/images/engineering_logo.png" width="250" align="right">