Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 1 addition & 6 deletions docs/about/contributing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,8 @@ This list is neither comprehensive nor in any particular order.

| Feature | Components | Description | Status |
|-------------------- | ----------- | ------------------------- | ------------- |
| Incremental updates | BE, WS, UI | Push results back to users during the query lifetime. Micro-batching, windowing and other features need to be implemented | In Progress |
| Bullet on Spark | BE | Implement Bullet on Spark Streaming. Compared with SQL on Spark Streaming which stores data in memory, Bullet will be light-weight | In Progress |
| Security | WS, UI | The obvious enterprise security for locking down access to the data and the instance of Bullet. Considering SSL, Kerberos, LDAP etc. Ideally, without a database | Planning |
| In-Memory PubSub | PubSub | For users who don't want a PubSub like Kafka, we could add REST based in-memory PubSub layer that runs in the WS. The backend will then communicate directly with the WS | Planning |
| LocalForage | UI | Migration the UI to LocalForage to distance ourselves from the relatively small LocalStorage space | [#9](https://github.com/yahoo/bullet-ui/issues/9) |
| Bullet on X | BE | With the pub/sub feature, Bullet can be implemented on other Stream Processors like Flink, Kafka Streaming, Samza etc | Open |
| Bullet on Beam | BE | Bullet can be implemented on [Apache Beam](https://beam.apache.org) as an alternative to implementing it on various Stream Processors | Open |
| SQL API | BE, WS | WS supports an endpoint that converts a SQL-like query into Bullet queries | Open |
| SQL API | BE, WS | WS supports an endpoint that converts a SQL-like query into Bullet queries | In Progress |
| Packaging | UI, BE, WS | Github releases and building from source are the only two options for the UI. Docker images or the like for quick setup and to mix and match various pluggable components would be really useful | Open |
| Spring Boot Reactor | WS | Migrate the Web Service to use Spring Boot reactor instead of servlet containers | Open |
18 changes: 13 additions & 5 deletions docs/backend/ingestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ Bullet operates on a generic data container that it understands. In order to get

## Bullet Record

The Bullet Record is a serializable data container based on [Avro](http://avro.apache.org). It is typed and has a generic schema. You can refer to the [Avro Schema](https://github.com/yahoo/bullet-record/blob/master/src/main/avro/BulletAvro.avsc) file for details if you wish to see the internals of the data model. The Bullet Record is also lazy and only deserializes itself when you try to read something from it. So, you can pass it around before sending to Bullet with minimal cost. Partial deserialization is being considered if performance is key. This will let you deserialize a much narrower chunk of the Record if you are just looking for a couple of fields.
The Bullet backend processes data that must be stored in a [Bullet Record](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/BulletRecord.java) which is an abstract Java class that can
be implemented as to be optimized for different backends or use-cases.

There are currently two concrete implementations of BulletRecord:

1. [SimpleBulletRecord](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/SimpleBulletRecord.java) which is based on a simple Java HashMap
2. [AvroBulletRecord](https://github.com/bullet-db/bullet-record/blob/master/src/main/java/com/yahoo/bullet/record/AvroBulletRecord.java) which uses [Avro](http://avro.apache.org) for serialization

## Types

Expand All @@ -17,9 +23,11 @@ Data placed into a Bullet Record is strongly typed. We support these types curre
### Primitives

1. Boolean
2. Long
3. Double
4. String
2. Integer
3. Long
4. Float
5. Double
6. String

### Complex

Expand All @@ -31,7 +39,7 @@ With these types, it is unlikely you would have data that cannot be represented

## Installing the Record directly

Generally, you depend on the Bullet Core artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet Core artifact already brings in the Bullet Record container as well. See the usage for the [Storm](storm-setup.md#installation) for an example.
Generally, you depend on the Bullet Core artifact for your Stream Processor when you plug in the piece that gets your data into the Stream processor. The Bullet Core artifact already brings in the Bullet Record containers as well. See the usage for the [Storm](storm-setup.md#installation) for an example.

However, if you need it, the artifacts are available through JCenter to depend on them in code directly. You will need to add the repository. Below is a Maven example:

Expand Down
36 changes: 24 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@

* Big-data scale-tested - used in production at Yahoo and tested running 500+ queries simultaneously on up to 2,000,000 rps

# How is this useful
# How is Bullet useful

How Bullet is used is largely determined by the data source it consumes. Depending on what kind of data you put Bullet on, the types of queries you run on it and your use-cases will change. As a look-forward query system with no persistence, you will not be able to repeat your queries on the same data. The next time you run your query, it will operate on the different data that arrives after that submission. If this usage pattern is what you need and you are looking for a light-weight system that can tap into your streaming data, then Bullet is for you!

Expand All @@ -40,15 +40,15 @@ This instance of Bullet also powers other use-cases such as letting analysts val

See [Quick Start](quick-start/bullet-on-spark.md) to set up Bullet locally using spark-streaming. You will generate some synthetic streaming data that you can then query with Bullet.

# Setting up Bullet on your streaming data
# Setup Bullet on your streaming data

To set up Bullet on a real data stream, you need:

1. To setup the Bullet Backend on a stream processing framework. Currently, we support [Bullet on Storm](backend/storm-setup.md):
1. To setup the Bullet Backend on a stream processing framework. Currently, we support [Bullet on Storm](backend/storm-setup.md) and [Bullet on Spark](backend/spark-setup.md).
1. Plug in your source of data. See [Getting your data into Bullet](backend/ingestion.md) for details
2. Consume your data stream
2. The [Web Service](ws/setup.md) set up to convey queries and return results back from the backend
3. To choose a [PubSub implementation](pubsub/architecture.md) that connects the Web Service and the Backend. We currently support [Kafka](pubsub/kafka.md) on any Backend and [Storm DRPC](pubsub/storm-drpc.md) for the Storm Backend.
3. To choose a [PubSub implementation](pubsub/architecture.md) that connects the Web Service and the Backend. We currently support [Kafka](pubsub/kafka.md) and a [REST PubSub](pubsub/rest.md) on any Backend and [Storm DRPC](pubsub/storm-drpc.md) for the Storm Backend.
4. The optional [UI](ui/setup.md) set up to talk to your Web Service. You can skip the UI if all your access is programmatic

!!! note "Schema in the UI"
Expand All @@ -59,9 +59,9 @@ To set up Bullet on a real data stream, you need:

# Querying in Bullet

Bullet queries allow you to filter, project and aggregate data. It lets you fetch raw (the individual data records) as well as aggregated data.
Bullet queries allow you to filter, project and aggregate data. You can also specify a window to get incremental results. Bullet lets you fetch raw (the individual data records) as well as aggregated data.

* See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start.md)
* See the [UI Usage section](ui/usage.md) for using the UI to build Bullet queries. This is the same UI you will build in the [Quick Start](quick-start/bullet-on-spark.md)

* See the [API section](ws/api.md) for building Bullet API queries

Expand Down Expand Up @@ -111,6 +111,16 @@ Currently we support ```GROUP``` aggregations with the following operations:
| MAX | Returns the maximum of the non-null values in the provided field for all the elements in the group |
| AVG | Computes the average of the non-null values in the provided field for all the elements in the group |

## Windows

Windows in a Bullet query allow you to specify how often you'd like Bullet to return results.

For example, you could launch a query for 2 minutes, and have Bullet return a COUNT DISTINCT on a particular field every 3 seconds:

![Time-Based Tumbling Windows](../img/time-based-tumbling.png)

See documentation on [the Web Service API](ws/api.md) for more info.

# Results

The Bullet Web Service returns your query result as well as associated metadata information in a structured JSON format. The UI can display the results in different formats.
Expand Down Expand Up @@ -145,17 +155,19 @@ The Bullet Backend can be split into three main conceptual sub-systems:
2. Data Processor - reads data from a input stream, converts it to an unified data format and matches it against queries
3. Combiner - combines results for different queries, performs final aggregations and returns results

The core of Bullet querying is not tied to the Backend and lives in a core library. This allows you implement the flow shown above in any stream processor you like. We are currently working on Bullet on [Spark Streaming](https://spark.apache.org/streaming).
The core of Bullet querying is not tied to the Backend and lives in a core library. This allows you implement the flow shown above in any stream processor you like.

## PubSub
Implementations of [Bullet on Storm](backend/storm-architecture.md) and [Bullet on Spark](backend/spark-architecture.md) are currently supported.

The PubSub is responsible for transmitting queries from the API to the Backend and returning results back from the Backend to the clients. It decouples whatever particular Backend you are using with the API. We currently provide a PubSub implementation using Kafka as the transport layer. You can very easily [implement your own](pubsub/architecture.md#implementing-your-own-pubsub) by defining a few interfaces that we provide.
## PubSub

In the case of Bullet on Storm, there is an [additional simplified option](pubsub/storm-drpc.md) using [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html) as the PubSub. This layer is planned to only support a request-response model for querying in the future.
The PubSub is responsible for transmitting queries from the API to the Backend and returning results back from the Backend to the clients. It decouples whatever particular Backend you are using with the API.
We currently support two different PubSub implementation:

!!! note "DRPC PubSub"
* [Kafka](pubsub/kafka.md)
* [REST](pubsub/rest.md)

This was how Bullet was first implemented in Storm. Storm DRPC provided a really simple way to communicate with Storm that we took advantage of. We provide this as a legacy adapter or for users who use Storm but don't want a PubSub layer.
You can also very easily [implement your own](pubsub/architecture.md#implementing-your-own-pubsub) by defining a few interfaces that we provide.

## Web Service and UI

Expand Down
7 changes: 4 additions & 3 deletions docs/pubsub/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ This section describes how the Publish-Subscribe or [PubSub layer](../index.md#p

## Why a PubSub?

When we initially created Bullet, it was built on [Apache Storm](https://storm.apache.org) and leveraged a feature in it called [Storm DRPC](http://storm.apache.org/releases/1.0.3/Distributed-RPC.html) to deliver queries to and extract results from the Bullet Backend. Storm DRPC is supported by a set of clusters that are physically part of the Storm cluster and is a shared resource for the cluster. While many other stream processors support some form of RPC and we could support multiple versions of the Web Service for those, it quickly became clear that abstracting the transport layer from the Web Service to the Backend was needed. This was particularly highlighted when we wanted to switch Bullet queries from operating in a request-response model (one response at the end of the query) to a streaming model. Streaming responses back to the user for a query through DRPC would be cumbersome and require a lot of logic to handle. A PubSub system was a natural solution to this. Since DRPC was a shared resource per cluster, we also were [tying the Backend's scalability](../backend/storm-performance.md#test-4-improving-the-maximum-number-of-simultaneous-raw-queries) to a resource that we didn't control.
When we initially created Bullet, it was built on [Apache Storm](https://storm.apache.org) and leveraged a feature in it called Storm DRPC to deliver queries to and extract results from the Bullet Backend. Storm DRPC is supported by a set of clusters that are physically part of the Storm cluster and is a shared resource for the cluster. While many other stream processors support some form of RPC and we could support multiple versions of the Web Service for those, it quickly became clear that abstracting the transport layer from the Web Service to the Backend was needed. This was particularly highlighted when we wanted to switch Bullet queries from operating in a request-response model (one response at the end of the query) to a streaming model. Streaming responses back to the user for a query through DRPC would be cumbersome and require a lot of logic to handle. A PubSub system was a natural solution to this. Since DRPC was a shared resource per cluster, we also were [tying the Backend's scalability](../backend/storm-performance.md#test-4-improving-the-maximum-number-of-simultaneous-raw-queries) to a resource that we didn't control.

However, we didn't want to pick a particular PubSub like Kafka and restrict a user's choice. So, we added a PubSub layer that was generic and entirely pluggable into both the Backend and the Web Service. We would support a select few like [Kafka](https://github.com/yahoo/bullet-kafka) or [Storm DRPC](https://github.com/yahoo/bullet-storm). See [below](#implementing-your-own-pubsub) for how to create your own.

With the transport mechanism abstracted out, it opens up a lot of possibilities like implementing Bullet on other stream processors ([Apache Spark](https://spark.apache.org) is in the works) and adding streaming, incremental results, sharding and much more.
With the transport mechanism abstracted out, it opens up a lot of possibilities like implementing Bullet on other stream processors, allowing for the development of [Bullet on Spark](../backend/spark-architecture.md) along with other possible implementations in the future.

## What does it do?

Expand All @@ -28,7 +28,8 @@ The PubSub layer does not deal with queries and results and just works on instan
If you want to use an implementation already built, we currently support:

1. [Kafka](kafka.md#setup) for any Backend
2. [Storm DRPC](storm-drpc.md#setup) if you're using Bullet on Storm as your Backend
2. [REST](rest.md#setup) for any Backend
3. [Storm DRPC](storm-drpc.md#setup) if you're using Bullet on Storm as your Backend

## Implementing your own PubSub

Expand Down
3 changes: 3 additions & 0 deletions docs/pubsub/storm-drpc.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Storm DRPC PubSub

!!! note "NOTE: This PubSub only works with old versions of the Storm Backend!"
Since DRPC is part of Storm, and can only support a single query/response model, this PubSub implementation can only be used with the Storm backend, and cannot support Windowed queries (bullet-storm 0.8.0 and later).

Bullet on [Storm](https://storm.apache.org/) can use [Storm DRPC](http://storm.apache.org/releases/1.0.0/Distributed-RPC.html) as a PubSub layer. DRPC or Distributed Remote Procedure Call, is built into Storm and consists of a set of servers that are part of the Storm cluster.

## How does it work?
Expand Down
4 changes: 1 addition & 3 deletions docs/quick-start/bullet-on-spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,6 @@ At the end of this section, you will have:
* You will need to be on an Unix-based system (Mac OS X, Ubuntu ...) with ```curl``` installed
* You will need [JDK 8](http://www.oracle.com/technetwork/java/javase/downloads/index.html) installed

## To Install and Launch Bullet Locally:

### Setup Kafka

For this instance of Bullet we will use the kafka PubSub implementation found in [bullet-spark](https://github.com/bullet-db/bullet-spark). So we will first download and run Kafka, and setup a couple Kafka topics.
Expand Down Expand Up @@ -180,7 +178,7 @@ Visit [http://localhost:8800](http://localhost:8800) to query your topology with
If you access the UI from another machine than where your UI is actually running, you will need to edit ```config/env-settings.json```. Since the UI is a client-side app, the machine that your browser is running on will fetch the UI and attempt to use these settings to talk to the Web Service. Since they point to localhost by default, your browser will attempt to connect there and fail. An easy fix is to change ```localhost``` in your env-settings.json to point to the host name where you will hosting the UI. This will be the same as the UI host you use in the browser. You can also do a local port forward on the machine accessing the UI by running:
```ssh -N -L 8800:localhost:8800 -L 9999:localhost:9999 hostname-of-the-quickstart-components 2>&1```

## Congratulations!! Bullet is all setup!
### Congratulations!! Bullet is all setup!

#### Playing around with the instance:

Expand Down
Loading