## Demo of Bspump Jupyter

Bspump Jupyter module is a collection of utilities to help you develop and deploy your pipelines from and within jupyter notebooks

All of the functions are located inside `bspump.jupyter` module

In [1]:
import bspump.jupyter as bpj

To start, common pattern along bspump applications is to use files in `.conf` format for configuration, you can do this in jupyter with `init_bitswan_jupyter('path_to_config_file')`.

For demo purposes we create a simple configuration file `demo-config.conf` with configuration for two simple pipelines:
 - `basic` pipeline with a single source named `TestSource` that generates events with increasing counter, that are then passed through a pipeline to a `NullSink`
 - `kafka2kafka` pipeline to showcase Kafka integration, that reads from a Kafka topic and writes to another

In [2]:
bpj.init_bitswan_jupyter('pipelines.conf')

BitSwan BSPump version devel


## Basic pipeline

To create a new pipeline, you can use `new_pipeline('name')` function, which will create a new pipeline with the given name and make it the current pipeline. You can then add sources, processors, etc.

In [3]:
bpj.new_pipeline('basic')

#### Each pipeline has to start with source.

Source is a class that inherits from `bspump.Source` and has to implement `main` method, which is called when the source is started. This method is asynchronous and passes events to the pipeline via `self.process(event)` call.
The `__init__` method has to take `app`, `pipeline`, `id` and `config` arguments and call `super().__init__(app, pipeline, id, config)` with `id` and `config` arguments being optional.

In [4]:
import bspump

class TestSource(bspump.Source):
    def __init__(self, app, pipeline, id=None, config=None):
        super().__init__(app, pipeline, id, config)
        self.name = self.Config["name"] # we can access the configuration file through self.Config attribute, it acts just like a python dictionary
        self.counter = 0
    
    async def main(self):
        while self.counter < 1_000_000:
            await self.process({f"{self.name}_counter": self.counter}) # we create events, and then pass them to pipeline for further processing
            self.counter += 1


Now, with our source class created, we can then register it to the pipeline with decorator `@register_source`, this decorator takes a function with 2 arguments, `app` and `pipeline` that returns an instance of the source class.

In [5]:
@bpj.register_source
def test_source(app, pipeline):
    return TestSource(app, pipeline)

The main advantage of `bspump.jupyter` module, is that you can test your pipelines directly and in real time, however for that you need some events to test with.

There are 2 main ways to generate events in `bspump.jupyter` environment:
 - `sample_events` function, that takes in a list of events and registers them to be further processed with the pipeline
 - `retrieve_sample_events` function, that calls your source in the background, and let's you retrieve the events directly from it

Using `sample_events` is useful when you want to test your processors, sinks, etc. and you already have the events prepared, while `retrieve_sample_events` is useful when you want to test your source and you don't have the events prepared. For this example, we will use `retrieve_sample_events` to test our `TestSource` source.

In [6]:
await bpj.retrieve_sample_events(limit=10) # mind the await keyword, it is necessary for this to work

# uncomment the line below to use the sample_events function
# bpj.sample_events([{"test_event": 1}, {"test_event": 2}])

{'test_source_name_counter': 0}
{'test_source_name_counter': 1}
{'test_source_name_counter': 2}
{'test_source_name_counter': 3}
{'test_source_name_counter': 4}
{'test_source_name_counter': 5}
{'test_source_name_counter': 6}
{'test_source_name_counter': 7}
{'test_source_name_counter': 8}
{'test_source_name_counter': 9}
Collected 10 events


The `retrieve_sample_events` function takes in a `limit` argument, which specifies how many events you want to retrieve from the source. It then starts the source in the background, retrieves the events and stops the source. The events are then registered and printed out.

We would like to do something with those events, process them further. To do this we use processors. Processor is a class that inherits from `bspump.Processor` and has to implement `process` method, which takes 2 arguments,
`context` and `event`. It then returns the processed event. There is also a functional way to register processors to the pipeline, without creating a new class, which will be shown later. Again, `__init__` method has to take `app`, `pipeline`, `id` and `config` arguments and call `super().__init__(app, pipeline, id, config)` with `id` and `config` arguments being optional. This is a common pattern for all bspump components.

In [7]:
class TestProcessor(bspump.Processor):
    def __init__(self, app, pipeline, id=None, config=None):
        super().__init__(app, pipeline, id, config)
        self.counter = 0
        self.name = self.Config["name"]

    def process(self, context, event):
        self.counter += 1
        event[self.name] = self.counter

        return event


To register this processor to the pipeline, we use `@register_processor` decorator, which takes a function with 2 arguments, `app` and `pipeline` that returns an instance of the processor class. This will, similary to source, register the processor to the pipeline and show the processed events that were previously registered.

In [8]:
@bpj.register_processor
def test_processor(app, pipeline):
    return TestProcessor(app, pipeline)

{'test_source_name_counter': 0, 'test_processor_name': 1}
{'test_source_name_counter': 1, 'test_processor_name': 2}
{'test_source_name_counter': 2, 'test_processor_name': 3}
{'test_source_name_counter': 3, 'test_processor_name': 4}
{'test_source_name_counter': 4, 'test_processor_name': 5}
{'test_source_name_counter': 5, 'test_processor_name': 6}
{'test_source_name_counter': 6, 'test_processor_name': 7}
{'test_source_name_counter': 7, 'test_processor_name': 8}
{'test_source_name_counter': 8, 'test_processor_name': 9}
{'test_source_name_counter': 9, 'test_processor_name': 10}


Similarly to object oriented approach to processors, a functional approach is possible. You can use `@step` or `@async_step` decorators, to decorate your functions that take event as an input. They should return, respectively, a processed event or a coroutine that returns a processed event. This is useful when you want to create a simple processor that doesn't require a class.

In the background, this will create a new class that inherits from `bspump.Processor` and implements `process` method, that calls the decorated function. The name of the processor will be the function name converted to camel case.

In [9]:
@bpj.step
def test_step(event): #TestStepProcessor
    event["processed"] = True
    return event

{'test_source_name_counter': 0, 'test_processor_name': 1, 'processed': True}
{'test_source_name_counter': 1, 'test_processor_name': 2, 'processed': True}
{'test_source_name_counter': 2, 'test_processor_name': 3, 'processed': True}
{'test_source_name_counter': 3, 'test_processor_name': 4, 'processed': True}
{'test_source_name_counter': 4, 'test_processor_name': 5, 'processed': True}
{'test_source_name_counter': 5, 'test_processor_name': 6, 'processed': True}
{'test_source_name_counter': 6, 'test_processor_name': 7, 'processed': True}
{'test_source_name_counter': 7, 'test_processor_name': 8, 'processed': True}
{'test_source_name_counter': 8, 'test_processor_name': 9, 'processed': True}
{'test_source_name_counter': 9, 'test_processor_name': 10, 'processed': True}


Now, we can end our pipeline with sink, in this case we will use `bspump.common.NullSink` which is a sink that does nothing with the events, it just consumes them. Using more useful sinks, like `bspump.kafka.KafkaSink` or `bspump.elasticsearch.ElasticSearchSink` is pretty much the same, and will be showcased later.

Again, to register the sink to the pipeline, we use `@register_sink` decorator, which takes a function with 2 arguments, `app` and `pipeline` that returns an instance of the sink class.

In [10]:
import bspump.common

@bpj.register_sink
def null_sink(app, pipeline):
    return bspump.common.NullSink(app, pipeline)

To finish up the pipeline, we need to call `end_pipeline` function, which will end the current pipeline and add it to the finished ones.

In [11]:
bpj.end_pipeline()