In [1]:
import time
import mdml_client as mdml
print(mdml.__version__)

1.1.102


## MDML Producing Data


In [2]:
example_data = {
    'time': time.time(), 
    'int1': 1,
    'int2': 2,
    'int3': 3
}
schema = mdml.create_schema(example_data, "Example schema", "schema for example notebook")
producer = mdml.kafka_mdml_producer(
    topic = "mdml-example-dict",
    schema = schema,
    kafka_host = '100.26.16.4',
    schema_host = '100.26.16.4'
)
producer.produce(example_data)
producer.flush()

## MDML Consuming Data

In [3]:
consumer = mdml.kafka_mdml_consumer(
    topics = ["mdml-example-dict"],
    group = "abc", # create a unique group id here
    kafka_host = '100.26.16.4',
    schema_host = '100.26.16.4'
)
for msg in consumer.consume():
    print(msg)

Consumer loop will exit after 300.0 seconds without receiving a message or with Ctrl+C
{'topic': 'mdml-example-dict', 'value': {'time': 1631221786.1839342, 'int1': 1, 'int2': 2, 'int3': 3}}


## Streaming Files via MDML

The MDML takes two approaches to streaming large files. One is by chunking and the other we call "coat-checking". In chunking, a large file is broken up into smaller chunks that are sent directly to the MDML. We will only demonstrate the chunking method here. The second method of "coat-checking" uses an S3 bucket to upload files. At the same time, a message describing the location and some metadata about the file is sent to the MDML. A consumer could then download the file from the specified S3 bucket location in the message.


### Chunking

In [6]:
large_file = "large_file.txt" # ~20MB
producer = mdml.kafka_mdml_producer(
    topic = "mdml-example-file",
    schema = mdml.multipart_schema, # using MDML's pre-defined schema for chunking
    kafka_host = '100.26.16.4',
    schema_host = '100.26.16.4'
)
i=0
for chunk in mdml.chunk_file(large_file, 500000): # chunk size of 500,000 Bytes
    producer.produce(chunk)
    i += 1
    if i % 10 == 0:
        print("flush")
        producer.flush() # flush every 50 chunks
print("final flush")
producer.flush()

1
2
3
4
5
6
7
8
9
10
flush
11
12
13
14
15
16
17
18
19
20
flush
21
22
23
24
25
26
27
28
29
30
flush
31
32
33
34
35
36
37
38
39
40
flush
41
42
43
44
45
46
47
48
49
50
flush
51
52
53
54
final flush


In [7]:
consumer = mdml.kafka_mdml_consumer(
    topics = ["mdml-example-file"],
    group = "abc", # create a unique group id here
    kafka_host = '100.26.16.4',
    schema_host = '100.26.16.4'
)
for msg in consumer.consume_chunks(): # the message returned is the filepath that the chunked file was written to
    print(msg)

Consumer loop will exit after 300.0 seconds without receiving a message or with Ctrl+C
(1631221960.0640821, 'large_file.txt')
