# IoT Microdemos


## Timeseries
A common pattern to store and retrieve time series data is to leverage the document model with the so called bucketing schema pattern. Instead of storing each measurement into a single document, multiple measurements are stored into one single document. This provides the benefits of: 
* Reducing Storage space (as less data is stored multiple times, e.g. device id and other metadata, as well as better compression ratios on larger documents)
* Reduce Index sizes (by bucket size), larger parts of the index will fit into memory and increase performance
* Reduce IO by less documents (reading time series at scale is usually IO-bound load)

The following examples will guide through the typical patterns:
* [Ingesting Data](#Ingesting-Data)
* [Indexing Data](#Indexing-Strategy)
* [Querying Data](Querying-the-Data)

## Ingesting Data

The following statement will search for a document of device 4711 and where the count of measurements is less than 3 entries in the bucket. In reality, this will be a higher number, e.g. 60 or 100. The new measurement is pushed to the array called m. 

Because of the upsert option, a new document will be inserted, if no available bucket can be found. Increasing the cnt by one during each insert will automatically create a new document once the exiting bucket is full.

### Initialize the database and drop the collection:

In [None]:
import pymongo
import os
import datetime
import bson
from bson.json_util import loads, dumps, RELAXED_JSON_OPTIONS
import random
from pprint import pprint

CONNECTIONSTRING = "localhost:27017"

# Establish Database Connection
client = pymongo.MongoClient(CONNECTIONSTRING)
db = client.iot
collection = db.iot_raw

# Drop the collection before we start with the demo
collection.drop()

### Insert the first measurements:
MongoDB Query Language offer rich operators that we leverage here to automatically bucket the data, i.e. we do not store each individual measurement into one document, but store multiple measurements into an array.

By using upsert, we automatically start a new bucket, i.e. create a new document if no bucket with additional space can be found. Otherwise, we push the new measurement into the bucket.

The following statement will find an open bucket of device 4711, i.e. where the count of measurements is less than 3 entries in the bucket. In reality, this will be a higher number, e.g. 60 or 100. The new measurement is pushed to the array called m, the bucket size is increased by one. For the later query on time ranges, we also store the minimal and maximal timestamp within this bucket.

In [None]:
# The timestamp of the new measurement
# Note: For better readability, we work with datetime objects. 
# For higher precision of timestamps, e.g. nanoseconds, 
# it is recommended to work with decimal values representing seconds and nanoseconds
date = datetime.datetime.now()

# Add the new measurement to the bucket
collection.update_one({
  "device": 4711,
  "cnt": { "$lt": 3 }
},
{
  "$push": { 
    "m": {
      "ts": date,
      "temperature": random.randint(0,100),
      "rpm": random.randint(0,10000),
      "status": "operating"
    }
  },
  "$max": { "max_ts": date },
  "$min": { "min_ts": date },
  "$inc": { "cnt": 1 }
},
upsert=True);

The target document looks like the following:

In [None]:
result = collection.find_one()

pprint(result)

### Add additional measurements

Insert some more data in order to have multiple buckets (again, here we use a bucket size of 3, in reality this number will be much higher). We Iinsert four more measurements, so there will be two documents with 3 and 2 measurements, respectively.

In [None]:
for i in range(4):
    date = datetime.datetime.now()
    
    collection.update_one(
        {
            "device": 4711,
            "cnt": { "$lt": 3 }
          },
          {
            "$push": { 
              "m": {
                "ts": date,
                "temperature": random.randint(0,100),
                "rpm": random.randint(0,10000),
                "status": "operating",
                  "new_field": { "subfield1": "s1", "subfield2": random.randint(0,100)}
              }
            },
            "$max": { "max_ts": date },
            "$min": { "min_ts": date },
            "$inc": { "cnt": 1 }
          },
          upsert=True
    )

The result will look like the following:

In [None]:
res = collection.find()

for doc in res:
    pprint(doc)

## Indexing Strategy

A proper indexing strategy is key for efficient querying of data. The first index is mandatory for efficient time series queries in historical data. The second one is needed for efficient retreival of the current, i.e. open, bucket for each device. If all device types have the same bucket size, it can be created as a partial index - this will only keep the open buckets in the index. For varying bucket sizes, e.g. per device type, the type could be added to the index. The savings can be huge for large implementations.

In [None]:
# Efficient queries per device and timespan
result = collection.create_index([("device",pymongo.ASCENDING),
                         ("min_ts",pymongo.ASCENDING),
                         ("max_ts",pymongo.ASCENDING)])
print("Created Index: " + result)

# Efficient retreival of open buckets per device
result = collection.create_index([("device",pymongo.ASCENDING),
                         ("cnt",pymongo.ASCENDING)],
                        partialFilterExpression={"cnt": {"$lt":3}})
print("Created Index: " + result)


These indexes will be used during the ingestion as well as the retreival process. And we will have a closer look at them later on.

## Querying the Data

With Aggregation Pipelines it is easy to query, filter, and format the data. This is the query for two timeseries (temperature and rpm). The sort should use the full index prefix in order to be executed on the index and not in memory.

In [None]:
result = collection.aggregate([
  { "$match": { "device": 4711 } },
  { "$sort": { "device": 1, "min_ts": 1 } },
  { "$unwind": "$m" },
  { "$sort": { "m.ts": 1 } },
  { "$project": { "_id": 0, "device": 1, "ts": "$m.ts", "temperature": "$m.temperature", "rpm": "$m.rpm" } }
]);
   
for doc in result:
    print(doc)

In order to query for a certain timeframe, the following $match stage can be used to search for a certain timeframe (please replace LOWER_BOUND and UPPER_BOUND with appropriate ISODate values).

In [None]:
LOWER_BOUND = datetime.datetime(2020, 4, 20, 13, 26, 43, 18000) # Replace with lower bound (copy & paste from results above)
UPPER_BOUND = datetime.datetime(2020, 4, 20, 13, 30, 26, 130000) # Replace with upper bound (copy & paste from results above)

result = collection.aggregate([
  { "$match": { "device": 4711, "min_ts": { "$lte": UPPER_BOUND }, "max_ts": { "$gte": LOWER_BOUND } } },
  { "$sort": { "device": 1, "min_ts": 1 } },
  { "$unwind": "$m" },
  { "$match": { "$and": [ { "m.ts": { "$lte": UPPER_BOUND } }, { "m.ts": { "$gte": LOWER_BOUND } } ] } },
  { "$project": { "_id": 0, "device": 1, "ts": "$m.ts", "temperature": "$m.temperature", "rpm": "$m.rpm" } }
]);

for doc in result:
    print(doc)

### How to explain this query pattern

We want to get the data from timestamps 8 to 17 that are spread across 5 buckets:
```
(1) 1 2 3 4 5
(2) 6 7 8 9 10
(3) 11 12 13 14 15
(4) 16 17 18 19 20
(5) 21 22 23 
```

We could use a complex condition, but this will end up in expensive index scans:
```
     min <= 8  and max >= 8   [ bucket (1) ]
 OR: min >= 8  and max <= 17  [ bucket (3) ]
 OR: min <= 17 and max >= 8   [ bucket (4) ]
```

The following statement leads to the same result and allows efficient index traversal and selects exactly the buckets of our interest:
```
     max >= 8
AND: min <= 17
```