# Data Processing Steps
## MQTT
Using the `capture.py` scripts located in the `mqtt` subdirectory, we subscribed to every topic available on the given server and we listened for messages for a 30 minutes timeframe. The script then saved the output of the capture to a `.json` file ready to be processed.

### Step 1: Removing Duplicates
First we want to remove duplicates from both MQTT and COAP captures, and then we want to merge the results.
To process the MQTT capture, which is located at `mqtt/data.json`, we'll use the following code block:

In [2]:
from json import load


def remove_duplicates(data):
    res = []
    for msg in data:
        if msg not in res:
            res.append(msg)
    return res

# opening file located at mqtt/data.json
messages = dict()
filepath = "mqtt/data.json"
with open(filepath, "r") as f:
    messages = load(f)["messages"]

print("Number of messages captured: %d" % len(messages))
messages = remove_duplicates(messages)
print("Number of messages captured after removing duplicates: %d" % len(messages))

Number of messages captured: 3202
Number of messages captured after removing duplicates: 289


### Step 2: Filtering out bad coordinates
Now we can select coordinates which are actually useful. For this task we choose to remove all messages with a payload which either is not a coordinate or has out of bounds coordinates.

In [3]:
from json import load
from re import findall, search

print(len(messages))
def filter_data(data):
    res = []
    bounds = [0, 10]
    for msg in data:
        if msg["payload"].count("|") == 6:

            # the first 7 characters are the two decimal values with
            # one decimal digit, separated by a comma that we want to
            # parse as coordinates
            coords_regex = r"(\d+\.\d+),(\d+\.\d+)"
            if search(coords_regex, msg["payload"]) is None:
                continue
            x, y = findall(coords_regex, msg["payload"])[0]
            # now we check if either coordinate is out of bound
            if (
                (float(x) >= bounds[0] and float(x) <= bounds[1])
                and (float(y) >= bounds[0] or float(y) <= bounds[1])
            ):
                res.append(msg)
    return res

messages = filter_data(messages)
messages = remove_duplicates(messages)
print("Number of messages captured after removing junk values: %d" % len(messages))

289
Number of messages captured after removing junk values: 15


### Step 3: Manual Cleanup
Now that we have only valid coordinates, we can manually select just the ones that actually belong to Dory.

In [5]:
from rich import print_json
for item in messages:
    print_json(data=item)

By printing them we can see that some are clearly misleading. We can proceed to filter out those entries.

In [7]:
from rich import print_json
from json import dumps, dump, load

blacklist = [
    "/wierd/topic/isnt/it",
    "cant/be/a/valid/entry",
    "hi/nemo/wassup/",
    "marlin/checked/this/entry",
    "nemo/tracks/dory",
    "marlin/and/nemo/are/stalkers",
    "no/nemo/here/discard/this"
]
for item in messages:
    if item["topic"] in blacklist:
        messages.remove(item)

with open("mqtt_data.json", "w") as f:
    for item in messages:
        print_json(data=item)
    dump(messages, f, indent=4)