# IoT device traffic to demonstrate office personnel traffic

## Introduction

IoT, or internet of things, refers to the world of devices or objects which [connect to other systems and devices over a network](https://en.wikipedia.org/wiki/Internet_of_things).

Network logging/packet capture [capture or log events which occur in a network](https://www.solarwinds.com/resources/it-glossary/pcap#:~:text=Packet%20capturing%20helps%20to%20analyze,directly%20from%20the%20computer%20network.). This usually includes the source and destination IP address, the sizes of the forward and backward packets, the duration of the event, and the start time of the event, among other details.  This data can help the user monitor network usage and identify risks or issues. While this data is frequently used to understand issues like malicous traffic or bandwidth issues, there are other possible benefits in monitoring network traffic.

Consider an office with multiple conference rooms and an open bull pen. While the location of personnel could be determined based off their schedule (no meetings suggests they are at their desk, meeting at conference room 1 suggests they are at conference room 1), this does not take into account transient movement (being called into a meeting that isn't on the calendar, choosing to take a private meeting in a free conference room ad hoc).

As more and more devices become network capable-or become a part of the internet of things, so to speak-network data becomes a part of managing a system.

Using network traffic to understand where people are allows a system to react accordingly: intelligently manage blinds to prevent a room from getting too hot or cold if there are more or fewer people expected, motion sensor can fail if a person is at a laptop not moving too much, other reasons. Also useful for understanding network traffic (enough bandwidth in an area)

This notebook explores the movement of network usage in a room based off a subset of generated data.

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=IoT" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Discover+Atoti+now.jpg" alt="Try Atoti"></a></div>

## Data Import

Packet capture data generally includes information like the source and destination IP addresses, mac addresses, the sizes of the forward and backward packets, event durations, and start times of the event, among other details.

Between IP addresses and mac addresses for identifying a device: IP addresses are dynamically assigned, while the mac addresses is a fixed device ID. For the purpose of this notebook, we'll be focusing on macaddresses, event starts, and durations to determine where devices are. Fields like packet size, direction are all important for understanding full network picture, but not necessary to understand locations of devices.

Since this is simulated data of network useage per gateway, the data is broken down based off each network gateway devices and location. This data will be combined into one massive table including the room location as a column to a dataframe, then this dataframe studied.

In [1]:
import atoti as tt
import pandas as pd

#### Data ETL

In [2]:
dfDict = {}
for room in ["bullpen", "conference1", "conference2", "conference3", "conference4"]:
    df = pd.read_csv(
        f"s3://data.atoti.io/notebooks/iot-load/{room}.csv", index_col="EventKey"
    )
    name = str(room)
    dfDict[name] = df

NetworkFlows = pd.concat(dfDict, names=["Room", "FlowId"])

In [3]:
NetworkFlows.dtypes

EventTime        object
MacAddress       object
EventDuration     int64
dtype: object

In [4]:
NetworkFlows.EventTime = NetworkFlows.EventTime.astype("datetime64[ns]")

#### Session Creation

Now that this is combined into a single dataframe, we can move ahead with injesting this data into Atoti. We'll create a session, setting up a few configs, and create our cube.

In [5]:
session = tt.Session(user_content_storage="./content")

In [6]:
Flows = session.read_pandas(NetworkFlows, table_name="NetworkFlows")
Flows.head()

Unnamed: 0,Room,FlowId,EventTime,MacAddress,EventDuration
0,bullpen,20220404080059:001AA0OOZAVP,2022-04-04 08:00:59,00:1A:A0:OO:ZA:VP,544
1,bullpen,20220404080458:0016DB1060EF,2022-04-04 08:04:58,00:16:DB:10:60:EF,88
2,bullpen,20220404080532:0016DBQFXS4N,2022-04-04 08:05:32,00:16:DB:QF:XS:4N,279
3,bullpen,20220404080622:001B63ZNGJ2C,2022-04-04 08:06:22,00:1B:63:ZN:GJ:2C,28
4,bullpen,20220404080839:001AA0YY0I3B,2022-04-04 08:08:39,00:1A:A0:YY:0I:3B,208


In [7]:
cube = session.create_cube(Flows)
h, l, m = cube.hierarchies, cube.levels, cube.measures

In [8]:
h

With our cube created, we can investigate what the data looks like, generally speaking. For example, we can investigate the traffic across rooms.

#### Cube Verification

In [9]:
session.visualize("Basic Visualization of Data")

## Hierarchy Management

Having investigated the basic shape of our data, there are other ways we'll like to classify or investigate our data. For example, it would be useful to look at our traffic based on the time buckets. We have a datetime column. From here, we can use create_date_hierarchy to break this down further.

Since our data all takes place in the same year and month, we'll only break this down to the day and hour. We'll also create a separate date hierarchy for just the hour.

In [10]:
cube.create_date_hierarchy(
    "DateTime", column=Flows["EventTime"], levels={"Day": "d", "Hour": "HH"}
)
cube.create_date_hierarchy("Hour", column=Flows["EventTime"], levels={"Hour": "HH"})

In [11]:
session.visualize("Number of Events over Time")

From this, we can already see something pretty intuitive: there is more network traffic between the hours of 08:00 and 18:00, which are reasonable working hours for an office.

We also notice that between 18:00 to the following 08:00, the network traffic doesn't quite drop to zero. Let's investigate what is contributing to this by drilling through on one of those hours.

In [12]:
session.visualize("Drill Through on one of the Lulls")

Looking at this, we see the mac address for these devices are similar. We can look up the manufacturer for these devices to see what they are, or at least, where they come from using a website like [macvendorlookup](https://www.macvendorlookup.com/)

Looking up one of the devices beginning with '00:17:88:XU:ED:P6' we see the vendor is Philips Lighting. If the office has network connecting capable lightbulbs, this type of network traffic makes sense. In the modern day, so many previously mundane objects are now 'smart'.

If we look at the data specifically for that device, we also notice that it always found in the same room-this makes sense, as the lightbulb shouldn't travel, unlike, say, a laptop.

In [13]:
session.visualize('Data from device "00:17:88:CR:PU:Y7"')

## Data Investigation

Now that we have the hierarchies we want, we can investigate what our data is saying, and create additional measures to gain insights on how people move in our offices. Before we get started, let's first see in better details where people are in our offices during the day.

In [14]:
session.visualize("Distribution of Devices")

As would be reasonable to expect, the bulk of our traffic seems to be located in the bullpen area of the office during the workday. Some questions to consider:
* Do people stay fixed in the same area throughout the day?
* How many people tend to gather in the conference rooms when the conference rooms are in use?
* Is there a specific time during the day where any particular conference room is favored or disfavored?

To answer these questions, we can create create additional measures. Let's start with creating a measure which returns the distinct # of rooms. This one simple measure will allow us to see two things immediately:
* how many devices cross through multiple rooms (and if anyone uses all three conference rooms and the bullpen area in a day
* roughly how many of our devices are stationary devices that are from the office.

In [15]:
m["#LocationsFound"] = tt.agg.count_distinct(Flows["Room"])

In [16]:
session.visualize("Devices #Locations Across Each Day")

So, which devides are moving around?  For this, we'll look at devices which are found in more than room. This will naturally exclude devices which are room features, as well as employees who either only take meetings in one conference room, or are always in the bullpen area.

In [17]:
session.visualize("Devices Found in Multiple Rooms")

We can also get a sense of the types of devices from the flow traffic.  For example, a person with a latop and a smart watch may only sit in the bullpen area, but those devices will likely log greater amount of events than a lightbulb.

In [18]:
session.visualize("Total Device Network Flow vs #Locations Device Inhabits")

Now that we have a sense that most of our devices are transient-meaning, they at some point or another end up visiting every room, let's see what we can find out in terms of when these rooms are being used. We already can see a bit from the area chart that conference room one seems to experience reduced traffic at some point.

This is a bit difficult, since we already we have devices like lightbulbs, which means there could be network traffic even if no humans are in the room. Let's start with visualizing the duration of network events for each room per each hour. We'll focus on the hours between 08:00 and 18:00.

In [19]:
session.visualize("Hourly Network Flow, Split by Room")

We see the conference rooms tend to be used throughout the day, but conference room one seems to become unpopular in the afternoon. Let's recreate this same visual, excluding the bullpen data.

In [20]:
session.visualize("Hourly Network Flow, Just Conference Rooms")

It seems like Conference Room 1 becomes unpopular as the day goes on. There could be many reasons for this:
* It gets hot/cold
* The sun/lighting gets worse through the day
* Fewer people need that room and its features 

Or other reasons. From an office management point of view, it becomes clear that either its conditions need to be investigated, or the amount of resources to it could be diverted.

Let's also create a similar measure counting the distinct mac addresses.  This can help us see how many devices are in a room during a specific period.

In [21]:
m["#DevicesFound"] = tt.agg.count_distinct(Flows["MacAddress"])

In [22]:
session.visualize("Devices Found per Room per Hour during the work day")

What else can we see from this data?

For example, how long, on average, does a device stay in a conference room?  For this, we'll build this metric up iteratively.  Let's start with determining what percent of time a device spends in a conference room while active (ie, actively communicating).  Ideally, we'll exclude the devices which are natively a part of that room, but we'll include them for now

In [23]:
m["TotalDuration"] = tt.agg.sum(
    m["EventDuration.SUM"], scope=tt.SiblingsScope(hierarchy=h["Room"])
)

m["%Time"] = m["EventDuration.SUM"] / m["TotalDuration"]

In [24]:
session.visualize("% Time in Each Room")

This data could be summarized in a dashboard, providing a view of how users use their office space over time.

In [25]:
session.link(path="#/dashboard/b1a")

Open the notebook in JupyterLab with the Atoti extension enabled to see this link.

We hope you enjoyed this exploration.

<div style="text-align: center;" ><a href="https://www.atoti.io/?utm_source=gallery&utm_content=collateral-monitoring" target="_blank" rel="noopener noreferrer"><img src="https://data.atoti.io/notebooks/banners/Try+Atoti.jpg" alt="Try Atoti"></a></div>