# Introduction

## Introduction / Goal

With this project we want to analyse our personal network traffic. We first want to understand general information about our network, namely the used protocols, dns servers, encryption distribution and ethertypes which are mainly used by our clients. Further, we want to analyse the sources and destinations of our requests and which devices sent the most data. Lastly, we want to inspect the collected data by searching for suspicious network traffic. To do so we want to investigate the traffic of used IoT devices and see how much / which data they share. Moreover, we will take a look at suspicious destinations to see where our data is shared to.

## Setup of Hardware

Our hardware setup included a Raspberry Pi which was setup as a Wifi access point and collected the network traffic using `tcpdump`. The Raspberry Pi was then connected to our local router providing access to the internet. All of us were using another access point and additional wired lan directly connected to the router. Therefore, the Raspberry Pi could not collect the whole traffic from our homes, but only a representive part of it.

To collect the data we used the following tcpdump command: `tcpdump -i eth0 -G 86400  -w dumps/%Y-%m-%d_%H-%M_dump.pcap -Z root`. We captured the whole traffic which is going through the eth0 interface, so through the LAN port of the Raspberry Pi. We saved the collected dumps into daily chunks which were named after the current date. The root flag was necessary to make sure that the capturing restarted daily. 

## Metrics on how much data was collected

How many devices , how much data

## Device Mapping

| IP                                    | Sven's Device | Maxi's Device | Fabi's Devices                         |
| ------------------------------------- | ------------- | ------------- | -------------------------------------- |
| 192.168.178.1                         | Router        |               |                                        |
| 192.168.178.21                        |               | Android Phone |                                        |
| 192.168.178.24                        | Google Home   |               |                                        |
| 192.168.178.26                        | Chromecast    |               |                                        |
| 192.168.178.27                        | Google Home   |               |                                        |
| 192.168.178.29                        | Google Home   |               |                                        |
| 192.168.178.42                        | Google Home   |               |                                        |
| 192.168.178.43                        | SmartTV       |               |                                        |
| 192.168.178.44                        | Android Phone |               |                                        |
| 192.168.178.50                        | iPad          |               |                                        |
| 192.168.178.51                        | MacBook       |               |                                        |
| 192.168.178.58                        | Vacuum Robot  |               |                                        |
| 192.168.178.59                        |               | Notebook      |                                        |
| 192.168.178.60                        | MacBook       |               |                                        |
| 192.168.178.62                        | iPhone        |               |                                        |
| 192.168.178.64                        | SmartTV       |               |                                        |
| 192.168.178.81                        | Android Phone |               |                                        |
| 2003:c1:3712:ac00:e9ad:724d:142a:c5c9 |               | Smartphone    |                                        |
| 2003:c1:3712:ac00:38ee:9c51:e7ee:fe52 |               | Notebook      |                                        |
| 192.168.0.1                           |               |               | Router                                 |
| 192.168.0.2                           |               |               | Wifi Smart Plug                        |
| 192.168.0.8                           |               |               | Amazon Fire TV                         |
| 192.168.0.9                           |               |               | Sonos Wifi Loudspeaker                 |
| 192.168.0.14                          |               |               | iPad                                   |
| 192.168.0.22                          |               |               | ESP32 Microcontroller                  |
| 192.168.0.88                          |               |               | Raspberry Pi used for network sniffing |
| 192.168.0.121                         |               |               | ESP32 with Feinstaubsensor firmware    |
| fe80::271:47ff:fe8d:2e30              |               |               | Amazon Fire TV                         |
| 2003:c1:3720:f300:44f9:8130:108c:d89  | Android Phone |               |                                        |
| 2003:c1:3720:f300:8550:6331:20ec:7cff | Android Phone |               |                                        |
| 2003:c1:3720:f300:35ff:69ab:67a6:7220 | Android Phone |               |                                        |
| fe80::920c:c8ff:fed8:2441             | Chromecast    |               |                                        |


In [None]:
# mapping for later sections
#devices_labels = ['Router (Sven)', 'Google Home (Sven)', 'Chromecast (Sven)', 'Smart TV (Sven)', 'Android Phone (Sven)', 'iPad (Sven)', 'MacBook (Sven)', 'Vacuum Robot (Sven)', 'iPhone (Sven)',
#                  'Android Phone (Maxi)', 'Notebook (Maxi)',]
# no router traffic is analyzed due to potential issues caused by device <--> router communication
# '192.168.178.1': 'Router (Sven)',
devices_mapping = {
                    '192.168.178.21': 'Android Phone (Maxi)',
                    '192.168.178.24': 'Google Home (Sven)',
                    '192.168.178.26': 'Chromecast (Sven)',
                    '192.168.178.27': 'Google Home (Sven)',
                    '192.168.178.29': 'Google Home (Sven)',
                    '192.168.178.42': 'Google Home (Sven)',
                    '192.168.178.43': 'Smart TV (Sven)',
                    '192.168.178.44': 'Android Phone (Sven)',
                    '192.168.178.50': 'iPad (Sven)',
                    '192.168.178.51': 'MacBook (Sven)',
                    '192.168.178.58': 'Vacuum Robot (Sven)',
                    '192.168.178.59': 'Notebook (Maxi)',
                    '192.168.178.60': 'MacBook (Sven)',
                    '192.168.178.62': 'iPhone (Sven)',
                    '192.168.178.64': 'Smart TV (Sven)',
                    '192.168.178.81': 'Android Phone (Sven)',
                    '2003:c1:3712:ac00:e9ad:724d:142a:c5c9': 'Android Phone (Maxi)',
                    '2003:c1:3712:ac00:38ee:9c51:e7ee:fe52': 'Notebook (Maxi)',
                    '192.168.0.2':    "Wifi Smart Plug (Fabi)",
                    '192.168.0.8':    "Amazon Fire TV (Fabi)",
                    '192.168.0.9':    "Sonos Wifi Loudspeaker (Fabi)",
                    '192.168.0.14':   "iPad (Fabi)",
                    '192.168.0.22':   "ESP32 Microcontroller (Fabi)",
                    '192.168.0.88':   "Raspberry Pi used for network sniffing (Fabi)",
                    '192.168.0.121':  "ESP32 with Feinstaubsensor firmware (Fabi)",
                    'fe80::271:47ff:fe8d:2e30': "Amazon Fire TV (Fabi)",
                    '2003:c1:3720:f300:44f9:8130:108c:d89':  "Android Phone (Sven)",
                    '2003:c1:3720:f300:8550:6331:20ec:7cff':  "Android Phone (Sven)",
                    '2003:c1:3720:f300:35ff:69ab:67a6:7220':  "Android Phone (Sven)",
                    'fe80::920c:c8ff:fed8:2441':  "Chromecast (Sven)",

}

devices_labels = list(set(devices_mapping.values()))

# Setup

## Imports

In [None]:
# Program, run imports
%matplotlib inline
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from datetime import datetime
import numpy as np
import time
import seaborn as sns
import requests
sns.set(style="darkgrid")

# used to skip merging, filtering and intesive computation tasks
new_read = True

## Create Pandas Dataframe

In [None]:
if new_read:
    print('Concatenate csv files...')
    df_fabi = pd.read_csv("dumps/allPorts_fabian.csv", encoding = "latin")
    df_maxi = pd.read_csv("dumps/allPorts_maxi_v3.csv", encoding = "latin")
    df_sven = pd.read_csv("dumps/advanced-dumps-sven.csv", encoding = "latin")
    df = pd.concat([df_fabi, df_maxi, df_sven])
    print('Concatenate operation completed')
    print(len(df))

## Filter Local Traffic

In [None]:
if new_read:
    time_filter_start = time.time()
    print('Compute intensive task filter started...')
    # source and destination should not start with 192.168 for IPv4 addresses to filter local network. DNS should not be filtered.
    filtered = df.loc[~df['Source'].str.startswith("192.168", na=False) & df['Destination'].str.startswith("192.168", na=False) |
                    df['Source'].str.startswith("192.168", na=False) & ~df['Destination'].str.startswith("192.168", na=False) |
                    df['Protocol'].isin(['DNS'])]

    # Filer local IPv6 traffic
    filtered = filtered.loc[filtered['Protocol'].isin(['DNS']) |
                    ~(
                        (filtered['Source'].str.startswith("fd00:", na=False) | filtered['Source'].str.startswith("fe80:", na=False)) &
                        (filtered['Destination'].str.startswith("fd00:", na=False) | filtered['Destination'].str.startswith("fe80:", na=False))
                    )]

    filtered = filtered.loc[~filtered['Protocol'].isin(['DHCP', 'ARP', 'MDNS', 'LLDP', 'SSDP', 'IGMP', 'IGMPv2', 'IGMPv3', 'ICMP', 'ICMPv4', 'ICMPv6', 'ieee1905', 'LLMNR'])]
    filtered = filtered.loc[~filtered['Destination'].isin(['255.255.255.255'])]


    df = filtered 

    # sort array by timestamp

    df = df.sort_values(by=['Time'])
    # save to csv
    # df.to_csv('df.csv')
    # print (df.loc[df['Destination'].str.contains("255.255.255.255")])
    time_filter_duration = time.time() - time_filter_start
    print('Task filter completed in ', time_filter_duration, ' seconds')
else:
    df = pd.read_csv('df.csv')


## Add Column 'Device Name'

In [None]:
if not 'Device Name' in df.columns:
    new_read = True # save df later
    time_device_name_calc_start = time.time()
    print('Compute Device Name column...')
    #df_devices = pd.DataFrame({'Source': devices_mapping.keys(), 'Device Name': devices_mapping.values()})
    # df = df.merge(df_devices, on='Source', how='left')
    # Apply Device Name to incoming and outgoing packets
    df.loc[df['Source'].isin(devices_mapping.keys()), 'Device Name'] = df.loc[df['Source'].isin(devices_mapping.keys())].Source.apply(lambda x : devices_mapping[x])
    df.loc[df['Destination'].isin(devices_mapping.keys()), 'Device Name'] = df.loc[df['Destination'].isin(devices_mapping.keys())].Destination.apply(lambda x : devices_mapping[x])
    time_device_name_calc_duration = time.time() - time_device_name_calc_start
    print('Compute Device Name column completed in ', time_device_name_calc_duration, ' in seconds')
    print(df.head(10))

## Calculate time intensive tasks

In [None]:
# helper functions
def utcEntryToTimestamp(entry):
    #if '.' in entry:
    row_entry = entry.split(".")[0]
    #else:
    #    row_entry = entry.split(",")[0]
    TIME_FORMAT='%Y-%m-%d %H:%M:%S'
    ts = int(datetime.strptime(row_entry, TIME_FORMAT).timestamp())
    return ts

def utcRowToTimestamp(row):
    return utcEntryToTimestamp(row.at['Time'])
utcRowToTimestamp(df.iloc[0])
# print(df.loc[0].at['Time'])


In [None]:
min_15_duration = 60 * 15
# add index time column, if it doesn't exist yet
if not 'index-time' in df.columns:
    time_utc_calc_start = time.time()
    print('Compute index-time column (approx. 5min)...')
    df['index-time'] = df['Time'].apply(lambda x: utcEntryToTimestamp(x)) # .apply(lambda x: x)) # map to 15 min window
    first_entry_offset = df.iloc[0].at['index-time'] // min_15_duration
    df['index-time'] = (df['index-time'] // min_15_duration) - first_entry_offset
    new_read = True
    time_utc_calc_duration = time.time() - time_utc_calc_start
    print('index-time column task completed in ', time_utc_calc_duration, ' seconds')

## Save DF to CSV / Load DF from CSV

In [None]:
if new_read:
    df.to_csv('df.csv')

print(df)
print(len(df))

# Analysis

## Protocols
We would like to start our investigations by getting a better insight about the Packet Types were are sending to the web.
Several protocols exist for sending data trough the web nowadays. Based on their purpose, these cane be assigned to Application, Transport and Internet layer.
- Application layer
  - DNS
  - SSH
  - HTTP
  - HTTPS
  - ...
- Transport Layer
  - TCP
  - UDP
  - QUICK
- Internet Layer
  - IP (IPv4, IPv6)
  - ICMP
  - ...   

Some of these protocols, such as HTTP, UDP and TCP are quite old. Others, such as the QUICK protocol, were only introduced in the recent years. 



### Protocol Distribution Among Frames

To get a better overview about all the traffic our devices use the sent and receive data from the Internet, we print out a ranking and plot the data. \
We see that the mayority of our captured frames use **TCP*** as underlying protocol. This is not surprising, since TCP is the most used transport protocol and is almost used for every traffic on the web. \
Surprisingly, we also see **QUICK** in our top 5. QUICK, which was recently standardized in May 2021, seems to be heavily pushed by its creator Google. As you can see from our device list, Google devices claim a large share out of all devices. This might be the reason for its popularity among our captured frames.

In [None]:
# rank by most used protocols
df_ranked_protocols = df.groupby('Protocol').size()
print(df_ranked_protocols.nlargest(15))


# plot pie diagram
fig, ax = plt.subplots(figsize=(10,10))
plt.title('Protocol Distribution by Amount of Packets')
ax.pie(df_ranked_protocols, labels=df_ranked_protocols.keys(), autopct='%1.1f%%')
fig.show()

### Amount of Data Traffic per Protocol
We also had a look in the amount of data traffic that is sent by each protocol. It seemes that this heavily correlates with the protocol distribution among our captured frames. 

In [None]:
df_data_per_protocol = df.groupby('Protocol')['Length'].sum()
print(df_data_per_protocol.nlargest(15))

fig, ax = plt.subplots(figsize=(10,10))
plt.title('Protocol Distribution by Data Traffic')
ax.pie(df_data_per_protocol, labels=df_data_per_protocol.keys(), autopct='%1.1f%%',)
fig.show()

### Average Data length per Protocol Type
In order to investigate which kind if protocol holds the most data, we compute the mean of each frame per protocol.
It seems that especially SSL and SSH have the greatest amount of data per frame. \
Surprisingly, we also some various industrial protocols, such as H1, PKIX-CRL and HART_IP in this ranking. These protocols are especially designed for sending large data packets. We are not very sure what programs on Maxi's laptop and phone are responsible for this and will investigate this further in a later section of this write-up.

In [None]:
df_mean_protocol_packet_length = df.groupby('Protocol')['Length'].mean()
print(df_mean_protocol_packet_length.nlargest(15))
#print (df.loc[df['Protocol'].str.contains("HART_IP")])

# plot
fig, ax = plt.subplots(figsize=(10,10))
ax.tick_params(axis='x', which='major', labelsize=12)
ax.tick_params(axis='x', which='minor', labelsize=12)
plt.xlabel('Protocol')
plt.ylabel('Data [Byte]')
plt.title('Mean Data Length per Protocol Type')
ax.bar(df_mean_protocol_packet_length.keys(), df_mean_protocol_packet_length, align='center',)
plt.xticks(rotation=60, ha="right")
fig.show()

In [None]:
# write ip address destinations to file
unique_dests = df['Destination'].unique()
file1 = open("destinations.txt","w")
for row in unique_dests:
    file1.write(row + "\n")
file1.close()

## Source and Destination Analysis
Our devices are contacting lots of servers on the Internet. We would like to know which servers are sending and receiving the most data frames from. For this, we group our data by sources and print out the largest sources and destinations of data. Also, we filter all local sources and destinations in our home networks.

In [None]:
# Data Sources
df_ranked_sources = df.loc[~df['Source'].str.startswith("192.168") & ~df['EtherType'].isin(['IPv6'])] \
    .groupby('Source').size()

print('Sources of Data')
print(df_ranked_sources.nlargest(15))

In [None]:
# Destinations 
df_ranked_destinations = df.loc[~df['Destination'].str.startswith("192.168") & ~df['EtherType'].isin(['IPv6'])] \
    .groupby('Destination').size()
print('Destination of Data')
print(df_ranked_destinations.nlargest(15))

### ISPs and Location

The previous print out does not reveal anything about the host and location behind the IP. Therefore, we decided to use an external API provider to reveal more data for us. The external API provider returns us a json response, were we parse the location and ISP from.

Most of the source and destinations servers in our data belong to cloud providers, content delivery networks (Fastly) or video streaming services (Twitch).

However, we also found also some anomalies. Among our most contacted servers is 'SWM Services GmbH', a local Munich infrastructure cooperation. We have to investigate this further in an upcoming section of this writeup. Also, there is the 'Datacamp limited'. A quick Google search reveals that this is a fraudulent isp which is used for various viruses.

In [None]:
# get data about source servers
rows = []
for address, count in df_ranked_sources.nlargest(25).iteritems():
    headers = { 'User-Agent': "keycdn-tools:https://www.example.com" }
    url = "https://tools.keycdn.com/geo.json?host={}".format(address)
    json_response = requests.get(url, headers=headers).json()
    # print(json_response)
    geo = json_response['data']['geo']
    rows.append([geo['ip'], geo['country_code'], geo['longitude'], geo['latitude'], geo['isp']])
    
# as dataframe
df_coord = pd.DataFrame(rows, columns=["ip", "country", "lng", "lat", "isp"])
print(df_coord)

# plot on worldmap
g_world = gpd.GeoDataFrame(df_coord, geometry=gpd.points_from_xy(df_coord.lng, df_coord.lat))

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
g_world.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()

### IP Locations World Wide Destinations (Outgoing Traffic) ###

In [None]:
# get data about destination servers
rows = []
for address, count in df_ranked_destinations.nlargest(25).iteritems(): 
    headers = { 'User-Agent': "keycdn-tools:https://www.example.com" }
    url = "https://tools.keycdn.com/geo.json?host={}".format(address)
    json_response = requests.get(url, headers=headers).json()
    #print(json_response)
    geo = json_response['data']['geo']
    rows.append([geo['ip'], geo['country_code'], geo['longitude'], geo['latitude'], geo['isp']])
    
# as dataframe
df_coord = pd.DataFrame(rows, columns=["ip", "country", "lng", "lat", "isp"])
print(df_coord)

# plot on worldmap
g_world = gpd.GeoDataFrame(df_coord, geometry=gpd.points_from_xy(df_coord.lng, df_coord.lat))

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
g_world.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()

## Encryption

To look at the percentage of encrypted traffic we will base our analysis solely on the used protocols. The reason behind this is that we were not able to capture the content of the traffic over time due to the memory intensity and therefore cannot look if the content is encrypted or not.

To distinguish between packages with encrypt content and packages with non-encrypt content we will only look at the following application layer protocols and divide them between non-encrypt and encrypted:
- Encrypted:
  - DTLSv1.2
  - ISAKMP
  - RTCP
  - SSH
  - SSL
  - TLS
  - WireGuard

- Non-Encrypted:
  - DNS
  - HTTP
  - MP4
  - NTP
  - OCSP
  - SMTP
  
- Uncategorized:
  - CLASSIC-STUN
  - HART_IP
  - HCrt
  - IMAP
  - STUN
  - XMPP

Based on this we get the following encryption distribution:

In [None]:
filter_protocols = ['ESP', 'GQUIC', 'H1', 'LWAPP', 'MPTCP', 'QUIC', 'TCP', 'THRIFT', 'TURN CHANNEL', 'UDP', 'UDPENCAP']
encrypted_application_protocols = ['DTLSv1.2', 'ISAKMP', 'RTCP', 'SSHv2', 'SSH', 'SSL', 'SSLv2', 'SSLv3', 'TLSv1', 'TLSv1.2', 'TLSv1.3', 'WireGuard']
non_encrypted_application_protocols = ['DNS', 'HTTP', 'HTTP/JSON', 'HTTP/JSON/XML', 'HTTP/XML', 'MP4', 'NTP', 'OCSP', 'SMTP']
uncategorized_application_protocols = ['CLASSIC-STUN', 'HART_IP', 'HCrt', 'IMAP', 'STUN', 'XMPP/XML']

only_application_protocols = df.loc[~df['Protocol'].isin(filter_protocols)]
only_encrypted_protocols = only_application_protocols.loc[only_application_protocols['Protocol'].isin(encrypted_application_protocols)]
only_non_encrypted_protocols = only_application_protocols.loc[only_application_protocols['Protocol'].isin(non_encrypted_application_protocols)]
only_uncategorized_protocols = only_application_protocols.loc[only_application_protocols['Protocol'].isin(uncategorized_application_protocols)]

content_length_encrypted = only_encrypted_protocols['Length'].sum()
content_length_unencrypted = only_non_encrypted_protocols['Length'].sum()
content_length_uncategorized = only_uncategorized_protocols['Length'].sum()

labels = 'Encrypted', 'Non-Encrypted', 'Uncategorized'
fig, ax = plt.subplots(figsize=(10,10))
plt.title('Encryption Distribution')
ax.pie([content_length_encrypted, content_length_unencrypted, content_length_uncategorized], explode=(0.1, 0, 0), labels=labels, autopct='%1.1f%%',)
fig.show()

By inspecting the plot it can quickly be seen that the encrypted packages only take up a small portion of the traffic. However, this number can not be taken as an exact value but as an lower bound since packages send with the UDP and TCP protocol take up most of the traffic but could be encrypted as well. Since don't have the content of the packages we cannot further investigate into the encryption of these. But we suspect that the ecryption percentage would be a lot higher.

## IPv6 vs IPv4

In [None]:
ether_distribution = df.groupby('EtherType').size()

fig, ax = plt.subplots()
plt.title('EtherType Distribution')
ax.pie(ether_distribution, labels=ether_distribution.keys(), autopct='%1.1f%%',)
fig.show()

### Which devices are using IPv6?

In [None]:
ipv6_traffic = df.loc[df['EtherType'].isin(['IPv6'])]

print(ipv6_traffic.groupby(['Source']).size())
# print(ipv6_traffic.groupby(['Source', 'Device Name']).size())

# TODO find out device names for rest of IP's

## DNS 

Every device that is connected to the Internet uses DNS to resolve URLs of their desired destination services. This DNS requests are particular interesting, since they can reveal a lot of a devices web browsing behavior or the various servers it is contacting over time.
In order to get this information, we filter our large data dump by the protocol "DNS".

In [None]:
df_dns = df[df['Protocol'] == 'DNS']
print(df_dns.head(1))

### Used DNS Server

Various DNS resolvers exist on the Internet. We would like to find out which of these are our devices using. 
Some devices can be configured to use a specific DNS Servers, some devices have their preferred DNS resolver hardcoded in their firmware. In many cases the resolve request is just forwarded to the router, who takes care of this. 

By grouping our DNS Destinations and counting the requests, we see which resolvers were primarily used by our devices.

In [None]:
df_dns_server = df_dns.loc[~df_dns['Info'].str.contains("response")]
print(df_dns_server.groupby(['Destination']).size().nlargest(15))

We see that most of our DNS requests were send to our routers. This includes the destinations starting with *192.168.* or *fd00*. Other devices use well known DNS resolvers, such as Google's *8.8.8.8* or *2001:4860:4860::8888*. 

However, we also found three unknown DNS resolvers (*217.68.162.126*, *17.68.162.126* and *2a02:2457:10c:101::126*). A print out of these requests reveals that the Feinstaubsensor IoT device and the  Sonos loudspeaker are using *217.68.162.126* and *17.68.162.126* resolve their destinations. These to DNS resolvers are provided by the local ISP "PYUR". \
*2001:4860:4860::8888*  seems to be an IPv6 DNS resolver, that is used by our Amazon fire TV to resolve its requests. 

In [None]:
print(df_dns_server.loc[df_dns_server['Destination'].str.contains("195.234.128.139")].groupby('Device Name').size())
print(df_dns_server.loc[df_dns_server['Destination'].str.contains("217.68.162.126")].groupby('Device Name').size())
print(df_dns_server.loc[df_dns_server['Destination'].str.contains("2a02:2457:10c:101::126")].groupby('Device Name').size())

Another great insight is the information about the URLs, that are actually being resolved. This can also be used to track a devices browsing behavior. We count the requests by URL to get a better overview. \
Many DNS requests are used internally for Fritzbox specific services (Our Router). Others, such as "server.chillibits.com" are contacted regulary by static IoT devices, such as the custom build air quality sensor. \
Especially on personal computers and mobile devices, a browsing behavior from DNS requests can be seen with this analysis. This includes our love for Reddit, as its endpoints can be seen multiple times in this DNS list. 👨🏽‍🚀

In [None]:
# only look at standart query dns requests
df_dns_requests = df_dns[df_dns['Info'].str.contains('Standard query')]
# exclude responses from count
df_dns_requests = df_dns_requests[~df_dns_requests['Info'].str.contains('response')]
# display URLS
df_dns_urls = df_dns_requests['Info'].apply(lambda x: x.split(' ')[-1]).value_counts()
print('ns-287.awsdns-35.com' in df_dns_urls.keys())
print(df_dns_urls.nlargest(30))

### Data traffic over time ###

For following sections, we form 15 minute timeframes and assign data traffic to those timeframes.

In [None]:
def timeMapping(x):
    # apply time difference UTC+1 to labels
    time = datetime.utcfromtimestamp((first_entry_offset+x)*min_15_duration + 3600)
    if time.minute == 0 and time.hour == 0:
        return time.strftime("%H:%M\n%d.%m")
    if time.minute == 0 and time.hour % 6 == 0:
        return time.strftime("%H:%M")
    else:
        return ""

first_entry = df.iloc[0].at['index-time']
last_entry = df.iloc[-1].at['index-time']

print('from ', df.iloc[0].at['Time'], " to ", df.iloc[-1].at['Time'])

x_values_packets = list(range(first_entry, last_entry + 1))
print('Amount of Intervals',len(x_values_packets))

mapping_res = list(map(timeMapping, x_values_packets))

# flag for data traffic computation task
values_calculated = False

count_packets = np.zeros(last_entry - first_entry + 1)
length_packets = np.zeros(last_entry - first_entry + 1)

df_dict = None


In [None]:
# calculate values for data traffic
if not values_calculated:
    time_start = time.time()
    print('Compute packet count and data traffic over time 2/2...')
    df_traffic_grouped = df.groupby(['index-time']).Length.sum()
    df_packet_grouped = df.groupby(['index-time']).count()

    for timeframe, values in df_packet_grouped.iterrows():
        # print(ip, timeframe, values['Length'])
        # calculate index
        count_packets[timeframe] += values['Length']
    # break
    for timeframe, values in df_traffic_grouped.iteritems():
        length_packets[timeframe] += values
    length_packets = length_packets // 1000
    time_end = time.time()
    print(time_end - time_start)

In [None]:
# print(count_packets)
# print(length_packets)
fig, ax = plt.subplots(figsize=(40, 5))
plt.xlabel('Time')
plt.ylabel('Packets')
plt.title('Packets per timeframe (15 min interval)')
plt.xticks(x_values_packets, mapping_res)
ax.bar(x_values_packets, count_packets, color='black')
fig.show()

fig, ax = plt.subplots(figsize=(40,5))
plt.xlabel('Time')
plt.ylabel('Data [kB]')
plt.title('Data traffic per timeframe (15 min interval)')
plt.xticks(x_values_packets, mapping_res)
ax.bar(x_values_packets, length_packets, color='black')
fig.show()

## Device Evaluation ##

Code for setup:

In [None]:
# define data structure for device metadata
df_device_traffic  = pd.DataFrame() # counts incoming and outgoing data
df_device_packets = pd.DataFrame() # counts incoming and outgoing packets
df_device_dns = pd.DataFrame() # counts dns requests per device

# initialize data structure
for name in devices_labels:
    count_packets_device = np.zeros(2 * (last_entry - first_entry + 1)) # track incoming and outgoing traffic
    length_packets_device = np.zeros(2 * (last_entry - first_entry + 1))
    dns_count = np.zeros(df_dns_urls.size)
    df_device_packets[name] = count_packets_device
    df_device_traffic[name] = length_packets_device
    df_device_dns[name] = dns_count

In [None]:
# Version 2
# incoming traffic
run_device_analysis = False

if run_device_analysis:
    print('Process incoming and outgoing traffic per device...')
    # Outgoing traffic: Device sends traffic
    df_device_out = df.loc[(df['Source'].isin(devices_mapping.keys()))]
    df_device_out_packets = df_device_out.groupby(['Source', 'index-time']).count()
    df_device_out_data = df_device_out.groupby(['Source', 'index-time']).Length.sum()
    # Incoming traffic: Device receives packet
    df_device_in = df.loc[(df['Destination'].isin(devices_mapping.keys()))]
    df_device_in_packets = df_device_in.groupby(['Destination', 'index-time']).count()
    df_device_in_data = df_device_in.groupby(['Destination', 'index-time']).Length.sum()

    # filter DNS request packets, device must have send request
    df_dns_frames = df.loc[df['Source'].isin(devices_mapping.keys()) & (df['Protocol'] == 'DNS') & (df['Info'].str.contains('Standard query'))]
    df_dns_frames = df_dns_frames.loc[~df_dns_frames['Info'].str.contains('response')]
    # apply map function on all frames
    print('Compute dns request apply function...')
    df_dns_frames['dns-request'] = df_dns_frames['Info'].apply(lambda x: x.split(" ")[-1])

    # group by device
    df_dns_device = df_dns_frames.groupby(['Device Name', 'dns-request']).Length.count().sort_values(ascending=False)

# ip address traffic destinations
df_device_ip_dest = df.loc[(df['Source'].isin(devices_mapping.keys()))]
df_device_ip_dest = df_device_ip_dest.groupby(['Device Name', 'Destination']).Length.count().sort_values(ascending=False)

# ip address traffic sources
df_device_ip_source = df.loc[df['Destination'].isin(devices_mapping.keys())]
df_device_ip_source = df_device_ip_source.groupby(['Device Name', 'Source']).Length.count().sort_values(ascending=False)
# df_device_ip_dest.head(50)
# print(df.loc[df['Destination'].str.startswith("192.168")].groupby('Destination').size())

In [None]:
# transfer processed data to data structures
# Process Outgoing traffic
for (ip, timeframe), values in df_device_out_packets.iterrows():
    # print(ip, timeframe, values['Length'])
    # calculate index
    index = 2 * timeframe + 1
    df_device_packets.loc[index, devices_mapping[ip]] += values['Length']
    # break
for (ip, timeframe), values in df_device_out_data.iteritems():
    index = 2 * timeframe + 1
    df_device_traffic.loc[index, devices_mapping[ip]] += values

# Process Incoming traffic
for (ip, timeframe), values in df_device_in_packets.iterrows():
    # print(ip, timeframe, values['Length'])
    # calculate index
    index = 2 * timeframe
    df_device_packets.loc[index, devices_mapping[ip]] += values['Length']
    # break
for (ip, timeframe), values in df_device_in_data.iteritems():
    index = 2 * timeframe
    df_device_traffic.loc[index, devices_mapping[ip]] += values

# Process DNS requests
dns_keys = list(df_dns_urls.keys())
for (label, address), values in df_dns_device.iteritems():
    index = dns_keys.index(address)
    df_device_dns.loc[index, label] += values
# print(df_dns_frames, type(df_dns_frames))

In [None]:
# save to csv
df_dns_urls.to_csv('df_dns_urls.csv')
df_device_traffic.to_csv('df_device_traffic.csv')
df_device_packets.to_csv('df_device_packets.csv')
df_device_dns.to_csv('df_device_dns.csv')

### Device Communication Endpoints ###

#### Outgoing Traffic (Upload) ####
The first place on amount of outgoing packets to the same location goes to the ESP32 and the MVG endpoint ('188.164.238.26') which receives one request per second.
Maxi's notebook also had many outgoing packets, many popular destinations ('52.223.201.182', '52.223.201.100' and '185.42.205.193') lead us to a host such as 'video-edge-c68130.fra02.no-abs.hls.ttvnw.net' and belongs to the livestreaming service Twitch. Some Oracle cloud addresses can be traced back to Zoom (e.g. zoomff130-61-166-170mmr.cloud.zoom.us for 130.61.166.170) and there is also captured outgoing to an video streaming service CDN from Limelight Networks ('178.79.232.14').

In the notebook traffic, there is also university related traffic such as '138.246.224.36' for an university IDP project. '131.159.0.186' refers to a 'rbgse35.in.tum.de' hostname, which is responsible for BBB-traffic. Other university related addresses are '129.187.255.213' (the sync and share platform from the LRZ) and '141.40.250.3' (some LRZ address).
Other popular destinations are mostly related to Zoom, Spotify, Amazon and Twitch.

With '89.187.169.39' and '185.59.220.199', Sven's smartphone has some suspicious activities that are further analyzed in the extra topic section.

#### Incoming Traffic (Download) ####
The device with the highest download traffic was Maxi's notebook which was used for watching Twitch livestreams ('52.223.201.182' and '52.223.201.100'), communicating with Microsoft ('52.113.63.202') and communicating with a Zoom service hosted in the Oracle cloud ('130.61.166.170'). 
Sven's notebook also used Zoom, but the server was different ('134.224.101.38').

Maxi's smartphone often communicated with Reddit, therefore the connection with Fastly's CDN ('199.232.189.140') is no real surprise. Fabian's speaker was also often connected to Fastly CDN, one explanation would content streaming from Spotify ('199.232.138.248'). Sven's Macbook was often involved in Zoom calls ('134.224.101.38'). We observed some traffic from Sven's Chromecast to an Amazon server ('54.182.252.136') which were related to his Amazon Prime video subscription. 
A little surprising was the favourite source address '52.113.47.239' from Sven's iPhone, it is a Microsoft server and not from Apple.

In [None]:
df_device_ip_dest.head(25) # output outgoing traffic


In [None]:
df_device_ip_source.head(25) # output popular incoming traffic ip addresses

### Traffic by Device ###


By far, the most traffic was produced by Maxi's notebook due to high media consumption in the form of videos and watching livestreams. Furthermore, the device also has a high upload traffic caused by this project and sharing preprocessed data to other group members. 

Other devices with much download traffic were smartphones, Macbook, Chromecast and one ESP32 device for the MVG departures. Interestingly, although the amount of web requests and web responses is almost equal, the responses are much larger. 

In [None]:
# for each device, plot a bar that shows the data traffic (incoming and outgoing) and amount of packets
# aggregate data
df_device_in_total = (df_device_traffic.iloc[::2].agg(['sum']) // 1000).transpose()# even rows, incoming traffic
df_device_out_total = (df_device_traffic.iloc[1::2].agg(['sum']) //1000).transpose()# odd rows, outgoing traffic

df_device_in_total_num = df_device_in_total['sum'].to_numpy()
df_device_out_total_num = df_device_out_total['sum'].to_numpy()

df_device_in_packets_total = (df_device_packets.iloc[::2].agg(['sum'])).transpose()# even rows, incoming traffic
df_device_out_packets_total = (df_device_packets.iloc[1::2].agg(['sum'])).transpose()# odd rows, outgoing traffic
df_device_in_packets_total_num = df_device_in_packets_total['sum'].to_numpy()
df_device_out_packets_total_num = df_device_out_packets_total['sum'].to_numpy()


fig, ax = plt.subplots(figsize=(15,5))
plt.xlabel('Devices')
plt.ylabel('Data [kB]')
plt.title('Data traffic per Device')
plot_x_devices_traffic = np.arange(len(devices_labels))
plt.xticks(plot_x_devices_traffic, devices_labels, rotation=60, ha="right")
ax.bar(plot_x_devices_traffic, df_device_in_total_num, width=0.4, label="Download", color='green')
ax.bar(plot_x_devices_traffic + 0.4, df_device_out_total_num, width=0.4, label="Upload", color='red')
ax.legend()
fig.show()

# plot packets

fig, ax = plt.subplots(figsize=(15,5))
plt.xlabel('Devices')
plt.ylabel('Packets')
plt.title('Packets per Device')
plot_x_devices_traffic = np.arange(len(devices_labels))
plt.xticks(plot_x_devices_traffic, devices_labels, rotation=60, ha="right")
ax.bar(plot_x_devices_traffic, df_device_in_packets_total_num, width=0.4, label="Download", color='green')
ax.bar(plot_x_devices_traffic + 0.4, df_device_out_packets_total_num, width=0.4, label="Upload", color='red')
ax.legend()
fig.show()

### Device Activity ###

The following plot shows at which timeframes packets from or to the device were captured. A single captured packet in a timeframe is sufficient to identify the device as "active". It does not show user activity, but it visualizes when the device was connected to the network and if it was used for data exchange with the web.

For Fabian's measurements, the capture stopped approximately 12 hours than the others and therefore are incomplete for the related devices. Another issue was the connection behaviour of some devices, which connected to a neighbouring Wifi network and the Raspberry Pi could not capture the traffic in certain timeframes (e.g. applicable to iPads and some smartphones).

The first group of devices were always online, some devices send data even during standby phases. Examples are the Fire TV stick, ESP32 device with Feinstaubsensor firmware, the SONOS speaker or the ESP32 with MVG requests. The ESP32 devices were configured by the user to always send requests to endpoints, the more suspicious devices are the speaker and the TV stick with web requests in short intervals. The wifi smart plug was also regularly online, but the intervals between web requests were larger than 15 minutes.

The second group of devices were only sending data while they were switched on, smartphones, notebooks or the Smart TV belong to this group. In the diagrams below, the intervals with an active connection to the Raspberry Pi can be clearly identified. In some cases (e.g. Maxi's smartphone), the device had a permanent connection during the night and still communicated to the Internet although the user was sleeping.

One interesting devices is the vacuum cleaner, which often was active for around 24 hours per session and effectively only worked for a few hours. The Google Home device was also connected for a longer amount of time until it randomly decided to disconnect from the network.

In [None]:
# plot activity
# activity = 1 packet send/received within a timeframe
packets_1 = df_device_packets.iloc[::2].to_numpy()
packets_2 = df_device_packets.iloc[1::2].to_numpy()

combined = np.minimum(np.add(packets_1, packets_2), 1).transpose()
# print(df_device_packets.iloc[::2])
print(combined.shape)
plot_X_activity = np.arange(combined.shape[1])


# df_device_packets.iloc[1::2]
fig, ax = plt.subplots(nrows=combined.shape[0], ncols=1, figsize=[20, 40])
plt.setp(ax, xticks=x_values_packets, xticklabels=mapping_res,
        yticks=[0, 1],xlabel='Time', ylabel="Activity", )

for index in range(combined.shape[0]):
    # plt.subplot(combined.shape[0], 1, index+1)
    if index % 2 == 0:
        ax[index].plot(x_values_packets, combined[index], label=devices_labels[index], color="red")
    else:
        ax[index].plot(x_values_packets, combined[index], label=devices_labels[index], color="blue")
    ax[index].legend()


fig.show()

### Night Behaviour and Application Behavior ###
- Analyze night behaviour (0-6 am) (low activity, find destination IP addresses, country, isp, protocols)
- Application Activity (e.g. Google, Twitch, MVG, etc.)

We wanted to know how devices behaved at night without user interaction. Therefore, we decided to only use captured data between 10pm and 6am for the first section.
In the second section, we want to find out more about application behavior, e.g. when any Google application is active. Therefore, we assign to each larger application a set of ip addresses (e.g. via resolving collected dns requests) and plot the activity of the applications.

In [None]:
def isNighttime(timeframe):
    timestamp = (first_entry_offset + timeframe) * min_15_duration
    hour = datetime.fromtimestamp(timestamp).hour
    #if hour >= 21 or hour < 8: 
    #    print(datetime.fromtimestamp(timestamp), hour <= 5 or hour >= 22)
    return hour <= 5 or hour >= 22

#for i in x_values_packets:
isNighttime = list(map(isNighttime, x_values_packets))
fig, ax = plt.subplots(figsize=(40,5))
plt.xlabel('Time')
plt.ylabel('Data [kB]')
plt.title('Data traffic per timeframe (15 min interval)')
plt.xticks(x_values_packets, mapping_res)
ax.plot(x_values_packets, isNighttime, color='black')
fig.show()

In [None]:
# fraud source
print(df.loc[(df['Source'].isin(['89.187.169.15']))])

## Find the Mule

During our network traffic analysis we came along some suspicious outliers. This included
- SWM Services GmbH being in our top contacted server
- An ISP called "Datacamp Limited", which was the top contact of an android phone
- several Industrial protocols that are used to sent large packets of data to the web
- lots of traffic traffic from an Xiaomi "Smart Vaccuum Robot"

In this section we will investigate these phenomena closer and try to understand what is happening here in our networks.

### Forgotten Subway Depature Display
We've found our first suspicious case during our source and Destination analyis. Here, we got a server belonging to "SWM Services GmbH". Let's see what is going on here and why it is in our top contacted servers of all our devices. 

First, we need to know which of our devices is contacting this server.

In [None]:
swm = df.loc[df['Destination'].isin(['188.164.238.26'])] 
print(swm.groupby(['Device Name']).size())

The only device that is contacting SWM is our ESP32 Microcontroller. This Microcontroller is used to display live subway departure times on a small display. Hence, it makes sense that it would regulary contact SWM Services GmbH to get the data from their server.

However, there seems to be something misconfigured with the number of requests. Printing out the timestamps of each sent and received data frame shows us that we have around 25 packets within 10 seconds sent from and to SWM. During our 5 day data capture, this summed up to a total of 1057066 packages of data. 

This device is running 24/7 for several years now. We quite sure now that SWM did not implement a rate limit on their public Api endpoint. Else, this horrendous Api spamming would've not been possible.

In [None]:
print(swm.Time.head(25))
print("Number of total packages: {}".format(len(swm)))

## myLoc managed IT AG
Anime Seite

## Suspicious H1 protocol sends large amount of data to cloud

### Roborock Vacuum Robot

In [None]:
vacuum_traffic = df.loc[df['Device Name'].isin(['Vacuum Robot (Sven)'])].groupby('Destination').size()
print(vacuum_traffic)

In [None]:
# get data about destination servers
rows = []
for address, count in vacuum_traffic.iteritems(): 
    if address.startswith('192.168'):
        continue
    headers = { 'User-Agent': "keycdn-tools:https://www.example.com" }
    url = "https://tools.keycdn.com/geo.json?host={}".format(address)
    json_response = requests.get(url, headers=headers).json()
    geo = json_response['data']['geo']
    rows.append([geo['ip'], geo['country_code'], geo['longitude'], geo['latitude'], geo['isp']])

# as dataframe
df_coord = pd.DataFrame(rows, columns=["ip", "country", "lng", "lat", "isp"])
print(df_coord)

# plot on worldmap
g_world = gpd.GeoDataFrame(df_coord, geometry=gpd.points_from_xy(df_coord.lng, df_coord.lat))

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
base = world.plot(color='white', edgecolor='black')
g_world.plot(ax=base, marker='o', color='red', markersize=5)
plt.show()

#### Mitmproxy analysis


GET https://api-eu.roborock.com/user/homes/712040/rooms

[{"id":4015991,"name":"Bedroom 2"},{"id":768374,"name":"Office"},{"id":768370,"name":"Bathroom"},{"id":768366,"name":"Bedroom"},{"id":768364,"name":"Floor 2"},{"id":768360,"name":"Living Room"},{"id":768356,"name":"Floor"},{"id":768352,"name":"Kitchen"}]

We couldn't identify any security issues with the Roborock API. 

### Datacamp Limited

https://scamalytics.com/ip/isp/datacamp-limited

https://www.dnb.com/business-directory/company-profiles.datacamp_limited.914baba433b38b62196ceb3cc013b06f.html

two employees generating 50 mio usd.
hmmm.

MangaMelon is behind the dns requests to the datacamp servers

connections to bunny.net -> deliverying mangas

founder has 100 mio euro 
https://translate.google.com/website?tl=de&client=webapp&u=https://cs.m.wikipedia.org/wiki/Zden%25C4%259Bk_Cendra&sl=cs

traffic all encrypted with tls

TODO plot how often requests are made

app is aus dem playstore geflogen

In [None]:
datacamp_traffic = df.loc[df['Destination'].isin(['89.187.169.15', '89.187.169.39', '185.59.220.199', '185.59.220.198', '185.59.220.194', '185.59.220.193'])].groupby('Device Name').size()
print(datacamp_traffic)

# Final Thoughts

This are our final thoughts before dying due to Covid-19