# Basic Analysis of Network Traffic Traces

In this laboratory, we will explore the basics of network traffic capture. 

## Learning Objectives

By the end of this lab, you should understand the following:

* How to capture a network traffic trace.
* What the meaining of the following fields are in the trace: (1) IP Address; (2) MAC Address; (3) Length; (4) DNS queries and responses.

## Setup

Before we get started, you will need to install a tool to generate packet captures. There are some example pcaps in the `pcaps` directory of this repository, as well, but it is good for everyone to become familiar with how to perform their own network traffic capture.

**Wireshark** The fundamental data that we will use for analysis, in this laboratory and others, is a _network packet trace_, sometimes called a "pcap".  [Wireshark](https://wireshark.org/) is a tool that we can use to capture and analyze network traffic data from the devices on a network. 

### Warmup: Basic Wireshark Analysis

First, you should use wireshark to collect a packet trace. Save the trace as a regular pcap (not pcapng) somewhere on your local machine. Note the location where you have saved the file, as we will be loading that file into the notebook later.

Using Wireshark answer the following questions:
* How many packets are in the trace?
* What is the total volume of traffic in the trace?

These are fairly straightforward questions that wireshark itself can easily tell you. Doing more complicated analysis (and eventually machine learning) requires more sophisticated processing. For that, in this course, we will rely on Python, pandas, and scikit-learn.

## Analyzing Packet Captures in Python

We will now load the packet capture you have generated into Python---specifically, and analysis library called Pandas, which will allow us to ask more complex questions.  This 

In [2]:
import pandas as pd
from datetime import datetime, timezone

# Allow us to load modules from the parent directory
import sys
sys.path.append("../lib") 
from parse_pcap import pcap_to_pandas, send_rates

# Insert your own packet capture here.

pcap = pcap_to_pandas('../pcaps/uchicagocs-web-20210329.pcap')

# look at the first n rows of the packet capture
pcap.head(10)

Unnamed: 0,datetime,dns_query,dns_resp,ip_dst,ip_dst_int,ip_src,ip_src_int,is_dns,length,mac_dst,mac_dst_int,mac_src,mac_src_int,port_dst,port_src,protocol,time,time_normed
0,2021-03-29 09:01:54,,,128.135.164.125,2156373117,192.168.1.43,3232235819,False,78,74:ac:b9:a6:47:c9,128285197879241,3c:15:c2:d9:d3:50,66064161035088,443,51226,TCP,1617026514.504975,0.0
1,2021-03-29 09:01:54,,,192.168.1.43,3232235819,128.135.164.125,2156373117,False,66,3c:15:c2:d9:d3:50,66064161035088,74:ac:b9:a6:47:c9,128285197879241,51226,443,TCP,1617026514.520897,0.015922
2,2021-03-29 09:01:54,,,128.135.164.125,2156373117,192.168.1.43,3232235819,False,54,74:ac:b9:a6:47:c9,128285197879241,3c:15:c2:d9:d3:50,66064161035088,443,51226,TCP,1617026514.520996,0.016021
3,2021-03-29 09:01:54,,,128.135.164.125,2156373117,192.168.1.43,3232235819,False,571,74:ac:b9:a6:47:c9,128285197879241,3c:15:c2:d9:d3:50,66064161035088,443,51226,TCP,1617026514.521269,0.016294
4,2021-03-29 09:01:54,,,192.168.1.43,3232235819,128.135.164.125,2156373117,False,60,3c:15:c2:d9:d3:50,66064161035088,74:ac:b9:a6:47:c9,128285197879241,51226,443,TCP,1617026514.540108,0.035133
5,2021-03-29 09:01:54,,,192.168.1.43,3232235819,128.135.164.125,2156373117,False,1514,3c:15:c2:d9:d3:50,66064161035088,74:ac:b9:a6:47:c9,128285197879241,51226,443,TCP,1617026514.540112,0.035137
6,2021-03-29 09:01:54,,,192.168.1.43,3232235819,128.135.164.125,2156373117,False,1514,3c:15:c2:d9:d3:50,66064161035088,74:ac:b9:a6:47:c9,128285197879241,51226,443,TCP,1617026514.540113,0.035138
7,2021-03-29 09:01:54,,,192.168.1.43,3232235819,128.135.164.125,2156373117,False,1230,3c:15:c2:d9:d3:50,66064161035088,74:ac:b9:a6:47:c9,128285197879241,51226,443,TCP,1617026514.540114,0.035139
8,2021-03-29 09:01:54,,,128.135.164.125,2156373117,192.168.1.43,3232235819,False,54,74:ac:b9:a6:47:c9,128285197879241,3c:15:c2:d9:d3:50,66064161035088,443,51226,TCP,1617026514.540194,0.035219
9,2021-03-29 09:01:54,,,128.135.164.125,2156373117,192.168.1.43,3232235819,False,54,74:ac:b9:a6:47:c9,128285197879241,3c:15:c2:d9:d3:50,66064161035088,443,51226,TCP,1617026514.540194,0.035219


### Basic Dataframe Statistics

You can use the `shape` function to discover how many rows and columns exist in your dataset and the `columns` function to get a list of column headers.

In [3]:
print('{}\n\n'.format(pcap.shape))
print(pcap.columns)

(227, 18)


Index(['datetime', 'dns_query', 'dns_resp', 'ip_dst', 'ip_dst_int', 'ip_src',
       'ip_src_int', 'is_dns', 'length', 'mac_dst', 'mac_dst_int', 'mac_src',
       'mac_src_int', 'port_dst', 'port_src', 'protocol', 'time',
       'time_normed'],
      dtype='object')


### Slicing and Sub-Selecting Data

Pandas allows the use of slicing to subselect columns. Let's use that function to cut down our list of columns to some columns on which we want to do further analysis.

In [4]:
pcap = pcap.loc[:,['datetime','ip_src','ip_dst',
                   'length','port_src','port_dst','protocol']]
pcap.head(10)

Unnamed: 0,datetime,ip_src,ip_dst,length,port_src,port_dst,protocol
0,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,78,51226,443,TCP
1,2021-03-29 09:01:54,128.135.164.125,192.168.1.43,66,443,51226,TCP
2,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
3,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,571,51226,443,TCP
4,2021-03-29 09:01:54,128.135.164.125,192.168.1.43,60,443,51226,TCP
5,2021-03-29 09:01:54,128.135.164.125,192.168.1.43,1514,443,51226,TCP
6,2021-03-29 09:01:54,128.135.164.125,192.168.1.43,1514,443,51226,TCP
7,2021-03-29 09:01:54,128.135.164.125,192.168.1.43,1230,443,51226,TCP
8,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
9,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP


### Conditional Slicing

You can slice a dataframe based on conditionals.  Here we select only the rows whose source IP address corresponds to a certain value.

In [5]:
pcap[pcap['ip_src'] == '192.168.1.43'].head(10)

Unnamed: 0,datetime,ip_src,ip_dst,length,port_src,port_dst,protocol
0,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,78,51226,443,TCP
2,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
3,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,571,51226,443,TCP
8,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
9,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
10,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
13,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,54,51226,443,TCP
14,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,180,51226,443,TCP
15,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,1514,51226,443,TCP
16,2021-03-29 09:01:54,192.168.1.43,128.135.164.125,217,51226,443,TCP


### Further Analysis Questions

You could ask some follow up questions about the web download above:
* How many total bytes were exchanged in this web download (in both directions)?
* How many total bytes went from the web server to the client device (e.g., web browser) ("download"))?
* How long did the total download take?
* What is the maximum packet size (length)? What is the average packet size?

You may find the Pandas documentation and examples helpful. (e.g., [groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html))

In [None]:
# Total bytes exchanged.
# HINT: use the pandas "groupby" function

# Total bytes downloaded.
# HINT: extend the first part by first using a conditional slice to only select the download packets.
# (download would be ip_dst equal to the IP address of the client, ip_src equal to the IP address of the server)

# Total time.
# HINT: Time of last row minus time of first row.

# Max/average length.
# HINT: Apply max, mean pandas functions to the 'length' column of the pcap dataframe.


---
## Basic Analysis of Traffic Using Pandas

Try some of the examples below using a trace that has multiple destination IP addresses (e.g., all of your web traffic).

### List of Unique Destination IP Addresses

What are the unique destinations that our network is communicating with?  We can use the `unique` function to retrieve those.

In [6]:
unique_dst_ip = pd.DataFrame(pcap['ip_dst'].unique())[0]
print(unique_dst_ip)

0    128.135.164.125
1       192.168.1.43
Name: 0, dtype: object


### Most Popular Destination IP Addresses

We can group the rows of the dataframe using `groupby`, `sum`, and `sort_values` to determine the most popular destination IP addresses?

In [7]:
pkts_dst = pcap.loc[:,['datetime','ip_dst','length']]
pkts_dst.groupby(['ip_dst']).sum().sort_values(by='length',ascending=False)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.43,105411
128.135.164.125,21009


Define a reverse lookup function.

In [10]:
from dns import resolver
from dns import reversename

# test reverse DNS lookup
addr = reversename.from_address('128.135.164.125')
print(resolver.query(addr, "PTR")[0])

hnd.cs.uchicago.edu.


In [11]:
# test reverse DNS lookup
addr = reversename.from_address('204.80.104.218')
print(resolver.query(addr, "PTR")[0])

zoomnye218mmr.ny.zoom.us.


In [12]:
def reverse_lookup(ip):
    if str(ip) == 'None':
        return 'None'
    addr = reversename.from_address(ip)
    try:
        return str(resolver.query(addr, "PTR")[0])
    except Exception as e:
        return 'N/A'

### Apply a Function to an Entire Dataframe

Use the pandas `apply` function to create a new column with the DNS names associated with each destination. 

Then look at the unique destination IP addresses in the trace.

In [13]:
pcap['name_dst'] = pcap['ip_dst'].apply(reverse_lookup)

In [14]:
unique_dst_name = pd.DataFrame(pcap['name_dst'].unique())[0]
print(unique_dst_name)

0    hnd.cs.uchicago.edu.
1                     N/A
Name: 0, dtype: object


## Functions

It is often useful to encapsulate functionality in functions so that we can use those functions again.

Write functions to count ("sum") the length field so that we can know how much total traffic in bytes is sent to each destination, either by IP address or by name.

In [15]:
def volume_stats_by_ip(pcap):
    return pcap.loc[:,['ip_dst','length']].groupby('ip_dst').sum().sort_values(by=['length'], ascending=False)


def volume_stats_by_name(pcap):
    return pcap.loc[:,['name_dst','length']].groupby('name_dst').sum().sort_values(by=['length'], ascending=False)

In [16]:
volume_stats_by_ip(pcap)

Unnamed: 0_level_0,length
ip_dst,Unnamed: 1_level_1
192.168.1.43,105411
128.135.164.125,21009


In [17]:
volume_stats_by_name(pcap)

Unnamed: 0_level_0,length
name_dst,Unnamed: 1_level_1
,105411
hnd.cs.uchicago.edu.,21009


## Going Further

For homework, define some questions you want to ask about the network traffic trace and write some functions to analyze the trace.

Here are some example questions.  You can pick one of these or define one yourself:
* What is the maximum, median, minimum, and mean packet size?
* How many DNS queries (destination port 53) are there in this trace?
* What is the most popular DNS query in the trace?