# Packet Analysis Using Data Science

***

## Getting Started

***

The first thing we are going to do in order to make all this possible, is downloading and importing a few libraries. These include:
+ [**Scapy**](https://scapy.net)- For packet manupilation.
+ [**Pandas**](https://pandas.pydata.org/) - To help us create and manupilate dataframes.
+ [**Numpy**](http://www.numpy.org/) - To help us perform complex mathematical functions.
+ [**Binascii**](https://docs.python.org/2/library/binascii.html) - To help us convert from Binary to Ascii.
+ [**Seaborn**](https://seaborn.pydata.org/) - For some awesome visualization. 

We are also going to add the Matplotlib inline (`%matplotlib inline`) function, to allow any visualization to appear within the notebook itself. 

So let's start by installing scapy into the notebook. 

In [123]:
!pip install scapy

Requirement not upgraded as not directly required: scapy in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages


In [124]:
%matplotlib inline

from scapy.all import *
import pandas as pd
import numpy as np
import binascii
import seaborn as sns

sns.set(color_codes=True)

## Sniffing packets in Scapy

***

Scapy is a tool that provides powerfull and interactive packet manupilation. it allows you to forge or decode packets of a wide number of protocols, send them on wire, capture them, match requests and replies, etc. Packet capture and analysis can primarily be accomplished using Wireshark, however, it is hard to keep track of multiple suspicious indicators while also keeping track of multiple connections in Wireshark. Manupliating packets in Scapy can get a little rigid, so we transform our packet capture into a Panda DataFrame. 



In [125]:
# num_of_packets_to_sniff = 100
# pcap = sniff(count=num_of_packets_to_sniff)

# print(type(pcap))
# print(len(pcap))
# print(pcap)
# pcap[0]

## Read the packet information from a pcap file

***

> Note: The following steps are for readers using Watson Studio only! 

For this notebook we will be reading the contents of a packet I obtained from a [CTF platform](https://cybertalents.com/challenges/forensics/cypher-anxiety). We can add the data object directly to the code by openning the `Find and add data` panel. Select the pcap file we will be using, and select `Insert to Code`. 

In [126]:
# Insert your data object here. 

In [127]:
# The code was removed by Watson Studio for sharing.

In [128]:
# The code was removed by Watson Studio for sharing.

In [129]:
# The code was removed by Watson Studio for sharing.

## Into the Packet Layers

***

Let's take a look at the different layers that are present in the network packet. A packet consists of several encapsulated layers. Where the payload of one layer contains it's own headers and the payloads of the following layers. In order to get a better understanding of what is going on, we will break it down into the different layers. 

In [130]:
# Retrieving a single item from the packet list

ethernet_frame = pcap[0]
ip_packet = ethernet_frame.payload
segment = ip_packet.payload
data = segment.payload

print("ethernet_frame : ", ethernet_frame.show(), "\n\n")
print("ip_packet : ", ip_packet.show(), "\n\n")
print("segment : ", segment.show() ,"\n\n")
print("data : ", data.show() ,"\n\n") # If blank, empty object

###[ Ethernet ]### 
  dst       = 64:66:b3:6f:e3:45
  src       = 60:36:dd:1e:9a:10
  type      = 0x800
###[ IP ]### 
     version   = 4
     ihl       = 5
     tos       = 0x0
     len       = 52
     id        = 18353
     flags     = DF
     frag      = 0
     ttl       = 128
     proto     = tcp
     chksum    = 0x2f58
     src       = 192.168.1.6
     dst       = 192.168.1.100
     \options   \
###[ TCP ]### 
        sport     = 65198
        dport     = mmcc
        seq       = 1290154816
        ack       = 0
        dataofs   = 8
        reserved  = 0
        flags     = S
        window    = 8192
        chksum    = 0x38c6
        urgptr    = 0
        options   = [('MSS', 1460), ('NOP', None), ('WScale', 8), ('NOP', None), ('NOP', None), ('SAckOK', b'')]

ethernet_frame :  None 


###[ IP ]### 
  version   = 4
  ihl       = 5
  tos       = 0x0
  len       = 52
  id        = 18353
  flags     = DF
  frag      = 0
  ttl       = 128
  proto     = tcp
  chksum    = 0x2f58
  src  

## Converting the PCAP frames to DataFrames

***

Next, we will try converting the PCAP frames to a Panda Dataframe. This will allow us to manupilate the data obtained from the packets with greater ease and efficiency. In order to do this, we will first obtain all the headers from the different layers of the packet frame (IP, TCP, UDP). Once we have the headers we create an empty panda data frame with those headers, and then populate it row by row, by inserting the frame array for each row. 

In [None]:
#Collect the field names from IP/TCP/UDP (These will be the columns in the DataFrame)

ip_fields =[field.name for field in IP().fields_desc]
tcp_fields = [field.name for field in TCP().fields_desc]
udp_fields = [field.name for field in UDP().fields_desc]

# print("ip_fields : ", ip_fields)
# print("tcp_fields : ", tcp_fields)
# print("udp_fields : ", udp_fields)

dataframe_fields = ip_fields + ['time'] + tcp_fields + ['payload', 'payload_raw', 'payload_hex']

# print("Dataframe_fields : ", dataframe_fields)

# Create a blank DataFrame.
df = pd.DataFrame(columns = dataframe_fields)

for packet in pcap[IP]:
    #Field array for each row of DataFrame
    field_values = []
    # Add all field values
    for field in ip_fields:
        if field == 'options':
            # Retrieving the number of options defined in IP headers
            field_values.append(len(packet[IP].fields[field]))
        else:
            field_values.append(packet[IP].fields[field])
#             print(packet[IP].fields[field]) 
            
    field_values.append(packet.time)
#     print("Packet Time : ", packet.time) 
    
    layer_type = type(packet[IP].payload)
#     print("Layer Type : ", layer_type)   
    
    
    for field in tcp_fields:
        try:
            if field == 'options':
                field_values.append(len(packet[layer_type].fields[field]))
            else:
                field_values.append(packet[layer_type].fields[field])
        except:
            field_values.append(None)
            
    # Append payload
    field_values.append(len(packet[layer_type].payload))
    
    field_values.append(packet[layer_type].payload.original)
#     print("packet[layer_type].payload.original : ",  packet[layer_type].payload.original)
    
    field_values.append(binascii.hexlify(packet[layer_type].payload.original))
#     print("binascii.hexlify(packet[layer_type].payload.original) : ", binascii.hexlify(packet[layer_type].payload.original))
    
    # Add row to DF
    df_append = pd.DataFrame([field_values], columns=dataframe_fields)
    df = pd.concat([df, df_append], axis=0)

# Reset Index
df = df.reset_index()
# Drop old index column
df = df.drop(columns="index")
        
        

In [None]:
# The code was removed by Watson Studio for sharing.

## DataFrame Basics 

***

In [None]:
# Retrieve first row from DataFrame
print(df.iloc[0])

In [None]:
# Shape of the dataframe
print("The shape of the dataframe is : ", df.shape)

In [None]:
# Return first 5 rows
df.head()

In [None]:
# Return last 5 rows
df.tail()

In [None]:
df.to_csv("packet.csv")

## PCAP Statistics

***

One of the advantages of converting our PCAP packet frames to a Pandas DataFrame is that, now we can perform statistics on it. One of the most cumbersome aspects of Wireshark is that there is tons and tons of data, which makes it very hard to make any real insights from it, efficiently. With the help of Pandas we can perform some simple statistics, like seeing the most frequently used source address, or the most frequent destination address or port.



In [None]:
# Top Source Adddress
print("[*] Top Source Address: \n", df['src'].describe(), "\n\n")

# Top Destination Address
print("[*] Top Destination Address: \n ", df['dst'].describe(), "\n\n")

frequent_address = df['src'].describe()['top']
print("Frequent Address : ", frequent_address, "\n\n")

# Who is the top address speaking to
print("[*] Who is Top Address Speaking to?: \n ", df[df['src'] == frequent_address]['dst'].unique(), "\n\n")

# Who is the top address speaking to (dst ports)
print("[*] Who is the top address speaking to (Destination Ports): \n ", df[df['src'] == frequent_address]['dport'].unique(), "\n\n")

## Visualizations

***

Now that we have the statistics, the next step is to visualize the data. Visualization helps you identify most frequent addresses and ports much quicker and efficiently, and also identify anomolous activity in a glance. 


In [None]:
# Group by Source Address and Payload Sum
source_addresses = df.groupby("src")['payload'].sum()
source_addresses.plot(kind='barh',title="Addresses Sending Payloads")

In [None]:
# Group by Destination Address and Payload Sum
destination_addresses = df.groupby("dst")['payload'].sum()
destination_addresses.plot(kind='barh', title="Destination Addresses (Bytes Received)",figsize=(8,5))

In [None]:
# Group by Source Port and Payload Sum
source_payloads = df.groupby("sport")['payload'].sum()
source_payloads.plot(kind='barh',title="Source Ports (Bytes Sent)",figsize=(8,5))

In [None]:
# Group by Destination Port and Payload Sum
destination_payloads = df.groupby("dport")['payload'].sum()
destination_payloads.plot(kind='barh',title="Destination Ports (Bytes Received)",figsize=(8,5))

## Payload Investigation

***

The graphs that we created highlighted the fact that a large amount of data was sent over port 53. Exfiltrating data using this port is a common technique for attackers due to the fact that restricting DNS communication can be troublesome. At this point, we can open wireshark or write a few lines of code to make this action repeatable. We’ll perform another grouping operation, separate the conversation into its own dataframe, and view the suspicious conversation:



In [None]:
# Create dataframe with only converation from most frequent address
frequent_address_df = df[df['src']==frequent_address]

# Only display Src Address, Dst Address, and group by Payload 
frequent_address_groupby = frequent_address_df[['src','dst','payload']].groupby("dst")['payload'].sum()

# Plot the Frequent address is speaking to (By Payload)
frequent_address_groupby.plot(kind='barh',title="Most Frequent Address is Speaking To (Bytes)",figsize=(8,5))

# Which address has excahnged the most amount of bytes with most frequent address
ip_of_interest = frequent_address_groupby.sort_values(ascending=False).index[0]
print(ip_of_interest, "May be an interesting address")

# Create dataframe with only conversation from most frequent address and suspicious address
ip_of_interest_df = frequent_address_df[frequent_address_df['dst']==ip_of_interest]

# Store each payload in an array
raw_stream = []
for p in ip_of_interest_df['payload_raw']:
    raw_stream.append(p)
    
print(raw_stream)

***