# <b>DarkVec: Automatic Analysis of Darknet Trafficwith Word Embeddings</b>
## <b>Darknet Overview</b> 

___
# <b>Table of Content</b> <a id="toc_"></a>
* [<b>Darknet Traffic Overview</b>](#darknet)  
  * [Port Ranking](#portranking)  
  * [Darknet IPs Activity Pattern](#darknetpattern)  
  * [Dataset Statistics](#dataset)  
  * [Filter Definition](#darknetfilter)  
  * [Distinct IPs Seen Over 30 Days](#darknetips)  
  * [Last Day of Traffic](#lastday)  
    * [Some Notable GT Activity Patterns](#gtpattern)
  * [Ground Truth/Services Incidence](#gtserv)  
  

This notebook provides an overview of the collected darknet traffic. We go through the paper reporting the snippets that generate the statistics and the reported Figures. Namely, this report is related to Section 3 of the paper.

___
***Note:*** All the code and data we provide are the ones included in the paper. To speed up the notebook execution, by default we trim the files when reading them. Comments on how to run on complete files are provided in the notebook. Note that running the notebook with the complete dataset requires *a PC with significant amount of memory*. 

In [4]:
DEMONSTRATIVE = True

In [11]:
from config import *
from src.preprocess import *
import numpy as np
from src.corpus import get_services
from src.callbacks import *
from src.utils import *

In [12]:
import matplotlib.pyplot as plt
%matplotlib inline
import fastplot
from cycler import cycler

cc = (cycler('color',['k', 'r', 'b', 'g', 'y', 'm', 'c'])+
      cycler('linestyle',['-', '--', '-.', ':', '-', '--', '-.']))

___
# <b>Darknet Traffic Overview</b> <a name="darknet"></a>



Load 30 days of raw traffic. From 2021-03-02 to 2021-03-31. This is the full dataset we use in our experiments. 

Here we load an already slightly preprocessed dataset. In [Appendix 3](A03-darknet-checkpoint.ipynb) we provide the scripts for generating this intermediate dataset.

Each row of the dataset is a packet received from the darknet. The dataset columns are:

- `ts`. It is the timestamp of the packet arrival
- `ip`. It is the source IP address sending the packet
- `port`. It is the destination (darknet) port
- `proto`. Used protocol among TCP, UDP, ICMP, GRE, OTH (for others)
- `pp`. `port/proto` pairs used forthis load roughly 6 days only the language definition
- `class`. Ground truth class of the source IP

## Swap the lines below to read the full dataset - As it is now, it loads roughly 6 days only

In [13]:
if not DEMONSTRATIVE: darknet = load_raw_data('202103')
else: darknet = load_raw_data('2021030*')
    
darknet.head(3)

KeyboardInterrupt: 

In [14]:
print(f'Traffic stats: ')
print(f'{darknet.ip.unique().shape[0]:,} distinct source IPs')
print(f"{darknet.pp.unique().shape[0]:,} destination 'port/protocol' pairs.")
print(f'{darknet.shape[0]:,} received packets')
print(f'Dataset shape: {darknet.shape}')

Traffic stats: 


NameError: name 'darknet' is not defined

### <b>Port Ranking</b> <a name="portranking"></a>  


Next we characterize the darknet. We focus on the port popularity in terms of received packets. 

In [None]:
# Get the port frequency from 30 days of traffic
top14 = darknet.value_counts('port').reset_index().rename(columns={0:'pkts'})
# Compute the ECDF(packets)
pkts = top14.sort_values('pkts', ascending=False)
pkts.pkts = np.cumsum(pkts.pkts)/np.sum(pkts.pkts)
# Zoom-in: Get the top-14 ports within 30 days
top = top14.sort_values('pkts', ascending=False)
top.pkts = np.cumsum(top.pkts)/np.sum(top14.pkts)
top = top.iloc[:14]

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: portRanking(plt, pkts, top),
                      cycler=cc, figsize=(5, 3.5), fontsize=14, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/top20.pdf')
plot.show()

### <b>Darknet IPs Activity Pattern</b> <a name="darknetpattern"></a>  


To provide the big picture, we visualize some IP addresses activity pattern.

We extract a time window shorter than 30 days and downsample the received traffic (modulo 3) to make the patterns more evident. Then a scatterplot is generated. Here, each dot is a packet sent by an IP $y$ at the istant $x$.

In [None]:
# Extract a 9 days window to make the IPs patterns more evident
tday_ = darknet[darknet.ts<='2021-3-11 23:28:56.952226'][['ts', 'ip']]
tday = pd.DataFrame(tday_, columns=['ts', 'ip'])
# Manage timestamps and sort them
tday.index = pd.DatetimeIndex(tday.ts)
tday = tday.sort_index()
tday = tday.drop(columns=['ts'])
# Tokenize IPs. From string to integer number
ydict = {v: k for k,v in enumerate(tday.ip.unique())}
tday['tkn'] = tday.ip.apply(lambda x: ydict[x])
# mod3 downsampling for reducing the image weight
resampled_idx = [x for x in range(tday.shape[0]) if x%3 == 0]
tday = tday.iloc[resampled_idx]

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: darknetPatterns(plt, tday),
                      cycler=cc, figsize=(5, 3.5), fontsize=14, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/density.png')
plot.show()

### <b>Dataset Statistics</b> <a name="dataset"></a>  

We improve the knowledge about the dataset by providing some statistics for both 30 days of darknet traffic and 1 day. In this second case the considered day is the 30^{th} of the full dataset.

In [None]:
df30 = darknet.copy()
df1 = load_raw_data('20210331')

In [None]:
# Number of IPs
ip1 = df1.ip.unique().shape[0]
ip30 = df30.ip.unique().shape[0]
# Number of packets
pkts1 = df1.shape[0]
pkts30 = df30.shape[0]
# Number of ports
port1 = df1.port.unique().shape[0]
port30 = df30.port.unique().shape[0]
# Top-3 ports
top3_1 = df1.value_counts('pp').index[:3]
top3_30 = df30.value_counts('pp').index[:3]
# Packets of top-3 ports
top3pkts1 = (df1.value_counts('pp')/df1.shape[0]*100).values[:3]
top3pkts30 = (df30.value_counts('pp')/df30.shape[0]*100).values[:3]

Get the IPs targeting the top-3 ports of the considered datasets

In [None]:
df1[df1.pp.isin(top3_1)].groupby('pp').agg({'ip':lambda x: len(set(x))})

In [None]:
df30[df30.pp.isin(top3_30)].groupby('pp').agg({'ip':lambda x: len(set(x))})

In [None]:
# Collect statistics
print('Date ($YYYY-MM-DD$):')
print(f'\tLast Day: 2021-03-31')
print(f'\tFull Dataset: [2021-03-02, 2021-03-31]')

print('Sources:')
print(f'\tLast Day: {ip1}')
print(f'\tFull Dataset: {ip30}')

print('Packets:')
print(f'\tLast Day: {pkts1}')
print(f'\tFull Dataset: {pkts30}')

print('Ports:')
print(f'\tLast Day: {port1}')
print(f'\tFull Dataset: {port30}')

print('Top-3 ports (\% of traffic):')
print(f'\tLast Day: {top3_1[0]} ({round(top3pkts1[0], 2)}\%), {top3_1[1]} ({round(top3pkts1[1], 2)}\%),{top3_1[2]} ({round(top3pkts1[2], 2)}\%)')
print(f'\tFull Dataset: {top3_30[0]} ({round(top3pkts30[0], 2)}\%), {top3_30[1]} ({round(top3pkts30[1], 2)}\%),{top3_30[2]} ({round(top3pkts30[2], 2)}\%)')

### <b>Filter Definition</b> <a name="darknetfilter"></a>  


Given the large amount of packets received in 30 days, a filter is needed to improve visualization and reduce noisy traffic (e.g., senders that send only a couple of packets in a month are not interesting for this analysis). 

We design our filter with respect to the monthly packets sent by each IP. We evaluate the distribution of this amount and set the filtering threshold to 10 packets. In this way, we keep the IP addresses sending _at least 10 packets over a month_.

In [None]:
# Count the packets per IP over a month
cdf = darknet.value_counts('ip').reset_index().drop(columns=['ip'])\
             .rename(columns={0:'pkts'}).value_counts('pkts')
# Get the ECDF
cdf = cdf.sort_index()
cdf = np.cumsum(cdf)/np.sum(cdf)

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: filterECDF(plt, cdf),
                      cycler=cc, figsize=(5, 3.5), fontsize=14, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/pkts_cdf.pdf')
plot.show()

### <b>Distinct IPs Seen Over 30 Days</b> <a name="darknetips"></a>  


After having filtering the dataset, we investigate the impact of the filter impact on the full 30 days dataset. Thus, we compare the number of distinct IPs seen on each day over the observation period between the filtered and unfiltered dataset.

In [None]:
dnet = darknet.copy()
# Count the number of packet per IPs for defining the filter
ips = darknet.value_counts('ip')
# Filter: keep IPs sending at least 10 packets
ips_f = set(ips[ips>=10].index)
# Apply the filter
dnet_f = dnet[dnet.ip.isin(ips_f)]

# Get the number of distinc IPs per day in both the filtered
# and unfiltered case
dnet = get_ip_set_by_day(dnet)
dnet_f = get_ip_set_by_day(dnet_f)
# Make the DataFrame for the fastplot callback
cdf = pd.DataFrame(get_ips_ecdf(dnet))[1].values
cdf_f = pd.DataFrame(get_ips_ecdf(dnet_f))[1].values

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: filterCoverage(plt, cdf, cdf_f),
                      cycler=cc, figsize=(5, 3.5), fontsize=14, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/daily_ips_v2.pdf')
plot.show()

___
# <b>Last Day of Traffic</b> <a name="lastday"></a>



We now show characteristics of the last day of darknet traffic used in some experiments. We report in [Appendix 3](A03-darknet-checkpoint.ipynb) the scripts for generating this sample. Then we apply the filter and provide some statistics on the ground truth. Finally we extract some notable activity patterns.

In [None]:
raw_data = load_raw_data('20210331')
daily = filter_data(raw_data, '20210331')

In [None]:
# Load the ground truth
gt = pd.read_csv(GT).drop(columns=['Unnamed: 0'])
ips = daily.ip.unique()
# Get the lookup dataframe to retrieve
# the ground truth class of the last day senders
lookup = pd.DataFrame(ips, columns=['ip'])\
           .merge(gt, on='ip', how='left')\
           .fillna('unknown').replace({ 'criminalip':'unknown', 
                'adscore':'unknown', 'quadmetrics':'unknown', 
                'esrg_stanford':'unknown', 'netscout':'unknown'})
lookup = lookup.rename(columns={'gt':'class'})
print(lookup.shape)
lookup.head(3)

Then we apply the filter keeping the IPs sending at least 10 packets over 30 days. Then we extract some statistics about the ground truth

In [None]:
daily = daily.merge(lookup, on='ip', how='left')
# Collect the statistics
stats = [get_last_day_stats(daily, x) for x in daily['class'].unique()]
pd.DataFrame(stats, columns=['Source', 'Senders', 'Packets', 
                             'Ports', 'Top-5 Ports (% Traffic)'])\
  .sort_values('Senders', ascending=False)

### <b>Some Notable GT Activity Patterns</b> <a name="gtpattern"></a> 



By applying the same technique as before, we extract the activity patterns plot for two ground truth classes: Engin-Umich and Stretchoid.


In [None]:
if not 'class' in darknet:
    darknet = darknet.merge(lookup, on='ip', how='left')
# Extract the stretchoid traces from the full darknet ones
stretchoid = darknet[darknet['class'] == 'stretchoid']
stretchoid.index = pd.DatetimeIndex(stretchoid.ts)
stretchoid = stretchoid.sort_index()
# Tokenize Stretchoid IPs
ydict = {v: k for k,v in enumerate(stretchoid.ip.unique())}
stretchoid['tkn'] = stretchoid.ip.apply(lambda x: ydict[x])
# Build the activity patterns timeseries
stretchoid = stretchoid[['ip', 'tkn']]
stretchoid.head()

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: stretchoidPattern(plt, stretchoid),
                      cycler=cc, figsize=(5, 3.5), fontsize=14, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/stretchoid_pattern.png')
plot.show()

In [None]:
# Extract the engin-umich traces from the full darknet ones
en_um = darknet[darknet['class'] == 'engin-umich']
en_um.index = pd.DatetimeIndex(en_um.ts)
# Tokenize Stretchoid IPs
ydict = {v: k for k,v in enumerate(en_um.ip.unique())}
en_um['tkn'] = en_um.ip.apply(lambda x: ydict[x])
# Build the activity patterns timeseries
en_um = en_um[['ip', 'tkn']]
en_um.head()

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: enginumichPattern(plt, en_um),
                      cycler=cc, figsize=(5, 3.5), fontsize=12, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/engin_umich_pattern.png')
plot.show()

### <b>Ground Truth/Services Incidence</b> <a name="gtserv"></a> 



Generate an heatmap indicating the fraction of GT class packets sent to ports used by general-purpose services.


In [None]:
raw_data = load_raw_data('20210331')
daily = filter_data(raw_data, '20210331')
# Get the class of services
daily['serv'] = daily.pp.apply(get_services)
# Add a ones column for packets sum
daily['pkts'] = 1
daily = daily.merge(lookup, on='ip', how='left')
daily = daily.replace({'unk_usr':'others', 'unk_sys':'others', 
                       'unk_eph':'others', 'proxy':'others'})
daily.head(3)

In [None]:
pivot = daily.pivot_table(values='pkts', index='class', columns='serv', 
                       aggfunc='sum')\
          .reindex(columns=daily.serv.unique())\
          .fillna(.0)
pivot = pivot.divide(pivot.sum(axis=1), axis='rows').round(2).T

In [None]:
%matplotlib inline
plot = fastplot.plot(None,  None, mode = 'callback', callback = lambda plt: ground_truth_heatmap(plt, pivot),
                      figsize=(10, 8), fontsize=22, xticks_rotate=45, style='latex')
if not DEMONSTRATIVE:
    plot.savefig(f'reports/figures/heatmap_gt_serv.pdf')
plot.show()