# Goal

1. Create Scatter plot presenting network activity
2. Identify Type of transaction and volume and type of transfers: Based on the pockets getting transferred between various hosts. Could you plot a graph or insight on what is actually happening. Example.. is it all shopping transactions, or search transaction or video streaming…
3. Find out No. of hosts in each subnet or group. Could you identify subnets or domains and group the data based on that

# Data

The data presented here was collected in a network section from Universidad Del Cauca, Popayán, Colombia by performing packet captures at different hours, during morning and afternoon, over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of 3.577.296 instances were collected and are currently stored in a CSV (Comma Separated Values) file.
Content

This dataset contains 87 features. Each instance holds the information of an IP flow generated by a network device i.e., source and destination IP addresses, ports, interarrival times, layer 7 protocol (application) used on that flow as the class, among others. Most of the attributes are numeric type but there are also nominal types and a date type due to the Timestamp.

The flow statistics (IP addresses, ports, inter-arrival times, etc) were obtained using CICFlowmeter (http://www.unb.ca/cic/research/applications.html - https://github.com/ISCX/CICFlowMeter). The application layer protocol was obtained by performing a DPI (Deep Packet Inspection) processing on the flows with ntopng (https://www.ntop.org/products/traffic-analysis/ntop/ - https://github.com/ntop/ntopng).

For further information and if you find this dataset useful, please read and cite the following papers:

Research Gate: https://www.researchgate.net/publication/326150046_Personalized_Service_Degradation_Policies_on_OTT_Applications_Based_on_the_Consumption_Behavior_of_Users

Research Gate: https://www.researchgate.net/publication/335954240_Consumption_Behavior_Analysis_of_Over_The_Top_Services_Incremental_Learning_or_Traditional_Methods

Springer: https://link.springer.com/chapter/10.1007/978-3-319-95168-3_37

IEEExplore https://ieeexplore.ieee.org/document/8845576


# Preparation

In [24]:
%matplotlib inline
import random
import pandas as pd
import numpy as np
import scipy.stats as st
from datetime import date, timedelta


import pdpbox
from pdpbox import pdp
from pdpbox.info_plots import target_plot
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report
from sklearn.tree.export import export_text
from sklearn.tree import plot_tree
from sklearn import preprocessing
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pd.set_option('display.max_columns', None)

In [25]:
dt = pd.read_csv("Dataset-Unicauca-Version2-87Atts.csv",parse_dates=['Timestamp'])

In [6]:
dt.head()

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std,Flow.Bytes.s,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max,Fwd.IAT.Min,Bwd.IAT.Total,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags,Fwd.Header.Length,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count,SYN.Flag.Count,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size,Avg.Bwd.Segment.Size,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes,Subflow.Bwd.Packets,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,132,110414.0,6,6,6.0,0.0,4380,1187,2007.527273,768.481689,2428355.0,1691.453,598.986842,816.061346,3880.0,1,45523.0,2167.761905,1319.384512,5988.0,698.0,41178.0,762.555556,1230.34822,5133.0,1.0,0,0,0,0,440,1100,483.2722,1208.18048,6,4380,1417.333333,1121.579194,1257940.0,0,0,0,0,1,0,0,0,2,1435.74026,6.0,2007.527273,440,0,0,0,0,0,0,22,132,55,110414,256,490,21,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,12,0.0,6,6,6.0,0.0,0,0,0.0,0.0,12000000.0,2000000.0,1.0,0.0,1.0,1,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,6,6,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,490,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,674,0.0,337,0,224.666667,194.567041,0,0,0.0,0.0,674000000.0,3000000.0,0.5,0.707107,1.0,0,1.0,0.5,0.707107,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,96,0,3000000.0,0.0,0,337,252.75,168.5,28392.25,0,1,0,0,1,0,0,0,0,337.0,224.666667,0.0,96,0,0,0,0,0,0,3,674,0,0,888,-1,1,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,18433.18,72.333333,62.660461,110.0,0,0.0,0.0,0.0,0.0,0.0,107.0,53.5,75.660426,107.0,0.0,0,0,0,0,32,96,4608.295,13824.884793,0,0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,3,0.0,0.0,0.0,32,0,0,0,0,0,0,1,0,3,0,888,490,0,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,1076,0.0,529,6,215.2,286.458898,0,0,0.0,0.0,13782.86,64.04673,19517.0,25758.50235,54313.0,0,78068.0,19517.0,25758.50235,54313.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,100,0,64.04673,0.0,6,529,267.5,286.458898,82058.7,0,1,0,0,1,0,0,0,0,321.0,215.2,0.0,100,0,0,0,0,0,0,5,1076,0,0,253,-1,4,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY


In [4]:
dt.describe()

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,...,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,L7Protocol
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,...,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,37999.38,12042.46,6.005508,25442470.0,62.37799,65.34083,46833.23,84457.42,512.3645,9.340408,...,25.69738,298199.0,183640.6,522937.2,167633.6,8524211.0,1370991.0,9743845.0,7252097.0,102.9508
std,22017.13,20449.16,0.3274574,40144300.0,1094.086,1108.092,1816196.0,2124319.0,1039.319,82.99983,...,6.025989,2349390.0,1325838.0,3266508.0,2064219.0,17065680.0,4814474.0,18885570.0,16007540.0,51.29198
min,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3697.0,443.0,6.0,628.0,2.0,1.0,12.0,0.0,6.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0
50%,49377.0,3128.0,6.0,584729.5,6.0,5.0,443.0,208.0,206.0,0.0,...,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0
75%,53799.0,3128.0,6.0,45001530.0,15.0,15.0,1769.0,3629.0,613.0,6.0,...,32.0,45.0,0.0,57.0,2.0,7506747.0,0.0,8034389.0,5369712.0,130.0
max,65534.0,65534.0,17.0,120000000.0,453190.0,542196.0,678023600.0,1345796000.0,32832.0,16060.0,...,523.0,114695000.0,72971360.0,114695000.0,114695000.0,120000000.0,77387460.0,120000000.0,120000000.0,222.0


In [3]:
dt.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

## Question 1

Create Scatter plot presenting network activity

In [26]:
def countsummary(dt,variable):
    if variable == 'Source.IP':
        TEMP = pd.DataFrame(dt.groupby('Source.IP')['Flow.ID'].count())
        TEMP.columns = ['count']
        print("Max frequency of Source.IP is", max(TEMP['count']))
        print("Min frequency of Source.IP is", min(TEMP['count']))
    else:
        TEMP = pd.DataFrame(dt.groupby(variable)['Source.IP'].count())
        TEMP.columns = ['count']
        print("Max frequency of", variable, " is : ", max(TEMP['count']))
        print("Min frequency of", variable, " is : ", min(TEMP['count']))
    print("------------------------------------------------------------")

In [27]:
for i in dt.columns:
    countsummary(dt = dt, variable = i)

Max frequency of Flow.ID  is :  393
Min frequency of Flow.ID  is :  1
------------------------------------------------------------
Max frequency of Source.IP is 295431
Min frequency of Source.IP is 1
------------------------------------------------------------
Max frequency of Source.Port  is :  601996
Min frequency of Source.Port  is :  1
------------------------------------------------------------
Max frequency of Destination.IP  is :  323161
Min frequency of Destination.IP  is :  1
------------------------------------------------------------
Max frequency of Destination.Port  is :  1432474
Min frequency of Destination.Port  is :  1
------------------------------------------------------------
Max frequency of Protocol  is :  3572975
Min frequency of Protocol  is :  1637
------------------------------------------------------------
Max frequency of Timestamp  is :  1512
Min frequency of Timestamp  is :  1
------------------------------------------------------------
Max frequency of Flo

Max frequency of ECE.Flag.Count  is :  3574947
Min frequency of ECE.Flag.Count  is :  2349
------------------------------------------------------------
Max frequency of Down.Up.Ratio  is :  1573265
Min frequency of Down.Up.Ratio  is :  1
------------------------------------------------------------
Max frequency of Average.Packet.Size  is :  513840
Min frequency of Average.Packet.Size  is :  1
------------------------------------------------------------
Max frequency of Avg.Fwd.Segment.Size  is :  823072
Min frequency of Avg.Fwd.Segment.Size  is :  1
------------------------------------------------------------
Max frequency of Avg.Bwd.Segment.Size  is :  1180736
Min frequency of Avg.Bwd.Segment.Size  is :  1
------------------------------------------------------------
Max frequency of Fwd.Header.Length.1  is :  413952
Min frequency of Fwd.Header.Length.1  is :  1
------------------------------------------------------------
Max frequency of Fwd.Avg.Bytes.Bulk  is :  3577296
Min frequency


## Question 2

Identify Type of transaction and volume and type of transfers: Based on the pockets getting transferred between various hosts. Could you plot a graph or insight on what is actually happening. Example.. is it all shopping transactions, or search transaction or video streaming…