# Goal

1. Create Scatter plot presenting network activity
2. Identify Type of transaction and volume and type of transfers: Based on the pockets getting transferred between various hosts. Could you plot a graph or insight on what is actually happening. Example.. is it all shopping transactions, or search transaction or video streaming…
3. Find out No. of hosts in each subnet or group. Could you identify subnets or domains and group the data based on that

# Data

The data presented here was collected in a network section from Universidad Del Cauca, Popayán, Colombia by performing packet captures at different hours, during morning and afternoon, over six days (April 26, 27, 28 and May 9, 11 and 15) of 2017. A total of 3.577.296 instances were collected and are currently stored in a CSV (Comma Separated Values) file.
Content

This dataset contains 87 features. Each instance holds the information of an IP flow generated by a network device i.e., source and destination IP addresses, ports, interarrival times, layer 7 protocol (application) used on that flow as the class, among others. Most of the attributes are numeric type but there are also nominal types and a date type due to the Timestamp.

The flow statistics (IP addresses, ports, inter-arrival times, etc) were obtained using CICFlowmeter (http://www.unb.ca/cic/research/applications.html - https://github.com/ISCX/CICFlowMeter). The application layer protocol was obtained by performing a DPI (Deep Packet Inspection) processing on the flows with ntopng (https://www.ntop.org/products/traffic-analysis/ntop/ - https://github.com/ntop/ntopng).

For further information and if you find this dataset useful, please read and cite the following papers:

Research Gate: https://www.researchgate.net/publication/326150046_Personalized_Service_Degradation_Policies_on_OTT_Applications_Based_on_the_Consumption_Behavior_of_Users

Research Gate: https://www.researchgate.net/publication/335954240_Consumption_Behavior_Analysis_of_Over_The_Top_Services_Incremental_Learning_or_Traditional_Methods

Springer: https://link.springer.com/chapter/10.1007/978-3-319-95168-3_37

IEEExplore https://ieeexplore.ieee.org/document/8845576


# Preparation

In [1]:
%matplotlib inline
import random
import pandas as pd
import numpy as np
import scipy.stats as st
from datetime import date, timedelta


import pdpbox
from pdpbox import pdp
from pdpbox.info_plots import target_plot
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score 
from sklearn.metrics import classification_report
from sklearn.tree.export import export_text
from sklearn.tree import plot_tree
from sklearn import preprocessing
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

pd.set_option('display.max_columns', None)

In [2]:
dt = pd.read_csv("Dataset-Unicauca-Version2-87Atts.csv",parse_dates=['Timestamp'])

In [3]:
dt.shape

(3577296, 87)

In [4]:
dt.head()
'''
Target: Label(categorical)
Identifiers: Flow.ID, Source.IP, Source.Port, Destination.IP, Destination.Port, Protocol, Timestamp
Numerical: Flow.Duration, Total.Fwd.Packets, Total.Backward.Packets, Total.Length.of.Fwd.Packets, 
           Total.Length.of.Bwd.Packets, Fwd.Packet.Length.Max, Fwd.Packet.Length.Min, Fwd.Packet.Length.Mean,
           Fwd.Packet.Length.Std, Bwd.Packet.Length.Max, Bwd.Packet.Length.Min, Bwd.Packet.Length.Mean,
           Bwd.Packet.Length.Std, Flow.Bytes.s, Flow.IAT.Mean, Flow.IAT.Std, Flow.IAT.Max, Flow.IAT.Min, Fwd.IAT.Total
           Fwd.IAT.Mean, Fwd.IAT.Std, Fwd.IAT.Max, Fwd.IAT.Min, Bwd.IAT.Total, Bwd.IAT.Mean, Bwd.IAT.Std, Bwd.IAT.Max,
           Bwd.IAT.Min, Fwd.Header.Length, Bwd.Header.Length, Min.Packet.Length, Max.Packet.Length, Packet.Length.Mean,
           Packet.Length.Std, Packet.Length.Variance, FIN.Flag.Count, SYN.Flag.Count, RST.Flag.Count, PSH.Flag.Count,
           ACK.Flag.Count, URG.Flag.Count, CWE.Flag.Count, ECE.Flag.Count, Down.Up.Ratio, Average.Packet.Size, Init_Win_bytes_forward
           Avg.Fwd.Segment.Size, Avg.Bwd.Segment.Size, Fwd.Header.Length.1, Init_Win_bytes_forward, Init_Win_bytes_backward,
           act_data_pkt_fwd, min_seg_size_forward, Active.Mean, Active.Std, Active.Max, Active.Min, Idle.Mean, Idle.Std,
           Idle.Max, Idle.Min, Fwd.Packets.s, Bwd.Packets.s, Fwd.Avg.Bytes.Bulk, Fwd.Avg.Packets.Bulk, 
           Fwd.Avg.Bulk.Rate, Bwd.Avg.Bytes.Bulk, Bwd.Avg.Packets.Bulk, Bwd.Avg.Bulk.Rate, Subflow.Fwd.Packets, 
           Subflow.Fwd.Bytes, Subflow.Bwd.Packets,Subflow.Bwd.Bytes,
Categorical: L7Protocol, ProtocolName
Binomial: Fwd.PSH.Flags, Bwd.PSH.Flags, Fwd.URG.Flags, Bwd.URG.Flags, 
'''

'''
Possible Problems:
strong correlationship between similar variables
'''


'\nPossible Problems:\nstrong correlationship between similar variables\n'

In [17]:
dt.head()

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std,Flow.Bytes.s,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max,Fwd.IAT.Min,Bwd.IAT.Total,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags,Fwd.Header.Length,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count,SYN.Flag.Count,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size,Avg.Bwd.Segment.Size,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes,Subflow.Bwd.Packets,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,132,110414.0,6,6,6.0,0.0,4380,1187,2007.527273,768.481689,2428355.0,1691.453,598.986842,816.061346,3880.0,1,45523.0,2167.761905,1319.384512,5988.0,698.0,41178.0,762.555556,1230.34822,5133.0,1.0,0,0,0,0,440,1100,483.2722,1208.18048,6,4380,1417.333333,1121.579194,1257940.0,0,0,0,0,1,0,0,0,2,1435.74026,6.0,2007.527273,440,0,0,0,0,0,0,22,132,55,110414,256,490,21,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,12,0.0,6,6,6.0,0.0,0,0,0.0,0.0,12000000.0,2000000.0,1.0,0.0,1.0,1,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,6,6,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,490,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,674,0.0,337,0,224.666667,194.567041,0,0,0.0,0.0,674000000.0,3000000.0,0.5,0.707107,1.0,0,1.0,0.5,0.707107,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,96,0,3000000.0,0.0,0,337,252.75,168.5,28392.25,0,1,0,0,1,0,0,0,0,337.0,224.666667,0.0,96,0,0,0,0,0,0,3,674,0,0,888,-1,1,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,18433.18,72.333333,62.660461,110.0,0,0.0,0.0,0.0,0.0,0.0,107.0,53.5,75.660426,107.0,0.0,0,0,0,0,32,96,4608.295,13824.884793,0,0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,3,0.0,0.0,0.0,32,0,0,0,0,0,0,1,0,3,0,888,490,0,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,1076,0.0,529,6,215.2,286.458898,0,0,0.0,0.0,13782.86,64.04673,19517.0,25758.50235,54313.0,0,78068.0,19517.0,25758.50235,54313.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,100,0,64.04673,0.0,6,529,267.5,286.458898,82058.7,0,1,0,0,1,0,0,0,0,321.0,215.2,0.0,100,0,0,0,0,0,0,5,1076,0,0,253,-1,4,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY


In [5]:
dt.describe()

Unnamed: 0,Source.Port,Destination.Port,Protocol,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std,Flow.Bytes.s,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max,Fwd.IAT.Min,Bwd.IAT.Total,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags,Fwd.Header.Length,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count,SYN.Flag.Count,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size,Avg.Bwd.Segment.Size,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes,Subflow.Bwd.Packets,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,L7Protocol
count,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0,3577296.0
mean,37999.38,12042.46,6.005508,25442470.0,62.37799,65.34083,46833.23,84457.42,512.3645,9.340408,114.9212,152.0501,1103.231,11.13491,254.7845,289.8878,4048709.0,88963.38,1422201.0,3365395.0,12850200.0,88702.01,24187960.0,3124467.0,3649620.0,12096240.0,1271532.0,21104510.0,2476877.0,2932460.0,9830803.0,888999.1,0.1720414,0.0,0.0,0.0,1653.339,1743.621,77058.16,11905.22,3.043745,1333.25,198.8191,303.519,279273.6,0.007037159,0.1720414,0.0006655865,0.405821,0.5995705,0.2773847,0.0,0.0006566412,0.9085471,207.563,114.9212,254.7845,1653.339,0.0,0.0,0.0,0.0,0.0,0.0,62.37799,46833.23,65.34083,84457.42,8984.691,2123.489,45.03535,25.69738,298199.0,183640.6,522937.2,167633.6,8524211.0,1370991.0,9743845.0,7252097.0,102.9508
std,22017.13,20449.16,0.3274574,40144300.0,1094.086,1108.092,1816196.0,2124319.0,1039.319,82.99983,246.4707,240.4702,2352.374,105.5422,506.0731,485.3004,75510400.0,402762.0,3550414.0,6260959.0,20765180.0,1605272.0,39625630.0,8358652.0,7390979.0,20491800.0,7279117.0,38626340.0,7578111.0,6666650.0,18835210.0,6231082.0,0.3774165,0.0,0.0,0.0,30088.9,30391.9,368315.3,108020.6,41.45472,2453.395,332.7427,432.6083,725860.8,0.0835921,0.3774165,0.02579038,0.4910503,0.4899855,0.447708,0.0,0.0256166,1.269945,343.227,246.4707,506.0731,30088.9,0.0,0.0,0.0,0.0,0.0,0.0,1094.086,1816196.0,1108.092,2124319.0,14101.26,7704.789,974.8192,6.025989,2349390.0,1325838.0,3266508.0,2064219.0,17065680.0,4814474.0,18885570.0,16007540.0,51.29198
min,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01666667,0.2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008333337,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3697.0,443.0,6.0,628.0,2.0,1.0,12.0,0.0,6.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,18.82429,1.128096,415.0,8.485281,570.0,0.0,7.0,5.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.0,32.0,0.5417242,0.1009873,0.0,6.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.0,6.0,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,12.0,1.0,0.0,411.0,18.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,91.0
50%,49377.0,3128.0,6.0,584729.5,6.0,5.0,443.0,208.0,206.0,0.0,46.57143,74.21124,81.0,0.0,30.14286,32.42474,1140.944,33.93752,33202.38,68364.44,281239.5,1.0,389264.5,37006.79,47175.96,207629.0,0.0,181562.5,15587.65,26175.95,95181.5,0.0,0.0,0.0,0.0,0.0,152.0,136.0,15.63422,2.951696,0.0,355.0,62.83333,106.9828,11445.31,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,66.5,46.57143,30.14286,152.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,443.0,5.0,208.0,5840.0,262.0,2.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126.0
75%,53799.0,3128.0,6.0,45001530.0,15.0,15.0,1769.0,3629.0,613.0,6.0,122.5,207.9035,1366.0,0.0,256.75,423.2105,23437.5,4214.963,936657.6,3980748.0,23915460.0,33.0,40011610.0,1549711.0,2932647.0,19269760.0,92.0,14976530.0,334214.2,752634.2,7508778.0,1.0,0.0,0.0,0.0,0.0,392.0,420.0,2164.502,83.44459,6.0,1460.0,250.0,481.8125,232143.2,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,263.7184,122.5,256.75,392.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,1769.0,15.0,3629.0,14600.0,660.0,9.0,32.0,45.0,0.0,57.0,2.0,7506747.0,0.0,8034389.0,5369712.0,130.0
max,65534.0,65534.0,17.0,120000000.0,453190.0,542196.0,678023600.0,1345796000.0,32832.0,16060.0,16060.0,6225.487,37648.0,13032.0,13032.0,8434.804,14396000000.0,6000000.0,120000000.0,84852730.0,120000000.0,120000000.0,120000000.0,120000000.0,84852560.0,120000000.0,120000000.0,120000000.0,119999900.0,84852750.0,119999900.0,119999900.0,1.0,0.0,0.0,0.0,15439500.0,12844400.0,6000000.0,5000000.0,7063.0,37648.0,10708.67,9268.781,85910310.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,293.0,16063.0,16060.0,13032.0,15439500.0,0.0,0.0,0.0,0.0,0.0,0.0,453190.0,678023600.0,542196.0,1345796000.0,65535.0,65535.0,328694.0,523.0,114695000.0,72971360.0,114695000.0,114695000.0,120000000.0,77387460.0,120000000.0,120000000.0,222.0


In [6]:
dt.columns

Index(['Flow.ID', 'Source.IP', 'Source.Port', 'Destination.IP',
       'Destination.Port', 'Protocol', 'Timestamp', 'Flow.Duration',
       'Total.Fwd.Packets', 'Total.Backward.Packets',
       'Total.Length.of.Fwd.Packets', 'Total.Length.of.Bwd.Packets',
       'Fwd.Packet.Length.Max', 'Fwd.Packet.Length.Min',
       'Fwd.Packet.Length.Mean', 'Fwd.Packet.Length.Std',
       'Bwd.Packet.Length.Max', 'Bwd.Packet.Length.Min',
       'Bwd.Packet.Length.Mean', 'Bwd.Packet.Length.Std', 'Flow.Bytes.s',
       'Flow.Packets.s', 'Flow.IAT.Mean', 'Flow.IAT.Std', 'Flow.IAT.Max',
       'Flow.IAT.Min', 'Fwd.IAT.Total', 'Fwd.IAT.Mean', 'Fwd.IAT.Std',
       'Fwd.IAT.Max', 'Fwd.IAT.Min', 'Bwd.IAT.Total', 'Bwd.IAT.Mean',
       'Bwd.IAT.Std', 'Bwd.IAT.Max', 'Bwd.IAT.Min', 'Fwd.PSH.Flags',
       'Bwd.PSH.Flags', 'Fwd.URG.Flags', 'Bwd.URG.Flags', 'Fwd.Header.Length',
       'Bwd.Header.Length', 'Fwd.Packets.s', 'Bwd.Packets.s',
       'Min.Packet.Length', 'Max.Packet.Length', 'Packet.Length.Mean',
  

In [7]:
all(dt.isnull().sum() / dt.shape[0] * 100.00==0)

True

In [14]:
dt.shape

(3577296, 87)

In [16]:
remv = []
for i in dt.columns:
    num = dt[i].nunique()
    print(i,num)
    if num == 1:
        remv.append(i)

Flow.ID 1522917
Source.IP 6566
Source.Port 40519
Destination.IP 22824
Destination.Port 34811
Protocol 3
Timestamp 41915
Flow.Duration 2044412
Total.Fwd.Packets 8530
Total.Backward.Packets 8905
Total.Length.of.Fwd.Packets 104568
Total.Length.of.Bwd.Packets 223090
Fwd.Packet.Length.Max 7737
Fwd.Packet.Length.Min 1761
Fwd.Packet.Length.Mean 422885
Fwd.Packet.Length.Std 1220764
Bwd.Packet.Length.Max 12167
Bwd.Packet.Length.Min 1491
Bwd.Packet.Length.Mean 561062
Bwd.Packet.Length.Std 1224601
Flow.Bytes.s 2386542
Flow.Packets.s 2352220
Flow.IAT.Mean 2329076
Flow.IAT.Std 2450818
Flow.IAT.Max 1724972
Flow.IAT.Min 104891
Fwd.IAT.Total 1896396
Fwd.IAT.Mean 2091594
Fwd.IAT.Std 2167212
Fwd.IAT.Max 1642105
Fwd.IAT.Min 175655
Bwd.IAT.Total 1657080
Bwd.IAT.Mean 1986588
Bwd.IAT.Std 2107991
Bwd.IAT.Max 1298206
Bwd.IAT.Min 149050
Fwd.PSH.Flags 2
Bwd.PSH.Flags 1
Fwd.URG.Flags 1
Bwd.URG.Flags 1
Fwd.Header.Length 21914
Bwd.Header.Length 22977
Fwd.Packets.s 2293417
Bwd.Packets.s 2190615
Min.Packet.Length 10

## Question 1

Create Scatter plot presenting network activity

In [8]:
def countsummary(dt,variable):
    if variable == 'Source.IP':
        TEMP = pd.DataFrame(dt.groupby('Source.IP')['Flow.ID'].count())
        TEMP.columns = ['count']
        print("Max frequency of Source.IP is", max(TEMP['count']))
        print("Min frequency of Source.IP is", min(TEMP['count']))
    else:
        TEMP = pd.DataFrame(dt.groupby(variable)['Source.IP'].count())
        TEMP.columns = ['count']
        print("Max frequency of", variable, " is : ", max(TEMP['count']))
        print("Min frequency of", variable, " is : ", min(TEMP['count']))
    print("------------------------------------------------------------")

In [9]:
for i in dt.columns:
    countsummary(dt = dt, variable = i)

Max frequency of Flow.ID  is :  393
Min frequency of Flow.ID  is :  1
------------------------------------------------------------
Max frequency of Source.IP is 295431
Min frequency of Source.IP is 1
------------------------------------------------------------
Max frequency of Source.Port  is :  601996
Min frequency of Source.Port  is :  1
------------------------------------------------------------
Max frequency of Destination.IP  is :  323161
Min frequency of Destination.IP  is :  1
------------------------------------------------------------
Max frequency of Destination.Port  is :  1432474
Min frequency of Destination.Port  is :  1
------------------------------------------------------------
Max frequency of Protocol  is :  3572975
Min frequency of Protocol  is :  1637
------------------------------------------------------------
Max frequency of Timestamp  is :  1512
Min frequency of Timestamp  is :  1
------------------------------------------------------------
Max frequency of Flo

Max frequency of ECE.Flag.Count  is :  3574947
Min frequency of ECE.Flag.Count  is :  2349
------------------------------------------------------------
Max frequency of Down.Up.Ratio  is :  1573265
Min frequency of Down.Up.Ratio  is :  1
------------------------------------------------------------
Max frequency of Average.Packet.Size  is :  513840
Min frequency of Average.Packet.Size  is :  1
------------------------------------------------------------
Max frequency of Avg.Fwd.Segment.Size  is :  823072
Min frequency of Avg.Fwd.Segment.Size  is :  1
------------------------------------------------------------
Max frequency of Avg.Bwd.Segment.Size  is :  1180736
Min frequency of Avg.Bwd.Segment.Size  is :  1
------------------------------------------------------------
Max frequency of Fwd.Header.Length.1  is :  413952
Min frequency of Fwd.Header.Length.1  is :  1
------------------------------------------------------------
Max frequency of Fwd.Avg.Bytes.Bulk  is :  3577296
Min frequency


## Question 2

Identify Type of transaction and volume and type of transfers: Based on the pockets getting transferred between various hosts. Could you plot a graph or insight on what is actually happening. Example.. is it all shopping transactions, or search transaction or video streaming…

## Question 3

Find out No. of hosts in each subnet or group. Could you identify subnets or domains and group the data based on that

Reference: 
https://support.microsoft.com/en-au/help/164015/understanding-tcp-ip-addressing-and-subnetting-basics

An IP address is a 32-bit number that uniquely identifies a host (computer or other device, such as a printer or router) on a TCP/IP network.

An IP address is a 32-bit number that uniquely identifies a host (computer or other device, such as a printer or router) on a TCP/IP network. For this process to work, an IP address has two parts. The first part of an IP address is used as a network address, the last part as a host address. 

In [18]:
dt.head()

Unnamed: 0,Flow.ID,Source.IP,Source.Port,Destination.IP,Destination.Port,Protocol,Timestamp,Flow.Duration,Total.Fwd.Packets,Total.Backward.Packets,Total.Length.of.Fwd.Packets,Total.Length.of.Bwd.Packets,Fwd.Packet.Length.Max,Fwd.Packet.Length.Min,Fwd.Packet.Length.Mean,Fwd.Packet.Length.Std,Bwd.Packet.Length.Max,Bwd.Packet.Length.Min,Bwd.Packet.Length.Mean,Bwd.Packet.Length.Std,Flow.Bytes.s,Flow.Packets.s,Flow.IAT.Mean,Flow.IAT.Std,Flow.IAT.Max,Flow.IAT.Min,Fwd.IAT.Total,Fwd.IAT.Mean,Fwd.IAT.Std,Fwd.IAT.Max,Fwd.IAT.Min,Bwd.IAT.Total,Bwd.IAT.Mean,Bwd.IAT.Std,Bwd.IAT.Max,Bwd.IAT.Min,Fwd.PSH.Flags,Bwd.PSH.Flags,Fwd.URG.Flags,Bwd.URG.Flags,Fwd.Header.Length,Bwd.Header.Length,Fwd.Packets.s,Bwd.Packets.s,Min.Packet.Length,Max.Packet.Length,Packet.Length.Mean,Packet.Length.Std,Packet.Length.Variance,FIN.Flag.Count,SYN.Flag.Count,RST.Flag.Count,PSH.Flag.Count,ACK.Flag.Count,URG.Flag.Count,CWE.Flag.Count,ECE.Flag.Count,Down.Up.Ratio,Average.Packet.Size,Avg.Fwd.Segment.Size,Avg.Bwd.Segment.Size,Fwd.Header.Length.1,Fwd.Avg.Bytes.Bulk,Fwd.Avg.Packets.Bulk,Fwd.Avg.Bulk.Rate,Bwd.Avg.Bytes.Bulk,Bwd.Avg.Packets.Bulk,Bwd.Avg.Bulk.Rate,Subflow.Fwd.Packets,Subflow.Fwd.Bytes,Subflow.Bwd.Packets,Subflow.Bwd.Bytes,Init_Win_bytes_forward,Init_Win_bytes_backward,act_data_pkt_fwd,min_seg_size_forward,Active.Mean,Active.Std,Active.Max,Active.Min,Idle.Mean,Idle.Std,Idle.Max,Idle.Min,Label,L7Protocol,ProtocolName
0,172.19.1.46-10.200.7.7-52422-3128-6,172.19.1.46,52422,10.200.7.7,3128,6,26/04/201711:11:17,45523,22,55,132,110414.0,6,6,6.0,0.0,4380,1187,2007.527273,768.481689,2428355.0,1691.453,598.986842,816.061346,3880.0,1,45523.0,2167.761905,1319.384512,5988.0,698.0,41178.0,762.555556,1230.34822,5133.0,1.0,0,0,0,0,440,1100,483.2722,1208.18048,6,4380,1417.333333,1121.579194,1257940.0,0,0,0,0,1,0,0,0,2,1435.74026,6.0,2007.527273,440,0,0,0,0,0,0,22,132,55,110414,256,490,21,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
1,172.19.1.46-10.200.7.7-52422-3128-6,10.200.7.7,3128,172.19.1.46,52422,6,26/04/201711:11:17,1,2,0,12,0.0,6,6,6.0,0.0,0,0,0.0,0.0,12000000.0,2000000.0,1.0,0.0,1.0,1,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,40,0,2000000.0,0.0,6,6,6.0,0.0,0.0,0,0,0,0,1,1,0,0,0,9.0,6.0,0.0,40,0,0,0,0,0,0,2,12,0,0,490,-1,1,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY
2,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,1,3,0,674,0.0,337,0,224.666667,194.567041,0,0,0.0,0.0,674000000.0,3000000.0,0.5,0.707107,1.0,0,1.0,0.5,0.707107,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,96,0,3000000.0,0.0,0,337,252.75,168.5,28392.25,0,1,0,0,1,0,0,0,0,337.0,224.666667,0.0,96,0,0,0,0,0,0,3,674,0,0,888,-1,1,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
3,10.200.7.217-50.31.185.39-38848-80-6,50.31.185.39,80,10.200.7.217,38848,6,26/04/201711:11:17,217,1,3,0,0.0,0,0,0.0,0.0,0,0,0.0,0.0,0.0,18433.18,72.333333,62.660461,110.0,0,0.0,0.0,0.0,0.0,0.0,107.0,53.5,75.660426,107.0,0.0,0,0,0,0,32,96,4608.295,13824.884793,0,0,0.0,0.0,0.0,0,0,0,0,1,1,0,0,3,0.0,0.0,0.0,32,0,0,0,0,0,0,1,0,3,0,888,490,0,32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,7,HTTP
4,192.168.72.43-10.200.7.7-55961-3128-6,192.168.72.43,55961,10.200.7.7,3128,6,26/04/201711:11:17,78068,5,0,1076,0.0,529,6,215.2,286.458898,0,0,0.0,0.0,13782.86,64.04673,19517.0,25758.50235,54313.0,0,78068.0,19517.0,25758.50235,54313.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0,0,100,0,64.04673,0.0,6,529,267.5,286.458898,82058.7,0,1,0,0,1,0,0,0,0,321.0,215.2,0.0,100,0,0,0,0,0,0,5,1076,0,0,253,-1,4,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,BENIGN,131,HTTP_PROXY


In [35]:
dt_ip = pd.DataFrame()
dt_ip[["a", "b", "c", "source_ip"]]=dt["Source.IP"].str.split(".",expand = True)
dt_ip["source_network"] = dt_ip["a"]+"."+dt_ip["b"]+"."+dt_ip["c"]
dt_ip[["e", "d", "f", "destination_ip"]]=dt["Destination.IP"].str.split(".",expand = True)
dt_ip["destination_network"] = dt_ip["e"]+"."+dt_ip["d"]+"."+dt_ip["f"]
dt_ip = dt_ip[["source_network", "source_ip","destination_network", "destination_ip"]]

In [44]:
dt_ip.groupby("source_network")["source_ip"].nunique()

source_network
10.120.1      1
10.130.1      1
10.130.10     6
10.130.12     3
10.130.13     3
             ..
98.139.21     1
98.139.225    3
98.142.102    1
98.158.99     2
99.198.117    1
Name: source_ip, Length: 2999, dtype: int64

In [45]:
dt_ip.groupby("destination_network")["destination_ip"].nunique()

destination_network
10.11.12      1
10.120.1      1
10.130.0      1
10.130.10     5
10.130.11     1
             ..
98.158.146    1
98.158.99     1
99.198.101    1
99.198.108    1
99.198.117    2
Name: destination_ip, Length: 10680, dtype: int64