# DS/CMPSC 410 MiniProject Deliverable #2

# Fall 2024
### Instructor: Prof. John Yen
### TA: Jin, Peng and Al Lawati, Ali Hussain Mohsin

### Learning Objectives
- Be able to represent ports scanned by scanners as binary features using One Hot Encoding
- Be able to apply k-means clustering to cluster the scanners based on the set of ports they scanned. 
- Be able to identify the set of top k ports for one-hot encoding ports scanned.
- Be able to interpret the results of clustering using cluster centers.
- After successful clustering of the small Darknet dataset, conduct clustering on the large Darknet dataset (running spark in cluster mode).
- Be able to evaluate the result of k-means clustering (cluster mode) using Silhouette score and Mirai labels.
- Be able to use .persist() and .unpersist() to improve the scalability/efficiency of PySpark code.

### Items for Submission: 
- Completed Jupyter Notebook for local mode (HTML format)
- .py file for successful execution in cluster mode 
- log file (including execution time information) for successful execution in cluster mode
- The csv file (generated in cluster mode) for Mirai Ratio and Cluster Centers for all clusters
- The csv file (generated in cluster mode) for sorted count of scanners that scan the same number of ports
- The first data file (i.e., part-00000) (generated in cluster mode) 

In [1]:
import pyspark
import csv

In [2]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, StringType, LongType, IntegerType, DecimalType, BooleanType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql.functions import array_contains
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

In [3]:
ss = SparkSession.builder.master("local").appName("MiniProject 2 k-meas Clustering using OHE").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/04 17:40:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
ss.sparkContext.setLogLevel("WARN")

In [5]:
scanner_schema = StructType([StructField("_c0", IntegerType(), False), \
                             StructField("id", IntegerType(), False ), \
                             StructField("numports", IntegerType(), False), \
                             StructField("lifetime", DecimalType(), False ), \
                             StructField("Bytes", IntegerType(), False ), \
                             StructField("Packets", IntegerType(), False), \
                             StructField("average_packetsize", IntegerType(), False), \
                             StructField("MinUniqueDests", IntegerType(), False),\
                             StructField("MaxUniqueDests", IntegerType(), False), \
                             StructField("MinUniqueDest24s", IntegerType(), False), \
                             StructField("MaxUniqueDest24s", IntegerType(), False), \
                             StructField("average_lifetime", DecimalType(), False), \
                             StructField("mirai", BooleanType(), True), \
                             StructField("zmap", BooleanType(), True),
                             StructField("masscan", BooleanType(), True),
                             StructField("country", StringType(), False), \
                             StructField("traffic_types_scanned_str", StringType(), False), \
                             StructField("ports_scanned_str", StringType(), False), \
                             StructField("host_tags_per_censys", StringType(), False), \
                             StructField("host_services_per_censys", StringType(), False) \
                           ])

In [6]:
Scanners_df = ss.read.csv("sampled_profile.csv", schema= scanner_schema, header= True, inferSchema=False )

In [7]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: decimal(10,0) (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: decimal(10,0) (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



## Q: What groups of scanners are similar in the ports they scan?

### Because the feature `numports` record the total number of ports being scanned by each scanner, we can use it to separate 1-port-scanners from multi-port-scanners.

In [8]:
one_port_scanners = Scanners_df.where(col('numports') == 1)

In [9]:
one_port_scanners.show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|ports_scanned_str|host_tags_per_censys|host_services_per_censys|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+
|1645181|1645181|       1|       0|   60|      1|                60|             1|             1|               1|               1|               0|false|false|  false|     BR|                   

24/11/04 17:40:58 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
 Schema: _c0, id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
Expected: _c0 but found: 
CSV file: file:///storage/home/dfw5416/MiniProj2/sampled_profile.csv


In [10]:
multi_port_scanners = Scanners_df.where(col("numports") > 1)

In [11]:
multi_port_scanners_count = multi_port_scanners.count()

In [12]:
print(multi_port_scanners_count)

73663


In [13]:
ScannersCount_byNumPorts = multi_port_scanners.groupby("numports").count()

In [14]:
ScannersCount_byNumPorts.show(3)

[Stage 4:>                                                          (0 + 1) / 1]

+--------+-----+
|numports|count|
+--------+-----+
|    1238|    1|
|      31|   33|
|   31161|    1|
+--------+-----+
only showing top 3 rows



                                                                                

In [15]:
SortedScannersCount_byNumPorts= ScannersCount_byNumPorts.orderBy("count", ascending=False)

In [16]:
output1 = "local/SortedScannersCount_byNumPorts.csv"
SortedScannersCount_byNumPorts.write.option("header", True).csv(output1)

                                                                                

In [17]:
ScannersCount_byNumPorts.where(col("count")==1).show(10)

+--------+-----+
|numports|count|
+--------+-----+
|    1238|    1|
|   31161|    1|
|      85|    1|
|    2247|    1|
|     362|    1|
|   35133|    1|
|    6419|    1|
|     115|    1|
|   36794|    1|
|    4843|    1|
+--------+-----+
only showing top 10 rows



### We noticed that some of the scanners that scan for very large number of ports (we call them Extreme Scanners) is unique in the number of ports they scan.

In [18]:
non_rare_NumPorts = SortedScannersCount_byNumPorts.where(col("count") > 1)

In [19]:
non_rare_NumPorts.show(3)

+--------+-----+
|numports|count|
+--------+-----+
|       2|24114|
|       3|16206|
|       4| 5952|
+--------+-----+
only showing top 3 rows



# DataFrame can aggregate a column using .agg({ "column name" : "operator name" })
## We can find the maximum of numports column using "max" as aggregation operator.

In [20]:
max_non_rare_NumPorts_df = non_rare_NumPorts.agg({"numports" : "max"})
max_non_rare_NumPorts_df.show()

+-------------+
|max(numports)|
+-------------+
|          654|
+-------------+



In [21]:
max_non_rare_NumPorts_rdd = max_non_rare_NumPorts_df.rdd.map(lambda x: x[0])
max_non_rare_NumPorts_rdd.take(2)

                                                                                

[654]

In [22]:
max_non_rare_NumPorts_list = max_non_rare_NumPorts_rdd.collect()
print(max_non_rare_NumPorts_list)

[654]


In [23]:
max_non_rare_NumPorts=max_non_rare_NumPorts_list[0]
print(max_non_rare_NumPorts)

654


## We are going to focus on the grouping of scanners that scan at least two ports, and do not scan extremely large number of ports. We will call these scanners Non-extreme Multi-port Scanners.
## We will save the extreme scanners in a csv file so that we can process it separately.

In [24]:
extreme_scanners = Scanners_df.where(col("numports") > max_non_rare_NumPorts)

In [25]:
path2="local/Extreme_Scanners.csv"
extreme_scanners.write.option("header",True).csv(path2)

24/11/04 17:41:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
 Schema: _c0, id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
Expected: _c0 but found: 
CSV file: file:///storage/home/dfw5416/MiniProj2/sampled_profile.csv
                                                                                

In [26]:
non_extreme_multi_port_scanners = Scanners_df.where(col("numports") <= max_non_rare_NumPorts).where(col("numports") > 1)

In [27]:
non_extreme_multi_port_scanners.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: decimal(10,0), Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: decimal(10,0), mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string]

In [28]:
non_extreme_multi_port_scanners.count()

24/11/04 17:41:08 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
 Schema: _c0, id, numports, lifetime, Bytes, Packets, average_packetsize, MinUniqueDests, MaxUniqueDests, MinUniqueDest24s, MaxUniqueDest24s, average_lifetime, mirai, zmap, masscan, country, traffic_types_scanned_str, ports_scanned_str, host_tags_per_censys, host_services_per_censys
Expected: _c0 but found: 
CSV file: file:///storage/home/dfw5416/MiniProj2/sampled_profile.csv
                                                                                

73599

In [29]:
non_extreme_multi_port_scanners.select("ports_scanned_str").show(4)

+--------------------+
|   ports_scanned_str|
+--------------------+
|         17128-17136|
|17128-17130-17132...|
|23-80-81-1023-232...|
|17128-17132-17136...|
+--------------------+
only showing top 4 rows



In [30]:
# (a)
NEMP_Scanners_df=non_extreme_multi_port_scanners.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )
NEMP_Scanners_df.show(2)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+
|2091467|2091467|       2|     200|  752|     12|                62|             1|             1|               1|         

# We will need to use NEMP_Scanners_df multiple times in creating OHE features later, so we persist it.

In [31]:
NEMP_Scanners_df.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: decimal(10,0), Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: decimal(10,0), mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>]

In [32]:
Ports_Scanned_RDD = NEMP_Scanners_df.select("Ports_Array").rdd

In [33]:
Ports_Scanned_RDD.take(5)

24/11/04 17:41:12 WARN BlockManager: Task 25 already completed, not releasing lock for rdd_99_0
                                                                                

[Row(Ports_Array=['17128', '17136']),
 Row(Ports_Array=['17128', '17130', '17132', '17134', '17136', '17138', '17140']),
 Row(Ports_Array=['23', '80', '81', '1023', '2323', '5555', '7574', '8080', '8443', '37215', '49152', '52869']),
 Row(Ports_Array=['17128', '17132', '17136', '17140', '17142', '34230']),
 Row(Ports_Array=['137', '17130'])]

In [34]:
Ports_Scanned_RDD.take(3)

24/11/04 17:41:12 WARN BlockManager: Task 26 already completed, not releasing lock for rdd_99_0


[Row(Ports_Array=['17128', '17136']),
 Row(Ports_Array=['17128', '17130', '17132', '17134', '17136', '17138', '17140']),
 Row(Ports_Array=['23', '80', '81', '1023', '2323', '5555', '7574', '8080', '8443', '37215', '49152', '52869'])]

In [35]:
Ports_list_RDD = Ports_Scanned_RDD.map(lambda row: row.Ports_Array )

In [36]:
Ports_list_RDD.take(3)

24/11/04 17:41:13 WARN BlockManager: Task 27 already completed, not releasing lock for rdd_99_0


[['17128', '17136'],
 ['17128', '17130', '17132', '17134', '17136', '17138', '17140'],
 ['23',
  '80',
  '81',
  '1023',
  '2323',
  '5555',
  '7574',
  '8080',
  '8443',
  '37215',
  '49152',
  '52869']]

In [37]:
flattened_Ports_list_RDD = Ports_list_RDD.flatMap(lambda x: x )

In [38]:
flattened_Ports_list_RDD.take(7)

24/11/04 17:41:13 WARN BlockManager: Task 28 already completed, not releasing lock for rdd_99_0


['17128', '17136', '17128', '17130', '17132', '17134', '17136']

In [39]:
Port_1_RDD = flattened_Ports_list_RDD.map(lambda x: (x, 1))
Port_1_RDD.take(7)

24/11/04 17:41:14 WARN BlockManager: Task 29 already completed, not releasing lock for rdd_99_0


[('17128', 1),
 ('17136', 1),
 ('17128', 1),
 ('17130', 1),
 ('17132', 1),
 ('17134', 1),
 ('17136', 1)]

In [40]:
Port_count_RDD = Port_1_RDD.reduceByKey(lambda x,y: x + y, 5)

In [41]:
Sorted_Count_Port_RDD = Port_count_RDD.map(lambda x: (x[1], x[0])).sortByKey(ascending = False)

                                                                                

In [42]:
Sorted_Count_Port_RDD.take(100)

                                                                                

[(25272, '17132'),
 (25134, '17130'),
 (25117, '17140'),
 (25074, '17128'),
 (25040, '17138'),
 (24928, '17136'),
 (18183, '17134'),
 (18166, '17142'),
 (13376, '80'),
 (13248, '8080'),
 (13126, '23'),
 (5808, '2323'),
 (4850, '81'),
 (4091, '1023'),
 (4063, '5555'),
 (4026, '52869'),
 (3989, '8443'),
 (3933, '49152'),
 (3866, '7574'),
 (3860, '37215'),
 (3483, '54594'),
 (3088, '34218'),
 (3049, '34220'),
 (3036, '33962'),
 (3034, '33968'),
 (3033, '34224'),
 (3024, '34228'),
 (3008, '33960'),
 (2977, '33964'),
 (2957, '34216'),
 (2946, '33970'),
 (2942, '34226'),
 (2932, '33972'),
 (2347, '50401'),
 (1835, '34222'),
 (1792, '34230'),
 (1790, '33966'),
 (1692, '33974'),
 (1092, '445'),
 (1078, '0'),
 (647, '22'),
 (571, '8291'),
 (531, '8728'),
 (421, '1433'),
 (302, '8000'),
 (293, '8081'),
 (282, '5353'),
 (275, '2004'),
 (256, '11211'),
 (245, '6881'),
 (245, '443'),
 (241, '8082'),
 (240, '4000'),
 (238, '5060'),
 (236, '8083'),
 (224, '8088'),
 (213, '6379'),
 (191, '9527'),
 (18

In [43]:
path3="local/sorted_top_ports_counts"
Sorted_Count_Port_RDD.saveAsTextFile(path3)

In [44]:
non_extreme_multi_port_scanners.unpersist()

DataFrame[_c0: int, id: int, numports: int, lifetime: decimal(10,0), Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: decimal(10,0), mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string]

In [45]:
top_ports= 100
Sorted_Ports_RDD= Sorted_Count_Port_RDD.map(lambda x: x[1] )
Top_Ports_list = Sorted_Ports_RDD.take(top_ports)

In [46]:
Top_Ports_list

['17132',
 '17130',
 '17140',
 '17128',
 '17138',
 '17136',
 '17134',
 '17142',
 '80',
 '8080',
 '23',
 '2323',
 '81',
 '1023',
 '5555',
 '52869',
 '8443',
 '49152',
 '7574',
 '37215',
 '54594',
 '34218',
 '34220',
 '33962',
 '33968',
 '34224',
 '34228',
 '33960',
 '33964',
 '34216',
 '33970',
 '34226',
 '33972',
 '50401',
 '34222',
 '34230',
 '33966',
 '33974',
 '445',
 '0',
 '22',
 '8291',
 '8728',
 '1433',
 '8000',
 '8081',
 '5353',
 '2004',
 '11211',
 '6881',
 '443',
 '8082',
 '4000',
 '5060',
 '8083',
 '8088',
 '6379',
 '9527',
 '30301',
 '7001',
 '9200',
 '7002',
 '1027',
 '1900',
 '3389',
 '5900',
 '21',
 '6380',
 '88',
 '35',
 '8181',
 '5000',
 '389',
 '56880',
 '5001',
 '137',
 '8008',
 '7547',
 '49153',
 '4444',
 '139',
 '2222',
 '8001',
 '3544',
 '8888',
 '5984',
 '2480',
 '53',
 '1883',
 '873',
 '631',
 '9000',
 '50070',
 '161',
 '4786',
 '60001',
 '8090',
 '27017',
 '85',
 '12866']

#  A.2 One Hot Encoding of Top K Ports
## One-Hot-Encoded Feature/Column Name
Because we need to create a name for each one-hot-encoded feature, which is one of the top k ports, we can adopt the convention that the column name is "PortXXXX", where "XXXX" is a port number. This can be done by concatenating two strings using ``+``.

In [47]:
Top_Ports_list[0]

'17132'

In [48]:
FeatureName = "Port"+Top_Ports_list[0]

In [49]:
FeatureName

'Port17132'

## One-Hot-Encoding using withColumn and array_contains

In [50]:
from pyspark.sql.functions import array_contains

In [58]:
NEMP_Scanners2_df= NEMP_Scanners_df.withColumn("Port"+Top_Ports_list[0], array_contains("Ports_array", Top_Ports_list[0]))

In [59]:
NEMP_Scanners2_df.show(10)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|    _c0|     id|numports|lifetime|Bytes|Packets|average_packetsize|MinUniqueDests|MaxUniqueDests|MinUniqueDest24s|MaxUniqueDest24s|average_lifetime|mirai| zmap|masscan|country|traffic_types_scanned_str|   ports_scanned_str|host_tags_per_censys|host_services_per_censys|         Ports_Array|Port17132|
+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+
|2091467|2091467|       2|     200|  752|     12|                62|             1|           

In [63]:
First_top_port_scanners_count = NEMP_Scanners2_df.where(col("Port17132")== True).count()

In [64]:
print(First_top_port_scanners_count)

25272


In [65]:
Sorted_Count_Port_RDD.take(2)

[(25272, '17132'), (25134, '17130')]

In [66]:
top_ports

100

In [67]:
Top_Ports_list[top_ports - 1]

'12866'

In [69]:
for i in range(0, top_ports):
    # "Port" + Top_Ports_list[i]  is the name of each new feature created through One Hot Encoding Top_Ports_list
    NEMP_Scanners3_df = NEMP_Scanners2_df.withColumn("Port"+Top_Ports_list[i], array_contains("Ports_Array", Top_Ports_list[i]))
    NEMP_Scanners2_df = NEMP_Scanners3_df

In [70]:
NEMP_Scanners2_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: decimal(10,0) (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: decimal(10,0) (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)
 |-- Ports_Array: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |

# Problem 6 (10 points)
## Complete the code below to use k-means (number of clusters = 200) to cluster non-extreme multi-port scanners using one-hot-encoded top 100 ports.

## Specify Parameters for k Means Clustering

In [71]:
input_features = [ ]
for i in range(0, top_ports ):
    input_features.append( "Port"+ Top_Ports_list[i] )

In [72]:
print(input_features)

['Port17132', 'Port17130', 'Port17140', 'Port17128', 'Port17138', 'Port17136', 'Port17134', 'Port17142', 'Port80', 'Port8080', 'Port23', 'Port2323', 'Port81', 'Port1023', 'Port5555', 'Port52869', 'Port8443', 'Port49152', 'Port7574', 'Port37215', 'Port54594', 'Port34218', 'Port34220', 'Port33962', 'Port33968', 'Port34224', 'Port34228', 'Port33960', 'Port33964', 'Port34216', 'Port33970', 'Port34226', 'Port33972', 'Port50401', 'Port34222', 'Port34230', 'Port33966', 'Port33974', 'Port445', 'Port0', 'Port22', 'Port8291', 'Port8728', 'Port1433', 'Port8000', 'Port8081', 'Port5353', 'Port2004', 'Port11211', 'Port6881', 'Port443', 'Port8082', 'Port4000', 'Port5060', 'Port8083', 'Port8088', 'Port6379', 'Port9527', 'Port30301', 'Port7001', 'Port9200', 'Port7002', 'Port1027', 'Port1900', 'Port3389', 'Port5900', 'Port21', 'Port6380', 'Port88', 'Port35', 'Port8181', 'Port5000', 'Port389', 'Port56880', 'Port5001', 'Port137', 'Port8008', 'Port7547', 'Port49153', 'Port4444', 'Port139', 'Port2222', 'Por

In [73]:
va = VectorAssembler().setInputCols(input_features).setOutputCol("features")

In [75]:
data= va.transform(NEMP_Scanners2_df)

In [76]:
data.show(1)

24/11/04 18:04:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+-----------------+--------------------+------------------------+--------------+---------+---------+---------+---------+---------+---------+---------+---------+------+--------+------+--------+------+--------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-----+------+--------+--------+--------+--------+--------+--------+--------+---------+--------+-------+--------+--------+--------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------+--------+--------+------+--------+------+------+--------+--------+-------+---------+--------+-------+--------+--------+-----

In [77]:
data.persist()

DataFrame[_c0: int, id: int, numports: int, lifetime: decimal(10,0), Bytes: int, Packets: int, average_packetsize: int, MinUniqueDests: int, MaxUniqueDests: int, MinUniqueDest24s: int, MaxUniqueDest24s: int, average_lifetime: decimal(10,0), mirai: boolean, zmap: boolean, masscan: boolean, country: string, traffic_types_scanned_str: string, ports_scanned_str: string, host_tags_per_censys: string, host_services_per_censys: string, Ports_Array: array<string>, Port17132: boolean, Port17130: boolean, Port17140: boolean, Port17128: boolean, Port17138: boolean, Port17136: boolean, Port17134: boolean, Port17142: boolean, Port80: boolean, Port8080: boolean, Port23: boolean, Port2323: boolean, Port81: boolean, Port1023: boolean, Port5555: boolean, Port52869: boolean, Port8443: boolean, Port49152: boolean, Port7574: boolean, Port37215: boolean, Port54594: boolean, Port34218: boolean, Port34220: boolean, Port33962: boolean, Port33968: boolean, Port34224: boolean, Port34228: boolean, Port33960: boo

In [78]:
km = KMeans(featuresCol= "features", predictionCol="prediction").setK(200).setSeed(123)
km.explainParams()

'distanceMeasure: the distance measure. Supported options: \'euclidean\' and \'cosine\'. (default: euclidean)\nfeaturesCol: features column name. (default: features, current: features)\ninitMode: The initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (default: k-means||)\ninitSteps: The number of steps for k-means|| initialization mode. Must be > 0. (default: 2)\nk: The number of clusters to create. Must be > 1. (default: 2, current: 200)\nmaxIter: max number of iterations (>= 0). (default: 20)\npredictionCol: prediction column name. (default: prediction, current: prediction)\nseed: random seed. (default: 5270608917868052676, current: 123)\ntol: the convergence tolerance for iterative algorithms (>= 0). (default: 0.0001)\nweightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (undefined)'

In [79]:
kmModel=km.fit(data)

                                                                                

In [80]:
kmModel

KMeansModel: uid=KMeans_cebc1ee7b551, k=200, distanceMeasure=euclidean, numFeatures=100

In [81]:
predictions = kmModel.transform(data)

In [82]:
predictions.show(3)

+-------+-------+--------+--------+-----+-------+------------------+--------------+--------------+----------------+----------------+----------------+-----+-----+-------+-------+-------------------------+--------------------+--------------------+------------------------+--------------------+---------+---------+---------+---------+---------+---------+---------+---------+------+--------+------+--------+------+--------+--------+---------+--------+---------+--------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-----+------+--------+--------+--------+--------+--------+--------+--------+---------+--------+-------+--------+--------+--------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------+--------+--------+------+--------+------+------+--------+--------+-------+---------+--------+-------+--------+-----

# Find The Size of The First Cluster

In [83]:
Cluster1_df=predictions.where(col("prediction")==0)

In [84]:
Cluster1_df.count()

819

In [85]:
summary = kmModel.summary

In [86]:
summary.clusterSizes

[819,
 2391,
 1465,
 195,
 256,
 429,
 6671,
 263,
 98,
 238,
 681,
 7162,
 958,
 971,
 558,
 131,
 949,
 107,
 157,
 363,
 84,
 1050,
 144,
 449,
 69,
 70,
 59,
 323,
 381,
 85,
 172,
 41,
 178,
 873,
 53,
 69,
 849,
 107,
 867,
 149,
 71,
 496,
 453,
 407,
 1334,
 97,
 364,
 322,
 54,
 238,
 551,
 48,
 304,
 1156,
 896,
 202,
 72,
 87,
 132,
 529,
 149,
 113,
 139,
 367,
 322,
 225,
 145,
 127,
 127,
 73,
 413,
 383,
 386,
 367,
 77,
 195,
 253,
 78,
 654,
 212,
 125,
 111,
 71,
 135,
 71,
 640,
 423,
 118,
 521,
 1084,
 70,
 762,
 193,
 552,
 130,
 330,
 203,
 402,
 86,
 110,
 308,
 406,
 123,
 453,
 60,
 70,
 239,
 206,
 628,
 223,
 329,
 273,
 102,
 263,
 109,
 91,
 424,
 64,
 188,
 264,
 270,
 316,
 113,
 753,
 639,
 111,
 37,
 66,
 28,
 66,
 377,
 180,
 81,
 71,
 143,
 275,
 298,
 693,
 593,
 159,
 105,
 278,
 664,
 95,
 366,
 415,
 296,
 104,
 107,
 50,
 682,
 252,
 112,
 126,
 185,
 131,
 281,
 99,
 212,
 88,
 247,
 160,
 237,
 91,
 194,
 90,
 270,
 41,
 104,
 1,
 320,
 219,
 

In [87]:
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)

                                                                                

In [88]:
print('Silhouette Score of the Clustering Result is ', silhouette)

Silhouette Score of the Clustering Result is  0.506527817721416


In [89]:
centers = kmModel.clusterCenters()

In [90]:
print(centers)

[array([0.        , 0.77289377, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.61660562, 0.        , 0.        ,
       0.001221  , 0.        , 0.        , 0.        , 0.003663  ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.06227106, 0.003663  , 0.04151404, 0.04639805, 0.04395604,
       0.02808303, 0.04029304, 0.        , 0.003663  , 0.03907204,
       0.002442  , 0.04395604, 0.00976801, 0.        , 0.01465201,
       0.00976801, 0.01465201, 0.01953602, 0.        , 0.        ,
       0.004884  , 0.        , 0.        , 0.01587302, 0.        ,
       0.        , 0.001221  , 0.        , 0.        , 0.        ,
       0.        , 0.001221  , 0.        , 0.        , 0.        ,
       0.004884  , 0.004884  , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.002442  , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.    

- 9.4% of the scanners in the cluster scan Top_Ports_list[0]: port 17132
- 80% of the scanners in the cluster scan Top_Ports_list[1]: port 17130
- No scanners in the cluster scan Top_Ports_list[2]: port 17140

In [91]:
import pandas as pd
import numpy as np
import math

In [92]:
# Define columns of the Pandas dataframe
column_list = ['cluster ID', 'size', 'mirai_ratio' ]
for feature in input_features:
    column_list.append(feature)
clusters_summary_df = pd.DataFrame( columns = column_list )
for i in range(0, top_ports):
    cluster_i = predictions.where(col('prediction')==i)
    cluster_i_size = cluster_i.count()
    cluster_i_mirai_count = cluster_i.where(col('mirai')).count()
    cluster_i_mirai_ratio = cluster_i_mirai_count/cluster_i_size
    if cluster_i_mirai_count > 0:
        print("Cluster ", i, "; Mirai Ratio:", cluster_i_mirai_ratio, "; Cluster Size: ", cluster_i_size)
    cluster_row = [i, cluster_i_size, cluster_i_mirai_ratio]
    for j in range(0, len(input_features)):
        cluster_row.append(centers[i][j])
    clusters_summary_df.loc[i]= cluster_row

Cluster  6 ; Mirai Ratio: 0.000749512816669165 ; Cluster Size:  6671
Cluster  11 ; Mirai Ratio: 0.012566322256352975 ; Cluster Size:  7162
Cluster  36 ; Mirai Ratio: 0.005889281507656066 ; Cluster Size:  849
Cluster  39 ; Mirai Ratio: 0.33557046979865773 ; Cluster Size:  149
Cluster  44 ; Mirai Ratio: 0.8313343328335832 ; Cluster Size:  1334
Cluster  83 ; Mirai Ratio: 0.022222222222222223 ; Cluster Size:  135
Cluster  89 ; Mirai Ratio: 0.03690036900369004 ; Cluster Size:  1084
Cluster  97 ; Mirai Ratio: 0.024875621890547265 ; Cluster Size:  402


In [93]:
# Create a file name based on the number of top_ports
path4= "local/MiraiRatio_Cluster_centers_"+str(top_ports)+"OHE_k200.csv"
clusters_summary_df.to_csv(path4, header=True)

In [94]:
ss.stop()