# DS/CMPSC 410 MiniProject Deliverable #1 

# Fall 2024
## Instructor: Prof. John Yen
## TAs: Jin, Peng and Al Lawati, Ali Hussain Mohsin

## Learning Objectives
- Be able to identify frequent 1 ports, 2 port sets and 3 port sets (based on a threshold) that are scanned by scanners in the Darknet dataset.
- Be able to adapt the Aprior algorithm by incorporating suitable threshold and pruning strategies.
- Be able to improve the performance of frequent port set mining by suitable reuse of RDD, together with appropriate persist and unpersist on the reused RDD.
- After successful execution in the local mode, modify the code for cluster mode, and final frequent 1-ports, 2-port sets, and 3-port sets using the big Darknet dataset (`Day_2020_profile.csv`).

### Data
- The small Darknet dataset 'sampled_profile.csv' and the large Darknet dataset `Day_2020_profile.csv` are available for download from Canvas, then upload to Roar under your MiniProj1 directory in work directory.

### Items to submit:
- Completed Jupyter Notebook (using small Darknet dataset `sampled_profile.csv`) in HTML format.
- .py file for mining frequent 1 ports, 2 port sets. and 3 port sets in cluster mode using the big Darknet dataset `Day_2020_profile.csv`.
- one csv file of frequent 1-ports generated in the CLUSTEr mode.
- one csv file of frequent 2-port sets generated in the CLUSTER mode.
- one csv file of frequent 3-port sets generated in the CLUSTER mode.
- a screen shot (using ``ls -l`` terminal command) of the MiniProj1 directory, showing all files and directories2
  
### Due: midnight, November 12, 2024

In [2]:
import pyspark
import csv
import pandas as pd

In [3]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, BooleanType, StringType, DecimalType
from pyspark.sql.functions import col, column
from pyspark.sql.functions import expr
from pyspark.sql.functions import split
from pyspark.sql import Row
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToString
from pyspark.ml.clustering import KMeans

In [4]:
ss = SparkSession.builder.master("local").appName("Mini Project #1 Freqent Port Sets Mining").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/10/30 23:21:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
ss.sparkContext.setLogLevel("WARN")

In [5]:
# ss.sparkContext.setCheckpointDir("~/scratch")

In [6]:
scanner_schema = StructType([StructField("_c0", IntegerType(), False), \
                             StructField("id", IntegerType(), False ), \
                             StructField("numports", IntegerType(), False), \
                             StructField("lifetime", DecimalType(), False ), \
                             StructField("Bytes", IntegerType(), False ), \
                             StructField("Packets", IntegerType(), False), \
                             StructField("average_packetsize", IntegerType(), False), \
                             StructField("MinUniqueDests", IntegerType(), False),\
                             StructField("MaxUniqueDests", IntegerType(), False), \
                             StructField("MinUniqueDest24s", IntegerType(), False), \
                             StructField("MaxUniqueDest24s", IntegerType(), False), \
                             StructField("average_lifetime", DecimalType(), False), \
                             StructField("mirai", BooleanType(), True), \
                             StructField("zmap", BooleanType(), True),
                             StructField("masscan", BooleanType(), True),
                             StructField("country", StringType(), False), \
                             StructField("traffic_types_scanned_str", StringType(), False), \
                             StructField("ports_scanned_str", StringType(), False), \
                             StructField("host_tags_per_censys", StringType(), False), \
                             StructField("host_services_per_censys", StringType(), False) \
                           ])

In [7]:
# In the cluster mode, change this line to
# Scanners_df = ss.read.csv("/storage/home/???/work/MiniProj1/Day_2020_profile.csv", schema = scanner_schema, header= True, inferSchema=False )
Scanners_df = ss.read.csv("sampled_profile.csv", schema = scanner_schema, \
                          header=True, inferSchema=False)

In [8]:
Scanners_df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- numports: integer (nullable = true)
 |-- lifetime: decimal(10,0) (nullable = true)
 |-- Bytes: integer (nullable = true)
 |-- Packets: integer (nullable = true)
 |-- average_packetsize: integer (nullable = true)
 |-- MinUniqueDests: integer (nullable = true)
 |-- MaxUniqueDests: integer (nullable = true)
 |-- MinUniqueDest24s: integer (nullable = true)
 |-- MaxUniqueDest24s: integer (nullable = true)
 |-- average_lifetime: decimal(10,0) (nullable = true)
 |-- mirai: boolean (nullable = true)
 |-- zmap: boolean (nullable = true)
 |-- masscan: boolean (nullable = true)
 |-- country: string (nullable = true)
 |-- traffic_types_scanned_str: string (nullable = true)
 |-- ports_scanned_str: string (nullable = true)
 |-- host_tags_per_censys: string (nullable = true)
 |-- host_services_per_censys: string (nullable = true)



## Read scanners data, parse the ports_scanned_str into an array

In [9]:
Scanners_df2=Scanners_df.withColumn("Ports_Array", split(col("ports_scanned_str"), "-") )

In [10]:
Ports_Scanned_RDD = Scanners_df2.select("Ports_Array").rdd

In [11]:
scanner_port_list_RDD = Ports_Scanned_RDD.map(lambda x: x["Ports_Array"])

## Compute the total number of scanners scanning each port

In [12]:
port_list_RDD = scanner_port_list_RDD.flatMap(lambda x: x)

In [13]:
port_1_RDD = port_list_RDD.map(lambda x: (x, 1) )
port_1_RDD.take(5)

                                                                                

[('13716', 1), ('17128', 1), ('17136', 1), ('35134', 1), ('17140', 1)]

In [14]:
port_count_RDD = port_1_RDD.reduceByKey(lambda x,y: x + y, 8)
port_count_RDD.take(5)

                                                                                

[('8443', 4012),
 ('34230', 1921),
 ('5564', 18),
 ('33966', 1919),
 ('33972', 3130)]

In [15]:
sorted_count_port_RDD = port_count_RDD.map(lambda x: (x[1], x[0])).sortByKey(ascending = False)

                                                                                

In [16]:
sorted_count_port_RDD.take(20)

                                                                                

[(32014, '17132'),
 (31865, '17140'),
 (31850, '17128'),
 (31805, '17138'),
 (31630, '17130'),
 (31617, '17136'),
 (29199, '23'),
 (25466, '445'),
 (25216, '54594'),
 (21700, '17142'),
 (21560, '17134'),
 (15010, '80'),
 (13698, '8080'),
 (8778, '0'),
 (6265, '2323'),
 (5552, '5555'),
 (4930, '81'),
 (4103, '1023'),
 (4058, '52869'),
 (4012, '8443')]

In [21]:
sorted_port_count = "sorted_port_count_local.txt"
sorted_count_port_RDD.saveAsTextFile(sorted_port_count)

                                                                                

## Filter for ports whose count of scanners (scanning the port) exceeds the thresdhold

In [17]:
threshold = 500
freq_count_port_RDD= sorted_count_port_RDD.filter(lambda x: x[0] > threshold)
total_freq_port_count = freq_count_port_RDD.count()

                                                                                

In [18]:
total_freq_port_count

48

In [23]:
DF1port = ss.createDataFrame(freq_count_port_RDD)
output_path_1_port = "freq_1_port_count_local.csv"
DF1port.write.option("header", True).csv(output_path_1_port)

                                                                                

In [29]:
Top_Ports = freq_count_port_RDD.map(lambda x: x[1]).collect()

                                                                                

In [30]:
print(Top_Ports)

['17132', '17140', '17128', '17138', '17130', '17136', '23', '445', '54594', '17142', '17134', '80', '8080', '0', '2323', '5555', '81', '1023', '52869', '8443', '49152', '7574', '37215', '34218', '34220', '33968', '34224', '34228', '33962', '33960', '33964', '34216', '34226', '33970', '33972', '50401', '34222', '34230', '33966', '33974', '3389', '1433', '22', '5353', '21', '8291', '8728', '443']


In [31]:
Top_1_Port_count = len(Top_Ports)

In [32]:
print(Top_1_Port_count)

48


# Finding Frequent Port Sets Being Scanned

## Pruning Strategy

In [33]:
scanner_port_list_RDD.take(5)

[['13716'], ['17128', '17136'], ['35134'], ['17140'], ['54594']]

In [34]:
scanner_port_list_RDD.count()

                                                                                

227062

In [35]:
MPscanner_port_list_RDD = scanner_port_list_RDD.filter(lambda x: len(x) >= 2 )

In [36]:
MPscanner_port_list_RDD.take(5)

[['17128', '17136'],
 ['17128', '17130', '17132', '17134', '17136', '17138', '17140'],
 ['23',
  '80',
  '81',
  '1023',
  '2323',
  '5555',
  '7574',
  '8080',
  '8443',
  '37215',
  '49152',
  '52869'],
 ['17128', '17132', '17136', '17140', '17142', '34230'],
 ['137', '17130']]

In [37]:
multi_port_scanner_count = MPscanner_port_list_RDD.count()
print(multi_port_scanner_count)

[Stage 31:>                                                         (0 + 1) / 1]

73663


                                                                                

In [38]:
scanner_count= scanner_port_list_RDD.count()
print(scanner_count)

[Stage 32:>                                                         (0 + 1) / 1]

227062


                                                                                

## Filter for scanners who scan one or more specific ports, then count the number of scanners that satisfy that criteria.

In [39]:
count_80_23 = MPscanner_port_list_RDD.filter(lambda x: ('80' in x) and ('23' in x)).count()

                                                                                

In [40]:
print(count_80_23)

10427


In [41]:
count2_80_23 = MPscanner_port_list_RDD.filter(lambda x: ('80' in x)).filter(lambda x: ('23' in x)).count()

                                                                                

In [42]:
print(count2_80_23)

10427


## Display the content of a few RDD to double check that we do not see any 1-port scanners in the RDD.

In [43]:
MPscanner_port_list_RDD.take(5)

[['17128', '17136'],
 ['17128', '17130', '17132', '17134', '17136', '17138', '17140'],
 ['23',
  '80',
  '81',
  '1023',
  '2323',
  '5555',
  '7574',
  '8080',
  '8443',
  '37215',
  '49152',
  '52869'],
 ['17128', '17132', '17136', '17140', '17142', '34230'],
 ['137', '17130']]

# Frequent 1 Port Sets
Earlier, we have saved the list of frequent 1 port set (the set of ports who have been scanned by more than x scanners, where x is the threshold) in the variable Top_Ports

In [44]:
print(Top_Ports)

['17132', '17140', '17128', '17138', '17130', '17136', '23', '445', '54594', '17142', '17134', '80', '8080', '0', '2323', '5555', '81', '1023', '52869', '8443', '49152', '7574', '37215', '34218', '34220', '33968', '34224', '34228', '33962', '33960', '33964', '34216', '34226', '33970', '33972', '50401', '34222', '34230', '33966', '33974', '3389', '1433', '22', '5353', '21', '8291', '8728', '443']


# Finding Frequent 2-Port Sets 

In [45]:
Top_1_Port_count = len(Top_Ports)

In [46]:
print(Top_1_Port_count)

48


In [48]:
# Initialize a Pandas DataFrame to store frequent port sets and their counts 
Two_Port_Sets_df = pd.DataFrame( columns = ['Port Sets', 'count'])
# Initialize the index to Two_Port_Sets_df
index2 = 0
threshold = 500
MPscanner_port_list_RDD.persist()
for i in range(0, Top_1_Port_count):
    filtered_scanners_TP_i = MPscanner_port_list_RDD.filter(lambda x: Top_Ports[i] in x)
    filtered_scanners_TP_i.persist()  
    # We do not need to filter for threshold for 1-port sets because all ports in Top_Ports have a
    # frequency higher than the threshold.
    for j in range(i+1, Top_1_Port_count):
        filtered_scanners_TP_i_j = filtered_scanners_TP_i.filter(lambda x: Top_Ports[j] in x)
        port_i_j_count = filtered_scanners_TP_i_j.count()
        if port_i_j_count > threshold:
            Two_Port_Sets_df.loc[index2] = [ Top_Ports[i]+"-"+Top_Ports[j], port_i_j_count] 
            index2 = index2 + 1
            # The print statement is for running in the local mode.  It can be commented out for running in the cluster mode.
            print("Two Ports:", Top_Ports[i], " , ", Top_Ports[j], ", Count: ", port_i_j_count)
    filtered_scanners_TP_i.unpersist()

                                                                                

Two Ports: 17132  ,  17140 , Count:  16317
Two Ports: 17132  ,  17128 , Count:  16279
Two Ports: 17132  ,  17138 , Count:  16299
Two Ports: 17132  ,  17130 , Count:  16336
Two Ports: 17132  ,  17136 , Count:  16148
Two Ports: 17132  ,  17142 , Count:  12722
Two Ports: 17132  ,  17134 , Count:  12761
Two Ports: 17132  ,  34218 , Count:  2658
Two Ports: 17132  ,  34220 , Count:  2666
Two Ports: 17132  ,  33968 , Count:  2608
Two Ports: 17132  ,  34224 , Count:  2619
Two Ports: 17132  ,  34228 , Count:  2624
Two Ports: 17132  ,  33962 , Count:  2591
Two Ports: 17132  ,  33960 , Count:  2628
Two Ports: 17132  ,  33964 , Count:  2567
Two Ports: 17132  ,  34216 , Count:  2552
Two Ports: 17132  ,  34226 , Count:  2555
Two Ports: 17132  ,  33970 , Count:  2540
Two Ports: 17132  ,  33972 , Count:  2564
Two Ports: 17132  ,  34222 , Count:  1630
Two Ports: 17132  ,  34230 , Count:  1599
Two Ports: 17132  ,  33966 , Count:  1594
Two Ports: 17132  ,  33974 , Count:  1484
Two Ports: 17140  ,  17128 

# Create a PySpark DataFrame using the Pandas dataframes of frequent 2-port sets, then write the PySpark DataFrame (with header information)

In [50]:
DF2port = ss.createDataFrame(Two_Port_Sets_df)

In [52]:

output_path_2_port = "2PS_dfw5416_local.csv"
DF2port.write.option("header", True).csv(output_path_2_port)

# Part D Finding Frequent 3-port sets

# Approach 1:
## One way To find frequent 3-port sets is to add another nested loop, inside the two loops above, to iterate three all possible frequent 3 port sets.
```
N = Total number of frequent 1-ports
For top port index i from 0 to N-1 do:
    filtered_MPscanner_Top_port_i = filter MPscanner_port_list_RDD for top port index i
    For top port index j from i+1 to N-1 do:
        filtered_MPscanner_Top_port_i_j = filter filtered_MPscanner_Top_port_i for top port index j
        port_i_j_count = filtered_MPscanner_Top_port_i_j.count()
        If port_i_j_count > threshold:
            Save [ [Top_port[i], Top_port[j]] , port_i_j_count ] in a Pandas dataframe for frequent 2 port set
            For top port index k from j+1 to N-1 do:
                filtered_MPscanner_Top_port_i_j_k = filter filtered_MPscanner_Top_port_i_j for top port index k 
                port_i_j_k_count = filtered_MPscanner_Top_port_i_j_k.count()
                If port_i_j_k_count > threshold:
                Save [ [Top_port[i], Top_port[j], Top_port[k]], port_i_j_k_count ] in a Panda dataframe for frequent 3 port set
```

# A More Scalable Approach:
## Due to the big size of the data, finding frequent 3 port set as the 2nd nested loop inside the loop for finding frequent 2 port sets is costly because it needs to maintain persisting on two RDDs needed for the outer loop.  In addition, it needs to persist and unpersist scanners for a 2 port set that exceeds the threshold so that we can iterate through possible 3rd ports for finding frequent 3 port sets.
## An Alternative Approach is to find frequent 3 port sets AFTER we have found frequent 2-port sets so that we can reduce the number of RDDs that need to persist at the same time.
## Also, we can reduce the size of scanners to consider, because we can filter out scanners that scan less than 3 ports.
## Below is an algorithm:
```
Read scanners data, parse the ports_scanned_str into an array
Generate an RDD containinging the list of ports scanned by each scanner.
Top_ports = A list of ports whose scanner count is > threshold
candidate_3PS_scanners = filter scanners for those that scan at least 3 ports
frequent_2PS_RDD = Reads from the file created from frequent 2 port set mining
frequent_2PS_RDD.persisit()
for each 2PS in frequent_2PS_RDD do:
    scanners_2PS = filter candidate_3PS_scanners for those that scan the two port set 2PS
    if the number of scanners in scanners_2PS > threshold:
        scanners_2PS.persist()
        index_i = index of first port in 2PS
        index_j = index of second port in 2PS
        for index_k from max{index_i, index_j} +1 to len(Top_ports) do:
            scanners_3PS = filter scanners_2PS for Top_ports[index_k]
            if the number of scanners in scanners_3PS > threshold:
                Record Top_ports[index_i], Top_ports[index_j], and Top_ports[index_k] as a frequent 3PortSet together with its count
        scanners_2PS.unpersisit()
frequent_2PS_RDD.unpersisit()              
        
```

In [7]:
# If read from file, change this line to read from your cluster output
# TwoPS_DF = ss.read.csv("/storage/home/juy1/work/MiniProj1/2PS_juy1_local.csv", header=True, inferSchema=True)

In [53]:
DF2port_A = DF2port.withColumn("ports array", split("Port Sets", "-") )

In [54]:
DF2port_A.show(3)

+-----------+-----+--------------+
|  Port Sets|count|   ports array|
+-----------+-----+--------------+
|17132-17140|16317|[17132, 17140]|
|17132-17128|16279|[17132, 17128]|
|17132-17138|16299|[17132, 17138]|
+-----------+-----+--------------+
only showing top 3 rows



In [55]:
DF2port_RDD = DF2port_A.select("ports array").rdd

In [56]:
DF2port_RDD.take(3)

[Row(ports array=['17132', '17140']),
 Row(ports array=['17132', '17128']),
 Row(ports array=['17132', '17138'])]

In [57]:
TwoPort_list = DF2port_RDD.map(lambda x: x["ports array"]).collect()

In [58]:
print(TwoPort_list)

[['17132', '17140'], ['17132', '17128'], ['17132', '17138'], ['17132', '17130'], ['17132', '17136'], ['17132', '17142'], ['17132', '17134'], ['17132', '34218'], ['17132', '34220'], ['17132', '33968'], ['17132', '34224'], ['17132', '34228'], ['17132', '33962'], ['17132', '33960'], ['17132', '33964'], ['17132', '34216'], ['17132', '34226'], ['17132', '33970'], ['17132', '33972'], ['17132', '34222'], ['17132', '34230'], ['17132', '33966'], ['17132', '33974'], ['17140', '17128'], ['17140', '17138'], ['17140', '17130'], ['17140', '17136'], ['17140', '17142'], ['17140', '17134'], ['17140', '34218'], ['17140', '34220'], ['17140', '33968'], ['17140', '34224'], ['17140', '34228'], ['17140', '33962'], ['17140', '33960'], ['17140', '33964'], ['17140', '34216'], ['17140', '34226'], ['17140', '33970'], ['17140', '33972'], ['17140', '34222'], ['17140', '34230'], ['17140', '33966'], ['17140', '33974'], ['17128', '17138'], ['17128', '17130'], ['17128', '17136'], ['17128', '17142'], ['17128', '17134'],

## Filter Scanners for those that scan at least three ports

In [59]:
Candidate_3PS_scanners = MPscanner_port_list_RDD.filter(lambda x: len(x) >= 3)

In [60]:
Candidate_3PS_scanners.persist()

PythonRDD[1235] at RDD at PythonRDD.scala:53

In [61]:
MPscanner_port_list_RDD.unpersist()

PythonRDD[43] at RDD at PythonRDD.scala:53

# Problem 7 (15 points) 
## Complete the missing code (including persist and unpersist) below for mining frequent 3 port sets
## and write the results (three port sets and their counts) using PySpark DataFrame.

In [63]:
# Initialize a Pandas DataFrame to store frequent port sets and their counts 
Three_Port_Sets_df = pd.DataFrame( columns= ['Port Sets', 'count'])
# Initialize the index to Three_Port_Sets_df
index3 = 0
# Set the threshold for Frequent Port Sets to be 400 in local mode.
# This threshold needs to be changed to 10000 in the cluster mode.
threshold = 500
Top_1_Port_count = len(Top_Ports)
for TwoPS in TwoPort_list:
    index_i = Top_Ports.index( TwoPS[0] )
    index_j = Top_Ports.index( TwoPS[1] )
    filtered_scanners_i_j = Candidate_3PS_scanners.filter(lambda x: Top_Ports[index_i] in x).filter(lambda y: Top_Ports[index_j] in y)
    filtered_scanners_i_j.persist()  
    for k in range(max(index_i, index_j)+1, Top_1_Port_count):
        filtered_scanners_i_j_k = filtered_scanners_i_j.filter(lambda x: Top_Ports[k] in x)
        port_i_j_k_count = filtered_scanners_i_j_k.count()
        if port_i_j_k_count > threshold:
            Three_Port_Sets_df.loc[index3] = [ Top_Ports[index_i]+"-"+Top_Ports[index_j]+"-"+Top_Ports[k], port_i_j_k_count] 
            index3 = index3 + 1
            # The print statement is for running in the local mode.  It can be commented out for running in the cluster mode.
            print("Three Ports:", Top_Ports[index_i], " , ", Top_Ports[index_j], " , ", Top_Ports[k], ", Count: ", port_i_j_k_count)
    filtered_scanners_i_j.unpersist()

                                                                                

Three Ports: 17132  ,  17140  ,  17128 , Count:  12594
Three Ports: 17132  ,  17140  ,  17138 , Count:  12562
Three Ports: 17132  ,  17140  ,  17130 , Count:  12665
Three Ports: 17132  ,  17140  ,  17136 , Count:  12522
Three Ports: 17132  ,  17140  ,  17142 , Count:  10461
Three Ports: 17132  ,  17140  ,  17134 , Count:  10454
Three Ports: 17132  ,  17140  ,  34218 , Count:  2463
Three Ports: 17132  ,  17140  ,  34220 , Count:  2453
Three Ports: 17132  ,  17140  ,  33968 , Count:  2420
Three Ports: 17132  ,  17140  ,  34224 , Count:  2461
Three Ports: 17132  ,  17140  ,  34228 , Count:  2443
Three Ports: 17132  ,  17140  ,  33962 , Count:  2397
Three Ports: 17132  ,  17140  ,  33960 , Count:  2413
Three Ports: 17132  ,  17140  ,  33964 , Count:  2361
Three Ports: 17132  ,  17140  ,  34216 , Count:  2359
Three Ports: 17132  ,  17140  ,  34226 , Count:  2382
Three Ports: 17132  ,  17140  ,  33970 , Count:  2334
Three Ports: 17132  ,  17140  ,  33972 , Count:  2391
Three Ports: 17132  , 

                                                                                

Three Ports: 17132  ,  17134  ,  34218 , Count:  2315
Three Ports: 17132  ,  17134  ,  34220 , Count:  2295
Three Ports: 17132  ,  17134  ,  33968 , Count:  2263
Three Ports: 17132  ,  17134  ,  34224 , Count:  2282
Three Ports: 17132  ,  17134  ,  34228 , Count:  2252
Three Ports: 17132  ,  17134  ,  33962 , Count:  2241
Three Ports: 17132  ,  17134  ,  33960 , Count:  2262
Three Ports: 17132  ,  17134  ,  33964 , Count:  2186
Three Ports: 17132  ,  17134  ,  34216 , Count:  2222
Three Ports: 17132  ,  17134  ,  34226 , Count:  2259
Three Ports: 17132  ,  17134  ,  33970 , Count:  2191
Three Ports: 17132  ,  17134  ,  33972 , Count:  2207
Three Ports: 17132  ,  17134  ,  34222 , Count:  1459
Three Ports: 17132  ,  17134  ,  34230 , Count:  1404
Three Ports: 17132  ,  17134  ,  33966 , Count:  1420
Three Ports: 17132  ,  17134  ,  33974 , Count:  1320
Three Ports: 17132  ,  34218  ,  34220 , Count:  1008
Three Ports: 17132  ,  34218  ,  33968 , Count:  1031
Three Ports: 17132  ,  34218

                                                                                

Three Ports: 8080  ,  5555  ,  81 , Count:  3373
Three Ports: 8080  ,  5555  ,  1023 , Count:  3373
Three Ports: 8080  ,  5555  ,  52869 , Count:  3384
Three Ports: 8080  ,  5555  ,  8443 , Count:  3397
Three Ports: 8080  ,  5555  ,  49152 , Count:  3369
Three Ports: 8080  ,  5555  ,  7574 , Count:  3286
Three Ports: 8080  ,  5555  ,  37215 , Count:  3298
Three Ports: 8080  ,  81  ,  1023 , Count:  3441
Three Ports: 8080  ,  81  ,  52869 , Count:  3429
Three Ports: 8080  ,  81  ,  8443 , Count:  3454
Three Ports: 8080  ,  81  ,  49152 , Count:  3408
Three Ports: 8080  ,  81  ,  7574 , Count:  3347
Three Ports: 8080  ,  81  ,  37215 , Count:  3364
Three Ports: 8080  ,  1023  ,  52869 , Count:  3427
Three Ports: 8080  ,  1023  ,  8443 , Count:  3444
Three Ports: 8080  ,  1023  ,  49152 , Count:  3421
Three Ports: 8080  ,  1023  ,  7574 , Count:  3345
Three Ports: 8080  ,  1023  ,  37215 , Count:  3337
Three Ports: 8080  ,  52869  ,  8443 , Count:  3437
Three Ports: 8080  ,  52869  ,  491

In [64]:
DF3port = ss.createDataFrame(Three_Port_Sets_df)

In [65]:
# These output file names need to be changed in the cluster mode, so that you can compare them with those from the local mode.
output_path_3_port = "3PS_dfw5416_local.csv"
DF3port.write.option("header", True).csv(output_path_3_port)

In [66]:
ss.stop()

# Part E (cluster mode): Finding frequent 2-port sets and 3-port sets from the large dataset.

# Problem 8 (30 points)
- Remove .master("local") from SparkSession statement
- Change the input file to "Day_2020_profile.csv"
- Change the threshold from 400 to 10000.
- Change the output files to two different directories from the ones you used in local mode.
- Export the notebook as a .py file
- Run spark-submit on ICDS Roar 
- Submit the following items:
-- (a) the .py file for cluster mode (10%)
-- (b) One output csv file for frequent 2-port sets and one output CSV file for frequent 3-port sets generated in the cluster mode. (10%)
-- (c) A screen shot (generated using `ls -l` terminal command) of your `MiniProj1` that shows all files and directories. (5%)
-- (d) Discuss (in the cell below) three things you noticed that are interesting/surprising from the frequent 3-port sets (5%)

# Your Answer to Exercise 7 (d):
1. I found it surprising that the counts seem to fit a pattern so well. Most counts are very similar to one another if differentiated by a similar last value
2. Counts in the 100,000+ range are way higher than the next highest grouping
3. 23-80-8080 was by far the most frequent combination of anything starting with 23.