# Strong Scaling Test for Spark Cluster

**Author:** Noah Wassberg

**Date Created:** March 6, 2024

**Description:** This notebook performs a strong scaling test on our cluster. This is done by doing the same work with a different amount of executors.

**Output:**
- Execution times for the different numbers of cores used

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import time

def process_data(spark, data_path):
    #Load reddit dataset
    df = spark.read.json(data_path)
    #Processing 
    doubled = df.union(df)
    filtered = doubled.filter(doubled["summary_len"] > 30)
    grouped = filtered.groupBy("subreddit").agg(F.sum("content_len").alias("content_len_value"))
    ordered = grouped.orderBy(grouped["content_len_value"].desc())
    #Action to trigger the computation
    ordered.show(1)  

if __name__ == "__main__":
    data_path = "hdfs://192.168.2.193:9000/user/hadoop/input/input/corpus-webis-tldr-17.json"
    master_url = "spark://192.168.2.193:7077"
    app_name = "Strong scaling test"
    num_executors_list = [1, 2, 4, 8]

    for num_executors in num_executors_list:
        # Configure Spark session
        spark = SparkSession.builder \
            .master(master_url) \
            .appName(app_name) \
            .config("spark.executor.memory", "8g")\
            .config("spark.dynamicAllocation.shuffleTracking.enabled",True)\
            .config("spark.shuffle.service.enabled", False)\
            .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
            .config("spark.executor.instances", str(num_executors)) \
            .getOrCreate()

        start_time = time.time()
        process_data(spark, data_path)
        end_time = time.time()

        print(f"Number of Executors: {num_executors}, Time Taken: {end_time - start_time} seconds")

        spark.stop()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/06 10:58:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        170712546|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 1, Time Taken: 238.74418020248413 seconds


24/03/06 11:03:15 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/03/06 11:03:30 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        170712546|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 2, Time Taken: 257.9959063529968 seconds


24/03/06 11:07:33 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        170712546|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 4, Time Taken: 264.3424050807953 seconds


24/03/06 11:11:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/03/06 11:12:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        170712546|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 8, Time Taken: 300.2862946987152 seconds
