# Weak Scaling Test for Spark Cluster

**Author**: Noah Wassberg

**Date Created**: March 7, 2024

**Description**: 
**Description**: 
This notebook performs a weak scaling test on our Spark cluster by increasing the workload in direct proportion to the number of executors. The aim is to maintain a constant workload per executor as the total number of executors increases, thereby evaluating the cluster's ability to handle larger datasets without degrading performance per unit of computation.

**Output**:
The output of this notebook includes the execution times corresponding to different numbers of cores used, providing insights into the scaling efficiency of the Spark cluster.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import time

def process_data(spark, data_path, num_executors):
    #Load reddit dataset
    df = spark.read.json(data_path)
    #Processing 

    if num_executors >= 2:
        df = df.union(df)
        if num_executors >= 4:
            df = df.union(df)
            if num_executors == 8:
                df = df.union(df)
    
    filtered = df.filter(df["summary_len"] > 30)
    grouped = filtered.groupBy("subreddit").agg(F.sum("content_len").alias("content_len_value"))
    ordered = grouped.orderBy(grouped["content_len_value"].desc())
    #Action to trigger the computation
    ordered.show(1)  

if __name__ == "__main__":
    data_path = "hdfs://192.168.2.193:9000/user/hadoop/input/input/corpus-webis-tldr-17.json"
    master_url = "spark://192.168.2.193:7077"
    app_name = "Weak scaling test"
    num_executors_list = [1, 2, 4, 8]

    for num_executors in num_executors_list:
        # Configure Spark session
        spark = SparkSession.builder \
            .master(master_url) \
            .appName(app_name) \
            .config("spark.executor.memory", "8g")\
            .config("spark.dynamicAllocation.shuffleTracking.enabled",True)\
            .config("spark.shuffle.service.enabled", False)\
            .config("spark.dynamicAllocation.executorIdleTimeout","30s")\
            .config("spark.executor.instances", str(num_executors)) \
            .getOrCreate()

        start_time = time.time()
        process_data(spark, data_path, num_executors)
        end_time = time.time()

        print(f"Number of Executors: {num_executors}, Time Taken: {end_time - start_time} seconds")

        spark.stop()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/07 09:34:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/03/07 09:35:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|         85356273|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 1, Time Taken: 215.00452399253845 seconds


24/03/07 09:38:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        170712546|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 2, Time Taken: 270.01586508750916 seconds


24/03/07 09:43:22 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
24/03/07 09:43:37 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        341425092|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 4, Time Taken: 472.54664492607117 seconds


24/03/07 09:51:15 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
                                                                                

+-------------+-----------------+
|    subreddit|content_len_value|
+-------------+-----------------+
|relationships|        682850184|
+-------------+-----------------+
only showing top 1 row

Number of Executors: 8, Time Taken: 883.1078379154205 seconds
