# Задание 1

## Входные данные 
- Файл с данными по оттоку телеком оператора в США (churn.csv)
- Справочник с названиями штатов (state.json)
- Справочник с численностью населения территорий (определяется полем area code) внутри штатов (state.json)
- Террия с численностью населения меньше 10_000 считается **мелкой**

## Что нужно сделать
1. Посчитать количество отточных и неотточных абонентов (поле churn), исключив **мелкие** территории
2. Отчет должен быть выполнен в разрезе **каждого штата** с его полным наименованием
3. Описать возникающие узкие места при выполнении данной операции
4. Применить один из способов оптимизации для ускорения выполнения запроса (при допущении, что справочник численности населения **сильно меньше** основных данных)
5. Если существует еще какой-то способ, применить также и его отдельно от п.4 (при допущении, что справочник численности населения **сопоставим по размеру** с основными данными)
6. Кратко описать реализованные способы и в чем их практическая польза

P.S. Одним из выбранных способов должен быть Bucket specific join

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

In [2]:
spark = SparkSession.builder \
    .master("spark://spark-master:7077") \
    .appName("sykorole_test") \
    .config("spark.sql.adaptive.enabled", False) \
    .config("spark.executor.memory", "450M") \
    .config("spark.driver.memory", "450M") \
    .config("spark.sql.autoBroadcastJoinThreshold", -1) \
    .config("spark.sql.sources.bucketing.enabled", True) \
    .config('spark.jars.packages', [
        "org.apache.hadoop:hadoop-aws:3.3.2",
        "com.amazonaws:aws-java-sdk-pom:1.12.365",
        "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1",
    ]
    ) \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config('spark.hadoop.fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider') \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.access.key", "XosDG9F6D3nDSHjOWd7H") \
    .config("spark.hadoop.fs.s3a.secret.key", "I2cpVw7DWhc6SpswCWcoB554vlYqnbq4iYSKYhPk") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .getOrCreate()

:: loading settings :: url = jar:file:/usr/local/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
com.amazonaws#aws-java-sdk-pom added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-c1e405e5-032d-4d51-bee7-62d9653bc748;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.3.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.1026 in central
	found org.wildfly.openssl#wildfly-openssl;1.0.7.Final in central
	found com.amazonaws#aws-java-sdk-pom;1.12.365 in central
:: resolution report :: resolve 117ms :: artifacts dl 4ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.1026 from central in [default]
	com.amazonaws#aws-java-sdk-pom;1.12.365 from central in [default]
	org.apache.hadoop#hadoop-aws;3.3.2 from central in [default]
	org.wildfly.openssl#wildfly-openssl;1.0.7.Final from central in [default]
	---------------------------------------------------------------------
	|        

In [3]:
churn_df = spark.read.option("header", True).csv("s3a://input/data/churn.csv")
state_dict = spark.read.json("s3a://input/data/state.json").withColumnRenamed("state_id", "state")
pop_dict = spark.read.json("s3a://input/data/population.json")

24/06/20 09:37:45 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

In [108]:
state_dict.show(1)

+--------+----------+
|state_id|state_name|
+--------+----------+
|      AL|   Alabama|
+--------+----------+
only showing top 1 row



In [109]:
pop_dict.show(1)

+---------+----------+
|area code|population|
+---------+----------+
|      131|     15742|
+---------+----------+
only showing top 1 row



In [4]:
POP_THRESHOLD = 10_000

#### Простое решение

In [5]:
data_with_population = churn_df.join(pop_dict, on="area code", how="left") \
    .filter(F.col("population") > POP_THRESHOLD)

In [6]:
data_with_population.filter(F.col("population").isNull()).count()  # убедились, что нет пропущенных значений

                                                                                

0

In [47]:
result = data_with_population \
    .join(state_dict, on="state", how="left") \
    .groupBy("state_name", "churn") \
    .agg(F.count("*").alias("cnt")) \
    .orderBy("state_name")

In [48]:
result.explain()

== Physical Plan ==
*(9) Sort [state_name#72 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(state_name#72 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=1667]
   +- *(8) HashAggregate(keys=[state_name#72, churn#945], functions=[count(1)])
      +- Exchange hashpartitioning(state_name#72, churn#945, 200), ENSURE_REQUIREMENTS, [plan_id=1663]
         +- *(7) HashAggregate(keys=[state_name#72, churn#945], functions=[partial_count(1)])
            +- *(7) Project [churn#945, state_name#72]
               +- *(7) SortMergeJoin [state#925], [state#75], LeftOuter
                  :- *(4) Sort [state#925 ASC NULLS FIRST], false, 0
                  :  +- Exchange hashpartitioning(state#925, 200), ENSURE_REQUIREMENTS, [plan_id=1645]
                  :     +- *(3) Project [state#925, churn#945]
                  :        +- *(3) SortMergeJoin [area code#927], [area code#970], Inner
                  :           :- *(1) Sort [area code#927 ASC NULLS FIRST], false, 0
             

- в плане видим две операции перемешивания данных (exchange)
- применение sort merge join
- перекос в spark ui в разделе stages -> event timeline

In [50]:
result.show(4, False)



+----------+-----+---+
|state_name|churn|cnt|
+----------+-----+---+
|Alabama   |False|58 |
|Alabama   |True |7  |
|Alaska    |True |3  |
|Alaska    |False|44 |
+----------+-----+---+
only showing top 4 rows



24/06/20 12:29:18 ERROR TaskSchedulerImpl: Lost executor 0 on 172.20.0.2: worker lost: Not receiving heartbeat for 60 seconds
24/06/20 12:35:16 ERROR TaskSchedulerImpl: Lost executor 1 on 172.20.0.2: worker lost: Not receiving heartbeat for 60 seconds
24/06/20 12:51:52 ERROR TaskSchedulerImpl: Lost executor 2 on 172.20.0.2: worker lost: Not receiving heartbeat for 60 seconds


In [10]:
result.count()

                                                                                

2

#### Оптимизация - broadcast

In [24]:
# Оба справочника мелкие, поэтому можно broadcast... но минимум для pop_dict - в нем перекошенный ключ area_code
pop_dict = F.broadcast(pop_dict)
state_dict = F.broadcast(state_dict)  # state равномерно распределен

In [25]:
data_with_population = churn_df.join(pop_dict, on="area code", how="left") \
    .filter(F.col("population") > POP_THRESHOLD)

In [26]:
broadcast_result = data_with_population \
    .join(state_dict, on="state", how="left") \
    .groupBy("churn") \
    .agg(F.count("*").alias("cnt"))

In [27]:
broadcast_result.explain()

== Physical Plan ==
*(4) HashAggregate(keys=[churn#401], functions=[count(1)])
+- Exchange hashpartitioning(churn#401, 200), ENSURE_REQUIREMENTS, [plan_id=868]
   +- *(3) HashAggregate(keys=[churn#401], functions=[partial_count(1)])
      +- *(3) Project [churn#401]
         +- *(3) BroadcastHashJoin [state#381], [state#438], LeftOuter, BuildRight, false
            :- *(3) Project [state#381, churn#401]
            :  +- *(3) BroadcastHashJoin [cast(area code#383 as bigint)], [area code#450L], Inner, BuildRight, false
            :     :- *(3) Filter isnotnull(area code#383)
            :     :  +- FileScan csv [state#381,area code#383,churn#401] Batched: false, DataFilters: [isnotnull(area code#383)], Format: CSV, Location: InMemoryFileIndex(1 paths)[s3a://input/data/churn.csv], PartitionFilters: [], PushedFilters: [IsNotNull(area code)], ReadSchema: struct<state:string,area code:string,churn:string>
            :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, big

In [None]:
broadcast_result.count()

#### Bucket specific join

In [33]:
churn_df.repartition(1) \
    .write \
    .mode("overwrite") \
    .bucketBy(50, "area code") \
    .option("path", "s3a://input/data/bucketed/churn") \
    .saveAsTable("churn_bucketed")

                                                                                

In [34]:
pop_dict \
    .withColumn("area code", F.col("area code").cast("string")) \
    .repartition(1) \
    .write \
    .mode("overwrite") \
    .bucketBy(50, "area code") \
    .option("path", "s3a://input/data/bucketed/population") \
    .saveAsTable("pop")

                                                                                

In [35]:
# churn_bucketed_df = spark.read.parquet("s3a://input/data/bucketed/churn")
# pop_bucketed_dict = spark.read.parquet("s3a://input/data/bucketed/population")

churn_bucketed_df = spark.table("churn_bucketed")
pop_bucketed_dict = spark.table("pop")  # бакетирование работает только на таблицах, не на файлах

In [40]:
data_with_population = churn_bucketed_df.join(pop_bucketed_dict, on="area code", how="left") \
    .filter(F.col("population") > POP_THRESHOLD)

In [41]:
bucketed_result = data_with_population \
    .join(state_dict, on="state", how="left") \
    .groupBy("churn") \
    .agg(F.count("*").alias("cnt"))

In [42]:
# обязательно перезапустить сессию, чтобы broadcast не применился
bucketed_result.explain()

== Physical Plan ==
*(8) HashAggregate(keys=[churn#945], functions=[count(1)])
+- Exchange hashpartitioning(churn#945, 200), ENSURE_REQUIREMENTS, [plan_id=1290]
   +- *(7) HashAggregate(keys=[churn#945], functions=[partial_count(1)])
      +- *(7) Project [churn#945]
         +- *(7) SortMergeJoin [state#925], [state#75], LeftOuter
            :- *(4) Sort [state#925 ASC NULLS FIRST], false, 0
            :  +- Exchange hashpartitioning(state#925, 200), ENSURE_REQUIREMENTS, [plan_id=1272]
            :     +- *(3) Project [state#925, churn#945]
            :        +- *(3) SortMergeJoin [area code#927], [area code#970], Inner
            :           :- *(1) Sort [area code#927 ASC NULLS FIRST], false, 0
            :           :  +- *(1) Filter isnotnull(area code#927)
            :           :     +- *(1) ColumnarToRow
            :           :        +- FileScan parquet spark_catalog.default.churn_bucketed[state#925,area code#927,churn#945] Batched: true, Bucketed: true, DataFilters:

- видим, что нет шага exchange по area code -> исключили шаффл + избежали джоина по перекошенному ключу