# Report
Chen Kewen.3036195526

## 1. AKS Cluster Information

- **AKS Cluster Name:** assign4-cluster
- **Resource Group Name:** assign4-resource-group
- **Storage Account Name:** assign4storageaccount
- **Blob Container Name:** assign4blobcontainer
- **Service Account Name:** assign4serviceaccount
- **Spark Image:** spark:v3.1.2-hadoop3.2


## 2. Dataset Information

- **Dataset URL:** [https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data](https://www.kaggle.com/datasets/bhanupratapbiswas/olympic-data)
- **Dataset Description:** The Olympic Data dataset from Kaggle is a thorough compilation of historical information on the Olympic Games, including data on athletes, events, and results. This dataset offers insights into the performances of athletes from different nations across various editions of both the Summer and Winter Olympics. It consists of 15 columns and 270,000 rows, with data spanning from 1896 to 2016. However, I have focused on data from the 1956 to 2016 Summer Olympics, as I consider the post-World War II Games more reflective of modern sports and hence more pertinent for analysis. The refined dataset includes 15 columns and 171068 rows.
- **Column Descriptions:**
  - **ID:** A unique ID assigned to each athlete.
  - **Name:** The athlete's full name.
  - **Sex:** The gender of the athlete.
  - **Age:** The age of the athlete at the time of competition.
  - **Height:** The athlete's height measured in centimeters.
  - **Weight:** The weight of the athlete in kilograms.
  - **Team:** The team or country the athlete represents.
  - **NOC:** The National Olympic Committee code for the athlete's country.
  - **Games:** The specific edition of the Olympic Games (year and season).
  - **Year:** The year when the Olympic Games took place.
  - **Season:** Indicates whether the Games were held in summer or winter.
  - **City:** The host city for the Olympic Games.
  - **Sport:** The sport in which the athlete competes.
  - **Event:** The specific event or discipline within the sport.
  - **Medal:** The type of medal won by the athlete (Gold, Silver, Bronze).


## 3. Exploratory Data Analysis Questions

1. Distribution and development trends of sports participation between male and female athletes.
2. Identification of the sports development and traditional strong events of various countries.
3. What are the age, height, and weight patterns for gold medalists in different sports to win the championship?
4. Examination of the home advantage phenomenon in the Olympics: Do host countries win significantly more medals compared to the Olympics before and after they host?


In [45]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract,count,when,avg,lag,format_number,concat, lit,row_number,concat_ws,collect_list,expr,greatest
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType


In [46]:
# Create SparkSession
spark = SparkSession.builder \
    .appName("Athlete Events Data Import") \
    .getOrCreate()

# Define schema
schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Sex", StringType(), True),
    StructField("Age", DoubleType(), True),
    StructField("Height", DoubleType(), True),
    StructField("Weight", DoubleType(), True),
    StructField("Team", StringType(), True),
    StructField("NOC", StringType(), True),
    StructField("Games", StringType(), True),
    StructField("Year", IntegerType(), True),
    StructField("Season", StringType(), True),
    StructField("City", StringType(), True),
    StructField("Sport", StringType(), True),
    StructField("Event", StringType(), True),
    StructField("Medal", StringType(), True)
])

# File path
csv_file_path = "E:/hku/cloud cluster/ex4/dataset/athlete_events.csv"

# Read CSV file
spark_df = spark.read.csv(csv_file_path, schema=schema, header=True)

# Filter data
spark_df = spark_df.filter((spark_df.Games.contains('Summer')) & (spark_df.Year >= 1956))

# Print schema
spark_df.printSchema()

# Show first 5 rows
spark_df.show(5)



root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Height: double (nullable = true)
 |-- Weight: double (nullable = true)
 |-- Team: string (nullable = true)
 |-- NOC: string (nullable = true)
 |-- Games: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Season: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Sport: string (nullable = true)
 |-- Event: string (nullable = true)
 |-- Medal: string (nullable = true)

+---+------------------+---+----+------+------+-------+---+-----------+----+------+---------+----------+--------------------+-----+
| ID|              Name|Sex| Age|Height|Weight|   Team|NOC|      Games|Year|Season|     City|     Sport|               Event|Medal|
+---+------------------+---+----+------+------+-------+---+-----------+----+------+---------+----------+--------------------+-----+
|  1|         A Dijiang|  M|24.0| 180.0|  80.0|  Ch

# EDA1: Distribution and development trends of sports participation between male and female athletes
1. Calculate the number of male and female athletes participating in each Olympic Games 
2. Calculate the growth rate and ratio of female participants in each Olympic Games
Group by gender and year, and count the number of athletes in each group


In [47]:
#  Group by gender and year, and count the number of athletes in each group
gender_trend_df = spark_df.groupBy("Sex", "Year").agg(count("ID").alias("Participant_Count"))
#  Sort the result
sorted_gender_trend_df = gender_trend_df.orderBy("Year", "Sex")
# Show the first 20 rows
sorted_gender_trend_df.show(40)


+---+----+-----------------+
|Sex|Year|Participant_Count|
+---+----+-----------------+
|  F|1956|              891|
|  M|1956|             4208|
|  F|1960|             1422|
|  M|1960|             6660|
|  F|1964|             1336|
|  M|1964|             6326|
|  F|1968|             1767|
|  M|1968|             6786|
|  F|1972|             2179|
|  M|1972|             8090|
|  F|1976|             2164|
|  M|1976|             6457|
|  F|1980|             1755|
|  M|1980|             5435|
|  F|1984|             2442|
|  M|1984|             6984|
|  F|1988|             3535|
|  M|1988|             8473|
|  F|1992|             4114|
|  M|1992|             8832|
|  F|1996|             4998|
|  M|1996|             8760|
|  F|2000|             5430|
|  M|2000|             8386|
|  F|2004|             5545|
|  M|2004|             7895|
|  F|2008|             5816|
|  M|2008|             7783|
|  F|2012|             5815|
|  M|2012|             7099|
|  F|2016|             6223|
|  M|2016|    

In [48]:
#  Group by gender and year, and count the number of athletes in each group
gender_trend_df = spark_df.groupBy("Sex", "Year").agg(
    count("ID").alias("Participant_Count")
)
# Calculate annual growth rate and participation ratio
window_spec = Window.partitionBy("Sex").orderBy("Year")
gender_trend_df = gender_trend_df.withColumn(
    "Prev_Year_Participant", lag("Participant_Count").over(window_spec)
).withColumn(
    "Growth_Rate", 
    ((col("Participant_Count") - col("Prev_Year_Participant")) / col("Prev_Year_Participant") * 100).cast("decimal(10,3)")
).withColumn(
    "Growth_Rate", concat(col("Growth_Rate"), lit("%"))
)
#Calculate the total number of participants and participation ratio each year
total_participants_df = spark_df.groupBy("Year").agg(count("ID").alias("Total_Participants"))
gender_ratio_df = gender_trend_df.join(total_participants_df, on="Year").withColumn(
    "Participation_Ratio", (col("Participant_Count") / col("Total_Participants") * 100).cast("decimal(10,3)")
).withColumn(
    "Participation_Ratio", concat(col("Participation_Ratio"), lit("%"))
)


In [49]:
# Show all data
gender_ratio_df.orderBy("Year", "Sex").show(n=200000, truncate=False)
gender_ratio_df.orderBy("Year", "Sex").write.csv("E:/hku/cloud cluster/ex4/output/eda1_gender_ratio.csv", header=True)



+----+---+-----------------+---------------------+-----------+------------------+-------------------+
|Year|Sex|Participant_Count|Prev_Year_Participant|Growth_Rate|Total_Participants|Participation_Ratio|
+----+---+-----------------+---------------------+-----------+------------------+-------------------+
|1956|F  |891              |NULL                 |NULL       |5099              |17.474%            |
|1956|M  |4208             |NULL                 |NULL       |5099              |82.526%            |
|1960|F  |1422             |891                  |59.596%    |8082              |17.595%            |
|1960|M  |6660             |4208                 |58.270%    |8082              |82.405%            |
|1964|F  |1336             |1422                 |-6.048%    |7662              |17.437%            |
|1964|M  |6326             |6660                 |-5.015%    |7662              |82.563%            |
|1968|F  |1767             |1336                 |32.260%    |8553              |2


# Inference

Since 1956, the number and proportion of female athletes have steadily increased, with the number rising from 891 in 1956 to 7462 in 2016, and the proportion rising from 17% in 1956 to 45% in 2016 (approaching gender balance). It is inferred that this trend is related to the advancement of gender equality and women's rights movements.


# Analysis of the Action Operation: `gender_trend_df.withColumn(...)`

### Analysis of the Spark Job for `gender_trend_df.withColumn(...)`

#### Action Operation
```python
window_spec = Window.partitionBy("Sex").orderBy("Year")
gender_trend_df = gender_trend_df.withColumn(
    "Prev_Year_Participant", lag("Participant_Count").over(window_spec)
).withColumn(
    "Growth_Rate", 
    ((col("Participant_Count") - col("Prev_Year_Participant")) / col("Prev_Year_Participant") * 100).cast("decimal(10,3)")
).withColumn(
    "Growth_Rate", concat(col("Growth_Rate"), lit("%"))
)



Execution Plan Analysis

The Spark job for gender_trend_df.withColumn(...) involves the following stages:

    Stage 1:
        Operation: Read data from the CSV file and count the number of participants by Sex and Year.
        Details: This stage involves reading the data and performing a group-by operation to count the participants. This group-by operation requires the data to be repartitioned by Sex and Year, which results in a shuffle event.

    Stage 2:
        Operation: Apply the window function to compute the previous year's participant count.
        Details: In this stage, a window function is applied to compute the participant count for the previous year. This operation requires sorting the data, which results in another shuffle event.

Summary

In the analysis of the gender_trend_df.withColumn(...) Spark job, the job is divided into two main stages. The first stage reads the data from the CSV file and counts the number of participants by gender and year, which triggers the first shuffle event due to the group-by operation. The second stage applies a window function to compute the previous year's participant count and calculates the growth rate, triggering a second shuffle event due to the sorting requirement. The execution plan clearly shows the specific operations in each stage and the reasons for the shuffle events, helping us understand the job's execution process and identify potential areas for performance optimization.

# EDA2: Identification of the sports infrastructure and traditional strong events of various countries
Idea: Track the change in the total number of medals over time for each country  2. Filter the leading countries in each sport and finally merge them. 
For example, if France is the leading country in both sport1 and sport2, 
the result will show France: sport1, sport2


In [50]:
# Calculate the total number of medals each country has won every year
medal_counts = spark_df.groupBy("Year", "NOC").agg(count(when(col("Medal") != 'NA', 1)).alias("MedalCount"))

# Use window function to find the top 10 countries with the most medals each year
windowSpec = Window.partitionBy("Year").orderBy(col("MedalCount").desc())
top_10_countries = medal_counts.withColumn("rank", row_number().over(windowSpec)).filter(col("rank") <= 10).drop("rank")

# Display the results as the top 10 countries with the most medals and the corresponding number of medals each year
result = top_10_countries.groupBy("Year").agg(concat_ws(", ", collect_list(concat_ws("-", col("NOC"), col("MedalCount")))).alias("Top10Countries"))
# Show results
result.show(20, truncate=False)

+----+-----------------------------------------------------------------------------------+
|Year|Top10Countries                                                                     |
+----+-----------------------------------------------------------------------------------+
|1956|URS-169, USA-111, AUS-67, HUN-64, GER-52, ITA-47, GBR-46, SWE-34, FRA-33, FIN-26   |
|1960|URS-169, USA-118, ITA-88, GER-88, HUN-66, AUS-46, JPN-31, POL-30, GBR-28, DEN-23   |
|1964|URS-174, USA-153, GER-116, JPN-62, HUN-56, TCH-54, ITA-51, POL-46, AUS-44, FRA-31  |
|1968|URS-192, USA-156, HUN-81, JPN-63, GDR-52, FRG-51, AUS-51, POL-37, ITA-33, YUG-29   |
|1972|URS-214, USA-167, GDR-151, FRG-102, HUN-81, JPN-56, POL-46, ROU-40, TCH-29, GBR-29 |
|1976|URS-286, GDR-195, USA-156, FRG-77, POL-73, ROU-55, HUN-55, JPN-41, BUL-39, GBR-32  |
|1980|URS-442, GDR-264, BUL-90, ROU-68, HUN-61, YUG-57, TCH-51, POL-50, GBR-47, ITA-37   |
|1984|USA-339, FRG-158, ROU-106, YUG-87, CAN-85, CHN-74, GBR-71, FRA-67, ITA-63, AUS-52  |

In [51]:
# Calculate the total number of medals each country has won in each sport
sport_leaders = spark_df.groupBy("NOC", "Sport").agg(count(when(col("Medal") != 'Na', 1)).alias("MedalCount"))

# Find the leading country in each sport
windowSpec = Window.partitionBy("Sport").orderBy(col("MedalCount").desc())
sport_leaders = sport_leaders.withColumn("rank", row_number().over(windowSpec)).filter(col("rank") == 1).drop("rank")

# Merge all leading sports of the same country
merged_sport_leaders = sport_leaders.groupBy("NOC").agg(concat_ws(", ", collect_list(concat_ws("-", col("Sport"), col("MedalCount")))).alias("DominantSports"))

# Show results
merged_sport_leaders.show(40, truncate=False)

# Save results
result.write.csv("E:/hku/cloud cluster/ex4/output/eda2_top10_countries.csv", header=True)
merged_sport_leaders.write.csv("E:/hku/cloud cluster/ex4/output/eda2_dominant_sports.csv", header=True)


+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NOC|DominantSports                                                                                                                                                                                                |
+---+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|POL|Weightlifting-113                                                                                                                                                                                             |
|BRA|Football-295, Volleyball-285                                                                                                                   

# Inference
In terms of total medal count, the United States and the Soviet Union have long dominated the top two positions. From 1956 to 1984, the Soviet Union held the advantage, while the United States has consistently held the advantage since 1984, reflecting the gradual decline of the Soviet Union. The United States excels in various ball sports, swimming, and athletics, whereas the Soviet Union had dominance in strength-based events such as gymnastics, volleyball, and wrestling.

Analyzing the Spark Jobs

Based on the provided code and the parsed physical plan, we can identify multiple Spark jobs. Specifically, we have two main tasks:

    Tracking the Change in Medal Counts Over Time:
        Calculate the total number of medals for each country per year.
        Use window functions to identify the top 10 countries by medal count each year.
        Display the results as the top 10 countries and their corresponding medal counts each year.
        This involves a primary Spark job.

    Identifying Dominant Countries in Each Sport:
        Calculate the total number of medals for each country in each sport.
        Use window functions to identify the leading country in each sport.
        Merge the dominant sports for each country into a single list.
        This also involves a primary Spark job.

Detailed Analysis of the Most Critical Spark Job

Let's assume that identifying the dominant countries in each sport is the most critical Spark job. Here's a detailed analysis of this job:

    First Stage:
        Operation: Filter the dataset to include only Summer Olympics and years after 1956.
        Description: This stage applies filtering conditions to limit the dataset to relevant records.

    Second Stage:
        Operation: Aggregate the data to count the number of medals for each country in each sport.
        Description: This aggregation calculates the medal count for each country-sport combination.
        Shuffle Event: A shuffle occurs here because the data needs to be repartitioned based on the country and sport columns.

    Third Stage:
        Operation: Sort the aggregated data by sport and medal count.
        Description: This stage sorts the data to prepare for ranking within each sport.
        Shuffle Event: Another shuffle happens here as data is repartitioned based on the sport and medal count columns.

    Fourth Stage:
        Operation: Apply window functions to rank the countries within each sport by their medal counts.
        Description: This stage uses the row_number() window function to rank countries and identify the top country in each sport.

    Fifth Stage:
        Operation: Filter to retain only the top-ranked country for each sport.
        Description: This final filtering stage keeps only the leading country for each sport.

Summary Paragraph

In this analysis, we delved into the Olympic dataset to identify the dominant countries in various sports. First, we filtered the data to focus on the Summer Olympics and records from 1956 onwards. Next, we calculated the number of medals each country won in each sport, which triggered a shuffle event as the data had to be reorganized based on country and sport. We then sorted the data by sport and medal count, leading to another shuffle event. Following this, we applied window functions to rank the countries within each sport by their medal counts and filtered the results to retain only the top country for each sport. Finally, we merged the dominant sports for each country into a single list. This critical Spark job involved multiple stages and shuffle events, reflecting the complexity of the analysis.

This summary describes the most critical Spark job's stages, the shuffle events, and the key operations performed at each stage. If you have further questions or need additional details, please let me know!


# EDA3: The Relationship Between Height Vs Weight Vs Age of Participants Across Sports
 Analyze the distribution of winning age, height, and weight in various sports. 
 In other words, divide the age into intervals from 10 to 80, every 5 years, 10-15, 15-20.....,
 and the height and weight into different intervals.
 The final result is similar to: Sport1, the most winning age interval, the most winning weight interval, the most winning height interval



In [54]:
# Create a general bin function
def create_bin(column, bins, labels):
    bin_expr = F.when((column >= bins[0]) & (column < bins[1]), labels[0])
    for i in range(1, len(bins) - 1):
        bin_expr = bin_expr.when((column >= bins[i]) & (column < bins[i+1]), labels[i])
    return bin_expr

# Define bins and labels
age_bins = [10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]
age_labels = ["10-15", "15-20", "20-25", "25-30", "30-35", "35-40", "40-45", "45-50", "50-55", "55-60", "60-65", "65-70", "70-75", "75-80"]

height_bins = [120, 125, 130, 135, 140, 145, 150, 155, 160, 165, 170, 175, 180, 185, 190, 195, 200, 205, 210, 215, 220]
height_labels = ["120-125", "125-130", "130-135", "135-140", "140-145", "145-150", "150-155", "155-160", "160-165", "165-170", "170-175", "175-180", "180-185", "185-190", "190-195", "195-200", "200-205", "205-210", "210-215", "215-220"]

weight_bins = [30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150]
weight_labels = ["30-40", "40-50", "50-60", "60-70", "70-80", "80-90", "90-100", "100-110", "110-120", "120-130", "130-140", "140-150"]

# Add bin columns
spark_df = spark_df.withColumn("Age_Bin", create_bin(F.col("Age"), age_bins, age_labels))
spark_df = spark_df.withColumn("Height_Bin", create_bin(F.col("Height"), height_bins, height_labels))
spark_df = spark_df.withColumn("Weight_Bin", create_bin(F.col("Weight"), weight_bins, weight_labels))

# Filter out gold medalists
gold_medalists = spark_df.filter(spark_df["Medal"] == "Gold")

# Group statistics for the most winning age interval, height interval, and weight interval in different sports
age_mode = gold_medalists.groupBy("Sport", "Age_Bin").count().withColumnRenamed("count", "Age_Count")
height_mode = gold_medalists.groupBy("Sport", "Height_Bin").count().withColumnRenamed("count", "Height_Count")
weight_mode = gold_medalists.groupBy("Sport", "Weight_Bin").count().withColumnRenamed("count", "Weight_Count")

# Find the most winning age interval for each sport
age_mode = age_mode.withColumn("Row_Number", F.row_number().over(Window.partitionBy("Sport").orderBy(F.desc("Age_Count"))))
age_mode = age_mode.filter(age_mode["Row_Number"] == 1).drop("Row_Number")

# Find the most winning height interval for each sport
height_mode = height_mode.withColumn("Row_Number", F.row_number().over(Window.partitionBy("Sport").orderBy(F.desc("Height_Count"))))
height_mode = height_mode.filter(height_mode["Row_Number"] == 1).drop("Row_Number")

# Find the most winning weight interval for each sport
weight_mode = weight_mode.withColumn("Row_Number", F.row_number().over(Window.partitionBy("Sport").orderBy(F.desc("Weight_Count"))))
weight_mode = weight_mode.filter(weight_mode["Row_Number"] == 1).drop("Row_Number")

# Merge results
result = age_mode.join(height_mode, on="Sport").join(weight_mode, on="Sport")

# Select and rename required columns
result = result.select("Sport", "Age_Bin", "Height_Bin", "Weight_Bin")
result.show(200)
# Write EDA3 results to file
result.write.csv("E:/hku/cloud cluster/ex4/output/eda3_height_weight_age.csv", header=True)




+--------------------+-------+----------+----------+
|               Sport|Age_Bin|Height_Bin|Weight_Bin|
+--------------------+-------+----------+----------+
|             Archery|  20-25|   165-170|     70-80|
|           Athletics|  20-25|   180-185|     60-70|
|           Badminton|  25-30|   175-180|     60-70|
|            Baseball|  25-30|   185-190|     80-90|
|          Basketball|  20-25|   190-195|    90-100|
|    Beach Volleyball|  30-35|   190-195|     70-80|
|              Boxing|  20-25|   165-170|     60-70|
|            Canoeing|  20-25|   180-185|     80-90|
|             Cycling|  20-25|   180-185|     70-80|
|              Diving|  20-25|   160-165|     50-60|
|       Equestrianism|  30-35|   170-175|     60-70|
|             Fencing|  25-30|   175-180|     70-80|
|            Football|  20-25|   170-175|     70-80|
|                Golf|  25-30|   165-170|     60-70|
|          Gymnastics|  20-25|   160-165|     50-60|
|            Handball|  25-30|   180-185|     

# Inference
For the majority of Olympic events, gold medalists are typically within the age ranges of 20-25 and 25-30, likely because 20-30 is the peak period for human physical performance. However, there are exceptions: the gold medal age range for Rhythmic Gymnastics is 15-20, and for Beach Volleyball, it is 30-35. This could be due to several reasons:

- **Physical demands of gymnastics:**
    - **Flexibility:** Gymnastics requires high levels of flexibility and agility, areas where younger athletes often have a distinct advantage.
    - **Strength and explosiveness:** Young athletes typically have optimal muscle strength and explosiveness, necessary for performing complex movements.
    - **Weight and body composition:** A smaller body weight and compact physique help athletes execute intricate aerial maneuvers and reduce injury risk.



Analyzing the Spark Jobs

Based on the provided code and the parsed physical plan, we can identify multiple Spark jobs. Specifically, we have 3 main tasks:

1. Adding bin columns (`Age_Bin`, `Height_Bin`, `Weight_Bin`).
2. Grouping statistics for the most winning age interval, height interval, and weight interval.
3. Merging results for the final output.

Detailed Analysis of the Most Critical Spark Job

Let's assume that grouping statistics for the most winning intervals is the most critical Spark job. Here's a detailed analysis of this job:

First Stage:
- Operation: `groupBy("Sport", "Age_Bin").count()`
- Description: This operation groups the gold medalists by sport and age bin, then counts the number of occurrences in each group.
- Shuffle Event: This stage involves a shuffle because `groupBy` operation requires data to be redistributed across the cluster to ensure that all rows with the same key (Sport, Age_Bin) end up in the same partition.

Second Stage:
- Operation: `groupBy("Sport", "Height_Bin").count()`
- Description: Similar to the first stage, this operation groups the data by sport and height bin and counts the occurrences.
- Shuffle Event: Again, a shuffle occurs due to the `groupBy` operation.

Third Stage:
- Operation: `groupBy("Sport", "Weight_Bin").count()`
- Description: This stage groups the data by sport and weight bin and counts the occurrences.
- Shuffle Event: This stage also triggers a shuffle event due to the `groupBy` operation.

Fourth Stage:
- Operation: `withColumn("Row_Number", F.row_number().over(Window.partitionBy("Sport").orderBy(F.desc("Age_Count"))))`
- Description: This operation calculates the row number for each group partitioned by sport and ordered by the descending age count.
- Shuffle Event: This operation may trigger a shuffle if the data needs to be repartitioned for window functions.

Fifth Stage:
- Operation: `filter(age_mode["Row_Number"] == 1)`
- Description: Filters the result to keep only the rows with the highest count for each sport in the age bin.
- Shuffle Event: This operation may trigger a shuffle depending on how the filtering is implemented.

Summary Paragraph

The critical Spark job involves multiple stages, each performing essential operations to group and count data based on different bins (age, height, weight) and sports. The key operations are `groupBy` and `count`, which trigger shuffle events to redistribute data across the cluster. Additional operations like `withColumn` for row numbering and filtering to keep the highest counts also contribute to the job's complexity. Understanding these stages helps optimize performance and ensures efficient data processing in PySpark for exploratory data analysis.

# EDA4: Do host countries win significantly more medals compared to the Olympics before and after they host?
 Compare the total number of medals won by host countries during the host period and non-host periods. 
 Only compare the host period with the two previous and two subsequent Olympics. 
 Add a new column showing the percentage increase in medals won during the host period compared to the maximum number of medals won during the four non-host periods.
 Assume you have created SparkSession and imported spark_df
 spark = SparkSession.builder.appName("Olympic Analysis").getOrCreate()



In [53]:
# Filter records with medals
medals_df = spark_df.filter(col("Medal") != 'NA')

# Initialize host countries dictionary
host_countries = {
    1956: 'AUS', 1960: 'ITA', 1964: 'JPN', 1968: 'MEX', 1972: 'FRG', 1976: 'CAN',
    1980: 'URS', 1984: 'USA', 1988: 'KOR', 1992: 'ESP', 1996: 'USA', 2000: 'AUS',
    2004: 'GRE', 2008: 'CHN', 2012: 'GBR', 2016: 'BRA'
}

# Create a temporary view for SQL queries
medals_df.createOrReplaceTempView("medals")

# Result list
results = []

# Calculate the number of medals won by the host country during the host period and the two previous and two subsequent Olympics
for year, country in host_countries.items():
    query = f"""
        SELECT
            {year} as Year,
            '{country}' as Country,
            COUNT(CASE WHEN NOC = '{country}' AND Year = {year} THEN Event END) as Medals_host,
            COUNT(CASE WHEN NOC = '{country}' AND Year = {year-4} THEN Event END) as Medals_nothost1,
            COUNT(CASE WHEN NOC = '{country}' AND Year = {year-8} THEN Event END) as Medals_nothost2,
            COUNT(CASE WHEN NOC = '{country}' AND Year = {year+4} THEN Event END) as Medals_nothost3,
            COUNT(CASE WHEN NOC = '{country}' AND Year = {year+8} THEN Event END) as Medals_nothost4
        FROM medals
    """
    result = spark.sql(query)
    results.append(result)

# Combine all results into one DataFrame
final_df = results[0]
for df in results[1:]:
    final_df = final_df.union(df)

# Add a new column to calculate the percentage increase in medals won during the host period compared to the maximum number of medals won during the non-host periods
final_df = final_df.withColumn(
    "Medals_bigger%",
    (col("Medals_host") - greatest("Medals_nothost1", "Medals_nothost2", "Medals_nothost3", "Medals_nothost4"))
    / col("Medals_host") * 100
)

final_df.show()
# Write EDA4 results to file
final_df.write.csv("E:/hku/cloud cluster/ex4/output/eda4_host_country_medals.csv", header=True)

+----+-------+-----------+---------------+---------------+---------------+---------------+-------------------+
|Year|Country|Medals_host|Medals_nothost1|Medals_nothost2|Medals_nothost3|Medals_nothost4|     Medals_bigger%|
+----+-------+-----------+---------------+---------------+---------------+---------------+-------------------+
|1956|    AUS|         67|              0|              0|             46|             44| 31.343283582089555|
|1960|    ITA|         88|             47|              0|             51|             33|  42.04545454545455|
|1964|    JPN|         62|             31|             24|             63|             56|-1.6129032258064515|
|1968|    MEX|          9|              1|              1|              1|              2|  77.77777777777779|
|1972|    FRG|        102|             51|              0|             77|              0| 24.509803921568626|
|1976|    CAN|         23|             11|             10|              0|             85| -269.5652173913044|
|

# Inference:
In most countries hosting the Olympics, the host nation effect is quite evident, generally resulting in a performance increase of over 30%. However, the 1976, 2012, and 2016 Olympics are exceptions to this trend. Various factors could contribute to these anomalies, such as the reduction in the number of events during the 1970s due to the Cold War between the US and the Soviet Union.


Analyzing the Spark Jobs

Based on the provided code and the parsed physical plan, we can identify multiple Spark jobs. Specifically, we have 3 main tasks:

1. Filtering records with medals.
2. Calculating the number of medals won by host countries during the host period and the non-host periods.
3. Adding a new column to show the percentage increase in medals won during the host period.

Detailed Analysis of the Most Critical Spark Job

Let's assume that calculating the number of medals won by host countries during the host period and non-host periods is the most critical Spark job. Here's a detailed analysis of this job:

First Stage:
- Operation: `filter(col("Medal") != 'NA')`
- Description: This operation filters out rows where the Medal column is 'NA', leaving only records with valid medal entries.
- Shuffle Event: This operation does not involve a shuffle as it is a simple filter operation.

Second Stage:
- Operation: `spark.sql(query)`
- Description: For each host year and country, a SQL query is executed to count the number of medals won during the host period and two previous and two subsequent Olympics.
- Shuffle Event: This stage involves a shuffle because the `COUNT` operations and conditions on `NOC` and `Year` require data to be redistributed across the cluster to ensure correct grouping and counting.

Third Stage:
- Operation: `union(df)`
- Description: This operation combines the results of the individual SQL queries into one DataFrame.
- Shuffle Event: This stage involves a shuffle to merge the different partitions of each individual DataFrame into a single cohesive DataFrame.

Fourth Stage:
- Operation: `withColumn("Medals_bigger%", ...)`
- Description: This operation adds a new column calculating the percentage increase in medals won during the host period compared to the maximum number of medals won during the non-host periods.
- Shuffle Event: This operation may trigger a shuffle if it involves repartitioning the data for the new column calculation.

Summary Paragraph

The critical Spark job in this analysis calculates the number of medals won by host countries during their host period and compares it with the two previous and two subsequent Olympics. This job involves filtering the data, executing multiple SQL queries for counting medals, and combining the results into a single DataFrame. Key operations like `spark.sql` and `union` trigger shuffle events due to the need for data redistribution and merging of partitions. Finally, a new column is added to show the percentage increase in medals won during the host period, providing insights into the performance boost for host countries. Understanding these stages helps optimize performance and ensures efficient data processing in PySpark for this exploratory data analysis.