# 2024: Week 30 - International Football Special

July 24, 2024

Challenge by: Jenny Martin

It's been a big month for International Football, with both the Euros and the Copa América. Personally, I'm not a big football fan, so the question that I'm looking to answer is:

If I'm only going to watch 15 minutes of a match, which time interval will be the most exciting? How does that vary by competition and has it changed over the years?
Inputs
The data this week comes from an incredible data source I found on Kaggle which contains over 47,000 results of international football matches. Thanks to Mart Jürisoo for collecting this data! 

1. Results table 

![1](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZP2sGbD5_kyD7ISa2HYcMgv2xdrigRgf9zdVzKTAcdZEaNYxJdhP1RCvWqpUKPFtZF1P3U5uh62vEo0KXHK0uBOYUxeblifShgdDqKBc4wKpb1iQKMXNhJSThEOCA3wZwyiuZTTvreH83hmhVVFLcrtUaT0dSHEqR_3uwb3_e-1KjjPGEz8Qd9BRjkBjz/s1242/Screenshot%202024-07-12%20112558.png)

2. International Competitions table 

![2](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih9_edQVgi4u54jtdYSPhNjpYxhBVh1qEP0ApXkyLJ5z8TTm0jlOGfknwE3Ej8WcSxT_kZwJ53rBLmRD19hEeCFIrIVHh2VNpT2eeD1owayx6J-a_bLxxqzt5VRmJUNjXFy5FkXagj4_3JKyXCWEc96fBMnBIDmMuD2UyfKYl2JnjdxFgFSwO2WycmG91g/s839/Screenshot%202024-07-12%20112705.png)

3. Goal Scorers table - currently only available for World Cup and Continental Championships so this is where we will focus our analysis 

![3](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPGSAMLx6yHO24UeOWSmE1iHOHxt7OzyrvdZk6vFADFh7u59oMM55x0rPiDKLZHZF7uS7iuh18mdahHK2vRokqAVYoYDu523yM0wHJdjQAzQshzfMBnfAeYIMCaZ2PuIxw3F3pOl4z7FF6Il6IBFKoesZSqocWcgR70NmWt8iGHLMmx9k0Wq4mzXKG0gxe/s1065/Screenshot%202024-07-12%20112823.png)

4. Segment table 

![4](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyLAB2DR7FlKWz3TSXr7FFDIPLtDmtia_KwtsZH3PurgdtslliaCiAYIeBaJu2qHLwqsLo9DzgcrhyCcZaWVgBBQr8KI8FmIKlxIXzfu_JE4-CJQFRw98dYCVyesbGodYqJMkdtbyNwyLo_V4_VoGqJ2MbCuUJ3wqpLWeIS_bMtVozPOWFCeTj9GoQYwxP/s293/Screenshot%202024-07-12%20112907.png)

### Requirements
- Input the data
- Filter out Qualification rounds from the Results table
- Split out the Football Association and Competition from the tournament field
- e.g. For UEFA Euro, UEFA is the Football Association and Euro is the Competition
- Not all tournaments contain information about the Football Association
- Join to the International Competitions table
- All 8 competitions should be included
- For the CONCACAF Championship ensure the correct Football Association and Competition are joined
- Create a field for the Decade the competition took place in
- Filter the data to 1950s onwards
- Create a Match ID field so every row in the data had a unique identifier
- Calculate the number of matches in each Decade, in each Competition
- Filter out the nulls from the Goal Scorers table and join to the dataset
- Join on the Segment table based on what segment of time the goal was scored in
- e.g. a goal scored 25 minutes into the game should be in the 15-30 segment
- Count how many goals were scored in each Segment, for each Competition and Decade
- Calculate the Expected number of Goals for each Segment, Competition and Decade
- Output the data
### Output

![5](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhIkofZTtNMyoDS8iZsJcnL0EFz_ve-kdk8HD_EidGjHZVSvFwVvAwzxT3CTQdNKecrjSy4WXQe6t6z3PNSr63VNNakxU0QpDjKi8rDZKLKKGPBPLJJ_8cH6jhyYzXmpdPDHNmnQHCfKwPQiwQr_lcxO6Eku5gRmDVOcsJHyblqFVZalnyrrGIMxoFb-tVJ/s927/Screenshot%202024-07-16%20163555.png)

- 6 fields
- Competition
- Decade
- Segment
- Total Goals
- Matches in a Decade per Competition
- Expected number of Goals
- 351 rows (352 including headers)

In [1]:
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("FootballAnalysis").getOrCreate()

# Read the goalscorers.csv file into a DataFrame
scorers_df = spark.read.csv("goalscorers.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
scorers_df.show()

+----------+---------+---------+---------+--------------------+------+--------+-------+
|      date|home_team|away_team|     team|              scorer|minute|own_goal|penalty|
+----------+---------+---------+---------+--------------------+------+--------+-------+
|1916-07-02|    Chile|  Uruguay|  Uruguay|     José Piendibene|    44|   false|  false|
|1916-07-02|    Chile|  Uruguay|  Uruguay|    Isabelino Gradín|    55|   false|  false|
|1916-07-02|    Chile|  Uruguay|  Uruguay|    Isabelino Gradín|    70|   false|  false|
|1916-07-02|    Chile|  Uruguay|  Uruguay|     José Piendibene|    75|   false|  false|
|1916-07-06|Argentina|    Chile|Argentina|       Alberto Ohaco|     2|   false|  false|
|1916-07-06|Argentina|    Chile|    Chile|      Telésforo Báez|    44|   false|  false|
|1916-07-06|Argentina|    Chile|Argentina|  Juan Domingo Brown|    60|   false|   true|
|1916-07-06|Argentina|    Chile|Argentina|  Juan Domingo Brown|    62|   false|   true|
|1916-07-06|Argentina|    Chile|

In [2]:
# Read the International Competitions.csv file into a DataFrame
competitions_df = spark.read.csv("International Competitions.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
competitions_df.show()

+------------+-------------+--------------------+--------------------+---------+
|Participants|    Continent|Football Association|         Competition|    Dates|
+------------+-------------+--------------------+--------------------+---------+
|      Global|         NULL|                FIFA|           World Cup|1930-2022|
| Continental|       Africa|                 CAF|Africa Cup of Nat...|1957-2024|
| Continental|         Asia|                 AFC|           Asian Cup|1956-2024|
| Continental|       Europe|                UEFA|                Euro|1960-2024|
| Continental|North America|            CONCACAF|        Championship|1963–1989|
| Continental|North America|            CONCACAF|            Gold Cup|1991-2023|
| Continental|      Oceania|                 OFC| Oceania Nations Cup|1973-2024|
| Continental|South America|            CONMEBOL|        Copa América|1916-2024|
+------------+-------------+--------------------+--------------------+---------+



In [3]:
# Read the results.csv file into a DataFrame
results_df = spark.read.csv("results.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
results_df.show()

+----------+----------------+---------+----------+----------+----------+---------+--------+-------+
|      date|       home_team|away_team|home_score|away_score|tournament|     city| country|neutral|
+----------+----------------+---------+----------+----------+----------+---------+--------+-------+
|1872-11-30|        Scotland|  England|         0|         0|  Friendly|  Glasgow|Scotland|  false|
|1873-03-08|         England| Scotland|         4|         2|  Friendly|   London| England|  false|
|1874-03-07|        Scotland|  England|         2|         1|  Friendly|  Glasgow|Scotland|  false|
|1875-03-06|         England| Scotland|         2|         2|  Friendly|   London| England|  false|
|1876-03-04|        Scotland|  England|         3|         0|  Friendly|  Glasgow|Scotland|  false|
|1876-03-25|        Scotland|    Wales|         4|         0|  Friendly|  Glasgow|Scotland|  false|
|1877-03-03|         England| Scotland|         1|         3|  Friendly|   London| England|  false|


In [4]:
# Read the segment.csv file into a DataFrame
segment_df = spark.read.csv("segment.csv", header=True, inferSchema=True)

# Show the first few rows of the DataFrame
segment_df.show()

+-------+
|Segment|
+-------+
|   0-15|
|  15-30|
|  30-45|
|  45-60|
|  60-75|
|  75-90|
|    90+|
+-------+



In [5]:
# Filter out qualification rounds from the results_df
filtered_results_df = results_df.filter(~results_df.tournament.contains("Qualification"))

# Show the first few rows of the filtered DataFrame
filtered_results_df.show()

+----------+----------------+---------+----------+----------+----------+---------+--------+-------+
|      date|       home_team|away_team|home_score|away_score|tournament|     city| country|neutral|
+----------+----------------+---------+----------+----------+----------+---------+--------+-------+
|1872-11-30|        Scotland|  England|         0|         0|  Friendly|  Glasgow|Scotland|  false|
|1873-03-08|         England| Scotland|         4|         2|  Friendly|   London| England|  false|
|1874-03-07|        Scotland|  England|         2|         1|  Friendly|  Glasgow|Scotland|  false|
|1875-03-06|         England| Scotland|         2|         2|  Friendly|   London| England|  false|
|1876-03-04|        Scotland|  England|         3|         0|  Friendly|  Glasgow|Scotland|  false|
|1876-03-25|        Scotland|    Wales|         4|         0|  Friendly|  Glasgow|Scotland|  false|
|1877-03-03|         England| Scotland|         1|         3|  Friendly|   London| England|  false|


In [6]:
from pyspark.sql.functions import regexp_extract

# Extract the Football Association from the tournament field
filtered_results_df = filtered_results_df.withColumn('Football_Association', regexp_extract('tournament', '([A-Z]{2,}).*', 1))

# Show the first few rows of the updated DataFrame
filtered_results_df.show()

+----------+----------------+---------+----------+----------+----------+---------+--------+-------+--------------------+
|      date|       home_team|away_team|home_score|away_score|tournament|     city| country|neutral|Football_Association|
+----------+----------------+---------+----------+----------+----------+---------+--------+-------+--------------------+
|1872-11-30|        Scotland|  England|         0|         0|  Friendly|  Glasgow|Scotland|  false|                    |
|1873-03-08|         England| Scotland|         4|         2|  Friendly|   London| England|  false|                    |
|1874-03-07|        Scotland|  England|         2|         1|  Friendly|  Glasgow|Scotland|  false|                    |
|1875-03-06|         England| Scotland|         2|         2|  Friendly|   London| England|  false|                    |
|1876-03-04|        Scotland|  England|         3|         0|  Friendly|  Glasgow|Scotland|  false|                    |
|1876-03-25|        Scotland|   

In [7]:
from pyspark.sql.functions import when, col

# Create the 'Competition' column
filtered_results_df = filtered_results_df.withColumn(
    'Competition',
    when(col('Football_Association') == '', col('tournament'))
    .otherwise(regexp_extract('tournament', 'Football_Association', 1))
)

# Show the first few rows of the updated DataFrame
filtered_results_df.show()

+----------+----------------+---------+----------+----------+----------+---------+--------+-------+--------------------+-----------+
|      date|       home_team|away_team|home_score|away_score|tournament|     city| country|neutral|Football_Association|Competition|
+----------+----------------+---------+----------+----------+----------+---------+--------+-------+--------------------+-----------+
|1872-11-30|        Scotland|  England|         0|         0|  Friendly|  Glasgow|Scotland|  false|                    |   Friendly|
|1873-03-08|         England| Scotland|         4|         2|  Friendly|   London| England|  false|                    |   Friendly|
|1874-03-07|        Scotland|  England|         2|         1|  Friendly|  Glasgow|Scotland|  false|                    |   Friendly|
|1875-03-06|         England| Scotland|         2|         2|  Friendly|   London| England|  false|                    |   Friendly|
|1876-03-04|        Scotland|  England|         3|         0|  Friend

In [8]:
unique_competitions_count = filtered_results_df.select("Competition").distinct().count()
print(f"Number of unique competitions: {unique_competitions_count}")

Number of unique competitions: 125
