# **FIFA World Cup (1930 - 2018) by the Numbers** 

As the 2022 FIFA World Cup comes to a close and with Messi finally winning his first ever World Cup, let's try and take a look back on the from the birth of the Copa Mundial until its recent iteration to see how it has evolved throughout the decades. 

**The questions we seek to answer in this Exploratory Data Analysis are:**
1. Which team has won the most World Cups?
2. Does being host give a better chance of a World Cup Win?
3. How many unique teams have joined the World Cup?
4. Which team has the most World Cup appearances? Which ones have the least?
5. How has goalscoring improved throughout the years?
6. How many goals do championship teams usually score and concede?

## Setup
Next cell imports all Python libraries needed for the project.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Import all datasets
Each import will be showing its first 5 rows of data as a preview.

In [None]:
wc_1930 = pd.read_csv("../data/FIFA - 1930.csv", index_col="Position")
wc_1930.head()

In [None]:
wc_1934 = pd.read_csv("../data/FIFA - 1934.csv", index_col="Position")
wc_1934.head()

In [None]:
wc_1938 = pd.read_csv("../data/FIFA - 1938.csv", index_col="Position")
wc_1938.head()

In [None]:
wc_1950 = pd.read_csv("../data/FIFA - 1950.csv", index_col="Position")
wc_1950.head()

In [None]:
wc_1954 = pd.read_csv("../data/FIFA - 1954.csv", index_col="Position")
wc_1954.head()

In [None]:
wc_1958 = pd.read_csv("../data/FIFA - 1958.csv", index_col="Position")
wc_1958.head()

In [None]:
wc_1962 = pd.read_csv("../data/FIFA - 1962.csv", index_col="Position")
wc_1962.head()

In [None]:
wc_1966 = pd.read_csv("../data/FIFA - 1966.csv", index_col="Position")
wc_1966.head()

In [None]:
wc_1970 = pd.read_csv("../data/FIFA - 1970.csv", index_col="Position")
wc_1970.head()

In [None]:
wc_1974 = pd.read_csv("../data/FIFA - 1974.csv", index_col="Position")
wc_1974.head()

In [None]:
wc_1978 = pd.read_csv("../data/FIFA - 1978.csv", index_col="Position")
wc_1978.head()

In [None]:
wc_1982 = pd.read_csv("../data/FIFA - 1982.csv", index_col="Position")
wc_1982.head()

In [None]:
wc_1986 = pd.read_csv("../data/FIFA - 1986.csv", index_col="Position")
wc_1986.head()

In [None]:
wc_1990 = pd.read_csv("../data/FIFA - 1990.csv", index_col="Position")
wc_1990.head()

In [None]:
wc_1994 = pd.read_csv("../data/FIFA - 1994.csv", index_col="Position")
wc_1994.head()

In [None]:
wc_1998 = pd.read_csv("../data/FIFA - 1998.csv", index_col="Position")
wc_1998.head()

In [None]:
wc_2002 = pd.read_csv("../data/FIFA - 2002.csv", index_col="Position")
wc_2002.head()

In [None]:
wc_2006 = pd.read_csv("../data/FIFA - 2006.csv", index_col="Position")
wc_2006.head()

In [None]:
wc_2010 = pd.read_csv("../data/FIFA - 2010.csv", index_col="Position")
wc_2010.head()

In [None]:
wc_2014 = pd.read_csv("../data/FIFA - 2014.csv", index_col="Position")
wc_2014.head()

In [None]:
wc_2018 = pd.read_csv("../data/FIFA - 2018.csv", index_col="Position")
wc_2018.head()

In [None]:
wc_summ = pd.read_csv("../data/FIFA - World Cup Summary.csv", index_col="YEAR")
wc_summ.head()

In [None]:
# Generate a list containing all dataframes for later use

df_list = [wc_1930, wc_1934, wc_1938, wc_1950, wc_1954, wc_1958, wc_1962, wc_1966, wc_1970, wc_1974, wc_1978, wc_1982, wc_1986, wc_1990, wc_1994, wc_1998, wc_2002, wc_2006,
wc_2010, wc_2014, wc_2018]

## Plots and Analysis

### 1. Which team has won the most World Cups?

In [None]:
# Generate a count plot of World Cup wins per Team

plt.figure(figsize=(14,9))
plt.title("Number of World Cup Wins per Team")
sns.countplot(data=wc_summ, x="CHAMPION", order=wc_summ['CHAMPION'].value_counts().index)
plt.xlabel("Team")

Based on the plot, Brazil has won the most World Cups at 5 followed by Italy with 4, and West Germany with 3. Ironically, the countries that hold the largest leagues globally have the least amount of World Cup wins at 1 each with England (EPL), Spain (La Liga), and Germany (Bundesliga). It can be inferred that players in these leagues are foreign and not from their grassroots programs.

### 2. Does being host give a better chance of a World Cup Win?


In [None]:
# Add a new column to the dataframe which represents a boolean if host country won 

host_won = []
for index, row in wc_summ.iterrows():
    host_won.append(row["HOST"] == row["CHAMPION"])
wc_summ["HOST WON"] = host_won

In [None]:
# Generate a count plot of booleans that indicates that the host country won

plt.figure(figsize=(10,9))
plt.title("Number of Host Country Wins")
sns.countplot(data=wc_summ, x="HOST WON")
plt.xlabel("Boolean for Host Country Wins")

In [None]:
# Calculating percentage of Host Country Wins

host_win_total = len(wc_summ[wc_summ["HOST WON"] == True]) 
tournament_total = len(wc_summ) 
host_win_percentage = (host_win_total/tournament_total)*100
host_win_percentage

Host countries have won 6 out of 21 World Cups at around 28.57% win rate. Without considering the ranking and skill level of the host country's national team, home court advantage does not appear to be significant in winning the World Cup.

## 3. How many unique teams have joined the World Cup?

In [None]:
# Generate list of teams and unique teams

list_of_teams = []
for tournament in df_list:
    for team in tournament["Team"]:
        list_of_teams.append(team)
unique_teams = list(set(list_of_teams))
print(f"There are {len(unique_teams)} unique teams out of the {len(list_of_teams)} that have participated in the history of the World Cup.")

## 4. Which team has the most World Cup appearances? Which ones have the least?

In [None]:
# Calculate number of appearances per team in the World Cup 

team_apps = {
    "Team": [],
    "Appearances": []
}
for team in unique_teams:
    team_apps["Team"].append(team)
    team_apps["Appearances"].append(list_of_teams.count(team))
team_apps_data = pd.DataFrame(team_apps).sort_values(by="Appearances", ascending=False).reset_index(drop=True)
team_apps_data.head()

In [None]:
team_apps_data.tail()

Brazil has attended all World Cups at 21 appearances followed by Italy with 18, Argentina with 17, Mexico with 16, and England with 15. The ones with the least are the Czech Republic, Slovakia, Togo, Serbia and Montenegro, and Jamaica with 1 appearance each.

## 5. How has goalscoring improved throughout the years?

In [None]:
# Generate plot for total goals per year

plt.figure(figsize=(14,9))
plt.title("Total Goals per Year")
sns.scatterplot(x=wc_summ.index, y=wc_summ["GOALS SCORED"])
sns.regplot(x=wc_summ.index, y=wc_summ["GOALS SCORED"])
plt.xlabel("Year")
plt.ylabel("Goals Scored")

In [None]:
# Generate plot for average number of goals per game per year

plt.figure(figsize=(14,9))
plt.title("Average Goals per Game")
sns.scatterplot(x=wc_summ.index, y=wc_summ["AVG GOALS PER GAME"])
sns.regplot(x=wc_summ.index, y=wc_summ["AVG GOALS PER GAME"])
plt.xlabel("Year")
plt.ylabel("Average Goals per Game")

In [None]:
# Generate plot for total matches per year

plt.figure(figsize=(14,9))
plt.title("Total Matches per Year")
sns.scatterplot(x=wc_summ.index, y=wc_summ["MATCHES PLAYED"])
sns.regplot(x=wc_summ.index, y=wc_summ["MATCHES PLAYED"])
plt.xlabel("Year")
plt.ylabel("Matches per Year")

Based from the plots, total goals per year is increasing. This led me to the question whether teams are getting better at scoring or not. Through the average goals per year plot, it showed that teams are getting better defensively instead of goalscoring due to the average amount of goals per game going at a downward trend. Therefore, total goals per year can be attributed to the increasing amount of matches instead of teams getting better at scoring.

## 6. How many goals do championship teams usually score and concede?

In [None]:
# Generate champion data

champion_data = {
    "Year": [1930, 1934, 1938, 1950, 1954, 1958, 1962, 1966, 1970, 1974, 1978, 1982, 1986, 1990, 1994, 1998, 2002, 2006, 2010, 2014, 2018],
    "Team": [],
    "Goals Scored": [],
    "Goals Conceded": [],
}

for tournament in df_list:
    champion_data["Team"].append(tournament.iloc[0]["Team"])
    champion_data["Goals Scored"].append(tournament.iloc[0]["Goals For"])
    champion_data["Goals Conceded"].append(tournament.iloc[0]["Goals Against"])

champion_df = pd.DataFrame(champion_data).set_index("Year")
champion_df.head()

In [None]:
# Generate probability density of champions' goals scored and conceded

plt.title("Probability Density of Champions' Goal Stats")
sns.kdeplot(data=champion_df["Goals Scored"], color='r', fill=True, label="Goals Scored")
sns.kdeplot(data=champion_df["Goals Conceded"], color='b', fill=True, label="Goals Conceded")
plt.legend()
plt.xlabel("Goals")
plt.ylabel("Probability Density")

Champions usually score around 15 goals while conceding around 3 - 4 goals. Champions usually don't occur in teams that score more than 20 and less than 6 - 8. While in conceding, they don't occur in teams that concede more than 10. However there are extreme cases where champions rose from scoring 25 goals and conceding around 13 goals.