### Analyzing the FIFA World Cup 

This project analyzes three csv files
- goalscorers.csv which
- results.csv 
- shootout.csv 


Questions to Answer:

1) Do teams perform better when playing as the host team in the men's FIFA World Cup tournament?
2) Who were the top goal scorers individually and by team?
3) Are teams who perform better in the 1st or 2nd half more victorious?


In [7]:
#Importing Pandas 
import pandas as pd


In [10]:
# Importing results.csv as this will help us answer the first question
results = pd.read_csv("results.csv")
results.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [12]:
# First Question asks Do teams perform better when playing as the host team in the men's FIFA World Cup tournament?
# World Cup matches make up 964 of the records
results["tournament"].value_counts()

Friendly                                17593
FIFA World Cup qualification             7878
UEFA Euro qualification                  2631
African Cup of Nations qualification     1976
FIFA World Cup                            964
                                        ...  
Évence Coppée Trophy                        1
Copa Confraternidad                         1
Real Madrid 75th Anniversary Cup            1
TIFOCO Tournament                           1
FIFA 75th Anniversary Cup                   1
Name: tournament, Length: 142, dtype: int64

In [81]:
results["date"].value_counts()

2012-02-29    66
2016-03-29    64
2008-03-26    60
2014-03-05    59
2012-11-14    56
              ..
1974-02-23     1
1974-02-24     1
1974-02-25     1
1974-02-27     1
2023-03-29     1
Name: date, Length: 15572, dtype: int64

In [27]:
# We have created a new df world_cup_matches that stores only WC Matches  
world_cup_matches = results[results["tournament"] == "FIFA World Cup"]
world_cup_matches.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
1311,1930-07-13,Belgium,United States,0,3,FIFA World Cup,Montevideo,Uruguay,True
1312,1930-07-13,France,Mexico,4,1,FIFA World Cup,Montevideo,Uruguay,True
1313,1930-07-14,Brazil,Yugoslavia,1,2,FIFA World Cup,Montevideo,Uruguay,True
1314,1930-07-14,Peru,Romania,1,3,FIFA World Cup,Montevideo,Uruguay,True
1315,1930-07-15,Argentina,France,1,0,FIFA World Cup,Montevideo,Uruguay,True


In [29]:
# Seperating the games played by the host country
# Logic country = home_team or country = away_team appears the host country is always listed as the home_team
host_country_games = world_cup_matches[(world_cup_matches["country"] == world_cup_matches["home_team"]) | (world_cup_matches["country"] == world_cup_matches["away_team"])]
host_country_games.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
1320,1930-07-18,Uruguay,Peru,1,0,FIFA World Cup,Montevideo,Uruguay,False
1325,1930-07-21,Uruguay,Romania,4,0,FIFA World Cup,Montevideo,Uruguay,False
1329,1930-07-27,Uruguay,Yugoslavia,6,1,FIFA World Cup,Montevideo,Uruguay,False
1330,1930-07-30,Uruguay,Argentina,4,2,FIFA World Cup,Montevideo,Uruguay,False
1694,1934-05-27,Italy,United States,7,1,FIFA World Cup,Rome,Italy,False


In [33]:
# Double checking the away team is never the host country
away_check = host_country_games[(host_country_games["away_team"] == host_country_games["country"])]
away_check.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral


In [43]:
# Gives us the number of rows in host_country_games
# This is also the amount of games the host countries have played 
host_games_played = host_country_games.shape[0]
host_games_played 

121

In [40]:
# Creating a df of games the host country won 
host_won = host_country_games[(host_country_games["home_score"] > host_country_games["away_score"])]
host_won.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
1320,1930-07-18,Uruguay,Peru,1,0,FIFA World Cup,Montevideo,Uruguay,False
1325,1930-07-21,Uruguay,Romania,4,0,FIFA World Cup,Montevideo,Uruguay,False
1329,1930-07-27,Uruguay,Yugoslavia,6,1,FIFA World Cup,Montevideo,Uruguay,False
1330,1930-07-30,Uruguay,Argentina,4,2,FIFA World Cup,Montevideo,Uruguay,False
1694,1934-05-27,Italy,United States,7,1,FIFA World Cup,Rome,Italy,False


In [72]:
# The host that has won the most games 
host_won["home_team"].value_counts()

Germany          11
Italy            10
France            7
Brazil            7
Mexico            5
England           5
Argentina         5
Uruguay           4
Chile             4
Sweden            4
South Korea       3
Switzerland       2
Japan             2
Russia            2
Spain             1
United States     1
South Africa      1
Name: home_team, dtype: int64

In [44]:
# Gives us the number of games won by the host country 
host_games_won = host_won.shape[0]
host_games_won

74

In [48]:
# 61.15% of the time the host played in the fifa world cup they won
percent_host_won = host_games_won / host_games_played * 100
percent_host_won

61.15702479338842

In [66]:
# The host country has a goal differential of 102
host_goal_diff = host_country_games["home_score"].sum() - host_country_games["away_score"].sum()
print("The host country has scored " + str(host_country_games["home_score"].sum()) + " goals")
print("The host country has conceded " + str(host_country_games["away_score"].sum()) + " goals")

print("The goal difference is " + str(host_goal_diff) + " in favor of the host team")

The host country has scored 222 goals
The host country has conceded 120 goals
The goal difference is 102 in favor of the host team


In [70]:
# Finding how many goals per game the host scores 
# Dividing the goal sum by the games played by host team 
average_goal_scores = host_country_games["home_score"].sum() / host_games_played
print("On average the host team scores " + str(round(average_goal_scores, 2)) + " goals per game")

On average the host team scores 1.83 goals per game


In [73]:
# Importing the goalscorers.csv file 
goal_scorers = pd.read_csv("goalscorers.csv")
goal_scorers.head()

Unnamed: 0,date,home_team,away_team,team,scorer,minute,own_goal,penalty
0,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,44.0,False,False
1,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,55.0,False,False
2,1916-07-02,Chile,Uruguay,Uruguay,Isabelino Gradín,70.0,False,False
3,1916-07-02,Chile,Uruguay,Uruguay,José Piendibene,75.0,False,False
4,1916-07-06,Argentina,Chile,Argentina,Alberto Ohaco,2.0,False,False


In [74]:
# Checking to see how many own goals were recorded 
goal_scorers["own_goal"].value_counts()

False    40290
True       718
Name: own_goal, dtype: int64

In [77]:
# Creating a df that excludes own goals 
not_own_goals =  goal_scorers[goal_scorers["own_goal"] == False]
not_own_goals["own_goal"].value_counts()

False    40290
Name: own_goal, dtype: int64

In [80]:
top_scorers = not_own_goals["scorer"].value_counts()
top_scorers.head()

Cristiano Ronaldo     91
Robert Lewandowski    56
Lionel Messi          54
Ali Daei              49
Miroslav Klose        48
Name: scorer, dtype: int64