# World Cup Matches : An analysis

This project is a study about World Cup matches ranging from 1930 until 2018, with the aim to  gather as much information as possible about this topic, such as the frequency of results, the most victorius country, the country with the most losses, the biggest win, and so on. 

First of all, we are going to load our dataset, that is a result of an integration between a csv file and data from web scraping:

In [1]:
import data_loading

In [2]:
df = data_loading.create_df_new()
df

Unnamed: 0,year,home_team_names,home_team_goals,away_team_goals,away_team_names,victory,loss
0,1930,France,4.0,1.0,Mexico,home,away
1,1930,USA,3.0,0.0,Belgium,home,away
2,1930,Yugoslavia,2.0,1.0,Brazil,home,away
3,1930,Romania,3.0,1.0,Peru,home,away
4,1930,Argentina,1.0,0.0,France,home,away
...,...,...,...,...,...,...,...
911,2018,Uruguay,0.0,2.0,France,away,home
912,2018,Croatia,2.0,1.0,England,home,away
913,2018,France,1.0,0.0,Belgium,home,away
914,2018,Belgium,2.0,0.0,England,home,away


### Data Understanding

Now, we can understand our data in more depth using statistics:

In [20]:
from data_vis_und import central_tedency, dispersion, game_with_most_goals
from data_vis_und import NonNumerical_based, Numerical_based

In this module there is two classes: NonNumerical_based, to deal with non numerical columns in df, and Numerical_based, to deal with numerical columns in df.

In [4]:
non_numerical = NonNumerical_based(df)
numerical = Numerical_based(df)

Let's deal first with the non numerical columns:

In [5]:
# All the teams in the world cup history:
non_numerical.all_teams()

{'Algeria',
 'Angola',
 'Argentina',
 'Australia',
 'Austria',
 'Belgium',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Brazil',
 'Bulgaria',
 'Cameroon',
 'Canada',
 'Chile',
 'China PR',
 'Colombia',
 'Costa Rica',
 'Croatia',
 'Cuba',
 'Czech Republic',
 'Czechoslovakia',
 "C�te d'Ivoire",
 'Denmark',
 'Dutch East Indies',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'England',
 'France',
 'German DR',
 'Germany',
 'Germany FR',
 'Ghana',
 'Greece',
 'Haiti',
 'Honduras',
 'Hungary',
 'IR Iran',
 'Iceland',
 'Iran',
 'Iraq',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Korea DPR',
 'Korea Republic',
 'Kuwait',
 'Mexico',
 'Morocco',
 'Netherlands',
 'New Zealand',
 'Nigeria',
 'Northern Ireland',
 'Norway',
 'Panama',
 'Paraguay',
 'Peru',
 'Poland',
 'Portugal',
 'Republic of Ireland',
 'Romania',
 'Russia',
 'Saudi Arabia',
 'Scotland',
 'Senegal',
 'Serbia',
 'Serbia and Montenegro',
 'Slovakia',
 'Slovenia',
 'South Africa',
 'South Korea',
 'Soviet Union',
 'Spain',
 'Sweden',
 'Switzerl

In [6]:
# Games played by each team:
non_numerical.games_played()

Algeria                 14.0
Angola                   3.0
Argentina               85.0
Australia               16.0
Austria                 29.0
                        ... 
United Arab Emirates     3.0
Uruguay                 57.0
Wales                    5.0
Yugoslavia              37.0
Zaire                    3.0
Name: count, Length: 86, dtype: float64

So, the team with most games, as expected, is:

In [7]:
non_numerical.games_played()[non_numerical.games_played() == non_numerical.games_played().max()]

Brazil    113.0
Name: count, dtype: float64

In [8]:
# Victories per team:
non_numerical.victories_per_team()

Algeria        3.0
Argentina     45.0
Australia      NaN
Austria       12.0
Belgium       21.0
              ... 
USA            8.0
Ukraine        2.0
Uruguay       24.0
Wales          NaN
Yugoslavia    16.0
Name: count, Length: 65, dtype: float64

The team with most victories is, again, as expected:

In [9]:
non_numerical.victories_per_team()[non_numerical.victories_per_team() == non_numerical.victories_per_team().max()]

Brazil    74.0
Name: count, dtype: float64

In [10]:
# Losses per team:
non_numerical.losses_per_team()

Algeria                  8.0
Angola                   NaN
Argentina               24.0
Australia               10.0
Austria                 13.0
                        ... 
United Arab Emirates     3.0
Uruguay                 21.0
Wales                    NaN
Yugoslavia              13.0
Zaire                    3.0
Name: count, Length: 86, dtype: float64

Team with most losses is:

In [11]:
non_numerical.losses_per_team()[non_numerical.losses_per_team() == non_numerical.losses_per_team().max()]

Mexico    28.0
Name: count, dtype: float64

In [12]:
# Goals per team:
non_numerical.goals_per_team()

[('Costa Rica', 7.0),
 ('Hungary', 73.0),
 ('Soviet Union', 43.0),
 ('Korea DPR', 2.0),
 ('Croatia', 8.0),
 ('Morocco', 3.0),
 ('Peru', 13.0),
 ('Australia', 7.0),
 ('Netherlands', 51.0),
 ('Argentina', 112.0),
 ('Italy', 99.0),
 ('Spain', 53.0),
 ('Turkey', 10.0),
 ('Norway', 1.0),
 ('South Africa', 7.0),
 ('Germany FR', 99.0),
 ('Tunisia', 7.0),
 ('Scotland', 11.0),
 ('IR Iran', 0.0),
 ('Senegal', 3.0),
 ('Algeria', 5.0),
 ('Trinidad and Tobago', 0.0),
 ('Serbia and Montenegro', 0.0),
 ('Israel', 0.0),
 ('Angola', 0.0),
 ('Yugoslavia', 42.0),
 ('Egypt', 0.0),
 ('Ecuador', 4.0),
 ('Kuwait', 0.0),
 ('Romania', 15.0),
 ('New Zealand', 1.0),
 ('Bulgaria', 11.0),
 ('USA', 19.0),
 ('Slovenia', 3.0),
 ('England', 60.0),
 ('Colombia', 13.0),
 ('Sweden', 55.0),
 ('Northern Ireland', 5.0),
 ('Cameroon', 11.0),
 ('Iraq', 1.0),
 ('Brazil', 186.0),
 ('Denmark', 14.0),
 ('Russia', 19.0),
 ('Dutch East Indies', 0.0),
 ('Honduras', 2.0),
 ('Bosnia and Herzegovina', 3.0),
 ('Ukraine', 1.0),
 ('Republ

The team with most goals is:

In [13]:
# Finding how many goals the team with the biggest number of goals
# has made
max_goals = 0

for i,j in non_numerical.goals_per_team():
    if j > max_goals:
        max_goals = j
        
max_goals

186.0

In [14]:
def team_max_goals(max_goals):
    for i,j in non_numerical.goals_per_team():
        if j == max_goals:
            return i
        
team_max_goals(max_goals)

'Brazil'

Again, as expected...

Now, let's explore the numerical columns:

In [15]:
# Defining the numerical columns:
numerical_columns = ["home_team_goals", "away_team_goals"]

In [16]:
# Various statistics metrics to evaluate goals from WC
print(numerical.central_tedency_all(numerical_columns))
print(numerical.dispersion_all(numerical_columns))

For column home_team_goals, we have:
Min = 0.0, Max = 10.0, Midpoint = 5.0, Median = 1.0, Mode = 1.0
For column away_team_goals, we have:
Min = 0.0, Max = 7.0, Midpoint = 3.5, Median = 1.0, Mode = 1.0
None
For column home_team_goals, we have:
Variance = 2.5477605173360103, Std = 1.5961705790221827, First Quartile = 1.0, Second Quartile = 1.0, Third Quartile = 3.0, Range = 10.0
For column away_team_goals, we have:
Variance = 1.1604493282745068, Std = 1.077241536645569, First Quartile = 0.0, Second Quartile = 1.0, Third Quartile = 2.0, Range = 7.0
None


These statistics methods demonstrate metrics for the "home" and "away" teams. This nomenclature is, in majority, not that great, as, looking in the geografic standpoint, the majority of the teams are in the "away" group. So, to deal with this problem, there is the goals_per_game_stat method:

In [17]:
numerical.goals_per_game_stat()

Total number of goals in all matches at WC history: 2583.0
Mean of goals per game: 2.819868995633188
Standard Deviation of goals per game: 1.9292277298164873
Variance of goals per game: 3.7219196334928775
Maximum number of goals per game: 12.0


To see which match was the one with 12 goals, we use game_with_most_goals function:

In [21]:
game_with_most_goals(df)

Austria x Switzerland
