# The Beautiful Game. Investigating European Football Database

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction


This football dataset is obtained from Kaggle. The data contains more than 25,000 matches, +10,000 players, 11 European countries, team squad formation with (X, Y) coordinates, detailed match events; for example, goal types, possessions, fouls, cards and many more. The dataset spans from 2008 to 2016 seasons and it comes in SQLite database format with 7 tables (Country, League, Match, Player, Player_Attributes, Team, and Team_Attributes). In addition, there are 199 columns combined in this database. We will extract what serves our purpose of analysis and try answer some questions; for instance, **what team improved over the period of time? which teams had scored the most number of goals? what attributes that leads the team to most victories?** and also dig in to explore players distinctions that dominates the game. 


All thanks to <a href="https://www.kaggle.com/hugomathien"> Hugo Mathien</a> for dedicating the time and effort to make this possbile. Further reading and ways to improve the project can be found in Hugo's github repo <a href="https://github.com/hugomathien/football-data-collection"> here.</a> 


In [1]:
import pandas as pd
import numpy as np
from sqlite3 import connect
import os
%matplotlib inline



<a id='wrangling'></a>
## Data Wrangling


### Database connection

In [3]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

# database connection
db = os.getcwd()+'/database.sqlite'
conn = connect(db)

pd.read_sql_query("SELECT * FROM Match", conn)

# perform joins

# team_df = pd.read_sql_query("""SELECT * 
#                   FROM Team as t 
#                   JOIN Team_Attributes as ta ON t.team_api_id=ta.team_api_id""", conn)

# player_df = pd.read_sql_query("""SELECT * 
#                   FROM Player as p 
#                   JOIN Player_Attributes as pa ON p.player_api_id=pa.player_api_id""", conn)


Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1,1,1,2008/2009,1,2008-08-17 00:00:00,492473,9987,9993,1,...,4.00,1.65,3.40,4.50,1.78,3.25,4.00,1.73,3.40,4.20
1,2,1,1,2008/2009,1,2008-08-16 00:00:00,492474,10000,9994,0,...,3.80,2.00,3.25,3.25,1.85,3.25,3.75,1.91,3.25,3.60
2,3,1,1,2008/2009,1,2008-08-16 00:00:00,492475,9984,8635,0,...,2.50,2.35,3.25,2.65,2.50,3.20,2.50,2.30,3.20,2.75
3,4,1,1,2008/2009,1,2008-08-17 00:00:00,492476,9991,9998,5,...,7.50,1.45,3.75,6.50,1.50,3.75,5.50,1.44,3.75,6.50
4,5,1,1,2008/2009,1,2008-08-16 00:00:00,492477,7947,9985,1,...,1.73,4.50,3.40,1.65,4.50,3.50,1.65,4.75,3.30,1.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25974,25975,24558,24558,2015/2016,9,2015-09-22 00:00:00,1992091,10190,10191,1,...,,,,,,,,,,
25975,25976,24558,24558,2015/2016,9,2015-09-23 00:00:00,1992092,9824,10199,1,...,,,,,,,,,,
25976,25977,24558,24558,2015/2016,9,2015-09-23 00:00:00,1992093,9956,10179,2,...,,,,,,,,,,
25977,25978,24558,24558,2015/2016,9,2015-09-22 00:00:00,1992094,7896,10243,0,...,,,,,,,,,,


In [4]:
pd.read_sql_query("""SELECT * 
                  FROM Team as t 
                  JOIN Team_Attributes as ta ON t.team_api_id=ta.team_api_id""", conn)

Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name,id.1,team_fifa_api_id.1,team_api_id.1,date,buildUpPlaySpeed,...,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,49119,9930,434,FC Aarau,AAR,1,434,9930,2010-02-22 00:00:00,60,...,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,49119,9930,434,FC Aarau,AAR,2,434,9930,2014-09-19 00:00:00,52,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,49119,9930,434,FC Aarau,AAR,3,434,9930,2015-09-10 00:00:00,47,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
3,39393,8485,77,Aberdeen,ABE,4,77,8485,2010-02-22 00:00:00,70,...,70,Lots,Organised,60,Medium,70,Double,70,Wide,Cover
4,39393,8485,77,Aberdeen,ABE,5,77,8485,2011-02-22 00:00:00,47,...,52,Normal,Organised,47,Medium,47,Press,52,Normal,Cover
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1453,3,10000,15005,SV Zulte-Waregem,ZUL,1454,15005,10000,2011-02-22 00:00:00,52,...,53,Normal,Organised,46,Medium,48,Press,53,Normal,Cover
1454,3,10000,15005,SV Zulte-Waregem,ZUL,1455,15005,10000,2012-02-22 00:00:00,54,...,50,Normal,Organised,44,Medium,55,Press,53,Normal,Cover
1455,3,10000,15005,SV Zulte-Waregem,ZUL,1456,15005,10000,2013-09-20 00:00:00,54,...,32,Little,Organised,44,Medium,58,Press,37,Normal,Cover
1456,3,10000,15005,SV Zulte-Waregem,ZUL,1457,15005,10000,2014-09-19 00:00:00,54,...,32,Little,Organised,44,Medium,58,Press,37,Normal,Cover


### Explore database tables

In [5]:
# explore database

db_tables = pd.read_sql("SELECT * FROM sqlite_master WHERE type='table';", conn)
db_tables

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,sqlite_sequence,sqlite_sequence,4,"CREATE TABLE sqlite_sequence(name,seq)"
1,table,Player_Attributes,Player_Attributes,11,"CREATE TABLE ""Player_Attributes"" (\n\t`id`\tIN..."
2,table,Player,Player,14,CREATE TABLE `Player` (\n\t`id`\tINTEGER PRIMA...
3,table,Match,Match,18,CREATE TABLE `Match` (\n\t`id`\tINTEGER PRIMAR...
4,table,League,League,24,CREATE TABLE `League` (\n\t`id`\tINTEGER PRIMA...
5,table,Country,Country,26,CREATE TABLE `Country` (\n\t`id`\tINTEGER PRIM...
6,table,Team,Team,29,"CREATE TABLE ""Team"" (\n\t`id`\tINTEGER PRIMARY..."
7,table,Team_Attributes,Team_Attributes,2,CREATE TABLE `Team_Attributes` (\n\t`id`\tINTE...


### Explore tables

In [6]:
# Team table
pd.read_sql_query("SELECT * FROM Team", conn)

Unnamed: 0,id,team_api_id,team_fifa_api_id,team_long_name,team_short_name
0,1,9987,673.0,KRC Genk,GEN
1,2,9993,675.0,Beerschot AC,BAC
2,3,10000,15005.0,SV Zulte-Waregem,ZUL
3,4,9994,2007.0,Sporting Lokeren,LOK
4,5,9984,1750.0,KSV Cercle Brugge,CEB
...,...,...,...,...,...
294,49479,10190,898.0,FC St. Gallen,GAL
295,49837,10191,1715.0,FC Thun,THU
296,50201,9777,324.0,Servette FC,SER
297,50204,7730,1862.0,FC Lausanne-Sports,LAU


In [7]:
# Team attributes table
pd.read_sql_query("SELECT * FROM Team_Attributes", conn)

Unnamed: 0,id,team_fifa_api_id,team_api_id,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribbling,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,...,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,1,434,9930,2010-02-22 00:00:00,60,Balanced,,Little,50,Mixed,...,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,2,434,9930,2014-09-19 00:00:00,52,Balanced,48.0,Normal,56,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,3,434,9930,2015-09-10 00:00:00,47,Balanced,41.0,Normal,54,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
3,4,77,8485,2010-02-22 00:00:00,70,Fast,,Little,70,Long,...,70,Lots,Organised,60,Medium,70,Double,70,Wide,Cover
4,5,77,8485,2011-02-22 00:00:00,47,Balanced,,Little,52,Mixed,...,52,Normal,Organised,47,Medium,47,Press,52,Normal,Cover
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1453,1454,15005,10000,2011-02-22 00:00:00,52,Balanced,,Little,52,Mixed,...,53,Normal,Organised,46,Medium,48,Press,53,Normal,Cover
1454,1455,15005,10000,2012-02-22 00:00:00,54,Balanced,,Little,51,Mixed,...,50,Normal,Organised,44,Medium,55,Press,53,Normal,Cover
1455,1456,15005,10000,2013-09-20 00:00:00,54,Balanced,,Little,51,Mixed,...,32,Little,Organised,44,Medium,58,Press,37,Normal,Cover
1456,1457,15005,10000,2014-09-19 00:00:00,54,Balanced,42.0,Normal,51,Mixed,...,32,Little,Organised,44,Medium,58,Press,37,Normal,Cover


In [8]:
# Player table
pd.read_sql_query("SELECT * FROM Player", conn)


Unnamed: 0,id,player_api_id,player_name,player_fifa_api_id,birthday,height,weight
0,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187
1,2,155782,Aaron Cresswell,189615,1989-12-15 00:00:00,170.18,146
2,3,162549,Aaron Doran,186170,1991-05-13 00:00:00,170.18,163
3,4,30572,Aaron Galindo,140161,1982-05-08 00:00:00,182.88,198
4,5,23780,Aaron Hughes,17725,1979-11-08 00:00:00,182.88,154
...,...,...,...,...,...,...,...
11055,11071,26357,Zoumana Camara,2488,1979-04-03 00:00:00,182.88,168
11056,11072,111182,Zsolt Laczko,164680,1986-12-18 00:00:00,182.88,176
11057,11073,36491,Zsolt Low,111191,1979-04-29 00:00:00,180.34,154
11058,11074,35506,Zurab Khizanishvili,47058,1981-10-06 00:00:00,185.42,172


In [9]:
# Player attributes table
pd.read_sql_query("SELECT * FROM Player_Attributes", conn)


Unnamed: 0,id,player_fifa_api_id,player_api_id,date,overall_rating,potential,preferred_foot,attacking_work_rate,defensive_work_rate,crossing,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,218353,505942,2016-02-18 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,2,218353,505942,2015-11-19 00:00:00,67.0,71.0,right,medium,medium,49.0,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,3,218353,505942,2015-09-21 00:00:00,62.0,66.0,right,medium,medium,49.0,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,4,218353,505942,2015-03-20 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,5,218353,505942,2007-02-22 00:00:00,61.0,65.0,right,medium,medium,48.0,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183973,183974,102359,39902,2009-08-30 00:00:00,83.0,85.0,right,medium,low,84.0,...,88.0,83.0,22.0,31.0,30.0,9.0,20.0,84.0,20.0,20.0
183974,183975,102359,39902,2009-02-22 00:00:00,78.0,80.0,right,medium,low,74.0,...,88.0,70.0,32.0,31.0,30.0,9.0,20.0,73.0,20.0,20.0
183975,183976,102359,39902,2008-08-30 00:00:00,77.0,80.0,right,medium,low,74.0,...,88.0,70.0,32.0,31.0,30.0,9.0,20.0,73.0,20.0,20.0
183976,183977,102359,39902,2007-08-30 00:00:00,78.0,81.0,right,medium,low,74.0,...,88.0,53.0,28.0,32.0,30.0,9.0,20.0,73.0,20.0,20.0


In [10]:
# Match table
pd.read_sql_query("SELECT * FROM Match", conn)


Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1,1,1,2008/2009,1,2008-08-17 00:00:00,492473,9987,9993,1,...,4.00,1.65,3.40,4.50,1.78,3.25,4.00,1.73,3.40,4.20
1,2,1,1,2008/2009,1,2008-08-16 00:00:00,492474,10000,9994,0,...,3.80,2.00,3.25,3.25,1.85,3.25,3.75,1.91,3.25,3.60
2,3,1,1,2008/2009,1,2008-08-16 00:00:00,492475,9984,8635,0,...,2.50,2.35,3.25,2.65,2.50,3.20,2.50,2.30,3.20,2.75
3,4,1,1,2008/2009,1,2008-08-17 00:00:00,492476,9991,9998,5,...,7.50,1.45,3.75,6.50,1.50,3.75,5.50,1.44,3.75,6.50
4,5,1,1,2008/2009,1,2008-08-16 00:00:00,492477,7947,9985,1,...,1.73,4.50,3.40,1.65,4.50,3.50,1.65,4.75,3.30,1.67
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25974,25975,24558,24558,2015/2016,9,2015-09-22 00:00:00,1992091,10190,10191,1,...,,,,,,,,,,
25975,25976,24558,24558,2015/2016,9,2015-09-23 00:00:00,1992092,9824,10199,1,...,,,,,,,,,,
25976,25977,24558,24558,2015/2016,9,2015-09-23 00:00:00,1992093,9956,10179,2,...,,,,,,,,,,
25977,25978,24558,24558,2015/2016,9,2015-09-22 00:00:00,1992094,7896,10243,0,...,,,,,,,,,,


In [12]:
# League table
pd.read_sql_query("SELECT * FROM League", conn)


Unnamed: 0,id,country_id,name
0,1,1,Belgium Jupiler League
1,1729,1729,England Premier League
2,4769,4769,France Ligue 1
3,7809,7809,Germany 1. Bundesliga
4,10257,10257,Italy Serie A
5,13274,13274,Netherlands Eredivisie
6,15722,15722,Poland Ekstraklasa
7,17642,17642,Portugal Liga ZON Sagres
8,19694,19694,Scotland Premier League
9,21518,21518,Spain LIGA BBVA


In [13]:
# Country table
pd.read_sql_query("SELECT * FROM Country", conn)

Unnamed: 0,id,name
0,1,Belgium
1,1729,England
2,4769,France
3,7809,Germany
4,10257,Italy
5,13274,Netherlands
6,15722,Poland
7,17642,Portugal
8,19694,Scotland
9,21518,Spain


In [14]:
# convert queries to csv files

# join team and team attributes tables into one
teams_query = pd.read_sql_query("""SELECT * 
                  FROM Team as t 
                  JOIN Team_Attributes as ta ON t.team_api_id=ta.team_api_id""", conn)
teams_query.to_csv('teams.csv', index=False)

# join player and player attributes tables into one
player_query = pd.read_sql_query("""SELECT * 
                  FROM Player as p 
                  JOIN Player_Attributes as pa ON p.player_api_id=pa.player_api_id""", conn)
player_query.to_csv('players.csv', index=False)

match_query = pd.read_sql_query("SELECT * FROM Match", conn)
match_query.to_csv('match.csv', index=False)


In [2]:
# read csv files

teams_df = pd.read_csv("teams.csv")
players_df = pd.read_csv("players.csv")
match_df = pd.read_csv('match.csv')

In [48]:
# explore dataframes

# teams df
teams_df.head()


Unnamed: 0,team_api_id,team_fifa_api_id,team_long_name,team_short_name,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,...,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
0,9930,434,FC Aarau,AAR,2010-02-22,60,Balanced,Little,50,Mixed,...,55,Normal,Organised,50,Medium,55,Press,45,Normal,Cover
1,9930,434,FC Aarau,AAR,2014-09-19,52,Balanced,Normal,56,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
2,9930,434,FC Aarau,AAR,2015-09-10,47,Balanced,Normal,54,Mixed,...,64,Normal,Organised,47,Medium,44,Press,54,Normal,Cover
3,8485,77,Aberdeen,ABE,2010-02-22,70,Fast,Little,70,Long,...,70,Lots,Organised,60,Medium,70,Double,70,Wide,Cover
4,8485,77,Aberdeen,ABE,2011-02-22,47,Balanced,Little,52,Mixed,...,52,Normal,Organised,47,Medium,47,Press,52,Normal,Cover


In [49]:
# player df
players_df.head()


Unnamed: 0,id,player_api_id,player_name,player_fifa_api_id,birthday,height,weight,id.1,player_fifa_api_id.1,player_api_id.1,...,vision,penalties,marking,standing_tackle,sliding_tackle,gk_diving,gk_handling,gk_kicking,gk_positioning,gk_reflexes
0,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187,1,218353,505942,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
1,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187,2,218353,505942,...,54.0,48.0,65.0,69.0,69.0,6.0,11.0,10.0,8.0,8.0
2,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187,3,218353,505942,...,54.0,48.0,65.0,66.0,69.0,6.0,11.0,10.0,8.0,8.0
3,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187,4,218353,505942,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0
4,1,505942,Aaron Appindangoye,218353,1992-02-29 00:00:00,182.88,187,5,218353,505942,...,53.0,47.0,62.0,63.0,66.0,5.0,10.0,9.0,7.0,7.0


In [50]:
# match df
match_df.head()

Unnamed: 0,id,country_id,league_id,season,stage,date,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
0,1,1,1,2008/2009,1,2008-08-17 00:00:00,492473,9987,9993,1,...,4.0,1.65,3.4,4.5,1.78,3.25,4.0,1.73,3.4,4.2
1,2,1,1,2008/2009,1,2008-08-16 00:00:00,492474,10000,9994,0,...,3.8,2.0,3.25,3.25,1.85,3.25,3.75,1.91,3.25,3.6
2,3,1,1,2008/2009,1,2008-08-16 00:00:00,492475,9984,8635,0,...,2.5,2.35,3.25,2.65,2.5,3.2,2.5,2.3,3.2,2.75
3,4,1,1,2008/2009,1,2008-08-17 00:00:00,492476,9991,9998,5,...,7.5,1.45,3.75,6.5,1.5,3.75,5.5,1.44,3.75,6.5
4,5,1,1,2008/2009,1,2008-08-16 00:00:00,492477,7947,9985,1,...,1.73,4.5,3.4,1.65,4.5,3.5,1.65,4.75,3.3,1.67


In [56]:
# print shapes
print(f"teams_df: {teams_df.shape[0]} rows and {teams_df.shape[1]} columns")
print(f"match_df: {match_df.shape[0]} rows and {match_df.shape[1]} columns")
print(f"players_df: {players_df.shape[0]} rows and {players_df.shape[1]} columns")

teams_df: 1458 rows and 25 columns
match_df: 25979 rows and 115 columns
players_df: 183978 rows and 49 columns


### Data Preprocessing


#### teams_df

In [20]:
# summary of teams_df 
teams_df.describe()

# get columns with missing values
teams_df.loc[:,teams_df.isna().any()].columns

#frequency of missing values in teams_df
teams_df.isna().sum()/len(teams_df)*100


id                                 0.000000
team_api_id                        0.000000
team_fifa_api_id                   0.000000
team_long_name                     0.000000
team_short_name                    0.000000
id.1                               0.000000
team_fifa_api_id.1                 0.000000
team_api_id.1                      0.000000
date                               0.000000
buildUpPlaySpeed                   0.000000
buildUpPlaySpeedClass              0.000000
buildUpPlayDribbling              66.460905
buildUpPlayDribblingClass          0.000000
buildUpPlayPassing                 0.000000
buildUpPlayPassingClass            0.000000
buildUpPlayPositioningClass        0.000000
chanceCreationPassing              0.000000
chanceCreationPassingClass         0.000000
chanceCreationCrossing             0.000000
chanceCreationCrossingClass        0.000000
chanceCreationShooting             0.000000
chanceCreationShootingClass        0.000000
chanceCreationPositioningClass  

In [17]:
# drop buildUpPlayDribbling column
teams_df.drop(columns="buildUpPlayDribbling", inplace=True)

#drop duplicated columns
teams_df.drop(columns=['id','id.1', 'team_api_id.1', 'team_fifa_api_id.1'], inplace=True)

In [32]:
# check duplicates
# number of duplicates
teams_df.duplicated().any().sum()

# show duplicated rows
teams_df[teams_df.duplicated()]

Unnamed: 0,team_api_id,team_fifa_api_id,team_long_name,team_short_name,date,buildUpPlaySpeed,buildUpPlaySpeedClass,buildUpPlayDribblingClass,buildUpPlayPassing,buildUpPlayPassingClass,...,chanceCreationShooting,chanceCreationShootingClass,chanceCreationPositioningClass,defencePressure,defencePressureClass,defenceAggression,defenceAggressionClass,defenceTeamWidth,defenceTeamWidthClass,defenceDefenderLineClass
860,9996,111560,Royal Excel Mouscron,MOU,2015-09-10 00:00:00,50,Balanced,Normal,50,Mixed,...,50,Normal,Organised,45,Medium,45,Press,50,Normal,Cover


In [33]:
teams_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   team_api_id                     1458 non-null   int64 
 1   team_fifa_api_id                1458 non-null   int64 
 2   team_long_name                  1458 non-null   object
 3   team_short_name                 1458 non-null   object
 4   date                            1458 non-null   object
 5   buildUpPlaySpeed                1458 non-null   int64 
 6   buildUpPlaySpeedClass           1458 non-null   object
 7   buildUpPlayDribblingClass       1458 non-null   object
 8   buildUpPlayPassing              1458 non-null   int64 
 9   buildUpPlayPassingClass         1458 non-null   object
 10  buildUpPlayPositioningClass     1458 non-null   object
 11  chanceCreationPassing           1458 non-null   int64 
 12  chanceCreationPassingClass      1458 non-null   

In [34]:
# convert date column data type

teams_df['date'] = pd.to_datetime(teams_df['date'])
teams_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 25 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   team_api_id                     1458 non-null   int64         
 1   team_fifa_api_id                1458 non-null   int64         
 2   team_long_name                  1458 non-null   object        
 3   team_short_name                 1458 non-null   object        
 4   date                            1458 non-null   datetime64[ns]
 5   buildUpPlaySpeed                1458 non-null   int64         
 6   buildUpPlaySpeedClass           1458 non-null   object        
 7   buildUpPlayDribblingClass       1458 non-null   object        
 8   buildUpPlayPassing              1458 non-null   int64         
 9   buildUpPlayPassingClass         1458 non-null   object        
 10  buildUpPlayPositioningClass     1458 non-null   object        
 11  chan

In [156]:
teams_df.shape

(1458, 30)

#### match_df

In [36]:
# summary stats for match_df
match_df.describe()

Unnamed: 0,id,country_id,league_id,stage,match_api_id,home_team_api_id,away_team_api_id,home_team_goal,away_team_goal,home_player_X1,...,SJA,VCH,VCD,VCA,GBH,GBD,GBA,BSH,BSD,BSA
count,25979.0,25979.0,25979.0,25979.0,25979.0,25979.0,25979.0,25979.0,25979.0,24158.0,...,17097.0,22568.0,22568.0,22568.0,14162.0,14162.0,14162.0,14161.0,14161.0,14161.0
mean,12990.0,11738.630317,11738.630317,18.242773,1195429.0,9984.371993,9984.475115,1.544594,1.160938,0.999586,...,4.622343,2.668107,3.899048,4.840281,2.498764,3.648189,4.353097,2.497894,3.660742,4.405663
std,7499.635658,7553.936759,7553.936759,10.407354,494627.9,14087.453758,14087.445135,1.297158,1.14211,0.022284,...,3.632164,1.928753,1.248221,4.318338,1.489299,0.86744,3.010189,1.507793,0.868272,3.189814
min,1.0,1.0,1.0,1.0,483129.0,1601.0,1601.0,0.0,0.0,0.0,...,1.1,1.03,1.62,1.08,1.05,1.45,1.12,1.04,1.33,1.12
25%,6495.5,4769.0,4769.0,9.0,768436.5,8475.0,8475.0,1.0,0.0,1.0,...,2.5,1.7,3.3,2.55,1.67,3.2,2.5,1.67,3.25,2.5
50%,12990.0,10257.0,10257.0,18.0,1147511.0,8697.0,8697.0,1.0,1.0,1.0,...,3.5,2.15,3.5,3.5,2.1,3.3,3.4,2.1,3.4,3.4
75%,19484.5,17642.0,17642.0,27.0,1709852.0,9925.0,9925.0,2.0,2.0,1.0,...,5.25,2.8,4.0,5.4,2.65,3.75,5.0,2.62,3.75,5.0
max,25979.0,24558.0,24558.0,38.0,2216672.0,274581.0,274581.0,10.0,9.0,2.0,...,41.0,36.0,26.0,67.0,21.0,11.0,34.0,17.0,13.0,34.0


In [46]:
# match_df info
match_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Columns: 115 entries, id to BSA
dtypes: float64(96), int64(9), object(10)
memory usage: 22.8+ MB


In [47]:
# too many columns check first 30
match_df.info(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Data columns (total 115 columns):
 #    Column            Dtype  
---   ------            -----  
 0    id                int64  
 1    country_id        int64  
 2    league_id         int64  
 3    season            object 
 4    stage             int64  
 5    date              object 
 6    match_api_id      int64  
 7    home_team_api_id  int64  
 8    away_team_api_id  int64  
 9    home_team_goal    int64  
 10   away_team_goal    int64  
 11   home_player_X1    float64
 12   home_player_X2    float64
 13   home_player_X3    float64
 14   home_player_X4    float64
 15   home_player_X5    float64
 16   home_player_X6    float64
 17   home_player_X7    float64
 18   home_player_X8    float64
 19   home_player_X9    float64
 20   home_player_X10   float64
 21   home_player_X11   float64
 22   away_player_X1    float64
 23   away_player_X2    float64
 24   away_player_X3    float64
 25   away_player_X4    fl

In [3]:
# get column name and index
for i, j in enumerate(match_df.columns):
    print(i, j)

0 id
1 country_id
2 league_id
3 season
4 stage
5 date
6 match_api_id
7 home_team_api_id
8 away_team_api_id
9 home_team_goal
10 away_team_goal
11 home_player_X1
12 home_player_X2
13 home_player_X3
14 home_player_X4
15 home_player_X5
16 home_player_X6
17 home_player_X7
18 home_player_X8
19 home_player_X9
20 home_player_X10
21 home_player_X11
22 away_player_X1
23 away_player_X2
24 away_player_X3
25 away_player_X4
26 away_player_X5
27 away_player_X6
28 away_player_X7
29 away_player_X8
30 away_player_X9
31 away_player_X10
32 away_player_X11
33 home_player_Y1
34 home_player_Y2
35 home_player_Y3
36 home_player_Y4
37 home_player_Y5
38 home_player_Y6
39 home_player_Y7
40 home_player_Y8
41 home_player_Y9
42 home_player_Y10
43 home_player_Y11
44 away_player_Y1
45 away_player_Y2
46 away_player_Y3
47 away_player_Y4
48 away_player_Y5
49 away_player_Y6
50 away_player_Y7
51 away_player_Y8
52 away_player_Y9
53 away_player_Y10
54 away_player_Y11
55 home_player_1
56 home_player_2
57 home_player_3
58 home

In [4]:
match_df.columns[85:]

Index(['B365H', 'B365D', 'B365A', 'BWH', 'BWD', 'BWA', 'IWH', 'IWD', 'IWA',
       'LBH', 'LBD', 'LBA', 'PSH', 'PSD', 'PSA', 'WHH', 'WHD', 'WHA', 'SJH',
       'SJD', 'SJA', 'VCH', 'VCD', 'VCA', 'GBH', 'GBD', 'GBA', 'BSH', 'BSD',
       'BSA'],
      dtype='object')

In [5]:
# drop betting columns

match_df.drop(match_df.columns[85:], axis=1, inplace=True)

In [6]:
# check columns with missing values
match_df.loc[:,match_df.isna().any()].columns


Index(['home_player_X1', 'home_player_X2', 'home_player_X3', 'home_player_X4',
       'home_player_X5', 'home_player_X6', 'home_player_X7', 'home_player_X8',
       'home_player_X9', 'home_player_X10', 'home_player_X11',
       'away_player_X1', 'away_player_X2', 'away_player_X3', 'away_player_X4',
       'away_player_X5', 'away_player_X6', 'away_player_X7', 'away_player_X8',
       'away_player_X9', 'away_player_X10', 'away_player_X11',
       'home_player_Y1', 'home_player_Y2', 'home_player_Y3', 'home_player_Y4',
       'home_player_Y5', 'home_player_Y6', 'home_player_Y7', 'home_player_Y8',
       'home_player_Y9', 'home_player_Y10', 'home_player_Y11',
       'away_player_Y1', 'away_player_Y2', 'away_player_Y3', 'away_player_Y4',
       'away_player_Y5', 'away_player_Y6', 'away_player_Y7', 'away_player_Y8',
       'away_player_Y9', 'away_player_Y10', 'away_player_Y11', 'home_player_1',
       'home_player_2', 'home_player_3', 'home_player_4', 'home_player_5',
       'home_player_6', 

In [13]:
# get frequency of missing values in each column 
print(match_df.isna().sum()/len(match_df)*100)


id             0.00000
country_id     0.00000
league_id      0.00000
season         0.00000
stage          0.00000
                ...   
foulcommit    45.27503
card          45.27503
cross         45.27503
corner        45.27503
possession    45.27503
Length: 85, dtype: float64


In [76]:
for i, j in enumerate(match_df):
    print(i, j, match_df[j].isna().sum()/len(match_df)*100)

0 id 0.0
1 country_id 0.0
2 league_id 0.0
3 season 0.0
4 stage 0.0
5 date 0.0
6 match_api_id 0.0
7 home_team_api_id 0.0
8 away_team_api_id 0.0
9 home_team_goal 0.0
10 away_team_goal 0.0
11 home_player_X1 7.009507679279419
12 home_player_X2 7.009507679279419
13 home_player_X3 7.051849570807191
14 home_player_X4 7.051849570807191
15 home_player_X5 7.051849570807191
16 home_player_X6 7.051849570807191
17 home_player_X7 7.051849570807191
18 home_player_X8 7.051849570807191
19 home_player_X9 7.051849570807191
20 home_player_X10 7.051849570807191
21 home_player_X11 7.051849570807191
22 away_player_X1 7.051849570807191
23 away_player_X2 7.051849570807191
24 away_player_X3 7.051849570807191
25 away_player_X4 7.051849570807191
26 away_player_X5 7.051849570807191
27 away_player_X6 7.051849570807191
28 away_player_X7 7.051849570807191
29 away_player_X8 7.051849570807191
30 away_player_X9 7.055698833673352
31 away_player_X10 7.055698833673352
32 away_player_X11 7.078794410870318
33 home_player_Y1 

In [61]:
# inspect missing values > 20%
print('***Printing goal column unique values***\n')
print(match_df['goal'].unique())
print('\n'*2)
print('***Printing shoton column unique values***\n')
print(match_df['shoton'].unique())
print('\n'*2)
print('***Printing foulcommit column unique values***\n')
print(match_df['foulcommit'].unique())
print('\n'*2)
print('***Printing card column unique values***\n')
print(match_df['card'].unique())
print('\n'*2)
print('***Printing cross column unique values***\n')
print(match_df['cross'].unique())
print('\n'*2)
print('***Printing corner column unique values***\n')
print(match_df['corner'].unique())
print('\n'*2)
print('***Printing possession column unique values***\n')
print(match_df['possession'].unique())


***Printing goal column unique values***

[nan
 '<goal><value><comment>n</comment><stats><goals>1</goals><shoton>1</shoton></stats><event_incident_typefk>406</event_incident_typefk><elapsed>22</elapsed><player2>38807</player2><subtype>header</subtype><player1>37799</player1><sortorder>5</sortorder><team>10261</team><id>378998</id><n>295</n><type>goal</type><goal_type>n</goal_type></value><value><comment>n</comment><stats><goals>1</goals><shoton>1</shoton></stats><event_incident_typefk>393</event_incident_typefk><elapsed>24</elapsed><player2>24154</player2><subtype>shot</subtype><player1>24148</player1><sortorder>4</sortorder><team>10260</team><id>379019</id><n>298</n><type>goal</type><goal_type>n</goal_type></value></goal>'
 '<goal><value><comment>n</comment><stats><goals>1</goals><shoton>1</shoton></stats><event_incident_typefk>393</event_incident_typefk><elapsed>4</elapsed><player2>39297</player2><subtype>shot</subtype><player1>26181</player1><sortorder>2</sortorder><team>9825</team>

In [74]:
match_df['home_player_Y1'].unique()

array([nan,  1.,  3.,  0.])

In [70]:
match_df['home_player_X2'].unique()

array([nan,  2.,  4.,  3.,  1.,  5.,  6.,  8.,  7.,  0.])

In [73]:
match_df['away_player_Y1'].unique()

array([nan,  1.,  3.])

In [138]:
# drop unexplainable and coordinates columns
match_df.drop(match_df.columns[np.r_[11:55, 77:85]], axis=1, inplace=True)

In [147]:
# check missing values
for i, j in enumerate(match_df):
    print(i, j, match_df[j].isna().sum()/len(match_df)*100)

0 id 0.0
1 country_id 0.0
2 league_id 0.0
3 season 0.0
4 stage 0.0
5 date 0.0
6 match_api_id 0.0
7 home_team_api_id 0.0
8 away_team_api_id 0.0
9 home_team_goal 0.0
10 away_team_goal 0.0
11 home_player_1 4.711497748181223
12 home_player_2 5.0617806690018865
13 home_player_3 4.930905731552407
14 home_player_4 5.092574771931175
15 home_player_5 5.065629931868047
16 home_player_6 5.100273297663498
17 home_player_7 4.723045536779707
18 home_player_8 5.038685091804919
19 home_player_9 4.900111628623119
20 home_player_10 5.527541475807383
21 home_player_11 5.985603756880558
22 away_player_1 4.749990376842835
23 away_player_2 4.9193579429539245
24 away_player_3 4.977096885946342
25 away_player_4 5.084876246198853
26 away_player_5 5.138765926325109
27 away_player_6 5.0540821432695635
28 away_player_7 4.753839639708995
29 away_player_8 5.161861503522076
30 away_player_9 5.111821086261981
31 away_player_10 5.546787790138189
32 away_player_11 5.981754494014396


In [154]:
match_df['home_player_1'].unique() 
match_df['away_player_2'].unique()


array([    nan,  38388.,  38293., ...,  27232., 458806.,  92252.])

In [194]:
# missing values on *_player_number indicates the participation of a player in a particular match

match_df.iloc[25974:,np.r_[9:11,11:33]]

Unnamed: 0,home_team_goal,away_team_goal,home_player_1,home_player_2,home_player_3,home_player_4,home_player_5,home_player_6,home_player_7,home_player_8,...,away_player_2,away_player_3,away_player_4,away_player_5,away_player_6,away_player_7,away_player_8,away_player_9,away_player_10,away_player_11
25974,1,0,42231.0,678384.0,95220.0,638592.0,413155.0,45780.0,171229.0,67333.0,...,563066.0,8800.0,67304.0,158253.0,133126.0,186524.0,93223.0,121115.0,232110.0,289732.0
25975,1,2,33272.0,41621.0,25813.0,257845.0,114735.0,42237.0,113227.0,358156.0,...,114792.0,150007.0,178119.0,27232.0,570830.0,260708.0,201704.0,36382.0,34082.0,95257.0
25976,2,0,157856.0,274779.0,177689.0,294256.0,42258.0,39979.0,173936.0,147959.0,...,67349.0,202663.0,32597.0,114794.0,188114.0,25840.0,482200.0,95230.0,451335.0,275122.0
25977,0,0,,8881.0,173534.0,39646.0,282287.0,340790.0,393337.0,8893.0,...,121080.0,197757.0,260964.0,231614.0,113235.0,41116.0,462608.0,42262.0,92252.0,194532.0
25978,4,3,274787.0,492132.0,108451.0,25815.0,94553.0,384376.0,598355.0,36785.0,...,95216.0,172768.0,22834.0,458806.0,207234.0,25772.0,40274.0,34035.0,41726.0,527103.0


In [197]:
# filling missing values with 0
match_df.fillna(0, inplace=True)

In [198]:
match_df.shape

(25979, 33)

In [199]:
match_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Data columns (total 33 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                25979 non-null  int64         
 1   country_id        25979 non-null  int64         
 2   league_id         25979 non-null  int64         
 3   season            25979 non-null  object        
 4   stage             25979 non-null  int64         
 5   date              25979 non-null  datetime64[ns]
 6   match_api_id      25979 non-null  int64         
 7   home_team_api_id  25979 non-null  int64         
 8   away_team_api_id  25979 non-null  int64         
 9   home_team_goal    25979 non-null  int64         
 10  away_team_goal    25979 non-null  int64         
 11  home_player_1     25979 non-null  float64       
 12  home_player_2     25979 non-null  float64       
 13  home_player_3     25979 non-null  float64       
 14  home_player_4     2597

In [162]:
# check duplicates
match_df.duplicated().sum()

0

In [145]:
# convert date to datetime
match_df['date'] = pd.to_datetime(match_df['date'])

In [200]:
match_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25979 entries, 0 to 25978
Data columns (total 33 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   id                25979 non-null  int64         
 1   country_id        25979 non-null  int64         
 2   league_id         25979 non-null  int64         
 3   season            25979 non-null  object        
 4   stage             25979 non-null  int64         
 5   date              25979 non-null  datetime64[ns]
 6   match_api_id      25979 non-null  int64         
 7   home_team_api_id  25979 non-null  int64         
 8   away_team_api_id  25979 non-null  int64         
 9   home_team_goal    25979 non-null  int64         
 10  away_team_goal    25979 non-null  int64         
 11  home_player_1     25979 non-null  float64       
 12  home_player_2     25979 non-null  float64       
 13  home_player_3     25979 non-null  float64       
 14  home_player_4     2597

<a id='eda'></a>
## Exploratory Data Analysis



### Research Question 1: Which team had scored the most goals?

### Research Question 2: What team improved the most over the time period?

### Research Question 3: Which player had the most penalties?

### Research Question 4: What attributes that lead teams to most victories?

<a id='conclusions'></a>
## Conclusions

