<div>    
<img src="https://cdn.bleacherreport.net/images_root/slides/photos/000/587/176/Premier_League_original.jpg?1293083692" width="500"/> 
</div>

[Image Source](https://cdn.bleacherreport.net/images_root/slides/photos/000/587/176/Premier_League_original.jpg?1293083692)

# EDA on English Premier League Players Game Statistics 
---

## Objectives:

- Exploratory data anlysis on English Premier League players dataset.
- Learn and apply data visualizatiosn techniques using plotly's data visualization library.

---
<a id="top"></a>

## Table of Contents
* [1. Introduction](#1)
    * [1.1 The Premier League](#1.1)
    * [1.2 The Dataset](#1.2)
    * [1.3 Data Pre-Processing](#1.3)
* [2. General Statistics](#2)
    * [2.1 Countries most represented in the EP](#2.1)
    * [2.2 Players Appearances (nr. games](#2.2)
    * [2.3 Players' Age](#2.3)
* [3. Players Stats By Playing Position](#3)
    * [3.1 Goalkeepers](#3.1)
    * [3.2 Defenders](#3.2)
    * [3.3 Midfielders](#3.3)
    * [3.4 Forwards](#3.4)
* [4. Other Statistics](#4)
    * [4.1 Goal Distribution](#4.1)
    * [4.2 Unwanted records](#4.2)
* [5. Closing Remarks](#5)   
* [6. References](#5)  
---


## 1. Introduction <a class="anchor" id="1"></a>
### 1.1 The Premier League <a class="anchor" id="1.1"></a>
The Premier League, often referred outside England as the `English Premier League or the EPL for short`, is the top level of the English football league system. Contested by 20 clubs, it operates on a system of promotion and relegation with the English Football League (EFL). Seasons run from August to May with each team playing 38 matches (playing all 19 other teams both home and away). [[source](https://en.wikipedia.org/wiki/Premier_League)]

### 1.2 The Dataset <a class="anchor" id="1.2"></a>

The [dataset](https://www.kaggle.com/rishikeshkanabar/premier-league-player-statistics-updated-daily) is correct upto and including 2020-09-24 (a lot has has happen since then, notebook will be updated when the dataset gets an update). Each row in the data represents a football player currently playing in the EPL and the columns are featured used to discribe players' data and game statistics. There are 571 rows and 59 columns in the data. Few of the columns are:

* Name: Name of the player
* Jersey Number: Number at the back of his shirt
* Club: Club the player plays for at present
* Position: Playing position(Goalkeeper, Defender, Mid-fider, Forward)
* Nationality: Country the player is from
* Age: Players age
* Appearances: Number of games played (a substitute appearance aslo counts)
* Wins: Number of games the palyer has won
* Losses: Number of games the palyer has lost
* Goals: Number of goals the player has scored in the EPL 
* Goals per match: Goals scored per game, etcetra.


**Notes** : 

1. A player's attributes from previous premier league clubs are carried to his current club. 
2. When all-time stats are considered, obviously longevity in the game plays a big role. The longer the player has played in the EPL, the higher the stats (counts of something) are going to be. Hence the *per-game* stat will be a better indicator of performance. On the other hand players who played too few games might appear as top-performers. For this reason only players who played at least 38 games are considered (a full-season's worth of games) in the per-game comparison.


In [None]:
from itertools import repeat
import numpy as np 
import matplotlib.pyplot as plt
import pandas as pd 
pd.set_option('mode.chained_assignment', None)
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

In [None]:
data = pd.read_csv('../input/premier-league-player-statistics-updated-daily/dataset - 2020-09-24.csv')
data = data.copy()

In [None]:
print('Shape of the dataset is {}'.format(data.shape))
data.head()

In [None]:
data.info()

### 1.3 Data pre-processing <a class="anchor" id="1.3"></a>

In [None]:
# Remove entries which do not have age, jersey number and nationality 
data = data[data['Nationality'].notna()]
data = data[data['Age'].notna()]
data = data[data['Jersey Number'].notna()]

# cleaning the percentage sign
data['Cross accuracy %'] = data['Cross accuracy %'].str.replace(r'%', '').astype(float)
data['Shooting accuracy %'] = data['Shooting accuracy %'].str.replace(r'%', '').astype(float)
data['Tackle success %'] = data['Tackle success %'].str.replace(r'%', '').astype(float)

features = data.columns
data_clean = data[features]
data_clean.head()

data_clean_appNonZero = data_clean[data_clean['Appearances'] > 0] #prevents division by zero for players who are yet to make appearances

# take care off the data type for division
# separate cols with dtype objects and cols that may not be divided by appearances

cols = features.drop(['Age', 'Name', 'Appearances', 'Club', 'Nationality', 'Jersey Number', 'Cross accuracy %', 'Position', 'Goals per match', 
                      'Passes per match','Tackle success %', 'Shooting accuracy %'])
data_clean_appNonZero.loc[:, cols] = data_clean_appNonZero.loc[:, cols].div(data_clean_appNonZero['Appearances'], axis=0)

# positional classifications on the data as is
goalies = data[data['Position'] == 'Goalkeeper']
defenders = data[data['Position'] == 'Defender']
midfielders = data[data['Position'] == 'Midfielder']
forwards = data[data['Position'] == 'Forward']

# palyers who have made atleast 38 games (a seasons worth games)
# data as is
data_38app = data[data['Appearances'] >=38]
goalies_38app = goalies[goalies['Appearances'] >= 38]
defenders_38app = defenders[defenders['Appearances'] >= 38]
midfilders_38app = midfielders[midfielders['Appearances'] >= 38]
forwards_38app = forwards[forwards['Appearances'] >= 38]

# palyers who have made atleast 38 games (a seasons worth games)
# data normalized
all_players = data_clean_appNonZero[data_clean_appNonZero['Appearances'] >= 38]
goalies_ = data_clean_appNonZero[(data_clean_appNonZero['Position'] == 'Goalkeeper') & (data_clean_appNonZero['Appearances'] >= 38)]
defenders_ = data_clean_appNonZero[(data_clean_appNonZero['Position'] == 'Defender') & (data_clean_appNonZero['Appearances'] >= 38)]
midfielders_ = data_clean_appNonZero[(data_clean_appNonZero['Position'] == 'Midfielder') & (data_clean_appNonZero['Appearances'] >= 38)]
forwards_ = data_clean_appNonZero[(data_clean_appNonZero['Position'] == 'Forward') & (data_clean_appNonZero['Appearances'] >= 38)]

<a href="#top">Back to top</a>  

## 2. General Statistics <a class="anchor" id="2"></a>
### 2.1 Countries most represented in the EPL <a class="anchor" id="2.1"></a>

In any league it is normal to have more home-grown players than foreign palyers and the EPL is no different. The majority of the players are English. Other UK member countries will also likely have more representations. The question is which country comes next? The first thing that can be a factor is geographyical proximity. What comes after that would be likely to be determined by talent baring workpermit issue and visa-related challenges that could prevent some players from playing/working in the EPL. But that is a rarity. Below are the top three nations after home country (England) ranked by overall apprearances and further breakdown by players playing positions.

#### Summary most represented nations:

* Overall: 1st **France**, 2nd **Spain**, 3rd *Brazil*
* Goalkeepers: 1st **Spain**, 2nd Denmark, 3rd **France**
* Defenders: 1st **Spain**, 2nd Nederland, 3rd *Ireland*
* Midfielders: 1st *Scotland*, 2nd **France**, 3rd **Spain**
* Forwards: 1st *Brazil*, 2nd **France**, 3rd *Ireland*

**French** and **Spanish** players found their second home in England. Honorable mention to **Brazil**

In [None]:
df = data
fig = px.pie(df,
             values='Appearances',
             names='Nationality',
             title='Countries represented in the EPL by number of appearances',
             )
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(title_text='<b> Nr. of appearances per country <b>', 
                  title_x=0.5, 
                  titlefont=dict(color='black', 
                            size=28, 
                            family="Courier New, monospace",),
                  width=600,
                  height=600,
                  showlegend=False,
                 )
iplot(fig)



In [None]:
df = data
fig = px.sunburst(df, 
                   path=['Position', 'Nationality'], 
                   values='Appearances', 
                 )
fig.update_layout(title_text='<b>Players position by country <b>', 
                  title_x=0.5, 
                  titlefont=dict(color='black', 
                            size=28, 
                            family="Courier New, monospace",),
                  width=600,
                  height=600,
                  showlegend=False,
                 )
iplot(fig)

### 2.2 Players Appearances (nr. games) <a class="anchor" id="2.2"></a>

Longevity, versatility, quality of the player & the squad and  playing position are major factors for number of appearances.  

Most of the appearances come from defence and midfield position. No surprise here. If the most common line-ups/systems (4-4-2, 4-5-1, 3-5-2, 4-3-3) are averaged they would look like in the graph below. Usually midfielders are versatile and can play in defence or in attack if needed. So they constitute the majority of a given squad.

#### Summary most appearances

- Goalkeeper: Joe Hart, 340
- Defender: Phil Jagielka, 366
- Midfielder: James Milner, 539
- Forward: Theo Walcott, 346

In [None]:
df = data
fig = px.bar(df, x="Position", y="Appearances",color='Club',
             hover_data=["Name"],
             width=750, height=600,)
fig.update_layout(
             template='ggplot2',
             title='<b>Players appearance by position<b>',
             titlefont={'size':24})
iplot(fig)

In [None]:
fig = px.bar(df, y="Club", x="Appearances",color='Position',
             hover_data=["Name"],
             width=750, 
             height=600,
             )
fig.update_layout(
             template='ggplot2',
             title='<b>Appearance by club<b>',
             titlefont={'size':24},
)
iplot(fig)

### 2.3 Players' Age <a class="anchor" id="2.3"></a>
Goalkeepers and defender are relatively older than thier mid-field and attacking colleagues. The defence line (including goalkeepers) is widely regarded as the area where more of a wise-head/cool-headed than a lightning fast leg is need. No harm if a defender is fast, there are many of them. But most defenders mature with age. Being an older defender is not a bad thing.

**Summary:**
* The yougest squad is Leeds-United
* Liverpool are the oldest group of players
* The min, median and max ages are 17, 25.8 and 38 years

**Extra info:**

The youngest ever EPL player: Harvey Elliott – 16 years and 30 days (Liverpool, made debut on May 2019)

The olderst ever EPL player: John Burridge – 43 years and 163 days (Aston Villa, played last 1995)

<a href="#top">Back to top</a>  

In [None]:
df = data
age_avg=df['Age'].mean()
fig = px.violin(df, y="Age", x="Position", box=True,
                title='<b> Players Age distribution by position (avg. age dotted line)<b>',
                width=600,height=400,template='simple_white')
fig.add_shape( 
    type="line", line_color="blue", line_width=3, opacity=1, line_dash="dot",
    x0=0, x1=1, xref="paper", y0=age_avg, y1=age_avg, yref="y"
)
iplot(fig)

fig = px.box(df, y="Club", x="Age",
            title='<b>Players Age distribution by club (avg. age dotted line)<b>',
            width=750,height=750,template='ggplot2')
fig.add_shape( 
    type="line", line_color="black", line_width=3, opacity=1, line_dash="dot",
    y0=0, y1=1, yref="paper", x0=age_avg, x1=age_avg, xref="x"
)
# fig.update_layout( yaxis={'categoryorder':'total descending'})
iplot(fig)

<div>    
<img src="https://cdn.standardmedia.co.ke/images/thursday/spwhmm6qtavglc4atib5ed0054dcc631.jpg" width="500"/>    
</div>

## 3. Players Stats By Playing Position <a class="anchor" id="3"></a>
### 3.1 Goalkeepers  <a class="anchor" id="3.1"></a>

One of the key stats for goalkeepers is the much coveted `clean sheet` (conceeding zero goals in a game). Although this stat is not entierly dependent on the performance/ability of the goalkeeper only (a solid defence-line infront of a goalkeeper always helps), this metric shows how good a goal keeper is. The goalkeeper who kept the most clean sheets is awarded a golden-glove for thier effort at the end of a season. Other important qualities of a goalkeeper are:

*  Clean sheets
*  Saves
*  Penalties saved 
*  Punches 
*  High Claims 
*  Catches

**Extra info**: 
Most Expensive goal keepers in the EPL are:
1. Kepa Arrizabalaga (SPN), Athletic Bilbao to Chelsea in 2018, £71m
2. Alisson Becker (BRA), AS Roma to Liverpool in 2018, £65m
3. Ederson Moraes (BRA), Benfica to Manchester City in 2017, £34.7m

[Here is the reference. ](https://www.espn.com/soccer/soccer-transfers/story/3135816/the-10-most-expensive-goalkeepers-kepa-alisson-becker-courtois-ederson)


In [None]:
from plotly.subplots import make_subplots

head = 5
df1=goalies_38app.sort_values(by='Clean sheets', ascending=False).head(head)
df2=goalies_38app.sort_values(by='Saves', ascending=False).head(head)
df3=goalies_38app.sort_values(by='High Claims', ascending=False).head(head)
df4=goalies_38app.sort_values(by='Catches', ascending=False).head(head)

df11=goalies_.sort_values(by='Clean sheets', ascending=False).head(head)
df12=goalies_.sort_values(by='Saves', ascending=False).head(head)
df13=goalies_.sort_values(by='High Claims', ascending=False).head(head)
df14=goalies_.sort_values(by='Catches', ascending=False).head(head)

fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=('Clean sheets (overall)', 'Clean sheets (per-game)','Saves (overall)','Saves (per-game)',
                    'High Claims (overall)', 'High Claims (per-game)','Catches (overall)', 'Catches (per-game)'),
    horizontal_spacing = 0.12,
    vertical_spacing = 0.075)

fig.add_trace(go.Bar(
                y=df1["Name"], 
                x=df1['Clean sheets'],
                hovertext=df1['Club'],
                orientation='h'),
                row=1, col=1)

fig.add_trace(go.Bar(
                y=df2["Name"], 
                x=df2['Saves'],
                hovertext=df2['Club'],
                orientation='h'),
                row=2, col=1)

fig.add_trace(go.Bar(
                y=df3["Name"], 
                x=df3['High Claims'],
                hovertext=df3['Club'],
                orientation='h'),
                row=3, col=1)

fig.add_trace(go.Bar(
                y=df4["Name"], 
                x=df4['Catches'],
                hovertext=df4['Club'],
                orientation='h'),
                row=4, col=1)


fig.add_trace(go.Bar(
                y=df11["Name"], 
                x=df11['Clean sheets'],
                hovertext=df11['Club'],
                orientation='h'),
                row=1, col=2)

fig.add_trace(go.Bar(
                y=df12["Name"], 
                x=df12['Saves'],
                hovertext=df12['Club'],
                orientation='h'),
                row=2, col=2)

fig.add_trace(go.Bar(
                y=df13["Name"], 
                x=df13['High Claims'],
                hovertext=df13['Club'],
                orientation='h'),
                row=3, col=2)

fig.add_trace(go.Bar(
                y=df14["Name"], 
                x=df14['Catches'],
                hovertext=df14['Club'],
                orientation='h'),
                row=4, col=2)

fig.update_traces(marker_color= ['rgb(110,102,250)','rgb(210,202,82)','rgb(210,202,82)','rgb(210,202,82)',
                                 'rgb(210,202,82)',], marker_line_color='rgb(8,48,107)',
                  marker_line_width=2.5, opacity=0.6)
fig.update_layout(title_text='<b> Top goalkeepers stat<b>', 
                  titlefont={'size':28},
                  title_x=0.5,
                  showlegend=False,
                  autosize=False, 
                  width=1300, 
                  height=1200,
                  template='ggplot2',
                  paper_bgcolor='lightgray',
                  #plot_bgcolor='lightgray',
                 )
fig.show()

<a href="#top">Back to top</a>  
<div>    
<img src="https://staticg.sportskeeda.com/editor/2020/01/ed476-15783945689891-800.jpg" width="500"/>    
</div>

### 3.2 Defenders <a class="anchor" id="3.2"></a>

In a game of footall (in fact in any game where a draw is possible) if you can not win the game then do not lose it. That means defend well and do not concede a goal. For that to happen derenders play a huge part. Good defenders are able to read the game very well and sense where the danger is in time. They know when to join the party (attacking) or sit back and defend. Although they are rare to find, there are defenders who can do more than their own job by assisting goals and scoring themselves as well. Let's see who are the best defenders at doing their job and who are contributing more than they are needed to.

**Extra info**:
Most expensive defender in the EPL:
1. Harry Maguire (Leicester City to Manchester United) - €87million
2. Virgil van Dijk (Southampton to Liverpool) - €84.65 million
3. Joao Cancelo (Juventus to Manchester City) - €65 million

Contrary to the transfer fees, big Virgil is the best defender in the list, if not in the world (of course my opinion) 

[Here is the reference.](https://www.kickoff.com/news/articles/world-news/categories/news/english-premier-league/the-10-most-expensive-defenders-of-all-time/681803?gallery=681803&gallery-page=11#ig)

In [None]:
defenders_attr =['Blocked shots', 'Interceptions', 'Clearances','Headed Clearance', 'Clearances off line',
                 'Duels won','Successful 50/50s', 'Aerial battles won']
# top=5
# defenders_attr =['Tackles', 'Tackle success'Interceptions' %', 'Last man tackles', 'Blocked shots', 'Interceptions', 'Clearances',
#                  'Headed Clearance', 'Clearances off line', 'Recoveries', 'Duels won','Successful 50/50s', 'Aerial battles won'] 
# for atr in defenders_attr:
#     text = 5
#     df = data_38app[data_38app["Position"] == 'Defender'].sort_values(by=atr, ascending=False).head(top)
#     fig = px.bar(df, x="Name", 
#                  y=atr,
#                  color='Club',
#                  hover_name=None,
#                  title="Defender defensive ability: Top {} {} ".format(text, atr.lower()))
#     fig.update_layout(autosize=False, width=1000, height=500)
#     iplot(fig)


top = 5
fig = make_subplots(
    rows=5, cols=2,
    horizontal_spacing = 0.05, 
    vertical_spacing = 0.075, 
    subplot_titles=('Blocked shots (overall)', 'Blocked shots (per-game)','Interceptions (overall)', 'Interceptions (per-game)','Clearances (overall)',
                    'Clearances (per-game)','Headed Clearance (overall)', 'Headed Clearance (per-game)','Clearances off line (overall)', 'Clearances off line (per-game)'),
    )


df = defenders_38app.sort_values(by='Blocked shots', ascending=False).head(top)
fig.add_trace(go.Bar(x=df["Name"], 
             y=df['Blocked shots'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=1, col=1)
              
df = defenders_38app.sort_values(by='Interceptions', ascending=False).head(top)
fig.add_trace(go.Bar(x=df["Name"], 
             y=df['Interceptions'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=2, col=1)  

df = defenders_38app.sort_values(by='Clearances', ascending=False).head(top)
fig.add_trace(go.Bar(x=df["Name"], 
             y=df['Clearances'],
             #color='Club',
            #hover_name=None,
             orientation='v'),
             row=3, col=1)


df = defenders_38app.sort_values(by='Headed Clearance', ascending=False).head(top)
fig.add_trace(go.Bar(x=df["Name"], 
             y=df['Headed Clearance'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=4,col=1)

df = defenders_38app.sort_values(by='Clearances off line', ascending=False).head(top)
fig.add_trace(go.Bar( x=df["Name"], 
             y=df['Clearances off line'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=5, col=1)


df = defenders_.sort_values(by='Blocked shots', ascending=False).head(top)
fig.add_trace(go.Bar( x=df["Name"], 
             y=df['Blocked shots'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=1, col=2)

df = defenders_.sort_values(by='Interceptions', ascending=False).head(top)
fig.add_trace(go.Bar( x=df["Name"], 
             y=df['Interceptions'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=2, col=2)
  

df = defenders_.sort_values(by='Clearances', ascending=False).head(top)
fig.add_trace(go.Bar(x=df["Name"], 
             y=df['Clearances'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=3, col=2)


df = defenders_.sort_values(by='Headed Clearance', ascending=False).head(top)
fig.add_trace(go.Bar( x=df["Name"], 
             y=df['Headed Clearance'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=4,col=2)

df = defenders_.sort_values(by='Clearances off line', ascending=False).head(top)
fig.add_trace(go.Bar( x=df["Name"], 
             y=df['Clearances off line'],
             #color='Club',
             #hover_name=None,
             orientation='v'),
             row=5, col=2)

#fig.update_layout(title_text='Top Defender Qualities', title_x=0.5)
fig.update_traces(marker_color= ['rgb(96, 96, 96)','rgb(210,202,82)','rgb(210,202,82)','rgb(210,202,82)',
                                 'rgb(210,202,82)', 'rgb(210,202,82)'], marker_line_color='rgb(8,48,107)',
                  marker_line_width=2.5, opacity=0.6)

fig.update_layout(title_text='<b>Top Defender Qualities<b>', 
                  titlefont={'size':28},
                  title_x=0.5,
                  showlegend=False,
                  autosize=False, 
                  width=1300, 
                  height=1300,
                  template='ggplot2',
                  paper_bgcolor='lightgray')
fig.show()


<div>    
<img src="https://assets-cms.thescore.com/uploads/image/file/215195/w1280xh966_EPL.jpg?ts=1479567580" width="500"/>    
</div>

### 3.3 Midfielders <a class="anchor" id="3.3"></a>

As it's often called mid-field is the engine-room of a football team. Mid-fielders dictate the tempo of the game, they help their attacking or defending team mates depending on the situation of their team is in. Their ability to thread in an incisive pass or their awarness to sense danger before their defending colleagues are in trouble are crucial qualities of a good mid-fielder. All great teams past and present had/have a couple world-class mid-fielders in them. Below some of these attributes are summerized.



In [None]:
mid_field_attr_D =['Recoveries','Duels won','Successful 50/50s','Aerial battles won'] 

col=2
row=4

top = 5
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=('Recoveries (overall)','Recoveries (per-game)','Duels won (overall)','Duels won (per-game)','Successful 50/50s (overall)',
                    'Successful 50/50s (per-game)','Aerial battles won (overall)', 'Aerial battles won (per-game)'))

for i, atr in enumerate(mid_field_attr_D):
    df = data_38app[data_38app["Position"] == 'Midfielder'].sort_values(by=atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=df[atr],
         orientation='v'),
         row=i+1, col=1)

for j, atr in enumerate(mid_field_attr_D):    
    df = midfielders_.sort_values(by=atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=df[atr],
         orientation='v'),
         row=j+1, col=2)
       
    fig.update_traces(marker_color= ['rgb(96, 96, 96)','rgb(110,202,82)','rgb(110,202,82)','rgb(110,202,82)',
                                    'rgb(110,202,82)', 'rgb(110,202,82)'], marker_line_color='rgb(8,48,107)',
                  marker_line_width=2.5, opacity=0.6)

fig.update_layout(title_text='<b>Top Midfield Qualities: Defense Ability<b>',
                  titlefont={'size': 28, 'family':'Courier'},
                  title_x=0.5,
                  showlegend=False,
                  autosize=False, 
                  width=1200, 
                  height=1200, 
                  template='ggplot2', 
                  paper_bgcolor='lightgray')
iplot(fig)

In [None]:
mid_field_attr_A =['Assists','Big chances created','Cross accuracy %','Through balls']
col=2
row=4

top = 5
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=('Assists (overall)','Assists (per-game)','Big chances created (overall)','Big chances created (per-game)','Cross accuracy % (overall)',
                    'Cross accuracy % (per-game)','Through balls (overall)', 'Through balls (per-game)'))

for i, atr in enumerate(mid_field_attr_A):
    df = data_38app[data_38app["Position"] == 'Midfielder'].sort_values(by=atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=df[atr],
         orientation='v'),
         row=i+1, col=1)

for j, atr in enumerate(mid_field_attr_A):    
    df = midfielders_.sort_values(by=atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=df[atr],
         orientation='v'),
         row=j+1, col=2)
       
    fig.update_traces(marker_color= ['rgb(255, 208, 288)','rgb(100,100,100)','rgb(100,100,100)','rgb(100,100,100)',
                                    'rgb(100,100,100)', 'rgb(100,100,100)'], marker_line_color='rgb(8,48,107)',
                  marker_line_width=2.5, opacity=0.6)

fig.update_layout(title_text='<b>Top Midfield Qualities: Creative Ability<b>', 
                  titlefont={'size': 28, 'family':'Courier'},
                  title_x=0.5,
                  showlegend=False,
                  autosize=False, 
                  width=1200, 
                  height=1200, 
                  template='ggplot2', 
                  paper_bgcolor='lightgray')
iplot(fig)

<a href="#top">Back to top</a>  
<div>    
<img src="https://dailypost.ng/wp-content/uploads/2019/04/epl-top-scorers-1.jpg" width="500"/>    
</div>

### 3.4 Forwards <a class="anchor" id="3.4"></a>

#### Goals, goals and goals:

Winning a game is the ultimate goal of the beautiful game. Scoring more goals that your opponent will do just that. We all watch football and enjoy when the team we support or our favorite team scores gaol/s. And enjoy the moment. Some would argue that goals just for the sake of goals are meaningless without attractive, entertaining attacking football. But goals goals are that wins you games. Below are the goals machines in the EPL.

1. Most goals: Sergio Aguero, Harry Kane, Jamie Vardy
2. Most right foot goals: Sergio Aguero, Harry Kane, Jamie Vardy
3. Most left foot goals: Mohammed Salah, Oliver Giroud, Ryhad Mahrez
4. Most headed goals: Oliver Giroud, Christian Benket, Andy Carrol
5. Most goal scoring nations: England, France, Brazil Argentina

In [None]:
headerColor = 'grey'
rowEvenColor = 'lightgrey'
rowOddColor = 'white'


head = 10
table_header = ['Rank', 'Total goals','Goals with right foot','Goals with left foot','Headed goals']
df = data_38app

fig = go.Figure(data=[go.Table(
    
    header=dict(values=list(table_header),
                    line_color='darkslategray',
                    fill_color=headerColor,
                    align=['left'],
                    font=dict(color='white', size=20),
                    height=30
                               ),
    
    cells=dict(values=[list(np.arange(1, head+1)),
        data_38app.sort_values(by='Goals', ascending=False)['Name'].head(head),
                       
        data_38app.sort_values(by='Goals with right foot', ascending=False)['Name'].head(head),
        data_38app.sort_values(by='Goals with left foot', ascending=False)['Name'].head(head),
        data_38app.sort_values(by='Headed goals', ascending=False)['Name'].head(head)],

       fill_color=[[rowOddColor,rowEvenColor]*5],
       font=dict(color='black', size=16, family="Courier New, monospace",), 
       align='left', height=25,)
        )
])
fig.update_layout(title_text='TOP {} GOAL SCORERS'.format(head), title_x=0.5, font=dict(color='white', size=20, family="Courier New, monospace",))
fig.update_layout(width=1200, height=550, template='plotly_dark')
fig.show()

In [None]:
#data = data_38app.head(10)
fig = px.pie(data,
             values='Goals',
             names='Nationality',
             title='<b>Player of which country score the most goals? <b>',
             width=550, height=550,
             )
fig.update_traces(textposition='inside',
                  textinfo='percent+label',
                  showlegend= False,
                 )
iplot(fig)

In [None]:
#Headed goals
hg_f =data[data['Position'] == 'Forward']['Headed goals'].sum()
hg_m =data[data['Position'] == 'Midfielder']['Headed goals'].sum()
hg_d =data[data['Position'] == 'Defender']['Headed goals'].sum()
hg_kg =data[data['Position'] == 'Goalkeeper']['Headed goals'].sum()
#Goals with right foot
rfg_f =data[data['Position'] == 'Forward']['Goals with right foot'].sum()
rfg_m =data[data['Position'] == 'Midfielder']['Goals with right foot'].sum()
rfg_d =data[data['Position'] == 'Defender']['Goals with right foot'].sum()
rfg_kg =data[data['Position'] == 'Goalkeeper']['Goals with right foot'].sum()
#Goals with left foot
lfg_f =data[data['Position'] == 'Forward']['Goals with left foot'].sum()
lfg_m =data[data['Position'] == 'Midfielder']['Goals with left foot'].sum()
lfg_d =data[data['Position'] == 'Defender']['Goals with left foot'].sum()
lfg_kg =data[data['Position'] == 'Goalkeeper']['Goals with left foot'].sum()


## 4. Other Statistics <a class="anchor" id="4"></a>
### 4.1 Goal distribution <a class="anchor" id="4.1"></a>
Summary:

* Forwards score more goals (obvious)
* Right-footed goals are dominant
* Defenders love to head
* The big hitters are Machester City, Liverpool and Tottenham
* The new boys (Leeds United have the fewest goals scored)
* **Crystal Palace defenders** have the best goal-scoring defenders
 

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(
    x=data['Club'],
    y=data['Goals with right foot'],
    name='Right foot goals',
    marker_color='indianred'
))
fig.add_trace(go.Bar(
    x=data['Club'],
    y=data['Goals with left foot'],
    name='Left foot goals',
   marker_color='lightsalmon'
))

fig.add_trace(go.Bar(
    x=data['Club'],
    y=data['Headed goals'],
    name='Headers',
   marker_color='lightseagreen'
))

fig.update_layout(barmode='group',)# xaxis_tickangle=-45)
fig.update_layout(title_text="<b>Goal distribution by club<b>",
                  titlefont={'size': 24, 'family': 'Courier'},
                  width=750,
                  height =500,
                  template='simple_white'
                 )

iplot(fig)

In [None]:
df = data

fig = px.sunburst(df, path=['Position', 'Club'], values='Headed goals')
fig.update_layout(title_text="<b>Headed goals (Club, position)<b>",
                  titlefont= {'size': 24},
                  width=500, 
                  height=500)
iplot(fig)

fig1 = px.sunburst(df, path=['Position', 'Club'], values='Goals with left foot')
fig1.update_layout(title_text="<b>Left footed goals (Club, position<b>", 
                   titlefont= {'size': 24,},
                   width=500, 
                   height=500)
iplot(fig1)

fig2 = px.sunburst(df, path=['Position', 'Club'], values='Goals with right foot')
fig2.update_layout(title_text="<b>Right footed goals (Club, position)<b>",
                   titlefont= {'size': 24,},
                   width=500, 
                   height=500)
iplot(fig2)

In [None]:
goals_groupedby_clubs=data.groupby(['Club']).agg({'Goals with left foot':sum, 
                                                  'Goals with right foot':sum, 'Headed goals': sum })


all_nodes = ['Forwards', 'Midfielders', 'Defenders', 'Right foot goals', 'Left foot goals', 'Headed goals',
           'Arsenal', 'Aston-Villa', 'Brighton-and-Hove-Albion', 'Burnley',
           'Chelsea', 'Crystal-Palace', 'Everton', 'Fulham', 'Leeds-United',
           'Leicester-City', 'Liverpool', 'Manchester-City', 'Manchester-United',
           'Newcastle-United', 'Sheffield-United', 'Southampton',
           'Tottenham-Hotspur', 'West-Bromwich-Albion', 'West-Ham-United',
           'Wolverhampton-Wanderers']


# source nodes
list_1 = [0,1,2]
list_2 = [3,4,5]
source = 3*list_1 + 20*list_2


# target nodes 
target = []
for tar in range(3, 26):
    target.extend(repeat(tar, 3))
    
#     
R = goals_groupedby_clubs['Goals with right foot'].values
L = goals_groupedby_clubs['Goals with left foot'].values
H = goals_groupedby_clubs['Headed goals'].values

# 
goals_scored = []
for i in range(len(R)):
    goals_scored.append(R[i])
    goals_scored.append(L[i])
    goals_scored.append(H[i])
    
value = [rfg_f, rfg_m, rfg_d, lfg_f, lfg_m, lfg_d, hg_f, hg_m, hg_d] + goals_scored


fig = go.Figure(
      data=[go.Sankey(
      node = dict(
      pad = 2,
      thickness = 75,
      line = dict(color = "gray", width = 0.75),
      label = all_nodes,      
      color = ['#67AEE1', '#ff6e4a', '#48bf91'] +
          3*['gold'] + 20*['#babad4']#node_colors
          
    ),
    
    link = dict(
       source = source,               
        
       target = target,
     
       value =  value,
              
       color = 26*['#94C6EA', '#ffb6a4', '#a3dfc8'],
       #color = 26*['#d8d8d8', '#a7a7a7', '#4f4f4f']
))])

fig.update_layout(title_text="<b> Distribution of goals: <b> "\
                  '<br><span style="font-size:16px; color: darkgray"> By player position, club and part-of-body scored by',
                  titlefont={'size': 28, 'family': 'Courier'})
fig.update_traces(textfont_family='Courier', selector=dict(type='sankey'))
iplot(fig)

<a href="#top">Back to top</a>  
### 4.2 Unwanted records <a class="anchor" id="4.2"></a>

Of course like in everyday-life, sometimes things happen in football wheather players like it or not. In football terms own goals, assisting the wrong player (aka error leading to a goal), being sent-off are some of the major ones. Let's see who are the unfortunate ones.

**Summary**

- Surprisingly Sergio Aguero (who's the top scorer) is also a big chance squanderer. 
- Joe Hart is error prone. No wonder Pep Guardiola has shown him the doors when he took-over at Manchester City.
- Phil Jagielka love his own net, i.e has more own goals.
- No wonder Mark Nobel is the biggest losser with so many recards to his name.
- Wondered how many times Gabriel Jesus calls his own name (swears)? 0.56 time per game (that is his big chances missed per game)


In [None]:
unwanted_records = ['Losses', 'Big chances missed','Own goals','Errors leading to goal', 'Red cards'] 
col=2
row=5

top = 5
fig = make_subplots(
    rows=5, cols=2,
    subplot_titles=('Losses (overall)', 'Losses (per-game)','Big chances missed (overall)','Big chances missed (per-game)',
                    'Own goals (overall)','Own goals (per-game)', 'Errors leading to goal (overall)',
                    'Errors leading to goal (per-game)','Red cards (overall)', 'Red cards (per-game)'))

for i, atr in enumerate(unwanted_records):
    df = data_38app.sort_values(atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=-df[atr],
         orientation='v'),
         row=i+1, col=1)

for j, atr in enumerate(unwanted_records):    
    df = all_players.sort_values(atr, ascending=False).head(top)
    fig.add_trace(go.Bar(x=df['Name'], 
         y=-df[atr],
         orientation='v'),
         row=j+1, col=2)
       
    fig.update_traces(marker_color= ['rgb(255, 0, 0)'],
                  marker_line_width=2.5, opacity=0.6)

fig.update_layout(title_text='<b> List of the unfortunates <b>',
                  titlefont={'size': 28, 'family':'Courier'},
                  showlegend=False,
                  autosize=False, 
                  width=1200, height=1100,
                  paper_bgcolor='lightgray',
                  plot_bgcolor='lightgray',
                 )
iplot(fig)


## 5. Closing Remarks  <a class="anchor" id="5"></a>

- This exploration was a simple EDA and one point on a learning-curve in data visualizations and analysis. Football being the dataset, I enjoyed the process. 
- In the process of doing this EDA I also learned an excellent data visualization library, plotly. It is an interactive and customizable library with several option to choose from. 
- However, I was slightly dissapointed that the dataset was not updated to include the games until the end of the full season 2020/21. Some of the stats would certainly have changed the part conclusions. For example, Harry Kane ended the season as top goal scorer with 23 goals and would have claimed the top spot in one or more of the columns in "Top 10 goal scorers" table.




## 6. References <a class="anchor" id="6"></a>

1. [The Premier League](https://en.wikipedia.org/wiki/Premier_League)
2. [The 10 most expensive goalkeepers](https://www.espn.com/soccer/soccer-transfers/story/3135816/the-10-most-expensive-goalkeepers-kepa-alisson-becker-courtois-ederson)
3. [10 most expensive defenders of all time](https://www.kickoff.com/news/articles/world-news/categories/news/english-premier-league/the-10-most-expensive-defenders-of-all-time/681803?gallery=681803&gallery-page=11#ig)
4. https://plotly.com/

<div>    
<img src="https://www.collinsdictionary.com/images/full/goldengoal_110918564.jpg" width="250"/>   
</div>


###  <span align="center" style= 'background:skyblue'> Thank you very much for reading this notebook! </span> 

  

<a href="#top">Back to top</a>