# Player scouting

We often construct narratives about a player's performance from watching them play week-in-week-out. This often leads to varying (*sometimes, unfair*) evaluations of how well a player actually performs relative to others in the same positions/role.

Here in this notebook, we take a stab at [soccermatics'](https://soccermatics.readthedocs.io/en/latest/lesson3/ScoutingPlayers.html) ideas on painting a concise and complete picture of a player's performance by using data to do the following:

- Counting each player's key actions
- Scoring their attributes
- Comparison by percentiles
- Giving context by adjusting for possession, team attributes, etc.

Examples of metrics are seen below:

| metric_type | example |
| --- | --- |
| Outcome | Goals, Assists, xG, etc. |
| Behavioral/Actions | Forward passes, duels, interceptions, etc. |
| Context | Passes received, team possession, position/role played, etc. |

****

## Set up

Fetch open-source `wyscout` data for 2017-18 season, listed [here](https://github.com/koenvo/wyscout-soccer-match-event-dataset/tree/main?tab=readme-ov-file#sources).

In [1]:
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests, zipfile, io

# For reproducibility purposes
np.random.seed(14)

In [2]:
# Convenience functions to read files directly from open-source data
def check_filetype_from_url(url):
    # Read header
    h = requests.head(url)
    return h.headers['Content-Type'].split('/')[-1]
def check_filenames(url):
    if check_filetype_from_url(url) == 'json':
        h = requests.head(url)
        return h.headers['Content-Disposition'].split('filename=')[-1]
    elif check_filetype_from_url(url) == 'zip':
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        return z.namelist()
    else:
        pass
def read_url_to_pandas(url, filenames = []):
    # Read header to check filetype
    ftype = check_filetype_from_url(url)

    # Read files into pandas
    if ftype == 'json':
        return pd.read_json(url)
    elif ftype == 'zip':
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        if len(filenames) == 1:
            return pd.read_json(z.open(filenames[0]))
        elif len(filenames) > 1:
            return [pd.read_json(z.open(filename)) for filename in filenames]
        else:
            pass
    else:
        pass

In [3]:
# Read and store all available filenames into a list
URLs = {
    'players': 'https://ndownloader.figshare.com/files/15073721',
    'teams': 'https://ndownloader.figshare.com/files/15073697',
    'matches': 'https://ndownloader.figshare.com/files/14464622',
    'events': 'https://ndownloader.figshare.com/files/14464685',
}
filenames = {}
for category, url in URLs.items():
    fnames = check_filenames(url)
    if type(fnames) == str:
        filenames[category] = [fnames]
    elif type(fnames) == list:
        filenames[category] = fnames

filenames

{'players': ['players.json'],
 'teams': ['teams.json'],
 'matches': ['matches_England.json',
  'matches_European_Championship.json',
  'matches_France.json',
  'matches_Germany.json',
  'matches_Italy.json',
  'matches_Spain.json',
  'matches_World_Cup.json'],
 'events': ['events_England.json',
  'events_European_Championship.json',
  'events_France.json',
  'events_Germany.json',
  'events_Italy.json',
  'events_Spain.json',
  'events_World_Cup.json']}

In [4]:
# Read players, teams and England's events and matches
players = read_url_to_pandas(URLs['players'], filenames['players'])
teams = read_url_to_pandas(URLs['teams'], filenames['teams'])
eng_matches = read_url_to_pandas(URLs['matches'], [fname for fname in filenames['matches'] if 'England' in fname])
eng_events = read_url_to_pandas(URLs['events'], [fname for fname in filenames['events'] if 'England' in fname])

display(players.head(3))
display(teams.head(3))
display(eng_matches.head(3))
display(eng_events.head(3))

Unnamed: 0,passportArea,weight,firstName,middleName,lastName,currentTeamId,birthDate,height,role,birthArea,wyId,foot,shortName,currentNationalTeamId
0,"{'name': 'Turkey', 'id': '792', 'alpha3code': ...",78,Harun,,Tekin,4502,1989-06-17,187,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'Turkey', 'id': '792', 'alpha3code': ...",32777,right,H. Tekin,4687.0
1,"{'name': 'Senegal', 'id': '686', 'alpha3code':...",73,Malang,,Sarr,3775,1999-01-23,182,"{'code2': 'DF', 'code3': 'DEF', 'name': 'Defen...","{'name': 'France', 'id': '250', 'alpha3code': ...",393228,left,M. Sarr,4423.0
2,"{'name': 'France', 'id': '250', 'alpha3code': ...",72,Over,,Mandanda,3772,1998-10-26,176,"{'code2': 'GK', 'code3': 'GKP', 'name': 'Goalk...","{'name': 'France', 'id': '250', 'alpha3code': ...",393230,,O. Mandanda,


Unnamed: 0,city,name,wyId,officialName,area,type
0,Newcastle upon Tyne,Newcastle United,1613,Newcastle United FC,"{'name': 'England', 'id': '0', 'alpha3code': '...",club
1,Vigo,Celta de Vigo,692,Real Club Celta de Vigo,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club
2,Barcelona,Espanyol,691,Reial Club Deportiu Espanyol,"{'name': 'Spain', 'id': '724', 'alpha3code': '...",club


Unnamed: 0,status,roundId,gameweek,teamsData,seasonId,dateutc,winner,venue,wyId,label,date,referees,duration,competitionId
0,Played,4405654,38,"{'1646': {'scoreET': 0, 'coachId': 8880, 'side...",181150,2018-05-13 14:00:00,1659,Turf Moor,2500089,"Burnley - AFC Bournemouth, 1 - 2","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 385705, 'role': 'referee'}, {'r...",Regular,364
1,Played,4405654,38,"{'1628': {'scoreET': 0, 'coachId': 8357, 'side...",181150,2018-05-13 14:00:00,1628,Selhurst Park,2500090,"Crystal Palace - West Bromwich Albion, 2 - 0","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 381851, 'role': 'referee'}, {'r...",Regular,364
2,Played,4405654,38,"{'1609': {'scoreET': 0, 'coachId': 7845, 'side...",181150,2018-05-13 14:00:00,1609,The John Smith's Stadium,2500091,"Huddersfield Town - Arsenal, 0 - 1","May 13, 2018 at 4:00:00 PM GMT+2","[{'refereeId': 384965, 'role': 'referee'}, {'r...",Regular,364


Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id
0,8,Simple pass,[{'id': 1801}],25413,"[{'y': 49, 'x': 49}, {'y': 78, 'x': 31}]",2499719,Pass,1609,1H,2.758649,85,177959171
1,8,High pass,[{'id': 1801}],370224,"[{'y': 78, 'x': 31}, {'y': 75, 'x': 51}]",2499719,Pass,1609,1H,4.94685,83,177959172
2,8,Head pass,[{'id': 1801}],3319,"[{'y': 75, 'x': 51}, {'y': 71, 'x': 35}]",2499719,Pass,1609,1H,6.542188,82,177959173


## Playmakers: Who's elite?

Here, we take a look at some of the most prominent midfielders from the Premier League in the 2017/18 season who have started in at least 30 games this season. We score each of these players by metrics related to outcome and actions while giving meaning to them by normalizing with context metrics such as touches taken, possession minutes, etc. and then compared in percentiles against the group's statistics.

Here are a few interesting metrics to take a look at:

| metric | metric_type | context |
| ------ | ----------- | ------- |
| Minutes Played | Outcome | - |
| Goals | Outcome | - |
| Goals | Outcome | Per 90 possession minutes |
| Assists | Outcome | - |
| Assists | Outcome | Per 90 possession minutes |
| Pass Completion % | Outcome | - |
| Passes Made | Actions | Per 90 possession minutes |
| Forward Passes % | Actions | - |
| Successful Forward Passes % | Outcome | - |
| Line-Breaking Passes | Actions | Per touch |

In [5]:
# Let's take a look at midfielders who started at least 30 games in the Premier League
# Get prem teams
prem_teams = (
    teams
    .merge(pd.DataFrame(teams.area.tolist())
    .rename({'name':'country_name'}, axis = 1), left_index = True, right_index = True)
    .loc[lambda x: x.country_name == 'England']
    .wyId.values.tolist()
)

# Get all players with midfielder roles (includes wingers)
prem_mids = (
    players
    .loc[
        lambda x: 
        (x.role.apply(lambda y: y['code3'] == 'MID')) & 
        (x.currentTeamId.isin(prem_teams))
    ]
    .copy()
)

# Get players who have started at least 30 games in 17/18 season
min_starts = 30
started_players = [
    player['playerId']
    for match in eng_matches.teamsData.values.tolist()
    for team_stats in match.values()
    for player in team_stats['formation']['lineup']
]
target_players = (
    pd
    .DataFrame(started_players, columns = ['playerId'])
    .groupby('playerId')
    .size()
    .reset_index(name = 'started')
    .loc[lambda x: x.started >= min_starts]
    .playerId
    .values
    .tolist()
)
target_players_df = (
    prem_mids
    .loc[lambda x: x.wyId.isin(target_players), ['wyId', 'shortName', 'currentTeamId', 'weight', 'height', 'birthDate', 'foot']]
    .assign(shortName = lambda x: x.shortName.str.encode('ascii', errors = 'ignore').str.decode('unicode_escape'))
    .reset_index(drop = True)
)
target_players_df

Unnamed: 0,wyId,shortName,currentTeamId,weight,height,birthDate,foot
0,54,C. Eriksen,1624,76,180,1992-02-14,right
1,93,J. Guðmunds­son,1646,77,186,1990-10-27,left
2,265366,W. Ndidi,1631,80,187,1996-12-16,right
3,70122,N. Matić,1611,84,194,1988-08-01,left
4,37725,C. Kouyaté,1628,78,193,1989-12-21,right
5,38021,K. De Bruyne,1625,68,181,1991-06-28,right
6,105339,Fernandinho,1625,67,179,1985-05-04,right
7,127537,L. Milivojević,1628,80,186,1991-04-07,right
8,210044,E. Dier,1624,90,188,1994-01-15,right
9,13484,D. Alli,1624,80,188,1996-04-11,right


### Scoring Metrics

In [6]:
# Let's calculate some midfielder-critical statistics over the season
# TODO: To be continued...
(
    eng_events
    .loc[lambda x: x.playerId.isin(target_players)]
)

Unnamed: 0,eventId,subEventName,tags,playerId,positions,matchId,eventName,teamId,matchPeriod,eventSec,subEventId,id
4,8,Simple pass,[{'id': 1801}],167145,"[{'y': 95, 'x': 41}, {'y': 88, 'x': 72}]",2499719,Pass,1609,1H,10.302366,85,177959175
6,8,Head pass,[{'id': 1801}],8653,"[{'y': 25, 'x': 23}, {'y': 15, 'x': 39}]",2499719,Pass,1631,1H,13.961228,82,177959186
7,1,Air duel,"[{'id': 701}, {'id': 1802}]",8013,"[{'y': 15, 'x': 39}, {'y': 20, 'x': 33}]",2499719,Duel,1631,1H,14.765321,10,177959189
9,8,Head pass,"[{'id': 1401}, {'id': 1801}]",167145,"[{'y': 80, 'x': 67}, {'y': 61, 'x': 59}]",2499719,Pass,1609,1H,15.320341,82,177959178
10,8,Head pass,[{'id': 1801}],49876,"[{'y': 61, 'x': 59}, {'y': 45, 'x': 45}]",2499719,Pass,1609,1H,18.051875,82,177959179
...,...,...,...,...,...,...,...,...,...,...,...,...
643128,8,Simple pass,[{'id': 1801}],21100,"[{'y': 40, 'x': 55}, {'y': 51, 'x': 55}]",2500098,Pass,1633,2H,2748.321677,85,251596211
643129,8,Simple pass,[{'id': 1801}],37725,"[{'y': 51, 'x': 55}, {'y': 67, 'x': 60}]",2500098,Pass,1633,2H,2750.561883,85,251596212
643131,8,Simple pass,[{'id': 1801}],8313,"[{'y': 77, 'x': 66}, {'y': 80, 'x': 60}]",2500098,Pass,1633,2H,2755.809381,85,251596215
643133,8,Simple pass,[{'id': 1801}],37725,"[{'y': 56, 'x': 57}, {'y': 63, 'x': 65}]",2500098,Pass,1633,2H,2758.973892,85,251596217
