# xG 

Given limited data for training a robust xG model, model predictions are poor and random

Instead, we will use conditional probability to determine a generalized xG model:

P(Goal | #opponets) = P(Goal, #opponets) / P(#opponents)

In this notebook we will calculate the intersection, P(Goal, #opponents) to then calculate the above conditional probability 

We will only take #opponents as a feature to simplify the model 

Ideally, xG would be determined using an ML model but since the purpose of this project was to explore crosses, I did not have the time to devote a great deal to an xG model

In [58]:
import pandas as pd
import numpy as np
from pickle import dump

from statsbombpy import sb

from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

## Load data

In [2]:
# collect all passing events 
cols = ['teammate', 'actor', 'location_x', 'id',
           'player', 'position', 'team', 'type', 
           'under_pressure', 'shot_outcome']

shot_df = pd.DataFrame(columns=cols)

matches = sb.matches(competition_id=55, season_id=43)
match_ids = matches[matches['match_status_360'] == 'available'].match_id.unique()

for match_id in tqdm(match_ids):
    
    events = sb.events(match_id=match_id)
    match_frames = sb.frames(match_id=match_id, fmt='dataframe')

    # join the events to the frames 
    df = pd.merge(match_frames, events, left_on='id', right_on='id', how='left')
    
    # take relevant columns 
    df = df[cols]
    
    # subset for shots only 
    df = df[df['type'] == 'Shot']
    df = df[df['actor'] == True]
    
    # append to overall df
    shot_df = shot_df.append(df)

shot_df[['loc_x', 'loc_y']] = shot_df['location_x'].apply(pd.Series) # separate location into x and y 

len(shot_df.id.unique())

100%|██████████| 51/51 [03:28<00:00,  4.09s/it]


1221

In [14]:
shot_df.head()

Unnamed: 0,teammate,actor,location_x,id,player,position,team,type,timestamp,under_pressure,goalkeeper_position,shot_outcome,shot_statsbomb_xg,shot_freeze_frame,loc_x,loc_y
768,True,True,"[101.76876, 47.53321]",91f2f8aa-4ee8-4593-97a6-1a7862a7cca5,Magomed Ozdoev,Right Defensive Midfield,Russia,Shot,00:00:29.166,,,Blocked,0.036601,"[{'location': [64.3, 46.2], 'player': {'id': 3...",101.76876,47.53321
7288,True,True,"[110.53585, 48.80568]",0bb2daca-ab0b-497c-ac59-3cc559be6fb7,Magomed Ozdoev,Right Defensive Midfield,Russia,Shot,00:09:42.014,,,Off T,0.234583,"[{'location': [113.0, 43.6], 'player': {'id': ...",110.53585,48.80568
14372,True,True,"[107.50664, 44.759357]",1215b0dc-81fb-4416-b9a6-0faf150a4e8d,Joel Pohjanpalo,Left Center Forward,Finland,Shot,00:19:53.988,,,Blocked,0.280135,"[{'location': [105.9, 32.3], 'player': {'id': ...",107.50664,44.759357
14505,True,True,"[111.61468, 29.156918]",1ce72d1b-a7c5-4eb3-965f-61621b798f35,Joona Toivio,Right Center Back,Finland,Shot,00:20:38.796,True,,Off T,0.014893,"[{'location': [112.7, 42.0], 'player': {'id': ...",111.61468,29.156918
18075,True,True,"[115.15034, 42.60374]",7aa54ce6-106a-4e26-9fb1-77aa0d76f793,Artem Dzyuba,Center Forward,Russia,Shot,00:29:01.200,True,,Blocked,0.133869,"[{'location': [107.5, 32.1], 'player': {'id': ...",115.15034,42.60374


## Historical shot/goal probabilities 

In [4]:
mapping = {'RF': 0,
          'LF': 1,
          'RB': 2,
          'MB': 3,
          'LB': 4,
          'TOB': 5}

In [5]:
def section(x,y):
    if (x <= 60 and y <= 18) or (x >= 60 and y >= 62):
        return 'RF'
    elif (x <= 60 and y >= 62) or (x >= 60 and y <= 18):
        return 'LF'
    elif (x <= 18 and 18 <= y <= 30) or (x >= 102 and 50 <= y <= 62):
        return 'RB'
    elif (x <= 18 and 30 <= y <= 50) or (x >= 102 and 30 <= y <= 50):
        return 'MB'
    elif (x <= 18 and 50 <= y <= 62) or (x >= 102 and 18 <= y <= 30):
        return 'LB'
    elif (18 <= x <= 60 and 18 <= y <= 62) or (60 <= x <= 102 and 18 <= y <= 62):
        return 'TOB'

### Defining number of opponents

We will consider a range of opponents in a zone:

- 0-1 opponents
    - 0-1 opponents means that the zone is either clear for the reciever, an equal match (1v1) or opponents are out numbered

- 2-3 opponents

- more than 3 opponents
    - more than 3 opponents is an over crowded zone 


In [9]:
# matrix structure
#         RF  LF  RB  MB  LB  TOB
# 0-1
# 2-3
# 3+

In [40]:
# collect count of shots and goals
shots = [[0]*6, [0]*6, [0]*6]
goals = [[0]*6, [0]*6, [0]*6]

for id, row in shot_df.iterrows():
    # shot section 
    shot_section = section(row.loc_x, row.loc_y)
    
    opp_count = 0
    
    # count number of opponents in the zone the shot was taken 
    if not isinstance(row.shot_freeze_frame, float):
        for player in row.shot_freeze_frame:
            player_x = player['location'][0]
            player_y = player['location'][1]

            if player['teammate'] == False:
                if section(player_x, player_y) == shot_section:
                    opp_count+=1
        
    j = mapping[shot_section]
    if opp_count <= 1: i=0         # 0 or 1 opps in zone
    elif 2 <= opp_count <= 3: i=1  # 2 or 3 opps in zone
    else: i=2                      # more than 3 opps in zone

    if row.shot_outcome == 'Goal':
        goals[i][j] = goals[i][j] + 1

    shots[i][j] = shots[i][j] + 1

In [49]:
goals

[[0, 0, 4, 9, 4, 3], [0, 0, 2, 10, 3, 3], [0, 0, 0, 80, 0, 12]]

In [50]:
shots

[[3, 4, 61, 18, 66, 66], [0, 1, 42, 43, 67, 130], [0, 0, 8, 465, 9, 238]]

In [54]:
# calculate probs 
p = [[0]*6, [0]*6, [0]*6]

for i in range(len(shots)):
    for j in range(len(shots[0])):
        if shots[i][j] > 0:
            p[i][j] = round(goals[i][j] / shots[i][j], 4)
        else:
             p[i][j] = 0

p

[[0.0, 0.0, 0.0656, 0.5, 0.0606, 0.0455],
 [0, 0.0, 0.0476, 0.2326, 0.0448, 0.0231],
 [0, 0, 0.0, 0.172, 0.0, 0.0504]]

### Applying conditional probability

EX: Prob of goal given 0 or 1 opponents in the shooting zone (Middle Box)

P(G | 0-1) = P(G, 0-1) / P(0-1)

In [57]:
p[0][3] / sum(p[0])

0.7443799315170463

## Save probabilities 

In [59]:
dump(p, open('models/xG_historical_probs.pkl', 'wb'))