# xG 

Given limited data for training a robust xG model, model predictions are poor and random

Instead, we will use conditional probability to determine a generalized xG model:

P(Goal | #opponets) = P(Goal, #opponets) / P(#opponents)

In this notebook we will calculate the intersection, P(Goal, #opponents) to then calculate the above conditional probability 

We will only take #opponents as a feature to simplify the model 

Ideally, xG would be determined using an ML model but since the purpose of this project was to explore crosses, I did not have the time to prioritize a robust xG model

In [1]:
import pandas as pd
import numpy as np
from pickle import dump

from statsbombpy import sb

from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

## Load data

In [9]:
# collect all shot events 
cols = ['teammate', 'actor', 'location_x', 'id',
           'player', 'position', 'team', 'type', 
           'under_pressure', 'shot_outcome', 'shot_freeze_frame']

shot_df = pd.DataFrame(columns=cols)

matches = sb.matches(competition_id=53, season_id=106)
match_ids = matches[matches['match_status_360'] == 'available'].match_id.unique()

for match_id in tqdm(match_ids):
    try:
        events = sb.events(match_id=match_id)
        match_frames = sb.frames(match_id=match_id, fmt='dataframe')

        # join the events to the frames 
        df = pd.merge(match_frames, events, left_on='id', right_on='id', how='left')

        # take relevant columns 
        df = df[cols]

        # subset for shots only 
        df = df[df['type'] == 'Shot']
        df = df[df['actor'] == True]

        # append to overall df
        shot_df = shot_df.append(df)
    except:
        continue

shot_df[['loc_x', 'loc_y']] = shot_df['location_x'].apply(pd.Series) # separate location into x and y 

len(shot_df.id.unique())

100%|██████████| 31/31 [00:41<00:00,  1.33s/it]


648

In [10]:
shot_df.head()

Unnamed: 0,teammate,actor,location_x,id,player,position,team,type,under_pressure,shot_outcome,shot_freeze_frame,loc_x,loc_y
246,True,True,"[114.193245, 52.8877]",26fe0e8e-8f01-41ab-9450-342e49c4a349,Lauren Wade,Right Wing,Northern Ireland,Shot,,Saved,"[{'location': [98.7, 51.8], 'player': {'id': 4...",114.193245,52.8877
859,True,True,"[114.50086, 38.94683]",ff9cf131-e8cf-4f65-8db7-c42d012fbde8,Bethany Mead,Right Wing,England Women's,Shot,True,Wayward,"[{'location': [119.1, 38.9], 'player': {'id': ...",114.50086,38.94683
1261,True,True,"[91.19042, 41.148174]",0257cc00-caca-48e1-bf22-add700284fea,Georgia Stanway,Right Defensive Midfield,England Women's,Shot,,Blocked,"[{'location': [93.0, 40.8], 'player': {'id': 1...",91.19042,41.148174
1805,True,True,"[105.99634, 35.9548]",365618f7-ebfc-4157-97ef-0032ec5ae5f1,Lauren Wade,Right Wing,Northern Ireland,Shot,,Saved,"[{'location': [105.3, 35.8], 'player': {'id': ...",105.99634,35.9548
2223,True,True,"[118.05066, 38.696938]",d9379f46-6b0c-43b7-9263-827ea837d270,Bethany Mead,Right Wing,England Women's,Shot,,Blocked,"[{'location': [116.2, 45.4], 'player': {'id': ...",118.05066,38.696938


## Historical shot/goal probabilities 

In [11]:
mapping = {'RF': 0,
          'LF': 1,
          'RB': 2,
          'MB': 3,
          'LB': 4,
          'TOB': 5}

In [12]:
def section(x,y):
    if (x <= 60 and y <= 18) or (x >= 60 and y >= 62):
        return 'RF'
    elif (x <= 60 and y >= 62) or (x >= 60 and y <= 18):
        return 'LF'
    elif (x <= 18 and 18 <= y <= 30) or (x >= 102 and 50 <= y <= 62):
        return 'RB'
    elif (x <= 18 and 30 <= y <= 50) or (x >= 102 and 30 <= y <= 50):
        return 'MB'
    elif (x <= 18 and 50 <= y <= 62) or (x >= 102 and 18 <= y <= 30):
        return 'LB'
    elif (18 <= x <= 60 and 18 <= y <= 62) or (60 <= x <= 102 and 18 <= y <= 62):
        return 'TOB'

### Defining number of opponents

We will consider a range of opponents in a zone:

- 0-1 opponents
    - 0-1 opponents means that the zone is either clear for the reciever, an equal match (1v1) or opponents are out numbered

- 2-3 opponents

- more than 3 opponents
    - more than 3 opponents is an over crowded zone 


In [13]:
# matrix structure
#         RF  LF  RB  MB  LB  TOB
# 0-1
# 2-3
# 3+

In [14]:
# collect count of shots and goals
shots = [[0]*6, [0]*6, [0]*6]
goals = [[0]*6, [0]*6, [0]*6]

for id, row in shot_df.iterrows():
    # shot section 
    shot_section = section(row.loc_x, row.loc_y)
    
    opp_count = 0
    
    # count number of opponents in the zone the shot was taken 
    if not isinstance(row.shot_freeze_frame, float):
        for player in row.shot_freeze_frame:
            player_x = player['location'][0]
            player_y = player['location'][1]

            if player['teammate'] == False:
                if section(player_x, player_y) == shot_section:
                    opp_count+=1
        
    j = mapping[shot_section]
    if opp_count <= 1: i=0         # 0 or 1 opps in zone
    elif 2 <= opp_count <= 3: i=1  # 2 or 3 opps in zone
    else: i=2                      # more than 3 opps in zone

    if row.shot_outcome == 'Goal':
        goals[i][j] = goals[i][j] + 1

    shots[i][j] = shots[i][j] + 1

In [15]:
goals

[[0, 0, 2, 7, 1, 2], [0, 0, 0, 8, 1, 1], [0, 0, 0, 49, 0, 3]]

In [16]:
shots

[[2, 2, 29, 9, 26, 20], [1, 3, 35, 21, 26, 50], [1, 0, 2, 293, 1, 127]]

In [17]:
# calculate probs 
p = [[0]*6, [0]*6, [0]*6]

for i in range(len(shots)):
    for j in range(len(shots[0])):
        if shots[i][j] > 0:
            p[i][j] = round(goals[i][j] / shots[i][j], 4)
        else:
             p[i][j] = 0

p

[[0.0, 0.0, 0.069, 0.7778, 0.0385, 0.1],
 [0.0, 0.0, 0.0, 0.381, 0.0385, 0.02],
 [0.0, 0, 0.0, 0.1672, 0.0, 0.0236]]

### Applying conditional probability

EX: Prob of goal given 0 or 1 opponents in the shooting zone (Middle Box)

P(G | 0-1) = P(G, 0-1) / P(0-1)

In [18]:
p[0][3] / sum(p[0])

0.7894042423627322

## Save probabilities 

In [20]:
dump(p, open('/Users/CaitlanKrasinski/Desktop/crossing-probability-model/models/xG_historical_probs_womens_euros.pkl', 'wb'))