# Metric 3 — Pressing Effectiveness

Analyzing defensive pressing using actual possession outcomes: regains, disruptions, and counter-attack creation.
The goal is to separate players who press frequently from those whose pressing actually wins the ball back or forces errors.

## Data sources

- `all_events.csv` — dynamic events including `on_ball_engagement` with pressing outcomes
- `all_phases.csv` — defensive phase labels (high/medium/low block)
- `player_metadata.csv` — minutes played, positions, team info

All preprocessed from SkillCorner open data.

## High-level approach

1. Filter dynamic events to `on_ball_engagement` with pressing/counter-pressing subtypes
2. Add defensive phase context (high_block, medium_block, low_block)
3. Use actual outcome fields (`end_type`, `pressing_chain_end_type`) to measure effectiveness:
   - **Direct regain**: Pressing player wins ball immediately
   - **Indirect regain**: Team wins ball shortly after press
   - **Direct/indirect disruption**: Press forces error or poor decision
4. Aggregate to player-match level in Spark
5. Rank by volume, success rate, and regain efficiency

### Why not EPV flags?

SkillCorner's schema includes `stop_possession_danger` and `reduce_possession_danger` flags, but these are almost never populated in the open data (<0.1% of events). Instead, I'm using the `end_type` field which categorizes every pressing action's actual outcome.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

import sys
sys.path.append('..')
from src.metrics import aggregate_pressing_impact

spark = SparkSession.builder \
    .appName("pressing-effectiveness") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

print(f"Spark version: {spark.version}")

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/10 13:48:29 WARN Utils: Your hostname, Anguss-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.4.158 instead (on interface en0)
25/12/10 13:48:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/10 13:48:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/10 13:48:30 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/10 13:48:30 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


Spark version: 4.0.1


In [2]:
# Load preprocessed data
data_dir = Path('../output')

all_events = pd.read_csv(data_dir / 'all_events.csv', low_memory=False)
all_phases = pd.read_csv(data_dir / 'all_phases.csv')
player_metadata = pd.read_csv(data_dir / 'player_metadata.csv')

print(f"Loaded {len(all_events):,} event records")

Loaded 47,853 event records


In [3]:
def show_top(df, sort_col, cols, n=10, decimals=3):
    """Quick helper to print rounded top-N tables."""
    out = df.nlargest(n, sort_col)[cols].copy()
    num_cols = out.select_dtypes('number').columns
    out[num_cols] = out[num_cols].round(decimals)
    print(out.to_string(index=False))

## Step 1 — Filter to pressing events

Pull out `on_ball_engagement` events, then narrow to pressing/counter-pressing.
Check what outcome fields are available and how often different outcomes occur.

In [4]:
# Filter to on-ball engagements
engagements = all_events[all_events['event_type'] == 'on_ball_engagement'].copy()

print(f"Total engagements: {len(engagements):,}")
print(f"Unique players: {engagements['player_id'].nunique()}")

# Check subtypes
print("\n=== Engagement Subtypes ===")
print(engagements['event_subtype'].value_counts())

# Filter to pressing actions
pressing = engagements[engagements['event_subtype'].isin(['pressing', 'counter_press'])].copy()

print(f"\nPressing actions: {len(pressing):,}")
print(f"Counter-pressing: {(pressing['event_subtype'] == 'counter_press').sum():,}")

Total engagements: 8,911
Unique players: 193

=== Engagement Subtypes ===
event_subtype
pressure          2231
pressing          2018
recovery_press    1938
other             1848
counter_press      876
Name: count, dtype: int64

Pressing actions: 2,894
Counter-pressing: 876


In [5]:
# Check outcome fields
print("=== Pressing Outcome Fields ===\n")

print("end_type (individual press outcome):")
print(pressing['end_type'].value_counts())

print("\npressing_chain_end_type (sequence-level outcome):")
print(pressing['pressing_chain_end_type'].value_counts())

print("\nlead_to_shot:")
print(pressing['lead_to_shot'].value_counts())

print("\nlead_to_goal:")
print(pressing['lead_to_goal'].value_counts())

=== Pressing Outcome Fields ===

end_type (individual press outcome):
end_type
indirect_regain        261
indirect_disruption     85
direct_disruption       39
direct_regain           24
foul_committed          15
Name: count, dtype: int64

pressing_chain_end_type (sequence-level outcome):
pressing_chain_end_type
regain        888
disruption    301
Name: count, dtype: int64

lead_to_shot:
lead_to_shot
False    2850
True       44
Name: count, dtype: int64

lead_to_goal:
lead_to_goal
False    2889
True        5
Name: count, dtype: int64


## Step 2 — Add defensive phase context

Join each press to its defensive phase (high/medium/low block) to understand where the pressing happens on the pitch.

In [6]:
# Normalize match_id
pressing['match_id'] = pressing['match_id'].astype(str)
all_phases['match_id'] = all_phases['match_id'].astype(str)
player_metadata['match_id'] = player_metadata['match_id'].astype(str)

pressing_enriched = []

for _, press in pressing.iterrows():
    match_id = press['match_id']
    frame = press['frame_start']
    
    phase_match = all_phases[
        (all_phases['match_id'] == match_id) &
        (all_phases['frame_start'] <= frame) &
        (all_phases['frame_end'] >= frame)
    ]
    
    press_dict = press.to_dict()
    if len(phase_match) > 0:
        phase = phase_match.iloc[0]
        press_dict['team_out_of_possession_phase_type'] = phase['team_out_of_possession_phase_type']
    else:
        press_dict['team_out_of_possession_phase_type'] = None
    
    pressing_enriched.append(press_dict)

pressing_enriched = pd.DataFrame(pressing_enriched)

phase_coverage = pressing_enriched['team_out_of_possession_phase_type'].notna().mean()
print(f"Phase join coverage: {phase_coverage:.1%}")

print("\nPressing by defensive phase:")
print(pressing_enriched['team_out_of_possession_phase_type'].value_counts())

Phase join coverage: 100.0%

Pressing by defensive phase:
team_out_of_possession_phase_type
medium_block             1681
high_block               1020
chaotic                    66
low_block                  52
defending_transition       43
defending_direct           16
defending_quick_break       9
defending_set_play          7
Name: count, dtype: int64


In [7]:
# Clean for Spark — use outcome fields that actually have data
pressing_clean = pressing_enriched[[
    'event_id', 'match_id', 'team_id', 'player_id',
    'event_subtype', 'frame_start', 'frame_end',
    'end_type', 'pressing_chain_end_type',
    'lead_to_shot', 'lead_to_goal',
    'team_out_of_possession_phase_type'
]].copy()

# Convert boolean columns
for col in ['lead_to_shot', 'lead_to_goal']:
    if col in pressing_clean.columns:
        pressing_clean[col] = pressing_clean[col].map({
            'True': True, True: True,
            'False': False, False: False,
            None: False, 'None': False, '': False
        }).fillna(False).astype(bool)

# Create derived success flags
pressing_clean['direct_regain'] = pressing_clean['end_type'] == 'direct_regain'
pressing_clean['indirect_regain'] = pressing_clean['end_type'] == 'indirect_regain'
pressing_clean['any_regain'] = pressing_clean['end_type'].isin(['direct_regain', 'indirect_regain'])

pressing_clean['direct_disruption'] = pressing_clean['end_type'] == 'direct_disruption'
pressing_clean['indirect_disruption'] = pressing_clean['end_type'] == 'indirect_disruption'
pressing_clean['any_disruption'] = pressing_clean['end_type'].isin(['direct_disruption', 'indirect_disruption'])

pressing_clean['successful_press'] = pressing_clean['end_type'].isin([
    'direct_regain', 'indirect_regain', 'direct_disruption', 'indirect_disruption'
])

print(f"Cleaned pressing actions: {len(pressing_clean):,}")

# Verify distributions
print("\n=== Pressing Outcome Distributions ===")
print(f"Any regain: {pressing_clean['any_regain'].sum():,} ({pressing_clean['any_regain'].mean():.1%})")
print(f"Any disruption: {pressing_clean['any_disruption'].sum():,} ({pressing_clean['any_disruption'].mean():.1%})")
print(f"Successful press: {pressing_clean['successful_press'].sum():,} ({pressing_clean['successful_press'].mean():.1%})")
print(f"Lead to shot: {pressing_clean['lead_to_shot'].sum():,} ({pressing_clean['lead_to_shot'].mean():.1%})")
print(f"Lead to goal: {pressing_clean['lead_to_goal'].sum():,} ({pressing_clean['lead_to_goal'].mean():.1%})")

Cleaned pressing actions: 2,894

=== Pressing Outcome Distributions ===
Any regain: 285 (9.8%)
Any disruption: 124 (4.3%)
Successful press: 409 (14.1%)
Lead to shot: 44 (1.5%)
Lead to goal: 5 (0.2%)


## Step 3 — Aggregate to player-match level

Group by player-match and calculate:
- Volume: pressing actions per 90
- Success: regain rate, disruption rate, overall success rate
- Outcomes: presses leading to shots/goals
- Phase breakdown: high/medium/low block distribution

In [8]:
# Convert to Spark
pressing_sdf = spark.createDataFrame(pressing_clean)
player_meta_sdf = spark.createDataFrame(player_metadata)

print(f"Pressing actions in Spark: {pressing_sdf.count():,}")
print(f"Player metadata in Spark: {player_meta_sdf.count():,}")

[Stage 0:>                                                        (0 + 14) / 14]

Pressing actions in Spark: 2,894
Player metadata in Spark: 360


25/12/10 13:48:35 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [9]:
# Aggregate using helper function
player_pressing_sdf = aggregate_pressing_impact(
    pressing_sdf=pressing_sdf,
    player_meta_sdf=player_meta_sdf,
    min_minutes=30.0,
    min_actions=3
)

print(f"Player-match records: {player_pressing_sdf.count()}")

Player-match records: 199


In [10]:
# Convert to pandas
player_pressing = player_pressing_sdf.toPandas()

print(f"Final dataset: {len(player_pressing)} player-match records")
print(f"Unique players: {player_pressing['player_id'].nunique()}")

# Save clean metrics (no metadata)
pressing_cols = [
    'match_id', 'player_id',
    'pressing_action_count', 'pressing_actions_per_90',
    'press_success_rate', 'regain_rate', 'disruption_rate',
    'successful_press_count', 'successful_presses_per_90',
    'total_regain_count', 'regains_per_90',
    'direct_regain_count', 'indirect_regain_count',
    'total_disruption_count', 
    'direct_disruption_count', 'indirect_disruption_count',
    'presses_leading_to_shot', 'presses_leading_to_goal',
    'shot_creation_rate',
    'high_block_press_count', 'medium_block_press_count', 'low_block_press_count',
    'counter_press_count', 'counter_presses_per_90'
]

player_pressing[pressing_cols].to_csv('../output/player_pressing.csv', index=False)
print(f"\nSaved {len(pressing_cols)} metric columns to ../output/player_pressing.csv")
print(f"player_pressing still has all {len(player_pressing.columns)} columns for analysis")

Final dataset: 199 player-match records
Unique players: 145

Saved 24 metric columns to ../output/player_pressing.csv
player_pressing still has all 30 columns for analysis


## Result tables – who presses most and who actually wins the ball back?

Below are four simple cuts on the player–match pressing metrics:

1. **Highest Pressing Volume (per 90)** – players who press most often relative to minutes played  
2. **Best Pressing Success Rate** – players whose presses most often end in a regain *or* disruption  
3. **Best Regain Rate** – players who actually win the ball back most reliably when they press  
4. **Best Combined Volume + Effectiveness** – players who both press a lot and convert those presses into successful outcomes per 90  

In each table I filter out tiny samples (minimum 5 presses) so the numbers aren’t driven by one or two actions.

In [11]:
# Top 10 by pressing volume
print("=== Highest Pressing Volume (per 90) ===\n")

show_top(
    player_pressing,
    sort_col='pressing_actions_per_90',
    cols=['player_short_name', 'position_group', 'minutes_played',
          'pressing_actions_per_90', 'press_success_rate', 'regain_rate']
)

=== Highest Pressing Volume (per 90) ===

player_short_name position_group  minutes_played  pressing_actions_per_90  press_success_rate  regain_rate
    C. Ikonomidis Center Forward           64.92                   49.908               0.056        0.056
          M. Ruhs Center Forward           74.26                   49.690               0.122        0.122
           G. May Center Forward           86.65                   45.701               0.045        0.023
       L. Bayliss       Midfield           61.55                   43.867               0.067        0.033
       P. Klimala Center Forward           83.40                   43.165               0.150        0.125
      A. Bugarija Center Forward           67.75                   42.509               0.188        0.188
         N. Botić Center Forward           99.83                   40.569               0.133        0.067
          A. Kuol Center Forward           98.72                   40.113               0.023        0

In [12]:
# Top 10 by success rate
print("=== Best Pressing Success Rate ===")
print("(Players with 5+ pressing actions)\n")

success_df = player_pressing[player_pressing['pressing_action_count'] >= 5]

show_top(
    success_df,
    sort_col='press_success_rate',
    cols=['player_short_name', 'position_group', 'pressing_action_count',
          'press_success_rate', 'regain_rate', 'disruption_rate']
)

=== Best Pressing Success Rate ===
(Players with 5+ pressing actions)

  player_short_name   position_group  pressing_action_count  press_success_rate  regain_rate  disruption_rate
        Walid Shour         Midfield                      8               0.500        0.125            0.375
J. Courtney-Perkins Central Defender                      7               0.429        0.429            0.000
        N. Atkinson        Full Back                      7               0.429        0.286            0.143
         L. Gillion    Wide Attacker                     10               0.400        0.400            0.000
          A. Faisal    Wide Attacker                      8               0.375        0.375            0.000
             K. Bos        Full Back                      6               0.333        0.000            0.333
           T. Payne        Full Back                      6               0.333        0.333            0.000
       F. Talladira        Full Back             

In [13]:
# Top 10 by regain efficiency
print("=== Best Regain Rate ===")
print("(Players with 5+ pressing actions)\n")

regain_df = player_pressing[player_pressing['pressing_action_count'] >= 5]

show_top(
    regain_df,
    sort_col='regain_rate',
    cols=['player_short_name', 'position_group', 'pressing_action_count',
          'regain_rate', 'regains_per_90', 'direct_regain_count']
)

=== Best Regain Rate ===
(Players with 5+ pressing actions)

  player_short_name   position_group  pressing_action_count  regain_rate  regains_per_90  direct_regain_count
J. Courtney-Perkins Central Defender                      7        0.429           2.853                    0
         L. Gillion    Wide Attacker                     10        0.400           5.746                    1
          A. Faisal    Wide Attacker                      8        0.375           3.985                    0
           T. Payne        Full Back                      6        0.333           1.821                    0
       F. Talladira        Full Back                      6        0.333           1.875                    0
     K. Barbarouses    Wide Attacker                     15        0.333           4.462                    0
         R. Danzaki    Wide Attacker                     10        0.300           3.074                    1
          L. Toomey    Wide Attacker                     10

In [14]:
# Top 10 by combined volume + effectiveness
print("=== Best Combined Volume + Effectiveness ===")
print("(Successful presses per 90)\n")

show_top(
    player_pressing,
    sort_col='successful_presses_per_90',
    cols=['player_short_name', 'position_group', 'minutes_played',
          'pressing_actions_per_90', 'press_success_rate',
          'successful_presses_per_90']
)

=== Best Combined Volume + Effectiveness ===
(Successful presses per 90)

player_short_name position_group  minutes_played  pressing_actions_per_90  press_success_rate  successful_presses_per_90
      A. Bugarija Center Forward           67.75                   42.509               0.188                      7.970
        M. Caputo Center Forward           86.13                   36.573               0.200                      7.315
        L. Toomey  Wide Attacker           38.20                   23.560               0.300                      7.068
      A. Thurgate       Midfield           94.65                   21.870               0.304                      6.656
        B. Gibson Center Forward           69.02                   33.903               0.192                      6.520
       P. Klimala Center Forward           83.40                   43.165               0.150                      6.475
          M. Ruhs Center Forward           71.22                   35.383      

## Validation — distributions and position patterns

In [15]:
print("=== Pressing Volume Distribution ===")
print(player_pressing['pressing_actions_per_90'].describe().round(2))

print("\n=== Success Rate Distribution ===")
print(player_pressing['press_success_rate'].describe().round(3))

print("\n=== Regain Rate Distribution ===")
print(player_pressing['regain_rate'].describe().round(3))

print("\n=== Disruption Rate Distribution ===")
print(player_pressing['disruption_rate'].describe().round(3))

=== Pressing Volume Distribution ===
count    199.00
mean      15.99
std       10.83
min        2.68
25%        7.21
50%       13.36
75%       22.89
max       49.91
Name: pressing_actions_per_90, dtype: float64

=== Success Rate Distribution ===
count    199.000
mean       0.161
std        0.159
min        0.000
25%        0.059
50%        0.133
75%        0.200
max        1.000
Name: press_success_rate, dtype: float64

=== Regain Rate Distribution ===
count    199.000
mean       0.113
std        0.134
min        0.000
25%        0.000
50%        0.091
75%        0.143
max        0.750
Name: regain_rate, dtype: float64

=== Disruption Rate Distribution ===
count    199.000
mean       0.048
std        0.090
min        0.000
25%        0.000
50%        0.000
75%        0.067
max        0.667
Name: disruption_rate, dtype: float64


In [16]:
# Position breakdown
print("=== Pressing Metrics by Position ===\n")

position_summary = player_pressing.groupby('position_group').agg({
    'pressing_action_count': 'sum',
    'pressing_actions_per_90': 'mean',
    'press_success_rate': 'mean',
    'regain_rate': 'mean',
    'disruption_rate': 'mean',
}).round(3)

print(position_summary)

=== Pressing Metrics by Position ===

                  pressing_action_count  pressing_actions_per_90  \
position_group                                                     
Center Forward                      872                   29.383   
Central Defender                     74                    3.759   
Full Back                           272                    7.152   
Midfield                            781                   15.073   
Wide Attacker                       702                   19.106   

                  press_success_rate  regain_rate  disruption_rate  
position_group                                                      
Center Forward                 0.120        0.085            0.035  
Central Defender               0.330        0.228            0.103  
Full Back                      0.181        0.146            0.035  
Midfield                       0.136        0.082            0.054  
Wide Attacker                  0.139        0.098            0.041  


## Key findings

**Volume patterns**

- Wide players and forwards tend to have the highest pressing actions per 90, with central defenders and deeper midfielders picking their moments more.  
- There’s a clear split between players who are constantly engaging the ball and those who press more selectively.

**Effectiveness**

- The best pressers combine a solid volume of actions with **regain rates** and **overall press success rates** well above the squad average.  
- A small group of players stand out for turning presses directly into regains rather than just loose disruptions.

**Impact**

- Only a small share of presses lead directly to a shot or goal, which is expected, but a handful of players consistently show up in those chains.  
- That makes this metric a good way to surface “trigger players” whose pressing actually tilts the pitch, rather than just generating work rate numbers.