In [1]:
%matplotlib inline


# Using Statsbomb
Getting familiar with Statsbomb data


In [3]:
!pip install mplsoccer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting mplsoccer
  Downloading mplsoccer-1.1.11-py3-none-any.whl (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.1/68.1 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mplsoccer
Successfully installed mplsoccer-1.1.11


In [4]:
#importing SBopen class from mplsoccer to open the data
from mplsoccer import Sbopen
# The first thing we have to do is open the data. We use a parser SBopen available in mplsoccer.
parser = Sbopen()

## Competition data
Using method *competition* of the parser we can explore competitions to find the competition we are interested in.
The most important information for us is in the *competition_id* (id of competition) and *season_id*.
The first one is the key in Statsbomb database of a competition, the second one of a season 
of this competition (for example WC 2018 would have a different *season_id* than WC 2014, but the same *competition_id*).



In [5]:
#opening data using competition method
df_competition = parser.competition()
#structure of data
df_competition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   competition_id             43 non-null     int64 
 1   season_id                  43 non-null     int64 
 2   country_name               43 non-null     object
 3   competition_name           43 non-null     object
 4   competition_gender         43 non-null     object
 5   competition_youth          43 non-null     bool  
 6   competition_international  43 non-null     bool  
 7   season_name                43 non-null     object
 8   match_updated              43 non-null     object
 9   match_updated_360          42 non-null     object
 10  match_available_360        4 non-null      object
 11  match_available            43 non-null     object
dtypes: bool(2), int64(2), object(8)
memory usage: 3.6+ KB


## Match data
Using method *match* of the parser we can explore matches of a competition to find the match we are interested in.
To open it we need to know the *competition_id* (id of competition) and *season_id*.
We know that for Women World Cup *competition_id* is 72 and *season_id* is 30
From this dataframe for us the most important imformation is provided in *match_id*, 
*home_team_id* and *home_team_name* and adequately *away_team_id* and *away_team_name*.



In [6]:
#opening data using match method
df_match = parser.match(competition_id=72, season_id=30)
#structure of data
df_match.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 52 columns):
 #   Column                           Non-Null Count  Dtype         
---  ------                           --------------  -----         
 0   match_id                         52 non-null     int64         
 1   match_date                       52 non-null     datetime64[ns]
 2   kick_off                         52 non-null     datetime64[ns]
 3   home_score                       52 non-null     int64         
 4   away_score                       52 non-null     int64         
 5   match_status                     52 non-null     object        
 6   match_status_360                 52 non-null     object        
 7   last_updated                     52 non-null     datetime64[ns]
 8   last_updated_360                 52 non-null     datetime64[ns]
 9   match_week                       52 non-null     int64         
 10  competition_id                   52 non-null     int64         


## Lineup data
To check the lineups we use the *lineup* method. We do it for England Sweden WWC 2019 game - *game_id* is 69301 
- you can check that in the df_match. In this dataframe you will find all players who played in this game, their teams 
and jersey numbers
COMMENTED OUT BECAUSE OF CHANGE OF DATA FORMAT.



In [7]:
#opening data using match method
#df_lineup = parser.lineup(69301)
#structure of data
#df_lineup.info()

## Event data
The Statsbomb data that we will use the most during the course is event data. 
Knowing *game_id* you can open all the events that occured on the pitch
In the event dataframe you will find events with additional information, we will mostly use this dataframe.
Tactics dataframe provides information about player position on the pitch. 'Related' dataframe provides information
on events that were related to each other - for example ball pass and pressure applied. *df_freeze* consists of freezed
frames with player position in the moment of shots. We will learn more about tracking data later in the course.
Below, an example of event data is presented.



In [8]:
#opening data
df_event, df_related, df_freeze, df_tactics = parser.event(69301)
#if you want only event data you can use 
#df_event = parser.event(69301)[0]
#structure of data
df_event.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3291 entries, 0 to 3290
Data columns (total 74 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              3291 non-null   object 
 1   index                           3291 non-null   int64  
 2   period                          3291 non-null   int64  
 3   timestamp                       3291 non-null   object 
 4   minute                          3291 non-null   int64  
 5   second                          3291 non-null   int64  
 6   possession                      3291 non-null   int64  
 7   duration                        2457 non-null   float64
 8   match_id                        3291 non-null   int64  
 9   type_id                         3291 non-null   int64  
 10  type_name                       3291 non-null   object 
 11  possession_team_id              3291 non-null   int64  
 12  possession_team_name            32

## 360 data
Statsbomb offers 360 data which track not only location of an event but also players' location. To open them we need
an id of game. Later, we will also need id of the event. In the *df_frame* we find information on players' position (but only if teammate, not all information)
and in *df_visible* it is provided which part of the pitch was tracked during an event.



In [9]:
df_frame, df_visible = parser.frame(3788741)

# exploring the data
df_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45737 entries, 0 to 45736
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   teammate  45737 non-null  bool   
 1   actor     45737 non-null  bool   
 2   keeper    45737 non-null  bool   
 3   match_id  45737 non-null  int64  
 4   id        45737 non-null  object 
 5   x         45737 non-null  float64
 6   y         45737 non-null  float64
dtypes: bool(3), float64(2), int64(1), object(1)
memory usage: 1.5+ MB
