# Kaggle DFL Data Shootout EDA
Exploratory Data Analysis of the Kaggle cometition DFL - Bundesliga Data Shootout

Goal: Detect passes (including throw-ins and crosses) and challenges in Bundesliga matches. The computer vision model should automatically classify and annotate these events in long video recordings.

In [21]:
import pandas as pd
from moviepy.video.io.ffmpeg_tools import ffmpeg_extract_subclip


In [2]:
train_df = pd.read_csv('data/train.csv')
train_df.head()

Unnamed: 0,video_id,time,event,event_attributes
0,1606b0e6_0,200.265822,start,
1,1606b0e6_0,201.15,challenge,['ball_action_forced']
2,1606b0e6_0,202.765822,end,
3,1606b0e6_0,210.124111,start,
4,1606b0e6_0,210.87,challenge,['opponent_dispossessed']


In [11]:
train_df.groupby(['event_attributes']).nunique()

Unnamed: 0_level_0,video_id,time,event
event_attributes,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
['ball_action_forced'],12,239,1
['challenge_during_ball_transfer'],12,53,1
"['cross', 'corner']",12,33,1
"['cross', 'freekick']",4,5,1
"['cross', 'openplay']",12,80,1
['cross'],6,18,1
['fouled'],12,111,1
['opponent_dispossessed'],12,138,1
['opponent_rounded'],12,39,1
"['pass', 'corner']",3,4,1


In [17]:
train_df['time_diff'] = train_df.time.diff()
for i, row in train_df.iterrows():
    if row['event'] == 'start':
        train_df.loc[i, 'time_diff'] = None
train_df.head(10)

Unnamed: 0,video_id,time,event,event_attributes,time_diff
0,1606b0e6_0,200.265822,start,,
1,1606b0e6_0,201.15,challenge,['ball_action_forced'],0.884178
2,1606b0e6_0,202.765822,end,,1.615822
3,1606b0e6_0,210.124111,start,,
4,1606b0e6_0,210.87,challenge,['opponent_dispossessed'],0.745889
5,1606b0e6_0,212.624111,end,,1.754111
6,1606b0e6_0,217.850213,start,,
7,1606b0e6_0,219.23,throwin,['pass'],1.379787
8,1606b0e6_0,220.350213,end,,1.120213
9,1606b0e6_0,223.93085,start,,


In [19]:
train_df.groupby(['event_attributes']).mean()

Unnamed: 0_level_0,time,time_diff
event_attributes,Unnamed: 1_level_1,Unnamed: 2_level_1
['ball_action_forced'],1905.769347,1.315094
['challenge_during_ball_transfer'],1746.260868,1.106434
"['cross', 'corner']",1877.533121,1.46611
"['cross', 'freekick']",1563.216,1.659324
"['cross', 'openplay']",1716.143587,1.387956
['cross'],1711.859444,1.365007
['fouled'],1812.061054,1.310193
['opponent_dispossessed'],1908.893087,1.290977
['opponent_rounded'],1803.720897,1.436239
"['pass', 'corner']",2020.185,0.849128


Idea for pipeline (each step one model):
  1) Check to see if there is a challenge
  2) Check for start and stop point of challenge
  3) Classify each challenge

In [20]:
train_df.groupby(['event']).nunique()

Unnamed: 0_level_0,video_id,time,event_attributes,time_diff
event,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
challenge,12,623,6,577
end,12,3418,0,2629
play,12,3579,6,2429
start,12,3418,0,0
throwin,12,172,2,168


## Watching clips

In [22]:
def vis_event(row, before=5, after=5):
    print(row["event_attributes"])
    filename = f"test_{row['index']}.mp4"
    ffmpeg_extract_subclip(
        f"../input/dfl-bundesliga-data-shootout/train/{row['video_id']}.mp4", 
        int(row['time']) - before, 
        int(row['time']) + after, 
        targetname=filename,
    )
    
    return Video(filename, width=800)