## 0. Libraries and Personal Tools

A single objective for this notebook: deep dive into how the ball position behaves during an event. Is the ball's position as expected or changes from our traditional conception of sports?

The questions I'll try to answer:

1. How many games, events, and frames do we have available to analyze? 
1. How many events ended up with a team scoring? What was the proportion?
1. What is the average event duration?
1. How does ball position change across different stages of the game?

**If you don't want to spend a lot of time reading this nootebook, go directly to the last chart!**

In [None]:
import os
import sys
import datatable as dt
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib import rcParams

In [None]:
# Set the default figure size and theme to display good looking matplotlib plots.
rcParams["figure.figsize"] = (12, 8)
plt.style.use("fivethirtyeight")

In [None]:
# Import custom modules
from visualize_tps_oct22 import *
from utilities_tps_oct22 import *

## 1. Get Raw Data

Kudos to Ayush Bihani! He help me to quickly read all training data.

In [None]:
# Author: Author: Ayush Bihani
# Source: https://www.kaggle.com/code/hsuyab/fast-loading-high-compression-with-feather/notebook

file_names = []
for dirname, _, filenames in os.walk("../input/tabular-playground-series-oct-2022"):
    for filename in filenames:
        file_names.append(os.path.join(dirname, filename))
        
train_paths = []
for k in file_names:
    if 'train_' in k and 'dtypes' not in k:
        train_paths.append(k)
        
train_paths

In [None]:
dtypes_df = pd.read_csv("../input/tabular-playground-series-oct-2022/train_dtypes.csv")
dtypes = {k: v for (k, v) in zip(dtypes_df.column, dtypes_df.dtype)}

df = pd.DataFrame()

for i in range(10):
    df_aux = pd.read_csv(f"../input/tabular-playground-series-oct-2022/train_{i}.csv", dtype=dtypes)
    df = pd.concat([df, df_aux])
    
df = reduce_mem_usage(df)

## 2. Exploratory Data Analysis

Let's quickly analyze the type of data we are dealing, column names and more. 

In [None]:
df.info()

In [None]:
df.sample(5)

### 2.1. About Games, Events and Frames

First, we want to understand the basic concepts from this data. Our training dataset contains a set of games.

A game in Rocket League takes place between two teams of three players each (it can be 2v2, but this time is only 3v3). To win the game, you need to score as many goals as possible. The exciting part of this game is that you are driving powerful cars and can perform crazy maneuvers. 

To learn more, check my post on [Better Understanding of Rocket League Might Help You](https://www.kaggle.com/competitions/tabular-playground-series-oct-2022/discussion/358801#1979635).

In [None]:
# How many games are available in the dataset?
df.game_num.nunique()

Alright, we know there are more than 7,000 games in this dataset. But each game might have more than one event!

An event is defined by a team scoring a goal—either team A or team B. Of course, there is a possibility that an event ends without a single team scoring. 

In [None]:
# How many events are available in the dataset?
df.event_id.nunique()

# Follow-up question: On average, how many events a game has?
# Follow-up question: What proportion of events ends with a score, and what proportion doesn't?

So, we finally have the lowest granularity level for the training set: a frame.

The Kaggle team recorded each event with a speed of 10 frames per second. Since we are interested in the result of each event, the event_time is registered as a negative number.

Also, with this concept in mind, we know that the event duration might be different, even in a single game. 

Imagine a game in Rocket League is starting. Team A scores, the team takes the ball, manages to trick team B, and scores during the initial 30 seconds. For this hypothetical situation, you could find `game_id` as 0, and `event_id` equals 1001 and 300 frames with a lot of information about what happened during the match. The game continues, but both teams have no score for the remaining time. In these cases, you could find the same `game_id`, and a new `event_id` but a lot more frames since the recording was truncated since there were no goals!

In [None]:
# How many frames per event are available in the dataset?
(
    df.groupby("event_id")
    .event_time
    .count()
    .describe()
)

### 2.3. Event Time Duration

Since we can get the duration of any event, we are interested in segmenting each game and trying to check where the ball was during each part of the event.

We want to understand how the ball positioning evolves until a team scores (or not!)

In [None]:
df_team_scoring = (
    df[["event_id", "team_scoring_next"]]
    .drop_duplicates()
    .set_index("event_id")
)

In [None]:
# With the following code I'm counting frames for each event, but I also want the flag of team_A|B_scoring_within_10sec
# for other purposes.
df_event_duration = (
    df_team_scoring
    .join(
        df
        .groupby(["event_id", "team_A_scoring_within_10sec"])
        [["event_time"]]
        .count()
        .pivot_table(index="event_id", columns="team_A_scoring_within_10sec", values="event_time")
        .rename(columns={0: "team_A_no_goal", 1: "team_A_goal"})
        # divide by 10 to get the number of seconds since event were recorded at 10 frames per second
        .div(10)
        )
    .join(
        df
        .groupby(["event_id", "team_B_scoring_within_10sec"])
        [["event_time"]]
        .count()
        .pivot_table(index="event_id", columns="team_B_scoring_within_10sec", values="event_time")
        .rename(columns={0: "team_B_no_goal", 1: "team_B_goal"})
        # divide by 10 to get the number of seconds since event were recorded at 10 frames per second
        .div(10)
        )
    ).reset_index()

# Remember, I splitted an event between two main segments: before 10 seconds if a team scores and the rest of the game
df_event_duration["seconds_outside_scoring_range"] = df_event_duration.apply(lambda x: x["team_A_no_goal"] if x["team_scoring_next"] == "A" else x["team_B_no_goal"], axis=1)
df_event_duration["seconds_within_scoring_range"] = df_event_duration.apply(lambda x: x["team_A_goal"] if x["team_scoring_next"] == "A" else x["team_B_goal"], axis=1)

df_event_duration = (
    df_event_duration
    [["event_id", "team_scoring_next", "seconds_within_scoring_range", "seconds_outside_scoring_range"]]
    .set_index("event_id")
    )

df_event_duration["total_duration"] = df_event_duration[["seconds_within_scoring_range", "seconds_outside_scoring_range"]].sum(axis=1)

In [None]:
# How the event duration distributes? I'm particulary interested in what happened before a score occur, 
# So I filtered the dataset to check the event duration for those seconds after the "scoring range"

(
    df_event_duration
    [df_event_duration["seconds_outside_scoring_range"] < df_event_duration["seconds_outside_scoring_range"].quantile(0.99)]
    .seconds_outside_scoring_range
    .plot.hist(
        bins=25, 
        title="Average time duration of event when outside scoring range (> 10s before next goal)"
        )
    );

In [None]:
# If we take into account the two big segments of the game (10 seconds before scoring and the rest of the game)
# We can see the average duration of a Rocket League event is 69 seconds (which is a great number btw)
df_event_duration["total_duration"].describe()

### 2.4. Average Ball Position (Every 10 seconds)

Ok, enough vague insights, and let's dig into the real deal. 

We can split our events every 10 seconds and try to visualize where the ball has been around (on average). 

I'm choosing to create segments of 10 seconds since the challenge is to predict if a team will score within the next 10 seconds, so it makes sense (at least to me) to create these buckets with this time length.

In [None]:
df_ball = df[[
    "event_id", "event_time", 
    "ball_pos_x", "ball_pos_y", "ball_pos_z", 
    "ball_vel_x", "ball_vel_y", "ball_vel_z", 
    "team_A_scoring_within_10sec", "team_B_scoring_within_10sec"
    ]].copy()

# I'm calculating some metrics for later modeling and analysis, hold with me
df_ball["scoring_within_10sec"] = df_ball["team_A_scoring_within_10sec"] + df_ball["team_B_scoring_within_10sec"]
df_ball["time_inverval_before_event_ending"] = pd.cut(df_ball["event_time"], [-1000] + list(range(-300, 0, 10)) + [0])
df_ball["ball_vel_scalar"] = np.sqrt(df_ball["ball_vel_x"]**2 + df_ball["ball_vel_y"]**2 + df_ball["ball_vel_z"]**2)
df_ball["ball_distance_from_center"] = np.sqrt(df_ball["ball_pos_x"]**2 + df_ball["ball_pos_y"]**2)


In [None]:
df_ball = df_ball[[
    "event_id", "scoring_within_10sec", "event_time", "time_inverval_before_event_ending", 
    "ball_pos_x", "ball_pos_y", "ball_pos_z", "ball_vel_scalar", "ball_distance_from_center"
    ]]

In [None]:
df_avg_ball_10s = (
    df_ball
    .groupby(["scoring_within_10sec", "event_id", "time_inverval_before_event_ending"])
    [["ball_pos_x", "ball_pos_y", "ball_pos_z", "ball_vel_scalar", "ball_distance_from_center"]]
    # I'm averaging my metrics but feel free to comment any other method you would have try here
    .mean()
    .dropna(how="all")
    .reset_index()
    )

#### Plot a Sample of Average Ball Positions (Color = Velocity Scalar)

In [None]:
sample_df = df_avg_ball_10s[
    (df_avg_ball_10s["scoring_within_10sec"] == 1)
    & (1 < df_avg_ball_10s["ball_distance_from_center"])
    ].sample(frac=0.2).copy()

meta = {
    "game_num": False,
    "event_id": False,
    "event_time": False,
}

fig, ax = plt.subplots(1, figsize=(12, 8))
rl_fig, rl_ax = draw_rocket_league_ball(
    data=sample_df, 
    meta=meta,
    field=(fig, ax)
);

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.ndimage import gaussian_filter

def myplot(x, y, s, bins=1000):
    
    heatmap, xedges, yedges = np.histogram2d(x, y, bins=bins)
    heatmap = gaussian_filter(heatmap, sigma=s)

    extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
    
    return heatmap.T, extent

In [None]:
N = 9

cats = df_avg_ball_10s[df_avg_ball_10s["scoring_within_10sec"] == 0].time_inverval_before_event_ending.cat.categories.tolist()[-N:-1]
df_avg_ball_no10s = df_avg_ball_10s[df_avg_ball_10s["scoring_within_10sec"] == 0].copy()

#### Since ploting a bunch of points does't make a lot of sense, I decided to create some heatmaps!

In [None]:
# Author: Alejandro
# Source: https://stackoverflow.com/a/36515364

figs, axs = plt.subplots(3, 3, figsize=(20, 20))

for ax, i in zip(axs.flatten(), range(N)):
    
    if i == N-1:
        df_aux = df_avg_ball_10s[
            (df_avg_ball_10s["scoring_within_10sec"] == 1)
            & (10 < df_avg_ball_10s["ball_distance_from_center"])
            ]

        img, extent = myplot(df_aux.ball_pos_y, df_aux.ball_pos_x, 16)
        ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
        ax.set_title("Ball distribution when\ngoal is scored within 10s")
    
    else:
        df_aux = df_avg_ball_no10s[
            (df_avg_ball_no10s["scoring_within_10sec"] == 0)
            & (10 < df_avg_ball_no10s["ball_distance_from_center"])
            & (df_avg_ball_no10s["time_inverval_before_event_ending"] == cats[i])
            ]

        # Generate some test data
        x = df_aux.ball_pos_y
        y = df_aux.ball_pos_x

        img, extent = myplot(x, y, 16)
        ax.imshow(img, extent=extent, origin='lower', cmap=cm.jet)
        ax.set_title(f"From {N*10 - (10*i)}s to {N*10 - (10*(i+1))}s")
        
# ===== ATTENTION =====
# X and Y are inverted to match the previous plot
# The scoring zone is at left and right (not up/bottom)
# =====================