<img src="https://drive.google.com/uc?id=1xoRrlKGNXAYJ3zYjWOgvon6SmTQmvjo5" alt="data-vis-logo" width="150"/>


# Visualização de dados dentro e fora das quatro linhas
---

[PT] Neste notebook vamos explorar, com exemplos práticos, como retirar o máximo proveito da exploração visual para detectar padrões a partir de um conjunto de dados. 


[EN] In this notebook we'll walkthrough practical examples of how to use visual exploration to detect patterns in a dataset.

## 📦 Importar pacotes / Imports

In [1]:
from statsbombpy import sb
import altair as alt
import numpy as np
import pandas as pd
import re
from feedzai_altair_theme import tokens

## 🛠 Ferramentas Auxiliares / Helper Functions

In [2]:
# Carrega tema para a biblioteca de visualização de dados Altair.
# Loads theme for the Altair data visualization library.

alt.themes.enable("feedzai")

ThemeRegistry.enable('feedzai')

In [3]:
# SVG path da forma de uma cruz.
# SVG path of a cross.

cross_shape = "M 0.7 0.7 L -0.7 -0.7 M -0.7 0.7 L 0.7 -0.7"

In [4]:
# Escala de cores de valores boleanos para dois tons de verde-azulado. Utilizado na codificação da variável is_goal.
# Color scale from boolean values to two shades of teal. Used when encoding is_goal.

goal_color_scale = alt.Scale(
    domain=[True, False],
    range=[
        tokens.COLOR_PRIMITIVES["teal"]["50"],
        tokens.COLOR_PRIMITIVES["teal"]["20"]
    ])

In [5]:
# Função que desenha um campo de futebol.
# Function that renders a football pitch.

def render_pitch(width = 120, height = 80):
    pitch_color = tokens.COLOR_PRIMITIVES["teal"]["10"]
    line_color = "#fff"
    outside_lines = alt.Chart(pd.DataFrame({
        "x": [0, width, width, 0, width / 2],
        "y": [0, 0, height, height, 0]
    })).mark_rect(fill=pitch_color, opacity=0.8,stroke=line_color).encode(
        x=alt.X("x", axis=None),
        y=alt.Y("y", axis=None)
    )
    
    def render_pitch_lines(df):
        return alt.Chart(df.reset_index()).mark_line(
          stroke=line_color,
          strokeWidth = 1
        ).encode(
          x=alt.X("x"),
          y=alt.Y("y"),
          order='index'
        )
    
    penalty_box_left = render_pitch_lines(
      pd.DataFrame({
      "x": [0, 16.5, 16.5, 0, 0],
      "y": [height / 2 - 18.32 / 2 - 11, height / 2 - 18.32 / 2 - 11, height / 2 + 18.32 / 2 + 11, height / 2 + 18.32 / 2 + 11, height / 2 + 18.32 / 2 + 11],
    }))
    
    penalty_box_right = render_pitch_lines(
      pd.DataFrame({
      "x": [width, width - 16.5, width - 16.5, width, width],
      "y": [height / 2 - 18.32 / 2 - 11, height / 2 - 18.32 / 2 - 11, height / 2 + 18.32 / 2 + 11, height / 2 + 18.32 / 2 + 11, height / 2 + 18.32 / 2 + 11],
    }))
    
    centre_line = render_pitch_lines(
      pd.DataFrame({
          "x": [width / 2, width / 2],
          "y": [0, height]
      })
    )
    
    left_box = render_pitch_lines(
      pd.DataFrame({
      "x": [0, 5.5, 5.5, 0, 0],
      "y": [height / 2 + 18.32 / 2, height / 2 + 18.32 / 2, height / 2 - 18.32 / 2, height / 2 - 18.32 / 2, height / 2 + 18.32 / 2],
    }))
    
    right_box = render_pitch_lines(
      pd.DataFrame({
      "x": [width, width - 5.5, width - 5.5, width, width],
      "y": [height / 2 + 18.32 / 2, height / 2 + 18.32 / 2, height / 2 - 18.32 / 2, height / 2 - 18.32 / 2, height / 2 + 18.32 / 2],
    }))
    
    circles = alt.Chart(pd.DataFrame({
      "x": [11, width / 2, width - 11, width/2 ],
      "y": [height / 2, height / 2, height / 2, height / 2],
    })).mark_circle(fill=line_color, opacity=1).encode(
        x=alt.X("x"),
        y=alt.Y("y"),
    )
    
    centre_mark = alt.Chart(pd.DataFrame({
      "x": [width / 2],
      "y": [height /2]
    })).mark_arc(stroke=line_color, fill="transparent", opacity=1, radius=20).encode(
      x=alt.X("x"),
      y=alt.Y("y"),
    )
    
    return outside_lines + centre_line + penalty_box_left + left_box + penalty_box_right + right_box + circles + centre_mark

## ⚽️ Carregar dados / Fetch data

[PT] Vamos explorar dados da [StastBomb](https://statsbomb.com/), que agregada dados multidimensionais de eventos por jogo. Utilizamos a biblioteca [stasbombpy](https://github.com/statsbomb/statsbombpy) para carregarmos facilmente dados Statsbomb em Python.

Neste workshop vamos analisar dados do **Europeu Feminino UEFA 2022**.

--- 

[EN] We will use data from [StastBomb](https://statsbomb.com/), that collect rich event data per match. We make use of the [stasbombpy](https://github.com/statsbomb/statsbombpy) data package to to easily fetch StatsBomb data into Python.

For today's workshop we will focus on the **2022 UEFA Women's Euro**.

In [6]:
# Parametros de pesquisa.
# Search parameters.

W_EURO = {
    "division": "UEFA Women's Euro",
    "country": "Europe",
    "season": "2022",
    "gender": "female"
}

In [None]:
euro_events = sb.competition_events(**W_EURO)

[PT] O conjunto de dados `euro_events` contém todos os eventos de todos os jogos do Euro. Vamos espreitar:

[EN] Our dataset `euro_events` contains all events for the matches that happened in the Euro. Let's take a look:

In [8]:
euro_events

Unnamed: 0,50_50,bad_behaviour_card,ball_receipt_outcome,ball_recovery_offensive,ball_recovery_recovery_failure,block_deflection,block_offensive,block_save_block,carry_end_location,clearance_aerial_won,...,shot_statsbomb_xg,shot_technique,shot_type,substitution_outcome,substitution_replacement,tactics,team,timestamp,type,under_pressure
0,,,,,,,,,,,...,,,,,,"{'formation': 4231, 'lineup': [{'player': {'id...",Sweden Women's,00:00:00.000,Starting XI,
1,,,,,,,,,,,...,,,,,,"{'formation': 4411, 'lineup': [{'player': {'id...",Switzerland Women's,00:00:00.000,Starting XI,
2,,,,,,,,,,,...,,,,,,"{'formation': 4231, 'lineup': [{'player': {'id...",Netherlands Women's,00:00:00.000,Starting XI,
3,,,,,,,,,,,...,,,,,,"{'formation': 343, 'lineup': [{'player': {'id'...",Sweden Women's,00:00:00.000,Starting XI,
4,,,,,,,,,,,...,,,,,,"{'formation': 4231, 'lineup': [{'player': {'id...",England Women's,00:00:00.000,Starting XI,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
105152,,,,,,,,,,,...,,,,,,,Switzerland Women's,00:03:27.394,Own Goal Against,
105153,,,,,,,,,,,...,,,,,,,England Women's,00:30:28.686,Own Goal For,
105154,,,,,,,,,,,...,,,,,,,France Women's,00:44:00.134,Own Goal For,
105155,,,,,,,,,,,...,,,,,,,Sweden Women's,00:51:22.764,Own Goal For,


[PT] Podemos reparar que o conjunto de dados é esparso. Como já devem ter notado, muitas das colunas só fazem sentido para determinados tipos de eventos. Por exemplo, `shot_body_part`, `shot_deflected` ou `shot_end_location`, só são relevantes para eventos do tipo `Shot`. Este formato mais horizontal é conhecido como 
[tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). Em conjuntos de dados tidy, linhas são observações e colunas são variáveis. Este formato torna mais facil trabalhar dados para visualizações.

[EN] We can see that the dataset is very sparse. As you've probably noticed many of the columns only make sense for some types of events. For example `shot_body_part`, `shot_deflected` or `shot_end_location` are only relevant for events of type `Shot`. This wide data format is known as [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). In tidy datasets rows are observations and columns are variables. This standard makes it easier to wrangle data form visualizations.




In [9]:
# Tipos de eventos
# Types of events
set(euro_events["type"])

{'50/50',
 'Bad Behaviour',
 'Ball Receipt*',
 'Ball Recovery',
 'Block',
 'Carry',
 'Clearance',
 'Dispossessed',
 'Dribble',
 'Dribbled Past',
 'Duel',
 'Error',
 'Foul Committed',
 'Foul Won',
 'Goal Keeper',
 'Half End',
 'Half Start',
 'Injury Stoppage',
 'Interception',
 'Miscontrol',
 'Offside',
 'Own Goal Against',
 'Own Goal For',
 'Pass',
 'Player Off',
 'Player On',
 'Pressure',
 'Referee Ball-Drop',
 'Shield',
 'Shot',
 'Starting XI',
 'Substitution',
 'Tactical Shift'}

In [10]:
# Variáveis disponíveis
# Available variables
list(euro_events.columns)

['50_50',
 'bad_behaviour_card',
 'ball_receipt_outcome',
 'ball_recovery_offensive',
 'ball_recovery_recovery_failure',
 'block_deflection',
 'block_offensive',
 'block_save_block',
 'carry_end_location',
 'clearance_aerial_won',
 'clearance_body_part',
 'clearance_head',
 'clearance_left_foot',
 'clearance_other',
 'clearance_right_foot',
 'counterpress',
 'dribble_no_touch',
 'dribble_nutmeg',
 'dribble_outcome',
 'dribble_overrun',
 'duel_outcome',
 'duel_type',
 'duration',
 'foul_committed_advantage',
 'foul_committed_card',
 'foul_committed_offensive',
 'foul_committed_penalty',
 'foul_committed_type',
 'foul_won_advantage',
 'foul_won_defensive',
 'foul_won_penalty',
 'goalkeeper_body_part',
 'goalkeeper_end_location',
 'goalkeeper_outcome',
 'goalkeeper_position',
 'goalkeeper_punched_out',
 'goalkeeper_shot_saved_off_target',
 'goalkeeper_shot_saved_to_post',
 'goalkeeper_success_in_play',
 'goalkeeper_technique',
 'goalkeeper_type',
 'half_start_late_video_start',
 'id',
 'i

In [11]:
# Variáveis calculadas
# Calculated variables

euro_events.loc[:,"is_goal"] = euro_events.apply(
    lambda row: row.shot_outcome == "Goal", axis=1)
euro_events.loc[:,"location_x"] = euro_events.apply(
    lambda row: row.location[0] if isinstance(row.location, list) else np.nan, axis=1)
euro_events.loc[:,"location_y"] = euro_events.apply(
    lambda row: row.location[1] if isinstance(row.location, list) else np.nan, axis=1)

## 🧭 Começar a explorar / Starting the Exploration

[PT] Quando temos um novo conjunto de dados em mãos, é útil calcular algumas estatísticas sumárias. Há algumas ferramentas genéricas, como [pandas-profilling](https://pandas-profiling.ydata.ai/docs/master/index.html), que permitem obter rapidamente estatísticas e gráficos descritivos.

Nesta sessão, vamos saltar alguns destes passos à frente, e focar-nos em como explorar histórias nos nossos dados.

- Quem foi a jogadora inglesa que mais rematou? Foi a mesma jogadora que marcou mais golos? [**Ranking** / Ordered Bar]

---- 

[EN] Often, when we get our hands on a new dataset, we start by running some summary statistics to get a feel for the dataset. There are great generic tools, such as [pandas-profilling](https://pandas-profiling.ydata.ai/docs/master/index.html), to get a quick first summary of the data with statistics, data health checks, and charts for visualizing distributions.

For today's session, we will move a few steps ahead, and focus on how to explore a story in our data.

- Out of the English players, who was the top scorer? Was it the same player that made the most shots? [**Ranking** / Ordered Bar]

In [12]:
england_shots = euro_events[
    (euro_events["team"] == "England Women's") & 
    (euro_events["type"] == "Shot")
  ]

# Cada barra corresponde a uma jogadora. Como os nomes das jogadoras podem ser
# um pouco extensos, utilizamos um gráficos de barras horizontal em vez de vertical.

# The labels for each bar are player's names. As these can be quite long, we plot
# it as an horizontal bar chart so there's enough room for the text.

alt.Chart(england_shots).mark_bar(
    ).encode(
    y=alt.Y("player", sort="-x", axis=alt.Axis(title="Player")),
    x=alt.X("count()", axis=alt.Axis(title="Number of shots")),
    # color="shot_outcome"
    color = alt.Color("is_goal", scale=goal_color_scale)
).properties(
    
    # title={ 
    #    "text": ["Ellen White, Georgia Stanway, and Beth Mead made the most shots"],
    #    "subtitle": ["Among the players with most shots, Beth Mead is the most successful"]
    # }
)

In [13]:
# Os remates não são todos iguais, que outras variáveis relevantes existem?
# Not every shot is the same, what other relevant variables are there?
shot_pattern = re.compile('shot_*')
shot_columns = [c for c in euro_events.columns if shot_pattern.match(c)]
shot_columns

['shot_aerial_won',
 'shot_body_part',
 'shot_deflected',
 'shot_end_location',
 'shot_first_time',
 'shot_freeze_frame',
 'shot_key_pass_id',
 'shot_one_on_one',
 'shot_open_goal',
 'shot_outcome',
 'shot_redirect',
 'shot_saved_off_target',
 'shot_saved_to_post',
 'shot_statsbomb_xg',
 'shot_technique',
 'shot_type']

### 🧤 Golos improváveis / Unlikely goals

[PT] [Golos esperados (xG)](https://statsbomb.com/soccer-metrics/expected-goals-xg-explained/) é uma métrica desenhada para medir a probabilidade de um golo resultar num remate.
Por exemplo, um remate com uma valor xG de 0.2 significa que é esperado que fosse covertido em golo duas vezes por cada 10 tentativas.

- Quem marcou golo a partir dos remates mais fáceis/díficeis? [**Distribution** / Dot plot]

--- 

[EN] [Expected Goals (xG)](https://statsbomb.com/soccer-metrics/expected-goals-xg-explained/) is a metric designed to measure the probability of a shot
resulting in a goal. For example, a shot with an xG value of 0.2 is one that we would generally expect to be converted twice in every 10 attempts.

- Who is scoring easier/more difficult shots? [**Distribution** / Dot plot]

In [14]:
# Com um dot plot conseguimos uma visão mais granular dos dados.
# A dot plot allows us to have a more granular view of the data.

alt.Chart(england_shots).mark_point().encode(
    x=alt.X("shot_statsbomb_xg", scale=alt.Scale(domain=[0,1])),
    y=alt.Y("player"),
    color=alt.Color("is_goal", scale=alt.Scale(domain=[True, False], range=[tokens.COLOR_PRIMITIVES["teal"]["60"], tokens.COLOR_PRIMITIVES["teal"]["20"]])),
) + alt.Chart(
  # Desenhamos uma linha vertical na marca 0.5 para separar os remates que tinham uma probabilidade >50% de marcar golo.
  # We render a rule at the 0.5 mark to clearly split shots that had a >50% change of scoring a goal.
  pd.DataFrame({"x": [0.5]})).mark_rule(
      stroke=tokens.COLOR_PRIMITIVES["neutral"]["20"]
      ).encode(x="x") 

### 🏟 Em campo! / To the pitch! 

[PT] O que pode ter tornado os remates mais facéis ou difícieis? A posição em campo de onde a jogadora remata é um factor importante para prever se pode resultar em golo. Assim, vamos analisar como a dimensão do espaço influencia os remates das jogadoras. 

- De onde foram feitos os remates [**Spatial** / Heatmap ]

---

[EN] What makes shots harder or easier to take? Where a player is standing on the field when they make a shot is an important factor for determining its success. Let's now take a look at how spatial dimension impacts the player's shots.

- Where did the shots happen? [**Spatial** / Heatmap ]

In [15]:
# A função auxiliar render_pitch desenha o campo de jogo.
# We use the render_pitch funcion to render the playing field.
render_pitch() + alt.Chart(england_shots).mark_rect(
    ).encode(
      x=alt.X("location_x", bin=True),
      y=alt.Y("location_y",  bin=True),
      opacity="count()",
      color=alt.Color("count()", scale=alt.Scale(range=tokens.COLORS["schemes"]["sequential"]["lavenders"][::-1]))
)

[PT] Embora esta visão agregada nos permita uma visão geral dos dados, não nos permite explorar padrões multi-dimensionais. Vamos alterar o nosso gráfico de forma a dar-nos uma visão mais granular dos dados, que codifique também outras variávies

Conseguimos encontrar padrões quando olhamos para propriedades de remates individuais? [**Spatial** / Dot density ]

---

[EN] Once again, this aggregated view gives us a big picture view of the data, but doesn't allow us to explore multidimensional patterns. Let's change our chart to a more granular view that allows us to encode other variables

Can we find patterns when we look at properties of individual shots? [**Spatial** / Dot density ]


In [16]:
render_pitch() + alt.Chart(england_shots).mark_point(
      opacity=0.75,
      fill="transparent",
      strokeWidth=1.8,
      size=50
    ).encode(
      x=alt.X("location_x"),
      y=alt.Y("location_y"),
      stroke=alt.Stroke("shot_statsbomb_xg", scale=alt.Scale(range=tokens.COLORS["schemes"]["sequential"]["lavenders"][:-2][::-1])),
      shape=alt.Shape("is_goal", scale=alt.Scale(domain=[True, False], range=["circle", cross_shape])),
      # fill=alt.Color("shot_statsbomb_xg", scale=alt.Scale(range=tokens.COLORS["schemes"]["sequential"]["lavenders"][:-2][::-1])),
      # size="shot_statsbomb_xg"
)

## 🪄 Interação / Interactivity

[PT] O gráfico de barras e a visão espacial dos remates permitiram-nos analísar os dados de perspectivas diferentes. Como vimos no último gráfico, ao codificarmos mais e mais variáveis no mesmo gráfico este pode tornar-se confuso. Como poderiamos cruzar as duas analíses e encontrar padrões entre estas variáveis?  A **interactividade** é uma ferramenta que nos permite criar estas analíses multidimensionais ricas.

---

[EN] The bar chart and the spatial view of the shots gave us different perspectives of the data. As we've just seen in the last chart, adding more and more visual encoding can easily become confusing. How might we combine our two previous analysis and find patterns across these variables? **Interactivity** is a powerful tool to create rich multidimensional analysis.




In [17]:
def shot_map(team, color="blue"):
    goal_color_scale = alt.Scale(
          domain=[True, False], 
          range=[
              tokens.COLOR_PRIMITIVES[color]["60"], 
              tokens.COLOR_PRIMITIVES[color]["30"]
      ])

    # Setup brush (field) and click (bar) interactions.
    # Configuramos as interações de brush (campo) e click (barras).
    #
    # https://altair-viz.github.io/user_guide/interactions.html
    brush = alt.selection_interval()
    click = alt.selection_multi(fields=['player'])

    country_shots = euro_events[
        (euro_events["team"] == team) & 
        (euro_events["type"] == "Shot")
    ]

    # Vamos precisar de fixar qual é o domínio do gráfico de barras para que não se altere entre interações/filtros
    # We will need to fix the domain of the bar chart so that it doesn't change when a filter is active.
    players_with_shots_sorted = list(
      country_shots.groupby(["player"]).size().sort_values(ascending=False).index
    )
    
    # Simbolos dos remates no campo.
    # Symbols per shot to be rendered on the pitch.
    shots = alt.Chart(country_shots).mark_point(
      opacity=0.75,
      fill="transparent",
      strokeWidth=1.8,
      size=50
      ).encode(
        x="location_x:Q",
        y="location_y:Q",
        stroke=alt.Color("is_goal:N", scale=goal_color_scale ),
        size=alt.Size("shot_statsbomb_xg:Q", scale=alt.Scale(domain=[0,1])),
        shape=alt.Shape("is_goal:N", scale=alt.Scale(
            domain=[True, False], 
            range=["circle", cross_shape]
        )),
        opacity=alt.condition(brush, alt.value(1), alt.value(0.3))
    ).add_selection(brush).transform_filter(
        click
    )
    
    # Ordered bar chart.
    # Gráfico de barras ordenado.
    players_shots = alt.Chart(country_shots).mark_bar().encode(
      y=alt.Y(
          "player:N",
          sort="-x",
          scale=alt.Scale(domain=players_with_shots_sorted)
      ),
      x=alt.X("count()", scale=alt.Scale(domain=[0,6])),
      color=alt.Color("is_goal:N", scale=goal_color_scale),
      opacity=alt.condition(click, alt.value(1), alt.value(0.5))
    ).transform_filter(
      brush
    ).add_selection(
      click
    )
    
    # Após interação, as barras são filtradas para corresponder à selecção. De forma
    # manter o contexto do total de remates por jogadora, desenhamos o contorno da barra 
    # completa que está sempre visível.
    # After interaction, the stacked bars are filtered to match selection. In order to 
    # keep context of how many shots each player made, we render the outline of the 
    # full bar at all times.
    players_shots_bg = alt.Chart(country_shots).mark_bar(
      fill="transparent", 
      strokeWidth=0.2, 
      stroke=tokens.COLOR_PRIMITIVES["neutral"]["20"]
      ).encode(
        y=alt.Y("player:N", 
                sort="-x", 
                scale=alt.Scale(domain=players_with_shots_sorted), 
                axis=alt.Axis(title="")
        ),
        x=alt.X("count()", 
                scale=alt.Scale(domain=[0,18]), 
                axis=alt.Axis(tickMinStep=1, title="Number of shots")
        ),
    )
    
    country_shot_map = render_pitch() + shots
    players_shot_bars = players_shots_bg + players_shots
    
    return (country_shot_map & players_shot_bars).properties(
        title=f"{team} Shot Map"
    )


shot_map("England Women's")

In [18]:
(shot_map("England Women's", "blue") | 
    shot_map("Portugal Women's", "red" )
).resolve_scale(
    color="independent",
    stroke="independent"
)

## 📚 Material Adicional / Additional Resources

*  [Altair Documentation](https://altair-viz.github.io/)
*  [statsbombpy package](https://github.com/statsbomb/statsbombpy)
*  [Accessing & Working With StatsBomb Data In R](https://statsbomb.com/wp-content/uploads/2021/11/Working-with-R.pdf)
*  [U. Washington Data Visualization Curriculum with Altair](https://uwdata.github.io/visualization-curriculum/intro.html)

