### Data Modelling #2 Overview

In this lesson, we will create a `TeamStat` class to represent a team's overall statistics in the tournament (goals scored, goals conceded, games won on penalties, etc)

In [None]:
import time
from bs4 import BeautifulSoup
import pandas as pd
import requests

In the last tutorial we had the following code to gather Euro 2021 data

In [None]:
base_url = 'https://www.bbc.co.uk/sport/football/european-championship/scores-fixtures'
start_date = '2021-06-11'
end_date = '2021-07-11'
KNOCKOUT_GAMES_START = pd.Timestamp('2021-06-26')

# generate tournament dates and URLs
tournament_dates = pd.date_range(start_date, end_date)
urls = [f"{base_url}/{dt.date()}" for dt in tournament_dates]

# container to store results
results = []

def show_result(home, home_goals, away, away_goals, pens=None) -> str:
  """ Stringifies a result from the scraped data """
  if pens:
    return f"{home} {home_goals} - {away_goals} {away} ({pens})"  
  return f"{home} {home_goals} - {away_goals} {away}"


for url in urls:
  response = requests.get(url)
  # time.sleep(1)
  
  soup = BeautifulSoup(response.text)

  # get all fixtures on the page
  fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})

  for fixture in fixtures:
    home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
    away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
    home_goals = fixture.select_one('.sp-c-fixture__number--home').text
    away_goals = fixture.select_one('.sp-c-fixture__number--away').text

    game_date = pd.Timestamp(url.split("/")[-1])
    if game_date >= KNOCKOUT_GAMES_START:
      pens = fixture.select_one('.sp-c-fixture__win-message')
      if pens is not None:
        results.append(show_result(home, home_goals, away, away_goals, pens.text))
        continue
    
    results.append(show_result(home, home_goals, away, away_goals))

In [None]:
results[::5] # look at every 5th result

['Turkey 0 - 3 Italy',
 'Netherlands 3 - 2 Ukraine',
 'Hungary 0 - 3 Portugal',
 'Denmark 1 - 2 Belgium',
 'Sweden 1 - 0 Slovakia',
 'Switzerland 3 - 1 Turkey',
 'Croatia 3 - 1 Scotland',
 'Portugal 2 - 2 France',
 'Croatia 3 - 5 Spain',
 'Belgium 1 - 2 Italy',
 'Italy 1 - 1 England (Italy win 3-2 on penalties)']

We want to create a `Result` class that models this data in a better, more extensible and flexible manner. Currently, we only have strings representing each result.

Let's use Python's `dataclasses` to create a class container for results.

The class will have the following attributes:
- `home`: the home team
- `away`: the away team
- `home_goals`: the number of goals the home team scored.
- `away_goals`: the number of goals the away team scored.
- `penalty_winner`: the winner on penalties. Specified as `Optional[str]` with a default of `None` because matches might not go to penalties.
- `penalty_score`: the penalty score. Specified as `Optional[str]` with a default of `None` because matches might not go to penalties.

We will also define some helpful methods that encapsulate some functionality that we are interested in. Some methods we will implement include:

- `is_draw()`: was the result a draw?
- `winner()`: returns the winner, or the `DRAW_LABEL` if the result was a draw
- `loser()`: returns the loser, or the `DRAW_LABEL` if the result was a draw
- `goals_scored()`: returns total goals scored in the match
- `goal_difference()`: the number of goals difference between the winning team and the losing team's goal count. Example: 4-1 or 1-4 (difference is 3 here)
- ``

In [None]:
from dataclasses import dataclass
from typing import ClassVar, Optional

@dataclass
class Result:
  home: str
  away: str
  home_goals: int
  away_goals: int
  penalty_winner: Optional[str] = None # "Italy"
  penalty_score: Optional[str]  = None # "5-4"

  def is_draw(self) -> bool:
    score_draw = self.home_goals == self.away_goals
    is_group_match = self.penalty_winner is None
    return score_draw and is_group_match

  def winner(self) -> Optional[str]:
    if self.is_draw(): return None

    if self.home_goals > self.away_goals:
      return self.home
    elif self.away_goals > self.home_goals:
      return self.away
    else:
      return self.penalty_winner

  def loser(self) -> Optional[str]:
    if self.is_draw(): None
    if self.home_goals < self.away_goals:
      return self.home
    elif self.away_goals < self.home_goals:
      return self.away
    else:
      return self.home if self.penalty_winner == self.away else self.away

  def goals_scored(self) -> int:
    return self.home_goals + self.away_goals

  def goal_difference(self) -> int:
    return abs(self.home_goals - self.away_goals)

  def __contains__(self, team):
    return team in [self.home, self.away]

  def __str__(self):
    return f"{self.home} {self.home_goals}-{self.away_goals} {self.away}"

Soon, we are going to modify our loop to store the data as `Result` instances, rather than simply using strings.

### Penalty Data

Firstly, though - we need to figure out how to extract both the winning team name, and the scores, for matches that were decided by penalties.

When a knockout match goes to penalties, the winning team is displayed with the following message:

- `<TEAM_NAME> win 5-4 on penalties`

We need to extract both the `<TEAM_NAME>` and the score from this expression.

For the team name, we can simply split on the space and take the first element of the returned list.

For the score, we will define a regular expression that finds the following pattern: a digit, followed by a `-`, followed by another digit.

Code for this below.

In [None]:
import re

msg = "Italy win 5-4 on penalties"

winner = msg.split(" ")[0]
print(winner)

re.search("\d+-\d+", msg) # call .group() to retrieve the text

Italy


<re.Match object; span=(10, 13), match='5-4'>

We can now add this code to our loop that collects the data, for knockout games that have the `.sp-c-fixture__win-message` class

In [None]:
results = []

for url in urls:
  response = requests.get(url)
  time.sleep(1)
  
  soup = BeautifulSoup(response.text)

  # get all fixtures on the page
  fixtures = soup.find_all('article', {'class': 'sp-c-fixture'})

  for fixture in fixtures:
    home = fixture.select_one('.sp-c-fixture__team--home .sp-c-fixture__team-name-trunc').text
    away = fixture.select_one('.sp-c-fixture__team--away .sp-c-fixture__team-name-trunc').text
    home_goals = fixture.select_one('.sp-c-fixture__number--home').text
    away_goals = fixture.select_one('.sp-c-fixture__number--away').text

    game_date = pd.Timestamp(url.split("/")[-1])
    if game_date >= KNOCKOUT_GAMES_START:
      pens = fixture.select_one('.sp-c-fixture__win-message')
      if pens is not None:

        # extract penalty winner from string:
        # TEAM_NAME win 5-4 on penalties
        pen_winner = pens.text.split(" ")[0]

        # get the score using a regular expression
        pen_score = re.search("\d+-\d+", pens.text).group()

        results.append(Result(
            home, 
            away, 
            int(home_goals), 
            int(away_goals),
            penalty_winner=pen_winner,
            penalty_score=pen_score)
        )
        continue
    
    results.append(Result(
        home,
        away,
        int(home_goals),
        int(away_goals))
    )

In [None]:
results[:10]

[Result(home='Turkey', away='Italy', home_goals=0, away_goals=3, penalty_winner=None, penalty_score=None),
 Result(home='Wales', away='Switzerland', home_goals=1, away_goals=1, penalty_winner=None, penalty_score=None),
 Result(home='Denmark', away='Finland', home_goals=0, away_goals=1, penalty_winner=None, penalty_score=None),
 Result(home='Belgium', away='Russia', home_goals=3, away_goals=0, penalty_winner=None, penalty_score=None),
 Result(home='Austria', away='North Macedonia', home_goals=3, away_goals=1, penalty_winner=None, penalty_score=None),
 Result(home='Netherlands', away='Ukraine', home_goals=3, away_goals=2, penalty_winner=None, penalty_score=None),
 Result(home='England', away='Croatia', home_goals=1, away_goals=0, penalty_winner=None, penalty_score=None),
 Result(home='Scotland', away='Czech Rep', home_goals=0, away_goals=2, penalty_winner=None, penalty_score=None),
 Result(home='Poland', away='Slovakia', home_goals=1, away_goals=2, penalty_winner=None, penalty_score=None

We can now use our new object-oriented approach to analyze the data a bit more "naturally".

Let's look at the winners - we'll use a list comprehension to call the `winner()` method for each of our `Result` objects that did not end in a draw.

In [None]:
# Look at the winners
winners = [r.winner() for r in results if not r.is_draw()]
winners

['Italy',
 'Finland',
 'Belgium',
 'Austria',
 'Netherlands',
 'England',
 'Czech Rep',
 'Slovakia',
 'Portugal',
 'France',
 'Wales',
 'Italy',
 'Russia',
 'Belgium',
 'Ukraine',
 'Netherlands',
 'Sweden',
 'Germany',
 'Italy',
 'Switzerland',
 'Belgium',
 'Denmark',
 'Netherlands',
 'Austria',
 'Croatia',
 'England',
 'Spain',
 'Sweden',
 'Denmark',
 'Italy',
 'Czech Rep',
 'Belgium',
 'Spain',
 'Switzerland',
 'England',
 'Ukraine',
 'Spain',
 'Italy',
 'Denmark',
 'England',
 'Italy',
 'England',
 'Italy']

We can now perform basic analytics on this data. For example, we can count up the results of calling `.winner()` and `.loser()` to find out who won/lost the most matches in the tournament.

We'll use the `collections.Counter` object to take care of the counting.

In [None]:
# Find the 10 teams that won the most matches in the tournament
from collections import Counter

Counter(winners).most_common(10)

[('Italy', 7),
 ('England', 5),
 ('Belgium', 4),
 ('Netherlands', 3),
 ('Denmark', 3),
 ('Spain', 3),
 ('Austria', 2),
 ('Czech Rep', 2),
 ('Ukraine', 2),
 ('Sweden', 2)]

In [None]:
# Find the 10 teams that lost the most matches in the tournament
losers = [r.loser() for r in results if not r.is_draw()]

# sorted(Counter(losers).items(), key=lambda x: x[1], reverse=True)[:10]
Counter(losers).most_common(10)

[('Turkey', 3),
 ('Denmark', 3),
 ('North Macedonia', 3),
 ('Ukraine', 3),
 ('Russia', 2),
 ('Croatia', 2),
 ('Scotland', 2),
 ('Poland', 2),
 ('Germany', 2),
 ('Switzerland', 2)]

In [None]:
# Find out how many draws there were by calling is_draw()
# Note: we can use sum() because True evaluates to 1 when cast to an int
# and False evaluates to 0 when cast to an int.
num_draws = sum([r.is_draw() for r in results])
num_draws

8

We can use our `goals_scored()` function on each result object to calculate the total number of goals scored in the tournament.

In [None]:
total_goals = sum([r.goals_scored() for r in results])
total_goals

142

We can also use our `goal_difference()` function to compute the result that had the **maximum** goal difference - i.e. which victory was won by the biggest margin of goals.

In [None]:
max_goal_diff = max([r.goal_difference() for r in results])
print(f"Biggest margin of victory: {max_goal_diff} goals")

# which result(s) does this correspond to?
biggest_victories = [str(r) for r in results if r.goal_difference() == max_goal_diff]
biggest_victories

Biggest margin of victory: 5 goals


['Slovakia 0-5 Spain']

We can look at other comprehensive victories by searching for results that were won by 4 goals.

In [None]:
# which result(s) had a goal difference of 4 [max_goal_diff - 1]
biggest_victories = [str(r) for r in results if r.goal_difference() == 4]
biggest_victories

['Wales 0-4 Denmark', 'Ukraine 0-4 England']

### Team models

We might also be interested in looking at individual team statistics, such as:

- how many games did a team play?
- how many goals did they score?
- how many goals did they concede?
- what was their tournament goal difference? (goals scored - goals conceded)
- how many of their games went to penalties?

There are different ways we could do this. We could simply use a dictionary to track the data, and that would be an appropriate, natural approach.

We could also create a container class. The benefit of this approach is that it allows us to create instance methods that encapsulate functionality. Let's define a class called `TeamStats` that will collect this data for each distinct team in the tournament.

In [None]:
@dataclass
class TeamStats:
  name: str
  games_won: int = 0 # default values
  games_drawn: int = 0
  games_lost: int = 0
  goals_scored: int = 0
  goals_conceded: int = 0
  penalty_games: int = 0

  @property  # calculated field, so let's make it a property
  def goal_difference(self) -> int:
    return self.goals_scored - self.goals_conceded

  def __eq__(self, team) -> bool:
    return self.name == team.name

We now need to create a `TeamStats` object for each team in the tournament.

Because of `Result` class contains home and away teams, we'll use a set union to get all the distinct teams, as below.

Note - in Python, the `&` operator performs a union of two sets.

In [None]:
TEAMS_IN_TOURNAMENT = 24
all_teams = {r.home for r in results} & {r.away for r in results}

# sanity check to ensure the correct number of teams exist
assert len(all_teams) == TEAMS_IN_TOURNAMENT

# now we create the TeamStats object for each team
teams = [TeamStats(a) for a in all_teams]
teams[:5]

[TeamStats(name='France', games_won=0, games_drawn=0, games_lost=0, goals_scored=0, goals_conceded=0, penalty_games=0),
 TeamStats(name='Russia', games_won=0, games_drawn=0, games_lost=0, goals_scored=0, goals_conceded=0, penalty_games=0),
 TeamStats(name='North Macedonia', games_won=0, games_drawn=0, games_lost=0, goals_scored=0, goals_conceded=0, penalty_games=0),
 TeamStats(name='Slovakia', games_won=0, games_drawn=0, games_lost=0, goals_scored=0, goals_conceded=0, penalty_games=0),
 TeamStats(name='Czech Rep', games_won=0, games_drawn=0, games_lost=0, goals_scored=0, goals_conceded=0, penalty_games=0)]

These objects only have the default values, for now. We need to populate the fields with the correct information.

We'll do this by iterating over all the results, getting the home and away teams, and finding the correct `TeamStats` object in the above `teams` list.

Once we have found these, we'll add the values for each of the fields.

Code for this is shown below.

NOTE: this is perfect code for unit testing. We are doing manual updates to a data structure, and we need to ensure that the updates are being done correctly. Unit tests are key for this type of code!

In [None]:
from typing import List

def parse_team_stats(results: List[Result]) -> List[TeamStats]:
  all_teams = {r.home for r in results} & {r.away for r in results}
  teams = [TeamStats(a) for a in all_teams]
  
  for result in results:
    home, away = result.home, result.away

    # Home team stats
    h_teamstats = next(t for t in teams if t.name == home)
    h_teamstats.goals_scored += result.home_goals
    h_teamstats.goals_conceded += result.away_goals

    # Away team stats
    a_teamstats = next(t for t in teams if t.name == away)
    a_teamstats.goals_scored += result.away_goals
    a_teamstats.goals_conceded += result.home_goals

    # Update games won/drawn/lost, and also penalties
    if result.winner() == home:
      h_teamstats.games_won += 1
      a_teamstats.games_lost += 1
    elif result.winner() == away:
      h_teamstats.games_lost += 1
      a_teamstats.games_won += 1
    else:
      h_teamstats.games_drawn += 1
      a_teamstats.games_drawn += 1
    
    # finally, check if the match went to penalties
    if result.penalty_winner is not None:
      h_teamstats.penalty_games += 1
      a_teamstats.penalty_games += 1
  return teams

In [None]:
team_stats = parse_team_stats(results)

In [None]:
team_stats

[TeamStats(name='France', games_won=1, games_drawn=2, games_lost=1, goals_scored=7, goals_conceded=6, penalty_games=1),
 TeamStats(name='Russia', games_won=1, games_drawn=0, games_lost=2, goals_scored=2, goals_conceded=7, penalty_games=0),
 TeamStats(name='North Macedonia', games_won=0, games_drawn=0, games_lost=3, goals_scored=2, goals_conceded=8, penalty_games=0),
 TeamStats(name='Slovakia', games_won=1, games_drawn=0, games_lost=2, goals_scored=2, goals_conceded=7, penalty_games=0),
 TeamStats(name='Czech Rep', games_won=2, games_drawn=1, games_lost=2, goals_scored=6, goals_conceded=4, penalty_games=0),
 TeamStats(name='Hungary', games_won=0, games_drawn=2, games_lost=1, goals_scored=3, goals_conceded=6, penalty_games=0),
 TeamStats(name='Croatia', games_won=1, games_drawn=1, games_lost=2, goals_scored=7, goals_conceded=8, penalty_games=0),
 TeamStats(name='Portugal', games_won=1, games_drawn=1, games_lost=2, goals_scored=7, goals_conceded=7, penalty_games=0),
 TeamStats(name='Spain

From here, we can do further interesting things, such as looking at teams with the best/worst goal difference, most penalty games.

In [None]:
# top 5 goal diff
N = 10
top10_goal_diff = sorted(team_stats, key=lambda x: x.goal_difference, reverse=True)[:N]
# top10_goal_diff

for team in top10_goal_diff:
  print(f"{team.name}: goal difference = {team.goal_difference}")

Italy: goal difference = 9
England: goal difference = 9
Spain: goal difference = 7
Belgium: goal difference = 6
Denmark: goal difference = 5
Netherlands: goal difference = 4
Czech Rep: goal difference = 2
France: goal difference = 1
Sweden: goal difference = 1
Portugal: goal difference = 0


In [None]:
# bottom 5 goal diff - remove "reverse=True" kwarg from sorted() function
bottom10_goal_diff = sorted(team_stats, key=lambda x: x.goal_difference)[:N]
bottom10_goal_diff

for team in bottom10_goal_diff:
  print(f"{team.name}: goal difference = {team.goal_difference}")

Turkey: goal difference = -7
North Macedonia: goal difference = -6
Russia: goal difference = -5
Slovakia: goal difference = -5
Scotland: goal difference = -4
Ukraine: goal difference = -4
Hungary: goal difference = -3
Wales: goal difference = -3
Finland: goal difference = -2
Poland: goal difference = -2


Let's see which teams were involved in penalty games.

In [None]:
penalty_counts = [t for t in team_stats if t.penalty_games > 0]

for team in penalty_counts:
  print(f"{team.name}: {team.penalty_games} penalty games")

France: 1 penalty games
Spain: 2 penalty games
Italy: 2 penalty games
England: 1 penalty games
Switzerland: 2 penalty games


Let's look at the teams who scored the most goals.

In [None]:
top_goals = sorted(team_stats, key=lambda x: x.goals_scored, reverse=True)[:N]
top_goals

[TeamStats(name='Spain', games_won=3, games_drawn=2, games_lost=1, goals_scored=13, goals_conceded=6, penalty_games=2),
 TeamStats(name='Italy', games_won=7, games_drawn=0, games_lost=0, goals_scored=13, goals_conceded=4, penalty_games=2),
 TeamStats(name='Denmark', games_won=3, games_drawn=0, games_lost=3, goals_scored=12, goals_conceded=7, penalty_games=0),
 TeamStats(name='England', games_won=5, games_drawn=1, games_lost=1, goals_scored=11, goals_conceded=2, penalty_games=1),
 TeamStats(name='Belgium', games_won=4, games_drawn=0, games_lost=1, goals_scored=9, goals_conceded=3, penalty_games=0),
 TeamStats(name='Netherlands', games_won=3, games_drawn=0, games_lost=1, goals_scored=8, goals_conceded=4, penalty_games=0),
 TeamStats(name='Switzerland', games_won=2, games_drawn=1, games_lost=2, goals_scored=8, goals_conceded=9, penalty_games=2),
 TeamStats(name='France', games_won=1, games_drawn=2, games_lost=1, goals_scored=7, goals_conceded=6, penalty_games=1),
 TeamStats(name='Croatia'

And the teams that conceded the most goals.

In [None]:
top_goals_conceded = sorted(team_stats, key=lambda x: x.goals_conceded, reverse=True)[:N]
top_goals_conceded

[TeamStats(name='Ukraine', games_won=2, games_drawn=0, games_lost=3, goals_scored=6, goals_conceded=10, penalty_games=0),
 TeamStats(name='Switzerland', games_won=2, games_drawn=1, games_lost=2, goals_scored=8, goals_conceded=9, penalty_games=2),
 TeamStats(name='North Macedonia', games_won=0, games_drawn=0, games_lost=3, goals_scored=2, goals_conceded=8, penalty_games=0),
 TeamStats(name='Croatia', games_won=1, games_drawn=1, games_lost=2, goals_scored=7, goals_conceded=8, penalty_games=0),
 TeamStats(name='Turkey', games_won=0, games_drawn=0, games_lost=3, goals_scored=1, goals_conceded=8, penalty_games=0),
 TeamStats(name='Russia', games_won=1, games_drawn=0, games_lost=2, goals_scored=2, goals_conceded=7, penalty_games=0),
 TeamStats(name='Slovakia', games_won=1, games_drawn=0, games_lost=2, goals_scored=2, goals_conceded=7, penalty_games=0),
 TeamStats(name='Portugal', games_won=1, games_drawn=1, games_lost=2, goals_scored=7, goals_conceded=7, penalty_games=0),
 TeamStats(name='Ge

### Conclusion

We have now successfully scraped and extracted Euro 2021 results data from the BBC using the `BeautifulSoup` and `requests` libraries, and have subsequently modelled this raw data using Python objects. We were then able to extract some insights and analysis from the collected data, with the aid of our object-oriented approach.

In the next lesson, we'll go one step further and use Python's data analytics and visualization libraries to further analyze and plot the data.