## Web Scrape Transfermarkt Player Market Data

We want to scrape player valuation data from Transfermarkt for the last seven premier league seasons

You will learn how to:
- parse html
- how to iterate over pages asynchronously
- Build an automated scraper using an OOP design pattern

## Why is webscraping a essential skill to use?

If you read my previous article on essential considerations for machine learning, I talk about how vital having available and plentiful data is to ML applications. More often than not, data can be hard to acquire, good public data most of the time is not loaded into a perfect .csv file for you to then use and manipulate yourself without having to do any preprocessing. In the real world, acquiring data and preparing it takes a significant amount of time.

Many businesses depend on webscraped data to avoid huge costs of purchasing data through third parties and other data vendors. Therfore, if you possess webscraping skills you are a valuable asset. Furthermore, if you are interested in building your own applications whether as a cool personal project or if you are looking to build a business, webscraping is a good place to start building from.

## What packages will we use?

- httpx
- pandas
- BeautifulSoup

In [272]:
from abc import ABC, abstractmethod
from dataclasses import dataclass
from collections.abc import Sequence
import httpx
import pandas as pd
from bs4 import BeautifulSoup
import time

In the initial example we want scrape manchester united player valuation data from Transfermarkt for the 2024/2025 season. The following url links to the united team page.

In [241]:
url = "https://www.transfermarkt.co.uk/fc-liverpool/kader/verein/31/saison_id/2023/plus/1"

In [242]:
#  get html response from url

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"
}
response = httpx.get(url, headers=headers)

In [243]:
print(response)

In [244]:
html_data = response.content

We can now parse the data we need from the html loaded when we call .content on url response object. A good way of doing this is by right clicking on the image and selecting 'inspect' at the bottom of the dropdown. This will then load the 'inspector' tab which allows you to analyse the page html code in more depth. For example, use the element selector function (on firefox you can use the short cut ctrl+shift+c) and this will allow you to use your mouse to hover over page elements and the correpsonding html code will appear in the 'inspect' tab. This allows us to use a html parser like BeautifulSoup to extract the data we need.

For reference, I use firefox developer edition as it's really good for inspecting page code, but any web browser with suffice. 

In [270]:
page_soup = BeautifulSoup(html_data, "html.parser")
page_soup

<!DOCTYPE html>

<html lang="en">
<head>
<script data-description="sourcepoint stub code" type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.add

In [164]:
elements = page_soup.find_all("img", {"class": "bilderrahmen-fixed lazy lazy"})
names = [td.get('title') if td.get('title') else None for td in elements]

## Parse table stats

From the screenshot of the html code related to the table statistics, we can see that the stats increment by 8. For example to get all the squad numbers, the squad number appears at the first index (index = 0) and then to get to the next player squad number, we have to jump 8 places. We do this using the double colon when indexing the list of html code.

In [165]:
stats = page_soup.find_all("td", {"class": "zentriert"})

In [169]:
numbers = [stat for stat in stats[0::8]]
numbers = [
    (
        td.find("div", class_="rn_nummer").text.strip()
        if td.find("div", class_="rn_nummer")
        else None
    )
    for td in numbers
]

In [255]:
ages = [stat for stat in stats[1::8]]

In [None]:
dob = [td.text.strip().split(" (")[0] if td.text else None for td in ages]

In [174]:
age = [
    int(td.text.strip().split(" (")[1].split(")")[0]) if td.text else None
    for td in ages
]

In [257]:
countries = [stat for stat in stats[2::8]]
print(countries[0])

In [None]:
countries = [
    td.find("img").get("title") if td.find("img") else None
    for td in countries
]

In [259]:
current_clubs = [stat for stat in stats[3::8]]

In [None]:
current_clubs = [
    td.find("a").get("title") if td.find("a") else None
    for td in current_clubs
]

In [261]:
heights = [stat for stat in stats[4::8]]

In [None]:
heights = [td.text if td.text else None for td in heights]

In [263]:
foots = [stat for stat in stats[5::8]]

In [None]:
foots = [td.text if td.text else None for td in foots]

In [271]:
joined_date = [stat for stat in stats[6::8]]

In [None]:
joined_date = [td.text if td.text else None for td in joined_date]

In [253]:
signing_info = [stat for stat in stats[7::8]]

In [None]:
signing_fee = [
    td.find("a").get("title").split(": Ablöse ")[1] if td.find("a") else 0
    for td in signing_info
]

In [221]:
signed_from = [
    td.find("a").get("title").split(": Ablöse ")[0] if td.find("a") else None
    for td in signing_info
]

### Parse the market valuations

In [252]:
values = page_soup.find_all("td", {"class": "rechts hauptlink"})

In [None]:
values = [td.find('a').text if td.find('a') else '€0' for td in values]

### Parse player positions

In [250]:
positions = page_soup.find_all("td", {"class": "posrela"})
positions

[<td class="posrela">
 <table class="inline-table">
 <tr>
 <td rowspan="2">
  </img></td>
 <td class="hauptlink">
 <a href="/alisson/profil/spieler/105470">
                 Alisson            </a>
 </td>
 </tr>
 <tr>
 <td>
             Goalkeeper        </td>
 </tr>
 </table>
 </td>,
 <td class="posrela">
 <table class="inline-table">
 <tr>
 <td rowspan="2">
  </td>
 <td class="hauptlink">
 <a href="/caoimhin-kelleher/profil/spieler/340918">
                 Caoimhí

In [None]:
positions = [
    (
        element.find_all("tr")[1].find("td").text.strip()
        if element.find_all("tr")
        else None
    )
    for element in positions
]

### Parse player transfermarkt name and id

Parsing this information is crucial for getting more information on players as the id and player name can be inserted into a url.

In [50]:
links = page_soup.find_all("td", {"class": "hauptlink"})

tm_name = [
    link.find("a")["href"].split("/")[1] if link.find("a") else None
    for link in links[::2]
]

tm_id = [link.find("a")["href"].split("/")[4] if link.find('a') else None for link in links[::2]]

### Creating a dataframe

In [195]:
data = {
    "name": names,
    "number": numbers,
    'dob': dob,
    "age": age,
    "country": countries,
    "current_club": current_clubs,
    "height": heights,
    "foot": foots,
    "joined_date": joined_date,
    "signing_fee": signing_fee,
    "signed_from": signed_from,
    "value": values,
    "position": positions,
    # "tm_name": tm_name,
    # "tm_id": tm_id
}

In [196]:
df = pd.DataFrame(data)
df.head() 

Unnamed: 0,name,number,dob,age,country,current_club,height,foot,joined_date,signing_fee,signed_from,value,position
0,Alisson,1,"Oct 2, 1992",31,Brazil,Liverpool FC,"1,93m",right,"Jul 19, 2018",€62.50m,AS Roma,€28.00m,Goalkeeper
1,Caoimhín Kelleher,62,"Nov 23, 1998",25,Ireland,Liverpool FC,"1,88m",right,"Jul 1, 2019",-,Liverpool FC U23,€20.00m,Goalkeeper
2,Vitezslav Jaros,-,"Jul 23, 2001",22,Czech Republic,Liverpool FC U21,"1,90m",right,,0,,€5.00m,Goalkeeper
3,Adrián,13,"Jan 3, 1987",37,Spain,Real Betis Balompié,"1,90m",right,"Aug 5, 2019",free transfer,West Ham United,€600k,Goalkeeper
4,Marcelo Pitaluga,-,"Dec 20, 2002",21,Brazil,Liverpool FC U21,"1,93m",right,,0,,€300k,Goalkeeper


## Extending to all Premier League teams for all seasons

It's great we've managed to extract detailed information on Manchester United players for the 2023/24 season, but it's not really that useful. We want to get data on all premier league teams across multiple seasons so we can start to do some interesting analysis and uncover patterns as to how player valuations have changed over time. 

Remember our initial URL looked like this: 
- "https://www.transfermarkt.co.uk/manchester-united/kader/verein/985/saison_id/2023/plus/1"

We can use it again, but we have to swap out where it says 'manchester-united' (team name), '985' (team transfermarkt id), and '2023' (season). If we have each premier league team's transfermarkt name and id, then we can iterate over the url - this is called pagination.

### Get team names

- get team names and ids for each season
- iterate over url to extact data
- create final scraper with scrapy

In [197]:
team_url = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1/plus/?saison_id=2023"

In [198]:
team_resp = httpx.get(
    team_url,
    headers={
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"
    },
)

In [199]:
team_html = team_resp.content

In [203]:
page_soup = BeautifulSoup(team_html, "html.parser")

In [205]:
team_info = page_soup.find_all("td", {"class": "hauptlink no-border-links"})

In [213]:
team_name = [
    td.find("a").get("href").split("/")[1] if td.find("a") else None for td in team_info
]

In [214]:
team_id = [
    td.find("a").get("href").split("/")[4] if td.find("a") else None for td in team_info
]

In [None]:
urls = []
for td in team_info:
    data = td.find('a').get('href')
    team_name = data.split('/')[1]
    team_id = data.split('/')[4]
    year = data.split('/')[6]
    
    url = f'https://www.transfermarkt.co.uk/{team_name}/kader/verein/{team_id}/saison_id/{year}/plus/1'
    
    urls.append(url)
    

Now we can iterate over these urls to return a dataframe that contains all player market value data for each premier league team for the 2023/2024 season.

Before we can do that, we need to create reproducible code to scrape data from each url. We can create a pipeline using protocol design principles that makes it easy to understand what is happening in the extraction process

In [142]:
@dataclass
class Team:
    id: str
    name: str


class Parser(ABC):
    """ABC Protocol class for parsing data from transfermarkt."""

    @abstractmethod
    def parse(self, soup: BeautifulSoup) -> pd.DataFrame:
        pass


@dataclass
class Scraper:
    """Scrape data from transfermarkt for a given team and year."""

    team: Team
    parsers: Sequence[Parser]
    year: int
    url: str = (
        "https://www.transfermarkt.co.uk/{name}/kader/verein/{id}/saison_id/{year}/plus/1"
    )

    def run(self) -> pd.DataFrame:
        """Run the scraping process."""
        url = self.url.format(name=self.team.name, id=self.team.id, year=self.year)
        print(f"Scraping: {self.team.name} - {self.year}")

        soup = self._get_soup_content(url)  # get html content from url

        data = pd.concat(
            [parser.parse(soup) for parser in self.parsers], axis=1
        )  # concatenate parsers into a dataframe

        data["season"] = self.year  # add season to dataframe
        data["team"] = self.team.name  # add team name to dataframe

        return data

    def _get_soup_content(self, url: str) -> BeautifulSoup:
        """Get the html content from a given Transfermarkt url."""
        resp = self._make_request(url)
        return BeautifulSoup(resp.content, "html.parser")

    def _make_request(self, url: str) -> httpx.Response:
        """Make a request to a given Transfermarkt url."""
        try:
            response = httpx.get(
                url,
                headers={
                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"
                },
                timeout=60,
            )
            response.raise_for_status()
            return response

        except httpx.HTTPError as e:
            print(f"HTTP error occurred: {e}")
            raise e

In [223]:
class PlayerNames(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        elements = soup.find_all("img", {"class": "bilderrahmen-fixed lazy lazy"})
        names = [td.get("title") if td.get("title") else None for td in elements]
        return pd.Series(names, name=["name"])

In [224]:
class PlayerNumbers(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        numbers = [stat for stat in stats[0::8]]
        numbers = [
            (
                td.find("div", class_="rn_nummer").text.strip()
                if td.find("div", class_="rn_nummer")
                else None
            )
            for td in numbers
        ]
        return pd.Series(numbers, name=["number"])

In [225]:
class PlayerAges(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.DataFrame:
        stats = soup.find_all("td", {"class": "zentriert"})
        ages = [stat for stat in stats[1::8]]
        dob = [td.text.strip().split(" (")[0] if td.text else None for td in ages]
        age = [
            int(td.text.strip().split(" (")[1].split(")")[0]) if td.text else None
            for td in ages
        ]
        return pd.DataFrame({"dob": dob, "age": age})

In [226]:
class PlayerCountries(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        countries = [stat for stat in stats[2::8]]
        countries = [
            td.find("img").get("title") if td.find("img") else None for td in countries
        ]
        return pd.Series(countries, name=["country"])

In [227]:
class CurrentClubs(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        current_clubs = [stat for stat in stats[3::8]]
        current_clubs = [
            td.find("a").get("title") if td.find("a") else None for td in current_clubs
        ]
        return pd.Series(current_clubs, name=["current_club"])

In [228]:
class PlayerHeights(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        heights = [stat for stat in stats[4::8]]
        heights = [td.text if td.text else None for td in heights]
        return pd.Series(heights, name=["height"])

In [229]:
class PlayerFoot(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        foots = [stat for stat in stats[5::8]]
        foots = [td.text if td.text else None for td in foots]
        return pd.Series(foots, name=["foot"])

In [231]:
class PlayerJoinedDate(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        joined_date = [stat for stat in stats[6::8]]
        joined_date = [td.text if td.text else None for td in joined_date]
        return pd.Series(joined_date, name=["joined_date"])

In [232]:
class PlayerSigningFee(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        signing_info = [stat for stat in stats[7::8]]
        signing_fee = [
            td.find("a").get("title").split(": Ablöse ")[1] if td.find("a") else 0
            for td in signing_info
        ]
        return pd.Series(signing_fee, name=["signing_fee"])

In [234]:
class PlayerSignedFrom(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        stats = soup.find_all("td", {"class": "zentriert"})
        signing_info = [stat for stat in stats[7::8]]
        signed_from = [
            td.find("a").get("title").split(": Ablöse ")[0] if td.find("a") else None
            for td in signing_info
        ]
        return pd.Series(signed_from, name=["signed_from"])

In [235]:
class PlayerValues(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        values = soup.find_all("td", {"class": "rechts hauptlink"})
        values = [td.find("a").text if td.find("a") else "€0" for td in values]
        return pd.Series(values, name=["value"])

In [237]:
class PlayerPositions(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        pos_soup = soup.find_all("td", {"class": "posrela"})
        positions = [
            td.find_all("tr")[1].find("td").text.strip() if td.find_all("tr") else None
            for td in pos_soup
        ]
        return pd.Series(positions, name=["position"])

In [238]:
class TransfermarktName(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        links = soup.find_all("td", {"class": "hauptlink"})
        tm_name = [
            link.find("a")["href"].split("/")[1] if link.find("a") else None
            for link in links[::2]
        ]
        return pd.Series(tm_name, name=["tm_name"])

In [239]:
class TransfermarktId(Parser):
    def parse(self, soup: BeautifulSoup) -> pd.Series:
        links = soup.find_all("td", {"class": "hauptlink"})
        tm_id = [
            link.find("a")["href"].split("/")[4] if link.find("a") else None
            for link in links[::2]
        ]
        return pd.Series(tm_id, name=["tm_id"])

In [118]:
parsers = (
    PlayerNames(),
    PlayerNumbers(),
    PlayerAges(),
    PlayerCountries(),
    CurrentClubs(),
    PlayerHeights(),
    PlayerFoot(),
    PlayerJoinedDate(),
    PlayerSigningFee(),
    PlayerSignedFrom(),
    PlayerValues(),
    PlayerPositions(),
    TransfermarktName(),
    TransfermarktId()
)

In [122]:
team = Team(id='985', name='manchester-united')
scraper = Scraper(team=team, parsers=parsers, year=2023)
df = scraper.run()

In [123]:
df.head()

Unnamed: 0,name,number,dob,age,country,current_club,height,foot,joined_date,signing_fee,signed_from,value,position,tm_name,tm_id,season
0,André Onana,24,"Apr 2, 1996",28,Cameroon,Manchester United,"1,90m",right,"Jul 20, 2023",€50.20m,Inter Milan,€35.00m,Goalkeeper,andre-onana,234509,2023
1,Dean Henderson,-,"Mar 12, 1997",27,England,Crystal Palace,"1,88m",right,"Aug 1, 2020",-,Manchester United U23,€12.00m,Goalkeeper,dean-henderson,258919,2023
2,Altay Bayındır,1,"Apr 14, 1998",26,Türkiye,Manchester United,"1,98m",right,"Sep 1, 2023",€5.00m,Fenerbahce,€10.00m,Goalkeeper,altay-bayindir,336077,2023
3,Radek Vitek,-,"Oct 24, 2003",20,Czech Republic,Manchester United U21,"1,98m",right,,0,,€300k,Goalkeeper,radek-vitek,622236,2023
4,Tom Heaton,22,"Apr 15, 1986",38,England,Manchester United,"1,88m",right,"Jul 2, 2021",free transfer,Aston Villa,€250k,Goalkeeper,tom-heaton,34130,2023


## Scrape every team for 2023/2024 season

We need to get all the team information for the premier league 2023/24 season.

In [133]:
def get_team_info(league: str, league_id: str, year: int) -> tuple:
    link = "https://www.transfermarkt.co.uk/{league}/startseite/wettbewerb/{league_id}/plus/?saison_id={year}"
    url = link.format(league=league, league_id=league_id, year=year)
    resp = httpx.get(
        url,
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"
        },
        timeout=20,
    )
    soup = BeautifulSoup(resp.content, "html.parser")
    team_info = soup.find_all("td", {"class": "hauptlink no-border-links"})
    team_name = [td.find('a').get('href').split('/')[1] for td in team_info]
    team_id = [td.find('a').get('href').split('/')[4] for td in team_info]
    return tuple(zip(team_name, team_id))

In [80]:
pl_teams = get_team_info('premier-league', 'GB1', 2023)

In [215]:
teams = [Team(id=id, name=name) for name, id in zip(team_name, team_id)]

In [216]:
teams

[Team(id='281', name='manchester-city'),
 Team(id='11', name='fc-arsenal'),
 Team(id='631', name='fc-chelsea'),
 Team(id='31', name='fc-liverpool'),
 Team(id='148', name='tottenham-hotspur'),
 Team(id='985', name='manchester-united'),
 Team(id='405', name='aston-villa'),
 Team(id='762', name='newcastle-united'),
 Team(id='1237', name='brighton-amp-hove-albion'),
 Team(id='703', name='nottingham-forest'),
 Team(id='379', name='west-ham-united'),
 Team(id='873', name='crystal-palace'),
 Team(id='543', name='wolverhampton-wanderers'),
 Team(id='1148', name='fc-brentford'),
 Team(id='989', name='afc-bournemouth'),
 Team(id='29', name='fc-everton'),
 Team(id='931', name='fc-fulham'),
 Team(id='1132', name='fc-burnley'),
 Team(id='350', name='sheffield-united'),
 Team(id='1031', name='luton-town')]

In [267]:
dfs = []
for name, id_ in pl_teams:
    team = Team(id=id_, name=name)
    scraper = Scraper(team=team, parsers=parsers, year=2023)
    df = scraper.run()
    dfs.append(df)
    time.sleep(5) # sleep for 5 seconds to avoid getting blocked

In [268]:
data = pd.concat(dfs)

In [269]:
data

Unnamed: 0,name,number,dob,age,country,current_club,height,foot,joined_date,signing_fee,signed_from,value,position,tm_name,tm_id,season,team
0,Ederson,31,"Aug 17, 1993",30,Brazil,Manchester City,"1,88m",left,"Jul 1, 2017",€40.00m,SL Benfica,€35.00m,Goalkeeper,ederson,238223,2023,manchester-city
1,Stefan Ortega,18,"Nov 6, 1992",31,Germany,Manchester City,"1,85m",right,"Jul 1, 2022",free transfer,Arminia Bielefeld,€9.00m,Goalkeeper,stefan-ortega,85941,2023,manchester-city
2,Zack Steffen,13,"Apr 2, 1995",29,United States,Colorado Rapids,"1,91m",right,"Jul 9, 2019",€6.82m,Columbus Crew SC,€2.00m,Goalkeeper,zack-steffen,221624,2023,manchester-city
3,True Grant,-,"Nov 2, 2005",18,England,Manchester City U21,-,,,0,,€500k,Goalkeeper,true-grant,919438,2023,manchester-city
4,Scott Carson,33,"Sep 3, 1985",38,England,Manchester City,"1,88m",right,"Jul 20, 2021",free transfer,Derby County,€200k,Goalkeeper,scott-carson,14555,2023,manchester-city
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41,Jacob Brown,19,"Apr 10, 1998",26,Scotland,Luton Town,"1,78m",right,"Aug 10, 2023",€2.90m,Stoke City,€4.00m,Centre-Forward,jacob-brown,469958,2023,luton-town
42,Cauley Woodrow,10,"Dec 2, 1994",29,England,Luton Town,"1,84m",right,"Jul 1, 2022",?,Barnsley FC,€1.00m,Centre-Forward,cauley-woodrow,169801,2023,luton-town
43,Joe Taylor,-,"Nov 18, 2002",21,Wales,Luton Town,-,right,"Jan 31, 2023",€450k,Peterborough United,€350k,Centre-Forward,joe-taylor,944551,2023,luton-town
44,Admiral Muskwe,-,"Aug 21, 1998",25,Zimbabwe,Without ClubWithout Club,"1,83m",right,"Jul 15, 2021",?,Leicester City U21,€275k,Centre-Forward,admiral-muskwe,314378,2023,luton-town


## Load Data

Now that the data is in a pandas dataframe, we can choose what we want to do with it. We can save it locally as a .csv or
.parquet file on our local machine, load it into an sql database, or load it to a cloud platform like google cloud storage. The latter is my preferred choice as it saves having to store on my local machine and I can easily access from anywhere - if you have the correct credentials of course!