# Data collection 

## I. Introduction

This notebook is the **first** step in a comprehensive project aimed at exploring **the correlation between financial investment and performance outcomes in European football clubs**. Our ultimate goal is to analyze how various financial metrics—such as **`revenue`**, **`spending`**, and **`net`** balance—relate to performance indicators like **`goals scored`**, **`wins`**, **`losses`**, and final league **`positions`**.

To achieve this, we start by gathering the necessary data through **web scraping**. Using the popular website **`Transfermarkt`**, which provides detailed information on **player transfers**, **team finances**, and **league standings**, we focus on collecting data from the top **15** European football leagues. Specifically, we target the **top 10** teams in each league, totaling **150** teams, to build a dataset that will form the foundation of our analysis.

In this notebook, we will guide you through the process of **scraping** data related to player transfers and league standings for these teams. The scraped data will include essential financial metrics as well as performance metrics across multiple seasons. Once **collected**, this data will be **cleaned**, **processed**, and **organized** in a structured format, ready for the subsequent stages of analysis in the project.

By the end of this notebook, we will have compiled a comprehensive dataset that captures the financial and performance aspects of **top European football teams**. This dataset will enable us to conduct a deeper analysis in future notebooks, where we will explore the relationship between financial spending and team success, aiming to uncover meaningful insights about the financial efficiency and competitive performance of football clubs.

## II. Data Collection and Preparation Process

Before implementing the **data collection** process, let's first outline the step-by-step methodology used to gather the necessary data. After understanding the approach, we will proceed to examine the corresponding code that executes these steps.

### 1. Importing Necessary Libraries:

The first step is to **import the essential** Python libraries. These libraries are required for various tasks such as sending HTTP **requests** (using requests), introducing delays between requests (using **time**), **parsing HTML** pages (using **BeautifulSoup**), and manipulating data (using **numpy** and **pandas**).

### 2. Defining HTTP Headers and Base URL:

Next, **HTTP headers** are defined to make the requests appear as if they are coming from a regular web browser, which helps in avoiding blocks by the website. The base URL of the Transfermarkt website is also defined, making it easier to construct the complete URLs needed to access specific pages later.

### 3. Requesting the European Leagues Page:

A request is sent to the webpage listing the European leagues on Transfermarkt. The content of the page is retrieved and parsed using BeautifulSoup to extract the necessary information if the request is successful. If the request fails, an error message is displayed.

### 4. Extracting Links for the Top 15 European Leagues:

Once the European leagues page is parsed, the code identifies and extracts the links to the top **15** most important leagues by iterating through the rows of the table that contains this information. Each link is stored in a dictionary, with the league name as the key and the URL as the value.

### 5. Cleaning and Converting Financial Values:

A function is defined to clean and convert financial values (e.g., "**€1.5m**" to 1500000.0). This function removes currency symbols and converts abbreviations (like "**k**" for thousand and "**m**" for million) into full numerical values. If the conversion fails, the function returns a missing value (np.nan).

### 6. Requesting Transfer Data for Each League:

The script iterates over the league links extracted earlier to retrieve the main page and the transfer balance page for each league. Delays of 2 seconds are introduced between requests to avoid overwhelming the server. The retrieved pages are then parsed using BeautifulSoup to extract the relevant content.

### 7. Extracting and Storing Team Data:

The script identifies the relevant rows in the transfer balance pages and the team pages for the top **10** teams in each league. Additional requests are made to retrieve the team pages and league position pages, and the extracted information is stored in a dictionary, organized by league and team.

### 8. Isolating and Processing Useful Information:

Specific data such as **revenue**, **spend**, **goals**, **league rankings**, and match results (**wins**, **ties**, **losses**) are isolated from the retrieved HTML content. These data points are cleaned and processed to be stored in the data dictionary. The script also ensures that the years of the transfer data and league rankings match to avoid errors.

### 9. Creating a DataFrame from the Collected Data:

After all the data is collected and organized in the dictionary, it is converted into a Pandas DataFrame, a tabular data structure commonly used in Python. This DataFrame contains all the extracted and processed information, organized in a way that is ready for analysis.

### 10. Calculating 5-Season Aggregates:

The DataFrame is then enriched by calculating aggregated metrics over a 5-season window for each team. These metrics include rolling sums and averages for revenue, spend, and other financial indicators, allowing for the analysis of trends over multiple seasons.

### 11. Creating the Final DataFrame:

Finally, the final DataFrame is generated, bringing together all the processed data and making it ready for further analysis. This DataFrame contains detailed information on the financial and performance metrics of the teams from the top European leagues across multiple seasons.

### 12. Saving the DataFrame as a CSV File

Once the DataFrame containing all the scraped and processed data has been created, the next step is to save it as a **CSV** file. Saving the DataFrame as a CSV ensures that the data is preserved and can be **easily accessed** or shared without needing to re-run the entire scraping process. We can also **avoid the time-consuming** task of scraping again if we want to revisit the data for further analysis or make adjustments. This allows us to **focus** on the analytical steps in subsequent notebooks.

## III. Data Collection Process: Step-by-Step Implementation

Now that we've outlined and understood the steps involved in the **data collection** process, it's time to see how these steps are **implemented** in the **code**. Let’s dive into the code to observe how the data scraping and processing are carried out.

**Note :** This code took me around an **hour** and **15** minutes to **run**.

In [1]:
# Step 1: Importing Necessary Libraries
import requests
import time
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd

# Step 2: Define HTTP headers and base URL to mimic a browser and simplify URL construction.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
url_prefix = "https://www.transfermarkt.us"

# Step 3: Request the European leagues page and parse the content if successful.
r_europe = requests.get(url_prefix + '/wettbewerbe/europa', headers=headers)
if r_europe.status_code == 200:
    europe = bs(r_europe.content, 'html.parser')
else:
    print('Error scraping Europe.')

# Step 4: Extract links for the top 15 European leagues from the parsed page.
table = europe.find('tbody')
rows = table.find_all("tr")

league_links = {}

# Every row in the table had an extra row so the top 30 rows gave us the top 15 leagues.
for row in rows[1:30]:
    flag = row.find(class_="flaggenrahmen")
    if flag:
        league_links[flag['alt']] = url_prefix + row.find('a')['href']

# Step 5: Define a function to clean and convert financial values from strings to floats.
def clean_value(value):
    if isinstance(value, str):
        value = value.replace('€', '').replace('k', '000').replace('m', '000000').replace('.', '').replace(',', '')
        try:
            return float(value) / 100  # The value should be divided by 100 after removing '.' to get the correct float.
        except ValueError:
            return np.nan
    return np.nan

# Step 6: Request and parse transfer data for the top ten teams in each examined league.
all_data = {}
for league in league_links:
    
    # League page request.
    time.sleep(2)
    r_league = requests.get(league_links[league], headers=headers)
    if r_league.status_code == 200:
        c_league = bs(r_league.content, 'html.parser')
    else:
        print(f'Error scraping {league}.')
        continue
    
    # League transfers page request.
    time.sleep(2)
    r_balance = requests.get(league_links[league].replace('startseite', 'transferbilanz'), headers=headers)
    if r_balance.status_code == 200:
        c_balance = bs(r_balance.content, 'html.parser')
    else:
        print(f'Error scraping {league} balance.')
        continue
              
    # Step 7: Find the league transfer balance table and locate the correct row to start from.
    all_rows = c_balance.find(class_="items").find("tbody").find_all("tr")
    for i, row in enumerate(all_rows):
        head = row.find("a")
        if head and head.text.strip()[-2:] == '23':
            start_row = i
            break
    balance_rows = c_balance.find(class_="items").find("tbody").find_all("tr")[start_row:]
    
    # Track progress while scraping by seeing which league is being scraped. 
    print(league)
    
    # Create a dictionary for each league in all data. 
    all_data[league] = {}
    
    # Step 7 (continued): Extract team information from the league table.
    table = c_league.find(class_='responsive-table')
    tbody = table.find('tbody')
    rows = tbody.find_all("tr")
    
    for i, row in enumerate(rows[:10]):
        link = row.find('a')
        team = link['title']
        all_data[league][team] = {}
               
        # Transfer and league rankings page's URLs are similar to the team overview page URL.
        time.sleep(2)
        r_team = requests.get(url_prefix + link['href'][:-14].replace('startseite', 'alletransfers'), headers=headers)
        
        time.sleep(2)
        r_position = requests.get(url_prefix + link['href'][:-14].replace('startseite', 'platzierungen'), headers=headers)

        if r_team.status_code == 200 and r_position.status_code == 200:
            c_team = bs(r_team.content, 'html.parser')
            c_position = bs(r_position.content, 'html.parser')
        else:
            print(f'Error scraping {team}.')
            continue
        
        # Step 8: Isolate useful information from the webpages.
        season_rows = c_position.select("tbody tr")[2:]
        
        # Find the correct row to start from in the transfer rows.
        all_rows = c_team.find_all(class_='row')
        for i, row in enumerate(all_rows):
            head = row.find("h2")
            if head and head.text.strip()[-2:] == '23':
                start_row = i
                break
        transfer_rows = c_team.find_all(class_='row')[start_row:]
                
        # Handle missing 'League rankings' rows and match the season information.
        season_counter = 0
        season_max = len(season_rows)
        
        # Step 8 (continued): Loop through up to 30 seasons of data for each team.
        for i, season in enumerate(transfer_rows[:30]):
            
            # Determine the year based on the last two characters (e.g., '22/23').
            year = season.find("h2").text.strip()[-2:]
            transfer_tables = season.find_all(class_='box')
            
            # If there was no data for the revenue or spend of a season, that value will be set to 0. 
            transfer_revenue = transfer_tables[1].select('tfoot td')
            revenue = '0' 
            if len(transfer_revenue) > 0:
                revenue = transfer_revenue[0].text
                
            transfer_spend = transfer_tables[0].select('tfoot td')
            spend = '0' 
            if len(transfer_spend) > 0:
                spend = transfer_spend[0].text
            
            # Clean and convert the financial values.
            revenue = clean_value(revenue)
            spend = clean_value(spend)
            
            # Reset season counter if necessary.
            if season_counter == season_max:
                season_counter = 0
            
            # Handle missing 'League ranking' rows.
            season_info = season_rows[season_counter].find_all('td')
            
            # If the years match, extract the relevant data.
            if year == season_info[0].text[-2:]:
                season_counter += 1
                goals = season_info[7].text
                competition = season_info[3].text
                position = season_info[10].text
                wins = season_info[4].text
                ties = season_info[5].text
                losses = season_info[6].text
                goals_for = season_info[8].text
                goals_against = season_info[9].text
            
            # If the years don't match, set all variables except 'competition' and 'position' to NaN.
            else:
                goals = np.nan
                competition = 'Not First'
                position = '≤10'
                wins = np.nan
                ties = np.nan
                losses = np.nan
                goals_for = np.nan
                goals_against = np.nan
                
            # Compute additional metrics as required.
            relative = (spend / revenue) if revenue != 0 else np.nan
            net = revenue - spend
            
            # Store all data for the season in a dictionary.
            all_data[league][team][year] = {
                'revenue': revenue,
                'spent': spend,
                'goals': goals,
                'competition': competition,
                'position': position,
                'wins': wins,
                'ties': ties,
                'losses': losses,
                'league_spent': clean_value(balance_rows[i].find_all("td")[1].text),
                'relative': relative,
                'net': net,
                'goals_for': goals_for,
                'goals_against': goals_against,
            }

# Step 9: Define a function to create a DataFrame from the collected data dictionary.
def make_df(dct):
    data = {
        'league': [],
        'team': [],
        'season': [],
        'revenue': [],
        'spent': [],
        'goals': [],
        'competition': [],
        'position': [],
        'wins': [],
        'ties': [],
        'losses': [],
        'league_spent': [],
        'relative': [],
        'net': [],
        'goals_for': [],
        'goals_against': [],
    }
    
    # Populate the DataFrame by iterating through the dictionary.
    for league in dct:
        for team in dct[league]:
            for season in dct[league][team]:
                data['league'].append(league)
                data['team'].append(team)
                data['season'].append(season)
                for metric in dct[league][team][season]:
                    data[metric].append(dct[league][team][season][metric])
    
    df = pd.DataFrame(data)
    
    # Step 10: Calculate rolling 5-season aggregates for financial and performance metrics.
    df['5_season_agg'] = df.groupby(['team'])['relative'].transform(lambda x: x.rolling(5, min_periods=1).sum())
    df['5_season_net'] = df.groupby(['team'])['net'].transform(lambda x: x.rolling(5, min_periods=1).sum())
    df['5_season_league_agg'] = df.groupby(['team'])['league_spent'].transform(lambda x: x.rolling(5, min_periods=1).sum())
    df['5_season_relative'] = df.groupby(['team'])['relative'].transform(lambda x: x.rolling(5, min_periods=1).mean())
    df['first_tier'] = df['competition'].apply(lambda x: 1 if x == 'First Tier' else 0)
    
    return df

England
Spain
Italy
Germany
France
Portugal
Netherlands
Türkiye
Russia
Belgium
Greece
Austria
Ukraine
Denmark
Switzerland


In [2]:
# Step 11: Create the final DataFrame from the collected data.
all_data_df = make_df(all_data)
all_data_df.sample(5)

Unnamed: 0,league,team,season,revenue,spent,competition,position,wins,ties,losses,league_spent,relative,net,goals_for,goals_against,5_season_agg,5_season_net,5_season_league_agg,5_season_relative,first_tier
3670,Scotland,Ross County FC,22,0.0,0.0,First Tier,5,10.0,10.0,13.0,32.14,0.0,0.0,45.0,52.0,0.045,1.083,144.33,0.000312,1
2016,Turkey,Antalyaspor,1,4.2,0.013,First Tier,≤10,9.0,9.0,16.0,75.76,0.000172,4.187,45.0,64.0,0.013,4.187,189.9,6.8e-05,1
707,Italy,Juventus FC,6,14.03,30.39,First Tier,≤10,27.0,10.0,1.0,196.88,0.154358,-16.36,71.0,24.0,358.04,-124.91,1761.45,0.203264,1
2668,Belgium,KVC Westerlo,9,1.8,0.2,First Tier,6,15.0,7.0,12.0,30.65,0.006525,1.6,42.0,38.0,0.475,2.975,115.19,0.004124,1
1843,Turkey,Fenerbahce,94,0.0,1.7,First Tier,2,21.0,6.0,3.0,2.42,0.702479,-1.7,69.0,26.0,52.76,-40.86,389.15,0.135578,1


In [3]:
# Summary of The dataset
all_data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4342 entries, 0 to 4341
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   league               4342 non-null   object 
 1   team                 4342 non-null   object 
 2   season               4342 non-null   int64  
 3   revenue              4342 non-null   float64
 4   spent                4342 non-null   float64
 5   competition          4342 non-null   object 
 6   position             4342 non-null   object 
 7   wins                 3790 non-null   float64
 8   ties                 3790 non-null   float64
 9   losses               3790 non-null   float64
 10  league_spent         4342 non-null   float64
 11  relative             4277 non-null   float64
 12  net                  4342 non-null   float64
 13  goals_for            3790 non-null   float64
 14  goals_against        3790 non-null   float64
 15  5_season_agg         4342 non-null   f

In [4]:
# Step 12: Save the final DataFrame as a CSV file
all_data_df.to_csv('../Data/first_data.csv', index=False)

### IV. Conclusion

In this notebook, we successfully completed the initial and crucial step of data collection for our broader project, which aims to analyze the correlation between financial metrics and performance outcomes in European football clubs. By employing **web scraping** techniques, we gathered comprehensive data on player transfers, **team finances**, and league standings from Transfermarkt for the top **150** teams across the top **15** European leagues.

Through meticulous scraping and data processing, we have compiled a robust dataset that includes key financial metrics—such as **`revenue`**, **`spent`**, and **`net`** balance—alongside **performance metrics** like **`goals`** scored, **`wins`**, **`losses`**, and final league **`positions`**. This dataset is now structured and ready for deeper analysis.

The **data collected** in this notebook will serve as the foundation for the next stages of our project, where we will analyze the relationships between financial investment and team performance. By exploring these correlations, we aim to gain insights into the financial strategies of football clubs and their impact on success in domestic leagues.

This notebook has laid the groundwork for our analysis, ensuring that we have reliable and well-organized data to draw meaningful conclusions about **`the financial efficiency and competitive performance of European football teams`** in subsequent analyses.