# 🚀 Get Data from Football Players 2023/2024 - La Liga Edition

Welcome to this data-driven project where we dive deep into the world of football, specifically focusing on the 2023/2024 La Liga season. In this notebook, you'll find a comprehensive approach to collecting and organizing data for every player who has taken the field in Spain's top football league this season.

## 🧩 Project Overview

This project is meticulously divided into two key phases:

1. **Data Collection:**
   - We start by compiling a list of all participating players, which involves scraping links for each player from the official La Liga website.

2. **Data Extraction & Cleaning:**
   - For each player, we access their individual pages, extract detailed statistics, and perform data cleaning to ensure consistency and accuracy. The final output is an Excel file containing all relevant stats and player information, ready for further analysis.

## 🛠️ Skills Utilized

- **Python:** The core programming language used to build and execute the data extraction and cleaning processes.
- **Selenium:** A powerful web automation tool employed to navigate and scrape data from the web pages efficiently.
- **HTML:** Utilized to parse the structure of web pages and locate specific elements for data extraction.
- **CSS:** Used to identify and target specific elements on the web pages by their styles and classes.
- **JavaScript:** Executed within the browser to interact with dynamic elements on the web pages, such as clicking buttons and loading content.


## 🌐 Data Source

All data used in this project has been obtained from the [FutbolFantasy website](https://www.futbolfantasy.com/laliga/estadisticas/jugador/2024/todos/total). This platform provides comprehensive statistics for La Liga players, making it an invaluable resource for this project.

## 🎯 Goals

- **Automate** the data collection process for ongoing updates throughout the season.
- **Consolidate** player statistics into a structured, easy-to-analyze format.
- **Enable** advanced analytics and insights on player performance throughout the 2023/2024 La Liga season.

Feel free to explore the code, understand the methodology, and use the data for your own analysis or projects. Let's get started! ⚽📊



In [9]:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait, Select
from selenium.webdriver.support import expected_conditions as EC
import random


In [10]:
ESPERA = 0.5
CHROMEDRIVER_PATH = 'C:/Users/User/Desktop/ESTIU/TABLEAU/MISTERFANTASY/chromedriver-win64/chromedriver.exe'

## Part 1: Getting the Links

In this section, the script seamlessly automates the process of gathering all player profile links from the La Liga 2023/2024 statistics page. 🚀 It navigates to the target webpage, handles cookie consent, and efficiently collects the URLs of all player profiles. The script then paginates through the site, ensuring that no player is left behind. After gathering all the data, it checks for duplicates to guarantee accuracy. Finally, it provides a summary of the total number of unique player links collected, setting the stage for an in-depth analysis of the season's top footballers. ⚽📊

In [11]:
# Initialize the Chrome WebDriver
service = Service(CHROMEDRIVER_PATH)
driver = webdriver.Chrome(service=service)
driver.get("https://www.futbolfantasy.com/laliga/estadisticas/jugador/2024/todos/total")

# Accept cookies
try:
    wait = WebDriverWait(driver, 10)  # Adjust wait time as needed
    cookie_button = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "css-pp6blp")))
    cookie_button.click()
    time.sleep(2)
except:
    print("Error accepting cookies.")

# List to store all player links
all_links = []

# Function to collect player links from the current table
def collect_links():
    table = driver.find_element(By.CLASS_NAME, "dataTables_scroll")
    rows = table.find_elements(By.TAG_NAME, "tr")
    for row in rows:
        link_elements = row.find_elements(By.TAG_NAME, "a")
        if link_elements:
            link = link_elements[0].get_attribute('href')
            all_links.append(link)

# Start collecting links from the first page
collect_links()

# Attempt to click 'Next' and continue collecting links
while True:
    try:
        next_button = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "paginate_button.next")))
        driver.execute_script("arguments[0].scrollIntoView(true);", next_button)
        if "disabled" in next_button.get_attribute("class"):
            break  # Stop if 'Next' button is disabled
        next_button.click()
        time.sleep(2)  # Wait for the page to load
        collect_links()  # Collect links from the new page
    except Exception as e:
        print("Error navigating to the next page:", e)
        break

# Close the browser after collecting all links
driver.quit()



NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location


In [12]:
all_links_no_duplicados = list(set(all_links))
if len(all_links)==len(all_links_no_duplicados):
    print("Correct!")
else:
    print("There are duplicate links")
    
print("Total players in La Liga 23/24: "+str(len(all_links_no_duplicados)))
print(all_links_no_duplicados[0:5])

NameError: name 'all_links' is not defined

# Part 2: Extracting Data from Player Links

In this section, the script leverages the player links gathered in Part 1 to delve deeper into each individual profile and extract all relevant data. 🌟 The goal is to automate the process of visiting each link, scraping detailed statistics and player information, and compiling this data into a structured format for further analysis. By connecting to each player's page, the script ensures that no valuable insights are missed, creating a comprehensive dataset that can be used for in-depth performance analysis, trend identification, or even predictive modeling. This step is crucial for transforming raw data into actionable insights, making it an indispensable part of the overall project. 📈

In [4]:
todos_links=all_links_no_duplicados[0:5]

NameError: name 'all_links_no_duplicados' is not defined

In [5]:
# Obtener estadísticas generales
estadisticas_generales = {
    "Partidos Jugados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/div[2]",
    "Titular": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[2]/div[2]",
    "Suplente": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[3]/div[2]",
    "Minutos Jugados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[4]/div[2]",
    "Goles": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[5]/div[2]",
    "Asistencias": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[1]/div[2]/div[6]/div[2]",
    "Tiros a puerta": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[1]/div[1]/div[2]/span",
    "Tiros": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[1]/div[1]/div[2]",
    "Tiros al Palo": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[1]/div[3]/div[2]",
    "Regates": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[2]/div[2]/div[2]/span",
    "Córners Forzados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[2]/div[4]/div[2]",
    "Faltas Recibidas": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[3]/div[1]/div[2]",
    "Faltas Cometidas": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[3]/div[2]/div[2]",
    "Pases Interceptados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[1]/div[1]/div[2]",
    "Balones Robados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[1]/div[2]/div[2]",
    "Balones Robados Último Hombre": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[1]/div[3]/div[2]",
    "Tarjetas Amarillas": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[2]/div[1]/div[2]",
    "Tarjetas Rojas": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[2]/div[2]/div[2]",
    "Penaltis Cometidos": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[3]/div[2]/div[2]",
    "Penaltis Forzados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[3]/div[3]/div[2]",
    "Erroes en gol en contra": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[3]/div[4]/div[2]",
    "Pases Completados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[2]/div[1]/div[2]",
    "Centros Precisos / Centros": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[2]/div[2]/div[3]/div[2]",
    "Goles Penalti / Lanzados": "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[1]/div/div[3]/div[3]/div[1]/div[2]"
}

# Define XPaths for specific matchday statistics (Sofascore and points) for 38 matchdays
jornadaNSofascore, jornadaNPuntos, jornadasNumero = [], [], []

for i in range(0, 38):
    numeroSofascore = i * 2 + 1
    jornadaNSofascore.append(f"/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[3]/table/tbody/tr[{numeroSofascore}]/td[7]")

for i in range(0, 38):
    numeroPuntos = i * 2 + 1
    jornadaNPuntos.append(f"/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[3]/table/tbody/tr[{numeroPuntos}]/td[11]/span[3]")

for i in range(0, 38):
    numeroJornada = i * 2 + 1
    jornadasNumero.append(f"/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[2]/div/div[4]/div/div[3]/table/tbody/tr[{numeroJornada}]/td[2]")

In [6]:
# Function to extract general statistics based on XPath
def estadisticas_general(driver, wait, xpath, variable):
    try:
        elemento = wait.until(EC.presence_of_element_located((By.XPATH, xpath)))
        var = elemento.text
        print(f"{variable}: {var}")
        return var
    except:
        var = "NA"
        print(f"{variable}: {var}")
        return "NA"

# Function to extract club information
def estadisticasClub(XPath, variable):
    try:
        elemento = WebDriverWait(driver, ESPERA).until(EC.presence_of_element_located((By.XPATH, XPath)))
        var = elemento.text
        print(f"{variable}: {var}")
        return var
    except:
        return "NA"


In [7]:
# Initialize an empty list to store all player data
data = []
contador = 1

# Iterate over each player link to scrape the data
for i in todos_links:
    print(f"###################\n{contador}\n###################")
    
    jugador = []
    link = i
    print(f"Opening link: {link}")

    # Initialize WebDriver and open the player's link
    service = Service(CHROMEDRIVER_PATH)
    driver = webdriver.Chrome(service=service)
    driver.get(link)

    try:
        # Accept cookies if prompted
        wait = WebDriverWait(driver, ESPERA)
        button = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "css-pp6blp")))
        button.click()
        time.sleep(2)
    except:
        print("No cookie prompt found.")

    # Extract player image URL
    try:
        image = driver.find_element(By.CSS_SELECTOR, ".row.profile img")
        image_src = image.get_attribute("src")
        print(f"Player image source: {image_src}")
    except:
        image_src = "NA"

    # Extract club information
    club = estadisticasClub("/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[1]/div[1]/div/div[1]/div[1]/a", "Club")

    try:
        # Extract position and other general info
        span_element = driver.find_element(By.CSS_SELECTOR, ".mx-2.mb-3.text-center.mt-1 span")
        span_text = span_element.text
        print(f"Player position: {span_text}")

        info_general = []
        info_elements = driver.find_elements(By.CSS_SELECTOR, ".row.profile.mt-3.negative .info")
        for element in info_elements:
            info_right = element.find_element(By.CLASS_NAME, "info-right").text
            info_general.append(info_right)
        print(f"General info: {info_general}")

        # Extract playing positions (demarcaciones)
        demarcaciones = estadisticas_general(driver, wait, "/html/body/div[2]/div/div[1]/main/div[2]/section/div/div[1]/div[3]/div[2]/div[2]/div[2]", "Demarcaciones")

        # Select the "comunio" option
        select_element = wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "mt-2.js_game")))
        select = Select(select_element)
        select.select_by_value("comunio")

        # Collect matchday statistics for each jornada (38 matchdays)
        jornadas = {}
        for i in range(38):
            jornada = 38 - i
            jornada_i = estadisticas_general(driver, wait, jornadasNumero[i], f"Jornada: {jornada}")
            jornadaSofascore = estadisticas_general(driver, wait, jornadaNSofascore[i], f"Jornada {jornada} Sofascore")
            jornadaPuntos = estadisticas_general(driver, wait, jornadaNPuntos[i], f"Jornada {jornada} Puntos")
            puntuaciones = [jornadaSofascore, jornadaPuntos]
            jornadas[jornada_i] = puntuaciones

        print(f"Collected matchdays data: {jornadas}")

        # Switch to general statistics tab
        active_tab = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, ".nav-link.ff-chart-pie[data-ok='statsglobales']")))
        driver.execute_script("arguments[0].scrollIntoView(true);", active_tab)
        time.sleep(1)
        driver.execute_script("arguments[0].click();", active_tab)

        # Extract general statistics
        stat_general = []
        for estadistica, xpath in estadisticas_generales.items():
            valor = estadisticas_general(driver, wait, xpath, estadistica)
            stat_general.append(valor)

        # Append all player data
        jugador.extend([info_general[0], club, span_text, demarcaciones])
        jugador.extend(info_general[1:7])
        jugador.append(image_src)
        jugador.extend(stat_general)

        jornadasJugador = []

        # Add matchday data
        for i in range(38, 0, -1):
            str_i = str(i)
            if str_i in jornadas:
                jornadasJugador.extend(jornadas[str_i])
            else:
                jornadasJugador.extend(['0', '0'])

        jugador.extend(jornadasJugador)

        print(f"Player data: {jugador}")
        data.append(jugador)

    except Exception as e:
        print(f"Error collecting data for player: {e}")
        jugador.append("XXXXXXX")

    # Close the browser for each player
    driver.quit()
    contador += 1

    

NameError: name 'todos_links' is not defined

In [8]:
# Define the headers for the DataFrame
headers = [
    "Nombre", "Club", "Posición", "Demarcaciones", "Edad y cumpleaños", "Lugar de nacimiento", 
    "Nacionalidad", "Altura", "Pie dominante", "Fecha fin de Contrato", "URL Imagen", 
    "Partidos Jugados", "Titular", "Suplente", "Minutos Jugados", "Goles", "Asistencias", 
    "Tiros a puerta", "Tiros a puerta/tiros", "Tiros al palo", "Regates", "Córners Forzados", 
    "Faltas recibidas", "Faltas Cometidas", "Pases interceptados", "Balones robados", 
    "Balones Robados Último Hombre", "Tarjetas amarillas", "Tarjetas rojas", 
    "Penaltis cometidos", "Penaltis forzados", "Errores en gol en contra", "Pases completados", 
    "Centros precisos/Centros", "Goles Penalti/Lanzados"
]

# Add matchday headers (Jornada Sofascore and Puntos)
for i in range(38, 0, -1):
    headers.append(f"J{i} Sofascore")
    headers.append(f"J{i} Puntos")

# Convert the data into a DataFrame and save it to an Excel file
df = pd.DataFrame(data, columns=headers)
df.to_excel('dataPrueba.xlsx', index=False)

print("Excel file 'dataPrueba.xlsx' created successfully.")


Excel file 'dataPrueba.xlsx' created successfully.
