# Pirelli F1 Tire Data Web Scraper

This notebook scrapes official tire and circuit data from Pirelli's website to complement F1 tire degradation analysis.

## Target Data:
- **Tire Compounds**: Soft/Medium/Hard compound specifications (C1-C5)
- **Circuit Length**: Track distance in km
- **Track Characteristics** (1-5 scale):
  - Traction
  - Asphalt Grip
  - Tire Stress
  - Braking
  - Lateral Forces
  - Downforce
  - Asphalt Abrasion
  - Track Evolution

## Output:
Structured DataFrame/CSV with all circuit data for analysis integration.

## Setup and Imports

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import time
from urllib.parse import urljoin, urlparse
import json
from typing import Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')

print("üï∑Ô∏è Pirelli Data Scraper Ready!")
print("üìä Target: Tire compounds and circuit characteristics (2022-2024)")

üï∑Ô∏è Pirelli Data Scraper Ready!
üìä Target: Tire compounds and circuit characteristics (2022-2024)


## Scraper Configuration

In [2]:
# Configuration
BASE_URL = "https://www.pirelli.com/global/en-ww/emotions-and-numbers/"
YEARS = [2022, 2023, 2024]
DELAY_BETWEEN_REQUESTS = 1  # Seconds - be respectful to the server

# Headers to mimic a real browser
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
}

print(f"üéØ Target years: {YEARS}")
print(f"‚è±Ô∏è Delay between requests: {DELAY_BETWEEN_REQUESTS}s")

üéØ Target years: [2022, 2023, 2024]
‚è±Ô∏è Delay between requests: 1s


## Helper Functions

In [3]:
def get_page_content(url: str) -> Optional[BeautifulSoup]:
    """
    Safely fetch and parse a web page.
    
    Args:
        url: URL to fetch
        
    Returns:
        BeautifulSoup object or None if failed
    """
    try:
        print(f"üì° Fetching: {url}")
        response = requests.get(url, headers=HEADERS, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        time.sleep(DELAY_BETWEEN_REQUESTS)  # Be respectful
        return soup
        
    except requests.RequestException as e:
        print(f"‚ùå Error fetching {url}: {e}")
        return None


def extract_tire_compounds(soup: BeautifulSoup) -> Dict[str, str]:
    """
    Extract tire compound information from the first infographic.
    
    Args:
        soup: Parsed HTML content
        
    Returns:
        Dictionary with tire compound info
    """
    compounds = {'soft': None, 'medium': None, 'hard': None}
    
    try:
        # Look for tire compound information
        # This will need to be adjusted based on actual HTML structure
        tire_elements = soup.find_all(text=re.compile(r'C[1-5]'))
        
        for element in tire_elements:
            text = element.strip()
            if 'C' in text:
                compound_match = re.search(r'C([1-5])', text)
                if compound_match:
                    compound = f"C{compound_match.group(1)}"
                    
                    # Try to determine if it's soft, medium, or hard
                    parent_text = element.parent.get_text().lower() if element.parent else text.lower()
                    
                    if 'soft' in parent_text or 'red' in parent_text:
                        compounds['soft'] = compound
                    elif 'medium' in parent_text or 'yellow' in parent_text:
                        compounds['medium'] = compound
                    elif 'hard' in parent_text or 'white' in parent_text:
                        compounds['hard'] = compound
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error extracting tire compounds: {e}")
    
    return compounds


def extract_circuit_length(soup: BeautifulSoup) -> Optional[float]:
    """
    Extract circuit length from the page.
    
    Args:
        soup: Parsed HTML content
        
    Returns:
        Circuit length in km or None
    """
    try:
        # Look for circuit length patterns
        length_patterns = [
            r'(\d+\.\d+)\s*km',
            r'(\d+,\d+)\s*km',
            r'length[^\d]*(\d+\.\d+)',
            r'circuit[^\d]*(\d+\.\d+)'
        ]
        
        page_text = soup.get_text()
        
        for pattern in length_patterns:
            matches = re.findall(pattern, page_text, re.IGNORECASE)
            if matches:
                # Convert comma to dot for European number format
                length_str = matches[0].replace(',', '.')
                return float(length_str)
                
    except Exception as e:
        print(f"‚ö†Ô∏è Error extracting circuit length: {e}")
    
    return None


def extract_track_characteristics(soup: BeautifulSoup) -> Dict[str, Optional[int]]:
    """
    Extract track characteristics (1-5 scale ratings).
    
    Args:
        soup: Parsed HTML content
        
    Returns:
        Dictionary with track characteristic ratings
    """
    characteristics = {
        'traction': None,
        'asphalt_grip': None,
        'tire_stress': None,
        'braking': None,
        'lateral': None,
        'downforce': None,
        'asphalt_abrasion': None,
        'track_evolution': None
    }
    
    try:
        # Look for rating elements - this will need adjustment based on actual HTML
        page_text = soup.get_text().lower()
        
        # Define search patterns for each characteristic
        patterns = {
            'traction': r'traction[^0-9]*([1-5])',
            'asphalt_grip': r'(?:asphalt\s*grip|grip)[^0-9]*([1-5])',
            'tire_stress': r'(?:tire\s*stress|tyre\s*stress)[^0-9]*([1-5])',
            'braking': r'braking[^0-9]*([1-5])',
            'lateral': r'lateral[^0-9]*([1-5])',
            'downforce': r'downforce[^0-9]*([1-5])',
            'asphalt_abrasion': r'(?:asphalt\s*abrasion|abrasion)[^0-9]*([1-5])',
            'track_evolution': r'(?:track\s*evolution|evolution)[^0-9]*([1-5])'
        }
        
        for char_name, pattern in patterns.items():
            matches = re.findall(pattern, page_text)
            if matches:
                characteristics[char_name] = int(matches[0])
                
    except Exception as e:
        print(f"‚ö†Ô∏è Error extracting track characteristics: {e}")
    
    return characteristics


def get_race_links_for_year(year: int) -> List[str]:
    """
    Get all race page links for a specific year.
    
    Args:
        year: Year to scrape (2022-2024)
        
    Returns:
        List of race page URLs
    """
    year_url = f"https://www.pirelli.com/global/en-ww/emotions-and-numbers/infographics-{year}/"
    soup = get_page_content(year_url)
    
    if not soup:
        return []
    
    race_links = []
    
    try:
        # Look for race links - this will need adjustment based on actual HTML structure
        links = soup.find_all('a', href=True)
        
        for link in links:
            href = link['href']
            link_text = link.get_text().strip().lower()
            
            # Filter for race-related links
            race_keywords = ['grand prix', 'gp', 'race', 'circuit', 'bahrain', 'saudi', 'australia', 
                           'imola', 'miami', 'spain', 'monaco', 'azerbaijan', 'canada', 'britain',
                           'austria', 'france', 'hungary', 'belgium', 'netherlands', 'italy',
                           'singapore', 'japan', 'qatar', 'usa', 'mexico', 'brazil', 'abu dhabi']
            
            if any(keyword in link_text for keyword in race_keywords):
                full_url = urljoin(year_url, href)
                if full_url not in race_links:
                    race_links.append(full_url)
        
        print(f"üèÅ Found {len(race_links)} race links for {year}")
        
    except Exception as e:
        print(f"‚ùå Error getting race links for {year}: {e}")
    
    return race_links


print("‚úÖ Helper functions loaded!")

‚úÖ Helper functions loaded!


## Main Scraping Function

In [4]:
def scrape_race_data(race_url: str, year: int) -> Optional[Dict]:
    """
    Scrape all tire and circuit data from a single race page.
    
    Args:
        race_url: URL of the race page
        year: Year of the race
        
    Returns:
        Dictionary with all extracted data or None if failed
    """
    soup = get_page_content(race_url)
    if not soup:
        return None
    
    try:
        # Extract race name from URL or page title
        race_name = "Unknown"
        title_element = soup.find('title')
        if title_element:
            title_text = title_element.get_text()
            # Clean up the title to get race name
            race_name = title_text.split('|')[0].strip() if '|' in title_text else title_text.strip()
        
        # Extract all data
        compounds = extract_tire_compounds(soup)
        circuit_length = extract_circuit_length(soup)
        characteristics = extract_track_characteristics(soup)
        
        # Combine all data
        race_data = {
            'year': year,
            'race_name': race_name,
            'url': race_url,
            'circuit_length_km': circuit_length,
            'soft_compound': compounds['soft'],
            'medium_compound': compounds['medium'],
            'hard_compound': compounds['hard'],
            **characteristics  # Unpack all track characteristics
        }
        
        print(f"‚úÖ Extracted data for: {race_name} ({year})")
        return race_data
        
    except Exception as e:
        print(f"‚ùå Error scraping race data from {race_url}: {e}")
        return None


def scrape_all_pirelli_data(years: List[int] = YEARS) -> pd.DataFrame:
    """
    Scrape tire and circuit data for all races across specified years.
    
    Args:
        years: List of years to scrape
        
    Returns:
        DataFrame with all race data
    """
    all_race_data = []
    
    print(f"üöÄ Starting Pirelli data scraping for years: {years}")
    print(f"üìä Target data: Tire compounds + 8 track characteristics")
    print("="*50)
    
    for year in years:
        print(f"\nüìÖ Processing year: {year}")
        
        # Get all race links for this year
        race_links = get_race_links_for_year(year)
        
        if not race_links:
            print(f"‚ö†Ô∏è No race links found for {year}")
            continue
        
        # Scrape each race
        for i, race_url in enumerate(race_links, 1):
            print(f"\nüèÅ Race {i}/{len(race_links)} for {year}")
            
            race_data = scrape_race_data(race_url, year)
            if race_data:
                all_race_data.append(race_data)
            
            # Extra delay between races
            time.sleep(DELAY_BETWEEN_REQUESTS)
    
    print("\n" + "="*50)
    print(f"üéâ Scraping complete! Collected data for {len(all_race_data)} races")
    
    # Convert to DataFrame
    df = pd.DataFrame(all_race_data)
    
    if not df.empty:
        print(f"üìä DataFrame shape: {df.shape}")
        print(f"üìã Columns: {list(df.columns)}")
    
    return df


print("‚úÖ Main scraping functions loaded!")

‚úÖ Main scraping functions loaded!


## Test Single Page First

In [5]:
# Test the scraper on a single page first to verify it works
print("üß™ Testing scraper on 2024 infographics page...")

test_url = "https://www.pirelli.com/global/en-ww/emotions-and-numbers/infographics-2024/"
test_soup = get_page_content(test_url)

if test_soup:
    print("‚úÖ Successfully loaded test page")
    print(f"üìÑ Page title: {test_soup.find('title').get_text() if test_soup.find('title') else 'No title found'}")
    
    # Show a sample of the page content
    page_text = test_soup.get_text()[:500]
    print(f"üìù Sample content: {page_text}...")
    
    # Test link extraction
    links = test_soup.find_all('a', href=True)
    print(f"üîó Found {len(links)} links on the page")
    
    # Show some sample links
    for i, link in enumerate(links[:5]):
        print(f"   Link {i+1}: {link.get_text().strip()[:50]} -> {link['href'][:50]}")
        
else:
    print("‚ùå Failed to load test page - check URL and network connection")
    print("üí° You may need to adjust the scraping approach based on the site structure")

üß™ Testing scraper on 2024 infographics page...
üì° Fetching: https://www.pirelli.com/global/en-ww/emotions-and-numbers/infographics-2024/
‚úÖ Successfully loaded test page
üìÑ Page title: Emotions and Numbers: infographics 2024 | Pirelli
üìù Sample content: 



























Emotions and Numbers: infographics 2024 | Pirelli





















































IT
EN
ES
BR
DE
FR
‰∏≠ÂõΩ









Stories




Stories




Road


Racing Spot


Life





Road overview
Car
Motorcycles
Bicycles


Racing Spot overview
Formula 1
Rally
Gran Turismo
Superbike
Sailing
Cycling
Other Competitions
E-sport


Life overview
Sustainability
People
Pirelli Calendar
Lifestyle
Innovation

back








Products




Products




Car Tyres


Moto Tyres

...
üîó Found 256 links on the page
   Link 1:  -> //www.pirelli.com/global/en-ww/homepage/
   Link 2:  -> //www.pirelli.com/global/en-ww/facebook-newsletter
   Link 3: IT -> //www.pirelli.com/global/it-it/homepage/
   Link 4: EN -

## Run Full Scraper

**‚ö†Ô∏è Important Notes:**
- This will make many requests to Pirelli's website
- The scraper includes delays to be respectful
- You may need to adjust the extraction functions based on the actual HTML structure
- Run the test cell above first to verify the scraper works

In [None]:
# Run the full scraper (uncomment when ready)
# pirelli_data = scrape_all_pirelli_data()

# For now, let's start with just one year to test
print("üéØ Starting with 2024 data only for testing...")
pirelli_data = scrape_all_pirelli_data([2024])

# Display results
if not pirelli_data.empty:
    print("\nüìä Scraped Data Summary:")
    print(pirelli_data.head())
    
    print("\nüìà Data Info:")
    print(pirelli_data.info())
    
    print("\nüîç Sample tire compounds:")
    compound_cols = ['soft_compound', 'medium_compound', 'hard_compound']
    print(pirelli_data[compound_cols].head())
    
    print("\nüèÅ Sample track characteristics:")
    char_cols = ['traction', 'asphalt_grip', 'tire_stress', 'braking']
    available_char_cols = [col for col in char_cols if col in pirelli_data.columns]
    if available_char_cols:
        print(pirelli_data[available_char_cols].head())
    
else:
    print("‚ùå No data scraped - check the extraction functions")
    print("üí° The HTML structure may be different than expected")

## Save Data to CSV

In [None]:
# Save the data to CSV for use in your tire degradation analysis
if not pirelli_data.empty:
    output_file = "pirelli_tire_circuit_data.csv"
    pirelli_data.to_csv(output_file, index=False)
    
    print(f"üíæ Data saved to: {output_file}")
    print(f"üìä Shape: {pirelli_data.shape}")
    print(f"üìã Columns: {list(pirelli_data.columns)}")
    
    # Show data quality summary
    print("\nüîç Data Quality Summary:")
    missing_data = pirelli_data.isnull().sum()
    print(missing_data[missing_data > 0])
    
    # Show unique tire compounds found
    print("\nüõû Tire Compounds Found:")
    for compound_type in ['soft_compound', 'medium_compound', 'hard_compound']:
        if compound_type in pirelli_data.columns:
            unique_compounds = pirelli_data[compound_type].dropna().unique()
            print(f"  {compound_type}: {unique_compounds}")
    
else:
    print("‚ùå No data to save")

## Integration with F1 Analysis

Once you have the Pirelli data, you can merge it with your FastF1 analysis:

In [None]:
# Example of how to integrate with your F1 tire degradation analysis
if not pirelli_data.empty:
    print("üîó Integration Example:")
    print("""# In your F1 analysis notebook:
    
import pandas as pd
import fastf1

# Load Pirelli data
pirelli_data = pd.read_csv('pirelli_tire_circuit_data.csv')

# Load F1 session
session = fastf1.get_session(2024, "Bahrain", "R")
session.load()

# Process with your functions
processed_laps = process_race_for_tire_analysis(session)

# Merge with circuit characteristics
race_name = "Bahrain Grand Prix"  # Match with pirelli_data
circuit_data = pirelli_data[pirelli_data['race_name'].str.contains('Bahrain')]

if not circuit_data.empty:
    # Add circuit characteristics to your lap data
    for col in ['traction', 'asphalt_grip', 'tire_stress', 'braking']:
        if col in circuit_data.columns:
            processed_laps[f'circuit_{col}'] = circuit_data[col].iloc[0]

# Now you have lap-by-lap data with official circuit characteristics!
# Perfect for advanced tire degradation modeling
""")
    
    print("\nüéØ Benefits:")
    print("‚úÖ Official tire compound data (C1-C5)")
    print("‚úÖ Circuit characteristics for modeling")
    print("‚úÖ Track evolution and abrasion data")
    print("‚úÖ Braking and lateral force intensity")
    print("‚úÖ Perfect complement to your FastF1 analysis!")

else:
    print("‚ùå No data available for integration example")

## Troubleshooting and Refinement

If the scraper doesn't work perfectly on the first try, here are debugging tools:

In [None]:
# Debug a specific race page
def debug_race_page(race_url: str):
    """
    Debug what's available on a specific race page.
    """
    print(f"üîç Debugging: {race_url}")
    
    soup = get_page_content(race_url)
    if not soup:
        return
    
    print("\nüìÑ Page title:")
    title = soup.find('title')
    print(title.get_text() if title else "No title found")
    
    print("\nüî§ Text content (first 1000 chars):")
    print(soup.get_text()[:1000])
    
    print("\nüñºÔ∏è Images found:")
    images = soup.find_all('img')
    for i, img in enumerate(images[:5]):
        alt_text = img.get('alt', 'No alt text')
        src = img.get('src', 'No src')
        print(f"  Image {i+1}: {alt_text} -> {src[:50]}...")
    
    print("\nüî¢ Numbers found (potential ratings):")
    numbers = re.findall(r'\b[1-5]\b', soup.get_text())
    print(f"Found {len(numbers)} single digits 1-5: {numbers[:20]}...")
    
    print("\nüèéÔ∏è Tire-related text:")
    tire_text = re.findall(r'\w*[Cc]\d\w*|\w*tire\w*|\w*compound\w*', soup.get_text())
    print(f"Found: {tire_text[:10]}...")

# Uncomment to debug a specific page
# debug_race_page("https://www.pirelli.com/global/en-ww/emotions-and-numbers/infographics-2024/")

print("üõ†Ô∏è Debug function ready - uncomment the line above to use it")

## Summary

This notebook provides a comprehensive web scraper for Pirelli's F1 tire and circuit data.

### What it scrapes:
1. **Tire Compounds** - Official C1-C5 designations for soft/medium/hard
2. **Circuit Length** - Track distance in kilometers
3. **Track Characteristics** (1-5 scale):
   - Traction
   - Asphalt Grip
   - Tire Stress
   - Braking
   - Lateral Forces
   - Downforce
   - Asphalt Abrasion
   - Track Evolution

### Next Steps:
1. **Test** the scraper on a few pages first
2. **Refine** extraction functions based on actual HTML structure
3. **Run** full scraper for all years (2022-2024)
4. **Integrate** data with your F1 tire degradation analysis

### Integration with your analysis:
The scraped data perfectly complements your FastF1 analysis by providing:
- Official tire compound context
- Circuit-specific characteristics for modeling
- Track evolution and surface data
- Braking and cornering intensity metrics

This will make your tire degradation analysis much more comprehensive! üèéÔ∏èüìä