# ENTSO-E Day-Ahead Price Data Collection

**Purpose**: Collect 2 years of hourly day-ahead electricity prices for clustering and ML prediction

**Scope**: Largest bidding zones in selected European countries

## Selected Zones (Largest per Country)

| Country | Zone | Code | Description |
|---------|------|------|-------------|
| Spain | ES | 10YES-REE------0 | Spain (single zone) |
| Norway | NO2 | 10YNO-2--------T | Southwest Norway (largest) |
| Denmark | DK1 | 10YDK-1--------W | West Denmark (largest) |

**Note**: UK (GB) excluded as it is not part of ENTSO-E Transparency Platform since Brexit

## Data Requirements for Project

- **Clustering**: Requires sufficient days for pattern detection (minimum 1-2 years)
- **XGBoost Training**: Needs seasonal variation and diverse market conditions
- **Time Period**: 10 years (3650 days) to capture all seasonal patterns

## Output Format

Each zone produces:
- `{ZONE}_raw.csv`: Hourly prices with timestamps
- Ready for preprocessing (outlier detection, interpolation, normalization)


In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import requests
import xml.etree.ElementTree as ET
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded successfully")

In [None]:
# API Configuration
ENTSOE_TOKEN = "7273b73e-d731-4c3d-a9eb-101bcf4ab674"  # token

# Largest bidding zones per country
ZONES = {
    'ES': '10YES-REE------0',   # Spain
    'NO2': '10YNO-2--------T',  # Norway Zone 2 (Southern Norway, major volume)
    'DK1': '10YDK-1--------W'   # Denmark West (largest)
}

# Data directory
DATA_DIR = Path('data')
RAW_DIR = DATA_DIR / 'raw'
RAW_DIR.mkdir(exist_ok=True, parents=True)

# Time period: 1ß years for robust clustering
DAYS_BACK = 3650  # 10 years

print(f"Configuration loaded:")
print(f"  Zones: 4")
print(f"  Period: {DAYS_BACK} days (~10 years)")
print(f"  Output: {RAW_DIR}")

## Data Collection Functions

Following best practices from academic literature on electricity price data processing

In [None]:
def fetch_entsoe_prices(zone_code, start_date, end_date, token, max_retries=3):
    """
    Fetch day-ahead electricity prices from ENTSO-E Transparency Platform
    
    Document Type A44: Price Document (Day-ahead prices)
    
    Returns:
    - DataFrame with columns: timestamp (UTC), price_eur_mwh
    """
    url = 'https://web-api.tp.entsoe.eu/api'
    
    params = {
        'securityToken': token,
        'documentType': 'A44',  # Day-ahead prices
        'in_Domain': zone_code,
        'out_Domain': zone_code,
        'periodStart': start_date.strftime('%Y%m%d%H%M'),
        'periodEnd': end_date.strftime('%Y%m%d%H%M')
    }
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, params=params, timeout=30)
            
            if response.status_code == 200:
                root = ET.fromstring(response.text)
                ns = {'ns': 'urn:iec62325.351:tc57wg16:451-3:publicationdocument:7:3'}
                
                data = []
                
                for timeseries in root.findall('.//ns:TimeSeries', ns):
                    for period in timeseries.findall('.//ns:Period', ns):
                        start = period.find('.//ns:timeInterval/ns:start', ns).text
                        start_dt = pd.to_datetime(start, utc=True)
                        
                        for point in period.findall('.//ns:Point', ns):
                            position = int(point.find('.//ns:position', ns).text)
                            price = float(point.find('.//ns:price.amount', ns).text)
                            
                            timestamp = start_dt + timedelta(hours=position - 1)
                            
                            data.append({
                                'timestamp': timestamp,
                                'price_eur_mwh': price
                            })
                
                if data:
                    df = pd.DataFrame(data)
                    df = df.sort_values('timestamp').reset_index(drop=True)
                    return df
                else:
                    return pd.DataFrame()
                    
            elif response.status_code == 429:
                wait_time = 60 * (attempt + 1)
                print(f"    Rate limit hit, waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                print(f"    Error {response.status_code}: {response.text[:200]}")
                return pd.DataFrame()
                
        except Exception as e:
            print(f"    Attempt {attempt + 1} failed: {str(e)}")
            if attempt < max_retries - 1:
                time.sleep(10)
    
    return pd.DataFrame()


def fetch_in_chunks(zone_code, start_date, end_date, token, chunk_days=90):
    """
    Fetch data in chunks to avoid API limits
    ENTSO-E has limits on request size, so we split into 90-day chunks
    """
    all_data = []
    current_start = start_date
    
    while current_start < end_date:
        current_end = min(current_start + timedelta(days=chunk_days), end_date)
        
        print(f"    Fetching {current_start.date()} to {current_end.date()}")
        
        chunk_data = fetch_entsoe_prices(zone_code, current_start, current_end, token)
        
        if not chunk_data.empty:
            all_data.append(chunk_data)
        
        current_start = current_end
        time.sleep(2)  # Rate limiting between chunks
    
    if all_data:
        combined = pd.concat(all_data, ignore_index=True)
        combined = combined.sort_values('timestamp').reset_index(drop=True)
        # Remove duplicates
        combined = combined.drop_duplicates(subset=['timestamp'], keep='first')
        return combined
    else:
        return pd.DataFrame()


print("Data collection functions defined")

## Collection Process

Collecting 2 years of data for each zone

**Expected duration**: ~5-10 minutes (due to API rate limits)

In [None]:
# Date range: 2 years back from today
end_date = datetime.utcnow()
start_date = end_date - timedelta(days=DAYS_BACK)

print(f"Collection Period: {start_date.date()} to {end_date.date()}")
print(f"Total: {DAYS_BACK} days\n")
print("="*70)

# Collect data for each zone
collection_stats = []

for zone_name, zone_code in ZONES.items():
    print(f"\n{zone_name}:")
    print("-" * 40)
    
    # Fetch data
    df = fetch_in_chunks(zone_code, start_date, end_date, ENTSOE_TOKEN)
    
    if not df.empty:
        # Basic statistics
        n_records = len(df)
        date_range = (df['timestamp'].min(), df['timestamp'].max())
        expected_hours = (date_range[1] - date_range[0]).total_seconds() / 3600
        completeness = 100 * n_records / expected_hours if expected_hours > 0 else 0
        
        # Save raw data
        filepath = RAW_DIR / f"{zone_name}_raw.csv"
        df.to_csv(filepath, index=False)
        
        stats = {
            'zone': zone_name,
            'records': n_records,
            'start': date_range[0],
            'end': date_range[1],
            'completeness_pct': round(completeness, 1),
            'mean_price': round(df['price_eur_mwh'].mean(), 2),
            'std_price': round(df['price_eur_mwh'].std(), 2)
        }
        collection_stats.append(stats)
        
        print(f"  Records: {n_records:,}")
        print(f"  Date range: {date_range[0].date()} to {date_range[1].date()}")
        print(f"  Completeness: {completeness:.1f}%")
        print(f"  Mean price: {df['price_eur_mwh'].mean():.2f} EUR/MWh")
        print(f"  Saved to: {filepath.name}")
    else:
        print(f"  FAILED: No data retrieved")
        collection_stats.append({
            'zone': zone_name,
            'records': 0,
            'status': 'FAILED'
        })

print("\n" + "="*70)
print("COLLECTION COMPLETE")
print("="*70)

In [None]:
# Summary table
summary_df = pd.DataFrame(collection_stats)

print("\nCollection Summary:")
print(summary_df.to_string(index=False))

# Save summary
summary_df.to_csv(DATA_DIR / 'collection_summary.csv', index=False)
print(f"\nSummary saved to: {DATA_DIR / 'collection_summary.csv'}")

# Check completeness
successful = summary_df[summary_df['records'] > 0]
print(f"\nSuccessful zones: {len(successful)}/{len(ZONES)}")

if len(successful) > 0:
    avg_completeness = successful['completeness_pct'].mean()
    print(f"Average completeness: {avg_completeness:.1f}%")
    
    if avg_completeness < 95:
        print("\nWarning: Low data completeness detected")
        print("Some gaps will need interpolation in preprocessing step")

## Data Quality Assessment

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

# Load collected data for quick visualization
fig, axes = plt.subplots(len(successful), 1, figsize=(14, 3*len(successful)))
if len(successful) == 1:
    axes = [axes]

for idx, (_, row) in enumerate(successful.iterrows()):
    zone = row['zone']
    filepath = RAW_DIR / f"{zone}_raw.csv"
    
    df = pd.read_csv(filepath)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df = df.sort_values('timestamp')
    
    axes[idx].plot(df['timestamp'], df['price_eur_mwh'], linewidth=0.5, alpha=0.7)
    axes[idx].set_title(f'{zone}: Raw Price Data (2 years)', fontweight='bold')
    axes[idx].set_ylabel('Price (EUR/MWh)')
    axes[idx].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(DATA_DIR / 'raw_price_overview.png', dpi=150, bbox_inches='tight')
plt.close()

print(f"Price overview saved to: {DATA_DIR / 'raw_price_overview.png'}")

## Next Steps

### 1. Data Preprocessing (Notebook 02)

Apply validated preprocessing methods:
- **Outlier Detection**: Z-score method (threshold=3) - Chikodili et al. (2021)
- **Missing Data**: Cubic spline interpolation - Moritz & Bartz-Beielstein (2017)
- **Normalization**: Min-max scaling [0,1] - Faiq et al. (2024)

### 2. Daily Profile Creation

Transform hourly data into 24-hour daily profiles:
- Each day represented as vector of 24 hourly prices
- Normalized profiles for clustering
- Raw profiles for final predictions

### 3. Hierarchical Clustering

Following Roberts & Brown (2020) methodology:
- Ward's linkage method for clustering
- Elbow method for optimal k
- Cluster validation and interpretation

### 4. XGBoost Prediction Model

Two-stage approach:
1. **Classification**: Predict cluster for day t+1
2. **Regression**: Cluster-specific models for hourly prices

### Data Requirements Satisfied

✓ **Clustering**: 2 years data provides ~730 days per zone
✓ **Seasonal Coverage**: All seasons, weekdays, holidays captured
✓ **XGBoost Training**: Sufficient samples for robust model training
✓ **Cross-Validation**: Enough data for train/test split


In [None]:
print("="*70)
print("DATA COLLECTION COMPLETE")
print("="*70)
print(f"\nOutput directory: {RAW_DIR}")
print(f"\nCollected zones ({len(successful)}/{len(ZONES)}):")
for _, row in successful.iterrows():
    print(f"  - {row['zone']}: {row['records']:,} records ({row['completeness_pct']:.1f}% complete)")

print(f"\nTotal hourly records: {successful['records'].sum():,}")
print(f"\nReady for preprocessing pipeline (Notebook 02)")