
UK WEATHER DATA PIPELINE FOR ELECTRICITY PRICE PREDICTION
==========================================================
By ANI

WORKFLOW:
1. Load REPD (Renewable Energy Planning Database) - UK govt database of all renewable energy projects
2. Filter for operational wind and solar projects (>1 MW)
3. Convert coordinates from British National Grid to lat/lon
4. Calculate capacity-weighted centroids for each UK region
5. Select top regions covering 95% of installed capacity
6. Fetch historical weather data from Open-Meteo API (2021-2025)
7. Save required wind and solar data per region into different folders and csv's

DATA SOURCES:

REPD: https://www.gov.uk/government/publications/renewable-energy-planning-database-monthly-extract

Weather: Open-Meteo Historical API (ERA5 reanalysis)

Data Period: 2021-01-01 to 2025-10-31


In [None]:
import pandas as pd
import numpy as np
import requests
from tqdm import tqdm
import time
from pyproj import Transformer
import warnings
warnings.filterwarnings('ignore')

print("Imports complete")

Imports complete


In [None]:
# Time period
start_date = '2021-01-01'
end_date = '2025-10-31'

# Technology filters
WIND_ONSHORE = 'Wind Onshore'
WIND_OFFSHORE = 'Wind Offshore'
SOLAR = 'Solar Photovoltaics'

# Minimum capacity threshold (MW) - filter out tiny installations
MIN_CAPACITY_MW = 1.0

print(f"Configuration set: {start_date} to {end_date}")


Configuration set: 2021-01-01 to 2025-10-31


In [None]:
#Load REPD CSV
from google.colab import drive
import os

drive.mount('/content/drive')

REPD_PATH = '/content/drive/MyDrive/Colab Notebooks/Ani_Data/repd.csv' ############CHANGE THIS FILEPATH TO HOW YOU HAVE SAVED IT

# Check if the file exists before trying to read it
if not os.path.exists(REPD_PATH):
    print("file not found")
else:
    repd_raw = pd.read_csv(REPD_PATH, encoding='latin-1')
    print(f"Loaded {len(repd_raw)} records from REPD")
    print(repd_raw.head()) #Making sure columns are retrieved correctly

Mounted at /content/drive
Loaded 13524 records from REPD
   Old Ref ID    Ref ID Record Last Updated (dd/mm/yyyy)  \
0           1  10726459                       07/07/2009   
1           2       NaN                       20/11/2017   
2           3  12019680                       20/12/2019   
3           4  11877116                       18/12/2003   
4           5       NaN                       29/09/2005   

                       Operator (or Applicant)  \
0                                   RWE npower   
1  Orsted (formerly Dong Energy) / Peel Energy   
2           Scottish and Southern Energy (SSE)   
3                       Energy Power Resources   
4                                      Agrigen   

                        Site Name      Technology Type Storage Type  \
0  Aberthaw Power Station Biomass  Biomass (co-firing)          NaN   
1           Hunterston - cofiring  Biomass (co-firing)          NaN   
2   Ferrybridge Multifuel 2 (FM2)     EfW Incineration          NaN 

In [None]:
#Filter and cleaning the REPD csv

def clean_repd(df):
    """Filter REPD for operational wind and solar projects"""

    # Check actual column names
    print("Checking column names...")
    print(df.columns.tolist())

    keep_cols = [
        'Site Name', 'Region', 'Technology Type', 'Development Status',
        'Installed Capacity (MWelec)', 'X-coordinate', 'Y-coordinate'
    ]
    print(keep_cols)

    # Filter for operational projects
    df_operational = df[df['Development Status'] == 'Operational'].copy()
    print(f"Operational projects: {len(df_operational)}")

    # Filter for wind and solar
    tech_filter = df_operational['Technology Type'].isin([WIND_ONSHORE, WIND_OFFSHORE, SOLAR])
    df_filtered = df_operational[tech_filter].copy()
    print(f"Wind/Solar operational: {len(df_filtered)}")

    # Convert 'Installed Capacity (MWelec)' to numeric, coercing errors, then drop NaNs
    df_filtered['Installed Capacity (MWelec)'] = pd.to_numeric(df_filtered['Installed Capacity (MWelec)'], errors='coerce')
    df_filtered = df_filtered.dropna(subset=['Installed Capacity (MWelec)']).copy()

    # Filter minimum capacity
    df_filtered = df_filtered[df_filtered['Installed Capacity (MWelec)'] >= MIN_CAPACITY_MW]
    print(f"After {MIN_CAPACITY_MW}MW threshold: {len(df_filtered)}")

    # Remove missing coordinates
    df_filtered = df_filtered.dropna(subset=['X-coordinate', 'Y-coordinate'])
    print(f"With valid coordinates: {len(df_filtered)}")

    df_filtered = df_filtered[keep_cols]

    # Summary by technology
    print("\n" + "="*50)
    print("CAPACITY SUMMARY BY TECHNOLOGY")
    print("="*50)
    summary = df_filtered.groupby('Technology Type')['Installed Capacity (MWelec)'].agg(['count', 'sum'])
    summary.columns = ['Projects', 'Total MW']
    print(summary)
    print("="*50)

    return df_filtered

repd_clean = clean_repd(repd_raw)


Checking column names...
['Old Ref ID', 'Ref ID', 'Record Last Updated (dd/mm/yyyy)', 'Operator (or Applicant)', 'Site Name', 'Technology Type', 'Storage Type', 'Storage Co-location REPD Ref ID', 'Installed Capacity (MWelec)', 'Share Community Scheme', 'CHP Enabled', 'CfD Allocation Round', 'RO Banding (ROC/MWh)', 'FiT Tariff (p/kWh)', 'CfD Capacity (MW)', 'Turbine Capacity', 'No. of Turbines', 'Height of Turbines (m)', 'Mounting Type for Solar', 'Development Status', 'Development Status (short)', 'Are they re-applying (New REPD Ref)', 'Are they re-applying (Old REPD Ref) ', 'Address', 'County', 'Region', 'Country', 'Post Code', 'X-coordinate', 'Y-coordinate', 'Planning Authority', 'Planning Application Reference', 'Appeal Reference', 'Secretary of State Reference', 'Type of Secretary of State Intervention', 'Judicial Review', 'Offshore Wind Round', 'Planning Application Submitted', 'Planning Application Withdrawn', 'Planning Permission Refused', 'Appeal Lodged', 'Appeal Withdrawn', 'A

In [None]:
def add_lat_lon(df):

    # 1. (EPSG:27700 -> EPSG:4326)
    # always_xy=True ensures the output is (Longitude, Latitude)
    transformer = Transformer.from_crs("epsg:27700", "epsg:4326", always_xy=True)

    # 2. Vectorized Transform
    lon, lat = transformer.transform(df['X-coordinate'].values, df['Y-coordinate'].values)

    # 3. Assign new columns
    df = df.copy()
    df['longitude'] = lon
    df['latitude'] = lat

    return df

# --- Execution Flow ---
repd_clean = clean_repd(repd_raw)

# 2. Run the conversion
repd_final = add_lat_lon(repd_clean)

# 3. Check results
print(repd_final[['Site Name', 'latitude', 'longitude']].head())

Checking column names...
['Old Ref ID', 'Ref ID', 'Record Last Updated (dd/mm/yyyy)', 'Operator (or Applicant)', 'Site Name', 'Technology Type', 'Storage Type', 'Storage Co-location REPD Ref ID', 'Installed Capacity (MWelec)', 'Share Community Scheme', 'CHP Enabled', 'CfD Allocation Round', 'RO Banding (ROC/MWh)', 'FiT Tariff (p/kWh)', 'CfD Capacity (MW)', 'Turbine Capacity', 'No. of Turbines', 'Height of Turbines (m)', 'Mounting Type for Solar', 'Development Status', 'Development Status (short)', 'Are they re-applying (New REPD Ref)', 'Are they re-applying (Old REPD Ref) ', 'Address', 'County', 'Region', 'Country', 'Post Code', 'X-coordinate', 'Y-coordinate', 'Planning Authority', 'Planning Application Reference', 'Appeal Reference', 'Secretary of State Reference', 'Type of Secretary of State Intervention', 'Judicial Review', 'Offshore Wind Round', 'Planning Application Submitted', 'Planning Application Withdrawn', 'Planning Permission Refused', 'Appeal Lodged', 'Appeal Withdrawn', 'A

In [None]:
#Work out weighted capacity for each location
def calculate_all_weighted_locations(df, tech_types):

    # Filter Tech
    df_tech = df[df['Technology Type'].isin(tech_types)].copy()

    # Create Weighted Coordinates
    cap = 'Installed Capacity (MWelec)'
    df_tech['w_lat'] = df_tech['latitude'] * df_tech[cap]
    df_tech['w_lon'] = df_tech['longitude'] * df_tech[cap]

    # Group & Sort
    regions = df_tech.groupby('Region').agg(
        total_capacity_mw=(cap, 'sum'),
        lat_sum=('w_lat', 'sum'),
        lon_sum=('w_lon', 'sum')
    ).sort_values('total_capacity_mw', ascending=False)

    # Calculate Cumulative %
    total_uk = regions['total_capacity_mw'].sum()
    regions['cumulative%'] = regions['total_capacity_mw'].cumsum() / total_uk
    regions['global_share'] = regions['total_capacity_mw'] / total_uk

    # Final Centroids
    regions['latitude'] = regions['lat_sum'] / regions['total_capacity_mw']
    regions['longitude'] = regions['lon_sum'] / regions['total_capacity_mw']

    return regions.reset_index()[['Region', 'latitude', 'longitude', 'total_capacity_mw', 'cumulative%', 'global_share']]

In [None]:
wind_all = calculate_all_weighted_locations(repd_final, [WIND_ONSHORE, WIND_OFFSHORE])
solar_all = calculate_all_weighted_locations(repd_final, [SOLAR])
for name, df in [("WIND", wind_all), ("SOLAR", solar_all)]:
    print(f"\n{'='*80}\n{name} CAPACITY-WEIGHTED LOCATIONS\n{'='*80}")
    print(df.to_string(index=False, formatters={
        'latitude': '{:.4f}'.format,
        'longitude': '{:.4f}'.format,
        'total_capacity_mw': '{:.1f}'.format,
        'global_share': '{:.2%}'.format,
        'cumulative%': '{:.2%}'.format
    }))


WIND CAPACITY-WEIGHTED LOCATIONS
              Region latitude longitude total_capacity_mw cumulative% global_share
            Offshore  53.9182   -0.4299           14679.0      49.82%       49.82%
            Scotland  56.5218   -3.7560            9511.1      82.09%       32.28%
               Wales  52.1411   -3.6936            1219.7      86.23%        4.14%
    Northern Ireland  54.7211   -7.0959            1194.8      90.29%        4.05%
Yorkshire and Humber  53.7178   -0.8216             652.2      92.50%        2.21%
          North East  55.0460   -1.7088             473.8      94.11%        1.61%
          North West  54.0710   -2.7548             467.3      95.69%        1.59%
             Eastern  52.3214    0.2815             451.7      97.23%        1.53%
       East Midlands  52.7549   -0.6387             398.5      98.58%        1.35%
          South West  50.8600   -4.1691             284.9      99.55%        0.97%
          South East  51.2432    0.4004             1

In [None]:
# --- 2. Filter (Top 95%) ---
wind_final = wind_all[wind_all['cumulative%'] <= 0.95].copy()
solar_final = solar_all[solar_all['cumulative%'] <= 0.95].copy()

# --- 3. Print Evidence (Wind) ---
print("--- WIND JUSTIFICATION ---")
print(f"Original Regions: {len(wind_all)}")
print(f"Final Regions:    {len(wind_final)}")
print(f"Noise Removed:    {len(wind_all) - len(wind_final)} regions")
print(f"Capacity Kept:    {wind_final['global_share'].sum():.1%}")
print("\n", wind_final)

# --- 4. Print Evidence (Solar) ---
print("\n\n--- SOLAR JUSTIFICATION ---")
print(f"Original Regions: {len(solar_all)}")
print(f"Final Regions:    {len(solar_final)}")
print(f"Noise Removed:    {len(solar_all) - len(solar_final)} regions")
print(f"Capacity Kept:    {solar_final['global_share'].sum():.1%}")
print("\n", solar_final)

--- WIND JUSTIFICATION ---
Original Regions: 13
Final Regions:    6
Noise Removed:    7 regions
Capacity Kept:    94.1%

                  Region   latitude  longitude  total_capacity_mw  cumulative%  \
0              Offshore  53.918238  -0.429913           14679.00     0.498153   
1              Scotland  56.521820  -3.756050            9511.10     0.820926   
2                 Wales  52.141054  -3.693580            1219.70     0.862318   
3      Northern Ireland  54.721094  -7.095877            1194.80     0.902865   
4  Yorkshire and Humber  53.717822  -0.821627             652.20     0.924999   
5            North East  55.045957  -1.708773             473.75     0.941076   

   global_share  
0      0.498153  
1      0.322773  
2      0.041392  
3      0.040547  
4      0.022133  
5      0.016077  


--- SOLAR JUSTIFICATION ---
Original Regions: 12
Final Regions:    7
Noise Removed:    5 regions
Capacity Kept:    94.5%

                  Region   latitude  longitude  total_capaci

In [None]:
#Open-Meteo API

def fetch_weather_for_one_location(lat, lon, start_date, end_date):

    #Define the API URL
    url = "https://archive-api.open-meteo.com/v1/archive"

    #Define what is needed from the API
    params = {
        'latitude': lat,
        'longitude': lon,
        'start_date': start_date,
        'end_date': end_date,
        'hourly': [
            'wind_speed_10m',           # Wind speed at 10 meters (m/s)
            'wind_speed_100m',          # Wind speed at 100m - turbine hub height (m/s)
            'wind_gusts_10m',           # Wind gusts (m/s)
            'wind_direction_100m',      # Wind direction in degrees (0-360)
            'shortwave_radiation',      # Solar radiation GHI (W/m²)
            'direct_normal_irradiance', # Solar radiation DNI (W/m²)
            'cloud_cover',              # Cloud cover (%)
            'temperature_2m',           # Temperature at 2m (°C)
        ],
        'timezone': 'Europe/London'
    }

    #Send request to API
    response = requests.get(url, params=params, timeout=60)
    response.raise_for_status()

    #Convert response to JSON (Python dictionary)
    data = response.json()

    # Step 5: Extract the hourly data into a DataFrame
    df = pd.DataFrame({
        'timestamp': pd.to_datetime(data['hourly']['time']),
        'wind_speed_10m': data['hourly']['wind_speed_10m'],
        'wind_speed_100m': data['hourly']['wind_speed_100m'],
        'wind_gusts': data['hourly']['wind_gusts_10m'],
        'wind_direction': data['hourly']['wind_direction_100m'],
        'ghi': data['hourly']['shortwave_radiation'],
        'dni': data['hourly']['direct_normal_irradiance'],
        'cloud_cover': data['hourly']['cloud_cover'],
        'temperature': data['hourly']['temperature_2m'],
    })

    # Step 6: Return the DataFrame
    return df


In [None]:
import os
import time
import pandas as pd

def fetch_all_locations(locations_df, start_date, end_date):
    """
    Fetch weather data, SAVE Files, resume progress if crash.
    """
    folder_path = r"/content/drive/MyDrive/Colab Notebooks/Ani_Data/"

    # Ensure folder exists
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

    all_data = {}

    for index, row in locations_df.iterrows():
        name = row['Region']
        lat = row['latitude']
        lon = row['longitude']

        # Check if file already saved before a crash
        if os.path.exists(os.path.join(folder_path, f"{name}_Solar.csv")):
            print(f"Skipping {name} (Already Saved)")
            continue
        # ----------------------

        print(f"Fetching {name}...")

        try:
            # 1. Fetch from API
            df = fetch_weather_for_one_location(lat, lon, start_date, end_date)

            # Save Solar
            solar_cols = ['timestamp', 'ghi', 'dni', 'cloud_cover', 'temperature']
            df[solar_cols].to_csv(os.path.join(folder_path, f"{name}_Solar.csv"), index=False)

            # Save Wind
            wind_cols = ['timestamp', 'wind_speed_10m', 'wind_speed_100m', 'wind_gusts', 'wind_direction']
            df[wind_cols].to_csv(os.path.join(folder_path, f"{name}_Wind.csv"), index=False)

            print(f"   -> Saved {name} to Drive")

            # 3. Add to memory
            all_data[name] = df

            time.sleep(10) #take time between API requests

        except Exception as e:
            print(f"FAIL on {name}: {e}")
            # If 429 hits , stop the loop to save IP
            if "429" in str(e) or "Client Error" in str(e):
                print("429 ERROR DETECTED. Stopping loop ")
                break

    print("Process Complete (or Stopped). Check Drive folder.")
    return all_data

In [None]:
#Fetch Wind data
print("--- Fetching Wind Sites ---")
raw_wind_data = fetch_all_locations(wind_final, start_date, end_date)

--- Fetching Wind Sites ---
Fetching Offshore...
   -> Saved Offshore to Drive
Fetching Scotland...
   -> Saved Scotland to Drive
Fetching Wales...
   -> Saved Wales to Drive
Fetching Northern Ireland...
   -> Saved Northern Ireland to Drive
Fetching Yorkshire and Humber...
   -> Saved Yorkshire and Humber to Drive
Fetching North East...
   -> Saved North East to Drive
Process Complete (or Stopped). Check your Drive folder.


In [None]:
#Fetch Solar Data
print("\n--- Fetching Solar Sites ---")
raw_solar_data = fetch_all_locations(solar_final, start_date, end_date)


--- Fetching Solar Sites ---
Fetching South West...
   -> Saved South West to Drive
Fetching South East...
   -> Saved South East to Drive
Fetching Eastern...
   -> Saved Eastern to Drive
Fetching East Midlands...
   -> Saved East Midlands to Drive
Skipping Wales (Already Saved)
Fetching West Midlands...
   -> Saved West Midlands to Drive
Skipping Yorkshire and Humber (Already Saved)
Process Complete (or Stopped). Check your Drive folder.
