# CrateDB UK Offshore Wind Farms Data Workshop

![Turbines forming part of a wind farm near the UK coastline.](multi-model-offshore-wind-farms.jpg "A wind farm near the UK coastline.")

This workshop explores multi-model data modeling and queries with [CrateDB](https://cratedb.com), using data from [The Crown Estate](https://www.thecrownestate.co.uk/our-business/marine/offshore-wind) which manages the UK's offshore wind farms.  It is derived from a conference presentation that you can [watch on YouTube](https://www.youtube.com/watch?v=xqiLGjaTlBk).

You'll work with tables containing data for:

* **Wind Farms**.  Details of the UK's 45 offshore wind farms are loaded into a CrateDB table from the supplied JSONL file.  Each record includes an ID for the wind farm as well as a name, description and geospatial data in [WKT format](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry) that describes the shape of the wind farm as one or more polygons.  The co-ordinates of each turbine are also included, where known.

* **Hourly Wind Farm Performance Data**.  This table contains time-series data pertaining to the power output of each wind farm on an hourly basis for the period 19th August 2024 to 28th October 2024.  The data is supplied as a compressed JSONL file.

## Install Dependencies

First, install the required dependencies by executing the `pip install` command below.

In [1]:
! pip install -U ipyleaflet sqlalchemy-cratedb pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Connect to CrateDB

Before going any further, you'll need to update the code below to include a connection string for your CrateDB cluster.  If you prefer, you can set the environment variable `CRATEDB_CONNECTION_STRING` instead.

The code below assumes that you're using a managed CrateDB Cloud database cluster.  [Sign up here](https://console.cratedb.cloud/) to create a free cluster.

Alternatively, if you are running CrateDB locally (for example with [Docker](https://hub.docker.com/_/crate)), use the `localhost` code block to eastablish a database connection instead.

In [2]:
import os
import sqlalchemy as sa

# Define database address when using CrateDB Cloud.
# Please find these settings on your cluster overview page.
CONNECTION_STRING = os.environ.get(
    "CRATEDB_CONNECTION_STRING",
    # TODO swap this back...
    #"crate://<USERNAME>:<PASSWORD>@<HOST>/?ssl=true",
    "crate://crate@localhost"
)

# # Define database address when using CrateDB on localhost.
# CONNECTION_STRING = os.environ.get(
#    "CRATEDB_CONNECTION_STRING",
#    "crate://crate@localhost/",
# )

# Connect to CrateDB using SQLAlchemy.
engine = sa.create_engine(
    CONNECTION_STRING, 
    echo=sa.util.asbool(os.environ.get("DEBUG", "false")))
connection = engine.connect()

## Create Tables

Next, we'll create two tables as follows:

* `windfarms`: Contains data about each wind farm, including geospatial data, the nunber, type and location of each turbine, and a free-text description providing an overview of the wind farm and its history.

* `windfarm_output`: Hourly records for each wind farm containing details of the actual output for that hour and the percentage of the maximum output that the wind farm was operating at.

Run the code below to create them, taking a moment to understand the table schemas.

In [13]:
# Drop any previous version of the tables.
_ = connection.execute(sa.text("DROP TABLE IF EXISTS windfarms"))
_ = connection.execute(sa.text("DROP TABLE IF EXISTS windfarm_output"))

# Create the tables.

_= connection.execute(sa.text(
"""
    CREATE TABLE windfarms (
        id TEXT PRIMARY KEY,
        name TEXT,
        description TEXT INDEX USING fulltext WITH (analyzer='english'),
        description_vec FLOAT_VECTOR(2048),
        location GEO_POINT,
        territory TEXT,
        boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025),
        turbines OBJECT(STRICT) AS (
            brand TEXT,
            model TEXT,
            locations ARRAY(GEO_POINT),
            howmany SMALLINT
        ),
        capacity DOUBLE PRECISION,
        url TEXT
    );
"""    
))

_= connection.execute(sa.text(
"""
    CREATE TABLE windfarm_output (
        windfarmid TEXT,
        ts TIMESTAMP WITHOUT TIME ZONE,
        month GENERATED ALWAYS AS date_trunc('month', ts),
        day TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('day', ts),
        output DOUBLE PRECISION,
        outputpercentage DOUBLE PRECISION
    ) PARTITIONED BY (day);
"""    
))

TODO commentary on the table schemas.

## Loading the Data

We'll load the data from files contained in the [`cratedb-datasets` public GitHub repository](https://github.com/crate/cratedb-datasets/).  There's one JSONL file for each table.  The file containing the hourly output data has been compressed.

The code that follows populates each table in turn, using `COPY FROM` statements.

In [15]:
def display_results(table_name, info):
    print(f"{table_name}: loaded {info['success_count']}, errors: {info['error_count']}")

    if info["error_count"] > 0:
        print(f"Errors: {info['errors']}")


# Load the wind farm data.
result = connection.execute(sa.text("""
    COPY windfarms 
                                    
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/devrel/uk-offshore-wind-farm-data/wind_farms.json'
    RETURN SUMMARY;
"""))

display_results("windfarms", result.mappings().first())

windfarms: loaded 0, errors: 45
Errors: {'A document with the same primary key exists already': {'count': 45, 'line_numbers': [1, 2, 3, 4, 5, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]}}


In [16]:
# Load the wind farm output data.
result = connection.execute(sa.text("""
    COPY windfarm_output
    FROM 'https://github.com/crate/cratedb-datasets/raw/main/devrel/uk-offshore-wind-farm-data/wind_farm_output.json.gz' 
    WITH (compression='gzip')
    RETURN SUMMARY;
"""))

display_results("windfarm_data", result.mappings().first())

windfarm_data: loaded 75825, errors: 0


Once the data's located, verify that the output shows 0 errors for each table.  Next, we'll run `REFRESH` and `ANALYZE` commands to make sure that the data's ready for immediate querying.  This isn't normally necessary as CrateDB will perform these tasks automatically in the background.  We're invoking them manually here to ensure that everyone in the workshop is on the same page at the same time.

In [17]:
_ = connection.execute(sa.text("REFRESH TABLE windfarms, windfarm_output"))
_ = connection.execute(sa.text("ANALYZE"))

TODO... the bulk of the query examples content!

In [None]:
# Slide 17

import pandas as pd

query = """
SELECT 
    ts, output, outputpercentage
FROM 
    windfarm_output
WHERE
    windfarmid = 'NHOYW-1' AND outputpercentage >= 50
ORDER BY 
    ts DESC
LIMIT 
    5
"""

pd.read_sql(query, CONNECTION_STRING)

Unnamed: 0,ts,output,outputpercentage
0,1730091600000,32.0,53.33
1,1730088000000,39.5,65.83
2,1730084400000,46.0,76.67
3,1730080800000,46.9,78.17
4,1730077200000,47.8,79.67


In [23]:
# Slide 18

query = """
SELECT
    name, trunc(avg(outputpercentage), 2) AS avg_output_percent
FROM
    windfarm_output o, windfarms w
WHERE
    month = 1722470400000 AND o.windfarmid = w.id
GROUP BY
    name
ORDER BY
    avg_output_percent DESC
LIMIT
    5
"""

pd.read_sql(query, CONNECTION_STRING)


Unnamed: 0,name,avg_output_percent
0,Seagreen Phase 1,69.41
1,Walney 2,67.69
2,West of Duddon Sands,66.41
3,Walney Extension 4,65.71
4,Rhyl Flats,61.25


In [25]:
# Slide 19

query = """
SELECT
    date_bin('1 week'::INTERVAL, ts, 0) AS week,
    trunc(avg(outputpercentage), 2) AS hourly_avg_output_pct
FROM
    windfarm_output
WHERE
    windfarmid = 'NHOYW-1'
GROUP BY 
    week
ORDER BY
    week DESC
LIMIT 
    3
"""

pd.read_sql(query, CONNECTION_STRING)

Unnamed: 0,week,hourly_avg_output_pct
0,1729728000000,26.29
1,1729123200000,31.16
2,1728518400000,26.64


In [None]:
# TODO map for slide 21

In [None]:
# Slide 22

query = """
SELECT
    name,
    max_by(outputpercentage, ts) AS latest_output_pct
FROM
    windfarms w, windfarm_output o
WHERE
    w.id = o.windfarmid AND within(
        location,
        {
            coordinates = [
                [
                    [ 0.056102312465469595, 53.561105338449806 ],
                    [ 0.6294239522362943, 52.99772662833962 ],
                    [ 0.7490867055060448, 53.00672492889956 ],
                    [ 1.2526092934924407, 52.99172438336538 ],
                    [ 1.7362237521376755, 52.75097942970572 ],
                    [ 1.7711106241761172, 52.53317123980176 ],
                    [ 2.9676480713816034, 52.952692458677035 ],
                    [ 1.412164720757545, 54.09662224695873 ],
                    [ 0.056102312465469595, 53.561105338449806 ]                        
                ]
            ],              
            type = 'Polygon'
        }
    )
GROUP BY
    name
ORDER BY
    latest_output_pct DESC
"""

pd.read_sql(query, CONNECTION_STRING)

Unnamed: 0,name,latest_output_pct
0,Race Bank,71.48
1,Dudgeon,63.56
2,Inner Dowsing,59.48
3,Triton Knoll,51.89
4,Lincs,41.67
5,Sheringham Shoal,39.59
6,Scroby Sands,23.17
7,Humber Gateway,17.12


In [None]:

# Slide 25

query = """
SELECT
    name,
    turbines['howmany'] AS num_turbines,
    description
FROM 
    windfarms
WHERE
    match(description, 'oil gas') AND turbines['howmany'] > 80
ORDER BY
    num_turbines DESC
"""

pd.read_sql(query, CONNECTION_STRING)

Unnamed: 0,name,num_turbines,description
0,East Anglia One,102,"East Anglia ONE is located in the southern area of the East Anglia Zone, and is approximately 43 km (27 miles) from the shore. The initial proposal was for an installed capacity of 1200 MW. Cabling for East Anglia ONE lands near the River Deben at Bawdsey, runs north of Ipswich and is connected to the National Grid at Bramford. A plan was formally submitted to the government in December 2012, and planning consent was granted in June 2014. In October 2014 ScottishPower announced that it intended to scale down East Anglia ONE because of insufficient subsidies. In February 2015 it was announced that ScottishPower would proceed with a scaled-down 714 MW project. A contract for £119/MWh was published on 27 April 2016, using 102 Siemens Wind Power direct-drive 7 MW turbines. Nacelles were built in Cuxhaven, while blades were made in Hull. Due to water depths between 30-40m, the turbines use jacketed foundations. Cabling is at 66 kV as opposed to the traditional 33 kV. Two export cables at 220 kV AC send the power to shore. A support vessel is powered by used vegetable oil."
1,Moray (East),100,"Moray East Wind Farm is an offshore wind farm located in the Moray Firth off the coast of Scotland. The wind farm received consent in 2014, and received support under the Contracts for Difference (CfD) scheme at £57.50/MWh (2012 prices) in 2017. The wind farm began exporting power in June 2021. The final turbine was installed in September 2021. Full power output was achieved in April 2022 and was commissioned. However, as market prices had increased above the CfD price due to the 2021 United Kingdom natural gas supplier crisis, the operator deferred the CfD start."
2,Sheringham Shoal,89,"Sheringham Shoal Offshore Wind Farm is a Round 2 wind farm in North Sea off the coast of Norfolk. A lease for use of the sea bed was obtained in 2004 by Scira Offshore Energy (later acquired by Statoil (now Equinor) and Statkraft), the development gained offshore planning consent in 2008, and was constructed 2009–2011, being officially opened in 2012. The wind farm has 88 Siemens Wind Power 3.6MW turbines (total power 316.8 MW) spread over a 35 km2 (14 sq mi) area over 17 km (11 mi) from shore. In 2004 the Crown Estate awarded Econventures (Utrecht, NL) the lease of the Round 2 wind farm site at Sheringham shoal. Econventures together with SLP Energy (Lowestoft, UK) formed a joint venture Scira Offshore Energy to develop a c. 315MW wind farm. Development work for Econventures was to be carried out by Evelop BV, both subsidiaries of Econcern BV. In 2005 Hydro took a 50% stake in Scira, acquiring 25% shareholdings from both SLP Energy and Ecocentures. In 2006 Scira submitted a planning application to the Department of Trade and Industry for a 108 turbine, 315 MW wind farm. The planned wind farm was approximately 18 kilometres (11 mi) off the coast of Norfolk at Sheringham, just within the 12 nm UK territorial water boundary, and 5 kilometres (3.1 mi) north of the sand bank known as Sheringham shoal. The wind farm would be located at water depths of 16 to 22 metres (52 to 72 ft) and consist of somewhere between 45 and 108 turbines. Benefits of the site included low shipping and trawling intensities; lack of any dredging, dumping, oil/gas, or MoD practice areas, and of cables or pipelines; as well as low visual impact from the coast, and outside any nature conservation areas. The seabed at the wind farm and offshore cable route consisted of mainly gravely sand, overlying chalk. The electrical power export cable was to be connected to a switching station near Muckleburgh Collection, via landfall near Weybourne Hope. Two routes were considered for the export cable, one avoiding the sandbank at Sheringham shoal. The connection to the National Grid was planned to be made at an electrical substation near Salle, Norfolk via a 132kV 21.3 kilometres (13.2 mi) underground cable. Planning consent for the wind farm was given on 8 August 2008."
3,Beatrice,84,"The Beatrice Offshore Wind Farm now known as Beatrice Offshore Windfarm Ltd (BOWL) project, is an offshore wind farm close to the Beatrice oil field in the Moray Firth, part of the North Sea 13 km off the north east coast of Scotland."


In [None]:
# TODO slide 26 needs the vector representation of the query text.

query = """
SELECT 
    _score, name, description 
FROM 
    windfarms 
WHERE 
    knn_match(
        description_vec, 
        # TODO QUERY VECTOR HERE, 
        2) 
    AND turbines['howmany'] > 100
ORDER BY _score DESC
"""

pd.read_sql(query, CONNECTION_STRING)

## Continue your Learning Journey

To learn more about CrateDB, sign up for our free courses at the CrateDB Academy.  We recommend the [CrateDB Fundamentals course](https://learn.cratedb.com/cratedb-fundamentals) for a comprehensive overview, and our [Advanced Time Series course](https://learn.cratedb.com/time-series) for a deep dive into time series data modelling, queries and aggregations.