## Genetic Programming with Energy Data

Data from the [National Grid ESO API ](https://www.nationalgrideso.com/data-portal/api-guidance). 

In [21]:
# imports and installs

import numpy as np
import pandas as pd
import requests
import json
import random
import itertools
import sys

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import HDBSCAN
from sklearn_extra.cluster import KMedoids

from sklearn.metrics import silhouette_score

## Description of Terms
Note: Units for flow are measured in MW. The following terms are descriptions of the time series. There are additional time series included which represent the carbon intensity forecast for 2023 of various regions. 

- IFA_FLOW: IFA stands for Interconnexion France-Angleterre, which is a subsea electricity link between France and Great Britain that began operating in 1986. It's a joint venture between National Grid and the French Transmission Operator RTE. The system-to-system flow (SSF) of the IFA is calculated and adjusted for interconnector losses to determine the flow. 
- TSD: This is the Transmission System generation requirement and is equivalent to the Initial Transmission System Outturn (ITSDO) and Transmission System Demand Forecast on BM Reports. Transmission System Demand is equal to the ND plus the additional generation required to meet station load, pump storage pumping and interconnector exports.
- VIKING_FLOW: Flow coming from a record-breaking 475-mile-long land and subsea cable connecting British and Danish energy grids for the first time.
- IFA2_FLOW: Commissioned in 2021 IFA2 is a 1,000 MW high voltage direct current (HVDC) electrical interconnector between the British and French transmission systems. It is the second link to France that National Grid has developed with RTE.
- EMBEDDED_WIND_GENERATION: This is an estimate of the GB wind generation from wind farms which do not have Transmission System metering installed. These wind farms are embedded in the distribution network and invisible to National Grid. Their effect is to suppress the electricity demand during periods of high wind. The true output of these generators is not known so an estimate is provided based on National Grid’s best model.
- ND: This is the Great Britain generation requirement and is equivalent to the Initial National Demand Outturn (INDO) and National Demand Forecast as published on BM Reports. National Demand is the sum of metered generation, but excludes generation required to meet station load, pump storage pumping and interconnector exports. National Demand is calculated as a sum of generation based on National Grid operational generation metering. 
- MOYLE_FLOW: Flow related to The Moyle Interconnector, a 500 megawatt (MW) HVDC link between Scotland and Northern Ireland, running between Auchencrosh in Ayrshire and Ballycronan More in County Antrim. It went into service in 2001
- NEMO_FLOW: Flow from Nemo Link, a 1,000 MegaWatt HVDC submarine power cable between Richborough Energy Park in Kent, the United Kingdom and Zeebrugge, Belgium
- ELECLINK_FLOW: Flow from ElecLink, a 1,000 MW high-voltage direct current (HVDC) electrical interconnector between the United Kingdom and France, passing through the Channel Tunnel.
- PUMP_STORAGE_PUMPING: The demand due to pumping at hydro pump storage units; the -ve signifies pumping load.
- EMBEDDED_WIND_CAPACITY: This is National Grid’s best view of the installed embedded wind capacity in GB. This is based on publically available information compiled from a variety of sources and is not the definitive view. It is consistent with the generation estimate provided above. 
- SETTLEMENT_DATE: Settlement Date. 
- ENGLAND_WALES_DEMAND: England and Wales Demand, as ND above but on an England and Wales basis.
- EMBEDDED_SOLAR_CAPACITY: As embedded wind capacity above, but for solar generation.
- SCOTTISH_TRANSFER: Power transfer across the scottish boundaries due to growth in renewable generation capacity.
- NON_BM_STOR: Operating reserve for units that are not included in the ND generator definition. This can be in the form of generation or demand reduction.
- SETTLEMENT_PERIOD: Settlement Period. 
- EAST_WEST_FLOW: Flow from the East - West Interconnector between Ireland and Great Britain.
- NSL_FLOW: Flow from the North Sea Link (NSL), a joint venture with Norwegian system operator Statnett. Stretching 720 kilometres under the North Sea, at depths of up to 700 metres, NSL is an interconnector capable of sharing up to 1400 megawatts of electricity
- BRITNED_FLOW: Flow from the Britned connector, a two-way 1,000 MW high-voltage direct current connection has a length of 260 km and runs from the Isle of Grain (in Kent) to Maasvlakte (near Rotterdam)
- _ID: Line ID. 
- EMBEDDED_SOLAR_GENERATION: As embedded wind generation above, but for solar generation.

In [22]:
# API Calls to the Britain national grid API. Calling to retrieve historic electricity demand,
# interconnector, wind and solar outturn, and carbon intensity data for 2022 and/or 2023.

URL = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "bb44a1b5-75b1-4db2-8491-257f23385006"'
response = requests.get(URL).json()
URL2 = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "bf5ab335-9b40-4ea4-b93a-ab4af7bce003"'
response2 = requests.get(URL2).json()
URL3 = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "3372646d-419f-4599-97a9-6bb4e7e32862"'
response3 = requests.get(URL3).json()
URL4 = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "c16b0e19-c02a-44a8-ba05-4db2c0545a2a"'
response4 = requests.get(URL4).json()


# Converting responses from json into pandas dataframe
df_demand_2022 = pd.json_normalize(
    response["result"]["records"],
    meta=[
        "IFA_FLOW",
        "TSD",
        "VIKING_FLOW",
        "IFA2_FLOW",
        "EMBEDDED_WIND_GENERATION",
        "ND",
        "MOYLE_FLOW",
        "NEMO_FLOW",
        "ELECLINK_FLOW",
        "PUMP_STORAGE_PUMPING",
        "EMBEDDED_WIND_CAPACITY",
        "SETTLEMENT_DATE",
        "ENGLAND_WALES_DEMAND",
        "EMBEDDED_SOLAR_CAPACITY",
        "SCOTTISH_TRANSFER",
        "NON_BM_STOR",
        "_FULL_TEXT",
        "SETTLEMENT_PERIOD",
        "EAST_WEST_FLOW",
        "NSL_FLOW",
        "BRITNED_FLOW",
        "_ID",
        "EMBEDDED_SOLAR_GENERATION",
    ],
)

df_demand_2023 = pd.json_normalize(
    response2["result"]["records"],
    meta=[
        "IFA_FLOW",
        "TSD",
        "VIKING_FLOW",
        "IFA2_FLOW",
        "EMBEDDED_WIND_GENERATION",
        "ND",
        "MOYLE_FLOW",
        "NEMO_FLOW",
        "ELECLINK_FLOW",
        "PUMP_STORAGE_PUMPING",
        "EMBEDDED_WIND_CAPACITY",
        "SETTLEMENT_DATE",
        "ENGLAND_WALES_DEMAND",
        "EMBEDDED_SOLAR_CAPACITY",
        "SCOTTISH_TRANSFER",
        "NON_BM_STOR",
        "_FULL_TEXT",
        "SETTLEMENT_PERIOD",
        "EAST_WEST_FLOW",
        "NSL_FLOW",
        "BRITNED_FLOW",
        "_ID",
        "EMBEDDED_SOLAR_GENERATION",
    ],
)

df_historic_prices_2022 = pd.json_normalize(
    response3["result"]["records"],
    meta=[
        "Settlement Period",
        "Half-hourly Charge",
        "Run Type",
        "Total Daily BSUoS Charge",
        "_full_text",
        "BSUoS Price (£/MWh Hour)",
        "Settlement Day",
        "_id",
    ],
)

df_carbon = pd.json_normalize(
    response4["result"]["records"],
    meta=[
        "East Midlands",
        "East England",
        "West Midlands",
        "North Scotland",
        "South Scotland",
        "_full_text",
        "South West England",
        "datetime",
        "North Wales and Merseyside",
        "North East England",
        "South East England",
        "South Wales",
        "North West England",
        "Yorkshire",
        "London",
        "_id",
        "South England",
    ],
)

# Conversions to datetime for extracting data for specific years
df_historic_prices_2022["Settlement Day"] = pd.to_datetime(
    df_historic_prices_2022["Settlement Day"]
)
df_historic_prices_2022 = df_historic_prices_2022[
    df_historic_prices_2022["Settlement Day"].dt.year == 2022
]

df_carbon["datetime"] = pd.to_datetime(df_carbon["datetime"])
df_carbon_2023 = df_carbon[df_carbon["datetime"].dt.year == 2023]

# Dropping unused columns for future concatenation
df_demand_2022.drop(["_full_text", "NON_BM_STOR"], axis=1, inplace=True)
df_demand_2023.drop(["_full_text", "NON_BM_STOR"], axis=1, inplace=True)
df_historic_prices_2022.drop(["Run Type", "_full_text"], axis=1, inplace=True)
df_carbon_2023.drop(["_full_text"], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_carbon_2023.drop(["_full_text"], axis=1, inplace=True)


In [23]:
# Dropping id columns and now unused date columns.
# Only want the time series that will be clustered - each are of size 17520
# so the "_id" column is able to be dropped.

df_demand_2022_noid = df_demand_2022.drop(
    ["_id", "SETTLEMENT_DATE", "SETTLEMENT_PERIOD"], axis=1
)
df_demand_2023_noid = df_demand_2023.drop(
    ["_id", "SETTLEMENT_DATE", "SETTLEMENT_PERIOD"], axis=1
)
df_demand_2023_noid.columns = [str(col) + "_2" for col in df_demand_2023_noid.columns]
df_historic_prices_2022_noid = df_historic_prices_2022.drop(
    ["Settlement Period", "Settlement Day", "_id"], axis=1
)
df_carbon_2023_noid = df_carbon_2023.drop(["_id", "datetime"], axis=1)

# Concatenating the dataframes.
df_full = pd.concat(
    [
        df_historic_prices_2022_noid.reset_index().drop("index", axis=1, inplace=True),
        df_demand_2022_noid,
        df_demand_2023_noid,
        df_carbon_2023_noid.reset_index().drop("index", axis=1, inplace=True),
    ],
    axis=1,
)

# Must perform scaling since clustering algorithms works on similarity/distance
df_full = StandardScaler().fit_transform(df_full)
df_full_transposed = df_full.transpose()

Must add in a step to check for nulls and other errors

We are going to instantiate multiple clustering models as our initial population for the algorithm. This can include the following: 
- KMeans
    - n_clusters: 3, 4, 5, 6
    - max_iter: 100, 200, 300, 400
    - tol: 0.00001, 0.0001, 0.001, 0.01
- KMedoids
    - n_clusters: 3, 4, 5, 6
    - metric: euclidean, cosine, haversine, l2 
    - method: alternate, pam
    - max_iter: 100, 200, 300, 400
- DBSCAN
    - eps: 0.2, 0.5, 1.0
    - min_samples: 3, 5, 10
    - metric: euclidean, cosine, haversine, l2 
- HDBSCAN
    - metric: euclidean, cosine, haversine, l2 
    - min_samples: 3, 5, 10
    - cluster_selection_epsilon: 0.2, 0.5, 1.0


In [24]:
list(np.arange(0.1, 4.1, 0.1))

[0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6,
 0.7000000000000001,
 0.8,
 0.9,
 1.0,
 1.1,
 1.2000000000000002,
 1.3000000000000003,
 1.4000000000000001,
 1.5000000000000002,
 1.6,
 1.7000000000000002,
 1.8000000000000003,
 1.9000000000000001,
 2.0,
 2.1,
 2.2,
 2.3000000000000003,
 2.4000000000000004,
 2.5000000000000004,
 2.6,
 2.7,
 2.8000000000000003,
 2.9000000000000004,
 3.0000000000000004,
 3.1,
 3.2,
 3.3000000000000003,
 3.4000000000000004,
 3.5000000000000004,
 3.6,
 3.7,
 3.8000000000000003,
 3.9000000000000004,
 4.0]

In [25]:
# Defining which parameters are appropriate to adjust for each clustering model

model_list = ["KMeans", "KMedoids", "DBSCAN", "HDBSCAN"]

list_dict_model_params = [
    {"KMeans": ["n_clusters", "max_iter", "tol"]},
    {"KMedoids": ["n_clusters", "metric_1", "method", "max_iter"]},
    {"DBSCAN": ["eps", "min_samples", "metric_1"]},
    {"HDBSCAN": ["metric_2", "min_samples", "eps"]},
]

# Defining which parameter values each model can take
dict_param_values = {
    "n_clusters": list(range(2, 11)),
    "max_iter": list(range(50, 510, 10)),
    "tol": list(np.arange(0.0001, 0.1001, 0.001)),
    "metric_1": [
        "euclidean",
        "cosine",
        "haversine",
        "l2",
        "cityblock",
        "l1",
        "manhattan",
    ],
    "metric_2": [
        "l2",
        "canberra",
        "manhattan",
        "euclidean",
        "braycurtis",
        "chebyshev",
        "hamming",
    ],
    "method": ["alternate", "pam"],
    "eps": list(np.arange(0.1, 4.1, 0.1)),
    "min_samples": list(range(3, 11)),
}

## Functions for the Genetic Algorithm

In [26]:
import evolution_fns

In [28]:
evolution_fns.evolution(
    model_params=list_dict_model_params,
    param_values=dict_param_values,
    init_population_num=20,
    df=df_full_transposed,
    selection_param=0.8,
    crossover_repeat=2,
    mutation_repeat=2,
    cutoff_score=0.4,
)

Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.25858202317892304
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.2764332546405843
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.2920668700273994
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.29390556611451235
Top model is KMeans(max_iter=290, n_clusters=9, tol=0.022099999999999998) and associated score is 0.3020025780987883
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.29134024569718303
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.022099999999999998) and associated score is 0.300380827493414
Top model is KMeans(max_iter=290, n_clusters=9, tol=0.022099999999999998) and associated score is 0.3027224563469835
Top model is KMeans(max_iter=430, n_clusters=9, tol=0.02209999

[(KMeans(max_iter=430, n_clusters=10, tol=0.022099999999999998),
  0.3239370495457752),
 (KMeans(max_iter=430, n_clusters=10, tol=0.022099999999999998),
  0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.035100000000000006),
  0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.0061), 0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.0541), 0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.08710000000000001),
  0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.0121), 0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.0261), 0.32352768205051846),
 (KMeans(max_iter=430, n_clusters=10, tol=0.0201), 0.3226796662985823),
 (KMeans(max_iter=320, n_clusters=10, tol=0.022099999999999998),
  0.3226796662985823)]