## Genetic Programming with Energy Data

Data from the [National Grid ESO API ](https://www.nationalgrideso.com/data-portal/api-guidance). 

Next steps: Implement a clustering algorithm for the time series, and then implement a genetic programming algorithm after POC to see which clustering method is best. 

In [102]:
# imports and installs

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json
import random
import itertools

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.cluster import HDBSCAN
from sklearn_extra.cluster import KMedoids
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import dendrogram
from scipy.spatial import distance

from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score

## Description of Terms
Note: Units for flow are measured in MW. 

- IFA_FLOW: IFA stands for Interconnexion France-Angleterre, which is a subsea electricity link between France and Great Britain that began operating in 1986. It's a joint venture between National Grid and the French Transmission Operator RTE. The system-to-system flow (SSF) of the IFA is calculated and adjusted for interconnector losses to determine the flow. 
- TSD: This is the Transmission System generation requirement and is equivalent to the Initial Transmission System Outturn (ITSDO) and Transmission System Demand Forecast on BM Reports. Transmission System Demand is equal to the ND plus the additional generation required to meet station load, pump storage pumping and interconnector exports.
- VIKING_FLOW: Flow coming from a record-breaking 475-mile-long land and subsea cable connecting British and Danish energy grids for the first time.
- IFA2_FLOW: Commissioned in 2021 IFA2 is a 1,000 MW high voltage direct current (HVDC) electrical interconnector between the British and French transmission systems. It is the second link to France that National Grid has developed with RTE.
- EMBEDDED_WIND_GENERATION: This is an estimate of the GB wind generation from wind farms which do not have Transmission System metering installed. These wind farms are embedded in the distribution network and invisible to National Grid. Their effect is to suppress the electricity demand during periods of high wind. The true output of these generators is not known so an estimate is provided based on National Grid’s best model.
- ND: This is the Great Britain generation requirement and is equivalent to the Initial National Demand Outturn (INDO) and National Demand Forecast as published on BM Reports. National Demand is the sum of metered generation, but excludes generation required to meet station load, pump storage pumping and interconnector exports. National Demand is calculated as a sum of generation based on National Grid operational generation metering. 
- MOYLE_FLOW: Flow related to The Moyle Interconnector, a 500 megawatt (MW) HVDC link between Scotland and Northern Ireland, running between Auchencrosh in Ayrshire and Ballycronan More in County Antrim. It went into service in 2001
- NEMO_FLOW: Flow from Nemo Link, a 1,000 MegaWatt HVDC submarine power cable between Richborough Energy Park in Kent, the United Kingdom and Zeebrugge, Belgium
- ELECLINK_FLOW: Flow from ElecLink, a 1,000 MW high-voltage direct current (HVDC) electrical interconnector between the United Kingdom and France, passing through the Channel Tunnel.
- PUMP_STORAGE_PUMPING: The demand due to pumping at hydro pump storage units; the -ve signifies pumping load.
- EMBEDDED_WIND_CAPACITY: This is National Grid’s best view of the installed embedded wind capacity in GB. This is based on publically available information compiled from a variety of sources and is not the definitive view. It is consistent with the generation estimate provided above. 
- SETTLEMENT_DATE: Settlement Date. 
- ENGLAND_WALES_DEMAND: England and Wales Demand, as ND above but on an England and Wales basis.
- EMBEDDED_SOLAR_CAPACITY: As embedded wind capacity above, but for solar generation.
- SCOTTISH_TRANSFER: Power transfer across the scottish boundaries due to growth in renewable generation capacity.
- NON_BM_STOR: Operating reserve for units that are not included in the ND generator definition. This can be in the form of generation or demand reduction.
- SETTLEMENT_PERIOD: Settlement Period. 
- EAST_WEST_FLOW: Flow from the East - West Interconnector between Ireland and Great Britain.
- NSL_FLOW: Flow from the North Sea Link (NSL), a joint venture with Norwegian system operator Statnett. Stretching 720 kilometres under the North Sea, at depths of up to 700 metres, NSL is an interconnector capable of sharing up to 1400 megawatts of electricity
- BRITNED_FLOW: Flow from the Britned connector, a two-way 1,000 MW high-voltage direct current connection has a length of 260 km and runs from the Isle of Grain (in Kent) to Maasvlakte (near Rotterdam)
- _ID: Line ID. 
- EMBEDDED_SOLAR_GENERATION: As embedded wind generation above, but for solar generation.

In [2]:
# API Calls to the Britain national grid API. Calling to retrieve historic electricity demand,
# interconnector, wind and solar outturn data for 2022 and 2023.

URL = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "bb44a1b5-75b1-4db2-8491-257f23385006"'
response = requests.get(URL).json()
URL2 = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "bf5ab335-9b40-4ea4-b93a-ab4af7bce003"'
response2 = requests.get(URL2).json()
URL3 = 'https://api.nationalgrideso.com/api/3/action/datastore_search_sql?sql=SELECT * FROM "3372646d-419f-4599-97a9-6bb4e7e32862"'
response3 = requests.get(URL3).json()


# Converting from json into pandas dataframe
df_demand_2022 = pd.json_normalize(
    response["result"]["records"],
    meta=[
        "IFA_FLOW",
        "TSD",
        "VIKING_FLOW",
        "IFA2_FLOW",
        "EMBEDDED_WIND_GENERATION",
        "ND",
        "MOYLE_FLOW",
        "NEMO_FLOW",
        "ELECLINK_FLOW",
        "PUMP_STORAGE_PUMPING",
        "EMBEDDED_WIND_CAPACITY",
        "SETTLEMENT_DATE",
        "ENGLAND_WALES_DEMAND",
        "EMBEDDED_SOLAR_CAPACITY",
        "SCOTTISH_TRANSFER",
        "NON_BM_STOR",
        "_FULL_TEXT",
        "SETTLEMENT_PERIOD",
        "EAST_WEST_FLOW",
        "NSL_FLOW",
        "BRITNED_FLOW",
        "_ID",
        "EMBEDDED_SOLAR_GENERATION",
    ],
)

df_demand_2023 = pd.json_normalize(
    response2["result"]["records"],
    meta=[
        "IFA_FLOW",
        "TSD",
        "VIKING_FLOW",
        "IFA2_FLOW",
        "EMBEDDED_WIND_GENERATION",
        "ND",
        "MOYLE_FLOW",
        "NEMO_FLOW",
        "ELECLINK_FLOW",
        "PUMP_STORAGE_PUMPING",
        "EMBEDDED_WIND_CAPACITY",
        "SETTLEMENT_DATE",
        "ENGLAND_WALES_DEMAND",
        "EMBEDDED_SOLAR_CAPACITY",
        "SCOTTISH_TRANSFER",
        "NON_BM_STOR",
        "_FULL_TEXT",
        "SETTLEMENT_PERIOD",
        "EAST_WEST_FLOW",
        "NSL_FLOW",
        "BRITNED_FLOW",
        "_ID",
        "EMBEDDED_SOLAR_GENERATION",
    ],
)

df_historic_prices_2022 = pd.json_normalize(
    response3["result"]["records"],
    meta=[
        "Settlement Period",
        "Half-hourly Charge",
        "Run Type",
        "Total Daily BSUoS Charge",
        "_full_text",
        "BSUoS Price (£/MWh Hour)",
        "Settlement Day",
        "_id",
    ],
)
df_historic_prices_2022["Settlement Day"] = pd.to_datetime(
    df_historic_prices_2022["Settlement Day"]
)
df_historic_prices_2022 = df_historic_prices_2022[
    df_historic_prices_2022["Settlement Day"].dt.year == 2022
]


# Cocatenate 2022 and 2023 dataframes vertically
df_demand_2022.drop(["_full_text", "NON_BM_STOR"], axis=1, inplace=True)
df_demand_2023.drop(["_full_text", "NON_BM_STOR"], axis=1, inplace=True)
df_historic_prices_2022.drop(["Run Type", "_full_text"], axis=1, inplace=True)

In [3]:
df_demand_2022_noid = df_demand_2022.drop(
    ["_id", "SETTLEMENT_DATE", "SETTLEMENT_PERIOD"], axis=1
)
df_demand_2023_noid = df_demand_2023.drop(
    ["_id", "SETTLEMENT_DATE", "SETTLEMENT_PERIOD"], axis=1
)
df_demand_2023_noid.columns = [str(col) + "_2" for col in df_demand_2023_noid.columns]
df_historic_prices_2022_noid = df_historic_prices_2022.drop(
    ["Settlement Period", "Settlement Day", "_id"], axis=1
)

df_full = pd.concat(
    [
        df_historic_prices_2022_noid.reset_index().drop("index", axis=1, inplace=True),
        df_demand_2022_noid,
        df_demand_2023_noid,
    ],
    axis=1,
)
df_full = StandardScaler().fit_transform(df_full)
df_full_transposed = df_full.transpose()

Must add in a step to check for nulls and other errors

We are going to instantiate multiple clustering models as our initial population for the algorithm. This can include the following: 
- KMeans
    - n_clusters: 3, 4, 5, 6
    - max_iter: 100, 200, 300, 400
    - tol: 0.00001, 0.0001, 0.001, 0.01
- KMedoids
    - n_clusters: 3, 4, 5, 6
    - metric: euclidean, cosine, haversine, l2 
    - method: alternate, pam
    - max_iter: 100, 200, 300, 400
- Hierarchical (Agglomerative Clustering linkage method with Scipy)
    - method: single, complete, average, weighted, centroid
    - metric: euclidean, cosine, canberra, chebyshev
- DBSCAN
    - eps: 0.2, 0.5, 1.0
    - min_samples: 3, 5, 10
    - metric: euclidean, cosine, haversine, l2 
- HDBSCAN
    - metric: euclidean, cosine, haversine, l2 
    - min_samples: 3, 5, 10
    - cluster_selection_epsilon: 0.2, 0.5, 1.0

In our genetic algorithm, we will want to cross-over and mutate hyperparameters; however, some hyperparameters are only present in certain types of clustering algorithms. Consider using a dictionary of "model":"hyperaparameter" to help during these steps. We will want to cross-over/remove hyperparameters during these phases and change their values. 

In [74]:
# Defining which parameters are appropriate to adjust for each clustering model

model_list = ["KMeans", "KMedoids", "DBSCAN", "HDBSCAN"]

list_dict_model_params = [
    {"KMeans": ["n_clusters", "max_iter", "tol"]},
    {"KMedoids": ["n_clusters", "metric_1", "method_1", "max_iter"]},
    {"DBSCAN": ["eps", "min_samples", "metric_1"]},
    {"HDBSCAN": ["metric_2", "min_samples", "eps"]},
]

# Defining which parameter values each model can take
dict_param_values = {
    "n_clusters": [3, 4, 5, 6, 7, 8],
    "max_iter": [100, 200, 300, 400, 500],
    "tol": [0.00001, 0.0001, 0.001, 0.01],
    "metric_1": ["euclidean", "cosine", "haversine", "l2"],
    "metric_2": ["l2", "canberra", "manhattan", "euclidean"],
    "method_1": ["alternate", "pam"],
    "method_2": ["complete", "average", "weighted", "centroid"],
    "eps": [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1, 2, 3, 4],
    "min_samples": [3, 4, 5, 6],
}

## Functions for the Genetic Algorithm

In [75]:
# Function to map the model name to instantiating the model
def intantiate_model(model_name):
    if model_name == "KMeans":
        model_out = KMeans()
    elif model_name == "KMedoids":
        model_out = KMedoids()
    elif model_name == "DBSCAN":
        model_out = DBSCAN()
    else:
        model_out = HDBSCAN()
    return model_out

In [117]:
# Defining function to initialize the population
def init_population(
    model_params=list_dict_model_params, param_values=dict_param_values, num_models=15
):
    population = []
    population_models = []
    for i in range(num_models):
        rand1 = random.randint(0, len(model_params) - 1)
        curr_model_name = list(model_params[rand1].keys())[0]
        curr_model_params = list(model_params[rand1].values())[0]
        params_dict = {}
        for param in curr_model_params:
            rand2 = random.randint(0, len(param_values[param]) - 1)
            curr_param_value = param_values[param][rand2]
            if param == "metric_1" or param == "metric_2":
                param = "metric"
            elif param == "method_1" or param == "method_2":
                param = "method"
            elif param == "eps" and curr_model_name == "HDBSCAN":
                param = "cluster_selection_epsilon"
            else:
                pass
            params_dict[param] = curr_param_value
        population.append({curr_model_name: params_dict})

    for i in range(len(population)):
        model_instant = intantiate_model(list(population[i].keys())[0])
        model_instant.set_params(**list(population[i].values())[0])
        population_models.append(model_instant)

    return population_models

In [118]:
init_population()

[DBSCAN(eps=0.8, metric='cosine', min_samples=3),
 HDBSCAN(cluster_selection_epsilon=0.4, min_samples=6),
 DBSCAN(min_samples=4),
 KMeans(max_iter=200, tol=0.001),
 KMeans(max_iter=200),
 HDBSCAN(cluster_selection_epsilon=3, metric='canberra', min_samples=5),
 KMeans(max_iter=500, n_clusters=6),
 KMedoids(max_iter=400, method='pam', metric='l2', n_clusters=6),
 HDBSCAN(cluster_selection_epsilon=0.8, metric='canberra', min_samples=4),
 KMeans(max_iter=500, tol=0.001),
 KMeans(max_iter=400, n_clusters=5, tol=0.01),
 DBSCAN(metric='l2', min_samples=4),
 KMeans(max_iter=500, n_clusters=7, tol=0.01),
 KMedoids(max_iter=500, method='pam', n_clusters=4),
 KMeans(max_iter=500, tol=1e-05)]

In [119]:
# Function to iterate models, fit, and evaluate cluster results.
def cluster_fitness(data=df_full_transposed, population=init_population()):
    fitness_indices = {}
    for model in population:
        try:
            model.fit(data)
            curr_score = silhouette_score(data, model.labels_)
            fitness_indices[model] = curr_score
        except:
            fitness_indices[model] = -1
    fitness_indices_sorted = {
        k: v
        for k, v in sorted(
            fitness_indices.items(), reverse=True, key=lambda item: item[1]
        )
    }

    return fitness_indices_sorted

In [120]:
cluster_fitness(population=init_population())

{KMeans(max_iter=500, n_clusters=7, tol=0.001): 0.25192759462953807,
 KMeans(max_iter=100, n_clusters=6, tol=0.001): 0.20460736869211613,
 KMeans(max_iter=200, n_clusters=5, tol=0.01): 0.1774473266895917,
 HDBSCAN(cluster_selection_epsilon=3, min_samples=3): 0.1658895330501038,
 HDBSCAN(cluster_selection_epsilon=4, metric='l2', min_samples=4): 0.16068084040114136,
 KMeans(max_iter=400, n_clusters=6, tol=1e-05): 0.15633400084166776,
 KMeans(max_iter=400, n_clusters=3): 0.15462603651507725,
 KMeans(n_clusters=4, tol=0.01): 0.1318302502122674,
 KMedoids(max_iter=400, method='pam', metric='haversine', n_clusters=7): -1,
 DBSCAN(eps=1, metric='cosine', min_samples=3): -1,
 HDBSCAN(cluster_selection_epsilon=0.4, min_samples=5): -1,
 HDBSCAN(cluster_selection_epsilon=2, metric='l2', min_samples=6): -1,
 DBSCAN(eps=0.7, metric='haversine'): -1,
 HDBSCAN(cluster_selection_epsilon=0.4, metric='l2', min_samples=5): -1,
 DBSCAN(eps=0.8, min_samples=3): -1}

In [121]:
# Function for the selection piece of the algorithm
def population_selection(dict_fitness=cluster_fitness(), selection_param=0.7):
    next_generation = []
    range_val = int(selection_param * len(dict_fitness))
    next_generation = dict(itertools.islice(dict_fitness.items(), range_val))

    return next_generation

In [122]:
population_selection(dict_fitness=cluster_fitness(population=init_population()))

{KMeans(max_iter=100, n_clusters=6, tol=0.01): 0.24106104111350266,
 KMedoids(max_iter=400, method='pam', metric='l2', n_clusters=5): 0.22360726518269644,
 KMedoids(method='pam', metric='l2', n_clusters=5): 0.22360726518269644,
 KMeans(max_iter=400, n_clusters=6): 0.19021172264539576,
 KMedoids(metric='l2', n_clusters=4): 0.16188376438438726,
 HDBSCAN(cluster_selection_epsilon=0.7, metric='l2', min_samples=4): 0.16068084040114136,
 KMeans(max_iter=200, n_clusters=4, tol=1e-05): 0.15505427130737057,
 KMeans(max_iter=400, n_clusters=4, tol=1e-05): 0.14612960590412846,
 HDBSCAN(cluster_selection_epsilon=2, metric='manhattan', min_samples=3): 0.14179907547246046,
 HDBSCAN(cluster_selection_epsilon=0.4, metric='manhattan', min_samples=4): 0.12974897746480146}

In [115]:
# Defining the crossover and mutation functions for the selected models.

# Crossover - If there are models of the same model type (e.g. several HDBSCAN)
# then take random parameters from one and swap it with another and add it
# to the population as a new model. Iterate through the process three times
# to generate three new models. 

# Mutation - Select models at random, then select a hyperparameter at random, 
# and mutate the parameter to another value in the possible selection list. Iterate
# through twice and generate two new models. 

In [6]:
# Determining structure of data for each model

kmeans_model = KMeans(n_clusters=10, max_iter=200, tol=0.00001)
kmeans_model.fit(df_full_transposed)
print(
    "Silhouette score for kmeans model trial: ",
    silhouette_score(df_full_transposed, kmeans_model.labels_),
)
print(
    "Davies Bouldin score for kmeans model trial: ",
    davies_bouldin_score(df_full_transposed, kmeans_model.labels_),
)

Silhouette score for kmeans model trial:  0.27678903395369786
Davies Bouldin score for kmeans model trial:  1.1870111161503307


In [6]:
kmedoids_model = KMedoids(n_clusters=7, metric="l2", method="alternate", max_iter=100)
kmedoids_model.fit(df_full_transposed)
print(
    "Silhouette score for kmedoids model trial: ",
    silhouette_score(df_full_transposed, kmedoids_model.labels_),
)
print(
    "Davies Bouldin score for kmedoids model trial: ",
    davies_bouldin_score(df_full_transposed, kmedoids_model.labels_),
)

Silhouette score for kmedoids model trial:  0.19878950439797027
Davies Bouldin score for kmedoids model trial:  1.4815774755918005


In [56]:
dbscan_model = DBSCAN(eps=100)
dbscan_model.fit(df_full_transposed)
print(
    "Silhouette score for DBSCAN model trial: ",
    silhouette_score(df_full_transposed, dbscan_model.labels_),
)
print(
    "Davies Bouldin score for DBSCAN model trial: ",
    davies_bouldin_score(df_full_transposed, dbscan_model.labels_),
)

Silhouette score for DBSCAN model trial:  0.1092080688765895
Davies Bouldin score for DBSCAN model trial:  1.3627369508006577


In [39]:
hdbscan_model = HDBSCAN(
    cluster_selection_epsilon=0.5, metric="euclidean", min_samples=5
)
hdbscan_model.fit(df_full_transposed)
print(
    "Silhouette score for DBSCAN model trial: ",
    silhouette_score(df_full_transposed, hdbscan_model.labels_),
)
print(
    "Davies Bouldin score for DBSCAN model trial: ",
    davies_bouldin_score(df_full_transposed, hdbscan_model.labels_),
)

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

We will need a fitness function to evaluate the results of each clustering model. The most common used is the Silhouette Score. This is in the sklearn.metrics library and is calculated with the mean intra-cluster distance and the mean nearest-cluster distance for each sample. The best value is 1, and the worst value is -1. Alternatively the Davies Bouldin score has a best value of 0. 

In [6]:
URL = "https://api.nationalgrideso.com/api/3/action/resource_search?query=name:historic"
response = requests.get(URL).json()
response

{'help': 'https://api.nationalgrideso.com/api/3/action/help_show?name=resource_search',
 'success': True,
 'result': {'count': 173,
  'results': [{'hash': '',
    'description': 'This file includes all historic GTMA (Grid Trade Master Agreement) trades between April 2015 to March 2016. This data file will be updated on a monthly basis. ',
    'format': 'CSV',
    'package_id': '8e511255-6cd7-41c0-868e-8d4d29bced0b',
    'mimetype_inner': None,
    'url_type': 'upload',
    'datastore_active': True,
    'id': 'bde33bcf-efc6-4e4a-bc7c-452b7424ea45',
    'size': None,
    'mimetype': 'text/csv',
    'cache_url': None,
    'name': 'Historic GTMA Trades Data - Apr 2015 - March 2016',
    'metadata_modified': '2023-02-09T03:22:07.364683',
    'created': '2020-12-09T16:49:50.738907',
    'url': 'https://api.nationalgrideso.com/dataset/8e511255-6cd7-41c0-868e-8d4d29bced0b/resource/bde33bcf-efc6-4e4a-bc7c-452b7424ea45/download/gtma_trades_2015-16.csv',
    'cache_last_updated': None,
    'state

In [6]:
URL = "https://api.nationalgrideso.com/api/3/action/resource_show?id=f93d1835-75bc-43e5-84ad-12472b180a98"
response = requests.get(URL).json()
response

{'help': 'https://api.nationalgrideso.com/api/3/action/help_show?name=resource_show',
 'success': True,
 'result': {'force': 'True',
  'cache_last_updated': None,
  'package_id': '88313ae5-94e4-4ddc-a790-593554d8c6b9',
  'datastore_active': True,
  'id': 'f93d1835-75bc-43e5-84ad-12472b180a98',
  'size': None,
  'metadata_modified': '2024-06-03T12:35:37.474140',
  'state': 'active',
  'hash': '',
  'description': 'Historic GB generation mix from the 1st of Jan 2009 through to today. Data points are either MW or %. ',
  'format': 'CSV',
  'mimetype_inner': None,
  'url_type': 'upload',
  'mimetype': 'text/csv',
  'cache_url': None,
  'name': 'Historic GB Generation Mix',
  'created': '2020-08-25T10:07:16.512192',
  'url': 'https://api.nationalgrideso.com/dataset/88313ae5-94e4-4ddc-a790-593554d8c6b9/resource/f93d1835-75bc-43e5-84ad-12472b180a98/download/df_fuel_ckan.csv',
  'datastore_append_or_update': True,
  'last_modified': '2024-06-03T12:35:37.894252',
  'position': 0,
  'revision_id