In [None]:
import os
import timeit
import operator
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
import scipy.stats as st
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn import metrics
from collections import Counter
from scipy.spatial import distance
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances
from sklearn.neighbors import KernelDensity
import supporting_functions as sf
import hypercubes

%load_ext autoreload
%autoreload 2

# Layered Integration Approach for Multi-view Analysis of Real-world Datasets
In this notebook we validate the *Layered Integration Approach for Multi-view Analysis of Real-world Datasets*.

The proposed layered integration methodology is demonstrated on SCADA data from the wind turbine dataset from [Engie](https://opendata-renewables.engie.com/explore/index). In this dataset 8 years of data from 4 wind turbines is available. This is an excellent example of a multi-source (multiple wind turbines) and multi-view (multiple different in nature subsets of parameters) dataset, which is exactly where the layered integration approach is designed for.


The initial aim of the studied experimental scenario is to identify and **characterise potentially different operating modes across
the fleet**. Notice that wind turbines can have several different operating modes, e.g. working at full speed, reduced speed in order to limit the noise burden on the surroundings, tailored production due to oversupply on the net and others. Subsequently, the ultimate goals is to **derive distinctive profiles of production performance** and establish an explicit link between those performance profiles and the characterised operating modes.


# 1 Overview of the Engie Dataset
The dataset should be first downloaded from [here](https://opendata-renewables.engie.com/explore/index), and the path should be changed.

As one can in the plots below, the 4th wind turbine has a big loss of values between 2013 and 2014.

In [None]:
DATA_DIR='../../../Hymop_Engie/repository/hymop-Engie/data/'

df_engie = pd.read_csv(DATA_DIR + 'extended_clean_sensor_data.csv', parse_dates=[0])

print("Amount of entries in database: \t"+ str(df_engie.shape[0]))
print("Amount of wind turbines: \t"+str(len(df_engie.Wind_turbine_name.unique())))
print("Amount of features: \t\t"+ str(df_engie.shape[1]))

print("\nstart: \t"+str(min(df_engie.Date_time)))
print("end: \t"+str(max(df_engie.Date_time)))

In [None]:
g = sns.FacetGrid(df_engie.reset_index(), row="Wind_turbine_name", height=1.7, aspect=8);
g.map(sns.lineplot, "Date_time", "P_avg");

# 2 Data Preprocessing
## 2.1 Eliminating Correlated Parameters
A difficulty in this dataset is that some features are not clearly defined, which makes it difficult to see which parameters are redundant. Too get more insights on whether a feature is derived from another, we check the correlation between all features:

In [None]:
plt.figure(figsize = (20,20))
sns.heatmap(df_engie.corr(), annot=True);

Based on this correlation matrix, some variables with a very high correlation are observed. We do not want this because this will cause an unbalanced focus in the clustering approach. 

Below we enumerate the values with the highest correlations and explain which ones are removed:
* DCs_avg (Generator converter speed), Ds_avg (Generator speed) and Rs_avg (Rotor speed)
    * The Generator converter speed and generator speed are almost exactly as shown in the table below.
    * Despite the lower values of the Rotor speed in rpm, it seems to be very correlated to the other two mentioned features. This might be explained that they are connected to each-other, but have a different speed thanks to the gear connection. The difference is an almost constant factor around 100.
    * The Gearbox bearing temperatures show also a high correlation to those 3 features. Which is expected due to the fact that increasing speed is a cause of increasing temperature of the bearing.
    * For this reason, we will **not use the generator converter speed and rotor speed** from now on.
* P_avg (Active power), S_avg (Apparent power), Cm_avg (Converter torque) and Rm_avg (Torque)
    * The power and active power are almost equal.
    * The torque is the force which is finally transformed into active power, which explains the big correlation.
    * Since we have here features of the two views, we will only **remove the apparent power and the converter torque**.
* Ws1_avg, Ws2_avg and Ws_avg
    * As expected do they also have a big correlation.
    * We will **remove Ws1_avg and Ws2_avg**.
* Ya_avg (Nacelle angle) and Na_c_avg (corrected nacelle angle)
    * Evidently, we'll only use the corrected nacelle angle.

*Remark: After [normalization](#2.2-Normalize), one could also observe a very similar distribution between the highly correlated variables.*

## 2.2 Identification of the different in nature subset of parameters
Intrinsically, the wind turbine data contains three subsets of parameters:

- **Operational View**: sensor measurements about internal working of the turbine
    - Ba_avg:  Pitch angle
    - Db1t_avg: Generator bearing 1 temperature (bearing = lager)
    - Db2t_avg: Generator bearing 2 temperature
    - Dst_avg:  Generator stator temperature
    - Rbt_avg:  Rotor bearing temperature
    - Gb1t_avg: Gearbox bearing 1 temperature
    - Gb2t_avg: Gearbox bearing 2 temperature
    - Git_avg:  Gearbox inlet temperature
    - Gost_avg: Gearbox oil sump temperature
    - Yt_avg:   Nacelle temperature
    - (Ya_avg:   Nacelle angle)
    - Na_c_avg: Nacelle angle corrected
    - (Rs_avg:   Rotor speed)
    - Rm_avg:   Torque
    - (DCs_avg: Generator converter speed)
    - Ds_avg:  Generator speed
    - (Cm_avg:  Converter torque)
    
- **Contextual View**: active power as a function of exogenous factors

    - (Ws1_avg:  First anemometer on the nacelle (wind speed))
    - (Ws2_avg:  Second anemometer on the nacelle (wind speed))
    - Ws_avg:   Average wind speed
    - Wa_avg:   Absolute wind direction
    - Wa_c_avg: Absolute wind direction corrected
    - Ot_avg:   Outdoor temperature
    - Rt_avg:   Hub temperature (the hub height is the distance from the ground to the center-line of the turbine rotor)
    - Nf_avg:   Grid frequency 

- **Output:**
    - P_avg:    Active power
    - Q_avg:   Reactive power
    - (S_avg:   Apparent power (Should be $ \sqrt{P^2 + Q^2}$ ) )
    - Cosphi_avg: Power factor (Should be $ \frac{P}{S}$ )
    
The features between brackets are left away since they are very correlated to other features in the same view.

Later on, the parameters of the operational view will be used in the 1st layer (individual analysis). In the 2nd layer (mediation analysis) we will use the parameters of the contextual view.


In [None]:
operational_features = ['Ba_avg', 'Db1t_avg', 'Db2t_avg', 'Dst_avg', 'Rbt_avg', 'Gb1t_avg', 'Gb2t_avg', 
                        'Git_avg', 'Gost_avg', 'Na_c_avg', 'Rm_avg', 'Ds_avg'] #, 'DCs_avg', 'Rs_avg', 'Cm_avg', 'Ya_avg'

contextual_features = ['P_avg', 'Q_avg', 'Cosphi_avg', 'Ws_avg', 
                       'Wa_avg', 'Wa_c_avg', 'Ot_avg', 'Yt_avg', 'Rt_avg', 'Nf_avg'] #, 'S_avg', 'Ws1_avg', 'Ws2_avg'

features_dict = {'Ba_avg':  'Pitch angle', 'Db1t_avg': 'Generator bearing 1 temperature', 'Db2t_avg': 'Generator bearing 2 temperature', 'Dst_avg':  'Generator stator temperature', 
                 'Rbt_avg':  'Rotor bearing temperature', 'Gb1t_avg': 'Gearbox bearing 1 temperature', 'Gb2t_avg': 'Gearbox bearing 2 temperature', 'Git_avg':  'Gearbox inlet temperature', 
                 'Gost_avg': 'Gearbox oil sump temperature', 'Yt_avg':   'Nacelle temperature', 'Ya_avg':   'Nacelle angle', 'Na_c_avg': 'Nacelle angle corrected', 'Rs_avg':   'Rotor speed', 
                 'Rm_avg':   'Torque', 'DCs_avg': 'Generator converter speed', 'Ds_avg':  'Generator speed', 'Cm_avg':  'Converter torque', 'P_avg':    'Active power', 
                 'Q_avg':   'Reactive power', 'S_avg':   'Apparent power', 'Cosphi_avg': 'Power factor', 'Ws1_avg':  'First anemometer on the nacelle', 
                 'Ws2_avg':  'Second anemometer on the nacelle', 'Ws_avg':   'Average wind speed', 'Wa_avg':   'Absolute wind direction', 'Wa_c_avg': 'Absolute wind direction corrected', 
                 'Ot_avg':   'Outdoor temperature', 'Rt_avg':   'Hub temperature' , 'Nf_avg':   'Grid frequency' }

## 2.3 Separate the dataset

To make the code in the following steps more easy, we create a dictionary which contains a dataframe for each wind turbine separately.

In [None]:
turbines = df_engie['Wind_turbine_name'].unique()

dict_turbines_uncleaned = {}
for i, turbine in enumerate(turbines):
    dict_turbines_uncleaned[turbine] = df_engie[df_engie['Wind_turbine_name']==turbine].copy().sort_index()
    

## 2.4 Noise removal

In this cell we remove the timestamps which we consider as noise. This is illustrated on the image below.

There are 4 methods we use to filter points:
- We remove points which have a negative active power or negative wind speed
- We remove points which have an uncommon wind speed-active power relation (by sparse cubes)
- We remove points which deviate to much from the expected active power on the **positive side** (hypercube approach)
- We remove points which deviate to much from the expected active power on the **negative side** (hypercube approach)



In [None]:
sns.set(font_scale=1)

dict_turbines = sf.remove_noise(dict_turbines_uncleaned, hypercubes, 
                                      remove_negative=True, min_cube_count=1, max_pos_deviation=500, max_neg_deviation=-1250)


## 2.5 Normalize the parameters

In order to give equal priority to each parameter, they need to be normalized.
The normalization is applied separately on each parameter and wind turbine by calculating the min-max score for each of the parameters.

The histograms of the min-max score results of each parameter from the first wind turbine are shown below.

In [None]:
sns.set(font_scale=0.8)

# Select parameters to normalize
parameters = list(df_engie.columns)
parameters.remove('Date_time')
parameters.remove('Wind_turbine_name')

# Make dictionary to separate the NORMALIZED data from each wind turbine:
dict_turbines_norm = {}

for i, turbine in enumerate(turbines):
    df_one_turbine = dict_turbines[turbine].copy()
    
    for p in parameters:
        df_one_turbine[p] = (df_one_turbine[p] - df_one_turbine[p].min())/(df_one_turbine[p].max() - df_one_turbine[p].min()) #min-max
        
    dict_turbines_norm[turbine] = df_one_turbine
    
# Plot the histograms of each feature from the First wind turbine
hist = dict_turbines_norm[turbines[0]].hist(figsize=(20,10), bins=20)
plt.tight_layout()


# 3. LAYER 1: Individual analysis
In the proposed layered integration approach, we start with the individual analysis. In here we will cluster the timestamps of each of the 4 wind turbines separately, based on the subset of **operating parameters**. As shown in [Section 2.2](#2.2-Identification-of-the-different-in-nature-subset-of-parameters), this are 12 parameters. This will result in turbine dependent operating modes.


## 3.1 Determine the appropriate amount of clusters per dataset
We use the k-means approach to do each clustering. The initial centers are selected by the smart 'k-means++' approach. Since the k-means++ algorithm still uses a random initial start, we can (and do) repeat it 5 times. Only the best result is used to plot.

For each of the wind turbines, the amount of clusters is determined separately by multiple validation techniques: 
- **elbow method**: In here we plot the sum of the squared distance between each point and the centroid of its cluster. If there is a sudden change in decrease, the optimal amount of clusters might be there
- **connectivity**: In here we look to the 10 closest neighbours of each point and we add a penalty if they are not in the same cluster. In here we again search for a knot.
- **silhouette**: The higher, the better. Since the silhouette algorithm is very slow, it is calculated based on only 5% of the data. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually.
- **Calinski Harabasz**: the higher the better
- **Davies Bouldin**: the lower the better


In [None]:
# ATTENION: this cell takes around 50 minutes to run!
start_time = timeit.default_timer()

for i, turbine in enumerate(turbines):
    print("Calculation of wind turbine " + str(i+1) + "...")
    df_One_norm = dict_turbines_norm[turbine].reset_index()
    
    sf.plot_kmeans_validation(df_One_norm.loc[:, operational_features], sample_size=0.1)

print("\t Elapsed time: "+ str( np.round((timeit.default_timer() - start_time)/60, 2)) + " minutes")

Deciding the proper amount of clusters is a very hard thing to do. We use the majority voting technique based on the results plotted above to decide the optimal amount of clusters per wind turbine. This results in an optimal of 3, 4, 4 and 3 clusters, in the order of the plot.


## 3.2 Compare power curve & torque curve of the obtained operating modes
Based on the most optimal amount of clusters for each dataset, we do the final clusters of the timestamps. Below we plot the power curve, torque curve and amount of points per cluster. Since it is both interesting to see the clusters on a separate plot as in the same plot, both of them are plotted below.


In [None]:
n_clusters = [3, 4, 4, 3]

# Define colors (which will be consitently used during this notebook)
colorpal = sns.color_palette("Set3", np.sum(n_clusters))
pd.set_option('mode.chained_assignment', None)

dict_turbines_norm = sf.do_individual_clustering(turbines, dict_turbines_norm,n_clusters, operational_features, dict_turbines, colorpal)


In [None]:
sf.visualize_individual_clustering_per_clusters(colorpal, n_clusters, dict_turbines, turbines)


## 3.3 Compare the difference of the parameters in the operating modes

In the cell below we plot for **one turbine** the distributions of each parameter for the clusters separately. This indicates what the main differences are between the operating modes.

In [None]:
# Select one turbine:
turbine_nr = 0 #Can be 0, 1, 2 or 3
turbine = turbines[turbine_nr]
df_One = dict_turbines[turbine]

color_index = sum(n_clusters[0:turbine_nr])
colors = colorpal[color_index:color_index+n_clusters[turbine_nr]]

for f, feature in enumerate(operational_features):
    g1 = sns.FacetGrid(df_One, col="Label", height=3, aspect = 1.5, hue='Label', palette=colors)
    g1.map(plt.hist, feature, linewidth=0, alpha=1, density=True, bins=20)
    g1.set_axis_labels(features_dict[feature], "Density")
    g1.set_titles("Turbine " + turbine + ", Cluster {col_name}")


Observations:
- Operating mode 1 differs mainly from the others in:
    - Generator stator temperature
    - Gearbox bearing 1 temperature
    - Gearbox bearing 2 temperature
    - Gearbox inlet temperature
    - Torque
- All operating modes differ clearly from the others in:
    - Nacelle angle corrected
    - Generator speed


## 3.4 Investigate the minimum and maximum value of each parameter in each cluster
Each cluster will define a range of allowable values for each operational parameter and thus generates parametric characterisation of the operating mode. In this way, the pool of clusters produced for the fleet leads to the construction of a repository of operating modes, each one characterized by two vectors of low and high values (see Table below).
For each operating mode, the high and low ranges for each parameter in the operating mode can also be considered as allowable ranges for the corresponding parameter, provided the other parameters are within the other ranges specified for this mode.

In [None]:
table = sf.create_table_endogenous_view(turbines, dict_turbines, operational_features)
table


# 4. LAYER 2: Mediation Analysis 

In this layer, we pursue a way to derive an alternative representation of each operating mode in terms of expected performance. The richness of our multivariate data allows to consider an alternative view for each cluster of timestamps generated in the previous layer.

To do this, we first divide the solution space of each operating mode into hypercubes with the dimensions wind speed, wind direction and ambient temperature. Next, we calculate for each of the hypercubes the kernel density estimation (KDE) of the active power. And than finally, we obtain a profile for each operating mode by summing the weighted KDEs together and normalize them.

## 4.1 Divide the space into hypercubes

We divide the timestamps in hypercubes based on wind speed, wind direction and ambient temperature. These are all [exogenous parameters](#2.2-Identification-of-the-different-in-nature-subset-of-parameters). Each hypercube has a size of 2 m/s (wind speeds), 72 degrees (wind direction) and 10 degrees (outside temperature). However, the method seems to be very robust to the hypercube size.

In [None]:
# Creating an empty Dataframe 
cluster_features = pd.DataFrame()

# Define the amount of splits per feature (for binning)
dict_features = {'Wa_c_avg': 6, 'Ws_avg':11, 'Ot_avg':6} 
ref_vars = list(dict_features.keys())
target_var = ['P_avg']
dat_types = ['angular','numerical','numerical']

# Define cubes for all 
_, cube_specs, _ = hypercubes.define_binning(dict_turbines[turbines[0]], ref_vars, dict_features)

# Give each hypercube an index number
cube_ids = sf.make_cube_ids(cube_specs, ref_vars)

for turbine in turbines:
    # Enricht dataframe with hypercube numbers
    dict_turbines[turbine],_ = hypercubes.define_cubes_in_df(dict_turbines[turbine].copy(), ref_vars, target_var[0], drop_stats=True, reset_idx=True,
                                       cube_specs=cube_specs, cube_ids=cube_ids, assign_nearest_cube=False,
                                       dat_types=dat_types)

Let us now remove all hypercubes cubes which contain less then 10 points. Since these hypercubes are so sparse, the features we will derive later on within those hypercubes would not be reliable.

In the plot below one can see the effect of this filtering. Each operating mode got rid of around 30% of the hypercubes, while almost all points are represented by their hypercube.

In [None]:
MIN_POINTS_IN_CUBE = 10
df_cube_counts, cluster_labels = sf.visualize_sparse_hypercubes(MIN_POINTS_IN_CUBE, turbines, dict_turbines, n_clusters)


## 4.2 Calculate the kernel density estimations
Below, we estimate the density of the active power within each hypercube. This is estimated by the KDE approach where we use a Gaussian kernel and the Silverman rule of thumb to calculate the bandwidth. 


In [None]:
KDEs, points_per_cube = sf.calculate_KDE_per_hypercube(turbines, n_clusters, dict_turbines, MIN_POINTS_IN_CUBE)


## 4.3 Combine densities per cluster as an endogenous profile for each internal operating mode
Profiling is done as following: We sum all the KDEs of the same wind turbine (weighted by the amount of points which are present in the hypercubes), and normalized by the total amount of (non removed) points. 

Below, one can see all original KDEs on the left (per operating mode), and the constructed performance profile (mixture probability distribution) on the right. One already can observe different patterns between operating modes of the same wind turbine, and similar patterns between operating modes of different wind turbines.

In [None]:
sum_densities = sf.construct_engdogenous_profiles(turbines, n_clusters, points_per_cube, KDEs, colorpal)


## 4.4 Calculate the pairwise distance
Below, one can see a heatmap of the distance matrix between the different mixture probability distributions. Distance is here defined as the Euclidean distance between the points of the mixture probability distribution with an active power of [0, 2.5, 5 ... 2500]. Once more, one can see some operating modes from different wind turbines with very low correlations, meaning they might have a similar behavior (based on the exogenous parameters). 

In [None]:
# Calculate distance matrix
cluster_distances = pairwise_distances(sum_densities, metric='euclidean')

# Visualise heatmap
plt.figure(figsize = (15,12))
OM_labels = ['OM '+str(i) for i in list(range(1, sum(n_clusters)+1))]
sns.heatmap(cluster_distances, annot=True, linewidths=1, square=True, xticklabels=OM_labels, yticklabels=OM_labels)
plt.yticks(rotation=0);

# 5. LAYER 3: Integration Analysis 

In this layer, we aim to combine the operational view (from [Layer 1](#3.-LAYER-1:-Individual-analysis)) with the exogenous view (from [Layer 2](#4.-LAYER-2:-Mediation-Analysis)). To do this, the obtained performance profiles (mixture distributions) per individual operating mode are pooled together and subjected to k-means clustering. This way we will obtain fleet-wide performance profiles, which we can trace back to the allowable ranges of turbine specific endogenous parameters (see [the table](#3.4-Investigate-the-minimum-and-maximum-value-of-each-parameter-in-each-cluster) which we constructed before).

## 5.1 Determine the appropriate amount of clusters

In here the optimal number of **fleet-wide** operating modes is investigated. We use again the Euclidean distance as distance measure. However, other measures as the Jensen Shannon distance might be used in future.

Following the majority voting system, the optimal amount of clusters (fleet-wide operating modes) is 3.

In [None]:
sf.plot_kmeans_validation(sum_densities)

### 5.2 Cluster the operating modes into fleet-wide operating modes, based on their individual performance profile
We do this again with k-means clustering. From the previous cell we know that k=3.

In [None]:
n_reduced_clusters = 3

# Do clustering:
kmeans = KMeans(n_reduced_clusters, init='k-means++', max_iter=300, n_init=10, random_state=0)
final_labels = kmeans.fit_predict(sum_densities) # Only use the operational features in the MAP step

# Map fleet-wide labels to letters:
OM_dict = dict({0:'A', 1:'B', 2:'C', 3:'C', 4:'C', 5:'D'})
final_labels = [OM_dict[label] for label in final_labels]

dictionary = dict(zip(df_cube_counts['cluster'], final_labels))

#### Show the power curve and torque curve for each of the fleet-wide operating modes
As we can see on the plots below, each operating mode are clearly different in the two curves.

In [None]:
# Make one big dataframe with the fleet-wide operating mode labels in it
cluster_centers_reduced = pd.concat(dict_turbines, ignore_index=True)
cluster_centers_reduced["Operating_mode"] = cluster_centers_reduced.Wind_turbine_name + '_' + cluster_centers_reduced.Label.astype(str)
cluster_centers_reduced["FW_Operating_mode"] = cluster_centers_reduced["Operating_mode"].replace(dictionary)

# Plot the points within each cluster
sf.plot_power_and_torque(cluster_centers_reduced, colorpal)

### Show the points of the fleet-wide operating modes PER TURBINE

In [None]:
for name, dataset in dict_turbines.items():
    
    dataset["Operating_mode"] = dataset.Wind_turbine_name + '_' + dataset.Label.astype(str)
    dataset["FW_Operating_mode"] = dataset["Operating_mode"].replace(dictionary)
    
    # 1. Plot power curve
    g1 = sns.FacetGrid(dataset, col="FW_Operating_mode", height=3, aspect = 1.5, col_order=['A','B','C'])
    g1.map(plt.scatter, "Ws_avg", "P_avg", s=1,linewidth=0, alpha = 1)
    g1.set(xlim=(0, 20), ylim=(-100, 2100))
    g1.set_axis_labels( "Wind speed [m/s]", "Active power [kW]")
    g1.set_titles("Operating mode {col_name}")
    g1.fig.suptitle('Power curves of turbine '+name, y=1.02)
    g1.add_legend()
    for lh in g1._legend.legendHandles: # This makes the legend visible
        lh.set_alpha(1)
        lh._sizes = [50]
    plt.tight_layout()


## 5.3 Calculate **fleet-wide** performance profiles of each fleet-wide operating mode

Now we calculate the **fleet-wide** performance profiles by summing the individual probabilistic distributions of each operating mode. The mixture weights have been computed as the number of points in the corresponding cluster from layer 1, normalized by the total number of points in the given fleet-wide cluster. The resulting very distinctive fleet-wide performance profiles denoted as A,
B and C are depicted below.

*Remark: This is very similar as in the [previous layer](#4.3-Combine-densities-per-cluster-as-an-endogenous-profile-for-each-internal-operating-mode), but one level higher.*

In [None]:
sf.construct_fleetwide_performance_profiles(n_reduced_clusters, final_labels, sum_densities, df_cube_counts, points_per_cube, colorpal)


Now each of the fleet-wide performance profiles can be traced back to a subset of individual operating modes ([this Table](#3.4-Investigate-the-minimum-and-maximum-value-of-each-parameter-in-each-cluster)), resulting in fleet-wide (composite) operating modes, which we also denote with A, B and C: A={1,3,5,6,8,9,12,13}; B={2,4,10,14}; C ={7,11}. It is interesting to observe that the composite operating mode linked to profile C can be traced back to only two of the four wind turbines. The derived fleet-wide (composite) operating modes, each associated with a very distinctive performance profile (see Figure above), can now be used to label the fleet data as follows: 

1. For each timestamp, consider the values of the 12 operational parameters.
2. Determine to which operating mode they can be assigned ([see this Table](#3.4-Investigate-the-minimum-and-maximum-value-of-each-parameter-in-each-cluster)). This is different per wind turbine.
3. Identify the fleet-wide (composite) operating mode to which the identified mode belongs.
4. Subsequently, assign the corresponding letter A, B, C or D (not seen) to the timestamp. 

In this way, each dataset per turbine can be converted into A, B, C or D code (as a DNA sequence), which can be very insightful for monitoring purposes (e.g. long periods of B would signify optimal performance), but is also a powerful representation enabling
more advanced applications, e.g.: 
- mining the fleet data for interesting patterns such as transitions between operating modes
- zooming in periods with too many Ds
- training a predictor of expected production on historical data to be used to detect deviations during real-time operations.

# 6. Conclusion and future work

We have proposed a novel data analysis approach that can be used for multi-view analysis and integration of heterogeneous real-world datasets originating from multiple sources. The validity and the potential of the proposed approach has been demonstrated on a real-world dataset of a fleet of wind turbines. The obtained results are very encouraging. The method is very efficient and robust in detecting characteristic operating modes across the fleet. Subsequently, distinctive performance profiles are derived and associated with each operating mode, which enable converting the fleet data into powerful letter code suitable for more advanced mining.


For future work, we are interested to extend our research in the following directions: 
1. Fine tune further the method by using e.g. an adaptive hypercube binning.
2. Testing different experimental scenarios e.g. comparing different time periods from the same wind turbine. 
3. Consider additional validation use cases dealing with multi-source datasets e.g. mobility or manufacturing data. 
4. Extend further the method by exploiting the possibility to covert the fleet data into letter code.
