# Wind Farms Dataset Documentation

Version 0.91
___

## Overview of the Wind Farms Dataset Document

__[Part 1](#section_1). Introduction to Wind Farms Dataset (AGL)__

Describes the OSIsoft Academic Hub ([Part 1](#section_1)), the Wind Farms dataset, the time frame in which the dataset was compiled, and wind power generation role at AGL. 

__[Part 2](#section_2). The Industry Challenges__

A short list of challenges wind farm operating companies face with presentations from customers. 

__[Part 3](#section_3). Data Stream Details__

Explains what data is collected and monitored for the wind turbines.  

__[Part 4](#section_4). Data Obfuscation and Quality__

Almost all industrial data contains sensitive business information. This is why data obfuscation is usually performed before sharing it. Section 4 explains the two obfuscation steps (state and geo-location) details. Moreover with real-world data, students are expected to work with "dirty" data: data which contains error and gaps. This section also describes various ways poor quality data arises for this specific dataset and how they appear in dataframes. 

__[Part 5](#section_5). Data Organization and Metadata__

All Academic Hub Datasets are organized around a set of assets, each of the asset having at least a default data view. A data view is an OCS feature to extract interpolated data from multiple related streams, in a tabular form ready for data-driven applications or any data analysis. Specialized data views (e.g. subset of the default data view, or a data view with multiple related wind turbines.) can also be defined. 

## Part 1. Introduction to Wind Farms Dataset (from AGL)
---
<a id="section_1"></a>

    
OSIsoft Academic Hub is a cloud-based platform that supports data analytics in university curriculum by providing a data infrastructure to host, aggregate, and analyze data. Students are exposed to real-world industrial data which illustrates some of the same engineering concepts being taught in classrooms and labs. OSIsoft Academic Hub hosts the real-world dataset composed of a set of 50 wind turbines owned by AGL, the largest fully integrated energy and telecommunication company in Australia with generation assets totalling over 11GW of capacity corresponding to approximately 20% of the total generation capacity of Australia's National Energy Market. AGL has been an OSIsoft customer for more than 10 years, and uses OSIsoft's technology to monitor, operate, and optimize their varied fleet of generation assets across multiple states.  A presentation on AGL's use of the PI System can be found [here](https://resources.osisoft.com/presentations/agl-energy-s-real-time-data-journey-continues/). AGL kindly shared a substantial subset of their wind farms dataset for academic usage within the terms defined within the [Academic Hub Subscription Agreement](https://resources.osisoft.com/uploadedFiles/Content_(New)/About_OSIsoft/Legal/AcademicHubAgreement%20(May%202020).pdf). Here is a summary of the dataset:

* 5 clusters of 10 wind turbines each (50 turbines total)
* 13 different sensors per turbine, the main categories being: 
    * Temperature: outside, drivetrain (3) and nacelle (all in <sup>o</sup>C)
    * Power: instant power sent to the grid, updated every 2 seconds (kW)
    * Speed: rotor (in RPM) and wind (m/s)
    * Angular: pitch, relative wind direction and yaw (degrees)
* 2-year of data covering the 2018-2019 period

[Section 3](#section_3) provides more details about the 13 data streams associated to each turbine. 


## Part 2. Industry Challenges
---
<a id="section_2"></a>

Here is a short list of industry challenges as explained in customer's presentations:   

* (EDF Renewables) Managing large number of wind turnines to reduce loss revenue due to unplanned failures: https://resources.osisoft.com/presentations/power-generation-qanda-session---when-turbines-go-bump-in-the-night---shedding-light-on-lost-revenue/
* (Statkraft) Enabling efficient maintenance with condition monitoring: https://resources.osisoft.com/presentations/condition-monitoring--enabling-efficient-maintenance--statkraftx/
* (Nergica) Predicting icing on wind turbine blades: https://resources.osisoft.com/presentations/icing-prediction-on-blade-wind-turbine-using-forecast-data/

## Part 3. Data Stream Details
<a id="section_3"></a>

Each sensor on a wind turbine generates a sequence of timestamped readings, i.e. a time-series. For each of such sequence, a unique identifiable data stream on OCS is associated and stored efficiently by the [Sequential Data Store](https://ocs-docs.osisoft.com/Content_Portal/Concepts/Data_Storage_Concepts.html), a cloud-native streaming database.  

As mentioned in Section 1, each wind turbine has a set of 13 streams. The table below summarizes this information including the unit of measure (UOM) per stream. For each sensor on a wind turbine, there is an associated stream of data. Each data stream is a time-series with a sequence of events, i.e. a series of timestamps with associated values. The first 12 data streams in the table below return float values, only `State` reports non-numerical values. 

### Table 1: Wind Turbine Data Streams

| Stream Name | UOM | &nbsp;&nbsp;&nbsp; Description | Note |
|:-------|:---:|:--------|:----:|
| Power to grid | kW  | &nbsp;&nbsp;&nbsp; Instant energy generated | high-resolution with about one event every 2 seconds |  
| Ambient temp | °C | &nbsp;&nbsp;&nbsp; Outside temperature | updated every 15 to 20 minutes |    
| Pitch angle | degrees |  &nbsp;&nbsp;&nbsp; Angle of the blade | from 22 up to 72 seconds between events |
| Rotor speed | RPM | &nbsp;&nbsp;&nbsp; Rotational speed |  from 9 up to  14 minutes between event |  
| Drivetrain temp IMSDE | °C | &nbsp;&nbsp;&nbsp; TBD | | 
| Drivetrain temp IMNSDE  | °C | &nbsp;&nbsp;&nbsp; TBD | | 
| Drivetrain main bearing temp &nbsp;&nbsp;&nbsp; | °C |  | 
| Nacelle temp | °C |  |  
| Drivetrain vibration | m/s<sup>2</sup> |  | 
| Wind relative direction | &nbsp;&nbsp;&nbsp;degrees&nbsp;&nbsp;&nbsp; | | Relative to the turbine direction |
| Wind speed | m/s | &nbsp;&nbsp;&nbsp; Measured wind velocity |  | 
| Yaw angle | degrees | &nbsp;&nbsp;&nbsp; Angle between turbine rotor towards and the wind | 
| State |  None | &nbsp;&nbsp;&nbsp; Functional status | &nbsp;&nbsp;&nbsp; Value is either `OK` or `TurbError` |  

             
   

## Part 4. Data Obfuscation and Quality 
<a id="section_4"></a>

Before sharing their data with OSIsoft, AGL performed specific data transformations to prevent leaking sensitive business information. The main items obfuscated are the turbine states and geo-location.

### Turbine State Obfuscastion 

The `State` data stream is the result of a mapping from the original state data at AGL. Most of the time turbine are in `OK` state indiciating normal operation. The only other state is "not OK" denoted by `TurbError`, which could represent any bad state of the turbine such as shutdown, or a variety of alarms.

### Geo-Location Obfuscation

Each wind turbine in this dataset has some metadata associated. This information is static in nature and do not change over time, unlike stream data. `Latitude` and `Longitude` are in this category.  

Latitudes and longitudes for this dataset have been replaced by pseudo-random latitude/longitude in the following way:
* lat/long are from existing AGL turbines as found on Google maps, but the assignment of the correct coordinates to a turbine is a coincidence 
* coordinates respect clustering, so turbines of the same cluster will be co-located when plot on a map

### Data Quality 

Dirty data is a fact of life for sensor data related to industrial assets. The causes are numerous including: an error from the data source, propogation of poor quality data through its use in calulations, manualy data being miskeyed or mislabeled, and simply missing data. 

With the current version of Academic Hub Python library, data stream values are obtained through a data view associated to a given asset, i.e. a table of values where each row starts with a timestamp, and each column values are from a data stream associated with the selected asset. Dirty values in this table are represented by a missing entry (empty string interpreted as NaN &mdash; Not a Number &mdash; by Pandas dataframe library). Discarding incomplete row is a standard technique to deal with such missing data. 



# Part 5. Assets, Dataviews and Metadata
---
<a id="section_5"></a>

For this dataset, all assets are wind turbines identified by their asset identifier which the format `clusterX.turbY` for `X` in \[1,..,5\] and `Y` in \[1,...,10\]

Academic Hub datasets are structured around assets. Here assets are pieces of equipment found in a brewery: many fermenter vessels, bright tanks, and other related equipment. Each asset has an Asset Identifier (Asset Id) and a description. This information is available through the Hub Python library (`ocs_academic_hub`) and is returned as Pandas dataframe.

Each asset is endowed with a default data view, i.e. a structure which maps all the data streams associated to the asset onto a column in a Pandas dataframe. Each data view has a Data view Id and exists within an OCS namespace, i.e. a named partition with its own Namespace ID. To request data for an asset, the following information should be provided:

* Namespace Id
* Data view Id
* Start index (timestamp)
* End index (timestamp). 
* Interpolation interval

Reference for the start/end index and interpolation interval can be found at https://ocs-docs.osisoft.com/Content_Portal/Documentation/DataViews/GetDataViewData/Quick_Start_Get_Data_View_Data.html#index. 

For specific use-cases it is possible to define specialized dataviews having only a subset of the default dataview columns. Specialized dataviews avoid data overfetching and reduce the time to get results back especially with large tables with high number of rows. Most data views are for a single asset, but it's possible to define a data view spanning multiple assets. This avoid requesting multiple single-asset data views and stiching the result into a single Pandas dataframe for analysis. 

Asset metadata is static information about the asset, e.g. the equipment manufacturer, installation date, etc.

In [1]:
# -------------------------------
# Code preamble for next sections
# -------------------------------
#
from ocs_academic_hub.datahub import hub_login
from IPython.display import display, Markdown
dataset = "Wind_Farms"
widget, hub = hub_login()
display(widget)
hub.refresh_datasets()
hub.set_dataset(dataset)
namespace_id = hub.namespace_of(dataset)
print(f"\n>>> Active dataset is {hub.current_dataset()} in namespace {namespace_id}")

<IPython.core.display.Javascript object>

VBox(children=(HTML(value='<p><img alt="AVEVA banner" src="https://academichub.blob.core.windows.net/images/av…


>>> Active dataset is Wind_Farms in namespace academic_hub_01


## Wind Turbine Assets

The `assets` method returns a Pandas dataframe with `Asset_Id` as the first column and short asset description in the second column. 

In [2]:
#print(hub.assets().to_string(index=False))
print(f"Number of assets: {len(hub.assets())}")
hub.assets()

Number of assets: 50


Unnamed: 0,Asset_Id,Description
0,cluster1.turb1,Turbine
1,cluster1.turb10,Turbine
2,cluster1.turb2,Turbine
3,cluster1.turb3,Turbine
4,cluster1.turb4,Turbine
5,cluster1.turb5,Turbine
6,cluster1.turb6,Turbine
7,cluster1.turb7,Turbine
8,cluster1.turb8,Turbine
9,cluster1.turb9,Turbine


## Available Data Views

All assets have at least a default data view with ID of the following format: `wind.farms_<asset_id>` where `<asset_id>` is an `Asset_Id` as returned by `hub.assets()`. 

This dataset has no additional data views besides the default ones for the time being. 

Using the wind turbine `cluster3.turb2` as an example, the list all available single-asset data views can be obtained with method `asset_dataviews()` as seen below: 

In [3]:
dataview_ids = hub.asset_dataviews("cluster3.turb2")
dataview_ids

['wind.farms_cluster3.turb2']

The definition of each data are given below (UOM == unit of measure): 

In [4]:
for dv_id in dataview_ids:
    # print(f"Data view ID: {dv_id}")
    display(Markdown(f"**Data view ID: {dv_id}**"))
    #print(
    display(hub.dataview_definition(namespace_id, dv_id)) 

**Data view ID: wind.farms_cluster3.turb2**

Unnamed: 0,Asset_Id,Column_Name,Stream_Type,Stream_UOM,OCS_Stream_Name
4,cluster3.turb2,Ambient Temperature,Float,°C,cluster3.turb2.temp_ambient
5,cluster3.turb2,Drivetrain Gearbox Temp IMSDE,Float,°C,cluster3.turb2.temp_drivetrain_gearbox_IMSDE
6,cluster3.turb2,Drivetrain Gearbox Temp IMSNDE,Float,°C,cluster3.turb2.temp_drivetrain_gearbox_IMSNDE
7,cluster3.turb2,Drivetrain Mainbearing Temp,Float,°C,cluster3.turb2.temp_drivetrain_mainbearing
9,cluster3.turb2,Drivetrain vibration,Float,m/s²,cluster3.turb2.vib_drive_train
8,cluster3.turb2,Nacelle Temp,Float,°C,cluster3.turb2.temp_nacelle
1,cluster3.turb2,Pitch Angle,Float,degrees,cluster3.turb2.pitch_angle
2,cluster3.turb2,Power To Grid,Float,kW,cluster3.turb2.power_to_grid
10,cluster3.turb2,Relative Wind Direction,Float,degrees,cluster3.turb2.wind_direction_relative
3,cluster3.turb2,Rotor Speed,Float,RPM,cluster3.turb2.rotor_rpm


## Multi-asset data views

Some analyses require the data or more than one asset. Adding the option `multiple_asset=True` returns all the data views with more than one asset. If the empty string "" is given instead of an asset Id, the list of all multi-asset data views is returned.

There is currently one data view defined per cluster. 


In [5]:
dataview_ids = hub.asset_dataviews("", multiple_asset=True)
dataview_ids

['wind.farms_cluster1',
 'wind.farms_cluster2',
 'wind.farms_cluster3',
 'wind.farms_cluster4',
 'wind.farms_cluster5']

For more information on how to request interpolated data with a data view, please consult [this notebook](https://data.academic.osisoft.com/nbviewer/github/academic-hub/datasets/blob/master/Hub_Library_Quickstart.ipynb#Getting-data-from-a-Data-View)

## Asset Metadata


Each wind turbine asset possesses a set of metadata with the following identification:

* Asset_Id: the turbine asset identifier (from `hub.assets()`)
* Cluster: the cluster ID of the turbine (from 1 up to 5)
* ID: the ID of the turbine within the cluster (from 1 up to 10)
* Latitude and Longitude: pseudo-location respecting clustering
* Manufacturer and Model: name of the manufacturer and the model name

`hub.all_assets_metadata()` returns a dataframe with the above information, one row per turbine. For a specific turbine, use `hub.asset_metadata("cluster3.turb2")
`.


In [6]:
display(hub.all_assets_metadata())

Unnamed: 0,Cluster,ID,Latitude,Longitude,Manufacturer,Model,Asset_Id
3,1,1,-38.016069,142.139145,,,cluster1.turb1
9,1,10,-38.008225,142.172404,,,cluster1.turb10
14,1,2,-38.018385,142.143941,,,cluster1.turb2
18,1,3,-38.017802,142.149638,,,cluster1.turb3
24,1,4,-38.021606,142.153468,,,cluster1.turb4
29,1,5,-38.015985,142.152867,,,cluster1.turb5
33,1,6,-38.013043,142.156611,,,cluster1.turb6
38,1,7,-38.011429,142.160753,,,cluster1.turb7
44,1,8,-38.008259,142.162984,,,cluster1.turb8
48,1,9,-38.00825,142.167297,,,cluster1.turb9


In [7]:
display(hub.asset_metadata("cluster3.turb2"))

{'Cluster': 3,
 'ID': '2',
 'Latitude': -35.113458,
 'Longitude': 137.719395,
 'Manufacturer': '',
 'Model': '',
 'Asset_Id': 'cluster3.turb2'}

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6febb1fd-0efc-43c1-a0ac-f7f30df1db1f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>