#ds. For the OpenOA examples, the metadatafile: `data/plant_meta.yml` will be used (and a JSON reference for those that prefer JSON: `data/plant_meta.json`) to map the La Haute Borne fields to the OpenOA fields. This v3 update allows a user to bring their data directly into a `PlantData` object with a means for the OpenOA to know which data fields are being used 

Below is a demonstration of loading a `PlantMetaData`object directly to show what data are
expected, though there is a `PlantMetaData.load()` that can accept a dictionary or file path input for routinized workflows.

```python
metadata = PlantMetaData(
    latitude,  # float
    longitude,  # float
    scada,  # dictionary of column mappings and data frequency
    meter,  # dictionary of column mappings and data frequency
    tower,  # dictionary of column mappings and data frequency
    status,  # dictionary of column mappings and data frequency
    curtail,  # dictionary of column mappings and data frequency
    asset,  # dictionary of column mappings
    reanalysis,  # dictionary of each product's dictionary of column mappings and data frequency
)
```
For each of the data objects above, there is a corresponding meta data class to help guide users. For instance, the `SCADAMetaData` (below) has pre-set attributes to help guide users outside of the docstrings and standard documentation. The other meta data objects are: `MeterMetaData`, `TowerMetaData`, `StatusMetaData`, `CurtailMetaData`, `AssetMetaData`, and `ReanalysisMetaData` (one is created for each producted provided).

For example, each of the metadata classes allows inputs for the column mappings and timestamp frequency to enable the data validation steps outlined in the [process summary](#summarizing-the-qa-process-into-a-reproducible-workflow). However to clarify the units and data types expected, each of the metadata classes contains the immutable attributes: `units` and `dtypes`, as shown below for the `SCADAMetaData` class, to signal to users what units each data input should be in, when passed, and what type the data should be able to be converted to, if it's not already in that format. Some examples of acceptable formats would be string-encode floats, or string-encoded timestamps, both of which can be automatically converted in the initialization steps.

In [1]:
from pprint import pprint
from openoa.schema import SCADAMetaData

scada_meta = SCADAMetaData()  # no inputs means use the default, internal mappings
print("Expected units for each column in the SCADA data:")
pprint(scada_meta.units)
print()
print("Expected data types for each column in the SCADA data:")
pprint(scada_meta.dtypes)



Expected units for each column in the SCADA data:
{'WMET_EnvTmp': 'C',
 'WMET_HorWdDir': 'deg',
 'WMET_HorWdDirRel': 'deg',
 'WMET_HorWdSpd': 'm/s',
 'WROT_BlPthAngVal': 'deg',
 'WTUR_SupWh': 'kWh',
 'WTUR_TurSt': None,
 'WTUR_W': 'kW',
 'asset_id': None,
 'time': 'datetim64[ns]'}

Expected data types for each column in the SCADA data:
{'WMET_EnvTmp': <class 'float'>,
 'WMET_HorWdDir': <class 'float'>,
 'WMET_HorWdDirRel': <class 'float'>,
 'WMET_HorWdSpd': <class 'float'>,
 'WROT_BlPthAngVal': <class 'float'>,
 'WTUR_SupWh': <class 'float'>,
 'WTUR_TurSt': <class 'str'>,
 'WTUR_W': <class 'float'>,
 'asset_id': <class 'str'>,
 'time': <class 'numpy.datetime64'>}


Below is a demonstration of loading a `PlantData` object directly, though there are class methods for loading from file or an ENTR warehouse.
```python
plant = PlantData(
    metadata,  # PlantMetaData, dictionary, or file
    analysis_type,  # list of analysis types expected to be performed, "all", or None
    scada,  # None, DataFrame or CSV file path
    meter,  # None, DataFrame or CSV file path
    tower,  # None, DataFrame or CSV file path
    status,  # None, DataFrame or CSV file path
    curtail,  # None, DataFrame or CSV file path
    asset,  # None, DataFrame or CSV file path
    reanalysis,  # None, dictionary of DataFrames or CSV file paths with the name of the product for keys
)
```

On loading, the data will be validated automatically according to the `analysis_type` input(s) provided to ensure columns exist with the expected names, data types are correct, and data frequencies are of a sufficient resolution. However, while all erros in this process are caught, only those of concern to an `analysis_type` are raised, with the exception of "all" raises any error found and `None` ignore all errors.

## Imports

In [2]:
from pprint import pprint

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

from openoa import PlantData
from openoa.utils import qa
from openoa.utils import plot

import project_ENGIE

# Avoid clipping data previews unnecessarily
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

ModuleNotFoundError: No module named 'h5pyd'

## QA'ing ENGIE's open data set

ENGIE provides access to the data of its 'La Haute Borne' wind farm through https://opendata-renewables.engie.com and through an API. The data gives users the opportunity to work with real-world operational data. 

The series of notebooks in the 'examples' folder uses SCADA data downloaded from https://opendata-renewables.engie.com, saved in the `examples/data` folder. Additional plant level meter, availability, and curtailment data were synthesized based on the SCADA data.

The data used throughout these examples are pre-processed appropriately for the issues described in the subsequent sections, and synthesized into a routinized format in the `examples/project_ENGIE.py` Python script.

**Note**: This demonstration is centered around a specific data set, so it should be noted that there are other methods for working with data that are not featured here, and we would like to point the user to the API documentation for further data checking and manipulation methods.

### Step 1: Load the SCADA data

First we'll need to unzip the data, and read the SCADA data to a pandas `DataFrame` so we can take a look at the data before we can start working with it. Here the `project_ENGIE.extract_data()` method is used to unzip the data folder because this demonstration is based on working with the ENGIE provided data without any preprocessing steps taken.

In [None]:
data_path = "data/la_haute_borne"
project_ENGIE.extract_data(data_path)

scada_df = pd.read_csv(f"{data_path}/la-haute-borne-data-2014-2015.csv")

scada_df.head(10)

The timestamps in the column `Date_time` show that we have timezone information encoded, and that the data have a 10 minute frequency to them (or "10min" according to the pandas guidance: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)

To demonstrate the breadth of data that the QA methods are inteneded to handle this demonstration will step through the data using the current format, and an alternative where the timezone data has been stripped out.

In [None]:
scada_df_tz = scada_df.loc[:, :].copy()  # timezone aware
scada_df_no_tz = scada_df.loc[:, :].copy()  # timezone unaware

# Remove the timezone information from the timezone unaware example dataframe
scada_df_no_tz.Date_time = scada_df_no_tz.Date_time.str[:19].str.replace("T", " ")

# # Show the resulting change
scada_df_no_tz.head()

Below, we can see the data types for each of the columns. We should note that the timestamps are not correctly encoded, but are considered as objects at this time

In [None]:
scada_df_tz.dtypes

In [None]:
scada_df_no_tz.dtypes

### Step 2: Convert the timestamps to proper timestamp data objects

Using the `qa.convert_datetime_column()` method, we can convert the timestamp data accordingly and insert the UTC-encoded data as an index for both the timezone aware, and timezone unaware data sets.

Under the hood this method does a few helpful items to create the resulting data set:
1) Converts the column "Date_time" to a datetime object
2) Creates the new datetime columns: "Date_time_localized" and "Date_time_utc" for the localized and UTC-encoded datetime objects
3) Sets the UTC timestamp as the index
4) Creates the column "utc_offset" containing the difference between the UTC timestamp and the localized timestamp that will be used to determine if the timestamp is in DST or not.
5) Creates the column "is_dst" indicating if the timestamps are in DST (`True`), or not (`False`) that will be used later when trying to assess time gaps and duplications in the data

Notice that in the resulting data that the data type of the column "Date_time" is successfully made into a localized timestamp in the timezone aware example, but is kept as a non-localized timestamp in the unaware example.

In the below, the "Date_time_utc" column should always remain in UTC time and the "Date_time_localized" column should always remain in the localized time. Conveniently, Pandas provides two methods `tz_convert()` and `tz_localize()` to toggle back and forth between timezones, which will operate on the index of the DataFrame. It is worth noting that the local time could also be UTC, in which case the two columns would be redundant.

The localized time, even when the passed data is unaware, is adjusted using the `local_tz` keyword argument to help normalize the time strings, from which a UTC-based timestamp is created (even when local is also UTC). By calculating the UTC time from the local time, we are able to ascertain DST shifts in the data, and better assess any anomalies that may exist.

However, there may be cases where the timezone is neither encoded (the unaware example), nor known. In the former, we can use the `local_tz` keyword argument that is seen in the code above, but for the latter, this is much more difficult, and the default value of UTC may not be accurate. In this latter case it is useful to try multiple timezones, such as an operating/owner company's headquarters or often the windfarm's location to find a best fit. 

<div class="alert alert-block alert-info">
<b>Note:</b> 
In the case of US-based wind power plants, the "qa.wtk_xx()" methods, such as "qa.wtk_diurnal_prep()" and "qa.wtk_diurnal_plot()", can be used for working with NREL's WINDToolKit for further data checking, validation, and plotting.
</div>

In [None]:
scada_df_tz = qa.convert_datetime_column(
    df=scada_df_tz,
    time_col="Date_time",
    local_tz="Europe/Paris",
    tz_aware=True # Indicate that we can use encoded data to convert between timezones
)
scada_df_tz.head()

In [None]:
print(scada_df_tz.index.dtype)
scada_df_tz.dtypes

In [None]:
scada_df_no_tz = qa.convert_datetime_column(
    df=scada_df_no_tz,
    time_col="Date_time",
    local_tz="Europe/Paris",
    tz_aware=False  # Indicates that we're going to need to make inferences about encoding the timezones
)
scada_df_no_tz.head()

In [None]:
print(scada_df_no_tz.index.dtype)
scada_df_no_tz.dtypes

### Step 3: Dive into the data

Using the `describe` method, which is a thin wrapper for the pandas method shows us the distribution of each of the numeric and time-based data columns. Notice that both descriptions are equal, with the exception of the UTC offset, because they are the same data set.

In [None]:
no_tz = qa.describe(scada_df_no_tz)
no_tz = no_tz.loc[~no_tz.index.isin(["Date_time"])] # Ignore the Date_time column that is not shared between the dataframes
col_order = ["count", "mean", "std", "min", "25%", "50%", "75%", "max"] # Ensure description columns are in the same order
qa.describe(scada_df_tz)[col_order] == no_tz[col_order]

In [None]:
qa.describe(scada_df_tz)

#### Inspecting the distributions of each column of numerical data

Similar to the above, `column_histograms` is not part of the QA module, but is helpful for reviewing the independent distributions of data within a dataset. Aligning with the table version below, we can see that some distrbiutions, such as "Ws_avg" don't have any outliers, whereas others such as "Ot_avg" do and have very narrow histograms to accommadate this behavior.

In [None]:
plot.column_histograms(scada_df_tz)

It appears that there are a number of highly frequent values in these distributions, so we can dive into that further to see if we have some unresponsive sensors, in which case the data will need to be invalidated for later analysis. In the below analysis of repeated behaviors in the data, it seems that we should be flagging potentially unresponsive sensors (see Step 5 for more details).

In [None]:
# Only check data for a single turbine to avoid any spurious findings
single_turbine_df = scada_df_tz.loc[scada_df_tz.Wind_turbine_name == "R80736"].copy()

# Identify consecutive data readings
ix_consecutive = single_turbine_df.Va_avg.diff(1) != 0

# Determine how many consecutive occurences are for various thresholds, starting with 2 repeats
consecutive_counts = {i + 1: (ix_consecutive.rolling(i).sum() == 0).sum() for i in range(1, 10)}

# Plot the distribution of  N occurences for each threshold
plt.bar(consecutive_counts.keys(), consecutive_counts.values(), zorder=10)
plt.grid(zorder=0)
plt.xticks(range(2, 10))
plt.xlim((1, 10))
plt.ylim(0, 100)
plt.ylabel("Number of Consecutive Repeats")
plt.xlabel("Threshold for Consecutive Repeats")
plt.show()

It's evident in the above distribution that this sensor appears to be operating adequately and won't need to have any data flagged for unresponsiveness. However, in the below example, we can see that the temperature data are potentially having faulty data and should therefore be flagged.

In [None]:
# Identify consecutive data readings
ix_consecutive = single_turbine_df.Ot_avg.diff(1) != 0

# Determine how many consecutive occurences are for various thresholds, starting with 2 repeats
consecutive_counts = {i + 1: (ix_consecutive.rolling(i).sum() == 0).sum() for i in range(1, 10)}

# Plot the distribution of  N occurences for each threshold
plt.bar(consecutive_counts.keys(), consecutive_counts.values(), zorder=10)
plt.grid(zorder=0)
plt.xlim((1, 10))
plt.ylim(0, 5500)
plt.xticks(range(2, 10))
plt.yticks(range(0, 5501, 500))
plt.ylabel("Number of Consecutive Repeats")
plt.xlabel("Threshold for Consecutive Repeats")
plt.show()

#### Checking the power curve distributions

While not contained in the QA module, the `plot_by_id` method is helpful for quickly assessing the quality of our operational power curves.

In [None]:
plot.plot_by_id(
    df=scada_df_no_tz,
    id_col="Wind_turbine_name",
    x_axis="Ws_avg",
    y_axis="P_avg",
)

### Step 4: Inspecting the timestamps for DST gaps and duplications

Now, we can get the the duplicate time stamps from each of the data sets, according to each of the original, localized, and UTC time data. This will help us to compare the effects of DST and timezone encoding.

In the below, timezone unaware data, we can see that there is a significant deviation between the local timestamps and the UTC timestamps, especialy around the end of March in both 2018 and 2019, suggesting that there is something missing with the DST data. 

#### Timezone-unaware data

First we'll look only at the data without any timezone encoding, then compare the results to the data where we kept the timezone data encoded to confirm what modifications need to be made to the data. Notice that the UTC converted data are showing duplications at roughly the same time each year when the spring-time European DST shift occurs, and is likely indicating that the original datetime stamps are missing the data to properly shift the duplicates.

In [None]:
dup_orig_no_tz, dup_local_no_tz, dup_utc_no_tz = qa.duplicate_time_identification(
    df=scada_df_no_tz,
    time_col="Date_time",
    id_col="Wind_turbine_name"
)
dup_orig_no_tz.size, dup_local_no_tz.size, dup_utc_no_tz.size

In [None]:
dup_utc_no_tz

To help confirm there are DST corrections needed in the data, we can also take a lot at the gaps in the timestamps, particularly in October. At a quick glance, the timezone unaware UTC encoding seems to create gaps in the data, likely accounting for the DST shift in the fall.

Based on the duplicated timestamps in the original data, it does seem like there is a DST correction in spring but no duplicate times in the fall. However, even with a UTC conversion, there still appear to be duplications in the data, so there is likely additional analysis needed here. While it appears that there are time gaps in the data for the original inputs, this phenomena switches seasons to the fall for the UTC converted time stamps, likely due to the lack of timezone encoding in the original inputs compared to a corrected timestamp.

In [None]:
gap_orig_no_tz, gap_local_no_tz, gap_utc_no_tz = qa.gap_time_identification(
    df=scada_df_no_tz,
    time_col="Date_time",
    freq="10min"
)
gap_orig_no_tz.size, gap_local_no_tz.size, gap_utc_no_tz.size

In [None]:
gap_orig_no_tz

In [None]:
gap_utc_no_tz

Below, we can observe the effects of not having timezones encoded, and what that might mean for potential analyses. In the unaware data, it appears that the original data (blue, solid line, labeled "Original Timestamp") has a time gap in the spring; however, when we compare it to the UTC timestamp (orange, dashed line), it is clear that there is not in fact any gap in the data, and the DST transition has been encoded properly in the data. On the otherhand, it at first appears that there are no gaps in the fall when we make the same comparison, but when looking at the UTC timestamps, we can see that there is a 1 hour gap in the data for both 2014 and 2015. This is in line with our comparison of the original and UTC time gaps above, and further confirms our findings that there are duplicates in the spring and gaps in the fall.

By having the original data and a UTC-converted timestamp it enables us to see any gaps that may appear when there is no timezone data encoded. On the other hand, using the UTC-converted timestamp does not reduce the number of duplications in this dataset that are present in the spring, but helps adjust for seemingly missing or available data. In tandem we can see in the scatter points that there are still duplicates in the spring data just before the DST switch.

In [None]:
# Timezone Unaware
qa.daylight_savings_plot(
    df=scada_df_no_tz,
    local_tz="Europe/Paris",
    id_col="Wind_turbine_name",
    time_col="Date_time",
    power_col="P_avg",
    freq="10min",
    hour_window=3  # default value
)

#### Timezone-aware data

We see a similar finding for timezeone-aware data, below, for the both the number of duplications and gaps, likely confirming our hunches from above.

In [None]:
dup_orig_tz, dup_local_tz, dup_utc_tz = qa.duplicate_time_identification(
    df=scada_df_tz,
    time_col="Date_time",
    id_col="Wind_turbine_name"
)

In [None]:
dup_orig_tz.size, dup_local_tz.size, dup_utc_tz.size

In [None]:
dup_utc_tz

In [None]:
gap_orig_tz, gap_local_tz, gap_utc_tz = qa.gap_time_identification(
    df=scada_df_tz,
    time_col="Date_time",
    freq="10min"
)
gap_orig_tz.size, gap_local_tz.size, gap_utc_tz.size

In [None]:
gap_utc_tz

Again, we see a high degree of similarity between the two examples, and so can confirm that we have some duplicated data in the spring unrelated to the DST shift, and some missing data in the fall likely due to the DST shift. Additionally, we can confirm that the Europe/Paris timezone is in fact the encoding of our original data, and should therefore be converted to UTC for later analyses.

In [None]:
# Timezone Aware
qa.daylight_savings_plot(
    df=scada_df_tz,
    local_tz="Europe/Paris",
    id_col="Wind_turbine_name",
    time_col="Date_time",
    power_col="P_avg",
    freq="10min",
    hour_window=3  # default value
)

## Summarizing the QA process into a reproducible workflow

The following description summarizes the steps taken to successfully import the ENGIE SCADA data based on the above analysis, and are implemented in the `project_ENGIE.prepare()` method. It should be noted that this method is cleaned up to provide users with an easy to follow example, it could also be contained in an analysis notebook, stand-alone script, etc., as long as it is able to feed into `PlantData` at the end of it.

1. From [Step 2](#step-2-convert-the-timestamps-to-proper-timestamp-data-objects) and [Step 4](#step-4-inspecting-the-timestamps-for-dst-gaps-and-duplications) we found that the data is in local time and should be converted to UTC for clarity in the timestamps. THis corresponds with line `project_ENGIE.py:79`.
2. Additionally from [Step 4](#step-4-inspecting-the-timestamps-for-dst-gaps-and-duplications), it was clear that duplicated timestamp data will need to be removed, corresponding to line `project_ENGIE.py:82`
3. In [Step 3](#step-3-dive-into-the-data), there is an oversized range for the temperature data, so this data will be invalidated, corresponding to line `project_ENGIE.py:86`
4. In [Step 3](#step-3-dive-into-the-data), the wind vane direction ("Va_avg") and temperature ("Ot_avg") fields seemed to have a large number duplicated data that were identified, so these data are flagged and invalidated, which corresponds to lines `project_ENGIE.py:88-102`
5. Finally, in [Step 3](#step-3-dive-into-the-data), it also should be noted that the pitch direction ranges from 0 to 360 degrees, and this will be corrected to the range of [-180, 180], which corresponds to lines: `project_ENGIE.py:105-107`

The remainder of the data do not need modification aside from additional variable calculations (see `project_ENGIE.py` for more details) and the aforementioned timestamp conversions.

## `PlantData` demonstration

In `project_ENGIE.prepare()` there are two methods to return the data: by dataframe (`return_value="dataframes"`) and by `PlantData` (`return_value="plantdata"`), which are both demonstrated below.

For the dataframe return selection, below it is also demonstrated how to load the dataframes into a `PlantData` object. A couple of things to notice about the creation of the the v3 `PlantData` object: 
- `metadata`: This field is what maps the OpenOA column convention to the user's column naming convention (`PlantData.update_column_names(to_original=True)` enables users to remap the data back to their original naming convention), in addition to a few other plant metadata objects.
- `analysis_type`: This field controls how the data will be validated, if at all, based on the analysis requirements defined in `openoa.plant.ANALYSIS_REQUIREMENTS`.


In [None]:
scada_df, meter_df, curtail_df, asset_df, reanalysis_dict = project_ENGIE.prepare(
    path="data/la_haute_borne",
    return_value="dataframes",
    use_cleansed=False,
)

engie = PlantData(
    analysis_type=None,  # No validation desired at this point in time
    metadata="data/plant_meta.yml",
    scada=scada_df,
    meter=meter_df,
    curtail=curtail_df,
    asset=asset_df,
    reanalysis=reanalysis_dict,
)

Below is a summary of what the `PlantData` and `PlantMetaData` object specifications are, which are both new or effectively new as of version 3.0

### `PlantData` documentation

In [None]:
print(PlantData.__doc__)

### `PlantMetaData` Documentation

In [None]:
from openoa.plant import PlantMetaData
print(PlantMetaData.__doc__)

### `PlantData` validation

Because `PlantData` is an attrs dataclass, it enables to support automatic conversion and validation of variables, which enables users to change analysis types as they work through their data. Below is a demonstration of what happens when we add an invalid analysis type (which fails to enable it), add the "MonteCarloAEP" analysis type (which passes the validation), and then further add in the "all" analysis type (which fails).

In [None]:
try:
    engie.analysis_type = "MonteCarlo"
except ValueError as e:
    print(e)

In [None]:
engie.analysis_type = "MonteCarloAEP"
engie.validate()

Notice that in the above cell, the data validates successfully for the MonteCarloAEP analysis, but below, when we append the `"all"` type, the validation fails. The below failure is because when `"all"` is input to `analysis_type`, it checks all of the data, and not just the analysis categories. In this case, there are no inputs to `tower` and `status`, so each of the checks will fail for all of the required columns by the metadata classifications.

In [None]:
engie.analysis_type.append("all")
print(f"The new analysis types now has all and MonteCarloAEP: {engie.analysis_type}")
try:
    engie.validate()
except ValueError as e: # Catch the error message so that the whole notebook can run
    print(e)

Below, the ENGIE data has been re-validated for a "MonteCarloAEP" analysis, so that it can be saved for later, and reloaded as the cleaned up version for easier importing.

In [None]:
engie.analysis_type = "MonteCarloAEP"
engie.validate()

In [None]:
data_path = "data/cleansed"
engie.to_csv(save_path=data_path, with_openoa_col_names=True)

In [None]:
engie_clean = PlantData(
    metadata=f"{data_path}/metadata.yml",
    scada=f"{data_path}/scada.csv",
    meter=f"{data_path}/meter.csv",
    curtail=f"{data_path}/curtail.csv",
    asset=f"{data_path}/asset.csv",
    reanalysis={
        "era5": f"{data_path}/reanalysis_era5.csv",
        "merra2": f"{data_path}/reanalysis_merra2.csv"
    },
)

### `PlantData` string and markdown representations

Below, we can see a summary of all the data contained in the `PlantData` object, `engie_clean`.

In [None]:
print(engie_clean)

While this is really great, it is a bit more catered towards a terminal output, and so we provide a Jupyter-friendly Markdown representation as well, as can be seen below. Alternatively, `engie_clean.markdown()` can be called to ensure the data are correctly displayed in their markdown format.

**NOTE**: the output displayed here that is viewable on the documentation site does not display the markdown correctly, so please take a look at the actual notebook on the GitHub or in your own Jupyter session.

In [None]:
engie_clean