## Quality Check Diagnostic Work, Part A

This notebook illustrates some quality control steps that should be considered when analyzing a new dataset. In this example we'll use the `WindToolKitQualityControlDiagnosticSuite` class to automate some of the QC analysis for SCADA data.

The `WindToolKitQualityDiagnosticSuite` is a subclass of the `QualityControlDiagnosticSuite` that adds additional methods for the use of the NREL WindToolKit database in addition to all the base QC methods.

In part A of this exercise, we will demonstrate the use of timezone-naive timestamps.

### Step 1: Load in Data

To load in the data, we can either preload the data, or pass in a full file path and have the QC class import the data file.

For this example we'll load in the data first, and remove the timezone data from the datetime stamp to demonstrate the process of uncovering the DST overlap in the data.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import pandas as pd

from operational_analysis.methods.quality_check_automation import WindToolKitQualityControlDiagnosticSuite as QC

ModuleNotFoundError: No module named 'h5pyd'

In [None]:
scada_df = pd.read_csv('./data/la_haute_borne/la-haute-borne-data-2014-2015.csv')

date = [s[0:10] for s in scada_df['Date_time']]
time = [s[11:19] for s in scada_df['Date_time']]
datetime = [date[s] + ' ' + time[s] for s in np.arange(len(date))]
scada_df['datetime'] = pd.to_datetime(datetime, format = "%Y-%m-%d %H:%M:%S")

scada_df.set_index('datetime', inplace = True, drop = False)

In [None]:
scada_df.head()

In [None]:
scada_df.dtypes

### Step 2: Initializing QC and Performing the Run Method

Now that we have our dataset with the necessary columns and datatypes, we are ready to perform our quality check diagnostic. This analysis will not make the adjustments for us, but it will allow us to quickly flag some key irregularities that we need to manage before going on. 

To start, let's initialize a QC object, qc, and call its run method. 

In [None]:
qc = QC(
    data=scada_df, 
    ws_field='Ws_avg', 
    power_field= 'P_avg', 
    time_field='datetime', 
    id_field='Wind_turbine_name', 
    freq='10T', 
    lat_lon=(48.45, 5.586),
    # It is highly recommended to add the local timezone even if it may not be present
    local_tz="Europe/Paris",
    timezone_aware=False,  # We should indicate that the timezone in the data is unknown
    check_tz=False,  # True for WIND ToolKit-valid locations only, though will not break the code if outside
)

Below is what the updated DataFrame object looks like after being read in and manipulated for the initial setup. Notice that there is now a UTC offset column, which directly translates to the `is_dst` column's `True`/`False` input for whether or not a particular timestamp is in Daylight Saving's Time (if it's used at all for the time zone).

In the below, the datetime_utc column should always remain in UTC time and the datetime_localized column should always remain in the localized time. Conveniently, Pandas provides two methods `tz_convert()` and `tz_localize()` to toggle back and forth between timezones, which will operate on the index of the DataFrame. It is worth noting that the local time could also be UTC, in which case the two columns would be redundant.

The localized time, even when the passed data is unaware, is adjusted using the `local_tz` keyword argument to help normalize the time strings, from which a UTC-based timestamp is created (even when local is also UTC). By calculating the UTC time from the local time, we are able to ascertain DST shifts in the data, and better assess any anomalies that may exist.

However, there may be cases where the timezone is not encoded (this example), nor known. In the former, we can use the `local_tz` keyword argument that is seen in the code above, but for the latter, this is much more difficult, and the default value of UTC may not be accurate. In this latter case it is useful to try multiple timezones, such as an operating/owner company's headquarters or often the windfarm's location to find a best fit. In the case of using a US-based windfarm, the subclass `WindToolKitQualityControlDiagnosticSuite` can be used to help better match a timezone and the data provided.

In [None]:
qc._df.head()

In [None]:
qc._df.dtypes

In [None]:
qc.run()

### Step 3: Deep Dive with QC Diagnostic Results

Let's take a deeper look at the results of our QC diagnostic. 

#### Perform a general scan of the distributions for each numeric variable

In [None]:
qc.column_histograms()

#### Check ranges of each variable

In [None]:
qc._max_min

These values look fairly reasonable and consistent. 

#### Identify any timestamp duplications and timestamp gaps. 

Duplications in October and gaps in March would suggest DST.

In [None]:
qc._time_duplications

In [None]:
qc._time_duplications_utc

In [None]:
qc._time_gaps

In [None]:
qc._time_gaps_utc

Based on the duplicated timestamps, it does seem like there is a DST correction in spring but no duplicate times in the fall. However, even with a UTC conversion, there still appear to be duplications in the data, so there is likely additional analysis needed here. While it appears that there are time gaps in the data for the original inputs, this phenomena switches seasons to the fall for the UTC converted time stamps, likely due to the lack of timezone encoding in the original inputs compared to a corrected timestamp.

#### Check the DST plot to look in more detail

In [None]:
qc.daylight_savings_plot()

It appears that the original data (blue, solid line, labeled "Original Timestamp") has a time gap in the spring; however, when we compare it to the UTC timestamp (orange, dashed line), it is clear that there is not in fact any gap in the data, and the DST transition has been encoded properly in the data. On the otherhand, it at first appears that there are no gaps in the fall when we make the same comparison, but when looking at the UTC timestamps, we can see that there is a 1 hour gap in the data for both 2014 and 2015. This is in line with our comparison of `qc._time_gaps` and `qc._time_gaps_utc` above, and further confirms our findings that there are duplicates in the spring and gaps in the fall.

By having the original data and a UTC-converted timestamp it enables us to see any gaps that may appear when there is no timezone data encoded. On the other hand, using the UTC-converted timestamp does not reduce the number of duplications in this dataset that are present in the spring, but helps adjust for seemingly missing or available data. In tandem we can see in the scatter points that there are still duplicates in the spring data just before the 

The final question regarding datetime is whether we're in UTC or local. Given the daylights savings gap, it's likely we're in local. This is further confirmed by the raw datetime info provided in the SCADA file, which shows either a +1h or +2h timezone from UTC. So we are operating in local time. Therefore, the project import script for La Haute Borne should shift the timestep back to put it into UTC.

### Inspect the turbine power curves

Now that we have gathered some useful information about our timeseries, the one last check we may want to make is to inspect each turbine profile. We can look at each turbine's power curve and perform an initial scan for irregularities.

In [None]:
qc.plot_by_id('Ws_avg', 'P_avg')

Overall, these power curves look pretty common with some downtime, derating, and what look like a few erroneous data points. 

### Step 4: Performing adjustments on our data

Recall that this notebook is only for diagnostic QC of plant data and does not actually change the data in the project import script. Any issues identifed here should be incorporated into the project import script. 

Note that the necessary corrections have already been applied to the project import script for this data.