<center>
<h1>The Full Machine Learning Lifecycle - How to Use Machine Learning in Production (MLOps)</h1>
<hr>
<h2>Exploratory Data Analysis</h2>
<hr>
 </center>

# Introduction
We will start our end-to-end ML project with an exploratory data analysis to get more familiar with the data and to look for patterns in the data that may be useful for our classification task. In this notebook, we will examine the structure of the dataset, aspects of data quality, as well as data distribution and correlations between features. Let´s get started...

In [None]:
import pandas as pd
import numpy as np
import os
import sys
import matplotlib.pyplot as plt
import seaborn as sns
sys.path.insert(1, os.path.join(sys.path[0], '..'))
sys.path.append('/cd4ml/plugins/')

from cd4ml.data_processing import get_data

sns.set()
%matplotlib inline

# Getting the data

In [None]:
# paths and variables
_raw_data_dir = '/data/batch1'   
_data_dir = '/data/' 

In [None]:
# get the data from storage
df = get_data(_raw_data_dir)

# Structure of the dataset
Let's start our EDA by looking at the structure of the dataset.

In [None]:
df.shape

As we can see, the dataset contains 52383 rows and 16 feature columns.

The table below lists for each feature a short description, alongside its unit and data type.

| Column name        | Description                                                                           | Unit | Data Type |
|:-------------------|:--------------------------------------------------------------------------------------|:-----|:----------|
| wt_sk              | Unique device identifier, equivalent to device name                             | -    | float     |
| measured_at        | Data timestemp in UTC format                                                          | -    | string    |
| wind_speed         | Average apparent wind speed measured by nacelle anemometer, normalised to rated value | m/s  | float     |
| power              | Average measured power production, normalised to rated max power                      | W    | float     |
| nacelle_direction  | Average position of nacelle relative to North (E=90°)                                 | °    | float     |
| wind_direction     | Average direction of incoming wind relative to North (E=90°)                          | °    | float     |
| rotor_speed        | Average revolutions per minute of the low speed rotor, normalised to rates RS         | -    | float     |
| generator_speed    | Average revolutions per minute of the generator, normalised to rated GS               | -    | float     |
| temp_environment   | Average outside temperature on nacelle height                                         | °C   | float     |
| temp_hydraulic_oil | Average oil temperature                                                               | °C   | float     |
| temp_gear_bearing  | Average gear temperature                                                              | °C   | float     |
| cosphi             | Average power factor of device                                                        | -    | float     |
| blade_angle_avg    | Average pitching angles, averaged over blades                                         | °    | float     |
| hydraulic_pressure | Average pressure in hydraulic circuit                                                 | mBar | float     |
| subtraction        | Error flag (NaN: no error, 0/1: error)                                                | -    | float     |
| categories_sk      | Categorisation of error type                                                          | -    | float     |


In [None]:
df.dtypes

Of the 16 features, 15 are numerical. The only non-numerical column is the column `measured_at`. Let's first have a look at this one.

### Non-numerical features

In [None]:
df['measured_at'].describe()

The feature `measured_at` represents the timestamp in UTC format corresponding to the measurement time. The number of unique entries is lower than the total count as the dataset contains entries for multiple wind turbines, recorded at the same time.

Since we want to do row-level predictions, we will remove the timestamps from the analysis.

In [None]:
df.drop("measured_at", axis=1, inplace=True)

### Numerical features

Now, we will turn our attention to the numerical features. More specifically, we will look at the number of unique values for each feature.

In [None]:
vals_unique = df.select_dtypes(include="number").nunique().sort_values()
print(vals_unique)

We observe that `subtraction` is a binary feature. `wt_sk` and `categories_sk` appear to be ordinal, and the remaining features are of continuous nature. We thus have 3 categorical features and 12 continuous features.

The feature `categories_sk` describes the type of error code of the wind turbine. This is the target which we want to predict. The feature `subtraction` indicates if an error code was received. Since this information is already contained in `categories_sk`, we will exlude `subtraction` from further analysis.

In [None]:
df.drop("subtraction", axis=1, inplace=True)

# Data quality
Now, we will have a look at the quality of the dataset.

### Duplicate rows

In [None]:
rows_duplicate = df.duplicated().sum()
print("Number of duplicate rows:", rows_duplicate)

Great. No duplicates!

### Missing values

In [None]:
df.isna().mean().sort_values()

Most of the features are complete, but `categories_sk` has less than 5% non-nan entries. The reason for this is that normal operation (i.e. non-error) is encoded as `NaN`.

In [None]:
df["categories_sk"].value_counts(dropna=False)

### Erroneous recordings
Plotting each feature can help in identifying obvious data errors.

In [None]:
df.plot(lw=0,
       marker=".",
       subplots=True,
       layout=(-1, 3),
       figsize=(15, 30),
       markersize=1)
plt.show()

Nothing seems to be obsously wrong. It can be tricky to discern an outlier (which we want to keep) from a truly erroneous entry (such as a negative wind speed). However, there are certainly some interesting things to be noted here. This is, where we start looking at the content of the data.

# Data content

### Feature correlations
Let's have a look at how the features are correlated with each other.

In [None]:
feat_correlations = df.corr(method="pearson")

plt.figure(figsize=(16, 16))
sns.heatmap(feat_correlations,
           square=True,
           center=0,
            annot=np.round(feat_correlations,3),
           fmt="",
           linewidths=.5,
           cmap="vlag",
           cbar_kws={"shrink": 0.8})

Some of the features seem to be strongly correlated, such as `rotor_speed` and `generator_speed`.

Of special interest is of course the correlation with the target variable `categories_sk`.

In [None]:
feat_correlations["categories_sk"].sort_values(ascending=False)

# Summary
This completes our Exploratory Data Analysis. You should now have a better understanding of the dataset and some first ideas what processing steps would be useful and which features might be informative for our classification task.