# EXPLORATORY DATA ANALYSIS

The goal of our analysis is to understand how the wind speed behaves over time, and propose the most adaqueate way of forecasting its _day ahead_.

Developing Wind farms involves a lot of resources, both financial an natural. Before deciding to construct such enterprises, companies take several measurements of the wind resourse by placing temporary measurement towers, which collect data over one year of preferably more.

The analysis here aims to investigate the data provided by a single tower in an undisclosed location. The selected tower has wind measurements at different hights going back to Jan 2015 but, in order to simplify the analysis, only the highest anemometer data will be used.

Before advancing into forecasting, it is very important to be aware of the statistical behavior of the time series. In order to do that, the present Notebook focuses on the visual analysis as well as a mathematical exploration of some theorems on Time Series forecasting.

> A simple library was built to help make this notebook cleaner. Because of that, most of the code used here will be injected using modules and submodules, check `main.py` and `services` for code.

In [None]:
from main import WindSpeedTower

from services import translation

## Loading the Dataset

As mentioned before, I'll be using modules to reduce the ammount of code exposed in this notebook.

In [None]:
Tower = WindSpeedTower(name='UT001', csv_path='datasets/wind.csv')
Tower.data.head()

### Decribing

`Pandas` have a nice method for DataFrames, which allows the statistical description of the columns.

In [None]:
Tower.data.describe()

## Visualizing

The following cell shows the _10min average_ wind speed. Eventhough it would be interesting to work on this level of granularity, the average over longer periods of time might show aditional statiscal behaviours.

In [None]:
Tower.plot_series(export=False)

As the plots show, there seems to be a lot of noise in the series, but if we plot the average speed over longer periods of time, some patterns start to emerge. As further analysis will show, just calculating average over time does not bring much value into the exploration, but some seasonality and trend rise from the data.

### Handling Missing Values

Before advancing, lets investigate what seems to be a gap in 2016. Looks like December 2016 has null values in our Series. Let's visualize it and make sure:

In [None]:
Tower.plot_date(year=2016, month=12)

Yes, there seems to be a problem with the dataset that should be addressed. Since there are so many data points along the period of time, we should not assume they are all filled or there are no gaps in our data. Before taking any measures on the missing values let's check how they behave and if we should be worry at all about them.

> **NOTE**: During the csv translation the dataset was already forced to produce a 10min frequency Time Series, and because of that we should not be worried about gaps in the range, only about the missing values

Values missing in a dataset can be classified as follows: [1](https://link.springer.com/article/10.1007/s10489-018-1139-9#:~:text=There%20are%20three%20types%20of,to%20impute%20these%20missing%20values.):

1. Missing Completely At Random (MCAR)
2. Missing At Random (MAR)
3. Not Missing at Random (NMAR)

Depending on how your missing values are classified your approach to them might be completely diferent, for example: values missing completely at random will not bring bias to your model, not missing a random might change the way you look at your data.

A special method was implemented to count missing values in the `WindSpeedTower` object.

In [None]:
Tower.missing_stats(verbose=True)

First, the missing values seems to be completly at random, which makes sense if we consider that our data is produced by sensors, which can fail and generate gaps.

Second, the ammount of missing values does not seem to be significant. Since they account for only 0.7961% of the total. 

Eventhough there are only a few missing values, let's check if they actually fall into the Completely at Random category (_MCAR_). In order to do that, I'll be using the [`missingo`](https://github.com/ResidentMario/missingno) Python package.

In [None]:
# Import missingo package

import missingno as msno

### Visualizing Missing Values

By visualizing how the missing values are distributed, we could have a clue of how to classify them. Let's use the `matryx()` method.

In [None]:
msno.matrix(Tower.data);

As the plot shows, there only seems to be 3 significant gaps. Judging by the distribution of the gaps (side bar) they are not evenly distributed whatsoever. Because of that we can assume that missing values will not statistically impact our analysis. 

## Time Series Decomposition

Another very important step in the analysis is the decomposition of a time series into its main components:

- Seasonality
- Trend (Or Trend-Cycle)
- Remainder (Which accounts for everything else)

Those components underline the patterns that might govern the data. It is often helpful apply mathematical simplifications that could remove known sources of variations, in our case, observing the series under a 10min granularity brings a lot of noise, perhaps by observing the patterns under longer periods of time would be better.



In [None]:
from services import timeseries 

In [None]:
timeseries.resample(dataset=Tower.data, rule='m')[['mean']]

In [None]:
Tower.decompose(period='d', model='additive', plot=True, overlay_trend=False, export=False)

### Unit Root Test

One of the most important characteristics to be classified in a series, is its _stationarity_. In order check for stationarity, let's use the Dickey-Fuller test for unit root

#### **HOURLY AGGREGATION**

In [None]:
Tower.stationarity(period='h', plot=True, verbose=True, export=True)

---

#### **DAILY AGGREGATION**

In [None]:
Tower.stationarity(period='d', plot=True, verbose=True, export=True)

---

#### **WEEKLY AGGREGATION**

In [None]:
Tower.stationarity(period='w', plot=True, verbose=True, export=True)

---

#### **MONTHLY AGGREGATION**

In [None]:
Tower.stationarity(period='m', plot=True, verbose=True, export=True)

## Train/Test Splits

In order to evaluate any model, it will be important to set the train and test splits. 

Let's consider the following information:

 - Series Start: 2015-01-01 00:00:00
 - Series End: 2021-12-24 18:00:00
 - Complete years: 6
 
 Hence, lets split the dataset into:
 
 - Train: < 2020-01-01 00:00:00
 - Test: >= 2020-01-01 00:00:00



In [None]:
split = '2020-01-01'
Tower.build_sets(period='m', split=split, plot=True)

In [None]:
Tower.save()