# Data Exploration

This is the **first** notebook of the analysis.

> Make sure you have executed the previous one [0_Cover.ipynb](0_Cover.ipynb), otherwise the packages might fail.
---

# Objective

Before jumping into modeling, it is very important to explore the data. The following cells will load the necessary packages and perform a Exploratory Data Analysis on it.

At the end, a dataset will be pickled into a compressed file, which will be later used in the next processes (Preparation, Train and Forecast)

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots

pd.set_option('display.max_columns', None)

---

# Wind Time Series

This project uses a wind time series provided by the professor. The data comes in a columnar format, where each record shows the hourly wind-speed for a series of days. The table below depicts the table format:

Column Index|Format             |Description
------------|-------------------|-----------------------
1           | $$0<hour<=23$$    | Hour reference
2           | $$1<day<=31$$     | First day of the series
.           |        .          | $n_{th}$ day of the series
31          | $$1<day<=31$$     | Last day of the series

The cell below loads the dataset.

In [None]:
file_id = "1JIk-xFfeL-uTNtkCbOERjko3ovM0EF18"
file_url='https://drive.google.com/uc?id={}'.format(file_id)
df = pd.read_csv(file_url)
df.head()

---

# Feature Analysis

Altough the provided dataset could be reshaped into a "single feature" dataset, with wind speed as the only feature. It could be interesting to investigate how wind velocity behaves for every hour of the day. The table below shows how the wind speed was recorded.

Index       | Feature   | Format  | Description
------------|-----------|-------- |------------------
1           | hour      | `int`   | record hour
2           | day1      | `float` | record wind speed
3           | day2      | `float` | record wind speed
.           |  .        |    .    | .
32          | day31     | `float` | record wind speed

A visual analysis of the wind speed behaviour will be conducted in the next cells using multiple plots. In order to do that, the following steps were taken

 - Rename and index the dataset so it becomes easier to use for investigation.
 - Plot the records grouped by hour or day.
 - Plot the main statistical values of the dataset.

## Preparing the dataset

In [None]:
df.rename(columns=lambda x: x.replace('day', ''), inplace=True)
df.set_index('hour', inplace=True)
df.head()

## Ploting the records

Thanks to the `plotly` package and this notebook, we can now plot the records in an interactively way.

In [None]:
daily_fig = px.line(df.T, title="Wind speed time series")
hourly_fig = px.line(df, title="Wind speed time series")
    
fig = make_subplots(rows=2, cols=1, subplot_titles=("Hourly Grouped", "Daily Grouped"))
fig.update_layout(height=600, showlegend=False, title_text="Wind Speed Time Series")

for trace in hourly_fig['data']:
    fig.add_trace(trace, row=1, col=1)
for trace in daily_fig['data']:
    fig.add_trace(trace, row=2, col=1)
    
fig.update_xaxes(title_text="Hours", row=1, col=1)
fig.update_xaxes(title_text="Days", row=2, col=1)

fig.update_yaxes(title_text="Wind speed (m/s)", row=1, col=1)
fig.update_yaxes(title_text="Wind speed (m/s)", row=2, col=1)
    
fig.show()

### **Notes**

The records already show that, even though the wind behaviour is the same, observing it from a diferent time perspective might be a good way to find correlations.

Another thing that becomes clear is that, ploting the records alone is not suficient to understand the statistical behaviour of the feature. The next cells addresses this issue 

## Statistical Behaviour

> At this point it is important to highlight that we are dealing with a time series, because of that the main statistical measurements will be obtained from a "point of view". For example, if we observe the measurements grouped by hour, we would have 31 records for every hour (1 for each day), hence any statistical measure will be obtained regarding a sample of 31 measurements.

### Mean, Standard Deviation and Box Plot Analysis

The following cell creates a function used to plot the data using statistical calculations. Each plot contains two subplots. In the first one, the main objective is to observe the average wind speed from two diferent perspectives (daily or hourly). The second plot is a very powerfull statistical tool, the **Box Plot**, which allows the observation of important statistical measurements, such as the median values, the outliers and the quarters.

In [None]:
def plot_graphs(df: pd.DataFrame, x_axis_title=str):
    mean = go.Scatter(
            name='Mean',
            x=df.columns,
            y=df.mean(),
            mode='lines')
    std1 = go.Scatter(
            name='Upper',
            x=df.columns,
            y=df.mean()+df.std(),
            mode='lines',
            marker=dict(color="#444"),
            line=dict(width=0),
            showlegend=False)
    std2 = go.Scatter(
            name='Lower',
            x=df.columns,
            y=df.mean()-df.std(),
            marker=dict(color="#444"),
            line=dict(width=0),
            mode='lines',
            fillcolor='rgba(163, 172, 247, 0.3)',
            fill='tonexty',
            showlegend=False)

    box_fig = px.box(df,labels={"value": "Wind Velocity (m/s)","hour": "Hour of the day"})
    box = box_fig['data'][0]
    
    
    fig = make_subplots(rows=2, cols=1, subplot_titles=("Mean and Standard Deviation", "Box Plot"), shared_yaxes=True)
    fig.update_layout(height=600)
    fig.add_trace(mean, row=1, col=1)
    fig.add_trace(std1, row=1, col=1)
    fig.add_trace(std2, row=1, col=1)
    fig.add_trace(box, row=2, col=1)
    
    fig.update_xaxes(title_text=x_axis_title, row=1, col=1)
    fig.update_xaxes(title_text=x_axis_title, row=2, col=1)
    
    fig.show()

#### **Hourly Grouped**

By transposing the dataset, we achieve a "hourly" point of view.

In [None]:
plot_graphs(df=df.T.copy(), x_axis_title="Hours")

#### **Statistics**

In [None]:
statistics_df = df.T.describe()
display(statistics_df.round(2))

#### **Daily Grouped**

In [None]:
plot_graphs(df=df.copy(), x_axis_title="Days")

#### **Statistics**

In [None]:
statistics_df = df.describe()
display(statistics_df.round(2))

# Pickle Dataframe

> This is the final step of this notebook.

Once completed, the analysed dataframe should be pickled into a compressed file.

This is a practice that allows the next Notebook to start from where this one left.

In [None]:
df.to_pickle("explore.pkl")