From from this article: [Time Series — ARIMA vs. SARIMA vs. LSTM: Hands-On Tutorial](https://towardsdatascience.com/time-series-arima-vs-sarima-vs-lstm-hands-on-tutorial-bd5630298da3)

First we want to start off with downloading the data. It can be had from [here](https://archive.ics.uci.edu/dataset/360/air+quality). It's around 1 MB.

In [None]:
# import libraries
import pandas as pd
import numpy as np

from ucimlrepo import fetch_ucirepo

# Fetch dataset
air_quality = fetch_ucirepo(id=360)

Get the data from the data set.

Same thing with many names:
- Features, X, input variables, independent variables
- Targets, y, output variables, dependent variables

In [None]:
# Data (as pandas dataframe)
air_quality_df = air_quality.data.features
air_quality_df.head()

Explore the data a bit.

In [None]:
air_quality_df.info()

See statistics of the data.

In [None]:
air_quality_df.describe()

From the documentation for this data: Missing values are tagged with -200 value. Let's
see how many are -200 in each column. This can give us an idea of how to handle them.

In [None]:
# Do an element wise comparison on all values, then sum up for each column.
missing_values_count = (air_quality_df == -200).sum()
missing_values_count

This is great! We see how many values are effectively null. Let's get a percentage too
so we can see for each column what the percentage of null (-200) there is.

In [None]:
# Calculate the percentage of missing values for each column. Rount to whole int
missing_values_percentage = ((missing_values_count / len(air_quality_df)) * 100).round().astype(int)
missing_values_percentage

<!--TODO on second pass: For the readings, let's use the average value of the column to fill in the missing values. -->
## Feature Engineering
We'll handle missing values by dropping them.

In [None]:
# Show column count and row count for before and after dropping missing values
air_quality_df.shape

In [None]:
# Replace missing values and drop unnecessary columns.

# Replace -200 with NaN
air_quality_df.replace(-200, np.nan, inplace=True)
# Drop columns where all values are NaN
# - axis=1 means columns, how='all' means only drop if all values in the column are NaN
air_quality_df.dropna(axis=1, how='all', inplace=True)
air_quality_df.shape


In [None]:
# Drop any rows with missing values.
air_quality_df.dropna(inplace=True)
air_quality_df.shape

Not great, we lost most the data set. I don't like how the tutorial handled this.
I think we can do better. TODO: come back to this

Now let's combine the Date and Time columns into a single datetime column.

In [None]:
# Combine Date and Time columns into a single datetime column
air_quality_df['DateTime'] = pd.to_datetime(air_quality_df['Date'] + ' ' + air_quality_df['Time'])
air_quality_df.head()

We can set the new datetime column as the index since we're working with time series
data, then sort the data by the index.
Note 💡: Inplace is more memory efficient than having it return one and then re-assigning
it back to the variable.

In [None]:
air_quality_df.set_index('DateTime', inplace=True)
air_quality_df.sort_index(inplace=True)

The tutorial picks Nox(GT) as the target variable without a reason why, so we'll use
that too, but don't ask me why. 😅

In [None]:
# Grab the target column.
# Note 💡: This also grabs the index column, so you'll have a df with two columns.
target = air_quality_df['NOx(GT)']
target.head()

### Visualize
Let's visualize the data to see what we're working with. Visualizing is great. Always
visualize. 📊

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.plot(target)
plt.title('Hourly NOx(GT) Levels')
plt.xlabel('Date')
plt.ylabel('NOx(GT) Concentration')
plt.show()

