In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt

# Quick recap: How to use Dates & Times with pandas

## Basic building block: `pd.Timestamp`

In [None]:
time_stamp = pd.Timestamp(datetime(2022, 12, 12))
time_stamp

Attributes to store time-specific information

In [None]:
time_stamp.year

In [None]:
time_stamp.day_name()

## More building blocks: `pd.Period` and `freq`

In [None]:
period = pd.Period("2017-01")
period

Period object has freq attribute to store frequency info

In [None]:
# convert to daily
period.asfreq("D")

Convert Period to Timestamp and back

In [None]:
period.to_timestamp()

In [None]:
period.to_timestamp().to_period("M")

Frequency info enables basic arithmetic

In [None]:
period + 2

## Sequences of Dates & Times

In [None]:
index = pd.date_range(start="2022-1-1", periods=12, freq="M")
index

`DateTimeIndex`: sequency of `Timestamp` objects with frequency info

In [None]:
index[0]

In [None]:
# index as monthly period to plot monthly aggregates
index.to_period()

# Indexing & resampling time series

## Let's load some data

In [None]:
ts = pd.read_csv("AAPL.csv", index_col="Date",  parse_dates=True)
ts.info()

In [None]:
ts.head()

In [None]:
ts.plot(y="Close", title="Apple stock closing price");

## Splitting TS by date

Select all data for 2018

In [None]:
# Approach 1
ts[ts.index.year == 2018]

In [None]:
# Approch 2, less verbose
ts["2018"]

Selecting only data from summer months of 2018

In [None]:
# the verbose way...
ts[(ts.index >= "2018-6") & (ts.index <= "2018-9")]

In [None]:
# ... the concise way. Slicing includes last month unlike Python slicing
ts["2018-6" : "2018-8"]

In [None]:
# This also works
ts["2018-6":]

## Changing frequency

In [None]:
ts.head()

In [None]:
ts.index

**Upsampling** - higher frequency implies new dates -> missing data

In [None]:
# Set daily frequency
ts_daily = ts.asfreq("D").head()
ts_daily

In [None]:
ts_daily.index

Options to impute missing data: mean,media, last value

In [None]:
ts.asfreq("D").ffill().head()

Why changing TS frequency relevant? For example, to merged time series with different frequencies

# Merging time series

Client case example (data made availaby by courtesy of Daan Salome) 

In [None]:
# pressurization time
pt = pd.read_csv("PressurisationSeconds.csv", index_col="_time", parse_dates=True).sort_index()
pt.head()

In [None]:
pt.info()

## Side note on Time Zones

The "+00:00" at the end of the timestamp indicate time zone information. We know the time zone is UTC

In [None]:
pt.index.tz

To convert a time zone aware pandas object from one time zone to another, you can use the `tz_convert` method.

In [None]:
pt.index[:4]

In [None]:
pt.index.tz_convert("US/Eastern")[:4]

In [None]:
pt.index.tz_convert("US/Pacific")[:4]

In [None]:
# convert DF to different TZ
pt.tz_convert("US/Pacific").head()

How to strip the TZ information if it annoys us

In [None]:
pt = pt.tz_localize(None)
pt.head()

## Load 2nd Time series

In [None]:
loads = pd.read_csv("smBpTestStandLoadTest.State.csv", index_col="_time", parse_dates=True).sort_index()
loads = loads.tz_localize(None)
loads.head(10)

## Solving for different frequencies
How to _align_ the data in second TS on the data in the first TS

**Reindexing**  is useful in preparation for combining two time series data sets. 

If data is unavailable for one of the new index dates or times, you must tell pandas how to fill it
in. Otherwise, pandas will fill with NaN by default.

In [None]:
pt.reindex(loads.index)

In [None]:
pt_reindex = pt.reindex(loads.index, method="pad")
pt_reindex

In [None]:
# now we can join datasets
joined = loads.join(pt_reindex)
joined

## `merge_asof`

**Could be viewed as reindex and merge in one operation**

Similar to a left-join except that we match on nearest key rather than equal keys.  For each row in the left DataFrame, we select the last row in the right DataFrame whose on key is less than the left’s key. 

In [None]:
loads.head()

In [None]:
pt.head()

In [None]:
pd.merge_asof(loads, pt, left_index=True, right_index=True).head()

# Time series analysis: seasonality, trend and noise

CO2 levels in Mauai Hawaii

In [None]:
co2_levels = pd.read_csv("ch2_co2_levels.csv", index_col="datestamp", parse_dates=True).dropna()
co2_levels

In [None]:
co2_levels.index

In [None]:
co2_levels.plot(title="CO2 levels Mauna Loa observatory, Hawaii");

## Autocorrelations

Let's manually calculate how auto-correlated our signal is

In [None]:
# we shift (i.e. delay)  the data in multiples of 1 week
shifts = range(1, 8)

for ii in shifts:
    co2_levels[f"lag_{ii}"] = co2_levels["co2"].shift(ii)

co2_levels.head(10)

In [None]:
co2_levels.plot()

In [None]:
co2_levels.corr()

... and now automatically 

In [None]:
from statsmodels.graphics import tsaplots
tsaplots.plot_acf(co2_levels["co2"], lags=60);

**Interpret autocorrelation plots**

If autocorrelation values are close to O, then values between consecutive observations are not correlated with one another.
Inversely, autocorrelations values close to 1 or -1 indicate that there exists strong positive or negative correlations between
consecutive observations, respectively.

In order to help you asses how trustworthy these autocorrelation values are, the plot-acf() function also returns
confidence intervals (represented as blue shaded regions). If an autocorrelation value goes beyond the confidence interval
region, you can assume that the observed autocorrelation value is statistically significant.

## Time series decomposition

In [None]:
import statsmodels.api as sm

`statsmodels` is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data
exploration.

In [None]:
decomposition = sm.tsa.seasonal_decompose(co2_levels["co2"])          

How to fix the error we've just received

In [None]:
co2_levels.index

In [None]:
# built-in pandas attribute not really helpful...
co2_levels.index.inferred_freq

In [None]:
freqs = co2_levels.index[1:] - co2_levels.index[:-1]
freqs

In [None]:
freqs.value_counts()

In [None]:
co2_levels_7d  = co2_levels.asfreq("7D").ffill()

In [None]:
co2_levels_7d.index

In [None]:
# Let's try again to decompose the TS
decomposition = sm.tsa.seasonal_decompose(co2_levels_7d["co2"])

# Temporarily changing plotting params 
with plt.rc_context({"figure.figsize": (11, 9)}):
    decomposition.plot()                                      

The following additive model is fit to the data

Y[t] = T[t] + S[t] + e[t]

The results are obtained by first estimating the trend by applying a convolution filter to the data. The trend is then removed
from the series and the average of this de-trended series for each period is the returned seasonal component.

### Extracting components from TS decomposition

In [None]:
print(decomposition.seasonal)

In [None]:
seas = decomposition.seasonal
fig, ax = plt.subplots(figsize=(11,3))
seas.plot(ax=ax)
ax.set(xlabel="Date",
       title="Seasonal values of time series");

In [None]:
trend = decomposition.trend
fig, ax = plt.subplots(figsize=(11,3))
trend.plot(ax=ax)
ax.set(xlabel="Date",
       title="Trend values of time series");

However, the analysis does not **automatically** provide an answer to the question:   **What is the seasonality of my data?**

Eyeballing the seasonal pattern, we can visually infer that the time series has annual seasonality with peaks in the month of May/June

_Could we think of an automatic approach to infer the seasonality period?_