<a href="https://colab.research.google.com/github/atlas-github/nih_time_series_nlp/blob/main/nih_time_series_nlp_day2_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 09:00 am: Feature engineering for time series data

### Creating Lag Features
Lag features are previous values of the time series that can help capture temporal dependencies.

In [None]:
import pandas as pd

# Sample data
data = {
    'date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
    'value': [10, 12, 15, 14, 16, 18, 20, 22, 21, 23]
}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# create lag features
# refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html


### Rolling Statistics (Moving Average)

Rolling statistics help in smoothing the time series data and identifying trends.

In [None]:
# moving average with a window of 3 days
# refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html


### Fourier Transform
Fourier Transform helps in analyzing the frequency components of the time series.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# generate sample data
# refer to https://numpy.org/doc/stable/reference/generated/numpy.fft.fft.html


# plot the Fourier Transform



### Exponential Moving Average

The Exponential Moving Average (EMA) gives more weight to recent observations.

In [None]:
# exponential moving average with a span of 3 days
# refer to https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ewm.html


## 09:45 am: Time series data normalization and scaling

### Normalization
Normalization (or Min-Max Scaling) scales the data to a fixed range, usually [0, 1].

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample data
data = {
    'date': pd.date_range(start='2024-01-01', periods=10, freq='D'),
    'value': [10, 12, 15, 14, 16, 18, 20, 22, 21, 23]
}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# normalizing
# refer to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html



### Standardization
Standardization (or Z-score normalization) scales the data so that it has a mean of 0 and a standard deviation of 1.

In [None]:


# standardizing
# refer to https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html


### Effects of Scaling on Time Series Analysis
Scaling can impact time series models and their predictions. Let’s compare how normalization and standardization affect a simple moving average.

In [None]:

# compute moving averages before scaling


# compute moving averages after normalization


# compute moving averages after standardization


# plot


## 11:00 am: Practical session 4


In [None]:
# install the gdown library
!pip install gdown



In [None]:
# download dengue csv
import pandas as pd
import gdown

# insert file_id
file_id = '1F-faNnQoyhdjbuyEVZPHV1cv_h0h5UDl'
url = f'https://drive.google.com/uc?id={file_id}'

# download the CSV file
gdown.download(url, 'dengue.csv', quiet=False)

# read the csv file into a DataFrame
df_dengue = pd.read_csv('dengue.csv')
df_dengue

In [None]:
# filter to only needed columns
df_dengue_filtered = df_dengue[["NO_KES", "TRK_NOTI", "EPID_MINGG", "EPID_TAHUN", "JNS_KES", "NO_RUMAH",
                                "POSKOD", "LOKALITI", "MUKIM", "DAERAH", "NEGERI", "LATITUDE", "LONGITUDE", "PENTADBIRA",
                                "WARGANEGAR", "STATUS_PEN", "STATUS_LOK"]]
# convert to datetime
df_dengue_filtered['TRK_NOTI'] = pd.to_datetime(df_dengue_filtered['TRK_NOTI'], format = "%Y/%m/%d")

# which postcodes have most complete vs. most missing data?
postcode_counts = df_dengue_filtered['POSKOD'].value_counts().reset_index()

# rename the columns
postcode_counts.columns = ['postcode', 'count']

postcode_counts

In [None]:
# filter to only postcode 47620


In [None]:
#reset index


In [None]:
# Create pivot table

In [None]:
# reset index to make 'date' a column again


# melt the DataFrame to long format


# extract the case type from the 'case_type' column


In [None]:
# filter to only wabak


In [None]:
import pandas as pd

# sample data


# create lag features


In [None]:
 # reset index


In [None]:
# create the plot


# customize the plot



# show the plot


In [None]:
# retrieve external data example: https://open-meteo.com/en/docs/historical-weather-api#start_date=2014-12-28&end_date=2022-02-06&hourly=&daily=rain_sum&timezone=Asia%2FSingapore


In [None]:
# convert to datetime


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# create the main figure and axis


# first plot: Dengue case counts and lag_2 on the primary y-axis


# customize the first axis


# create a secondary y-axis for rainfall


# second plot: Daily rainfall on the secondary y-axis


# customize the second axis


# adjust layout


# show the plot


## 12:00 pm: Time series decomposition

Time series decomposition is a useful technique for analyzing the components of a time series. It typically breaks down a time series into trend, seasonality, and residuals (or noise). There are two main types of decomposition: additive and multiplicative.

### Additive Decomposition
In additive decomposition, the time series is assumed to be the sum of the trend, seasonality, and residuals.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# generate sample data
np.random.seed(0)
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
trend = np.linspace(10, 20, 100)


# create DataFrame


# perform additive decomposition

# plotting the components




### Multiplicative Decomposition
In multiplicative decomposition, the time series is assumed to be the product of the trend, seasonality, and residuals.

In [None]:
# perform multiplicative decomposition


# plot the components


### Comparing Additive vs. Multiplicative Decomposition
It's useful to compare how the decompositions differ, especially if the data exhibits multiplicative seasonality or varying amplitude.

In [None]:
# generate sample data with multiplicative seasonality


# create DataFrame for multiplicative example


# perform decompositions


# plot the results for both decompositions



## 2:00 pm: Practical session 5

### 11 classical time series forecasting methods in Python (article link [here](https://machinelearningmastery.com/time-series-forecasting-methods-in-python-cheat-sheet/))

#### Autoregressive (AR) Model
The autoregression (AR) method predicts the subsequent value in a sequence using a linear combination of previous observations.

The notation for the model involves specifying the order of the model p as a parameter to the AR function, e.g. AR(p). For example, AR(1) is a first-order autoregression model.

The method is best suited for single-variable time series that lack trend and seasonal components.

In [None]:
# AR example


# contrived dataset


# fit model


# make prediction


#### Moving Average (MA) Model
The Moving Average (MA) method models predict the next step in the sequence as a linear function of the residual errors from a mean process at prior time steps.

It’s important to note that a Moving Average model is different from calculating the moving average of the time series.

The notation for the model involves specifying the order of the model q as a parameter to the MA function, e.g. MA(q). For example, MA(1) is a first-order moving average model.

The method is suitable for univariate time series without trend and seasonal components.

We can use the ARIMA class to create an MA model and set a zeroth-order AR model. We must specify the order of the MA model in the order argument.

In [None]:
# MA example


# contrived dataset


# fit model


# make prediction



#### Autoregressive Moving Average (ARMA)
The Autoregressive Moving Average (ARMA) method model predicts the next step in the sequence based on a linear combination of both past observations and past residual errors.

The method combines both Autoregression (AR) and Moving Average (MA) models.

To represent the model, the notation involves specifying the order for the AR(p) and MA(q) models as parameters to an ARMA function, e.g. ARMA(p, q). An ARIMA model can be used to develop AR or MA models.

The method is suitable for univariate time series without trend and seasonal components.

In [None]:
# ARMA example


# contrived dataset


# fit model


# make prediction



#### Autoregressive Integrated Moving Average (ARIMA)

The Autoregressive Integrated Moving Average (ARIMA) method model predicts the next step in the sequence as a linear function of the differenced observations and residual errors at prior time steps.

The method integrates the principles of Autoregression (AR) and Moving Average (MA) models as well as a differencing pre-processing step of the sequence to make the sequence stationary, called integration (I).

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function, e.g. ARIMA(p, d, q). An ARIMA model can also be used to develop AR, MA, and ARMA models.

The ARIMA approach is optimal for single-variable time series that exhibit a trend but lack seasonal variations.

In [None]:
# ARIMA example


# contrived dataset


# fit model


# make prediction



#### Seasonal Autoregressive Integrated Moving-Average (SARIMA)

The Seasonal Autoregressive Integrated Moving Average (SARIMA) method models the next step in the sequence based on a linear blend of differenced observations, errors, differenced seasonal observations, and seasonal errors at prior time steps.

SARIMA enhances the ARIMA model with the ability to perform the same autoregression, differencing, and moving average modeling at the seasonal level.

The notation for the model involves specifying the order for the AR(p), I(d), and MA(q) models as parameters to an ARIMA function and AR(P), I(D), MA(Q) and m parameters at the seasonal level, e.g. SARIMA(p, d, q)(P, D, Q)m where “m” is the number of time steps in each season (the seasonal period). A SARIMA model can be used to develop AR, MA, ARMA and ARIMA models.

The method is suitable for univariate time series with trend and/or seasonal components.

In [None]:
# SARIMA example


# contrived dataset


# fit model


# make prediction



#### Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX)

The Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors (SARIMAX) is an extension of the SARIMA model that also includes the modeling of exogenous variables.

Exogenous variables are also called covariates and can be thought of as parallel input sequences that have observations at the same time steps as the original series. The primary series may be referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The SARIMAX method can also be used to model the subsumed models with exogenous variables, such as ARX, MAX, ARMAX, and ARIMAX.

The method is suitable for univariate time series with trend and/or seasonal components and exogenous variables.

In [None]:
# SARIMAX example


# contrived dataset


# fit model


# make prediction



#### Vector Autoregression (VAR)

In [None]:
# contrived dataset with dependency


# convert to NumPy array


# fit model

# make prediction


# pass the last known observations as input for forecasting

# get the number of lags used in the model


# Get the last 'lag_order' observations


#### Vector Autoregression Moving-Average (VARMA)

The Vector Autoregression Moving-Average (VARMA) method models the upcoming value in multiple time series by utilising the ARMA model approach. It is the generalization of ARMA to multiple parallel time series, e.g. multivariate time series.

The notation for the model involves specifying the order for the AR(p) and MA(q) models as parameters to a VARMA function, e.g. VARMA(p, q). A VARMA model can also be used to develop VAR or VMA models.

The method is suitable for multivariate time series without trend and seasonal components.

In [None]:
# VARMA example


# contrived dataset with dependency


# fit model


# make prediction



#### Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX)

The Vector Autoregression Moving-Average with Exogenous Regressors (VARMAX) extends the capabilities of the VARMA model which also includes the modelling of exogenous variables. It is a multivariate version of the ARMAX method.

Exogenous variables, also called covariates and can be thought of as parallel input sequences that align with the time steps as the original series. The primary series(es) are referred to as endogenous data to contrast it from the exogenous sequence(s). The observations for exogenous variables are included in the model directly at each time step and are not modeled in the same way as the primary endogenous sequence (e.g. as an AR, MA, etc. process).

The VARMAX method can also be used to model the subsumed models with exogenous variables, such as VARX and VMAX.

The method is suitable for multivariate time series without trend and seasonal components with exogenous variables.

In [None]:
# VARMAX example


# contrived dataset with dependency


# fit model


# make prediction


#### Simple Exponential Smoothing (SES)

The Simple Exponential Smoothing (SES) method models the next time step as an exponentially weighted linear function of observations at prior time steps.

The method is suitable for univariate time series without trend and seasonal components.

In [None]:
# SES example


# contrived dataset


# fit model


# make prediction



#### Holt Winter’s Exponential Smoothing (HWES)

In [None]:
# HWES example


# contrived dataset


# fit model


# make prediction


### Link to [Prophet](https://facebook.github.io/prophet/)

In [None]:
# install prophet library
!pip install prophet

In [None]:


# Generate sample data


# plot original data


In [None]:
# initialize and fit Prophet model


In [None]:
# create a dataframe for future dates


# make predictions


# plot the forecast


In [None]:
# add custom seasonality


In [None]:
# define holidays


# add holidays to the model


# create future dataframe and make predictions



# Plot forecast with holidays


In [None]:
# introduce missing data


# introduce outliers



In [None]:
# fit Prophet model with missing data


# make predictions



# plot


1.   Basic Prophet Forecasting: Easy to use and fits well with time series data exhibiting seasonality and trends.
2.   Custom Seasonality: Allows for the addition of custom seasonal effects.
3.   Holidays: Can incorporate holiday effects to adjust forecasts.
4.   Missing Data and Outliers: Prophet can handle missing data and outliers by filling in missing values or treating outliers as anomalies.

## 4:00 pm: Resampling and time series frequency conversion

#### Downsampling
Description: Downsampling involves reducing the frequency of your time series data. For instance, converting daily data to monthly data.

Example: Downsampling daily data to monthly data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# generate sample daily data
np.random.seed(0)
dates = pd.date_range(start='2024-01-01', periods=365, freq='D')
data = np.random.randn(365)
df = pd.DataFrame({'date': dates, 'value': data})
df.set_index('date', inplace=True)

# downsample to monthly frequency


# plot original and resampled data


Pros:

1.   Reduces the amount of data, which can simplify analysis.
2.   Useful for identifying long-term trends.

Cons:

1.   Loss of detail due to aggregation.
2.   May miss short-term variations.

#### Upsampling
Description: Upsampling involves increasing the frequency of your time series data. For example, converting monthly data to daily data.

Example: Upsampling monthly data to daily data and filling in missing values.

In [None]:
# generate sample monthly data


# upsample to daily frequency
# forward fill to handle missing values

# plot original and upsampled data


Pros:

1.  Increases data granularity.
2.  Useful for detailed analysis and interpolating data.

Cons:

1.  May introduce artificial patterns.
2.  Requires careful handling of missing values.

####Resampling with Aggregation

Description: Aggregating data with a custom function during resampling. For example, calculating the sum or median during resampling.

Example: Resampling daily data to weekly data and calculating the weekly sum.

In [None]:
# resample daily data to weekly data and calculate the sum


# plot original and resampled data


Pros:

1.  Provides various aggregation options such as sum, mean, median, etc.
2.  Useful for understanding data patterns over different time periods.

Cons:

1.  Aggregation can mask fluctuations.
2.  Requires appropriate choice of aggregation method.

#### Frequency Conversion

Description: Converting time series data to a different frequency without resampling.

Example: Converting data from daily frequency to business day frequency.

In [None]:
# generate sample daily data


# convert to business day frequency
# 'B' for business day frequency

# plot original and converted data


Pros:

1.  Allows for conversion to different frequencies based on needs.
2.  Can handle specific business-related requirements.

Cons:

1.  May result in missing values if frequency conversion does not align with original data.
2.  Requires appropriate filling strategy for missing data.