# Problem Statement
We will investigate how TimeGAN can help us understand the distributions of future time-series when it comes to PM2.5 time-series generation. We will answer questions such as, how worse can air quality be in a certain time frame? We will also look at how well TimeGAN can generate a PM2.5 time-series. TimeGAN was developed by Yoon et al, which you can take a detailed look at the architecture at https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks

In [9]:
# Necessary Packages
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# TimeGAN model
from timegan import timegan
# Data Loading
from data_loading import real_data_loading
# Metrics
from metrics.discriminative_metrics import discriminative_score_metrics
from metrics.predictive_metrics import predictive_score_metrics
from metrics.visualization_metrics import visualization

import os
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Data
We will be using NAPA datasets from 2007 to 2020 where we are interested in mean daily PM2.5 and AQI

In [2]:
# data_path = "data/"
# data = [pd.read_csv(data_path + file) for file in os.listdir(data_path)]
# master_napa = pd.concat(data).reset_index()

In [3]:
# # obtain only dail AQI and mean daily PM2.5 columns
# pm_data = master_napa[["Date", "Daily Mean PM2.5 Concentration", "DAILY_AQI_VALUE"]]
# pm_data.to_csv("data/napa_master_date.csv", index=False)

In [4]:
# beijing_data = pd.read_csv("data/beijing.csv")
# beijing_master = beijing_data.drop(columns=["No", "cbwd"])
# beijing_master.to_csv("data/beijing_master.csv", index=False)

In [5]:
# use TimeGAN's data loading 
data_name = 'beijing'
sequence_length = 24
ori_data = real_data_loading(data_name, sequence_length)

# Network Parameters
The following parameters will be experimentally evaluated 
* module: gru, lstm, or lstmLN
* hidden_dim: hidden dimensions
* num_layer: number of layers
* iteration: number of training iterations
* batch_size: the number of samples in each batch

In [6]:
parameters = dict()

parameters['module'] = 'gru' 
parameters['hidden_dim'] = 24
parameters['num_layer'] = 3
parameters['iterations'] = 10000 # default 10000
parameters['batch_size'] = 128

In [11]:
generated_data = timegan(ori_data, parameters) 

Start Embedding Network Training
step: 0/10000, e_loss: 0.3535
step: 100/10000, e_loss: 0.2296
step: 200/10000, e_loss: 0.2025
step: 300/10000, e_loss: 0.1927
step: 400/10000, e_loss: 0.187
step: 500/10000, e_loss: 0.1693
step: 600/10000, e_loss: 0.1585
step: 700/10000, e_loss: 0.1449
step: 800/10000, e_loss: 0.1347
step: 900/10000, e_loss: 0.1111
step: 1000/10000, e_loss: 0.0972
step: 1100/10000, e_loss: 0.0856
step: 1200/10000, e_loss: 0.0789
step: 1300/10000, e_loss: 0.0682
step: 1400/10000, e_loss: 0.0664
step: 1500/10000, e_loss: 0.0678
step: 1600/10000, e_loss: 0.0684
step: 1700/10000, e_loss: 0.0655
step: 1800/10000, e_loss: 0.0644
step: 1900/10000, e_loss: 0.0671
step: 2000/10000, e_loss: 0.0612
step: 2100/10000, e_loss: 0.0564
step: 2200/10000, e_loss: 0.061
step: 2300/10000, e_loss: 0.0579
step: 2400/10000, e_loss: 0.0569
step: 2500/10000, e_loss: 0.0589
step: 2600/10000, e_loss: 0.0559
step: 2700/10000, e_loss: 0.0516
step: 2800/10000, e_loss: 0.0503
step: 2900/10000, e_loss

step: 1500/10000, d_loss: 1.8327, g_loss_u: 0.9513, g_loss_s: 0.0247, g_loss_v: 0.0394, e_loss_t0: 0.0197


KeyboardInterrupt: 

# TimeGAN Evaluation
Now that TimeGAN has generated data, we want to evaluate it's performance based on three metrics: discriminative score, predictive score, and visualization techniques

### 1.  Discriminative Score
As per Yoon et al, "to evaluate the classification accuracy between original and synthetic data using post-hoc RNN network", we label original data sequences as "real" and generated sequences as "fake" and we train and test an off-the-shelf RNN classifier and report the classification error on the test set.

In [None]:
metric_iteration = 5
rnn_iterations = 2000 # default 2000

In [None]:

discriminative_score = list()
for _ in range(metric_iteration):
  temp_disc = discriminative_score_metrics(ori_data, generated_data, rnn_iterations)
  discriminative_score.append(temp_disc)

print('Discriminative score: ' + str(np.round(np.mean(discriminative_score), 4)))


### 2. Predictive Score
Now we will use a post-hoc RNN to predict one step ahead in the time-series evaluate the performance of the prediction in terms of MAE.

In [None]:
predictive_score = list()
for tt in range(metric_iteration):
  temp_pred = predictive_score_metrics(ori_data, generated_data, rnn_iterations)
  predictive_score.append(temp_pred)   
    
print('Predictive score: ' + str(np.round(np.mean(predictive_score), 4)))

### 3. Visualization of Data
We will use PCA and t-SNE to visualize the data

In [None]:
visualization(ori_data, generated_data, 'pca')
visualization(ori_data, generated_data, 'tsne')