<a href="https://colab.research.google.com/github/anra8571/INFO5871FinalProject/blob/main/DoppelGANger.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Note: You can select "GPU" on your Notebook
# Click "Runtime > Change runtime type" and select "T4 GPU"

In [2]:
#Uncomment to install the ydata-synthetic package
# !pip install ydata-synthetic==1.3.1

# Time Series Synthetic Data Generation with DoppelGANger

- DoppelGANger - Implemented accordingly to the [paper](https://dl.acm.org/doi/pdf/10.1145/3419394.3423643)
- This notebook is an example of how DoppelGANger can be used to generate synthetic time-series data

## Dataset

- The data used in this notebook is the [Measuring Broadband America](https://www.fcc.gov/reports-research/reports/measuring-broadband-america/raw-data-measuring-broadband-america-seventh) (MBA) Dataset, freely available on the Federal Communications Commission (FCC) website. You can also find it [here](https://drive.google.com/drive/folders/19hnyG8lN9_WWIac998rT6RtBB9Zit70X) and a CVS was left for your convenience [here](https://github.com/ydataai/ydata-synthetic/blob/dev/data/fcc_mba.csv). It comprises:
    - **2 continuous measurements** - traffic_byte_counter and ping_loss_rate
    - **3 categorical metadata features** - isp, technology, and state

In [None]:
!pip install ydata-synthetic==1.3.1

Collecting ydata-synthetic==1.3.1
  Downloading ydata_synthetic-1.3.1-py2.py3-none-any.whl.metadata (9.9 kB)
Collecting requests<2.31,>=2.30 (from ydata-synthetic==1.3.1)
  Downloading requests-2.30.0-py3-none-any.whl.metadata (4.6 kB)
Collecting pandas==2.0.* (from ydata-synthetic==1.3.1)
  Downloading pandas-2.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy==1.23.* (from ydata-synthetic==1.3.1)
  Downloading numpy-1.23.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting scikit-learn==1.2.* (from ydata-synthetic==1.3.1)
  Downloading scikit_learn-1.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting matplotlib==3.7.* (from ydata-synthetic==1.3.1)
  Downloading matplotlib-3.7.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting tensorflow==2.12.0 (from ydata-synthetic==1.3.1)
  Downloading tensorflow-2.12.0-cp310-cp310-manyli

In [None]:
# Importing the necessay modules
import pandas as pd
import matplotlib.pyplot as plt
from ydata_synthetic.synthesizers.timeseries import TimeSeriesSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

In [None]:
# Read the data
cntHb_data = pd.read_csv("cntHb_doppel_reduced.csv", usecols=range(1,4))
numerical_cols = ["cntHb"]
categorical_cols = [col for col in cntHb_data.columns if col not in numerical_cols]

In [None]:
# Preview the dataset
cntHb_data.head(28)

In [None]:
# Defining model and training parameters
model_args = ModelParameters(batch_size=100,
                             lr=0.001,
                             betas=(0.2, 0.9),
                             latent_dim=20,
                             gp_lambda=2,
                             pac=1)

train_args = TrainParameters(epochs=400,
                             sequence_length=28,
                             sample_length=7,
                             rounds=1,
                             measurement_cols=["cntHb"])

In [None]:
# Training the DoppelGANger synthesizer
model_dop_gan = TimeSeriesSynthesizer(modelname='doppelganger',model_parameters=model_args)
model_dop_gan.fit(cntHb_data, train_args, num_cols=numerical_cols, cat_cols=categorical_cols)

In [None]:
# Generating new synthetic samples
synth_data = model_dop_gan.sample(n_samples=375000)
synth_df = pd.concat(synth_data, axis=0)

In [None]:
# Create a plot for each measurement column
plt.figure(figsize=(10, 6))

plt.subplot(2, 1, 1)
plt.plot(cntHb_data['cntHb'].reset_index(drop=True), label='cntHb Real')
plt.plot(synth_df['cntHb'].reset_index(drop=True), label='cntHb Synthetic', alpha=0.7)
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('cntHb Comparison')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Divide original data into sequences
sequence_lenght = 28
mba_sequences = []

for i in range(0, len(cntHb_data), sequence_lenght):
    sequence = cntHb_data.iloc[i:i+sequence_lenght]
    mba_sequences.append(sequence)

print(f"Number of sequences: {len(mba_sequences)}")
print(f"Size of each sequence: {mba_sequences[0].shape} (rows x columns)")

In [None]:
# Choose a random sequence
import numpy as np

In [None]:
obs = np.random.randint(len(mba_sequences))
print(obs)

In [None]:
print(synth_data)

In [None]:
# # Create a plot for each measurement column
# print(synth_data[1])
# print(mba_sequences[1])
# plt.figure(figsize=(10, 6))

# plt.subplot(2, 1, 1)
# plt.plot(mba_sequences[1]['cntHb'].reset_index(drop=True), label='Real cntHb')
# plt.plot(synth_data[1]['cntHb'].reset_index(drop=True), label='Synthetic cntHb', alpha=0.7)
# plt.xlabel('Index')
# plt.ylabel('Value')
# plt.title('cntHb Comparison')
# plt.legend()
# plt.grid(True)

# plt.tight_layout()
# plt.show()

In [None]:
synth_df.to_csv('synthetic_cntHb_reduced_375000.csv', index=False)