# DoppelGANger

## Method Overview

### Assumptions

Assume a dataset can be modeled as $\mathcal{D} = \{ O^1 , O^2, ..., O^n\}$ where $O^i$ is an object representing an atomic, high-dimensional data element, i.e., the combination of a single time series with its associated metadata. More precisely, each object $O^i = (A^i, R^i)$ contains $m$ *attributes* $A^i = [A_1^i, A_2^i, ..., A_m^i]$. For example, attribute $A_j^i$ could represent user $i$'s physical location. Note that this model can support datasets in which multiple objects have the same set of attributes. 

The second component of each object is a time series of *records* $R^i = [R_1^i, R_2^i, ..., R_{T^i}^i]$. For example, in retail, the time series may contain the numbers of products that user $i$ purchases on a given day. Different objects may contain different numbers of records (i.e., time series of different lengths). The number of records in object $O^i$ is given by $T^i$. Each record $R_j^i = (t_j^i, f_j^i)$ contains a *timestamp* $t_j^i$ and $K$ features $f_j^i = [f_{j, 1}^i, f_{j, 2}^i, ..., f_{j, 1K^i}]$ (e.g., the number of each product among all $K$ products that the user purchases). 


In [1]:
import sys
import pickle
import numpy as np
import pandas as pd



In [2]:
sys.path.append("./gan")


file_path = os.path.join("..", "data", "FCC_MBA", "data_feature_output.pkl")

with open(os.path.join(file_path), "rb") as f:
    data_feature_outputs = pickle.load(f)

data_feature_outputs

[<output.Output at 0x7f044e78eda0>, <output.Output at 0x7f044e797048>]

In [3]:
data_feature_outputs[0].__dict__

{'dim': 1,
 'is_gen_flag': False,
 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>,
 'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>}

In [4]:
data_feature_outputs[1].__dict__

{'dim': 1,
 'is_gen_flag': False,
 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>,
 'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>}

In [5]:
file_path = os.path.join("..", "data", "FCC_MBA", "data_attribute_output.pkl")

with open(os.path.join(file_path), "rb") as f:
    data_attribute_outputs = pickle.load(f)

data_attribute_outputs

[<output.Output at 0x7f044e797470>,
 <output.Output at 0x7f044e797358>,
 <output.Output at 0x7f044e7973c8>]

In [6]:
data_attribute_outputs[0].__dict__

{'dim': 15,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [7]:
data_attribute_outputs[1].__dict__

{'dim': 5,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [8]:
data_attribute_outputs[2].__dict__

{'dim': 53,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [9]:
data_npz = np.load(os.path.join("..", "data", "FCC_MBA", "data_train.npz"))

In [10]:
data_npz["data_gen_flag"]

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [11]:
data_npz["data_gen_flag"].shape

(600, 56)

In [12]:
data_npz["data_attribute"]

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [13]:
data_npz["data_attribute"].shape

(600, 73)

In [14]:
data_npz["data_feature"]

array([[[0.00190305, 0.        ],
        [0.00542065, 0.        ],
        [0.00351279, 0.        ],
        ...,
        [0.00570855, 0.        ],
        [0.0109244 , 0.        ],
        [0.01260134, 0.        ]],

       [[0.00451891, 0.        ],
        [0.01619207, 0.        ],
        [0.00240422, 0.        ],
        ...,
        [0.00979518, 0.        ],
        [0.00285562, 0.        ],
        [0.00412524, 0.        ]],

       [[0.00871978, 0.        ],
        [0.06614204, 0.        ],
        [0.00705297, 0.        ],
        ...,
        [0.01537887, 0.        ],
        [0.0196006 , 0.        ],
        [0.02046846, 0.        ]],

       ...,

       [[0.05727319, 0.        ],
        [0.04847331, 0.        ],
        [0.02320519, 0.        ],
        ...,
        [0.06335463, 0.        ],
        [0.0498604 , 0.        ],
        [0.04445172, 0.        ]],

       [[0.10847276, 0.        ],
        [0.0983321 , 0.        ],
        [0.03953493, 0.        ],
        .

In [15]:
data_npz["data_feature"].shape

(600, 56, 2)

For the Orange Juice data, the features include sales, price, as well as deal and deal information; while the attributes contain store ID and brand ID. 