# DoppelGANger

## Method Overview

### Assumptions

Assume a dataset can be modeled as $\mathcal{D} = \{ O^1 , O^2, ..., O^n\}$ where $O^i$ is an object representing an atomic, high-dimensional data element, i.e., the combination of a single time series with its associated metadata. More precisely, each object $O^i = (A^i, R^i)$ contains $m$ *attributes* $A^i = [A_1^i, A_2^i, ..., A_m^i]$. For example, attribute $A_j^i$ could represent user $i$'s physical location. Note that this model can support datasets in which multiple objects have the same set of attributes. 

The second component of each object is a time series of *records* $R^i = [R_1^i, R_2^i, ..., R_{T^i}^i]$. For example, in retail, the time series may contain the numbers of products that user $i$ purchases on a given day. Different objects may contain different numbers of records (i.e., time series of different lengths). The number of records in object $O^i$ is given by $T^i$. Each record $R_j^i = (t_j^i, f_j^i)$ contains a *timestamp* $t_j^i$ and $K$ features $f_j^i = [f_{j, 1}^i, f_{j, 2}^i, ..., f_{j, 1K^i}]$ (e.g., the number of each product among all $K$ products that the user purchases). 


In [51]:
import sys
import math
import pickle
import numpy as np
import pandas as pd

from gan.output import OutputType, Normalization, Output

In [2]:
sys.path.append("./gan")


file_path = os.path.join("..", "data", "FCC_MBA", "data_feature_output.pkl")

with open(os.path.join(file_path), "rb") as f:
    data_feature_outputs = pickle.load(f)

data_feature_outputs

[<output.Output at 0x7f9bac33c0b8>, <output.Output at 0x7f9bac33c320>]

In [3]:
data_feature_outputs[0].__dict__

{'dim': 1,
 'is_gen_flag': False,
 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>,
 'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>}

In [4]:
data_feature_outputs[1].__dict__

{'dim': 1,
 'is_gen_flag': False,
 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>,
 'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>}

In [5]:
file_path = os.path.join("..", "data", "FCC_MBA", "data_attribute_output.pkl")

with open(os.path.join(file_path), "rb") as f:
    data_attribute_outputs = pickle.load(f)

data_attribute_outputs

[<output.Output at 0x7f9ba5bb42e8>,
 <output.Output at 0x7f9bac33c6a0>,
 <output.Output at 0x7f9bac33c710>]

In [6]:
data_attribute_outputs[0].__dict__

{'dim': 15,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [7]:
data_attribute_outputs[1].__dict__

{'dim': 5,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [8]:
data_attribute_outputs[2].__dict__

{'dim': 53,
 'is_gen_flag': False,
 'normalization': None,
 'type_': <OutputType.DISCRETE: 'DISCRETE'>}

In [9]:
data_npz = np.load(os.path.join("..", "data", "FCC_MBA", "data_train.npz"))
data_npz.__dict__

{'_files': ['data_feature_max.npy',
  'data_feature.npy',
  'data_attribute.npy',
  'data_gen_flag.npy',
  'data_feature_min.npy'],
 'allow_pickle': False,
 'f': <numpy.lib.npyio.BagObj at 0x7f9bac32ff98>,
 'fid': <_io.BufferedReader name='../data/FCC_MBA/data_train.npz'>,
 'files': ['data_feature_max',
  'data_feature',
  'data_attribute',
  'data_gen_flag',
  'data_feature_min'],
 'pickle_kwargs': {'encoding': 'ASCII', 'fix_imports': True},
 'zip': <zipfile.ZipFile file=<_io.BufferedReader name='../data/FCC_MBA/data_train.npz'> mode='r'>}

In [10]:
data_npz["data_gen_flag"]

array([[1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       ...,
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.],
       [1., 1., 1., ..., 1., 1., 1.]])

In [11]:
data_npz["data_gen_flag"].shape

(600, 56)

In [12]:
data_npz["data_attribute"]

array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [13]:
data_npz["data_attribute"].shape

(600, 73)

In [14]:
data_npz["data_feature"]

array([[[0.00190305, 0.        ],
        [0.00542065, 0.        ],
        [0.00351279, 0.        ],
        ...,
        [0.00570855, 0.        ],
        [0.0109244 , 0.        ],
        [0.01260134, 0.        ]],

       [[0.00451891, 0.        ],
        [0.01619207, 0.        ],
        [0.00240422, 0.        ],
        ...,
        [0.00979518, 0.        ],
        [0.00285562, 0.        ],
        [0.00412524, 0.        ]],

       [[0.00871978, 0.        ],
        [0.06614204, 0.        ],
        [0.00705297, 0.        ],
        ...,
        [0.01537887, 0.        ],
        [0.0196006 , 0.        ],
        [0.02046846, 0.        ]],

       ...,

       [[0.05727319, 0.        ],
        [0.04847331, 0.        ],
        [0.02320519, 0.        ],
        ...,
        [0.06335463, 0.        ],
        [0.0498604 , 0.        ],
        [0.04445172, 0.        ]],

       [[0.10847276, 0.        ],
        [0.0983321 , 0.        ],
        [0.03953493, 0.        ],
        .

In [15]:
data_npz["data_feature"].shape

(600, 56, 2)

For the Orange Juice data, the features include sales, price, as well as deal and advertisement information; while the attributes contain store ID and brand ID.  

| Attributes | Description | Possible Values |
| --- | --- | --- |
| store | Store ID | integers |
| brand | Brand/Product ID | integers |

| Features | Description | Possible Values |
| --- | --- | --- |
| sales | Number of products sold | integers |
| price | Price of the product | float numbers |
| deal | Deal information | float numbers |
| feat | Advertisement information | float numbers |

| Timestamp Description | Possible Values |
| --- | --- |
| The starting date of each week | 1990-01-07 |

### Create Attributes

In [16]:
ojdata_attribute_outputs = []
# store ID attribute
ojdata_attribute_outputs.append(Output(OutputType.DISCRETE, 83, None))
# brand ID attribute
ojdata_attribute_outputs.append(Output(OutputType.DISCRETE, 11, None))

print(ojdata_attribute_outputs[0].__dict__)
print(ojdata_attribute_outputs[1].__dict__)

{'type_': <OutputType.DISCRETE: 'DISCRETE'>, 'normalization': None, 'is_gen_flag': False, 'dim': 83}
{'type_': <OutputType.DISCRETE: 'DISCRETE'>, 'normalization': None, 'is_gen_flag': False, 'dim': 11}


In [17]:
with open("../data/ojdata/data_attribute_output.pkl", "wb") as f:
    pickle.dump(ojdata_attribute_outputs, f)

In [18]:
with open("../data/ojdata/data_attribute_output.pkl", "rb") as f:
    data_attribute_outputs = pickle.load(f)

data_attribute_outputs

[<gan.output.Output at 0x7f9bac33cda0>, <gan.output.Output at 0x7f9bac33cfd0>]

### Create Features

In [19]:
ojdata_feature_outputs = []
# sales feature
ojdata_feature_outputs.append(Output(OutputType.CONTINUOUS, 1, Normalization.ZERO_ONE))
# price feature
ojdata_feature_outputs.append(Output(OutputType.CONTINUOUS, 1, Normalization.ZERO_ONE))
# deal feature
ojdata_feature_outputs.append(Output(OutputType.CONTINUOUS, 1, Normalization.ZERO_ONE))
# feat feature
ojdata_feature_outputs.append(Output(OutputType.CONTINUOUS, 1, Normalization.ZERO_ONE))

In [20]:
print(ojdata_feature_outputs[0].__dict__)
print(ojdata_feature_outputs[1].__dict__)
print(ojdata_feature_outputs[2].__dict__)
print(ojdata_feature_outputs[3].__dict__)

{'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>, 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>, 'is_gen_flag': False, 'dim': 1}
{'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>, 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>, 'is_gen_flag': False, 'dim': 1}
{'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>, 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>, 'is_gen_flag': False, 'dim': 1}
{'type_': <OutputType.CONTINUOUS: 'CONTINUOUS'>, 'normalization': <Normalization.ZERO_ONE: 'ZERO_ONE'>, 'is_gen_flag': False, 'dim': 1}


In [21]:
with open("../data/ojdata/data_feature_output.pkl", "wb") as f:
    pickle.dump(ojdata_feature_outputs, f)

In [22]:
with open("../data/ojdata/data_feature_output.pkl", "rb") as f:
    data_feature_outputs = pickle.load(f)

data_feature_outputs

[<gan.output.Output at 0x7f9bac357128>,
 <gan.output.Output at 0x7f9bac3572e8>,
 <gan.output.Output at 0x7f9bac357208>,
 <gan.output.Output at 0x7f9bac357198>]

### Create Training Data

Next, we create a dictionary called `data_npz` which include the following three numpy arrays `data_feature`, `data_attribute`, and `data_gen_flag`. 

* `data_feature`: Training features, in numpy float32 array format. The size is `[(number of training samples) x (maximum length) x (total dimension of features)]`. Categorical features are stored by one-hot encoding; for example, if a categorical feature has 3 possibilities, then it can take values between `[1., 0., 0.]`, `[0., 1., 0.]`, and `[0., 0., 1.]`. Each continuous feature should be normalized to `[0, 1]` or `[-1, 1]`. The array is padded by zeros after the time series ends.

* `data_attribute`: Training attributes, in numpy float32 array format. The size is `[(number of training samples) x (total dimension of attributes)]`. Categorical attributes are stored by one-hot encoding; for example, if a categorical attribute has 3 possibilities, then it can take values between `[1., 0., 0.]`, `[0., 1., 0.]`, and `[0., 0., 1.]`. Each continuous attribute should be normalized to `[0, 1]` or `[-1, 1]`.

* data_gen_flag: Flags indicating the activation of features, in numpy float32 array format. The size is `[(number of training samples) x (maximum length)]`. 1 means the time series is activated at this time step, 0 means the time series is inactivated at this timestep.

In [23]:
original_sales = pd.read_csv("../data/ojdata/yx.csv")
original_sales.head()

Unnamed: 0,store,brand,week,logmove,constant,price1,price2,price3,price4,price5,price6,price7,price8,price9,price10,price11,deal,feat,profit
0,2,1,40,9.018695,1,0.060469,0.060497,0.042031,0.029531,0.049531,0.053021,0.038906,0.041406,0.028906,0.024844,0.038984,1,0.0,37.992326
1,2,1,46,8.723231,1,0.060469,0.060312,0.045156,0.046719,0.049531,0.047813,0.045781,0.027969,0.042969,0.042031,0.038984,0,0.0,30.126667
2,2,1,47,8.253228,1,0.060469,0.060312,0.045156,0.046719,0.037344,0.053021,0.045781,0.041406,0.048125,0.032656,0.038984,0,0.0,30.0
3,2,1,48,8.987197,1,0.060469,0.060312,0.049844,0.037344,0.049531,0.053021,0.045781,0.041406,0.042344,0.032656,0.038984,0,0.0,29.95
4,2,1,50,9.093357,1,0.060469,0.060312,0.043594,0.031094,0.049531,0.053021,0.046648,0.041406,0.042344,0.032656,0.038203,0,0.0,29.92


In [24]:
# Check number of time series in the data
n_ts_samples = len(original_sales.groupby(["store", "brand"]).groups.keys())
print(n_ts_samples)

913


In [25]:
# Get the maximum length of the time series
min_week = original_sales["week"].min()
max_week = original_sales["week"].max()
print("Minmum week number is ", min_week)
print("Maximum week number is ", max_week)
max_ts_length = max_week - min_week
print("Maximum time series length ", max_ts_length)

Minmum week number is  40
Maximum week number is  160
Maximum time series length  120


In [26]:
data_npz = np.load(os.path.join("..", "data", "FCC_MBA", "data_train.npz"))
data_npz.__dict__

{'_files': ['data_feature_max.npy',
  'data_feature.npy',
  'data_attribute.npy',
  'data_gen_flag.npy',
  'data_feature_min.npy'],
 'allow_pickle': False,
 'f': <numpy.lib.npyio.BagObj at 0x7f9bb0b2feb8>,
 'fid': <_io.BufferedReader name='../data/FCC_MBA/data_train.npz'>,
 'files': ['data_feature_max',
  'data_feature',
  'data_attribute',
  'data_gen_flag',
  'data_feature_min'],
 'pickle_kwargs': {'encoding': 'ASCII', 'fix_imports': True},
 'zip': <zipfile.ZipFile file=<_io.BufferedReader name='../data/FCC_MBA/data_train.npz'> mode='r'>}

In [27]:
data_npz["data_feature_min"]

array([2079777.,       0.], dtype=float32)

In [28]:
data_gen_flag = np.tile(1, (n_ts_samples, max_ts_length))

In [29]:
data_gen_flag.shape

(913, 120)

In [36]:
#!pip install sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/42/ec/32310181e803f5d22e0dd33eb18924489b2f8d08cf5b6e116a93a6a5d1c6/scikit_learn-0.22.2.post1-cp35-cp35m-manylinux1_x86_64.whl (7.0MB)
[K    100% |████████████████████████████████| 7.0MB 8.3MB/s 
[?25hCollecting scipy>=0.17.0 (from scikit-learn->sklearn)
[?25l  Downloading https://files.pythonhosted.org/packages/c1/60/8cbf00c0deb50a971e6e3a015fb32513960a92867df979870a454481817c/scipy-1.4.1-cp35-cp35m-manylinux1_x86_64.whl (26.0MB)
[K    100% |████████████████████████████████| 26.0MB 2.7MB/s 
Collecting joblib>=0.11 (from scikit-learn->sklearn)
  Using cached https://files.pythonhosted.org/packages/28/5c/cf6a2b65a321c4a209efcdf64c2689efae2cb62661f8f6f4bb28547cf1bf/joblib-0.14.1-py2.py3-none-any.whl
Building wheels for collected pa

In [38]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit(original_sales[["store", "brand"]])
enc.transform([[2, 1]]).toarray()

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [45]:
import itertools

store_list = original_sales["store"].unique()
brand_list = original_sales["brand"].unique()
all_store_brand = np.array([[s, b] for s, b in itertools.product(store_list, brand_list)])
data_attribute = enc.transform(all_store_brand).toarray()
data_attribute

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [46]:
data_attribute.shape

(913, 94)

In [47]:
# Impute missing values
week_list = range(min_week, max_week + 1)
d = {"store": store_list, "brand": brand_list, "week": week_list}
cart = list(itertools.product(*d.values()))
data_grid = pd.DataFrame(cart, columns=d.keys())
original_sales_filled = pd.merge(data_grid, original_sales, how="left", on=["store", "brand", "week"])
original_sales_filled = original_sales_filled.groupby(["store", "brand"]).apply(lambda x: x.fillna(method="ffill").fillna(method="bfill"))

In [48]:
original_sales_filled.shape

(110473, 19)

In [49]:
original_sales.shape

(106139, 19)

In [52]:
original_sales_filled["move"] = original_sales_filled["logmove"].apply(lambda x: round(math.exp(x)))

In [53]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(original_sales_filled[["move", "feat", "deal"]])
scaler.data_max_

array([7.16416e+05, 1.00000e+00, 1.00000e+00])