# Model Training Plan

In [152]:
import logging
from completejourney_py import get_data

from syncomp.utils.data_util import CompleteJourneyDataset
logger = logging.getLogger()
logger.setLevel(logging.INFO)

%reload_ext autoreload
%autoreload 2

## Raw Data Analysis

In [2]:
complete_dataset = get_data()
complete_dataset["transactions"].dtypes

household_id                      int64
store_id                          int64
basket_id                         int64
product_id                        int64
quantity                          int64
sales_value                     float64
retail_disc                     float64
coupon_disc                     float64
coupon_match_disc               float64
week                              int64
transaction_timestamp    datetime64[ns]
dtype: object

In [155]:
complete_dataset["transactions"][["household_id", "product_id","store_id", "week"]].nunique()

household_id     2469
product_id      68509
store_id          457
week               53
dtype: int64

In [4]:
complete_dataset["products"].dtypes

product_id           int64
manufacturer_id      int64
department          object
brand               object
product_category    object
product_type        object
package_size        object
dtype: object

In [5]:
complete_dataset["demographics"].dtypes

household_id       int64
age               object
income            object
home_ownership    object
marital_status    object
household_size    object
household_comp    object
kids_count        object
dtype: object

## Preprocessing

In [153]:
cd = CompleteJourneyDataset()
data = cd.run_preprocess()
train_data = cd.combine_product_with_few_transactions(data)

INFO:root:Filter out transactions with non-positive quantity sold or money spent. Number of transactions are decreased to 1458032.
INFO:root:Use the same label for products with the same hierarchy information. Number of products are decreased to 32333.
INFO:root:Filter out transactions with invalid customer id. Number of transactions are decreased to 730640.
INFO:root:Filter out transactions with extreme large quantity sold. Number of transactions are decreased to 723742.
INFO:root:Remove 16271 products with few transactions. Number of transactions are decreased to 644084.
INFO:root:Combine 4643 products with few transactions to belong to one category.


Numerical columns include information like unit price, base price, source of discounts, quantity purchased and revenue per transactions. Notice that 

$unit\_price = sales\_value / quantity$ \
$base\_price = (sales\_value + retail\_disc + coupon\_match\_disc + coupon\_disc) / quantity$ \
$unit\_price = base\_price * (1 - retail\_discount\_portion - coupon\_match\_discount\_portion - coupon\_discount\_portion)$



In [133]:
data.describe(percentiles=[0.8, 0.85, 0.9, 0.95, 0.99, 0.999])

Unnamed: 0,product_id,quantity,sales_value,retail_disc,coupon_disc,coupon_match_disc,unit_price,base_price,retail_discount_portion,coupon_discount_portion,coupon_match_discount_portion
count,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0,644084.0
mean,6328.23812,1.504582,3.122173,0.543492,0.015082,0.00411,2.390958,2.799378,0.127889,0.002817,0.001083
std,7564.997183,1.430167,3.47074,1.205408,0.177302,0.045405,2.271693,2.554744,0.160419,0.025034,0.012252
min,0.0,1.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0
50%,2702.0,1.0,2.39,0.1,0.0,0.0,1.99,2.29,0.043165,0.0,0.0
80%,12185.0,2.0,4.0,0.89,0.0,0.0,3.19,3.79,0.27933,0.0,0.0
85%,15057.0,2.0,4.99,1.02,0.0,0.0,3.63,3.99,0.328859,0.0,0.0
90%,19097.0,2.0,5.98,1.49,0.0,0.0,4.19,4.99,0.373434,0.0,0.0
95%,23716.0,4.0,7.99,2.24,0.0,0.0,5.99,6.99,0.459459,0.0,0.0
99%,28870.0,7.0,15.0,4.92,0.55,0.023114,10.59,12.29,0.570201,0.12945,0.003875


Categorical columns contain various identifiers, product hierarchy and customer demographics. For infrequent products with less than 100 transactions, we reassign -1 as the new value for any product hierarchy columns as a placeholder. We can train these products together regardless of `product_id` and sample product information randomly for the synthetic dataset. After combining products, we will need to synthesize weekly transactions for 1499 products from 180 categories collected from 801 customers. The general fitting strategy is to maintain a separate model for each product category.

In [143]:
categorical_columns = train_data.select_dtypes(include='object').columns
train_data[categorical_columns].nunique()

product_id          1499
household_id         801
week                  53
manufacturer_id      392
department            18
brand                  3
product_category     180
product_type         564
package_size         335
age                    6
income                12
home_ownership         5
marital_status         3
household_size         5
household_comp         4
kids_count             4
dtype: int64

After combining infrequent products, we do not have extreme large product categories. The largest category only has 48 different products, thus we could treat `product_id` as a categorical vairables directly fitting to any diffusion model.

In [144]:
str_columns = train_data.select_dtypes(include='object').columns
a = train_data.groupby(["product_category"])[str_columns].nunique().sort_values(['household_id', 'product_id'], ascending=False)
a.describe(percentiles=[0.55, 0.6, 0.65, 0.7, 0.8, 0.9])


Unnamed: 0,product_id,household_id,week,manufacturer_id,department,brand,product_category,product_type,package_size,age,income,home_ownership,marital_status,household_size,household_comp,kids_count
count,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0,180.0
mean,8.327778,380.944444,52.022222,3.716667,1.016667,1.544444,1.0,3.172222,4.766667,5.983333,11.733333,4.866667,3.0,5.0,4.0,4.0
std,9.063297,216.169516,3.259632,4.024055,0.128376,0.49941,0.0,2.36792,4.826718,0.166294,0.673522,0.401116,0.0,0.0,0.0,0.0
min,1.0,31.0,16.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,8.0,3.0,3.0,5.0,4.0,4.0
50%,5.0,372.0,53.0,3.0,1.0,2.0,1.0,3.0,3.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0
55%,6.0,399.35,53.0,3.0,1.0,2.0,1.0,3.0,4.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0
60%,7.0,439.8,53.0,3.0,1.0,2.0,1.0,3.0,4.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0
65%,8.0,496.35,53.0,3.35,1.0,2.0,1.0,3.0,5.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0
70%,9.0,534.9,53.0,4.0,1.0,2.0,1.0,4.0,6.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0
80%,13.0,601.6,53.0,5.2,1.0,2.0,1.0,5.0,7.0,6.0,12.0,5.0,3.0,5.0,4.0,4.0


However, we could find there are still 30% of categories have transactions from more than 500 different customers. Thus, we could not treat `household_id` directly as a categorical variables. We can partition the customers into batches and train synthesizers separately.

## Model & Training strategies

We will partition the training data by each `product_category` and minibatch of `household_id` to make sure that there are at least 100 transactions in the sample training data.

In [145]:
train_data.groupby('product_category').week.count().describe()

count       180.000000
mean       3578.244444
std       15470.573830
min         102.000000
25%         468.500000
50%        1312.000000
75%        3097.000000
max      204611.000000
Name: week, dtype: float64