# How to generate synthetic import data

- Written by Yeon Soo Choi, Research Unit, World Customs Organization
- Updated on 6th May 2020
- Ojective: This notebook is **to generate synthetic import data for WCO capacity building programmes in data analytics**

## Outline
- Introduction
- Train a CTGAN model with real import data
- Generate synthetic data with the pre-trained CTGAN model
- Final touch up

## Introduction

**GAN (Generative Adversarial Network)** is one of the hottest algorithms in the image analysis. Representative examples of GAN are;
- to generate artificial human faces by traing a machine with real human faces.
- to generate artificial driving environments for training self-driving cars. 

Here is a good video from MIT (6.S191: Introduction to Deep Learning) for your reference:
- https://www.youtube.com/watch?v=rZufA635dq4

**CTGAN (Conditional GAN for Tabular Data)** is an advanced GAN model for tabular (structured) data, such as import declarations, composed of numeric and categorical variables. With customized CTGAN model, we produced 200,000 synthetic import data **only for WCO capacity building activities**. For more detals, please refer to;
- Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.
- https://sdv-dev.github.io/CTGAN

The synthetic data has **basic attributes as import data**.
- Overall distribution of numeric variables are similar to those of the real data.
- Number of classes of categorical variables are similar to those of the real data. 
- A fraud detection model (XGBoost model) performs well in the synthetic data. For instance;
    - Checking top 10% suspicious transactions: 2000
    - Precision: 0.2215, Recall: 0.2911, Seized Revenue (Recall): 0.3074

The synthetic data (in our practice) **is far from the real data**.
- Every synthetic import is constructed from a series of random numbers, not from transforming or manipulating the real import.
- To minimize any concerns of the real data provider, only 2013-2014 data was used in training CTGAN model. 
- The XGBoost model trained with the synthetic data performs **as bad as a random targeting** in the real data. It means that hidden patterns of fraudulent imports in the synthetic data are **significantly different** from those of the real data. For instance;
    - Checking top 10% suspicious transactions: 27481
    - Precision: 0.0247, Recall: 0.1, Seized Revenue (Recall): 0.1201

## Train a CTGAN model with real import data

In [1]:
## Set environment

# install ctgan -> pip install ctgan
from ctgan import load_demo
import pandas as pd
import numpy as np
from ctgan import CTGANSynthesizer
import pickle
import os

## Preprocess data

df = pd.read_csv("~/Sharedfolder/NTM/anonymized_full.csv", encoding="ISO-8859-1")

# Convert date to categories
df['SGD.DATE'] = pd.to_datetime(df['SGD.DATE'], format='%Y-%m-%d')
df['year'] = df['SGD.DATE'].dt.year
df['month'] = df['SGD.DATE'].dt.month
df['day'] = df['SGD.DATE'].dt.day

# Drop columns
del df['SGD.DATE']

# Select only 2013 and 2014 data
df = df[df['year']<2015]
# Reduce data size by random sampling
df = df.sample(len(df)//500) #50

# Select columns to use
df = df[['year','month','day','OFFICE', 'IMPORTER.TIN','TARIFF.CODE','DECLARANT.CODE',
         'ORIGIN.CODE', 'CIF_USD_EQUIVALENT', 'QUANTITY', 'GROSS.WEIGHT', 
         'TOTAL.TAXES.USD','RAISED_TAX_AMOUNT_USD', 'illicit']]

# Drop NAs
df = df.dropna()

# Define categorical variables
discrete_columns = ['year','month','day','OFFICE','IMPORTER.TIN',
                    'DECLARANT.CODE','TARIFF.CODE','ORIGIN.CODE','illicit']

# Confirm data type of categorical variables as 'category' 
for var in discrete_columns:
    df[var]=df[var].astype('category')

# Define numeric variables
numeric_vars = list(set(df.columns)-set(discrete_columns))

# log-scale numeric variables
df[numeric_vars] = df[numeric_vars].apply(np.log1p)

## CTGAN: train

num_epoch = 10
ctgan = CTGANSynthesizer()
ctgan.fit(df, discrete_columns, epochs=num_epoch)

# Save model and original data
with open('Pretrained_CTGAN.pkl', 'wb') as f:
    pickle.dump(ctgan, f)

print('end')

  interactivity=interactivity, compiler=compiler, result=result)


Epoch 1, Loss G: 4.0602, Loss D: -0.0612
Epoch 2, Loss G: 4.1001, Loss D: -0.1508
Epoch 3, Loss G: 4.0447, Loss D: -0.2176
Epoch 4, Loss G: 4.1481, Loss D: -0.2761
Epoch 5, Loss G: 4.0133, Loss D: -0.2503
Epoch 6, Loss G: 3.8790, Loss D: -0.2584
Epoch 7, Loss G: 4.1555, Loss D: -0.3340
Epoch 8, Loss G: 3.8727, Loss D: -0.2900
Epoch 9, Loss G: 4.0709, Loss D: -0.2541
Epoch 10, Loss G: 3.8185, Loss D: -0.2819




end


## Generate synthetic data with a pre-trained CTGAN model

In [2]:
# If you have pretrained ctgan model, load it...
with open('Pretrained_CTGAN.pkl', 'rb') as f:
    ctgan=pickle.load(f)

In [3]:
# Generate 100,000 synthetic data with the pre-trained CTGAN model
samples = None
for i in range(10):
    print(i)
    samples_sub = ctgan.sample(10000)
    if samples is None:
        samples=samples_sub
    else:
        samples=samples.append(samples_sub, ignore_index=True)

0
1
2
3
4
5
6
7
8
9


In [4]:
# Scale-back (exponentiate log-values) to numeric vairables
numeric_vars = ['CIF_USD_EQUIVALENT','TOTAL.TAXES.USD','QUANTITY','RAISED_TAX_AMOUNT_USD','GROSS.WEIGHT']

for var in numeric_vars:
    samples[var]=np.expm1(samples[var].values.astype(float))

## Final touch up 

In [5]:
# Round numeric variables
samples[numeric_vars]=round(samples[numeric_vars])

In [6]:
# Redefine 'illicit' as 1 if raise tax amount is positive; otherwise 0
samples['illicit']=np.where(samples['RAISED_TAX_AMOUNT_USD']>0,1,0)

In [7]:
# Check number of re-defined illicit imports
samples['illicit'].value_counts()

0    96153
1     3847
Name: illicit, dtype: int64

In [8]:
samples[samples['illicit']==1].sample(10)

Unnamed: 0,year,month,day,OFFICE,IMPORTER.TIN,TARIFF.CODE,DECLARANT.CODE,ORIGIN.CODE,CIF_USD_EQUIVALENT,QUANTITY,GROSS.WEIGHT,TOTAL.TAXES.USD,RAISED_TAX_AMOUNT_USD,illicit
18944,2013,10,12,OFFICE60,IMP791403,8474900000,DEC8784,CNTRY994,963673.0,2085.0,1130.0,810.0,6970.0,1
69892,2013,11,27,OFFICE66,IMP536032,8423100000,CA25063,CNTRY138,4281.0,1.0,3909.0,1478.0,4148.0,1
24986,2014,2,6,OFFICE51,IMP886748,6910900000,DEC3620,CNTRY429,150602.0,104.0,1040.0,609.0,10407.0,1
91153,2013,11,27,OFFICE60,IMP738364,8479900000,DEC6760,CNTRY139,235.0,317.0,1039.0,1350.0,4456.0,1
19711,2014,3,30,OFFICE51,IMP209068,8704219000,DEC1243,CNTRY351,3920.0,212.0,9878.0,12103.0,12293.0,1
2215,2014,10,2,OFFICE59,IMP345511,8418210000,DEC8041,CNTRY656,3602.0,7.0,4.0,13968.0,2485.0,1
91504,2013,3,23,OFFICE23,IMP148184,8539320000,DEC5210,CNTRY656,6940.0,4.0,935.0,1096.0,7735.0,1
37651,2013,12,1,OFFICE51,IMP475170,8703241932,DEC9584,CNTRY215,64290.0,233.0,716.0,1263.0,19724.0,1
85840,2014,3,13,OFFICE59,IMP289365,8441100000,DEC8128,CNTRY759,1171757.0,66.0,8011.0,23002.0,1788.0,1
14739,2013,11,19,OFFICE51,IMP929217,8413700000,CA25173,CNTRY284,197572.0,162.0,7456.0,17822.0,5037.0,1


In [9]:
samples[samples['illicit']==0].sample(10)

Unnamed: 0,year,month,day,OFFICE,IMPORTER.TIN,TARIFF.CODE,DECLARANT.CODE,ORIGIN.CODE,CIF_USD_EQUIVALENT,QUANTITY,GROSS.WEIGHT,TOTAL.TAXES.USD,RAISED_TAX_AMOUNT_USD,illicit
44623,2014,1,29,OFFICE59,IMP612456,8517180000,DEC3394,CNTRY264,94809.0,5.0,931.0,1644.0,0.0,0
73946,2012,8,22,OFFICE59,IMP562250,8704219000,DEC7987,CNTRY770,35814.0,260.0,4280.0,1508.0,0.0,0
60647,2014,10,31,OFFICE51,IMP507104,3904100000,CA25064,CNTRY680,59699.0,556.0,1.0,11740.0,0.0,0
12076,2013,1,23,OFFICE51,IMP182800,3805900000,DEC2642,CNTRY233,712983.0,1.0,13372.0,94635.0,0.0,0
85496,2013,8,10,OFFICE51,IMP815180,8703332100,DEC5289,CNTRY277,5841.0,305.0,880.0,19261.0,0.0,0
2282,2014,5,7,OFFICE59,IMP823193,6912001000,DEC3088,CNTRY915,113839.0,1789.0,3829.0,1716.0,0.0,0
84704,2013,6,26,OFFICE59,IMP937153,1005900000,DEC2148,CNTRY994,318777.0,1.0,2513.0,52859.0,0.0,0
31044,2014,2,28,OFFICE40,IMP257154,6210400000,CA25147,CNTRY562,3837.0,1.0,5674.0,25953.0,0.0,0
36037,2013,2,4,OFFICE51,IMP298890,6905100000,DEC2516,CNTRY277,7310.0,120.0,685.0,2036.0,0.0,0
93144,2013,8,6,OFFICE51,IMP575237,8539299000,DEC5803,CNTRY846,149588.0,1.0,5645.0,2043.0,0.0,0
