TVAE Model
===========

In this guide we will go through a series of steps that will let you
discover functionalities of the `TVAE` model, including how to:

-   Create an instance of `TVAE`.
-   Fit the instance to your data.
-   Generate synthetic versions of your data.
-   Use `TVAE` to anonymize PII information.
-   Customize the data transformations to improve the learning process.
-   Specify hyperparameters to improve the output quality.

What is TVAE?
--------------

The `sdv.tabular.TVAE` model is based on the VAE-based Deep Learning
data synthesizer which was presented at the NeurIPS 2020 conference by
the paper titled [Modeling Tabular data using Conditional
GAN](https://arxiv.org/abs/1907.00503).

Let\'s now discover how to learn a dataset and later on generate
synthetic data with the same format and statistical properties by using
the `TVAE` class from SDV.

Quick Usage
-----------

We will start by loading one of our demo datasets, the
`student_placements`, which contains information about MBA students that
applied for placements during the year 2020.

<div class="alert alert-warning">

**Warning**

In order to follow this guide you need to have `tvae` installed on your
system. If you have not done it yet, please install `tvae` now by
executing the command `pip install sdv` in a terminal.

</div>

In [1]:
from sdv.demo import load_tabular_demo

data = load_tabular_demo('student_placements')
data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


As you can see, this table contains information about students which
includes, among other things:

-   Their id and gender
-   Their grades and specializations
-   Their work experience
-   The salary that they where offered
-   The duration and dates of their placement

You will notice that there is data with the following characteristics:

-   There are float, integer, boolean, categorical and datetime values.
-   There are some variables that have missing data. In particular, all
    the data related to the placement details is missing in the rows
    where the student was not placed.

T   There are float, integer, boolean, categorical and datetime values.
-   There are some variables that have missing data. In particular, all
    the data related to the placement details is missing in the rows
    where the student was not placed.

Let us use `TVAE` to learn this data and then sample synthetic data
about new students to see how well de model captures the characteristics
indicated above. In order to do this you will need to:

-   Import the `sdv.tabular.TVAE` class and create an instance of it.
-   Call its `fit` method passing our table.
-   Call its `sample` method indicating the number of synthetic rows
    that you want to generate.

In [2]:
from sdv.tabular import TVAE

model = TVAE()
model.fit(data)

<div class="alert alert-info">

**Note**

Notice that the model `fitting` process took care of transforming the
different fields using the appropriate [Reversible Data
Transforms](http://github.com/sdv-dev/RDT) to ensure that the data has a
format that the underlying TVAESynthesizer class can handle.

</div>

### Generate synthetic data from the model

Once the modeling has finished you are ready to generate new synthetic
data by calling the `sample` method from your model passing the number
of rows that we want to generate.

In [3]:
new_data = model.sample(200)

This will return a table identical to the one which the model was fitted
on, but filled with new data which resembles the original one.

In [4]:
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17289,M,83.35318,55.357234,Science,68.778783,Sci&Tech,False,1,80.490271,Mkt&Fin,59.951833,,True,NaT,NaT,3.0
1,17284,M,68.644864,66.734732,Science,69.178014,Comm&Mgmt,False,1,83.620164,Mkt&Fin,59.936255,,True,NaT,NaT,3.0
2,17317,M,70.126206,67.059427,Commerce,77.617235,Comm&Mgmt,False,0,81.372499,Mkt&Fin,62.349719,,False,NaT,NaT,3.0
3,17292,M,68.359096,78.149577,Arts,72.483335,Comm&Mgmt,False,2,84.87575,Mkt&Fin,61.789867,,False,NaT,NaT,3.0
4,17338,F,58.302424,73.128391,Arts,77.993735,Comm&Mgmt,False,0,75.140944,Mkt&Fin,57.964864,27479.083484,False,NaT,NaT,3.0


<div class="alert alert-info">

**Note**

You can control the number of rows by specifying the number of `samples`
in the `model.sample(<num_rows>)`. To test, try `model.sample(10000)`.
Note that the original table only had \~200 rows.

</div>

### Save and Load the model

In many scenarios it will be convenient to generate synthetic versions
of your data directly in systems that do not have access to the original
data source. For example, if you may want to generate testing data on
the fly inside a testing environment that does not have access to your
production database. In these scenarios, fitting the model with real
data every time that you need to generate new data is feasible, so you
will need to fit a model in your production environment, save the fitted
model into a file, send this file to the testing environment and then
load it there to be able to `sample` from it.

Let\'s see how this process works.

#### Save and share the model

Once you have fitted the model, all you need to do is call its `save`
method passing the name of the file in which you want to save the model.
Note that the extension of the filename is not relevant, but we will be
using the `.pkl` extension to highlight that the serialization protocol
used is [pickle](https://docs.python.org/3/library/pickle.html).

In [5]:
model.save('my_model.pkl')

This will have created a file called `my_model.pkl` in the same
directory in which you are running SDV.

<div class="alert alert-info">

**Important**

If you inspect the generated file you will notice that its size is much
smaller than the size of the data that you used to generate it. This is
because the serialized model contains **no information about the
original data**, other than the parameters it needs to generate
synthetic versions of it. This means that you can safely share this
`my_model.pkl` file without the risc of disclosing any of your real
data!

</div>

#### Load the model and generate new data

The file you just generated can be send over to the system where the
synthetic data will be generated. Once it is there, you can load it
using the `TVAE.load` method, and then you are ready to sample new data
from the loaded instance:

In [6]:
loaded = TVAE.load('my_model.pkl')
new_data = loaded.sample(200)

<div class="alert alert-warning">

**Warning**

Notice that the system where the model is loaded needs to also have
`sdv` and `tvae` installed, otherwise it will not be able to load the
model and use it.

</div>

### Specifying the Primary Key of the table

One of the first things that you may have noticed when looking that demo
data is that there is a `student_id` column which acts as the primary
key of the table, and which is supposed to have unique values. Indeed,
if we look at the number of times that each value appears, we see that
all of them appear at most once:

In [7]:
data.student_id.value_counts().max()

1

However, if we look at the synthetic data that we generated, we observe
that there are some values that appear more than once:

In [8]:
new_data[new_data.student_id == new_data.student_id.value_counts().index[0]]

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
7,17315,M,71.830018,60.088855,Science,68.713454,Comm&Mgmt,False,1,81.23433,Mkt&Fin,58.058374,,False,2020-03-12,NaT,3.0
40,17315,F,57.186171,63.406604,Arts,72.573235,Comm&Mgmt,False,0,79.755246,Mkt&HR,59.036378,,False,NaT,NaT,3.0
54,17315,M,70.879378,80.173777,Science,71.242612,Comm&Mgmt,True,0,81.263191,Mkt&Fin,60.00583,,False,2020-03-06,NaT,12.0
74,17315,M,72.61429,69.012887,Arts,70.687134,Comm&Mgmt,False,2,75.979427,Mkt&Fin,58.761753,,False,2020-03-03,NaT,3.0
77,17315,F,60.256375,66.576178,Arts,73.041089,Comm&Mgmt,False,0,81.28182,Mkt&Fin,57.716816,,True,NaT,NaT,3.0
91,17315,M,74.159233,66.663158,Arts,69.389609,Comm&Mgmt,False,1,82.248418,Mkt&Fin,59.669755,,False,2020-03-12,NaT,3.0
148,17315,M,71.514676,66.794494,Commerce,70.620976,Comm&Mgmt,False,0,71.724508,Mkt&HR,58.279448,36438.982309,False,NaT,NaT,3.0
168,17315,F,58.622536,64.922292,Commerce,66.843349,Comm&Mgmt,False,0,79.949809,Mkt&Fin,60.526892,,False,2020-03-14,NaT,3.0
172,17315,M,70.786906,61.248514,Arts,67.548021,Comm&Mgmt,False,1,71.812168,Mkt&Fin,62.112848,,False,NaT,NaT,3.0


This happens because the model was not notified at any point about the
fact that the `student_id` had to be unique, so when it generates new
data it will provoke collisions sooner or later. In order to solve this,
we can pass the argument `primary_key` to our model when we create it,
indicating the name of the column that is the index of the table.

In [9]:
model = TVAE(
    primary_key='student_id'
)
model.fit(data)
new_data = model.sample(200)
new_data.head()

Unnamed: 0,student_id,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,F,62.998857,63.543086,Arts,69.635247,Others,True,0,87.211753,Mkt&Fin,70.954342,,True,NaT,NaT,6.0
1,1,F,76.727291,42.696298,Commerce,66.95968,Comm&Mgmt,False,0,82.768902,Mkt&Fin,69.002179,,False,NaT,NaT,12.0
2,2,M,79.224738,67.681799,Commerce,65.523451,Comm&Mgmt,True,0,78.135573,Mkt&HR,75.241672,,True,NaT,NaT,12.0
3,3,M,63.555886,60.726611,Commerce,69.673942,Comm&Mgmt,False,0,85.724415,Mkt&Fin,69.165557,,True,NaT,NaT,6.0
4,4,F,55.202356,48.766086,Commerce,64.707694,Comm&Mgmt,True,0,89.009612,Mkt&Fin,71.538128,,True,NaT,NaT,12.0


As a result, the model will learn that this column must be unique and
generate a unique sequence of values for the column:

In [10]:
new_data.student_id.value_counts().max()

1

### Anonymizing Personally Identifiable Information (PII)

There will be many cases where the data will contain Personally
Identifiable Information which we cannot disclose. In these cases, we
will want our Tabular Models to replace the information within these
fields with fake, simulated data that looks similar to the real one but
does not contain any of the original values.

Let\'s load a new dataset that contains a PII field, the
`student_placements_pii` demo, and try to generate synthetic versions of
it that do not contain any of the PII fields.

<div class="alert alert-info">

**Note**

The `student_placements_pii` dataset is a modified version of the
`student_placements` dataset with one new field, `address`, which
contains PII information about the students. Notice that this additional
`address` field has been simulated and does not correspond to data from
the real users.

</div>

In [11]:
data_pii = load_tabular_demo('student_placements_pii')
data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,17264,"70304 Baker Turnpike\nEricborough, MS 15086",M,67.0,91.0,Commerce,58.0,Sci&Tech,False,0,55.0,Mkt&HR,58.8,27000.0,True,2020-07-23,2020-10-12,3.0
1,17265,"805 Herrera Avenue Apt. 134\nMaryview, NJ 36510",M,79.33,78.33,Science,77.48,Sci&Tech,True,1,86.5,Mkt&Fin,66.28,20000.0,True,2020-01-11,2020-04-09,3.0
2,17266,"3702 Bradley Island\nNorth Victor, FL 12268",M,65.0,68.0,Arts,64.0,Comm&Mgmt,False,0,75.0,Mkt&Fin,57.8,25000.0,True,2020-01-26,2020-07-13,6.0
3,17267,Unit 0879 Box 3878\nDPO AP 42663,M,56.0,52.0,Science,52.0,Sci&Tech,False,0,66.0,Mkt&HR,59.43,,False,NaT,NaT,
4,17268,"96493 Kelly Canyon Apt. 145\nEast Steven, NC 3...",M,85.8,73.6,Commerce,73.3,Comm&Mgmt,False,0,96.8,Mkt&Fin,55.5,42500.0,True,2020-07-04,2020-09-27,3.0


If we use our tabular model on this new data we will see how the
synthetic data that it generates discloses the addresses from the real
students:

In [12]:
model = TVAE(
    primary_key='student_id',
)
model.fit(data_pii)
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"049 Kurt Fords\nNew Lawrenceton, MO 77895",F,61.024426,62.844666,Arts,75.142477,Comm&Mgmt,False,1,75.950522,Mkt&HR,57.927106,26556.058418,True,2020-07-23,NaT,
1,1,"8497 Steven Estate\nCindyberg, WV 01019",F,69.185621,81.806211,Arts,76.02162,Others,True,1,75.522416,Mkt&HR,66.729737,,True,NaT,NaT,
2,2,"049 Kurt Fords\nNew Lawrenceton, MO 77895",M,67.908509,68.44156,Commerce,68.592869,Sci&Tech,True,1,61.703539,Mkt&HR,66.648979,,True,NaT,2020-07-04,12.0
3,3,"231 Rachel Trail Apt. 886\nEast Jennifer, CO 2...",F,63.66708,82.409662,Commerce,72.425383,Sci&Tech,True,1,68.858513,Mkt&Fin,69.04553,,True,2020-07-01,NaT,12.0
4,4,"049 Kurt Fords\nNew Lawrenceton, MO 77895",F,68.908847,69.463604,Commerce,69.844127,Others,True,2,58.966773,Mkt&HR,61.089308,,True,NaT,2020-08-16,3.0


More specifically, we can see how all the addresses that have been
generated actually come from the original dataset:

In [13]:
new_data_pii.address.isin(data_pii.address).sum()

200

In order to solve this, we can pass an additional argument
`anonymize_fields` to our model when we create the instance. This
`anonymize_fields` argument will need to be a dictionary that contains:

-   The name of the field that we want to anonymize.
-   The category of the field that we want to use when we generate fake
    values for it.

The list complete list of possible categories can be seen in the [Faker
Providers](https://faker.readthedocs.io/en/master/providers.html) page,
and it contains a huge list of concepts such as:

-   name
-   address
-   country
-   city
-   ssn
-   credit_card_number
-   credit_card_expire
-   credit_card_security_code
-   email
-   telephone
-   \...

In this case, since the field is an e-mail address, we will pass a
dictionary indicating the category `address`

In [14]:
model = TVAE(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)
model.fit(data_pii)

As a result, we can see how the real `address` values have been replaced
by other fake addresses:

In [15]:
new_data_pii = model.sample(200)
new_data_pii.head()

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,0,"547 Duke Alley Suite 255\nKaramouth, TN 09731",M,54.697593,56.81794,Arts,60.335703,Sci&Tech,True,1,79.696961,Mkt&Fin,65.519709,,False,NaT,2020-10-13,
1,1,"7598 Chen Place\nSeanborough, WV 68818",F,49.787108,59.736321,Arts,68.166026,Comm&Mgmt,True,0,73.485755,Mkt&Fin,57.712093,,False,NaT,2020-11-05,
2,2,"14793 Jones Vista Suite 667\nBarbarabury, IN 6...",F,49.009085,60.296029,Arts,62.066417,Sci&Tech,True,0,76.621463,Mkt&Fin,59.786268,24302.144818,True,2020-03-07,2020-05-07,
3,3,"45135 Bryce Lock Suite 550\nLake Rebecca, DC 7...",F,52.359116,62.147668,Arts,69.115026,Sci&Tech,True,0,78.695888,Mkt&Fin,65.370524,,False,NaT,2020-09-05,3.0
4,4,"28277 Tammy Cliff Suite 735\nHarrisonview, NY ...",F,55.017693,62.734469,Arts,64.701996,Comm&Mgmt,True,2,81.621391,Mkt&Fin,56.661962,,True,2020-03-08,2020-08-01,


Which means that none of the original addresses can be found in the
sampled data:

In [16]:
data_pii.address.isin(new_data_pii.address).sum()

0

As we can see, in this case these modifications changed the obtained
results slightly, but they did neither introduce dramatic changes in the
performance.

### Conditional Sampling

As the name implies, conditional sampling allows us to sample from a conditional distribution using the `TVAE` model, which means we can generate only values that satisfy certain conditions. These conditional values can be passed to the `conditions` parameter in the `sample` method either as a dataframe or a dictionary.

In case a dictionary is passed, the model will generate as many rows as requested, all of which will satisfy the specified conditions, such as `gender = M`.

In [17]:
conditions = {
    'gender': 'M'
}
model.sample(5, conditions=conditions)

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,1,"564 Baker Place\nSouth Joshuachester, MI 15093",M,59.268576,63.748501,Commerce,75.52907,Comm&Mgmt,True,0,73.899364,Mkt&Fin,57.946745,37267.004708,True,2020-03-01,2020-12-06,6.0
1,1,"0872 Kimberly Light\nEast Christopherstad, AK ...",M,53.494102,65.998045,Arts,67.540636,Sci&Tech,True,0,82.876376,Mkt&Fin,63.490348,23985.606414,True,NaT,2020-08-30,
2,6,"723 Judy Garden\nWest Stephen, KY 26982",M,55.337768,60.708585,Arts,60.157231,Sci&Tech,True,1,75.91214,Mkt&Fin,59.409489,23181.438149,True,NaT,2020-10-02,6.0
3,1,"77750 Hall Passage Suite 739\nErinside, GA 52442",M,55.127807,63.145479,Arts,68.107433,Sci&Tech,True,0,86.822579,Mkt&Fin,58.861254,,False,NaT,2021-02-22,
4,2,"052 Timothy Brooks\nNorth Joshua, PA 59031",M,63.600719,64.735011,Commerce,63.043982,Sci&Tech,True,0,56.857493,Mkt&Fin,60.949091,,True,2020-03-05,2020-10-18,12.0


It's also possible to condition on multiple columns, such as `gender = M, 'experience_years': 0`.

In [19]:
conditions = {
    'gender': 'M',
    'experience_years': 0
}
model.sample(5, conditions=conditions)

Unnamed: 0,student_id,address,gender,second_perc,high_perc,high_spec,degree_perc,degree_type,work_experience,experience_years,employability_perc,mba_spec,mba_perc,salary,placed,start_date,end_date,duration
0,2,"0020 Brett Square Suite 092\nHuynhberg, CA 22876",M,61.934265,63.800347,Arts,75.068595,Sci&Tech,True,0,58.18634,Mkt&Fin,57.844271,,False,NaT,2020-09-10,
1,2,"8911 Dalton Valley Suite 943\nGreermouth, OK 0...",M,51.913288,72.483957,Arts,67.319725,Sci&Tech,True,0,81.802448,Mkt&Fin,60.458977,48354.545977,True,NaT,2020-12-15,3.0
2,10,"6012 Diaz Row Apt. 337\nAlvaradoberg, NJ 09205",M,57.062414,64.108624,Arts,64.033669,Sci&Tech,True,0,81.730565,Mkt&Fin,60.058818,,False,NaT,2021-01-18,
3,11,"94130 Timothy Forks Suite 869\nJessemouth, MO ...",M,54.96155,63.243112,Arts,71.81602,Sci&Tech,True,0,73.701814,Mkt&Fin,59.344342,,False,NaT,2020-09-02,
4,0,"749 Allen Inlet\nEast Shannonmouth, KS 11549",M,54.320593,63.751386,Arts,70.809816,Comm&Mgmt,True,0,78.717891,Mkt&Fin,56.568123,31014.332133,True,2020-03-05,2020-09-16,6.0


`conditions` can also be passed as a dataframe. In that case, the model will generate one sample for each row of the dataframe, sorted in the same order. Since the model already knows how many samples to generate, passing it as a parameter is unnecessary. For example, if we want to generate three samples where `gender = M` and three samples with `gender = F`, all of them with `work_experience = False`, we can do the following: 

In [None]:
import pandas as pd 

conditions = pd.DataFrame({
    'gender': ['M', 'M', 'M', 'F', 'F', 'F'],
    'work_experience': [False, False, False, False, False, False]
})
model.sample(conditions=conditions)

`TVAE` also supports conditioning on continuous values, as long as the values are within the range of seen numbers. For example, if all the values of the dataset are within 0 and 1, `TVAE` will not be able to set this value to 1000.

In [None]:
conditions = {
    'degree_perc': 70.0
}
model.sample(5, conditions=conditions)

<div class="alert alert-info">

**Note**
Currently, conditional sampling works through a rejection sampling process, where rows are sampled repeatedly until one that satisfies the conditions is found. In case you are running into a `Could not get enough valid rows within x trials` or simply wish to optimize the results, there are three parameters that can be fine-tuned: max_rows_multiplier, max_retries and float_rtol. More information about these parameters can be found in the API section.


</div>

### How do I specify constraints?

If you look closely at the data you may notice that some properties were
not completely captured by the model. For example, you may have seen
that sometimes the model produces an `experience_years` number greater
than `0` while also indicating that `work_experience` is `False`. These
type of properties are what we call `Constraints` and can also be
handled using `SDV`. For further details about them please visit the
[Handling Constraints](04_Handling_Constraints.ipynb) guide.

### Can I evaluate the Synthetic Data?

A very common question when someone starts using **SDV** to generate
synthetic data is: *\"How good is the data that I just generated?\"*

In order to answer this question, **SDV** has a collection of metrics
and tools that allow you to compare the *real* that you provided and the
*synthetic* data that you generated using **SDV** or any other tool.

You can read more about this in the [Evaluating Synthetic Data Generators](
05_Evaluating_Synthetic_Data_Generators.ipynb) guide.