#### HMA1 Class 
This is a demo notebook to discover functionalities of the HMA1 class for relational data modelling.

#### What is HMA1?
The bulian.relational.HMA1 class implements a Hierarchical Modeling Algorithm which is an algorithm that allows to recursively walk through a relational dataset and apply tabular models across all the tables in a way that lets the models learn how all the fields from all the tables are related.

In [24]:
sys.path.insert(1,r'F:\Users\Kaggle\bulian')

#### Import all the relevant packages

In [25]:
import os,sys
import pandas as pd
from bulian.metrics.demo import sample_relational_demo
from bulian.utils import display_tables
from bulian.relational import HMA1

#### Start by Loading Bulian Demo relational dataset

In [26]:
metadata, tables = sample_relational_demo()

##### This returns two objects, the relational data tables and the metadata associated with each table 

In [27]:
print(metadata)

Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id


In [28]:
for name, table in tables.items():
    print(name, table.shape)

users (30, 4)
sessions (77, 5)
transactions (419, 5)


In [29]:
display_tables(tables)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,ZM,M,28,
1,KW,F,30,
2,KW,M,42,
3,NE,F,27,
4,PW,F,26,
5,NE,,37,
6,PW,,34,
7,KW,M,43,
8,PW,F,35,
9,ZM,F,33,

user_id,country,gender,age
0,ZM,M,28
1,KW,F,30
2,KW,M,42
3,NE,F,27
4,PW,F,26
5,NE,,37
6,PW,,34
7,KW,M,43
8,PW,F,35
9,ZM,F,33

session_id,user_id,device,os,minutes
0,1,tablet,android,57
1,2,pc,windows,41
2,3,tablet,android,29
3,4,tablet,android,44
4,4,mobile,android,67
5,4,pc,macos,61
6,4,pc,macos,40
7,4,pc,macos,55
8,4,pc,macos,63
9,4,tablet,android,57

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-11-24 01:16:37,7.99,False
1,0,2019-11-24 01:20:24,8.72,False
2,0,2019-11-24 01:23:01,9.69,False
3,0,2019-11-24 01:25:28,33.97,False
4,0,2019-11-24 01:32:45,24.49,False
5,0,2019-11-24 01:37:57,21.78,False
6,0,2019-11-24 01:52:47,21.95,False
7,0,2019-11-24 02:03:21,8.71,False
8,0,2019-11-24 02:06:17,13.06,False
9,1,2019-03-12 11:37:59,91.16,False


#### Deep dive into metadata class

- Get Tables to view associated table names
- Get parents(table) to view child-parent releationship for a particular table
 

In [30]:
metadata.get_tables()

['users', 'sessions', 'transactions']

In [31]:
metadata.get_parents('sessions')

{'users'}

#### Next, lets fit a HMA1 model to learn this data to eventually sample synthetic data rows

- Import bulian.relational.HMA1 and create an instance of it passing relevant metadata
- Fit the instance to the tables dict

In [39]:
model = HMA1(metadata,)
model.fit(tables,)

*The fit method basically iterates through all the datasets in the tables dict and models each dataset using Gaussian Copula models*  

#### Next, sample synthetic data rows from the fitted HMA1 model 

In [33]:
new_data = model.sample(num_rows=100,)

Sampling rows: 100%|██████████| 100/100 [00:00<00:00, 503.29it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 219.86it/s]
Sampling rows: 100%|██████████| 9/9 [00:00<00:00, 473.61it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 144.25it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 329.92it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 218.22it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 214.58it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 326.57it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 55.56it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 227.82it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 103.40it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 53.43it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 265.98it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 227.28it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 210.29it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 160.00it/s]
Sampli

In [34]:
display_tables(new_data)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,NE,F,31,
1,NE,F,29,
2,CN,F,26,
3,NE,F,28,
4,NE,M,36,
5,PW,F,19,
6,PW,M,28,
7,KW,M,40,
8,PW,M,22,
9,ZM,,34,

user_id,country,gender,age
0,NE,F,31
1,NE,F,29
2,CN,F,26
3,NE,F,28
4,NE,M,36
5,PW,F,19
6,PW,M,28
7,KW,M,40
8,PW,M,22
9,ZM,,34

session_id,user_id,device,os,minutes
0,0,mobile,android,52
1,0,tablet,android,53
2,0,pc,windows,36
3,0,tablet,ios,32
4,1,mobile,android,19
5,1,mobile,android,49
6,1,pc,android,42
7,1,tablet,android,29
8,1,mobile,android,58
9,1,mobile,android,47

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-11-04 01:11:47,38.17,False
1,0,2019-11-04 00:56:45,59.6,False
2,0,2019-11-04 01:02:50,61.92,False
3,0,2019-11-04 01:16:44,60.02,False
4,0,2019-11-04 01:40:38,73.35,False
5,0,2019-11-04 01:25:29,63.34,False
6,0,2019-11-04 01:17:23,44.01,False
7,0,2019-11-04 01:08:12,27.36,False
8,1,2019-11-25 01:05:17,40.53,False
9,1,2019-11-25 00:58:24,61.62,False


In [35]:
for name, table in new_data.items():
    print(name, table.shape)

users (100, 4)
sessions (288, 5)
transactions (1512, 5)


#### Save and share the model:  
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [36]:
model.save('F:/Users/Kaggle/wids/relational_model.pkl')

**Important** If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains **no information about the original data**, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this `relational_model.pkl` file without the risk of disclosing any of your real data!

#### Sample only a subset of tables? 

In some occasions you will not be interested in generating rows for the entire dataset and would rather generate data for only one table and its children.
To do this you can simply pass the name of the table that you want to sample.
For example, pass the name sessions to the sample method, the model will only generate data for the sessions table and its child table, transactions.

In [40]:
model.sample('sessions', num_rows=25)

Sampling rows: 100%|██████████| 25/25 [00:00<00:00, 862.13it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 400.18it/s]
Sampling rows: 100%|██████████| 7/7 [00:00<00:00, 437.56it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 222.26it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 399.91it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 599.70it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 266.66it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 71.42it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 333.30it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 428.57it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 235.26it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 300.13it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 499.99it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 384.67it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 400.10it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 600.04it/s]
Samplin

{'sessions':     session_id  user_id  device       os  minutes
 0            0        4      pc  windows       46
 1            1        3  mobile  windows       57
 2            2        5  tablet      ios       33
 3            3        6      pc      ios       40
 4            4        9  mobile    macos       54
 5            5        9      pc  android       19
 6            6        9      pc  windows       30
 7            7        1  tablet  windows       46
 8            8        6      pc      ios       30
 9            9        6  tablet      ios       40
 10          10        6      pc  android       38
 11          11        1      pc  android       37
 12          12        2      pc      ios       53
 13          13        1      pc  windows       31
 14          14        1      pc  android       52
 15          15        7  mobile  android       58
 16          16        8      pc      ios       51
 17          17        4  tablet      ios       52
 18          18    

If you want to further restrict the sampling process to only one table and also skip its child tables, you can add the argument sample_children=False. For example, you can sample data from the table users only without producing any rows for the tables sessions and transactions.

In [41]:
model.sample('users', num_rows=5, sample_children=False)

Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 27.17it/s]


Unnamed: 0,user_id,country,gender,age
0,10,NE,M,25
1,11,CN,M,41
2,12,KW,F,26
3,13,KW,M,34
4,14,NE,F,31


**Note** In this case, since we are only producing a single table, the output is given directly as a `pandas.DataFrame` instead of a dictionary.

### Fin ###