#### HMA1 Class 
This is a demo notebook to discover functionalities of the HMA1 class for relational data modelling.

#### What is HMA1?
The bulian.relational.HMA1 class implements a Hierarchical Modeling Algorithm which is an algorithm that allows to recursively walk through a relational dataset and applies tabular models across all the tables in a way that lets the models learn how all the fields from all the tables are related.

In [30]:
sys.path.insert(1,r'F:\Users\Kaggle\bulian')

#### Import all the relevant packages

In [31]:
import os,sys
import pandas as pd
from bulian.metrics.demo import sample_relational_demo
from bulian.utils import display_tables
from bulian.relational import HMA1

#### Start by Loading Bulian Demo relational dataset

In [32]:
metadata, tables = sample_relational_demo()

In [33]:
metadata

Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id

##### This returns two objects, the relational data tables and the metadata associated with each table 

In [34]:
print(metadata)

Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id


In [35]:
for name, table in tables.items():
    print(name, table.shape)

users (30, 4)
sessions (43, 5)
transactions (226, 5)


In [36]:
display_tables(tables)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,PK,F,32,
1,IS,,21,
2,MV,F,25,
3,MM,F,40,
4,MV,M,42,
5,IS,F,32,
6,MV,F,23,
7,MM,M,41,
8,MV,F,37,
9,KN,F,28,

user_id,country,gender,age
0,PK,F,32
1,IS,,21
2,MV,F,25
3,MM,F,40
4,MV,M,42
5,IS,F,32
6,MV,F,23
7,MM,M,41
8,MV,F,37
9,KN,F,28

session_id,user_id,device,os,minutes
0,0,mobile,android,48
1,1,mobile,android,63
2,3,tablet,android,40
3,3,mobile,android,35
4,4,tablet,android,53
5,4,tablet,android,67
6,5,tablet,ios,39
7,5,mobile,ios,46
8,5,tablet,ios,31
9,5,tablet,ios,41

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-12-24 18:04:14,44.01,False
1,0,2019-12-24 18:16:54,22.84,False
2,0,2019-12-24 18:18:07,159.0,False
3,0,2019-12-24 18:22:25,112.66,False
4,0,2019-12-24 18:32:21,33.31,False
5,0,2019-12-24 18:43:41,59.43,False
6,0,2019-12-24 18:47:48,104.97,False
7,1,2019-11-10 17:17:54,37.14,False
8,1,2019-11-10 17:50:26,68.12,False
9,1,2019-11-10 17:56:00,56.76,False


#### Deep dive into metadata class

- Get Tables to view associated table names
- Get parents(table) to view child-parent releationship for a particular table
 

In [37]:
metadata.get_tables()

['users', 'sessions', 'transactions']

In [38]:
metadata.get_parents('sessions')

{'users'}

#### Next, lets fit a HMA1 model to learn this data to eventually sample synthetic data rows

- Import bulian.relational.HMA1 and create an instance of it passing relevant metadata
- Fit the instance to the tables dict

In [39]:
model = HMA1(metadata)
model.fit(tables)

*The fit method basically iterates through all the datasets in the tables dict and models each dataset using Gaussian Copula models*  

#### Next, sample synthetic data rows from the fitted HMA1 model 

In [40]:
new_data = model.sample(num_rows=100,)

Sampling rows: 100%|██████████| 100/100 [00:00<00:00, 349.49it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 97.36it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 262.97it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 198.93it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 93.40it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 37.81it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 223.26it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 157.90it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 160.93it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 53.57it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 55.83it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 52.66it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 206.71it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 56.33it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 111.20it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 53.58it/s]
Sampling row

In [41]:
display_tables(new_data)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,MM,M,40,
1,IS,F,27,
2,MV,F,32,
3,PK,M,28,
4,IS,F,21,
5,MM,F,21,
6,MV,M,28,
7,MV,F,31,
8,PK,M,26,
9,IS,F,31,

user_id,country,gender,age
0,MM,M,40
1,IS,F,27
2,MV,F,32
3,PK,M,28
4,IS,F,21
5,MM,F,21
6,MV,M,28
7,MV,F,31
8,PK,M,26
9,IS,F,31

session_id,user_id,device,os,minutes
0,0,pc,android,21
1,0,mobile,android,27
2,2,pc,android,83
3,2,pc,ios,27
4,2,pc,android,62
5,2,pc,linux,45
6,2,pc,android,19
7,3,tablet,android,47
8,3,pc,android,63
9,7,pc,android,23

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-09-09 07:31:08,64.23,False
1,0,2019-09-09 07:37:11,65.41,False
2,0,2019-09-09 07:27:10,65.87,False
3,1,2019-08-07 10:59:23,24.49,False
4,1,2019-08-07 11:02:12,34.6,False
5,1,2019-08-07 10:56:24,94.5,False
6,2,2019-07-18 14:00:23,1.87,True
7,2,2019-07-18 14:34:03,0.55,True
8,2,2019-07-18 13:58:59,7.1,False
9,2,2019-07-18 14:15:57,8.08,False


In [42]:
for name, table in new_data.items():
    print(name, table.shape)

users (100, 4)
sessions (164, 5)
transactions (771, 5)


#### Save and share the model:  
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [43]:
model.save('F:/Users/Kaggle/wids/relational_model.pkl')

**Important** If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains **no information about the original data**, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this `relational_model.pkl` file without the risk of disclosing any of your real data!

#### Sample only a subset of tables? 

In some occasions you will not be interested in generating rows for the entire dataset and would rather generate data for only one table and its children.
To do this you can simply pass the name of the table that you want to sample.
For example, pass the name sessions to the sample method, the model will only generate data for the sessions table and its child table, transactions.

In [44]:
model.sample('sessions', num_rows=25)

Sampling rows: 100%|██████████| 25/25 [00:00<00:00, 2471.54it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 765.80it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 120.87it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 113.69it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 508.98it/s]
Sampling rows: 100%|██████████| 6/6 [00:00<00:00, 718.80it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 423.79it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 624.99it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 555.63it/s]
Sampling rows: 100%|██████████| 4/4 [00:00<00:00, 444.50it/s]
Sampling rows: 100%|██████████| 7/7 [00:00<00:00, 777.65it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 99.97it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 249.97it/s]
Sampling rows: 100%|██████████| 9/9 [00:00<00:00, 749.92it/s]
Sampling rows: 100%|██████████| 7/7 [00:00<00:00, 777.67it/s]
Sampling rows: 100%|██████████| 8/8 [00:00<00:00, 727.18it/s]
Sampli

{'sessions':     session_id  user_id  device       os  minutes
 0          164      105  tablet  windows       51
 1          165      101      pc    linux       44
 2          166      112  tablet  android       26
 3          167      103      pc  android       43
 4          168      107      pc  android       56
 5          169      112  tablet    linux       46
 6          170      107      pc  android       46
 7          171      100      pc    linux       60
 8          172      107  mobile  android       38
 9          173      110      pc  android       48
 10         174      107  tablet    linux       21
 11         175      103  tablet  android       29
 12         176      105      pc    linux       47
 13         177      101      pc    linux       57
 14         178      107      pc  android       46
 15         179      105      pc    linux       45
 16         180      100  mobile  android       29
 17         181      113      pc  android       62
 18         182    

If you want to further restrict the sampling process to only one table and also skip its child tables, you can add the argument sample_children=False. For example, you can sample data from the table users only without producing any rows for the tables sessions and transactions.

In [45]:
model.sample('users', num_rows=5, sample_children=False)

Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 23.92it/s]


Unnamed: 0,user_id,country,gender,age
0,117,MM,F,22
1,118,KN,F,32
2,119,PK,F,30
3,120,MV,F,26
4,121,MV,F,32


**Note** In this case, since we are only producing a single table, the output is given directly as a `pandas.DataFrame` instead of a dictionary.

### Fin ###