#### HMA1 Class 
This is a demo notebook to discover functionalities of the HMA1 class for relational data modelling.

#### What is HMA1?
The bulian.relational.HMA1 class implements a Hierarchical Modeling Algorithm which is an algorithm that allows to recursively walk through a relational dataset and applies tabular models across all the tables in a way that lets the models learn how all the fields from all the tables are related.

#### Import all the relevant packages

In [13]:
import os,sys
import pandas as pd
from bulian.metrics.demo import sample_relational_demo
from bulian.metrics.reports import get_multi_table_report
from bulian.utils import display_tables
from bulian.relational import HMA1

#### Start by Loading Bulian Demo relational dataset

In [28]:
metadata, tables = sample_relational_demo()

In [15]:
metadata

Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id

##### This returns two objects, the relational data tables and the metadata associated with each table 

In [16]:
print(metadata)

Metadata
  root_path: .
  tables: ['users', 'sessions', 'transactions']
  relationships:
    sessions.user_id -> users.user_id
    transactions.session_id -> sessions.session_id


In [17]:
for name, table in tables.items():
    print(name, table.shape)

users (30, 4)
sessions (58, 5)
transactions (269, 5)


In [18]:
display_tables(tables)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,BG,M,28,
1,AL,F,21,
2,AR,M,37,
3,AL,M,30,
4,AR,F,26,
5,GW,F,27,
6,AR,M,43,
7,IQ,M,26,
8,AL,F,36,
9,AL,M,34,

user_id,country,gender,age
0,BG,M,28
1,AL,F,21
2,AR,M,37
3,AL,M,30
4,AR,F,26
5,GW,F,27
6,AR,M,43
7,IQ,M,26
8,AL,F,36
9,AL,M,34

session_id,user_id,device,os,minutes
0,1,tablet,android,57
1,2,pc,windows,52
2,3,tablet,android,43
3,5,mobile,android,28
4,5,pc,windows,53
5,6,pc,windows,46
6,6,pc,windows,37
7,7,tablet,android,43
8,8,tablet,ios,50
9,8,tablet,ios,38

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-05-14 12:45:35,7.82,False
1,0,2019-05-14 12:51:14,17.97,False
2,0,2019-05-14 13:00:30,14.76,False
3,0,2019-05-14 13:11:29,3.22,False
4,0,2019-05-14 13:14:21,6.16,False
5,0,2019-05-14 13:14:28,2.94,True
6,0,2019-05-14 13:25:12,18.87,False
7,0,2019-05-14 13:25:55,12.86,False
8,0,2019-05-14 13:31:01,6.55,False
9,1,2019-03-10 05:13:34,77.04,False


#### Deep dive into metadata class

- Get Tables to view associated table names
- Get parents(table) to view child-parent releationship for a particular table
 

In [19]:
metadata.get_tables()

['users', 'sessions', 'transactions']

In [20]:
metadata.get_parents('sessions')

{'users'}

#### Next, lets fit a HMA1 model to learn this data to eventually sample synthetic data rows

- Import bulian.relational.HMA1 and create an instance of it passing relevant metadata
- Fit the instance to the tables dict

In [29]:
model = HMA1(metadata)
model.fit(tables)

*The fit method basically iterates through all the datasets in the tables dict and models each dataset using Gaussian Copula models*  

#### Next, sample synthetic data rows from the fitted HMA1 model 

In [30]:
new_data = model.sample(num_rows=100,)

Sampling rows: 100%|██████████| 100/100 [00:00<00:00, 654.05it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 32.84it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 103.71it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 160.73it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 98.71it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 166.02it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 52.63it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 147.27it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 254.38it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 154.01it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 147.25it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 51.55it/s]
Sampling rows: 100%|██████████| 5/5 [00:00<00:00, 259.88it/s]
Sampling rows: 100%|██████████| 3/3 [00:00<00:00, 161.19it/s]
Sampling rows: 100%|██████████| 2/2 [00:00<00:00, 101.98it/s]
Sampling rows: 100%|██████████| 1/1 [00:00<00:00, 49.39it/s]
Sampling 

In [23]:
display_tables(new_data)

user_id,country,gender,age,Unnamed: 4_level_0
session_id,user_id,device,os,minutes
transaction_id,session_id,timestamp,amount,cancelled
0,AL,M,34,
1,AR,F,29,
2,AL,F,34,
3,AL,F,27,
4,AL,M,40,
5,AR,M,43,
6,BG,M,33,
7,AR,F,34,
8,AR,M,40,
9,AL,F,32,

user_id,country,gender,age
0,AL,M,34
1,AR,F,29
2,AL,F,34
3,AL,F,27
4,AL,M,40
5,AR,M,43
6,BG,M,33
7,AR,F,34
8,AR,M,40
9,AL,F,32

session_id,user_id,device,os,minutes
0,0,pc,windows,46
1,0,pc,android,70
2,2,tablet,windows,40
3,2,tablet,windows,33
4,2,mobile,windows,50
5,2,mobile,android,44
6,3,tablet,android,61
7,3,pc,windows,50
8,3,tablet,ios,52
9,4,tablet,ios,52

transaction_id,session_id,timestamp,amount,cancelled
0,0,2019-01-08 15:30:56,87.66,False
1,0,2019-01-08 15:38:27,87.61,False
2,0,2019-01-08 15:42:43,89.59,False
3,0,2019-01-08 15:43:35,88.52,False
4,1,2019-05-12 17:28:55,89.29,True
5,2,2019-12-29 20:28:17,162.16,False
6,2,2019-12-29 20:27:39,77.78,False
7,2,2019-12-29 20:18:30,125.85,False
8,2,2019-12-29 20:12:42,116.86,False
9,2,2019-12-29 20:26:09,139.7,False


In [24]:
for name, table in new_data.items():
    print(name, table.shape)

users (100, 4)
sessions (207, 5)
transactions (947, 5)


#### Generate Report on Real and Synthetic Data

After new data has been generated you can create a quality report on the same. The function takes in the real data, synthetic data and metadata as required input parameters. You can also specific the numeric and discrete columns explicitly or let the function determine it accordingly. 

The inputs for numeric and discrete columns is dictionary with the key name being the table name and the values is a list of valid column names present in the tables.

##### Generate Report without specifying columns

In [25]:
get_multi_table_report(tables, new_data, metadata)

##### Generate Report with defined columns

As mentioned above you can also specify the numeric and discrete columns and pass them as parameters while generating the report.

In [31]:
discrete_columns = {
    'users': ['country', 'gender'],
    'sessions': ['device', 'os'],
    'transactions': ['timestamp', 'cancelled']
}

numeric_columns = {
    'users': ['age'],
    'sessions': ['minutes'],
    'transactions': ['amount']
}

In [32]:
get_multi_table_report(tables, new_data, metadata, discrete_features=discrete_columns, numeric_features=numeric_columns)


Unexpected value nan in synthetic data.



##### Generate Report on a Dashboard

Reports can be viewed on a dashboard webpage by using the option boolean <b>show_dashboard</b> parameter on the <b>get_mutli_table_report</b> function. The dashboard is a Dash application which runs on a local server. You can also specify which port to use for the local server, by default the app would run on <b>8050</b>.

In [None]:
get_multi_table_report(tables, new_data, metadata, discrete_features=discrete_columns, numeric_features=numeric_columns, show_dashboard=True, port=8050)

#### Save and share the model:  
Once you have fitted the model, all you need to do is call its save method passing the name of the file in which you want to save the model. Note that the extension of the filename is not relevant, but we will be using the .pkl extension to highlight that the serialization protocol used is pickle.

In [33]:
model.save('relational_model.pkl')

**Important** If you inspect the generated file you will notice that its size is much smaller than the size of the data that you used to generate it. This is because the serialized model contains **no information about the original data**, other than the parameters it needs to generate synthetic versions of it. This means that you can safely share this `relational_model.pkl` file without the risk of disclosing any of your real data!

#### Sample only a subset of tables? 

In some occasions you will not be interested in generating rows for the entire dataset and would rather generate data for only one table and its children.
To do this you can simply pass the name of the table that you want to sample.
For example, pass the name sessions to the sample method, the model will only generate data for the sessions table and its child table, transactions.

In [None]:
model.sample('sessions', num_rows=25)

If you want to further restrict the sampling process to only one table and also skip its child tables, you can add the argument sample_children=False. For example, you can sample data from the table users only without producing any rows for the tables sessions and transactions.

In [None]:
model.sample('users', num_rows=5, sample_children=False)

**Note** In this case, since we are only producing a single table, the output is given directly as a `pandas.DataFrame` instead of a dictionary.

### Fin ###