## 1.3 Pearson's Correlation Coefficient Comparison
This is to test whether the synthetic data has captured dependencies between variables of the original data.  
We calculate Pearson’s correlation coefficients - **_r_**, between categorical columns within the original and synthetic data.   

To compare the how similar those _r_'s are, 
* We calculate the **MSE** between every pair of _r_'s (for the same pair of columns but one from synthetic one from real data)
* We calculate the **SRA** (Synthetic Ranking Agreement,see explanation below) of _r_'s for each column.


In [1]:
def r_corr_test(df,PTable = False, CoefficientandPtable = False, lower = True ):
    '''Returns a table of Pearson's r correlation coefficients between every pair of columns in the dataframe
    
    Args:
    df: The input dataframe
    PTable: False (default) or True, if True, then the return is a table containing the p(probavility)-value of correlation test.
    CoefficientandPtable: False(default) or True, if true, then the return is a table containing tuples (p-value, r coefficient) from the correlation test.
    lower: True(default) or False. If True, the lower triangle part of the table is filled with the transpose of the upper triangle part rather than leaved with None.
    
    Returns:
    The requested table as specified in the args. If PTable and CoefficientandPtable are all False, then the return table consists of coefficient values only.
    
    '''
    from scipy.stats import pearsonr
    import pandas as pd
    import numpy as np

    df_index = (df.keys()).tolist()
    n = len(df_index)
    ini = [ [ None for y in range( n ) ] 
                 for x in range( n ) ]

    #pearsonr returns two values: the correlation coefficient and significance test probability p
    #so we create two empty dataframes to store them
    coefficient_table = pd.DataFrame(ini,index = df_index,columns = df_index)
    p_table = coefficient_table.copy()
    coe_and_p_table = coefficient_table.copy()

    for i in range(n):
        for j in range(i+1,n):
            name1 = df_index[i]
            name2 = df_index[j]
            obs_1 = df[name1].dropna()
            obs_2 = df[name2].dropna()
            dataframe = pd.DataFrame({name1: obs_1, name2: obs_2})

            values = dataframe.dropna().values
            (coe,p) = pearsonr(values[:,0],values[:,1])
            coefficient_table.loc[name1,name2]=coe
            p_table.loc[name1,name2]=p
            coe_and_p_table.loc[name1,name2]=(coe,p)
    
    if lower:
        #A function that can fill the lower part of the dataframe, because coe_table and p_table has their lower triangles empty
        #But for comparison reasons you may want them to be filled
        def fill_lower(df):
            n = df.values.shape[0]
            for j in range(n):
                for i in range(j+1,n):
                    df.iloc[i,j]=df.iloc[j,i]
            return df
        
        coefficient_table = fill_lower(coefficient_table)
        p_table = fill_lower(p_table)
        coe_and_p_table = fill_lower(coe_and_p_table)
    
    
    if PTable:
        return p_table
    elif CoefficientandPtable:
        return coe_and_p_table
    else:
        return coefficient_table

## What is SRA?
**SRA** is used when we want to test whether a synthetic dataset respects a certain ranking. Suppose we have a list of metrics $ R_{1}, R_{2}, R_{3} ... R_{n} $ calculated from the real data and a list of same metrics $ S_{1}, S_{2}, S_{3} ... S_{n} $ calculated from the synthetic data. Then we define **SRA** as
$$
SRA(R,S) = \frac{1}{n(n-1)} \sum_{i=1}^{n}\sum_{j\neq{i}} Id((S_{i}-S_{j})\times(R_{i}-R_{j})>0)\\
$$
where $ Id $ is the identity function. $ SRA \in [0,1] $, The closer the SRA to $ 1 $, the better the ranking agreement.

In the case of correlation comparison, suppose we have columns _A, B, C, D, E_ , for column _A_, we calculate correlation coefficients $ r_{AB}, r_{AC}, r_{AD}, r_{AE} $ for the real data, and $ r'_{AB}, r'_{AC}, r'_{AD}, r'_{AE} $ for the synthetic data. We hope the ranking of $ r $ and $ r' $ agrees, e.g. if $ r_{AB} > r_{AC} $ then $ r'_{AB} > r'_{AC} $ as well. As a result, our $ R $ is $ r_{AB}, r_{AC}, r_{AD}, r_{AE} $, $ S $ is $ r'_{AB}, r'_{AC}, r'_{AD}, r'_{AE} $, and we can calulate the $ SRA $ for each column.

In [2]:
def SRA(R,S):
    '''Calculate the SRA of lists R and S
    
    Args:
    - R: A list of performance metrics of different predictive models from TSTS
    - S: A list of performance metrics of different predictive models from TRTR, len(S)=len(R)
    
    Returns:
    - SRA: SRA value
    
    '''
    def identity_function(statement):
        v = 0
        if statement:
            v = 1
        return v
            
    k = len(R)
    sum_ = 0
    for i in range(k):
        for j in range(k):
            if i != j:
                if (R[i]-R[j])==0:
                    if (S[i]-S[j])==0:
                        agree = True
                    else:
                        agree = False
                else:
                    agree = (R[i]-R[j])*(S[i]-S[j])>0
                sum_ += identity_function(agree)
    SRA = sum_ / (k*(k-1))
    return SRA

In [3]:
def CorrelationSRA(ori_correlation_df,gen_correlation_df,ColumnWise = False):
    '''Returns the value of SRA for the absolute Pearsons correlation coefficients for each column between \
    all other columns. SRA is between 0 and 1, the closer the SRA is to 1, the more the agreement between the ranking,\
    the more similar the synthetic data and the real data are.
    
    Args:
    ori_correlation_df: the correlation coefficient dataframe for the real data, usually generated from the function\
                        r_corr_test.
    gen_correlation_df: the correlation coefficient dataframe for the synthetic data, usually generated from the function\
                        r_corr_test. 
    ColumnWise: False(default) or True. If True, the return is a Series containing the SRA value for each column and the average.\
                Otherwise, the return is the average of SRA values for all columns
    
    Returns:
    s: It is either a column-wise SRA series or the average SRA values of them, determined by the arg ColumnWise.
    
    '''
    import numpy as np
    import pandas as pd
    
    columns = (ori_correlation_df.keys()).tolist()
    n = len(columns)
    ini = np.ones(n)
    
    for i in range(n):
        ori_values = ori_correlation_df.iloc[i,:].dropna()
        gen_values = gen_correlation_df.iloc[i,:].dropna()
        ini[i] = SRA(abs(ori_values), abs(gen_values))
    
    if ColumnWise:
        s = pd.Series(ini,index = columns)
        s['average'] = sum(ini)/n
    else:
        s = sum(ini)/n
    return s

In [4]:
def MSE(r_table_ori,r_table_gen):
    '''
    Returns the MSE for each position between two dataframes and an average value.
    '''
    import pandas as pd
    import numpy as np
    ori = r_table_ori.fillna(0).values
    gen = r_table_gen.fillna(0).values
    columns = (r_table_gen.keys()).tolist()
    matrix = (ori-gen)**2
    df = pd.DataFrame(matrix, index = columns, columns = columns)
    score = np.sum(matrix)/(len(ori)*(len(ori)-1)) #The diagonal is always zero so we don't count them
    return df, score

# Data Loading

In [5]:
import numpy as np
import pandas as pd
dp_ori_df = pd.read_csv('synthetic data/doppelGANger/dp_ori.csv') #originally ori_features_prism.npy
dp_gen_df = pd.read_csv('synthetic data/doppelGANger/dp_gen.csv') #originally features_600.npy
tgan_ori_df = pd.read_csv('synthetic data/TGAN/tgan_ori.csv') #originally cat_time_10visits_all_noid.csv
tgan_gen_df = pd.read_csv('synthetic data/TGAN/tgan_gen.csv') #originally gen_cat_time_10visits_wl_5000it.npy
ori_df = pd.read_csv('synthetic data/2_no_id/ori_df.csv') #originally cat_time_5abovevisits_all.csv
gen_1_df = pd.read_csv('synthetic data/2_no_id/gen_1_df.csv') #originally gen_cat_time_10visits_wl_5000it_hd10_nl5.npy
gen_2_df = pd.read_csv('synthetic data/2_no_id/gen_2_df.csv') #originally gen_cat_time_10visits_wl_5000it_hd10.npy
gen_3_df = pd.read_csv('synthetic data/2_no_id/gen_3_df.csv') #originally gen_dop_cat_5abovevisits_d2g_e449.npy
gen_4_df = pd.read_csv('synthetic data/2_no_id/gen_4_df.csv') #originally from gen_cat_time_10visits_all_5000it.npy.
dp_0827_df = pd.read_csv('synthetic data/2_no_id/dp_0827_gen.csv')  #originally gen_doptf2_cat_5abovevisits_e200_lstm.csv

synthetic_data_dic = {'DoppelGANger_0814':[dp_ori_df, dp_gen_df],'DoppelGANger_0824':[ori_df,gen_3_df],\
                      'DoppelGANger_0827':[ori_df,dp_0827_df],'tGAN':[tgan_ori_df,tgan_gen_df],\
                      'tGAN 1':[tgan_ori_df,gen_1_df],'tGAN 2':[tgan_ori_df,gen_2_df],\
                     'tGAN 4':[tgan_ori_df,gen_4_df]}
syn_keys = list(synthetic_data_dic.keys())

In [6]:
n = len(syn_keys)
MSE_array = np.zeros(n)
for i in range(n):
    key = syn_keys[i]
    df_ori = synthetic_data_dic[key][0]
    df_gen = synthetic_data_dic[key][1]
    r_table_ori = r_corr_test(df_ori)
    r_table_gen = r_corr_test(df_gen)
    
    #Highlight all r values > 0.5 as yellow, indicating strong correlation
    def color_threshold_yellow(val):
        threshold = 0.5
        if ((val != None) and (abs(val) > threshold)):
            color = 'yellow' 
        else:
            color = 'black'
        return 'color: %s' % color

    display(key+' '+'generated r table',r_table_gen.style.applymap(color_threshold_yellow))
    display(key+' '+'real r table',r_table_ori.style.applymap(color_threshold_yellow))
    sra = CorrelationSRA(r_table_ori,r_table_gen,ColumnWise=True)
    if i==0:
        sra_df = pd.DataFrame(sra,columns = [key])
    else:
        sra_df = pd.concat([sra_df,pd.DataFrame(sra,columns = [key])], axis = 1, sort = False)
    display(key+' '+'SRA',sra)
    MSE_df, MSE_score = MSE(r_table_gen,r_table_ori)
    display(key+' '+'MSE table', MSE_df)
    MSE_array[i] = MSE_score
MSE_series = pd.Series(MSE_array,index = syn_keys)

'DoppelGANger_0814 generated r table'

Unnamed: 0,dday,weight,height,age,temp
dday,,0.46059,0.579901,0.385757,0.572924
weight,0.46059,,0.85236,0.765801,0.725225
height,0.579901,0.85236,,0.700063,0.951644
age,0.385757,0.765801,0.700063,,0.561933
temp,0.572924,0.725225,0.951644,0.561933,


'DoppelGANger_0814 real r table'

Unnamed: 0,dday,weight,height,age,temp
dday,,0.547442,0.625742,0.43148,0.624604
weight,0.547442,,0.904009,0.888397,0.787127
height,0.625742,0.904009,,0.739106,0.964485
age,0.43148,0.888397,0.739106,,0.589636
temp,0.624604,0.787127,0.964485,0.589636,


'DoppelGANger_0814 SRA'

dday       1.0
weight     1.0
height     1.0
age        1.0
temp       1.0
average    1.0
dtype: float64

'DoppelGANger_0814 MSE table'

Unnamed: 0,dday,weight,height,age,temp
dday,0.0,0.007543,0.002101,0.002091,0.002671
weight,0.007543,0.0,0.002668,0.01503,0.003832
height,0.002101,0.002668,0.0,0.001524,0.000165
age,0.002091,0.01503,0.001524,0.0,0.000767
temp,0.002671,0.003832,0.000165,0.000767,0.0


'DoppelGANger_0824 generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,-0.162625,-0.230949,0.090384,0.047434,-0.06428,0.083909,-0.083909,0.072363,-0.072363
height,-0.162625,,0.76762,0.250086,0.004119,-0.098146,0.314467,-0.314467,0.170356,-0.170356
weight,-0.230949,0.76762,,0.241961,-0.101985,-0.036908,0.213542,-0.213542,0.13813,-0.13813
temp,0.090384,0.250086,0.241961,,0.074358,0.113351,0.351852,-0.351852,0.2934,-0.2934
vomit_dur,0.047434,0.004119,-0.101985,0.074358,,-0.030081,-0.051109,0.051109,-0.047358,0.047358
cough_dur,-0.06428,-0.098146,-0.036908,0.113351,-0.030081,,0.011111,-0.011111,0.036099,-0.036099
diar_No,0.083909,0.314467,0.213542,0.351852,-0.051109,0.011111,,-1.0,0.28616,-0.28616
diar_Yes,-0.083909,-0.314467,-0.213542,-0.351852,0.051109,-0.011111,-1.0,,-0.28616,0.28616
head_No,0.072363,0.170356,0.13813,0.2934,-0.047358,0.036099,0.28616,-0.28616,,-1.0
head_Yes,-0.072363,-0.170356,-0.13813,-0.2934,0.047358,-0.036099,-0.28616,0.28616,-1.0,


'DoppelGANger_0824 real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,-0.036634,0.000878,-0.059368,-0.029941,-0.05115,0.021721,-0.021721,0.048467,-0.048467
height,-0.036634,,0.881265,-0.175099,-0.059278,-0.101346,0.105971,-0.105971,-0.149711,0.149711
weight,0.000878,0.881265,,-0.160768,-0.040652,-0.063729,0.055076,-0.055076,-0.139772,0.139772
temp,-0.059368,-0.175099,-0.160768,,0.178328,0.133677,-0.051018,0.051018,-0.200484,0.200484
vomit_dur,-0.029941,-0.059278,-0.040652,0.178328,,0.045722,-0.172186,0.172186,-0.056125,0.056125
cough_dur,-0.05115,-0.101346,-0.063729,0.133677,0.045722,,-0.017408,0.017408,-0.068576,0.068576
diar_No,0.021721,0.105971,0.055076,-0.051018,-0.172186,-0.017408,,-1.0,-0.007423,0.007423
diar_Yes,-0.021721,-0.105971,-0.055076,0.051018,0.172186,0.017408,-1.0,,0.007423,-0.007423
head_No,0.048467,-0.149711,-0.139772,-0.200484,-0.056125,-0.068576,-0.007423,0.007423,,-1.0
head_Yes,-0.048467,0.149711,0.139772,0.200484,0.056125,0.068576,0.007423,-0.007423,-1.0,


'DoppelGANger_0824 SRA'

dday         0.361111
height       0.750000
weight       0.638889
temp         0.444444
vomit_dur    0.555556
cough_dur    0.833333
diar_No      0.583333
diar_Yes     0.611111
head_No      0.638889
head_Yes     0.611111
average      0.602778
dtype: float64

'DoppelGANger_0824 MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.015874,0.053744,0.022426,0.005987,0.000172,0.003867285,0.003867285,0.0005709811,0.0005709811
height,0.015874,0.0,0.012915,0.180783,0.004019,1e-05,0.04347078,0.04347078,0.1024431,0.1024431
weight,0.053744,0.012915,0.0,0.162191,0.003762,0.000719,0.02511168,0.02511168,0.07722939,0.07722939
temp,0.022426,0.180783,0.162191,0.0,0.01081,0.000413,0.1623037,0.1623037,0.2439214,0.2439214
vomit_dur,0.005987,0.004019,0.003762,0.01081,0.0,0.005746,0.01465964,0.01465964,7.685961e-05,7.685963e-05
cough_dur,0.000172,1e-05,0.000719,0.000413,0.005746,0.0,0.0008133489,0.0008133491,0.01095699,0.01095699
diar_No,0.003867,0.043471,0.025112,0.162304,0.01466,0.000813,0.0,3.8103950000000005e-27,0.0861911,0.0861911
diar_Yes,0.003867,0.043471,0.025112,0.162304,0.01466,0.000813,3.8103950000000005e-27,0.0,0.0861911,0.0861911
head_No,0.000571,0.102443,0.077229,0.243921,7.7e-05,0.010957,0.0861911,0.0861911,0.0,5.230641e-28
head_Yes,0.000571,0.102443,0.077229,0.243921,7.7e-05,0.010957,0.0861911,0.0861911,5.230641e-28,0.0


'DoppelGANger_0827 generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,-0.387425,-0.256111,-0.007959,0.023495,0.218447,0.08368,-0.08368,0.054254,-0.054254
height,-0.387425,,0.772865,0.270505,-0.207991,-0.139755,0.492757,-0.492757,0.329424,-0.329424
weight,-0.256111,0.772865,,0.018055,-0.160459,-0.163045,0.219853,-0.219853,0.165243,-0.165243
temp,-0.007959,0.270505,0.018055,,0.228755,-0.07267,0.634756,-0.634756,0.493087,-0.493087
vomit_dur,0.023495,-0.207991,-0.160459,0.228755,,0.029032,0.027499,-0.027499,0.010257,-0.010257
cough_dur,0.218447,-0.139755,-0.163045,-0.07267,0.029032,,0.049845,-0.049845,-0.006349,0.006349
diar_No,0.08368,0.492757,0.219853,0.634756,0.027499,0.049845,,-1.0,0.596697,-0.596697
diar_Yes,-0.08368,-0.492757,-0.219853,-0.634756,-0.027499,-0.049845,-1.0,,-0.596697,0.596697
head_No,0.054254,0.329424,0.165243,0.493087,0.010257,-0.006349,0.596697,-0.596697,,-1.0
head_Yes,-0.054254,-0.329424,-0.165243,-0.493087,-0.010257,0.006349,-0.596697,0.596697,-1.0,


'DoppelGANger_0827 real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,-0.036634,0.000878,-0.059368,-0.029941,-0.05115,0.021721,-0.021721,0.048467,-0.048467
height,-0.036634,,0.881265,-0.175099,-0.059278,-0.101346,0.105971,-0.105971,-0.149711,0.149711
weight,0.000878,0.881265,,-0.160768,-0.040652,-0.063729,0.055076,-0.055076,-0.139772,0.139772
temp,-0.059368,-0.175099,-0.160768,,0.178328,0.133677,-0.051018,0.051018,-0.200484,0.200484
vomit_dur,-0.029941,-0.059278,-0.040652,0.178328,,0.045722,-0.172186,0.172186,-0.056125,0.056125
cough_dur,-0.05115,-0.101346,-0.063729,0.133677,0.045722,,-0.017408,0.017408,-0.068576,0.068576
diar_No,0.021721,0.105971,0.055076,-0.051018,-0.172186,-0.017408,,-1.0,-0.007423,0.007423
diar_Yes,-0.021721,-0.105971,-0.055076,0.051018,0.172186,0.017408,-1.0,,0.007423,-0.007423
head_No,0.048467,-0.149711,-0.139772,-0.200484,-0.056125,-0.068576,-0.007423,0.007423,,-1.0
head_Yes,-0.048467,0.149711,0.139772,0.200484,0.056125,0.068576,0.007423,-0.007423,-1.0,


'DoppelGANger_0827 SRA'

dday         0.333333
height       0.555556
weight       0.444444
temp         0.527778
vomit_dur    0.583333
cough_dur    0.472222
diar_No      0.527778
diar_Yes     0.500000
head_No      0.583333
head_Yes     0.555556
average      0.508333
dtype: float64

'DoppelGANger_0827 MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.123054,0.066044,0.002643,0.002855,0.072682,0.003838895,0.003838894,3.348902e-05,3.348905e-05
height,0.123054,0.0,0.011751,0.198563,0.022116,0.001475,0.1496033,0.1496033,0.2295709,0.2295709
weight,0.066044,0.011751,0.0,0.031978,0.014354,0.009864,0.02715143,0.02715143,0.09303412,0.09303412
temp,0.002643,0.198563,0.031978,0.0,0.002543,0.042579,0.4702847,0.4702847,0.4810405,0.4810405
vomit_dur,0.002855,0.022116,0.014354,0.002543,0.0,0.000279,0.03987431,0.03987431,0.004406513,0.004406512
cough_dur,0.072682,0.001475,0.009864,0.042579,0.000279,0.0,0.00452293,0.00452293,0.003872173,0.003872173
diar_No,0.003839,0.149603,0.027151,0.470285,0.039874,0.004523,0.0,2.896708e-26,0.3649606,0.3649606
diar_Yes,0.003839,0.149603,0.027151,0.470285,0.039874,0.004523,2.896708e-26,0.0,0.3649606,0.3649606
head_No,3.3e-05,0.229571,0.093034,0.48104,0.004407,0.003872,0.3649606,0.3649606,0.0,1.374511e-26
head_Yes,3.3e-05,0.229571,0.093034,0.48104,0.004407,0.003872,0.3649606,0.3649606,1.374511e-26,0.0


'tGAN generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.630282,0.674201,-0.518343,0.509896,0.10626,-0.508896,0.505435,0.013767,-0.013388
height,0.630282,,0.967306,0.271986,0.840133,0.607276,-0.476903,0.444277,-0.627081,0.627144
weight,0.674201,0.967306,,0.264717,0.934518,0.704238,-0.642798,0.610614,-0.679343,0.679569
temp,-0.518343,0.271986,0.264717,,0.417741,0.737285,-0.165282,0.13641,-0.845591,0.845392
vomit_dur,0.509896,0.840133,0.934518,0.417741,,0.848396,-0.777692,0.744161,-0.802912,0.803213
cough_dur,0.10626,0.607276,0.704238,0.737285,0.848396,,-0.72696,0.697473,-0.96676,0.967068
diar_No,-0.508896,-0.476903,-0.642798,-0.165282,-0.777692,-0.72696,,-0.998401,0.575161,-0.575615
diar_Yes,0.505435,0.444277,0.610614,0.13641,0.744161,0.697473,-0.998401,,-0.541316,0.541765
head_No,0.013767,-0.627081,-0.679343,-0.845591,-0.802912,-0.96676,0.575161,-0.541316,,-0.999998
head_Yes,-0.013388,0.627144,0.679569,0.845392,0.803213,0.967068,-0.575615,0.541765,-0.999998,


'tGAN real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.176702,0.216054,-0.090951,-0.039075,-0.091015,0.017397,-0.017397,0.069809,-0.069809
height,0.176702,,0.873474,-0.143156,-0.026711,-0.05938,0.05302,-0.05302,-0.108343,0.108343
weight,0.216054,0.873474,,-0.122154,-0.018737,-0.041915,0.015691,-0.015691,-0.093078,0.093078
temp,-0.090951,-0.143156,-0.122154,,0.125559,0.112293,-0.048428,0.048428,-0.281744,0.281744
vomit_dur,-0.039075,-0.026711,-0.018737,0.125559,,0.020258,-0.147209,0.147209,-0.086867,0.086867
cough_dur,-0.091015,-0.05938,-0.041915,0.112293,0.020258,,-0.015304,0.015304,-0.10895,0.10895
diar_No,0.017397,0.05302,0.015691,-0.048428,-0.147209,-0.015304,,-1.0,0.040633,-0.040633
diar_Yes,-0.017397,-0.05302,-0.015691,0.048428,0.147209,0.015304,-1.0,,-0.040633,0.040633
head_No,0.069809,-0.108343,-0.093078,-0.281744,-0.086867,-0.10895,0.040633,-0.040633,,-1.0
head_Yes,-0.069809,0.108343,0.093078,0.281744,0.086867,0.10895,-0.040633,0.040633,-1.0,


'tGAN SRA'

dday         0.694444
height       0.638889
weight       0.555556
temp         0.750000
vomit_dur    0.166667
cough_dur    0.583333
diar_No      0.500000
diar_Yes     0.500000
head_No      0.805556
head_Yes     0.805556
average      0.600000
dtype: float64

'tGAN MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.205735,0.209899,0.182664,0.301369,0.038918,0.276985,0.273354,0.00314079,0.003183373
height,0.205735,0.0,0.008805,0.172343,0.751419,0.44443,0.280819,0.247304,0.2690889,0.2691544
weight,0.209899,0.008805,0.0,0.149669,0.908696,0.556745,0.433608,0.392258,0.3437072,0.3439713
temp,0.182664,0.172343,0.149669,0.0,0.08537,0.390614,0.013655,0.007741,0.3179238,0.3176998
vomit_dur,0.301369,0.751419,0.908696,0.08537,0.0,0.685812,0.397508,0.356351,0.5127207,0.5131519
cough_dur,0.038918,0.44443,0.556745,0.390614,0.685812,0.0,0.506454,0.465354,0.7358369,0.7363662
diar_No,0.276985,0.280819,0.433608,0.013655,0.397508,0.506454,0.0,3e-06,0.2857201,0.2862055
diar_Yes,0.273354,0.247304,0.392258,0.007741,0.356351,0.465354,3e-06,0.0,0.2506829,0.251133
head_No,0.003141,0.269089,0.343707,0.317924,0.512721,0.735837,0.28572,0.250683,0.0,5.322748e-12
head_Yes,0.003183,0.269154,0.343971,0.3177,0.513152,0.736366,0.286206,0.251133,5.322748e-12,0.0


'tGAN 1 generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.162797,0.097409,-0.33907,0.110246,0.021433,0.225388,-0.242572,0.366513,-0.368334
height,0.162797,,0.937405,-0.563485,-0.075009,-0.798786,0.606384,-0.623032,-0.024406,0.024948
weight,0.097409,0.937405,,-0.514464,-0.047376,-0.693646,0.488053,-0.500043,-0.007414,0.008275
temp,-0.33907,-0.563485,-0.514464,,0.589742,0.72994,-0.108582,0.125227,-0.748058,0.745608
vomit_dur,0.110246,-0.075009,-0.047376,0.589742,,0.560289,0.235987,-0.250278,-0.646799,0.647633
cough_dur,0.021433,-0.798786,-0.693646,0.72994,0.560289,,-0.440975,0.450831,-0.289458,0.290085
diar_No,0.225388,0.606384,0.488053,-0.108582,0.235987,-0.440975,,-0.998209,-0.248394,0.247638
diar_Yes,-0.242572,-0.623032,-0.500043,0.125227,-0.250278,0.450831,-0.998209,,0.250436,-0.249732
head_No,0.366513,-0.024406,-0.007414,-0.748058,-0.646799,-0.289458,-0.248394,0.250436,,-0.999857
head_Yes,-0.368334,0.024948,0.008275,0.745608,0.647633,0.290085,0.247638,-0.249732,-0.999857,


'tGAN 1 real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.176702,0.216054,-0.090951,-0.039075,-0.091015,0.017397,-0.017397,0.069809,-0.069809
height,0.176702,,0.873474,-0.143156,-0.026711,-0.05938,0.05302,-0.05302,-0.108343,0.108343
weight,0.216054,0.873474,,-0.122154,-0.018737,-0.041915,0.015691,-0.015691,-0.093078,0.093078
temp,-0.090951,-0.143156,-0.122154,,0.125559,0.112293,-0.048428,0.048428,-0.281744,0.281744
vomit_dur,-0.039075,-0.026711,-0.018737,0.125559,,0.020258,-0.147209,0.147209,-0.086867,0.086867
cough_dur,-0.091015,-0.05938,-0.041915,0.112293,0.020258,,-0.015304,0.015304,-0.10895,0.10895
diar_No,0.017397,0.05302,0.015691,-0.048428,-0.147209,-0.015304,,-1.0,0.040633,-0.040633
diar_Yes,-0.017397,-0.05302,-0.015691,0.048428,0.147209,0.015304,-1.0,,-0.040633,0.040633
head_No,0.069809,-0.108343,-0.093078,-0.281744,-0.086867,-0.10895,0.040633,-0.040633,,-1.0
head_Yes,-0.069809,0.108343,0.093078,0.281744,0.086867,0.10895,-0.040633,0.040633,-1.0,


'tGAN 1 SRA'

dday         0.333333
height       0.527778
weight       0.527778
temp         0.861111
vomit_dur    0.611111
cough_dur    0.500000
diar_No      0.527778
diar_Yes     0.555556
head_No      0.722222
head_Yes     0.722222
average      0.588889
dtype: float64

'tGAN 1 MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.000193,0.014077,0.061563,0.022297,0.012645,0.04326,0.050704,0.0880328,0.08911684
height,0.000193,0.0,0.004087,0.176676,0.002333,0.546722,0.306211,0.324913,0.007045521,0.006954713
weight,0.014077,0.004087,0.0,0.153907,0.00082,0.424753,0.223126,0.234597,0.007338358,0.007191577
temp,0.061563,0.176676,0.153907,0.0,0.215465,0.381487,0.003619,0.005898,0.2174486,0.2151704
vomit_dur,0.022297,0.002333,0.00082,0.215465,0.0,0.291634,0.14684,0.157996,0.3135232,0.3144579
cough_dur,0.012645,0.546722,0.424753,0.381487,0.291634,0.0,0.181195,0.189684,0.03258313,0.03280977
diar_No,0.04326,0.306211,0.223126,0.003619,0.14684,0.181195,0.0,3e-06,0.08353678,0.08310056
diar_Yes,0.050704,0.324913,0.234597,0.005898,0.157996,0.189684,3e-06,0.0,0.08472164,0.0843119
head_No,0.088033,0.007046,0.007338,0.217449,0.313523,0.032583,0.083537,0.084722,0.0,2.047105e-08
head_Yes,0.089117,0.006955,0.007192,0.21517,0.314458,0.03281,0.083101,0.084312,2.047105e-08,0.0


'tGAN 2 generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.669932,0.600616,-0.421997,-0.319321,-0.365681,0.303687,-0.324059,0.355109,-0.355177
height,0.669932,,0.936421,-0.027013,0.160809,-0.505931,-0.073868,0.053488,-0.323211,0.323054
weight,0.600616,0.936421,,-0.117218,0.049449,-0.566673,0.02792,-0.042256,-0.26002,0.259797
temp,-0.421997,-0.027013,-0.117218,,0.861116,0.197196,-0.615479,0.611958,-0.551272,0.551744
vomit_dur,-0.319321,0.160809,0.049449,0.861116,,0.291194,-0.684354,0.660524,-0.723152,0.723586
cough_dur,-0.365681,-0.505931,-0.566673,0.197196,0.291194,,0.073297,-0.069696,-0.287328,0.287804
diar_No,0.303687,-0.073868,0.02792,-0.615479,-0.684354,0.073297,,-0.993841,0.383103,-0.383243
diar_Yes,-0.324059,0.053488,-0.042256,0.611958,0.660524,-0.069696,-0.993841,,-0.382184,0.382314
head_No,0.355109,-0.323211,-0.26002,-0.551272,-0.723152,-0.287328,0.383103,-0.382184,,-0.999999
head_Yes,-0.355177,0.323054,0.259797,0.551744,0.723586,0.287804,-0.383243,0.382314,-0.999999,


'tGAN 2 real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.176702,0.216054,-0.090951,-0.039075,-0.091015,0.017397,-0.017397,0.069809,-0.069809
height,0.176702,,0.873474,-0.143156,-0.026711,-0.05938,0.05302,-0.05302,-0.108343,0.108343
weight,0.216054,0.873474,,-0.122154,-0.018737,-0.041915,0.015691,-0.015691,-0.093078,0.093078
temp,-0.090951,-0.143156,-0.122154,,0.125559,0.112293,-0.048428,0.048428,-0.281744,0.281744
vomit_dur,-0.039075,-0.026711,-0.018737,0.125559,,0.020258,-0.147209,0.147209,-0.086867,0.086867
cough_dur,-0.091015,-0.05938,-0.041915,0.112293,0.020258,,-0.015304,0.015304,-0.10895,0.10895
diar_No,0.017397,0.05302,0.015691,-0.048428,-0.147209,-0.015304,,-1.0,0.040633,-0.040633
diar_Yes,-0.017397,-0.05302,-0.015691,0.048428,0.147209,0.015304,-1.0,,-0.040633,0.040633
head_No,0.069809,-0.108343,-0.093078,-0.281744,-0.086867,-0.10895,0.040633,-0.040633,,-1.0
head_Yes,-0.069809,0.108343,0.093078,0.281744,0.086867,0.10895,-0.040633,0.040633,-1.0,


'tGAN 2 SRA'

dday         0.861111
height       0.694444
weight       0.805556
temp         0.361111
vomit_dur    0.750000
cough_dur    0.500000
diar_No      0.833333
diar_Yes     0.805556
head_No      0.527778
head_Yes     0.527778
average      0.666667
dtype: float64

'tGAN 2 MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.243275,0.147888,0.109591,0.078538,0.075441,0.081962,0.094041,0.08139562,0.08143487
height,0.243275,0.0,0.003962,0.013489,0.035164,0.199408,0.016101,0.011344,0.04616809,0.04610061
weight,0.147888,0.003962,0.0,2.4e-05,0.004649,0.275371,0.00015,0.000706,0.02786954,0.02779516
temp,0.109591,0.013489,2.4e-05,0.0,0.541044,0.007208,0.321547,0.317566,0.07264567,0.07290021
vomit_dur,0.078538,0.035164,0.004649,0.541044,0.0,0.073406,0.288525,0.263492,0.404859,0.4054104
cough_dur,0.075441,0.199408,0.275371,0.007208,0.073406,0.0,0.00785,0.007225,0.03181875,0.03198881
diar_No,0.081962,0.016101,0.00015,0.321547,0.288525,0.00785,0.0,3.8e-05,0.1172854,0.1173811
diar_Yes,0.094041,0.011344,0.000706,0.317566,0.263492,0.007225,3.8e-05,0.0,0.116657,0.1167459
head_No,0.081396,0.046168,0.02787,0.072646,0.404859,0.031819,0.117285,0.116657,0.0,2.216756e-12
head_Yes,0.081435,0.046101,0.027795,0.0729,0.40541,0.031989,0.117381,0.116746,2.216756e-12,0.0


'tGAN 4 generated r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.725814,0.780287,-0.703732,-0.398647,-0.284812,0.345047,-0.367659,0.300212,-0.299526
height,0.725814,,0.903226,-0.237532,-0.087795,-0.554406,0.303831,-0.318263,-0.093038,0.09315
weight,0.780287,0.903226,,-0.289294,-0.045537,-0.315996,0.183392,-0.198528,-0.116846,0.117114
temp,-0.703732,-0.237532,-0.289294,,0.486734,0.058875,-0.363811,0.400465,-0.476977,0.476242
vomit_dur,-0.398647,-0.087795,-0.045537,0.486734,,0.440218,-0.142562,0.150752,-0.921556,0.921134
cough_dur,-0.284812,-0.554406,-0.315996,0.058875,0.440218,,-0.298121,0.300001,-0.402301,0.402814
diar_No,0.345047,0.303831,0.183392,-0.363811,-0.142562,-0.298121,,-0.997279,0.119775,-0.119507
diar_Yes,-0.367659,-0.318263,-0.198528,0.400465,0.150752,0.300001,-0.997279,,-0.130189,0.129886
head_No,0.300212,-0.093038,-0.116846,-0.476977,-0.921556,-0.402301,0.119775,-0.130189,,-0.999992
head_Yes,-0.299526,0.09315,0.117114,0.476242,0.921134,0.402814,-0.119507,0.129886,-0.999992,


'tGAN 4 real r table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,,0.176702,0.216054,-0.090951,-0.039075,-0.091015,0.017397,-0.017397,0.069809,-0.069809
height,0.176702,,0.873474,-0.143156,-0.026711,-0.05938,0.05302,-0.05302,-0.108343,0.108343
weight,0.216054,0.873474,,-0.122154,-0.018737,-0.041915,0.015691,-0.015691,-0.093078,0.093078
temp,-0.090951,-0.143156,-0.122154,,0.125559,0.112293,-0.048428,0.048428,-0.281744,0.281744
vomit_dur,-0.039075,-0.026711,-0.018737,0.125559,,0.020258,-0.147209,0.147209,-0.086867,0.086867
cough_dur,-0.091015,-0.05938,-0.041915,0.112293,0.020258,,-0.015304,0.015304,-0.10895,0.10895
diar_No,0.017397,0.05302,0.015691,-0.048428,-0.147209,-0.015304,,-1.0,0.040633,-0.040633
diar_Yes,-0.017397,-0.05302,-0.015691,0.048428,0.147209,0.015304,-1.0,,-0.040633,0.040633
head_No,0.069809,-0.108343,-0.093078,-0.281744,-0.086867,-0.10895,0.040633,-0.040633,,-1.0
head_Yes,-0.069809,0.108343,0.093078,0.281744,0.086867,0.10895,-0.040633,0.040633,-1.0,


'tGAN 4 SRA'

dday         0.611111
height       0.694444
weight       0.694444
temp         0.527778
vomit_dur    0.555556
cough_dur    0.444444
diar_No      0.583333
diar_Yes     0.583333
head_No      0.694444
head_Yes     0.694444
average      0.608333
dtype: float64

'tGAN 4 MSE table'

Unnamed: 0,dday,height,weight,temp,vomit_dur,cough_dur,diar_No,diar_Yes,head_No,head_Yes
dday,0.0,0.301524,0.318359,0.3755,0.129292,0.037557,0.107354,0.122683,0.05308513,0.05276953
height,0.301524,0.0,0.000885,0.008907,0.003731,0.245051,0.062906,0.070354,0.0002342499,0.000230839
weight,0.318359,0.000885,0.0,0.027936,0.000718,0.07512,0.028124,0.033429,0.0005649137,0.0005777484
temp,0.3755,0.008907,0.027936,0.0,0.130447,0.002854,0.099466,0.12393,0.03811595,0.03782954
vomit_dur,0.129292,0.003731,0.000718,0.130447,0.0,0.176366,2.2e-05,1.3e-05,0.6967063,0.6960018
cough_dur,0.037557,0.245051,0.07512,0.002854,0.176366,0.0,0.079986,0.081052,0.08605441,0.0863561
diar_No,0.107354,0.062906,0.028124,0.099466,2.2e-05,0.079986,0.0,7e-06,0.006263326,0.00622113
diar_Yes,0.122683,0.070354,0.033429,0.12393,1.3e-05,0.081052,7e-06,0.0,0.008020146,0.007966082
head_No,0.053085,0.000234,0.000565,0.038116,0.696706,0.086054,0.006263,0.00802,0.0,6.311835e-11
head_Yes,0.05277,0.000231,0.000578,0.03783,0.696002,0.086356,0.006221,0.007966,6.311835e-11,0.0


In [7]:
display('MSE values of r for each synthetic data',MSE_series)

'MSE values of r for each synthetic data'

DoppelGANger_0814    0.003839
DoppelGANger_0824    0.048782
DoppelGANger_0827    0.112824
tGAN                 0.315190
tGAN 1               0.129868
tGAN 2               0.111499
tGAN 4               0.098235
dtype: float64

## Conclusion
* The lower the value of MSE, the smaller the average difference between the correlation coefficients between the real and synthetic data, hence the better the result.   

* DoppelGANger_0814 is good but it contains only 5 columns.  

* tGAN, gen 1 and gen 2 have large MSE values, by comparing the correlation tables we find that these generated datas have wronly too strong dependencies (r>0.5) between some columns.  

* DoppelGANger_0824 agrees with its original data in all r>0.5. DoppelGANger_0827 doesn't do well in categorical columns (diar and head), wrongly strong dependencies between these columns.

In [8]:
display('SRA for each column and synthetic data',sra_df)

'SRA for each column and synthetic data'

Unnamed: 0,DoppelGANger_0814,DoppelGANger_0824,DoppelGANger_0827,tGAN,tGAN 1,tGAN 2,tGAN 4
dday,1.0,0.361111,0.333333,0.694444,0.333333,0.861111,0.611111
weight,1.0,0.638889,0.444444,0.555556,0.527778,0.805556,0.694444
height,1.0,0.75,0.555556,0.638889,0.527778,0.694444,0.694444
age,1.0,,,,,,
temp,1.0,0.444444,0.527778,0.75,0.861111,0.361111,0.527778
average,1.0,0.602778,0.508333,0.6,0.588889,0.666667,0.608333
vomit_dur,,0.555556,0.583333,0.166667,0.611111,0.75,0.555556
cough_dur,,0.833333,0.472222,0.583333,0.5,0.5,0.444444
diar_No,,0.583333,0.527778,0.5,0.527778,0.833333,0.583333
diar_Yes,,0.611111,0.5,0.5,0.555556,0.805556,0.583333


## Conclusion
* we can conclude that DoppleGANger_0814preserves the dependency ranking between columns very well.   

* By comparing the 'average', tGAN 2 is best at preserve the ranking. Though in the previous section we find that tGAN 2 tends to have a large r value in average.  
* DoppelGANger_0827 achives the lowest SRA by comparing the average.

## Possible Improvements in this method
Note in the TGAN data, it contains categorical columns e.g. diar_No and diar_Yes between which the _r_ is -1. This corresponding to the logic fact that if diar_No = 1, then diar_Yes = 0; if diar_No = 0, then diar_Yes = 1. A reasonable synthetic data has to respect this kind of 'logic' relationship.  

As a result, it's reasonable to say that that larger the absolute value of _r_ is, the more important the relationship is, that's why we choose to use MSE rather than MAE in quantitative evaluation, a possible improvement is adding weight according to _r_ rather than averaging the MSE.

**Reference:**
<!--[Text](link)-->
* [James Jordon, Jinsung Yoon, Mihaela van der Schaar. PATE-GAN: GENERATING SYNTHETIC DATA WITH
DIFFERENTIAL PRIVACY GUARANTEES](https://openreview.net/pdf?id=S1zk9iRqF7)