<h1>Synthetizing the insurance Dataset</h1>
<p>
The method is based on copula. In this version, I work with the non-categorical features. No grouping is performed.
The content is as follows:
    
<ol>
    <li><a ="#section1">Imports and Reading Datasets</a>  
    <li><a ="#section2">Exploratory Analysis</a>
    <li><a ="#section3">Step 1: Compute correlation matrix on real data</a>
    <li><a ="#section4">   Step 2: Multivariate Gaussian generation</a>
    <li><a ="#section5">   Step 3 and 4: From Gaussian to uniform to target distribution</a>
    <li><a ="#section6">  Synthetic Data: Snapshot</a>
    <li><a ="#section7">  Gathering More Stats and Insights</a>
    <li><a ="#section8">First Attempt at Parametric Copulas</a>
</ol>


<h2>Imports and Reading Dataset</h2><a id='section1'></a>
<p>
The dataset on Kaggle, <a href="https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset">here</a>. 
In this notebook, I use the version on my GitHub repository,
    <a href="https://github.com/VincentGranville/Main/blob/main/insurance.csv">here</a>.
<p>
<b>Features: </b> age, sex, bmi, children, smoker, region, charges. 
<p>
The last one is the response. Categorical fields (sex, smoker, regions) to be treated separately. Check out for 
outliers, 
missing values, 
values with commas inside,
and so on. Also, do we need to transform the data?
<p>
<b>Exercise 1</b><br>
Use dummy variables for categorical fields, and include them in the synthetization.

In [1]:
# copula_insurance_nogroup.py

import pandas as pd
from scipy.stats import norm
import numpy as np

filename = 'insurance.csv' 
data = pd.read_csv(filename)
print(data.head(10))

Now using numerical fields only: age, bmi, children, charges

In [2]:
age = data.loc[:,"age"]
bmi = data.loc[:,"bmi"]
children = data.loc[:,"children"]
charges = data.loc[:,"charges"]

<a id='section2'></a>
<h2>Exploratory Analysis</h2>
<p>
The focus is to check out what kind of distributions we are dealing with, by plotting the distribution of each feature, and scatterplots for pair of features. In particular:
<ul>
<li> 'age' looks uniform except for the extremes
<li> 'bmi' looks Gaussian
<li> 'children' looks like a geometric distribution
<li> 'charges' is bimodal
</ul>

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
axes = plt.axes()
n_bins = 15  # n_bins = 20 produces periodic spikes, why?
plt.hist(age, n_bins, density = True, rwidth = 0.8, color = 'orange') 

In [4]:
n_bins = 6
plt.hist(children, n_bins, density = True, rwidth = 0.8, color = 'orange') 

In [5]:
n_bins = 15
plt.hist(bmi, n_bins, density = True, rwidth = 0.8, color = 'orange') 

In [6]:
n_bins = 15 # a lot smoother with 10 bins
plt.hist(charges, n_bins, density = False, rwidth = 0.8, color = 'orange') 

<a id='section3'></a>
<h2>Step 1: Compute correlation matrix on real data</h2>

In [7]:
# need correlation matrix computed on real data, for Gaussian copula
r_data = np.stack((age, bmi, children, charges), axis = 0)
r_corr = np.corrcoef(r_data) 
print(r_corr)

Also computing the means for each feature. Not needed here, but useful to see if they make sense and get an idea.

In [8]:
r_mu  = [np.mean(age), np.mean(bmi), np.mean(children), np.mean(charges)]
nobs_synth = len(age)
print("Mean: %5.2f %5.2f %5.2f %6.0f" % (r_mu[0],r_mu[1],r_mu[2],r_mu[3]))
print("Nobs: %4d" %(nobs_synth))
zero = [0, 0, 0, 0] 

<a id='section4'></a>
<h2>Step 2: Multivariate Gaussian generation</h2>
<p>
Generate multivariate Gaussian with zero mean and covariance equal to correlation matrix on real data. We generate <code>nobs_synth</code> observations. In this case, the same number as in the real data. The use of a Gaussian copula here is similar to using Gaussians for the latent data in GAN. There are alternatives, such as Frank copula, in the same way that you could use (say) uniform deviates for latent features in GAN. If some features in the real data have very thick tail, a Gaussian copula, though theoretically correct (especially if the dataset is large) may undersample extremes.
<p>
First, we want to control all sources of randomness for replicability and to study volatility. This is done with <code>seed</code>. When saving results, don't forget to save the seed that you used as well.

In [9]:
seed = 453
np.random.seed(seed)

In [10]:
# this step to reconstruct correl structure in synth. data
gfg = np.random.multivariate_normal(zero, r_corr, nobs_synth) 
g_age = gfg[:,0]
g_bmi = gfg[:,1]
g_children = gfg[:,2]
g_charges = gfg[:,3]

<a id='section5'></a>
<h2>Step 3 and 4: From Gaussian to uniform to target distribution</h2>
<p>
From the correlated Gaussian with the target correlation structure, extract marginals (the features) and turn them into uniforms on [0, 1]. Then transform the uniforms into the correct target distribution: the empirical distribution of the real data, for each feature. The correlation structure is preserved.

In [11]:
s_data = []

for k in range(nobs_synth):  

    # Step 3: first get uniform distrib. for each feature
    u_age = norm.cdf(g_age[k])
    u_bmi = norm.cdf(g_bmi[k])
    u_children = norm.cdf(g_children[k])
    u_charges = norm.cdf(g_charges[k])

    # Step 4: turn uniform into target distrib.
    s_age = np.quantile(age, u_age)                # synthesized age 
    s_bmi = np.quantile(bmi, u_bmi)                # synthesized bmi
    s_children = np.quantile(children, u_children) # synthesized children
    s_charges = np.quantile(charges, u_charges)    # synthesized charges
    s_data.append((s_age,s_bmi,s_children, s_charges))

s_data = np.array(s_data)

<a id='sectionx'></a>
<h2>Assessing Quality</h2>

In [12]:
s_mu = np.mean(s_data, axis=0)
print("Mean Synth: %5.2f %5.2f %5.2f %6.0f" % (s_mu[0],s_mu[1],s_mu[2],s_mu[3]))
print("Mean Real : %5.2f %5.2f %5.2f %6.0f" % (r_mu[0],r_mu[1],r_mu[2],r_mu[3]))

In [13]:
s_corr = np.corrcoef(np.transpose(s_data))
print("Correlation, real data\n")
print(r_corr)
print("\nCorrelation, synthetic data\n")
print(s_corr)

<b>Exercise 2</b><br>
Check if pairwise feature scatter plots on real and synth. data are similar.
<p>
<b>Exercise 3</b><br>    
Try with different seeds. Assess volatility of the results. Compute confidence intervals for mean age and so on.

<a id='section6'></a>
<h2>Synthetic Data: Snapshot</h2>

In [14]:
# --- np to pandas array
s_data = pd.DataFrame(s_data, columns = ['age','bmi','children','charges'])
print(s_data.head(10))

In [15]:
r_data = np.transpose(r_data)
r_data = pd.DataFrame(r_data, columns = ['age','bmi','children','charges'])
print(r_data.head(10))

<a id='section7'></a>
<h2>Gathering More Stats and Insights</h2>
<p>
The goal is to better compare synthetic with real data, and save summary stats for future comparison with other seeds (to assess volatility) and other methods: GAN, copula with grouping, Frank copula, copula with empirical quantiles replaced by parametric distributions fit to the real data, and feature substitution to reduce algorithmic bias.

In [16]:
r_min = np.min(r_data,axis=1)
s_min = np.min(s_data,axis=0)
r_max = np.max(r_data,axis=1)
s_max = np.max(s_data,axis=0)
r_std = np.std(r_data,axis=1)
s_std = np.std(s_data,axis=0)

print("Min Real:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f\n" 
      % (r_min[0],r_min[1],r_min[2],r_min[3]))
print("Min Synth:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f\n" 
      % (s_min[0],s_min[1],s_min[2],s_min[3]))
print("Max Real:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f\n" 
      % (r_max[0],r_max[1],r_max[2],r_max[3]))
print("Max Synth:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f\n" 
      % (s_max[0],s_max[1],s_max[2],s_max[3]))
print("Std Real:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f\n" 
      % (r_std[0],r_std[1],r_std[2],r_std[3]))
print("Std Synth:\nage\t\t%9.3f\nbmi\t\t%9.3f\nchildren\t%9.3f\ncharges\t\t%9.3f" 
      % (s_std[0],s_std[1],s_std[2],s_std[3]))

<b>Exercise 4</b><br>
Add 25- and 75-percentiles, both for real and synthetic.
<p>
<b>Exercise 5</b><br> 
Compute distance between statistical summaries on real and synthetized data. To do so, transform real data so that each feature has zero mean and unit variance. Make the comparison between standardized data and the synth data produced on the standardized version of the real data. Then try different seeds, and see which ones provide best fit.
<p>
<b>Exercise 6</b><br>
In addition to the above statistics in Exercise 5, also include the correlation matrices. No need to transform the data here: these matrices are invariant under linear transformations. To be discussed in the GAN module. 
<p>
<b>Exercise 7</b><br>
Instead of summary stats, use Hellinger distance for comparison purposes: (1) feature-wise, (2) max or average Hellinger across features.

<b>Conclusions</b><p>
The copulas do a great job at replicating the correlation structure and marginal distributions. They work with both ordinal and continuous features, producing ordinal values for ordinal features. Replication of the results is easy thanks to
<code>seed</code>. Also, it is a very fast technique. We will see that copulas, unlike GANs, are not great at replicating non linear structures (correlation is a linear structure). Also, unless using parametric distributions for the quantiles, you can't sample outside the range of observations (min, max) in the real data. This issue is true for all features, but easy to fix.

<a id='section8'></a>
<h2>First Attempt at Parametric Copulas</h2>
<p>
Let's replace the empirical quantiles for 'children' by quantiles of a geometric distribution of parameter <em>p</em>. 
The parameter is estimated on the real data, as the inverse of the mean for the feature in question. The
feature <code>s_children</code> obtained in the synthetic data will have a
sister feature <code>s_param_children</code>. See what happens when we switch them. Are the results worse? Are we able to sample outside the range? (that was the purpose). Likewise, we could use a GMM (Gaussian mixture model) for the bimodal 'charges', preferably after a log-transform to make sure we do not generate negative charges in the synthetic data (and typically, expenses may follow a log-normal rather than normal distribution).