<h1>Synthetizing the insurance Dataset</h1>
<p>
The method is based on copula. In this version, I work with the non-categorical features. No grouping is performed.

<h2>Imports and Reading Dataset</h2>
<p>
The dataset on Kaggle, <a href="https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset">here</a>. 
In this notebook, I use the version on my GitHub repository,
    <a href="https://github.com/VincentGranville/Main/blob/main/insurance.csv">here</a>.
<p>
    <b>Features: </b>
<ul>
    <li>age, 
    <li>sex, 
    <li>bmi, 
    <li>children, 
    <li>smoker, 
    <li>region, 
    <li>charges. 
</ul>
The last one is the response. Categorical fields (sex, smoker, regions) to be treated separately. Check out for 
<p>
<ul>
    <li>outliers, 
    <li>missing values, 
    <li>values with commas inside,
</ul>
and so on. Also, do we need to transform the data?
<p>
<b>Exercise:</b> Use dummy variables for categorical fields, and include them in the synthetization.

In [1]:
# copula_insurance_nogroup.py

import pandas as pd
from scipy.stats import norm
import numpy as np

filename = 'insurance.csv' 
data = pd.read_csv(filename)
print(data.head(10))
print("\n")

   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
5   31  female  25.740         0     no  southeast   3756.62160
6   46  female  33.440         1     no  southeast   8240.58960
7   37  female  27.740         3     no  northwest   7281.50560
8   37    male  29.830         2     no  northeast   6406.41070
9   60  female  25.840         0     no  northwest  28923.13692




Now using numerical fields only: age, bmi, children, charges

In [2]:
age = data.loc[:,"age"]
bmi = data.loc[:,"bmi"]
children = data.loc[:,"children"]
charges = data.loc[:,"charges"]

<h2>Step 1: Compute correlation matrix on real data</h2>

In [3]:
# need correlation matrix computed on real data, for Gaussian copula
r_data = np.stack((age, bmi, children, charges), axis = 0)
r_corr = np.corrcoef(r_data) 
print(r_corr)
print("\n")

[[1.         0.10927188 0.042469   0.29900819]
 [0.10927188 1.         0.0127589  0.19834097]
 [0.042469   0.0127589  1.         0.06799823]
 [0.29900819 0.19834097 0.06799823 1.        ]]




Also computing the means for each feature. Not needed here, but useful to see if they make sense and get an idea.

In [4]:
r_mu  = [np.mean(age), np.mean(bmi), np.mean(children), np.mean(charges)]
nobs_synth = len(age)
print("Mean: %5.2f %5.2f %5.2f %6.0f" % (r_mu[0],r_mu[1],r_mu[2],r_mu[3]))
print("Nobs: ",nobs_synth)
zero = [0, 0, 0, 0] 

Mean: 39.21 30.66  1.09  13270
Nobs:  1338


<h2>Step 2: Multivariate Gaussian generation</h2>
<p>
Generate multivariate Gaussian with zero mean and covariance equal to correlation matrix on real data. We generate <code>nobs_synth</code> observations. In this case, the same number as in the real data.
<p>
First, we want to control all sources of randomness for replicability. This is done with <code>seed</code>.

In [5]:
seed = 453
np.random.seed(seed)

In [6]:
# this step to reconstruct correl structure in synth. data
gfg = np.random.multivariate_normal(zero, r_corr, nobs_synth) 
g_age = gfg[:,0]
g_bmi = gfg[:,1]
g_children = gfg[:,2]
g_charges = gfg[:,3]

<h2>Step 3 and 4: From Gaussian to uniform to target distribution</h2>
<p>
From the correlated Gaussian with the target correlation structure, extract marginals (the features) and turn them into uniforms on [0, 1]. Then transform the uniforms into the correct target distribution: the empirical distribution of the real data, for each feature. The correlation structure is preserved.

In [7]:
s_data = []

for k in range(nobs_synth):  

    # Step 3: first get uniform distrib. for each feature
    u_age = norm.cdf(g_age[k])
    u_bmi = norm.cdf(g_bmi[k])
    u_children = norm.cdf(g_children[k])
    u_charges = norm.cdf(g_charges[k])

    # Step 4: turn uniform into target distrib.
    s_age = np.quantile(age, u_age)                # synthesized age 
    s_bmi = np.quantile(bmi, u_bmi)                # synthesized bmi
    s_children = np.quantile(children, u_children) # synthesized children
    s_charges = np.quantile(charges, u_charges)    # synthesized charges
    s_data.append((s_age,s_bmi,s_children, s_charges))

s_data = np.array(s_data)

<h2>Assessing Quality</h2>

In [8]:
s_mu = np.mean(s_data, axis=0)
print("\n")
print("Mean Synth: %5.2f %5.2f %5.2f %6.0f" % (s_mu[0],s_mu[1],s_mu[2],s_mu[3]))
print("Mean Real : %5.2f %5.2f %5.2f %6.0f" % (r_mu[0],r_mu[1],r_mu[2],r_mu[3]))
print("\n")



Mean Synth: 38.84 30.65  1.02  13328
Mean Real : 39.21 30.66  1.09  13270




In [9]:
s_corr = np.corrcoef(np.transpose(s_data))
print(r_corr)
print("\n")
print(s_corr)
print("\n")

[[1.         0.10927188 0.042469   0.29900819]
 [0.10927188 1.         0.0127589  0.19834097]
 [0.042469   0.0127589  1.         0.06799823]
 [0.29900819 0.19834097 0.06799823 1.        ]]


[[1.         0.09547703 0.02413221 0.28762577]
 [0.09547703 1.         0.05036735 0.15233377]
 [0.02413221 0.05036735 1.         0.05744995]
 [0.28762577 0.15233377 0.05744995 1.        ]]




<b>Exercises:</b>
<ol>
    <li>Check if pairwise feature scatter plots on real and synth. data are similar.
    <li>Try with different seeds. Assess volatility of the results. Compute confidence intervals for mean age and so on.
</ol>

<h2>Synthetic Data: Snapshot</h2>

In [10]:
# --- np to pandas array
s_data = pd.DataFrame(s_data, columns = ['age','bmi','children','charges'])
print(s_data.head(10))
print("\n")

    age        bmi  children       charges
0  26.0  35.200000       3.0   7203.616777
1  34.0  29.193193       2.0   9526.725877
2  32.0  24.761289       1.0   9381.264084
3  19.0  26.885000       2.0   3333.067695
4  23.0  25.665819       0.0   9793.251928
5  27.0  28.837104       3.0  21223.696804
6  24.0  28.000000       0.0   2903.971962
7  33.0  25.534879       0.0   6626.613634
8  51.0  30.347082       1.0  12920.077903
9  18.0  21.755000       1.0   4932.769237




In [11]:
r_data = np.transpose(r_data)
r_data = pd.DataFrame(r_data, columns = ['age','bmi','children','charges'])
print(r_data.head(10))

    age     bmi  children      charges
0  19.0  27.900       0.0  16884.92400
1  18.0  33.770       1.0   1725.55230
2  28.0  33.000       3.0   4449.46200
3  33.0  22.705       0.0  21984.47061
4  32.0  28.880       0.0   3866.85520
5  31.0  25.740       0.0   3756.62160
6  46.0  33.440       1.0   8240.58960
7  37.0  27.740       3.0   7281.50560
8  37.0  29.830       2.0   6406.41070
9  60.0  25.840       0.0  28923.13692


<h2>Gathering More Stats and Insights</h2>
<p>
The goal is to better compare synthetic with real data, and save summary stats for future comparison with other seeds (to assess volatility) and other methods: GAN, copula with grouping, Frank copula, copula with empirical quantiles replaced by parametric distributions fit to the real data, and feature substitution to reduce algorithmic bias.