<h1>GAN: Synthetizing the Diabetes Dataset - Module 3</h1>
<p>
This notebook covers Module 3 of my course on synthetic data and generative AI. This module features the GAN method (generative adversarial networks) applied to tabular data. Emphasis is on a creating replicable synthetic data by controlling all sources of randomness in GAN (TensorFlow, Python, Numpy), leveraging the seed to get the best synthetic data, reducing training time, assessing the quality of synthetized data using the distance between correlation matrices, and identifing when GAN performs better than Copulas. A separate GAN may be used to synthetize missing observations.
<p>
The dataset diabetes.csv is used to predict the risk of cancer based on other features such as BMI, insulin, blood pressure, and so on. The rightmost column is the response: cancer status (yes or no). The dataset is available 
    <a href="https://github.com/VincentGranville/Main/blob/main/diabetes.csv">here</a>.
<p>    
The content is as follows:
    
<ol>
    <li><a href="#section1">Imports and Reading Datasets</a>  
    <li><a href="#section2">Seed, Replicability and Faster Training</a>
    <li><a href="#section3">Supervised Classification: Real Data</a>
    <li><a href="#section4">Setting up the Deep Neural Network</a>
    <li><a href="#section5">Assessing the Quality of the Synthetized Data</a>
    <li><a href="#section6">Main Part</a>
    <li><a href="#section7">Supervised Classification: Synthetic Data</a>
    <li><a href="#section8">Final Evalution of the Synthetized Data</a>
    <li><a href="#section9">Conclusions</a>
</ol>

<a id='section1'></a>
<h2>1. Imports and Reading Dataset</h2><a id='section1'></a>
<p>
The dataset has the following features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI	DiabetesPedigreeFunction, Age, Outcome. 
    <p>
   More information is available in chapter 10 in my book 
    <em>Synthetic Data and Generative AI</em>, available <a href="https://mltechniques.com/shop/">here</a>.
The last feature "Outcome" is the response: cancer or not. Values set to zero correspond to missing data, except for the binary response (cancer status). Here I ignore observations with missing values, but they can be processed with a separate GAN.

In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import random as python_random
from tensorflow import random
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam    # type of gradient descent optimizer
from numpy.random import randn
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

data = pd.read_csv('diabetes.csv')
# rows with missing data must be treated separately: I remove them here
data.drop(data.index[(data["Insulin"] == 0)], axis=0, inplace=True) 
data.drop(data.index[(data["Glucose"] == 0)], axis=0, inplace=True) 
data.drop(data.index[(data["BMI"] == 0)], axis=0, inplace=True) 
# no further data transformation used beyond this point
data.to_csv('diabetes_clean.csv')

print (data.shape)
print (data.tail())
print (data.columns)

(392, 9)
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
753            0      181             88             44      510  43.3   
755            1      128             88             39      110  36.5   
760            2       88             58             26       16  28.4   
763           10      101             76             48      180  32.9   
765            5      121             72             23      112  26.2   

     DiabetesPedigreeFunction  Age  Outcome  
753                     0.222   26        1  
755                     1.057   37        1  
760                     0.766   22        0  
763                     0.171   63        0  
765                     0.245   30        0  
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')


<a id='section2'></a>
<h2>2. Seed, Replicability and Faster Training</h2>
<p>
The synthetized data is very sensitive to the seed. Trying different seeds allow you to find one that provides better results, or leads faster to the same quality results. In the enhanced mode, the code allows you to stop the main loop (epoch iterations) as soon as a good enough solution is found. In order to do so, the synthetized data is assessed for its quality every <code>n_eval</code> epochs. By default (standard mode), the number of epochs is set to <code>n_epochs=10000</code>. The learning rate is another critical hyperparameter that you can find tune. See also my article on smart search grid to automatically and efficiently fine tune hyperparameters. The article (with Python code) is available 
    <a href="https://mltechniques.com/2023/03/30/smart-grid-search-case-study-with-hybrid-zeta-geometric-distributions-and-synthetic-data/">here</a>. Further optimization can be achieved by reducing the size of the training set 
    (randomly removing more than 50% of all observations, see <a href="https://mltechniques.com/2023/04/07/massively-speed-up-your-learning-algorithm-with-stochastic-thinning/">here</a>) or eliminating redundant features via feature clustering (see <a href="https://mltechniques.com/2023/03/12/feature-clustering-a-simple-solution-to-many-machine-learning-problems/">here</a>).
    
In addition, my GAN implementation leads to replicable results if you use the same seed each time you run it.     

In [2]:
seed = 103     # to make results replicable
np.random.seed(seed)     # for numpy
random.set_seed(seed)    # for tensorflow/keras
python_random.seed(seed) # for python

adam = Adam(learning_rate=0.001) # also try 0.01
latent_dim = 10
n_inputs   = 9   # number of features
n_outputs  = 9   # number of features

<a id='section3'></a>
<h2>3. Supervised Classification: Real Data</h2>
<p>
The purpose here is to perform classification using random forests, to predict the risk of cancer, using the real dataset. We will later run the same classifier on the synthetic data as well, to see how it compares. Usually, the real data is blended with synthetic observations to produce a larger training set called <em>augmented training set</em>. The idea is that by enriching the original training set, you get a more powerful predictive model (classifier in this case), offering more robustness and able to deal with future observations truly different from those in the original training set.  

In [3]:
#--- STEP 1: Base Accuracy for Real Dataset

features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
label = ['Outcome']  # OutCome column is the label (binary 0/1) 
X = data[features]
y = data[label] 

# Real data split into train/test dataset for classification with random forest

X_true_train, X_true_test, y_true_train, y_true_test = train_test_split(X, y, test_size=0.30, random_state=42)
clf_true = RandomForestClassifier(n_estimators=100)
clf_true.fit(X_true_train,y_true_train)
y_true_pred=clf_true.predict(X_true_test)
print("Base Accuracy: %5.3f" % (metrics.accuracy_score(y_true_test, y_true_pred)))
print("Base classification report:\n",metrics.classification_report(y_true_test, y_true_pred))


  clf_true.fit(X_true_train,y_true_train)


Base Accuracy: 0.754
Base classification report:
               precision    recall  f1-score   support

           0       0.80      0.85      0.82        80
           1       0.64      0.55      0.59        38

    accuracy                           0.75       118
   macro avg       0.72      0.70      0.71       118
weighted avg       0.75      0.75      0.75       118



<a id='section4'></a>
<h2>4. Setting up the Deep Neural Network</h2>
<p>
This section features all the functions needed to create the architecture, as well as sampling fake (synthetic) data, sampling from the real data, and generating latent data.
    <p>
The DNN has two components: the generator for synthetization, and the discriminator to compare synthetic against real data. The GAN model is the combined DNN. The discriminator is used to check if the synthetic data can be distinguished from the real data after enought training (iterations), statistically speaking. If not, it means that the synthetic data is good enough and we can stop the iterations (referred to as epochs). 
    <p>
    The DNN essentially performs a gradient descent using the <code>adam</code> method to minimize a loss function, whose parameters are the weights attached to the neurons in the different layers. Here 3 layers are used in both models (the generator and the discriminator):
    <code>model.add</code> adds one layer at a time. We use a different loss function for the generator and discriminator. The value of the loss function is one of the elements stored in the object <code>model</code> created by the functions
    <code>defined_generator</code> and <code>define_discriminator</code>, and updated by 
        <code>model.train_on_batch</code> in <a href="#section6">section 6</a>. The value for the loss function, decreasing over time (on average) over successive epochs in case of convergence to a local minimum, is stored at each epoch in the variables
        <code>d_loss_real</code>, <code>d_loss_fake</code> and <code>g_loss_fake</code> (see code in <a href="#section6">section 6</a>), respectively for the discriminator (letter d) and the generator (letter g).
        <p>
            An epoch consists of processing the full data, once. More details are available in my book on synthetic data and generative AI, available <a href="https://mltechniques.com/shop/">here</a>.

In [4]:
#--- STEP 2: Generate Synthetic Data 

def generate_latent_points(latent_dim, n_samples):
    x_input = randn(latent_dim * n_samples) 
    x_input = x_input.reshape(n_samples, latent_dim)
    return x_input

def generate_fake_samples(generator, latent_dim, n_samples):
    x_input = generate_latent_points(latent_dim, n_samples) # random N(0,1) data
    X = generator.predict(x_input,verbose=0) 
    y = np.zeros((n_samples, 1))  # class label = 0 for fake data
    return X, y

def generate_real_samples(n):
    X = data.sample(n)   # sample from real data
    y = np.ones((n, 1))  # class label = 1 for real data
    return X, y

def define_generator(latent_dim, n_outputs): 
    model = Sequential()
    model.add(Dense(15, activation='relu',  kernel_initializer='he_uniform', input_dim=latent_dim))
    model.add(Dense(30, activation='relu'))
    model.add(Dense(n_outputs, activation='linear'))
    model.compile(loss='mean_absolute_error', optimizer=adam, metrics=['mean_absolute_error']) # 
    return model

def define_discriminator(n_inputs):
    model = Sequential()
    model.add(Dense(25, activation='relu', kernel_initializer='he_uniform', input_dim=n_inputs))
    model.add(Dense(50, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy']) 
    return model

def define_gan(generator, discriminator):
    discriminator.trainable = False # weights must be set to not trainable
    model = Sequential()
    model.add(generator) 
    model.add(discriminator) 
    model.compile(loss='binary_crossentropy', optimizer=adam)  
    return model


<a id='section5'></a>
<h2> 5. Assessing the Quality of the Synthetized Data</h2>
<p>
This function also produces the final synthetic data. It compares the correlation matrix (correlations between features/response) observed on the real data, with that computed on the synthetic data. The returned value 
    <code>g_dist</code> is between 0 and 1, with zero being the best fit. Other metrics (besides the correlation distance) are discussed later in <a href="#section8">section 8</a>, using open source libraries. The array <code>data_fake</code> is the output synthetic data.

In [5]:
def gan_distance(data, model, latent_dim, nobs_synth): 

    # generate nobs_synth synthetic rows as X, and return it as data_fake
    # also return correlation distance between data_fake and real data

    latent_points = generate_latent_points(latent_dim, nobs_synth)  
    X = model.predict(latent_points, verbose=0)  
    data_fake = pd.DataFrame(data=X,  columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'])
 
    # convert Outcome field to binary 0/1
    outcome_mean = data_fake.Outcome.mean()
    data_fake['Outcome'] = data_fake['Outcome'] > outcome_mean
    data_fake["Outcome"] = data_fake["Outcome"].astype(int)

    # compute correlation distance
    R_data      = np.corrcoef(data.T) # T for transpose
    R_data_fake = np.corrcoef(data_fake.T)
    g_dist = np.average(abs(R_data-R_data_fake))
    return(g_dist, data_fake) 

<a id='section6'></a>
<h2>6. Main Part</h2>
<p>
 The function <code>train</code> is the main function to train both models (discriminator, generator), produce
    the synthetic data <code>data_fake</code>, compute the value <code>g_dist</code> of the generator loss function
    at the final epoch, and decide when to stop: after <code>n_epochs</code> iterations or (if using the enhanced mode), after producing good enough synthetic data at iteration <code>best_epoch</code>, whichever comes first. The goodness of fit is computed as the distance between the correlation matrices (real versus synthetic data) as discussed in 
    <a href="#section5">section 5</a>,
    with a call to the function <code>gan_distance</code>.
    <p>
        The <code>train</code> function also saves the value of the loss functions obtained at each iteration (<code>d_history</code> for the discriminator, <code>g_history</code> for the generator) to the file <code>history.txt</code>, to produce a plot later if desired. 
        <p>
            <b> Important note</b>: <br>
            In testing mode, set <code>n_epochs</code> to 200 rather than 10,000 (the recommended value for this dataset) when calling the function <code>train</code>. Otherwise you might have to wait quite a bit for the training to complete. The function <code>train</code> displays a message every <code>n_eval</code> epochs to show the progress, showing the current values of the loss functions and the number of epochs completed.            

In [6]:
def train(g_model, d_model, gan_model, latent_dim, mode, n_epochs=10000, n_batch=128, n_eval=50):   
    
    # determine half the size of one batch, for updating the  discriminator
    half_batch = int(n_batch / 2)
    d_history = [] 
    g_history = [] 
    g_dist_history = []
    if mode == 'Enhanced':
        g_dist_min = 999999999.0  

    for epoch in range(0,n_epochs+1): 
                 
        # update discriminator
        x_real, y_real = generate_real_samples(half_batch)  # sample from real data
        x_fake, y_fake = generate_fake_samples(g_model, latent_dim, half_batch)
        d_loss_real, d_real_acc = d_model.train_on_batch(x_real, y_real) 
        d_loss_fake, d_fake_acc = d_model.train_on_batch(x_fake, y_fake)
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)

        # update generator via the discriminator error
        x_gan = generate_latent_points(latent_dim, n_batch)  # random input for generator
        y_gan = np.ones((n_batch, 1))                        # label = 1 for fake samples
        g_loss_fake = gan_model.train_on_batch(x_gan, y_gan) 
        d_history.append(d_loss)
        g_history.append(g_loss_fake)

        if mode == 'Enhanced': 
            (g_dist, data_fake) = gan_distance(data, g_model, latent_dim, nobs_synth=400)
            if g_dist < g_dist_min and epoch > int(0.75*n_epochs): 
               g_dist_min = g_dist
               best_data_fake = data_fake
               best_epoch = epoch
        else: 
            g_dist = -1.0
        g_dist_history.append(g_dist)
                
        if epoch % n_eval == 0: # evaluate the model every n_eval epochs
            print('>%d, d1=%.3f, d2=%.3f d=%.3f g=%.3f g_dist=%.3f' % (epoch, d_loss_real, d_loss_fake, d_loss,  g_loss_fake, g_dist))       
            plt.subplot(1, 1, 1)
            plt.plot(d_history, label='d')
            plt.plot(g_history, label='gen')
            # plt.show() # un-comment to see the plots
            plt.close()
       
    OUT=open("history.txt","w")
    for k in range(len(d_history)):
        OUT.write("%6.4f\t%6.4f\t%6.4f\n" %(d_history[k],g_history[k],g_dist_history[k]))
    OUT.close()
    
    if mode == 'Standard':
        # best synth data is assumed to be the one produced at last epoch
        best_epoch = epoch
        (g_dist_min, best_data_fake) = gan_distance(data, g_model, latent_dim, nobs_synth=400)
       
    return(g_model, best_data_fake, g_dist_min, best_epoch) 

Below is the very core of the program. Besides selecting the mode, initializing the models and saving the synthetic data to the
file <code>diabetes_synthetic.csv</code>, it consists of one instruction: the call to the <code>train</code> function.

In [7]:
#--- main part for building & training model

discriminator = define_discriminator(n_inputs)
discriminator.summary()
generator = define_generator(latent_dim, n_outputs)
generator.summary()
gan_model = define_gan(generator, discriminator)

mode = 'Enhanced'  # options: 'Standard' or 'Enhanced'
model, data_fake, g_dist, best_epoch = train(generator, discriminator, gan_model, latent_dim, mode)
data_fake.to_csv('diabetes_synthetic.csv') 

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 25)                250       
                                                                 
 dense_1 (Dense)             (None, 50)                1300      
                                                                 
 dense_2 (Dense)             (None, 1)                 51        
                                                                 
Total params: 1,601
Trainable params: 1,601
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 15)                165       
                                                                 
 dense_4 (Dense)             (No

>6000, d1=1.001, d2=0.796 d=0.898 g=0.833 g_dist=0.441
>6050, d1=0.773, d2=0.846 d=0.810 g=0.905 g_dist=0.507
>6100, d1=1.284, d2=0.861 d=1.072 g=0.824 g_dist=0.406
>6150, d1=0.968, d2=0.889 d=0.928 g=0.819 g_dist=0.465
>6200, d1=0.759, d2=1.813 d=1.286 g=0.567 g_dist=0.344
>6250, d1=0.687, d2=0.740 d=0.713 g=1.018 g_dist=0.549
>6300, d1=0.378, d2=0.320 d=0.349 g=1.753 g_dist=0.425
>6350, d1=0.691, d2=0.544 d=0.617 g=1.062 g_dist=0.454
>6400, d1=0.922, d2=0.950 d=0.936 g=0.759 g_dist=0.485
>6450, d1=0.912, d2=0.833 d=0.873 g=0.967 g_dist=0.358
>6500, d1=0.599, d2=0.595 d=0.597 g=1.082 g_dist=0.539
>6550, d1=0.558, d2=0.476 d=0.517 g=1.494 g_dist=0.363
>6600, d1=0.577, d2=0.718 d=0.647 g=1.034 g_dist=0.478
>6650, d1=0.611, d2=0.833 d=0.722 g=0.791 g_dist=0.407
>6700, d1=0.567, d2=1.066 d=0.816 g=0.773 g_dist=0.355
>6750, d1=0.720, d2=0.731 d=0.725 g=1.103 g_dist=0.503
>6800, d1=0.648, d2=0.715 d=0.681 g=1.242 g_dist=0.361
>6850, d1=0.407, d2=0.448 d=0.427 g=1.299 g_dist=0.495
>6900, d1=

Now plotting sample historical data extracted from the output file <code>history.txt</code> (values of the loss function), starting with two different seeds. It shows the sensitivity to the seed, and the volatility during successive epochs. The right plot indicates that seed 103 is better than 102: it leads to a lower local minimum of the loss function, in fewer iterations (about 6000), compared to seed 102 which produces worse results (higher minimum, higher volatility) and has not fully stabilized even after 8000 epochs. The loss function values of interest are in <code>g_history</code> (attached to the generator model). 
<p>
<img style="float: left;" src="https://raw.githubusercontent.com/VincentGranville/Notebooks/main/history.png" width=600>

<a id='section7'></a>
<h2>7. Supervised Classification: Synthetic Data</h2>
<p>
Here we assess the performance of the classifier on the synthetic data. That is, we compare the performance statistics with those obtained for the classification on the real data 
    in section <a href="#section3">section 3</a>. With well synthetized data (as measured via the <code>gan_distance</code> function in <a href="#section5">section 5</a>, or other metrics at the end of this notebook), the classifier does a much better job classifying the synthetic data (assigning cancer status yes/no to each observation), than on the real data.
    <p>
        In test mode, when using only 200 epochs, the opposite is true: you are far from a good synthetization, and the classifier on the synthetic data is quite poor.

In [8]:
#--- STEP 3: Classify synthetic data based on Outcome field

features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
label = ['Outcome']
X_fake_created = data_fake[features]
y_fake_created = data_fake[label]
X_fake_train, X_fake_test, y_fake_train, y_fake_test = train_test_split(X_fake_created, y_fake_created, test_size=0.30, random_state=42)
clf_fake = RandomForestClassifier(n_estimators=100)
clf_fake.fit(X_fake_train,y_fake_train)
y_fake_pred=clf_fake.predict(X_fake_test)
print("Accuracy of fake data model: %5.3f" % (metrics.accuracy_score(y_fake_test, y_fake_pred)))
print("Classification report of fake data model:\n",metrics.classification_report(y_fake_test, y_fake_pred))

  clf_fake.fit(X_fake_train,y_fake_train)


Accuracy of fake data model: 0.942
Classification report of fake data model:
               precision    recall  f1-score   support

           0       0.96      0.94      0.95        70
           1       0.92      0.94      0.93        50

    accuracy                           0.94       120
   macro avg       0.94      0.94      0.94       120
weighted avg       0.94      0.94      0.94       120



<a id='section8'></a>
<h2>8. Final Evalution of the Synthetized Data</h2>
<p>
    Here I use the <code>table_evaluator</code> Python library to gather several metrics measuring the quality of the synthetized data, in addition to my correlation matrix distance. The "Basic Statistics" is an aggregate of multiple distances, such as normalized mean, variance, 25- and 75-percentiles differences (in absolute value) for each feature, between the synthetic and real data. It is equal to 1 if both datasets are identical. Note that the best fit may lead to overfitting, also reproducing the biases present in the real data. Bias removal is best achieved by eliminating features favoring biases, such as race or gender, and replacing them with other or new features. 

In [9]:
#--- STEP 4: Evaluate the Quality of Generated Fake Data With g_dist and Table_evaluator

from table_evaluator import load_data, TableEvaluator

table_evaluator = TableEvaluator(data, data_fake)
table_evaluator.evaluate(target_col='Outcome')
# table_evaluator.visual_evaluation() 

print("Avg correlation distance: %5.3f" % (g_dist))
print("Based on epoch number: %5d" % (best_epoch))

IPython not installed.


  distances = Parallel(n_jobs=-1)(



Classifier F1-scores and their Jaccard similarities::
                             f1_real  f1_fake  jaccard_similarity
index                                                            
DecisionTreeClassifier_fake   0.4810   0.8481              0.2344
DecisionTreeClassifier_real   0.7089   0.6835              0.5048
LogisticRegression_fake       0.6329   0.9620              0.4234
LogisticRegression_real       0.7595   0.5949              0.4107
MLPClassifier_fake            0.6456   0.8861              0.4364
MLPClassifier_real            0.6709   0.6329              0.4906
RandomForestClassifier_fake   0.6709   0.8608              0.4906
RandomForestClassifier_real   0.7975   0.5949              0.5048

Privacy results:
                                         result
Duplicate rows between sets (real/fake)  (0, 0)
nearest neighbor mean                    1.6491
nearest neighbor std                     0.6970

Miscellaneous results:
                                  Result
Column Cor

  distances = Parallel(n_jobs=-1)(


<a id='section9'></a>
<h2>9. Conclusions</h2>
<p>
The copula method outperforms GAN on the type of data investigated here, if the criterion to measure quality is how well the correlation structure is reproduced in the synthetic data. See the picture below, comparing the two methods on the diabetes dataset, in terms of correlation matrix. The synth2 matrix corresponds to the GAN-generated synthetic data. The full analysis is in my spreadsheet <code>diabetes.synthetic.xlsx</code> on GitHub,
    <a href="https://github.com/VincentGranville/Main">here</a>
    <p>
    <img  src="https://raw.githubusercontent.com/VincentGranville/Notebooks/main/GAN_compare_diabetes.png" width=600>
    <p>
        
        
However, when structures other than linear are present in the real dataset, GAN can do a better job. For instance, in my artificial circle dataset (concentric circles, zero correlation), the copula method correctly generates non-correlated data, but fails to capture the circular structure. GAN does a much better job, as shown in the picture below. The full details are in my spreadsheet <code>Circle8D.xlsc</code>, in the same GitHub folder.
        <a href="https://github.com/VincentGranville/Main">here</a>
    <p>
    <img  src="https://raw.githubusercontent.com/VincentGranville/Notebooks/main/gan_circle.png" width=600>
    <p>
        
The left, middle and right plots correspond respectively to the real data , the copula synthetization, and the GAN synthetization. Actually, it is a 2-dimensional projection of the full 8-dimensional dataset.

I discussed in <a href="#section2">section 2</a> different options to train GAN much faster. 
Another option is to use a fast gradient descent algorithm such as lightGBM. This is implemented in the TabGAN library.
Finally, the SDV Python library offers a comprehensive set of synthetization methods, including for tabular data, as well as a variety of real-life datasets. See my sample code 
<code>GAN_copula_SDV.py</code> in the same GitHub folder. SDV also includes mechanisms to mask private information, using 
functions from the Faker Python library. It also handles missing data.
<p>
    <b> Exercise 1</b><br>
Blend the synthetic diabetes data with the real data to produce a larger (augmented) training set. Use it to classify the real data into two categories: cancer/no-cancer. Assess the impact of adding synthetic data on the predictions, using cross-validation techniques.    