# Synthetic Data Generation using SDV

The Synthetic Data Vault (SDV) is a Python library designed for creating tabular synthetic data. The SDV uses a variety of machine learning algorithms to learn patterns from your real data and emulate them in synthetic data.

In [1]:
%pip install sdv

Collecting sdv
  Downloading sdv-1.2.1-py2.py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.3/123.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Faker<15,>=10 (from sdv)
  Downloading Faker-14.2.1-py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
Collecting copulas<0.10,>=0.9.0 (from sdv)
  Downloading copulas-0.9.0-py2.py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.3/54.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ctgan<0.8,>=0.7.2 (from sdv)
  Downloading ctgan-0.7.3-py2.py3-none-any.whl (26 kB)
Collecting deepecho<0.5,>=0.4.1 (from sdv)
  Downloading deepecho-0.4.1-py2.py3-none-any.whl (28 kB)
Collecting rdt<2,>=1.5.0 (from sdv)
  Downloading rdt-1.6.0-py2.py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31

## 1. Data Preparation

### Loading data

In [1]:
import sdv as sdv
import pandas as pd
import numpy as np

df = pd.read_csv("/content/drive/MyDrive/ADNOC/with_only_context.csv")
df.head()

Unnamed: 0,ID,context
0,1,On Day 01/01/2021 Well Name is FORGE 16A [78...
1,2,On Day 01/02/2021 Well Name is FORGE 16A [78...
2,3,On Day 01/03/2021 Well Name is FORGE 16A [78...
3,4,On Day 10/22/2020 Well Name is FORGE 16A [78...
4,5,On Day 10/23/2020 Well Name is FORGE 16A [78...


### Metadata Description

The SDV requires that you provide a description of your data, also known as **metadata**.

The metadata describes the types of data that are available in every column.

In [3]:
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)

#The SDV can auto detect portions of the metadata by inspecting your actual data.
print('Auto detected data:\n')
metadata

Auto detected data:



{
    "columns": {
        "ID": {
            "sdtype": "numerical"
        },
        "context": {
            "sdtype": "categorical"
        }
    },
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1"
}

In [None]:
# metadata.detect_table_from_dataframe(
#     table_name='data',
#     data=df
# )

### Updating the data types

In [4]:
# # We can include a format string to parse the date and time components.
# metadata.update_column(
#     column_name='RPT DATE',
#     sdtype='datetime',
#     datetime_format=" %m/%d/%Y"
# )

# ID Columns: These columns do not have statistical value, as they are only used to identify rows.
metadata.update_column(
    column_name='ID',
    sdtype='id'
)

### Setting the primary key

**Primary Keys**: These keys identify every row of the table. They must be unique to the entire table and other tables may refer to them.

In [5]:
metadata.set_primary_key(
    column_name='ID'
)

Now, the metadata should be accurate. Let's validate it. If successful, the code should run without any errors.

In [8]:
metadata.validate()
# metadata.save_to_json('metadata.json')
# metadata = MultiTableMetadata.load_from_json('metadata.json')

In [9]:
metadata

{
    "primary_key": "ID",
    "columns": {
        "ID": {
            "sdtype": "id"
        },
        "context": {
            "sdtype": "categorical"
        }
    },
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1"
}

### Creating a Synthesizer

An SDV **synthesizer** is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data.

In [10]:
from sdv.lite import SingleTablePreset
from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
# synthesizer = SingleTablePreset(
#     metadata,
#     name='FAST_ML'
# )

synthesizer.fit(
    data=df
)

Use the `sample` function and pass in any number of rows to synthesize.

In [12]:
synthetic_data = synthesizer.sample(
    num_rows=100
)

synthetic_data.head()

Unnamed: 0,ID,context
0,0,On Day 11/01/2020 Well Name is FORGE 16A [78...
1,1,On Day 12/16/2020 Well Name is FORGE 16A [78...
2,2,On Day 11/09/2020 Well Name is FORGE 16A [78...
3,3,On Day 11/27/2020 Well Name is FORGE 16A [78...
4,4,On Day 11/03/2020 Well Name is FORGE 16A [78...


In [None]:
from google.colab import files

synthetic_data.to_csv('synthetic_data_context.csv', encoding = 'utf-8-sig')
files.download('synthetic_data_context.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>