# Generating Synthetic Data in `R`

## 1. What is synthetic data?

According to the US Census Bureau, “Synthetic data are microdata records created to improve data utility while preventing disclosure of confidential respondent information. Synthetic data is created by statistically modelling original data and then using those models to generate new data values that reproduce the original data’s statistical properties. Users are unable to identify the information of the entities that provided the original data.”​

There are many situations in ONS where the generation of synthetic data could be used to improve outputs. These are listed in [Synthetic data at ONS](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot) and they include:

* provision of microdata to users
* testing systems
* developing systems or methods in non-secure environments
* obtaining non-disclosive data from data suppliers
* teaching – a useful way to promote the use of ONS data sources

**Methods for Generating Synthetic Data in `R`**

We will look at two methods for generating synthetic data. Some of the examples here were adapted from Iain Dove presentations on [Synthetic data, a useful tool for ONS](https://officenationalstatistics.sharepoint.com/sites/ONS_coffee_and_coding/Recording%20of%20Monthly%20Presentations/Forms/AllItems.aspx?ct=1741603363150&or=Teams%2DHL&ga=1&LOF=1&id=%2Fsites%2FONS%5Fcoffee%5Fand%5Fcoding%2FRecording%20of%20Monthly%20Presentations%2FSythetic%20Data%2C%20a%20tool%20for%20ONS%2FSynthetic%2Ddata%2C%2Da%2Duseful%2Dtool%2Dfor%2DONS%2Ehtml&parent=%2Fsites%2FONS%5Fcoffee%5Fand%5Fcoding%2FRecording%20of%20Monthly%20Presentations%2FSythetic%20Data%2C%20a%20tool%20for%20ONS):

1. Writing your own code
2. Using the `synthpop` R package

## 2. Generating synthetic data in `R` from scratch

### 2.1 Numerical Variables



In this section, we will generate synthetic numerical variables using different distributions in `R`. We will use the following distributions:

1. Uniform Distribution
2. Normal Distribution
3. Binomial Distribution
4. Poisson Distribution

Below are the `R` codes to generate these variables.

**1. Uniform Distribution**: The uniform distribution generates data where each value within a specified range is equally likely. This is useful for simulating data with a consistent spread.

`runif` generates data where each value within the specified range is equally likely.


```r
# Create data from scratch
# Creates a uniform distribution of data

uniform_dist <- runif(n = 1000, # number of observations/rows
                      min = 0,  # lower limit
                      max = 1)  # upper limit

# Example: Simulating random probabilities or proportions.
# To explore the generated data:
# hist(uniform_dist, main="Uniform Distribution Example", xlab="Value")
# summary(uniform_dist)
```

**2. Normal Distribution**: The normal distribution, or Gaussian distribution, is useful for simulating data that clusters around a mean. This is common in many natural phenomena.

`rnorm` generates data that clusters around a specified mean, with a given standard deviation.

```r
# Create data from scratch
# Creates a normal distribution of data

normal_dist <- rnorm(n = 1000,  # number of observations/rows
                     mean = 50, # average
                     sd = 4)    # standard deviation

# Example: Simulating test scores or measurements.
# To explore the generated data:
# hist(normal_dist, main="Normal Distribution Example", xlab="Value")
# summary(normal_dist)
```

**3. Poisson Distribution**: The Poisson distribution is used for count data, where you are counting the number of events in a fixed interval of time or space.

`rpois` generates count data, useful for modeling the number of events in a fixed interval.

```r
# Create data from scratch
# Creates a poisson distribution of data

poisson_dist <- rpois(n = 1000,  # number of observations/rows
                      lambda = 4) # non negative mean

# Example: Simulating the number of visits to a government website per day.
# To explore the generated data:
# hist(poisson_dist, main="Poisson Distribution Example", xlab="Number of Events")
# summary(poisson_dist)
```

**4. Binomial Distribution**: The binomial distribution is useful for simulating the number of successes in a fixed number of trials, each with the same probability of success.

`rbinom` generates data representing the number of successes in a fixed number of trials, each with the same probability of success.

```r
# Create data from scratch
# Creates a binomial distribution of data

binomial_dist <- rbinom(n = 1000,  # number of observations/rows
                        size = 20, # number of trials
                        prob = 0.2) # probability of success of each trial

# Example: Simulating the number of successful outcomes in a series of experiments.
# To explore the generated data:
# hist(binomial_dist, main="Binomial Distribution Example", xlab="Number of Successes")
# summary(binomial_dist)
```

### 2.2 Character/factor variables

In this section, we will generate synthetic character (factor) variables using random sampling in `R`. These variables are useful for simulating categorical data, such as demographic information.

**1. Random sampling from a vector**: Random sampling from a vector allows you to generate categorical data by randomly selecting elements from a specified set. This is useful for creating variables like gender, where each observation is randomly assigned a category.

```r
# Set seed for reproducibility
set.seed(123)

# Random sampling from a vector

gender <- sample(x = c("M", "F"),  # elements to choose from
              size = 1000,      # number of observations
              replace = TRUE)   # sampling with replacement

# Example: Simulating gender distribution in a population.
# To explore the generated data:
# table(gender)
# prop.table(table(gender))
```

**2. Weighted sampling from a vector**: Weighted sampling allows you to generate categorical data with specified probabilities for each category. This is useful for creating variables like marital status, where each category has a different likelihood of being selected.



```r
# Set seed for reproducibility
set.seed(456)

# Weighted sampling from a vector, using a vector of probabilities

marriage_status <- sample(x = c("Single", "Married", "Divorced", "Widowed"),  
                          size = 1000,    
                          replace = TRUE,
                          prob = c(0.35, 0.50, 0.10, 0.05))

# Example: Simulating marital status distribution in a population.
# To explore the generated data:
# table(marriage_status)
# prop.table(table(marriage_status))
```

Here is the explanation of the code snippet above:

1. Setting a Seed (`set.seed`):

* `set.seed(123)`: Sets the seed for random number generation. The number `123` can be any integer. Using the same seed will produce the same sequence of random numbers each time the code is run.
2. Random Sampling (`sample`):

* `x`: The vector of elements to choose from.
* `size`: The number of observations to generate.
* `replace`: Whether sampling is with replacement (`TRUE`) or without replacement (`FALSE`).
3. Weighted Sampling (`sample` with `prob`):

* `prob`: A vector of probabilities corresponding to the likelihood of each element in `x` being selected. The probabilities must sum to `1`.

### 2.3 Combining data into a synthetic dataset


Now, we will combine the numerical and categorical data into a single synthetic dataset using the `data.table` package:

```r
# Load the data.table package (first install the package using: install.packages("data.table"))
library(data.table)

# Combine the data into a synthetic dataset
synthetic_data1 <- data.table(uniform_dist,
                              normal_dist,
                              poisson_dist,
                              binomial_dist,
                              gender,
                              marriage_status)

# Explore the synthetic dataset
print(head(synthetic_data1))
summary(synthetic_data1)
str(synthetic_data1)
```

**Converting Character Variables to Factors**  
To ensure that the categorical variables are treated appropriately in analyses, we should convert them to factor variables:

```r
# Change gender and marriage_status to factor variables
synthetic_data1$gender <- as.factor(synthetic_data1$gender)
synthetic_data1$marriage_status <- as.factor(synthetic_data1$marriage_status)

# Check the structure of the dataset again
str(synthetic_data1)
```

This combined synthetic dataset can now be used for various analyses, simulations, and testing purposes. It provides a comprehensive representation of both numerical and categorical variables, making it suitable for a wide range of applications.

These snippets provide a starting point for generating synthetic categorical data using random and weighted sampling with reproducibility. You can adjust the parameters to fit the specific characteristics of the population you are modeling. For more detailed information on the `sample` function, refer to the R documentation.

## 3. `synthpop` for Generating Synthetic Data in `R`

In the guidiance, we will work with the `R` package `synthpop` [(Nowok, Raab, and Dibben 2016)](https://www.synthpop.org.uk/get-started.html), which is one of the most advanced and dedicated packages in `R` to create synthetic data.

```r
install.packages("synthpop")
```



This will install `synthpop` and its dependencies from [ONS Artifactory](https://onsart-01/ui/login/).


### 3.2 Start `synthpop`

To start using the package, load it using the `library()` function:



```r
library("synthpop")
```



To get a list of all `synthpop` functions, use:



```r
help(package = synthpop)
```



To access a help file for a specific function, e.g., the main `synthpop` function `syn()`, type:



```r
?syn
```



## 3.3 First synthesis

You will be working with your own data, but for practice, you can use the sample data `SD2011` provided with the `synthpop` package.

**Read the Data**  
Read the data you want to synthesise into `R`. You can use the `synthpop` function `read.obs()` to read data from other formats.

**Examine Your Data**  
Start with a modest number of variables (8-12) to understand `synthpop`. If your data have more variables, make a selection. The package is intended for large datasets (at least `500` observations).

Use the `codebook.syn()` function to examine the features relevant to synthesising:



### 3.3.1 Sample synthesis using sample data `SD2011` provided with the `synthpop` package

In this section, we will guide you through a sample synthesis using the `synthpop` package in `R`. This example uses the [`SD2011` dataset](sample data `SD2011` provided with the `synthpop` package) provided by the package and demonstrates how to generate synthetic data with a smaller subset of variables.

#### Step-by-Step Guide

1. **Clean Workspace and Load Package**



```r
# Clean out workspace
rm(list = ls())

# Load the synthpop package
library(synthpop)
```



2. **Explore the `SD2011` Dataset**



```r
# Get information about the SD2011 dataset
help(SD2011)

# Get the size of the data frame
dim(SD2011)

# Get summary information about variables
codebook.syn(SD2011)$tab
```



3. **Select a Subset of Variables**



```r
# Select a subset of variables from SD2011
mydata <- SD2011[, c(1, 3, 6, 8, 11, 17, 18, 19, 20, 10)]

# Get summary information about the selected variables
codebook.syn(mydata)$tab
```



4. **Handle Missing Values**



```r
# Check for negative income values
table(mydata$income[mydata$income < 0], useNA = "ifany")
```



5. **Synthesize Data**



```r
# Synthesize data, handling -8 as a missing value for income
mysyn <- syn(mydata, cont.na = list(income = -8))

# Get a summary of the synthetic data
summary(mysyn)

# Compare the synthetic data with the original data
compare(mysyn, mydata, stat = "counts")
```



6. **Export Synthetic Data**



```r
# Export synthetic data to SPSS format
write.syn(mysyn, filename = "mysyn_SD2001.sav", filetype = "SPSS")

# Export synthetic data to CSV format
write.syn(mysyn, filename = "mysyn_SD2001.csv", filetype = "csv")
```



7. **Explore synthetic data**
After generating synthetic data using the `synthpop` package, it is important to explore and understand the structure and components of the synthetic data object. This section provides guidance on how to explore the synthetic data object and perform additional comparisons.

**Exploring the synthetic data object**  

**I. Retrieve Component Names**

```r
names(mysyn)
```

* This command retrieves the names of the components within the `mysyn` object, helping you understand its structure.
* When you run `name(mysyn)`, you might see output similar to this:
<Insert a screenshot of example output>

**II. Explanation of some of the components**
1. `syn`: The synthetic data set.
2. `method`: The methods used for synthesising each variable.
3. `predictor.matrix`: The matrix indicating which variables were used as predictors for each synthesised variable.
4. `visit.sequence`: The order in which the variables were synthesised.
5. `cont.na`: Information about how missing values were handled.
6. `rules`: Any rules applied during synthesis.
7. `rvalues`: Values used for rules.
8. `m`: Number of synthetic datasets created (if multiple).
9. `proper`: Indicates if proper synthesis was used.
10. `seed`: The seed used for random number generation.
11. `call`: The original call to the syn() function.
12. `data`: The original data used for synthesis.

**III. Inspect key components**

```r
mysyn$method
mysyn$predictor.matrix
mysyn$visit.sequence
mysyn$cont.na
mysyn$seed
```

**IV. Additional comparisons**  
To further validate the synthetic data, you can perform additional comparisons between the synthetic and original data using the `multi.compare()` function.

```r
# Additional comparisons
multi.compare(mysyn, mydata, var = "marital", by = "sex")
multi.compare(mysyn, mydata, var = "income", by = "agegr")
multi.compare(mysyn, mydata, var = "income", by = "edu", cont.type = "boxplot")
```

* `multi.compare(mysyn, mydata, var = "marital", by = "sex")`: Compares the distribution of the `marital` variable by `sex` between the synthetic and original data.
* `multi.compare(mysyn, mydata, var = "income", by = "agegr")`: Compares the distribution of the `income` variable by `age group` between the synthetic and original data.
* `multi.compare(mysyn, mydata, var = "income", by = "edu", cont.type = "boxplot")`: Compares the distribution of the `income` variable by `education level` using boxplots between the synthetic and original data.

These steps help ensure that the synthetic data accurately reflects the structure and relationships present in the original data, making it suitable for analysis while protecting the privacy of the original data.

See below some of the visuals from using the `multi.compare()` function.

<insert the "figures/Rplot_Observed_versus_synthetic_data.png">



### Explanation

1. **Clean Workspace and Load Package**:
    - `rm(list = ls())`: Clears the workspace.
    - `library(synthpop)`: Loads the `synthpop` package.

2. **Explore the `SD2011` Dataset**:
    - `help(SD2011)`: Provides information about the `SD2011` dataset.
    - `dim(SD2011)`: Displays the dimensions of the dataset.
    - `codebook.syn(SD2011)$tab`: Generates a codebook for the dataset.

3. **Select a Subset of Variables**:
    - Selects specific columns from `SD2011` to create a smaller dataset `mydata`.
    - `codebook.syn(mydata)$tab`: Generates a codebook for the selected variables.

4. **Handle Missing Values**:
    - `table(mydata$income[mydata$income < 0], useNA = "ifany")`: Checks for negative income values.

5. **Synthesize Data**:
    - `syn(mydata, cont.na = list(income = -8))`: Synthesizes the data, treating -8 as a missing value for income.
    - `summary(mysyn)`: Summarizes the synthetic data.
    - `compare(mysyn, mydata, stat = "counts")`: Compares the synthetic data with the original data.

6. **Export Synthetic Data**:
    - `write.syn(mysyn, filename = "mysyn", filetype = "SPSS")`: Exports the synthetic data to SPSS format.

7. **Explore Synthetic Data**:
    - `names(mysyn)`, `mysyn$method`, `mysyn$predictor.matrix`, `mysyn$visit.sequence`, `mysyn$cont.na`: Explores various components of the synthetic data object.
    - `multi.compare()`: Performs additional comparisons between the synthetic and original data.

This guide provides a concise overview of generating synthetic data using the `synthpop` package in R. For more detailed information, refer to the [synthpop documentation](https://www.synthpop.org.uk/get-started.html).

## References

1. [Synthetic data at ONS](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot)
2. [Synthetic data, a useful tool for ONS (Iain Dove 2022)](https://officenationalstatistics.sharepoint.com/sites/ONS_coffee_and_coding/Recording%20of%20Monthly%20Presentations/Forms/AllItems.aspx?ct=1741603363150&or=Teams%2DHL&ga=1&LOF=1&id=%2Fsites%2FONS%5Fcoffee%5Fand%5Fcoding%2FRecording%20of%20Monthly%20Presentations%2FSythetic%20Data%2C%20a%20tool%20for%20ONS%2FSynthetic%2Ddata%2C%2Da%2Duseful%2Dtool%2Dfor%2DONS%2Ehtml&parent=%2Fsites%2FONS%5Fcoffee%5Fand%5Fcoding%2FRecording%20of%20Monthly%20Presentations%2FSythetic%20Data%2C%20a%20tool%20for%20ONS)
3. [`synthpop` (Nowok, Raab, and Dibben 2016)](https://www.synthpop.org.uk/get-started.html)