# Generating Synthetic Data in `R`

## 1. What is synthetic data?

According to the US Census Bureau, “Synthetic data are microdata records created to improve data utility while preventing disclosure of confidential respondent information. Synthetic data is created by statistically modelling original data and then using those models to generate new data values that reproduce the original data’s statistical properties. Users are unable to identify the information of the entities that provided the original data.”​

There are many situations in ONS where the generation of synthetic data could be used to improve outputs. These are listed in [Synthetic data at ONS](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot) and they include:

* provision of microdata to users
* testing systems
* developing systems or methods in non-secure environments
* obtaining non-disclosive data from data suppliers
* teaching – a useful way to promote the use of ONS data sources

** Methods for Generating Synthetic Data in `R`**

We will look at two methods for generating synthetic data:

1. Writing your own code
2. Using the `synthpop` R package

## 2. Generating synthetic data in `R` from scratch

### 2.1 Numerical Variables



In this section, we will generate synthetic numerical variables using different distributions in `R`. We will use the following distributions:

1. Uniform Distribution
2. Normal Distribution
3. Binomial Distribution
4. Poisson Distribution

Below are the `R` codes to generate these variables.

**1. Uniform Distribution**: The uniform distribution generates data where each value within a specified range is equally likely. This is useful for simulating data with a consistent spread.

`runif` generates data where each value within the specified range is equally likely.


```r
# Create data from scratch
# Creates a uniform distribution of data

uniform_dist <- runif(n = 1000, # number of observations/rows
                      min = 0,  # lower limit
                      max = 1)  # upper limit

# Example: Simulating random probabilities or proportions.
# To explore the generated data:
# hist(uniform_dist, main="Uniform Distribution Example", xlab="Value")
# summary(uniform_dist)
```

**2. Normal Distribution**: The normal distribution, or Gaussian distribution, is useful for simulating data that clusters around a mean. This is common in many natural phenomena.

`rnorm` generates data that clusters around a specified mean, with a given standard deviation.

```r
# Create data from scratch
# Creates a normal distribution of data

normal_dist <- rnorm(n = 1000,  # number of observations/rows
                     mean = 50, # average
                     sd = 4)    # standard deviation

# Example: Simulating test scores or measurements.
# To explore the generated data:
# hist(normal_dist, main="Normal Distribution Example", xlab="Value")
# summary(normal_dist)
```

**3. Poisson Distribution**: The Poisson distribution is used for count data, where you are counting the number of events in a fixed interval of time or space.

`rpois` generates count data, useful for modeling the number of events in a fixed interval.

```r
# Create data from scratch
# Creates a poisson distribution of data

poisson_dist <- rpois(n = 1000,  # number of observations/rows
                      lambda = 4) # non negative mean

# Example: Simulating the number of visits to a government website per day.
# To explore the generated data:
# hist(poisson_dist, main="Poisson Distribution Example", xlab="Number of Events")
# summary(poisson_dist)
```

**4. Binomial Distribution**: The binomial distribution is useful for simulating the number of successes in a fixed number of trials, each with the same probability of success.

`rbinom` generates data representing the number of successes in a fixed number of trials, each with the same probability of success.

```r
# Create data from scratch
# Creates a binomial distribution of data

binomial_dist <- rbinom(n = 1000,  # number of observations/rows
                        size = 20, # number of trials
                        prob = 0.2) # probability of success of each trial

# Example: Simulating the number of successful outcomes in a series of experiments.
# To explore the generated data:
# hist(binomial_dist, main="Binomial Distribution Example", xlab="Number of Successes")
# summary(binomial_dist)
```

### 2.2 Character/factor variables

In this section, we will generate synthetic character (factor) variables using random sampling in `R`. These variables are useful for simulating categorical data, such as demographic information.

**1. Random sampling from a vector**: Random sampling from a vector allows you to generate categorical data by randomly selecting elements from a specified set. This is useful for creating variables like gender, where each observation is randomly assigned a category.

```r
# Set seed for reproducibility
set.seed(123)

# Random sampling from a vector

gender <- sample(x = c("M", "F"),  # elements to choose from
              size = 1000,      # number of observations
              replace = TRUE)   # sampling with replacement

# Example: Simulating gender distribution in a population.
# To explore the generated data:
# table(gender)
# prop.table(table(gender))
```

**2. Weighted sampling from a vector**: Weighted sampling allows you to generate categorical data with specified probabilities for each category. This is useful for creating variables like marital status, where each category has a different likelihood of being selected.



```r
# Set seed for reproducibility
set.seed(456)

# Weighted sampling from a vector, using a vector of probabilities

marriage_status <- sample(x = c("Single", "Married", "Divorced", "Widowed"),  # elements to choose from
                          size = 1000,    # number of observations
                          replace = TRUE, # sampling with replacement
                          prob = c(0.35, 0.50, 0.10, 0.05))  # weights for obtaining the elements of the vector being sampled

# Example: Simulating marital status distribution in a population.
# To explore the generated data:
# table(marriage_status)
# prop.table(table(marriage_status))
```

### 2.3 Combining data into a synthetic dataset


Now, we will combine the numerical and categorical data into a single synthetic dataset using the `data.table` package:

```r
# Load the data.table package
library(data.table)

# Combine the data into a synthetic dataset
synthetic_data1 <- data.table(uniform_dist,
                              normal_dist,
                              poisson_dist,
                              binomial_dist,
                              gender,
                              marriage_status)

# Explore the synthetic dataset
print(head(synthetic_data1))
summary(synthetic_data1)
str(synthetic_data1)
```

**Converting Character Variables to Factors**
To ensure that the categorical variables are treated appropriately in analyses, we should convert them to factor variables:

```r
# Change gender and marriage_status to factor variables
synthetic_data1$gender <- as.factor(synthetic_data1$gender)
synthetic_data1$marriage_status <- as.factor(synthetic_data1$marriage_status)

# Check the structure of the dataset again
str(synthetic_data1)
```

This combined synthetic dataset can now be used for various analyses, simulations, and testing purposes. It provides a comprehensive representation of both numerical and categorical variables, making it suitable for a wide range of applications.

Explanation:

1. Setting a Seed (`set.seed`):

* `set.seed(123)`: Sets the seed for random number generation. The number `123` can be any integer. Using the same seed will produce the same sequence of random numbers each time the code is run.
2. Random Sampling (`sample`):

* `x`: The vector of elements to choose from.
* `size`: The number of observations to generate.
* `replace`: Whether sampling is with replacement (`TRUE`) or without replacement (`FALSE`).
3. Weighted Sampling (`sample` with `prob`):

* `prob`: A vector of probabilities corresponding to the likelihood of each element in x being selected. The probabilities must sum to 1.

These snippets provide a starting point for generating synthetic categorical data using random and weighted sampling with reproducibility. You can adjust the parameters to fit the specific characteristics of the population you are modeling. For more detailed information on the `sample` function, refer to the R documentation.

## Application of `synthpop` in R

In the guidiance, we will work with the `R` package synthpop (Nowok, Raab, and Dibben 2016), which is one of the most advanced and dedicated packages in R to create synthetic data. Other alternatives to create synthetic data are, for example, the R-package mice (van Buuren and Groothuis-Oudshoorn 2011; see Volker and Vink 2021), or the stand-alone software IVEware (“IVEware: Imputation and Variance Estimation Software,” n.d.).

In this guide, we will work with the `R` package `synthpop` [(Nowok, Raab, and Dibben 2016)](https://www.synthpop.org.uk/get-started.html), which is one of the most advanced and dedicated packages in `R` for creating synthetic data. Other alternatives for creating synthetic data include the R-package `mice` (van Buuren and Groothuis-Oudshoorn 2011; see Volker and Vink 2021), or the stand-alone software IVEware (“IVEware: Imputation and Variance Estimation Software,” n.d.).

## References

1. [Synthetic data at ONS](https://www.ons.gov.uk/methodology/methodologicalpublications/generalmethodology/onsworkingpaperseries/onsmethodologyworkingpaperseriesnumber16syntheticdatapilot)
2. [`synthpop` (Nowok, Raab, and Dibben 2016)](https://www.synthpop.org.uk/get-started.html)