## **Generating Synthetic/Dummy Data Using Faker**

### 1. Introduction to Faker

`Faker` is a Python package that generates fake data such as names, addresses, emails, dates, credit card numbers, and more.

**Key Features:** Randomised generation, locale support, wide range of data types.

### 2. Install required libraries
To start, we will need the `Faker` library to generate the dummy data. Additionally, we will use PySpark for working with Spark DataFrames in Python.

For Python, use:

In [8]:
!pip install Faker
!pip install setuptools

Looking in indexes: https://njobud:****@onsart-01/artifactory/api/pypi/yr-python/simple



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Looking in indexes: https://njobud:****@onsart-01/artifactory/api/pypi/yr-python/simple



[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


For further details to assist on the setup, please visit the [official documentation](https://faker.readthedocs.io/en/master/).

**Check the version of the `Faker` package installed in your environment**

In [9]:
import pkg_resources
faker_version = pkg_resources.get_distribution("Faker").version
print(f"faker_version: {faker_version}")

faker_version: 37.1.0


**Set up a global random seed for reproducibility**

You can set a global seed for all data providers and use it without explicitly passing it to each provider:

In [10]:
from faker import Faker
fake = Faker()
fake.seed_instance(42)

### 3. Generate a first fake dData

In [11]:
# Generate sample personal data
print("Name:", fake.name())
print("Email:", fake.email())
print("Address:", (fake.address()).replace("\n", ", ")) # the address method has new line characters (\n), hence, replace with ", " to have all in one line
print("Phone:", fake.phone_number())
print("Birthday:", fake.date_of_birth(minimum_age=18, maximum_age=90))

Name: Allison Hill
Email: donaldgarcia@example.net
Address: 600 Jeffery Parkways, New Jamesside, MT 29394
Phone: 394.802.6542x351
Birthday: 1947-06-02


### 4. International data generation


Faker has a great feature that allows it to generate data tailored to specific locales. Here's how you can create data that reflects various locations:

In [12]:
# Create localised Faker instances
fake_us = Faker('en_US')
fake_uk = Faker('en_GB')
fake_fr = Faker('fr_FR')
fake_jp = Faker('ja_JP')

print("US:", fake_us.address().replace("\n", ", "))
print("UK:", fake_uk.address().replace("\n", ", "))
print("France:", fake_fr.address().replace("\n", ", "))
print("Japan:", fake_jp.address().replace("\n", ", "))

US: 6828 Rachel Mountain Suite 480, Port Bryanshire, MD 52233
UK: 071 Hollie vista, Garethside, TA4 2ND
France: 75, rue Claudine Buisson, 20591 Allarddan
Japan: 山梨県小平市蟇沼19丁目9番2号 シティ入谷630


Notice the differences in address formats:

* US addresses include state abbreviations and ZIP codes.  
* UK addresses use British postal codes.  
* France addresses follow European conventions.  
* Japan addresses are formatted with the correct characters and local conventions.


For an up-to-date list of supported locales, you can check the [official documentation](https://fakerjs.dev/guide/localization.html#available-locales).

### 5. Faker’s provider architecture


Faker utilises a modular system of "providers," where each provider is responsible for generating a specific type of data. Let's take a closer look at how providers work and the types available:

In [13]:
from faker import Faker

fake = Faker('en_GB')

print("Available Faker Providers:")
for provider in fake.providers:
    print(f"- {provider}")

Available Faker Providers:
- <faker.providers.user_agent.Provider object at 0x000002B7A21F4D40>
- <faker.providers.ssn.en_GB.Provider object at 0x000002B7A21F4BF0>
- <faker.providers.sbn.Provider object at 0x000002B7A21F45F0>
- <faker.providers.python.Provider object at 0x000002B7A21F4650>
- <faker.providers.profile.Provider object at 0x000002B7A21F5A00>
- <faker.providers.phone_number.en_GB.Provider object at 0x000002B7A21F45C0>
- <faker.providers.person.en_GB.Provider object at 0x000002B7A21F4530>
- <faker.providers.passport.en_US.Provider object at 0x000002B7A2124B00>
- <faker.providers.misc.en_US.Provider object at 0x000002B7A21AC320>
- <faker.providers.lorem.la.Provider object at 0x000002B7A21F4560>
- <faker.providers.job.en_US.Provider object at 0x000002B7A21F44A0>
- <faker.providers.isbn.en_US.Provider object at 0x000002B7A21F44D0>
- <faker.providers.internet.en_GB.Provider object at 0x000002B7A21F4410>
- <faker.providers.geo.en_US.Provider object at 0x000002B7A21F4470>
- <faker

Each item in the list corresponds to a specific provider, such as:

- **User Agent Provider:** Generates browser and device identification strings.  
- **SSN Provider:** Creates valid-format social security numbers (US-specific).  
- **SBN Provider:** Generates 9-digit Standard Book Numbers (used in older systems prior to 1974).  

In addition to these, Faker includes other commonly used providers for generating various types of data:

- **Person Provider:** Generates names, birthdates, and personal details.  
- **Address Provider:** Produces realistic street addresses and postal codes.
- **Internet Provider:** Generates email addresses, domain names, and URLs.

You can explore the full list of available providers and their methods in the official documentation, which offers detailed information and usage examples for each one.

### 6. Best practices for using Faker

When working with Faker, keep the following best practices in mind:

- Select locale-specific providers when generating data tailored to a particular region.
- Combine different providers to generate more realistic and interconnected data.
- Organise your fake data generation to align with the requirements of your application.
- Explore the official documentation to learn about additional provider features and options.

### 7. Building a typical dataset


Now, let's explore how to combine multiple providers to generate rich and interconnected data for more complex structures:

In [16]:
from faker import Faker
import pandas as pd

fake = Faker('en_GB')
fake.seed_instance(42)

n_rows = 100


def generate_user_profile():
    """ Function to generate a user profile """
    return {
        # Personal details
        'name': fake.name(),
        'age': fake.random_int(min=18, max=80),
        'email': fake.email(),
        
        # Location details
        'street': fake.street_address().replace("\n", " "),
        'city': fake.city(),
        'postcode': fake.postcode(),
        
        # Professional information
        'job': fake.job(),
        'company': fake.company(),
        
        # Other details
        'username': fake.user_name(),
        'website': fake.url()
    }
 
profiles = [generate_user_profile() for _ in range(n_rows)]
 
df = pd.DataFrame(profiles)

df.head()

Unnamed: 0,name,age,email,street,city,postcode,job,company,username,website
0,William Jennings,35,francisdavidson@example.org,600 Charlie fort,New Joeside,DT79 0GS,"Librarian, public",Bryan-Andrews,timothy16,http://butler-gough.info/
1,Benjamin Simpson,42,fisherteresa@example.net,Studio 31 Irene forks,Jasonbury,S4 5GQ,Marine scientist,Davies Ltd,johnronald,http://www.lamb-scott.co.uk/
2,Ricky Lloyd-Duncan,67,wilsonryan@example.com,Flat 13K Kelly parks,Youngmouth,M23 9SY,Oceanographer,Lloyd-Turner,dgreen,http://www.walsh.biz/
3,Dr Jasmine Smith,53,daviesgeoffrey@example.net,Studio 51s Steele alley,Donnaburgh,E4W 2QG,"Solicitor, Scotland","Bell, Anderson and Jones",murraymohammad,http://www.shaw.com/
4,Norman Sharp-Stewart,52,scottknowles@example.net,89 Arnold plains,Lake Graeme,SG57 1JJ,Bonds trader,Brown-Naylor,thomashammond,http://www.howe.com/


### 8. Using synthetic data with big data frameworks

#### 8.1. Using synthetic data with PySpark

Demonstrate how to create synthetic data in PySpark and use it within Spark DataFrames.

In [18]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from faker import Faker
import pandas as pd



# Initialize Spark session
spark = SparkSession.builder.master("local[2]").appName("Synthetic Data Example").getOrCreate()


# Create synthetic data with Faker
fake = Faker('en_GB')
fake.seed_instance(42)

n_rows = 100
data = [(fake.name(), fake.address(), fake.email()) for _ in range(n_rows)]

df_pandas = pd.DataFrame(data, columns=["Name", "Address", "Email"])

# Replace newline characters with commas in the 'Address' column using chaining
df_pandas['Address'] = df_pandas['Address'].str.replace("\n", ", ")

#I'm having problems when using spark.createDataFrame with virtual environment, hence, I have to create a csv file and read it
df_pandas.to_csv('temp.csv', index=False)

df_spark = spark.read.csv('temp.csv', header=True, inferSchema=True)

df_spark.show(5)

+--------------------+--------------------+--------------------+
|                Name|             Address|               Email|
+--------------------+--------------------+--------------------+
|    William Jennings|2 Sian streets, N...|francescaharrison...|
|     Rosemary Wright|654 Robin track, ...|simpsongemma@exam...|
|         Sean Norton|103 Robinson walk...|  rita19@example.net|
|       Brenda Briggs|Studio 4, Lydia i...|iwilkins@example.org|
|Leonard Powell-Mo...|Flat 32G, Green c...|andrea01@example.net|
+--------------------+--------------------+--------------------+
only showing top 5 rows



**Application to big data:**

This example can be extended to generate large datasets (millions of records) that can be processed in parallel using PySpark.

#### 8.2. Generate synthetic data with more features

We will generate synthetic data with the following additional features:
1. Mode of Transportation (Car, Public Transport, Walking, etc.)  
2. Highest Education (High School, Bachelor's, Master's, PhD, etc.)  
3. Marital Status (Single, Married, Divorced, Widowed)  
4. Favorite High Street Supermarket (Tesco, Sainsbury's, Asda, etc.)  
5. Pet Ownership (Yes, No - with type of pet if Yes)  

We will also simulate some missing data in the dataset, which is commonly encountered in real-world scenarios.

In [20]:
import pandas as pd
from faker import Faker
from faker.generator import random
from pyspark.sql import SparkSession

fake = Faker('en_GB')
fake.seed_instance(42)

 
n_samples = 1000   

# Create lists of categories for features not captured in the Faker library
mode_of_transport = ['Car', 'Public Transport', 'Walking', 'Cycling', 'Taxi']
education_levels = ['Primary', 'High School', 'Vocational', 'Bachelor\'s', 'Master\'s', 'PhD']
marital_status = ['Single', 'Married', 'Divorced', 'Widowed']
supermarkets = ['Tesco', 'Sainsbury\'s', 'Asda', 'Morrisons', 'Waitrose']
pets = ['Dog', 'Cat', 'None']

# Generate synthetic data with the new features
data = []
for _ in range(n_samples):
    name = fake.name()
    address = fake.address()
    postcode = fake.postcode()
    city = fake.city()
    email = fake.email()
    transport = fake.random_element(mode_of_transport)
    education = fake.random_element(education_levels)
    marital = fake.random_element(marital_status)
    supermarket = fake.random_element(supermarkets)
    pet = fake.random_element(pets)
    
    # Simulating missing data by randomly omitting some features
    if random.random() < 0.1:  # 10% chance to have missing data for 'Mode of Transport'
        transport = None
    if random.random() < 0.1:  # 10% chance to have missing data for 'Education'
        education = None
    if random.random() < 0.1:  # 10% chance to have missing data for 'Marital Status'
        marital = None
    if random.random() < 0.1:  # 10% chance to have missing data for 'Supermarket'
        supermarket = None
    if random.random() < 0.1:  # 10% chance to have missing data for 'Pet'
        pet = None

    data.append([name, address, postcode, city, email, transport, education, marital, supermarket, pet])


columns = ['Name', 'Address', 'Postcode', 'City', 'Email', 'Mode_of_Transport', 'Education', 'Marital_Status', 'Supermarket', 'Pet']
synthetic_df = pd.DataFrame(data, columns=columns)
synthetic_df['Address'] = synthetic_df['Address'].str.replace("\n", ", ")


synthetic_df.head()

Unnamed: 0,Name,Address,Postcode,City,Email,Mode_of_Transport,Education,Marital_Status,Supermarket,Pet
0,William Jennings,"2 Sian streets, New Maryton, E3 8ZA",L8G 7YL,Port Samchester,ricky23@example.com,Car,Primary,Widowed,Tesco,Cat
1,Dr Josh Pritchard,"Studio 16, Lynn hill, Melissaborough, BR0X 4DJ",S4 5GQ,Rhysview,johnronald@example.net,Public Transport,High School,Widowed,Morrisons,Cat
2,Leigh Randall,"0 Alexander circles, New Guy, W67 4FJ",NW8M 9RQ,East Natasha,dgreen@example.org,Public Transport,PhD,Married,Morrisons,
3,Lorraine Palmer,"Studio 01, Read junctions, West Tracyburgh, AB...",MK9H 5GX,New Brandonfort,griffithslinda@example.com,Public Transport,,Married,Waitrose,Dog
4,June Sharp,"782 Hill rest, Arnoldside, SM3Y 6QT",W1D 1PA,New Gerald,hammondjulia@example.org,Public Transport,Primary,Single,Asda,Dog


**Integrate with PySpark for big data workflow**

Now that we have a robust dataset with additional features and some missing data, let's see how to integrate this with PySpark, which is commonly used in big data workflows.

In [21]:

spark = SparkSession.builder.master("local[2]").appName("SyntheticDataForBigData").getOrCreate()

# Convert the synthetic DataFrame to a Spark DataFrame
# Currently, using the createDataFrame on a virtual environment raising error. So, I will convert to a csv file and read the file using spark
# spark_df = spark.createDataFrame(synthetic_df)
synthetic_df.to_csv('temp.csv', index=False)

spark_df = spark.read.csv('temp.csv', header=True, inferSchema=True)

# Show the Spark DataFrame
spark_df.show(5)

+-----------------+--------------------+--------+---------------+--------------------+-----------------+-----------+--------------+-----------+----+
|             Name|             Address|Postcode|           City|               Email|Mode_of_Transport|  Education|Marital_Status|Supermarket| Pet|
+-----------------+--------------------+--------+---------------+--------------------+-----------------+-----------+--------------+-----------+----+
| William Jennings|2 Sian streets, N...| L8G 7YL|Port Samchester| ricky23@example.com|              Car|    Primary|       Widowed|      Tesco| Cat|
|Dr Josh Pritchard|Studio 16, Lynn h...|  S4 5GQ|       Rhysview|johnronald@exampl...| Public Transport|High School|       Widowed|  Morrisons| Cat|
|    Leigh Randall|0 Alexander circl...|NW8M 9RQ|   East Natasha|  dgreen@example.org| Public Transport|        PhD|       Married|  Morrisons|None|
|  Lorraine Palmer|Studio 01, Read j...|MK9H 5GX|New Brandonfort|griffithslinda@ex...| Public Transport|  

### Conclusion
Generating synthetic data is a crucial tool for testing algorithms, saving resources, and maintaining privacy. Faker offers various providers to generate synthetic data on Python and big data frameworks like `PySpark`  can help handle large datasets efficiently.


### References and Further Reading
* Faker Documentation (Python)
* PySpark Documentation