## **Generating Synthetic/Dummy Data Using Mimesis**

## 1. Introduction to Mimesis

Mimesis is a Python library used to generate fake, random, or synthetic data for various purposes such as testing, model validation, or training datasets. It allows users to generate a wide range of data types, including personal information, addresses, financial data, dates, and more.

## 2. Install Required Libraries
To start, we will need the `mimesis` library to generate the dummy data. Additionally, we will use PySpark for working with Spark DataFrames in Python, and sparklyr for R users.

For Python, use:

In [1]:
pip install mimesis

Looking in indexes: https://njobud:****@onsart-01.ons.statistics.gov.uk/artifactory/api/pypi/yr-python/simple
Note: you may need to restart the kernel to use updated packages.


**Check the Version of the `mimesis` package installed in your environment**

In [2]:
import mimesis
print(mimesis.__version__)

18.0.0


In [3]:
from mimesis import Generic, Person, Address, Finance
from mimesis.locales import Locale

**Set Up a Global Random Seed for Reproducibility**

You can set a global seed for all data providers and use it without explicitly passing it to each provider:

In [4]:
from mimesis import random

random.global_seed = 42

**Initialise the Generic Provider with the Seed and locale**

In [5]:
generic = Generic(locale=Locale.EN_GB, seed=42)

Setting this seed allows us to generate the exact same synthetic data as will be shown in our examples. This is especially helpful when we are debugging or creating tests that require consistent data.

## 3. How Mimesis Works


Mimesis, inspired by the ancient Greek concept of `"mimesis"` (which means to imitate or replicate), is designed to create realistic synthetic data rather than just random values. Its goal is to generate contextually appropriate data that mimics real-world information. This is achieved through a structured provider system and built-in support for different locales, ensuring that the generated data is both varied and meaningful.

Let us walk through how to generate basic synthetic data using a seeded instance of Mimesis:

In [6]:
from mimesis import Person, Address, Finance
from mimesis.locales import Locale

# Initialise the providers with the selected locale and a fixed seed for reproducibility
person = Person(locale=Locale.EN_GB, seed=42)
address = Address(locale=Locale.EN_GB, seed=42)
finance = Finance(locale=Locale.EN_GB, seed=42)

# Generate and display basic personal and financial information
print("Name:", person.full_name())
print("Email:", person.email())
print("Address:", address.address())
print("Job:", person.occupation())
print("Company:", finance.company())


Name: Anthony Reilly
Email: holds1871@live.com
Address: 1310 Blaney Avenue
Job: Choreographer
Company: Centrica


Let's take a closer look at what makes this output interesting. Each piece of data showcases Mimesis's capability to produce realistic and varied information that mirrors real-world patterns. The name adheres to common naming conventions, while the email address illustrates Mimesis's ability to generate authentic usernames along with known email providers. The address follows typical street numbering and naming structures, and the occupation "Choreographer" demonstrates how Mimesis can generate a broad range of job titles, not just corporate ones. Lastly, the company name "Centrica" follows a typical style a business will be named. This makes it especially useful for generating test data across diverse business sectors.

## 4. Mimesis Provider System

One of the key strengths of Mimesis is its well-structured provider system. At its core is the **Generic provider**, which acts as a central hub, giving you access to all the specialised data generators available in Mimesis. Let us begin by exploring the range of providers that you can use:

In [7]:
from mimesis import Generic
from mimesis.locales import Locale

# Initialise the Generic provider with the desired locale and a seed
generic = Generic(locale=Locale.EN_GB, seed=42)


print("Providers available through Generic:")
for attribute in dir(generic):
    if not attribute.startswith('_'):  # Here, we skip internal attributes
        print(attribute)


Providers available through Generic:
address
binaryfile
choice
code
cryptographic
datetime
development
file
finance
food
hardware
internet
numeric
path
payment
person
science
text
transport


## 5. Locale Support for International Applications

A key feature of Mimesis is its ability to handle different locales. When generating test data for global applications, It is essential that the data reflects not only different languages but also the cultural and formatting norms of various regions. Mimesis achieves this through its robust locale system.

Now, let us see how Mimesis adjusts its data generation based on different locale settings. We will begin by exploring the available locales:

In [8]:
from mimesis.locales import Locale

# Display all available locales in Mimesis
print("Available Locales:")
for locale in Locale:
    print(f"- {locale}: {locale.value}")

Available Locales:
- Locale.AR_AE: ar-ae
- Locale.AR_DZ: ar-dz
- Locale.AR_EG: ar-eg
- Locale.AR_JO: ar-jo
- Locale.AR_OM: ar-om
- Locale.AR_SY: ar-sy
- Locale.AR_YE: ar-ye
- Locale.CS: cs
- Locale.DA: da
- Locale.DE: de
- Locale.DE_AT: de-at
- Locale.DE_CH: de-ch
- Locale.EL: el
- Locale.EN: en
- Locale.EN_AU: en-au
- Locale.EN_CA: en-ca
- Locale.EN_GB: en-gb
- Locale.ES: es
- Locale.ES_MX: es-mx
- Locale.ET: et
- Locale.FA: fa
- Locale.FI: fi
- Locale.FR: fr
- Locale.HU: hu
- Locale.HR: hr
- Locale.IS: is
- Locale.IT: it
- Locale.JA: ja
- Locale.KK: kk
- Locale.KO: ko
- Locale.NL: nl
- Locale.NL_BE: nl-be
- Locale.NO: no
- Locale.PL: pl
- Locale.PT: pt
- Locale.PT_BR: pt-br
- Locale.RU: ru
- Locale.SK: sk
- Locale.SV: sv
- Locale.TR: tr
- Locale.UK: uk
- Locale.ZH: zh


In this example, we are listing all the locales supported by Mimesis, allowing you to see which regions are available for generating region-specific data.

To demonstrate how locales influence data generation, let us create a simple example that generates person and address data for various regions. We will use a fixed seed to ensure consistent results.

In [9]:
from mimesis import Person
from mimesis.locales import Locale

# Create a dictionary to store examples
examples = {}

# List of diverse locales to generate data for
locales = [Locale.EN_GB, Locale.EN, Locale.JA, Locale.FR, Locale.AR_EG]

# Generate data for each locale
for locale in locales:
    person = Person(locale=locale, seed=42)  # Use the same seed for consistency
    examples[locale] = {
        "Full Name": person.full_name(),
        "Phone": person.telephone(),
        "Email": person.email(),
        "Job": person.occupation()
    }

for locale, data in examples.items():
    print(f"\n{locale.value.upper()} Examples:")
    for key, value in data.items():
        print(f"{key}: {value}")



EN-GB Examples:
Full Name: Anthony Reilly
Phone: 055 2768 0402
Email: appeared1901@example.org
Job: Veterinary Surgeon

EN Examples:
Full Name: Anthony Reilly
Phone: +1-309-276-8040
Email: guitars1813@yahoo.com
Job: Yacht Master

JA Examples:
Full Name: 石松 田場
Phone: +81 117 5500 2657
Email: readers2029@example.org
Job: レコーディング・エンジニア

FR Examples:
Full Name: Alexy Rigal
Phone: 0427680402
Email: appeared1901@example.org
Job: Responsable de la promotion des ventes

AR-EG Examples:
Full Name: أيمن عبد الماجد باشا
Phone: 0611755002
Email: water2079@duck.com
Job: مهندس تنظيف


## 6. Generic Provider to Generate UK Data

### 6.1. Class: mimesis.Person

The `mimesis.Person` class is designed to generate personal information such as names, genders, emails, phone numbers, and more. This class can be particularly useful for generating synthetic user data, such as for testing user-related models or systems.

#### **Key Methods in `mimesis.Person`**

1. **`full_name(gender=None)`**: Generates a random full name. Optionally, you can specify a gender using the `Gender` enum (`Gender.MALE`, `Gender.FEMALE`).
   

In [10]:
generic = Generic(locale=Locale.EN_GB, seed=42)
from mimesis.enums import Gender

# Generate male and female names
male_name = generic.person.full_name(gender=Gender.MALE)
female_name = generic.person.full_name(gender=Gender.FEMALE)

print(f"male_name: {male_name}, female_name: {female_name}")

male_name: Cornelius Avila, female_name: Krystin Downs


2. **`first_name(gender=None)`**: Generates a random first name. You can specify the gender.


In [11]:
male_first_name = generic.person.first_name(gender=Gender.MALE)
female_first_name = generic.person.first_name(gender=Gender.FEMALE)

print(f"male_first_name: {male_first_name}, female_first_name: {female_first_name}")

male_first_name: Garry, female_first_name: Edelmira


3. **`last_name()`**: Generates a random last name (no gender specification needed). It has an alias, `surname()`

In [12]:
last_name = generic.person.last_name()
surname = generic.person.surname()

print(f"last_name: {last_name}, surname: {surname}")

last_name: Raymond, surname: Brock


4. **`email()`**: Generates a random email address.

In [13]:
email = generic.person.email()
print(f"email: {email}")

email: chapel1816@example.org


**`phone_number()`**: Generates a random phone number. It follows the format of UK phone numbers.

In [14]:
phone_number = generic.person.telephone()
print(f"phone_number: {phone_number}")

phone_number: 01250 165258


We can modify the `phone_number()` method to support a custom `mask` and a `placeholder` parameter. The `placeholder` will be used to replace the masked characters (e.g., `#`), allowing for even more flexibility in formatting the phone number.

In [15]:
phone_number = generic.person.telephone(mask='+44-(###)-###-####')
mobile_number = generic.person.telephone(mask='07### ######')
custom_phone_number =  generic.person.telephone(mask='+44-(###)-###-####', placeholder='X')

print(f"phone_number: {phone_number}")
print(f"mobile_number: {mobile_number}")
print(f"custom_phone_number: {custom_phone_number}")

phone_number: +44-(086)-319-3008
mobile_number: 07687 593586
custom_phone_number: +44-(###)-###-####


6. **`username(mask=None)`**: Generates a random username. You can provide a mask to specify the structure of the username (e.g., lowercase letters, uppercase letters, digits).

In [16]:
username = generic.person.username(mask="l_d_U-C")
print(username)  # Format: l = lowercase, d = digit, U = uppercase, c = Captialise

organ_1835_GRANDE-Carroll


7. **Weighted Choice**

You may want to generate data with a specific probability of occurrence.

For example, let's say you want to generate random full names for both males and females, but with a higher probability of generating female names.

Here’s one way to achieve this:

In [17]:
from mimesis import Person, Locale, Gender

person = Person(Locale.EN_GB)

#person.reseed('ok')

for _ in range(10):
    full_name = person.full_name(
        gender=person.random.weighted_choice(
            choices={
                Gender.MALE: 0.9,
                Gender.FEMALE: 0.1,
            }
        ),
    )
    print(full_name)

Anthony Reilly
Garry Cardenas
Tom Boyd
Armand Baker
Gilberto Lane
Vern Cortez
Tom Hogan
Zachariah Fields
Alan Robbins
Neville Gomez


### 6.2. Class: mimesis.Datetime

The `mimesis.Datetime` class allows us to generate random dates, times, and even future or past dates, which is useful for time-based simulations, datasets, and models. For the list of all the methods within this class visit the [mimesis.Datetime Page](https://mimesis.name/v12.1.1/api.html#mimesis.Person.phone_number). Next, we will show some key methods.

#### **Key Methods in `mimesis.Datetime`**

1. **`date()`**: Generates a random date. You can specify the range (e.g., past or future) using `start` and `end` parameters.

In [18]:
random_date = generic.datetime.date(start=2010, end=2025) 
print(random_date)

2013-01-24


2. **`time()`**: Generates a random time.

In [19]:
random_time = generic.datetime.time()
print(random_time)

08:15:14.146316


3. **`datetime()`**: Generates a random datetime (combination of date and time).

In [20]:
random_datetime = generic.datetime.datetime(start=2010, end=2025)
print(random_datetime)

2013-11-24 17:05:37.442417


There is an option to specify a timezone, using the `pytz` library.

In [21]:
pip install pytz

Looking in indexes: https://njobud:****@onsart-01.ons.statistics.gov.uk/artifactory/api/pypi/yr-python/simple
Note: you may need to restart the kernel to use updated packages.


In [22]:
import pytz
# Generate a random datetime with a specific timezone
random_datetime_warsaw = generic.datetime.datetime(start=2010, end=2025, timezone= 'Europe/Warsaw')

print(random_datetime_warsaw)

2011-01-03 06:14:32.631262+01:00


4. **`timestamp()`**: Generates a random timestamp in given format. Support formats are: POSIX, RfC_3339, ISO_8601).

In [23]:
from mimesis.enums import TimestampFormat
time_stamp_format_posix = generic.datetime.timestamp(TimestampFormat.POSIX)
time_stamp_format_rfc_3339 = generic.datetime.timestamp(TimestampFormat.RFC_3339)
time_stamp_format_iso_8601 = generic.datetime.timestamp(TimestampFormat.ISO_8601)

print(f"time_stamp_format_posix: {time_stamp_format_posix}")
print(f"time_stamp_format_rfc_3339: {time_stamp_format_rfc_3339}")
print(f"time_stamp_format_iso_8601: {time_stamp_format_iso_8601}")

time_stamp_format_posix: 1757284904
time_stamp_format_rfc_3339: 2025-04-15T18:17:51Z
time_stamp_format_iso_8601: 2025-03-23T13:21:17.163032


### 6.3. Class: mimesis.Finance

The `mimesis.Finance` class is used to generate financial data, such as amounts, currency values, and financial transactions. This class is particularly useful for generating test data for financial systems, accounting models, or other financial applications.

#### **Key Methods in `mimesis.Finance`**

1. **`currency()`**: Generates a random currency code (e.g., GBP, USD, EUR).


In [24]:
currency_code = generic.finance.currency_symbol()
print(currency_code)

£


2. **`bank()`**: Generates a random bank name.

In [25]:
bank_name = generic.finance.bank()
print(bank_name)

National Counties Building Society


3. **`company()`**: Generates a random company name. `company_type()`: Generates a random type of business entity.

In [26]:
company_name = generic.finance.company()
print(company_name)

company_registered_type = generic.finance.company_type(abbr=True)
print(company_registered_type)

Centrica
Corp.


### 6.4. Class: mimesis.Address

The `mimesis.Address` class generates fake address-related information, such as street names, city names, zip/post codes, and more. This is helpful when generating addresses for test data in location-based applications or geospatial data models.

#### **Key Methods in `mimesis.Address`**

1. **`address()`**: Generates a random address. **`street_name()`**: Generates a random street name. **`street_suffix()`**: Generate a random street suffix


In [27]:
address = generic.address.address()
print(address)

street_name = generic.address.street_name()
print(street_name)

street_suffix = generic.address.street_suffix()
print(street_suffix)

1310 Blaney Avenue
Covehill
Hill


2. **`post_code()`**: Generates a random postcode (specific to the UK format).

In [28]:
postcode = generic.address.postal_code()
print(postcode)

FT6X 0KA


3. **`city()`**: Generates a random city name.

In [29]:
city = generic.address.city()
print(city)

Horwich


4. **`region()`**: Generates a random region name.

In [30]:
region = generic.address.region()
print(region)

Gwent


4. **`coordinates(dms=False)`**: Generates random goe coordinates.


In [31]:
geo_cordinates = generic.address.coordinates(dms=False)
print(geo_cordinates)

{'longitude': 1.927904, 'latitude': -85.223525}


### 6.5. Class: mimesis.Transport

The `mimesis.Transport` class generates random transportation-related data, such as vehicle makes, models, license plates, and more. This is useful for applications dealing with logistics, traffic analysis, or fleet management.

#### **Key Methods in `mimesis.Transport`**

1. **`car()`**: Generates a random car make.

In [32]:
car_make = generic.transport.car()
print(car_make)

Peugeot 605


2. **`manufacturer()`**: Generatres a random car manufacturer.

In [33]:
car_maker = generic.transport.manufacturer()
print(car_maker)

Dodge


3. **`airplane()`**: Generates a random airplane model name.

In [34]:
airplane_model_name = generic.transport.airplane()
print(airplane_model_name)

Airbus A319


In [35]:
# Import the necessary classes from Mimesis
from mimesis import Generic
from mimesis.locales import Locale

# Initialize the Generic provider with UK locale and a fixed seed for reproducibility
generic = Generic(locale=Locale.EN_GB, seed='OK')

# --- Section 1: Random Sentence ---
print("### 1. Random Sentence")

# Generate a random sentence
random_sentence = generic.text.sentence()
print(f"Random Sentence: {random_sentence}")

# --- Section 2: Random Word ---
print("\n### 2. Random Word")

# Generate a random word
random_word = generic.text.word()
print(f"Random Word: {random_word}")

# --- Section 3: Random Text (Multiple Sentences) ---
print("\n### 3. Random Text")

# Generate random text consisting of multiple sentences
random_text = generic.text.text()
print(f"Random Text (multiple sentences):\n{random_text}")

# --- Section 4: Random Quote ---
print("\n### 4. Random Quote")

# Generate a random quote (could be used for example datasets in surveys or quotes sections)
random_quote = generic.text.quote()
print(f"Random Quote: {random_quote}")

# --- Section 5: Random Answer  ---
print("\n### 5. An Answer")

# Generates a random answer in the current language
random_answer = generic.text.answer()
print(f"Random Answer:\n{random_answer}")

# --- Section 6: Random Level ---
print("\n### 6. Random Level")

# Generates a word that indicates a level of something
random_level = generic.text.level()
print(f"Random Level:\n{random_level}")

# --- Section 7: Random Long Paragraphs ---
print("\n### 6. Random Paragraph")

# Generates a long paragraph from sentences. Specify the length of the list of sentences
random_sentences = [generic.text.text() for _ in range(10)]
long_paragraph = " ".join(random_sentences)
print(f"Random Long Paragraph:\n{long_paragraph}")



### 1. Random Sentence
Random Sentence: Messages can be sent to and received from ports, but these messages must obey the so-called "port protocol."

### 2. Random Word
Random Word: checkout

### 3. Random Text
Random Text (multiple sentences):
Haskell features a type system with type inference and lazy evaluation. They are written as strings of consecutive alphanumeric characters, the first character being lowercase. She spent her earliest years reading classic literature, and writing poetry. Make me a sandwich. He looked inquisitively at his keyboard and wrote another sentence.

### 4. Random Quote
Random Quote: Those who refuse to learn from history are condemned to repeat it.

### 5. An Answer
Random Answer:
Yes

### 6. Random Level
Random Level:
high

### 6. Random Paragraph
Random Long Paragraph:
Do you come here often? I don't even care. Haskell features a type system with type inference and lazy evaluation. Haskell is a standardized, general-purpose purely functional programmin

3. **`city()`**: Generates a random city name.

## 7. Generating Fake Datasets

### 7.1 Generating Fake Person Data

In [36]:
import pandas as pd
from mimesis import Person, Locale

# Person instance with Locale.EN_GB for English
person = Person(locale=Locale.EN_GB)
n_rows = 10
personal_data = {
    "First Name": [person.first_name() for _ in range(n_rows)],  # random first names
    "Last Name": [person.last_name() for _ in range(n_rows)],  # random last names
    "Full Name": [person.full_name() for _ in range(n_rows)],  # random full names
    "Gender": [person.gender() for _ in range(n_rows)],  # random genders
    "Age": [person.random.randint(16, 88) for _ in range(n_rows)],  # random ages
    "Email": [person.email() for _ in range(n_rows)],  # random email addresses
    "Phone Number": [person.phone_number() for _ in range(n_rows)],  # random phone numbers
    "Nationality": [person.nationality() for _ in range(n_rows)],  #  random nationalities
    "Occupation": [person.occupation() for _ in range(n_rows)]  # random occupations
}

personal_df = pd.DataFrame(personal_data)
personal_df

Unnamed: 0,First Name,Last Name,Full Name,Gender,Age,Email,Phone Number,Nationality,Occupation
0,Anthony,Hogan,Neville Gomez,Male,42,gear1828@example.org,056 9807 6526,Dutch,Riding Instructor
1,Kaley,Davidson,Epifania Daniel,Male,50,arrived2005@yahoo.com,0800 449 8251,Cambodian,Taxidermist
2,Demarcus,Hunt,Crysta Bradshaw,Female,25,briefly2090@live.com,0800 851 3199,Belgian,Tax Advisor
3,Tom,Mcknight,Conchita Gross,Female,37,franklin2002@outlook.com,0841 398 2250,Saudi,Landlord
4,Zack,Fields,Kenisha Schwartz,Female,84,depend1871@example.com,01951 691569,Mexican,Research Director
5,Arlena,Sears,Randy Lynch,Other,47,waters1934@yahoo.com,0800 363 8420,Afghan,Booking Clerk
6,Chris,Sykes,Malik Bolton,Female,36,previously1912@protonmail.com,055 5070 0035,Brazilian,Heating Engineer
7,Gilberto,Aguirre,Melodi Mcdowell,Male,75,tales1846@yandex.com,0800 219547,Uruguayan,Purchasing Assistant
8,Vern,Robbins,Bryan Barton,Female,64,compute1881@duck.com,0306 348 0660,Dominican,Accounts Staff
9,Tom,Schultz,Jayson Bond,Female,50,boxing1995@protonmail.com,0121 144 2294,Chinese,Records Supervisor


### 7.2 Generating Fake Finance Data

In [37]:
from mimesis import Finance, Locale

# Finance instance with a English GB locale
finance = Finance(locale=Locale.EN_GB)

n_rows = 10

financial_data = {
    "Bank Name": [finance.bank() for _ in range(n_rows)],  # Generating bank names
    "Company Name": [finance.company() for _ in range(n_rows)],  # company names
    "Company Type": [finance.company_type() for _ in range(n_rows)],  # company types
    "Cryptocurrency ISO Code": [finance.cryptocurrency_iso_code() for _ in range(n_rows)],  # ISO codes
    "Cryptocurrency Symbol": [finance.cryptocurrency_symbol() for _ in range(n_rows)],  # symbols
    "Currency ISO Code": [finance.currency_iso_code() for _ in range(n_rows)],  # currency ISO codes
    "Currency Symbol": [finance.currency_symbol() for _ in range(n_rows)],  # currency symbols
    "Random Price": [finance.price() for _ in range(n_rows)],  # random prices
    "Price in BTC": [finance.price_in_btc() for _ in range(n_rows)],  # prices in BTC
    "Stock Exchange Name": [finance.stock_exchange() for _ in range(n_rows)],  # stock exchange names
    "Stock Name": [finance.stock_name() for _ in range(n_rows)],  # stock names
    "Stock Ticker": [finance.stock_ticker() for _ in range(n_rows)]  # stock tickers
}


finance = pd.DataFrame(financial_data)
finance

Unnamed: 0,Bank Name,Company Name,Company Type,Cryptocurrency ISO Code,Cryptocurrency Symbol,Currency ISO Code,Currency Symbol,Random Price,Price in BTC,Stock Exchange Name,Stock Name,Stock Ticker
0,National Counties Building Society,Brown Grp.,Corporation,ZEC,₿,GBP,£,959.42,1.322527,HKEX,First Trust CEF Income Opportunity ETF,GF
1,The Royal Bank of Scotland plc,Vodafone Grp.,Limited Partnership,LTC,₿,GBP,£,624.83,1.546137,AMEX,Conatus Pharmaceuticals Inc.,YMAB
2,Royal Bank of Scotland Group plc,Polymetal Int,Incorporated,BCH,Ł,GBP,£,1422.3,1.970443,HKEX,Nuveen New York Municipal Value Fund 2,TRVN
3,Paragon Banking Group plc,Aveva Grp,Private company limited by guarantee,EOS,₿,GBP,£,578.8,1.710635,HKEX,Lazydays Holdings,TWO^B
4,Triodos Bank UK,Atkins,Private company limited by guarantee,XBT,Ł,GBP,£,793.18,1.732967,HKEX,Frontier Communications Corporation,ARI
5,The Bank of Ireland,Bunzl,Private company limited by guarantee,VTC,Ł,GBP,£,1128.64,0.760252,NYSE,Tyson Foods,DSS
6,OneSavings Bank plc,Genesis E.m.f.,Limited Partnership,LTC,Ξ,GBP,£,1385.45,0.906821,SSE,WisdomTree Barclays Negative Duration U.S. Agg...,WFC^R
7,Tesco Bank,Grainger,Limited Liability Partnership,DOT,Ł,GBP,£,861.64,1.668221,HKEX,Salarius Pharmaceuticals,ALGRW
8,Penrith Building Society,Sig,Incorporated,ETH,₿,GBP,£,692.29,0.325308,NASDAQ,Entera Bio Ltd.,VMD
9,Felixstowe & Walton United Community Social En...,Witan Inv Tst,Limited Liability Partnership,VTC,Ξ,GBP,£,569.56,0.710541,SSE,Westrock Company,F^B


### 7.3 Generating Fake Uber Data

In [38]:
from mimesis import Generic
import pandas as pd
from datetime import timedelta


# Initialise mimesis instance for data generation
data_generator = Generic(locale=Locale.EN_GB, seed=42)

# Providers
address_provider = data_generator.address
person_provider = data_generator.person
datetime_provider = data_generator.datetime

n_rows = 1000

# Function to generate random payment methods
def generate_payment_method():
    return person_provider.random.choice(["credit card", "cash", "PayPal", "Apple Pay", "Google Pay"])

# Function to generate a random category for the ride
def generate_category():
    return person_provider.random.choice(["business", "personal", "travel"])

# Function to generate random reviews between 1 and 5
def generate_review():
    return round(person_provider.random.uniform(1, 5), 2)

# Function to generate random gender
def generate_gender():
    return person_provider.random.choice(["Male", "Female", "Other"])

# Create an efficient generator for the Uber dataset
def generate_uber_data(n_rows=n_rows):
    # Empty lists to hold the data
    data = {
        "pickup_date": [],
        "drop_date": [],
        "category": [],
        "start_location": [],
        "end_location": [],
        "distance_travelled": [],
        "payment_method": [],
        "fare_amount": [],
        "passenger_count": [],
        "customer_name": [],
        "gender": [],
        "reviews": []
    }
    
    for _ in range(n_rows):
        # Generate pickup and drop dates
        #pickup_time = datetime.now() - timedelta(days=person_provider.random.randint(1, 365))
        pickup_time = datetime_provider.datetime(start=2022, end=2024)
        drop_time = pickup_time + timedelta(minutes=person_provider.random.randint(10, 180))
        
        data["pickup_date"].append(pickup_time.strftime('%Y-%m-%d %H:%M:%S'))
        data["drop_date"].append(drop_time.strftime('%Y-%m-%d %H:%M:%S'))

        # Randomly generate ride category, start and end locations
        data["category"].append(generate_category())
        data["start_location"].append(address_provider.address())
        data["end_location"].append(address_provider.address())

        # Random distance travelled, fare amount, and payment method
        data["distance_travelled"].append(round(person_provider.random.uniform(1, 50), 2))
        data["payment_method"].append(generate_payment_method())
        data["fare_amount"].append(round(person_provider.random.uniform(5, 50), 2))

        # Random passenger count, customer name, gender, and review score
        data["passenger_count"].append(person_provider.random.randint(1, 4))
        data["customer_name"].append(person_provider.full_name())
        data["gender"].append(generate_gender())
        data["reviews"].append(generate_review())
    
    return pd.DataFrame(data)

# Generate the dataset
uber_df = generate_uber_data(n_rows)

uber_df.head()


Unnamed: 0,pickup_date,drop_date,category,start_location,end_location,distance_travelled,payment_method,fare_amount,passenger_count,customer_name,gender,reviews
0,2024-02-01 23:17:15,2024-02-02 02:10:15,business,1310 Blaney Avenue,564 Colebrooke Grove,2.23,PayPal,16.02,2,Tom Boyd,Other,2.69
1,2022-12-04 21:47:57,2022-12-04 22:04:57,business,286 Bernisk Walk,179 Irish Road,11.71,Google Pay,32.09,2,Jazmine Hunt,Other,2.11
2,2022-10-14 01:01:05,2022-10-14 01:12:05,business,66 Ballycrummy Circle,448 Clonfeacle Terrace,35.21,PayPal,17.5,2,Crysta Bradshaw,Female,1.39
3,2022-09-20 00:35:12,2022-09-20 02:13:12,travel,1233 Ballyclander Gardens,1331 Linsfort Walk,13.96,credit card,37.84,1,Charis Martin,Female,4.32
4,2024-12-18 13:14:28,2024-12-18 16:02:28,personal,860 Clanmorris Run,1207 Craigatoke Alley,29.29,credit card,7.06,2,Charlesetta Stevenson,Male,4.47


## 7.4 Generating Fake Population Data

In [39]:
import pandas as pd
import numpy as np
from mimesis import Generic
from mimesis.providers.address import Address
from mimesis.providers.person import Person
from mimesis.locales import Locale
import datetime
import random
from mimesis.enums import TitleType 

# Initialise mimesis Generic provider
data_generator = Generic(locale=Locale.EN_GB)

# Providers
address_provider = data_generator.address
person_provider = data_generator.person

# Helper function to generate fake elector data
def generate_elector_data(n_rows):
    data = {
        "Local_Authority": [],
        "Postcode": [],
        "Address_Line_2": [],
        "Address_Line_3": [],
        "Address_Line_5": [],
        "Last_Name": [],
        "Middlenames": [],
        "First_Name": [],
        "Title": [],
        "guid": [],
        "SOURCE_FILE": [],
        "Address_Line_1": [],
        "DOB": [],
        "TIME_STAMP": []
    }

    for i in range(n_rows):
        # Generate address and personal data
        data["Local_Authority"].append(address_provider.region())
        data["Postcode"].append(address_provider.postal_code())
        data["Address_Line_2"].append(address_provider.street_name())
        data["Address_Line_3"].append(address_provider.city())
        data["Address_Line_5"].append(address_provider.country())
        
        # Generate names
        data["Last_Name"].append(person_provider.last_name())
        data["Middlenames"].append(person_provider.name())
        data["First_Name"].append(person_provider.name())
        data["Title"].append(person_provider.title(title_type=TitleType.TYPICAL))
        
        # Unique GUID for each entry
        data["guid"].append(person_provider.password())
        
        # Source file, placeholder for now
        data["SOURCE_FILE"].append(person_provider.height())
        
        # Address line 1 - Random number as address number
        data["Address_Line_1"].append(person_provider.random.randint(1, 200))

        # Date of Birth: Randomly generate between ages 18-99
        dob_year = person_provider.random.randint(1925, 2007)
        dob_month = person_provider.random.randint(1, 12)
        dob_day = person_provider.random.randint(1, 28)  # Simplified for valid day generation
        data["DOB"].append(f"{dob_year}-{dob_month:02d}-{dob_day:02d}")
        
        # Timestamp of data generation
        data["TIME_STAMP"].append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))

    return pd.DataFrame(data)

# Generate fake data
n_rows = 300
df = generate_elector_data(n_rows)

# Add Address_Line_4 as a copy of Local_Authority for certain rows
same_street_rows = 10
for i in range(same_street_rows):
    # Ensure the last rows share the same location info
    df.at[n_rows-(i+1), 'Address_Line_4'] = df.at[n_rows-1, 'Local_Authority']
    df.at[n_rows-(i+1), 'Address_Line_3'] = df.at[n_rows-1, 'Address_Line_3']
    df.at[n_rows-(i+1), 'Address_Line_2'] = df.at[n_rows-1, 'Address_Line_2']
    df.at[n_rows-(i+1), 'Postcode'] = df.at[n_rows-1, 'Postcode']

# Adding extra columns for address continuity
df['Address_Line_6'] = np.nan
df['Address_Line_7'] = np.nan
df['Address_Line_8'] = np.nan
df['Address_Line_9'] = np.nan
df['F1'] = np.nan
df['ID'] = np.arange(n_rows)

# ReorganiSe columns
df = df[[
    'F1', 'SOURCE_FILE', 'TIME_STAMP', 'Local_Authority', 'Postcode','Address_Line_1','Address_Line_2',
    'Address_Line_3','Address_Line_4', 'Address_Line_5', 'Address_Line_6','Address_Line_7','Address_Line_8', 
    'Address_Line_9','Last_Name', 'Middlenames', 'First_Name', 'Title', 'DOB', 'guid'     
]]


df.head()


Unnamed: 0,F1,SOURCE_FILE,TIME_STAMP,Local_Authority,Postcode,Address_Line_1,Address_Line_2,Address_Line_3,Address_Line_4,Address_Line_5,Address_Line_6,Address_Line_7,Address_Line_8,Address_Line_9,Last_Name,Middlenames,First_Name,Title,DOB,guid
0,,1.51,2025-01-22 16:23:32,County Londonderry,AH2T 6XC,51,Egeria,Barrhead,,Antarctica,,,,,Mullins,Anthony,Kaley,Mr.,1994-07-08,"(""@iNcuV"
1,,1.86,2025-01-22 16:23:32,County Armagh,FN0F 6OF,138,Invergourie,Luton,,St. Lucia,,,,,Hunt,Aleen,Neville,Miss,1940-07-03,o`Fij<4.
2,,1.69,2025-01-22 16:23:32,Avon,TE4H 2TC,54,Druminiskill,Charlbury,,Guadeloupe,,,,,Martin,Melodi,Bryan,Master,1959-12-22,vBhvjA7I
3,,1.9,2025-01-22 16:23:32,Norfolk,PU7N 9JO,103,Killylane,Redhill,,Mongolia,,,,,Newman,Edmundo,Dusty,Madam,1959-02-07,z]#uE+f:
4,,2.0,2025-01-22 16:23:32,Essex,SB2H 0GC,131,Craigatempin,Wednesfield,,Niue,,,,,Wallace,Jaimee,Oda,Ms.,1988-02-25,yx0y2[Lu


## 8. Conclusion

In this notebook, we explored how to generate synthetic data using the `mimesis` library. We covered various classes, including `Person`, `Datetime`, `Finance`, `Address`, and `Transport`. These classes offer a rich set of features to generate realistic data for testing machine learning models or simulating real-world datasets.

Feel free to explore and modify the code to suit your data generation needs!


This notebook can be expanded with additional classes, more detailed data generation examples, and use cases, depending on the specific needs of the users.

## References


1. [Mimesis API](https://mimesis.name/v12.1.1/api.html)
2. [Medium](https://medium.com/@tubelwj/mimesis-a-python-library-for-generating-test-sample-data-7809d894cbd9)
3. [Getting Started with Mimesis: A Modern Approach to Synthetic Data Generation](https://www.statology.org/getting-started-mimesis-modern-approach-synthetic-data-generation/)