### The 'Faker' Library

This library can be quite useful when testing your code or needing to generate synthetic data sets.

Often our code involves working with data, and we want to test our code using that data. However, sometimes the data we have available is "real" data, that may contain actual confidential or even sensitive information ( so called PII information - personal identificable information).

We still want to test our code, and usually we include the tests and the data needed to test into the code repository. But putting real client data or sensitive data should be a big no-no. Or maybe we just want to test our code with a variety of data we may not even have in our available "real" data (maybe internationalization issues).

Instead, we have to build up fake data, that has similar properties to real data, but is completely synthetic.

That's where the `Faker` library can be helpful.

> Installation: You can pip install `Faker`, or simply use the `Pipenv` file provided in this repo.
>
> ```bash
> pip install Faker
> ```

In this notebook we'll take a quick look at this library and some of the functionality it offers to construct synthetic datasets.

#### Seeding

The `Faker` library uses a random number generator to generate sequences of choices, and we can set the `seed` to obtain repeatable results - especially important when writing unit tests where we may need to rely on the synthetic data remaining consistent from run to run. (although often we generate the synthetic data once, and store it somewhere).

`Faker` mostly uses generators to generate those sequences of data.

In [1]:
from faker import Faker

In [2]:
Faker.seed(0)

fake = Faker()
for _ in range(10):
    print(fake.name())


Norma Fisher
Jorge Sullivan
Elizabeth Woods
Susan Wagner
Peter Montgomery
Theodore Mcgrath
Stephanie Collins
Stephanie Sutton
Brian Hamilton
Susan Levy


If we reset the seed, we'll get the same sequence of names:

In [3]:
Faker.seed(0)

fake = Faker()
for _ in range(10):
    print(fake.name())

Norma Fisher
Jorge Sullivan
Elizabeth Woods
Susan Wagner
Peter Montgomery
Theodore Mcgrath
Stephanie Collins
Stephanie Sutton
Brian Hamilton
Susan Levy


You can find docs for all the various categories of synthetoc data (called "providers") [here](https://faker.readthedocs.io/en/master/providers.html)

#### Person Names

Several functions are available to generate synthetic names. 

We saw `name` in the previous example, but there are many others:
- `first_name`
- `first_name_female`
- `first_name_male`
- `last_name`
- `name_female`
- `name_male`
- `prefix`
- `prefix_female`
- `prefix_male`

and more...

In [4]:
Faker.seed(0)
faker = Faker()

for _ in range(10):
    print(
        faker.prefix_female(), 
        faker.first_name_female(), 
        faker.last_name()
    )

Dr. Norma Fisher
Mrs. Kayla Sullivan
Dr. Elizabeth Woods
Ms. Susan Wagner
Mrs. Nicole Montgomery
Mrs. Susan Mcgrath
Dr. Stephanie Collins
Dr. Stephanie Sutton
Mrs. Ashlee Hamilton
Miss Susan Levy


#### Addresses

See [here](https://faker.readthedocs.io/en/master/providers/faker.providers.address.html) for all available generators for addresses.

Lots of ways to generate addresses with lots of granularity, including:

- `address` (full address, including newline characters for separate address lines)
- `street_address`
- `building_number`
- `street_name`
- `postcode`
- `city`
- `country` or `country_code`

and more...

In [5]:
Faker.seed(0)
for _ in range(10):
    print(faker.address())
    print('=' * 20)

48764 Howard Forge Apt. 421
Vanessaside, PA 19763
578 Michael Island
New Thomas, NC 34644
60975 Jessica Squares
East Sallybury, FL 71671
8714 Mann Plaza
Lisaside, PA 72227
96593 White View Apt. 094
Jonesberg, FL 05565
848 Melissa Springs Suite 947
Kellerstad, MD 80819
30413 Norton Isle Suite 012
North Lisa, ND 79428
39916 Mitchell Crescent
New Andrewburgh, DE 63315
086 Mary Cliff
North Deborah, NE 24135
91634 Strong Mountains Apt. 302
West Alyssa, PA 58475


You'll notice that all these addresses (at least for me) are US based addresses - more on locales in a bit.

In [6]:
Faker.seed(0)

for _ in range(10):
    print(faker.building_number(), faker.street_name(), faker.city(), faker.postcode(), faker.country())

6048 Sullivan Tunnel Tammystad 76966 Uganda
82421 Archer Place West Corey 10166 United Arab Emirates
578 Michael Island New Thomas 68835 Eritrea
80160 Clayton Isle Lake Mark 08756 Comoros
332 Davis Island Rodriguezside 38654 Romania
85839 Wallace Ranch Stewartbury 38555 Colombia
20947 Christopher Throughway East Sandra 11019 Ukraine
868 Boyd Freeway Lake Brittany 55475 Morocco
7751 Salazar Oval Meganbury 02625 Saint Martin
1352 Simmons Circle Port Dustinbury 06429 Svalbard & Jan Mayen Islands


As you can see, in this case some of this data is not limited to US versions (like countries(, although it does look like the postal codes are US postal codes, and the street names are all in English.

#### Locales

Some of the synthetic data Faker generates can be tied to specific locale(s).

To to this we need to let Faker know which locale, or locales to use.

Also, note that not all providers support all locales - to see what locales are available, and what providers are supported for each locale, see [here](https://faker.readthedocs.io/en/master/locales.html)

For example, we can see that the `hi_IN` locale supports `address`, `date_time`, `person` and `phone_number`, while the `fr_FR` locale supports a lot more (you can always contribute to the project if you want to!)

Let's try out the `hi_IN` locale:

In [7]:
Faker.seed(0)
faker = Faker(['hi_IN'])

for _ in range(10):
    print(faker.address())
    print('=' * 20)

4876 विकावि
फतहपुर-824219
1/1 लोकनाट्यों
बचेली-593877
1609 ड़ाल
देहरा-139332
587 रतन महादेव
असम-858398
965 प्रणव चौधरी
विस्तारण-094711
1868 मदन
लखनऊ 969477
91/79 चौहान
अगरतला-135256
30 रेयांश बादामी
फतेहपुर 615109
17 अयांश हासन
चिपलुन 914131
20/870 कान्ती छाबरा
बचेली-923022


And with the person provider:

In [8]:
for _ in range(10):
    print(faker.name())

ललित बादामी
नाम, अमर
सुलभा हुसैन
विष्णु कुमार
ज्योत्सना मंडल
आव्या महाजन
शर्मा, शान्ता
गावित, अभिलाषा
अहलुवालिया, प्रभाकर
कालिदास विकावि


According to the library's documentation we can specify multiple locales and the sequences of values (as long as they are supported by the locale) should be randomly selected from each locale as a whole.

Unfortunately there seems to be a bug in the more recent version of the library, where this is not working (see [issue #1656](https://github.com/joke2k/faker/issues/1656)) which currently remains unresolved.

Here, I'll show you a workaround to this problem. It's not pretty, but it works!

We can specify multiple locales by just adding them to the args when create an instance of the Faker class:

In [9]:
fake = Faker(['en_US', 'fr_FR', 'es_ES'])
Faker.seed(0)
for _ in range(10):
    print(fake.name())

Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier
Alice Poirier


As you can see though, this is not working as expected. So, we have something that works for single locales, but not multiple locales. (It seems that setting the seed produces this bug - if you don't set a seed, or use `Faker.seed()`, then things appear to work fine)

To get around this, we'll use Python's `random.choices` function to pick the "next" value from a choice of possible single-locale providers.

In [10]:
locales = ['en_US', 'fr_FR', 'es_ES']
providers = [Faker(locale) for locale in locales]

In [11]:
from random import choice, choices, seed

In [12]:
provider = lambda: choice(providers)

This `provider` function will give us a random provider, which we can use to get the next value:

In [13]:
Faker.seed(0)
seed(0)
for _ in range(10):
    print(provider().name())

Paulette Fournier
Pierre-Patrick Lebon
Sandra Faulkner
Gabrielle Bodin
Epifanio Chaves Bustamante
Jeannine Rossi de la Ruiz
Auguste Pottier
Sylvie Vaillant
Joseph Gay
Emmanuelle Courtois


We can modify this slightly to gives us the ability to **weigh** the choices of local providers. We simply use Python's `random.choices` function which allows us to specify weights:

In [14]:
locales = ['hi_IN', 'en_US']
weights = [1, 2]
providers = [Faker(locale) for locale in locales]
provider = lambda: choices(providers, weights=weights, k=1)[0]

In [15]:
Faker.seed(0)
seed(0)
for _ in range(10):
    print(provider().name())

Norma Fisher
Jorge Sullivan
Elizabeth Woods
डानी, ज़ोया
Charles Davis
Victoria Patel
Lindsay Thomas
किआन हुसैन
Justin Gomez
Martin Harris


#### Unique Values

The functions we have used so far do not guarantee that the same value (be they name, address, or any other provider functions) will not be returned more than once. When we generate large data sets this could happen. It may be that you want to guarantee that a particular sequence of values will only contain unique values.

Faker provides the ability to do this, by using the `unique` attribute of a Faker instance.

For example to guarantee unique SSNs:

In [16]:
Faker.seed(0)
faker = Faker()
for _ in range(10):
    print(faker.unique.ssn())

865-50-6891
042-34-8377
498-52-4970
489-46-9559
224-65-2282
289-18-1554
634-33-8726
723-78-2408
318-13-1209
871-88-5410


Now, it may be that the number of values you are building will exceed the available possibilities for a particular provider generator - in which case Faker will raise a `UniquenessException`.

In [17]:
from faker.exceptions import UniquenessException

We know that there should be a limited number of countries available, so given a sufficiently large number of requests, we should, at some point, expect the unique list of countries to become exhausted:

In [18]:
faker = Faker()
try:
    for _ in range(244):
        faker.unique.country()
except UniquenessException:
    print('Not enough unique values available')

Not enough unique values available


But it will work just fine with less than 244 cities:

In [19]:
faker = Faker()
try:
    for _ in range(243):
        faker.unique.country()
except UniquenessException:
    print('Not enough unique values available')

And we can easily show the values are indeed unique:

In [20]:
faker = Faker()
countries = [faker.unique.country() for _ in range(243)]
len(countries) == len(set(countries))

True

Since the set and the original list have the same length, the list did not contain any duplicates.

#### Other Providers

There are many more providers available - just see the docs I linked earlier for more info.

In [21]:
Faker.seed(0)
faker = Faker()

print('ssn:', faker.ssn())
print('color name:', faker.color_name())
print('isbn-10:', faker.isbn10())
print('company email:', faker.company_email())
print('bank IBAN:', faker.iban())
print('User agent:', faker.user_agent())

ssn: 865-50-6891
color name: BlueViolet
isbn-10: 1-64759-382-4
company email: donald19@archer-patel.org
bank IBAN: GB17EJDX15781565938778
User agent: Mozilla/5.0 (compatible; MSIE 5.0; Windows NT 10.0; Trident/5.0)


#### The Python Provider

There is also one very interesting provider, called the `python` provider - docs for this provider are [here](https://faker.readthedocs.io/en/master/providers/faker.providers.python.html).

We can use it to generate Python objects, such as lists, dictionaries, etc.

In [22]:
Faker.seed(0)
faker = Faker()

for _ in range(10):
    data = faker.pylist(
        nb_elements=6,
        variable_nb_elements=False,
        value_types= ['bool', 'int'],
    )
    print(data)

[6890, True, 6634, 7808, 9558, False]
[2289, True, True, False, 7735, True]
[5180, True, 8541, 1020, False, 18]
[5458, True, False, False, True, False]
[8322, 1786, 9031, 2044, 8852, True]
[1501, 5194, True, False, False, 7807]
[False, False, False, 8594, 8549, False]
[9497, 7382, 5855, True, True, 3119]
[False, 1919, True, True, 1018, False]
[False, False, False, True, True, False]


You have other value type choices available as well (also passed as strings), such as `email`, `uri`, `date_time`, `decimal`, etc.

In fact, the `value_types` argument can reference other provider functions:

In [23]:
faker.pylist(nb_elements=10, value_types=['name', 'hostname'])

['Wesley Robbins',
 'srv-00.coleman.com',
 'Hayley White',
 'db-91.mcneil.info',
 'Christina Saunders',
 'Gloria King',
 'email-20.maldonado-mccullough.net',
 'email-56.king.com',
 'email-08.taylor-gill.info',
 'lt-45.arroyo.com',
 'email-61.wheeler.com',
 'Mark Gray',
 'db-69.barton-fletcher.com']

Another interesting one is the `pystr_format` function, where we can specify a format of mixed characters and digits.

We use a template string to indicate the output string format we want, using `#`  for single digit placeholder, and `?` for single character placeholder.

In [24]:
Faker.seed(0)
faker = Faker()

template = 'x-??##-###-?'
for _ in range(10):
    print(faker.pystr_format(template))

x-Fz66-048-Y
x-Gi47-593-s
x-TZ21-948-M
x-EJ24-115-g
x-JE56-593-C
x-fU84-080-z
x-Uu09-753-T
x-Zj13-933-Z
x-GF87-115-g
x-vI48-418-n


We can also limit the character set to choose from:

In [25]:
Faker.seed(0)
faker = Faker()

template = 'x-??##-###-?'
for _ in range(10):
    print(faker.pystr_format(template, 'abcdefg'))

x-dd66-048-g
x-eb47-593-c
x-fg21-948-e
x-de24-115-a
x-ed56-593-d
x-af84-080-g
x-bf60-975-c
x-gb13-933-g
x-ed87-115-a
x-cg48-418-e


Note: this `pystr_format` is actually using another function called `bothify`, documented [here](https://faker.readthedocs.io/en/master/providers/baseprovider.html), along with other functions like `hexify`, `lexify`, `numerify`, etc.

In [26]:
Faker.seed(0)
faker = Faker()
template = 'x-??##-###-?'
for _ in range(10):
    print(faker.bothify(template, 'abcdefg'))

x-dd66-048-g
x-eb47-593-c
x-fg21-948-e
x-de24-115-a
x-ed56-593-d
x-af84-080-g
x-bf60-975-c
x-gb13-933-g
x-ed87-115-a
x-cg48-418-e


One big difference is that `pystr_format` is slightly more versatile because you can use other provider functions in the template:

In [27]:
Faker.seed(0)
faker = Faker()

template = "Customer: {{name}} ({{email}})"
for _ in range(10):
    print(faker.pystr_format(template))

Customer: Norma Fisher (tammy766example.com)
Customer: Susan Wagner (donald19example.com)
Customer: Nicholas Nolan (thomas154example.com)
Customer: Karen Grimes (bryan801example.org)
Customer: Samantha Cook (jane13example.net)
Customer: Walter Pratt (udavis2example.net)
Customer: Eddie Martinez (lisa838example.net)
Customer: Robert Stewart (kellylopezexample.org)
Customer: Mary Alvarez (sheltondavidexample.org)
Customer: Stephanie Leblanc (hmasseyexample.com)


As you can see there are **lots** of options available!

Note: it looks like there may be another bug here as well - look at those emails, they are not quite correct - they are all missing the `@` character.

#### Even More Providers

If the built-in providers are not sufficient, there are also community built providers available.

For example, you can pip install `faker_airtravel` (included in this repo's `Pipfile`) for aitr travel related synthetic data.

In [28]:
from faker_airtravel import AirTravelProvider

We can use this additional provider by adding it to a Faker instance's providers:

In [29]:
Faker.seed(0)
faker = Faker()
faker.add_provider(AirTravelProvider)

And now we can use it like any other provider:

In [30]:
faker.airport_object()

{'airport': 'Fort Lauderdale Hollywood International airport',
 'iata': 'FLL',
 'icao': 'KFLL',
 'city': 'Dania Beach',
 'state': 'Florida',
 'country': 'United states'}

In [31]:
faker.flight()

{'airline': 'Qatar Airways',
 'origin': {'airport': 'Pucallpa airport',
  'iata': 'PCL',
  'icao': 'SPCL',
  'city': 'Callaria',
  'state': 'Ucayali',
  'country': 'Peru'},
 'destination': {'airport': 'Tancredo Neves International airport',
  'iata': 'CNF',
  'icao': 'SBCF',
  'city': 'Confins',
  'state': 'Minas Gerais',
  'country': 'Brazil'},
 'stops': 2,
 'price': 973}

There is a lot more to the `Faker` library, so feel free to explore it - it is fairly powerful, but documentation can be a bit sketchy at times - I often end looking at the source code (located [here](https://github.com/joke2k/faker)) if I need to dig further in.

While producing this video, I ran across another synthetic data library called [Mimesis](https://mimesis.name/en/master/getting_started.html), which I have not used before, but it looks pretty powerful with some features that Faker does not have. Will be interesting to see if that library has less bugs than Faker. So, I'll take a look at it and report back! 