## Generating Random Data in Python (Guide)

https://realpython.com/python-random/

https://docs.python.org/3/library/secrets.html#module-secrets

How random is random? This is a weird question to ask, but it is one of paramount importance in cases where information security is concerned. Whenever you’re generating random data, strings, or numbers in Python, it’s a good idea to have at least a rough idea of how that data was generated.

Here, you’ll cover a handful of different options for generating random data in Python, and then build up to a comparison of each in terms of its level of security, versatility, purpose, and speed.

I promise that this tutorial will not be a lesson in mathematics or cryptography, which I wouldn’t be well equipped to lecture on in the first place. You’ll get into just as much math as needed, and no more.

How Random Is Random?

First, a prominent disclaimer is necessary. Most random data generated with Python is not fully random in the scientific sense of the word. Rather, it is pseudorandom: generated with a pseudorandom number generator (PRNG), which is essentially any algorithm for generating seemingly random but still reproducible data.

“True” random numbers can be generated by, you guessed it, a true random number generator (TRNG). One example is to repeatedly pick up a die off the floor, toss it in the air, and let it land how it may.

Assuming that your toss is unbiased, you have truly no idea what number the die will land on. Rolling a die is a crude form of using hardware to generate a number that is not deterministic whatsoever. (Or, you can have the dice-o-matic do this for you.) TRNGs are out of the scope of this article but worth a mention nonetheless for comparison’s sake.

PRNGs, usually done with software rather than hardware, work slightly differently. Here’s a concise description:

They start with a random number, known as the seed, and then use an algorithm to generate a pseudo-random sequence of bits based on it. 

You’ve likely been told to “read the docs!” at some point. Well, those people are not wrong. Here’s a particularly notable snippet from the random module’s documentation that you don’t want to miss:

    Warning: The pseudo-random generators of this module should not be used for security purposes

You’ve probably seen random.seed(999), random.seed(1234), or the like, in Python. This function call is seeding the underlying random number generator used by Python’s random module. It is what makes subsequent calls to generate random numbers deterministic: input A always produces output B. This blessing can also be a curse if it is used maliciously.

Perhaps the terms “random” and “deterministic” seem like they cannot exist next to each other. To make that clearer, here’s an extremely trimmed down version of random() that iteratively creates a “random” number by using x = (x * 3) % 19. x is originally defined as a seed value and then morphs into a deterministic sequence of numbers based on that seed:

In [1]:
class NotSoRandom(object):
    def seed(self, a=3):
        """Seed the world's most mysterious random number generator."""
        self.seedval = a
    def random(self):
        """Look, random numbers!"""
        self.seedval = (self.seedval * 3) % 19
        return self.seedval

_inst = NotSoRandom()
seed = _inst.seed
random = _inst.random


In [2]:
seed(1234)
[random() for _ in range(10)]

[16, 10, 11, 14, 4, 12, 17, 13, 1, 3]

In [3]:
seed(1234)
[random() for _ in range(10)]

[16, 10, 11, 14, 4, 12, 17, 13, 1, 3]

In [6]:
(1234*3)%19, (16*3)%19, (10*3)%19

(16, 10, 11)

What Is “Cryptographically Secure?”

If you haven’t had enough with the “RNG” acronyms, let’s throw one more into the mix: a CSPRNG, or cryptographically secure PRNG. CSPRNGs are suitable for generating sensitive data such as passwords, authenticators, and tokens. Given a random string, there is realistically no way for Malicious Joe to determine what string came before or after that string in a sequence of random strings.

One other term that you may see is entropy. In a nutshell, this refers to the amount of randomness introduced or desired. For example, one Python module that you’ll cover here defines DEFAULT_ENTROPY = 32, the number of bytes to return by default. The developers deem this to be “enough” bytes to be a sufficient amount of noise.

A key point about CSPRNGs is that they are still pseudorandom. They are engineered in some way that is internally deterministic, but they add some other variable or have some property that makes them “random enough” to prohibit backing into whatever function enforces determinism.

What You’ll Cover Here

In practical terms, this means that you should use plain PRNGs for statistical modeling, simulation, and to make random data reproducible. They’re also significantly faster than CSPRNGs, as you’ll see later on. Use CSPRNGs for security and cryptographic applications where data sensitivity is imperative.

In addition to expanding on the use cases above, in this tutorial, you’ll delve into Python tools for using both PRNGs and CSPRNGs:

    PRNG options include the random module from Python’s standard library and its array-based NumPy counterpart, numpy.random.
    Python’s os, secrets, and uuid modules contain functions for generating cryptographically secure objects.

You’ll touch on all of the above and wrap up with a high-level comparison.

The random Module

Probably the most widely known tool for generating random data in Python is its random module, which uses the Mersenne Twister PRNG algorithm as its core generator.

Earlier, you touched briefly on random.seed(), and now is a good time to see how it works. First, let’s build some random data without seeding. The random.random() function returns a random float in the interval [0.0, 1.0). The result will always be less than the right-hand endpoint (1.0). This is also known as a semi-open range:

In [9]:
import random
random.random(), random.random()

(0.44642411981601304, 0.8024881910580238)

If you run this code yourself, I’ll bet my life savings that the numbers returned on your machine will be different. The default when you don’t seed the generator is to use your current system time or a “randomness source” from your OS if one is available.

In [11]:
random.seed(444), random.random(), random.random()

(None, 0.3088946587429545, 0.01323751590501987)

In [12]:
random.seed(444), random.random(), random.random()

(None, 0.3088946587429545, 0.01323751590501987)

It can help to think about the design of the function first. You need to choose from a “pool” of characters such as letters, numbers, and/or punctuation, combine these into a single string, and then check that this string has not already been generated. A Python set works well for this type of membership testing:

In [13]:
import string

def unique_strings(k: int, ntokens: int,
               pool: str=string.ascii_letters) -> set:
    """Generate a set of unique string tokens.

    k: Length of each token
    ntokens: Number of tokens
    pool: Iterable of characters to choose from

    For a highly optimized version:
    https://stackoverflow.com/a/48421303/7954504
    """

    seen = set()

    # An optimization for tightly-bound loops:
    # Bind these methods outside of a loop
    join = ''.join
    add = seen.add

    while len(seen) < ntokens:
        token = join(random.choices(pool, k=k))
        add(token)
    return seen


In [16]:
unique_strings(k=4, ntokens=5)

{'MduV', 'ZeeN', 'dJVs', 'nlSm', 'ppQZ'}

In [17]:
unique_strings(5, 4, string.printable)

{'\rgA)l', '%)nFC', "'aqZ-", 'WbZ0t'}

you can generate two time series that are correlated but still random: using numpy

## CSPRNGs in Python

os.urandom(): About as Random as It Gets

Python’s os.urandom() function is used by both secrets and uuid (both of which you’ll see here in a moment). Without getting into too much detail, os.urandom() generates operating-system-dependent random bytes that can safely be called cryptographically secure:

On Unix operating systems, it reads random bytes from the special file /dev/urandom, which in turn “allow access to environmental noise collected from device drivers and other sources.” (Thank you, Wikipedia.) This is garbled information that is particular to your hardware and system state at an instance in time but at the same time sufficiently random.

On Windows, the C++ function CryptGenRandom() is used. This function is still technically pseudorandom, but it works by generating a seed value from variables such as the process ID, memory status, and so on.

With os.urandom(), there is no concept of manually seeding. While still technically pseudorandom, this function better aligns with how we think of randomness. The only argument is the number of bytes to return:

In [22]:
os.urandom(3)

b'/\xec\x0c'

Before we go any further, this might be a good time to delve into a mini-lesson on character encoding. Many people, including myself, have some type of allergic reaction when they see bytes objects and a long line of \x characters. However, it’s useful to know how sequences such as x above eventually get turned into strings or numbers.

In [24]:
os.urandom(6)

b']Z\xb5\xb8\x8f\xd6'

First, recall one of the fundamental concepts of computing, which is that a byte is made up of 8 bits. You can think of a bit as a single digit that is either 0 or 1. A byte effectively chooses between 0 and 1 eight times, so both 01101100 and 11110000 could represent bytes. Try this, which makes use of Python f-strings introduced in Python 3.6, in your interpreter:

In [26]:
binary = [f'{i:0>8b}' for i in range(256)]
binary[:4]

['00000000', '00000001', '00000010', '00000011']

This is equivalent to [bin(i) for i in range(256)], with some special formatting. bin() converts an integer to its binary representation as a string.

Where does that leave us? Using range(256) above is not a random choice. (No pun intended.) Given that we are allowed 8 bits, each with 2 choices, there are 2 ** 8 == 256 possible bytes “combinations.”

This means that each byte maps to an integer between 0 and 255. In other words, we would need more than 8 bits to express the integer 256. You can verify this by checking that len(f'{256:0>8b}') is now 9, not 8.

Okay, now let’s get back to the bytes data type that you saw above, by constructing a sequence of the bytes that correspond to integers 0 through 255:

In [27]:
bites = bytes(range(256))
bites

b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

If you call list(bites), you’ll get back to a Python list that runs from 0 to 255. But if you just print bites, you get an ugly looking sequence littered with backslashes:

These backslashes are escape sequences, and \xhh represents the character with hex value hh. Some of the elements of bites are displayed literally (printable characters such as letters, numbers, and punctuation). Most are expressed with escapes. \x08 represents a keyboard’s backspace, while \x13 is a carriage return (part of a new line, on Windows systems).

If you need a refresher on hexadecimal, Charles Petzold’s Code: The Hidden Language is a great place for that. Hex is a base-16 numbering system that, instead of using 0 through 9, uses 0 through 9 and a through f as its basic digits.

Finally, let’s get back to where you started, with the sequence of random bytes x. Hopefully this makes a little more sense now. Calling .hex() on a bytes object gives a str of hexadecimal numbers, with each corresponding to a decimal number from 0 through 255:

In [29]:
x = os.urandom(6)
x

b'd*\xd1\xd3\x0f!'

In [30]:
list(x)

[100, 42, 209, 211, 15, 33]

Python’s Best Kept secrets

Introduced in Python 3.6 by one of the more colorful PEPs out there, the secrets module is intended to be the de facto Python module for generating cryptographically secure random bytes and strings.

You can check out the source code for the module, which is short and sweet at about 25 lines of code. secrets is basically a wrapper around os.urandom(). It exports just a handful of functions for generating random numbers, bytes, and strings. Most of these examples should be fairly self-explanatory:

In [32]:
n = 16
import secrets
secrets.token_bytes(n)

b'\xe5\xd8B\xc6\xa8\xber\xb9\xfd\xa8\xa2i\x7fC_\xc5'

In [33]:
secrets.token_hex(n)

'9e6d4d51f28f28c87a8116dec6a871ae'

In [34]:
secrets.token_urlsafe(n)

'BLpVrjj_4dtEFp8kkysgRw'

In [35]:
secrets.choice('rain')

'n'

Now, how about a concrete example? You’ve probably used URL shortener services like tinyurl.com or bit.ly that turn an unwieldy URL into something like https://bit.ly/2IcCp9u. Most shorteners don’t do any complicated hashing from input to output; they just generate a random string, make sure that string has not already been generated previously, and then tie that back to the input URL.

Let’s say that after taking a look at the Root Zone Database, you’ve registered the site short.ly. Here’s a function to get you started with your service:

In [40]:
from secrets import token_urlsafe

DATABASE = {}

def shorten(url: str, nbytes: int=5) -> str:
    ext = token_urlsafe(nbytes=nbytes)
    if ext in DATABASE:
        return shorten(url, nbytes=nbytes)
    else:
        DATABASE.update({ext: url})
        return f'short.ly/{ext}'

Is this a full-fledged real illustration? No. I would wager that bit.ly does things in a slightly more advanced way than storing its gold mine in a global Python dictionary that is not persistent between sessions. However, it’s roughly accurate conceptually:

In [42]:
urls = ('https://realpython.com/','https://docs.python.org/3/howto/regex.html')
for u in urls:
    print(shorten(u))

short.ly/1D9XqNE
short.ly/DIwWlTY


In [44]:
DATABASE

{'1D9XqNE': 'https://realpython.com/',
 'DIwWlTY': 'https://docs.python.org/3/howto/regex.html'}

The bottom line here is that, while secrets is really just a wrapper around existing Python functions, it can be your go-to when security is your foremost concern. 

## One Last Candidate: uuid

One last option for generating a random token is the uuid4() function from Python’s uuid module. A UUID is a Universally Unique IDentifier, a 128-bit sequence (str of length 32) designed to “guarantee uniqueness across space and time.” uuid4() is one of the module’s most useful functions, and this function also uses os.urandom():

In [45]:
import uuid

In [46]:
uuid.uuid4(), uuid.uuid4()

(UUID('e0fec97e-22e2-42b1-a1e9-1fcaa7f4b20a'),
 UUID('c1a1c9fe-441b-42c4-a676-7703f176124d'))

You may also have seen some other variations: uuid1(), uuid3(), and uuid5(). The key difference between these and uuid4() is that those three functions all take some form of input and therefore don’t meet the definition of “random” to the extent that a Version 4 UUID does:

uuid1() uses your machine’s host ID and current time by default. Because of the reliance on current time down to nanosecond resolution, this version is where UUID derives the claim “guaranteed uniqueness across time.”

uuid3() and uuid5() both take a namespace identifier and a name. The former uses an MD5 hash and the latter uses SHA-1.

uuid4(), conversely, is entirely pseudorandom (or random). It consists of getting 16 bytes via os.urandom(), converting this to a big-endian integer, and doing a number of bitwise operations to comply with the formal specification.

Hopefully, by now you have a good idea of the distinction between different “types” of random data and how to create them. However, one other issue that might come to mind is that of collisions.

In this case, a collision would simply refer to generating two matching UUIDs. What is the chance of that? Well, it is technically not zero, but perhaps it is close enough: there are 2 ** 128 or 340 undecillion possible uuid4 values. So, I’ll leave it up to you to judge whether this is enough of a guarantee to sleep well.

One common use of uuid is in Django, which has a UUIDField that is often used as a primary key in a model’s underlying relational database.

## Why Not Just “Default to” SystemRandom?

In addition to the secure modules discussed here such as secrets, Python’s random module actually has a little-used class called SystemRandom that uses os.urandom(). (SystemRandom, in turn, is also used by secrets. It’s all a bit of a web that traces back to urandom().)

At this point, you might be asking yourself why you wouldn’t just “default to” this version? Why not “always be safe” rather than defaulting to the deterministic random functions that aren’t cryptographically secure ?

I’ve already mentioned one reason: sometimes you want your data to be deterministic and reproducible for others to follow along with.

But the second reason is that CSPRNGs, at least in Python, tend to be meaningfully slower than PRNGs. Let’s test that with a script, timed.py, that compares the PRNG and CSPRNG versions of randint() using Python’s timeit.repeat():

In [47]:
import random
import timeit

# The "default" random is actually an instance of `random.Random()`.
# The CSPRNG version uses `SystemRandom()` and `os.urandom()` in turn.
_sysrand = random.SystemRandom()

def prng() -> None:
    random.randint(0, 95)

def csprng() -> None:
    _sysrand.randint(0, 95)

setup = 'import random; from __main__ import prng, csprng'

if __name__ == '__main__':
    print('Best of 3 trials with 1,000,000 loops per trial:')

    for f in ('prng()', 'csprng()'):
        best = min(timeit.repeat(f, setup=setup))
        print('\t{:8s} {:0.2f} seconds total time.'.format(f, best))

Best of 3 trials with 1,000,000 loops per trial:
	prng()   1.70 seconds total time.
	csprng() 3.14 seconds total time.


Odds and Ends: Hashing

One concept that hasn’t received much attention in this tutorial is that of hashing, which can be done with Python’s hashlib module.

A hash is designed to be a one-way mapping from an input value to a fixed-size string that is virtually impossible to reverse engineer. As such, while the result of a hash function may “look like” random data, it doesn’t really qualify under the definition here.
