# Sampling with or without replacement

Given a set of objects (like in a list), sample the set with or without replacement, obtain permutations, shuffle, etc.

https://towardsdatascience.com/how-to-generate-random-numbers-in-python-eb5aecf3e059

https://www.pythonpool.com/numpy-random/

https://www.javatpoint.com/numpy-random

https://www.thoughtco.com/sampling-with-or-without-replacement-3126563

https://docs.python.org/3/library/random.html#real-valued-distributions

https://www.thoughtco.com/example-of-bootstrapping-3126155

In [1]:
import random
import numpy as np
import matplotlib.pyplot as plt
import copy
from time import time

The sequence does not have to contain only numbers, can be strings, can be elements of different types, etc.

In [2]:
list_value_1 = ["red", "green", "blue", "yellow", "orange", "dark"]
list_value_2 = [i for i in range(10)]
list_value_3 = [{"a": 2, "b": 3}, {"c": 5, "d": 5}, {"e": 7, "f": 2}]
list_value_4 = [{"a": 2, "b": 3}, {"c": 5, "d": 5}, "red", "blue", 7.7, 2]

# 1. Intro: Select random elements from a sequence

Can select one or several (fewer than those in the list, or more than those in a list).

The same element can be choosen several times, so this is sampling with replacement. It means that after the first element is chosen, it is placed back again (re-placed, so replacement) in the set. So at the next extraction it can be extracted again, with the same probablity as the first time. As a note, sampling without replacement means that the element is no longer available to be extracted again.

One element:

Python: random.choice

Numpy: np.random.choice

Several elements:

Python: random.choices

Numpy: np.random.choice

# 2. Select one element

## 2.1 Python

In [3]:
# Python
random.seed(1)
value = random.choice(list_value_1)
print(value)
value = random.choice(list_value_1)
print(value)
value = random.choice(list_value_1)
print(value)

print()

random.seed(1)
value = random.choice(list_value_1)
print(value)
value = random.choice(list_value_1)
print(value)
value = random.choice(list_value_1)
print(value)

print()

random.seed(1)
value = random.choice(list_value_2)
print(value)

print()

random.seed(1)
value = random.choice(list_value_3)
print(value)

print()

random.seed(1)
value = random.choice(list_value_4)
print(value)

green
orange
red

green
orange
red

2

{'a': 2, 'b': 3}

{'c': 5, 'd': 5}


## 2.2 Numpy

In [4]:
# Numpy
np.random.seed(1)
value = np.random.choice(list_value_1)
print(value)
value = np.random.choice(list_value_1)
print(value)
value = np.random.choice(list_value_1)
print(value)

print()

random.seed(1)
value = np.random.choice(list_value_1)
print(value)
value = np.random.choice(list_value_1)
print(value)
value = np.random.choice(list_value_1)
print(value)

print()

np.random.seed(1)
value = np.random.choice(list_value_2)
print(value)

print()

np.random.seed(1)
value = np.random.choice(list_value_3)
print(value)

print()

random.seed(1)
value = random.choice(list_value_4)
print(value)

dark
yellow
orange

red
green
yellow

5

{'c': 5, 'd': 5}

{'c': 5, 'd': 5}


# 3. Select several elements with replacement

Python: random.choice several times (with int arithmentic), or random.choices (with floating point arithmetic)

Python: np.random.choice

In [5]:
N = 7

## 3.1 Python sampling with replacement

In [6]:
# Python: choice
random.seed(1)
values = np.array([random.choice(list_value_1) for _ in range(N)])
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(7,), dtype=<U6, values=
['green' 'orange' 'red' 'blue' 'red' 'yellow' 'yellow']


In [7]:
# Python: choices - sampling with replacement
random.seed(1)
values = np.array(random.choices(list_value_1, k = N))
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(7,), dtype=<U6, values=
['red' 'dark' 'orange' 'green' 'blue' 'blue' 'yellow']


In [8]:
# Python: choices - sampling with replacement
random.seed(1)
values = np.array(random.choices(list_value_1, k = 2 * N))
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(14,), dtype=<U6, values=
['red' 'dark' 'orange' 'green' 'blue' 'blue' 'yellow' 'orange' 'red' 'red'
 'dark' 'blue' 'orange' 'red']


In [9]:
# Python: choices - sampling with replacement and weighted weights
# Six roulette wheel spins (weighted sampling with replacement)
# In a roulette game there are 18 red, 18 black and 2 green
# So we need a weighted sampling
# and it is with replacement, as the colors and probabilities are the same for the next round
# So we want to simulate 6 bets on the roulette
# The outcome can be
random.seed(1)
values = random.choices(['red', 'black', 'green'], [18, 18, 2], k = 6)
print(f"type={type(values)}, values=\n{values}")
# can add the variable name weights
random.seed(1)
values = random.choices(['red', 'black', 'green'], weights = [18, 18, 2], k = 6)
print(f"type={type(values)}, values=\n{values}")
# can give instead the cumulative weights
random.seed(1)
values = random.choices(['red', 'black', 'green'], cum_weights = [18, 36, 38], k = 6)
print(f"type={type(values)}, values=\n{values}")

type=<class 'list'>, values=
['red', 'black', 'black', 'red', 'black', 'red']
type=<class 'list'>, values=
['red', 'black', 'black', 'red', 'black', 'red']
type=<class 'list'>, values=
['red', 'black', 'black', 'red', 'black', 'red']


In [10]:
# Python with replacement: coin throwing
# A coin can have head (H) or tail (T) => the sequence can be the string "HT"
# Let's throw the coin 7 times, so k = 20
random.seed(1)
values = random.choices(["H", "T"], cum_weights = [0.5, 1.0], k = 20)
print(f"type={type(values)}, values=\n{values}")
# Let's assume the coin is biased to have 60% heads => cum_weights = [0.6, 1.0]
random.seed(1)
values = random.choices(["H", "T"], cum_weights = [0.6, 1.0], k = 20)
print(f"type={type(values)}, values=\n{values}")
# calculate the probability of getting 12 or more heads out of the k = 20
count_H = values.count("H")
print(f"type={type(count_H)}, values=\n{count_H}")
# check if count_H > m, we get a boolean
enough_H = count_H >= 15
print(f"type={type(enough_H)}, values=\n{enough_H}")

# we then repeat this many times and then calculate the fraction of experiments that get True
N = 10_000 # write with _ to be easier to read values, can
print(f"type={type(N)}, values=\n{N}")
random.seed(1)
def experiment():
    # returns a boolean
    return random.choices("HT", cum_weights = [0.6, 1.0], k = 20).count("H") >= 15
probability = sum([experiment() for _ in range(N)]) / N
print(f"type={type(probability)}, values=\n{probability}")

type=<class 'list'>, values=
['H', 'T', 'T', 'H', 'H', 'H', 'T', 'T', 'H', 'H', 'T', 'H', 'T', 'H', 'H', 'T', 'H', 'T', 'T', 'H']
type=<class 'list'>, values=
['H', 'T', 'T', 'H', 'H', 'H', 'T', 'T', 'H', 'H', 'T', 'H', 'T', 'H', 'H', 'T', 'H', 'T', 'T', 'H']
type=<class 'int'>, values=
11
type=<class 'bool'>, values=
False
type=<class 'int'>, values=
10000
type=<class 'float'>, values=
0.13


In [11]:
# Python with replacement
# We have the sequence of numbers from 0 to 9999 => range(10_000)
# We select randomly 5 samples with replace
# We take the median value of these 5 samples => sorted, then pick index [2]
# Check if the median value belongs to the middle two quartiles, so N/4 <= v < 3*N/4
# This is a boolean, outcome of one experiment
# Throw many such experiments and calculate the fraction of events where outcome is true
# That is the probability of this to happen
N = 10_000
r = range(N)
k = 5
nb_experiment = 100_000
random.seed(1)
def experiment(N):
    return 1*N/4 <= sorted(random.choices(r, k = k))[2] < 3*N/4
probability = sum([experiment(N) for _ in range(nb_experiment)]) / nb_experiment
print(f"type={type(probability)}, values=\n{probability}")
# 
k = 20
probability = sum([experiment(N) for _ in range(nb_experiment)]) / nb_experiment
print(f"type={type(probability)}, values=\n{probability}")

type=<class 'float'>, values=
0.79285
type=<class 'float'>, values=
0.09078


## 3.2 Numpy sampling with replacement

In [12]:
# Numpy
np.random.seed(1)
values = np.random.choice(list_value_4, N)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

print()

# can give also a shape
values = np.random.choice(list_value_4, (3, 2))
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

# can give also a shape
values = np.random.choice(list_value_2, (3, 2, 5))
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(10000,), dtype=object, values=
[2 'blue' 7.7 ... 2 {'a': 2, 'b': 3} 7.7]

type=<class 'numpy.ndarray'>, shape=(3, 2), dtype=object, values=
[[2 'red']
 ['red' {'c': 5, 'd': 5}]
 [{'a': 2, 'b': 3} {'a': 2, 'b': 3}]]
type=<class 'numpy.ndarray'>, shape=(3, 2, 5), dtype=int64, values=
[[[1 7 5 3 6]
  [9 6 3 1 0]]

 [[7 1 3 9 0]
  [9 1 6 3 0]]

 [[0 0 0 9 0]
  [7 3 0 5 4]]]


In [13]:
# Numpy can also do with weights for the roulette problem
# Numpy: choices - sampling with replacement and weighted weights
# Six roulette wheel spins (weighted sampling with replacement)
# In a roulette game there are 18 red, 18 black and 2 green
# So we need a weighted sampling
# and it is with replacement, as the colors and probabilities are the same for the next round
# So we want to simulate 6 bets on the roulette
# The outcome can be
np.random.seed(1)
# weights
weights = [18, 18, 2]
print(f"type={type(weights)}, values=\n{weights}")
# convert weights to probabilities
p = np.array(weights)
print(f"type={type(p)}, shape={p.shape}, dtype={p.dtype}, values=\n{p}")
p = p/np.sum(p)
print(f"type={type(p)}, shape={p.shape}, dtype={p.dtype}, values=\n{p}")
# it implies replace = True
values = np.random.choice(['red', 'black', 'green'], size = 6, p = p)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")
# 
values = np.random.choice(['red', 'black', 'green'], size = 6, p = p, replace = True)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'list'>, values=
[18, 18, 2]
type=<class 'numpy.ndarray'>, shape=(3,), dtype=int64, values=
[18 18  2]
type=<class 'numpy.ndarray'>, shape=(3,), dtype=float64, values=
[0.47368421 0.47368421 0.05263158]
type=<class 'numpy.ndarray'>, shape=(6,), dtype=<U5, values=
['red' 'black' 'red' 'red' 'red' 'red']
type=<class 'numpy.ndarray'>, shape=(6,), dtype=<U5, values=
['red' 'red' 'red' 'black' 'red' 'black']


# 4. Select several elements without replacement

Sampling without replacement means that once an element is chosen, it is not put back in the set, so it can not be extracted again.

Python: random.sample(sequence, N), where N must be N <= len(sequence)

https://www.geeksforgeeks.org/python-random-sample-function/

Numpy: np.random.choice(), but use replace = False. 

## 4.1 Python sampling without replacement

In [14]:
# Python
values = copy.deepcopy(list_value_1)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 3)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 1)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 2)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 3)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 4)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 5)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = random.sample(list_value_1, 6)
print(f"type={type(values)}, values=\n{values}")

type=<class 'list'>, values=
['red', 'green', 'blue', 'yellow', 'orange', 'dark']
type=<class 'list'>, values=
['green', 'orange', 'red']
type=<class 'list'>, values=
['green']
type=<class 'list'>, values=
['green', 'orange']
type=<class 'list'>, values=
['green', 'orange', 'red']
type=<class 'list'>, values=
['green', 'orange', 'red', 'dark']
type=<class 'list'>, values=
['green', 'orange', 'red', 'dark', 'yellow']
type=<class 'list'>, values=
['green', 'orange', 'red', 'dark', 'yellow', 'blue']


In [15]:
# deal 20 cards from a deck of cards without replacement
# but care only if the cards are tens (11-14, 4 of them, of 4 colours, so 16 in total)
# or lower cards (2-10, 9 of them, of 4 colours, so 36 in total)
# 
random.seed(1)
# this below will crash with error: Sample larger than population or is negative
# values = random.sample(["tens", "low cards"], k = 20)
# we have to tell instead how many cards are of each type with the attribute counts
N = 20
values = random.sample(["tens", "low cards"], counts = [16, 36], k = N)
print(f"type={type(values)}, values=\n{values}")
# then we can ask what fraction of these are in the tens category
count_tens = values.count("tens")
print(f"type={type(count_tens)}, values=\n{count_tens}")
fraction_tens = count_tens / N
print(f"type={type(fraction_tens)}, values=\n{fraction_tens}")

type=<class 'list'>, values=
['tens', 'low cards', 'low cards', 'tens', 'low cards', 'tens', 'low cards', 'low cards', 'low cards', 'low cards', 'low cards', 'tens', 'tens', 'low cards', 'tens', 'low cards', 'low cards', 'tens', 'low cards', 'low cards']
type=<class 'int'>, values=
7
type=<class 'float'>, values=
0.35


## 4.2 Numpy sampling without replacement

In [16]:
# Numpy
values = copy.deepcopy(list_value_1)
print(f"type={type(values)}, values=\n{values}")
#
np.random.seed(1)
values = np.random.choice(list_value_1, 3, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 1, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 2, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 3, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 4, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 5, replace = False)
print(f"type={type(values)}, values=\n{values}")
#
random.seed(1)
values = np.random.choice(list_value_1, 6, replace = False)
print(f"type={type(values)}, values=\n{values}")

type=<class 'list'>, values=
['red', 'green', 'blue', 'yellow', 'orange', 'dark']
type=<class 'numpy.ndarray'>, values=
['blue' 'green' 'orange']
type=<class 'numpy.ndarray'>, values=
['orange']
type=<class 'numpy.ndarray'>, values=
['blue' 'red']
type=<class 'numpy.ndarray'>, values=
['yellow' 'dark' 'orange']
type=<class 'numpy.ndarray'>, values=
['green' 'dark' 'red' 'yellow']
type=<class 'numpy.ndarray'>, values=
['yellow' 'dark' 'green' 'red' 'blue']
type=<class 'numpy.ndarray'>, values=
['blue' 'dark' 'yellow' 'orange' 'green' 'red']


# 5. Permutations

The output is a sequence of the same number of elements, having the elements, but in a different order. Each element can only appear maximum once. So it means that this is sampling without replacement. Meaning once it is chosen, it is not put back in the set, so it can not be extracted again.

Python: not existing.

Numpy: np.random.permutation([list_value])

In [17]:
# Numpy
np.random.seed(1)
values = np.random.permutation(list_value_1)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(6,), dtype=<U6, values=
['blue' 'green' 'orange' 'red' 'yellow' 'dark']


In [18]:
# Numpy
np.random.seed(1)
values = np.random.permutation(list_value_2)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(10,), dtype=int64, values=
[2 9 6 4 0 3 1 7 8 5]


In [19]:
# if an integer is given, then it permutes np.arange(N)
# our list_value_2 is exactly np_arange(10)
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [20]:
np.random.seed(1)
values = np.random.permutation(10)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(10,), dtype=int64, values=
[2 9 6 4 0 3 1 7 8 5]


In [21]:
# Numpy - type returned as object if input list has object of different types
np.random.seed(1)
values = np.random.permutation(list_value_3)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(3,), dtype=object, values=
[{'a': 2, 'b': 3} {'e': 7, 'f': 2} {'c': 5, 'd': 5}]


In [22]:
# Numpy - type returned as object if input list has object of different types
np.random.seed(1)
values = np.random.permutation(list_value_4)
print(f"type={type(values)}, shape={values.shape}, dtype={values.dtype}, values=\n{values}")

type=<class 'numpy.ndarray'>, shape=(6,), dtype=object, values=
['red' {'c': 5, 'd': 5} 7.7 {'a': 2, 'b': 3} 'blue' 2]


# 6. Shuffle

Same as permutation, but change the same object in place.

Can be done also with choice for the same number of elements, but it is about 50 times faster.

In [23]:
# Python - no output, but the same object is shuffled in place
random.seed(1)
list_value_3_2 = copy.deepcopy(list_value_1)
print(f"type={type(list_value_3_2)}, values=\n{list_value_3_2}")
random.shuffle(list_value_3_2)
print(f"type={type(list_value_3_2)}, values=\n{list_value_3_2}")

type=<class 'list'>, values=
['red', 'green', 'blue', 'yellow', 'orange', 'dark']
type=<class 'list'>, values=
['blue', 'yellow', 'dark', 'red', 'orange', 'green']


In [24]:
# Python - no output, but the same object is shuffled in place
np.random.seed(1)
list_value_3_2 = copy.deepcopy(list_value_1)
print(f"type={type(list_value_3_2)}, values=\n{list_value_3_2}")
np.random.shuffle(list_value_3_2)
print(f"type={type(list_value_3_2)}, values=\n{list_value_3_2}")

type=<class 'list'>, values=
['red', 'green', 'blue', 'yellow', 'orange', 'dark']
type=<class 'list'>, values=
['blue', 'green', 'orange', 'red', 'yellow', 'dark']


# 7 Bootstrapping

https://www.thoughtco.com/example-of-bootstrapping-3126155

https://www.youtube.com/watch?v=O_Fj4q8lgmc

When we have samples of small sizes (typically < 40), we can not assume the samples have a particular distribution (Gaussian or t). Let's say we want to estimate the mean value and a 90% confidence interval (so the quantiles of 5% and 95%). Usually one needs to know a Gaussian distribution, a mean (mu) and a standard deviation (sigma). But for small samples, say with a size of 5 elements, we do not have mu and sigma. Then we apply the bootstrapping technique. 

Bootstrapping means that we repeat many experiments (more than hundredes, thousands). In each experiment we resample with replacement the original sample to get a new sample with the same number of elements. We calculate the mean of each experiment. We sort the sequence of means of all experiments. We identify the 5% and 95% quantiles.

In [71]:
list_value = [1, 2, 4, 4, 10]
k = len(list_value)
nb_experiment = 10_000
confidence_interval = 0.50 # as fraction
edge = (1.0 - confidence_interval) / 2
low = edge
high = 1.0 - edge
print(f"low={low}, high={high}")
# when sorting elements, find the index of for the low % and high %
index_low = int(nb_experiment * low)
index_high = int(nb_experiment * high)
index_median = int(nb_experiment * 0.5)
print(f"index_low={index_low}, index_high={index_high}, index_median={index_median}")

low=0.25, high=0.75
index_low=2500, index_high=7500, index_median=5000


## 7.1 Python 

* sample with replacement => random.choices(), not random.sample(), which is without replacement

* need mean of values in a list => from statistics import fmean as mean 

In [72]:
from statistics import fmean as mean
random.seed(1)
random.choices(list_value, k = k)
def experiment():
    return mean(random.choices(list_value, k = k))
list_mean_sorted = sorted([experiment() for _ in range(nb_experiment)])
print(f"The mean of the original sample = {list_value} for the {confidence_interval} confidence interval is [{list_mean_sorted[index_low]}, {list_mean_sorted[index_high]}].")

The mean of the original sample = [1, 2, 4, 4, 10] for the 0.5 confidence interval is [3.2, 5.2].


## 7.2 Numpy 

* sample with replacement => random.choices(replace = True)

* need mean of values in a list => np.mean()

In [73]:
np.random.seed(1)
np.random.choice(list_value, size = k, replace = True)
def experiment():
    return np.mean(np.random.choice(list_value, size = k))
nparray_mean_sorted = np.sort([experiment() for _ in range(nb_experiment)])
print(f"The mean of the original sample = {list_value} for the {confidence_interval} confidence interval is [{nparray_mean_sorted[index_low]}, {nparray_mean_sorted[index_high]}].")

The mean of the original sample = [1, 2, 4, 4, 10] for the 0.5 confidence interval is [3.0, 5.2].
