### Sampling and Shuffling

We can shuffle (in-place) a list (or any mutable sequence) using the `shuffle` function in the `random` module:

In [1]:
import random

In [2]:
l = [1, 2, 3, 4, 5]

In [3]:
random.shuffle(l)

As you can see nothing is *returned* from this function call - that's because it mutated `l`:

In [4]:
l

[4, 1, 5, 2, 3]

The `shuffle` function also uses `random()` as it's source of random numbers, so for the same seed, we'll get the same shuffle:

In [5]:
l = [1, 2, 3, 4, 5]
random.seed(0)
random.shuffle(l)
print(l)

[3, 2, 1, 5, 4]


In [6]:
l = [1, 2, 3, 4, 5]
random.seed(0)
random.shuffle(l)
print(l)

[3, 2, 1, 5, 4]


The `choice` function can be used to pick a random element from a sequence:

In [7]:
random.seed(0)
l = [1, 2, 3, 4, 5]
for _ in range(5):
    print(random.choice(l))

4
4
1
3
5


And again, repeatable if we use the same seed:

In [8]:
random.seed(0)
l = [1, 2, 3, 4, 5]
for _ in range(5):
    print(random.choice(l))

4
4
1
3
5


The frequency distiribution of repeated choices would result in a uniform distribution.

We also have two methods for randomly choosing multiple elements at a time.

The first one we'll look at is the `sample` function.

This allows us to pick a random subset from a sequence or set, and even a range.

In [9]:
s = set('abcdef')

In [10]:
s

{'a', 'b', 'c', 'd', 'e', 'f'}

In [11]:
random.seed(0)

for _ in range(5):
    print(random.sample(s, 3))

['b', 'f', 'd']
['e', 'c', 'b']
['b', 'e', 'f']
['e', 'c', 'a']
['c', 'a', 'e']


since Python 3.9 and will be removed in a subsequent version.
  print(random.sample(s, 3))


You'll notice that in every choice made, the elements do not (and in general , will not) be repeated.

This is like doing population sampling, or picking cards from a deck of cards. As you are creating your random sample, once an element has been picked into the sample, it cannot be picked again.

This also means that you cannot set a sample size that exceeds the population size.

In [12]:
try:
    random.sample(s, 10)
except ValueError as ex:
    print(ex)

Sample larger than population or is negative


since Python 3.9 and will be removed in a subsequent version.
  random.sample(s, 10)


We can also use a `range` object as the population:

In [13]:
random.seed(0)
random.sample(range(2, 100, 2), 5)

[50, 98, 54, 6, 34]

This is a quick way to select multiple random integers from a range (remember you can generate one random int at a time within a given range:

In [14]:
random.seed(0)
for _ in range(5):
    print(random.randrange(2, 100, 2))

50
98
54
6
34


We could even use a list comprehension:

In [15]:
random.seed(0)
[random.randrange(2, 100, 2) for _ in range(5)]

[50, 98, 54, 6, 34]

But the `sample` approach is actually faster, and the syntax is easier.

We can see this approximately using a rough timing mechanism:

In [16]:
from time import perf_counter

In [17]:
start = perf_counter()
for _ in range(100):
    # repeat test 100 times
    sample = random.sample(range(2, 1_000_000, 2), 10_000)
end = perf_counter()
print(end - start)

0.49355484399999994


In [18]:
start = perf_counter()
for _ in range(100):
    sample = [random.randrange(2, 1_000_000, 2) for _ in range(10_000)]
end = perf_counter()
print(end - start)

0.9185610720000001


Next we have the option of choosing `k` elements from some sequence (and it must be a sequence, sets are not supported), **with** repetition.

This means that when we pick `k` elements from the sequence, the **same** element may be found more than once in the `k` choices.

This is very similar to using this to roll two dice.

Each die has numbers from `1` to `6` - and from those numbers we want to pick two elements at a time to simulate a dice roll. But just because the first die rolled a `6`, does not mean the other die cannot also roll a `6`. i.e. picking one element as the first choice does not preclude it being chosen again for the second choice.

In [19]:
s = 1, 2, 3, 4, 5, 6

random.seed(11)
for _ in range(5):
    print(random.choices(s, k=2))

[3, 4]
[6, 3]
[4, 4]
[2, 4]
[4, 5]


As you can see, the third time we made our picks, the element `4` was selected twice.

The default behavior of `choices` is that at every pick to generate the `k` subset, each element has an equal probability of getting chosen (uniform distribution).

But, we can actually change this and assign weights to each element in the sequence - the higher the number relative to another element, the higher the probability that it will be selected.

So, the default is like having a weight of `1` assigned to every element.

To assign weights we have to create another sequence, **equal in length** to the sequence we are choosing from, with the assigned weight in the correponding position.

The weights can be integers or floats.

To actually visualize the distribution based on the weights, let's bring back some of the code we used in the last coding video.

In [20]:
def chart_freq(data):
    pad = max([len(str(el[0])) for el in data])
    for k, v in data:
        print(f"{str(k).rjust(pad)}| {'*' * round(v)}")
        
def freq_distribution(data):
    freq = {}
    for el in data:
        freq[el] = freq.get(el, 0) + 1
    return freq

def relative_freq(freq_dist):
    sum_freq = sum(freq_dist.values())
    return {
        k: v / sum_freq * 100 for k, v in freq_dist.items()
    }

What we want to do here is run `choices` multiple times, and then count how many times each element from the sequence was selected in all these choices.

For example, if we had these choices:

```
[1, 2, 2]
[3, 2, 2]
[2, 1, 2]
```

our frequency analysis should result in:

```
{
  '1': 2,
  '2': 6,
  '3': 1
}
```

We already have a function (`freq_distribution`) that can do this for a single list of elements:

In [21]:
freq_distribution([1, 1, 1, 2, 3, 3])

{1: 3, 2: 1, 3: 2}

A quick and simple way to analyze something like this:

```
[1, 2, 2]
[3, 2, 2]
[2, 1, 2]
```

is to join all these sub-lists into a single list, and run it through the same function. (there are probably better ways to do this, but I'm not looking for a super efficient approach here, just some quick code so we can look at how the distribution changes when we modify the weights).

In [22]:
def freq_distribution_matrix(data):
    # data is a sequence of sequences that we'll join up and then analyze
    linearized = [el for row in data for el in row]
    return freq_distribution(linearized)

In [23]:
random.seed(0)
population = tuple('abcdefghij')
data = [random.choices(population, k=5) for _ in range(3)]
print(data)    

[['i', 'h', 'e', 'c', 'f'], ['e', 'h', 'd', 'e', 'f'], ['j', 'f', 'c', 'h', 'g']]


In [24]:
freq_distribution_matrix(data)

{'i': 1, 'h': 3, 'e': 3, 'c': 2, 'f': 3, 'd': 1, 'j': 1, 'g': 1}

Let's create a function that will perform all the steps, from generating the data to producing the chart:

In [25]:
def analyze_choices(base_data, num_choices, choice_size, weights=None):
    data = [
        random.choices(base_data, k=choice_size, weights=weights) 
        for _ in range(num_choices)
    ]
    
    freq = freq_distribution_matrix(data)
    rel = relative_freq(freq)
    
    # let's sort the data
    sorted_items = sorted(rel.items(), key=lambda x: x[0])
    chart_freq(sorted_items)

In [26]:
base_data = tuple('abcdefghij')

random.seed(0)
analyze_choices(base_data, 10_000, 5)

a| **********
b| **********
c| **********
d| **********
e| **********
f| **********
g| **********
h| **********
i| **********
j| **********


Looks pretty uniform.

Now let's assign some weights - let's make all the weights `1`, but setting the weights for `a` (index 0) to 2, `b` (index `1`) set to 3, and for `j` (last element), set to 4.

In [27]:
weights = [1] * 10
weights[0] = 2
weights[1] = 3
weights[-1] = 4

In [28]:
weights

[2, 3, 1, 1, 1, 1, 1, 1, 1, 4]

In [29]:
random.seed(0)
analyze_choices(base_data, 10_000, 5, weights)

a| *************
b| *******************
c| ******
d| ******
e| ******
f| ******
g| ******
h| ******
i| ******
j| *************************


As you can see, our frequency distribution matches the weights we assigned to each element.

Sampling can be quite useful when you are dealing with a huge dataset, but want to perform some quick calculations based on a smaller random sample (or like estimating the population mean with a sample mean).

We'll come back to sampling later in this course.