# Lecture 24 – Randomness and Simulations Part 2

## Data 6, Summer 2024

In [None]:
from datascience import *
import numpy as np
Table.interactive_plots()

from ipywidgets import interact, widgets
from IPython.display import display

## `tbl.sample()`

In lecture 22 we saw how we can use the `np.random` library to simulate randomness. This time we're going to take another approach, using the `tbl.sample()` method that is already built into the `datascience library`.

### Simulating Random Numbers

Last time we saw that each time you run the below cell, you will get a random integer between 1 and 10 inclusive.

In [None]:
np.random.randint(1, 11)

To do the same thing using the datascience library we first have to create a table with the numbers 1 through 10. We'll then be able to use table methods including our new `.sample` to generate a random number.

In [None]:
numbers = Table().with_column("Numbers 1-10", np.arange(1,11))
numbers.sample().column(0).item(0)

What is going on in the above cell? After making the new table we called `.sample`. By default `.sample` will randomly sample as many rows as there are in the table with replacement. So we actually simulated 10 random numbers above and then chose 1 of them. You can see all 10 numbers by commenting out the `item(0)` in the line above.

Similar to the methods in the `np.random` library you can set the randomness seed using `np.random.seed` to get consistent random results. As you see below each time you run the following cell you will get the number 9. Changing the seed (15) to something else may change the number you see. The seed is local to the cell only (no other cells in your notebook are affected by the `np.random.seed(15)` line).

We actually get the same number as before because `.sample` is built off the `np.random` module.

In [None]:
np.random.seed(15)
numbers.sample().column(0).item(0)

#### Choosing our sample size

Similarly to `np.random.randint` we can use the first argument of `sample` which is `k` to determine how large of a sample we want. As we see below `numbers.sample(k=5)` will give us a new table with five rows randomly sampled from `numbers`. _Note we often refer to this first argument as 'n' or the sample\_size_.

In [None]:
np.random.seed(12)
numbers.sample(k=5)

#### with_replacement

Notice that in the above example we get two fours. Why is that? By default `tbl.sample` defaults to sampling rows from the table `with_replacement=True` meaning each row can be repeatedly sampled. You can think of this as if you were randomly choosing a piece of paper with a number between 1-10 written on it from a bag then putting it back in the bag so it's an option to be chosen again. If you do `with_replacement=False` you will never choose the same row.

In [None]:
np.random.seed(12)
numbers.sample(k=10, with_replacement=False)

As you saw above sampling with the sample size the same as the table `with_replacement=False` will just shuffle the rows. 

If you tried to sample more than 10 rows what will happen?

In [None]:
numbers.sample(k=11, with_replacement=False)

We got an error! That is because we run out of rows in the table to sample and so cannot sample 10 items if there were only 10 to begin with.

#### Weighting

You can also conduct a weighted sample. Don't worry about this too much for this class but if you're curious we would use the `weights` argument as seen below.

In [None]:
# only sampling even numbers
numbers.sample(5, with_replacement=True, weights=[0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2, 0, 0.2])

## Simulations

Just like with the `np.random` library we can also run simulations with `.sample`. In fact there are times where we may prefer using `.sample` so we only have to work with the `Table` data type rather than arrays.

For instance, let's simulate 100 random numbers between 1 and 6 as if we're rolling a die and create an empirical histogram.

In [None]:
die = Table().with_columns("Roll", np.arange(6))
sample_100 = die.sample(100)
sample_100

This above table is the result of simulating the act of a single die roll. Below we'll make the histogram. We can just call `.hist` directly since we already have a table.

In [None]:
sample_100

Let's once again see what happens when we simulate this process a varying number of times. Note while you don't need to understand the code it's simpler with tables compared to numpy.

In [None]:
# Don't worry about the code, just play with the slider that appears after running.
w = widgets.FloatLogSlider(
    value=1000,
    base=10,
    min=0, # max exponent of base
    max=6, # min exponent of base
    step=0.2, # exponent step
    description='Log Slider'
)

def sum_rolls_hist(scale):
    fig = die.sample(int(scale)).hist(density = False, 
                                 bins = np.arange(1.5, 13.5),
                                 title = f'Empirical Distribution of the Sum of 1 Die Rolls, Repeated {scale} Times',
                                 show = False)
    display(fig)
    
interact(sum_rolls_hist, scale=w);

Sampling two rolls becomes a lot more difficult though.

In [None]:
two_die = die.with_column("2nd roll", die.column(0))
two_die

#simulating 100 rolls
two_dice_sampled = two_die.sample(100)
two_dice_sampled = two_dice_sampled.with_column("Two Die Sum", two_dice_sampled.column(0) + two_dice_sampled.column(1))
two_dice_sampled

As we can see above since we're sampling a row at a time we aren't getting anything but doubles. We'd have to have a table with every possible pair to make this work.

In [None]:
# don't worry about how this works
from itertools import combinations_with_replacement
combos = list(combinations_with_replacement(range(1, 7), 2))
first_rolls = [roll[0] for roll in combos]
second_rolls = [roll[1] for roll in combos]
two_die = Table().with_columns("Die 1", first_rolls, "Die 2", second_rolls)

#simulating 100 rolls
two_dice_sampled = two_die.sample(100)
two_dice_sampled = two_dice_sampled.with_column("Two Die Sum", two_dice_sampled.column(0) + two_dice_sampled.column(1))
two_dice_sampled

### Quick Check 1

Fill in the blanks to shuffle the deck of cards.

In [50]:
cards = Table().with_columns("Card Number", ["Ace", "King", "Queen", "Jack", "10", "9", "8", "7", "6", "5", "4", "3", "2"] * 4).sort(0).with_column("Card Type", ["Spade", "Club", "Diamond", "Hearts"]*13)
...

## Repetition

When testing statistical hypotheses, it is useful to be able to repeat simulations many, many times.

In [57]:
coin = Table().with_columns("Flip Result", make_array("heads", "tails"))
coin

Flip Result
heads
tails


### Simulating Coin Flips

Idea:
1. Flip a coin 100 times. Write down the number of heads.
2. Repeat step 1 many times – say, 10,000 times.
3. Draw a histogram of the number of heads in each **iteration**.

In [58]:
num_heads_arr = np.array([])

for i in np.arange(10000):
    flips = coin.sample(100) # note how this time we used sample
    heads = np.count_nonzero(flips.column(0) == 'heads') # now we have to convert table to an array
    num_heads_arr = np.append(num_heads_arr, heads)

In [60]:
num_heads_arr

array([ 54.,  55.,  51., ...,  53.,  40.,  46.])

In [61]:
len(num_heads_arr)

10000

In [62]:
Table().with_columns('Number of Heads', num_heads_arr) \
       .hist(density = False, bins = np.arange(25.5, 76.5), title = 'Empirical Distribution of 100 Coin Flips')