# Random Sampling

Think back for a moment to quant methods with Coye. There, you spent time talking about *sampling*, which is the process of drawing data from a population of places that we could potentially draw that data from. 

Consider the following, vector, `population` which is arranged from lowest to highest.

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
N = 1000
population = np.random.normal(loc=10, scale=10, size=N)
population = np.sort(population)

In this vector, we've places 1,000 draws from a gaussian distribution with a mean of 10 and a standard deviation of 5. In this case, we can know that the population *parameter* **mean** is just:

In [3]:
population.mean()

9.9156957170544313

There isn't any value statement that we are attaching to this parameter value, it just is the average of the population. 

What if, for some reason, we couldn't actually observe all of the values in the population. Either it would take too much time, or cost too much money. How could we produce an *estimate* of the population parameter that we're interested in? 

1. A *bad* estimate might just take the first 10 data value that we observe, and add them up. `population[:9].sum()`. 

Why is this a bad estimate? Well, there just isn't any clear relationship between the **sample** that we're drawing and then applying a function across, and the parameter that we're interested in. That is, the *estimator* that we've produced just doesn't provide us an unbiased estimate of the population parameter. In particular, it is always going to underestimate the true value. 

In [4]:
population[:9].sum()

-145.62388471995698

2. The choice to sum those values seems pretty illogical, right? What if, instead, we were to take the average of them. Would this estimator produce an estimate of the population parameter we're looking for that is any less biased? 

In [5]:
population[:9].mean()

-16.180431635550775

Not really. Although there is a more natural relationship between the function we're applying to the sample and the population parameter we're looing for -- the population average -- the sample we're drawing from is still systematically different from the population. Specifically, it is always on the *low* side of the average, right? 

And so, this estimator is going to  remain a **biased estimator** for any sample size that is smaller than the whole population. 

Interestingly, however, as the size of the sample gets larger, the estimate that the estimator produces gets closer to the truth. When this is the case, we say that the estimator is **consistent.** Although it might be biased in any finite sample, it produces a consistent estimate in an infinate sample. 

In [6]:
sample_1 = population[:9].mean()
sample_2 = population[:100].mean()
sample_3 = population[:999].mean()

In [7]:
f'Sample 1: {sample_1}' 

'Sample 1: -16.180431635550775'

In [8]:
f'Sample 2: {sample_2}'

'Sample 2: -7.177306697103318'

In [9]:
f'Sample 3: {sample_3}'

'Sample 3: 9.884058346917456'

Would you say that the first estimator we were using -- the sum of the sample -- was a consistent estimator? Why or why not? 

# A natural estimator

Of course, the natural estimator that you probably had in mind was: 

1. Take a random sample, and 
2. Apply the sample mean function onto that sample. 

Good idea! This estimator -- the sample mean -- is going to be an unbiased, consistent estimate of the population parameter we're interested in. 

One way to make the random sample is just to use the `sample` function, and to provide a vector of integers that we want to sample from. 

In [10]:
small_pop = np.arange(10)
np.random.choice(small_pop)

0

If we call `np.random.choice` with out any other arguments, we're going to pull a single draw from the population. What we really want is a group of samples though, so we'll add an argument to increase the number. 

In [11]:
np.random.choice(small_pop, size=5, replace=False)

array([9, 6, 8, 2, 5])

If you pass a size argument, you'll pull a sample that is of that size. We can use these values as positional indexes in the vector that we're sampling from. For example, if our population is the set of the first ten letters, 

In [12]:
sample_mean_1a  = np.random.choice(population, replace=False, size=10).mean()
sample_mean_1b = np.random.choice(population, replace=False, size=100).mean()
sample_mean_2 = np.random.choice(population, replace=False, size=100).mean()
sample_mean_3  = np.random.choice(population, replace=False, size=999).mean()

In [13]:
sample_mean_1a

5.3858784315362165

In [14]:
sample_mean_1b

9.9272463722901669

In [15]:
sample_mean_2

9.3900048740046227

In [16]:
sample_mean_3

9.9100982707701206

# Questions for Understanding 

1. Are you surprisred that the values are not all the same? Why or why not? Are you surprised that the `sample_mean_2a` and `sample_mean_2b` are not exactly the same? Aren't they the same size? 

2. As the size of the sample that you take gets larger, will the estimate that you calculate tend to get closer to the truth, farther from the truth, or stay the same distance from the truth? Why? 