1\. From sample mean to population mean
---------------------------------------

00:00 - 00:13

Now we're going to study some patterns that we can observe in the sample mean when the sample size becomes larger. These patterns form the basis of the law of large numbers.

2\. Sample mean review
----------------------

00:13 - 00:27

Jakob Bernoulli developed the law of large numbers in his book Ars Conjectandi (1713). The law states that the sample mean tends to the expected value as the sample grows larger.

3\. Sample mean review (Cont.)
------------------------------

00:27 - 00:33

For example, we calculate the sample mean of two values by adding the values and dividing by two.

4\. Sample mean review (Cont.)
------------------------------

00:33 - 00:39

For three values, we add up the values and divide by three.

5\. Sample mean review (Cont.)
------------------------------

00:39 - 00:45

If we have n samples, we add the n values and divide by n.

6\. Sample mean review (Cont.)
------------------------------

00:45 - 00:54

As the sample becomes larger, the sample mean gets nearer to the population mean. Let's code a bit.

7\. Generating the sample
-------------------------

00:54 - 01:34

To generate a sample of coin flips, we will use the binomial distribution. First we import the binom object and the describe method from scipy dot stats, then we generate the sample using binom dot rvs. We specify n as 1 coin flip and p as the probability of success (0.5 for a fair coin), then we specify the sample size as 250 and set random_state so we can reproduce our results. After that, we print the first 100 values from our samples.

8\. Calculating the sample mean
-------------------------------

01:34 - 01:53

To calculate the sample mean we pass the sample to describe dot mean. We specify samples from 0 to 10, and we see that for the first 10 values the sample mean is 0.6. Now let's see what this process looks like with an animation.

9\. Sample mean of coin flips (Cont.)
-------------------------------------

01:53 - 02:31

In this animation you see how we take the sample mean for values from 2 to 250 using the describe method. The red line represents the population mean, in this case 0.5, and the blue line is the sample mean. As you'll notice, due to the randomness of the data, the sample mean fluctuates around the population mean -- but as more data becomes available, the sample mean approaches the population mean. Let's see another example with the normal distribution.

10\. Sample mean of normal distribution
---------------------------------------

02:31 - 03:19

Now we have three animated plots. At the top left we have our sample data from a normal distribution. We use one dot for each sample. At the top right we've plotted a histogram of the sample data, and at the bottom we've plotted the sample mean. In all the plots the population mean is represented with a black line and the sample mean is drawn using a red line. You can see how the red line moves and gets nearer to the population mean as more data becomes available. Enjoy the animations for a bit, and get some perspective. Now let's move on and learn how to plot the sample mean with Python.

11\. Plotting the sample mean
-----------------------------

03:19 - 03:45

First we import the binom object and describe from scipy dot stats, along with matplotlib dot pyplot as plt. Then we initialize the variables, setting coin_flips to 1, p to 0.5, sample_size to 1000, and averages to an empty list.

12\. Plotting the sample mean (Cont.)
-------------------------------------

03:45 - 04:06

Finally, we calculate the sample mean using describe from 0 to the i index that goes from 2 to sample_size plus 1. We store the result in the averages list using append, then we print the first 10 values.

13\. Plotting the sample mean (Cont.)
-------------------------------------

04:06 - 04:20

We add a red line with plt dot axhline at the population mean and plot the averages. Then we add a legend in the upper-right corner and show our plot.

14\. Sample mean plot
---------------------

04:20 - 04:25

The result is this beautiful plot that shows the law of large numbers in action.

15\. Let's practice!
--------------------

04:25 - 04:32

Let's get some hands-on practice with the law of large numbers.

Generating a sample
===================

A hospital's planning department is investigating different treatments for newborns. As a data scientist you are hired to simulate the sex of 250 newborn children, and you are told that on average 50.50% are males.

Instructions
------------

-   Import the `binom` object from `scipy.stats`.
-   Generate a sample of 250 newborns with 50.50% probability of being male.
-   Print the sample.

In [None]:
# Import the binom object
from scipy.stats import binom

# Generate a sample of 250 newborn children
sample = binom.rvs(n=1, p=0.505, size=250, random_state=42)

# Show the sample values
print(sample)

Calculating the sample mean
===========================

Now you can calculate the sample mean for this generated sample by taking some elements from the sample.

Using the `sample` variable you just created, you'll calculate the sample means of the first 10, 50, and 250 samples.

The `binom` object and `describe()` method from `scipy.stats` have been imported for your convenience.

Instructions 1/3
----------------

Print the sample mean of the first 10 samples.

In [None]:
# Print the sample mean of the first 10 samples
print(describe(sample[0:10]).mean)

Instructions 2/3
----------------

-   Print the sample mean of the first 10 samples.

In [None]:
# Print the sample mean of the first 50 samples
print(describe(sample[0:50]).mean)

Instructions 3/3
----------------

-   Print the sample mean of the first 50 samples.

In [None]:
# Print the sample mean of the first 250 samples
print(describe(sample[0:250]).mean)

Plotting the sample mean
========================

Now let's plot the sample mean, so you can see more clearly how it evolves as more data becomes available.

For this exercise we'll again use the sample you generated earlier, which is available in the `sample` variable. The `binom` object and `describe()` function have already been imported for you from `scipy.stats`, and `matplotlib.pyplot` is available as `plt`.

Instructions 1/3
----------------

In a `for` statement for `i` in a range that goes from `2` to `251`, do the following:

-   Calculate the sample mean for the first `i` values.
-   Use `append` to add the value to the `averages` array.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

Instructions 2/3
----------------

Add a horizontal line at the mean value of the binomial distribution with `n=1` and `p=0.505`.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color='red')
plt.plot(averages, '-')

Instructions 3/3
----------------

Add a legend with labels `Population mean` and `Sample mean` and show the plot.

In [None]:
# Calculate sample mean and store it on averages array
averages = []
for i in range(2, 251):
    averages.append(describe(sample[0:i]).mean)

# Add population mean line and sample mean plot
plt.axhline(binom.mean(n=1, p=0.505), color='red')
plt.plot(averages, '-')

# Add legend
plt.legend(("Population mean","Sample mean"), loc='upper right')
plt.show()

1\. Adding random variables
---------------------------

00:00 - 00:11

The most important result in probability and statistics is the central limit theorem. Let's take a look at what happens when you add random variables.

2\. The central limit theorem (CLT)
-----------------------------------

00:11 - 00:54

The CLT states that the sum of random variables tends to a normal distribution as the number of them grows to infinity. This theorem works under certain conditions: the variables must have the same distribution, and the variables must be independent. You can start adding binomial, geometric, or even Poisson random variables, and as you add more, you get a normal distribution. Recall that random variables are independent when the outcome on one variable does not affect the outcome on the others. Let's see an example.

3\. Poisson sample generation
-----------------------------

00:54 - 01:22

In an example we saw previously about a busy highway with two accidents per day on average, we modeled the number of accidents per day with a Poisson random variable. Now imagine we have the data from 1,000 days. In the following animation you can see on the left the values of our population, and on the right you can see the histogram of the population values. This is our data.

4\. Selection from population
-----------------------------

01:22 - 01:45

Now we are going to take 10 values from our population many times, so we can calculate the sample mean of those values. Notice the red dots. Recall that when calculating the sample mean we are adding the values, and the central limit theorem applies to the sum of random variables that are equally distributed.

5\. Selection from population (Cont.)
-------------------------------------

01:45 - 01:57

Notice the histogram of the population -- it's skewed! Now we are going to repeat this process 350 times and see the outcome.

6\. Poisson sample mean plot
----------------------------

01:57 - 02:39

Take a look at these animations. At the top we have the population. We're highlighting in red the 10 randomly selected values used to calculate the sample means, and plotting those values. At the bottom left we're plotting the sample means, and at the bottom right is a histogram of the sample means. Notice that as we calculate more sample means from our population the histogram is centered at 2, which is the mean of our population, and the histogram takes on a bell shape. That is the magic of the central limit theorem. Now let's code this important result.

7\. Poisson population plot
---------------------------

02:39 - 03:11

First we import poisson and describe from scipy dot stats. Then, from matplotlib we import pyplot as plt, and we import numpy as np. We generate our population with poisson dot rvs with mu equals 2, size equals 1000, and the random_state seed set to reproduce our results. Now we can plot a histogram of our population.

8\. Poisson population plot (Cont.)
-----------------------------------

03:11 - 03:17

This is the plot. It's a Poisson skewed plot of our data. Next, let's plot the sample means.

9\. Sample means plot
---------------------

03:17 - 03:50

We first fix our random seed make the results reproducible. We define an empty list called sample_means to store the sample mean values. Then we write a for statement to loop for and arbitrarily chosen large number of samples like, 350 times. We select 10 values from our population using np dot random dot choice and then we append the sample mean of the 10 values to the sample_means list.

10\. Sample means plot (Cont.)
------------------------------

03:50 - 04:08

Outside the for statement, we add labels and a title to the plot. Finally, we plot and show the histogram. We get a plot centered at 2, which is the mean of the population, with a bell shape as we expected.

11\. Let's add random variables
-------------------------------

04:08 - 04:29

We've finished with the most important results in probability and statistics. After exercising a bit with the central limit theorem, we will work on two applications of probability in data science, linear regression and logistic regression. Let's add random variables!

Sample means
============

An important result in probability and statistics is that the shape of the distribution of the means of random variables tends to a normal distribution, which happens when you add random variables with **any** distribution with the same expected value and variance.

For your convenience, we've loaded `binom` and `describe()` from the `scipy.stats` library and imported `matplotlib.pyplot` as `plt` and `numpy` as `np`. We generated a simulated population with size 1,000 that follows a binomial distribution for 10 fair coin flips and is available in the `population` variable.

Instructions 1/4
----------------

Select 20 random values from the `population` using `np.random.choice()`.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)

Instructions 2/4
----------------

Calculate the sample mean of `sample` and add the calculated sample mean to the `sample_means` list.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

Instructions 3/4
----------------

Plot a histogram of the `sample_means` list.

In [None]:
# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.xlabel("Sample mean values")
plt.ylabel("Frequency")
plt.show()

Instructions 4/4
----------------

Question
--------

Inspecting the plot, what is the distribution of the sample mean?

### Possible answers

Same as the generated sample

Binomial

[x] Normal

Sample means follow a normal distribution
=========================================

In the previous exercise, we generated a population that followed a binomial distribution, chose 20 random samples from the population, and calculated the sample mean. Now we're going to test some other probability distributions to see the shape of the sample means.

From the `scipy.stats` library, we've loaded the `poisson` and `geom` objects and the `describe()` function. We've also imported `matplotlib.pyplot` as `plt` and `numpy` as `np`.

As you'll see, the shape of the distribution of the means is the same even though the samples are generated from different distributions.

Instructions 1/2
----------------

Select 20 values from the population, add the sample mean to the `sample_means` list, and plot a histogram.

In [None]:
# Generate the population
population = geom.rvs(p=0.5, size=1000)

# Create list for sample means
sample_means = []
for _ in range(3000):
	# Take 20 values from the population
    sample = np.random.choice(population, 20)
    # Calculate the sample mean
    sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.show()

Instructions 2/2
----------------

-   Select 20 values from the population, add the sample mean to the `sample_means` list, and plot a histogram.

In [None]:
# Generate the population
population = poisson.rvs(mu=2, size=1000)

# Create list for sample means
sample_means = []
for _ in range(1500):
	# Take 20 values from the population
     sample = np.random.choice(population, 20)
    # Calculate the sample mean
     sample_means.append(describe(sample).mean)

# Plot the histogram
plt.hist(sample_means)
plt.show()

Adding dice rolls
=================

To illustrate the central limit theorem, we are going to work with dice rolls. We'll generate the samples and then add them to plot the outcome.

You're provided with a function named `roll_dice()` that will generate the sample dice rolls. `numpy` is already imported as `np` for your convenience: you have to use `np.add(sample1, sample2)` to add samples. Also, `matplotlib.pyplot` is imported as `plt` so you can plot the histograms.

Instructions 1/3
----------------

Generate a sample of `2000` dice rolls using `roll_dice()` and plot a histogram of the sample.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate the sample
sample1 = roll_dice(num_rolls=2000)

# Plot the sample
plt.hist(sample1, bins=range(1, 8), width=0.9)
plt.show()   

Instructions 2/3
----------------

-   Generate a sample of `2000` dice rolls using `roll_dice()` and plot a histogram of the sample.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate two samples of 2000 dice rolls
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)

# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)

# Plot the sum
plt.hist(sum_of_1_and_2, bins=range(2, 14), width=0.9)
plt.show()

Instructions 3/3
----------------

-   Add `sample1` and `sample2` using `np.add()`, store the result in the variable `sum_of_1_and_2`, then plot `sum_of_1_and_2`.

In [None]:
# Configure random generator
np.random.seed(42)

# Generate the samples
sample1 = roll_dice(2000)
sample2 = roll_dice(2000)
sample3 = roll_dice(2000)  # Generate the third sample

# Add the first two samples
sum_of_1_and_2 = np.add(sample1, sample2)

# Add the first two with the third sample
sum_of_3_samples = np.add(sum_of_1_and_2, sample3)

# Plot the result
plt.hist(sum_of_3_samples, bins=range(3, 20), width=0.9)
plt.show()