1\. What are the chances?
-------------------------

00:00 - 00:12

People talk about chance pretty frequently, like what are the chances of closing a sale, of rain tomorrow, or of winning a game? But how exactly do we measure chance?

2\. Measuring chance
--------------------

00:12 - 00:59

We can measure the chances of an event using probability. We can calculate the probability of some event by taking the number of ways the event can happen and dividing it by the total number of possible outcomes. For example, if we flip a coin, it can land on either heads or tails. To get the probability of the coin landing on heads, we divide the 1 way to get heads by the two possible outcomes, heads and tails. This gives us one half, or a fifty percent chance of getting heads. Probability is always between zero and 100 percent. If the probability of something is zero, it's impossible, and if the probability of something is 100%, it will certainly happen.

3\. Assigning salespeople
-------------------------

00:59 - 01:15

Let's look at a more complex scenario. There's a meeting coming up with a potential client, and we want to send someone from the sales team to the meeting. We'll put each person's name on a ticket in a box and pull one out randomly to decide who goes to the meeting.

4\. Assigning salespeople
-------------------------

01:15 - 01:23

Brian's name gets pulled out. The probability of Brian being selected is one out of four, or 25%.

5\. Sampling from a DataFrame
-----------------------------

01:23 - 01:45

We can recreate this scenario in Python using the sample() method. By default, it randomly samples one row from the DataFrame. However, if we run the same thing again, we may get a different row since the sample method chooses randomly. If we want to show the team how we picked Brian, this won't work well.

6\. Setting a random seed
-------------------------

01:45 - 02:24

To ensure we get the same results when we run the script in front of the team, we'll set the random seed using np-dot-random-dot-seed. The seed is a number that Python's random number generator uses as a starting point, so if we orient it with a seed number, it will generate the same random value each time. The number itself doesn't matter. We could use 5, 139, or 3 million. The only thing that matters is that we use the same seed the next time we run the script. Now, we, or one of the sales-team members, can run this code over and over and get Brian every time.

7\. A second meeting
--------------------

02:24 - 02:45

Now there's another potential client who wants to meet at the same time, so we need to pick another salesperson. Brian has already been picked and he can't be in two meetings at once, so we'll pick between the remaining three. This is called sampling without replacement, since we aren't replacing the name we already pulled out.

8\. A second meeting
--------------------

02:45 - 02:52

This time, Claire is picked, and the probability of this is one out of three, or about 33%.

9\. Sampling twice in Python
----------------------------

02:52 - 02:59

To recreate this in Python, we can pass 2 into the sample method, which will give us 2 rows of the DataFrame.

10\. Sampling with replacement
------------------------------

02:59 - 03:15

Now let's say the two meetings are happening on different days, so the same person could attend both. In this scenario, we need to return Brian's name to the box after picking it. This is called sampling with replacement.

11\. Sampling with replacement
------------------------------

03:15 - 03:23

Claire gets picked for the second meeting, but this time, the probability of picking her is 25%.

12\. Sampling with/without replacement in Python
------------------------------------------------

03:23 - 03:38

To sample with replacement, set the replace argument to True, so names can appear more than once. If there were 5 meetings, all at different times, it's possible to pick some rows multiple times since we're replacing them each time.

13\. Independent events
-----------------------

03:38 - 03:53

Let's quickly talk about independence. Two events are independent if the probability of the second event isn't affected by the outcome of the first event. For example, if we're sampling with replacement, the probability

14\. Independent events
-----------------------

03:53 - 04:04

that Claire is picked second is 25%, no matter who gets picked first. In general, when sampling with replacement, each pick is independent.

15\. Dependent events
---------------------

04:04 - 04:19

Similarly, events are considered dependent when the outcome of the first changes the probability of the second. If we sample without replacement, the probability that Claire is picked second depends on who gets picked first.

16\. Dependent events
---------------------

04:19 - 04:25

If Claire is picked first, there's 0% probability that Claire will be picked second.

17\. Dependent events
---------------------

04:25 - 04:36

If someone else is picked first, there's a 33% probability Claire will be picked second. In general, when sampling without replacement, each pick is dependent.

18\. Let's practice!
--------------------

04:36 - 04:40

Head over to the exercises!

#### With or without replacement?

In the video, you learned about two different ways of taking samples: with replacement and without replacement. Although it isn't always easy to tell which best fits various situations, it's important to correctly identify this so that any probabilities you report are accurate. In this exercise, you'll put your new knowledge to the test and practice figuring this out.

##### Instructions

-   For each scenario, decide whether it's sampling with replacement or sampling without replacement.

## Drag the items into the correct bucket

---

### With replacement

- Rolling a die twice  
- Flipping a coin 3 times

---

### Without replacement

- Randomly selecting 5 products from the assembly line to test for quality assurance  
- Randomly picking 3 people to work on the weekend from a group of 20 people  
- From a deck of cards, dealing 3 players 7 cards each


Calculating probabilities
=========================

You're in charge of the sales team, and it's time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he's worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you'll first figure out what the chances are of selecting certain deals.

Recall that the probability of an event can be calculated by

P(event) = (# ways event can happen) / (total # of possible outcomes)

```latex
P(\text{event}) = \frac{\# \text{ ways event can happen}}{\text{total } \# \text{ of possible outcomes}}
```

Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is available.

Instructions 1/3
----------------

-   Count the number of deals Amir worked on for each `product` type using `.value_counts()` and store in `counts`.

In [None]:
# Count the deals for each product
counts = amir_deals['product'].value_counts()
print(counts)

Instructions 2/3
----------------

-   Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as `probs`.

In [None]:
# Count the deals for each product
counts = amir_deals['product'].value_counts()

# Calculate probability of picking a deal with each product
probs = counts / amir_deals.shape[0]
print(probs)

Instructions 3/3
----------------

Question
--------

If you randomly select one of Amir's deals, what's the probability that the deal will involve `Product C`?

### Possible answers

15%

80.43%

[x] 8.43%

22.5%

124.3%

Sampling deals
==============

In the previous exercise, you counted the deals Amir worked on. Now it's time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You'll try doing this both with and without replacement.

Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you'll need to set the random seed before sampling from the deals.

Both `pandas` as `pd` and `numpy` as `np` are loaded and `amir_deals` is available.

Instructions 1/3
----------------

-   Set the random seed to `24`.
-   Take a sample of `5` deals **without** replacement and store them as `sample_without_replacement`.

In [None]:
# Set random seed
np.random.seed(24)

# Sample 5 deals without replacement
sample_without_replacement = amir_deals.sample(5)
print(sample_without_replacement)

Instructions 2/3
----------------

-   Take a sample of 5 deals **with** replacement and save as `sample_with_replacement`.

In [None]:
# Set random seed
np.random.seed(24)

# Sample 5 deals with replacement
sample_with_replacement = amir_deals.sample(5, replace=True)
print(sample_with_replacement)


Instructions 3/3
----------------

Question
--------

What type of sampling is better to use for this situation?

### Possible answers

With replacement

[x] Without replacement

It doesn't matter

1\. Discrete distributions
--------------------------

00:00 - 00:08

In this lesson, we'll take a deeper dive into probability and begin looking at probability distributions.

2\. Rolling the dice
--------------------

00:08 - 00:11

Let's consider rolling a standard, six-sided die.

3\. Rolling the dice
--------------------

00:11 - 00:23

There are six numbers, or six possible outcomes, and every number has one sixth, or about a 17 percent chance of being rolled. This is an example of a probability distribution.

4\. Choosing salespeople
------------------------

00:23 - 00:35

This is similar to the scenario from earlier, except we had names instead of numbers. Just like rolling a die, each outcome, or name, had an equal chance of being chosen.

5\. Probability distribution
----------------------------

00:35 - 01:01

A probability distribution describes the probability of each possible outcome in a scenario. We can also talk about the expected value of a distribution, which is the mean of a distribution. We can calculate this by multiplying each value by its probability (one sixth in this case) and summing, so the expected value of rolling a fair die is 3-point-5.

6\. Visualizing a probability distribution
------------------------------------------

01:01 - 01:10

We can visualize this using a barplot, where each bar represents an outcome, and each bar's height represents the probability of that outcome.

7\. Probability = area
----------------------

01:10 - 01:29

We can calculate probabilities of different outcomes by taking areas of the probability distribution. For example, what's the probability that our die roll is less than or equal to 2? To figure this out, we'll take the area of each bar representing an outcome of 2 or less.

8\. Probability = area
----------------------

01:29 - 01:43

Each bar has a width of 1 and a height of one sixth, so the area of each bar is one sixth. We'll sum the areas for 1 and 2, to get a total probability of one third.

9\. Uneven die
--------------

01:43 - 02:11

Now let's say we have a die where the two got turned into a three. This means that we now have a 0% chance of getting a 2, and a 33% chance of getting a 3. To calculate the expected value of this die, we now multiply 2 by 0, since it's impossible to get a 2, and 3 by its new probability, one third. This gives us an expected value that's slightly higher than the fair die.

10\. Visualizing uneven probabilities
-------------------------------------

02:11 - 02:16

When we visualize these new probabilities, the bars are no longer even.

11\. Adding areas
-----------------

02:16 - 02:29

With this die, what's the probability of getting something less than or equal to 2? There's a one sixth probability of getting 1, and zero probability of getting 2,

12\. Adding areas
-----------------

02:29 - 02:33

which sums to one sixth.

13\. Discrete probability distributions
---------------------------------------

02:33 - 03:04

The probability distributions you've seen so far are both discrete probability distributions, since they represent situations with discrete outcomes. Recall from chapter 1 that discrete variables can be thought of as counted variables. In the case of a die, we're counting dots, so we can't roll a 1-point-5 or 4-point-3. When all outcomes have the same probability, like a fair die, this is a special distribution called a discrete uniform distribution.

14\. Sampling from discrete distributions
-----------------------------------------

03:04 - 03:30

Just like we sampled names from a box, we can do the same thing with probability distributions like the ones we've seen. Here's a DataFrame called die that represents a fair die, and its expected value is 3-point-5. We'll sample from it 10 times to simulate 10 rolls. Notice that we sample with replacement so that we're sampling from the same distribution every time.

15\. Visualizing a sample
-------------------------

03:30 - 03:39

We can visualize the outcomes of the ten rolls using a histogram, defining the bins we want using np-dot-linspace.

16\. Sample distribution vs. theoretical distribution
-----------------------------------------------------

03:39 - 03:58

Notice that we have different numbers of 1's, 2's, 3's, and so on since the sample was random, even though on each roll we had the same probability of rolling each number. The mean of our sample is 3-point-0, which isn't super close to the 3-point-5 we were expecting.

17\. A bigger sample
--------------------

03:58 - 04:07

If we roll the die 100 times, the distribution of the rolls looks a bit more even, and the mean is closer to 3-point-5.

18\. An even bigger sample
--------------------------

04:07 - 04:16

If we roll 1000 times, it looks even more like the theoretical probability distribution and the mean closely matches 3-point-5.

19\. Law of large numbers
-------------------------

04:16 - 04:26

This is called the law of large numbers, which is the idea that as the size of your sample increases, the sample mean will approach the theoretical mean.

20\. Let's practice!
--------------------

04:26 - 04:33

Time to solidify your knowledge of probability distributions.

Creating a probability distribution
===================================

A new restaurant opened a few months ago, and the restaurant's management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. In this exercise, you'll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the `restaurant_groups` DataFrame.

Remember that expected value can be calculated by multiplying each possible outcome with its corresponding probability and taking the sum. The `restaurant_groups` data is available. `pandas` is loaded as `pd`, `numpy` is loaded as `np`, and `matplotlib.pyplot` is loaded as `plt`.

Instructions 1/4
----------------

-   Create a histogram of the `group_size` column of `restaurant_groups`, setting `bins` to `[2, 3, 4, 5, 6]`. Remember to show the plot.

In [None]:
# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

print(size_dist)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

Instructions 2/4
----------------

-   Count the number of each `group_size` in `restaurant_groups`, then divide by the number of rows in `restaurant_groups` to calculate the probability of randomly selecting a group of each size. Save as `size_dist`.
-   Reset the index of `size_dist`.
-   Rename the columns of `size_dist` to `group_size` and `prob`.

In [None]:
# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

print(size_dist)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

Instructions 3/4
----------------

-   Calculate the expected value of the `size_dist`, which represents the expected group size, by multiplying the `group_size` by the `prob` and taking the sum.

In [None]:
# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

print(size_dist)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

Instructions 4/4
----------------

-   Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.

In [None]:
# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

print(size_dist)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

Identifying distributions
=========================

Which sample is most likely to have been taken from a uniform distribution?

![A: bell-shaped distribution, B: relatively flat distribution, C: lots of lower values, fewer high values](https://assets.datacamp.com/production/repositories/5786/datasets/bd64d4775ec28f36b081d92aa38a391033c03b8f/Screen%20Shot%202020-05-04%20at%204.35.58%20PM.png)

##### Answer the question

#### Possible Answers

Select one answer

-   A

[x] -   B

-   C

#### Expected value vs. sample mean

The app to the right will take a sample from a discrete uniform distribution, which includes the numbers 1 through 9, and calculate the sample's mean. You can adjust the size of the sample using the slider. Note that the expected value of this distribution is 5.

A sample is taken, and you win twenty dollars if the sample's mean is less than 4. There's a catch: you get to pick the sample's size.

Which sample size is ***most likely*** to win you the twenty dollars?

##### Instructions

[x] -   10

-   100

-   1000

-   5000

-   10000