<a href="https://colab.research.google.com/github/gibsonea/Biostats2026/main/Labs/04_Overview_of_Probability.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <a name="04intro">2.1: An Overview of Probability</a>

----

Statistical inference is the process of using data from a sample to
describe characteristics of the population. Our conclusions are going to
be based on randomness in the sampling process, and we will need to
account for some uncertainty in our predictions. Statistics relies on
theory and applications from probability to help quantify the
uncertainty in our models and estimates.

-   <font color="mediumseagreen">**If we know the make up of a
    population, we can use probability to predict the likelihood that
    certain outcomes occur.**</font >
    -   If we know a six-sided is fair, what is the probability of
        rolling the die 10 times and getting exactly 4 rolls that land
        on 1?
-   <font color="tomato">**We use statistics to go in the
    reverse direction. Namely, how can we make predictions about a
    population from a random sample of data.**</font >
    -   We want to determine whether or not a six-sided die is fair. We
        roll the die 100 times and get 40 rolls that land on 1. Is the
        die fair?
    -   We will need to use probability to answer statistical questions
        and make inferences about a population.



## <a name="04sample-space">Sample Space, Outcomes and Events</a>

----

A <font color="dodgerblue">**statistical experiment or
observation**</font > is any random activity that results in a definite
outcome.

-   The <font color="dodgerblue">**sample space**</font >, $\color{dodgerblue}{\Omega}$, is the set of all possible outcomes of an experiment.
-   An <font color="dodgerblue">**outcome**</font >, $\color{dodgerblue}{\omega}$, is a result from an experiment or observation.
-   An <font color="dodgerblue">**event**</font >, $\color{dodgerblue}{A}$, is a collection of one or more outcomes from an experiment or observation.



## <a name="04ex-roll">Example: Rolling a Fair Six-Sided Die</a>

----

<figure>
<img
src="https://upload.wikimedia.org/wikipedia/commons/c/c4/2-Dice-Icon.svg"
alt="Image Credit: Adam Spiegler, CC BY-SA 4.0." width = "20%"/>
<figcaption aria-hidden="true">
<a href="https://commons.wikimedia.org/wiki/File:2-Dice-Icon.svg">Steaphan
Greene</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0">CC
BY-SA 3.0</a>, via Wikimedia
Commons
</figcaption>
</figure>


The study of probability was initially inspired by calculating odds of
outcomes from card and dice games. For example, consider rolling a
fair six-sided die.

-   The sample space is $\Omega = \left\{ 1, 2, 3, 4, 5, 6 \right\}$
-   For a fair die, each of the six possible outcomes has an equally likely chance of occurring.
-   One possible outcome is rolling a 4, $\omega = 4$




## <a name="04equal-likelihood">Probabilities with Equally Likely Outcomes</a>

----

-   Let $A$ denote the event that the roll is a multiple of 3, $A = \left\{ 3, 6 \right\}$.

For finite sample spaces, we often use counting to determine
probabilities. A special case which we will encounter often is when each
outcome in the sample space $\Omega$ is equally likely to occur, and
therefore

$$ P(A) = \frac{\mbox{Number of outcomes in event $A$}}{\mbox{Total number of outcomes in $\Omega$}} $$

We use the notation $P(A)$ to denote the probability that event $A$
occurs.

-   Probabilities are proportions between $0$ (impossible to occur) and $1$ (certain to occur) that we typically represent as decimals or fractions.
-   Sometimes we convert the proportion to a percentage when giving a probability.



## <a name="04q1">Question 1</a>

----

If a person rolls a fair, six-sided die, what is the probability the
result of the roll is a number that is divisible by 3?



### <a name="04ans1">Solution to Question 1</a>

----



<br>  
<br>  

### <a name="04cardata">Installing and Loading `carData` Package</a>

----

The data set `Arrests` is in the `carData` package in R which is not
installed in Google Colaboratory.

-   First run the code cell below to to install the `carData` package.

In [None]:
install.packages("carData")

-   Next load the package with the library command so we can access the
    `Arrests` data set.

In [None]:
library(carData)

- **We can now access and explore the data set `Arrests` in `carData`.**



## <a name="04q2">Becoming Familiar with the `Arrests` Dataset</a>

----

Use R functions such as `summary()`, `str()`, and/or `?Arrests` to
answer some questions about the data:

-   What data is included in the `Arrests` data set?
-   How many observations are in the data?
-   What is the population of interest?
-   What is the source of the data?
-   What are the categorical variables in the data set?
-   What are the quantitative variables in the data set?
-   Are the variable data types accurate, or do some variables need to
    be converted to other data types?



In [None]:
# Summarize and/or get help with Arrests data
?Arrests
str(Arrests)

## <a name="O4q3">Question 2</a>

----

Suppose you would like to analyze whether female arrestees are more or
less likely to be not released with a summons (and therefore detained in
jail) compared to male arrestees. The two variables of interest are
therefore `sex` and `released`.



### <a name="O4q3a">Question 2a</a>

----

Use the `table()` function, create a two-way table to summarize the
relation between `sex` and `released`.

#### <a name="O4ans3a">Solution to Question 3a</a>

----

### <a name="O4q3b">Question 2b</a>

----

You should see from your output in Question 3a that more male arrestees
were not released (829) compared the female arrestees that were not released
(63).

Would it be fair to say, based on that fact alone, that female arrestees are more likely to be detained (not released)? If not, why is it problematic to arrive to that conclusion?


#### <a name="O4ans3b">Solution to Question 2b</a>

----



<br>  
<br>  
<br>


# <a name="O4notation">Notation and Symbols</a>

----

Probability is closely tied with something called "set theory" - the study of how sets of elements can interact or be part of different groups. We often use notation from set theory as shorthand to express probabilities. The main ones you need to know are:

- $\color{dodgerblue}{\cap}$ : also referred to as **intersection**. It can also be interpreted as "***and***".
    - <img src="https://github.com/gibsonea/Biostats/blob/main/Images/a_intersect_b.png?raw=true"/>

- $\color{dodgerblue}{\cup}$ : also referred to as **union**. It can also be interpreted as "***or***".
    - <img src="https://github.com/gibsonea/Biostats/blob/main/Images/a_union_b.png?raw=true"/>

- $\color{dodgerblue}{\bar{}}$ : over a variable name, also referred to as **not**, or more formally, ***complement***. For example, ${\bar{A}}$ would be read as "not A".
    - <img src="https://github.com/gibsonea/Biostats/blob/main/Images/not_a.png?raw=true"/>

# <a name="O4notation">Simple and Conditional Probabilities</a>

----

Let $A$ and $B$ denote two events in sample space $\Omega$, then

-   $\color{dodgerblue}{P(A)}$ is the probability that event $A$ occurs.
-   $\color{dodgerblue}{P(\bar{A}) }$ is the probability that event $A$ <font color="dodgerblue">**does NOT occur**</font >.
-   The notation $\color{dodgerblue}{\bar{A}}$, is used to denote the <font color="dodgerblue">**complement**</font > of $A$, AKA "not A".
-   $\color{dodgerblue}{P(A \cap B)}$ is the probability that events $A$ <font color="dodgerblue">**AND**</font > $B$ both occur.
-   $\color{dodgerblue}{P(A \cap \bar{B})}$ is the probability that <font color="dodgerblue">**event A occurs AND event B does NOT occur**</font >.
-   $\color{dodgerblue}{P(A \cup B)}$ is the probability that either event $A$ <font color="dodgerblue">**OR**</font > event $B$ occurs (or both $A$ and $B$ occur).
-   $\color{dodgerblue}{P(B \ | \ A )}$ is the <font color="dodgerblue">**conditional probability**</font > that event $B$ occurs <font color="dodgerblue">**given that**</font > event $A$ occurs.
    - $\color{dodgerblue}{P(B \ | \ A ) = P(A \cap B) / P(B)}$




# <a name="04prop">Properties of Probability</a>

----

We can generalize the calculations from the previous study on people
arrested for small quantities of marijuana to obtain the following
results:

Let $A$ and $B$ denote two events in sample space $\Omega$, then

-   <font color="dodgerblue">**Additive property**</font >:
    $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.
-   <font color="dodgerblue">**Multiplicative property**</font >:
    $P(A \cap B) = P(A) \cdot P(B | A)$
-   <font color="dodgerblue">**Complement property**</font >:
    $P(\bar{A}) = 1 - P(A)$



### <a name="O4q3c">Question 3</a>

----

Think carefully about the additive property above. Why would it be wrong to just add P(A) and P(B) together?

*Hint: drawing a venn diagram may help.*

#### <a name="O4ans3c">Solution to Question 3</a>

----



</br></br></br>

## <a name="04q4">Question 4</a>

----

Match one of the <font color="dodgerblue">**Venn
diagrams**</font > labelled (i)-(vi) in the table below to one of the set
operations below. Note that two of the Venn Diagrams do not match any of
the set operations.

| <font size=3>diagram (i)</font>  | <font size=3>diagram (ii)</font>  | <font size=3>diagram (iii)</font> |
|--------------|---------------|----------------|
| <img src="https://upload.wikimedia.org/wikipedia/commons/6/60/04fig-venn-001.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0."  /> | <img src="https://upload.wikimedia.org/wikipedia/commons/0/0f/04fig-venn-002.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0." /> | <img src="https://upload.wikimedia.org/wikipedia/commons/c/c4/04fig-venn-003.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0."  />  |


| <font size=3>diagram (iv)</font>  | <font size=3>diagram (v)</font> | <font size=3>diagram (vi)</font>   |
|---------------|-------------|----------------|
| <img src="https://upload.wikimedia.org/wikipedia/commons/7/73/04fig-venn-004.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0."  /> | <img src="https://upload.wikimedia.org/wikipedia/commons/7/77/04fig-venn-008.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0."  />| <img src="https://upload.wikimedia.org/wikipedia/commons/9/9e/04fig-venn-005.png" alt="Image Credit: Adam Spiegler, CC BY-SA 4.0."  /> |

Image credit: Adam Spiegler, [CC BY-SA
4.0](https://creativecommons.org/licenses/by-sa/4.0), via Wikimedia
Commons

### <a name="04ans4">Solution to Question 4</a>

----
${\bar{A}}$ = __

${A \cup B}$ = __

${A \cap B}$ = __

${A \cap \bar{B}}$ = __

<br> <br> <br>



### <a name="O4q3c">Question 5</a>

----
Let's go back to the arrests dataset.

What is the probability that a randomly selected arrestee in the study:

1.  Was not released?
2.  Was male?
3.  Was not released **and** was male?
4.  Was not released **or** was male?
5.  Given that a person was male, what is the probability they were not released?
6. Given that a person was female, what is the probability they were not released?

*Hint: Use the function `nrows()` to get the total number of observations in the dataset. Use `?nrows` to see the documentation. Also, the "Counting observations with logical statements" section from lab 3 might help.*


In [None]:
# Enter code to compute each of the following
# Be sure to print results to screen!



# (1) was not released



In [None]:
# (2) was Male



In [None]:
# (3) was not released and was Male



In [None]:
# (4) was not released or was Male




In [None]:
# (5) was not released given they were Male



#### <a name="O4ans3d">Solution to Question 5</a>

----

Summarize results below

1.  The probability that a randomly selected person in the study was not released is $\color{dodgerblue}{P(N)=???}$.

2.  The probability that a randomly selected person in the study was Male is $\color{dodgerblue}{P(M)=???}$.

3.  The probability that a randomly selected person in the study was not released and was Male is $\color{dodgerblue}{P(N \cap M)=???}.$

4.  The probability that a randomly selected person in the study was not released or was Male is $\color{dodgerblue}{P(N \cup M)=???}$.

5.  Given that a person was Male, what is the probability they were not released?

$$\color{dodgerblue}{P(N \ | \ M)=???}$$



### <a name="O4q3d">Question 6</a>

----

Based on the data from this study, do you believe Male arrestees are
more, less, or equally likely to be detained (not be released) than
Female arrestees? Support your answer using probabilities. You may want
to compute additional probabilities that were not asked so far before reaching your conclusion.




In [None]:
# use this block for additional probability calculations



#### <a name="O4ans3d">Solution to Question 6</a>

----



<br>  
<br>  
<br>


# <a name="04ind">Independent Events</a>

----

Often in statistics we want to investigate questions such as:

-   Do certain sentencing laws have an effect on crime rates?
-   Did increasing the minimum wage for fast food workers effect fast food prices?
-   **Does the occurrence of one event** $M$ (being male) **effect the likelihood that another event** $N$ (being detained after arrest) **occurs**?

We would want to know if the two events are <font color="dodgerblue">**dependent**</font > or <font color="dodgerblue">**independent**</font >.



## <a name="04def-ind">Definition of Independent Events</a>

----

Two events $A$ and $B$ are
<font color="dodgerblue">**independent**</font > if the
occurrence of one has no effect on the occurrence of the other:

$$ P(B) = P(B \ | \ A) \quad \mbox{or} \quad P(A) = P(A \ | \ B).$$

**Special case:** If events $A$ and $B$ are independent events then we
have $P(A \cap B) = P(A) \cdot P(B)$.



## <a name="04q5">Question 7</a>

----

Suppose the probability that CU football will beat CSU is 70%, while the probability that CU will beat Kansas is 40%.  

$$ P(A) = 0.7$$
$$P(B) = 0.4 $$

Where the event A = {CU beats CSU} and event B = {CU beats KU}.

**Assume that the two events are independent, and that CU will play each team once.**




### <a name="04q5a">Question 7a</a>

----

If the two events are independent, does it matter which game happens first?




#### <a name="04ans5a">Solution to Question 7a</a>

----



<br>  
<br>  
<br>

### <a name="04q5a">Question 7b</a>

----

What is the probability that CU football wins both games?  How would you write this in probability notation?

*Hint: drawing a venn diagram might help.*




#### <a name="04ans5a">Solution to Question 7b</a>

----



<br>  
<br>  
<br>

### <a name="04q5b">Question 7c</a>

----

What is the probability that CU wins **at least** one game? How would you write this in probability notation?

*Hint: drawing a venn diagram might help.*





#### <a name="04ans5b">Solution to Question 7c</a>

----



<br>  
<br>  
<br>

### <a name="04q5c">Question 7d</a>

----

What is the probability that CU wins only one game?  Again, write in probability notation.

*Hint: drawing a venn diagram might help.*



#### <a name="04ans5c">Solution to Question 7d</a>

----



<br>  
<br>  
<br>

### <a name="04ans5d">Question 7e</a>

----

What is the probability that CU wins no games?  Again, write in probability notation.

*Hint: drawing a venn diagram might help.*



#### <a name="04q5d">Solution to Question 7e</a>

----



<br>  
<br>  
<br>

## <a name="04q7">Question 8</a>

----

Does having health insurance help avoid bankruptcies? Let $B$ denote the
event a person goes bankrupt. Let $H$ denote the event a person has
health insurance.



### <a name="04q7a">Question 8a</a>

----

What is the difference in the practical meaning $P(H \ | \ B)$ and
$P(B \ | \  H )$? Explain in practical terms that a non-statistician
could understand, and avoiding technical language.





#### <a name="04ans7a">Solution to Question 8a</a>

----



<br>  
<br>  
<br>

### <a name="04q7b">Question 8b</a>

----

Can you determine whether having health insurance has any affect on the
likelihood that a person goes bankrupt by comparing $P(H \ | \ B)$ and
$P(B \ | \ H )$?

-   If so, explain how you would compare those to probabilities to help
    answer the question.
-   If not, what additional probability (or probabilities) would be
    useful to know in order to answer this question.





#### <a name="04ans7b">Solution to Question 8b</a>

----



<br>  
<br>  
<br>

# <a name="04excl">Mutually Exclusive Events</a>

----

Two events $A$ and $B$ are <font color="dodgerblue">**mutually
exclusive**</font > (or
<font color="dodgerblue">**disjoint**</font >) if they cannot
occur at the same time, and therefore $P(A \cap B) = 0$.

**Special case:** If events $A$ and $B$ are disjoint then we have
$P(A \cup B) = P(A)+P(B)$.



## <a name="04q8">Question 9</a>

----

Give an example of two events that are mutually exclusive, and give an
example of two events that are not mutually exclusive.





#### <a name="04ans8">Solution to Question 9</a>

----



<br>  
<br>  
<br>

# <a name="05totalprob-rule">Total Probability Rule</a>

----

The total probability (or unconditional probability), $P(B)$, is related to the conditional probabilities as follows:

$$P(B) = P(B | A) \cdot P(A) + Pr(B | \bar{A}) \cdot P(\bar{A}) $$

In general, if there are a set of events: $A_{1}, A_{2}, ... A_{n} $ that are mutually exclusive and exhaustive (span the full sample space), then the total probability rule can be expressed as:

$$P(B) = \sum_{i=1}^n P(B | A_{i}) \cdot P(A_{i}) $$





## <a name="04q9">Question 10</a>

----

A large study found that among 100,000 women with negative mammograms, 20 will be diagnosed with breast cancer within 2 years.  While 1 in 10 women with  positive mammogram results will be diagnosed with breast cancer within 2 years.  Suppose that 7% of women in the general population will have a positive mammogram result.  

Let A be the event of a *positive* mammogram and B be the event of a positive diagnosis.

1. Calculate the probability of a randomly selected women in the general population developing breast cancer over the next two years?

2. Write the probability notation for your answer.





#### <a name="04ans9">Solution to Question 10</a>

----



<br>  
<br>  
<br>

# <a name="CC License">Creative Commons License Information</a>
---


![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain* by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.