## <span style="color:darkblue"> Class Notes: Unit 2 (Introduction to sampling and probability)

*The content of this notebook is based on modified materials from Open Intro Biostatistics (OIBiostat)*

<span style="color:darkblue"> **Outline**<span style="color:black">:

A. Data Collection Principles

B. Basic Concepts of Probability

C. Conditional Probability

D. Positive Predictive Value and Bayes Theorem

----

#### <span style="color:darkblue"> A. Data Collection Principles

#### A.1 Populations and samples

* In general, research questions pertain to a <span style="color:violet">*target population*<span style="color:black">:
    1. Do bluefin tuna from the Atlantic Ocean have particularly high levels of mercury, such that they are unsafe for human consumption?
    2. For infants predisposed to developing a peanut allergy, is there evidence that introducing peanut products early in life is an effective strategy for reducing the risk of developing a peanut allergy?
    3. Does a recently developed drug designed to treat glioblastoma, a form of brain cancer, appear more effective at inducing tumor shrinkage than the drug currently on the market?
* Nearly all research is based on information obtained about a <span style="color:violet">*sample* <span style="color:black">from the population, representing a small fraction of the total population.

A sample is considered <span style="color:violet">*representative* <span style="color:black"> of the population if the sample characterstics correspond to the population characterisitics.

The <span style="color:violet">*sampling scheme*<span style="color:black"> helps determine this, but does not guarantee it!

<img style="float: left", src="sampleRandomHealthPlan.png">


<img style="float: left", src="sampleConvenienceHealthPlan.png">

<img style="float: left", src="sampleNonResponseHealthPlan.png">

**** A.2 Types of Bias: ****

<span style="color:violet">*Selection bias*<span style="color:black"> occurs when some units are more likely to be selected for analysis than others.  

*Example:* Suppose you want to conduct a survey to determine if people prefer Uber, Lift or traditional taxis.

* What are the units in this example?
* If you use FaceBook to administer the survey, will there be selection bias?  
* What impact does this have on the survey results?

<span style="color:violet">*Recall bias*<span style="color:black"> occurs when some individuals in a study exhibit different abilities to correctly recollect their past experiences.

*Example:* Suppose you are interested in investigating the association between coffee consumption during pregnancy and risk of miscarriage.  For logistically and economic reasons, you decide to do a case-control study in which you identify women who have had a miscarriage and those who have not had a miscarriage and ask them to report their average per day coffee consumption during the first 2 months of pregnancy.  

* Are women who have had a miscarriage and those who have not likely to recall their coffee consumption differently? Why?
* What would be the impact of this on our interpretation of the study findings?

Bias can also be introduced as a result of <span style="color:violet">*informative missing data*.

<span style="color:violet">*Estimation bias*<span style="color:black"> occurs when a summary measure based on a sample is systemically different from the population measure of interest.    

*Example*: Suppose we are interested in reporting the average SAT/ACT score for students who decide to come to MHC.  
 
* Taking the SAT/ACT is not required for admission to MHC so some percentage of students who are here never took the SAT or did not submit their scores. 
* Is it possible that the SAT/ACT scores of individuals who report scores are different than the 'would-be' scores of individuals who do not report them?
* How would this impact your interpretation of the study findings?

**** A.3 Study Design and Confounding: ****

The first step is to understand the <span style="color:violet">*data generating process<span style="color:black">*.

* In an <span style="color:orange">observational study<span style="color:black"> researchers observe and record data, without interfering with how the data arise.
* In an <span style="color:orange">experimental study<span style="color:black"> investigators intervene by assigning a *treatment* to each unit. In a *randomized* experiment the units are randomly assigned to groups that receive the treatments.  

*Example:* 

In June 2006, the NYT reported http://www.nytimes.com/2006/06/13/health/13brea.html an article entitled "Breast-Feed or Else" quoting a senior scientific adviser to the Office on Women's Health in the Department of Health and Human Services who likened not breastfeeding to smoking during pregnancy.

The AAP policy statement recommending women nurse cited a http://pediatrics.aappublications.org/content/113/5/e435.full}{\textcolor{red}{scientific study that reported:

*"We evaluated the effect of breastfeeding on postneonatal mortality in United States using 1988 National Maternal and Infant Health Survey (NMIHS) data ...  children who were ever breastfed had 0.79 (95% confidence interval: 0.67-0.93) times the risk of never breastfed children for dying in the postneonatal period."*


Important information:
* The NMIHS is an *observational study*.
* The children who were not breast fed were more likely to die of injury (e.g. car accident).
* At the time of the survey, breast feeding was highly associated with economic class -- higher-income women were more likely to breast feed than lower-income women.

How does this impact the interpretation of the study results?


In the previous example, economic status is an example of a confounder as it confounds the relationship between breast feeding and infant mortality. Formally, a <span style="color:violet">*confounder*<span style="color:black"> is a variable that is associated with both the explanatory (predictor) variable and the response (outcome) variable, but is not in the `causal pathway'.

<img style="float: left", src="confoundingVariable.png">

----
#### <span style="color:darkblue"> B. Basic Concepts of Probability

#### B.1 Outcomes and events   

An <span style="color:violet"> *outcome*<span style="color:black">  in a study or experiment is the observable result after conducting the experiment.

- The sum of the faces on two dice that have been rolled  

- The response of a patient treated with an experimental therapy

- The total volume of eggs in a clutch laid by a frog 

An <span style="color:violet"> *event*<span style="color:black">  is a collection of outcomes.  

- The sum after rolling two dice is 7

- 22 of 30 patients in a study have a good response to a therapy

- The total volume of eggs in a clutch is larger than $750 \text{ mm}^3$


#### B.2 Working definition of probability   

The <span style="color:violet"> *probability* <span style="color:black"> of an outcome or an event is the proportion of times the outcome or event would occur if the random phenomenon could be observed an infinite number of times.


Two outcomes or events are called <span style="color:violet"> *disjoint* <span style="color:black"> or <span style="color:violet"> *mutually exclusive* <span style="color:black"> if they cannot both happen at the same time.

Addition rule for disjoint events (Equation 2.6 in *OI Biostat*):

- If $A_1$ and $A_2$ represent two disjoint events, then the probability that either one of them occurs is $$P(A_1 \text{ or } A_2) = P(A_1) + P(A_2)$$   


- If there are $k$ disjoint events $A_1,\dots,A_k$, then the probability that one of these outcomes will occur is $$P(A_1) + P(A_2) + \cdots + P(A_k)$$



#### B.3 Rules/definitions for probability    

<span style="color:violet"> General Addition Rule<span style="color:black">  (Equation 2.12 in *OI Biostat*)

- If $A$ and $B$ are any two events, disjoint or not, then the probability that at least one of them will occur is $$P(A \text{ or }B) = P(A) + P(B) - P(A  \text{ and }B)$$ where P(A and B) is the probability that both events occur.

The <span style="color:violet"> *complement*<span style="color:black">  $A^c$ of an event $A$ is the collection of all outcomes *not* in $A$. The complement is sometimes written *not* $A$.

- $P(A) = 1-P(A^c)$  Equation 2.19, *OI Biostat*

- Equivalently $P(A^c) = 1-P(A)$  


#### B.4 Independent events   

Two events $A$ and $B$ are called <span style="color:violet">*independent* <span style="color:black">if the probability that both $A$ and $B$ occur equals the product of their separate probabilities
$$ P(A \text{ and }B) = P(A)P(B)$$ (Equation 2.23 in *OI Biostat*)

Similarly, $k$ events $A_1$, ..., $A_k$  are called *independent* if the probability they all occur, written $A_1 \text{ and } A_2 \text{ and } A_3  \cdots A_k$ is the product
$$P(A_1) P(A_2) \cdots P(A_k)$$

----

<span style="color:orange"> **In-class (paper and pencil) exercises:**
    
**Exercise 1**. Toss a fair coin 3 times. What is the probability of at least 2 heads?

----

**Exercise 2**.  Mammograms incorrectly suggest breast cancer is present approximately 5% of the time when a women does *not* have cancer.  Suppose a clinic conducts approximately 50 mammograms in a week.  What is the probability that at least 1 woman will test positive if none of the women have breast cancer?

----

----
#### <span style="color:darkblue"> C. Conditional probabilities

#### C.1 Conceptual Definition:

The <span style="color:violet">*conditional probability*<span style="color:black"> of an event $B$, given a second event $A$, is the probability of $B$ happening, knowing that the event $A$ has happened. This is denoted as $P(B|A)$.

<img style="float: left", src="premature_infant_death.png">

Published in Patel, et al., NEJM (2015) Vol 372, pp 331 - 340, posted on website under Supplementary Reading. 

Toss a fair coin 3 times. Let $B$ be the event that *exactly* two heads occur, and $A$ the event that *at least* two heads occur.

- $P(B|A)$ is the probability of having exactly two heads among
   the outcomes that have at least two heads.

- Conditioning on $A$ means we know the outcome is in the set
    - (HHH, HHT, HTH, THH)
- In this set of outcomes, $B$ consists of the last three, so
   $P(B|A) = 3/4$
- Note that $P(B) = 3/8$

#### C.2 Mathematical Definition:

As long as $P(A) > 0$,
 $$ P(B | A) = \dfrac{P(A \text{ and } B)}{P(A)}$$

Using the definition, 
\begin{align*} 
P(B | A) &= \dfrac{P(A \text{ and } B)}{ P(A) }\\
         &= \frac{P(\text{At least two heads and exactly two heads})}
             {P(\text{At least two heads})} \\
          &= \frac{P(\text{Exactly two heads})}{P(\text{At least two heads})} \\
          &= (3/8)/(4/8) = 3/4
\end{align*}


*Independence again*...

A simple consequence of the definition of conditional probability:

- $A$ and $B$ are independent if $P(B|A) =  P(B)$


#### C.3 Multiplication Rule of Probability

The <span style="color:violet">general multiplication rule <span style="color:black">applies for events that might not be independent.

$$ P(A \text{ and } B) = P(B | A)P(A) $$

It is a rearrangement of the definition for conditional probability, $P(B | A) = P(A \text{ and } B) / P(A)$


----

<span style="color:orange"> **In-class (paper and pencil) exercises:**
    
**Exercise 3**. Suppose that a study conducted in the United States showed that 10% of the
population have some mental disorder, 8% have an alcohol related
disorder, and 6% have both.  

a) If a person has been diagnosed with an alcohol related disorder, what is
the probability that he/she has a mental disorder?

b) If a person has been diagnosed with a mental disorder, what is
the probability that he/she has an alcohol related disorder?

c) Are the events *A* and *B* independent, where *A* = (a person has a mental disorder), *B* = (a person has an alcohol related disorder)?

----

**Exercise 4**. A bag contains 3 red and 3 white balls.  Two balls are drawn from the bag, one at a time; the first ball is not replaced before the second ball is drawn. 

a) What is the probability of drawing a white ball on the first pick and a red on the second?
    
b) What is the probability of drawing exactly one red ball?

----

**Exercise 5**. In the US population, approximately 20% of men and 3% of women are taller than 6 feet (72 in).  Let $F$ be the event that a person is female and $T$ be the event that a person is taller than 6 feet. Assume an equal probability of being male and female.  

a) What is $P(T|F)$?
 
b) What is the probability that the next person walking through the door is a woman and taller than 6 feet?
    
c) What is the probability that the next person walking through the door is taller than 6 feet tall?  

----

----
#### <span style="color:darkblue"> D. Positive Predictive Value of a Diagnostic Test (Bayes' Theorem)


*Example: Pre-natal testing for trisomy 21, 13, and 18*

Some congenital disorders are caused by an additional copy of a chromosome being attached (translocated) to another chromosome during reproduction.

- Trisomy 21: Down syndrome, approximately 1 in 800 births

- Trisomy 13: Patau's syndrome,  physical and mental
disabilities, approximately 1 in 16,000 newborns

- Trisomy 18: Edward's syndrome, nearly always fatal, either
in stillbirth or infant mortality.  Occurs in about 1 in 6,000 births

Until recently, testing for these conditions consisted of screening the mother's blood for serum markers, followed by amniocentesis in women who test positive.


*Cell-free fetal DNA (cfDNA)*

cfDNA consists of copies of embryo DNA present in maternal blood.

Advances in sequencing DNA provided possibility of non-invasive testing for these disorders, using only a blood sample. 

Initial testing of the technology was done using archived samples of genetic material from children whose trisomy status was known.

The results are variable, but generally very good:

- Of 1000 children with the one of the disorders, approximately 980 have cfDNA that tests positive. The test has high *sensitivity*.

- Of 1000 children without the disorders, approximately 995 test negative. The test has high *specificity*.


*What do the parents care about?*
  
The designers of a test want a test to have high sensitivity and specificity:

- <span style="color:violet">*sensitivity*<span style="color:black"> is the probability of a positive test given (conditional on) disease positive.
- <span style="color:violet">*specificity*<span style="color:black"> is the probability of a negative test given disease negative.

That makes it a good test.

Specifically, a family with a child undergoing testing wants to know the likelihood of the condition being present if the test is positive.

Suppose a child has tested positive for trisomy 21. What is the probability that the child does have the trisomy 21 condition, given the positive test result?

We will address this question using each of the following approaches:

- Table-based solution 

- Tree-based solution
	
- Algebraic solution using Bayes' rule

- Simulation


#### D.1 Using a table to calculate PPV

Steps:

- Create a contingency table showing the distribution of trisomy 21 incidence and cfDNA test outcomes in a large, hypothetical population.
 
- Use the table to calculate the correct conditional probabilities.

- We use *R* to calculate the entries in the table, but it can be done by hand.

----

<span style="color:orange"> **In-class (paper and pencil) exercises:**

**Exercise 6**. Constructing a contingency table for a large, hypothetical population offers an intuitive way to understand the distribution of disease incidence and test outcome. In this case, we will work with a population of 100,000.


| | Disease Present | Disease Absent  | Total |
|------|------|------|
|Test Positive| | | | 
|Test Negative | | | | 
| Total | | | 100,000 |

Let's start filling in the table, based on the known information about the disease prevalence, test specificity, and test sensitivity.

a) Calculate the two column totals -- out of 100,000 children, how many are expected to have trisomy 21? How many are expected to not have trisomy 21?

b) Populate the rest of the table (you may find it helpful to use R as a calculator).

c) Using values from the table, calculate an estimate of the probability that a child who tests positive actually has trisomy 21. *Hint*: Think about how the definition of conditional probability could be applied here... 

$$P(D|T^{+}) = \frac{P(D \text{ and } T^{+})}{P(T^{+})}$$

----

#### D.2 Tree-based solution

Using a tree diagram allows for the information in the problem to be mapped out in a way that makes it easier to calculate the desired conditional probability, $P(D|T^{+})$.

----
    
<span style="color:orange"> **In-class (paper and pencil) exercises:**

**Exercise 7**. Consider the set-up from the previous exercise.

a) First, let's write out what we know (prevalence, sensitivity, specificity) using some notation. Let $D$ = \{disease presence\}, $D^C$ = \{disease absence\}, $T^{+}$ = \{positive test\}, $T^{-}$ = \{negative test\}.

b) Draw a tree diagram that organizes the four possible combinations of disease incidence and test outcome: $D \text{ and } T^{+}$, $D \text{ and } T^{-}$, $D^C \text{ and } T^{+}$, $D^C \text{ and } T^{-}$.

c) Calculate $P(D|T^{+})$. Remember that $P(D \text{ and } T^{+})$ can be rewritten as $P(D)P(T^{+}|D)$ by the general multiplication rule.

----

#### D.3 Diagnostic tests and Bayes' rule

Events of interest in diagnostic testing, where () denotes an event:

- $D$ = (person has disease) 

- $D^C$ = (person does not have disease)  

- $T^+$ = (positive screening result)  

- $T^-$ = (negative screening result).  Could use $T$ and $T^C$,
    but $T^+,\,\, T^-$ are consistent with notation in medical and public health
    literature.



#### Measures of accuracy for diagnostic tests
  
- Sensitivity = $P (T^+ | D)$ (want very high!)   

- Specificity = $P(T^- | D^C)$ (want high!)   

- False negative rate = $P(T^- | D)$ = 1 - sensitivity   

- False positive rate = $P(T^+ | D^C)$ = 1 - specificity   

These measures are all characteristics of a diagnostic test.


####  Positive predictive value of a test

Suppose an individual tests positive for a disease $D$.   

<span style="color:violet">*Positive Predictive Value (PPV)*<span style="color:black">: The PPV of a diagnostic test is the probability that a person has a disease $D$, given that she has tested positive.

- PPV = $P(D | T^+)$

- The conditioning here is in the reverse order from the test characteristics 

The characteristics of the test give us $P(T^+|D)$, among other things, but not $P(D|T^+)$.



#### Bayes' Theorem, aka Bayes' Rule

Bayes' Thorem (simplest form):

$$ P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$

Note: Follows directly from the definition of conditional probability by noting that $P(A) P(B|A)$ equals $P(A \text{ and } B)$:

$$ P(A|B) =  \frac{P(A \text{ and } B)}{P(B)} = \frac{P(A) P(B|A)}{P(B)}$$



#### The denominator P(B) in Bayes' Theorem

Bayes' Theorem is seldom stated in its simplest form, because in many problems, $P(B)$ is not given directly, but is calculated using the general multiplication formula for probabilities:

Suppose $A$ and $B$ are events.  Then,
\begin{align*}
P(B) = & P(B \text{ and } A) + P(B \text{ and } A^C) \\
    = & P(A) P(B | A)  + P(A^C)P(B|A^C)
\end{align*}

Bayes' Theorem can be written as:

$$ P(A|B) = \frac{P(A) P(B|A)}{P(B)} = \frac{P (A) P (B|A)}{P (A) P (B | A)  + P (A^C)P (B|A^C)}$$



### Bayes' Theorem for diagnostic tests

Let $A = D$, $B = T^+$, where the PPV is $P(D|T^+)$. 

\begin{align*}
  P(D|T^+) = & \dfrac{P(D \text{ and } T^{+})}{P(D)} \\ 
  = & \frac{P(D)P(T^{+}|D)}{P(D)P(T^{+}|D)+
    (1-P(D))P(T^{+}|D^{C})} \\ \
  = &  \frac{\text{prevalence} \times
    \text{sensitivity}}{[\text{prev} \times
    \text{sensitivity] +
      [(1-prev)} \times \text{(1-specificity)}]} 
\end{align*}