## <span style="color:darkblue"> Problem Set 4: Probability and practice with simulations

Stat140-02: Introduction to the Ideas and Applications of Statistics 

Due: Friday, March 2 @11am

**Problem set policies:** Please provide concise, clear answers for each question. Note that only writing the result of a calculation (e.g., "SD = 3.3") without explanation is not sufficient. For problems involving `R`, include the code in your solution, along with any plots.
		
Each problem set is due by 11:00am on the due date; please submit your problem set via gyrd
		
*You are encouraged to discuss problems with other students (and, of course, with the course head and the TAs), but you must write your final answer in your own words. Solutions prepared "in committee" are not acceptable. If you do collaborate with classmates on a problem, please list your collaborators on your solution.*

#### Problem 1:

The ABO blood group system consists of four different blood groups, which describe whether an individual's red blood cells carry the A antigen, B antigen, both, or neither. The ABO gene has three alleles: ${I}^{A}$, ${I}^{B}$, and \textit{i}. The \textit{i} allele is recessive to both ${I}^{A}$ and ${I}^{B}$, while the ${I}^{A}$ and ${I}^{B}$ allels are codominant. Individuals homozygous for the \textit{i} allele are known as blood group O, with neither A nor B antigens. this is summarized in the following table:


|  **Alleles inherited** | **Blood type**  |
| -- | --|
  | $I^A$ and $I^A$ |  A| 
  |   $I^A$ and  $I^B$ | AB | 
| 		$I^A$ and  $i$ | A | 
| 		$I^B$ and  $I^B$ | B | 
| 		$I^B$ and  $i$ | B | 
| 		$i$ and  $i$ | O | 


Blood group follows the rules of Mendelian single-gene inheritance, i.e. alleles are inherited independently from either parent, with probability 0.5.

a) Suppose that both members of a couple have Group AB blood. What is the probability that a child of this couple will have Group A blood?

b) Suppose that one member of a couple is genotype $I^{B}i$ and the other is $I^{A}i$. What is the probability that their first child has Type O blood and the next two do not?

c) Suppose that one member of a couple is genotype $I^{B}i$ and the other is $I^{A}i$. Given that one child has Type O blood and two do not, what is the probability of the first child having Type O blood?

#### Problem 2:

Recall the *positive predictive value (PPV)* of a diagnostic test is the probability that a person has a disease, given that they tested positive for it. In class we considered the setting in which a child tested positive for trisomy 21 from a cell-free fetal DNA (cfDNA) test and calculated the PPV, i.e. the probability that the child does have trisomy 21, given the positive test result?

The information necessary for the calculation was:

- Disease prevalence: trisomy 21 occurs with a rate of approximately 1 in 800 births. 

- Test sensitivity: Of 1000 children with trisomy 21, approximately 980 test positive.

- Test specificity: Of 1000 children without trisomy 21, approximately 995 test negative. 

In this problem you will again calculcate the PPV using a simulation study.

The following `R` code creates a simulated dataset of 100,000 individuals that each have a disease status and test result. A contingency table is created and the number of individuals with disease, the total number of positive tests and the number of true positives are also recorded.

In [None]:
#set parameters
population.size = 100000
prevalence = 1/800
sensitivity = 0.980
specificity = 0.995

#set seed
set.seed(2018)

#create empty lists
disease.status = vector("numeric", population.size)
test.result = vector("numeric", population.size)

#assign disease status (part a)
disease.status = sample(c(0,1), size = population.size,
                          prob=c(1 - prevalence, prevalence),
                          replace = TRUE)

#assign test result (part b)
for (ii in 1:population.size) {   
  if(disease.status[ii] == 0) {test.result[ii] = sample(c(0,1), size=1, 
                                   prob = c(specificity, 1 - specificity)) }
  if(disease.status[ii] == 1) {test.result[ii] = sample(c(0,1), size=1, 
                                   prob = c(1 - sensitivity, sensitivity)) }
}  

#create matrix of disease status and test result (part c)
disease.stat.and.test.result = cbind(disease.status, test.result)

#create a table of test result by disease status
addmargins(table(test.result, disease.status))

#calculate number of individuals with disease
num.disease = sum(disease.status)
num.disease

#calculate total number of positive tests
num.pos.test = sum(test.result)
num.pos.test

#calculate number of true positives
num.true.pos = sum(test.result[disease.status == 1])
num.true.pos

Run the code chunk, then answer the following questions. There are comments in the code that correspond to the following questions; i.e., when answering (a), look for the comment in the above code that says *(part a)*.

a) Explain how `sample()` is being used to fill in `disease.status`. If an individual is assigned a `0`, what is their disease status?

b) The `for()` loop that assigns test results to each individual contains two `if()` statements, which instruct `R` to follow a different set of instructions based on whether an individual has been assigned a `0` or `1` for disease status. How is test outcome assigned if an individual has disease status `0`? How is test outcome assigned if an individual has disease status `1`?

c) Take a look at `disease.stat.and.test.result`. What does a single row with a `0` in both columns represent?

d) Calculate the PPV based on the results of this simulation.

#### Problem 3:

Recall from the Problem Set 3 that the strongest risk factor for breast cancer is age; as a woman gets older, her risk of developing breast cancer increases. The following table shows the average percentage of American women in each age group who develop breast cancer, according to statistics from the National Cancer Institute. For example, approximately 3.56% of women in their 60's get breast cancer. 

*Prevalence of Breast Cancer by Age Group*:


 |          **Age Group** | **Prevalence**  |
    | -- | --|
		  | 30 - 40 |     0.0044   |                     
		  | 40 - 50 |      0.0147    |                      
		  | 50 - 60 |      0.0238      |                    
		  | 60 - 70 |      0.0356        |                 
		  | 70 - 80 |       0.0382   |

A mammogram typically identifies a breast cancer about 85% of the time, and is correct 95% of the time when a woman does not have breast cancer. 

Use `R` to simulate the results for administering mammograms to a population of 100,000 women in their 30's. How many women in this hypothetical population are expected to test positive for breast cancer? Estimate the PPV of a mammogram for a woman in her 30's. 

*Hint: Modify the code from Problem 2.*