---
title: Back Bay National Wildlife Refuge
jupyter: python3
---

> Back Bay National Wildlife Refuge is located in the southeastern corner of the City of Virginia Beach. The refuge was established in 1938 to protect and provide habitat for migrating and wintering waterfowl. Diverse habitats, including beachfront, freshwater marsh, dunes, shrub-scrub and upland forest are home to hundreds of species of birds, reptiles, amphibians, mammals and fish.

![BNWR](https://www.fws.gov/sites/default/files/styles/banner_image_xl/public/banner_images/2020-09/waterfowl%20%28tundras%29.jpg?h=0c8d0f81&itok=NcZlpD27)


To get introduced to the park and its history, please view the following interactive story map.

[BBNWR History and Introduction](https://storymaps.arcgis.com/stories/960d9db38cca4f3d8d38111119b9874f)

Additionally, here is some drone footage of the park for a better look at the geography and ecology of the area.

[BBNWR Drone Footage](https://www.youtube.com/watch?v=NlW330aBTCc)

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sb
import statsmodels.api as sm
import scipy.stats as sps
import statsmodels.stats.proportion as prop

In [None]:
bbnwr = pd.read_csv("./BKB_WaterQualityData_2020084.csv")
bbnwr.columns

In [None]:
bbnwr["Site_Id"] = bbnwr["Site_Id"].replace({'d': 'D'})

## Question 1

### Q1a

The water in BBNWR is a mix of fresh water and sea water. Sea water has an average salinity of 35 ppt (parts per thousand). Because fresh water flows into the Bay, however, the level of salinity can be much lower, depending how much fresh water enters the system. Such systems are described through a tiered system of descriptions based on the amount of salt in the water.

A oligohaline mixture is one in which the saline content is between 0.5 - 5.0. More details on [classifying estuaries can be found in this EPA report](https://www.epa.gov/sites/default/files/2015-09/documents/2009_03_13_estuaries_monitor_chap14.pdf).

Let's test the theory that the measurements from the Bay come from a oligohaline mixture or a mixture with more saline content, so that salinity is more than 0.5.

For this hypothesis test will we use a $\alpha$-level (maximum Type I error probability) of $0.0015$.

Clearly state:

- The null hypothesis
- The alternative hypothesis
- A suitable test statistic
- The standard error of this test statistic
- A rejection region that will have probability of including the test statistic 0.0015 when the null hypothesis is true.

*Double click to add your answer*

Null Hypothesis: The arverage salinity content is less than or equal to 0.5 ppt, signaling an oligohaline mixture. 

Alternative Hyptohesis: The average salinity content is greater than .5 ppt, indicating a higher salinity mixture. 

Test Statistic: z = (x̄ - μ₀) / (σ / √n)

Standard Error: σ / √n

Rejection region: z > zᵦ


### Q1b

Now that we have described our hypothesis test, compute all quantities needed to compute the test statistic, standard error, and rejection region. You may find it helpful to create a table that only includes observations from the Bay with non-missing values for "Salinity (ppt)".

## Q1a: Hypothesis Test Setup

### Hypotheses

- **Null Hypothesis (H₀):**  
  \( H₀: \mu \leq 0.5 \)  
  *(The average salinity content is less than or equal to 0.5 ppt, indicating an oligohaline mixture.)*

- **Alternative Hypothesis (Hₐ):**  
  \( Hₐ: \mu > 0.5 \)  
  *(The average salinity content is greater than 0.5 ppt, indicating a higher salinity mixture.)*

---

### Test Statistic
The test statistic is calculated as:
$$
z = rac{ar{x} - \mu_0}{rac{\sigma}{\sqrt{n}}}
$$
Where:
- \( ar{x} \): Sample mean salinity  
- \( \mu_0 \): Hypothesized population mean (\( 0.5 \))  
- \( \sigma \): Standard deviation of the sample  
- \( n \): Sample size  

---

### Standard Error (SE)
The standard error of the test statistic is:
$$
SE = rac{\sigma}{\sqrt{n}}
$$

---

### Rejection Region
For a one-tailed test at \( lpha = 0.0015 \), find the critical \( z \)-value (\( z_{lpha} \)) using the standard normal distribution. The rejection region is:
$$
z > z_{lpha}
$$
Where:
- \( z_{lpha} \) corresponds to the cumulative probability \( 1 - lpha = 0.9985 \).  
- Reject \( H₀ \) if the test statistic falls in this region.

---

### Notes
- Use the sample data to compute the test statistic (\( z \)), standard error (\( SE \)), and evaluate against the rejection region.  
- If \( z \) lies in the rejection region, reject the null hypothesis; otherwise, fail to reject it.


## Q1b: Compute Quantities

Now that we have set up the hypothesis test, compute the following:

1. **Test Statistic (z):**
   Use the formula:
   $$
   z = rac{ar{x} - \mu_0}{rac{\sigma}{\sqrt{n}}}
   $$

2. **Standard Error (SE):**
   $$
   SE = rac{\sigma}{\sqrt{n}}
   $$

3. **Critical Value and Rejection Region:**
   - Find the critical \( z_{lpha} \) for \( lpha = 0.0015 \).  
   - Compare the calculated \( z \)-score to the critical value to determine whether to reject \( H₀ \).

---

### Instructions
- Use the provided data from the Bay to calculate the sample mean (\( ar{x} \)) and standard deviation (\( \sigma \)).
- Compute the standard error and the \( z \)-score using the formulas above.
- Interpret the result based on the rejection region.


In [None]:
# compute quantities

### Q1c

Perform the hypothesis test. Clearly state whether you reject or fail to reject the null hypothesis. Interpret this result with result to the original question of whether the average salinity of the Bay is consistent with a oligohaline (or more extreme) mixture.


In [None]:
# perform test

### Q1d

We could also approach this question by creating a confidence interval for the average salinity in the population of all measurements in the Bay.

Using the quantities above, create a 99.7% confidence interval of the average salinity of the Bay. What does this interval tell us about the following table of salinity mixture tiers:

| Level | Salinity |
| ----- | -------- |
| Fresh Water | < 0.5 ppt |
| Oligohaline | 0.5 -- 5.0 ppt|
| Mesohaline | 5.0-18.0 ppt|
| Polyhaline | 18.0­ -- 30 ppt |
| Ocean | > 30 ppt |


In [None]:
# confidence interval

Which of these levels can be ruled out for the Bay?

*Double click to add your answer*

## Question 2

### Q2a

A [Secchi disk](https://en.wikipedia.org/wiki/Secchi_disk) is a device used to measure the clarity of water by submerging the disk and measuring the depth at which it is no longer visible.

![Secchi Disk](https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/Secchi_disk_pattern.svg/240px-Secchi_disk_pattern.svg.png)

A common definition of "clear water" is being able to view a Secchi disk at 4m. In the case of the BBNWR, most of the depths are less than 4m:


In [None]:
sb.histplot(data = bbnwr, x = "Water Depth (m)")

Create a new column "clear" that is `True` if either of the following conditions is met:

- The "Secchi Depth (m)" is at least 4m
- The "Secchi Depth (m)" is at least as large as the "Water Depth (m)" (due to small differences due to waves/location/etc
 Secchi Depth can be (slightly) greater than Water Depth)

 Display the proportion of "clear" observations.


In [None]:
# proprotion of clear observations

### Q2b

Test the hypothesis  that the population proportion of clear measurements is 37% against the alternative that it is not equal to 37%.

Use $\alpha = 0.05$. Clearly state if you reject or fail to reject this hypothesis.

In [None]:
# test

*Double click to add your answer*

### Q2c

Referring the result from the previous section, compute the $p$-value for this hypothesis.

In [None]:
# you will find sps.norm.cdf helpful

Consider three different people:

- Person A has a 10% tolerance for Type I errors
- Person B has a 5% tolerance for Type I errors
- Person C has a 1% tolerance for Type I errors


Which of these people (if any) would reject the null hypothesis that 40% of all possible measurements would be clear. Justify your answer.

### Q2d

Using values you computed in the previous sections, create a 95% confidence interval for the proportion of clear observations in the population of all observations.

Note: you will need to use the estimated standard error of the sample proportion ($\hat p$) of: $\sqrt{\frac{\hat p(1-\hat p)}{n}}$.

Interpret this result in words.

In [None]:
# confidence interval

*Double click to add your answer*
## Question 3

### Q3a

Recall that if there is no relationship between two variables (i.e., they are statistically independent in the population), then the correlation coefficient will be zero (or more generally, if there is no linear relationship).

Using the standard error for a correlation coefficient. Test the hypothesis that the correlation between "AirTemp (C)" and "Water Temp (?C)" is zero against the alternative that it is non-zero. Use an $\alpha$-level of 0.05.

Note: it is helpful to create a table that only these two variables and no missing values.

In [None]:
# test

Interpret this result. Would you reject the hypothesis that there is no linear relationship between these two variables (in the population)?

*Double click to add your answer*

### Q3b

Create two confidence intervals for the population correlation coefficient:

- a 95% CI
- a 99.7% CI

In [None]:
# confidence intervals

Which interval is wider? Explain why we know this would be true without ever calculating the intervals.

*Double click to add your answer*

## Question 4

Let's investigate if the locations of the measurements are related to the time of year of the measurement. In other words, do the people taking the measurements favor different sites at different times of the year.

In [None]:
bbnwr["Date"] = pd.to_datetime(bbnwr["Read_Date"])

### Q4a

Create a table that only includes the columns "Site_Id" and "Date". Then, create a new column "Month" that is the month of the date of the measurement (using the `.dt.month` attribute of the `"Date"` column created above). Plot the distribution of this variable.

In [None]:
# month plot

### Q4b

We have noted that one way to think about independence is by thing about conditional distributions. If two variables are independent, then the conditional distribution of one variable should be the same regardless of conditioning on the value of the other variable.

Create a plot that shows the conditional distribution of Site_Id for each value of month. Note: the `histplot` method has two arguments that are helpful -- `multipel = 'fill'` anad `discrete = True`.

In [None]:
# plots

What do you notice about the conditional distributions? Does this suggest that the variables are independent?

*Double click to add interpretation*

### Q4c

Create the *contingency table* for the variables "Site_Id" and "Month". Print out this table. What do you notice about the row or column (this will depend on how you order the data) that represents the "Bay" site? This row/column will look different than the others. Is this evidence that the the variables are dependent? Why or why not?

In [None]:
# print out table

### Q4d

Using the `sps.chi2_contingency` function, perform a $\chi^2$ test for independence between these two variables. Use an $\alpha$-level of 0.05. Report if you would reject the null hypothesis that the variables are independent. Write, in words, your conclusion to the question of whether different sites were favored in particular months.

In [None]:
# compute p-value

*Double click to add your answer*