# HW3 problems

### 1. Cumulative upwelling index

_Note:_ More information on loops can be found in Chapter 3 of
Python For Everybody.

NOAA publishes an upwelling index for different sites along the coast.

[http://www.pfeg.noaa.gov/products/PFEL/modeled/indices/upwelling/NA/what_is_upwell.html](http://www.pfeg.noaa.gov/products/PFEL/modeled/indices/upwelling/NA/what_is_upwell.html)

For January 2017, the following daily averages were calculated (in units of of cubic meters per second along each 100 meters of coastline, positive upwelling favorable):

[88., 11., -164.,-16., 82., -53., -321., -257., 1., -43., 21., 67., 45., 54., 41., 12., 1., -134., -9., -6., -122., -94., 22., -6., 8., 10., -7., -3., 14., 5.]

You are interested in the cumulative upwelling index. 

Write a function that uses a for loop to calculate and print the cumulative sum of an arbitrary 1-D array. Use this function to print the cumulative upwelling index for January 2017 (in units of cubic meters per 100m meters coastline). Do not use `np.cumsum` in your function, but feel free to use it to check your results. Use the function in the code block below and display your results for the cumulative upwelling index in January 2017.

_Submission format:_ Update the `cumulativesum` function in the [myfuncs.py](myfuncs.py) file file in this repository,  upload it to your HW3 repository and commit the changes.

_Grading criteria:_ Your function gives correct output. There is a docstring that describes the function. There are comments in your code.  Your function works for an array of any length. Correct units are displayed.

In [36]:
import numpy as np
import myfuncs

In [37]:
upwelling_index = [88., 11., -164.,-16., 82., -53., -321.,
                   -257., 1., -43., 21., 67., 45., 54., 41.,
                   12., 1., -134., -9., -6., -122., -94., 22.,
                   -6., 8., 10., -7., -3., 14., 5.]

# use the cumulativesum function in myfuncs.py 
cui = myfuncs.cumulativesum(upwelling_index)

# These values are a rate (cubic meters per second), but the output
# is cubic meters, so multiply by the time of each observation (1 day)
# in seconds.
cui = cui * 24*3600;

# display results here
print('The cumulative upwelling index for each day in the provided data is', cui, 'cubic meters per 100 m of coastline')


[  88.   99.  -65.  -81.    1.  -52. -373. -630. -629. -672. -651. -584.
 -539. -485. -444. -432. -431. -565. -574. -580. -702. -796. -774. -780.
 -772. -762. -769. -772. -758. -753.]
The cumulative upwelling index for each day in the provided data is [  7603200.   8553600.  -5616000.  -6998400.     86400.  -4492800.
 -32227200. -54432000. -54345600. -58060800. -56246400. -50457600.
 -46569600. -41904000. -38361600. -37324800. -37238400. -48816000.
 -49593600. -50112000. -60652800. -68774400. -66873600. -67392000.
 -66700800. -65836800. -66441600. -66700800. -65491200. -65059200.] cubic meters per 100 m of coastline


#### 2. One-way ANOVA function

Write a function that performs a one-way analysis of variance for an array with _J_ groups (in columns) and _N_ samples in each group (in rows). You may assume that the data set is balanced, i.e. there are the same number of observations in each group.

The function should return the F statistic and the p-value. You can calculate the p-value with the help of the `stats.f.cdf` function.

```
from scipy import stats
stats.f.cdf(F,dfn,dfd)
```
which returns the cumulative F distribution given the F statistic, the degrees of freedom in the numerator (dfn) and the degrees of freedom in the denominator (dfd).

Do not use `stats.f_oneway` in your function, but feel free to use this function to check your work.

Use the function to test the null hypothesis of no difference between sample means, for the MgO example in Table 10.1 of McKillup and Dyar (using the csv file incuded in this repository). Use if-else statements to print whether the null hypothesis can be rejected with 95% and 99% confidence.

_Submission format:_ Update the `anova` function the [myfuncs.py](myfuncs.py) file in this repository,  upload it to your HW3 repository and commit the changes. Define the null hypothesis being tested, and use the function in the space below to determine whether the null hypothesis can be rejected.

_Grading criteria:_ Your function should work for any balanced array of data with groups in columns and samples in rows. Your function should have a doc-string that describes the inputs and outputs. Your function should use descriptive variable names.

In [38]:
data = np.genfromtxt('MgO_Maine.csv',skip_header=1,delimiter=',')

In [39]:
# use the anova_oneway function in myfuncs.py (currently incomplete)
import importlib
F,p = myfuncs.anova_oneway(data)

Display the results below. State the null hypothesis and determine whether the null hypothesis should be accpted or rejected

In [40]:
# display results and interpretation here
print('The null hypothesis is that MgO concentration is the same in all 3 areas.')
print('From one-way ANOVA we find F = ', F, ' and p = ', p)
if p < 0.01:
    print('Because p < 0.01 the null hypothesis can be rejected with 99% confidence (and of course 95% as well).')
elif p < 0.05:
    print('Because p < 0.05 the null hypothesis can be rejected with 95% confidence.')
else:
    print('The p-value is large, so the data do not support a rejection of the null hypothesis.')
    
    

The null hypothesis is that MgO concentration is the same in all 3 areas.
From one-way ANOVA we find F =  10.8  and p =  0.00405830677724
Because p < 0.01 the null hypothesis can be rejected with 99% confidence (and of course 95% as well).


### 3. Short problems



Fill in code to solve the short problems below. Use any function in `scipy.stats` or other libraries.

##### a. Comparing respiration rates

Water column respiration rates are measured in dark bottle incubations at two different stations on an oceanographic cruise. Three replicates are taken at each station. The values (in units of mL/L d$^{-1}$) are given below:

Station A: ```[0.45, 0.77, 0.71]```

Station B: ```[0.54, 0.43, 0.36]```

Use an appropriate statistical test to determine whether there is a significant difference in the mean respiration rate between the two stations.

In [41]:
import numpy as np
import scipy.stats as stats
# A t-test seems appropriate.
data_A = np.array([0.45, 0.77, 0.71])
data_B = np.array([0.54, 0.43, 0.36])
# ttest_ind is a t-test for two independent samples.
(t_stat, p_value) = stats.ttest_ind(data_A, data_B)
print('A t-test comparing respiration rates at stations A and B finds a test statistic of %.3f and a p-value of %.3E' % (t_stat, p_value))
if p_value < 0.01:
    print('Because p < 0.01 there is 99% confidence that there is a difference between the two stations.')
elif p_value < 0.05:
    print('Because p < 0.05 there is 95% confidence that there is a difference between the two stations.')
else:
    print('The p-value is large, so no significant difference in respiration rate between the samples is found.')

A t-test comparing respiration rates at stations A and B finds a test statistic of 1.797 and a p-value of 1.468E-01
The p-value is large, so no significant difference between the samples is found.


##### b. Comparing two years of current meter records

January means of alongshore velocity ($V$) from current meter data off the coast of British Columbia are reported in the literature for two different years. The means and standard deviations of daily averaged values in January are

Year 1: $\bar{V_1} = 23 \pm 3 \text{ cm/s}$

Year 2: $\bar{V_2} = 20 \pm 2 \text{ cm/s}$

Test the null hypothesis that the means are the same between these two years, with 95% confidence. You may assume that each daily average is an independent sample. State any other assumptions that you make in your analysis.

In [44]:
# These are two samples which we can assume are independent since they were taken in different
# years.  We don't have the data values, but the descriptive stats are usable as input to ttest_ind_from_stats
# I will assume that the observations were taken daily.  If the meter data was recorded at a higher frequency
# this will be a conservative assumption.
stat, p_value = stats.ttest_ind_from_stats(23, 3, 31, 20, 2, 31)
print('A two-tailed test of independent samples, assuming 31 samples for each year,')
print('gives a t-statistic of %.3f and a p-value of %.3E' % (stat, p_value))
if p_value < 0.05:
    print('Because p < 0.05 there is 95% confidence that there is a difference in current between the two years.')
else:
    print('The p-value is large, so no significant difference between the two years is found.')

A two-tailed test of independent samples, assuming 31 samples for each year,
gives a t-statistic of 4.633 and a p-value of 1.990E-05
Because p < 0.05 there is 95% confidence that there is a difference in current between the two years.


##### c. Power analysis and experimental design

You are studying the effects of a marine reserve on juvenile rock fish. Previous literature indicates that the juveniles of the species you are studying have a standard length of 70 +/- 30 mm (mean +/- standard deviation). The marine reserve will allow you to collect 20 fish for scientific purposes. 

If your target power is 80% and your confidence level is 95%, what is the minimum difference in mean length you can expect to observe in the marine reserve? You can assume that the fish lengths are normally distributed.

What is the probability of not observing a significant effect of this magnitude if there actually is one?

In [45]:
from statsmodels.stats import power

# power.tt_solve_power() solves for any one of effect_size, number of observations, alpha, and power.
target_power = 0.8
effect_size = power.tt_solve_power(effect_size = None, nobs=10, alpha=0.05, power=target_power)
# effect_size is in units of standard deviations, so multiply by standard deviation to get a physical value.
# This assumes that the published data and the sample will have the same standard deviation.
effect_mm = effect_size * 30
print('The sample size of 10 juvenile rockfish will support detection if there is a size',
      'difference of %.1f mm or more.' % effect_mm)

# The probability of not observing a significant effect of this size is the type II error, 1 minus the power.
p_typeII = 1.0 - target_power
print('There is a %.2f probability of not observing a significant effect of size %.1f mm, even if it exists.' % (p_typeII, effect_mm))

The sample size of 10 juvenile rockfish will support detection if there is a size difference of 29.9 mm or more.
There is a 0.20 probability of not observing a significant effect of size 29.9 mm, even if it exists.
