# Practice for Hypothesis Testing

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
from scipy import stats
import urllib

## 1. Hypothesis testing practice

In the first section of HW4, we calculate the annual peak flow for St. Lawrence River. In this practice, we will continue looking at the annual peak flows. We are wondering whether there is a statistically significant change in the mean annual peak flow around Year 1980.

A. **Descriptive Plots**: Please do a line plot for the time series of annual peak flow, assigning different color and line styles for 1936-1979 ("blue","-" denotes solid line), and 1980-2022 ("orange","--" denotes dashed line). 

B. **Two-sample test** for a change in the mean annual peak flow: Test for statistical significance of the observed change in the mean annual peak flow around 1980.

* Use a two sample test, and alpha = 0.05 (95% confidence) and the z-distribution to define the rejection region.

* Discuss why using the z-distribution is appropriate here.

* Use the two-sample test to compare the data from Year 1980-2022 to the data from Year 1936-1979, accounting for the different sample sizes and sample standard deviations appropriately (remember to use the “pooled standard deviation”).

* For your null hypothesis, postulate that the difference between the two means = 0, and state the alternative hypothesis that the difference has changed. (although you don’t know the direction of this change) and state the test statistic you’ll be using.

* Can you reject the null hypothesis?

* Calculate P after your test.

* How does your estimate of P change if your null hypothesis is that the difference in the mean between the two data sets is equal to 5% of the pre-1980 sample mean? (In other words, test with a new null hypothesis: the mean of the second period is 1.05 times the mean of the first period.)

#### Since we practiced it in our last homework, here we provided the annual peak flow time series in the format of dataframe, with the variable name of `st_law_peakflow`

In [None]:
args = {
    'site_no': '04264331',
    'begin_date': '1936-01-01',
    'end_date': '2022-12-31'
}
query = urllib.parse.urlencode(args)
verde_url = (
    f'https://waterdata.usgs.gov/nwis/dv?'
    f'cb_00060=on&format=rdb&referred_module=sw&{query}'
)
response = urllib.request.urlopen(verde_url)
df1 = pd.read_table(
    response,
    comment='#',
    sep='\s+',
    names=['agency', 'site', 'date', 'streamflow', 'quality_flag'],
    index_col=2,
    parse_dates=True,
    date_format='yyyy-mm-dd',
    engine='python')
# discard the first two rows
df1 = df1.iloc[2:]
# Now convert the streamflow data to floats and
# the index to datetimes. When processing raw data
# it's common to have to do some extra postprocessing
df1['streamflow'] = df1['streamflow'].astype(np.float64)
df1.index = pd.DatetimeIndex(df1.index)

# we calculated the annual peak flow for the St Lawrence River
st_law_peakflow = df1[['streamflow']].groupby(df1.index.year).max()
st_law_peakflow.columns = ['peakflow_cfs']

In [None]:
# INSERT Your code here