# In Class Practice #7-2: More Hypothesis Testing
---

In [None]:
# import libraries we'll need
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline

## Two-sample T-Test for small sample sizes (n<30)

We have instantaneous monthly observations of dissolved organic carbon (DOC) in two streams over the course of one water year (October-September). **In all following three tests, Please use a two-sample, two-sided, t-test to determine:**

### Practice 1. Using data for all 12 months, with what confidence can we say that the annual mean DOC concentrations are different between the two streams?

In [None]:
wy_month_labels = ['Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep']
wy_month_numbers = np.arange(12)+1

In [None]:
# DOC for the first stream, mg/L
doc_1 = [65.3, 98.4, 113.1, 120.5, 105.3, 100.3, 92.3, 97.5, 88.2, 89.5, 72.1, 61.9]
# DOC for the second stream, mg/L
doc_2 = [62.0, 50.7, 30.9, 52.5, 98.7, 95.8, 99.3, 110.2, 104.9, 96.4, 82.5, 75.5]

### If you do not remember how to calculate T-score, you can either check slides or this wiki page
https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_variances_(%E2%81%A01/2%E2%81%A0_%3C_%E2%81%A0sX1/sX2%E2%81%A0_%3C_2)

In [None]:
# Note that you need to enter the code to calculate the t-test yourself, based on the lecture notes or book


In [None]:
# Step 1: Calculate the sample mean, standard deviation, and sample size


In [None]:
# Step 2: Calculate the pooled standard deviation


In [None]:
# Step 3: Calculate T-score


In [None]:
# Step 4: Check the T-score table 
# (https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf)
# or use following equations (stats.t.ppf) to check T-score threshold
# for different confidence level

# Example: degree of freedom=2, significance level: 0.05, 2-tail
# stats.t.ppf(1 - 0.025, 2)

### Practice #2. 
Compare the two streams again, but this time perform two tests, one for the **first 6 months** of the water year (October-March), and a second test for the **last 6 months (April-September)**.

Can we say that the DOC concentrations between the two streams are different in the first half and/or second half of the water year? With what level of confidence could we say that they are different?


# Chi-Squared Test for a Change in the Standard Deviation

#### Z-tests and T-tests are designed to compare sample means. But how can we detect a change in standard deviation?

Test for statistical significance of a change in the standard deviation. Note that the standard deviation does not benefit from the Central Limit Theorem. Even though it is not strictly true, assume for the moment that the sample data are derived from a normally distributed population. </br>

Use a single sample test (with rejection region based on the Chi Squared distribution). Assume that the sample standard deviation from the 1929-1974 data is close to the true population standard deviation of the earlier data set. Test that the more recent sample is different from this.

More details can be referred to this reading materials ([9.5 Chi Squared Test for Variance or Standard Deviation](https://openpress.usask.ca/introtoappliedstatsforpsych/chapter/9-5-chi-squared-test-for-variance-or-standard-deviation/)).



In [None]:
# Read the excel file
niagara_data_file = 'niagara_river.peak_flow.cfs.csv'
niagara_peak_flow = pd.read_csv(niagara_data_file,index_col=[0])
# Preview our data
niagara_peak_flow.head(3)


In [None]:
# define niagara usgs id
niagara_id = "04216000"

# Divide the data into the early period (before 1980) and late period (after and including 1980). 

niagara_before = niagara_peak_flow[ niagara_peak_flow.index < 1980 ]
niagara_after = niagara_peak_flow[ niagara_peak_flow.index >= 1980 ]

In [None]:
# first calculate the test statistic
sd1 = niagara_before[niagara_id].std() #we pretend this is the "true population standard deviation)
sd2 = niagara_after[niagara_id].std()
m = len(niagara_after[niagara_id])
t = (m-1)*sd2**2/sd1**2
print(t)

Now, we know from the lecture notes that this test statistic is a chi-squared distributed with n-1 degrees of freedom. Let’s choose that we want 95% confidence that there is a change, and therefore alpha = 0.05. In this example we are just going to test for an increase in the standard deviation (we are doing a one-sided test). We can look up our critical value in a chi-squared distribution table using our degrees of freedom and chosen alpha.

How can we look this up in python?

In [None]:
stats.chi2.ppf?

In [None]:
alpha = 0.05
vals = stats.chi2.ppf(alpha, m-1)
print(vals)

### Our t statistic is larger than the cut-off value from the chi-squared distribution, so we determine that yes, with 95% confidence, a change has occurred.