# F and T tests in Python

To compare your various indicator data, we need to process the data a little bit (to get averages and standard deviations, at least) and then you need to set up the F test and T-tests to then make a decision. In computer programming, decisions are often made with an "if - then" format. In this case, we might say "if Fcalc is less than Ftable then we must run the t-test for unequal variances". Make a plan for how to approach the question: are these two indicators providing the same results?



Once you've structured your plan, let's get started! 

In [10]:
# Just like we imported some extra math functions before, we're going to import some 
# extra statistical functions here
import math
import numpy as np
import scipy.stats as stats

# we need the average and standard deviation for each of our data sets. Covert them into csv files and save them in the same folder as this notebook
data = np.genfromtxt('SP19_exp3.csv', dtype=float, delimiter=',', names=True) 

#using the names = True in our import command has told python that all of our columns have names in the first row 
# and we can use those names to call the data!



average_BB = stats.tmean(data['BB'])
s_BB = stats.tstd(data['BB'])

print ("the average HCl concentration calculated using bromothymol blue is " + str(average_BB) + " +/- " + str(s_BB) + " M")

average_MR =  np.nanmean(data['MR'])
s_MR = np.nanstd(data['MR'])

print ("the average HCl concentration calculated using methyl red is " + str(average_MR) + " +/- " + str(s_MR) + " M")

average_BG =  np.nanmean(data['BG'])
s_BG = np.nanstd(data['BG'])

print ("the average HCl concentration calculated using Bromocresol Green is " + str(average_BG) + " +/- " + str(s_BG) + " M")

average_Ph =  np.nanmean(data['Ph'])
s_Ph = np.nanstd(data['Ph'])

print ("the average HCl concentration calculated using Phenolphthalein is " + str(average_Ph) + " +/- " + str(s_Ph) + " M")

# we're just missing one last indicator! Fill in your own code to print out the average and standard deviation for thymolphthalein.






the average HCl concentration calculated using bromothymol blue is 0.10857142857142861 +/- 0.010703804397102397 M
the average HCl concentration calculated using methyl red is 0.17604166666666665 +/- 0.24542551978938315 M
the average HCl concentration calculated using Bromocresol Green is 0.09876666666666667 +/- 0.002933238634835034 M
the average HCl concentration calculated using Phenolphthalein is 0.36000000000000004 +/- 0.09946356116689167 M


We might also want 95 % confidence interval. We calculated this back in the very first Experiment 1 post-lab notebook:

In [22]:
# Since our data arrays are all different sizes, we have to make sure we remove any blank rows from our calculation of the size:
n = data.size - np.isnan(data['BB']).sum()

#the first input is confidence %, the second is degrees of freedom (n-1)
t = stats.t.ppf(0.95, n-1)

CI_BB = s_BB*t/math.sqrt(n)

print ("[HCl] calculated using bromothymol blue is " + str(average_BB) + " +/- " + str(CI_BB) + " M, at the 95% confidence interval")

[HCl] calculated using bromothymol blue is 0.10857142857142861 +/- 0.0050661305169051336 M, at the 95% confidence interval


### F-test
We want to know whether we can use all of the indicators interchangably. Are they all giving us essentially the same answer, or are some of the indicators producing results which are significantly different from the others? We will eventually want to know whether the means are the same, but there are many different ways to compare the means. First, we need to know if the standard deviations are similar, to help us decide which t-test to do!

The F-test is a simple test:

$$ F_{calculated} = (s_{1}^{2}/s_{2}^{2}) $$

Note that $ s_{1} $ must be the larger standard deviation, so you should always have an F value greater than 1!


In [3]:
# Just like t values, we can tell python to look up critical F values for us as well!
# Now, we can get an F critical value. What are each of those values in that equation? Double check that 
# they are right for our equation, and add comments to explain!

F_crit = stats.f.ppf(q=1-0.05, dfn=4, dfd=5)

# Pick the first two indicators you'd like to compare!
# Think about what the equation for F calculated is, and then calculate your F_calc value here


F_calc = 

SyntaxError: invalid syntax (<ipython-input-3-966c18ed0d90>, line 11)

## Using the F-test to choose a t-test
Then we need to acutally use our data to make a decision! This is where we use if-then statements to make a decision about how to proceed!

We have two possible methods for calculating our t value.
1. If the variance of the two data sets is the same, then we can use:

$$ {\displaystyle t_{calc}={\frac {{\bar {x}}_{1}-{\bar {x}}_{2}}{s_{pooled}\cdot {\sqrt {{\frac {1}{n_{1}}}+{\frac {1}{n_{2}}}}}}}} $$

where $$ s_{pooled} = {\displaystyle s_{p}={\sqrt {\frac {\left(n_{1}-1\right)s_{{1}}^{2}+\left(n_{2}-1\right)s_{{2}}^{2}}{n_{1}+n_{2}-2}}}.} $$


In this case, degrees of freedom is $ d.o.f = n_{1} + n_{2} -2 $

2. If the variance of the two data sets is different, then we must use:
$$ {\displaystyle t={\frac {{\bar {x}}_{1}-{\bar {x}}_{2}}{{\sqrt {{\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}}}}}} $$

and the degrees of freedom equation is a little more complicated:

$$ {\displaystyle \mathrm {d.o.f.} ={\frac {\left({\frac {s_{1}^{2}}{n_{1}}}+{\frac {s_{2}^{2}}{n_{2}}}\right)^{2}}{{\frac {\left(s_{1}^{2}/n_{1}\right)^{2}}{n_{1}-1}}+{\frac {\left(s_{2}^{2}/n_{2}\right)^{2}}{n_{2}-1}}}}.} $$


In [14]:
# Here, add equations for the correct t test and the correct degrees of freedom calculations
# Note that there must be a colon at the end of the if statement!

if F_calc < F_crit:
    t_calc = 
    dof = 
    print ("Standard deviations are not significantly different")
    
#
if F_calc > F_crit:
    t_calc = 
    dof = 
    print ("Standard deviations are significantly different")




SyntaxError: invalid syntax (<ipython-input-14-9b05ef0afd4f>, line 4)

## Using the t-test to make a decision
Now we have another decision to make, using that t_calc. If t_calc is greater thant t_critical, the means are not the same, and therefore the two indicators are NOT producing the same answer. We would not want to combine those data sets, or use those two indicators interchangably.


To make this decisions, again, we'll use scipy to pull the right critical value. Then write your own if else statement to print out a statement about whether the two data sets have similar means or different means

In [8]:
# Again, add comments to explain what the inputs for this function are!

t_crit = stats.t.ppf(1.0 - 0.05, dof)

# add your if-then statements to make your final decision!

if t_calc < t_crit:
    print("")
    
if t_calc > t_crit:
    print("")

NameError: name 'dof' is not defined

Repeat this process for as many indicators as you'd like to compare! 