# Overview
In this notebook you will be doing a one-way ANOVA (analysis of variance) and Mann-Whitney Tests. One-way ANOVA allows you to test the null hypothesis that the means of several groups that represent variations across a single factor are equivelent. The alternative hypothesis is that at least one of the group means is not equal to the rest. One-way ANOVA makes the following assumptions:

* The data has interval or ratio measurement scales.
* The observations in all groups are independent from all other observations.
* The data in each group is approximately normally distributed (Shapiro-Wilk Test).
* There is homogeneity in variance across all groups (Levene's Test).

The Mann-Whitney Test is a non-parametric test for differences between samples. This means it can be employed like an independent samples T-test but on samples that violate the assumpiton of normality. It is a rank test, which means it can be used on interval or ratio data but also on ordinal data. Mann-Whitney allows you to test the null hypothesis that a randomly selected value from one distribution (modeled with sample 1) is just as likely to be less than as greater than a randomly selected value from a second distrubiton (modeled with sample 2). Another way of stating this null hypothesis is that the there is no difference between the distributions the two samples came from. There is a generalization of Mann-Whitney for testing more than two groups like ANOVA called Kruskal–Wallis One-Way ANOVA. Mann-Whitney makes the following assumptions:

* The data has ordinal, interval, or ratio measurement scales.
* The observations in the groups are independent.

#### Run the following cell (shift-enter) to load needed python packages and modules.

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats

### Measurements of Extension Growth from Various Apple Tree Root Stocks

* Load appleroots.csv into a pandas dataframe.
* If this was in your library you would use the path `.../library/filename.csv`.
* Use the `.head()` method to print out the first 7 rows of the dataframe.
* Get the `.shape` (no parentheses) property to see how many rows and columns are in the dataset.

**Source:** S.C. Pearce, University of Kent at Canterbury, England.

**Description:** Description: Five types of root-stock were used in an apple orchard grafting experiment. The following data represent the extension growth (cm) after four years. 
* X1 = extension growth for type I 
* X2 = extension growth for type II 
* X3 = extension growth for type III 
* X4 = extension growth for type IV 
* X5 = extension growth for type V 


* H<sub>0</sub>: There is no difference in mean extension growth between the five root-stocks. 
* H<sub>A</sub>: There is a difference in mean extension growth between the five root-stocks.

In [42]:
# DON'T MODIFY THIS CELL
url = "https://raw.githubusercontent.com/prof-groff/evns462/master/data/appleroots.csv"
apple = pd.read_csv(url)
print(apple.head())
print("shape: ", apple.shape)

  root stock  growth
0         X1    2569
1         X1    2928
2         X1    2865
3         X1    3844
4         X1    3027
shape:  (40, 2)


### Performing a one-way ANOVA on the head formation data to test the null hypothesis. 

In [43]:
# Here the .groupby() function is used to group the data in the beer dataframe by bottling number.
roots = apple.groupby('root stock')

# Here the .get_group() function is ued to get the data for the first bottling.
X1 = roots.get_group('X1')

# TO DO: REPEAT THE ABOVE COMMAND WITH APPROPRIATE MODIFICATIONS TO EXTRACT THE SECOND AND THIRD BOTTLING DATA.


# The following prints a collection of descriptive statistics for each group
roots.describe()

Unnamed: 0_level_0,growth,growth,growth,growth,growth,growth,growth,growth
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
root stock,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
X1,8.0,2977.125,447.985152,2336.0,2791.0,2977.5,3080.5,3844.0
X2,8.0,3109.125,552.074125,2074.0,2859.25,3198.0,3399.0,3906.0
X3,8.0,2815.25,392.804841,2315.0,2476.25,2844.0,3121.5,3308.0
X4,8.0,2879.75,512.011928,2199.0,2417.0,2919.5,3297.75,3601.0
X5,8.0,2557.25,606.694734,1532.0,2267.25,2484.0,3087.25,3366.0


## Questions:

How many groups will be compared, that is, how many different root stocks are there?

How many degrees of freedom are there between the groups? 

What is the total number of trees across all groups?

How many degrees of freedom are there within the groups?

What is the measurement scale (or data type) of the root stock data?

What is the F-critical value for this data at an alpha level of 0.05? [F-table](http://web.anglia.ac.uk/numbers/biostatistics/one_way_anova/local_folder/critical_values.html) [F-statisic calculator](https://www.danielsoper.com/statcalc/calculator.aspx?id=4)



In [44]:
# USE THIS CELL TO CHECK THAT THE ASSUMPTIONS FOR ONE-WAY ANOVA ARE SATISFIED
# Remember, something like X1['growth'] is needed to get the target data from a column in the dataframe.

# LET'S DO A SHAPIRO-WILK TEST FOR NORMALITY ON EACH GROUP
# Remember that the null hypothesis of this test is that the data IS normaly distributed. So we are 'hoping' for 
# a p-value greater than 0.05 so we must accept the null hypothesis.
# The following does the test for group X1. Repeat the test for the other four groups.
statistic, pvalue = stats.shapiro(X1['growth'])
print(statistic, pvalue)

# NEXT LET'S DO LEVENE'S TEST FOR HOMOSCEDASTICITY (EQUALITY OF VARIANCE IN ALL GROUPS)
# We could alternatively do a bartlett test here.
# Remember that the null hypothesis of this test is that the data in all groups HAVE uniform variance. So we are
# 'hoping' for a p-value greater than 0.05 so we must accept the null hypothesis.
# Here is a template for the python function to do this test:
# statistic, pvalue = stats.levene(REPLACE THIS WITH DATA FROM YOUR FIVE GROUPS)
# print(statistic, pvalue)

0.9433721303939819 0.644533634185791


## Questions:

Is the assumption of normality valid for the apple root stock data set?

Is the assumption of equal variance valid for the apple root stock data set?

In [23]:
# USE THIS CELL TO CARRY OUT ONE-WAY ANOVA ON THE APPLE ROOT STOCK DATA
# Here is a template for the python function to do this test:
# F, p = stats.f_oneway(REPLACE THIS WITH DATA FROM YOUR FIVE GROUPS)
# print("F = ", F, " p = ", p)

## Questions:

Is the null hypothsis accepted or rejected?

### Prevelance of Fox Rabies

* Load foxrabies.csv into a pandas dataframe.
* If this was in your library you would use the path `.../library/filename.csv`.
* Use the `.head()` method to print out the first 7 rows of the dataframe.
* Get the `.shape` (no parentheses) property to see how many rows and columns are in the dataset.

**Source:** Sayers, B., Medical Informatics, Vol. 2, 11-34

**Description:** The data represent the number of cases of red fox rabies for a random sample of 16 areas in each of two different regions of southern Germany.

* H<sub>0</sub>: Foxes from region 1 are less or equally likely to have rabies compared to foxes from region 2.
* H<sub>A</sub>: Foxes from region 1 are more likely to have rabies compared to foxes from regions 1.

In [45]:
# DON'T MODIFY THIS CELL
url = "https://raw.githubusercontent.com/prof-groff/evns462/master/data/foxrabies.csv"
rabies = pd.read_csv(url)
print(rabies.head(7))
print("shape: ", rabies.shape)

  region  cases
0     R1     10
1     R1      2
2     R1      2
3     R1      5
4     R1      3
5     R1      4
6     R1      3
shape:  (32, 2)


### Perform a Mann-Whitney Test on the Fox Rabies Data

Performing a Shapiro-Wilk Test on the R1 and R2 data will reveal that the data for R2 is not normally distributed. So, we can not perform a independent samples T-test to compare the two groups. Instead perform a Mann-Whitney Rank Test.

In [46]:
# USE THIS CELL TO PERFORM A MANN-WHITNEY RANK TEST
# First you must use .groupby to group the data by region.
# Second you must use .get_group() to get the R1 and R2 groups individually
# Finally, use the python function templated below to carry out the test.
# The alternative can be 'greater', 'less', or 'two-sided' depending on the situation.
# statistic, pvalue = stats.mannwhitneyu(GROUP1, GROUP2, alternative=TESTSITUATION)



## Questions:

Based on your results, should you reject or accept the null hypothesis?

Are the foxes from region 1 more likely to have rabies than foxers from regions 2?