#### Run the below code to import all libraries required to run sample code within this notebook

In [6]:
import numpy as np 
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
init_notebook_mode(connected=True)
import numpy as np

import random
import pandas as pd
from IPython.display import display, Math

from bokeh.io import show, output_notebook
from bokeh.plotting import figure, show
from scipy.stats import norm 
from bokeh import plotting as pl
from bokeh.models import HoverTool, Arrow, OpenHead, NormalHead, VeeHead
import statsmodels.api as sm
from scipy.stats import chi2_contingency
from scipy import stats
from statsmodels.formula.api import ols
import statsmodels.api as sm
output_notebook()


### Solution code

```python
# Just run the above code
```

Our next topic is something that seems simple on the surface but there is a lot of depth to it. In learning ANOVA which is an acronym for analysis of variance, we are trying to really solve a simple problem. Given two or more groups of data, can you conclude if the means are the same for all groups to which the data belongs or not. For example, you might have two groups like young people and old people, for which you are measuring how long do they watch the telly. You just want to see if there is a significant difference in the number of hours they watch. So to see if they are different, you would do a one way anova test. So in this notebook we are going to do the following: 

1) One way Anova- the easy way <br>
2) One way Anova- the hard way <br>
3) Two way Anova-  a simple example<br>


## One way Anova

As mentioned Anova is an acronym for analysis of variance of variable.

When do we use Anova? We have stated earlier that anova is used when we want to compare means between groups, this actually takes an interesting form when it comes to datasets. Suppose we have a categorical variable and a continuous variable, like age group(young, middle, old) and height of individuals. We can treat the levels of the categorical variable age as separate groups and measure if the mean height is the same in all groups or not. Why is this important? Well if you are solving a machine learning problem, and you want to predict the height of individuals, and you use age as variable to predict the height then you will find that if the means of the groups in the age are the same then the height may actually not vary much by age. In which case you will not be able to teach a computer to predict height from age. In reverse, you *want* a difference in mean between the various groups of a categorical variable. This variation is what the computer learns when you run an ML algorithm. Very loosely you can say that there may be a dependency of the height on your categorical variable age. That is what we are after! 

If the above though seemed really confusing, just hold on, you will see what I mean! 

Anova is also another kind of hypothesis test, meaning we follow the sample logic of the hypothesis testing. In the last section, when we did chi-square testing we used the chi-square distribution and we defined the chi-square test statistic right. Here, we use the F-distribution and the F-statistic as the test statistic. We are essentially going to see if our null hypothesis is valid or not using the F-distribution. Similar to the lesson on  chi-square testing, we are not going to go in depth into about how you get the F-distribution but, we are straight going to applications. Also unlike the chi-square lesson, we are going to actually generate data, some fake data.   So what does it involve ? Well here are the steps in Anova: 

1) Define the null and the alternate hypothesis  <br>
2) Set the significance level <br>
3) Calculate the f-statistic <br>
4) Calculate the p-value <br>
5) Compare the p-value and significance level to conclude if the null hypothesis is valid or not <br>


You will see these steps are pretty much similar to the lesson on hypothesis testing notebook or the chi-square testing. 

For this problem we will not use real data, we will create our own data. This is because of the kind of assumptions that go behind doing an anova: 

Assumptions- 
1) Normality: The data in each level of a category should be approximately normal. In the data of age vs number of hours of telly watched, the assumption is that in each category, i.e young, middle-aged and old, the number of hours of tv watched has a normal distribution. 

2) Homogeneity of variance: Assumption that the variance within each group should be similar to the other groups 

3) Sample size: Typically the expectation is that you have 20 samples 

4) Independence of Observations: The values from one group must be independent from other groups. This is a rather hard condition to meet, one we will worry about a bit less. 

Rather than trying to find a dataset that meets this criteria. We will merely pull data from a normal distribution and run Anova. This is more instructive since we can show various cases with this rather than merely depending on one case. 

The null hypothesis and alternate hypothesis, are pretty clear. 

Null hypothesis: The means of all groups are equal 

Alternate hypothesis : The means of all the groups are not equal. 
Note: It's not really telling us which means are not equal to the other means. Just that the null hypothesis is not valid


Our first example is a really simple one. We are going to pull some data from a normal distribution, provide some context, then run Anova on it. Then we will do the whole calculation by hand. For simplicity, we will use a categorical variable with just 2 groups. 

Suppose we have two groups of people - young and old. We measure the amount of telly  they watch per week and get the following data. 


In [7]:
# Just run the below code

np.random.seed(1)
old_people_mean  = 20,
old_people_sigma = 5
old_count = 20
old_tv_times = np.random.normal(old_people_mean, old_people_sigma, old_count)


np.random.seed(2)
young_people_mean  = 30,
young_people_sigma = 5
young_count = 20
young_tv_times = np.random.normal(young_people_mean, young_people_sigma, young_count)

old_tv_times, young_tv_times

(array([28.12172682, 16.94121793, 17.35914124, 14.63515689, 24.32703815,
         8.49230652, 28.72405882, 16.1939655 , 21.59519548, 18.75314812,
        27.31053969,  9.69929645, 18.38791398, 18.07972823, 25.66884721,
        14.50054366, 19.13785896, 15.61070791, 20.21106873, 22.91407607]),
 array([27.91621076, 29.71866586, 19.31901952, 38.20135404, 21.03282207,
        25.79126317, 32.51440709, 23.77355957, 24.71023891, 25.45496193,
        32.75727022, 41.46104006, 30.20769696, 24.41037277, 32.6952916 ,
        27.0192015 , 29.90434752, 35.8750061 , 26.26064525, 30.04512625]))

### Solution code

```python
# Just run the above code
```

Now, from above code we have randomly pulled 20 samples of data each for group of old people and group of young people. In this next step, we will create a dataframe where we categorically mark each of these sample observations as belonging to group 1 - old people and group 2 - young people. So, the first column (after the index) of the dataframe, will indicate age_class which is 1 for old and 2 for young people and the second column will contain the observations within each of those categories which we generated in the above exercise.

In [8]:
# Just run the below code

age_variable = np.append(np.repeat(1,20),np.repeat(2,20))  
age_tv_times = pd.DataFrame([age_variable,np.append(old_tv_times,young_tv_times)]).T

age_tv_times.columns =np.array(["age_class", "tv_times"])
age_tv_times

Unnamed: 0,age_class,tv_times
0,1.0,28.121727
1,1.0,16.941218
2,1.0,17.359141
3,1.0,14.635157
4,1.0,24.327038
5,1.0,8.492307
6,1.0,28.724059
7,1.0,16.193965
8,1.0,21.595195
9,1.0,18.753148


### Solution code

```python
# Just run the above code
```

### One way anova- The easy way 

If you care only about implementing anova without really caring what it is. Then go ahead and just follow the method here. If you are coming to this topic for the very first time and have no idea what is anova, I would strongly advise that you go through the next section. 


In [9]:
# Just run the below code

model =ols( 'tv_times ~ C(age_class)', data= age_tv_times).fit()
anova_results = sm.stats.anova_lm(  model, typ= 2)
print(anova_results)

                   sum_sq    df          F    PR(>F)
C(age_class)   925.491762   1.0  29.419906  0.000004
Residual      1195.404476  38.0        NaN       NaN


### Solution code

```python
# Just run the above code
```

You might be wondering what on earth is the term "tv_times ~ C(age_class)" in the above code? Well you are going to have to wait a bit for a complete explanation. We need to cover linear regression and generalized linear models for that. For now, remember that for all practical purposes an anova is the same as a linear regression and that the term in question is a general formula of a line. Whenever you want to do a one way anova you would need to write a formula of a similar type. In a general sense, we have to write: 

" output variable ~ C(input_variable)" 

We generate a model object and then use the 'stats.anova_lm' function in the 'statsmodels' module to calculate the anova result. As you can see in the results section, we have a p-value and F value as well. So if we follow our hypothesis method then all we have to do now is say- p-value $\lt$ significance level of 5% hence we can say that that the means of our two groups are not equal. Hey we are done ! 

Well that was uneventful. You can get all sorts of other information from the anova_results like the residual, sum squared value and the degrees of freedom. Well wait, what are those and why are they relevant? For that look at the next section! 

### One way Anova- the hard way 

In the last section, you saw how easy it is to get anova results in python. Once you know how do the calculation "by hand" its always better to use python packages to do the calculations since there are a lot of steps involved and it might just become a bit too tedious to do these steps. But for the very first time it is useful to see these steps. So let's go get on with it. 
 
When we are doing anova, what we are really doing is something called an f-test. For this, we will be using the f-statistic. The f-statistic is- 


\begin{equation}
F= \dfrac{\text{variation between group means}}{\text{variation in group means} } 
\end{equation}

we will decode each statement one at a time so let's start with it. 

Before we start with this. Lets do something, let's just plot the data to get an intuition as to how spread out the data looks and what should we expect. 

Question: Plot a plotly scatter plot with the dataframe age_tv_times.

In [10]:
trace1 = go.Scatter(
    x = np.repeat(1,20),
    y = old_tv_times,
    mode = 'markers', 
    name ="old"
    
)

trace2 = go.Scatter(
    x = np.repeat(2,20),
    y = young_tv_times,
    mode = 'markers', 
    name="young"
    
)


layout = go.Layout(
    xaxis=dict(
        range=[0, 3], 
        title ="Category", 
        dtick =1
       
    ),
    yaxis=dict(
        range=[0, 50], 
        title ="Number of hours watching TV (hrs/week) ", 
        
    )
)
data = [trace1, trace2]
fig =go.Figure(data =data, layout= layout)
iplot(fig, filename='tv-times-anova')



### Solution code

```python
# Just run the above code
```

Looking at the data we find that for group 1 the data is between $\approx 8 $ and $ 30 $. Hover over the points to identify the group the point belongs to. For the young people class, is distributed between $\approx 20 $ and $ 42 $. If we could just easily judge that the means are different then we really wouldn't need a test. For example if there were no overlapping points between the two sets of points then we do not have to even do a test. This is a case, just be looking at the data it's unclear but one can hazard the guess that maybe the means for both these groups are not the same.

### So now on to the F-statistic

#### Variation between the group means 

Firstly what we mean by the variation between group mean is the sum of squares by the degrees of freedom. 

we can nicely write it down as- 
\begin{equation}
\text{ variation between group} = \dfrac{\text{sum of squares between groups}}{(K-1)} = \dfrac{\sum_{i=1}^K n_i (\text{mean}_i - \text{grand mean})^2 }{(K-1)} 
\end{equation}

where 

grand mean: mean of the whole column of tv_times for both levels of "tv_times" combined. Essentially mean of the column "tv_times" in the "age_tv_times" dataframe <br>

$\text{mean}_i$- is the mean of each the $i^{th}$ group. Notice the sum is from $i=1$ to $K$ 

$K$- refers to the number of groups. In our case $K =2$ 

$n_i$-  is the number of entries in the $i^{th}$ group, so this number is the same for both groups in our case. $n_1 =n_2 =20$

$K- 1$- is also know as the degrees of freedom. Typically as df1. A second degrees of freedom value will come when we calculate the variation in group mean 

so let us calculate this now

In [11]:
# variation between groups 
old_mean =np.mean(old_tv_times)
young_mean=np.mean(young_tv_times)
grand_mean = np.mean(age_tv_times.tv_times)
num_samples =20 
num_groups_K =2 
df1 = (num_groups_K-1)
sum_squares_between_groups= num_samples*(np.square(old_mean-grand_mean) + np.square(young_mean-grand_mean))
variation_between_group = sum_squares_between_groups/df1
print("Variation between groups- {}".format(variation_between_group))

Variation between groups- 925.4917621802622


### Solution code

```python
# Just run the above code
```

You can compare this value with what we acquired in the section- One way anova- the easy way. You will find that the values for sum_sq parameter is the same as the variation between group. This is good! We are on the right track. 

#### Variation within groups 

This one is a bit tricky.
This involves two steps 
1) Sum up the difference between the group mean and individual values of the groups 
2) You will be doing the 1st step for all the groups involved, then sum these differences 
3) Divided it by the degrees of freedom. 


we can nicely write it down as- 
\begin{equation}
\text{ variation within group} = \dfrac{\text{sum of squares within groups}}{(N-k)} = \dfrac{\sum_{i=1}^K \sum_{j=1}^{N_g}  (\text{value}_j - \text{mean}_i)^2 }{(N-k)} 
\end{equation}

where 

$(N-k)$ - are the degrees of freedom where N is the total number values in all groups combined and k is the number of groups <br>
$i$ - is the sum over all the groups  
$j$ - is the sum over all the values in a group 
$\text{value}_j$-  is the $j^{th}$ value in the $i^{th}$ group. For example $\text{value}_5$ is the $5^{th}$ value in the group. 

So how should we proceed? Here are the steps

1) Calculate the means for each group <br>
2) Subtract each mean from the group values and then square them <br>
3) Sum these squared values <br>
4) Do steps 1-3 for all the groups then sum all the values <br>
5) Divide by the degrees of freedom <br>

Well, we already have the mean values for each group from the last section. Those are the "old_mean" and "young_mean" variables. So let's do this- 


In [12]:
# step 2 and 3 
old_squared=np.sum(np.square(old_tv_times-old_mean))
young_squared=np.sum(np.square(young_tv_times-young_mean))

#step 4 
sum_of_squares_within_groups = old_squared+young_squared

#degrees of freedom 
df2 = len(age_tv_times) - 2 

variation_within_group= sum_of_squares_within_groups/df2

print("Variation between groups- {}".format(variation_within_group))

Variation between groups- 31.45801251631027


### Solution code

```python
# Just run the above code
```

So now that we have everything. The last step is to execute what we have in equation(1), when we do that we get-

In [13]:
F_value =  variation_between_group/variation_within_group
print("F value for is {}".format(F_value))


F value for is 29.419905714018505


### Solution code

```python
# Just run the above code
```

You will find that we acquired the same F value when we ran the anova analysis from statsmodel module. To get the p-value we can simply do. 


In [14]:
p_value = 1-stats.f.cdf(F_value,df1,df2)
print("p value for the given F-value is {}".format(p_value))


p value for the given F-value is 3.5082754100690394e-06


### Solution code

```python
# Just run the above code
```

Similar to what we did with the chi-square distribution, we subtract 1 from the probability value since we are looking at probability which is to the left of the critical value. 


In [15]:
conf_int = 0 
xrange = np.linspace(0,10,10000)

#answer
tools_to_show= 'box_zoom,pan,save,hover,reset,tap,wheel_zoom'        



def get_sig_lvl(significance_level,f_val_crtic, df1,df2): 
    xrange =np.linspace(0,f_val_crtic+5,10000)
    pdf = stats.f.pdf(xrange, df1,df2)

    fig = pl.figure(x_range=[0,f_val_crtic+5], 
                    plot_height=400,
                    tools = tools_to_show,
                    title="f square test calculator: Figure 1",
                    x_axis_label= "f values",
                    y_axis_label ="Probability distribution")
    
    fig.line(x=xrange, y= pdf, line_width = 4)

    fig.xgrid.grid_line_color = None
    fig.y_range.start = 0
    
    hover = fig.select(dict(type=HoverTool))
    hover.tooltips = [("xvalue", "@x"), ("yvalue", "@y")]
   
#     # calculate right 
    z_value = stats.f.ppf(1-(significance_level)/100, df1,df2)

#     # calculate pvalue 
    p_value = 1.0-stats.f.cdf(f_val_crtic,df1,df2)

    show(fig)

    print("z value for the given significance level: {} ".format(z_value))
    print("p value  :  {}".format(p_value))
    return None 




interact(get_sig_lvl, 
                 significance_level = widgets.FloatText(value = 5, 
                                                        min = 50,
                                                        max = 99.9, 
                                                         step = 0.001), 
                   f_val_crtic = widgets.FloatText(value = 5, 
                                                        min = 0,
                                                        max = 1000, 
                                                        step = 0.001),
                  df1 = widgets.IntText(value = 3, 
                                                        min = 1,
                                                        max = 100, 
                                                        step = 1),
                 df2 = widgets.IntText(value = 2, 
                                                        min = 1,
                                                        max = 100, 
                                                        step = 1)
     
        
        );


interactive(children=(FloatText(value=5.0, description='significance_level', step=0.001), FloatText(value=5.0,…

### Solution code

```python
# Just run the above code
```

#### Just some Rough code

In [16]:
np.random.seed(1)
mu  = 0,
sigma = 10
group1 = np.random.normal(mu, sigma, 5)

np.random.seed(2)
mu  = 5,
sigma = 10
group2 = np.random.normal(mu, sigma, 5)

group1, group2

(array([ 16.24345364,  -6.11756414,  -5.28171752, -10.72968622,
          8.65407629]),
 array([  0.83242153,   4.43733173, -16.36196096,  21.40270808,
        -12.93435585]))

### Solution code

```python
# Just run the above code
```

#### Just some Rough code

In [17]:
from scipy import stats
stats.f_oneway(old_tv_times, young_tv_times)

F_onewayResult(statistic=29.419905714018498, pvalue=3.5082754101159494e-06)

### Solution code

```python
# Just run the above code
```