#### Run the below code to import all libraries required to run sample code within this notebook

In [2]:
import numpy as np 
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
init_notebook_mode(connected=True)
import numpy as np

import random
import pandas as pd
from IPython.display import display, Math

from bokeh.io import show, output_notebook
from bokeh.plotting import figure, show
from scipy.stats import norm 
from bokeh import plotting as pl
from bokeh.models import HoverTool, Arrow, OpenHead, NormalHead, VeeHead
import statsmodels.api as sm
from scipy.stats import chi2_contingency
from scipy import stats


output_notebook()


### Solution code

```python
# Just run the above code
```

So far we have been talking about general concepts, from here on we are going to get into a few specific examples and techniques. In many ways, we needed to cover hypothesis testing before we could come to chi square testing and other topics since in essence chi square testing is a specific type of hypothesis testing. 

Chi square testing deals with categorical data, we are going to first learn about running a chi square test of independence. Before that we need to talk a little bit about how to represent categorical variables.

So let us get into it. In this notebook we are going to cover  <br> 

1) Describing Catergorical data - Contingency table <br>
2) Chi square testing steps 

##  Describing Categorical Data

So far we have not really made a distinction between continuous data and categorical data. Continuous data is data that represents variables like weight, height, speed etc, these are quantities that can take any value in a given range. Also there is a clear ordering between the numbers. Typically, continuous values tend to be numbers, either integers, or real numbers of some sort. Categorical variables represent quantities which belong to a certain category. The most common example would be say gender (male and female) or colors etc. Categorical variables can also be numerical, example ratings (1,2,3 etc ) of a song. 

One thing we should always keep in mind is that some categorical variables will have ordering. Like ratings, where $1 \lt 2 \lt 3 $ where as in some cases categorical variable may not have a specific ordering, for example female $\gt$ male or vice versa does not really make any sense.

So to process further we will use a dataset from UCI dataset, its called the Flags dataset. It is a dataset that contains various design aspects of flags. The data dictionary can be found here- 
https://data.world/uci/flags/workspace/file?filename=flag.names.txt

Question: Import the csv file "flags_data.csv" into pandas and display its head row

In [3]:
#answer 
flags_data = pd.read_csv("../../../data/flags_data.csv", header =None)
flags_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,Afghanistan,5,1,648,16,10,2,0,3,5,...,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,...,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,...,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,...,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,...,0,0,0,0,0,0,0,0,blue,red


### Solution code

```python
# Just run the above code
```

In the last two lines we merely imported the data from that dataset. Notice we set header = None. This is because we don't know what the column names are. For that purpose we must go back to the UCI page for the dataset and look at data dictionary.

So you can see there are 30 columns in the dataset. 
We do not want all of them, let us just take a look at three columns. The three columns that we are going to take are: 

|column number | column titles: | column description|
|--------------|----------------|-------------------|
|4             | population:    | in round millions |
|27            | text :         | 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise|
|28            | topleft:       | colour in the top-left corner (moving right to decide  tie-breaks) |



In [4]:
flags_data_subset  = flags_data[[4,27,28]] #selecting the columns 
flags_data_subset.columns = ["population","text", "topleft"]  #labeling the columns 
flags_data_subset.head() 

Unnamed: 0,population,text,topleft
0,16,0,black
1,3,0,red
2,20,0,green
3,0,0,blue
4,0,0,blue


### Solution code

```python
# Just run the above code
```

Question: Now that you have this information can you write down which type of variables the columns are? 

Answer: Population is essentially a continuous variable, since it's unbounded it can take any value. Where as the other two are categorical variables. As far as the text column is concerned, its a binary yes or no where as the topleft variable can take multiple values which are colors.  


So we have two categorical and continuous variables, now we care only about the categorical variables. How can we describe categorical data? Well below is a method called contingency table. 

In [5]:
contingency_table = pd.crosstab(flags_data_subset["text"], flags_data_subset["topleft"])
contingency_table["total"] = contingency_table.sum(axis= 1)
contingency_table.head()


topleft,black,blue,gold,green,orange,red,white,total
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,12,41,6,29,4,51,35,178
1,0,2,0,3,0,5,6,16


### Solution code

```python
# Just run the above code
```

Question: What is the contingency table describing? What does the total column represent? 

Answer: The contingency table tells us what is the relationship between the levels of each categorical variables. For instance, "0" and "1" are two levels in the "text" column. "black" is a level in the "topleft" column. The first column of the contingency table tells us for rows where "topleft" has a value of "black", there are no flags (each row represents a single country's flag) where there is text on the flag but there are 12 countries where there is no text on the flag.

Think about what the total values represent? They are the total number of flags where there is text on the flag (16 in total) and flags where there is no text on the flag (178 flags in total)

Question: What should the sum of the total column represent with respect to the flags dataset? 

Answer: The total should represent the total number of flags i.e the total number of rows in the dataset. Hope makes sense. See, regardless of the type of value that you have in the text column, the total number of rows will always be conserved. This is an important sanity check to have.

Before we delve into the chi square testing part, lets recall what we are doing in terms of population and samples. So in this case, we can consider our dataset a population. What we are doing by pulling certain features is really to pull features of the population and checking if there is a dependency.
Now that we have this in place let's run a chi square hypothesis test on this dataset. 


## Chi square testing for independence of variables 

The procedure for chi square testing is somewhat similar to hypothesis testing- 

1) Define the null and alternate hypothesis <br>
2) Fix the significance level  <br>
3) Calculate the test statistic  <br> 
4) Compare the test statistic to the significance level <br> 


This is a pretty straight procedure. Step 1 is actually also pre-defined. 

1) Define the null and alternate hypothesis- 

Null hypothesis: $\qquad$ two categorical variables are independent <br>
Alternate hypothesis: $\qquad$ two categorical variables are <b>not </b> independent 


2) Fix the significance level 
 
Similar to hypothesis testing we are going to stick with 5% as the significance level. 

3) Calculate the test statistic 
       
Here is where the fun is. So the test statistic for a chi-square test is a bit different. I will state it and define the terms


\begin{equation}
 \chi^{2} = \sum^{mrows}_{r =1} {\sum^{mcols}_{c=1}} \dfrac{(O_{r,c} -E_{r,c})^{2}}{E_{r,c}}
\end{equation}

where <br>

$\chi^{2}$  - is the test statistic  <br>
$O_{r,c}$ is the observed frequency <br>
$E_{r,c}$ is the expected frequency<br>
$mrows$ is the max number of rows in the dataset <br>
$mcols$ is the max number of columns in the dataset <br>


We call $E$ and $O$ as frequencies since in the contingency table each value represents the frequency with which a certain value appears. 

The sum is over all rows and columns. Yeah it looks messy but we will decode it step by step so hold on! 

For this purpose we will not stick with the "text" and "topleft". This is because, you need  <b>a minimum of count of 5 in contingency table for each  category level for you to apply chi square test </b>. This is an important point to keep in mind. You cannot use the chi-square test under the circumstances: In our contingency table we see a lot of values where we have a count of 0. 

The two columns we have chosen have a binary value (yes or no) for if we have green color in the flag or blue color in the flag. 


In [6]:
flags_data_subset2  = flags_data[[11,12]] #selecting the columns 
flags_data_subset2.columns = ["green","blue"]  #labeling the columns 
flags_data_subset2.head()

Unnamed: 0,green,blue
0,1,0
1,0,0
2,1,0
3,0,1
4,0,1


### Solution code

```python
# Just run the above code
```

Question: Generate a contingency table for the data above. Do you think we can use a chi square test given the new contingency table values? 

Let's find out...(Execute the code below)

In [7]:
# answer code

new_cont_table = pd.crosstab(flags_data_subset2["green"],flags_data_subset2["blue"]) 
new_cont_table["total"] =new_cont_table.sum(axis=1)
new_cont_table


blue,0,1,total
green,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,36,67,103
1,59,32,91


### Solution code

```python
# Just run the above code
```

Looking at the values in the contingency table we can be sure that all values are greater than 5. 
Now let us calculate the test statistic. For that what we need to do first is calculating the expected frequency.

### Expected frequency

For this purpose we need to make an addition to the contingency table. We need to add the row total as well. The total value we have is the column total as well. 

Question: can you add new row called "row total" to the contingency table and rename the "total" column as well ? 


In [8]:
# I just copied these lines from above since I want to redefine new_cont_table each time
new_cont_table = pd.crosstab(flags_data_subset2["green"],flags_data_subset2["blue"]) 
new_cont_table["total"] =new_cont_table.sum(axis=1)

# renaming a column 
new_cont_table = new_cont_table.rename(index= str,columns ={0:"blue_no", 1:"blue_yes", "total":"row total"})
new_cont_table = new_cont_table.append([new_cont_table.sum()])
new_cont_table.index = [ "green_no", "green_yes", "column total"]
new_cont_table

blue,blue_no,blue_yes,row total
green_no,36,67,103
green_yes,59,32,91
column total,95,99,194


### Solution code

```python
# Just run the above code
```

Don't worry if you could not get the correct answer for the above case. 

Now the expected frequency has to be calculated for each element in contingency table except for the column and row total elements. 

So for blue_yes and green_no expected frequency will be: 
 
 
Expected frequency (green_no, blue_yes)  = $\dfrac { \text{column total for green_no}  *  \text{ row total for blue_yes}}{\text{total of the dataset} }$ 

total of the dataset is nothing but the sum of all the values in the contingency table. Well that is just the value we have at the index -(row_total, column total). 



In [9]:
tot_blue_yes = 99 
tot_green_no = 103
#blue yes, green no expected frequency is - 
expected_freq_12  = tot_blue_yes*tot_green_no/194 
print("expected frequency for the index  green_no, blue_yes is {} " .format(str(expected_freq_12)))


expected frequency for the index  green_no, blue_yes is 52.56185567010309 


### Solution code

```python
# Just run the above code
```

The observed frequency is 67. So you can see that there is a difference between the expected and the observed frequency. Now its up to you to do the rest of them. I have used the notation expected_freq_21 because I am looking at row 1 and column 2 in the contingency table. 

Question: Can you calculate the rest of the expected frequency values? There are three more to go!



In [10]:
# green_no, blue_no 
tot_blue_no = 95 
tot_green_no = 103
expected_freq_11  = tot_blue_no*tot_green_no/194 
print("expected frequency for the index  green_no, blue_no is {} " .format(str(expected_freq_11)))


# green_yes, blue_no 
tot_blue_no = 95 
tot_green_yes = 91
expected_freq_21  = tot_blue_no*tot_green_yes/194 
print("expected frequency for the index  green_yes, blue_no is {} " .format(str(expected_freq_21)))


# green_yes, blue_yes 
tot_blue_yes = 99
tot_green_yes = 91
expected_freq_22  = tot_blue_yes*tot_green_yes/194 
print("expected frequency for the index  green_yes, blue_yes is {} " .format(str(expected_freq_22)))


expected frequency for the index  green_no, blue_no is 50.43814432989691 
expected frequency for the index  green_yes, blue_no is 44.56185567010309 
expected frequency for the index  green_yes, blue_yes is 46.43814432989691 


### Solution code

```python
# Just run the above code
```

Now that we have the expected frequencies, we can finally expand the test statistic and calculate it.<br>
The test statistic in the expanded state is 

\begin{equation}
 \chi^{2} = \dfrac{(O_{1,1} -E_{1,1})^{2}}{E_{1,1}}  +\dfrac{(O_{1,2} -E_{1,2})^{2}}{E_{1,2}} +\dfrac{(O_{2,1} -E_{2,1})^{2}}{E_{2,1}} + \dfrac{(O_{2,2} -E_{2,2})^{2}}{E_{2,2}} 
\end{equation}

what we have done here is expanding the sums. The sums were over row and column. Since we have a two by two matrix. When we expand it, it will be 4 elements. If it's a 3 by 3 matrix then we will have 9 elements and so on.. <br>

Now I am going to calculate the first element. You can do the rest! 


In [11]:
O_11 = 36  # from the contingency table 
E_11 = expected_freq_11

first_term = np.square(O_11 - E_11)/E_11
print("first term in equation (2) is {}".format(first_term ))


first term in equation (2) is 4.132983369242845


### Solution code

```python
# Just run the above code
```

Question: calculate the rest of the terms from equation (2) and sum everything to get the chi square value 


In [12]:
#answer 
#second term 
O_12 = 67  # from the contingency table 
E_12 = expected_freq_12

second_term = np.square(O_12 - E_12)/E_12
print("second term in equation (2) is {}".format(second_term ))

#third 
O_21 = 59  # from the contingency table 
E_21 = expected_freq_21

third_term = np.square(O_21 - E_21)/E_21
print("third term in equation (2) is {}".format(third_term ))



#fourth term 
O_22 = 32  # from the contingency table 
E_22 = expected_freq_22

fourth_term = np.square(O_22 - E_22)/E_22
print("fourth term in equation (2) is {}".format(fourth_term ))


chi_square_value = first_term+second_term+third_term+fourth_term
print("\n Chi square value:  {}".format(chi_square_value))

second term in equation (2) is 3.96599414220273
third term in equation (2) is 4.677992165186956
fourth term in equation (2) is 4.488982380734958

 Chi square value:  17.265952057367493


### Solution code

```python
# Just run the above code
```

Now that we have this values. Lets actually calculate the p-value directly from this. But wait...What distribution do we use? Normal distribution right ? AHA! NO. In the distributions sections you would have gone through the chi square distribution that is what we have to use. If you remember, for that distribution you needed something called the degrees of freedom. So how do we calculate that in this case? 

Well it's just: <br>

\begin{equation}
\text{Degrees of freedom }\text{(DF)}    = (\text{number of columns} -1)* (\text{number of rows} -1)
\end{equation}

we need the degree of freedom to actually be able to draw out the chi square distribution. For our contingency matrix we just have a degree of freedom of 1. Check your calculation from equation (3)!

Now let's use a chi square calculator 

In [13]:
conf_int =0 
xrange =np.linspace(0,10,10000)

#answer
tools_to_show= 'box_zoom,pan,save,hover,reset,tap,wheel_zoom'        



def get_sig_lvl(significance_level, chi_val_crtic, df): 
    xrange =np.linspace(0,chi_val_crtic+5,10000)
    pdf = stats.chi2.pdf(xrange, df)

    fig = pl.figure(x_range=[0,chi_val_crtic+5], 
                    plot_height=400,
                    tools = tools_to_show,
                    title="Chi square test calculator: Figure 1",
                    x_axis_label= "chi-square values",
                    y_axis_label ="Probability distribution")
    
    fig.line(x=xrange, y= pdf, line_width = 4)

    fig.xgrid.grid_line_color = None
    fig.y_range.start = 0
    
    hover = fig.select(dict(type=HoverTool))
    hover.tooltips = [("xvalue", "@x"), ("yvalue", "@y")]
   
#     # calculate right 
    z_value = stats.chi2.ppf(1-(significance_level)/100, df)

#     # calculate pvalue 
    p_value = 1.0-stats.chi2.cdf(chi_val_crtic,df)

    show(fig)

    print("z value for the given significance level: {} ".format(z_value))
    print("p value  :  {}".format(p_value))
    return None 




interact(get_sig_lvl, 
                 significance_level = widgets.FloatText(value = 5, 
                                                        min =50,
                                                        max = 99.9, 
                                                         step =0.001), 
                   chi_val_crtic = widgets.FloatText(value = 5, 
                                                        min =0,
                                                        max = 1000, 
                                                        step =0.001),
                  df = widgets.IntText(value = 1, 
                                                        min =1,
                                                        max = 100, 
                                                        step =1)
     
        
        );


interactive(children=(FloatText(value=5.0, description='significance_level', step=0.001), FloatText(value=5.0,…

### Solution code

```python
# Just run the above code
```

In the calculator above there are three values you can adjust. The significance level, the chi value and the degrees of freedom. You can insert the values that we have acquired in solving the problem. So we have- 

Significance level = 5% <br>
$\chi^2 = 17.27$ (rounded up) <br>
df = 1 from  equation (3) <br>

So when we enter this we find that the p-value is a real small value of of $0.0000324$. So what does all this mean? 

Question: Given that the significance level is 5% do we accept or reject the null hypothesis. what does this mean? 

Answer: what this means is that we need to reject the null hypothesis and accept the alternate hypothesis. This is because the p-value is much more smaller than the significance level that we have set. 

Now there are a couple of details that we need to talk about. For that let us look at the chi square distribution again. We have fixed the number of degrees of freedom and we want to vary the significance level. So the task for you is to vary the significance level and see what happens. 

Question: What happens when you change the value of the significance level? What does this mean? Hint: which side is the rejection growing

Answer: As you change the significance level you will see that the rejection region goes from right to left i.e it goes larger values of chi square to smaller values of chi square. Hence for a large significance level, we will have a small value of chi square and vice versa. What this basically translates to is that small difference between the observed and expected frequency will ensure that the null hypothesis is met. This is an important point to keep in mind! To understand this better just look at equation 2. 

In [14]:
conf_int =0 

#answer
tools_to_show= 'box_zoom,pan,save,hover,reset,tap,wheel_zoom'        

def shade_reject_region(z_min,z_max, df ): 
    shade_x  = np.arange(z_min,z_max,0.001)
    shade_region  =stats.chi2.pdf(shade_x,df)
    
    shade_region[0] = 0 
    shade_region[-1] = 0
   
    return shade_x, shade_region

def get_sig_lvl(significance_level, chi_val_crtic, df): 
    xrange =np.linspace(0,chi_val_crtic+15,10000)
    pdf = stats.chi2.pdf(xrange, df)

    fig = pl.figure(x_range=[0,chi_val_crtic+15], 
                    plot_height=400,
                    tools = tools_to_show,
                    title="Chi square test calculator: Figure 1",
                    x_axis_label= "chi-square values",
                    y_axis_label ="Probability distribution")
    
    fig.line(x=xrange, y= pdf, line_width = 4)

    fig.xgrid.grid_line_color = None
    fig.y_range.start = 0
    
    hover = fig.select(dict(type=HoverTool))
    hover.tooltips = [("xvalue", "@x"), ("yvalue", "@y")]
   
#     # calculate right 
    z_value = stats.chi2.ppf(1-(significance_level)/100, df)

#     # calculate pvalue 
    p_value = 1.0-stats.chi2.cdf(chi_val_crtic,df)

    
    
    left_shade, left_region= shade_reject_region(z_value,chi_val_crtic+15, df)
    fig.patch(left_shade, left_region, color="red", alpha =0.4)
# left reject region title 
    
    reject_text_x =  z_value
    reject_text_y = 0.1
    fig.text(x =reject_text_x ,y = reject_text_y, text=["Significance\n level"])
    arrow_y_end = left_region[int(left_region.size/2)]
    arrow_x_end = left_shade[int(left_shade.size*0.01)]
    
    
    fig.add_layout(Arrow(end=NormalHead(fill_color="black"),
                   x_start=reject_text_x+0.5,
                   y_start=reject_text_y-0.001,
                   x_end=arrow_x_end,
                   y_end=arrow_y_end+0.01))

    show(fig)

    print("z value for the given significance level: {} ".format(z_value))
    print("p value  :  {}".format(p_value))
    return None 





interact_manual(get_sig_lvl, 
                 significance_level = widgets.FloatSlider( value = 5,
                                                        min =2,
                                                        max = 10, 
                                                         step =0.001), 
                   chi_val_crtic = widgets.FloatText(value = 5, 
                                                        min =0,
                                                        max = 1000, 
                                                        step =0.001),
                  df = widgets.IntText(value = 3, 
                                                        min =1,
                                                        max = 100, 
                                                        step =1)
     
        
        );


interactive(children=(FloatSlider(value=5.0, description='significance_level', max=10.0, min=2.0, step=0.001),…

### Solution code

```python
# Just run the above code
```

So far you got to see how the hand calculation of the chi square test is done. Well lucky for you, you are learning ( and by now, hopefully) using python a lot! Python has a nice module to implement chi square test. Its called 'scipy.stats'. I love this library since it makes our lives so much easier. See below how you can just do all what we did above but had with 1 line of code! Remember when you write the code always set correction to false. The correction is something called yates corrections, it applies when you have small values of chi square and the degree of freedom is 1. In our cause we don't care much about it now.

In [15]:
chi_value, p_value, df_new, expected_values= chi2_contingency(np.array([[36,67],[59,32]]), correction= False)

print("chi squared value is {}".format(chi_value))
print("\n P-value is {}".format(p_value))
print("\n degrees of freedom of the table {}".format(df_new))
print("\n Exptected values  \n{}".format(expected_values))


chi squared value is 17.265952057367493

 P-value is 3.2495779824285224e-05

 degrees of freedom of the table 1

 Exptected values  
[[50.43814433 52.56185567]
 [44.56185567 46.43814433]]


### Solution code

```python
# Just run the above code
```

Chi square tests are fairly easy to run once you understand the underlying idea. 

We have covered only one use of the chi square test. There are other uses for it: 

1) Chi square test for homogeneity: where we are trying to see if the same variable from two different populations will be differently distributed.   

2) Chi square test for goodness of fit: This is specifically used when you have a single categorical variable and you want to see if it fits a certain type of distribution. This distribution might be an assumption that you might be making about the process or it might be a result from a previous observation.

We are not going into the details of these tests. For your curiosity you might want to check out - 
https://stattrek.com/chi-square-test/goodness-of-fit.aspx?Tutorial=AP



In [16]:
# End of notebook

### Solution code

```python
# End of notebook
```