# EDA 5 - Categorical Bivariate Analysis
In EDA 4, we did a bivariate analysis that compared two quantitative variables. Now we're going to focus on two categorical variables. 

There was a LOT of math in the last lesson. This should be a bit easier because our sample (hint, not a population!) is from the NPI-40, which is the Narcissistic Personality Inventory. This is a personality test w/ 40 questions. Each of the questions are yes/no, so all of the variables are binary. 

We're going to look at just a subset of the 40 questions: 
- **influence**: *yes* = I have a natural talent for influencing people; *no* = I am not good at influencing people.
- **blend_in**: *yes* = I prefer to blend in with the crowd; *no* = I like to be the center of attention.
- **special**: *yes* = I think I am a special person; *no* = I am no better or worse than most people.
- **leader**: *yes* = I see myself as a good leader; *no* = I am not sure if I would make a good leader.
- **authority**: *yes* = I like to have authority over other people; *no* = I don’t mind following orders.

In [None]:
import pandas as pd

npi40 = pd.read_csv('npi_40.csv')
npi40.head()

In [None]:
# info
npi40.info()

In [None]:
# Summarization
## NOTE: Since we only have categorical variables, we don't need to specify include='all'
npi40.describe()

---
Before we get to the meat and potatoes, let's look at the data and think about what might be associated? (Silently, that means me too.. I'm not going to write the notes here.) 

### Contingency Tables (Frequencies) 

Contingency Tables (a.k.a Two-Way Tables, Cross-Tabulations, Crosstabs) are a useful tool for summarizing two variables at the same time. It's similar to ```value_counts()```. 


Let's look at two examples: 
- **leader** vs. **influence**
- **special** vs **authority**

In [None]:
# crosstab for leader vs. influence
leader_influence_ct = pd.crosstab(npi40.leader, npi40.influence)
leader_influence_ct

In [None]:
# crosstab for special  vs. authority
special_authority_ct = pd.crosstab(npi40.special, npi40.authority)
special_authority_ct


I'll use the first example to show you how to read these: 
- 3015 people have determined that they aren't influential or a leader
- 2360 people have determined that they are influential but not a leader
- 1293 people have determined that they aren't influential but they are a leader.
- 4429 people have determined that they are both influential and a leader.

How do we **assess an association?** We have to look at the responses and the magnitude of the commonality to determine if information about one variable gives us additional information about the other variable. 

Looking at the results above, the 4429 gives us the strongest indicator, that those who think they are a good leader, are also influential (or vice versa). 

### Contingency Tables (Proporitions)

You may have remembered some of the challenges with previous statistical analysis: raw values are often not good indicators for strength or magnitude, because they don't have broad significance beyond the single observation. As a result, we often calculate them as proportions. 

Let's go ahead and do that. 

In [None]:
# leaders and influence
leader_influence_ct_prop = leader_influence_ct / len(npi40)
leader_influence_ct_prop

In [None]:
# special and authority
special_authority_ct_prop = special_authority_ct / len(npi40)
special_authority_ct_prop

It quickly becomes clear how much easier this is to interpret. 

Almost 40% of the sample in the first case view themselves as both influential and good leaders, 27% view themselves in the inverse. 

Special and authority are a little bit less conclusive...

### Marginal Proporition

All else equal, we might expect that each quadrant reflects 1/4 (25% of the people) This would mean that respondents in any single category (top row, bottom row, left column, bottom column, diagonals) represents 50%. 

Well, this isn't accurate, but the concept of looking at respondents of single categories (and their sums) is the definition of **marginal proportions**. We can carry this out by summing the relevant values. 

To calculate the leadership **marginal proportions** sum the no rows and the yes rows to get the total no/yes proportions for leadership respectively
- (no) 0.271695+0.212670 = **0.484365**
- (yes) 0.116518+0.399117 = **0.515635**

If we want to calculate the influence **marginal proportions**, you'd sum the no and yes columns. This can be done programmatically w/ ```sum()``` and manipulating the ```axis``` parameter. 

In [None]:
leader_marginal_proportions = leader_influence_ct_prop.sum(axis=1)
leader_marginal_proportions

In [None]:
influence_marginal_proportions = leader_influence_ct_prop.sum(axis=0)
influence_marginal_proportions

By investinging the marginal proportions, we've found that more people feel they have a talent for influence, but they are split on whether or not they are good leaders. 

### Expected Contingency Tables. 

This is a tool that help us better understand association between categorical variables. We use **marginal proporitions** to create a specical contingency table of *expected proportions* between the variables under the assumption that they have **no association**. 

We accomplish this by 
- calculating the product of each combination of categories.
- we then de-normalize the data by multiplying the resulting proportions by the sample size to convert the proportions back into frequencies.

Let's multiply the product of our **marginal proportions**:

| | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 0.484365×0.388213 = **0.18803679** | 0.484365×0.611787 = **0.29632821**  | 
| **leader = yes** | 0.515635×0.388213 = **0.20017621** | 0.515635×0.611787 = **0.31545879** | 

Now let's multiply each of these by the sample size of the dataset (11097) to de-normalize them back into frequencies. 

| | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 0.18803679*11097 = **2086.64425863** | 0.29632821*11097 = **3288.35414637** | 
| **leader = yes** |  0.20017621*11097 = **2221.35540237** | 0.31545879*11097 = **3500.64619263** | 

Now, let's round off those values to whole numbers to arrive out our **expected contingency tables**

| | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 2087 | 3288 | 
| **leader = yes** | 2221 | 3501 | 

What this table tells us is that *if there were **no** association between the **leader** and **influence** questions* we would expect 2087 people to answer *no* to both and 3501 people to answer *yes* to both, and so on. 

---

Cool right! That's a lot of math (less than the last lesson, though!) 

As usual, there is an easier way to do this with python libraries! (```numpy```, ```scipy.stats.chi2_contingency```). Let's check our work. 

In [None]:
# let's import those libraries. 
import numpy as np
from scipy.stats import chi2_contingency

# due to the magic of python and scipy.stats, we don't need to compute the marginal proportions, we just need the original observed contingency table
## ignore the details of this command, we'll eventua
ch2, pval, dof, expected = chi2_contingency(leader_influence_ct)

expected

In [None]:
# Hmm. There must be a way to round this off easily.. didn't Ed say that we needed numpy? 
np.round(expected)

The numbers look familiar!

So let's compare these values to the original contingency table? 

| **observed** | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 3015 | 2360 | 
| **leader = yes** | 1293 | 4429 | 

| **expected** | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 2087 | 3288 | 
| **leader = yes** | 2221 | 3501 | 

The values we observed showed some strong differences between no/no and yes/yes. They were much higher than the expected table. Remember that the expected contingencies are expected under the condition of "no association." You'll note that the no/yes and yes/no associations decrease. This is normal. If proportional relationships increase, then inverse proportional relationships decrease. (and vice versa). 

This can be hard for some to wrap their head around. We aren't changing the sample size. The expected contingencies attempt to establish a baseline for the values without any association, so that we can evaluate the magnitude of the delta between our observations and the baseline. We can't just look at a single quadrant, we have to look at the contingencies and/or proportions holistically in order to determine that there is an association between the variables. 

### Chi-Square Statistic

After we compared the observed and expected values, we were able to determine that there is likely an association between the two. However, as we've discussed, frequencies aren't very good at establishing the magnitude or strength of the difference to help us use it as a tool for predictability about future study. Normally, we use proporitions, hwoever as I mentioned, we look at the tables holistically, which means that even the delta between observed and expected ends up being a scalar value, not a proportion. (the difference between two percentages or proporitions is expressed as a percentage only in terms of units, not as a proporition in statistical measure). 

Another way of stating this is that it's not a part of a single whole, but rather a gap between two separate proportions. 

So we use the **chi-square** statistic to summarize how different the two contingency tables are. The calculation is as follows: 
1. find the squared difference between each value in the observed table and it's corresponding value in the expected table
2. divide the result(s) of step 1 by the value from the expected table
3. Add those numbers up to get the result.

*You: Ed, are you really going to make me do the math?"
*Me: Have you met me?*

#### Step 1: Find the difference between the tables. 
Since we're going to square the results, the order of subtraction doesn't matter. That said, every version of this I've ever been taught have shown *observed - expected*, so I'm going to do the same. 

| **observed - expected** | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 3015 - 2087 = **928** | 2360 - 3288 = **-928** | 
| **leader = yes** | 1293 - 2221 = **-928** | 4429 - 3501 = **928** | 

See why it doesn't matter now?

| **squared** | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 928^2 = **861184** | -928^2 = **861184** | 
| **leader = yes** | -928^2 = **861184** | 928^2 = **861184** | 

#### Step 2: Divide the results by their expected vable

| **divided by expected** | **influence = no** | **influence = yes** | 
|-|-|-|
| **leader = no** | 861184 / 2087 = **412.642069957** | 861184 / 3288 = **261.917274939** | 
| **leader = yes** | 861184 / 2221 = **387.746060333** | 861184 / 3501 = **245.982290774** | 

#### Step 3: Add 'em up!

412.642069957 + 261.917274939 + 387.746060333 + 245.982290774 = **1308.287696003**

Now let's calculate the result to check our math...

In [None]:
# ready to get really annoyed? You've already calculated the variable. It was part of the chi2_contingency output above. 
ch2

---
You'll notice that the numbers are slightly different. Remember that we did a lot of rounding in this process. It's worth trusting the output of ```scipy.stats```. You **can** go back and do this all by hand and get the same result. 

I want to stop for a moment to emphasize why these libraries are so valuable to data scientists. This takes the complexity and **error-prone** nature of math by hand out of the science. 

---
So how do we interpret the number? 

This gets a little complicated, and we're going to get more into this later, but the interpretation depends on the complexity of the table. 

We calculate a value called **degrees of freedom**, which is (num_of_rows - 1) * (num_of_cols - 1). For a 2 x 2 table, **df (degrees of freedom)** is 1. 

We also select a **significance level** (\alpha). 0.05 is the most commonly used value. This value has to do with/ probability, and I'm going to handwave over it for now. (We'll get there, I promise!) 

#### Chi-Square Critical Values Table

| Degrees of Freedom (df) | \(\alpha = 0.10\) | \(\alpha = 0.05\) | \(\alpha = 0.01\) | \(\alpha = 0.001\) |
|-------------------------|-------------------|-------------------|-------------------|--------------------|
| 1                       | 2.706             | 3.841             | 6.635             | 10.828             |
| 2                       | 4.605             | 5.991             | 9.210             | 13.816             |
| 3                       | 6.251             | 7.815             | 11.345            | 16.266             |
| 4                       | 7.779             | 9.488             | 13.277            | 18.467             |
| 5                       | 9.236             | 11.070            | 15.086            | 20.515             |
| 6                       | 10.645            | 12.592            | 16.812            | 22.458             |
| 7                       | 12.017            | 14.067            | 18.475            | 24.322             |
| 8                       | 13.362            | 15.507            | 20.090            | 26.125             |
| 9                       | 14.684            | 16.919            | 21.666            | 27.877             |
| 10                      | 15.987            | 18.307            | 23.209            | 29.588             |

--- 

When the **chi-square statistic > critical value** we *reject the null hypothesis* (This is going to mean more later, but I'm introducing the term now. File this under the same promise I made above). 

When we reject the null hypothesis, we are confirming an association. 

Our value of ~1308 is much greater than the 