<a href="https://colab.research.google.com/github/fundamentals-of-data-science/course-materials/blob/master/problem_sets/Problem_Set_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%matplotlib inline

# Problem Set 1

## Directions

1. Show all work/steps/calculations using a combination of code and Markdown.
2. **All** work is to be your own. This is not a group project. You may, however, use code from the lectures and labs. Provide citations for any code that is not your own. You may also consult Stackoverflow, etc. This is not by any means "closed book" or anything like that. Basically, I need to see if *you* learned the concepts from the Lectures, *Fundamentals*, and Labs.
3. Add whatever markdown or code cells you need in each part to explain and calculate your answers. Don't just provide answers but explain them as well. **Explain and interpret your results.**

**Submission**

For this assignment...

0. Complete the pre-submission checklist at the end of the notebook.
1. Name the notebook to be your JHED ID, ie, `ssmith1.ipynb`. Do not add anything else to the name.
2. Upload to Canvas.

Do not add anything to the names like " Problem Set 2".
Any assignment not following the submission guidelines will generally be assumed to be incomplete under the Syllabus and therefore a "C".


<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Note</strong>
    <p>This Problem Set covers Lab 2 and the Titanic Case Study (and the corresponding course materials).</p>
</div>


**Do not delete or otherwise change the formatting, order, text of the questions**



## Question 1 - Bayes Rule

### Concepts

**Part 1**

$$P(H|D) = \frac{P(D|H)P(H)}{P(D)}$$

If we let H be some condition, characteristic, hypothesis and D be some data, evidence, a test result), then how do we interpret each of the following in Bayes Rule: $P(H)$, $P(D)$, $P(H|D)$, $P(D|H)$, $P(H, D)$?


`P(H)` - The prior distribution, or prior probability of the hypothesis. How likely the hypothesis is before we receive new evidence. <br><br>

`P(D)` - The probability of the data, also called the normalizer. Often an unknown value, and we can use the law of Total Probability in order to find it. <br><br>

`P(H|D)` - The posterior distribution, or posterior probability of the hypothesis. How likely the hypothesis is after we receive new evidence. <br><br>

`P(D|H)` - The probability of the data, given the hypothesis. Also called the likelihood. <br><br>

`P(H,D)` - The joint probability distribution between the hypothesis and the data. Not used in the Bayes Rule calculation.



### Application

Use the code provided in *Fundamentals* to do your Naive Bayes calculations.
Focus on calculating the required probabilities to fill out the data structure and on discussion.
Read the directions *carefully*.

**Part 2.** 

There is a genetic condition that affects 2.3% of the population, C = {c, ~c}. If someone has the condition, the test (T) can detect that fact accurately 72.1% of the time (true positive rate). However, the test (T) also returns a positive response in 20.3% of the cases where the someone does not have the condition (false positive rate), T={pos, neg}.

**DO NOT CHANGE THE NOTATION**

1. If we want to know `P(C|T)`, write out Bayes Rule for this problem:

$? = \frac{? * ?}{?}$

Bayes Rule for this problem looks like: <br>
$$P(C|T) = \frac{P(T|C)*P(C)}{P(T)}$$
or
$$P(Condition|Test) = \frac{P(Test|Condition)*P(Condition)}{P(Test)}$$

2. Using the values supplied in the description of the problem, specify all priors and likelihoods for the version of Bayes Rule you just wrote out. Calculating the missing values.

The values we have from the problem are: <br>
`P(C = c) = 0.023` <br>
`P(T = pos|C = c) = 0.721` <br>
`P(T = pos|C = ~c) = 0.203`


We are missing the prior probability `P(C = ~c)`, and the likelihoods `P(T = neg | C = c)`, <br> `P(T = neg | C = ~c)`

<strong>Calculations</strong> <br> We can easily calculate `P(C = ~c)` by the second axiom of probability. Since there are only two events for `P(C)`, we know that `P(C = c) + P(C = ~c) = 1`, and find that <br><br> `P(C = ~c) = 1 - 0.023 = 0.977`

Similarly, we can say that <br><br>
`P(T = neg | C = c) = 1 - P(T = pos | C = c) = 1 - 0.721 = 0.279` <br>
`P(T = neg | C = ~c) = 1 - P(T = pos | C = ~c) = 1 - 0.203 = 0.797`

3. Calculate `P(C|T)`.

There are 4 posterior probabilities we need to calculate. Let's start with <br>
`P(C = c | T = pos)`. We know that <br>

$$P(C = c|T = pos) = \frac{P(T = pos|C = c)*P(C = c)}{P(T = pos)}$$


First, we need to find $P(T = pos)$, which we can do by the total law of probability that <br>
$$P(T = pos) = \sum P(T | C)* P(C) $$
$$ = P(T = pos | C = c) * P(C = c) + P(T = pos | C =  \sim c)* P(C =  \sim c) $$
$$ = (0.721 * 0.023) + (0.203 * 0.977) = 0.215 $$

Therefore, <br>
$$P(C = c|T = pos)  = \frac{(0.721 * 0.023)}{0.215}  = 0.077$$

Similarly, <br>
$$P(C = \sim c|T = pos)  = \frac{(0.203 * 0.977)}{0.215} = 0.922$$ 
$$P(C = c|T = neg)  = \frac{(0.279 * 0.023)}{0.785} = 0.008$$ 
$$P(C = \sim c|T = neg)  = \frac{(0.797 * 0.977)}{0.785} = 0.992$$ 

Note - We could have used the `normalize` and `query` helper functions from *Fundamentals*, Chapter 2, pages 132-133 to code these solutions instead of writing it out manually. However, I felt more comfortable doing the manual calculations, but recognize that coding the solution would be much faster when the probability distribution is much larger.

In [2]:
from tabulate import tabulate

Below we can see the same posterior probabilities in a table and compare to the prior.

In [3]:
table = [
    ['c', 'pos', 0.077, 0.023],
    ['~c', 'pos', 0.922, 0.977],
    ['c', 'neg', 0.008, 0.023],
    ['~c', 'neg', 0.992, 0.977]
]

print(tabulate(table, headers=['C', 'T', 'P(C|T)', 'P(C)']))

C    T      P(C|T)    P(C)
---  ---  --------  ------
c    pos     0.077   0.023
~c   pos     0.922   0.977
c    neg     0.008   0.023
~c   neg     0.992   0.977


Here we can see the posterior probabilities as well as the priors, for reference.

4. Please explain each individual probability in words and the difference between the prior and the posterior and the reason for the difference.

<Strong>Conclusion</Strong><br>

`P(C = c | T = pos)` - The probability a patient has the condition, given they test positive for it. Larger than the prior. The true positive rate, so the test will give an accurate result. <br>

`P(C = ~c | T = pos)` - The probability a patient does not have the condition, given they test positive for it. Smaller than the prior. The false positive rate, it makes sense it will go down given new information. <br>

`P(C = c | T = neg)` - The probability a patient has the condition, given they test negative for it. Smaller than the prior. Makes sense that given other information, there is a smaller probability the test will fail to detect a patient with the condition. <br>

`P(C = ~c | T = neg)` - The probability a patient does not have the condition, given they test negative for it. Larger than the prior. This means the test will be accurate, giving a negative result for patients without the condition.

**Part 3.** 

Re-calculate #2-4 (including discussion) but assume a new test has a true positive rate is 97.3% but the false positive rate is 37.2%.


2. Using the values supplied in the description of the problem, specify all priors and likelihoods for the version of Bayes Rule you just wrote out. Calculating the missing values.

The new values we have from the problem are: <br>
`P(C = c) = 0.023` <br>
`P(T = pos|C = c) = 0.973` <br>
`P(T = pos|C = ~c) = 0.372`

Again, we are missing the prior probability `P(C = ~c)`, and the likelihoods `P(T = neg | C = c)`, <br> `P(T = neg | C = ~c)`

The prior probability `P(C = ~c) = 0.977` does not change from earlier.

We can calculate <br><br>
`P(T = neg | C = c) = 1 - P(T = pos | C = c) = 1 - 0.973 = 0.027` <br>
`P(T = neg | C = ~c) = 1 - P(T = pos | C = ~c) = 1 - 0.372 = 0.628`

3. Calculate `P(C|T)`.

We find again by total law of probability that <br>
`P(T = pos) = 0.386` <br>
`P(T = neg) = 0.614`

Again, there are 4 posterior probabilities we need to calculate. <br>

$$P(C = c|T = pos)  = \frac{(0.973 * 0.023)}{0.386} = 0.058$$ 
$$P(C = \sim c|T = pos)  = \frac{(0.372 * 0.977)}{0.386} = 0.942$$ 
$$P(C = c|T = neg)  = \frac{(0.027 * 0.023)}{0.614} = 0.001$$ 
$$P(C = \sim c|T = neg)  = \frac{(0.628 * 0.977)}{0.614} = 0.999$$ 

In [4]:
table = [
    ['c', 'pos', 0.058, 0.023],
    ['~c', 'pos', 0.942, 0.977],
    ['c', 'neg', 0.001, 0.023],
    ['~c', 'neg', 0.999, 0.977]
]

print(tabulate(table, headers=['C', 'T', 'P(C|T)', 'P(C)']))

C    T      P(C|T)    P(C)
---  ---  --------  ------
c    pos     0.058   0.023
~c   pos     0.942   0.977
c    neg     0.001   0.023
~c   neg     0.999   0.977


<Strong>Conclusion</Strong><br>

We see similar results to the first part, but note that for the posterior where `T=pos`, the difference from the prior was not as great, while when `T=neg`, the posterior here has a larger difference compared to the prior. This makes sense when considering the true positive and false positive rates are much higher in this scenario. There is a lower probability that a patient has the condition given the test is positive, due to the higher false positive rate, for example.

## Question 2 - 1994 Adult Census Data - Empirical Probability

For this question, you will answer queries in conditional probability as was done with the Titanic Case Study.
Because we have not had Exploratory Data Analysis yet, we will not ask you to explore the data "in general".
Instead, for each calculation, you should examine the variables involved individually, looking to see if they have missing values, what their values look like (most of the data is categorical), and to see if any transformations are in order.
This means that the variable stated in the question $P(X|Y)$ may not need to be the *literal* variable $X$ or $Y$ but can instead be a transformation of that variable (with an informative name).
State your reasons for any such transformation.

You may use code from the Titanic Case Study as you see fit. Add any other imports you need.

To reiterate, for each calculation:

1. Look at the values and counts/percentages for each variable (unless it's been done before). That is, look at the marginal probabilities.
2. Before each calculation, write down a hypothesis about what you expect to see. For example, if you are about to calculate a conditional probability distribution, what do you expect the distribution to look like?
3. Make the calculation.
4. Discuss what the results show, relative to your hypothesis.

In [5]:
import warnings

In [6]:
warnings.filterwarnings('ignore')

In [7]:
import pandas as pd
from pandasql import sqldf
import seaborn as sns
import matplotlib.pyplot as plt

Load the data:

In [8]:
income = pd.read_csv("https://raw.githubusercontent.com/fundamentals-of-data-science/datasets/master/income.csv")

`info()` will give us a quick overview of the data (we'll talk about this more in Module 4, ETL)

In [9]:
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1   class            32561 non-null  object
 2   fnlwght          32561 non-null  int64 
 3   education        32561 non-null  object
 4   education_years  32561 non-null  int64 
 5   marital_status   32561 non-null  object
 6   occupation       32561 non-null  object
 7   relationship     32561 non-null  object
 8   race             32561 non-null  object
 9   sex              32561 non-null  object
 10  capital_gain     32561 non-null  int64 
 11  capital_loss     32561 non-null  int64 
 12  hours_per_week   32561 non-null  int64 
 13  native_country   32561 non-null  object
 14  agi              32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Looking at the first rows, we can get a general idea of the data:

In [10]:
income.head()

Unnamed: 0,age,class,fnlwght,education,education_years,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,agi
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


## Conditional Probabilities

You *will* need more cells for your answer. Add them as you need them. There are already the 3 default cells there for you. You will need more. Always make those three and fill them in: 

* discuss
* code
* discuss

1. $P(agi|age)$

In order to work easier with the conditional probabilities, I will be copying the `summarize_category` function from the Titanic Case Study in *Fundamentals*, Chapter 4, page 158.

In [11]:
from pandas.core.algorithms import value_counts
from pandas.core.reshape.concat import concat

In [12]:
def summarize_category(series):
    res_regu = value_counts(series)
    res_norm = value_counts(series, normalize=True)
    result = concat([res_regu, res_norm], axis=1, keys=['Count', 'Frequency'])
    result = result.sort_index()
    return result

First let's get a good idea of what the `Agi` and `Age` categories look like by summarizing them. First up is `Agi`.

In [13]:
summarize_category(income['agi'])

Unnamed: 0_level_0,Count,Frequency
agi,Unnamed: 1_level_1,Unnamed: 2_level_1
<=50K,24720,0.75919
>50K,7841,0.24081


We can see that the probability a person's adjusted gross income is less than or equal to 50K is about 76%. Simiarly, the probability a person's agi is greater than 50K is about 24%. Note that there are only 2 categories here, even though we have a `<=` and `>` in the descriptors. In other words <br>
`P(agi = <=50K ) = 0.759` <br>
`P(agi = >50K) = 0.241` <br>
I will continue to use decimal notation rather than percents, since we tend to use decimals in calculations and not actual percentages.

Now let's look at `Age`.

In [14]:
summarize_category(income['age'])

Unnamed: 0_level_0,Count,Frequency
age,Unnamed: 1_level_1,Unnamed: 2_level_1
17,395,0.012131
18,550,0.016891
19,712,0.021867
20,753,0.023126
21,720,0.022112
...,...,...
85,3,0.000092
86,1,0.000031
87,1,0.000031
88,3,0.000092


We see there are a lot more categories here, since age is somewhat continuous. It would probably help to group the ages by decade, similar to the Titanic Case Study, in order to make working with the data easier. We note the minimum age is 17 and the max is 90, thus there is no 0-10 as in the Titanic Case Study.

In [15]:
income['decade'] = (income['age'] // 10) * 10
summarize_category(income['decade'])

Unnamed: 0_level_0,Count,Frequency
decade,Unnamed: 1_level_1,Unnamed: 2_level_1
10,1657,0.050889
20,8054,0.247351
30,8613,0.264519
40,7175,0.220356
50,4418,0.135684
60,2015,0.061884
70,508,0.015601
80,78,0.002396
90,43,0.001321


Now we are able to see the probability a person is in their 30s (P(decade = 30) = 0.266), rather than making a distinction between 33 and 36, let's say. As we get to the older generations, the frequency decreases a lot since there is not as much data due to people retiring. Organizing the age into bins will be helpful here because of this fact.

Now we can finally look at the conditional probability $P(agi|decade)$. I hypothesize that as age increases, agi will also tend to increase, up to a certain decade (maybe 70, based on common retiring age). We will use the `crosstab` method in Pandas to calculate the conditional probability distribution.

In [16]:
pd.crosstab(income['decade'], income['agi'], normalize='index')

agi,<=50K,>50K
decade,Unnamed: 1_level_1,Unnamed: 2_level_1
10,0.998793,0.001207
20,0.936802,0.063198
30,0.731917,0.268083
40,0.62899,0.37101
50,0.613626,0.386374
60,0.732506,0.267494
70,0.809055,0.190945
80,0.897436,0.102564
90,0.813953,0.186047


It seems that the probability one makes over 50K steadily increases until decade 60, while the probability one makes less than 50K mostly decreases until roughly the same spot. This means our hypothesis was correct: as age increases, adjusted gross income is likely to increase as well, up to a certain age - around one's 60s, when people are nearing retiring age.

2. $P(agi|occupation)$

Similar to problem 1, we can look at the conditional probability `P(agi | occupation)`. We already took a look at the `Agi` category, so let's look at `Occupation`.

In [17]:
summarize_category(income['occupation'])

Unnamed: 0_level_0,Count,Frequency
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
?,1843,0.056601
Adm-clerical,3770,0.115783
Armed-Forces,9,0.000276
Craft-repair,4099,0.125887
Exec-managerial,4066,0.124873
Farming-fishing,994,0.030527
Handlers-cleaners,1370,0.042075
Machine-op-inspct,2002,0.061485
Other-service,3295,0.101195
Priv-house-serv,149,0.004576


We have very broad categories here, and for now we assume that `? = unknown`. I think we can use this category as is, no need to bin it as we did with age, although it would probably make the data easier to work with. I hypothesize that the `P(agi = >50K| occupation)` will be more likely when `occupation = tech-support`, and `occupation = sales`, and `P(agi = <= 50K | occupation)` will be more likely when `occupation = farming-fishing` and `occupation = transport-moving`.

In [18]:
pd.crosstab(income['occupation'], income['agi'], normalize='index')

agi,<=50K,>50K
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
?,0.896365,0.103635
Adm-clerical,0.865517,0.134483
Armed-Forces,0.888889,0.111111
Craft-repair,0.773359,0.226641
Exec-managerial,0.515986,0.484014
Farming-fishing,0.884306,0.115694
Handlers-cleaners,0.937226,0.062774
Machine-op-inspct,0.875125,0.124875
Other-service,0.958422,0.041578
Priv-house-serv,0.993289,0.006711


It looks like the highest probability for having a high income lies with `exec-managerial` at about 0.484 and `prof-specialty`, at 0.449, although my hypothesis of `tech-support` and `sales` is not too far behind those categories. On the lower end of income, we see `priv-house-serv` and `other-service`. Again, the occupations I hypothesized would be the highest are not too far behind, though there is a distinction to be made.

3. $P(agi|sex)$

Next we can look at how `sex` affects `agi`. Let's start by summarizing `sex`.

In [19]:
summarize_category(income['sex'])

Unnamed: 0_level_0,Count,Frequency
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,10771,0.330795
Male,21790,0.669205


We see a higher probability that a person from this data is a male, at 0.669. It's about a 1/3 probability of being a female, and 2/3 of being a male. I hypothesize that given a person's sex is `male`, the probability that their `agi = >50K` is higher than the probability that given a `female`, the probability that their `agi = >50K`. Let's find out. 

In [20]:
pd.crosstab(income['sex'], income['agi'], normalize='index')

agi,<=50K,>50K
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.890539,0.109461
Male,0.694263,0.305737


Looks like my hypothesize was correct. Given a person's `sex = male` the probability their `agi = >50K` is 0.306, while given `sex = female`, the probability their `agi >50K` is only 0.109. That is a sizeable difference, and one could argue that men typically work jobs with higher income, although I don't know if that is necessarily the case. It would be interesting to look at `P(occupation | sex)`, but that is an exercise left up to the reader.

4. $P(agi|sex, occupation)$

Now we want to find agi given two categories, so the `crosstab` function is no longer sufficient. I will be copying the `conditional_probability` function here from *Fundamentals*, Chapter 4, page 187.

In [21]:
def conditional_probability(df, target, givens, cell="index"):
    """
    calculates a simple conditional probability (only one target variable) based off of:
    https://stackoverflow.com/questions/54040923/change-order-of-pandas-multiindex
    
    P(target|givens...)
    
    df: the DataFrame to use for the calculation
    target: the string name of the target variable
    givens: a string or List of strings that represent the "givens"
    cell: a column that is neither target nor givens to "count". Should be a column without NA.
    
    The default assumes you have added a column: df["index"] = df.index to your DataFrame.
    """
    if isinstance(givens, str):
        givens = [givens]
    print(f"P({target}|{', '.join(givens)})")
    columns = [target] + givens
    # handling multiple targets would require a more sophisticated join.
    result = (df.groupby(columns).count() / df.groupby(givens).count())[cell]
    # this makes sure the target is always the column
    result = result.reorder_levels(givens + [target]).sort_index()
    # this flattens the hiearchical index and should fill in missing values.
    result = result.unstack(fill_value=0.0)
    return pd.DataFrame(result)

We have already lookes at `agi`, `sex`, and `occupation` separately, as well as `P(agi|sex)` and `P(agi|occupation)`. What we found was the probability of `agi` being higher given someone's `sex = male`, and the probability `agi` is higher given someone's occupation is `exec-managerial` or `prof-specialty`.

I hypothesize that we will see a higher probability that <br>
$$P(agi = >50K | sex = male, occupation = exec-managerial)$$
and a higher probability that
$$P(agi = <=50K | sex = female, occupation = priv-house-serv)$$

In [22]:
conditional_probability(income, 'agi', ['sex', 'occupation'], 'class')


P(agi|sex, occupation)


Unnamed: 0_level_0,agi,<=50K,>50K
sex,occupation,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,?,0.938169,0.061831
Female,Adm-clerical,0.916437,0.083563
Female,Craft-repair,0.90991,0.09009
Female,Exec-managerial,0.758412,0.241588
Female,Farming-fishing,0.969231,0.030769
Female,Handlers-cleaners,0.97561,0.02439
Female,Machine-op-inspct,0.963636,0.036364
Female,Other-service,0.971667,0.028333
Female,Priv-house-serv,0.992908,0.007092
Female,Prof-specialty,0.745875,0.254125


It looks like the hypotheize was correct. <br>
$$P(agi = >50K | sex = male, occupation = exec-managerial) = 0.581$$
while
$$P(agi = <=50K | sex = female, occupation = priv-house-serv) = 0.993$$
In general, there's a good chance that a female makes less than 50K, regardless of occupation. While that pattern mostly holds for males as well, there is an exception for `exec-managerial` and `prof-specialty`, as expected. We do note that given a female's occupation is `exec-managerial` or `prof-speciality`, there is a higher probability her `agi = >50K`, relative to the other occupations in this table. It seems that `occupation` has a greater effect on `agi` than `sex`, although `sex` is still a contributing factor.

5. $P(occupation|race)$

Finally we can look at how race affects occupation. Let's start with summarizing the `race` category.

In [23]:
summarize_category(income['race'])

Unnamed: 0_level_0,Count,Frequency
race,Unnamed: 1_level_1,Unnamed: 2_level_1
Amer-Indian-Eskimo,311,0.009551
Asian-Pac-Islander,1039,0.031909
Black,3124,0.095943
Other,271,0.008323
White,27816,0.854274


We see a large difference here.
$$P(race = White) = 0.854$$
while the next largest probability is only,
$$P(race = Black) = 0.0959$$ 
This makes sense given the counts. We have more data for people whose `race = white` than any other `race`. If we look at `native_country`, we might see why. 

In [24]:
summarize_category(income['native_country'])

Unnamed: 0_level_0,Count,Frequency
native_country,Unnamed: 1_level_1,Unnamed: 2_level_1
?,583,0.017905
Cambodia,19,0.000584
Canada,121,0.003716
China,75,0.002303
Columbia,59,0.001812
Cuba,95,0.002918
Dominican-Republic,70,0.00215
Ecuador,28,0.00086
El-Salvador,106,0.003255
England,90,0.002764


We can see we mostly have people from the US, so the difference that we see in the `race` category starts to make sense in this context. Now we can look at `P(occupation | race)`. I hypothesize that give `race = White`, we will see a higher probability that `occupation = exec-managerial` and a lower probability `occupation = handlers-cleaners`. And given `race = Asian-Pac-Islander`, we might expect a higher probability that `occupation  = Transport-moving`. 

In [25]:
pd.crosstab(income['race'], income['occupation'], normalize='index')

occupation,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
race,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Amer-Indian-Eskimo,0.080386,0.099678,0.003215,0.141479,0.096463,0.032154,0.07074,0.061093,0.106109,0.0,0.106109,0.025723,0.083601,0.012862,0.080386
Asian-Pac-Islander,0.06256,0.133782,0.0,0.085659,0.129933,0.015399,0.022137,0.056785,0.123195,0.00385,0.179018,0.014437,0.103946,0.042348,0.026949
Black,0.068822,0.15685,0.00032,0.078105,0.078105,0.013444,0.057298,0.087708,0.182778,0.008963,0.076504,0.03265,0.081306,0.022727,0.054417
Other,0.084871,0.095941,0.0,0.103321,0.04059,0.04059,0.04428,0.143911,0.147601,0.01107,0.114391,0.01845,0.092251,0.01107,0.051661
White,0.054465,0.110871,0.000252,0.132801,0.131076,0.032895,0.040768,0.057916,0.090703,0.004098,0.131255,0.018658,0.116372,0.028976,0.048893


We actually see about the same result for <br>
$$P(occupation = exec-managerial | race = White) = 0.131$$
$$P(occupation = exec-managerial | race = Asian-Pac-Islander) = 0.130$$
The probability one's occupation is `handerls-cleaners` is highest given `race = Amer-Indian-Eskimo`, and 
$$P(occupation = Transport-moving | race = Asian-Pac-Islander) = 0.027$$
the lowest probability for that occupation given race.


My hypothesis was pretty far off, and it looks like overall, `race` doesn't play too much of a factor to influence one's `occupation`, although we do see some differences here and there.

## Naive Bayes Classifier

Train a Naive Bayes Classifier of the form $P(agi|age, sex, race, occupation, education_years)$. Make 5 predictions with of specific cases with it, after first making a hypothesis about the prediction. Import what you need. Follow the example in the Titanic Case Study (don't make predictions in a loop!).

So far we've looked at every category except `education_years`. Let's take a look now.

In [26]:
summarize_category(income['education_years'])

Unnamed: 0_level_0,Count,Frequency
education_years,Unnamed: 1_level_1,Unnamed: 2_level_1
1,51,0.001566
2,168,0.00516
3,333,0.010227
4,646,0.01984
5,514,0.015786
6,933,0.028654
7,1175,0.036086
8,433,0.013298
9,10501,0.322502
10,7291,0.223918


We see a distribution that might be expected - most people have about 9,10, or 13 years of education, and thenthat probability decreases each year as you get further from these values. We could create bins here, as we've done with the `Age` category, but I will leave that exercise up to the reader.

We can also look at the empirical conditional probability using the function created above.

In [27]:
agi_ = conditional_probability(income, 'agi', ['decade', 'sex', 'race', 'occupation', 'education_years'], 'class')
agi_

P(agi|decade, sex, race, occupation, education_years)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,agi,<=50K,>50K
decade,sex,race,occupation,education_years,Unnamed: 5_level_1,Unnamed: 6_level_1
10,Female,Amer-Indian-Eskimo,?,10,1.0,0.0
10,Female,Amer-Indian-Eskimo,Adm-clerical,6,1.0,0.0
10,Female,Amer-Indian-Eskimo,Adm-clerical,7,1.0,0.0
10,Female,Amer-Indian-Eskimo,Other-service,6,1.0,0.0
10,Female,Amer-Indian-Eskimo,Other-service,7,1.0,0.0
...,...,...,...,...,...,...
90,Male,White,Prof-specialty,13,1.0,0.0
90,Male,White,Prof-specialty,15,0.0,1.0
90,Male,White,Protective-serv,4,1.0,0.0
90,Male,White,Sales,13,0.5,0.5


Here we have a large table with the empirical conditional probabilities. We won't go into these now since we are more concerned with Naive Bayes, so we now turn Naive Bayes Classifier.

I will start by copying the code for the Ordinal Encoder, in order to filter out categories with missing information. Based on the dataframe, it doesn't look like there are any missing values, though I will proceed with the Ordinal Encoder. Note I am using the `Decade` category here again, instead of `Age`.

In [28]:
from sklearn.naive_bayes import CategoricalNB
from sklearn.preprocessing import OrdinalEncoder

Listing income info here again just for reference, so we can see the different columns again without having to scroll up too far.

In [29]:
income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1   class            32561 non-null  object
 2   fnlwght          32561 non-null  int64 
 3   education        32561 non-null  object
 4   education_years  32561 non-null  int64 
 5   marital_status   32561 non-null  object
 6   occupation       32561 non-null  object
 7   relationship     32561 non-null  object
 8   race             32561 non-null  object
 9   sex              32561 non-null  object
 10  capital_gain     32561 non-null  int64 
 11  capital_loss     32561 non-null  int64 
 12  hours_per_week   32561 non-null  int64 
 13  native_country   32561 non-null  object
 14  agi              32561 non-null  object
 15  decade           32561 non-null  int64 
dtypes: int64(7), object(9)
memory usage: 4.0+ MB


Now we see the encoder with the encoded categories we are interested in. I am essentially following the same process as in *Fundamentals*, Chapter 4, pages 195-196.

In [30]:
encoder = OrdinalEncoder()

with_age = income[income['age'].notnull()]

encoder.fit(with_age[['decade', 'sex', 'race', 'occupation', 'education_years']])

encoder.categories_


[array([10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64),
 array(['Female', 'Male'], dtype=object),
 array(['Amer-Indian-Eskimo', 'Asian-Pac-Islander', 'Black', 'Other',
        'White'], dtype=object),
 array(['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair',
        'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners',
        'Machine-op-inspct', 'Other-service', 'Priv-house-serv',
        'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support',
        'Transport-moving'], dtype=object),
 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16],
       dtype=int64)]

Then we fit the Classifier with these categories, as in Lab 2.

In [31]:
clf = CategoricalNB()

clf.fit(encoder.transform(with_age[['decade', 'sex', 'race', 'occupation',
                                        'education_years']]), with_age['agi'])

Next, creating a function to display the results of our predictions.

In [32]:
def display_prediction(probs):
    print(f'P( agi = >50K|*): {probs[0][1]}')
    print(f'P(agi = <=50K|*): {probs[0][0]}')

Using a Dict to "Namespace" results, as in lab 2.

In [33]:
predictions = {}

Now we can make the 5 predictions. For the first prediction, let's look at a white man in his 20s, who works in Adm-Clerical work and has 9 years of education. I'm going to predict that his Agi is >50K.

In [34]:
predictions[1] = clf.predict_proba(encoder.transform([(20, 'Male', 'White', 
                                                        'Adm-clerical', 9)]))

In [35]:
display_prediction(predictions[1])

P( agi = >50K|*): 0.028898293582970868
P(agi = <=50K|*): 0.9711017064170292


There is 97% chance his agi is in fact <=50K, so my hypothesis was not correct.

Moving on, let's predict that a black woman in her 50s, who works in Exec-managerial work and has 12 years of education has an agi >50K.

In [36]:
predictions[2] = clf.predict_proba(encoder.transform([(50, 'Female', 'Black', 
                                                        'Exec-managerial', 12)]))

In [37]:
display_prediction(predictions[2])

P( agi = >50K|*): 0.2512875089844287
P(agi = <=50K|*): 0.7487124910155714


It looks like only 25% chance that she does make at least 50K, while 75% chance she does not, which is suprising to me.

For my next trick..I mean, my next prediction, let's predict a woman in his 30s of unknown race, working in Craft-repair, with 10 years of education. I hypothesize she does not make over 50K.

In [38]:
predictions[3] = clf.predict_proba(encoder.transform([(30, 'Female', 'Other', 
                                                        'Craft-repair', 10)]))

In [39]:
display_prediction(predictions[3])

P( agi = >50K|*): 0.03117210843226513
P(agi = <=50K|*): 0.9688278915677351


About a 97% chance she does not make over 50K, a correct hypothesis.

For the fourth prediction, let's try a man in his 50s, of asian descent, working in Prof-specialty field, with 16 years of education. I hypothesize he makes over 50K.

In [40]:
predictions[4] = clf.predict_proba(encoder.transform([(50, 'Male', 'Asian-Pac-Islander', 
                                                        'Prof-specialty', 16)]))

In [41]:
display_prediction(predictions[4])

P( agi = >50K|*): 0.9582007976681152
P(agi = <=50K|*): 0.0417992023318841


The hypothesis looks correct. There is a 96% chance he makes over 50K.

Finally, let's predict a black man in his 60s, who works in sales, and has 10 years of education. I predict he will make more than 50K.

In [42]:
predictions[5] = clf.predict_proba(encoder.transform([(60, 'Male', 'Black', 
                                                        'Sales', 10)]))

In [43]:
display_prediction(predictions[5])

P( agi = >50K|*): 0.1629121390131851
P(agi = <=50K|*): 0.8370878609868143


The hypothesis was wrong, as there is an 84% chance he does not make over 50K.

Overall a good variety of data. I think the education_years tripped me up a bit when making the predictions. I think it definitely would have helped to bin them into a smaller range, maybe something like 'High School Education', 'Bachelor's', 'Master's', and 'Beyond Master's'. A good exercise I'll try out in my spare time.

---

**PRE-SUBMISSION CHECK LIST**

Before you submit this assignent, go back and review the directions to ensure that you have followed each instruction.

* [ ] Have you completed every section and answered every question asked?
* [ ] For every question, have you described your approach and explained your results?
* [ ] Have you checked for spelling and grammar errors?
* [ ] Are your code blocks free of any errors?
* [ ] Have you deleted unused code or markdown blocks? Removed scratch calculations? Excessive raw data print outs?
* [ ] Hide all the code/output cells and make sure that you have sufficient discussion. Re-show the output cells but leave code cells hidden.
* [ ] Have you *SAVED* your notebook?
* [ ] Are you following the submission requirements for this particular assignment?
