# CP 218 Worksheet2: Probability Theory

Firstly we will import the relevant libraries (numpy, pandas, matplotlib, etc), ensuring our plots appear inline rather than in separate windows.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
%matplotlib inline


.

In [2]:
import os
cwd = os.getcwd()
cwd

'/content'

## Probability Theory


### Practice Questions

1. A real estate data suggests that 57% of houses in a city have garden, 52% have garage, and 14% have both. What is the probability that a house in that city

       a) a garage or a garden?  

       b) neither a garage nor a garden?

       c) a garage but no garden?  


2.  The probabilities that an adult man has high blood pressure and/or high cholesterol are shown in the table.


                         Cholesterol   

                           High        OK
                   High    0.11       0.16
     Blood
     Pressure
                   OK      0.21       0.52


   What's the probability that

     a) a man has high blood pressure?

     b) a man with high blood pressure has high cholesterol?  

     c) a man has high blood pressure if it's known that he has high cholesterol?
     

### Programming
We will take a computational approach to understand probability and some of its law. The data will be used from General Social survey to compute several propobability propositions such as:

* What is the probability for a particular proposition to be true/false?
* Conjunction: What is the probability that two propositions are both true.
* Conditional probability, What is the probability that one proposition is true, given that another is true.

In [3]:
#load the data
gss = pd.read_csv('gss_survey_data.csv', index_col=0)

In [None]:
gss.shape

In [None]:
gss.head()

In [None]:
gss.columns

The columns are

* `caseid`: Respondent id (which is the index of the table).

* `year`: Year when the respondent was surveyed.

* `age`: Respondent's age when surveyed.

* `sex`: Male or female.

* `polviews`: Political views on a range from liberal to conservative.

* `partyid`: Political party affiliation, Democrat, Independent, or Republican.

* `indus10`: [Code](https://www.census.gov/naics/) for the industry the respondent works in.

Let's look at these variables in more detail, starting with `indus10`.

In [None]:
gss.describe()

Let's compute probability for some propositions from Banking. The code for "Banking and related activities" is 6870.

`Question 1: If we choose a random person from the dataset, what is the probability they are a banker?`

In [None]:
banker = (gss['indus10'] == 6870)
banker.head()

Let's find how many times each value appears.

In [None]:
banker.value_counts(dropna=False).sort_index()

In [None]:
#Compute the probability for a chosen random person to be a banker.

banker.mean()

`Question 2: If we choose a random person from the dataset, what is the probability they are a female?`

The values of the column `sex` are encoded like this:

```
1    Male
2    Female
```

In [None]:
# ..... solution goes here

0.5378575776019476

`Question 3: If we choose a random person in this dataset, what is the probability they are liberal.`

The values of `polviews` are on a seven-point scale:


1	Extremely liberal

2	Liberal

3	Slightly liberal

4	Moderate

5	Slightly conservative

6	Conservative

7	Extremely conservative


We will consider `liberal` to be `True` for anyone whose response is "Extremely liberal", "Liberal", or "Slightly liberal"

In [None]:
gss['polviews'].value_counts(dropna=False).sort_index()

In [None]:
liberal = (gss['polviews'] < 4)
liberal.mean()

Let's define a function that takes a Boolean series and returns a probability:

In [None]:
def prob(A):
    """Computes the probability of a proposition, A.

    A: Boolean series

    returns: probability
    """
    assert isinstance(A, pd.Series)
    assert A.dtype == 'bool'

    return A.mean()

In [None]:
#verify if the function is behaving correct.
prob(liberal)

`Question 4: If we choose a random person from the dataset, what is the probability they are a democrat?`

 The values of `partyid` are encoded like this:

```
0	Strong democrat
1	Not str democrat
2	Ind,near dem
3	Independent
4	Ind,near rep
5	Not str republican
6	Strong republican
7	Other party
```

We will consider `democrat` to include respondents who chose "Strong democrat" or "Not str democrat":

In [None]:
# compute probability
# .....   solution goes here.


0.3662609048488537

### Conjunction

Now that we have a defined a function to compute probability, let's move on to conjunction.

"Conjunction" is another name for the logical `and` operation.  If you have two propositions, `A` and `B`, the conjunction `A and B` is `True` if both `A` and `B` are `True`, and `False` otherwise.

 Use `prob` and the `&` operator to compute the following probabilities.


* Q5: What is the probability that a random respondent is a banker and liberal?

* Q6: What is the probability that a random respondent is female, a banker, and liberal?

* Q7: What is the probability that a random respondent is female, a banker, and a liberal Democrat?


In [None]:
 #Solution for Q5 goes here

In [None]:
 #Solution for Q6 goes here

In [None]:
 #Solution for Q7 goes here

`Q8: Is the conjunction have commutative property?`

In [None]:
# Solution for Q8 goes here

### Conditional probability

* Q9: What is the probability that a respondent is a Democrat, given that they are liberal?

* Q10: What is the probability that a respondent is female, given that they are a banker?

* Q11: What is the probability that a respondent is liberal, given that they are female?


In [None]:
selected = democrat[liberal]
prob(selected)

Write a function with name `conditional` to take two Boolean series, `A` and `B`, and compute the conditional probability of `A` given `B`:


In [None]:
#solution goes here
def conditional(A, B):
    """Conditional probability of A given B.

    A: Boolean series
    B: Boolean series

    returns: probability
    """
    return prob(A[B])

In [None]:
conditional(democrat, liberal)

In [None]:
  #solution for Q10 goes here

In [None]:
  #solution for Q11 goes here

`Q 12: Is conditional probability commutative?`

In [None]:
#solution for Q12 goes here

`Q: 13 Compute the probability a respondent is female, given that they are a liberal Democrat.`

In [None]:
#solution for Q13 goes here
conditional(female, liberal & democrat)

`Q 14: What fraction of female bankers are liberal Democrats?`

In [None]:
#solution for Q14 goes here
conditional(liberal & democrat, female & banker)

# More propositions

Now, we'll derive three relationships between conjunction and conditional probability:

* Theorem 1: Using conjunction to compute a conditional probability,
$P(A|B) = \frac{P(A~\mathrm{and}~B)}{P(B)}$


* Theorem 2: Using a conditional probability to compute a conjunction,
$P(A~\mathrm{and}~B) = P(B) P(A|B)$


* Theorem 3: Using `conditional(A, B)` to compute `conditional(B, A)`, $P(A|B) = \frac{P(A) P(B|A)}{P(B)}$
  (Bayes's Theorem)


We will validate above relationships with some example.




    
` Q15: Verify Theorem 1 by checking what fraction of builders are male`

The industry code (`indus10` ) for "Construction" is 770.

In [None]:
#solution for Q15
male = (gss['sex']==1)
prob(male)

builder = (gss['indus10'] == 770)

prob(builder)


In [None]:
print(conditional(male, builder))
print(prob(male & builder) / prob(builder))

 `Q16: Verify Theorem 2 by checking the fraction of respondents who are conservative republican`

Consider "Strong Republican" and "Not Strong Republican" as Republicans, and

Consider "Slightly Conservative","Conservative" and "Extemely Conservative" as Conservative.

Hint: The `isin` function checks whether values are in a given sequence

In [None]:
#Solution for Q16 goes here

conservative = (gss['polviews'] > 4)
prob(conservative)

republican= (gss['partyid'].isin([5,6]))
prob(republican)


prob(conservative & republican)
prob(republican) * conditional(conservative, republican)

 `Q17: Verify Theorem 3 by computing the fraction of builders who are liberal`

In [None]:
#Solution for Q17 goes here


# Joint, Marginals, and Conditional Probability

Now we will take a step toward multivariate distributions, starting with two variables. We will use contingency tabke (cross-tabulation) to compute a joint distribution, then use the joint distribution to compute conditional distributions and marginal distribution.

Lets generate a dataset of colors and fruits.

In [None]:
colors = ['red', 'yellow', 'green']
fruits = ['apple', 'banana', 'grape']

Now lets take a random sample of 100 fruits.

In [None]:
np.random.seed(2)
fruit_sample = np.random.choice(fruits, 100, replace=True)
fruit_sample

Similartly, now lets take a random sample of colors.

In [None]:
color_sample = np.random.choice(colors, 100, replace=True)
color_sample

Can we see the distribution (probability mass function) on fruits and colours?

In [None]:
def pmf_from_seq(seq):
    """Make a PMF from a sequence of values.

    seq: sequence

    returns: Series representing a PMF
    """
    pmf = pd.Series(seq).value_counts(sort=False).sort_index()
    pmf /= pmf.sum()
    return pmf

In [None]:
pmf_fruit = pmf_from_seq(fruit_sample)
pmf_fruit.plot.bar(color='C0')

plt.ylabel('Probability')
plt.title('Distribution of fruit');

In [None]:
pmf_color = pmf_from_seq(color_sample)

pmf_color.plot.bar(color='C1')

plt.ylabel('Probability')
plt.title('Distribution of colors');

Looking at these two probability mass functions, we know the distributions of fruits, ignoring color, and we know the proportion of each color, ignoring fruit type. But if we only have the distributions and not the original data, we don't know how many apples are green, for example, or how many yellow fruits are bananas.

We can compute that information in the form of contigency table using `crosstab`, which computes the number of cases for each combination of fruit type and color.

In [None]:
contingency_table = pd.crosstab(color_sample, fruit_sample,
                   rownames=['color'], colnames=['fruit'])
contingency_table

A contigency table (or cross tabulation) represents the "joint distribution" of two variables.

If we normalize contingency_table so the sum of the elements is 1, the result is a joint PMF:

In [None]:
joint = contingency_table / contingency_table.to_numpy().sum()
joint

In this joint PMF table, each column represents the conditional distribution of color for a given fruit.

In [None]:
col = joint['apple']
col

If we normalize it, we get the conditional distribution of color for a given fruit.

In [None]:
col / col.sum()

Lets write a function to compute conditional distribution

In [None]:
def conditional(joint, name, value):
    """Compute a conditional distribution.

    joint: DataFrame representing a joint PMF
    name: string name of an axis
    value: value to condition on

    returns: Series representing a conditional PMF
    """
    if joint.columns.name == name:
        cond = joint[value]
    elif joint.index.name == name:
        cond = joint.loc[value]
    return cond / cond.sum()

In [None]:
conditional(joint, 'fruit', 'apple')

Given a joint distribution, we can compute the unconditioned (marginal) distribution of either variable.

Lets compute the marginal distribution of fruit.

In [None]:
marg_fruit = joint.sum(axis=0)
marg_fruit

In [None]:
marg_color = joint.sum(axis=1)
marg_color

Lets write a function to compute marginal distribution

In [None]:
def marginal(joint, name):
    """Compute a marginal distribution.

    joint: DataFrame representing a joint PMF
    name: string name of an axis

    returns: Series representing a marginal PMF
    """
    if joint.columns.name == name:
        return joint.sum(axis=0)
    elif joint.index.name == name:
        return joint.sum(axis=1)

In [None]:
marg_fruit = marginal(joint, 'fruit')
marg_fruit

**Exercise**:  USe GSS survey data to explore the joint distribution of two variables, `partyid` and `polviews`.

1. Make a cross tabulation of `gss['partyid']` and `gss['polviews']` and normalize it to make a joint PMF.

2. Compute the marginal distributions of `polviews`, and plot the results.

3. Compute the conditional distribution of partyid for people who identify themselves as "Extremely conservative" (`polviews`==7). How many of them are "strong Republicans" (`partyid`==6)?

In [None]:
#Solution for task 1 goes here
cross_tab = pd.crosstab(gss['partyid'], gss['polviews'])
joint_table = cross_tab / cross_tab.to_numpy().sum()
joint_table

In [None]:
#Solution for task 2 goes here
marginal(joint_table, 'polviews').plot.bar(color='C2')
plt.ylabel('Probability')
plt.title('Distribution of polviews');

In [None]:
#Solution for task 3 goes here
cond_partyid = conditional(joint_table, 'polviews', 7)
cond_partyid.plot.bar(label='Extremely conservative', color='C4')

plt.ylabel('Probability')
plt.title('Distribution of partyid')

cond_partyid[6]

**References**:

1. Stats: Data and Models, by Deveaux, Velleman, Bock, Fourth (Global) Edition, Pearson 2016.– Chapters 1-4, 6-8, 13-14    
2. Bite Size Bayes, an introduction to probability and Bayesian statistics using Python