download or load the data file:

In [16]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv')

I'll use Pandas to read the data and store it in a `DataFrame`.

In [17]:
import pandas as pd

gss = pd.read_csv('gss_bayes.csv', index_col=0)
gss.head()

Unnamed: 0_level_0,year,age,sex,polviews,partyid,indus10
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1974,21.0,1,4.0,2.0,4970.0
2,1974,41.0,1,5.0,0.0,9160.0
5,1974,58.0,2,6.0,1.0,2670.0
6,1974,30.0,1,5.0,4.0,6870.0
7,1974,48.0,1,5.0,4.0,7860.0


The DataFrame has one row for each person surveyed and one column for each variable I selected.

The columns are:

   `caseid`: Respondent id (which is the index of the table).

   `year`: Year when the respondent was surveyed.

   `age`: Respondent's age when surveyed.

   `sex`: Male or female.

   `polviews`: Political views on a range from liberal to conservative.

   `partyid`: Political party affiliation, Democrat, Independent, or Republican.

   `indus10`: Code for the industry the respondent works in.

Let's look at these variables in more detail, starting with indus10.


## Fraction of Bankers

The code for "Banking and related activities" is 6870, so we can select bankers like this:

In [18]:
banker = (gss['indus10'] == 6870)
banker.head()

caseid
1    False
2    False
5    False
6     True
7    False
Name: indus10, dtype: bool

The result is a Pandas Series that contains the Boolean values True and False.

If we use the sum function on this Series, it treats True as 1 and False as 0, so the total is the number of bankers.


In [19]:
banker.sum()

728

In this dataset, there are 728 bankers.

To compute the *fraction* of bankers, we can use the `mean` function, which computes the fraction of `True` values in the `Series`:

In [20]:
banker.mean()

0.014769730168391155

Now we can make this function that takes a Boolean series and returns a probability:

In [21]:
def prob(A):
    """Computes the probability of a proposition, A."""    
    return A.mean()

## Conditional Probability

Conditional probability is a probability that depends on a condition, but that might not be the most helpful definition.  Here are some examples:

* What is the probability that a respondent is a Democrat, given that they are liberal?

* What is the probability that a respondent is female, given that they are a banker?

* What is the probability that a respondent is liberal, given that they are female?

Let's start with the first one, which we can interpret like this: "Of all the respondents who are liberal, what fraction are Democrats?"

We can compute this probability in two steps:

1. Select all respondents who are liberal.

2. Compute the fraction of the selected respondents who are Democrats.

To select liberal respondents, we can use the bracket operator, `[]`, like this:

In [22]:
def conditional(proposition, given):
    """Probability of all respondents who are "propositions", what fraction are "given"?"""
    return prob(proposition[given])

Now let's look at another variable in this dataset.
The values of the column `sex` are encoded like this:

```
1    Male
2    Female
```

So we can make a Boolean series that is `True` for female respondents and `False` otherwise.

takes a Boolean series and returns a probability:

In [23]:
female = (gss['sex'] == 2)

And use it to compute the fraction of respondents who are women.

In [24]:
prob(female)

0.5378575776019476

## Political Views and Parties

The other variables we'll consider are `polviews`, which describes the political views of the respondents, and `partyid`, which describes their affiliation with a political party.

The values of `polviews` are on a seven-point scale:

```
1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slightly conservative
6	Conservative
7	Extremely conservative
```

I'll define `liberal` to be `True` for anyone whose response is "Extremely liberal", "Liberal", or "Slightly liberal".

In [27]:
liberal = (gss['polviews'] <= 3)

The values of `partyid` are encoded like this:

```
0	Strong democrat
1	Not strong democrat
2	Independent, near democrat
3	Independent
4	Independent, near republican
5	Not strong republican
6	Strong republican
7	Other party
```

I'll define `democrat` to include respondents who chose "Strong democrat" or "Not strong democrat":

In [36]:
democrat = (gss['partyid'] <= 1)

**Exercise:** Let's use the tools in this chapter to solve a variation of the Linda problem.

> Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.  Which is more probable?
> 1. Linda is a banker.
> 2. Linda is a banker and considers herself a liberal Democrat.

To answer this question, compute 

* The probability that Linda is a female banker,

* The probability that Linda is a liberal female banker, and

* The probability that Linda is a liberal female banker and a Democrat.

In [33]:
conditional(banker, given=female)

0.02116102749801969

## Conjunction

Now that we have a definition of probability and a function that computes it, let's move on to conjunction.

"Conjunction" is another name for the logical `and` operation.  If you have two [propositions](https://en.wikipedia.org/wiki/Proposition), `A` and `B`, the conjunction `A and B` is `True` if both `A` and `B` are `True`, and `False` otherwise.

If we have two Boolean series, we can use the `&` operator to compute their conjunction.
For example, we have already computed the probability that a respondent is a female liberal given that is a banker.

In [34]:
conditional(liberal & banker, given=female)

0.004752744143940251

In [37]:
conditional(liberal & banker & democrat, given=female)

0.0023009316887329786