# ***1. Probability***

> ## **1.1 Probabilities and fractions**
***
- <span style="color:#DCF763">A probability, in simple terms is a fraction of a finite set. For example, if we survey 1000 people, and 20 of them are bank tellers, the fraction of people that work as bank tellers is 2%.</span>

With this definition and an appropriate dataset, we can compute probabilities by counting. I’ll use data from the General Social Survey (GSS).

In [27]:
# Load the data file
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)
    
download('https://github.com/AllenDowney/ThinkBayes2/raw/master/data/gss_bayes.csv')

I’ll use Pandas to read the data and store it in a DataFrame.

In [28]:
import pandas as pd

gss = pd.read_csv('gss_bayes.csv')
gss.head()

Unnamed: 0,caseid,year,age,sex,polviews,partyid,indus10
0,1,1974,21.0,1,4.0,2.0,4970.0
1,2,1974,41.0,1,5.0,0.0,9160.0
2,5,1974,58.0,2,6.0,1.0,2670.0
3,6,1974,30.0,1,5.0,4.0,6870.0
4,7,1974,48.0,1,5.0,4.0,7860.0


The DataFrame has one row for each person surveyed and one column for each variable I selected.

The columns are

* `caseid`: Respondent identifier.

* `year`: Year when the respondent was surveyed.

* `age`: Respondent's age when surveyed.

* `sex`: Male (1) or female (2).

* `polviews`: Political views on a range from liberal to conservative.

* `partyid`: Political party affiliation, Democrat, Independent, or Republican.

* `indus10`: Code for the industry the respondent works in.

The code for “Banking and related activities” is 6870, so we can select bankers and the fraction they represent like this:

In [29]:
# Pandas boolean series from the column indus10 that are bankers.
banker = (gss['indus10'] == 6870)
# Total number of bankers.
print("Bankers in the survey: ", banker.sum())
# Probability of sampling a random respondent that is a banker.
print("Fraction of bankers: ", banker.mean())

Bankers in the survey:  728
Fraction of bankers:  0.014769730168391155


<div class="alert alert-block alert-info">
<b>Reminder:</b> Note that when you select a whole column you write the name of the DataFrame (gss) and the name of the column between apostrophes and inside squared brackets (['indus10']). You would also use outer parentheses when specifying a certain value (6780) to ensure that the comparison (== or <=) happens before assignment.
</div>

> ## **1.2 Conjunction and Conditional Probability**
***
I’ll put the code from the previous section in a function that takes a Boolean series and returns a probability:

In [30]:
def prob(A):
    """Computes the probability of a proposition, A."""    
    return A.mean()

- <span style="color:#DCF763">Conjunction is another name for the logical and operation and it is commutative.</span>

The values of `partyid` are encoded like this:

```
0	Strong democrat
1	Not strong democrat
2	Independent, near democrat
3	Independent
4	Independent, near republican
5	Not strong republican
6	Strong republican
7	Other party
```

I'll define `democrat` to include respondents who chose "Strong democrat" or "Not strong democrat":

We can compute the probability that a respondent is a banker and a democrat:

In [31]:
# Filter democrat respondents.
democrat = (gss['partyid'] <= 1)
print("Probability of sampling a democrat banker: ", prob(banker & democrat))

Probability of sampling a democrat banker:  0.004686548995739501


<div class="alert alert-block alert-info">
<b>Reminder:</b> prob(banker and democrat) would cause an error because and in Python does not work element-wise on Pandas Series, whereas prob(banker & democrat) performs an element-wise AND properly.
</div>

As we should expect, the probability of being a banker and a democrat is less than the probability of being a banker and the probability of being a democrat, because not all bankers are democrats and viceversa.

- <span style="color:#DCF763">Conditional probability is the probability of an event occurring, given that another event (by assumption or evidence) is already known to have occurred.</span>

- <span style="color:#DCF763">Conditional probability is not commutative.</span>

We can use the bracket operator to select only the bankers and prob to compute the fraction that are female:

In [32]:
# Filter all female respondents.
female = (gss['sex'] == 2)
# Filter bankers out of all female respondents. ´female´ and ´banker´ are already boolean series.
fem_banker = female[banker]
print("Probability of sampling a banker given that she's a female: ", prob(fem_banker))

Probability of sampling a banker given that she's a female:  0.7706043956043956


The other variables we'll consider are `polviews`, which describes the political views of the respondents, and `partyid`, which describes their affiliation with a political party.

The values of `polviews` are on a seven-point scale:

```
1	Extremely liberal
2	Liberal
3	Slightly liberal
4	Moderate
5	Slightly conservative
6	Conservative
7	Extremely conservative
```

We can use conditional probability to compute the probability that a respondent is liberal given that they are female. I’ll define liberal to be True for anyone whose response is “Extremely liberal”, “Liberal”, or “Slightly liberal”.

Since I'll be using this calculation often, I'll wrap it into a function that takes two boolean series, proposition and given, and computes the conditional probability of proposition conditioned on given:

In [33]:
def conditional(proposition, given):
    """Probability of A conditioned on given."""
    return prob(proposition[given])

In [34]:
# Define liberal
liberal = (gss['polviews'] <= 3)
print("Probability of sampling a liberal respondent given that she's a female: ", conditional(liberal, given = female))

Probability of sampling a liberal respondent given that she's a female:  0.27581004111500884


We can combine conditional probability and conjunction. For example, here’s the probability a respondent is female, given that they are a liberal democrat. About 57% of liberal Democrats are female.

In [35]:
print("Probability of sampling a female respondent given that they're a liberal democrat: ", conditional(female, given = liberal & democrat))

Probability of sampling a female respondent given that they're a liberal democrat:  0.576085409252669


> ## **1.3 Laws of probability**
***
&emsp; ***Bayes Theorem:*** $P(A|B) = \frac{P(A) P(B|A)}{P(B)}$
- If A and B are independent, then the numerator is just $P(A)*P(B)$ and the result is only $P(A)$ <br>

&emsp; ***Law of Total Probability:*** $P(A) = \sum_i P(B_i) P(A|B_i)$ 
- This holds true as long as the conditions $B_i$ are mutually exclusive and collectively exhaustive (MECE), i.e. only one of them can be true, and one of them must be true, e.g. this survey considers that you can be either male or female and you can't be none.

In [37]:
# Filter male respondents
male = (gss['sex'] == 1)
# Total probability of being a banker
print("Probability of being a banker: ", prob(banker))
# Total probability of being a banker (using the law of total prob.)
print("Probability of being a banker: ", (prob(male) * conditional(banker, given=male) +
prob(female) * conditional(banker, given=female)))

Probability of being a banker:  0.014769730168391155
Probability of being a banker:  0.014769730168391153
