# Probability

## Linda the Banker

In [3]:
import pandas as pd

gss = pd.read_csv('gss_bayes.csv')
gss

Unnamed: 0,caseid,year,age,sex,polviews,partyid,indus10
0,1,1974,21.0,1,4.0,2.0,4970.0
1,2,1974,41.0,1,5.0,0.0,9160.0
2,5,1974,58.0,2,6.0,1.0,2670.0
3,6,1974,30.0,1,5.0,4.0,6870.0
4,7,1974,48.0,1,5.0,4.0,7860.0
...,...,...,...,...,...,...,...
49285,2863,2016,57.0,2,1.0,0.0,7490.0
49286,2864,2016,77.0,1,6.0,7.0,3590.0
49287,2865,2016,87.0,2,4.0,5.0,770.0
49288,2866,2016,55.0,2,5.0,5.0,8680.0


In [4]:
gss.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49290 entries, 0 to 49289
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   caseid    49290 non-null  int64  
 1   year      49290 non-null  int64  
 2   age       49290 non-null  float64
 3   sex       49290 non-null  int64  
 4   polviews  49290 non-null  float64
 5   partyid   49290 non-null  float64
 6   indus10   49290 non-null  float64
dtypes: float64(4), int64(3)
memory usage: 2.6 MB


In [5]:
banker = (gss['indus10'] == 6870)
banker

0        False
1        False
2        False
3         True
4        False
         ...  
49285    False
49286    False
49287    False
49288    False
49289    False
Name: indus10, Length: 49290, dtype: bool

In [6]:
banker.sum()

728

In [7]:
banker.mean()

0.014769730168391155

## The Probability Function

In [8]:
def prob(A):
    '''Compuite the probability of a proposition A.'''
    return A.mean()

In [9]:
prob(banker)

0.014769730168391155

In [10]:
female = (gss['sex'] == 2)

In [11]:
prob(female)

0.5378575776019476

## Political Views and Parties

  1. Extremely liberal
  2. Liberal
  3. Slightly liberal
  4. Moderate 
  5. Slightly conservative 
  6. Conservative 
  7. Extremely conservative

In [12]:
liberal = (gss['polviews'] <= 3)

In [13]:
prob(liberal)

0.27374721038750255

0. Strong democrat
1. Not strong democrat
2. Independent, near democrat
3. Independent
4. Independent, near republican
5. Not strong republican
6.   Strong republican
7. Other party

In [14]:
democrat = (gss['partyid'] <= 1)

In [15]:
prob(democrat)

0.3662609048488537

## Conjunciton

In [17]:
prob(banker)

0.014769730168391155

In [18]:
prob(democrat)

0.3662609048488537

In [19]:
prob(banker & democrat)

0.004686548995739501

In [20]:
prob(democrat & banker)

0.004686548995739501

In [21]:
prob(banker & democrat) == prob(democrat & banker)

True

## Conditional Probability

In [22]:
selected = democrat[liberal]

In [23]:
prob(selected)

0.5206403320240125

In [24]:
selected = female[banker]

In [25]:
prob(selected)

0.7706043956043956

In [26]:
def conditional(proposition, given):
    '''Compute the conditional probability of A given B.'''
    return prob(proposition[given])

In [27]:
conditional(liberal, given=female)

0.27581004111500884

## Conditional Probability Is Not Commutative

In [28]:
conditional(female, given=banker)

0.7706043956043956

In [29]:
conditional(banker, given=female)

0.02116102749801969

In [31]:
conditional(female, given=banker) != conditional(banker, given=female)

True

## Condition and Conjuction

In [32]:
conditional(female, given=liberal & democrat)

0.576085409252669

57% of liberal Democrats are female

In [34]:
conditional(liberal & female, given=banker)

0.17307692307692307

17% of bankers are liberal women.

## Laws of Probability

* $ P(A) $ : A의 확률
* $ P(A~\mathrm{and}~B) $ : A & B의 확률
* $ P(A|B) $ : A given B의 조건부 확률, B가 참일때 A의 확률, B중에 A의 확률

### Theorem 1

$$P(A|B) = \frac{P(A~\mathrm{and}~B)}{P(B)}$$

In [35]:
prob(female & banker) / prob(banker)

0.7706043956043956

In [36]:
prob(female & banker) / prob(banker) == conditional(female, given=banker)

True

### Theorem 2

$$ P(A~\mathrm{and}~B) = P(B) \cdot P(A|B) $$

In [38]:
prob(liberal & democrat)

0.1425238385067965

In [39]:
prob(democrat) * conditional(liberal, given=democrat)

0.1425238385067965

In [40]:
prob(liberal & democrat) == prob(democrat) * conditional(liberal, given=democrat)

True

### Theorem 3

$$ P(A~\mathrm{and}~B) = P(B~\mathrm{and}~A) $$
$$ P(B) \cdot P(A|B) = P(A) \cdot P(B|A) $$

* B의 확룔을 먼저 그 다음 B중에 A의 조건부 확률 
* A의 확률을 먼저 그 다음 A중의 B의 조건부 확률

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$
베이즈 공식

In [41]:
conditional(liberal, given=banker)

0.2239010989010989

In [42]:
prob(liberal) * conditional(banker, given=liberal) / prob(banker)

0.2239010989010989

In [43]:
conditional(liberal, given=banker) == prob(liberal) * conditional(banker, given=liberal) / prob(banker)

True

### The Law of Total Probability

$$ P(A) = P(B_1~\mathrm{and}~A) + P(B_2~\mathrm{and}~A) $$

$B_1$과 $B_2$는 상호배타적이고 완전한 사건들의 집합임. mutually exclusive and collectively exhaustive (MECE)

복잡한 문제를 개별적인 문제로 단순화

In [44]:
prob(banker)

0.014769730168391155

In [45]:
male = (gss['sex'] == 1)

In [46]:
prob(male & banker) + prob(female & banker)

0.014769730168391155

$$ P(A) = P(B_1)P(A|B_1) + P(B_2)P(A|B_2) $$

In [47]:
(prob(male) * conditional(banker, given=male) + 
 prob(female) * conditional(banker, given=female)) 

0.014769730168391153

$$P(A) = \sum_i P(B_i) P(A|B_i)$$
$B_i$는 MECE

In [49]:
B = gss['polviews']
B.value_counts().sort_index()

polviews
1.0     1442
2.0     5808
3.0     6243
4.0    18943
5.0     7940
6.0     7319
7.0     1595
Name: count, dtype: int64

In [51]:
i = 4
prob(B == i) * conditional(banker, B == i)

0.005822682085615744

In [52]:
sum(prob(B == i) * conditional(banker, B == i) for i in range(1, 8))

0.014769730168391157

In [53]:
prob(banker)

0.014769730168391155