# Conditional Probability

### Intersections

$A \cap B = \{\: x: x \in A \: and \: x \in B \:\}$

order doesn't matter $A \cap B$ or $B \cap A$

In [None]:
# https://anaconda.org/conda-forge/matplotlib-venn
# https://pypi.org/project/matplotlib-venn/
# https://towardsdatascience.com/visualizing-intersections-and-overlaps-with-python-a6af49c597d9
# https://practicaldatascience.co.uk/data-science/how-to-visualise-data-using-venn-diagrams-in-matplotlib

# import matplotlib.pyplot as plt
# from matplotlib_venn import venn2, venn2_circles

set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6, 7, 8}
print([value for value in set1 if value in set2])
print(set1 & set2)
print(set1.intersection(set2))

# venn2([set1, set2])
# plt.show()

[3, 4]
{3, 4}
{3, 4}


#### Independent

$P(A \cap B) = P(A) * P(B)$

In [None]:
# what's the probability of a random pick of a number from 1 - 10 if P(A) less than 5 and P(B) odd
# filter https://www.pythonlikeyoumeanit.com/Module2_EssentialsOfPython/Iterables.html
event_space = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
N = len(event_space)
a = filter(lambda x: x < 5, event_space)
a = list(a)
print(a)
pa = len(list(a))/N
print(pa)
b = filter(lambda x: x % 2 == 1, event_space)
b = list(b)
print(b)
pb = len(list(b))/N
print(pb)
print(f'Probability of < 5 and odd: {pa * pb}')

[1, 2, 3, 4]
0.4
[1, 3, 5, 7, 9]
0.5
Probability of < 5 and odd: 0.2


#### Dependent

$P(A \cap B) = P(A) * P(B|A) = P(B) * P(A|B)$<br />
also expressed as P(A and B) = P(A) * P(B given A)

In [None]:
# https://www.mathsisfun.com/data/probability-events-conditional.html
# https://en.wikipedia.org/wiki/Sample_space
# number of outcomes in event / total possible outcomes in sample space

bag = ['red', 'red', 'red', 'blue', 'blue']
mcolor = 'blue'
print(f'Probability of picking two {mcolor} marbles from this bag:')
bag_dict = {item:bag.count(item) for item in bag}
print(bag_dict)

# P(A) probability of picking blue 2 / 5
pa = bag_dict[mcolor] / len(bag)
print(f'Probability of picking a {mcolor}: {pa}')
bag.remove(mcolor)
print(f'Bag minus one {mcolor}:')
bag_dict = {item:bag.count(item) for item in bag}
print(bag_dict)
print(f'Probability of picking another {mcolor} with just four marbles: {(bag_dict[mcolor] / len(bag))}')
print()
# P(B|A) probability of picking another blue given you already picked a blue
print(f'Probability of picking two {mcolor} marbles in a row: {pa * (bag_dict[mcolor] / len(bag))}')

Probability of picking two blue marbles from this bag:
{'red': 3, 'blue': 2}
Probability of picking a blue: 0.4
Bag minus one blue:
{'red': 3, 'blue': 1}
Probability of picking another blue with just four marbles: 0.25

Probability of picking two blue marbles in a row: 0.1


### Unions

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$<br />
also expressed P(A or B)<br />
order doesn't matter

In [None]:
# what's the probability of a random pick of a number from 1 - 10 is P(A) less than 5 or P(B) odd
print(event_space)
print(a)
print(pa)
print(b)
print(pb)
print(f'Probability of the union of A or B is {pa + pb - (pa * pb)}')

# venn2([set(a), set(b)])
# plt.show()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4]
0.4
[1, 3, 5, 7, 9]
0.5
Probability of the union of A or B is 0.7


#### Unions if mutually exclusive (vs. disjoint)

$P(A \cup B) = P(A) + P(B) - P(A \cap B)$ where $P(A \cap B) = 0$
<br />so<br />
$P(A \cup B) = P(A) + P(B)$

Mutually Exclusive: $P(A \cap B) = 0$<br />
Disjoint (dealing in sets): $A \cap B = 0$

In [None]:
# is drawing an even number or odd number mutually exclusive?
print(event_space)
N = len(event_space)
c = filter(lambda x: x % 2 == 0, event_space)
c = list(c)
print(c)
pc = len(list(c))/N
print(pc)
d = filter(lambda x: x % 2 == 1, event_space)
d = list(d)
print(d)
pd = len(list(d))/N
print(pd)
print(set(c).intersection(set(d)))

# venn2([set(c), set(d)])
# plt.show()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[2, 4, 6, 8, 10]
0.5
[1, 3, 5, 7, 9]
0.5
set()


### Compliment

Compliment of $A$ is $\bar{A}$

Is the compliment of A mutually exclusive with A?

In [None]:
event_space = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
N = len(event_space)
a = filter(lambda x: x < 5, event_space)
a = list(a)
print(a)

a_compliment = np.setdiff1d(event_space, a)
print(a_compliment)

print([value for value in event_space if value not in a])

[1, 2, 3, 4]
[ 5  6  7  8  9 10]
[5, 6, 7, 8, 9, 10]


### Conditional Probability

https://towardsdatascience.com/conditional-probability-with-a-python-example-fd6f5937cd2<br />
https://towardsdatascience.com/conditional-probability-with-python-concepts-tables-code-c23ffe65d110<br />

$P(A|B) = \frac{P(A \cap B)}{P(B)}$<br />

What's the probability of something given something else

Terms
* $P(A|B)$: Probability of A given B
* $P(A \cap B)$: Probability of A and B
* $P(B)$: Probability of B

### Addition and Multiplication Rules

* Addition Rule: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
* Multiplication Rule: $P(A \cap B) = P(A) * P(B|A)$<br />

### Contingency Tables


A **contingency table** (or **cross tabulation/crosstab**) is a table presented in a matrix format used in statistics to display the **multivariate frequency distribution** of categorical variables.

### Core Function

The primary purpose of a contingency table is to summarize and analyze the relationship between two or more categorical variables by showing the number of observations (frequencies) that fall into each combination of categories.

### Structure (Two-Way Table)

The most common contingency table structure (a **two-way table**) organizes data as follows:

* The categories of one variable form the **rows**.
* The categories of the second variable form the **columns**.
* The cells inside the table display the **count** (frequency) of cases that meet both the row and column conditions simultaneously.
* The margins of the table display the **marginal totals** (the total frequency for each category of the individual variables).

| | Category B1 | Category B2 | Row Total |
| :--- | :--- | :--- | :--- |
| **Category A1** | Count $A1 \& B1$ | Count $A1 \& B2$ | Total A1 |
| **Category A2** | Count $A2 \& B1$ | Count $A2 \& B2$ | Total A2 |
| **Column Total** | Total B1 | Total B2 | Grand Total |

### Applications

Contingency tables are widely used across disciplines, as noted in your source, because they are the raw data structure required for key statistical tests:

* **Chi-Square Test of Independence:** This test uses the observed frequencies within the contingency table to determine if the two categorical variables are statistically independent or related.
* **Survey Research:** Used to summarize survey responses, such as showing how many male and female respondents selected "Yes" versus "No" to a given question.
* **Business Intelligence:** Analyzing customer demographics against product purchase history.

Contingency table. (February 6, 2022). In *Wikipedia*. https://en.wikipedia.org/wiki/Contingency_table

In [None]:
# seaborn mpg dataset
import seaborn as sns

mpg = sns.load_dataset('mpg')
print(mpg.shape)
print(mpg.head())

(398, 9)
    mpg  cylinders  displacement  horsepower  weight  acceleration  \
0  18.0          8         307.0       130.0    3504          12.0   
1  15.0          8         350.0       165.0    3693          11.5   
2  18.0          8         318.0       150.0    3436          11.0   
3  16.0          8         304.0       150.0    3433          12.0   
4  17.0          8         302.0       140.0    3449          10.5   

   model_year origin                       name  
0          70    usa  chevrolet chevelle malibu  
1          70    usa          buick skylark 320  
2          70    usa         plymouth satellite  
3          70    usa              amc rebel sst  
4          70    usa                ford torino  


In [None]:
# find mpg quartiles
mpg.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


In [None]:
# create a feature where 1 represents horsepower > 94 (median) and 0 if not
mpg['above_median_hp'] = np.where(mpg['horsepower'] > 94, 1, 0)
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   mpg              398 non-null    float64
 1   cylinders        398 non-null    int64  
 2   displacement     398 non-null    float64
 3   horsepower       392 non-null    float64
 4   weight           398 non-null    int64  
 5   acceleration     398 non-null    float64
 6   model_year       398 non-null    int64  
 7   origin           398 non-null    object 
 8   name             398 non-null    object 
 9   above_median_hp  398 non-null    int32  
dtypes: float64(4), int32(1), int64(3), object(2)
memory usage: 29.7+ KB


In [None]:
# create a feature where 1 represents mpg > 23 and 0 if not
mpg['above_median_mpg'] = np.where(mpg['mpg'] > 23, 1, 0)
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   mpg               398 non-null    float64
 1   cylinders         398 non-null    int64  
 2   displacement      398 non-null    float64
 3   horsepower        392 non-null    float64
 4   weight            398 non-null    int64  
 5   acceleration      398 non-null    float64
 6   model_year        398 non-null    int64  
 7   origin            398 non-null    object 
 8   name              398 non-null    object 
 9   above_median_hp   398 non-null    int32  
 10  above_median_mpg  398 non-null    int32  
dtypes: float64(4), int32(2), int64(3), object(2)
memory usage: 31.2+ KB


In [None]:
mpg['count'] = 1

### Pivot Table

A **Pivot Table** is a data summarization tool used to quickly and dynamically reorganize, group, and aggregate data from a large dataset, making it easier to analyze complex information.

### Core Function and Process

The primary function of a pivot table is to take individual items from an extensive table (like a spreadsheet or database) and group them into a compact, easily readable matrix based on discrete categories.

The process involves four key actions:

1.  **Pivoting:** Selecting a field from the original data and using its unique values to form the **rows** and **columns** of the new summary table.
2.  **Grouping:** Categorizing the data based on the chosen row and column fields.
3.  **Aggregation:** Applying an aggregation function (such as **sum**, **average**, **count**, $\text{min}$, or $\text{max}$) to the data field that resides in the table's central **values** area.
4.  **Summarization:** Displaying the calculated statistics for each intersection of the row and column categories.

### Key Advantage

The main advantage of a pivot table is its **flexibility** and **speed**. A user can quickly "pivot" the data—changing which fields are used for rows, columns, and values—to analyze the same data from multiple perspectives without altering the original source data.



Contingency table. (February 9, 2022). In *Wikipedia*. https://en.wikipedia.org/wiki/Pivot_table

In [None]:
# create pivot table
result_pivot = mpg.pivot_table(values='count', index='above_median_mpg', columns='above_median_hp', aggfunc=np.size)
result_pivot

above_median_hp,0,1
above_median_mpg,Unnamed: 1_level_1,Unnamed: 2_level_1
0,39,168
1,164,27


**Our terms**:
* $P(A)$: horsepower > 94 (median)
* $P(B)$: mpg > 23 (median)
* $P(A \cap B)$: probability of getting a car with above median mpg and above median horsepower

**Calculations**
* $P(A)$ = (168 + 27) / (39 + 168 + 164 + 27)
* $P(B)$ = (164 + 27) / (39 + 168 + 164 + 27)
* $P(A \cap B)$ = (27) / (39 + 168 + 164 + 27)

In [None]:
# P(A and B)
round((27) / (103 + 188 + 2 + 105), 2)

0.07