---
**Outline**

1. Review: Categorical Data Analysis
2. Contingency Table and Probability Structure
3. Odds, Odds Ratio, and Relative Risk

In [8]:
# Load data manipulation package
import numpy as np
import pandas as pd
import math

# <font color='blue'>Review: Categorical Data Analysis

## **Contingency Table**
---

- **Joint Probability**: probability that ($X$, $Y$ ) falls in the cell in row $i$ and column $j$.
- **Marginal Probability**: the row and column totals of the joint probabilities.
- **Conditional Probability**: probability distribution for $Y$ at each level of $X$.


- **Joint Probability**
$$
\pi_{ij}=n_{ij}/n
$$
- **Marginal Probability**
$$
\pi_{i+}=\pi_{i1}+\pi_{i2}+\dots+\pi_{iJ}
$$
$$
\pi_{+j}=\pi_{1j}+\pi_{2j}+\dots+\pi_{Ij}
$$
- **Conditional Probability**
$$
\pi_{i|j}=n_{ij}/n_{+j}
$$

## **Odds, Odds Ratio, Relative Risk**

- **Odds (of success)**:
$$
\text{odds} = \frac{\pi}{1-\pi}
$$
- **Odds Ratio (OR)**:
$$
\text{OR} = \frac{\text {odds}_{1}}{\text {odds}_{2}}
$$
- **Relative Risk**:
$$
\text{relative risk} = \frac{\pi_1}{\pi_2}
$$

where:
$\pi$ = probability of success

## **Test of Independence**
---
To test whether certain levels of one variable tend to be **associated** with some levels of another.

- **Null Hypothesis:**
  - The two variables are independent.
- **Alternative Hypothesis:**
  - The two variables are not independent.
- **Test Statistic:**
  - Pearson Chi-squared Statistic

  $$
  \chi ^{2}=\sum \frac{(n_{ij}-\mu _{ij})^{2}}{\mu _{ij}}
  $$
  where
  $$
  \mu _{ij}=\frac{(n_{i+})(n_{+j})}{n}
  $$

  - degrees of freedom = $(I-1) \times (J-1)$
- **Rejection Region:**

$$
\begin{align*}
\chi^{2} &> \chi^{2}_{\alpha} \\
&\text{or} \\
P_{\text{value}} &< \alpha
\end{align*}
$$


# <font color='blue'>Example</font>

## **Data Preparation**

- This is a fictive dataset from a car insurance company.
- Column `OUTCOME` indicates 1 if a customer has claimed her insurance, else 0.
- We will generate our target `DEFAULT` from column `CREDIT_SCORE`:
    - `DEFAULT = 1` for default loan if `CREDIT_SCORE > 0.5`,
    - `DEFAULT = 0` for non default loan if `CREDIT_SCORE <= 0.5`

In [11]:
# Import dataset from csv file
data = pd.read_csv('Car_Insurance.csv')

# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
ID,569520,750365,199901,478866,731664
AGE,65+,16-25,16-25,16-25,26-39
GENDER,female,male,female,male,male
RACE,majority,majority,majority,majority,majority
DRIVING_EXPERIENCE,0-9y,0-9y,0-9y,0-9y,10-19y
EDUCATION,high school,none,high school,university,none
INCOME,upper class,poverty,working class,working class,working class
CREDIT_SCORE,0.629027,0.357757,0.493146,0.206013,0.388366
VEHICLE_OWNERSHIP,1.0,0.0,1.0,1.0,1.0
VEHICLE_YEAR,after 2015,before 2015,before 2015,before 2015,before 2015


Now, Creating target variable which is **DEFAULT** that is related on Credit_score, based on several requirements :
 - If CREDIT_SCORE > 0.5, we assign 1
 - If CREDIT_SCORE < 0.5, we assign 0

In [12]:
# Assign the Default Status
data['DEFAULT'] = data['CREDIT_SCORE'].apply(lambda x:1 if x > 0.5 else 0)

In [13]:
# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
ID,569520,750365,199901,478866,731664
AGE,65+,16-25,16-25,16-25,26-39
GENDER,female,male,female,male,male
RACE,majority,majority,majority,majority,majority
DRIVING_EXPERIENCE,0-9y,0-9y,0-9y,0-9y,10-19y
EDUCATION,high school,none,high school,university,none
INCOME,upper class,poverty,working class,working class,working class
CREDIT_SCORE,0.629027,0.357757,0.493146,0.206013,0.388366
VEHICLE_OWNERSHIP,1.0,0.0,1.0,1.0,1.0
VEHICLE_YEAR,after 2015,before 2015,before 2015,before 2015,before 2015


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   10000 non-null  int64  
 1   AGE                  10000 non-null  object 
 2   GENDER               10000 non-null  object 
 3   RACE                 10000 non-null  object 
 4   DRIVING_EXPERIENCE   10000 non-null  object 
 5   EDUCATION            10000 non-null  object 
 6   INCOME               10000 non-null  object 
 7   CREDIT_SCORE         9018 non-null   float64
 8   VEHICLE_OWNERSHIP    10000 non-null  float64
 9   VEHICLE_YEAR         10000 non-null  object 
 10  MARRIED              10000 non-null  float64
 11  CHILDREN             10000 non-null  float64
 12  POSTAL_CODE          10000 non-null  int64  
 13  ANNUAL_MILEAGE       9043 non-null   float64
 14  VEHICLE_TYPE         10000 non-null  object 
 15  SPEEDING_VIOLATIONS  10000 non-null  

The Dataset have **10000** observations and **19 Variables** but only several variables that will be used

In [15]:
data1 = data[['GENDER','POSTAL_CODE','DEFAULT']]

In [16]:
data1.head()

Unnamed: 0,GENDER,POSTAL_CODE,DEFAULT
0,female,10238,1
1,male,10238,0
2,female,10238,0
3,male,32765,0
4,male,32765,0


Check missing value





In [17]:
data1.isna().sum()

GENDER         0
POSTAL_CODE    0
DEFAULT        0
dtype: int64

THere are not missing value in our dataset

## **1. Contingency Table**

### **Create a Contingency Table**

- You can use `pd.crosstab()` to create a Contingency Table.
- Example:

```python
contingency_table = pd.crosstab(row_data,
                                column_data,
                                margins,
                                normalize)
```

- Use `margins = True` if you want to add subtotals.
- Use `normalize = True` if you want normalize all values by dividing it by the total values

**Questions**

- We want to answer several questions.
  - What is the probability of female customers being default?
  - What is the probability of default given they are women?
  - What is the odds ratio of default between female and male debtors?
  - What is the relative risk of default for female?
  - What is the relative risk of default for male?

a. **What is the probability of female customers being default?**

In [18]:
# Contigency table for default by gender
crosstab_gender = pd.crosstab(
            data1['GENDER'],
            data1['DEFAULT'],
            margins = True

)

In [19]:
crosstab_gender

DEFAULT,0,1,All
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,2324,2686,5010
male,2568,2422,4990
All,4892,5108,10000


In [20]:
#P(female and default)---> Joint Probability
2686/10000

0.2686

In [21]:
# Contigency table for default by gender
crosstab_gender_normalize = pd.crosstab(
            data1['GENDER'],
            data1['DEFAULT'],
            margins = True,
            normalize = True

)

In [22]:
crosstab_gender_normalize

DEFAULT,0,1,All
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.2324,0.2686,0.501
male,0.2568,0.2422,0.499
All,0.4892,0.5108,1.0


In [23]:
female_and_default = crosstab_gender[1]['female']
margin_total = crosstab_gender.sum()

female_and_default

2686

In [24]:
margin_total.sum()

40000

In [25]:
p_female_and_default = female_and_default/margin_total.sum()

In [26]:
p_female_and_default

0.06715

**2. - What is the probability of default given they are women?**

P(default | women)--> Conditional Probability

In [27]:
crosstab_gender_2 = crosstab_gender = pd.crosstab(
            data1['GENDER'],
            data1['DEFAULT'],
            margins = False

)

In [28]:
crosstab_gender_2

DEFAULT,0,1
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
female,2324,2686
male,2568,2422


In [29]:
2686/5108

0.5258418167580267

In [30]:
female_and_default = crosstab_gender_2[1]['female'] # 2686
total_female = crosstab_gender_2.iloc[0].sum() # 5101b

p_default_given_female = female_and_default/total_female
p_default_given_female

0.536127744510978

In [31]:
crosstab_gender_2.iloc[0].sum() # Compute sum of female regardless default or not

5010

In [32]:
female_and_default = crosstab_gender_2[1]['female'] # compute number od female who default

In [33]:
female_and_default

2686

**3. What is the odds ratio of default between female and male debtors?**

In [65]:
# Display the contingency table
crosstab_gender

DEFAULT,0,1
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
female,2324,2686
male,2568,2422


In [70]:
WOE_female = math.log(relative_risk/relative_risk_male)

In [71]:
WOE_female

0.20329699980819546

In [35]:
# Define number sample in group
female_and_default = crosstab_gender[1]['female']
female_and_non_default = crosstab_gender[0]['female']

# Calculate odds of female (odds of default female)
odds_female = female_and_default/female_and_non_default

In [36]:
print(f"Odds female is {odds_female:.2f}.")

Odds female is 1.16.


In [37]:
# Define number sample in group
male_and_default = crosstab_gender[1]['male']
male_and_non_default = crosstab_gender[0]['male']

# Calculate odds of female (odds of default female)
odds_male = male_and_default/male_and_non_default

In [38]:
print(f"Odds male is {odds_male:.2f}.")

Odds male is 0.94.


In [39]:
# Calculate odds ratio between female and male
odds_ratio = odds_female/ odds_male

In [40]:
print(f"Odds ratio between female and male is {odds_ratio:.2f}.")

Odds ratio between female and male is 1.23.


**What Does it mean ?**
- Female odds higher than male odds
- Female odds greater that 1 so female is more likely to default
- Female debtors is more likely to default than male based on odds ratio greater than 1

**4. What is the relative risk of default for female?**

In [41]:
# Display the contingency table
crosstab_gender

DEFAULT,0,1
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
female,2324,2686
male,2568,2422


In [42]:
# Calculate the total probability female and male who default
total_default_male_female = crosstab_gender[1]['female'] + crosstab_gender[1]['male']

In [43]:
total_default_male_female

5108

In [44]:
# Calculate the conditional probability female given default
# first calculate probability female who default
crosstab_gender[1]['female']

2686

In [45]:
#  calculate probability male who default
crosstab_gender[1]['male']

2422

In [46]:
# Calculate the conditional probability female given default
p_female_given_default = crosstab_gender[1]['female']/total_default_male_female
print(f"Conditional probability of female given default = {p_female_given_default:.2f}.")

Conditional probability of female given default = 0.53.


In [47]:
#  calculate probability female who is no default
crosstab_gender[0]['female']
crosstab_gender[0]['male']
total_no_default_male_female =crosstab_gender[0]['female']+ crosstab_gender[0]['male']
total_no_default_male_female

4892

In [48]:
p_female_given_no_default = crosstab_gender[0]['female'] /total_no_default_male_female
print(f"Conditional probability of female given no default = {p_female_given_no_default:.2f}.")

Conditional probability of female given no default = 0.48.


In [49]:
# Calculate Relative Risk Default for female
relative_risk = p_female_given_default / p_female_given_no_default
print(f"relative_risk_female_default = {relative_risk:.2f}.")

relative_risk_female_default = 1.11.


- Relative risk greater than 1
- so for female debtors, it is likely to default than no default

**4. What is the relative risk of default for male?**

In [50]:
# Display the contingency table
crosstab_gender

DEFAULT,0,1
GENDER,Unnamed: 1_level_1,Unnamed: 2_level_1
female,2324,2686
male,2568,2422


In [51]:
# Calculate the conditional probability
p_male_given_default = 2422/(2686 + 2422)
p_male_given_non_default = 2568/(2324 + 2568)

print(f"Conditional probability of male given default = {p_male_given_default:.2f}.")
print(f"Conditional probability of male given non default = {p_male_given_non_default:.2f}.")

Conditional probability of male given default = 0.47.
Conditional probability of male given non default = 0.52.


In [52]:
# Calculate the total probability female and male who default
total_default_male_female = crosstab_gender[1]['female'] + crosstab_gender[1]['male']

In [53]:
# Calculate the conditional probability male given default
# first calculate probability male who default
crosstab_gender[1]['male']

2422

In [54]:
# Calculate the conditional probability female given default
p_male_given_default = crosstab_gender[1]['male']/total_default_male_female
print(f"Conditional probability of male given default = {p_male_given_default:.2f}.")

Conditional probability of male given default = 0.47.


In [55]:
#  calculate probability female who is no default
crosstab_gender[0]['female']
crosstab_gender[0]['male']
total_no_default_male_female =crosstab_gender[0]['female']+ crosstab_gender[0]['male']
total_no_default_male_female

4892

In [56]:
p_male_given_no_default = crosstab_gender[0]['male'] /total_no_default_male_female
print(f"Conditional probability of female given no default = {p_female_given_no_default:.2f}.")

Conditional probability of female given no default = 0.48.


In [57]:
# Calculate Relative Risk Default for male
relative_risk_male = p_male_given_default / p_male_given_no_default
print(f"relative_risk_male_default = {relative_risk_male :.2f}.")

relative_risk_male_default = 0.90.


- Relative risk less than 1
- so for male debtors, the default risk is lower

## **2. Test of Independence**


- You need a contingency_table from `pd.crosstab()` to run the test.
- Run the Chi-sq test of independence:

```python
stats.chi2_contingency(contingency_table)
```

- The output from the test returns:
  - `statistic` : the test statistic.
  - `pvalue` : the p-value of the test.
  - `dof` : the degrees of freedom $(I-1) \times (J-1)$
  - `expected_freq`: the expected frequency for each cell in contingency table.
  $$
  \mu _{ij}=\frac{(n_{i+})(n_{+j})}{n}
  $$

---
**1. Is the probability of default dependent on gender?**

State the hypotheses:
- **Null Hypothesis:**
  - The probability of default is independent of gender.
- **Alternative Hypothesis:**
  - The probability of default is not independent of gender.

- **Rejection Region:**

$$
\begin{align*}
\chi^{2} &> \chi^{2}_{\alpha} \\
&\text{or} \\
P_{\text{value}} &< \alpha
\end{align*}
$$

In [58]:
# Load package
import scipy.stats as stats
from scipy.stats import chi2_contingency

In [59]:
# Test whether probability of DEFAULT is independent of GENDER
stats.chi2_contingency(crosstab_gender)

Chi2ContingencyResult(statistic=25.57193334363074, pvalue=4.2619373946694464e-07, dof=1, expected_freq=array([[2450.892, 2559.108],
       [2441.108, 2548.892]]))

In [60]:
# Result of independence test
result = stats.chi2_contingency(crosstab_gender)

# Extract the test result
stat = result[0]
pval = result[1]

print(f"Chi Square   : {stat:.4f}")
print(f"p-value  : {pval:.4f}")

Chi Square   : 25.5719
p-value  : 0.0000


Yes, with $\alpha=0.05$, the probability of default is dependent on gender.
  - Since P-value=0.00 < $\alpha$, we reject the null hypothesis at $\alpha=0.05$.

**2. Is the probability of default dependent on postal code?**

State the hypotheses:
- **Null Hypothesis:**
  - The probability of default is independent of postal code.
- **Alternative Hypothesis:**
  - The probability of default is not independent of postal code.

In [61]:
crosstab_postal = pd.crosstab(
            data1['POSTAL_CODE'],
            data1['DEFAULT'],
            margins = False

)

In [62]:
crosstab_postal

DEFAULT,0,1
POSTAL_CODE,Unnamed: 1_level_1,Unnamed: 2_level_1
10238,3416,3524
21217,50,70
32765,1194,1262
92101,232,252


In [63]:
# Test whether probability of DEFAULT is independent of POSTAL
stats.chi2_contingency(crosstab_postal)

Chi2ContingencyResult(statistic=3.0590346365281826, pvalue=0.3826124539284613, dof=3, expected_freq=array([[3395.048 , 3544.952 ],
       [  58.704 ,   61.296 ],
       [1201.4752, 1254.5248],
       [ 236.7728,  247.2272]]))

In [64]:
# Result of independence test
result = stats.chi2_contingency(crosstab_postal)

# Extract the test result
stat = result[0]
pval = result[1]

print(f"Chi Square   : {stat:.4f}")
print(f"p-value  : {pval:.4f}")

Chi Square   : 3.0590
p-value  : 0.3826


Yes, with $\alpha=0.05$, the probability of default is **independent** on postal code.
  - Since P-value=0.3826 > $\alpha$, we fail reject the null hypothesis at $\alpha=0.05$.