# Odds, Ratio of Odds and their logs

- StatQuest Videos
    - https://youtu.be/ARfXDSkQf1Y?si
    - https://youtu.be/8nm0G-1uJzA?si

Odds are a way of representing probability. They are calculated as the ratio of an event happening to the event *not* happening.

$$
\text{Odds} = \frac{\text{Probability of an event occurring}}{\text{Probability of the event NOT occurring}}
$$

#### Example (Team Winning):

- If there are 2 chances to win and 4 chances to lose, the odds of winning are:
- `Odds(Winning) = 2 / 4 = 0.5`

**Key Property:** The range of odds is from 0 to infinity, which is an asymmetric scale. Taking the **logarithm of the odds**, or `log(odds)`, transforms this into a symmetrical scale centered around 0, which is useful for statistical modeling.

## Odds Ratios

The core concept is simple: an **Odds Ratio (OR)** is a **ratio of two different odds**. It compares the odds of an event occurring in one group to the odds of it occurring in another group. This helps quantify how a certain factor (like a treatment or a risk factor) affects the odds of an outcome.

$$
\text{Odds Ratio} = \frac{\text{Odds of event in Group 1}}{\text{Odds of event in Group 2}}
$$

##### **Properties of the Odds Ratio:**

*   **Range:** Like odds, the odds ratio has an asymmetric range from 0 to infinity.
    *   **OR < 1:** The odds of the event are lower in the first group.
    *   **OR = 1:** The odds of the event are the same in both groups (no association).
    *   **OR > 1:** The odds of the event are higher in the first group.
*   **Symmetry Issue:** An odds ratio of 0.25 (a four-fold decrease) is not symmetrical to an odds ratio of 4 (a four-fold increase).

---

### The Log(Odds Ratio)

To address the asymmetry of the odds ratio, we can take its natural logarithm. Taking the log transforms the scale to be symmetrical around 0.
*   `log(OR) < 0`: The event is less likely in the first group.
*   `log(OR) = 0`: There is no difference in odds between the groups.
*   `log(OR) > 0`: The event is more likely in the first group.

**Symmetry:** A log(OR) of -1.5 is perfectly symmetrical to a log(OR) of +1.5. This property makes it very useful in statistical models like logistic regression.

## Example: Gene Mutation and Cancer

The video uses a contingency table to illustrate the calculation and interpretation of an odds ratio.

| | **Has Cancer (Yes)** | **Has Cancer (No)** | **Total** |
| :--- | :---: | :---: | :---: |
| **Has the Mutated Gene (Yes)** | 23 | 117 | **140** |
| **Has the Mutated Gene (No)** | 6 | 210 | **216** |
| **Total** | **29** | **327** | **356** |

**Question:** Is having the mutated gene associated with an increased odds of having cancer?

1.  **Calculate the odds of having cancer for the group WITH the mutated gene:**
    *   `Odds(Cancer | Gene) = (Has Cancer) / (No Cancer) = 23 / 117 ≈ 0.20`

2.  **Calculate the odds of having cancer for the group WITHOUT the mutated gene:**
    *   `Odds(Cancer | No Gene) = (Has Cancer) / (No Cancer) = 6 / 210 ≈ 0.03`

3.  **Calculate the Odds Ratio:**
    *   `OR = Odds(Cancer | Gene) / Odds(Cancer | No Gene)`
    *   `OR = (23 / 117) / (6 / 210) ≈ 0.20 / 0.03 = 6.88`

4.  **Calculate the Log(Odds Ratio):**
    *   `log(OR) = log(6.88) ≈ 1.93`

**Interpretation of the Odds Ratio (6.88):**

> The odds of having cancer are **6.88 times greater** for someone with the mutated gene compared to someone without it.

This shows a strong relationship or **effect size**. However, it doesn't tell us if this result is statistically significant.

In [1]:
import numpy as np

# Create the 2x2 contingency table from the video
# Rows: Has Mutated Gene (Yes, No)
# Columns: Has Cancer (Yes, No)
table = np.array([
    [23, 117],
    [6, 210]
])

table

array([[ 23, 117],
       [  6, 210]])

## Determining Statistical Significance

An odds ratio tells you the *magnitude* of an effect, similar to how R-squared indicates the strength of a correlation. To determine if this effect is statistically significant (i.e., not just due to random chance), we need to calculate a **p-value**.

There are three common statistical tests used for this:

1.  **Fisher's Exact Test**
2.  **Chi-Square Test**
3.  **The Wald Test**

### Fisher's Exact Test

An "exact" test because it calculates the precise probability of observing the specific cell counts in our 2x2 table, given the fixed row and column totals (the "marginals"). It doesn't rely on approximations. It's considered the gold standard for 2x2 contingency tables, especially when sample sizes are small or when some cells have very low counts (e.g., less than 5). For our example, the cell with "6" is quite small, making Fisher's test an excellent choice.

The test uses the hypergeometric distribution to calculate the probability of every possible 2x2 table that could be formed with the same row and column totals. It then sums the probabilities of all tables that show an association as strong as or stronger than the one we observed.

In [3]:
from scipy.stats import fisher_exact

# The function returns the odds ratio and the p-value
odds_ratio, p_value = fisher_exact(table)

print(f"Observed Odds Ratio: {odds_ratio:.4f}")
print(f"P-value from Fisher's Exact Test: {p_value:.10f}")

Observed Odds Ratio: 6.8803
P-value from Fisher's Exact Test: 0.0000109870


The p-value is extremely small (much less than 0.05). This means there is only a 0.001% chance of seeing an association this strong if the gene and cancer were truly unrelated. We confidently **reject the null hypothesis** and conclude there is a significant association.

### Chi-Square (χ²) Test

An approximate test that compares the observed frequencies in each cell of the table to the "expected" frequencies we would see if the null hypothesis (no association) were true. It's very common and computationally fast. It works well when sample sizes are large and all expected cell counts are greater than 5.

**How it Works:**
1.  **Calculate Expected Values:** For each cell, the expected count is `(Row Total * Column Total) / Grand Total`. For the top-left cell (Has Gene, Has Cancer), the expected value would be `(140 * 29) / 356 ≈ 11.4`.
2.  **Calculate the Chi-Square Statistic:** It measures the total difference between observed (O) and expected (E) counts across all cells: `χ² = Σ [ (O - E)² / E ]`. A larger `χ²` value means a bigger discrepancy from what we'd expect under the null hypothesis.
3.  **Find the p-value:** The `χ²` statistic is compared to a known Chi-Square distribution to determine the p-value.

**Yates' Continuity Correction:** For 2x2 tables, a "continuity correction" is often applied to make the Chi-Square approximation more accurate. It tends to be conservative, making the p-value slightly larger.

In [8]:
from scipy.stats import chi2_contingency

# Perform Chi-Square Test WITH continuity correction (default)
chi2_stat_corr, p_value_corr, _, _ = chi2_contingency(table, correction=True)
print(f"Chi-Square Test (with correction):")
print(f"  >  Chi-Square Statistic: {chi2_stat_corr:.4f}")
print(f"  >  P-value: {p_value_corr:.6f}")

# Perform Chi-Square Test WITHOUT continuity correction
chi2_stat_no_corr, p_value_no_corr, _, _ = chi2_contingency(table, correction=False)
print(f"\nChi-Square Test (without correction):")
print(f"  >  Chi-Square Statistic: {chi2_stat_no_corr:.4f}")
print(f"  >  P-value: {p_value_no_corr:.6f}")

Chi-Square Test (with correction):
  >  Chi-Square Statistic: 19.3694
  >  P-value: 0.000011

Chi-Square Test (without correction):
  >  Chi-Square Statistic: 21.1545
  >  P-value: 0.000004


Both p-values are incredibly small and lead to the same conclusion as Fisher's test: we **reject the null hypothesis**. The continuity correction gives a slightly more conservative (larger) p-value, but the difference is negligible here because the effect is so strong.

### The Wald Test

A test commonly used in the context of regression models, including logistic regression. It tests whether a model coefficient is statistically different from zero. In a simple logistic regression with one binary predictor, that coefficient is precisely the **log(odds ratio)**. It is the standard way significance is reported in logistic regression output.

**How it Works:**
1.  The null hypothesis is that the log(odds ratio) is 0.
2.  The Wald statistic is calculated by dividing the estimated log(odds ratio) by its standard error.
    `Wald Statistic (z) = log(OR) / StandardError(log(OR))`
3.  This z-score is then used to find a p-value from the standard normal distribution. A z-score greater than ~1.96 (or less than -1.96) corresponds to a two-sided p-value < 0.05.

In [2]:
import pandas as pd
import statsmodels.api as sm

# We need to restructure the data from the contingency table
# This represents the four cells of our table
data = pd.DataFrame({
    'has_gene': [1, 1, 0, 0],  # 1 for Yes, 0 for No
    'has_cancer': [1, 0, 1, 0], # 1 for Yes, 0 for No
    'count': [23, 117, 6, 210]   # The number of people in each group
})

# Define the model formula: we want to predict 'has_cancer' based on 'has_gene'
y = data['has_cancer']
X = data['has_gene']
X = sm.add_constant(X) # Add an intercept to the model

# Fit the logistic regression model, using 'count' as frequency weights
logit_model = sm.Logit(y, X)

# 2. Pass the weights to the .fit() method
result = logit_model.fit(freq_weights=data['count']) 

# Print the summary which includes the Wald test results
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.693147
         Iterations 1
                           Logit Regression Results                           
Dep. Variable:             has_cancer   No. Observations:                    4
Model:                          Logit   Df Residuals:                        2
Method:                           MLE   Df Model:                            1
Date:                Mon, 11 Aug 2025   Pseudo R-squ.:                   0.000
Time:                        13:04:40   Log-Likelihood:                -2.7726
converged:                       True   LL-Null:                       -2.7726
Covariance Type:            nonrobust   LLR p-value:                     1.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const               0      1.414          0      1.000      -2.772       2.772
has_gene            0      2.



The key line is for the `has_gene` variable:

|                    | **coef** | **std err** | **z**   | **P>\|z\|** |
|--------------------|----------|-------------|---------|-----------|
| **has_gene**       | **1.9288** | **0.469**   | **4.111** | **0.000** |


*   **coef (1.9288):** This is the **log(odds ratio)**. Note that `exp(1.9288) ≈ 6.88`, our odds ratio.
*   **std err (0.469):** This is the standard error of the log(odds ratio), calculated from the data.
*   **z (4.111):** This is the Wald statistic (`1.9288 / 0.469`).
*   **P>|z| (0.000):** This is the **p-value** for the Wald test. The value is so small it's rounded to 0.000. It's actually `0.00004`.

Again, this p-value is far below 0.05, so we **reject the null hypothesis**.