# Hypergeometric Test

You can interact with the notebook using Binder:
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/claudiavicente/TFG_MTISCR/main?filepath=Pipeline/hg_test.ipynb)

The goal of this notebook is to understand the use of the hypergeometric test to calculate the probability of drawing a certain number of successes. And explore how changing the parameters affects the probability.

The hypergeometric distribution is given by:
$$
PMF:    P(X = k) = \frac{{\binom{K}{k} \cdot \binom{N - K}{n - k}}}{{\binom{N}{n}}}   
$$
Where:
- `N`: Total number of items.
- `K`: Total number of items of interest.
- `n`: Number of items in the sample.
- `k`: Number of items in the sample that are also in the items of interest.
  
1. The cumulative mass function (CMF) is $(P(X \leq k))$, which represents the probability of observing at most (`k`) successes.
2. The survival function (SF) is $(P(X \geq k))$, which represents the probability of observing at least (`k`) successes.

***

In [1]:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import ipywidgets as widgets
from ipywidgets import interact

From the hypergeometric distribution, we can calculate the cumulative probability (CDF) of drawing up to a certain number of spades, and the probability of drawing more than `k` spades by using the survival function (SF = 1-CDF).

Let's consider a standard deck of 52 cards, which contains 13 spades. 
Suppose we draw 10 cards from the deck. We want to find the probability of drawing exactly 4 spades.

In [2]:
def hypergeometric_test(N, K, n, k):
    x = np.arange(0, n+1)
    pmf_values = stats.hypergeom.pmf(x, N, K, n)
    cum_values = stats.hypergeom.cdf(x, N, K, n)
    sf_values = stats.hypergeom.sf(x, N, K, n)

    prob = stats.hypergeom.pmf(k, N, K, n)
    print(f"The probability of drawing exactly {k} spades in {n} draws is: {prob:.4f}")
    plt.figure(figsize=(10, 6))
    plt.plot(x, pmf_values, 'o', ms=8, color = "#88E2FF", label='hypergeometric pmf')
    plt.vlines(x, 0, pmf_values, colors='#88E2FF', lw=5)
    plt.plot(k, prob, 'o', ms=8, color = "#FF0000", label='prob (pmf)')
    plt.xlabel('Number of spades drawn')
    plt.ylabel('Probability')
    plt.title('Hypergeometric Distribution - Drawing spades from a deck of cards')
    plt.legend()
    plt.grid()
    plt.show()

    cum_prob = stats.hypergeom.cdf(k, N, K, n)
    print(f"The cumulative probability of drawing up to {k} spades in {n} draws is: {cum_prob:.4f}")
    plt.figure(figsize=(10, 6))
    plt.plot(x, cum_values, 'o', ms=8, color = "#6EC6E2", label='hypergeometric cdf')
    plt.vlines(x, 0, cum_values, colors = "#6EC6E2", lw=5)
    plt.plot(k, cum_prob, 'o', ms=8, color = "#FF0000", label='prob (cdf)')
    plt.xlabel('Number of spades drawn')
    plt.ylabel('Cumulative distribution function (CDF)')
    plt.title('Cumulative distribution function - Drawing spades from a deck of cards')
    plt.legend()
    plt.grid()
    plt.show()
    
    surv_prob = stats.hypergeom.sf(k, N, K, n)
    print(f"The survival function (probability of drawing more than {k} spades, 1-CDF) is: {surv_prob:.4f}")
    plt.figure(figsize=(10, 6))
    plt.plot(x, sf_values, 'o', ms=8, color = "#54A9C4", label='hypergeometric sf')
    plt.vlines(x, 0, sf_values, colors='#54A9C4', lw=5)
    plt.plot(k, surv_prob, 'o', ms=8, color = "#FF0000", label='prob (sf)')
    plt.xlabel('Number of spades drawn')
    plt.ylabel('Survival Function (SF, 1-CDF)')
    plt.title('Survival Function - Drawing spades from a deck of cards')
    plt.legend()
    plt.grid()
    plt.show()

In [3]:
interact(hypergeometric_test,
         N = widgets.IntSlider(min = 10, max = 100, step = 1, value = 52, description = 'Total cards (N)'),
         K = widgets.IntSlider(min = 1, max = 52, step = 1, value = 13, description = 'Successes (K)'),
         n = widgets.IntSlider(min = 1, max = 52, step = 1, value = 10, description = 'Draws (n)'),
         k = widgets.IntSlider(min = 0, max = 52, step = 1, value = 4, description = 'Successes in draw (k)'));

interactive(children=(IntSlider(value=52, description='Total cards (N)', min=10), IntSlider(value=13, descript…

In the context of the project, of using the hypergeometric test for finding enriched pathways in context-specific models:

- `N`: All reactions in the metabolic model Recon3D.
- `K`: Reactions of Recon3D involved in the specific pathway.
- `n`: Reactions identified from the context-specific models.
- `k`: Reactions from the context-specific models that are involved in the specific pathway.

The p-value which will determine the enriched pathways is going to be calculated using the hypergeometric test to determine the likelihood of observing `k` or more items of interest in a "sample" (**SF: (k-1, N, K, n)**). 

The SF quantifies the probability of seeing at least `k` items of interest purely by random chance under the null hypothesis. A low p-value indicates that the observed overlap between the sample and the pathway reactions is unlikely to be due to random chance, suggesting a significant enrichment. 