In [1]:
%pip install discopula


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


> Make sure to have discopula's latest version installed using `pip`. More information about the latest version can be found at https://pypi.org/project/discopula/

Run the following upgrade commands on your terminal if you are facing issues related to `pip` or `scipy`

```
# pip install --upgrade pip
# pip install --upgrade scipy
```

In [2]:
import numpy as np
from discopula import GenericCheckerboardCopula

# 2-Dimensional Case 

### Create Sample Contingency Table and Initialize the GenericCheckerboardCopula

The `GenericCheckerboardCopula` object can be initialized with a contingency table represented as a NumPy array. For a 2D contingency table:

- `axis=0`: First variable ($X_1$) with 5 categories
- `axis=1`: Second variable ($X_2$) with 3 categories

The axis indexing follows NumPy's convention, starting from the outermost dimension (0). As described in Joe, Liu & Manuguerra (2021), the variables are ordered such that:

- $X_1$ corresponds to (rows)
- $X_2$ corresponds to (columns)

This ordering is important for calculating measures of regression association between two variables. And, later on in this `.ipynb`, you can see how we can conveniently mention the 1-indexed variable numbers to perform association measure calculation, regression, and prediction.

In [3]:
contingency_table = np.array([
    [0, 0, 20],
    [0, 10, 0],
    [20, 0, 0],
    [0, 10, 0],
    [0, 0, 20]
])
copula = GenericCheckerboardCopula.from_contingency_table(contingency_table)
print(f"Shape of the inferred joint probability matrix P: {copula.P.shape}")
print(f"Probability matrix P:\n{copula.P}")

Shape of the inferred joint probability matrix P: (5, 3)
Probability matrix P:
[[0.    0.    0.25 ]
 [0.    0.125 0.   ]
 [0.25  0.    0.   ]
 [0.    0.125 0.   ]
 [0.    0.    0.25 ]]


### Calculating the Checkerboard Copula Scores and their Variances

In [4]:
# Calculate and display scores for both axes
scores_X1 = copula.calculate_scores(1)
scores_X2 = copula.calculate_scores(2)

print("Checkerboard Copula Scores for X1:")
print(scores_X1)
# Expected: [0.125, 0.3125, 0.5, 0.6875, 0.875]

print("\nCheckerboard Copula Scores for X2:")
print(scores_X2)
# Expected: [0.125, 0.375, 0.75]

# Calculate and display variance of scores
variance_S_X1 = copula.calculate_variance_S(1)
variance_S_X2 = copula.calculate_variance_S(2)

print("\nVariance of Checkerboard Copula Scores for X1:", variance_S_X1)
# Expected: 81/1024 = 0.0791015625
print("Variance of Checkerboard Copula Scores for X2:", variance_S_X2)
# Expected: 9/128 = 0.0703125 

Checkerboard Copula Scores for X1:
[np.float64(0.125), np.float64(0.3125), np.float64(0.5), np.float64(0.6875), np.float64(0.875)]

Checkerboard Copula Scores for X2:
[np.float64(0.125), np.float64(0.375), np.float64(0.75)]

Variance of Checkerboard Copula Scores for X1: 0.0791015625
Variance of Checkerboard Copula Scores for X2: 0.0703125


### Category Prediction Using Checkerboard Copula Regression (CCR)

After measuring regression associations with CCRAM, we can use the Checkerboard Copula Regression (CCR) for predicting the category of the response variable. The `get_category_predictions_multi()` method:

- Predicts the categories of response variable (to be passed in as `response` input argument) given predictor values (to be listed in `predictors` input argument)
- Returns predictions in an easy-to-read DataFrame format
- Supports custom axis labels for better interpretation (Optional)

In [5]:
predictions_X1_to_X2 = copula.get_category_predictions_multi(predictors=[1], response=2)
print("\nPredictions from X1 to X2:")
print(predictions_X1_to_X2)

# Example: Showcasing the use of custom axis names for the output
axis_to_name_dict = {1: "Income Bracket", 2: "Education Level"}
predictions_Education_to_Income = copula.get_category_predictions_multi(predictors=[2], response=1, axis_names=axis_to_name_dict)
print("\nPredictions from Education Level to Income Bracket:")
print(predictions_Education_to_Income)


Predictions from X1 to X2:
   X1 Category  Predicted X2 Category
0            1                      3
1            2                      2
2            3                      1
3            4                      2
4            5                      3

Predictions from Education Level to Income Bracket:
   Education Level Category  Predicted Income Bracket Category
0                         1                                  3
1                         2                                  3
2                         3                                  3


### Calculating CCRAM & SCCRAM

The CCRAM (Checkerboard Copula Regression Association Measure) allows us to quantify the regression relationship between categorical variables. In our example:

- Variables are 1-indexed: $X_1$ (rows) and $X_2$ (columns)
- X1 to X2 ($X_1 \rightarrow X_2$) measures how much X1 (to be listed in `predictors` input argument) explains the variation in X2 (to be passed in as `response` input argument) 
- Scaled version (SCCRAM) normalizes the CCRAM for properly assessing the magnitude of regression association  by taking into account the upperbound of the CCRAM. 

In [6]:
ccram_X1_to_X2 = copula.calculate_CCRAM(predictors=[1], response=2)
ccram_X2_to_X1 = copula.calculate_CCRAM(predictors=[2], response=1)
print(f"CCRAM X1 to X2: {ccram_X1_to_X2:.4f}")
print(f"CCRAM X2 to X1: {ccram_X2_to_X1:.4f}")

sccram_X1_to_X2 = copula.calculate_CCRAM(predictors=[1], response=2, scaled=True)
sccram_X2_to_X1 = copula.calculate_CCRAM(predictors=[2], response=1, scaled=True)
print(f"SCCRAM X1 to X2: {sccram_X1_to_X2:.4f}")
print(f"SCCRAM X2 to X1: {sccram_X2_to_X1:.4f}")

CCRAM X1 to X2: 0.8438
CCRAM X2 to X1: 0.0000
SCCRAM X1 to X2: 1.0000
SCCRAM X2 to X1: 0.0000


# 4-Dimensional Case (Real Data Analysis)

### Create Sample Data in Cases Form and Initialize the GenericCheckerboardCopula

The `GenericCheckerboardCopula` can be initialized using categorical data with multiple variables. Let's explain this with a concrete example:

Consider a dataset with 4 categorical variables:
- Length of Previous Attack ($X_1$): 2 categories (Short, Long)
- Pain Change ($X_2$): 3 categories (Worse, Same, Better)
- Lordosis ($X_3$): 2 categories (absent/decreasing, present/increasing)
- Back Pain ($X_4$): 6 categories (Worse (W), Same (S), Slight Improvement (SI), Moderate Improvement (MODI), Marked Improvement (MARI), Complete Relief (CR))

In the data structure:
- Each row represents one observation
- Each column represents one categorical variable
- The data is stored as a NumPy array of categorical values starting from 1 to (number of categories)

When creating the copula:
- Variables are numbered from $X_1$ to $X_4$ (in order of presentation in the input data).

In [7]:
real_cases_data = np.array([
    # RDA Row 1
    [1,3,1,2],[1,3,1,5],[1,3,1,5],
    [1,3,1,6],[1,3,1,6],[1,3,1,6],[1,3,1,6],
    # RDA Row 2
    [1,3,2,4],[1,3,2,5],[1,3,2,5],[1,3,2,5],
    # RDA Row 3
    [1,2,1,2],[1,2,1,2],[1,2,1,3],[1,2,1,3],[1,2,1,3],
    [1,2,1,5],[1,2,1,5],[1,2,1,5],[1,2,1,5],[1,2,1,5],[1,2,1,5],
    [1,2,1,6],[1,2,1,6],[1,2,1,6],[1,2,1,6],
    # RDA Row 4
    [1,2,2,2],[1,2,2,4],[1,2,2,4],[1,2,2,6],
    # RDA Row 5
    [1,1,1,5],[1,1,1,5],[1,1,1,6],[1,1,1,6],
    # RDA Row 6
    [1,1,2,3],[1,1,2,4],[1,1,2,5],[1,1,2,5],[1,1,2,5],
    # RDA Row 7
    [2,3,1,3],[2,3,1,3],[2,3,1,3],[2,3,1,5],[2,3,1,6],[2,3,1,6],
    # RDA Row 8
    [2,3,2,2],[2,3,2,5],[2,3,2,5],[2,3,2,5],
    # RDA Row 9
    [2,2,1,2],[2,2,1,2],[2,2,1,2],[2,2,1,3],[2,2,1,3],[2,2,1,3],[2,2,1,3],
    [2,2,1,4],[2,2,1,4],[2,2,1,4],[2,2,1,4],[2,2,1,4],
    [2,2,1,5],[2,2,1,5],[2,2,1,5],[2,2,1,5],[2,2,1,5],[2,2,1,5],
    [2,2,1,6],[2,2,1,6],
    # RDA Row 10
    [2,2,2,1],[2,2,2,2],[2,2,2,2],[2,2,2,2],[2,2,2,2],
    [2,2,2,3],[2,2,2,3],[2,2,2,3],[2,2,2,3],
    [2,2,2,4],[2,2,2,4],[2,2,2,4],[2,2,2,6],
    # RDA Row 11
    [2,1,1,1],[2,1,1,1],[2,1,1,2],[2,1,1,2],[2,1,1,3],
    [2,1,1,4],[2,1,1,4],[2,1,1,4],[2,1,1,4],[2,1,1,4],
    [2,1,1,5],[2,1,1,5],
    # RDA Row 12
    [2,1,2,1],[2,1,2,1],[2,1,2,3],[2,1,2,3],
    [2,1,2,4],[2,1,2,4],[2,1,2,4]
])

rda_copula = GenericCheckerboardCopula.from_cases(cases=real_cases_data, shape=(2,3,2,6))
print(f"Shape of the inferred joint probability matrix P: {rda_copula.P.shape}")
print(f"Probability matrix P:\n{rda_copula.P}\n")

for idx, marginal_pdf in rda_copula.marginal_pdfs.items():
    print(f"Marginal pdf for X{idx+1}: {marginal_pdf}")

for idx, marginal_cdf in rda_copula.marginal_cdfs.items():
    print(f"Marginal cdf for X{idx+1}: {marginal_cdf}")

Shape of the inferred joint probability matrix P: (2, 3, 2, 6)
Probability matrix P:
[[[[0.         0.         0.         0.         0.01980198 0.01980198]
   [0.         0.         0.00990099 0.00990099 0.02970297 0.        ]]

  [[0.         0.01980198 0.02970297 0.         0.05940594 0.03960396]
   [0.         0.00990099 0.         0.01980198 0.         0.00990099]]

  [[0.         0.00990099 0.         0.         0.01980198 0.03960396]
   [0.         0.         0.         0.00990099 0.02970297 0.        ]]]


 [[[0.01980198 0.01980198 0.00990099 0.04950495 0.01980198 0.        ]
   [0.01980198 0.         0.01980198 0.02970297 0.         0.        ]]

  [[0.         0.02970297 0.03960396 0.04950495 0.05940594 0.01980198]
   [0.00990099 0.03960396 0.03960396 0.02970297 0.         0.00990099]]

  [[0.         0.         0.02970297 0.         0.00990099 0.01980198]
   [0.         0.00990099 0.         0.         0.02970297 0.        ]]]]

Marginal pdf for X1: [0.38613861 0.61386139]
Ma

### Calculating Checkerboard Copula Scores and their Variances

In [8]:
# Calculate and display scores for both axes
rda_scores_X1 = rda_copula.calculate_scores(1)
rda_scores_X2 = rda_copula.calculate_scores(2)
rda_scores_X3 = rda_copula.calculate_scores(3)
rda_scores_X4 = rda_copula.calculate_scores(4)

print("Checkerboard Copula Scores for X1:")
print(rda_scores_X1)
print("\nCheckerboard Copula Scores for X2:")
print(rda_scores_X2)
print("\nCheckerboard Copula Scores for X3:")
print(rda_scores_X3)
print("\nCheckerboard Copula Scores for X4:")
print(rda_scores_X4)

# Calculate and display variance of scores
rda_variance_S_X1 = rda_copula.calculate_variance_S(1)
rda_variance_S_X2 = rda_copula.calculate_variance_S(2)
rda_variance_S_X3 = rda_copula.calculate_variance_S(3)
rda_variance_S_X4 = rda_copula.calculate_variance_S(4)

print("\nVariance of Checkerboard Copula scores for X1:", rda_variance_S_X1)
print("\nVariance of Checkerboard Copula scores for X2:", rda_variance_S_X2)
print("\nVariance of Checkerboard Copula scores for X3:", rda_variance_S_X3)
print("\nVariance of Checkerboard Copula scores for X4:", rda_variance_S_X4)
# Expected 12 * (variance of Checkerboard Copula scores for X4): 0.07987568681385342*12 = 0.95850824176

Checkerboard Copula Scores for X1:
[np.float64(0.19306930693069307), np.float64(0.693069306930693)]

Checkerboard Copula Scores for X2:
[np.float64(0.13861386138613863), np.float64(0.5346534653465347), np.float64(0.8960396039603961)]

Checkerboard Copula Scores for X3:
[np.float64(0.3168316831683168), np.float64(0.8168316831683167)]

Checkerboard Copula Scores for X4:
[np.float64(0.024752475247524754), np.float64(0.1188118811881188), np.float64(0.27722772277227725), np.float64(0.4653465346534653), np.float64(0.7029702970297029), np.float64(0.9207920792079207)]

Variance of Checkerboard Copula scores for X1: 0.059258896186648376

Variance of Checkerboard Copula scores for X2: 0.0694360191827437

Variance of Checkerboard Copula scores for X3: 0.0580335261248897

Variance of Checkerboard Copula scores for X4: 0.07987568681385342


### Category Prediction Using Checkerboard Copula Regression (CCR)

After measuring regression associations with CCRAM, we can use the Checkerboard Copula Regression (CCR) for predicting the category of the response variable. The `get_category_predictions_multi()` method:

- Predicts the categories of response variable (to be passed in as `response` input argument) given predictor values (to be listed in `predictors` input argument)
- Returns predictions in an easy-to-read DataFrame format
- Supports custom axis labels for better interpretation (Optional)

In [9]:
rda_predictions_X1_X2_X3_to_X4 = rda_copula.get_category_predictions_multi(predictors=[1, 2, 3], response=4)
print("\nPredictions from X1, X2, X3 to Y = X4:")
print(rda_predictions_X1_X2_X3_to_X4)

rda_axis_to_name_dict = {1: "Length of Previous Attack", 2: "Pain Change", 3: "Lordosis", 4: "Back Pain"}
rda_predictions_X1_X2_X3_to_X4_named = rda_copula.get_category_predictions_multi(predictors=[1, 2, 3], response=4, axis_names=rda_axis_to_name_dict)
print("\nPredictions from Length of Previous Attack, Pain Change, Lordosis to Y = Back Pain:")
print(rda_predictions_X1_X2_X3_to_X4_named)


Predictions from X1, X2, X3 to Y = X4:
    X1 Category  X2 Category  X3 Category  Predicted X4 Category
0             1            1            1                      5
1             1            1            2                      5
2             1            2            1                      5
3             1            2            2                      4
4             1            3            1                      5
5             1            3            2                      5
6             2            1            1                      3
7             2            1            2                      3
8             2            2            1                      4
9             2            2            2                      3
10            2            3            1                      4
11            2            3            2                      4

Predictions from Length of Previous Attack, Pain Change, Lordosis to Y = Back Pain:
    Length of Previous Attack 

### Calculating CCRAM & SCCRAM

The CCRAM (Checkerboard Copula Regression Association Measure) allows us to quantify the regression relationship between categorical variables. In our example:

- Variables are 1-indexed: (by default) $X_1$, $X_2$, $X_3$, $X_4$.
- (X1, X2, X3) to X4 ($(X_1, X_2, X_3) \rightarrow X_4$) measures how much (X1, X2, X3) (to be listed in `predictors` input argument) explain the variation in X4 (to be passed in as `response` input argument) 
- Scaled version (SCCRAM) normalizes the CCRAM for properly assessing the magnitude of regression association  by taking into account the upperbound of the CCRAM. 

In [10]:
rda_ccram_X1_X2_X3_to_X4 = rda_copula.calculate_CCRAM(predictors=[1, 2, 3], response=4)
print(f"CCRAM from (X1, X2, X3) to X4: {rda_ccram_X1_X2_X3_to_X4:.4f}")

rda_sccram_X1_X2_X3_to_X4 = rda_copula.calculate_CCRAM(predictors=[1, 2, 3], response=4, scaled=True)
print(f"SCCRAM from (X1, X2, X3) to X4: {rda_sccram_X1_X2_X3_to_X4:.4f}")

CCRAM from (X1, X2, X3) to X4: 0.2604
SCCRAM from (X1, X2, X3) to X4: 0.2716
