# Statistical Tests Frequency

This notebook performs a step-by-step exploratory analysis of antibody–antigen interactions for several immune checkpoint proteins. Below is a concise summary of the key operations:

- Define datasets for major inhibitory checkpoints: PD-1, PD-L1, CTLA-4, KIR, LAG3, TIM3.
- Calculate interface/epitope/paratope length and residue frequency using structural data.
- Perform PCA Clustering based on interface/epitope/paratope residue frequency
- Use linear regression to predict checkpoint type based on provided residue frequency

In [7]:
import numpy as numpy
import pandas as pd
from scipy.stats import chi2_contingency
from scipy.stats import fisher_exact

In [8]:
# Wasser aus ctla4_interface_residues entfernen
# Datei laden
df = pd.read_csv("data/ctla4/ctla4_interface_residues.csv")

# Zeilen mit 'HOH' in einer beliebigen Spalte entfernen (z. B. in Spalte 3)
df_clean = df[~df.iloc[:, 2].astype(str).str.contains("HOH")]

# Gesäuberte Datei speichern
df_clean.to_csv("data/ctla4/ctla4_interface_residues_clean.csv", index=False)

## Chi2 Test
Prüft, ob sich die Zeilenverteilungen signifikant unterscheiden.

In [9]:
var = 'paratope'
files = {
    'CTLA4': f'data/ctla4/ctla4_{var}_residues.csv',
    'KIR': f'data/kir/kir_{var}_residues.csv',
    'LAG3': f'data/lag3/lag3_{var}_residues.csv',
    'PD1': f'data/pd1/pd1_{var}_residues.csv',
    'PDL1': f'data/pdl1/pdl1_{var}_residues.csv',
    'TIM3': f'data/tim3/tim3_{var}_residues.csv',
}

# Häufigkeitstabellen erstellen
dfs = []
for checkpoint, path in files.items():
    df = pd.read_csv(path)
    counts = df['residue_name'].value_counts()
    counts.name = checkpoint
    dfs.append(counts)

# Kontingenztabelle erzeugen (Residuen = Spalten, Checkpoints = Zeilen)
contingency_table = pd.DataFrame(dfs).fillna(0).astype(int)

print("Kontingenztabelle:\n", contingency_table)

# Chi²-Test
chi2, p, dof, expected = chi2_contingency(contingency_table)

print("\nChi²-Wert:", chi2)
print("p-Wert:", p)
print("Freiheitsgrade:", dof)

# Erwartete Häufigkeiten und Residuen
expected_df = pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns)
residuals = (contingency_table - expected_df) / (expected_df**0.5)

print("\nStandardisierte Residuen:\n", residuals)

Kontingenztabelle:
 residue_name  TYR  SER  GLY  TRP  ASN  THR  HIS  LEU  VAL  PHE  ...  ARG  GLU  \
CTLA4          60   42   36   22   19   11   10   10    8    8  ...    4    4   
KIR             4    6    1    1    1    2    0    0    1    2  ...    0    1   
LAG3            6    2   10    0    2    8    0    2    0    1  ...    0    0   
PD1            99   64   54   16   41   30    9   19    8   22  ...   19    2   
PDL1           30   34   17   15    5   12    4    8    4    8  ...   10    6   
TIM3            4    6    3    2    1    1    0    0    3    1  ...    0    0   

residue_name  GLN  PRO  FLC  ILE  ALA  MET  EDO  SO4  
CTLA4           4    3    1    1    1    0    0    0  
KIR             0    1    0    1    0    0    0    0  
LAG3            2    2    0    2    2    0    0    0  
PD1             3    5    0   12   11    5    1    0  
PDL1            0    8    0    4    6    0    0    1  
TIM3            0    0    0    0    4    0    0    0  

[6 rows x 22 columns]

Chi

## Fisher Exact Test

- Tests whether two categorical variables in a 2x2 contingency table are independent of each other
- Determines whether there is a statistically significant association between them
- Answers the question: Does amino acid X occur significantly more often in the paratope than in the epitope, or is the difference purely due to chance?

In [10]:
var = 'PDL1'
epitope_path = f"data/{var}/{var}_epitope_residues.csv"
paratope_path = f"data/{var}/{var}_paratope_residues.csv"

# load csvs
df_epi = pd.read_csv(epitope_path)
df_para = pd.read_csv(paratope_path)

# Nur Spalten mit Aminosäure-Infos auswählen (das ist residue_name)
residue_col = "residue_name"

aa_list = sorted(set(df_epi[residue_col]) | set(df_para[residue_col]))

# save results
results = []

for aa in aa_list:
    # Zähle für diese Aminosäure:
    epi_aa = (df_epi[residue_col] == aa).sum()
    epi_not_aa = (df_epi[residue_col] != aa).sum()
    para_aa = (df_para[residue_col] == aa).sum()
    para_not_aa = (df_para[residue_col] != aa).sum()

    # 2x2-table
    table = [[para_aa, para_not_aa],
             [epi_aa, epi_not_aa]]

    # Fisher-Test
    _, p = fisher_exact(table)

    results.append({
        "amino_acid": aa,
        "paratope_count": para_aa,
        "epitope_count": epi_aa,
        "p_value": p
    })

# Als DataFrame ausgeben
results_df = pd.DataFrame(results)

# Nach Signifikanz sortieren
results_df = results_df.sort_values("p_value")

# Ausgabe
print(results_df)

# Als CSV speichern
results_df.to_csv(f"fisher_test_{var}.csv", index=False)


   amino_acid  paratope_count  epitope_count   p_value
14        SER              34              6  0.000001
11        MET               0             15  0.000048
17        TRP              15              1  0.000163
13        PRO               8              0  0.003042
0         ALA               6             21  0.004588
3         ASP               8             22  0.012849
4         GLN               0              7  0.014910
16        THR              12              3  0.016861
19        VAL               4             14  0.027700
5         GLU               6             16  0.046899
8         ILE               4             11  0.112399
12        PHE               8              3  0.132959
18        TYR              30             22  0.182333
6         GLY              17             11  0.239557
9         LEU               8              5  0.407307
1         ARG              10             15  0.411093
15        SO4               1              0  0.489418
7         


CTLA4:

SER, TRP, GLY, TYR → häufiger im Paratope → wichtig für Antikörperbindung.

PRO, MET, GLU → häufiger im Epitope → typisch für das Antigen.